Posts Tagged ‘Big Data’

Why Healthcare IT Should Abandon Data Storage Islands and Take the Plunge into Data Lakes

One of the most significant technology-related challenges in the modern era is managing data growth. As healthcare organizations leverage new data-generating technology, and as medical record retention requirements evolve, the exponential rise in data (already growing at 48 percent each year according to the Dell EMC Digital Universe Study) could span decades.

Let’s start by first examining the factors contributing to the healthcare data deluge:

  • Longer legal retention times for medical records – in some cases up to the lifetime of the patient.
  • Digitization of healthcare and new digitized diagnostics workflows such as digital pathology, clinical next-generation sequencing, digital breast tomosynthesis, surgical documentation and sleep study videos.
  • With more digital images to store and manage, there is also an increased need for bigger picture archive and communication system (PACS) or vendor-neutral archive (VNA) deployments.
  • Finally, more people are having these digitized medical tests, (especially given the large aging population) resulting in a higher number of yearly studies with increased data sizes.

Healthcare organizations also face frequent and complex storage migrations, rising operational costs, storage inefficiencies, limited scalability, increasing management complexity and storage tiering issues caused by storage silo sprawl.

Another challenge is the growing demand to understand and utilize unstructured clinical data. To mine this data, a storage infrastructure is necessary that supports the in-place analytics required for better patient insights and the evolution of healthcare that enables precision medicine.

Isolated Islands Aren’t Always Idyllic When It Comes to Data

The way that healthcare IT has approached data storage infrastructure historically hasn’t been ideal to begin with, and it certainly doesn’t set up healthcare organizations for success in the future.

Traditionally, when adding new digital diagnostic tools, healthcare organizations provided a dedicated storage infrastructure for each application or diagnostic discipline. For example, to deal with the growing storage requirements of digitized X-rays, an organization will create a new storage system solely for the radiology department. As a result, isolated storage siloes, or data islands, must be individually managed, making processes and infrastructure complicated and expensive to operate and scale.

Isolated siloes further undermine IT goals by increasing the cost of data management and compounding the complexity of performing analytics, which may require multiple copies of large amounts of data copied into another dedicated storage infrastructure that can’t be shared with other workflows. Even the process of creating these silos is involved and expensive because tech refreshes require migrating medical data to new storage. Each migration, typically performed every three to five years, is labor-intensive and complicated. Frequent migrations not only strain resources, but take IT staff away from projects aimed at modernizing the organization, improving patient care and increasing revenue.

Further, silos make it difficult for healthcare providers to search data and analyze information, preventing them from gaining the insights they need for better patient care. Healthcare providers are also looking to tap potentially important medical data from Internet-connected medical devices or personal technologies such as wireless activity trackers. If healthcare organizations are to remain successful in a highly regulated and increasingly competitive, consolidated and patient-centered market, they need a simplified, scalable data management strategy.

Simplify and Consolidate Healthcare Data Management with Data Lakes

The key to modern healthcare data management is to employ a strategy that simplifies storage infrastructure and storage management and supports multiple current and future workflows simultaneously. A Dell EMC healthcare data lake, for example, leverages scale-out storage to house data for clinical and non-clinical workloads across departmental boundaries. Such healthcare data lakes reduce the number of storage silos a hospital uses and eliminate the need for data migrations. This type of storage scales on the fly without downtime, addressing IT scalability and performance issues and providing native file and next-generation access methods.

Healthcare data lake storage can also:

  • Eliminate storage inefficiencies and reduce costs by automatically moving data that can be archived to denser, more cost-effective storage tiers.
  • Allow healthcare IT to expand into private, hybrid or public clouds, enabling IT to leverage cloud economies by creating storage pools for object storage.
  • Offer long-term data retention without the security risks and giving up data sovereignty of the public cloud; the same cloud expansion can be utilized for next-generation use cases such as healthcare IoT.
  • Enable precision medicine and better patient insights by fostering advanced analytics across all unstructured data, such as digitized pathology, radiology, cardiology and genomics data.
  • Reduce data management costs and complexities through automation, and scale capacity and performance on demand without downtime.
  • Eliminate storage migration projects.

 

The greatest technical challenge facing today’s healthcare organizations is the ability to effectively leverage and manage data. However, by employing a healthcare data management strategy that replaces siloed storage with a Dell EMC healthcare data lake, healthcare organizations will be better prepared to meet the requirements of today’s and tomorrow’s next-generation infrastructure and usher in advanced analytics and new storage access methods.

 

Get your fill of news, resources and videos on the Dell EMC Emerging Technologies Healthcare Resource Page

 

 

Using a World Wide Herd (WWH) to Advance Disease Discovery and Treatment

Patricia Florissi

Vice President & Global Chief Technology Officer, Sales at Dell EMC
Patricia Florissi is Vice President and Global Chief Technology Officer (CTO) for Sales. As Global CTO for Sales, Patricia helps define mid and long term technology strategy, representing the needs of the broader EMC ecosystem in EMC strategic initiatives. Patricia is an EMC Distinguished Engineer, holds a Ph. D. in Computer Science from Columbia University in New York, graduated valedictorian with an MBA at the Stern Business School in New York University, and has a Master's and a Bachelor's Degree in Computer Science from the Universidade Federal de Pernambuco, in Brazil. Patricia holds multiple patents, and has published extensively in periodicals including Computer Networks and IEEE Proceedings.

Latest posts by Patricia Florissi (see all)

Analysis of very large genomic datasets has the potential to radically alter the way we keep people healthy. Whether it is quickly identifying the cause of a new infectious outbreak to prevent its spread or personalizing a treatment based on a patient’s genetic variants to knock out a stubborn disease, modern Big Data analytics has a major role to play.

By leveraging cloud, Apache™ Hadoop®, next-generation sequencers, and other technologies, life scientists potentially have a new, very powerful way to conduct innovative global-scale collaborative genomic analysis research that has not been possible before. With the right approach, there are great benefits that can be realized.

image1_

To illustrate the possibilities and benefits of using coordinated worldwide genomic analysis, Dell EMC partnered with researchers at Ben-Gurion University of the Negev (BGU) to develop a global data analytics environment that spans across multiple clouds. This environment lets life sciences organizations analyze data from multiple heterogeneous sources while preserving privacy and security. The work conducted by this collaboration simulated a scenario that might be used by researchers and public health organizations to identify the early onset of outbreaks of infectious diseases. The approach could also help uncover new combinations of virulence factors that may characterize new diseases. Additionally, the methods used have applicability to new drug discovery and translational and personalized medicine.

 

Expanding on past accomplishments

In 2003, SARS (severe acute respiratory syndrome) was the first infectious outbreak where fast global collaborative genomic analysis was used to identify the cause of a disease. The effort was carried out by researchers in the U.S. and Canada who decoded the genome of the coronavirus to prove it was the cause of SARS.

The Dell EMC and BGU simulated disease detection and identification scenario makes use of technological developments (the much lower cost of sequencing, the availability of greater computing power, the use of cloud for data sharing, etc.) to address some of the shortcomings of past efforts and enhance the outcome.

Specifically, some diseases are caused by the combination of virulence factors. They may all be present in one pathogen or across several pathogens in the same biome. There can also be geographical variations. This makes it very hard to identified root causes of a disease when pathogens are analyzed in isolation as has been the case in the past.

Addressing these issues requires sequencing entire micro-biomes from many samples gathered worldwide. The computational requirements for such an approach are enormous. A single facility would need a compute and storage infrastructure on a par with major government research labs or national supercomputing centers.

Dell EMC and BGU simulated a scenario of distributed sequencing centers scattered worldwide, where each center sequences entire micro-biome samples. Each center analyzes the sequence reads generated against a set of known virulence factors. This is done to detect the combination of these factors causing diseases, allowing for near-real time diagnostic analysis and targeted treatment.

To carry out these operations in the different centers, Dell EMC extended the Hadoop framework to orchestrate distributed and parallel computation across clusters scattered worldwide. This pushed computation as close as possible to the source of data, leveraging the principle of data locality at world-wide scale, while preserving data privacy.

Since one Hadoop instance is represented by a single elephant, Dell EMC concluded that a set of Hadoop instances, scattered across the world, but working in tandem formed a World Wide Herd or WWH. This is the name Dell EMC has given to its Hadoop extensions.

image2_

Using WWH, Dell EMC wrote a distributed application where each one of a set of collaborating sequence centers calculates a profile of the virulence factors present in each of the micro-biome it sequenced and sends just these profiles to a center selected to do the global computation.

That center would then use bi-clustering to uncover common patterns of virulence factors among subsets of micro-biomes that could have been originally sampled in any part of the world.

This approach could allow researchers and public health organizations to potentially identify the early onset of outbreaks and also uncover new combinations of virulence factors that may characterize new diseases.

There are several biological advantages to this approach. The approach eliminates the time required to isolate a specific pathogen for analysis and for re-assembling the genomes of the individual microorganisms. Sequencing the entire biome lets researchers identify known and unknown combinations of virulence factors. And collecting samples independently world-wide helps ensure the detection of variants.

On the compute side, the approach uses local processing power to perform the biome sequence analysis. This reduces the need for a large centralized HPC environment. Additionally, the method overcomes the matter of data diversity. It can support all data sources and any data formats.

This investigative approach could be used as a next-generation outbreak surveillance system. It allows collaboration where different geographically dispersed groups simultaneously investigate different variants of a new disease. In addition, the WWH architecture has great applicability to pharmaceutical industry R&D efforts, which increasingly relies on a multi-disciplinary approach where geographically dispersed groups investigate different aspects of a disease or drug target using a wide variety of analysis algorithms on share data.

 

Learn more about modern genomic Big Data analytics

 

 

Dell EMC DSSD D5 And Toshiba Accelerate AI With Deep Learning Test Bed

Jason Tolu

Senior Product Marketing Manager | DSSD, Emerging Technologies Division at Dell EMC

Latest posts by Jason Tolu (see all)

DSSD D5 Rack-Scale Flash Storage Provides High Performance Storage To New Toshiba Deep Learning Test Bed For Facilities Management

With increasing computing power and the Internet of Things supplying ever increasing sources, types and amounts of data, the potential for new, innovative applications and products are limitless.   One way to organizations are taking advantage of this is through machine learning or deep learning, where computers learn from the data with the use of analytical models.  More and more complex algorithms are being applied to massive quantities of data to develop machine learning, AI driven applications such as self-driving cars or smart buildings. image1

Toshiba Corporation is at the head of the deep learning movement.  The Toshiba Smart Community Center in Kawasaki, Japan, which opened up in 2013, makes use of a wide variety of IoT sensor devices and is a key element in their vision to bring new innovations to market for smarter facility management. To make this vision possible, Toshiba and Dell Technologies have joined forces to develop a deep learning test bed to improve the management of IoT edge devices that provide data to enterprise networks.  The jointly developed solution has become the first deep learning platform to be approved by the Industrial Internet Consortium (ICC).

The test bed will be used in Toshiba’s Smart Community Center in Kawasaki, Japan and will utilize big data from a variety of sensors, including building management, air conditioning and building security, to provide more efficient machine control, reduce maintenance costs and improve the management for building facilities.

screen-shot-2016-11-14-at-1-53-46-pm

DSSD D5 Provides The Storage Performance For Machine Learning and AI

Toshiba will be providing the deep learning technology for analyzing and evaluating big data for the Deep Learning test bed. Dell EMC DSSD D5 will provide the high-speed storage – the low latency, along with the IOPS, bandwidth and the capacity required for rapid ingest and complex analytics on large data sets in near real-time.

In developing the solution, Toshiba utilized Hadoop to achieve record-breaking performance. Toshiba’s choice of DSSD D5 as the storage layer in this Deep Learning solution validates DSSD D5’s standing as the high performance storage of choice for next generation applications that are striving to take advantage of growing data and computational power.

Limitless Possibilities For Smart Facilities Management

With the integration of Toshiba Deep Learning and high performance storage from Dell EMC DSSD D5, Toshiba and Dell Technologies are accelerating the application of artificial intelligence to benefit multiple industries.  And, with the approval of ICC, the jointly developed solution is a major step in the advancement of IoT for industrial usage. The verification of the test bed at the Smart Community Center is expected to be concluded by September 2017.  Once the verification is complete, Toshiba intends to roll out the solution to hospitals, hotels, shopping malls, factories and airports.

If you would like to find out more about the Toshiba and Dell Technologies Deep Learning Solution:

Announcing Isilon OneFS 8.0.1

David Noy

VP Product Management, Emerging Technologies Division at EMC

It’s really been an exhilarating last couple of months leading up to the recent historical merger between Dell and EMC! We just completed our first Dell EMC World and announced Isilon All-Flash last week.  While all that was in progress, the Isilon team was heads-down focused on the next update to the industry leading Scale-Out NAS OneFS operating system.

Today, we’re announcing the new OneFS 8.0.1 release with a strong focus on strengthening the Data Lake with features supporting the horizontals and vertical markets we serve. For the horizontal markets, we’ve added new and improved capabilities around Hadoop big data analytics, Isilon CloudPools, and IsilonSD Edge. For the vertical industries we support, we’ve focused on enhancing the needs of the Healthcare and Financial markets.

Customers continue to gain more value from their data with analytics.  Hadoop based solutions have always been a pillar for Isilon customers because of native support for HDFS protocol in the OneFS operating system. In OneFS 8.0.1, we’ve added support for Apache Ambari to proactively monitor key performance metrics and alerts which enables enterprise customers to have a single point for management of the entire Hadoop cluster. In addition, from a security perspective, not only have we integrated with Apache Ranger to deliver seamless authorization and access control, but we’ve also added support for end-to-end data in-flight encryption between Isilon nodes and the HDFS client.

Many of the Isilon enterprise customers continue to use OneFS because of its simplicity and ease of management at scale. We’ve added many new features for enterprises like CloudPools proxy support to increase security, reduce risk, and simplify management. For IsilonSD Edge software defined storage, we’ve added support for VMware ESX 6.0 and have seamlessly integrated with EMC Remote Support (ESRS) for remote monitoring, issue resolution and troubleshooting.

Other enterprise capabilities include seamless non-disruptive upgrades from OneFS 8.0, upgrade rollback support, a 5X improvement in audit performance and a completely re-written framework for performance resource management,  reporting and data insights.

Isilon deployments continue to add value for customers across verticals like Media & Entertainment, Healthcare, Life Sciences, EDA and others. In this release we have strengthened our solutions for the Healthcare and Finance verticals.  For Healthcare PACS workloads, we’ve added capabilities in OneFS 8.0.1 that increases the efficiency, optimizes the storage and significantly improves the storage utilization for PACS archive workloads. For the Financial industry, we’ve added seamless integration for compliance data with business continuity features by integrating SmartLock compliance mode with SyncIQ replication for push button failover and failback.

OneFS 8.0.1 – the first major upgrade to the OneFS 8.0 code base – it contains a number of features that many enterprises were waiting for. If you are looking to upgrade to the OneFS 8.0 code base because you generally want to wait for a subsequent “dot release”, today is the day, your wait is over!

Big Data Analysis for the Greater Good: Dell EMC & the 100,000 Genome Project

Wolfgang Mertz

CTO of Healthcare, Life Sciences and High performance Computing

It might seem far-reaching to say that big data analysis can fundamentally impact patient outcomes around cancer and other illnesses, and that it has the power to ultimately transform health services and indeed society at large, but that’s the precise goal behind the 100,000 Genome Project from Genomics England.

DNA backgroundFor background, Genomics England is a wholly-owned company of the Department of Health, set up to deliver the 100,000 Genomes Project. This exciting endeavor will sequence and collect 100,000 whole genomes from 70,000 NHS patients and their families (with their full consent), focusing on patients with rare diseases as well as those with common cancers.

The program is designed to create a lasting legacy for patients as well as the NHS and the broader UK economy, while encouraging innovation in the UK’s bioscience sector. The genetic sequences will be anonymized and shared with approved academic researchers to help develop new treatments and diagnostic testing methods targeted at the genetic characteristics of individual patients.

Dell EMC provides the platform for large-scale analytics in a hybrid cloud model for Genomics England, which leverages our VCE vScale, with EMC Isilon and EMC XtremIO solutions. The Project has been using EMC storage for its genomic sequence library, and now it will be leveraging an Isilon data lake to securely store data during the sequencing process. Backup services are provided by EMC’s Data Domain and EMC Networker.

The Genomics England IT environment uses both on-prem servers and IaaS provided by cloud service providers on G-Cloud. According to an article from Government Computing, “one of Genomics England’s key legacies is expected to be an ecosystem of cloud service providers providing low cost, elastic compute on demand through G-Cloud, bringing the benefits of scale to smaller research groups.”

There are two main considerations from an IT perspective around genome and DNA sequencing projects such as those being done by Genomics England and others: data management and speed. Vast amounts of research data have to be stored and retrieved, and this large-scale biologic data has to be processed quickly in order to gain meaningful insights.

Scale is another key factor. Sequencing and storing genomic information digitally is a data-intensive endeavor, to say the least. Just sequencing a single genome creates hundreds of gigabytes and the Project has sequenced over 13,000 genomes to date, which is expected to generate ten times more data over the next two years. The data lake being used by Genomics England allows 17 petabytes of data to be stored and made available for multi-protocol analytics (including Hadoop).

For perspective, 1 PB is a quadrillion bytes – think of that as 20 million four-drawer filing cabinets filled with text. Or, considering the Milky Way has roughly two hundred billion stars in its galaxy, if you count each single star as a single byte – it would take 5,000 Milky Way galaxies to reach 1PB of data. It’s staggering.

The potential of being able to contribute to eradicating disease and identify exciting new treatments is truly awe inspiring.  And considering the immense scale of the data involved – 5,000 galaxies! – provides new context around reaching for the stars.

Get first access to our LifeScience Solutions

 

Analyst firm IDC evaluates EMC Isilon: Lab-validation of scale-out NAS file storage for your enterprise Data Lake

Suresh Sathyamurthy

Sr. Director, Product Marketing & Communications at EMC

A Data Lake should now be a part of every big data workflow in your enterprise organization. By consolidating file storage for multiple workloads onto a single shared platform based on scale-out NAS, you can reduce costs and complexity in your IT environment, and make your big data efficient, agile and scalable.

That’s the expert opinion in analyst firm IDC’s recent Lab Validation Brief: “EMC Isilon Scale-Out Data Lake Foundation: Essential Capabilities for Building Big Data Infrastructure”, March 2016. As the lab validation report concludes: “IDC believes that EMC Isilon is indeed an easy-to-operate, highly scalable and efficient Enterprise Data Lake Platform.

The Data Lake Maximizes Information Value

The Data Lake model of storage represents a paradigm shift from the traditional linear enterprise data flow model. As data and the insights gleaned from it increase in value, enterprise-wide consolidated storage is transformed into a hub around which the ingestion and consumption systems work. This enables enterprises to bring analytics to data in-place – and avoid expensive costs of multiple storage systems, and time for repeated ingestion and analysis.

But pouring all your data into a single shared Data Lake would put serious strain on traditional storage systems – even without the added challenges of data growth. That’s where the virtually limitless scalability of EMC Isilon scale-out NAS file storage makes all the difference…

The EMC Data Lake Difference

The EMC Isilon Scale-out Data Lake is an Enterprise Data Lake Platform (EDLP) based on Isilon scale-out NAS file storage and the OneFS distributed file system.

As well as meeting the growing storage needs of your modern datacenter with massive capacity, it enables big data accessibility using traditional and next-generation access methods – helping you manage data growth and gain business value through analytics. You can also enjoy seamless replication of data from the enterprise edge to your core datacenter, and tier inactive data to a public or private cloud.

We recently reached out to analyst firm IDC to lab-test our Isilon Data Lake solutions – here’s what they found in 4 key areas…

  1. Multi-Protocol Data Ingest Capabilities and Performance

Isilon is an ideal platform for enterprise-wide data storage, and provides a powerful centralized storage repository for analytics. With the multi-protocol capabilities of OneFS, you can ingest data via NFS, SMB and HDFS. This makes the Isilon Data Lake an ideal and user-friendly platform for big data workflows, where you need to ingest data quickly and reliably via protocols most suited to the workloads generating the information. Using native protocols enables in-place analytics, without the need for data migration, helping your business gain more rapid data insights.

datalake_blog

IDC validated that the Isilon Data Lake offers excellent read and write performance for Hadoop clusters accessing HDFS via OneFS, compared against via direct-attached storage (DAS). In the lab tests, Isilon performed:

  • nearly 3x faster for data writes
  • over 1.5x faster for reads and read/writes.

As IDC says in its validation: “An Enterprise Data Lake platform should provide vastly improved Hadoop workload performance over a standard DAS configuration.”

  1. High Availability and Resilience

Policy-based high availability capabilities are needed for enterprise adoption of Data Lakes. The Isilon Data Lake is able to cope with multiple simultaneous component failures without interruption of service. If a drive or other component fails, it only has to recover the specific affected data (rather than recovering the entire volume).

IDC validated that a disk failure on a single Isilon node has no noticeable performance impact on the cluster. Replacing a failed drive is a seamless process and requires little administrative effort. (This is in contrast to traditional DAS, where the process of replacing a drive can be rather involved and time consuming.)

Isilon can even cope easily with node-level failures. IDC validated that a single-node failure has no noticeable performance impact on the Isilon cluster. Furthermore, the operation of removing a node from the cluster, or adding a node to the cluster, is a seamless process.

  1. Multi-tenant Data Security and Compliance

Strong multi-tenant data security and compliance features are essential for an enterprise-grade Data Lake. Access zones are a crucial part of the multi-tenancy capabilities of the Isilon OneFS. In tests, IDC found that Isilon provides no-crossover isolation between Hadoop instances for multi-tenancy.

Another core component of secure multi-tenancy is the ability to provide a secure authentication and authorization mechanism for local and directory-based users and groups. IDC validated that the Isilon Data Lake provides multiple federated authentication and authorization schemes. User-level permissions are preserved across protocols, including NFS, SMB and HDFS.

Federated security is an essential attribute of an Enterprise Data Lake Platform, with the ability to maintain confidentiality and integrity of data irrespective of the protocols used. For this reason, another key security feature of the OneFS platform is SmartLock – specifically designed for deploying secure and compliant (SEC Rule 17a-4) Enterprise Data Lake Platforms.

In tests, IDC found that Isilon enables a federated security fabric for the Data Lake, with enterprise-grade governance, regulatory and compliance (GRC) features.

  1. Simplified Operations and Automated Storage Tiering

The Storage Pools feature of Isilon OneFS allows administrators to apply common file policies across the cluster locally – and extend them to the cloud.

Storage Pools consists of three components:

  • SmartPools: Data tiering within the cluster – essential for moving data between performance-optimized and capacity-optimized cluster nodes.
  • CloudPools: Data tiering between the cluster and the cloud – essential for implementing a hybrid cloud, and placing archive data on a low-cost cloud tier.
  • File Pool Policies: Policy engine for data management locally and externally – essential for automating data movement within the cluster and the cloud.

As IDC confirmed in testing, Isilon’s federated data tiering enables IT administrators to optimize their infrastructure by automating data placement onto the right storage tiers.

The expert verdict on the Isilon Data Lake

IDC concludes that: “EMC Isilon possesses the necessary attributes such as multi-protocol access, availability and security to provide the foundations to build an enterprise-grade Big Data Lake for most big data Hadoop workloads.”

Read the full IDC Lab Validation Brief for yourself: “EMC Isilon Scale-Out Data Lake Foundation: Essential Capabilities for Building Big Data Infrastructure”, March 2016.

Learn more about building your Data Lake with EMC Isilon.

Follow Dell EMC

Categories

Archives

Connect with us on Twitter