Archive for the ‘Solutions’ Category

Galaxy: A Workflow Management System for Modern Life Sciences Research

Nathan Bott

Healthcare Solutions Architect at EMC

Am I a life scientist or an IT data manager? That’s the question many researchers are asking themselves in today’s data-driven life sciences organizations.

Whether it is a bench scientist analyzing a genomic sequence or an M.D. exploring biomarkers and a patient’s genomic variants to develop a personalized treatment, researchers are spending a great amount of time searching for, accessing, manipulating, analyzing, and visualizing data.

Organizations supporting such research efforts are trying to make it easier to perform these tasks without the user needing extensive IT expertise and skills. This mission is not easy.

Focus on the data

Modern life sciences data analysis requirements are vastly different than they were just a handful of years ago.

In the past, once data was created, it was stored, analyzed soon after, and then archived to tape or another long-term medium. Today, not only is more data is being generated, but also the need to re-analyze that data means that it must be retained where it can be easily accessed for longer periods.

Additionally, today’s research is much more collaborative and multi-disciplinary. As a result, organizations must provide an easy way for researchers to access data, ensure that results are reproducible, and provide transparency to ensure best practices are used and that procedures adhere to regulatory mandates.

More analytics and collaboration represent areas where The Galaxy Project (also known as just Galaxy) can help. Galaxy is a scientific workflow, data integration, and data and analysis persistence and publishing platform designed to help make computational biology accessible to research scientists that do not have computer programming experience.

Galaxy is generally used as a general bioinformatics workflow management system that automatically tracks and manages data while providing support for capturing the context and intent of computational methods.

Organizations have several ways to make use of Galaxy. They include:

Free public instance: The Galaxy Main instance is available as a free public service at UseGalaxy.org. This is the Galaxy Project’s primary production Galaxy instance and is useful for sharing or publishing data and methods with colleagues for routine analysis or with the larger scientific community for publications.

Anyone can use the public servers, with or without an account. (With an account, data quotas are increased and full functionality across sessions opens up, such as naming, saving, sharing, and publishing Galaxy-defined objects).

Publicly available instances: Many other Galaxy servers besides Main have been made publicly available by the Galaxy community. Specifically, a number of institutions have installed Galaxy and have made those installations either accessible to individual researchers or open to certain organizations or communities.

For example, the Centre de Bioinformatique de Bordeaux offers a general purpose Galaxy instance that includes EMBOSS (a software analysis package for molecular biology) and fibronectin (diversity analysis of synthetic libraries of a Fibronectin domain). Biomina offers a general purpose Galaxy instance that includes most standard tools for DNA/RNA sequencing, plus extra tools for panel resequencing, variant annotation, and some tools for Illumina SNP array analysis.

A list of the publically available installations of Galaxy can be found here.

Do-it-yourself: Organizations also have the choice of deploying their own Galaxy installations. There are two options: an organization can install a local instance of Galaxy (more information on setting up a local instance of Galaxy can be found here), or Galaxy can be deployed to the cloud. The Galaxy Project supports CloudMan, a software package that provides a common interface to different cloud infrastructures.

How it works

Architecturally, Galaxy is a modular python-based web application that provides a data abstracting layer to integrate with various storage platforms. This allows researchers to access data on a variety of storage back-ends like standard direct attached storage, S3 object-based cloud storage, storage management systems like iRODs (the Integrated Rule-Oriented Data System), or a distributed file system.

For example, a Galaxy implementation might use object-based storage such as that provided by Dell EMC Elastic Cloud Storage (ECS). ECS is a software-defined, cloud-scale, object storage platform that combines that cost advantages of commodity infrastructure with the reliability, availability, and serviceability of traditional storage arrays.

With ECS, any organization can deliver scalable and simple public cloud services with the reliability and control of a private-cloud infrastructure.

ECS provides comprehensive protocol support, like S3 or Swift, for unstructured workloads on a single, cloud-scale storage platform. This would allow the user of a Galaxy implementation to easily access data stored on such cloud storage platforms.

With ECS, organizations can easily manage a globally distributed storage infrastructure under a single global namespace with anywhere access to content. ECS features a flexible software-defined architecture that is layered to promote limitless scalability. Each layer is completely abstracted and independently scalable with high availability and no single points of failure.

Get first access to our Life Sciences Solutions

You can test drive Dell EMC ECS by registering for an account and getting access to our APIs by visiting https://portal.ecstestdrive.com/

Or you can download the Dell EMC ECS Community Edition here and try it for FREE in your own environment with no time limit for non-production use

Overcoming the Exabyte-Sized Obstacles to Precision Medicine

Wolfgang Mertz

CTO of Healthcare, Life Sciences and High performance Computing

As we make strides towards a future that includes autonomous cars and grocery stores sans checkout lines, concepts that once seemed reserved only for utopian fiction, it seems there’s no limit to what science and technology can accomplish. It’s an especially exciting time for those in the life sciences and healthcare fields, with 2016 seeing breakthroughs such as a potential “universal” flu vaccine and CRISPR, a promising gene editing technology that may help treat cancer.

Several of Dell EMC’s customers are also making significant advances in precision medicine, the medical model that focuses on using an individual’s specific genetic makeup to customize and prescribe treatments.

Currently, physicians and scientists are in the research phase of a myriad of applications for precision medicine, including oncology, diabetes and cardiology. Before we are able to realize the vision President Obama shared of “the right treatments at the right time, every time, to the right person” from his 2015 Precision Medicine Initiative, there are significant challenges to overcome.

Accessibility

In order for precision medicine to become available to the masses, this will require researchers and doctors to not only have the technical infrastructure to support genomic sequencing, but the storage capacity and resources to access, view and share additional relevant data as well. They will need to have visibility into patients’ electronic health records (EHR), along with information on environmental conditions and lifestyle behaviors and biological samples. While increased data sharing may sound simple enough, the reality is there is still much work to be done on the storage infrastructure side to make this possible. Much of this data is typically siloed, which impedes healthcare providers’ ability to collaborate and review critical information that could impact a patient’s diagnosis and treatment. To fully take advantage of the potential life-saving insights available from precision medicine, organizations must implement a storage solution that enables high-speed access anytime, anywhere.

Volume

Another issue to confront is the storage capacity needed to house and preserve the petabytes of genomic data, medical imaging, EHR and other data. Thanks to decreased costs of genomic sequencing and more genomes being analyzed, the sheer volume of genomic data alone being generated is quickly eclipsing the storage available in most legacy systems. According to a scientific report by Stephens et. al published in PLOS Biology, between 100 million and two billion human genomes may be sequenced by 2025. This may lead to storage demands of up to 2-40 exabytes since storage requirements must take into consideration the accuracy of the data collected. The paper states that, “For every 3 billion bases of human genome sequence, 30-fold more data (~100 gigabases) must be collected because of errors in sequencing, base calling and genome alignment.” With this exponential projected growth, scale-out storage that can simultaneously manage multiple current and future workflows is necessary now more than ever.

Early Stages 

Finally, while it’s easy to get caught up in the excitement of the advances made thus far in precision medicine, we have to remember this remains a young discipline. At the IT level, there’s still much to be done around network and storage infrastructure and workflows in order to develop the solutions that will make this ground-breaking research readily available to the public, the physician community and healthcare professionals. Third-generation platform applications need to be built to make this more mainstream. Fortunately, major healthcare technology players such as GE and Philips have undertaken initiatives to attract independent software vendor (ISV) applications. With high-profile companies willing to devote time and resources to supporting ISV applications, the more likely it is scientists will have access to more sophisticated tools sooner.

More cohort analysis such as Genomic England’s 100,000 Genomic Project must be put in place to ensure researchers have sufficient data to develop new forms of screening and treatment and these efforts will also necessitate additional storage capabilities.

Conclusion

Despite these barriers, the future remains promising for precision medicine. With the proper infrastructure in place to provide reliable shared access and massive scalability, clinicians and researchers will have the freedom to focus on discovering the breakthroughs of tomorrow.

Get first access to our Life Sciences Solutions

TGen Cures Storage Needs with Dell EMC to Advance Precision Medicine

Sasha Paegle

Sasha Paegle

Sr. Business Development Manager, Life Sciences

As the gap between theoretical treatment and clinical application for precision medicine continues to shrink, we’re inching closer to having the practice of doctors using individual human genomes to prescribe specific care strategies become a commonplace reality.

Organizations such as the Translational Genomics Research Institute (TGen), a leading biomedical research institute, are on the forefront of enabling a new generation of life-saving treatments. With innovations from TGen, breakthroughs in genetic sequencing are unraveling mysteries of complex diseases like cancer.

To help achieve its goal to successfully use –omics to prevent, diagnose and treat disease, the Phoenix-based non-profit research institute selected Dell EMC to enhance its IT system and infrastructure to manage its petabyte-size sequencing cluster.

Data Tsunami 

The time and cost of genomic sequencing for a single person has dropped dramatically since the Human Genome Project, which spanned 13 years and cost $1 billion. Today, sequencing can be completed in roughly one day for approximately $1,000. Furthermore, technological advances in sequencing and on the IT front have enabled TGen to increase the number of patients being sequenced from the hundreds to the thousands annually. To handle the storage output from current sequencing technologies and emerging single molecule real-time (SMRT) sequencing, TGen required an infrastructure with the storage capacity and performance to support big data repositories produced by genetic sequencing—even as they grow exponentially.

“When you get more sequencers that go faster and run cheaper, and the more people are being sequenced, you’re going to need more resources in order to process this tsunami of data,” said James Lowey, TGen’s CIO.

TGen stores vast amounts of data generated by precision medicine, such as genetic data and data from wearables including glucose monitors and pain management devices, as well as clinical records and population health statistics. Scientists must then correlate and analyze this information to develop a complete picture of an individual’s illness and potential treatment. This involves TGen’s sequencing cluster churning through one million CPU hours per month and calls for a storage solution that is also able to maintain high availability, which is critical to the around the clock research environment.

Benefits for Researchers

In the coming years, researchers can expect genetic sequences to increase in addition to SMRT sequencing paving the way for larger data volumes.

Lowey notes, “As genetic data continues to grow exponentially, it’s even more important to have an extremely reliable infrastructure to manage that data and make it accessible to the scientists 24/7.”

Having a robust storage infrastructure in place allows researchers to fully devote their time and attention on the core business of science without worrying if there’s enough disk space or processing capacity. It also helps scientists get more precise treatments to patients faster, enabling breakthroughs that lead to life-saving and life-changing medical treatments – the ultimate goal of TGen and like-minded research institutes.

Looking Ahead

With the likelihood of sequencing clusters growing to exabyte-scale, TGen and its peers must continue to seek out an enterprise approach that emphasizes reliability and scalability and ensures high availability of critical data for 24/7 operations.

Lowey summarizes the future of precision medicine and IT by saying, “The possibilities are endless, but the real trick is to build all of that backend infrastructure to support it.”

To learn more about Dell EMC’s work with TGen, check out our video below.

 

Get first access to our Life Sciences Solutions

Why Healthcare IT Should Abandon Data Storage Islands and Take the Plunge into Data Lakes

One of the most significant technology-related challenges in the modern era is managing data growth. As healthcare organizations leverage new data-generating technology, and as medical record retention requirements evolve, the exponential rise in data (already growing at 48 percent each year according to the Dell EMC Digital Universe Study) could span decades.

Let’s start by first examining the factors contributing to the healthcare data deluge:

  • Longer legal retention times for medical records – in some cases up to the lifetime of the patient.
  • Digitization of healthcare and new digitized diagnostics workflows such as digital pathology, clinical next-generation sequencing, digital breast tomosynthesis, surgical documentation and sleep study videos.
  • With more digital images to store and manage, there is also an increased need for bigger picture archive and communication system (PACS) or vendor-neutral archive (VNA) deployments.
  • Finally, more people are having these digitized medical tests, (especially given the large aging population) resulting in a higher number of yearly studies with increased data sizes.

Healthcare organizations also face frequent and complex storage migrations, rising operational costs, storage inefficiencies, limited scalability, increasing management complexity and storage tiering issues caused by storage silo sprawl.

Another challenge is the growing demand to understand and utilize unstructured clinical data. To mine this data, a storage infrastructure is necessary that supports the in-place analytics required for better patient insights and the evolution of healthcare that enables precision medicine.

Isolated Islands Aren’t Always Idyllic When It Comes to Data

The way that healthcare IT has approached data storage infrastructure historically hasn’t been ideal to begin with, and it certainly doesn’t set up healthcare organizations for success in the future.

Traditionally, when adding new digital diagnostic tools, healthcare organizations provided a dedicated storage infrastructure for each application or diagnostic discipline. For example, to deal with the growing storage requirements of digitized X-rays, an organization will create a new storage system solely for the radiology department. As a result, isolated storage siloes, or data islands, must be individually managed, making processes and infrastructure complicated and expensive to operate and scale.

Isolated siloes further undermine IT goals by increasing the cost of data management and compounding the complexity of performing analytics, which may require multiple copies of large amounts of data copied into another dedicated storage infrastructure that can’t be shared with other workflows. Even the process of creating these silos is involved and expensive because tech refreshes require migrating medical data to new storage. Each migration, typically performed every three to five years, is labor-intensive and complicated. Frequent migrations not only strain resources, but take IT staff away from projects aimed at modernizing the organization, improving patient care and increasing revenue.

Further, silos make it difficult for healthcare providers to search data and analyze information, preventing them from gaining the insights they need for better patient care. Healthcare providers are also looking to tap potentially important medical data from Internet-connected medical devices or personal technologies such as wireless activity trackers. If healthcare organizations are to remain successful in a highly regulated and increasingly competitive, consolidated and patient-centered market, they need a simplified, scalable data management strategy.

Simplify and Consolidate Healthcare Data Management with Data Lakes

The key to modern healthcare data management is to employ a strategy that simplifies storage infrastructure and storage management and supports multiple current and future workflows simultaneously. A Dell EMC healthcare data lake, for example, leverages scale-out storage to house data for clinical and non-clinical workloads across departmental boundaries. Such healthcare data lakes reduce the number of storage silos a hospital uses and eliminate the need for data migrations. This type of storage scales on the fly without downtime, addressing IT scalability and performance issues and providing native file and next-generation access methods.

Healthcare data lake storage can also:

  • Eliminate storage inefficiencies and reduce costs by automatically moving data that can be archived to denser, more cost-effective storage tiers.
  • Allow healthcare IT to expand into private, hybrid or public clouds, enabling IT to leverage cloud economies by creating storage pools for object storage.
  • Offer long-term data retention without the security risks and giving up data sovereignty of the public cloud; the same cloud expansion can be utilized for next-generation use cases such as healthcare IoT.
  • Enable precision medicine and better patient insights by fostering advanced analytics across all unstructured data, such as digitized pathology, radiology, cardiology and genomics data.
  • Reduce data management costs and complexities through automation, and scale capacity and performance on demand without downtime.
  • Eliminate storage migration projects.

 

The greatest technical challenge facing today’s healthcare organizations is the ability to effectively leverage and manage data. However, by employing a healthcare data management strategy that replaces siloed storage with a Dell EMC healthcare data lake, healthcare organizations will be better prepared to meet the requirements of today’s and tomorrow’s next-generation infrastructure and usher in advanced analytics and new storage access methods.

 

Get your fill of news, resources and videos on the Dell EMC Emerging Technologies Healthcare Resource Page

 

 

Using a World Wide Herd (WWH) to Advance Disease Discovery and Treatment

Patricia Florissi

Vice President & Global Chief Technology Officer, Sales at Dell EMC
Patricia Florissi is Vice President and Global Chief Technology Officer (CTO) for Sales. As Global CTO for Sales, Patricia helps define mid and long term technology strategy, representing the needs of the broader EMC ecosystem in EMC strategic initiatives. Patricia is an EMC Distinguished Engineer, holds a Ph. D. in Computer Science from Columbia University in New York, graduated valedictorian with an MBA at the Stern Business School in New York University, and has a Master's and a Bachelor's Degree in Computer Science from the Universidade Federal de Pernambuco, in Brazil. Patricia holds multiple patents, and has published extensively in periodicals including Computer Networks and IEEE Proceedings.

Latest posts by Patricia Florissi (see all)

Analysis of very large genomic datasets has the potential to radically alter the way we keep people healthy. Whether it is quickly identifying the cause of a new infectious outbreak to prevent its spread or personalizing a treatment based on a patient’s genetic variants to knock out a stubborn disease, modern Big Data analytics has a major role to play.

By leveraging cloud, Apache™ Hadoop®, next-generation sequencers, and other technologies, life scientists potentially have a new, very powerful way to conduct innovative global-scale collaborative genomic analysis research that has not been possible before. With the right approach, there are great benefits that can be realized.

image1_

To illustrate the possibilities and benefits of using coordinated worldwide genomic analysis, Dell EMC partnered with researchers at Ben-Gurion University of the Negev (BGU) to develop a global data analytics environment that spans across multiple clouds. This environment lets life sciences organizations analyze data from multiple heterogeneous sources while preserving privacy and security. The work conducted by this collaboration simulated a scenario that might be used by researchers and public health organizations to identify the early onset of outbreaks of infectious diseases. The approach could also help uncover new combinations of virulence factors that may characterize new diseases. Additionally, the methods used have applicability to new drug discovery and translational and personalized medicine.

 

Expanding on past accomplishments

In 2003, SARS (severe acute respiratory syndrome) was the first infectious outbreak where fast global collaborative genomic analysis was used to identify the cause of a disease. The effort was carried out by researchers in the U.S. and Canada who decoded the genome of the coronavirus to prove it was the cause of SARS.

The Dell EMC and BGU simulated disease detection and identification scenario makes use of technological developments (the much lower cost of sequencing, the availability of greater computing power, the use of cloud for data sharing, etc.) to address some of the shortcomings of past efforts and enhance the outcome.

Specifically, some diseases are caused by the combination of virulence factors. They may all be present in one pathogen or across several pathogens in the same biome. There can also be geographical variations. This makes it very hard to identified root causes of a disease when pathogens are analyzed in isolation as has been the case in the past.

Addressing these issues requires sequencing entire micro-biomes from many samples gathered worldwide. The computational requirements for such an approach are enormous. A single facility would need a compute and storage infrastructure on a par with major government research labs or national supercomputing centers.

Dell EMC and BGU simulated a scenario of distributed sequencing centers scattered worldwide, where each center sequences entire micro-biome samples. Each center analyzes the sequence reads generated against a set of known virulence factors. This is done to detect the combination of these factors causing diseases, allowing for near-real time diagnostic analysis and targeted treatment.

To carry out these operations in the different centers, Dell EMC extended the Hadoop framework to orchestrate distributed and parallel computation across clusters scattered worldwide. This pushed computation as close as possible to the source of data, leveraging the principle of data locality at world-wide scale, while preserving data privacy.

Since one Hadoop instance is represented by a single elephant, Dell EMC concluded that a set of Hadoop instances, scattered across the world, but working in tandem formed a World Wide Herd or WWH. This is the name Dell EMC has given to its Hadoop extensions.

image2_

Using WWH, Dell EMC wrote a distributed application where each one of a set of collaborating sequence centers calculates a profile of the virulence factors present in each of the micro-biome it sequenced and sends just these profiles to a center selected to do the global computation.

That center would then use bi-clustering to uncover common patterns of virulence factors among subsets of micro-biomes that could have been originally sampled in any part of the world.

This approach could allow researchers and public health organizations to potentially identify the early onset of outbreaks and also uncover new combinations of virulence factors that may characterize new diseases.

There are several biological advantages to this approach. The approach eliminates the time required to isolate a specific pathogen for analysis and for re-assembling the genomes of the individual microorganisms. Sequencing the entire biome lets researchers identify known and unknown combinations of virulence factors. And collecting samples independently world-wide helps ensure the detection of variants.

On the compute side, the approach uses local processing power to perform the biome sequence analysis. This reduces the need for a large centralized HPC environment. Additionally, the method overcomes the matter of data diversity. It can support all data sources and any data formats.

This investigative approach could be used as a next-generation outbreak surveillance system. It allows collaboration where different geographically dispersed groups simultaneously investigate different variants of a new disease. In addition, the WWH architecture has great applicability to pharmaceutical industry R&D efforts, which increasingly relies on a multi-disciplinary approach where geographically dispersed groups investigate different aspects of a disease or drug target using a wide variety of analysis algorithms on share data.

 

Learn more about modern genomic Big Data analytics

 

 

Data Security: Are You Taking It For Granted?

Keith Manthey

CTO of Analytics at EMC Emerging Technologies Division

sc1

Despite the fact that the Wells Fargo fake account scandal first broke in September, the banking giant still finds itself the topic of national news headlines and facing public scrutiny months later. While it’s easy to assign blame, whether to the now-retired CEO, the company’s unrealistic sales goals and so forth, let’s take a moment to discuss a potential solution for Wells Fargo and its enterprise peers. I’m talking about data security and governance.

There’s no question that the data security and governance space is still evolving and maturing. Currently, the weakest link in the Hadoop ecosystem is masking of data. As it stands at most enterprises using Hadoop, access to the Hadoop space translates to uncensored access to information that can be highly sensitive. Fortunately, there are some initiatives to change that. Hortonworks recently released Ranger 2.5, which starts to add allocated masking. Shockingly enough, I can count on one hand the number of clients that understand they need this feature. In some cases, CIO- and CTO-level executives aren’t even aware of just how critical configurable row and column masking capabilities are to the security of their data.

Another aspect I find to be shocking is the lack of controls around data governance in many enterprises. Without data restrictions, it’s all too easy to envision Wells Fargo’s situation – which resulted in 5,300 employees being fired – repeating itself at other financial institutions. It’s also important to point out entering unmasked sensitive and confidential healthcare and financial data into a Hadoop system is not only an unwise and negligent practice; it’s a direct violation of mandated security and compliance regulations.

Identifying the Problem and Best Practices

sc3From enterprise systems administrators to C-suite executives, both groups are guilty of taking data security for granted, and assuming that masking and encryption capabilities are guaranteed by default of having a database. These executives are failing to do their research, dig into the weeds and ask the more complex questions, often times due to a professional background that focused on analytics or IT rather than governance. Unless an executive’s background includes building data systems or setting up controls and governance around these types of systems, he/she may not know the right questions to ask.

Another common mistake is not strictly controlling access to sensitive data, putting it at risk of theft and loss. Should customer service representatives be able to pull every file in the system? Probably not. Even IT administrators’ access should be restricted to the specific actions and commands required to perform their jobs. Encryption provides some file level protections from unauthorized users.  Authorized users who have the permission to unlock an encrypted file can often look at fields that aren’t required for their job.

As more enterprises adopt Hadoop and other similar systems, they should consider the following:

Do your due diligence. When meeting with customers, I can tell they’ve done their homework if they ask questions about more than the “buzz words” around Hadoop. These questions alone indicate they’re not simply regurgitating a sales pitch and have researched how to protect their environment. Be discerning and don’t assume the solution you’re purchasing off the shelf contains everything you need. Accepting what the salesperson has to say at face value, without probing further, is reckless and could lead to an organization earning a very damaging and costly security scandal.

Accept there are gaps. Frequently, we engage with clients who are confident they have the most robust security and data governance available.
sc4However, when we start to poke and prod a bit more to understand what other controls they have in place, the astonishing answer is zero. Lest we forget that “Core” Hadoop only obtained security in 2015 without third-party add-ons, the governance around the software framework is still in its infancy stage in many ways. Without something as inherently rudimentary in traditional IT security as a firewall in place, it’s difficult for enterprises to claim they are secure.

Have an independent plan. Before purchasing Hadoop or a similar platform, map out your exact business requirements, consider what controls your business needs and determine whether or not the product meets each of them. Research regulatory compliance standards to select the most secure configuration of your Hadoop environment and the tools you will need to supplement it.

To conclude, here is a seven-question checklist enterprises should be able to answer about their Hadoop ecosystem:

  • Do you know what’s in your Hadoop?
  • Is it meeting your business goals?
  • Do you really have the controls in place that you need to enable your business?
  • Do you have the governance?
  • Where are your gaps and how are you protecting them?
  • What are your augmented controls and supplemental procedures?
  • Have you reviewed the information the salesperson shared and mapped it to your actual business requirements to decide what you need?

Categories

Archives

Connect with us on Twitter