Archive for the ‘Life Sciences’ Category

Galaxy: A Workflow Management System for Modern Life Sciences Research

Nathan Bott

Healthcare Solutions Architect at EMC

Am I a life scientist or an IT data manager? That’s the question many researchers are asking themselves in today’s data-driven life sciences organizations.

Whether it is a bench scientist analyzing a genomic sequence or an M.D. exploring biomarkers and a patient’s genomic variants to develop a personalized treatment, researchers are spending a great amount of time searching for, accessing, manipulating, analyzing, and visualizing data.

Organizations supporting such research efforts are trying to make it easier to perform these tasks without the user needing extensive IT expertise and skills. This mission is not easy.

Focus on the data

Modern life sciences data analysis requirements are vastly different than they were just a handful of years ago.

In the past, once data was created, it was stored, analyzed soon after, and then archived to tape or another long-term medium. Today, not only is more data is being generated, but also the need to re-analyze that data means that it must be retained where it can be easily accessed for longer periods.

Additionally, today’s research is much more collaborative and multi-disciplinary. As a result, organizations must provide an easy way for researchers to access data, ensure that results are reproducible, and provide transparency to ensure best practices are used and that procedures adhere to regulatory mandates.

More analytics and collaboration represent areas where The Galaxy Project (also known as just Galaxy) can help. Galaxy is a scientific workflow, data integration, and data and analysis persistence and publishing platform designed to help make computational biology accessible to research scientists that do not have computer programming experience.

Galaxy is generally used as a general bioinformatics workflow management system that automatically tracks and manages data while providing support for capturing the context and intent of computational methods.

Organizations have several ways to make use of Galaxy. They include:

Free public instance: The Galaxy Main instance is available as a free public service at UseGalaxy.org. This is the Galaxy Project’s primary production Galaxy instance and is useful for sharing or publishing data and methods with colleagues for routine analysis or with the larger scientific community for publications.

Anyone can use the public servers, with or without an account. (With an account, data quotas are increased and full functionality across sessions opens up, such as naming, saving, sharing, and publishing Galaxy-defined objects).

Publicly available instances: Many other Galaxy servers besides Main have been made publicly available by the Galaxy community. Specifically, a number of institutions have installed Galaxy and have made those installations either accessible to individual researchers or open to certain organizations or communities.

For example, the Centre de Bioinformatique de Bordeaux offers a general purpose Galaxy instance that includes EMBOSS (a software analysis package for molecular biology) and fibronectin (diversity analysis of synthetic libraries of a Fibronectin domain). Biomina offers a general purpose Galaxy instance that includes most standard tools for DNA/RNA sequencing, plus extra tools for panel resequencing, variant annotation, and some tools for Illumina SNP array analysis.

A list of the publically available installations of Galaxy can be found here.

Do-it-yourself: Organizations also have the choice of deploying their own Galaxy installations. There are two options: an organization can install a local instance of Galaxy (more information on setting up a local instance of Galaxy can be found here), or Galaxy can be deployed to the cloud. The Galaxy Project supports CloudMan, a software package that provides a common interface to different cloud infrastructures.

How it works

Architecturally, Galaxy is a modular python-based web application that provides a data abstracting layer to integrate with various storage platforms. This allows researchers to access data on a variety of storage back-ends like standard direct attached storage, S3 object-based cloud storage, storage management systems like iRODs (the Integrated Rule-Oriented Data System), or a distributed file system.

For example, a Galaxy implementation might use object-based storage such as that provided by Dell EMC Elastic Cloud Storage (ECS). ECS is a software-defined, cloud-scale, object storage platform that combines that cost advantages of commodity infrastructure with the reliability, availability, and serviceability of traditional storage arrays.

With ECS, any organization can deliver scalable and simple public cloud services with the reliability and control of a private-cloud infrastructure.

ECS provides comprehensive protocol support, like S3 or Swift, for unstructured workloads on a single, cloud-scale storage platform. This would allow the user of a Galaxy implementation to easily access data stored on such cloud storage platforms.

With ECS, organizations can easily manage a globally distributed storage infrastructure under a single global namespace with anywhere access to content. ECS features a flexible software-defined architecture that is layered to promote limitless scalability. Each layer is completely abstracted and independently scalable with high availability and no single points of failure.

Get first access to our Life Sciences Solutions

You can test drive Dell EMC ECS by registering for an account and getting access to our APIs by visiting https://portal.ecstestdrive.com/

Or you can download the Dell EMC ECS Community Edition here and try it for FREE in your own environment with no time limit for non-production use

Overcoming the Exabyte-Sized Obstacles to Precision Medicine

Wolfgang Mertz

CTO of Healthcare, Life Sciences and High performance Computing

As we make strides towards a future that includes autonomous cars and grocery stores sans checkout lines, concepts that once seemed reserved only for utopian fiction, it seems there’s no limit to what science and technology can accomplish. It’s an especially exciting time for those in the life sciences and healthcare fields, with 2016 seeing breakthroughs such as a potential “universal” flu vaccine and CRISPR, a promising gene editing technology that may help treat cancer.

Several of Dell EMC’s customers are also making significant advances in precision medicine, the medical model that focuses on using an individual’s specific genetic makeup to customize and prescribe treatments.

Currently, physicians and scientists are in the research phase of a myriad of applications for precision medicine, including oncology, diabetes and cardiology. Before we are able to realize the vision President Obama shared of “the right treatments at the right time, every time, to the right person” from his 2015 Precision Medicine Initiative, there are significant challenges to overcome.

Accessibility

In order for precision medicine to become available to the masses, this will require researchers and doctors to not only have the technical infrastructure to support genomic sequencing, but the storage capacity and resources to access, view and share additional relevant data as well. They will need to have visibility into patients’ electronic health records (EHR), along with information on environmental conditions and lifestyle behaviors and biological samples. While increased data sharing may sound simple enough, the reality is there is still much work to be done on the storage infrastructure side to make this possible. Much of this data is typically siloed, which impedes healthcare providers’ ability to collaborate and review critical information that could impact a patient’s diagnosis and treatment. To fully take advantage of the potential life-saving insights available from precision medicine, organizations must implement a storage solution that enables high-speed access anytime, anywhere.

Volume

Another issue to confront is the storage capacity needed to house and preserve the petabytes of genomic data, medical imaging, EHR and other data. Thanks to decreased costs of genomic sequencing and more genomes being analyzed, the sheer volume of genomic data alone being generated is quickly eclipsing the storage available in most legacy systems. According to a scientific report by Stephens et. al published in PLOS Biology, between 100 million and two billion human genomes may be sequenced by 2025. This may lead to storage demands of up to 2-40 exabytes since storage requirements must take into consideration the accuracy of the data collected. The paper states that, “For every 3 billion bases of human genome sequence, 30-fold more data (~100 gigabases) must be collected because of errors in sequencing, base calling and genome alignment.” With this exponential projected growth, scale-out storage that can simultaneously manage multiple current and future workflows is necessary now more than ever.

Early Stages 

Finally, while it’s easy to get caught up in the excitement of the advances made thus far in precision medicine, we have to remember this remains a young discipline. At the IT level, there’s still much to be done around network and storage infrastructure and workflows in order to develop the solutions that will make this ground-breaking research readily available to the public, the physician community and healthcare professionals. Third-generation platform applications need to be built to make this more mainstream. Fortunately, major healthcare technology players such as GE and Philips have undertaken initiatives to attract independent software vendor (ISV) applications. With high-profile companies willing to devote time and resources to supporting ISV applications, the more likely it is scientists will have access to more sophisticated tools sooner.

More cohort analysis such as Genomic England’s 100,000 Genomic Project must be put in place to ensure researchers have sufficient data to develop new forms of screening and treatment and these efforts will also necessitate additional storage capabilities.

Conclusion

Despite these barriers, the future remains promising for precision medicine. With the proper infrastructure in place to provide reliable shared access and massive scalability, clinicians and researchers will have the freedom to focus on discovering the breakthroughs of tomorrow.

Get first access to our Life Sciences Solutions

TGen Cures Storage Needs with Dell EMC to Advance Precision Medicine

Sasha Paegle

Sasha Paegle

Sr. Business Development Manager, Life Sciences

As the gap between theoretical treatment and clinical application for precision medicine continues to shrink, we’re inching closer to having the practice of doctors using individual human genomes to prescribe specific care strategies become a commonplace reality.

Organizations such as the Translational Genomics Research Institute (TGen), a leading biomedical research institute, are on the forefront of enabling a new generation of life-saving treatments. With innovations from TGen, breakthroughs in genetic sequencing are unraveling mysteries of complex diseases like cancer.

To help achieve its goal to successfully use –omics to prevent, diagnose and treat disease, the Phoenix-based non-profit research institute selected Dell EMC to enhance its IT system and infrastructure to manage its petabyte-size sequencing cluster.

Data Tsunami 

The time and cost of genomic sequencing for a single person has dropped dramatically since the Human Genome Project, which spanned 13 years and cost $1 billion. Today, sequencing can be completed in roughly one day for approximately $1,000. Furthermore, technological advances in sequencing and on the IT front have enabled TGen to increase the number of patients being sequenced from the hundreds to the thousands annually. To handle the storage output from current sequencing technologies and emerging single molecule real-time (SMRT) sequencing, TGen required an infrastructure with the storage capacity and performance to support big data repositories produced by genetic sequencing—even as they grow exponentially.

“When you get more sequencers that go faster and run cheaper, and the more people are being sequenced, you’re going to need more resources in order to process this tsunami of data,” said James Lowey, TGen’s CIO.

TGen stores vast amounts of data generated by precision medicine, such as genetic data and data from wearables including glucose monitors and pain management devices, as well as clinical records and population health statistics. Scientists must then correlate and analyze this information to develop a complete picture of an individual’s illness and potential treatment. This involves TGen’s sequencing cluster churning through one million CPU hours per month and calls for a storage solution that is also able to maintain high availability, which is critical to the around the clock research environment.

Benefits for Researchers

In the coming years, researchers can expect genetic sequences to increase in addition to SMRT sequencing paving the way for larger data volumes.

Lowey notes, “As genetic data continues to grow exponentially, it’s even more important to have an extremely reliable infrastructure to manage that data and make it accessible to the scientists 24/7.”

Having a robust storage infrastructure in place allows researchers to fully devote their time and attention on the core business of science without worrying if there’s enough disk space or processing capacity. It also helps scientists get more precise treatments to patients faster, enabling breakthroughs that lead to life-saving and life-changing medical treatments – the ultimate goal of TGen and like-minded research institutes.

Looking Ahead

With the likelihood of sequencing clusters growing to exabyte-scale, TGen and its peers must continue to seek out an enterprise approach that emphasizes reliability and scalability and ensures high availability of critical data for 24/7 operations.

Lowey summarizes the future of precision medicine and IT by saying, “The possibilities are endless, but the real trick is to build all of that backend infrastructure to support it.”

To learn more about Dell EMC’s work with TGen, check out our video below.

 

Get first access to our Life Sciences Solutions

Using a World Wide Herd (WWH) to Advance Disease Discovery and Treatment

Patricia Florissi

Vice President & Global Chief Technology Officer, Sales at Dell EMC
Patricia Florissi is Vice President and Global Chief Technology Officer (CTO) for Sales. As Global CTO for Sales, Patricia helps define mid and long term technology strategy, representing the needs of the broader EMC ecosystem in EMC strategic initiatives. Patricia is an EMC Distinguished Engineer, holds a Ph. D. in Computer Science from Columbia University in New York, graduated valedictorian with an MBA at the Stern Business School in New York University, and has a Master's and a Bachelor's Degree in Computer Science from the Universidade Federal de Pernambuco, in Brazil. Patricia holds multiple patents, and has published extensively in periodicals including Computer Networks and IEEE Proceedings.

Latest posts by Patricia Florissi (see all)

Analysis of very large genomic datasets has the potential to radically alter the way we keep people healthy. Whether it is quickly identifying the cause of a new infectious outbreak to prevent its spread or personalizing a treatment based on a patient’s genetic variants to knock out a stubborn disease, modern Big Data analytics has a major role to play.

By leveraging cloud, Apache™ Hadoop®, next-generation sequencers, and other technologies, life scientists potentially have a new, very powerful way to conduct innovative global-scale collaborative genomic analysis research that has not been possible before. With the right approach, there are great benefits that can be realized.

image1_

To illustrate the possibilities and benefits of using coordinated worldwide genomic analysis, Dell EMC partnered with researchers at Ben-Gurion University of the Negev (BGU) to develop a global data analytics environment that spans across multiple clouds. This environment lets life sciences organizations analyze data from multiple heterogeneous sources while preserving privacy and security. The work conducted by this collaboration simulated a scenario that might be used by researchers and public health organizations to identify the early onset of outbreaks of infectious diseases. The approach could also help uncover new combinations of virulence factors that may characterize new diseases. Additionally, the methods used have applicability to new drug discovery and translational and personalized medicine.

 

Expanding on past accomplishments

In 2003, SARS (severe acute respiratory syndrome) was the first infectious outbreak where fast global collaborative genomic analysis was used to identify the cause of a disease. The effort was carried out by researchers in the U.S. and Canada who decoded the genome of the coronavirus to prove it was the cause of SARS.

The Dell EMC and BGU simulated disease detection and identification scenario makes use of technological developments (the much lower cost of sequencing, the availability of greater computing power, the use of cloud for data sharing, etc.) to address some of the shortcomings of past efforts and enhance the outcome.

Specifically, some diseases are caused by the combination of virulence factors. They may all be present in one pathogen or across several pathogens in the same biome. There can also be geographical variations. This makes it very hard to identified root causes of a disease when pathogens are analyzed in isolation as has been the case in the past.

Addressing these issues requires sequencing entire micro-biomes from many samples gathered worldwide. The computational requirements for such an approach are enormous. A single facility would need a compute and storage infrastructure on a par with major government research labs or national supercomputing centers.

Dell EMC and BGU simulated a scenario of distributed sequencing centers scattered worldwide, where each center sequences entire micro-biome samples. Each center analyzes the sequence reads generated against a set of known virulence factors. This is done to detect the combination of these factors causing diseases, allowing for near-real time diagnostic analysis and targeted treatment.

To carry out these operations in the different centers, Dell EMC extended the Hadoop framework to orchestrate distributed and parallel computation across clusters scattered worldwide. This pushed computation as close as possible to the source of data, leveraging the principle of data locality at world-wide scale, while preserving data privacy.

Since one Hadoop instance is represented by a single elephant, Dell EMC concluded that a set of Hadoop instances, scattered across the world, but working in tandem formed a World Wide Herd or WWH. This is the name Dell EMC has given to its Hadoop extensions.

image2_

Using WWH, Dell EMC wrote a distributed application where each one of a set of collaborating sequence centers calculates a profile of the virulence factors present in each of the micro-biome it sequenced and sends just these profiles to a center selected to do the global computation.

That center would then use bi-clustering to uncover common patterns of virulence factors among subsets of micro-biomes that could have been originally sampled in any part of the world.

This approach could allow researchers and public health organizations to potentially identify the early onset of outbreaks and also uncover new combinations of virulence factors that may characterize new diseases.

There are several biological advantages to this approach. The approach eliminates the time required to isolate a specific pathogen for analysis and for re-assembling the genomes of the individual microorganisms. Sequencing the entire biome lets researchers identify known and unknown combinations of virulence factors. And collecting samples independently world-wide helps ensure the detection of variants.

On the compute side, the approach uses local processing power to perform the biome sequence analysis. This reduces the need for a large centralized HPC environment. Additionally, the method overcomes the matter of data diversity. It can support all data sources and any data formats.

This investigative approach could be used as a next-generation outbreak surveillance system. It allows collaboration where different geographically dispersed groups simultaneously investigate different variants of a new disease. In addition, the WWH architecture has great applicability to pharmaceutical industry R&D efforts, which increasingly relies on a multi-disciplinary approach where geographically dispersed groups investigate different aspects of a disease or drug target using a wide variety of analysis algorithms on share data.

 

Learn more about modern genomic Big Data analytics

 

 

Big Data Analysis for the Greater Good: Dell EMC & the 100,000 Genome Project

Wolfgang Mertz

CTO of Healthcare, Life Sciences and High performance Computing

It might seem far-reaching to say that big data analysis can fundamentally impact patient outcomes around cancer and other illnesses, and that it has the power to ultimately transform health services and indeed society at large, but that’s the precise goal behind the 100,000 Genome Project from Genomics England.

DNA backgroundFor background, Genomics England is a wholly-owned company of the Department of Health, set up to deliver the 100,000 Genomes Project. This exciting endeavor will sequence and collect 100,000 whole genomes from 70,000 NHS patients and their families (with their full consent), focusing on patients with rare diseases as well as those with common cancers.

The program is designed to create a lasting legacy for patients as well as the NHS and the broader UK economy, while encouraging innovation in the UK’s bioscience sector. The genetic sequences will be anonymized and shared with approved academic researchers to help develop new treatments and diagnostic testing methods targeted at the genetic characteristics of individual patients.

Dell EMC provides the platform for large-scale analytics in a hybrid cloud model for Genomics England, which leverages our VCE vScale, with EMC Isilon and EMC XtremIO solutions. The Project has been using EMC storage for its genomic sequence library, and now it will be leveraging an Isilon data lake to securely store data during the sequencing process. Backup services are provided by EMC’s Data Domain and EMC Networker.

The Genomics England IT environment uses both on-prem servers and IaaS provided by cloud service providers on G-Cloud. According to an article from Government Computing, “one of Genomics England’s key legacies is expected to be an ecosystem of cloud service providers providing low cost, elastic compute on demand through G-Cloud, bringing the benefits of scale to smaller research groups.”

There are two main considerations from an IT perspective around genome and DNA sequencing projects such as those being done by Genomics England and others: data management and speed. Vast amounts of research data have to be stored and retrieved, and this large-scale biologic data has to be processed quickly in order to gain meaningful insights.

Scale is another key factor. Sequencing and storing genomic information digitally is a data-intensive endeavor, to say the least. Just sequencing a single genome creates hundreds of gigabytes and the Project has sequenced over 13,000 genomes to date, which is expected to generate ten times more data over the next two years. The data lake being used by Genomics England allows 17 petabytes of data to be stored and made available for multi-protocol analytics (including Hadoop).

For perspective, 1 PB is a quadrillion bytes – think of that as 20 million four-drawer filing cabinets filled with text. Or, considering the Milky Way has roughly two hundred billion stars in its galaxy, if you count each single star as a single byte – it would take 5,000 Milky Way galaxies to reach 1PB of data. It’s staggering.

The potential of being able to contribute to eradicating disease and identify exciting new treatments is truly awe inspiring.  And considering the immense scale of the data involved – 5,000 galaxies! – provides new context around reaching for the stars.

Get first access to our LifeScience Solutions

 

Metalnx: Making iRODS Easy

Stephen Worth

Stephen Worth is a director of Global Innovation Operations at Dell EMC. He manages development and university research projects in Brazil, is a technical liaison between helping to improve innovation across our global engineering labs, and works in digital asset management leveraging user defined metadata. Steve is based out of Dell EMC’s RTP Software Development Center which focuses on data protection, core storage products, & cloud storage virtualization. Steve started with Data General in 1985, which was acquired by EMC in 1999, and Dell Technologies in 2016. He has led many product development efforts involving operating systems, diagnostics, UI, database, & applications porting. His background includes vendor & program management, performance engineering, engineering services, manufacturing, and test engineering. Steve, an alumnus of North Carolina Status University, received a B.S. degree in Chemistry in 1981 and M.S. degree in Computer Studies in 1985. He served as an adjunct faculty member of the Computer Science department from 1987-1999. Currently Steve is an emeritus member of the Computer Science Department’s Strategic Advisory Board and is currently chairperson of the Technical Advisory Board for the James B. Hunt Jr. Library on Centennial Campus.

Latest posts by Stephen Worth (see all)

DNA background

Advances in sequencing, spectroscopy, and microscopy are driving life sciences organizations to produce vast amounts of data. Most organizations are dedicating significant resources to the storage and management of that data. However, until recently, their primary efforts have focused on how to host the data for high performance, rapid analysis, and moving it to more economical disks for longer-term storage.

The nature of life sciences work demands better data organization. The data produced by today’s next-generation lab equipment is rich in information, making it of interest to different research groups and individuals at varying points in time. Examples include:

  • Raw experimental and analyzed data may be needed as new drug candidates move through research and development, clinical trials, FDA approval, and production
  • A team interested in new indications for an existing chemical compound would want to leverage work already done by others in the organization on the compound in the past
  • In the realm of personalized medicine, clinicians may need to evaluate not only a person’s health history, but correlate that information with genome sequences and phenotype data throughout the individual’s life.

The great challenge is how to make data more generally available and useful throughout an organization. Researchers need to know what data exists and have a way to access it. For this to happen, data must be properly categorized, searchable, and easy to find.

To get help in this area, many research organizations and government agencies worldwide are using the Integrated Rule-Oriented Data System (iRODS), which is open source data management software developed by the iRODS Consortium. iRODS enables data discovery using a data/metadata catalog that can retain machine and user-defined metadata describing every file, collection, and object in a data grid.

Additionally, iRODS automates data workflows with a rule engine that permits any action to be initiated by any trigger on any server or client in the grid. iRODS enables secure collaboration, so users only need to login to their home grid to access data hosted on a remote, federated grid.

Leveraging iRODS can be simplified and its benefits enhanced when used with Metalnx, an administrative and metadata management user interface (UI) for iRODS. Metalnx was developed by Dell EMC through its efforts as a corporate member of the iRODS Consortium. The intuitive Metalnx UI helps both the IT administrators charged with managing metadata and the end-users / researchers who need to find and access relevant data based upon metadata descriptions.

Making use of metadata via an easy to use UI provided by Metalnx working with iRODS can help:

  • Maximize storage assets
  • Find what’s valuable, no matter where the data is located
  • Automate movement and processing of data
  • Securely share data with collaborators

Real world example: Putting the issues into perspective

A simple example illustrates why iRODS and Metalnx are needed. Plant & Food Research, a New Zealand-based science company providing research and development that adds value to fruit, vegetable, crop and food products, makes great use of next-generation sequencing and genotyping. The work generates a lot of mixed data types.

“In the past, we were good at storing data, but not good at categorizing the data or using metadata,” said Ben Warren, bioinformatician, at Plant & Food Research. “We tried to get ahead of this by looking at what other institutions were doing.”

iRODS seemed a good fit. It was the only decent open source solution available. However, there were some limitations. “We were okay with the rule engine, but not the interface,” said Warren.

A system administrator working with EMC on hardware for the organization’s compute cluster had heard of Metalnx and mentioned this to Warren. “We were impressed off the bat with its ease of use,” said Warren. “Not only would it be useful for bioinformaticians, coders, and statisticians, but also for the scientists.”

The reason: Metalnx makes it easier to categorize the organization’s data, to control the metadata used to categorize the data, and to use the metadata to find and access any data.

Benefits abound

At Plant & Food Research, metadata is an essential element of a scientist’s workflow. The metadata makes it easier to find data at any stage of a research project. When a project is conceived, scientists will start by determining all metadata required for the project using Metalnx and cataloging data using iRODS. With this approach, everything associated with a project including the samples used, sample descriptions, experimental design, NGS data, and other information are searchable.

One immediate benefit is that someone undertaking a new project can quickly determine if similar work has already been done. This is increasingly important in life science organizations as research become more multi-discipline in nature.

Furthermore, the more an organization knows about its data, the more valuable the data becomes. Researchers can connect with other work done across the organization. Being able to find the right raw data of a past effort means an experiment does not have to be redone. This saves time and resources.

Warren notes that there are other organizational benefits using iRODS and Metalnx. When it comes to collaborating with others, the data is simply easier to share. Scientists can put the data in any format and it is easier to publish the data.

Learn more

Metalnx is available as open source tool. It can be found at Dell EMC Code www.codedellemc.com  or on Github at www.github.com/Metalnx . EMC has also made binary versions available on bintray at www.bintray.com/metalnx  and a Docker image posted on Docker Hub at https://hub.docker.com/r/metalnx/metalnx-web/

A broader discussion of the use of Metalnx and iRODS in the life sciences can be found in an on-demand video of a recent web seminar “Expanding the Face of Meta Data in Next Generation Sequencing.” The video can be viewed on the EMC Emerging Tech Solutions site.

 

Get first access to our LifeScience Solutions

Follow Dell EMC

Categories

Archives

Connect with us on Twitter