Posts Tagged ‘genomics’

TGen Cures Storage Needs with Dell EMC to Advance Precision Medicine

Sasha Paegle

Sasha Paegle

Sr. Business Development Manager, Life Sciences

As the gap between theoretical treatment and clinical application for precision medicine continues to shrink, we’re inching closer to having the practice of doctors using individual human genomes to prescribe specific care strategies become a commonplace reality.

Organizations such as the Translational Genomics Research Institute (TGen), a leading biomedical research institute, are on the forefront of enabling a new generation of life-saving treatments. With innovations from TGen, breakthroughs in genetic sequencing are unraveling mysteries of complex diseases like cancer.

To help achieve its goal to successfully use –omics to prevent, diagnose and treat disease, the Phoenix-based non-profit research institute selected Dell EMC to enhance its IT system and infrastructure to manage its petabyte-size sequencing cluster.

Data Tsunami 

The time and cost of genomic sequencing for a single person has dropped dramatically since the Human Genome Project, which spanned 13 years and cost $1 billion. Today, sequencing can be completed in roughly one day for approximately $1,000. Furthermore, technological advances in sequencing and on the IT front have enabled TGen to increase the number of patients being sequenced from the hundreds to the thousands annually. To handle the storage output from current sequencing technologies and emerging single molecule real-time (SMRT) sequencing, TGen required an infrastructure with the storage capacity and performance to support big data repositories produced by genetic sequencing—even as they grow exponentially.

“When you get more sequencers that go faster and run cheaper, and the more people are being sequenced, you’re going to need more resources in order to process this tsunami of data,” said James Lowey, TGen’s CIO.

TGen stores vast amounts of data generated by precision medicine, such as genetic data and data from wearables including glucose monitors and pain management devices, as well as clinical records and population health statistics. Scientists must then correlate and analyze this information to develop a complete picture of an individual’s illness and potential treatment. This involves TGen’s sequencing cluster churning through one million CPU hours per month and calls for a storage solution that is also able to maintain high availability, which is critical to the around the clock research environment.

Benefits for Researchers

In the coming years, researchers can expect genetic sequences to increase in addition to SMRT sequencing paving the way for larger data volumes.

Lowey notes, “As genetic data continues to grow exponentially, it’s even more important to have an extremely reliable infrastructure to manage that data and make it accessible to the scientists 24/7.”

Having a robust storage infrastructure in place allows researchers to fully devote their time and attention on the core business of science without worrying if there’s enough disk space or processing capacity. It also helps scientists get more precise treatments to patients faster, enabling breakthroughs that lead to life-saving and life-changing medical treatments – the ultimate goal of TGen and like-minded research institutes.

Looking Ahead

With the likelihood of sequencing clusters growing to exabyte-scale, TGen and its peers must continue to seek out an enterprise approach that emphasizes reliability and scalability and ensures high availability of critical data for 24/7 operations.

Lowey summarizes the future of precision medicine and IT by saying, “The possibilities are endless, but the real trick is to build all of that backend infrastructure to support it.”

To learn more about Dell EMC’s work with TGen, check out our video below.

 

Get first access to our Life Sciences Solutions

Using a World Wide Herd (WWH) to Advance Disease Discovery and Treatment

Patricia Florissi

Vice President & Global Chief Technology Officer, Sales at Dell EMC
Patricia Florissi is Vice President and Global Chief Technology Officer (CTO) for Sales. As Global CTO for Sales, Patricia helps define mid and long term technology strategy, representing the needs of the broader EMC ecosystem in EMC strategic initiatives. Patricia is an EMC Distinguished Engineer, holds a Ph. D. in Computer Science from Columbia University in New York, graduated valedictorian with an MBA at the Stern Business School in New York University, and has a Master's and a Bachelor's Degree in Computer Science from the Universidade Federal de Pernambuco, in Brazil. Patricia holds multiple patents, and has published extensively in periodicals including Computer Networks and IEEE Proceedings.

Latest posts by Patricia Florissi (see all)

Analysis of very large genomic datasets has the potential to radically alter the way we keep people healthy. Whether it is quickly identifying the cause of a new infectious outbreak to prevent its spread or personalizing a treatment based on a patient’s genetic variants to knock out a stubborn disease, modern Big Data analytics has a major role to play.

By leveraging cloud, Apache™ Hadoop®, next-generation sequencers, and other technologies, life scientists potentially have a new, very powerful way to conduct innovative global-scale collaborative genomic analysis research that has not been possible before. With the right approach, there are great benefits that can be realized.

image1_

To illustrate the possibilities and benefits of using coordinated worldwide genomic analysis, Dell EMC partnered with researchers at Ben-Gurion University of the Negev (BGU) to develop a global data analytics environment that spans across multiple clouds. This environment lets life sciences organizations analyze data from multiple heterogeneous sources while preserving privacy and security. The work conducted by this collaboration simulated a scenario that might be used by researchers and public health organizations to identify the early onset of outbreaks of infectious diseases. The approach could also help uncover new combinations of virulence factors that may characterize new diseases. Additionally, the methods used have applicability to new drug discovery and translational and personalized medicine.

 

Expanding on past accomplishments

In 2003, SARS (severe acute respiratory syndrome) was the first infectious outbreak where fast global collaborative genomic analysis was used to identify the cause of a disease. The effort was carried out by researchers in the U.S. and Canada who decoded the genome of the coronavirus to prove it was the cause of SARS.

The Dell EMC and BGU simulated disease detection and identification scenario makes use of technological developments (the much lower cost of sequencing, the availability of greater computing power, the use of cloud for data sharing, etc.) to address some of the shortcomings of past efforts and enhance the outcome.

Specifically, some diseases are caused by the combination of virulence factors. They may all be present in one pathogen or across several pathogens in the same biome. There can also be geographical variations. This makes it very hard to identified root causes of a disease when pathogens are analyzed in isolation as has been the case in the past.

Addressing these issues requires sequencing entire micro-biomes from many samples gathered worldwide. The computational requirements for such an approach are enormous. A single facility would need a compute and storage infrastructure on a par with major government research labs or national supercomputing centers.

Dell EMC and BGU simulated a scenario of distributed sequencing centers scattered worldwide, where each center sequences entire micro-biome samples. Each center analyzes the sequence reads generated against a set of known virulence factors. This is done to detect the combination of these factors causing diseases, allowing for near-real time diagnostic analysis and targeted treatment.

To carry out these operations in the different centers, Dell EMC extended the Hadoop framework to orchestrate distributed and parallel computation across clusters scattered worldwide. This pushed computation as close as possible to the source of data, leveraging the principle of data locality at world-wide scale, while preserving data privacy.

Since one Hadoop instance is represented by a single elephant, Dell EMC concluded that a set of Hadoop instances, scattered across the world, but working in tandem formed a World Wide Herd or WWH. This is the name Dell EMC has given to its Hadoop extensions.

image2_

Using WWH, Dell EMC wrote a distributed application where each one of a set of collaborating sequence centers calculates a profile of the virulence factors present in each of the micro-biome it sequenced and sends just these profiles to a center selected to do the global computation.

That center would then use bi-clustering to uncover common patterns of virulence factors among subsets of micro-biomes that could have been originally sampled in any part of the world.

This approach could allow researchers and public health organizations to potentially identify the early onset of outbreaks and also uncover new combinations of virulence factors that may characterize new diseases.

There are several biological advantages to this approach. The approach eliminates the time required to isolate a specific pathogen for analysis and for re-assembling the genomes of the individual microorganisms. Sequencing the entire biome lets researchers identify known and unknown combinations of virulence factors. And collecting samples independently world-wide helps ensure the detection of variants.

On the compute side, the approach uses local processing power to perform the biome sequence analysis. This reduces the need for a large centralized HPC environment. Additionally, the method overcomes the matter of data diversity. It can support all data sources and any data formats.

This investigative approach could be used as a next-generation outbreak surveillance system. It allows collaboration where different geographically dispersed groups simultaneously investigate different variants of a new disease. In addition, the WWH architecture has great applicability to pharmaceutical industry R&D efforts, which increasingly relies on a multi-disciplinary approach where geographically dispersed groups investigate different aspects of a disease or drug target using a wide variety of analysis algorithms on share data.

 

Learn more about modern genomic Big Data analytics

 

 

GET 2010: Isilon and the Future of Genomics

Recently, a historic event took place in Boston, hosted by Dr. George Church and his Personal Genome Project. The Genomes Environments Traits (GET) Conference brought together every person who has had their full genome mapped – less than 20 people in total – to share a stage and discuss the impact of genomics research on the future human medicine, health, culture and society as a whole.

Needless to say, this event attracted a number of interested journalists. One among them, Beth Pariseau of SearchStorage, posted this article highlighting the data management challenges and storage growth in the field of genomic research. Beth highlighted conversations from several bioinformatics institutions – the Broad Institute (a joint venture between MIT and Harvard), Oklahoma Medical Research Foundation and Stanford University’s Quake Lab, as well as Illumina (the leading manufacturer of DNA sequencing devices and genomic services). All the organizations detailed not only the data storage growth and management challenges of genomic research, but also the continuing evolution in this rapidly changing environment.

There is, of course, one thing all the aforementioned organizations share in common – Isilon scale-out NAS is their primary storage architecture.

Isilon’s ability to scale from a few terabytes to tens of petabytes – all within a single file system – provides a distinct advantage to genomic research organizations, which are usually data-heavy but IT light, with few full-time IT staff available to manage massive storage systems. Add to this the fact Isilon IQ can scale a single volume non-disruptively via NFS/CIFS connectivity, eliminating the complexity and high operating costs associated with traditional storage architectures, and it’s clear why so many genomics organizations are using Isilon to power their research. With Isilon IQ and OneFS® operating system, users can cost-effectively accelerate meta-data, optimize concurrent I/O workflows (used by high performance computing) and single stream workflows (used for binary sequence data), and maintain high data availability – enabling genomics organizations to spend their time and resources on science, not on storage.

Along with GET, Isilon recently attended Bio-IT World, where our CTO Paul Rutherford spoke on the evolution of cloud computing, its pros and cons, and its potential impact for life sciences organizations. Later that evening at a partner event with Cambridge Computers, one of my Isilon colleagues introduced himself to a gentleman who had walked up to his table:

“Hi, I’m from Isilon. We’re a data storage company,” my colleague said.

The gentleman laughed in reply, “That’s like saying: Hi, I’m from Microsoft. We’re a software company.” The study of genomics holds the potential to usher in an era of truly personalized medicine, which would enable individual treatments that could improve not only the state of healthcare, but possibly the quality of life for millions, if not billions, of people. Here at Isilon, we’re proud to be the storage architecture helping power this groundbreaking research.

Follow Dell EMC

Categories

Archives

Connect with us on Twitter