Posts Tagged ‘genomics’

Using a World Wide Herd (WWH) to Advance Disease Discovery and Treatment

Patricia Florissi

Vice President & Global Chief Technology Officer, Sales at Dell EMC
Patricia Florissi is Vice President and Global Chief Technology Officer (CTO) for Sales. As Global CTO for Sales, Patricia helps define mid and long term technology strategy, representing the needs of the broader EMC ecosystem in EMC strategic initiatives. Patricia is an EMC Distinguished Engineer, holds a Ph. D. in Computer Science from Columbia University in New York, graduated valedictorian with an MBA at the Stern Business School in New York University, and has a Master's and a Bachelor's Degree in Computer Science from the Universidade Federal de Pernambuco, in Brazil. Patricia holds multiple patents, and has published extensively in periodicals including Computer Networks and IEEE Proceedings.

Latest posts by Patricia Florissi (see all)

Analysis of very large genomic datasets has the potential to radically alter the way we keep people healthy. Whether it is quickly identifying the cause of a new infectious outbreak to prevent its spread or personalizing a treatment based on a patient’s genetic variants to knock out a stubborn disease, modern Big Data analytics has a major role to play.

By leveraging cloud, Apache™ Hadoop®, next-generation sequencers, and other technologies, life scientists potentially have a new, very powerful way to conduct innovative global-scale collaborative genomic analysis research that has not been possible before. With the right approach, there are great benefits that can be realized.


To illustrate the possibilities and benefits of using coordinated worldwide genomic analysis, Dell EMC partnered with researchers at Ben-Gurion University of the Negev (BGU) to develop a global data analytics environment that spans across multiple clouds. This environment lets life sciences organizations analyze data from multiple heterogeneous sources while preserving privacy and security. The work conducted by this collaboration simulated a scenario that might be used by researchers and public health organizations to identify the early onset of outbreaks of infectious diseases. The approach could also help uncover new combinations of virulence factors that may characterize new diseases. Additionally, the methods used have applicability to new drug discovery and translational and personalized medicine.


Expanding on past accomplishments

In 2003, SARS (severe acute respiratory syndrome) was the first infectious outbreak where fast global collaborative genomic analysis was used to identify the cause of a disease. The effort was carried out by researchers in the U.S. and Canada who decoded the genome of the coronavirus to prove it was the cause of SARS.

The Dell EMC and BGU simulated disease detection and identification scenario makes use of technological developments (the much lower cost of sequencing, the availability of greater computing power, the use of cloud for data sharing, etc.) to address some of the shortcomings of past efforts and enhance the outcome.

Specifically, some diseases are caused by the combination of virulence factors. They may all be present in one pathogen or across several pathogens in the same biome. There can also be geographical variations. This makes it very hard to identified root causes of a disease when pathogens are analyzed in isolation as has been the case in the past.

Addressing these issues requires sequencing entire micro-biomes from many samples gathered worldwide. The computational requirements for such an approach are enormous. A single facility would need a compute and storage infrastructure on a par with major government research labs or national supercomputing centers.

Dell EMC and BGU simulated a scenario of distributed sequencing centers scattered worldwide, where each center sequences entire micro-biome samples. Each center analyzes the sequence reads generated against a set of known virulence factors. This is done to detect the combination of these factors causing diseases, allowing for near-real time diagnostic analysis and targeted treatment.

To carry out these operations in the different centers, Dell EMC extended the Hadoop framework to orchestrate distributed and parallel computation across clusters scattered worldwide. This pushed computation as close as possible to the source of data, leveraging the principle of data locality at world-wide scale, while preserving data privacy.

Since one Hadoop instance is represented by a single elephant, Dell EMC concluded that a set of Hadoop instances, scattered across the world, but working in tandem formed a World Wide Herd or WWH. This is the name Dell EMC has given to its Hadoop extensions.


Using WWH, Dell EMC wrote a distributed application where each one of a set of collaborating sequence centers calculates a profile of the virulence factors present in each of the micro-biome it sequenced and sends just these profiles to a center selected to do the global computation.

That center would then use bi-clustering to uncover common patterns of virulence factors among subsets of micro-biomes that could have been originally sampled in any part of the world.

This approach could allow researchers and public health organizations to potentially identify the early onset of outbreaks and also uncover new combinations of virulence factors that may characterize new diseases.

There are several biological advantages to this approach. The approach eliminates the time required to isolate a specific pathogen for analysis and for re-assembling the genomes of the individual microorganisms. Sequencing the entire biome lets researchers identify known and unknown combinations of virulence factors. And collecting samples independently world-wide helps ensure the detection of variants.

On the compute side, the approach uses local processing power to perform the biome sequence analysis. This reduces the need for a large centralized HPC environment. Additionally, the method overcomes the matter of data diversity. It can support all data sources and any data formats.

This investigative approach could be used as a next-generation outbreak surveillance system. It allows collaboration where different geographically dispersed groups simultaneously investigate different variants of a new disease. In addition, the WWH architecture has great applicability to pharmaceutical industry R&D efforts, which increasingly relies on a multi-disciplinary approach where geographically dispersed groups investigate different aspects of a disease or drug target using a wide variety of analysis algorithms on share data.


Learn more about modern genomic Big Data analytics



GET 2010: Isilon and the Future of Genomics

Recently, a historic event took place in Boston, hosted by Dr. George Church and his Personal Genome Project. The Genomes Environments Traits (GET) Conference brought together every person who has had their full genome mapped – less than 20 people in total – to share a stage and discuss the impact of genomics research on the future human medicine, health, culture and society as a whole.

Needless to say, this event attracted a number of interested journalists. One among them, Beth Pariseau of SearchStorage, posted this article highlighting the data management challenges and storage growth in the field of genomic research. Beth highlighted conversations from several bioinformatics institutions – the Broad Institute (a joint venture between MIT and Harvard), Oklahoma Medical Research Foundation and Stanford University’s Quake Lab, as well as Illumina (the leading manufacturer of DNA sequencing devices and genomic services). All the organizations detailed not only the data storage growth and management challenges of genomic research, but also the continuing evolution in this rapidly changing environment.

There is, of course, one thing all the aforementioned organizations share in common – Isilon scale-out NAS is their primary storage architecture.

Isilon’s ability to scale from a few terabytes to tens of petabytes – all within a single file system – provides a distinct advantage to genomic research organizations, which are usually data-heavy but IT light, with few full-time IT staff available to manage massive storage systems. Add to this the fact Isilon IQ can scale a single volume non-disruptively via NFS/CIFS connectivity, eliminating the complexity and high operating costs associated with traditional storage architectures, and it’s clear why so many genomics organizations are using Isilon to power their research. With Isilon IQ and OneFS® operating system, users can cost-effectively accelerate meta-data, optimize concurrent I/O workflows (used by high performance computing) and single stream workflows (used for binary sequence data), and maintain high data availability – enabling genomics organizations to spend their time and resources on science, not on storage.

Along with GET, Isilon recently attended Bio-IT World, where our CTO Paul Rutherford spoke on the evolution of cloud computing, its pros and cons, and its potential impact for life sciences organizations. Later that evening at a partner event with Cambridge Computers, one of my Isilon colleagues introduced himself to a gentleman who had walked up to his table:

“Hi, I’m from Isilon. We’re a data storage company,” my colleague said.

The gentleman laughed in reply, “That’s like saying: Hi, I’m from Microsoft. We’re a software company.” The study of genomics holds the potential to usher in an era of truly personalized medicine, which would enable individual treatments that could improve not only the state of healthcare, but possibly the quality of life for millions, if not billions, of people. Here at Isilon, we’re proud to be the storage architecture helping power this groundbreaking research.

Follow Dell EMC



Connect with us on Twitter