Posts Tagged ‘DNA’

Big Data Analysis for the Greater Good: Dell EMC & the 100,000 Genome Project

Wolfgang Mertz

CTO of Healthcare, Life Sciences and High performance Computing

It might seem far-reaching to say that big data analysis can fundamentally impact patient outcomes around cancer and other illnesses, and that it has the power to ultimately transform health services and indeed society at large, but that’s the precise goal behind the 100,000 Genome Project from Genomics England.

DNA backgroundFor background, Genomics England is a wholly-owned company of the Department of Health, set up to deliver the 100,000 Genomes Project. This exciting endeavor will sequence and collect 100,000 whole genomes from 70,000 NHS patients and their families (with their full consent), focusing on patients with rare diseases as well as those with common cancers.

The program is designed to create a lasting legacy for patients as well as the NHS and the broader UK economy, while encouraging innovation in the UK’s bioscience sector. The genetic sequences will be anonymized and shared with approved academic researchers to help develop new treatments and diagnostic testing methods targeted at the genetic characteristics of individual patients.

Dell EMC provides the platform for large-scale analytics in a hybrid cloud model for Genomics England, which leverages our VCE vScale, with EMC Isilon and EMC XtremIO solutions. The Project has been using EMC storage for its genomic sequence library, and now it will be leveraging an Isilon data lake to securely store data during the sequencing process. Backup services are provided by EMC’s Data Domain and EMC Networker.

The Genomics England IT environment uses both on-prem servers and IaaS provided by cloud service providers on G-Cloud. According to an article from Government Computing, “one of Genomics England’s key legacies is expected to be an ecosystem of cloud service providers providing low cost, elastic compute on demand through G-Cloud, bringing the benefits of scale to smaller research groups.”

There are two main considerations from an IT perspective around genome and DNA sequencing projects such as those being done by Genomics England and others: data management and speed. Vast amounts of research data have to be stored and retrieved, and this large-scale biologic data has to be processed quickly in order to gain meaningful insights.

Scale is another key factor. Sequencing and storing genomic information digitally is a data-intensive endeavor, to say the least. Just sequencing a single genome creates hundreds of gigabytes and the Project has sequenced over 13,000 genomes to date, which is expected to generate ten times more data over the next two years. The data lake being used by Genomics England allows 17 petabytes of data to be stored and made available for multi-protocol analytics (including Hadoop).

For perspective, 1 PB is a quadrillion bytes – think of that as 20 million four-drawer filing cabinets filled with text. Or, considering the Milky Way has roughly two hundred billion stars in its galaxy, if you count each single star as a single byte – it would take 5,000 Milky Way galaxies to reach 1PB of data. It’s staggering.

The potential of being able to contribute to eradicating disease and identify exciting new treatments is truly awe inspiring.  And considering the immense scale of the data involved – 5,000 galaxies! – provides new context around reaching for the stars.

Get first access to our LifeScience Solutions

 

Why DNA Sequencing Eclipses the Moon Landing

Sanjay Joshi

CTO, Healthcare & Life-Sciences at EMC
Sanjay Joshi is the Isilon CTO of Healthcare and Life Sciences at the EMC Emerging Technologies Division. Based in Seattle, Sanjay's 28+ year career has spanned the entire gamut of life-sciences and healthcare from clinical and biotechnology research to healthcare informatics to medical devices. His current focus is a systems view of Healthcare, Genomics and Proteomics for infrastructures and informatics. Recent experience has included information and instrument systems in Electronic Medical Records; Proteomics and Flow Cytometry; FDA and HIPAA validations; Lab Information Management Systems (LIMS); Translational Genomics research and Imaging. Sanjay holds a patent in multi-dimensional flow cytometry analytics. He began his career developing and building X-Ray machines. Sanjay was the recipient of a National Institutes of Health (NIH) Small Business Innovation Research (SBIR) grant and has been a consultant or co-Principal-Investigator on several NIH grants. He is actively involved in non-profit biotech networking and educational organizations in the Seattle area and beyond. Sanjay holds a Master of Biomedical Engineering from the University of New South Wales, Sydney and a Bachelor of Instrumentation Technology from Bangalore University. He has completed several medical school and PhD level courses.

“That’s one small step for man, one giant leap for mankind.”

Many of us are familiar with Neil Armstrong’s famous statement, marking one of mankind’s greatest scientific achievements of the 20th century.

Forward to the 21st century and that statement still holds true. This time, for a scientific accomplishment that we believe eclipses the moon landing: the completion of the Human Genome Project (HGP). Here’s why.

DNATo give you an idea of the project’s magnitude, it took 13 years and some 18 countries to identify between 20,000 to 25,000 genes, and determine the sequences of 3 billion chemical base pairs that make up the human DNA – according to Explorable. While there are recent studies that dispute this figure and have pegged the count of human genes at under 20,000, the point here is: The large scale collaboration efforts to complete this project is to ultimately achieve one thing, and that is to rid the world of the tyranny of disease.

Even Hollywood’s in on It

No, this isn’t a zombie apocalypse waiting to happen, of a biological experiment gone wrong like you’ve seen in The Walking Dead or World War Z. On the contrary, it is a pivotal breakthrough in mankind’s existence that has lead to the discovery of disease genes, paving the way for genetic tests and biotechnology-based products.

Citing a CNN story, we’re sure some of you have heard of Angelina Jolie’s double mastectomy in 2013 and how she had her ovaries removed in March 2015. But why? Genetic testing revealed that she was a carrier of the breast- and ovarian-cancer gene, BRCA1. A decision, though hard, that would reduce her cancer risk by a great deal.

Revving DNA Sequencing

The rise of DNA sequencing can be partially credited to a stark drop in the cost of whole genome sequencing, from US$100 million per human genome to between US$1,000 and US$3,000 today. Of course, we do need to consider the cost of analysis after genetic testing is completed, and that number can stretch to US$20,000. But that brings me to my point. Affordability for all-not just Hollywood celebrities. Affordability is a dream for scientists in this field. For one, they can now stretch funding budgets to take on more experimental risks and beef up their sequencing activities, pushing boundaries and gathering more research data that could lead to new discoveries.

That being said, we should all be aware that DNA sequencing functions on two engines: storage and speed. Its simple. Without enough space to store research data and the adequate speed to process this data, scientists have little means to glean insights.

Take for example, SciGenom Labs (SciGenom), a company based in Cochin, India. SciGenom focuses on molecular diagnostics, cancer treatment, and metagenomics. Prior to adopting an EMC Isilon X200 scale-out storage platform, it encountered performance reduction corresponding to storage expansion that adversely impacted the speed at which the analysis of large-scale biological data sets could be completed.

Post EMC Isilon, project tasks completion are now 40 percent faster. The lab expects to achieve reductions in the workflow times associated with analyzing, annotating, and understanding the terabytes of data generated every day by the sequencing machines.

Says Saneesh Chembakasseri, IT Manager at SciGenom Labs, “The key reason for moving to Isilon scale-out storage was to increase the performance and speed of analyzing raw data generated by DNA sequencing machines. There is no better choice in the market than EMC Isilon in providing both the needed scalability and performance for meeting the demands of DNA sequencing.”

Read the SciGenom Case Study to learn more.

Being Nimble Now a Reality

18 countries. Can you imagine the kind of coordination that went into HGP? To minimize miscommunication and mistakes, sequencing workflows not only had to be established way in advance. They also had to be nimble to adapt to changes. The only way to do so was to store and share findings seamlessly, even with massive quantities of data being exchanged. The same applies today to follow-on DNA sequencing initiatives.

Malaysia Genome Institute is another establishment that has embraced the strengths of EMC Isilon. Engaged in comparative genomics and genetics, structure and synthetic biology, computational and systems biology, and metabolic engineering, MGI has sequencing machines delivering 1 gigabyte per second of throughput. Putting it in perspective, that is an astounding 1 terabyte in under 17 minutes. MGI uses the Illumina HiSeq 2000 and Illumina MiSeq sequencing platforms for DNA sequencing, whole genome sequencing, whole transcriptome sequencing, and targeted resequencing

“The way we analyze Big Data can require millions of inputs at the same time. This involves transferring data back and forth between the storage and high-performance computing cluster. EMC can comfortably handle the high throughput required within the analysis,” says Mohd. Noor Mat Isa, Head of Genome Technology and Innovation at MGI.

Read the MGI Case Study to learn more.

A Healthier Future

The National Human Genome Research Institute discusses how individualized DNA analysis based on each person’s genome will lead to a very powerful form of predictive, personalized, participatory and preventive medicine, with the ability to learn about the risks of future illness – as seen with Angelina Jolie.

With this understanding, a new generation of more effective and precise drugs can be developed as compared to the one-size-fits-all versions available today. On how fast these breakthroughs will happen, we do not yet know. But for certain, storage and processing speed of Big Data lies at the heart of progress in the next few leaps for mankind.

 

Get first access to our LifeScience Solutions

Follow Dell EMC

Categories

Archives

Connect with us on Twitter