Archive for the ‘Big Data’ Category

Announcing Isilon OneFS 8.0.1

David Noy

VP Product Management, Emerging Technologies Division at EMC

It’s really been an exhilarating last couple of months leading up to the recent historical merger between Dell and EMC! We just completed our first Dell EMC World and announced Isilon All-Flash last week.  While all that was in progress, the Isilon team was heads-down focused on the next update to the industry leading Scale-Out NAS OneFS operating system.

Today, we’re announcing the new OneFS 8.0.1 release with a strong focus on strengthening the Data Lake with features supporting the horizontals and vertical markets we serve. For the horizontal markets, we’ve added new and improved capabilities around Hadoop big data analytics, Isilon CloudPools, and IsilonSD Edge. For the vertical industries we support, we’ve focused on enhancing the needs of the Healthcare and Financial markets.

Customers continue to gain more value from their data with analytics.  Hadoop based solutions have always been a pillar for Isilon customers because of native support for HDFS protocol in the OneFS operating system. In OneFS 8.0.1, we’ve added support for Apache Ambari to proactively monitor key performance metrics and alerts which enables enterprise customers to have a single point for management of the entire Hadoop cluster. In addition, from a security perspective, not only have we integrated with Apache Ranger to deliver seamless authorization and access control, but we’ve also added support for end-to-end data in-flight encryption between Isilon nodes and the HDFS client.

Many of the Isilon enterprise customers continue to use OneFS because of its simplicity and ease of management at scale. We’ve added many new features for enterprises like CloudPools proxy support to increase security, reduce risk, and simplify management. For IsilonSD Edge software defined storage, we’ve added support for VMware ESX 6.0 and have seamlessly integrated with EMC Remote Support (ESRS) for remote monitoring, issue resolution and troubleshooting.

Other enterprise capabilities include seamless non-disruptive upgrades from OneFS 8.0, upgrade rollback support, a 5X improvement in audit performance and a completely re-written framework for performance resource management,  reporting and data insights.

Isilon deployments continue to add value for customers across verticals like Media & Entertainment, Healthcare, Life Sciences, EDA and others. In this release we have strengthened our solutions for the Healthcare and Finance verticals.  For Healthcare PACS workloads, we’ve added capabilities in OneFS 8.0.1 that increases the efficiency, optimizes the storage and significantly improves the storage utilization for PACS archive workloads. For the Financial industry, we’ve added seamless integration for compliance data with business continuity features by integrating SmartLock compliance mode with SyncIQ replication for push button failover and failback.

OneFS 8.0.1 – the first major upgrade to the OneFS 8.0 code base – it contains a number of features that many enterprises were waiting for. If you are looking to upgrade to the OneFS 8.0 code base because you generally want to wait for a subsequent “dot release”, today is the day, your wait is over!

Population Boom, Energy Boom

Yasir Yousuff

Sr. Director, Global Geo Marketing at EMC Emerging Technologies Division

Latest posts by Yasir Yousuff (see all)

In its 2015 World Population Prospects report, the United Nations documented the global population at 7.3 billion with 60 percent attributed to rising Asia.

As with most, if not all emerging economies, energy consumption has witnessed exponential growth in recent decades, spurred by expanded economic activity that has done well to lift much of the masses into a more prosperous middle class. Living standards have improved as a consequence, further accelerating demand for energy to power luxuries like air-conditioning, consumer electronics, and automobiles, among others.

So what does this all mean? With global population set to climb to 9.7 billion by 2050m the strain on the world’s energy resources is expected to increase exponentially.

Hitting a Rocky Patch

oilandgas_resizeHere’s where Chuanqing Drilling Engineering (CDE), a Geophysical Prospecting Company of China National Petroleum Corporation, enters the fray. China is the biggest energy consumer in Asia and relies on companies like CDE to uncover new energy sources to feed its ever-growing appetite.

The services offered by CDE are extensive and high in demand both in China and the global stage. They include engineering and geological research, geophysical surveys, drilling engineering, downhole services, mud logging, well logging and perforating, oil and gas field engineering construction and development, civil works, as well as oil/gas field cooperative development.

Specialized these fields are, yet all derive their invaluable insights from seismic data processed by high-performance computing software. Unbeknownst to many, these industry applications are much more robust than the ones that run on any given PC. The algorithms and calculations that make sense of gargantuan raw data solicited from seismic explorations in varying geological environments are extremely complex, owing to the need for mechanical precision that can make or break any mission-critical task.

In this regard, it is not surprising that the few SAN storage systems deployed by CDE had issues handling the massive growth of data from its exploration activities. Reduced storage capacity resulted in diminished performance, which inadvertently made data analysis increasingly challenging.

A Well-Drilled Solution

CDE’s problem was overwhelming to say the least but its solution was one of relative simplicity: a highly scalable, high-performance platform to meet the demands of its exploration activities and data analysis. In essence, a change of CDE’s IT environment was long overdue and called for.

Following comprehensive consultations with EMC, CDE made the decision to implement a single cluster comprising EMC Isilon X410 with 23 nodes, providing scaling capacity up to 1.6 petabytes. With scale-out NAS deployed, the system possessed the further ability to scale up to 50 petabytes when the need eventually called for it.

Discovering Productivity

“EMC Isilon storage is far easier to manage and has resulted in considerable savings. We estimate that we’ve achieved a 33 percent reduction in costs,” says Tang Chengbing, Director of the Computer Center at CDE.

In an industry where financials are monitored in the billions, not millions, 33 percent represents a significant figure that can mean the different between a profitable or loss-making calendar year.

Storage aside, software such as EMC Isilon SmartQuotas and SmartDedupe have enabled CDE to simplify workflows and eliminate duplicates from a high accessible and secure single volume of storage. It is all part of an ecosystem of smarter allocation so working divisions never face a shortage of capacity nor wasteful surplus like they did with disparate storage solutions.

Future Ready Energy Exploration

Problem solved? Yes, but it is just the beginning.

“It is inevitable that data volumes will increase in the next few years, and with that in mind we need strong technical support from our IT partners,” adds Tang.

The diversity and intensity of seismic data analytics will only become more robust as innovation continues to take flight in this vital field. With EMC’s storage infrastructure firmly established in the company’s operating DNA, CDE can now adopt new solutions flexibly in tandem with fast-paced technological progress.

Read the Chuanqing Drilling Engineering Case Study to learn more.

 

Metalnx: Making iRODS Easy

Stephen Worth

Stephen Worth is a director of Global Innovation Operations at Dell EMC. He manages development and university research projects in Brazil, is a technical liaison between helping to improve innovation across our global engineering labs, and works in digital asset management leveraging user defined metadata. Steve is based out of Dell EMC’s RTP Software Development Center which focuses on data protection, core storage products, & cloud storage virtualization. Steve started with Data General in 1985, which was acquired by EMC in 1999, and Dell Technologies in 2016. He has led many product development efforts involving operating systems, diagnostics, UI, database, & applications porting. His background includes vendor & program management, performance engineering, engineering services, manufacturing, and test engineering. Steve, an alumnus of North Carolina Status University, received a B.S. degree in Chemistry in 1981 and M.S. degree in Computer Studies in 1985. He served as an adjunct faculty member of the Computer Science department from 1987-1999. Currently Steve is an emeritus member of the Computer Science Department’s Strategic Advisory Board and is currently chairperson of the Technical Advisory Board for the James B. Hunt Jr. Library on Centennial Campus.

Latest posts by Stephen Worth (see all)

DNA background

Advances in sequencing, spectroscopy, and microscopy are driving life sciences organizations to produce vast amounts of data. Most organizations are dedicating significant resources to the storage and management of that data. However, until recently, their primary efforts have focused on how to host the data for high performance, rapid analysis, and moving it to more economical disks for longer-term storage.

The nature of life sciences work demands better data organization. The data produced by today’s next-generation lab equipment is rich in information, making it of interest to different research groups and individuals at varying points in time. Examples include:

  • Raw experimental and analyzed data may be needed as new drug candidates move through research and development, clinical trials, FDA approval, and production
  • A team interested in new indications for an existing chemical compound would want to leverage work already done by others in the organization on the compound in the past
  • In the realm of personalized medicine, clinicians may need to evaluate not only a person’s health history, but correlate that information with genome sequences and phenotype data throughout the individual’s life.

The great challenge is how to make data more generally available and useful throughout an organization. Researchers need to know what data exists and have a way to access it. For this to happen, data must be properly categorized, searchable, and easy to find.

To get help in this area, many research organizations and government agencies worldwide are using the Integrated Rule-Oriented Data System (iRODS), which is open source data management software developed by the iRODS Consortium. iRODS enables data discovery using a data/metadata catalog that can retain machine and user-defined metadata describing every file, collection, and object in a data grid.

Additionally, iRODS automates data workflows with a rule engine that permits any action to be initiated by any trigger on any server or client in the grid. iRODS enables secure collaboration, so users only need to login to their home grid to access data hosted on a remote, federated grid.

Leveraging iRODS can be simplified and its benefits enhanced when used with Metalnx, an administrative and metadata management user interface (UI) for iRODS. Metalnx was developed by Dell EMC through its efforts as a corporate member of the iRODS Consortium. The intuitive Metalnx UI helps both the IT administrators charged with managing metadata and the end-users / researchers who need to find and access relevant data based upon metadata descriptions.

Making use of metadata via an easy to use UI provided by Metalnx working with iRODS can help:

  • Maximize storage assets
  • Find what’s valuable, no matter where the data is located
  • Automate movement and processing of data
  • Securely share data with collaborators

Real world example: Putting the issues into perspective

A simple example illustrates why iRODS and Metalnx are needed. Plant & Food Research, a New Zealand-based science company providing research and development that adds value to fruit, vegetable, crop and food products, makes great use of next-generation sequencing and genotyping. The work generates a lot of mixed data types.

“In the past, we were good at storing data, but not good at categorizing the data or using metadata,” said Ben Warren, bioinformatician, at Plant & Food Research. “We tried to get ahead of this by looking at what other institutions were doing.”

iRODS seemed a good fit. It was the only decent open source solution available. However, there were some limitations. “We were okay with the rule engine, but not the interface,” said Warren.

A system administrator working with EMC on hardware for the organization’s compute cluster had heard of Metalnx and mentioned this to Warren. “We were impressed off the bat with its ease of use,” said Warren. “Not only would it be useful for bioinformaticians, coders, and statisticians, but also for the scientists.”

The reason: Metalnx makes it easier to categorize the organization’s data, to control the metadata used to categorize the data, and to use the metadata to find and access any data.

Benefits abound

At Plant & Food Research, metadata is an essential element of a scientist’s workflow. The metadata makes it easier to find data at any stage of a research project. When a project is conceived, scientists will start by determining all metadata required for the project using Metalnx and cataloging data using iRODS. With this approach, everything associated with a project including the samples used, sample descriptions, experimental design, NGS data, and other information are searchable.

One immediate benefit is that someone undertaking a new project can quickly determine if similar work has already been done. This is increasingly important in life science organizations as research become more multi-discipline in nature.

Furthermore, the more an organization knows about its data, the more valuable the data becomes. Researchers can connect with other work done across the organization. Being able to find the right raw data of a past effort means an experiment does not have to be redone. This saves time and resources.

Warren notes that there are other organizational benefits using iRODS and Metalnx. When it comes to collaborating with others, the data is simply easier to share. Scientists can put the data in any format and it is easier to publish the data.

Learn more

Metalnx is available as open source tool. It can be found at Dell EMC Code www.codedellemc.com  or on Github at www.github.com/Metalnx . EMC has also made binary versions available on bintray at www.bintray.com/metalnx  and a Docker image posted on Docker Hub at https://hub.docker.com/r/metalnx/metalnx-web/

A broader discussion of the use of Metalnx and iRODS in the life sciences can be found in an on-demand video of a recent web seminar “Expanding the Face of Meta Data in Next Generation Sequencing.” The video can be viewed on the EMC Emerging Tech Solutions site.

 

Get first access to our LifeScience Solutions

Analyst firm IDC evaluates EMC Isilon: Lab-validation of scale-out NAS file storage for your enterprise Data Lake

Suresh Sathyamurthy

Sr. Director, Product Marketing & Communications at EMC

A Data Lake should now be a part of every big data workflow in your enterprise organization. By consolidating file storage for multiple workloads onto a single shared platform based on scale-out NAS, you can reduce costs and complexity in your IT environment, and make your big data efficient, agile and scalable.

That’s the expert opinion in analyst firm IDC’s recent Lab Validation Brief: “EMC Isilon Scale-Out Data Lake Foundation: Essential Capabilities for Building Big Data Infrastructure”, March 2016. As the lab validation report concludes: “IDC believes that EMC Isilon is indeed an easy-to-operate, highly scalable and efficient Enterprise Data Lake Platform.

The Data Lake Maximizes Information Value

The Data Lake model of storage represents a paradigm shift from the traditional linear enterprise data flow model. As data and the insights gleaned from it increase in value, enterprise-wide consolidated storage is transformed into a hub around which the ingestion and consumption systems work. This enables enterprises to bring analytics to data in-place – and avoid expensive costs of multiple storage systems, and time for repeated ingestion and analysis.

But pouring all your data into a single shared Data Lake would put serious strain on traditional storage systems – even without the added challenges of data growth. That’s where the virtually limitless scalability of EMC Isilon scale-out NAS file storage makes all the difference…

The EMC Data Lake Difference

The EMC Isilon Scale-out Data Lake is an Enterprise Data Lake Platform (EDLP) based on Isilon scale-out NAS file storage and the OneFS distributed file system.

As well as meeting the growing storage needs of your modern datacenter with massive capacity, it enables big data accessibility using traditional and next-generation access methods – helping you manage data growth and gain business value through analytics. You can also enjoy seamless replication of data from the enterprise edge to your core datacenter, and tier inactive data to a public or private cloud.

We recently reached out to analyst firm IDC to lab-test our Isilon Data Lake solutions – here’s what they found in 4 key areas…

  1. Multi-Protocol Data Ingest Capabilities and Performance

Isilon is an ideal platform for enterprise-wide data storage, and provides a powerful centralized storage repository for analytics. With the multi-protocol capabilities of OneFS, you can ingest data via NFS, SMB and HDFS. This makes the Isilon Data Lake an ideal and user-friendly platform for big data workflows, where you need to ingest data quickly and reliably via protocols most suited to the workloads generating the information. Using native protocols enables in-place analytics, without the need for data migration, helping your business gain more rapid data insights.

datalake_blog

IDC validated that the Isilon Data Lake offers excellent read and write performance for Hadoop clusters accessing HDFS via OneFS, compared against via direct-attached storage (DAS). In the lab tests, Isilon performed:

  • nearly 3x faster for data writes
  • over 1.5x faster for reads and read/writes.

As IDC says in its validation: “An Enterprise Data Lake platform should provide vastly improved Hadoop workload performance over a standard DAS configuration.”

  1. High Availability and Resilience

Policy-based high availability capabilities are needed for enterprise adoption of Data Lakes. The Isilon Data Lake is able to cope with multiple simultaneous component failures without interruption of service. If a drive or other component fails, it only has to recover the specific affected data (rather than recovering the entire volume).

IDC validated that a disk failure on a single Isilon node has no noticeable performance impact on the cluster. Replacing a failed drive is a seamless process and requires little administrative effort. (This is in contrast to traditional DAS, where the process of replacing a drive can be rather involved and time consuming.)

Isilon can even cope easily with node-level failures. IDC validated that a single-node failure has no noticeable performance impact on the Isilon cluster. Furthermore, the operation of removing a node from the cluster, or adding a node to the cluster, is a seamless process.

  1. Multi-tenant Data Security and Compliance

Strong multi-tenant data security and compliance features are essential for an enterprise-grade Data Lake. Access zones are a crucial part of the multi-tenancy capabilities of the Isilon OneFS. In tests, IDC found that Isilon provides no-crossover isolation between Hadoop instances for multi-tenancy.

Another core component of secure multi-tenancy is the ability to provide a secure authentication and authorization mechanism for local and directory-based users and groups. IDC validated that the Isilon Data Lake provides multiple federated authentication and authorization schemes. User-level permissions are preserved across protocols, including NFS, SMB and HDFS.

Federated security is an essential attribute of an Enterprise Data Lake Platform, with the ability to maintain confidentiality and integrity of data irrespective of the protocols used. For this reason, another key security feature of the OneFS platform is SmartLock – specifically designed for deploying secure and compliant (SEC Rule 17a-4) Enterprise Data Lake Platforms.

In tests, IDC found that Isilon enables a federated security fabric for the Data Lake, with enterprise-grade governance, regulatory and compliance (GRC) features.

  1. Simplified Operations and Automated Storage Tiering

The Storage Pools feature of Isilon OneFS allows administrators to apply common file policies across the cluster locally – and extend them to the cloud.

Storage Pools consists of three components:

  • SmartPools: Data tiering within the cluster – essential for moving data between performance-optimized and capacity-optimized cluster nodes.
  • CloudPools: Data tiering between the cluster and the cloud – essential for implementing a hybrid cloud, and placing archive data on a low-cost cloud tier.
  • File Pool Policies: Policy engine for data management locally and externally – essential for automating data movement within the cluster and the cloud.

As IDC confirmed in testing, Isilon’s federated data tiering enables IT administrators to optimize their infrastructure by automating data placement onto the right storage tiers.

The expert verdict on the Isilon Data Lake

IDC concludes that: “EMC Isilon possesses the necessary attributes such as multi-protocol access, availability and security to provide the foundations to build an enterprise-grade Big Data Lake for most big data Hadoop workloads.”

Read the full IDC Lab Validation Brief for yourself: “EMC Isilon Scale-Out Data Lake Foundation: Essential Capabilities for Building Big Data Infrastructure”, March 2016.

Learn more about building your Data Lake with EMC Isilon.

The Democratization of Data Science with the Arrival of Apache Spark

Keith Manthey

CTO of Analytics at EMC Emerging Technologies Division

As an emerging field, data science has seen rapid growth over the span of just a few short years. With Harvard Business Review referring to the data scientist role as the “sexiest job of the 21st century” in 2012 and job postings for the role growing 57 percent in the first quarter of 2015, enterprises are increasingly seeking out talent to help bolster their organizations’ understanding of their most valuable assets: their data.

The growing demand for data scientists reflects a larger business trend – a shifting emphasis from the zeros and ones to the people who help manage the mounds of data on a daily basis. Enterprises are sitting on a wealth of information but are struggling to derive actionable insights from it, in part due to its sheer volume but also because they don’t have the right talent on board to help.

The problem enterprises now face isn’t capturing data – but finding and retaining top talent to help make sense of it in meaningful ways. Luckily, there’s a new technology on the horizon that can help democratize data science and increase accessibility to the insights it unearths.

Data Science Scarcity & Competition

dataThe talent pool for data scientists is notoriously scarce. According to McKinsey & Company, by 2018, the United States alone may face a 50 to 60 percent gap between supply and demand for “deep analytic talent, i.e., people with advanced training in statistics or machine learning.” Data scientists possess an essential blend of business acumen, statistical knowledge and technological prowess, rendering them as difficult to train as they are invaluable to the modern enterprise.

Moreover, banks and insurance companies face an added struggle in hiring top analytics talent, with the allure of Silicon Valley beckoning top performers away from organizations perceived as less inclined to innovate. This perception issue hinders banks’ and insurance companies’ ability to remain competitive in hiring and retaining data scientists.

As automation and machine learning grow increasingly sophisticated, however, there’s an opportunity for banks and insurance companies to harness the power of data science, without hiring formally trained data scientists. One such technology that embodies these innovations in automation is Apache Spark, which is poised to shift the paradigm of data science, allowing more and more enterprises to tap into insights culled from their own data.

Spark Disrupts & Democratizes Data Science

Data science requires three pillars of knowledge: statistical analysis, business intelligence and technological expertise. Spark does the technological heavy-lifting, by understanding and processing data at a scale that most people aren’t comfortable. It handles the distribution and categorization of the data, removing the burden from individuals and automating the process. By allowing enterprises to load data into clusters and query it on an ongoing basis, the platform is particularly adept at machine-learning and automation – a crucial component in any system intended to analyze mass quantities of data.

Spark was created in the labs of UC Berkeley and has quickly taken the analytics world by storm, with two main business propositions: the freedom to model data without hiring data scientists, and the power to leverage analytics models that are already built and ready-for-use in Spark today. The combination of these two attributes allows enterprises to gain speed on analytics endeavors with a modern, open-source technology.

The arrival of Spark signifies a world of possibility for companies that are hungry for the business value data science can provide but are finding it difficult to hire and keep deep analytic talent on board. The applications of Spark are seemingly endless, from cybersecurity and fraud detection to genomics modeling and actuarial analytics.

What Spark Means for Enterprises

Not only will Spark enable businesses to hire non-traditional data scientists, such as actuaries, to effectively perform the role, but it will also open a world of possibilities in terms of actual business strategy.

Banks, for example, have been clamoring for Spark from the get-go, in part because of Spark’s promise to help banks bring credit card authorizations back in-house. For over two decades, credit card authorizations have been outsourced, since it was more efficient and far less dicey to centralize the authorization process.

The incentive to bring this business back in-house is huge, however, with estimated cost savings of tens to hundreds of millions annually. With Spark, the authorization process could be automated in-house – a huge financial boon to banks. The adoption of Spark allows enterprises to effectively leverage data science and evolve their business strategies accordingly.

The Adoption of Spark & Hadoophadoop_1_resized

Moreover, Spark works seamlessly with the Hadoop Distributions sitting on EMC’s storage platforms. As I noted in my last post, Hadoop adoption among enterprises has been incredible and is quickly becoming the de facto
standard for storing and processing terabytes or even petabytes of data.

By leveraging Spark and existing Hadoop platforms in tandem, enterprises are well-prepared to solve the ever-increasing data and analytics challenges ahead.

This summer, NBC captured history while setting standards for the future

Tom "TV" Burns

CTO, Media & Entertainment at EMC

This summer, NBC captured history while setting standards for the future.

Building on its history covering the Olympic Games, NBC provided viewers in the United States a front row seat to the Games of the XXXI Olympiad.

Projects such as covering the Games, a 17-day live concurrent event, require the ultimate in scalable, reliable storage. NBC uses the EMC Isilon product line to store and stage video captured during these irreplaceable moments of sporting glory, as well as audio, stills and motion graphics.

Isilon’s 3 Petabyte storage repository bridges the gap from Stamford to Rio, where it functioned as a single large Data Lake, enabling real-time global collaborative production supporting the entire broadcast. Adding Isilon nodes without downtime allows the addition of storage capacity and network throughput while maintaining seamless access to a rock solid platform.

NBC selected the EMC Isilon product line as a reliable, proven infrastructure, to manage their storage.

 

 

Categories

Archives

Connect with us on Twitter