Posts Tagged ‘analytics’

Unwrapping Machine Learning

Ashvin Naik

Cloud Infrastructure Marketing at Dell EMC

In a recent IDC spending guide titled Worldwide cognitive systems and artificial intelligence spending guide,   some fantastic numbers were thrown out in terms of opportunity and growth 50+ % CAGR, Verticals pouring in billions of dollars on cognitive systems. One of the key components of cognitive systems is Machine Learning.

According to wikipedia Machine Learning is a subfield of computer science that gives the computers the ability to learn without being explicitly programmed. Just these two pieces of information were enough to get me interested in the field.

After hours of daily  searching, digging through inane babble and noise across the internet, the understanding of how machines can learn evaded me for weeks, until I hit a jackpot. A source, that should not be named pointed me to a “secure by obscurity” share that had the exact and valuable insights on machine learning. It was so simple, elegant and completely made sense to me.

Machine Learning was not all noise, it worked on a very simple principle. Imagine, there is a pattern in this world that can be used to forecast or predict a behavior of any entity. There is no mathematical notation available to describe the pattern, but if you have the data that can be used to plot the pattern, you can use Machine Learning to model it.  Now, this may sound like a whole lot of mumbo jumbo but allow me to break it down in simple terms.

Machine learning can be used to understand patterns so you can forecast or predict anything provided

  • You are certain there is a pattern
  • You do not have a mathematical model to describe the pattern
  • You have the data to try to figure out the pattern.

Viola, this makes so much sense already. If you have data, know there is a pattern but don’t know what that is, you can use machine learning to find it out. The applications for this are endless from natural language processing, speech to text to predictive analytics. The most important is forecasting- something we do not give enough credit these days. The Most critical component of Machine Learning is Data – you should have the data. If you do not have data, you cannot find the pattern.

As a cloud storage professional, this is a huge insight. You should have data. Pristine, raw data coming from the systems that generate it- sort of like a tip from the horses mouth. I know exactly where my products fit in. We are able to ingest, store, protect and expose the data for any purposes in the native format complete with the metadata all through one system.

We have customers in the automobile industry leveraging our multi-protocol cloud storage across 2300 locations in Europe capturing data from cars on the roads. They are using proprietary Machine Learning systems to look for patterns in how their customers- the car owners use their products in the real world to predict the parameters of designing better, reliable and efficient cars. We have customers in the life-sciences business saving lives by looking at the patterns of efficacy and effective therapies for terminal diseases. Our customers in retail are using Machine Learning to detect fraud and protect their customers. This goes on and on and on.

I personally do not know the details of how they make it happen, but this is the world of the third platform. There are so many possibilities and opportunities ahead if only we have the data. Talk to us and we can help you capture, store and secure your data so you can transform humanity for the better.

 

Learn more about how Dell EMC Elastic Cloud Storage can fit into your Machine Learning Infrastructure

 

 

Using a World Wide Herd (WWH) to Advance Disease Discovery and Treatment

Patricia Florissi

Vice President & Global Chief Technology Officer, Sales at Dell EMC
Patricia Florissi is Vice President and Global Chief Technology Officer (CTO) for Sales. As Global CTO for Sales, Patricia helps define mid and long term technology strategy, representing the needs of the broader EMC ecosystem in EMC strategic initiatives. Patricia is an EMC Distinguished Engineer, holds a Ph. D. in Computer Science from Columbia University in New York, graduated valedictorian with an MBA at the Stern Business School in New York University, and has a Master's and a Bachelor's Degree in Computer Science from the Universidade Federal de Pernambuco, in Brazil. Patricia holds multiple patents, and has published extensively in periodicals including Computer Networks and IEEE Proceedings.

Latest posts by Patricia Florissi (see all)

Analysis of very large genomic datasets has the potential to radically alter the way we keep people healthy. Whether it is quickly identifying the cause of a new infectious outbreak to prevent its spread or personalizing a treatment based on a patient’s genetic variants to knock out a stubborn disease, modern Big Data analytics has a major role to play.

By leveraging cloud, Apache™ Hadoop®, next-generation sequencers, and other technologies, life scientists potentially have a new, very powerful way to conduct innovative global-scale collaborative genomic analysis research that has not been possible before. With the right approach, there are great benefits that can be realized.

image1_

To illustrate the possibilities and benefits of using coordinated worldwide genomic analysis, Dell EMC partnered with researchers at Ben-Gurion University of the Negev (BGU) to develop a global data analytics environment that spans across multiple clouds. This environment lets life sciences organizations analyze data from multiple heterogeneous sources while preserving privacy and security. The work conducted by this collaboration simulated a scenario that might be used by researchers and public health organizations to identify the early onset of outbreaks of infectious diseases. The approach could also help uncover new combinations of virulence factors that may characterize new diseases. Additionally, the methods used have applicability to new drug discovery and translational and personalized medicine.

 

Expanding on past accomplishments

In 2003, SARS (severe acute respiratory syndrome) was the first infectious outbreak where fast global collaborative genomic analysis was used to identify the cause of a disease. The effort was carried out by researchers in the U.S. and Canada who decoded the genome of the coronavirus to prove it was the cause of SARS.

The Dell EMC and BGU simulated disease detection and identification scenario makes use of technological developments (the much lower cost of sequencing, the availability of greater computing power, the use of cloud for data sharing, etc.) to address some of the shortcomings of past efforts and enhance the outcome.

Specifically, some diseases are caused by the combination of virulence factors. They may all be present in one pathogen or across several pathogens in the same biome. There can also be geographical variations. This makes it very hard to identified root causes of a disease when pathogens are analyzed in isolation as has been the case in the past.

Addressing these issues requires sequencing entire micro-biomes from many samples gathered worldwide. The computational requirements for such an approach are enormous. A single facility would need a compute and storage infrastructure on a par with major government research labs or national supercomputing centers.

Dell EMC and BGU simulated a scenario of distributed sequencing centers scattered worldwide, where each center sequences entire micro-biome samples. Each center analyzes the sequence reads generated against a set of known virulence factors. This is done to detect the combination of these factors causing diseases, allowing for near-real time diagnostic analysis and targeted treatment.

To carry out these operations in the different centers, Dell EMC extended the Hadoop framework to orchestrate distributed and parallel computation across clusters scattered worldwide. This pushed computation as close as possible to the source of data, leveraging the principle of data locality at world-wide scale, while preserving data privacy.

Since one Hadoop instance is represented by a single elephant, Dell EMC concluded that a set of Hadoop instances, scattered across the world, but working in tandem formed a World Wide Herd or WWH. This is the name Dell EMC has given to its Hadoop extensions.

image2_

Using WWH, Dell EMC wrote a distributed application where each one of a set of collaborating sequence centers calculates a profile of the virulence factors present in each of the micro-biome it sequenced and sends just these profiles to a center selected to do the global computation.

That center would then use bi-clustering to uncover common patterns of virulence factors among subsets of micro-biomes that could have been originally sampled in any part of the world.

This approach could allow researchers and public health organizations to potentially identify the early onset of outbreaks and also uncover new combinations of virulence factors that may characterize new diseases.

There are several biological advantages to this approach. The approach eliminates the time required to isolate a specific pathogen for analysis and for re-assembling the genomes of the individual microorganisms. Sequencing the entire biome lets researchers identify known and unknown combinations of virulence factors. And collecting samples independently world-wide helps ensure the detection of variants.

On the compute side, the approach uses local processing power to perform the biome sequence analysis. This reduces the need for a large centralized HPC environment. Additionally, the method overcomes the matter of data diversity. It can support all data sources and any data formats.

This investigative approach could be used as a next-generation outbreak surveillance system. It allows collaboration where different geographically dispersed groups simultaneously investigate different variants of a new disease. In addition, the WWH architecture has great applicability to pharmaceutical industry R&D efforts, which increasingly relies on a multi-disciplinary approach where geographically dispersed groups investigate different aspects of a disease or drug target using a wide variety of analysis algorithms on share data.

 

Learn more about modern genomic Big Data analytics

 

 

When It Comes To Data, Isolation Is The Enemy Of Insights

Brandon Whitelaw

Senior Director of Global Sales Strategy for Emerging Technologies Division at Dell EMC

Latest posts by Brandon Whitelaw (see all)

Within IT, data storage, servers and virtualization, there have always been ebbs and flows of consolidation and deconsolidation. You had the transition from terminals to PCs and now we’re going back to virtual desktops – it flows back and forth from centralized to decentralized. It’s also common to see IT trends repeat themselves.

dataIn the mid to late 90s, the major trend was to consolidate structured data sources into a single platform; to go from direct detached storage with dedicated servers per application to a consolidated central storage piece, called a storage array network (SAN). SANs allowed organizations to go from a shared nothing architecture (SN) to a shared everything architecture (SE), where you have a single point of control, allowing users to share available resources and not have data trapped or siloed within the independent direct detached storage systems.

The benefit of consolidation has been an ongoing IT trend that continues to repeat itself on a regular basis, whether it’s storage, servers or networking. What’s interesting is once you consolidate all the data sources, IT is able to finally look at doing more with them. The consolidation onto a SAN enables cross analysis of data sources that were otherwise previously isolated from each other. This was simply practically infeasible to do before. Now that these sources are in one place, this enables the emergence of systems such as an enterprise data warehouse, which is the concept of ingesting and transforming all the data on a common scheme to allow for reporting and analysis. Companies embracing this process led to growth in IT consumption because of the value gained from that data. It also led to new insights, resulting in most of the world’s finance, strategy, accounting, operations and sales groups all relying on the data they get from these enterprise data warehouses.

Next, companies started giving employees PCs, and what do you do on PCs? Create files. Naturally, the next step is to ask, “How do I share these files?” and “How do I collaborate on these files?” The end result is home directories and file shares. From an infrastructure perspective, there needed to be a shared common platform for this data to come together. Regular PCs can’t talk to a SAN without direct block level access, a fiber channel, or being connected in the data center to a server, so unless you want everyone to physically sit in the data center, you run Ethernet.

Businesses ended up building Windows file servers to be the middleman brokering the data between the users on Ethernet and the backend SAN. This method worked until companies reached the point where the Windows file servers steadily grew to dozens. Yet again, this led to IT teams being left with complexity, inefficiency and facing the original problem of having several isolated silos of data and multiple different points of management.

So what’s the solution? Let’s take the middleman out of this. Let’s take the file system that was sitting on top of the file servers and move it directly onto the storage system and allow Ethernet to go directly to it. Thus the network-attached storage (NAS) was born.

However, continuing the cycle, what started as a single NAS eventually became dozens for organizations. Each NAS device contained specific applications with different performance characteristics and protocol access. Also, each system could only store so much data before it didn’t have enough performance to keep up, so systems would continue expanding and replicating to accommodate.

This escalates until an administrator is startled to realize 80 percent of his/her company’s data being created is unstructured. The biggest challenge of unstructured data is that it’s not confined to the four walls of a data center. Once again, we find ourselves with silos that aren’t being shared (notice the trend repeating itself?). Ultimately, this creates the need for scale-out architecture with multiprotocol data access that can combine and consolidate unstructured data sources to optimize collaboration.

Doubling every two years, unstructured data is the vast majority of all data being created. Traditionally, the approach to gaining insights from this data has involved building yet another silo, which prevents having a single source of istock_000048860836_largetruth and having your data in one place. Due to the associated cost and the complexity, not all of the data goes into a data lake, for instance, but only sub-samples of the data that are relevant to that individual query. An option to ending this particular cycle is investing in a storage system that not only has the protocol access and tiering capabilities to consolidate all your unstructured data sources, but can also serve as your analytics platform. Therefore your primary storage, the single source of truth that comes with it and that ease of management will lend itself to become that next phase, which is unlocking its insights.

Storing data is typically viewed as a red-ink line item, but it can actually be to your benefit. Not because of regulation or policies dictating it, but as a deeper, wider set of data that can provide better answers. Often, you may not know what questions to ask until you’re able to see data sets together. Consider the painting technique, pointillism. If you look too closely, it’s just a bunch of dots of paint. However, if you stand back, a landscape emerges, ladies with umbrellas materialize and suddenly you realize you’re staring at Georges Seurat’s famous panting, A Sunday Afternoon on the Island of La Grande Jatte. Similar to pointillism, with data analytics, you never think of connecting the dots if you don’t even realize they’re next to one another.

Bullseye: NCC Media Helps Advertisers Score a Direct Hit with Their Audiences

Keith Manthey

CTO of Analytics at EMC Emerging Technologies Division

Can you recall the last commercial you watched? If you’re having a hard time remembering, don’t beat yourself up. Nowadays, it’s all too easy for us to fast forward, block or avoid ads entirely. In order to watch content on one of our many devices, advertising is no longer a necessary preccartoon-tv-convertedursor to enjoying our favorite shows. While people may be consuming more media in more places than we’ve ever seen, it’s more challenging than ever for advertisers to reach their target audiences. The rise of media fragmentation and ad-blocking technology has exponentially increased the cost of reaching the same number of people it did years ago.

As a result, advertisers are under pressure to accurately allocate spending across TV and digital media and to ensure those precious ad dollars being spent are reaching the right people. Gone are the days when a media buyer’s gut instinct determined which block of airtime to purchase on what channel, replaced instead with advanced analytics.

By aggregating hundreds of terabytes of data from sources like Nielsen ratings, Scarborough local market data and even voting and census data, companies such as NCC Media are able to provide targeted strategies for major retailers, nonprofits and political campaigns. In one of the most contentious election years to date, AdAge reported data and analytics are heavily influencing how political advertisers are purchasing cable TV spots. According to Tim Kay, director of political strategy at NCC, “Media buyers…are basing TV decisions on information such as whether a program’s audience is more likely to vote or be registered gun owners.”

In order for NCC to identify its customers targets more quickly (for example, matching data about cable audiences to voter information and zip codes targets more quickly), NCC built an enterprise data lake with EMC Isilon’s scale-out storage with Hortonworks Data Platform, allowing it to streamline data aggregation and analytics.

To learn more about how Isilon has helped NCC eliminate time-consuming overnight data importation processes and tackle growing data aggregation requirements, check out the video below or read the NCC Media Case Study.

 

As New Business Models Emerge, Enterprises Increasingly Seek to Leave the World of Silo-ed Data

Keith Manthey

CTO of Analytics at EMC Emerging Technologies Division

As Bob Dylan famously wrote back in 1964, the times, they are a changin’. And while Dylan probably wasn’t speaking about the Fortune 500’s shifting business models and their impact on enterprise storage infrastructure (as far as we know), his words hold true in this context.

Many of the world’s largest companies are attempting to reinvent themselves by abandoning their product-or manufacturing-focused business models in favor of a more service-oriented approach. Look at industrial giants such as GE, Caterpillar or Procter & Gamble to name a few and consider how they leverage existing data about products (in the case of GE, say it’s a power plant) and apply them to a service model (say for utilities, in this example).

The evolution of a product-focused model into a service-oriented one can offer more value (and revenue) over time, but also requires a more sophisticated analytic model and holistic approach to data, a marked difference from the traditional silo-ed way that data has been managed historically.

Transformation

Financial services is another example of an industry undergoing a transformation from a data storage perspective. Here you have a complex business with lots of traditionally silo-ed data, split between commercial, consumer and credit groups. But increasingly, banks and credit unions want a more holistic view of their business in order to better understand how various divisions or teams could work together in new ways. Enabling consumer credit and residential mortgage units to securely share data could allow them to build better risk score models across loans, for example, ultimately allowing a financial institution to provide better customer service and expand their product mix.

Early days of Hadoop: compromise was the norm

As with any revolution, it’s the small steps that matter most at first. Enterprises have traditionally started small when it comes to holistically governing their data and managing workflows with Hadoop. In earlier days of Hadoop, say five to seven years ago, enterprises assumed potential compromises around data availability and efficiency, as well as how workflows could be governed and managed. Issues in operations could arise, making it difficult to keep things running one to three years down the road. Security and availability were often best effort – there weren’t the expectations of  five-nines reliability.

Data was secured by making it an island by itself. The idea was to scale up as necessary, and build a cluster for each additional department or use case. Individual groups or departments ran what was needed and there wasn’t much integration with existing analytics environments.

With Hadoop’s broader acceptance, new business models can emerge

hadoop_9_resizeHowever, last year, with its 10-year anniversary, we’ve started to see broader acceptance of Hadoop and as a result it’s becoming both easier and more practical to consolidate data company-wide. What’s changed is the realization that Hadoop was a true proof of concept and not a science experiment. The number of Hadoop environments has grown and users are realizing there is real power in combining data from different parts of the business and real business value in keeping historical data.

At best, the model of building different islands and running them independently is impractical; at worst it is potentially paralyzing for businesses. Consolidating data and workflows allows enterprises to focus on and implement better security, availability and reliability company-wide. In turn, they are also transforming their business models and expanding into new markets and offerings that weren’t possible even five years ago.

Analyst firm IDC evaluates EMC Isilon: Lab-validation of scale-out NAS file storage for your enterprise Data Lake

Suresh Sathyamurthy

Sr. Director, Product Marketing & Communications at EMC

A Data Lake should now be a part of every big data workflow in your enterprise organization. By consolidating file storage for multiple workloads onto a single shared platform based on scale-out NAS, you can reduce costs and complexity in your IT environment, and make your big data efficient, agile and scalable.

That’s the expert opinion in analyst firm IDC’s recent Lab Validation Brief: “EMC Isilon Scale-Out Data Lake Foundation: Essential Capabilities for Building Big Data Infrastructure”, March 2016. As the lab validation report concludes: “IDC believes that EMC Isilon is indeed an easy-to-operate, highly scalable and efficient Enterprise Data Lake Platform.

The Data Lake Maximizes Information Value

The Data Lake model of storage represents a paradigm shift from the traditional linear enterprise data flow model. As data and the insights gleaned from it increase in value, enterprise-wide consolidated storage is transformed into a hub around which the ingestion and consumption systems work. This enables enterprises to bring analytics to data in-place – and avoid expensive costs of multiple storage systems, and time for repeated ingestion and analysis.

But pouring all your data into a single shared Data Lake would put serious strain on traditional storage systems – even without the added challenges of data growth. That’s where the virtually limitless scalability of EMC Isilon scale-out NAS file storage makes all the difference…

The EMC Data Lake Difference

The EMC Isilon Scale-out Data Lake is an Enterprise Data Lake Platform (EDLP) based on Isilon scale-out NAS file storage and the OneFS distributed file system.

As well as meeting the growing storage needs of your modern datacenter with massive capacity, it enables big data accessibility using traditional and next-generation access methods – helping you manage data growth and gain business value through analytics. You can also enjoy seamless replication of data from the enterprise edge to your core datacenter, and tier inactive data to a public or private cloud.

We recently reached out to analyst firm IDC to lab-test our Isilon Data Lake solutions – here’s what they found in 4 key areas…

  1. Multi-Protocol Data Ingest Capabilities and Performance

Isilon is an ideal platform for enterprise-wide data storage, and provides a powerful centralized storage repository for analytics. With the multi-protocol capabilities of OneFS, you can ingest data via NFS, SMB and HDFS. This makes the Isilon Data Lake an ideal and user-friendly platform for big data workflows, where you need to ingest data quickly and reliably via protocols most suited to the workloads generating the information. Using native protocols enables in-place analytics, without the need for data migration, helping your business gain more rapid data insights.

datalake_blog

IDC validated that the Isilon Data Lake offers excellent read and write performance for Hadoop clusters accessing HDFS via OneFS, compared against via direct-attached storage (DAS). In the lab tests, Isilon performed:

  • nearly 3x faster for data writes
  • over 1.5x faster for reads and read/writes.

As IDC says in its validation: “An Enterprise Data Lake platform should provide vastly improved Hadoop workload performance over a standard DAS configuration.”

  1. High Availability and Resilience

Policy-based high availability capabilities are needed for enterprise adoption of Data Lakes. The Isilon Data Lake is able to cope with multiple simultaneous component failures without interruption of service. If a drive or other component fails, it only has to recover the specific affected data (rather than recovering the entire volume).

IDC validated that a disk failure on a single Isilon node has no noticeable performance impact on the cluster. Replacing a failed drive is a seamless process and requires little administrative effort. (This is in contrast to traditional DAS, where the process of replacing a drive can be rather involved and time consuming.)

Isilon can even cope easily with node-level failures. IDC validated that a single-node failure has no noticeable performance impact on the Isilon cluster. Furthermore, the operation of removing a node from the cluster, or adding a node to the cluster, is a seamless process.

  1. Multi-tenant Data Security and Compliance

Strong multi-tenant data security and compliance features are essential for an enterprise-grade Data Lake. Access zones are a crucial part of the multi-tenancy capabilities of the Isilon OneFS. In tests, IDC found that Isilon provides no-crossover isolation between Hadoop instances for multi-tenancy.

Another core component of secure multi-tenancy is the ability to provide a secure authentication and authorization mechanism for local and directory-based users and groups. IDC validated that the Isilon Data Lake provides multiple federated authentication and authorization schemes. User-level permissions are preserved across protocols, including NFS, SMB and HDFS.

Federated security is an essential attribute of an Enterprise Data Lake Platform, with the ability to maintain confidentiality and integrity of data irrespective of the protocols used. For this reason, another key security feature of the OneFS platform is SmartLock – specifically designed for deploying secure and compliant (SEC Rule 17a-4) Enterprise Data Lake Platforms.

In tests, IDC found that Isilon enables a federated security fabric for the Data Lake, with enterprise-grade governance, regulatory and compliance (GRC) features.

  1. Simplified Operations and Automated Storage Tiering

The Storage Pools feature of Isilon OneFS allows administrators to apply common file policies across the cluster locally – and extend them to the cloud.

Storage Pools consists of three components:

  • SmartPools: Data tiering within the cluster – essential for moving data between performance-optimized and capacity-optimized cluster nodes.
  • CloudPools: Data tiering between the cluster and the cloud – essential for implementing a hybrid cloud, and placing archive data on a low-cost cloud tier.
  • File Pool Policies: Policy engine for data management locally and externally – essential for automating data movement within the cluster and the cloud.

As IDC confirmed in testing, Isilon’s federated data tiering enables IT administrators to optimize their infrastructure by automating data placement onto the right storage tiers.

The expert verdict on the Isilon Data Lake

IDC concludes that: “EMC Isilon possesses the necessary attributes such as multi-protocol access, availability and security to provide the foundations to build an enterprise-grade Big Data Lake for most big data Hadoop workloads.”

Read the full IDC Lab Validation Brief for yourself: “EMC Isilon Scale-Out Data Lake Foundation: Essential Capabilities for Building Big Data Infrastructure”, March 2016.

Learn more about building your Data Lake with EMC Isilon.

Follow Dell EMC

Categories

Archives

Connect with us on Twitter