Archive for the ‘Analytics’ Category

At the Speed of Light

Keith Manthey

CTO of Analytics at EMC Emerging Technologies Division

For the last year, an obvious trend in analytics has begun to emerge. Batch analytics are getting bigger and real-time analytics are getting faster.  This divergence has never been more apparent then as of late.

Batch Analytics

Batch analytics primarily compose the arena of descriptive analytics, massive scale analytics, and online model development. Descriptive analytics are still the main purview of data warehouses, but Hadoop has expanded the capabilities to ask “What If” questions with far more data types and analytics capabilities. The size of some Hadoop descriptive analytics installations have reached rather massive scale.

The documented successes of massive scale analytics is well trod. Cross data analytics (like disease detection with multiple data sets), time-series modeling, and anomaly detection rank are particularly impressive due to their depth of adoption in several verticals. The instances in health care analytics with Hadoop alone in the past year are numerous and show the potential of this use case to provide amazing insights into caring for our aging population as well as healing rather bespoke diseases.

Model development is an application that effectively highlights the groundbreaking potential that can be unlocked through Hadoop’s newest capabilities and analytics. Creating real-time models based upon trillions of transactions for a hybrid architecture is a good example of this category. Due to the small percentage of records that actually occur as real fraud on a daily basis, trillions of transactions are required for a fraud model to be certified as effective. The model is then deployed into production, which is often a real-time system.

One data point for the basis of my belief that batch is getting “bigger” is that I have been engaged in no less than 10 Hadoop clusters that have crossed the 50 PB threshold this year alone. In each case, the cluster has hit a logical pause point, causing customers  to re-evaluate the architecture and operations. This may be due to cost, scale, limitations, or other catalysts.  These are often the times when I am engaged with these customers. Not every client reaches these catalysts at consistent sizes or times, so it’s interesting that 10 clusters greater than 50 PB have hit this in 2017 alone. Nonetheless, Hadoop continues to capture new records of customers setting all-time size limits on their Hadoop cluster size.

Real-Time Analytics

While hybrid analytics were certainly in vogue last year, real-time or streaming analytics appear to be the hottest trend as of late. Real-time analytics, such as efforts to combat fraud authorizations, are not new endeavors. Why is the latest big push for streaming analytics now the “new hot thing”? There are several factors at play.

Data is growing at an ever increasing rate. One contributing factor can effectively be categorized as “whether to store or not to store.” While this step takes place usually in conjunction with more complex processes, one aspect that is clearly apparent is some form of analytics to decide if the data is useful.  Not every piece of data is valuable and an enormous amount of data is being generated. Determining if there is value in using batch storage for a particular artifact of data is one use for real-time analytics.

Moving up the value chain, the more significant factor at play is that the value proposition of real-time far outweighs the value proposition in batch. However, this doesn’t mean that batch and real-time are de-coupled or not symbiotic in some ways. In high-frequency trading, fraud authorization detection, cyber security, and other streaming use cases, the value of gaining insights in real time versus several days can be especially critical. Real-time systems have historically not relied upon Hadoop for their architectures, which has not gone unnoticed by some traditional Hadoop ecosystem tools like Spark.  The University of California Berkeley recently shifted the focus of its AMP Labs to create RISELabs, greenlighting projects such as Drizzle that aim to bring low-latency streaming capabilities to Spark. The ultimate goal of Drizzle and RISELabs is to increase the viability of Spark for real-time, non-Hadoop workloads. The emphasis on creating lower latency tools will certainly escalate the usage of streaming analytics, as real time continues to get “faster.”

The last factor is the “Internet of Everything,” often referred to as “IoT” or “M2M.” While sensors are top of mind, most companies are still finding their way in this new world of streaming sensor data. Highly technologically advanced use cases and designs are already in place, but the installs are still very bespoke and limited in nature. The mass adoption is still a work in progress. The theoretical value of this data for use in governance analytics or the analytics of improving business operations is massive. Given the dearth of data, storage in batch is not a feasible alternative at scale. As such, most of the analytics of IoT are streaming-based capabilities. The value proposition is still truly outstanding and IoT analytics remain in the hype phase. The furor and spending is in full-scale deployment regardless.

In closing, the divergence of analytics is growing between batch and online analytics. The symbiotic relationship remains strong, but the architectures are quickly separating. Most predictions from IDC, Gartner, and Forrester indicate streaming analytics will grow at a far greater rate than batch analytics due to most of the factors above. It will be interesting to see how this trend continues to manifest itself.  Dell EMC is always interested in learning more about specific use cases, and we welcome your stories on how these trends are impacting your business.

Unwrapping Machine Learning

Ashvin Naik

Cloud Infrastructure Marketing at Dell EMC

In a recent IDC spending guide titled Worldwide cognitive systems and artificial intelligence spending guide,   some fantastic numbers were thrown out in terms of opportunity and growth 50+ % CAGR, Verticals pouring in billions of dollars on cognitive systems. One of the key components of cognitive systems is Machine Learning.

According to wikipedia Machine Learning is a subfield of computer science that gives the computers the ability to learn without being explicitly programmed. Just these two pieces of information were enough to get me interested in the field.

After hours of daily  searching, digging through inane babble and noise across the internet, the understanding of how machines can learn evaded me for weeks, until I hit a jackpot. A source, that should not be named pointed me to a “secure by obscurity” share that had the exact and valuable insights on machine learning. It was so simple, elegant and completely made sense to me.

Machine Learning was not all noise, it worked on a very simple principle. Imagine, there is a pattern in this world that can be used to forecast or predict a behavior of any entity. There is no mathematical notation available to describe the pattern, but if you have the data that can be used to plot the pattern, you can use Machine Learning to model it.  Now, this may sound like a whole lot of mumbo jumbo but allow me to break it down in simple terms.

Machine learning can be used to understand patterns so you can forecast or predict anything provided

  • You are certain there is a pattern
  • You do not have a mathematical model to describe the pattern
  • You have the data to try to figure out the pattern.

Viola, this makes so much sense already. If you have data, know there is a pattern but don’t know what that is, you can use machine learning to find it out. The applications for this are endless from natural language processing, speech to text to predictive analytics. The most important is forecasting- something we do not give enough credit these days. The Most critical component of Machine Learning is Data – you should have the data. If you do not have data, you cannot find the pattern.

As a cloud storage professional, this is a huge insight. You should have data. Pristine, raw data coming from the systems that generate it- sort of like a tip from the horses mouth. I know exactly where my products fit in. We are able to ingest, store, protect and expose the data for any purposes in the native format complete with the metadata all through one system.

We have customers in the automobile industry leveraging our multi-protocol cloud storage across 2300 locations in Europe capturing data from cars on the roads. They are using proprietary Machine Learning systems to look for patterns in how their customers- the car owners use their products in the real world to predict the parameters of designing better, reliable and efficient cars. We have customers in the life-sciences business saving lives by looking at the patterns of efficacy and effective therapies for terminal diseases. Our customers in retail are using Machine Learning to detect fraud and protect their customers. This goes on and on and on.

I personally do not know the details of how they make it happen, but this is the world of the third platform. There are so many possibilities and opportunities ahead if only we have the data. Talk to us and we can help you capture, store and secure your data so you can transform humanity for the better.

 

Learn more about how Dell EMC Elastic Cloud Storage can fit into your Machine Learning Infrastructure

 

 

When It Comes To Data, Isolation Is The Enemy Of Insights

Brandon Whitelaw

Senior Director of Global Sales Strategy for Emerging Technologies Division at Dell EMC

Latest posts by Brandon Whitelaw (see all)

Within IT, data storage, servers and virtualization, there have always been ebbs and flows of consolidation and deconsolidation. You had the transition from terminals to PCs and now we’re going back to virtual desktops – it flows back and forth from centralized to decentralized. It’s also common to see IT trends repeat themselves.

dataIn the mid to late 90s, the major trend was to consolidate structured data sources into a single platform; to go from direct detached storage with dedicated servers per application to a consolidated central storage piece, called a storage array network (SAN). SANs allowed organizations to go from a shared nothing architecture (SN) to a shared everything architecture (SE), where you have a single point of control, allowing users to share available resources and not have data trapped or siloed within the independent direct detached storage systems.

The benefit of consolidation has been an ongoing IT trend that continues to repeat itself on a regular basis, whether it’s storage, servers or networking. What’s interesting is once you consolidate all the data sources, IT is able to finally look at doing more with them. The consolidation onto a SAN enables cross analysis of data sources that were otherwise previously isolated from each other. This was simply practically infeasible to do before. Now that these sources are in one place, this enables the emergence of systems such as an enterprise data warehouse, which is the concept of ingesting and transforming all the data on a common scheme to allow for reporting and analysis. Companies embracing this process led to growth in IT consumption because of the value gained from that data. It also led to new insights, resulting in most of the world’s finance, strategy, accounting, operations and sales groups all relying on the data they get from these enterprise data warehouses.

Next, companies started giving employees PCs, and what do you do on PCs? Create files. Naturally, the next step is to ask, “How do I share these files?” and “How do I collaborate on these files?” The end result is home directories and file shares. From an infrastructure perspective, there needed to be a shared common platform for this data to come together. Regular PCs can’t talk to a SAN without direct block level access, a fiber channel, or being connected in the data center to a server, so unless you want everyone to physically sit in the data center, you run Ethernet.

Businesses ended up building Windows file servers to be the middleman brokering the data between the users on Ethernet and the backend SAN. This method worked until companies reached the point where the Windows file servers steadily grew to dozens. Yet again, this led to IT teams being left with complexity, inefficiency and facing the original problem of having several isolated silos of data and multiple different points of management.

So what’s the solution? Let’s take the middleman out of this. Let’s take the file system that was sitting on top of the file servers and move it directly onto the storage system and allow Ethernet to go directly to it. Thus the network-attached storage (NAS) was born.

However, continuing the cycle, what started as a single NAS eventually became dozens for organizations. Each NAS device contained specific applications with different performance characteristics and protocol access. Also, each system could only store so much data before it didn’t have enough performance to keep up, so systems would continue expanding and replicating to accommodate.

This escalates until an administrator is startled to realize 80 percent of his/her company’s data being created is unstructured. The biggest challenge of unstructured data is that it’s not confined to the four walls of a data center. Once again, we find ourselves with silos that aren’t being shared (notice the trend repeating itself?). Ultimately, this creates the need for scale-out architecture with multiprotocol data access that can combine and consolidate unstructured data sources to optimize collaboration.

Doubling every two years, unstructured data is the vast majority of all data being created. Traditionally, the approach to gaining insights from this data has involved building yet another silo, which prevents having a single source of istock_000048860836_largetruth and having your data in one place. Due to the associated cost and the complexity, not all of the data goes into a data lake, for instance, but only sub-samples of the data that are relevant to that individual query. An option to ending this particular cycle is investing in a storage system that not only has the protocol access and tiering capabilities to consolidate all your unstructured data sources, but can also serve as your analytics platform. Therefore your primary storage, the single source of truth that comes with it and that ease of management will lend itself to become that next phase, which is unlocking its insights.

Storing data is typically viewed as a red-ink line item, but it can actually be to your benefit. Not because of regulation or policies dictating it, but as a deeper, wider set of data that can provide better answers. Often, you may not know what questions to ask until you’re able to see data sets together. Consider the painting technique, pointillism. If you look too closely, it’s just a bunch of dots of paint. However, if you stand back, a landscape emerges, ladies with umbrellas materialize and suddenly you realize you’re staring at Georges Seurat’s famous panting, A Sunday Afternoon on the Island of La Grande Jatte. Similar to pointillism, with data analytics, you never think of connecting the dots if you don’t even realize they’re next to one another.

Categories

Archives

Connect with us on Twitter