Author Archive

At the Speed of Light

Keith Manthey

CTO of Analytics at EMC Emerging Technologies Division

For the last year, an obvious trend in analytics has begun to emerge. Batch analytics are getting bigger and real-time analytics are getting faster.  This divergence has never been more apparent then as of late.

Batch Analytics

Batch analytics primarily compose the arena of descriptive analytics, massive scale analytics, and online model development. Descriptive analytics are still the main purview of data warehouses, but Hadoop has expanded the capabilities to ask “What If” questions with far more data types and analytics capabilities. The size of some Hadoop descriptive analytics installations have reached rather massive scale.

The documented successes of massive scale analytics is well trod. Cross data analytics (like disease detection with multiple data sets), time-series modeling, and anomaly detection rank are particularly impressive due to their depth of adoption in several verticals. The instances in health care analytics with Hadoop alone in the past year are numerous and show the potential of this use case to provide amazing insights into caring for our aging population as well as healing rather bespoke diseases.

Model development is an application that effectively highlights the groundbreaking potential that can be unlocked through Hadoop’s newest capabilities and analytics. Creating real-time models based upon trillions of transactions for a hybrid architecture is a good example of this category. Due to the small percentage of records that actually occur as real fraud on a daily basis, trillions of transactions are required for a fraud model to be certified as effective. The model is then deployed into production, which is often a real-time system.

One data point for the basis of my belief that batch is getting “bigger” is that I have been engaged in no less than 10 Hadoop clusters that have crossed the 50 PB threshold this year alone. In each case, the cluster has hit a logical pause point, causing customers  to re-evaluate the architecture and operations. This may be due to cost, scale, limitations, or other catalysts.  These are often the times when I am engaged with these customers. Not every client reaches these catalysts at consistent sizes or times, so it’s interesting that 10 clusters greater than 50 PB have hit this in 2017 alone. Nonetheless, Hadoop continues to capture new records of customers setting all-time size limits on their Hadoop cluster size.

Real-Time Analytics

While hybrid analytics were certainly in vogue last year, real-time or streaming analytics appear to be the hottest trend as of late. Real-time analytics, such as efforts to combat fraud authorizations, are not new endeavors. Why is the latest big push for streaming analytics now the “new hot thing”? There are several factors at play.

Data is growing at an ever increasing rate. One contributing factor can effectively be categorized as “whether to store or not to store.” While this step takes place usually in conjunction with more complex processes, one aspect that is clearly apparent is some form of analytics to decide if the data is useful.  Not every piece of data is valuable and an enormous amount of data is being generated. Determining if there is value in using batch storage for a particular artifact of data is one use for real-time analytics.

Moving up the value chain, the more significant factor at play is that the value proposition of real-time far outweighs the value proposition in batch. However, this doesn’t mean that batch and real-time are de-coupled or not symbiotic in some ways. In high-frequency trading, fraud authorization detection, cyber security, and other streaming use cases, the value of gaining insights in real time versus several days can be especially critical. Real-time systems have historically not relied upon Hadoop for their architectures, which has not gone unnoticed by some traditional Hadoop ecosystem tools like Spark.  The University of California Berkeley recently shifted the focus of its AMP Labs to create RISELabs, greenlighting projects such as Drizzle that aim to bring low-latency streaming capabilities to Spark. The ultimate goal of Drizzle and RISELabs is to increase the viability of Spark for real-time, non-Hadoop workloads. The emphasis on creating lower latency tools will certainly escalate the usage of streaming analytics, as real time continues to get “faster.”

The last factor is the “Internet of Everything,” often referred to as “IoT” or “M2M.” While sensors are top of mind, most companies are still finding their way in this new world of streaming sensor data. Highly technologically advanced use cases and designs are already in place, but the installs are still very bespoke and limited in nature. The mass adoption is still a work in progress. The theoretical value of this data for use in governance analytics or the analytics of improving business operations is massive. Given the dearth of data, storage in batch is not a feasible alternative at scale. As such, most of the analytics of IoT are streaming-based capabilities. The value proposition is still truly outstanding and IoT analytics remain in the hype phase. The furor and spending is in full-scale deployment regardless.

In closing, the divergence of analytics is growing between batch and online analytics. The symbiotic relationship remains strong, but the architectures are quickly separating. Most predictions from IDC, Gartner, and Forrester indicate streaming analytics will grow at a far greater rate than batch analytics due to most of the factors above. It will be interesting to see how this trend continues to manifest itself.  Dell EMC is always interested in learning more about specific use cases, and we welcome your stories on how these trends are impacting your business.

Data Security: Are You Taking It For Granted?

Keith Manthey

CTO of Analytics at EMC Emerging Technologies Division

sc1

Despite the fact that the Wells Fargo fake account scandal first broke in September, the banking giant still finds itself the topic of national news headlines and facing public scrutiny months later. While it’s easy to assign blame, whether to the now-retired CEO, the company’s unrealistic sales goals and so forth, let’s take a moment to discuss a potential solution for Wells Fargo and its enterprise peers. I’m talking about data security and governance.

There’s no question that the data security and governance space is still evolving and maturing. Currently, the weakest link in the Hadoop ecosystem is masking of data. As it stands at most enterprises using Hadoop, access to the Hadoop space translates to uncensored access to information that can be highly sensitive. Fortunately, there are some initiatives to change that. Hortonworks recently released Ranger 2.5, which starts to add allocated masking. Shockingly enough, I can count on one hand the number of clients that understand they need this feature. In some cases, CIO- and CTO-level executives aren’t even aware of just how critical configurable row and column masking capabilities are to the security of their data.

Another aspect I find to be shocking is the lack of controls around data governance in many enterprises. Without data restrictions, it’s all too easy to envision Wells Fargo’s situation – which resulted in 5,300 employees being fired – repeating itself at other financial institutions. It’s also important to point out entering unmasked sensitive and confidential healthcare and financial data into a Hadoop system is not only an unwise and negligent practice; it’s a direct violation of mandated security and compliance regulations.

Identifying the Problem and Best Practices

sc3From enterprise systems administrators to C-suite executives, both groups are guilty of taking data security for granted, and assuming that masking and encryption capabilities are guaranteed by default of having a database. These executives are failing to do their research, dig into the weeds and ask the more complex questions, often times due to a professional background that focused on analytics or IT rather than governance. Unless an executive’s background includes building data systems or setting up controls and governance around these types of systems, he/she may not know the right questions to ask.

Another common mistake is not strictly controlling access to sensitive data, putting it at risk of theft and loss. Should customer service representatives be able to pull every file in the system? Probably not. Even IT administrators’ access should be restricted to the specific actions and commands required to perform their jobs. Encryption provides some file level protections from unauthorized users.  Authorized users who have the permission to unlock an encrypted file can often look at fields that aren’t required for their job.

As more enterprises adopt Hadoop and other similar systems, they should consider the following:

Do your due diligence. When meeting with customers, I can tell they’ve done their homework if they ask questions about more than the “buzz words” around Hadoop. These questions alone indicate they’re not simply regurgitating a sales pitch and have researched how to protect their environment. Be discerning and don’t assume the solution you’re purchasing off the shelf contains everything you need. Accepting what the salesperson has to say at face value, without probing further, is reckless and could lead to an organization earning a very damaging and costly security scandal.

Accept there are gaps. Frequently, we engage with clients who are confident they have the most robust security and data governance available.
sc4However, when we start to poke and prod a bit more to understand what other controls they have in place, the astonishing answer is zero. Lest we forget that “Core” Hadoop only obtained security in 2015 without third-party add-ons, the governance around the software framework is still in its infancy stage in many ways. Without something as inherently rudimentary in traditional IT security as a firewall in place, it’s difficult for enterprises to claim they are secure.

Have an independent plan. Before purchasing Hadoop or a similar platform, map out your exact business requirements, consider what controls your business needs and determine whether or not the product meets each of them. Research regulatory compliance standards to select the most secure configuration of your Hadoop environment and the tools you will need to supplement it.

To conclude, here is a seven-question checklist enterprises should be able to answer about their Hadoop ecosystem:

  • Do you know what’s in your Hadoop?
  • Is it meeting your business goals?
  • Do you really have the controls in place that you need to enable your business?
  • Do you have the governance?
  • Where are your gaps and how are you protecting them?
  • What are your augmented controls and supplemental procedures?
  • Have you reviewed the information the salesperson shared and mapped it to your actual business requirements to decide what you need?

Bullseye: NCC Media Helps Advertisers Score a Direct Hit with Their Audiences

Keith Manthey

CTO of Analytics at EMC Emerging Technologies Division

Can you recall the last commercial you watched? If you’re having a hard time remembering, don’t beat yourself up. Nowadays, it’s all too easy for us to fast forward, block or avoid ads entirely. In order to watch content on one of our many devices, advertising is no longer a necessary preccartoon-tv-convertedursor to enjoying our favorite shows. While people may be consuming more media in more places than we’ve ever seen, it’s more challenging than ever for advertisers to reach their target audiences. The rise of media fragmentation and ad-blocking technology has exponentially increased the cost of reaching the same number of people it did years ago.

As a result, advertisers are under pressure to accurately allocate spending across TV and digital media and to ensure those precious ad dollars being spent are reaching the right people. Gone are the days when a media buyer’s gut instinct determined which block of airtime to purchase on what channel, replaced instead with advanced analytics.

By aggregating hundreds of terabytes of data from sources like Nielsen ratings, Scarborough local market data and even voting and census data, companies such as NCC Media are able to provide targeted strategies for major retailers, nonprofits and political campaigns. In one of the most contentious election years to date, AdAge reported data and analytics are heavily influencing how political advertisers are purchasing cable TV spots. According to Tim Kay, director of political strategy at NCC, “Media buyers…are basing TV decisions on information such as whether a program’s audience is more likely to vote or be registered gun owners.”

In order for NCC to identify its customers targets more quickly (for example, matching data about cable audiences to voter information and zip codes targets more quickly), NCC built an enterprise data lake with EMC Isilon’s scale-out storage with Hortonworks Data Platform, allowing it to streamline data aggregation and analytics.

To learn more about how Isilon has helped NCC eliminate time-consuming overnight data importation processes and tackle growing data aggregation requirements, check out the video below or read the NCC Media Case Study.

 

As New Business Models Emerge, Enterprises Increasingly Seek to Leave the World of Silo-ed Data

Keith Manthey

CTO of Analytics at EMC Emerging Technologies Division

As Bob Dylan famously wrote back in 1964, the times, they are a changin’. And while Dylan probably wasn’t speaking about the Fortune 500’s shifting business models and their impact on enterprise storage infrastructure (as far as we know), his words hold true in this context.

Many of the world’s largest companies are attempting to reinvent themselves by abandoning their product-or manufacturing-focused business models in favor of a more service-oriented approach. Look at industrial giants such as GE, Caterpillar or Procter & Gamble to name a few and consider how they leverage existing data about products (in the case of GE, say it’s a power plant) and apply them to a service model (say for utilities, in this example).

The evolution of a product-focused model into a service-oriented one can offer more value (and revenue) over time, but also requires a more sophisticated analytic model and holistic approach to data, a marked difference from the traditional silo-ed way that data has been managed historically.

Transformation

Financial services is another example of an industry undergoing a transformation from a data storage perspective. Here you have a complex business with lots of traditionally silo-ed data, split between commercial, consumer and credit groups. But increasingly, banks and credit unions want a more holistic view of their business in order to better understand how various divisions or teams could work together in new ways. Enabling consumer credit and residential mortgage units to securely share data could allow them to build better risk score models across loans, for example, ultimately allowing a financial institution to provide better customer service and expand their product mix.

Early days of Hadoop: compromise was the norm

As with any revolution, it’s the small steps that matter most at first. Enterprises have traditionally started small when it comes to holistically governing their data and managing workflows with Hadoop. In earlier days of Hadoop, say five to seven years ago, enterprises assumed potential compromises around data availability and efficiency, as well as how workflows could be governed and managed. Issues in operations could arise, making it difficult to keep things running one to three years down the road. Security and availability were often best effort – there weren’t the expectations of  five-nines reliability.

Data was secured by making it an island by itself. The idea was to scale up as necessary, and build a cluster for each additional department or use case. Individual groups or departments ran what was needed and there wasn’t much integration with existing analytics environments.

With Hadoop’s broader acceptance, new business models can emerge

hadoop_9_resizeHowever, last year, with its 10-year anniversary, we’ve started to see broader acceptance of Hadoop and as a result it’s becoming both easier and more practical to consolidate data company-wide. What’s changed is the realization that Hadoop was a true proof of concept and not a science experiment. The number of Hadoop environments has grown and users are realizing there is real power in combining data from different parts of the business and real business value in keeping historical data.

At best, the model of building different islands and running them independently is impractical; at worst it is potentially paralyzing for businesses. Consolidating data and workflows allows enterprises to focus on and implement better security, availability and reliability company-wide. In turn, they are also transforming their business models and expanding into new markets and offerings that weren’t possible even five years ago.

The Democratization of Data Science with the Arrival of Apache Spark

Keith Manthey

CTO of Analytics at EMC Emerging Technologies Division

As an emerging field, data science has seen rapid growth over the span of just a few short years. With Harvard Business Review referring to the data scientist role as the “sexiest job of the 21st century” in 2012 and job postings for the role growing 57 percent in the first quarter of 2015, enterprises are increasingly seeking out talent to help bolster their organizations’ understanding of their most valuable assets: their data.

The growing demand for data scientists reflects a larger business trend – a shifting emphasis from the zeros and ones to the people who help manage the mounds of data on a daily basis. Enterprises are sitting on a wealth of information but are struggling to derive actionable insights from it, in part due to its sheer volume but also because they don’t have the right talent on board to help.

The problem enterprises now face isn’t capturing data – but finding and retaining top talent to help make sense of it in meaningful ways. Luckily, there’s a new technology on the horizon that can help democratize data science and increase accessibility to the insights it unearths.

Data Science Scarcity & Competition

dataThe talent pool for data scientists is notoriously scarce. According to McKinsey & Company, by 2018, the United States alone may face a 50 to 60 percent gap between supply and demand for “deep analytic talent, i.e., people with advanced training in statistics or machine learning.” Data scientists possess an essential blend of business acumen, statistical knowledge and technological prowess, rendering them as difficult to train as they are invaluable to the modern enterprise.

Moreover, banks and insurance companies face an added struggle in hiring top analytics talent, with the allure of Silicon Valley beckoning top performers away from organizations perceived as less inclined to innovate. This perception issue hinders banks’ and insurance companies’ ability to remain competitive in hiring and retaining data scientists.

As automation and machine learning grow increasingly sophisticated, however, there’s an opportunity for banks and insurance companies to harness the power of data science, without hiring formally trained data scientists. One such technology that embodies these innovations in automation is Apache Spark, which is poised to shift the paradigm of data science, allowing more and more enterprises to tap into insights culled from their own data.

Spark Disrupts & Democratizes Data Science

Data science requires three pillars of knowledge: statistical analysis, business intelligence and technological expertise. Spark does the technological heavy-lifting, by understanding and processing data at a scale that most people aren’t comfortable. It handles the distribution and categorization of the data, removing the burden from individuals and automating the process. By allowing enterprises to load data into clusters and query it on an ongoing basis, the platform is particularly adept at machine-learning and automation – a crucial component in any system intended to analyze mass quantities of data.

Spark was created in the labs of UC Berkeley and has quickly taken the analytics world by storm, with two main business propositions: the freedom to model data without hiring data scientists, and the power to leverage analytics models that are already built and ready-for-use in Spark today. The combination of these two attributes allows enterprises to gain speed on analytics endeavors with a modern, open-source technology.

The arrival of Spark signifies a world of possibility for companies that are hungry for the business value data science can provide but are finding it difficult to hire and keep deep analytic talent on board. The applications of Spark are seemingly endless, from cybersecurity and fraud detection to genomics modeling and actuarial analytics.

What Spark Means for Enterprises

Not only will Spark enable businesses to hire non-traditional data scientists, such as actuaries, to effectively perform the role, but it will also open a world of possibilities in terms of actual business strategy.

Banks, for example, have been clamoring for Spark from the get-go, in part because of Spark’s promise to help banks bring credit card authorizations back in-house. For over two decades, credit card authorizations have been outsourced, since it was more efficient and far less dicey to centralize the authorization process.

The incentive to bring this business back in-house is huge, however, with estimated cost savings of tens to hundreds of millions annually. With Spark, the authorization process could be automated in-house – a huge financial boon to banks. The adoption of Spark allows enterprises to effectively leverage data science and evolve their business strategies accordingly.

The Adoption of Spark & Hadoophadoop_1_resized

Moreover, Spark works seamlessly with the Hadoop Distributions sitting on EMC’s storage platforms. As I noted in my last post, Hadoop adoption among enterprises has been incredible and is quickly becoming the de facto
standard for storing and processing terabytes or even petabytes of data.

By leveraging Spark and existing Hadoop platforms in tandem, enterprises are well-prepared to solve the ever-increasing data and analytics challenges ahead.

Digital Strategies:  Are Analytics Disrupting the World?

Keith Manthey

CTO of Analytics at EMC Emerging Technologies Division

Close up of woman hand pointing at business document during discussion at meeting“Software is eating the world”.  It is a phrase that we often see written, but sometimes do not fully understand.  More recently I read derivations of that phrase that posits that “analytics are disrupting the world”.  Both phrases have a lot of truth.  But why? Some of the major disruptions in the last 5 years can be attributed to analytics.  Most companies that serve as an intermediary, such as Uber or AirBNB, with a business model of making consumer and supplier “connections” are driven by analytics.  Pricing surges, routing optimizations, available rentals, available drivers, etc. are all algorithms to these “connection” businesses that are disrupting the world.  It could be argued that analytics is their secret weapon.

It is normal for startups to try new and sometimes crazy & risky investments into new technologies like Hadoop and analytics.  The trend is carrying over into traditional industries and established businesses as well.  What are the analytics uses cases in industries like Financial Services (aka FSI)?

Established Analytics Plays in FSI

Two use cases naturally come to my mind when I think of “Analytics” and “Financial Services”; High Frequency Trading and Fraud are two traditional use cases that have long utilized analytics.  Both are fairly well respected and written about with regard to their heavy use of analytics.  I myself blogged recently (From Kinetic to Synthetic) on behalf of Equifax regarding the market trends in Synthetic Fraud.  Beyond these obvious trends though, where are analytics impacting the Financial Services industry?  What use cases are relevant and impacting the industry in 2016 and why?

Telematics

The insurance industry has been experimenting with opt-in programs that monitor driving behavior for several years.  Insurance companies have varying opinions of its usefulness, but it’s clear that driving behavior is (1) a heavy use of unstructured data and (2) a dramatic leap from the statistical based approach using financial data, actuarial tables, and statistics.  Telematics is the name given to a set of opt-in programs around usage-based insurance / driver monitoring programs. Telematics use in insurance companies has fostered a belief that has long been used in other verticals like fraud that pins behavior down to an individual pattern instead of trying to predict broad swaths of patterns.  To be more precise, Telematics is looking to derive a “behavior of one” vs a “generalized driving pattern for 1K individuals”.  As to the change of why this is different from past insurance practices, we will draw a specific comparison between the two. Method One – historical actuarial tables of life expectancy along with demographic and financial data to denote risk vs. Method Two – how does ONE individual drive based upon real driving data as received from their car.  Which might be more predictive about the expected rate of accidents is the question for analytics.  While this is a gross over-simplification of the entire process, it is a radical shift of the types of data and the analytical methods of deriving value from the data available to the industry.  Truly transformational.

Labor Arbitrage

The insurance industry has been experimenting with analytics based on past performance data.  The industry has years of predictive information (i.e., claim reviews along with actual outcomes) based on past claims.  By exploring this past performance data, Insurance companies are able to apply logistical regression algorithms to derive weighted scores.  The derived scores are then being analyzed to determine a path forward.  For example, if scores greater then 50 amounted to claims that are evaluated and then almost always paid by the insurer, then all scores above 50 should be immediately approved and paid.  The inverse is also true that treatments can be quickly rejected as they are often not appealed or regularly turned down under review if appealed. The analytics of the actual present case was compared against previous outcomes of the corpus of past performance data to derive the most likely outcome of the case.  The resulting business effect would be that the workforce that reviewed medical claims would only be given those files that needed to be worked.  The result would be a better work force productivity.  Labor Arbitrage with data and analytics being the disruptor of workforce trends.

Know Your Customer

Retail Banking has turned to analytics as they have focused on attracting and retaining their customers.   After a large trend of acquisitions in the last decade, retail banks are working to integrate their various portfolios.  In Business people shaking hands, finishing up a meetingsome cases, resolving down the identity of all their clients on all their accounts isn’t always as straight forward as it sounds.  This is especially hard with dormant accounts that might have maiden names, mangled data attributes, or old addresses.  The ultimate goal of co-locating all their customer data into an analytics environment is a customer 360.  Customer 360 is mainly focused on gaining full insights around a customer.  This can lead to upsell opportunities by understanding what a customer’s peer set and what products a similar demographic has a strong interest in. For example, if individuals of a given demographic typically subscribe to 3 of a company’s 5 products, an individual matching that demographic should be targeted for upsell on those additional products when they only subscribe to 1 product.  This is using large swathes of data and companies own product adoptions to build upsell and marketing strategies for their own customers.  If someone was a small business owner and personal consumer of the retail bank, the company may not have previously tied those accounts together.  It gives the bank a whole new perspective on who its customer base really is.

Wrap Up

Why are these trends interesting?  In most of these cases above, people are familiar with certain portions of the story.  The underlying why or what, might often get missed.  It is important to not only understand the technology and capabilities involved with transformation, but also the underlying shift that is being caused. EMC has a long history of helping customers through those journeys and we look forward to helping even more clients face them.

 

 

 

 

Categories

Archives

Connect with us on Twitter