Hadoop Grows Up: How Enterprises Can Successfully Navigate its Growing Pains

Keith Manthey

CTO of Analytics at EMC Emerging Technologies Division

If you’d asked me 10 years ago whether enterprises would be migrating to Hadoop, I would’ve answered with an emphatic no. Slow to entice enterprise customers and named after a toy elephant, at first glance, the framework didn’t suggest it was ready for mass commercialization or adoption.

But the adoption of Hadoop among enterprises has been phenomenal. With its open-source software framework, Hadoop provides enterprises with the ability to process and store unprecedented volumes of data – a capability today’s enterprise sorely needs – effectively becoming today’s default standard for storing, processing and analyzing mass quantities, hundreds of terabytes or even petabytes of data.

While the adoption and commercialization of Hadoop is remarkable and an overall positive move for enterprises hungry for streamlined data storage and processing, enterprises are in for a significant challenge with the migration from Hadoop 2.0 to 3.X.

Most aren’t sure what to expect, and few experienced the earlier migration’s pain points. Though Hadoop has “grown up”, in that it is now used by some of the world’s largest enterprises, it hasn’t identified a non-disruptive solution when it jumps major releases.

Happening in just a few short years, this next migration will have dramatic implications for the storage capabilities of today’s insurance companies, banks and largest corporations. It’s imperative that these organizations begin planning for the change now to ensure that their most valuable asset—their data—remains intact and accessible in an “always on” culture that demands it.

Why the Migration Mattersmigration_2

First, let’s explore the significant benefits of the migration and why, despite the headaches, this conversion will ultimately be beneficial for enterprises.

One of the key benefits of Hadoop 3.X is erasure coding, which will dramatically decrease the amount of storage needed to protect data. In a more traditional system, files are replicated multiple times in order to protect against loss. If one file becomes lost or corrupted, its replica can easily be summoned in place of the original file or datum.

As you can imagine, replication of data requires significant volumes of storage that can shield against data failure, but is expensive. In fact, default replication requires an additional 200 percent in storage space and other resources, such as network bandwidth when writing the data.

Hadoop 3.X’s move to erasure coding resolves the storage issue while maintaining the same level of fault tolerance. In other words, erasure coding helps protect data as effectively as traditional forms of coding but takes up far less storage. In fact, erasure coding is estimated to reduce the storage cost by 50 percent – a huge financial boon for enterprises moving to Hadoop 3.X. With Hadoop 3.X, enterprises will be able to store twice as much data on the same amount of raw storage hardware.

That being said, enterprises updating to Hadoop 3.X will face significant roadblocks to ensure that their data remains accessible and intact during a complicated migration process.

Anticipating Challenges Ahead

For those of us who experienced the conversion from Hadoop 1.X to Hadoop 2.X, it was a harrowing one, requiring a complete unload of the Hadoop environment data and a complete re-load onto the new system. That meant long periods of data inaccessibility and, in some cases, data loss. Take a typical laptop upgrade and multiply the pain points thousand-fold.

Data loss is no longer a tolerable scenario for today’s enterprises and can have huge financial, not to mention reputational implications. However, most enterprises adopted Hadoop after its last revamp, foregoing the headaches associated with major upgrades involving data storage and processing. These enterprises may not anticipate the challenges ahead.

The looming migration can have potentially dire implications for today’s enterprises. A complete unload and re-load of enterprises’ data will be expensive, painful and fraught with data loss. Without anticipating the headaches in store for the upcoming migration, enterprises may forego the necessary measures to ensure the accessibility, security and protection of their data.

Navigating the Migration Successfully

Isilon_Hadoop_2The good news is that there is a simple, actionable step enterprises can take to manage migration and safeguard their data against loss, corruption and inaccessibility.

Enterprises need to ensure that their current system does not require a complete unload and reload of their data. Most systems do require a complete unload and reload, so it is crucial that enterprises understand their current system and its capabilities when it comes to the next Hadoop migration.

If the enterprise were on Isilon for Hadoop, for example, there would be no need to unload and re-load its data. The enterprise would simply point the newly upgraded computer nodes to Isilon, with limited downtime, no re-load time and no risk for data loss.

Isilon for Hadoop helps enterprises ensure the accessibility and protection of their data through the migration process to an even stronger, more efficient Hadoop 3.X. While I’m eager for the next revamp of Hadoop and its tremendous storage improvements, today’s enterprises need to take precautionary measures before the jump to protect their data and ensure the transition is as seamless as possible.

Cloud Computing and EDA – Are we there yet?

Lawrence Vivolo

Sr. Business Development Manager at EMC²

Cloud 9Today anything associated with “Cloud” is all the rage.  In fact, depending on your cellular service provider, you’re probably already using cloud storage to back up your e-mail, pictures, texts, etc. on your cell phone. (I realized this when I got spammed with “you’re out of cloud space – time to buy more” messages). Major companies that offer cloud-based solutions (servers, storage, infrastructure, applications, management, etc.) include Microsoft, Google, Amazon, Rackspace, Dropbox, EMC and others. For those that don’t know the subtleties of Cloud, and the terms, like Public vs Private vs Hybrid vs Funnel, and why some are better suited for EDA, I thought I’d give you some highlights.

Let’s start with the obvious – what is “Cloud”? Cloud is a collection of resources which can include servers (for computing), storage, applications, infrastructure (ex: networking) and even services (management, backups, etc.). Public clouds are simply clouds that are made available by 3rd-parties and are shared resources. Being shared is often advertised as a key advantage of public cloud – because the resources are shared, so is the cost. These shared resources can also expand and contract as needs change, allowing companies to precisely balance need with availability.  Back in 2011, Synopsys, a leading EDA company, was promoting this as a means to address peak EDA resource demand [1].

Unfortunately, public cloud has some drawbacks.  The predictability of storage cost is one. Though public cloud appears very affordable at first glance, most providers charge for the movement of data to and from their cloud, which can exceed the actual costs to store the data.  This can be further compounded when data is needed worldwide as it may need to be copied to multiple regions for performance and redundancy purposes. With semiconductor design, these charges can be significant, since many EDA programs generate lots of data.

Perhaps the greatest drawback to EDA adoption of public cloud is the realization that your data might be sitting on physical compute and/or storage resources that are being shared with someone else’s data.  That doesn’t mean you can see other’s data. Access is restricted via OS policy and other security measures. Yet that does create a potential path for unauthorized access. As a result, most semiconductor companies have not been willing to risk the potential to have their most important “golden jewels” (their IP) hacked and stolen from a public cloud environment. Security has improved since 2011, however, and some companies are considering cloud for long-term archiving of non-critical data as well as some less business critical IP.

Private cloud avoids these drawbacks, as it isolates the physical infrastructure – including hardware, storage and networking – from all other users. Your own company’s on-premise hardware is typically a private cloud, even though, increasingly, some of that “walled-off” infrastructure is itself located off-premise and/or owned and managed by a 3rd party. While physical and network isolation reduce the security concerns, they also eliminates some of the flexibility. The number of servers available can’t be increased or decreased with a single key-click to accommodate peak demand changes, at least not without upfront planning and additional costs.

Hybrid cloud is another common term – which simply means a combination of public and private clouds.

In the world of semiconductor design, private cloud as a service has been available for some time and is offered in various forms by several EDA companies today. Cadence® Design Systems, for example, offers both Hosted Design Solutions [2], which includes HW, SW and IT infrastructure, and QuickCycles® Service which offers on-site or remote access to Palladium emulation and simulation acceleration resources [3]. Hybrid cloud is also starting to gain interest, where non-critical data that’s infrequently accessed can be stored with minimal transport costs.

The public cloud market is changing constantly and as time progresses new improvements may arise that make it more appealing to EDA. A challenge of IT administrators today is meeting today’s growing infrastructure needs while avoiding investments that are incompatible with future cloud migrations. This is where you need to hedge your bets and chose a platform that delivers the performance and flexibility EDA companies require, yet enables easy migration from private to hybrid—or even public cloud. EMC’s Isilon, for example, is an EDA-proven high performance network-attached storage platform that provides native connectivity to the most popular public cloud providers, including Amazon Web Services, Microsoft Azure and EMC’s Virtustream.

Not only does native cloud support future-proof today’s storage investment, it makes the migration seamless – thanks to its single point of management that encompasses private, hybrid and public cloud deployments. EMC Isilon supports a feature called CloudPools, which transparently extends an Isilon storage pool into cloud infrastructures. With CloudPools your company’s critical data can remain on-premise yet less critical, rarely accessed data can be encrypted securely and archived automatically and transparently onto the cloud. Isilon can also be configured to archive your business-critical data (IP) to lower-cost on-premise media.  This combination saves budget and keeps more high-performance storage space available locally for your critical EDA jobs.

Semiconductor companies and EDA vendors have had their eyes on public cloud for many years. While significant concerns over security continue to slow adoption, technology continues to evolve. Whether your company ultimately sticks with private cloud, or migrates seamlessly to hybrid or public cloud in the future depends on decisions you make today. The key is to focus on flexibility, and not let fear cloud your judgment.

[1] EDA in the Clouds: Myth Busting: https://www.synopsys.com/Company/Publications/SynopsysInsight/Pages/Art6-Clouds-IssQ2-11.aspx?cmp=Insight-I2-2011-Art6

[2] Cadence Design Systems Hosted Design Services: http://www.cadence.com/services/hds/Pages/Default.aspx

[3] Cadence Design System QuickCycles Service: http://www.cadence.com/products/sd/quickcycles/pages/default.aspx

From Kinetic to Synthetic

Keith Manthey

CTO of Analytics at EMC Emerging Technologies Division

Technology is continuing to evolve and drive disruption. By now, most of you have probably viewed the meme being shared online that lists the biggest ride rental company as having no cars, the biggest accommodation company as having no property, etc. The identity and fraud space has not been unaffected by this trend either.

Several decades ago, identity and fraud were presenting businesses with a challenge, but the challenge was very kinetic. A fraudster usually committed the fraud in person and often used forged documents to commit the crime: fraud was therefore a very physical or kinetic transaction.

Fast-forward to today and kinetic fraud has greatly reduced in scope and impact; in its place, cyber fraud (committed via many different avenues) is burgeoning. Taking into account a number of recent cyber breaches, identity information and compromised payment methods like credit cards are readily available on the dark portions of the web. These identity elements sell for extremely low monetary values these days but it’s the volume of this data that will ultimately be financially rewarding to the fraudsters.


OLTP Power With EMC ScaleIO: Software-Defined Block Storage for SAP, Oracle & NoSQL Databases

Jeff Thomas

Global ScaleIO SE Director at EMC²

When it comes to getting the best from today’s enterprise OLTP databases, a powerful storage solution is vital. Let’s explore how EMC ScaleIO software-defined block storage gives DBAs all the performance, scalability and resiliency they demand – while also giving infrastructure managers the flexibility, ease of management and cost-efficiency they need.

You may be running traditional enterprise OLTP (on-line transaction processing) database applications from vendors like Microsoft, Oracle and SAP. Perhaps you’re exploring new in-memory databases like SAP HANA, or the latest scale-out databases based on NoSQL (including Cassandra, MongoDB, CouchDB, Apache HBase and so on).

When it comes to database storage, you may be using a high-performance purpose-built array to make it all work. Or, if you have economies-of-scale issues, that may drive you to building your own system with direct-attached storage (DAS).

But software-defined storage (SDS) now offers a third option that promises the best of both worlds. EMC ScaleIO uses smart software to connect multiple industry-standard x86 servers into a shareable pool of high-performance block storage – creating a server-based virtual SAN. Our customers are increasingly embracing ScaleIO as a next-generation block storage platform for their databases – and here are some great reasons why…

ScaleIO OLTP Databases

Software-Defined Flexibility and Agility

ScaleIO software is agnostic to hardware and hypervisor, running on the x86 server infrastructure most organizations already use. ScaleIO can also be deployed flexibly – in a ‘storage-only’ model where storage and applications are on physically separate servers, or in a ‘hyper-converged’ model where each server hosts both applications and shared storage. ScaleIO’s tiny resource footprint means that running hyper-converged has minimal impact on database performance, making this an increasingly popular option.


Software defined scale-out NAS extends your Data Lake from core to edge: IsilonSD Edge NAS software

Sri Seshadri

Product Marketing at EMC Isilon

When people consider enterprise data growth, they often focus on the ‘core’ IT within the corporate headquarters and datacenter. But what’s happening further away from the core – at your remote offices and branch offices?

We all know that the amount of enterprise data requiring storage is doubling every 2–3 years (according to analyst IDC’s ‘Digital Universe’ study). Managing these ever-growing quantities of (mostly unstructured) data is a constant challenge for most enterprises.

At the enterprise core, EMC Isilon is already addressing that challenge. The Isilon data lake offering helps you consolidate your data, eliminate storage silos, simplify management, increase data protection, and gain value from your data assets. Isilon’s built-in multi-protocol capabilities support a wide range of traditional and next-gen applications – including data analytics that can be used to gain better insights to accelerate your business.

But data is also growing at enterprise edge locations. A recent ESG study (“Remote Office/Branch Office Technology Trends”, May 2015) showed that 68% of organizations now have an average of more than 10 TB of data stored at each branch office – while only 23% reported this amount of edge-stored data in 2011.


What’s Next for Hadoop? Examining its Evolution, and its Potential

John Mallory

CTO of Analytics at EMC Emerging Technologies Division

In my last blog post, I talked about one of the most popular buzzwords in the IT space today – the Internet of Things – and offered some perspective in terms of what’s real and what’s hype, as well as which use cases make the most sense for IoT in the short-term.

Today I’d like to address the evolution of Apache’s Hadoop, and factors to consider that will drive Hadoop adoption to a wider audience beyond early use-cases.

First, consider that data informs nearly every decision an organization makes today. Customers across virtually every industry expect to interact with businesses wherever they go, in real-time, across a myriad pf devices and applications. This results in piles and mounds of information that need to be culled, sorted and organized to find actionable data to drive businesses forward.

This evolution mirrors much of what’s taking place in the Apache-Hadoop ecosystem as it continues to mature and find its place among a broader business audience.

The Origins & Evolution of Hadoop

HadoopLet’s look at the origins of Hadoop as a start. Hadoop originally started out as a framework for big batch processing, which is exactly what early adopters like Yahoo! needed – an algorithm that could crawl all of the content on the Internet to help build big search engines and then take the outputs and monetize them with targeted advertising. That type of a use case is entirely predicated on batch processing on a very large scale.

The next phase centered on how Hadoop would reach a broader customer base. The challenge there was to make Hadoop easier to use by a wider audience. Sure, it’s possible to do very rich processing with Hadoop, but it also has to be programmed very specifically, which can make it difficult to use by enterprise users for business intelligence or reporting. This drove the trend around SQL on Hadoop, which was the big thing about two years ago with companies like Cloudera, IBM, Pivotal and others entering the space. (more…)

Follow EMC



Connect with us on Twitter