Archive for the ‘Hadoop’ Category

As New Business Models Emerge, Enterprises Increasingly Seek to Leave the World of Silo-ed Data

Keith Manthey

CTO of Analytics at EMC Emerging Technologies Division

As Bob Dylan famously wrote back in 1964, the times, they are a changin’. And while Dylan probably wasn’t speaking about the Fortune 500’s shifting business models and their impact on enterprise storage infrastructure (as far as we know), his words hold true in this context.

Many of the world’s largest companies are attempting to reinvent themselves by abandoning their product-or manufacturing-focused business models in favor of a more service-oriented approach. Look at industrial giants such as GE, Caterpillar or Procter & Gamble to name a few and consider how they leverage existing data about products (in the case of GE, say it’s a power plant) and apply them to a service model (say for utilities, in this example).

The evolution of a product-focused model into a service-oriented one can offer more value (and revenue) over time, but also requires a more sophisticated analytic model and holistic approach to data, a marked difference from the traditional silo-ed way that data has been managed historically.

Transformation

Financial services is another example of an industry undergoing a transformation from a data storage perspective. Here you have a complex business with lots of traditionally silo-ed data, split between commercial, consumer and credit groups. But increasingly, banks and credit unions want a more holistic view of their business in order to better understand how various divisions or teams could work together in new ways. Enabling consumer credit and residential mortgage units to securely share data could allow them to build better risk score models across loans, for example, ultimately allowing a financial institution to provide better customer service and expand their product mix.

Early days of Hadoop: compromise was the norm

As with any revolution, it’s the small steps that matter most at first. Enterprises have traditionally started small when it comes to holistically governing their data and managing workflows with Hadoop. In earlier days of Hadoop, say five to seven years ago, enterprises assumed potential compromises around data availability and efficiency, as well as how workflows could be governed and managed. Issues in operations could arise, making it difficult to keep things running one to three years down the road. Security and availability were often best effort – there weren’t the expectations of  five-nines reliability.

Data was secured by making it an island by itself. The idea was to scale up as necessary, and build a cluster for each additional department or use case. Individual groups or departments ran what was needed and there wasn’t much integration with existing analytics environments.

With Hadoop’s broader acceptance, new business models can emerge

hadoop_9_resizeHowever, last year, with its 10-year anniversary, we’ve started to see broader acceptance of Hadoop and as a result it’s becoming both easier and more practical to consolidate data company-wide. What’s changed is the realization that Hadoop was a true proof of concept and not a science experiment. The number of Hadoop environments has grown and users are realizing there is real power in combining data from different parts of the business and real business value in keeping historical data.

At best, the model of building different islands and running them independently is impractical; at worst it is potentially paralyzing for businesses. Consolidating data and workflows allows enterprises to focus on and implement better security, availability and reliability company-wide. In turn, they are also transforming their business models and expanding into new markets and offerings that weren’t possible even five years ago.

Analyst firm IDC evaluates EMC Isilon: Lab-validation of scale-out NAS file storage for your enterprise Data Lake

Suresh Sathyamurthy

Sr. Director, Product Marketing & Communications at EMC

A Data Lake should now be a part of every big data workflow in your enterprise organization. By consolidating file storage for multiple workloads onto a single shared platform based on scale-out NAS, you can reduce costs and complexity in your IT environment, and make your big data efficient, agile and scalable.

That’s the expert opinion in analyst firm IDC’s recent Lab Validation Brief: “EMC Isilon Scale-Out Data Lake Foundation: Essential Capabilities for Building Big Data Infrastructure”, March 2016. As the lab validation report concludes: “IDC believes that EMC Isilon is indeed an easy-to-operate, highly scalable and efficient Enterprise Data Lake Platform.

The Data Lake Maximizes Information Value

The Data Lake model of storage represents a paradigm shift from the traditional linear enterprise data flow model. As data and the insights gleaned from it increase in value, enterprise-wide consolidated storage is transformed into a hub around which the ingestion and consumption systems work. This enables enterprises to bring analytics to data in-place – and avoid expensive costs of multiple storage systems, and time for repeated ingestion and analysis.

But pouring all your data into a single shared Data Lake would put serious strain on traditional storage systems – even without the added challenges of data growth. That’s where the virtually limitless scalability of EMC Isilon scale-out NAS file storage makes all the difference…

The EMC Data Lake Difference

The EMC Isilon Scale-out Data Lake is an Enterprise Data Lake Platform (EDLP) based on Isilon scale-out NAS file storage and the OneFS distributed file system.

As well as meeting the growing storage needs of your modern datacenter with massive capacity, it enables big data accessibility using traditional and next-generation access methods – helping you manage data growth and gain business value through analytics. You can also enjoy seamless replication of data from the enterprise edge to your core datacenter, and tier inactive data to a public or private cloud.

We recently reached out to analyst firm IDC to lab-test our Isilon Data Lake solutions – here’s what they found in 4 key areas…

  1. Multi-Protocol Data Ingest Capabilities and Performance

Isilon is an ideal platform for enterprise-wide data storage, and provides a powerful centralized storage repository for analytics. With the multi-protocol capabilities of OneFS, you can ingest data via NFS, SMB and HDFS. This makes the Isilon Data Lake an ideal and user-friendly platform for big data workflows, where you need to ingest data quickly and reliably via protocols most suited to the workloads generating the information. Using native protocols enables in-place analytics, without the need for data migration, helping your business gain more rapid data insights.

datalake_blog

IDC validated that the Isilon Data Lake offers excellent read and write performance for Hadoop clusters accessing HDFS via OneFS, compared against via direct-attached storage (DAS). In the lab tests, Isilon performed:

  • nearly 3x faster for data writes
  • over 1.5x faster for reads and read/writes.

As IDC says in its validation: “An Enterprise Data Lake platform should provide vastly improved Hadoop workload performance over a standard DAS configuration.”

  1. High Availability and Resilience

Policy-based high availability capabilities are needed for enterprise adoption of Data Lakes. The Isilon Data Lake is able to cope with multiple simultaneous component failures without interruption of service. If a drive or other component fails, it only has to recover the specific affected data (rather than recovering the entire volume).

IDC validated that a disk failure on a single Isilon node has no noticeable performance impact on the cluster. Replacing a failed drive is a seamless process and requires little administrative effort. (This is in contrast to traditional DAS, where the process of replacing a drive can be rather involved and time consuming.)

Isilon can even cope easily with node-level failures. IDC validated that a single-node failure has no noticeable performance impact on the Isilon cluster. Furthermore, the operation of removing a node from the cluster, or adding a node to the cluster, is a seamless process.

  1. Multi-tenant Data Security and Compliance

Strong multi-tenant data security and compliance features are essential for an enterprise-grade Data Lake. Access zones are a crucial part of the multi-tenancy capabilities of the Isilon OneFS. In tests, IDC found that Isilon provides no-crossover isolation between Hadoop instances for multi-tenancy.

Another core component of secure multi-tenancy is the ability to provide a secure authentication and authorization mechanism for local and directory-based users and groups. IDC validated that the Isilon Data Lake provides multiple federated authentication and authorization schemes. User-level permissions are preserved across protocols, including NFS, SMB and HDFS.

Federated security is an essential attribute of an Enterprise Data Lake Platform, with the ability to maintain confidentiality and integrity of data irrespective of the protocols used. For this reason, another key security feature of the OneFS platform is SmartLock – specifically designed for deploying secure and compliant (SEC Rule 17a-4) Enterprise Data Lake Platforms.

In tests, IDC found that Isilon enables a federated security fabric for the Data Lake, with enterprise-grade governance, regulatory and compliance (GRC) features.

  1. Simplified Operations and Automated Storage Tiering

The Storage Pools feature of Isilon OneFS allows administrators to apply common file policies across the cluster locally – and extend them to the cloud.

Storage Pools consists of three components:

  • SmartPools: Data tiering within the cluster – essential for moving data between performance-optimized and capacity-optimized cluster nodes.
  • CloudPools: Data tiering between the cluster and the cloud – essential for implementing a hybrid cloud, and placing archive data on a low-cost cloud tier.
  • File Pool Policies: Policy engine for data management locally and externally – essential for automating data movement within the cluster and the cloud.

As IDC confirmed in testing, Isilon’s federated data tiering enables IT administrators to optimize their infrastructure by automating data placement onto the right storage tiers.

The expert verdict on the Isilon Data Lake

IDC concludes that: “EMC Isilon possesses the necessary attributes such as multi-protocol access, availability and security to provide the foundations to build an enterprise-grade Big Data Lake for most big data Hadoop workloads.”

Read the full IDC Lab Validation Brief for yourself: “EMC Isilon Scale-Out Data Lake Foundation: Essential Capabilities for Building Big Data Infrastructure”, March 2016.

Learn more about building your Data Lake with EMC Isilon.

The Democratization of Data Science with the Arrival of Apache Spark

Keith Manthey

CTO of Analytics at EMC Emerging Technologies Division

As an emerging field, data science has seen rapid growth over the span of just a few short years. With Harvard Business Review referring to the data scientist role as the “sexiest job of the 21st century” in 2012 and job postings for the role growing 57 percent in the first quarter of 2015, enterprises are increasingly seeking out talent to help bolster their organizations’ understanding of their most valuable assets: their data.

The growing demand for data scientists reflects a larger business trend – a shifting emphasis from the zeros and ones to the people who help manage the mounds of data on a daily basis. Enterprises are sitting on a wealth of information but are struggling to derive actionable insights from it, in part due to its sheer volume but also because they don’t have the right talent on board to help.

The problem enterprises now face isn’t capturing data – but finding and retaining top talent to help make sense of it in meaningful ways. Luckily, there’s a new technology on the horizon that can help democratize data science and increase accessibility to the insights it unearths.

Data Science Scarcity & Competition

dataThe talent pool for data scientists is notoriously scarce. According to McKinsey & Company, by 2018, the United States alone may face a 50 to 60 percent gap between supply and demand for “deep analytic talent, i.e., people with advanced training in statistics or machine learning.” Data scientists possess an essential blend of business acumen, statistical knowledge and technological prowess, rendering them as difficult to train as they are invaluable to the modern enterprise.

Moreover, banks and insurance companies face an added struggle in hiring top analytics talent, with the allure of Silicon Valley beckoning top performers away from organizations perceived as less inclined to innovate. This perception issue hinders banks’ and insurance companies’ ability to remain competitive in hiring and retaining data scientists.

As automation and machine learning grow increasingly sophisticated, however, there’s an opportunity for banks and insurance companies to harness the power of data science, without hiring formally trained data scientists. One such technology that embodies these innovations in automation is Apache Spark, which is poised to shift the paradigm of data science, allowing more and more enterprises to tap into insights culled from their own data.

Spark Disrupts & Democratizes Data Science

Data science requires three pillars of knowledge: statistical analysis, business intelligence and technological expertise. Spark does the technological heavy-lifting, by understanding and processing data at a scale that most people aren’t comfortable. It handles the distribution and categorization of the data, removing the burden from individuals and automating the process. By allowing enterprises to load data into clusters and query it on an ongoing basis, the platform is particularly adept at machine-learning and automation – a crucial component in any system intended to analyze mass quantities of data.

Spark was created in the labs of UC Berkeley and has quickly taken the analytics world by storm, with two main business propositions: the freedom to model data without hiring data scientists, and the power to leverage analytics models that are already built and ready-for-use in Spark today. The combination of these two attributes allows enterprises to gain speed on analytics endeavors with a modern, open-source technology.

The arrival of Spark signifies a world of possibility for companies that are hungry for the business value data science can provide but are finding it difficult to hire and keep deep analytic talent on board. The applications of Spark are seemingly endless, from cybersecurity and fraud detection to genomics modeling and actuarial analytics.

What Spark Means for Enterprises

Not only will Spark enable businesses to hire non-traditional data scientists, such as actuaries, to effectively perform the role, but it will also open a world of possibilities in terms of actual business strategy.

Banks, for example, have been clamoring for Spark from the get-go, in part because of Spark’s promise to help banks bring credit card authorizations back in-house. For over two decades, credit card authorizations have been outsourced, since it was more efficient and far less dicey to centralize the authorization process.

The incentive to bring this business back in-house is huge, however, with estimated cost savings of tens to hundreds of millions annually. With Spark, the authorization process could be automated in-house – a huge financial boon to banks. The adoption of Spark allows enterprises to effectively leverage data science and evolve their business strategies accordingly.

The Adoption of Spark & Hadoophadoop_1_resized

Moreover, Spark works seamlessly with the Hadoop Distributions sitting on EMC’s storage platforms. As I noted in my last post, Hadoop adoption among enterprises has been incredible and is quickly becoming the de facto
standard for storing and processing terabytes or even petabytes of data.

By leveraging Spark and existing Hadoop platforms in tandem, enterprises are well-prepared to solve the ever-increasing data and analytics challenges ahead.

Breakfast with ECS: Files Can’t Live in the Cloud? This Myth is BUSTED!

Welcome to another edition of Breakfast with ECS, a series where we take a look at issues related to cloud storage and ECS (Elastic Cloud Storage), EMC’s cloud-scale storage platform.

The trends towards increasing digitization of content and towards cloud based storage have been driving a rapid increase in the use of object storage throughout the IT industry.  However, while it may seem that all applications are using Web-accessible REST interfaces on top of cloud based object storage, in reality, while new applications are largely being designed with this model, file based access models remain critical for a large proportion of the existing IT workflows.

Given the shift in the IT industry towards object based storage, why is file access still important?  There are several reasons for this, but they boil down to two fundamental reasons:

  1. There exists a wealth of applications, both commercial and home-grown, that rely on file access, as it has been the dominant access paradigm for the past decade.
  2. It is not cost effective to update all of these applications and their workflows to use an object protocol. The data set managed by the application may not benefit from an object storage platform, or the file access semantics may be so deeply embedded in the application that the application would need a near rewrite to disentangle it from the file protocols.

What are the options?

The easiest option is to use a file-system protocol with an application that was designed with file access as its access paradigm.

ECS - Beauty FL_resizedECS has supported file access natively since its inception, originally via its HDFS access method, and most recently via the NFS access method.  While HDFS lacks certain features of true file system interfaces, the NFS access method has full support for applications and NFS clients are a standard part of any OS platform, thus making NFS the logical choice for file based application access.

Via NFS, applications gain access to the many benefits of ECS, including its scale-out performance, the ability to massively multi-thread reads and writes, the industry leading storage efficiencies, and the ability to support multi-protocol access, e.g. ingesting data from a legacy application via NFS while also supporting data access over S3 for newer, mobile application clients and thus supporting next generation workloads at a fraction of the cost of rearchitecting the complete application.

Read the NFS on ECS Overview and Performance White Paper for a high level summary of version 3 of NFS with ECS.

An alternative is to use a gateway or tiering solution to provide file access, such as CIFS-ECS, Isilon CloudPools, or third-party products like Panzura or Seven10.  However, if ECS supports direct file-system access, why would an external gateway ever be useful?  There are several reasons why this might make sense:

  • An external solution will typically support a broader range of protocols, including things like CIFS, NFSv4, FTP, or other protocols that may be needed in the application environment.
  • The application may be running in an environment where the access to the ECS is over a slow WAN link. A gateway will typically cache files locally, thereby shielding the applications from WAN limitations or outages while preserving the storage benefits of ECS.
  • A gateway may implement features like compression, thereby either reducing WAN traffic to the ECS, thus providing direct cost savings on WAN transfer fees, or encryption, thus providing an additional level of security for the data transfers.
  • While HTTP ports are typically open across corporate or data center firewalls, network ports for NAS (NFS, CIFS) protocols are normally blocked for external traffic. Some environments, therefore, may not allow direct file access to an ECS which is not in the local data center, though a gateway which provides file services locally and accesses ECS over HTTP would satisfy the corporate network policies.

So what’s the right answer?

The there is no one right answer; instead, the correct answer will depend on the specifics of the environment and of the characteristics of the application.

  • How close is the application to the ECS? File system protocols work well over LANs and less well over WANs.  For applications that are near the ECS, a gateway is an unnecessary additional hop on the data path, though 3d Kugel mit Fragezeichen im Labyrinthgateways can give an application the experience of LAN local traffic even for a remote ECS.
  • What are the application characteristics? For an application that makes many small changes to an individual file or a small set of files, a gateway can consolidate multiple such changes into a single write to ECS.  For applications that more generally write new files or update existing files with relatively large updates (e.g. rewriting a PowerPoint presentation), a gateway may not provide much benefit.
  • What is the future of the application? If the desire is to change the application architecture to a more modern paradigm, then files on ECS written via the file interface will continue to be accessible later as the application code is changed to use S3 or Swift.  Gateways, on the other hand, often write data to ECS in a proprietary format, thereby making the transition to direct ECS access via REST protocols more difficult.

As should be clear, there is no one right answer for all applications.  The flexibility of ECS, however, allows for some applications to use direct NFS access to ECS while other applications use a gateway, based on the characteristics of the individual applications.

If existing file based workflows were the reason for not investigating the benefits of an ECS object based solution, then rest assured that an ECS solution can address your file storage needs while still providing the many benefits of the industry’s premier object storage platform.

Want more ECS? Visit us at www.emc.com/ecs or try the latest version of ECS for FREE for non-production use by visiting www.emc.com/getecs.

Digital Strategies:  Are Analytics Disrupting the World?

Keith Manthey

CTO of Analytics at EMC Emerging Technologies Division

Close up of woman hand pointing at business document during discussion at meeting“Software is eating the world”.  It is a phrase that we often see written, but sometimes do not fully understand.  More recently I read derivations of that phrase that posits that “analytics are disrupting the world”.  Both phrases have a lot of truth.  But why? Some of the major disruptions in the last 5 years can be attributed to analytics.  Most companies that serve as an intermediary, such as Uber or AirBNB, with a business model of making consumer and supplier “connections” are driven by analytics.  Pricing surges, routing optimizations, available rentals, available drivers, etc. are all algorithms to these “connection” businesses that are disrupting the world.  It could be argued that analytics is their secret weapon.

It is normal for startups to try new and sometimes crazy & risky investments into new technologies like Hadoop and analytics.  The trend is carrying over into traditional industries and established businesses as well.  What are the analytics uses cases in industries like Financial Services (aka FSI)?

Established Analytics Plays in FSI

Two use cases naturally come to my mind when I think of “Analytics” and “Financial Services”; High Frequency Trading and Fraud are two traditional use cases that have long utilized analytics.  Both are fairly well respected and written about with regard to their heavy use of analytics.  I myself blogged recently (From Kinetic to Synthetic) on behalf of Equifax regarding the market trends in Synthetic Fraud.  Beyond these obvious trends though, where are analytics impacting the Financial Services industry?  What use cases are relevant and impacting the industry in 2016 and why?

Telematics

The insurance industry has been experimenting with opt-in programs that monitor driving behavior for several years.  Insurance companies have varying opinions of its usefulness, but it’s clear that driving behavior is (1) a heavy use of unstructured data and (2) a dramatic leap from the statistical based approach using financial data, actuarial tables, and statistics.  Telematics is the name given to a set of opt-in programs around usage-based insurance / driver monitoring programs. Telematics use in insurance companies has fostered a belief that has long been used in other verticals like fraud that pins behavior down to an individual pattern instead of trying to predict broad swaths of patterns.  To be more precise, Telematics is looking to derive a “behavior of one” vs a “generalized driving pattern for 1K individuals”.  As to the change of why this is different from past insurance practices, we will draw a specific comparison between the two. Method One – historical actuarial tables of life expectancy along with demographic and financial data to denote risk vs. Method Two – how does ONE individual drive based upon real driving data as received from their car.  Which might be more predictive about the expected rate of accidents is the question for analytics.  While this is a gross over-simplification of the entire process, it is a radical shift of the types of data and the analytical methods of deriving value from the data available to the industry.  Truly transformational.

Labor Arbitrage

The insurance industry has been experimenting with analytics based on past performance data.  The industry has years of predictive information (i.e., claim reviews along with actual outcomes) based on past claims.  By exploring this past performance data, Insurance companies are able to apply logistical regression algorithms to derive weighted scores.  The derived scores are then being analyzed to determine a path forward.  For example, if scores greater then 50 amounted to claims that are evaluated and then almost always paid by the insurer, then all scores above 50 should be immediately approved and paid.  The inverse is also true that treatments can be quickly rejected as they are often not appealed or regularly turned down under review if appealed. The analytics of the actual present case was compared against previous outcomes of the corpus of past performance data to derive the most likely outcome of the case.  The resulting business effect would be that the workforce that reviewed medical claims would only be given those files that needed to be worked.  The result would be a better work force productivity.  Labor Arbitrage with data and analytics being the disruptor of workforce trends.

Know Your Customer

Retail Banking has turned to analytics as they have focused on attracting and retaining their customers.   After a large trend of acquisitions in the last decade, retail banks are working to integrate their various portfolios.  In Business people shaking hands, finishing up a meetingsome cases, resolving down the identity of all their clients on all their accounts isn’t always as straight forward as it sounds.  This is especially hard with dormant accounts that might have maiden names, mangled data attributes, or old addresses.  The ultimate goal of co-locating all their customer data into an analytics environment is a customer 360.  Customer 360 is mainly focused on gaining full insights around a customer.  This can lead to upsell opportunities by understanding what a customer’s peer set and what products a similar demographic has a strong interest in. For example, if individuals of a given demographic typically subscribe to 3 of a company’s 5 products, an individual matching that demographic should be targeted for upsell on those additional products when they only subscribe to 1 product.  This is using large swathes of data and companies own product adoptions to build upsell and marketing strategies for their own customers.  If someone was a small business owner and personal consumer of the retail bank, the company may not have previously tied those accounts together.  It gives the bank a whole new perspective on who its customer base really is.

Wrap Up

Why are these trends interesting?  In most of these cases above, people are familiar with certain portions of the story.  The underlying why or what, might often get missed.  It is important to not only understand the technology and capabilities involved with transformation, but also the underlying shift that is being caused. EMC has a long history of helping customers through those journeys and we look forward to helping even more clients face them.

 

 

 

 

Hadoop Grows Up: How Enterprises Can Successfully Navigate its Growing Pains

Keith Manthey

CTO of Analytics at EMC Emerging Technologies Division

If you’d asked me 10 years ago whether enterprises would be migrating to Hadoop, I would’ve answered with an emphatic no. Slow to entice enterprise customers and named after a toy elephant, at first glance, the framework didn’t suggest it was ready for mass commercialization or adoption.

But the adoption of Hadoop among enterprises has been phenomenal. With its open-source software framework, Hadoop provides enterprises with the ability to process and store unprecedented volumes of data – a capability today’s enterprise sorely needs – effectively becoming today’s default standard for storing, processing and analyzing mass quantities, hundreds of terabytes or even petabytes of data.

While the adoption and commercialization of Hadoop is remarkable and an overall positive move for enterprises hungry for streamlined data storage and processing, enterprises are in for a significant challenge with the migration from Hadoop 2.0 to 3.X.

Most aren’t sure what to expect, and few experienced the earlier migration’s pain points. Though Hadoop has “grown up”, in that it is now used by some of the world’s largest enterprises, it hasn’t identified a non-disruptive solution when it jumps major releases.

Happening in just a few short years, this next migration will have dramatic implications for the storage capabilities of today’s insurance companies, banks and largest corporations. It’s imperative that these organizations begin planning for the change now to ensure that their most valuable asset—their data—remains intact and accessible in an “always on” culture that demands it.

Why the Migration Mattersmigration_2

First, let’s explore the significant benefits of the migration and why, despite the headaches, this conversion will ultimately be beneficial for enterprises.

One of the key benefits of Hadoop 3.X is erasure coding, which will dramatically decrease the amount of storage needed to protect data. In a more traditional system, files are replicated multiple times in order to protect against loss. If one file becomes lost or corrupted, its replica can easily be summoned in place of the original file or datum.

As you can imagine, replication of data requires significant volumes of storage that can shield against data failure, but is expensive. In fact, default replication requires an additional 200 percent in storage space and other resources, such as network bandwidth when writing the data.

Hadoop 3.X’s move to erasure coding resolves the storage issue while maintaining the same level of fault tolerance. In other words, erasure coding helps protect data as effectively as traditional forms of coding but takes up far less storage. In fact, erasure coding is estimated to reduce the storage cost by 50 percent – a huge financial boon for enterprises moving to Hadoop 3.X. With Hadoop 3.X, enterprises will be able to store twice as much data on the same amount of raw storage hardware.

That being said, enterprises updating to Hadoop 3.X will face significant roadblocks to ensure that their data remains accessible and intact during a complicated migration process.

Anticipating Challenges Ahead

For those of us who experienced the conversion from Hadoop 1.X to Hadoop 2.X, it was a harrowing one, requiring a complete unload of the Hadoop environment data and a complete re-load onto the new system. That meant long periods of data inaccessibility and, in some cases, data loss. Take a typical laptop upgrade and multiply the pain points thousand-fold.

Data loss is no longer a tolerable scenario for today’s enterprises and can have huge financial, not to mention reputational implications. However, most enterprises adopted Hadoop after its last revamp, foregoing the headaches associated with major upgrades involving data storage and processing. These enterprises may not anticipate the challenges ahead.

The looming migration can have potentially dire implications for today’s enterprises. A complete unload and re-load of enterprises’ data will be expensive, painful and fraught with data loss. Without anticipating the headaches in store for the upcoming migration, enterprises may forego the necessary measures to ensure the accessibility, security and protection of their data.

Navigating the Migration Successfully

Isilon_Hadoop_2The good news is that there is a simple, actionable step enterprises can take to manage migration and safeguard their data against loss, corruption and inaccessibility.

Enterprises need to ensure that their current system does not require a complete unload and reload of their data. Most systems do require a complete unload and reload, so it is crucial that enterprises understand their current system and its capabilities when it comes to the next Hadoop migration.

If the enterprise were on Isilon for Hadoop, for example, there would be no need to unload and re-load its data. The enterprise would simply point the newly upgraded computer nodes to Isilon, with limited downtime, no re-load time and no risk for data loss.

Isilon for Hadoop helps enterprises ensure the accessibility and protection of their data through the migration process to an even stronger, more efficient Hadoop 3.X. While I’m eager for the next revamp of Hadoop and its tremendous storage improvements, today’s enterprises need to take precautionary measures before the jump to protect their data and ensure the transition is as seamless as possible.

Categories

Archives

Connect with us on Twitter