Posts Tagged ‘HDFS’

Breakfast with ECS: Files Can’t Live in the Cloud? This Myth is BUSTED!

Welcome to another edition of Breakfast with ECS, a series where we take a look at issues related to cloud storage and ECS (Elastic Cloud Storage), EMC’s cloud-scale storage platform.

The trends towards increasing digitization of content and towards cloud based storage have been driving a rapid increase in the use of object storage throughout the IT industry.  However, while it may seem that all applications are using Web-accessible REST interfaces on top of cloud based object storage, in reality, while new applications are largely being designed with this model, file based access models remain critical for a large proportion of the existing IT workflows.

Given the shift in the IT industry towards object based storage, why is file access still important?  There are several reasons for this, but they boil down to two fundamental reasons:

  1. There exists a wealth of applications, both commercial and home-grown, that rely on file access, as it has been the dominant access paradigm for the past decade.
  2. It is not cost effective to update all of these applications and their workflows to use an object protocol. The data set managed by the application may not benefit from an object storage platform, or the file access semantics may be so deeply embedded in the application that the application would need a near rewrite to disentangle it from the file protocols.

What are the options?

The easiest option is to use a file-system protocol with an application that was designed with file access as its access paradigm.

ECS - Beauty FL_resizedECS has supported file access natively since its inception, originally via its HDFS access method, and most recently via the NFS access method.  While HDFS lacks certain features of true file system interfaces, the NFS access method has full support for applications and NFS clients are a standard part of any OS platform, thus making NFS the logical choice for file based application access.

Via NFS, applications gain access to the many benefits of ECS, including its scale-out performance, the ability to massively multi-thread reads and writes, the industry leading storage efficiencies, and the ability to support multi-protocol access, e.g. ingesting data from a legacy application via NFS while also supporting data access over S3 for newer, mobile application clients and thus supporting next generation workloads at a fraction of the cost of rearchitecting the complete application.

Read the NFS on ECS Overview and Performance White Paper for a high level summary of version 3 of NFS with ECS.

An alternative is to use a gateway or tiering solution to provide file access, such as CIFS-ECS, Isilon CloudPools, or third-party products like Panzura or Seven10.  However, if ECS supports direct file-system access, why would an external gateway ever be useful?  There are several reasons why this might make sense:

  • An external solution will typically support a broader range of protocols, including things like CIFS, NFSv4, FTP, or other protocols that may be needed in the application environment.
  • The application may be running in an environment where the access to the ECS is over a slow WAN link. A gateway will typically cache files locally, thereby shielding the applications from WAN limitations or outages while preserving the storage benefits of ECS.
  • A gateway may implement features like compression, thereby either reducing WAN traffic to the ECS, thus providing direct cost savings on WAN transfer fees, or encryption, thus providing an additional level of security for the data transfers.
  • While HTTP ports are typically open across corporate or data center firewalls, network ports for NAS (NFS, CIFS) protocols are normally blocked for external traffic. Some environments, therefore, may not allow direct file access to an ECS which is not in the local data center, though a gateway which provides file services locally and accesses ECS over HTTP would satisfy the corporate network policies.

So what’s the right answer?

The there is no one right answer; instead, the correct answer will depend on the specifics of the environment and of the characteristics of the application.

  • How close is the application to the ECS? File system protocols work well over LANs and less well over WANs.  For applications that are near the ECS, a gateway is an unnecessary additional hop on the data path, though 3d Kugel mit Fragezeichen im Labyrinthgateways can give an application the experience of LAN local traffic even for a remote ECS.
  • What are the application characteristics? For an application that makes many small changes to an individual file or a small set of files, a gateway can consolidate multiple such changes into a single write to ECS.  For applications that more generally write new files or update existing files with relatively large updates (e.g. rewriting a PowerPoint presentation), a gateway may not provide much benefit.
  • What is the future of the application? If the desire is to change the application architecture to a more modern paradigm, then files on ECS written via the file interface will continue to be accessible later as the application code is changed to use S3 or Swift.  Gateways, on the other hand, often write data to ECS in a proprietary format, thereby making the transition to direct ECS access via REST protocols more difficult.

As should be clear, there is no one right answer for all applications.  The flexibility of ECS, however, allows for some applications to use direct NFS access to ECS while other applications use a gateway, based on the characteristics of the individual applications.

If existing file based workflows were the reason for not investigating the benefits of an ECS object based solution, then rest assured that an ECS solution can address your file storage needs while still providing the many benefits of the industry’s premier object storage platform.

Want more ECS? Visit us at www.emc.com/ecs or try the latest version of ECS for FREE for non-production use by visiting www.emc.com/getecs.

Storage Transformation Demands New Thinking (When it Comes to Software-Defined Storage, if it Walks like a Duck and Quacks like a Duck, It Still Might be a Pig!)

We live in interesting times right now in the storage business. What was once considered a “boring” sector of IT is now hot again. We have new vendors entering the market at a furious pace, trying to gain position in all-flash, flash attach, and software-defined storage. Additionally, we also have traditional storage incumbents looking to box out the new entrants through different combinations of product re-brand, acquisition and/or partnerships.

The new vendor entrants are the most fun to watch in my opinion. Unencumbered by installed bases, or legacy technology (or politics!) they are free to try new approaches to long-standing issues and roadblocks that always emerge as technology matures. Some new players have truly unique and interesting solutions; others have only marketing spin.

Watching some of the traditional storage vendors try to counter these new offerings is generally quite amusing, and in iStock_000035622116Smallsome cases just plain sad. They trot out technology that has been around for years declaring it to be Software-Defined, and Cloud ready or whatever they think will make them most relevant. The most common response I see is the re-brand. You know the drill: product XYZ was our storage virtualization/storage OS product for years, but now it’s called product ZYX and it’s software-defined storage because we dropped the hardware requirement! So it’s now Software-Defined Storage (SDS)?

It all just serves to remind me why I work where I do. One of the great things about working forEMC is the company’s ability to blend both the innovation and enthusiasm of a startup with our traditional storage business. My group, the Advanced Software Division is a great example of this. EMC looked out over the storage landscape some years ago and made a pretty bold bet. They did not choose to re-purpose and re-brand existing technology. Rather they went outside the box (literally outside the company walls) and hired Amitabh Srivastava to build it from the ground up. Now Amitabh was building cloud storage in his last gig, so he has been in on this SDS, cloud ready stuff for a while. EMC was listening to our customers tell us they needed a new approach, and that’s what we went out and did, starting from scratch to develop a solution that could help customers transition to the next storage generation.

The product that we developed was of course, ViPR, and it’s been fun to watch the impact it has had on the storage market in the year it’s been around. First our competitors said it was vaporware and would never ship; then ViPR shipped. Next they witnessed the traction we received in the press and they all said that they have SDS too, it just wasn’t called SDS; uh-huh. Then they said ViPR only works with EMC hardware; well yeah it does (kinda not smart of us to leave that out) but as we demonstrate in ViPR 2.0, it works with their stuff just too.

So now we are at the point in the compete cycle that I call the “bundle” phase. That’s when you realize you can’t compete very well head to head, so you start adding other stuff off the truck; it’s software-defined storage plus game show answers! Or it’s our old file system plus our old management tools! You just didn’t recognize it as SDS all these years (bad customer! No discounts for you!). Now, I am not trying to throw stones at glass houses. All mature IT vendors have had to deal with something like this at some point in time, and I sure that EMC has been guilty of the same at some point. But not this time.

My point here is to note the impact ViPR has had on the storage market in the year that it has been around; Storage stalwarts like IBM and HDS calling it out by name and new startups are gunning for it. I think it is refreshing to see an incumbent IT company that is willing to take a different approach, something out of the norm to solve a customer problem. As for the other vendors in the storage market: well you better buckle your chinstraps people. ViPR became generally available in September of 2013, we added HDFS in Jan 2014, then ScaleIO (Block) + Geo-distribution/redundancy+ deeper OpenStack integration in June 2014. Sensing a trend? I am looking forward to the next year and beyond to see if these competitors can keep re-spinning, re-branding etc. quickly enough to keep up with the ViPR development teams.

Looking Back to Get Ahead Using ‘Divide and Conquer’

While my last blogs encouraged taking advantage of new technologies and not being constrained by

back-to-future-MOwe’ve always done things“, for this blog I’ll emphasize the wisdom that can come from looking backwards.

Divide and conquer is an old strategy that has been applied effectively to many situations and problems. In politics, splitting the opposition, by offering deals that appeal to individual groups or subsets of the opposition, can enable successful policy implementations that would normally have united the opposition and prevented progress. Militarily, victory can be achieved by avoiding an enemy’s combined strength, by engaging the enemy repeatedly in smaller battles and whittling down the enemy’s fighting capacity.

It is not often that politics, warfare, computing, and storage all intersect, but in this case, leveraging an age old strategy can help us gain insights into today’s seem
ingly intractable problems.

Divide and conquer has been used previously in computer science, notably in the realm of recursive problem solutions. The efficient sorting algorithm quicksort is one such example, where the original input is split into two chunks, one chunk of all elements less than or equal to a certain value, and another chunk of elements greater than that same value. As neither chunk is sorted, it might appear that no progress has been made, but instead of one large sorting problem, there are now two smaller, independent sorting problems. By repeating this approach on the two chunks, the sorting problem can eventually be reduced to a size that is trivial to process, and by combining all the results, the original, seemingly intractable problem, has been solved. If the subchunks can each be processed by independent processors, this can unleash a high degree of parallelism and enable a far faster sorting result than other sorting techniques.

The Hadoop infrastructure for analytics is built on this simple premise. In a Hadoop deployment, multiple commodity nodes are clustered together in a shared nothing environment (i.e. each node can directly access only its own storage, and storage on other nodes is only accessible via the network). A large data set is written to the Hadoop environment, typically copied from an online transaction processing system. Within the Hadoop environment, the task is processed as a series of independent “map” jobs, which process a small chunk of the data, purely local to a node, but with many such “map” jobs running concurrently (the “divide” part of divide and conquer). The final results are then combined together in a “reduce” phase, which combines all the smaller results together to produce the final output (e.g. combining all the sorted subchunks from a quicksort algorithm to produce the entire sorted list).

The Hadoop style of processing is an elegant solution to the problem of making sense of today’s reams of data and translating them into useful information. However, a typical implementation involves the Hadoop system as another storage system, composed of a cluster of nodes, storing protected copies of data. The Hadoop environment is optimized for batch processing of the data, rather than for normal data access, and scales most effectively when the size of each processing unit is measured in the 10s to 100s of MBs or larger, often requiring multiple data items to be combined for Hadoop processing rather than allowing the natural data size to be used.

The description of a Hadoop cluster should sound somewhat familiar, as a cluster of commodity nodes, with individual disk farms on each node, connected via a high speed network, are the exact components of a modern object storage system. However, object storage systems are optimized for both batch and interactive data access, are available to ingest and store both active, online data and older archive data, and are engineered to avoid bottlenecks when storing any size object, even objects as small as individual purchase records of perhaps a few kilobytes.

The recent ViPR 1.1 release leverages the commonality between the world of Hadoop and the world of object storage to deliver a solution which combines the best of both. The object storage platform delivers a highly scalable storage environment that provides high-speed access to all data, regardless of its natural size, and without the need to copy the data from the object store to a secondary storage platform. A Hadoop file system (HDFS) implementation has been layered on top of the object store, providing the enabling mechanisms for the Hadoop framework to identify where each data object is stored and to run the “map” jobs in a highly efficient manner, as local as possible to the object. There is no additional storage infrastructure to manage, and the data can be viewed either through an HDFS lens or via an object lens, depending on the needs of the moment.

As we continue to rethink storage, there will be more such opportunities to combine old ideas with new technologies to produce real value for today’s customers. Hadoop is a novel application of the tried and true “divide and conquer” strategy, and, when combined with the new storage paradigm of a scale-out object storage system, produces an analytical framework that avoids the unnecessary overheads of a dedicated analytics cluster and the unnecessary costs of transferring and reformatting data.

Understanding Hadoop, HDFS, and What That Means to Big Data

Amrita Sadhukhan

Amrita Sadhukhan

Amrita Sadhukhan

Latest posts by Amrita Sadhukhan (see all)

Over the past few years, usage of the Internet has increased massively.  People are now accessing emails, social networking sites, writing blogs, and creating multiple websites.  As a result, petabytes of data are generated every moment.  Enterprises today are trying to derive meaningful information from this data and convert it into information that can translate into business value as well as features and functionality for their various products.

Huge volumes of a great variety of data, both structured and unstructured, are being generated at an unprecedented velocity and in many respects, that is the easy part!  It is the “gather, filter, derive and translate” part that has most organizations tied up in knots.  This is the genesis of today’s focus on Big Data solutions.

Previously we have used traditional enterprise architectures consisting of one or more servers, storage arrays, and storage area networks connecting the servers and the storage arrays.  This architecture was built based on compute intensive applications that require a lot of processing cycles but mostly on a small subset of application data. Big-Data-HadoopIn the era of Big Data, petabytes of data are being generated every day and our customers want to sort and derive business value from them.  In fact, the amount of Big Data that is generated every day must be sorted first in order to get analyzed.  This is a massively data intensive operation. Handling this volume of data in a manner that that meets the performance requirements of the business, drives us towards a Clustered Architecture.  Clustered Architectures typically consist of a set of simple and basic components that are available in hundreds/thousands.  The computational capability of each component may be less but they can perform massive amounts of compute and data intensive operations efficiently as a scalable group.

The problem with Clustered Architectures is, when we are dealing with hundreds or thousands of nodes, they can and will fail – the larger the number of nodes, the more often it will happen.  So the software managing the cluster and the applications running on it must detect and respond to those failures in an automated and efficient manner.  Moreover, data needs to be efficiently distributed among the nodes and in many cases petabytes of data need to be replicated as well.

This has been the driving force behind Hadoop.

Let’s think of a simple problem.  People all over the world have written blogs about various EMC products and we have aggregated all of them.  The aggregated file is now petabytes in size and we want to find out something simple like how many times each EMC product is mentioned in the blogs.  This looks difficult at first blush, but using Hadoop we can get the data we need in a very efficient manner.

A file may be petabytes in size, but at its most basic level, it still consists of a number of blocks.  Using the Hadoop Distributed File System (HDFS), we store these blocks in a cluster consisting of hundreds of nodes, replicating each block a certain number of times across the nodes.  HDFS will take care of fault tolerance – if any node fails; it can automatically assign the work to another node.

Now it uses MapReduce, which consists of two phases, the Mapper Phase and the Reduction Phase.  In the Mapper phase, each participating node gets a pointer to Mapper function and the address of the file blocks collocated in the node.  In our example, we need to find out how many times, each EMC product is mentioned in the file.  So, the problem will be solved on each node on the file blocks residing on that node.  Each Mapper phase outputs a <key, value=””> pair, in our case this will be <emc_product, number=”” of=”” times=”” it=”” appears=”” in=”” the=”” file=”” blocks=””>.  In the Reduction phase we will consolidate all those <key, value=””> pairs and aggregate them to find out how many times an EMC product is mentioned in the file as a whole.

Doug Cutting created Hadoop when he was inspired by two papers on the Google File Systemand Google MapReduce.  He named it after his child’s favorite toy .

The same issues companies like Google and Yahoo faced in the early 2000s, are now being faced by many enterprises.  According to Yahoo, by the second half of the decade, 50% of enterprise data will be processed and stored using Hadoop1.

1 Source:  The Enterpise of Hadoop: Internet Research Group, November 2011

Categories

Archives

Connect with us on Twitter