When Big Data hits the centre!

Steve Jenkins, VP for EMEA at MapR Technologies and subject matter expert on Hadoop and Big Data technologies, provides a candid “need to know” around the major challenges that Big Data poses for the modern data centre environment.

11 years ago Posted in

THE SHEER VARIETY, velocity and volume of data associated with many Big Data projects is forcing data centre administrators to evaluate infrastructure architecture in new ways. Before deploying a Big Data solution, data centre administrators need to understand some fundamentals around how a solution operates to successfully design, build and manage a suitable new underlying infrastructure.

Of the several Big Data solutions that have emerged in the last few years, none has experienced the rapid adoption of Hadoop.

Google pioneered the distributed processing framework called MapReduce, which distributes data, breaks jobs into tasks, and processes these tasks in parallel across large distributed clusters of lower-end servers. Google published a paper in 2004 that spawned an open source implementation called Apache Hadoop. The Hadoop community has grown dramatically and is producing enhancements and innovations that expand its usability within the enterprise.

As Hadoop enters the enterprise data centre, IT administrators need to review and analyse its architecture from several aspects. Hadoop represents a paradigm shift in how data is processed and is distinctly different from analytical platforms or traditional data warehouses. IT administrators should evaluate Hadoop platforms with the same standards that other enterprise software products deployed in the data centre.

These include evaluation criteria concerning meeting SLAs, protecting data, disaster recovery, securing access, and ease of administration. IT administrators should also assume, now that Hadoop has been established as a viable and valuable business analytics platform that the number of business groups wanting Hadoop-based applications will continue to grow within the enterprise.

Architecture matters
A typical Hadoop installation consists of a number of nodes, collectively called a cluster. With most Hadoop solutions, a typical cluster consists of a number of nodes that combine data processing and storage, and a smaller number of nodes that run other services that provide cluster coordination and management. An advanced architecture would eliminate special purpose nodes and distribute coordination and management across the data nodes in the cluster. This highly distributed architecture provides better performance, scale and recovery.

One of the fundamental differences for Hadoop-based Big Data projects is that data and processing resides on the same server. Hadoop is a framework that enables this processing to take place across these large clusters of commodity hardware and direct attached storage. There are several reference architectures that architects use to configure Hadoop nodes and the network infrastructure that goes with it. There are also benchmarks used
to compare performance of Hadoop distributions such as Minutesort, Terasort, and YCSB (Yahoo Cloud Services Benchmark).

All distributions of Hadoop typically support deployment over standard Linux distributions such as CentOS, Ubuntu and RedHat on any x86_64 server. Typical hardware nodes contain 2 processors with 12 or 16 cores total, with 64 GB to 128 GB of RAM. Each node will have anywhere from 4 TB to 36 TB of disk storage but these can vary between 4 to 24 disk spindles of different sizes or flavours. Both SATA and SAS disks are commonly used.

Hadoop leverages direct-attached storage for all nodes; so no external storage arrays, and hence no storage fabric are required. Cluster nodes should be connected with multiple interfaces for reliability; 1Gb or 10Gb interconnect works well. Multi-rack deployments typically perform better when the top-of-rack switches are powerful enough to support a rate of 2 Gb/sec for each node in the rack, so 20 nodes in each rack should have 40 Gb/sec bandwidth between racks. On the networking side, a 1 Gb network fabric is critical; most nodes will have 2 or 4 1Gb channels for high availability. Currently, however, only one Hadoop distribution takes advantage of nodes with multiple network interfaces. A 10 Gb network is generally supported, and may be helpful for certain workloads.
The number of nodes for any cluster depends on the size of the data set and the complexity of the analytics.

Most distributions easily scale up to a few hundred nodes while some alternatives run into the thousands. As Big Data projects become more prevalent, these massive clusters have begun to emerge in the data centre.

Operational impacts
This area is where the various Hadoop distributions differ considerably in what they offer for the data centre administrator. Most distributions have limited data protection, disaster recovery and high availability options. Only one distribution provides customers with advanced features to enable enterprises to meet their SLAs by running Big Data operations on a resilient infrastructure.

For example, MapR has added innovations to Apache Hadoop to provide open source support with features that that provides automated stateful failover, complete point-in-time data recovery, and mirroring support for efficient replication of cluster data between data centres for disaster recovery.

Another area which data centre designers need to consider is how to avoid single points of failure. Most distributions have single points of failure for critical services and have no protection from runaway jobs. For a distribution such as MapR that has designed high availability features into every level of the Hadoop stack, there are no single points of failure in the cluster. Big Data analysis is becoming a key driver for data centre expansion. Enabling technologies like Hadoop are now pushing past web-based social media environments where they emerged and are being adopted within the well-established, production data centres of the Fortune 500.

Across verticals, Hadoop needs to run in “lights-out” environments that require high degrees of automation so it is essential that
IT administrators adopt advanced Hadoop architectures that
eliminate low level administrative tasks for file and database applications.

As the use of Hadoop expands, it is also wise to require support for industry standard APIs such as NFS, ODBC, REST and LDAP to allow enterprise software to easily integrate with the widest array of data sources and other Big Data tools.

Hadoop can transform data operations, but data centre administrators need to understand how Hadoop operates to successfully design, build and manage a suitable new underlying infrastructure.