Big Data: From Hype to Insight – Part 2 Infrastructure and Technology
By Jenna Danko on Mar 17, 2016
I hope you liked the first part of the Big Data Ecosystem series and welcome to the second part of this series. Now, it’s time to peep into the nitty-gritty of what makes Big Data click! Big data requires more than a change in mindset – it requires a change in the technologies used to deal with the different types of data in play and the way they need to be handled to maximize the benefits.
When we talk of Infrastructure and Technology, it’s a sum total of many things and means different things to different people and that’s why we need to define and limit our discussion to the below pointers:
- Platform and Framework
- Storage and Database
- Hadoop Applications in the ecosystem
- Data Policy, Governance and Security
- Big Data Talent
Nevertheless, I’ll touch on the complete ecosystem from Technology to Talent – all things which can affect the successful implementation of Big Data projects. So, let’s get started!
1. Platform and Framework
Hadoop is a Platform for distributed Storage and Processing of data sets on computer clusters built from commodity hardware. A computer cluster consists of a set of loosely or tightly connected computers (also called node) that work together so that, in many respects, they can be viewed as a single system. They have each node set to perform the same task, controlled and scheduled by software. Hadoop framework believes that moving computation is cheaper than moving data and thus allows the data to be processed on the local nodes where the data resides,and fair enough, this architecture processes data faster and more efficiently than it would on a supercomputer where computation and data are distributed via high-speed networking.
See Figure 1 below for the complete Hadoop Ecosystem. It’s interesting to note that the complete Hadoop Ecosystem is free and open source software (FOSS) which is incubated and developed by Apache Software Foundation. Apache is funded and managed by many large corporations and individuals who voluntarily offer their services, time and money for the noble cause.
Figure 1: Hadoop 2.0 Ecosystem (Multi-use Data Platform-Batch, Online, Streaming, Interactive, etc.)
Hadoop 2.0 is a framework for processing, storing, and analyzing massive amounts of distributed unstructured data and this framework is composed of the following core modules:
- Hadoop Common – contains libraries and utilities needed by other Hadoop modules
- Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster (more on this later)
- Hadoop YARN – Yet Another Resource Negotiator (YARN) is often called the operating system of Hadoop because it is responsible for managing and monitoring workloads, maintaining a multi-tenant environment, implementing security controls, and managing high availability features of Hadoop
- Hadoop MapReduce – Framework for programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster (more on this later)
The term Hadoop has come to refer to not just the above mentioned core modules, but also the collection of additional software packages that can be installed on top of core Hadoop, such as Pig, Hive, HBase etc. We’ll discuss all of these components subsequently.
Right now, it’s important to address MapReduce and how it works. Hadoop manages the distribution of work across many servers in a divide-and-conquer methodology known as MapReduce. Since each server houses a subset of your overall data set, MapReduce lets you move the processing close to the data to minimize network accesses to data that will slow down the task.
The MapReduce algorithm, introduced by Google, contains two important tasks namely Map and Reduce. Map takes a set of data and breaks the individual elements into tuples. Then the Reduce task takes the output (from a Map function) as an input and combines those data tuples into a smaller set of tuples. The reduce task is always performed after the Map job. An example will help to understand how MapReduce actually works; Figure 2 below captures how the number of words are counted under this framework, which represents asymptotic complexity if done using one machine.
Figure 2: Example of MapReduce
2. Storage and Database
It is important to understand how HDFS Stores data. At its most basic level, a Hadoop implementation creates four unique node types for cataloging, tracking, and managing data throughout the infrastructure as below:
- Data Node: these are the repositories for the data, and consist of multiple smaller database infrastructures that are horizontally scaled across compute and storage resources through the infrastructure.
- Client Node: this represents the user interface to the big data implementation and query engine. The client could be a server or PC with a traditional user interface.
- Name Node: this is equivalent of the address router for the distributed infrastructure. This node maintains the index and location of every data node.
- Job tracker: represents the software job tracking mechanism to distribute and aggregate search queries across multiple nodes for ultimate client analysis.
Figure 3 below captures how these nodes work in tandem to store, retrieve, and process data by taking advantage of data locality and in-built fault tolerant nature of the HDFS Architecture.
Figure 3: HDFS Architecture – How Storage Works Under Hadoop
Now, talking of the Database technologies, you would agree that the Relational databases do not lend themselves intuitively to the unstructured or semi structured data and hence new type of Databases by name of NoSQL (Not only SQL) came into picture. Broadly speaking, there are four types of NoSQL Database in the usage right now – each has different runtime rules and different trade-offs. The complexity of the data and scalability of the system decides which database to use at which time. Figure 4 below depicts different type of Databases and their usages.
Figure 4: NoSQL Database Types
3. Applications in the Hadoop Ecosystem
As noted below in Figure 5, the Hadoop Ecosystem is well supported by many useful applications, which abstract the complexity and provide a platform to the business users to access Hadoop Architecture in efficient and productive manner. The below applications are discussed briefly to understand their utility and applicability in the ecosystem. The readers are encouraged to refer external sources for more details.
Hive – Hive is a “SQL-like” bridge that allows conventional BI applications to run queries against a Hadoop cluster. It’s a higher-level abstraction of the Hadoop framework that allows anyone to make queries against data stored in a Hadoop cluster just as if they were manipulating a conventional data store. It amplifies the reach of Hadoop, making it more familiar for BI users. Hive allows SQL developers to write Hive Query Language (HQL) statements that are similar to standard SQL statements and are broken down by the Hive service into MapReduce jobs and executed across a Hadoop cluster.
PIG – Pig was initially developed to focus more on analyzing large data sets and spend less time having to write mapper and reducer programs. Pig is made up of two components: the first is the language itself, which is called PigLatin and the second is a runtime environment where PigLatin programs are executed. It’s a high level scripting language that enables data workers to write complex data transformations without knowing Java.
Figure 5: Business Abstraction for Hadoop
Sqoop – Sqoop efficiently transfers bulk data between Hadoop and Relational databases. Sqoop helps offload certain tasks (such as ETL processing) from the EDW to Hadoop for efficient execution at a much lower cost. Sqoop can also be used to extract data from Hadoop and export it into external RDBMS. Sqoop works with almost all major RDBMS available in the market.
Ambari – It’s a web-based platform for provisioning, managing, monitoring and securing Apache Hadoop clusters. Ambari takes the guesswork out of operating Hadoop. It makes Hadoop management simpler by providing a consistent, secure platform for operational control. Ambari provides an intuitive Web UI as well as a robust REST API, which is particularly useful for automating cluster operations.
HBase – HBase is a column-oriented database management system that runs on top of HDFS. It is well suited for sparse data sets, which are common in many big data use cases. Unlike relational database systems, HBase does not support a structured query language like SQL.
Flume – Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of streaming data into the HDFS. It has a simple and flexible architecture based on streaming data flows; and is robust and fault tolerant with tunable reliability mechanisms for failover and recovery. YARN coordinates data ingest from Flume that deliver raw data into the Hadoop cluster.
Storm – Storm is a system for processing streaming data in real time. Storm makes it easy to reliably process unbounded streams of data, doing for real-time processing what Hadoop did for batch processing. Storm on YARN is powerful for scenarios requiring real-time analytics, machine learning and continuous monitoring of operations.
Zoo keeper – ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications.
Oozie – Oozie is a workflow processing system that lets users define a series of jobs written in multiple languages – such as Map Reduce, Pig and Hive – then intelligently link them to one another. Oozie allows users to specify, for example, that a particular query is only to be initiated after specified previous jobs on which it relies for data are completed.
Mahout – Mahout is a data mining library and takes the most popular data mining algorithms for performing clustering, regression testing and statistical modeling and implements them using the Map Reduce model. Also, produces free implementations of distributed or otherwise scalable machine learning algorithms focused primarily in the areas of collaborative filtering, clustering and classification.
Kafka – Kafka is publish-subscribe distributed messaging service. A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients. Kafka is designed to allow a single cluster to serve as the central data backbone for a large organization. It can be elastically and transparently expanded without downtime. Other messaging variants include RabbitMQ and NSQ.
Spark – It’s an upcoming cluster computing framework like Hadoop, albeit, with much faster processing speed than MapReduce (upto 100X). However, Spark does not provide its own distributed storage system. For this reason, many Big Data projects involve installing Spark on top of Hadoop, where Spark’s advanced analytics applications can make use of data stored using the HDFS. Spark handles most of its operations ‘in-memory’ – copying them from the distributed physical storage into RAM memory. Spark arranges data in what are known as Resilient Distributed Datasets (RDD), which can be recovered following failure.
4. Security of the Hadoop Ecosystem
“The larger the concentration of sensitive personal data, the more attractive a database is to criminals, both inside and outside a firm. The risk of consumer injury increases as the volume and sensitivity of the data grows.” – Edith Ramirez, Chairwoman, U.S. Federal Trade Commission.
In the age of Big Data, data Security is a deal maker. Organizations need to adopt a data-centric approach to security. Organizations are now in need of big data environments that include enterprise-grade authentication and authorization (LDAP or Apache Sentry project). Securing the big data life cycle requires the following security controls:
- Authentication and authorization of users, applications, and databases
- Privileged user access and administration
- Encryption of data at rest and in motion
- Data redaction and masking for both production and non-production environments
- Separation of responsibilities and roles
- Implementing least privilege
- Transport security
- API security
- Monitoring, auditing, alerting, and reporting
Organizations can achieve all the benefits that big data has to offer while providing a comprehensive, inside-out security approach that ensures that the right people, internally and external, receive access to the appropriate data at the right time and place. In addition, organizations must be able to address regulatory compliance and extend existing governance policies across their big data platforms.
5. Big Data Talent
The discussion about the big data infrastructure can’t be complete without talking about the new expertise and human resource required managing and analyzing this huge data. New breed of tools and programming languages are evolving which are very much in demand to cater to the growing needs of the Industry. As you can see below, the mix of the talent and competency will drive the demand and the availability for the new breed data scientists and data owners. Needless to say, right now there is huge gap in the demand and availability of the core skills required for data science and Big Data talent.
Figure 6: Big Data skill requirement
I hope you find this post informative. Kindly drop by any questions or comments you have in the comments section below. In the upcoming post, we’ll talk about the Big Data Analytics – the one area which is most talked about in the field right now. I’m excited for the next post and hope you are also plugged in and enjoying the journey!