eBooks

Data Analytics with Hadoop Zaloni_Preview Edition.pdf?hsCtaTracking=8254b676-4354-4c49-97ac-f08e23de7c8a%7Cb8c38c3c-35d1-40c5-b4e9-ab2a4b497a60&__hstc=111218075.a71d874649e61a8f39ac37304909af70.143958

Issue link: https://resources.zaloni.com/i/790569

Contents of this Issue

Navigation

Page 38 of 50

4 MapReduce: Simplified Data Processing on Large Clusters: http://bit.ly/google-mapreduce-paper working across a cluster, HDFS should be easily integrated to your current opera‐ tional workflows. For the rest of the book, our primary concern will be related to the management and computation of data that resides on HDFS, and to do that we need to make sure we have a fundamental understanding of distributed computing and its requirements. While YARN has enabled Hadoop to become a general distributed computing plat‐ form, MapReduce (often abbreviated to MR) was the first computational framework for Hadoop. YARN allows for non MapReduce frameworks like Spark, Tez, and Storm (to name a few) to run alongside the original MapReduce application on a Hadoop cluster. However, for most Hadoop users, MapReduce is still the primary framework for many applications and analytics. Moreover, a general understanding of how MapReduce works will allow us to more deeply think about distributed ana‐ lytics and inform discussions of how other platforms work since the theoretical underpinnings of MapReduce are shared with those other frameworks. In this section, we'll explore the basic principles of the MapReduce programming paradigm and discuss why these functional programming constructs are ideal for dis‐ tributed systems. We will demonstrate how MapReduce works via two simple analyt‐ ics which are routinely used to demonstrate computation in a distributed environment: word counting and shared friendships. Finally we will describe how MapReduce applications are implemented on a Hadoop cluster and show how to sub‐ mit and manage a sample MapReduce job, fetching the output via the Hadoop com‐ mand line interface. MapReduce: A Functional Programming Model When people refer to MapReduce, they're usually referring to the distributed pro‐ gramming model that was devised and later described by Google in the paper by Jef‐ frey Dean and Sanjay Ghemawat, MapReduce: Simpli"ed Data Processing on Large Clusters 4 . MapReduce is a simple but very powerful computational framework specifi‐ cally designed to enable fault-tolerant distributed computation across a cluster of cen‐ trally managed machines. It does this by employing a "functional" programming style which is inherently parallelizable — by allowing multiple independent tasks to exe‐ cute a function on local chunks of data and aggregating the results after processing. Functional programming is a style of programming that ensures that unit computa‐ tions are evaluated in a stateless manner. This means that functions depend only on their inputs and that they are closed and do not share state. Data is transferred between functions by sending the output of one function as the input to another, wholly independent function. These traits make functional programming a great fit 36 | Chapter 2: An Operating System for Big Data

Articles in this issue

Links on this page

view archives of eBooks - Data Analytics with Hadoop Zaloni_Preview Edition.pdf?hsCtaTracking=8254b676-4354-4c49-97ac-f08e23de7c8a%7Cb8c38c3c-35d1-40c5-b4e9-ab2a4b497a60&__hstc=111218075.a71d874649e61a8f39ac37304909af70.143958