eBooks

Data Analytics with Hadoop Zaloni_Preview Edition.pdf?hsCtaTracking=8254b676-4354-4c49-97ac-f08e23de7c8a%7Cb8c38c3c-35d1-40c5-b4e9-ab2a4b497a60&__hstc=111218075.a71d874649e61a8f39ac37304909af70.143958

Issue link: https://resources.zaloni.com/i/790569

Contents of this Issue

Navigation

Page 48 of 50

Figure 2-8. Complex algorithms or applications are actually made up through the chain‐ ing of MapReduce jobs where the input of a downstream MapReduce job is the output of a more recent one. In fact, the use of of multiple MapReduce jobs to perform a single computation is how more complex applications are constructed, through a process called "Job Chaining". By creating data flows through a system of intermediate MapReduce jobs, as shown in Figure 2-8, we can create a pipeline of analytical steps that lead us to our end result. The job for us as the analyst and developer is to devise algorithms that implement Map and Reduce in order to come to a single analytical conclusion, a topic that we will explore in detail in ???. Throughout the book, we will explore how to transform our computational frame‐ works away from more traditional iterative analytics to "data flows" for large scale computation. Data flows are directed acyclic graphs of jobs or operations that are applied to a large data set towards some end computation. In the end, the primary data engineering effort of a Big Data application is to filter and aggregate the larger data sets towards last mile computing - potentially to the space where the data can fit into memory and be evaluated. It is easy to see how chained jobs fit this data process‐ ing model; although it will be relevant in other data processing systems like Storm and Spark as well. Submitting a MapReduce Job to YARN The MapReduce API is written in Java, and therefore MapReduce jobs submitted to the cluster are going to be compiled Java Archive (JAR) files. Hadoop will transmit the JAR files across the network to each node that will run a task (either a Mapper or Reducer) and the individual tasks of the MapReduce job are executed. 46 | Chapter 2: An Operating System for Big Data

Articles in this issue

view archives of eBooks - Data Analytics with Hadoop Zaloni_Preview Edition.pdf?hsCtaTracking=8254b676-4354-4c49-97ac-f08e23de7c8a%7Cb8c38c3c-35d1-40c5-b4e9-ab2a4b497a60&__hstc=111218075.a71d874649e61a8f39ac37304909af70.143958