eBooks

Data Analytics with Hadoop Zaloni_Preview Edition.pdf?hsCtaTracking=8254b676-4354-4c49-97ac-f08e23de7c8a%7Cb8c38c3c-35d1-40c5-b4e9-ab2a4b497a60&__hstc=111218075.a71d874649e61a8f39ac37304909af70.143958

Issue link: https://resources.zaloni.com/i/790569

Contents of this Issue

Navigation

Page 24 of 50

Hadoop and to building algorithms and workflows for data processing. In this chap‐ ter, we will present Hadoop as an operating system for Big Data. We will discuss the high-level concepts of how the operating system works via its two primary compo‐ nents: the distributed filesystem, HDFS ("Hadoop Distributed File System"), and workload and resource manager, YARN ("Yet Another Resource Negotiator"). We will also demonstrate how to interact with HDFS on the command line, as well as execute an example MapReduce job. At the end of this chapter you should be com‐ fortable interacting with a cluster and ready to execute the examples in the rest of this book. Basic Concepts In order to perform computation at scale, Hadoop distributes an analytical computa‐ tion that involves a massive data set to many machines that each simultaneously operate on their own individual chunk of data. Distributed computing is not new, but it is a technical challenge, requiring distributed algorithms to be developed, machines in the cluster to be managed, and networking and architecture details to be solved. More specifically, a distributed system must meet the following requirements: 1. Fault Tolerance: if a component fails, it should not result in the failure of the entire system. The system should gracefully degrade into a lower performing state. If a failed component recovers, it should be able to rejoin the system. 2. Recoverability: in the event of failure, no data should be lost. 3. Consistency: the failure of one job or task should not affect the final result. 4. Scalability: adding load (more data, more computation) leads to decline of per‐ formance, not failure; increasing resources should result in a proportional increase in capacity. Hadoop addresses these requirements through several abstract concepts, as defined below. When implemented correctly, these concepts define how a cluster should manage data storage and distributed computation. Moreover, an understanding of why these concepts are the basic premise for Hadoop's architecture will inform other topics such as data pipelines and data flows for analysis. 1. Data is distributed immediately when added to the cluster and stored on multiple nodes. Nodes prefer to process data that is stored locally in order to minimize traffic across the network. 2. Data is stored in blocks of a fixed size (usually 128 megabytes) and each block is duplicated multiple times across the system to provide redundancy and data safety. 22 | Chapter 2: An Operating System for Big Data

Articles in this issue

view archives of eBooks - Data Analytics with Hadoop Zaloni_Preview Edition.pdf?hsCtaTracking=8254b676-4354-4c49-97ac-f08e23de7c8a%7Cb8c38c3c-35d1-40c5-b4e9-ab2a4b497a60&__hstc=111218075.a71d874649e61a8f39ac37304909af70.143958