eBooks

Data Analytics with Hadoop Zaloni_Preview Edition.pdf?hsCtaTracking=8254b676-4354-4c49-97ac-f08e23de7c8a%7Cb8c38c3c-35d1-40c5-b4e9-ab2a4b497a60&__hstc=111218075.a71d874649e61a8f39ac37304909af70.143958

Issue link: https://resources.zaloni.com/i/790569

Contents of this Issue

Navigation

Page 25 of 50

3. A computation is usually referred to as a job, jobs are broken into tasks where each individual node performs the task on a single block of data. 4. Jobs are written at a high level without concern for network programming, time, or low-level infrastructure allowing developers to focus on the data and compu‐ tation rather than distributed programming details. 5. The amount of network traffic between nodes should be minimized transparently by the system. Each task should be independent and nodes should not have to communicate with each other during processing to ensure that there are no interprocess dependencies that could lead to deadlock. 6. Jobs are fault tolerant usually through task redundancy, such that if a single node or task fails, the final computation is not incorrect or incomplete. 7. Master programs allocate work to worker nodes such that many worker nodes can operate in parallel, each on their own portion of the larger data set. These basic concepts, while implemented slightly differently for various Hadoop sys‐ tems, drive the core architecture and together ensure that the requirements for fault tolerance, recoverability, consistency, and scalability are met. These requirements also ensure that Hadoop is a data management system that behaves as expected for analyt‐ ical data processing, which has traditionally been performed in relational databases or scientific data warehouses. Unlike data warehouses, however, Hadoop is able to run on more economical commercial off the shelf hardware. As such, Hadoop has been leveraged primarily to store and compute upon large, heterogenous data sets stored in "lakes" rather than warehouses, and relied upon for rapid analysis and prototyping of data products. Hadoop Architecture Hadoop is composed of two primary components that implement the basic concepts of distributed storage and computation as discussed in the previous section: HDFS and YARN. HDFS (sometimes shortened to DFS) is the Hadoop Distributed File Sys‐ tem, responsible for managing data stored on disks across the cluster. YARN ("Yet Another Resource Negotiator") acts as a cluster resource manager, allocating compu‐ tational assets (processing availability and memory on worker nodes) to applications that wish to perform a distributed computation. The architectural stack is shown in Figure 2-1. Of note, the original MapReduce application is now implemented on top of YARN as well as other new distributed computation applications like the graph processing engine, Apache Giraph, and the in-memory computing platform, Apache Spark. Hadoop Architecture | 23

Articles in this issue

Links on this page

view archives of eBooks - Data Analytics with Hadoop Zaloni_Preview Edition.pdf?hsCtaTracking=8254b676-4354-4c49-97ac-f08e23de7c8a%7Cb8c38c3c-35d1-40c5-b4e9-ab2a4b497a60&__hstc=111218075.a71d874649e61a8f39ac37304909af70.143958