eBooks

Data Analytics with Hadoop Zaloni_Preview Edition.pdf?hsCtaTracking=8254b676-4354-4c49-97ac-f08e23de7c8a%7Cb8c38c3c-35d1-40c5-b4e9-ab2a4b497a60&__hstc=111218075.a71d874649e61a8f39ac37304909af70.143958

Issue link: https://resources.zaloni.com/i/790569

Contents of this Issue

Navigation

Page 21 of 50

The end result of this step should be an application, job, or script that can be run on-demand in an automated fashion. Hadoop has specifically evolved into an ecosystem of tools that operationalize some part of the above pipeline. For example, Sqoop and Kafka are designed for ingestion, allowing the import of relational databases into Hadoop or distributed message queues for on demand processing. Data warehouses in Hadoop like Hive and HBase provide data management opportunities at scale. Libraries like Spark's GraphX and MLLib or Mahout provide analytical packages for large scale computation as well as validation. Throughout the book, we'll explore many different components of the Hadoop ecosystem and see how they fit into the overall big data pipeline. Building Data Products with Hadoop The conversation regarding what data science is has changed over the course of the past decade, moving from the purely analytical towards more visualization-related methods, and now to the creation of data products. Data products are economic engines that derive their value from data, are self-adapting, learning, and broadly applicable, and generate new data in return. Data products have engaged a new infor‐ mation economy revolution that has changed the way that small business, technology startups, larger organizations, and government view their data. In this chapter, we've described a revision to the original pedagogical model of the data science pipeline, and proposed a data product pipeline. The data product pipe‐ line is iterative, with two phases: the building phase and the operational phase, and four stages: interaction, data, storage, and computation. It serves as an architecture for performing large scale data analyses in a methodical fashion that preserves experi‐ mentation and human interaction with data products, but also enables parts of the process to become automated as larger applications are built around them. We hope that this pipeline can be used as a general framework for understanding the data product lifecycle, but also as a stepping stone so that more innovative projects may be explored. Throughout this book we we will explore distributed computing and Hadoop from the perspective of a data scientist - and therefore with the idea that the purpose of Hadoop is to take data from many disparate sources, in a variety of forms, with a large number of instances, events, and classes and transform it into something of value, a data product. Building Data Products with Hadoop | 19

Articles in this issue

view archives of eBooks - Data Analytics with Hadoop Zaloni_Preview Edition.pdf?hsCtaTracking=8254b676-4354-4c49-97ac-f08e23de7c8a%7Cb8c38c3c-35d1-40c5-b4e9-ab2a4b497a60&__hstc=111218075.a71d874649e61a8f39ac37304909af70.143958