eBooks

Data Analytics with Hadoop Zaloni_Preview Edition.pdf?hsCtaTracking=8254b676-4354-4c49-97ac-f08e23de7c8a%7Cb8c38c3c-35d1-40c5-b4e9-ab2a4b497a60&__hstc=111218075.a71d874649e61a8f39ac37304909af70.143958

Issue link: https://resources.zaloni.com/i/790569

Contents of this Issue

Navigation

Page 20 of 50

Big Data Work!ows With the goals of scalability and automation in mind, we can refactor the human- driven data science pipeline into an iterative model with four stages: ingestion, data staging, and computation, and work#ow management as shown in Figure 1-2. Like the data science pipeline, this model in its simplest form takes raw data and converts it into insights. The crucial distinction, however, is that the data product pipeline builds in the step to operationalize and automate the workflow. By converting the ingestion, staging, and computation steps into an automated workflow, this step ultimately pro‐ duces a reusable data product as the output. The workflow management step also introduces a feedback flow mechanism, where the output from one job execution can be automatically fed in as the data input for the next iteration, and thus provides the necessary self-adapting framework for machine-learning applications. Figure 1-2. !e Big Data Pipeline • The ingestion stage is both the initialization of a model as well as an application interaction between users and the model. During initialization, users specify locations for data sources or annotate data (another form of ingestion). During interaction, users consume the predictions of the model and provide feedback which is used to reinforce the model. • The staging stage is where transformations are applied to data to make it con‐ sumable and stored so that it can be made available for processing. Staging is responsible for normalization and standardization of data, as well as data man‐ agement in some computational data store. • The computation stage is the heavy lifting stage with the primary responsibility of mining the data for insights, performing aggregations or reports, or building machine-learning models for recommendations, clustering, or classification. • The work#ow management stage performs abstraction, orchestration, and auto‐ mation tasks that enable the workflow steps to be operationalized for production. 18 | Chapter 1: The Age of the Data Product

Articles in this issue

view archives of eBooks - Data Analytics with Hadoop Zaloni_Preview Edition.pdf?hsCtaTracking=8254b676-4354-4c49-97ac-f08e23de7c8a%7Cb8c38c3c-35d1-40c5-b4e9-ab2a4b497a60&__hstc=111218075.a71d874649e61a8f39ac37304909af70.143958