eBooks

Understanding metadata scalable data architecture book

Issue link: https://resources.zaloni.com/i/790575

Contents of this Issue

Navigation

Page 17 of 22

With Bedrock all steps of data ingestion are defined in advance, tracked, and logged. The process is repeatable. Bedrock captures streaming data and allows you to define streams by integrating Kafka topics and flume agents. Bedrock can be configured to automatically consume incoming files and streams, capture metadata, and register with the Hadoop eco‐ system. It employs file- and record-level watermarking, making it possible to see where data moves and how it is used (data lineage). Input data can be enriched and transformed by implementing Spark-based transformation libraries, providing flexible transforma‐ tions at scale. One challenge that the Bedrock product addresses is metadata man‐ agement in transient clusters. Transient clusters are configured to allow a cost-effective, scalable on-demand process, and they are turned off when no data processing is required. Since metadata information needs to be persistent, most companies decide to pay an extra cost for persistent data; one way to address this is with a data lake platform, such as Bedrock. Zaloni also provides Mica, a self-service data preparation product on top of Bedrock that enables business users to do data exploration, preparation, and collaboration. It provides an enterprise-wide data catalog to explore and search for datasets using free-form text or multifaceted search. It also allows users to create transformations interactively, using a tabular view of the data, along with a list of transformations that can be applied to each column. Users can define a process and operationalize it in Bedrock, since Mica creates a workflow by automatically translating the UI steps into Spark code and transferring it to Bedrock. Automating Metadata Capture Metadata generation can be an exhausting process if it is performed by manually inspecting each data source. This process is even harder in larger companies with numerous but disparate data sources. As we mentioned before, the key is being able to automate the capture of metadata on arrival of data in the lake, and identify relationships with existing metadata definitions, governance policies, and busi‐ ness glossaries. 16 | Chapter 1: Understanding Metadata: Create the Foundation for a Scalable Data Architecture

Articles in this issue

Links on this page

view archives of eBooks - Understanding metadata scalable data architecture book