Understanding metadata scalable data architecture book

Issue link: https://resources.zaloni.com/i/790575

Contents of this Issue


Page 15 of 22

Depending on the provider, data lake management solutions can be classified into three different groups: (1) solutions from traditional data integration/management vendors, (2) tooling from open source projects, and (3) startups providing best-of-breed technology. Traditional Data Integration/Management Vendors The IBM Research Accelerated Discovery Lab is a collaborative environment specifically designed to facilitate analytical research projects. This lab leverages IBM's Platform Cluster Management and includes data curation tools and data lake support. The lab provides data lakes that can ingest data from open source environments (e.g., data.gov/) or third-party providers, making contextual and project- specific data available. The environment includes tools to pull data from open APIs like Socrata and ckan. IBM also provides Info‐ Sphere Information Governance Catalog, a metadata management solution that helps to manage and explore data lineage. The main drawback of solutions from traditional data integration vendors is the integration with third-party systems; although most of them include some integration mechanism in one way or another, it may complicate the data lake process. Moreover they usually require a heavy investment in technical infrastructure and people with specific skills related to their product. Tooling From Open Source Projects Teradata Kylo is a sample framework for delivering data lakes in Hadoop and Spark. It includes a user interface for data ingesting and wrangling and provides metadata tracking. Kylo uses Apache NiFi for orchestrating the data pipeline. Apache NiFi is an open source project developed under the Apache ecosystem and supported by HortonWorks as DataFlow. NiFi is an integrated data logistics plat‐ form for automating the movement of data between disparate sys‐ tems. It provides data buffering and provenance when moving data by using visual commands (i.e., drag and drop) and control in a web-based user interface. Apache Atlas is another solution, currently in the incubator state. Atlas is a scalable and extensible set of core foundational governance services. It provides support for data classification, centralized audit‐ ing, search, and lineage across Hadoop components. 14 | Chapter 1: Understanding Metadata: Create the Foundation for a Scalable Data Architecture

Articles in this issue

Links on this page

view archives of eBooks - Understanding metadata scalable data architecture book