In the past year, the focus of big data has expanded from creating new streaming and computing frameworks into creating ways to manage and control these frameworks. Unfortunately, none of the tools for these frameworks provide a complete enough set of governance and management functionality to operate alone.
Often, we see multiple tools deployed over a single environment. Although this can improve functional coverage, it can be difficult, time-consuming, and inefficient to arrange these tools into a system. Let’s look at a few of these tools, where each fits most ideally, and why considering a data lake architecture helps to simplify governance and security.
Ambari and Hue
Ambari and Hue provide web-based UIs, along with a series of RESTful APIs, aimed at facilitating Hadoop cluster management and administration. These tools do a good job at just this - making it easier to access and tweak settings without having to modify configuration files manually. They also include tools for monitoring, provisioning, and integrating with Hadoop clusters. Metadata management, multi-cluster management, ingestion control, and security enforcement are all fairly limited in these tools for all but the most basic clusters.
Atlas is an open-source tool developed to provide exchange, modeling, and management of metadata within the Hadoop ecosystem. Atlas essentially bridges metadata between Hadoop and Non-Hadoop tools, managed from a central location where schema, security, and other administrative features can be defined. Atlas on its own lacks much functionality outside of metadata management, but it is tightly integrated as a part of the standard Hortonworks distribution, meaning it leans on other components such as Ranger to provide the security and entitlement layer. This is an acceptable model for those willing to look into HDP as a Hadoop distribution, but those wishing to implement Atlas standalone will need to solve the integrations by themselves, which is no small task.
Navigator is Cloudera’s answer to Atlas, aiming to provide a unified metadata platform. In addition, Navigator includes tracking, entitlement and audit functionality, although it still requires a supporting security framework (such as Cloudera-supported Sentry). This comes at the cost of Navigator being proprietary to the Cloudera distribution, as it is closed-source and will only operate within a Cloudera cluster. In this sense, Navigator can be compared to an Atlas/Ranger deployment, with slightly tighter integration and less customization available. Cloudera does offer several third-party partner integrations, but these are reliant on relationships brokered by Cloudera itself, instead of open-source community efforts.
If there are so many issues with existing Hadoop governance tools, why not just use IBM or Informatica? The emergence of Hadoop as an alternative to traditional ETL systems is predicated on traditional governance processes being applied only after structuring and ingestion - this simply doesn’t work for modern architectures and business demands. For more agile, high-volume applications, a more flexible solution is necessary.
Arranging Tools into a Solution (Sort Of)
So, if the available governance and management solutions are all deficient in some way, how do we manage and govern Hadoop without creating unnecessary complexity? One solution is to layer multiple solutions together, which is what Hortonworks and Cloudera have done. In the simplest case, the following stack illustrates how the components we’ve discussed align.
Depending on the components used, access points can differ greatly, and each component brings with it considerations from a performance, integration, and security standpoint. This diagram does not even begin to consider the various other products also available in the security and governance space. While staying within a single distro can help to reduce the complexity, this does not eliminate the headache by any means. Hadoop is by nature a constantly-shifting animal and from release to release, even within a distribution, functionality can change drastically. This can cause issues for any external consumer or user of the system, who, after all, are the ones who must ultimately support and endorse it.
A Data Lake Approach
While the standard data lake architecture does not enforce a single approach to security or governance, thinking about your Hadoop ecosystem in terms of zones does by providing a logical, well-defined baseline. By separating data into zones such as Transient, Raw, Refined and Trusted, the task of architecting a governance solution appropriate to your business needs becomes a more manageable problem.
For example, we may know that all transient and raw data needs to be protected, but needs to be maintained for a minimum of 2 years. This knowledge helps us to create a clear picture of how these zones will look, and what kind of policies will need to be defined, regardless of what specific tools we will use.
Other governance-related questions that can be asked once we adopt a Data Lake architecture include:
- How many zones will I leverage, and what will each be used for?
- What is the main point of access for end users? Will they be accessing raw data, or refined/trusted data?
- What are the access protocols used by the business? Is the usage API-heavy, or does it rely largely on manual access by users?
- What are the external systems that I am dealing with? What data/metadata will I be sharing with these systems, and how/where will it be stored?
- What are the specific retention, security and access policies surrounding my data?
These are only a few of the many questions that need to be asked as a data lake is established, but if they are properly qualified, they will help establish a solid foundation for growth and evolution of your environment. Together, Bedrock and Mica are one way to guide the development, operationalization, and maturation of your Data Lake with built-in governance, orchestration, and integration features, helping to answer these (and other) questions as your Hadoop environment grows.
Check for future articles about data lake solutions, including specific tasks and functionality that support governance and security.