Governed Data Lakes and the Race to GDPR Compliance

June 15, 2017 Scott Gidley

Enterprise-wide data governance is the lynchpin to achieving compliance with the European Union’s General Data Protection Regulation (GDPR), went into effect May 25, 2018. Meeting GDPR requirements is a tall order for most organizations. With the deadline behind us, do-it-yourself may not be a realistic option for many organizations.

Fortunately, there are comprehensive solutions available to more quickly bring your data architecture up to snuff. We believe that integrating a governed data lake into an existing data environment is a smart, practical way for organizations to become GDPR compliant – as well as set themselves up to derive more value from their data in the future.

Data governance refers to a company’s policies and procedures that manage the use, access, availability, quality, privacy, and security of data. Governance brings consistency to data practices and controls across an organization’s systems and data sources, no matter where they reside. This is becoming more complex as data types, sources and applications proliferate, the volume of data grows, and organizations store and process data both on premises and in the cloud.

Data governance is essential for organizations to meet GDPR requirements for protecting the security and privacy of consumers’ personal information. A data lake with a data lake management and governance platform and self-service data catalog enables you to meet requirements such as:

  • Tracking what data is in the data lake – its source, format, and lineage
  • Understanding the data quality of all data types
  • Restricting or permitting access to data with enterprise-wide controls
  • Pseudonymizing/tokenizing, masking, and encrypting sensitive data
  • Logically and transparently automating the organization of data however it makes sense for the organization, including by categories, user access, data quality, etc.
  • Enabling broader, controlled access to data to eliminate IT bottlenecks and increase agility for reporting
  • Automating repeatable processes and operationalizing workflows to ensure consistency, reduce human error, and increase transparency

Integrate a data lake into your existing data environment 

Ideally a data lake serves as a centralized repository for data from any source for a holistic view of all of your data; however, in reality, it is rarely that neat. Instead, a data lake is typically integrated into an organization’s existing data environment to ingest, store and process unstructured and streaming data, and/or to augment a data warehouse by providing faster, cheaper processing.

What we call a “next-generation architecture,” which includes a data lake, enables organizations to take a hybrid approach to create a federated data platform and use a data catalog to have a 360-degree view of data. Users or processors can more quickly discover and access data regardless of where it is located, bring data into the data lake for data preparation, and then feed datasets back out to analytics tools.

Create a metadata-rich foundation

If you don’t have metadata, you can’t apply and enforce governance policies – you can’t govern what you don’t know. With a data lake management and governance platform, metadata, or information about data, can automatically be applied upon ingestion of all data types into the data lake. Only by applying technical, operational and business metadata to all of your data, can you understand what you have and use the metadata in a variety of ways to successfully manage and use data.

Categorize information and control access

Another key feature of a governed data lake is the ability to organize data based on any criteria, such as GDPR’s definitions of standard (e.g., typical identifiers) and “special” personal data (e.g., biometric, racial, religious). [JM3] For example, you can set up zones for your data in the data lake and use rules related to data quality, data type, controller/processor access, security and privacy (e.g., child protection, right to be forgotten, customer consent) to automatically move data into and between zones. The zone the data resides in can indicate the degree of confidence, level of access, or appropriate use for the data.

Implement data retention policies

Metadata also comes in useful for implementing data retention policies. With a data lake management and governance platform, organizations can control data lifecycle management at the scale of big data through data retention policies that automatically tier data within the data lake based on whatever criteria makes sense for the organization. Old or irrelevant data can be deleted or moved to more cost-effective storage in the cloud.

Ensure privacy and security

A data lake management and governance platform allows you to create policies to automatically pseudonymize or tokenize, mask and encrypt sensitive personal data to protect individuals’ privacy rights. Further, you can implement a policy-based or attribute-based, enterprise-scale security model for the data lake. Leveraging metadata, a policy-based security model automates permissions and access – and is really the only way to confidently secure big data while still allowing the access necessary for business users or IoT.

Automate, automate, automate!

Consistent, enterprise-wide data governance at the scale of big data requires that you automate key processes including data ingestion, metadata management, security, data privacy, and data lifecycle management. In addition, operationalizing data analytics processes enables faster results, such as for reporting. A robust data management and governance platform can facilitate the automation you need.

Serious financial penalties for noncompliance mean you can’t afford not to make it to the GDPR finish line. You can do it! We can help. Talk with us to discuss your specific needs and how a next-generation architecture including a governed data lake can prepare your organization for GDPR and your data future.

Learn more about data governance and GDPR:

Why Good Data Governance is Good Business Beyond GDPR        Get Your Big Data House in Order for GDPR

About the Author

Scott Gidley

Scott Gidley is Vice President of Product Management for Zaloni, where he is responsible for the strategy and roadmap of existing and future products within the Zaloni portfolio. Scott is a nearly 20 year veteran of the data management software and services market. Prior to joining Zaloni, Scott served as senior director of product management at SAS and was previously CTO and cofounder of DataFlux Corporation.

More Content by Scott Gidley
Previous Article
3 Hacks to Get the Most From Sqoop
3 Hacks to Get the Most From Sqoop

Sqoop is a very effective tool in transferring huge amounts of bulk data between RDBMS and Hadoop. However,...

Next Article
Elasticsearch as a Hive Datastore: Is it a Stretch?
Elasticsearch as a Hive Datastore: Is it a Stretch?

Elasticsearch is a fully capable data store with many of the resiliency features of HDFS underlying its rob...


Get a custom demo for your team.

First Name
Last Name
Phone Number
Job Title
Comments - optional
I would like to subscribe to email updates about content and events
Zaloni is committed to the best experience for you. Read more on our Privacy Policy.
Thank you! We'll be in touch!
Error - something went wrong!