We’re seeing data lake environments grow from the size of tens of terabytes to the colossal scale of petabytes. As a result, more enterprises are questioning how on earth they should go about governing something so huge and complex. From our perspective, a policy-based or attribute-based security model is paramount in terms of creating an enterprise-scale security model for the data lake. Leveraging metadata, a policy-based security model automates permissions and access – and is really the only way to confidently secure big data while still allowing the access necessary to democratize use and derive value from the data lake.
Why policy-based versus other options, such as role-based? A policy-based model enables more flexibility to control access based on policies that consider a combination of attributes in addition to the user’s role; e.g., the type of data being accessed, the desired action, and the context in which it is being accessed.
As you develop your policy-based data lake security strategy, we recommend taking the time to consider three important areas.
Key #1: Encryption
Determine which data needs to be encrypted while in transit – both coming into the data lake and being extracted out of the data lake – and which data needs to be encrypted while at rest in the data lake. Also, where will you enable policy-based encryption at various levels in the stack, e.g., the file system, storage or application layers? You’ll want to put rules in place that automate encryption to ensure you comply with industry and other regulations.
Key #2: Access control
Ultimately the data lake should be made available for many data users via a self-service data platform that leverages metadata to enable users to discover, curate and prepare datasets from the data lake – as well as from other systems across the enterprise. A data catalog provides access, while maintaining the required governance policies and controls. As the data lake environment becomes shared and used, one of the key aspects is to be able to provide policy-based access control to the data based attributes of roles and business units or based on functional groups that may be defined within an organization. How you implement those aspects and how you transfer those aspects into the data access layer so that they can be enforced at the data access layer is important.
Key #3: Data privacy
Data masking for personally identifiable information (PII), personal health information (PHI) or payment card industry (PCI) information ensures compliance with industry and other regulations. For certain attributes that are sensitive, consider how you will do application-level masking and tokenization so that you are either obfuscating the field or replacing the original field with a tokenized value. Your strategy should allow you to maintain the mapping of the original value in a secure area accessible only by privileged users, while the majority of users only see the tokenized value.
The amount and types of data that enterprises collect are only increasing, making data ecosystems more and more complex. Putting a data management platform and policies in place to operationalize governance and automate processes is the only way for enterprises to confidently secure data at a level that meets or exceeds industry regulations.
About the Author
Big Data Solutions Engineer - RTP Raleigh NCMore Content by Parth Patel