Zaloni Zip: Data Quality

November 22, 2016 Adam Diaz

The term “data quality” refers to not only the properties make up good data vs. bad data but also what to do with that data after a decision has been made.

The first step in the process of separating good data from bad data might be as simple as filtering missing values. It might be more complex to make sure a SSN field has a value and follows the correct numerical pattern. We could even implement sets of rules to check multiple columns each with their own properties.

The second step involves the actual use of that data. Once we confirm we have data that passes our quality standards, we can put that into an external Hive table in a specific location in HDFS. Equally, what do we do with bad data? Do we simply delete it? Do we copy it and archive it? The point is there is also a process for what is considered bad data.

In this video, simple examples are used to represent what can be a much more complex process in the Bedrock platform. This includes deciding between good data and bad data then the action performed on that data in both cases.

 

 

To explore additional topics related to your data, please see more about the big data ecosystem or learn more about Bedrock.

About the Author

Adam Diaz

Director of Field Engineering Sales - RTP Raleigh NC

More Content by Adam Diaz
Previous Article
Big Data Maturity Stages: Is Your Data Ready to Be a Product?
Big Data Maturity Stages: Is Your Data Ready to Be a Product?

The idea of turning your business data into a product, also termed “data as a product,” is a known concept ...

Next Article
Metadata is Critical for Fishing in the Big Data Lake
Metadata is Critical for Fishing in the Big Data Lake

Excerpt from report, Managing the Data Lake: Moving to Big Data Analysis, by Andy Oram, editor at O’Reilly ...

×

Get the latest tips and how-to's delivered straight to your inbox!

First Name
Last Name
Zaloni Blog Email Subscription
Thank you!
Error - something went wrong!