The term “data quality” refers to not only the properties make up good data vs. bad data but also what to do with that data after a decision has been made.
The first step in the process of separating good data from bad data might be as simple as filtering missing values. It might be more complex to make sure a SSN field has a value and follows the correct numerical pattern. We could even implement sets of rules to check multiple columns each with their own properties.
The second step involves the actual use of that data. Once we confirm we have data that passes our quality standards, we can put that into an external Hive table in a specific location in HDFS. Equally, what do we do with bad data? Do we simply delete it? Do we copy it and archive it? The point is there is also a process for what is considered bad data.
In this video, simple examples are used to represent what can be a much more complex process in the Zaloni Data Lake Management Platform. This includes deciding between good data and bad data then the action performed on that data in both cases.
About the Author
Big data & Hadoop thought-leaderMore Content by Adam Diaz