Data Lake Maturity Model

Issue link: https://resources.zaloni.com/i/1078782

Contents of this Issue


Page 10 of 43

store for key/value, document, or columnar formats—which are concepts that we discuss in the next section. Numerous tools exist for translating data between HDFS and other storage systems, both relational and NoSQL. Characteristics of Big Data Let's put together a tiny bit of data that illustrates the traits of big data and how it affects storage choices. We're going to start a cloth‐ ing list that would let a woman figure out what's missing in her wardrobe, look for combinations that match, and make a travel list. Here's a sample of the list: • blouse, chartreuse, linen • shoes, black, leather, flat • laundry bag • blouse, white, cotton • pants, blue, denim • clip, silver Each line is a record, and each field is delimited by commas. We always start with the item itself (blouse, shoes, etc.). Many data pro‐ cessing tasks require a single field to represent the whole record, so this first field will be considered the key. Note that it's not repre‐ sented in the list in any special way. We just assume that the first field will be treated as the key. The key doesn't need to be unique to a single record, as relational databases usually require. In our short list, for instance, we have two blouses. It's also important to note that fields can be arbitrary. We try to impose some order by putting a color right after the name of the item, and then a material as the third field. But big data is messy, so not all records conform. The clip's color is not indicated here, and the laundry bag has neither color nor material. Finally, the shoes have a fourth field to indicate the style. Big data algorithms can handle this kind of variety; that is one of the strengths that makes them more appropriate than relational data‐ bases for the messy data that comes into many organizations. But it's Structure and Purpose of a Data Lake | 5

Articles in this issue

view archives of eBooks - Data Lake Maturity Model