Data Lake Maturity Model

Issue link: https://resources.zaloni.com/i/1078782

Contents of this Issue


Page 12 of 43

color or material) besides the key can be cumbersome and time consuming. Keys often are not unique. • Fields can be duplicated. For instance, if you own five white items, the term "white" will appear five times. Relational data‐ base theory dislikes such repetition, and will try to eliminate it, introducing considerable complexity along the way and slowing down queries. Big data usually thrives on duplicated fields and can process such records much faster. As we've seen, the storage of redundant fields makes big data even bigger. Modern methods of storing big data, such as the aforemen‐ tioned HDFS, are designed to handle large files quickly. They are also optimized for the kinds of algorithms that data scientists run on big data, which we discuss next. Tools for Big Data Processing In 2004, the world awoke to the need for new algorithms to process big data when two engineers at Google, Jeffrey Dean and Sanjay Ghemawat, published a paper titled "MapReduce: Simplified Data Processing on Large Clusters." It described the new class of algo‐ rithms Google had discovered for indexing documents (a critical step in creating search databases). MapReduce, as the name implies, is a two-step algorithm. It accepts input data as a key/value store, or creates a key, as part of the Map step. The Map step determines all of the values associated with a key; for instance, all the web pages that contain a word. This step is highly parallelizable. The Reduce step takes all output of the Map jobs and creates the final dataset linking documents to words. As the subtitle of the article indicates, the algorithm is designed from the start to distribute work among large numbers of computers. MapReduce suddenly appeared to be the solution to a wide swath of data problems throughout many industries. When in 2006 the open source implementation Hadoop was released with the kind of Java interface that was standard for data processing in those days, it quickly became a resumé builder for programmers and data manag‐ ers. However, the MapReduce algorithm turned out to be restrictive. For instance, you might need to run several Map jobs in a row, and Hadoop makes this difficult to do—you must store results from the Structure and Purpose of a Data Lake | 7

Articles in this issue

Links on this page

view archives of eBooks - Data Lake Maturity Model