Data Lake Maturity Model

Issue link: https://resources.zaloni.com/i/1078782

Contents of this Issue


Page 9 of 43

Structure and Purpose of a Data Lake Data lakes are loose confederations of databases that can have differ‐ ent structures, come from different vendors, and be processed through different tools. Such data lakes are well suited to the modern era in which you might be hungry for data from any source: your own sales records, third-party data brokers, freely available government records, data streamed from sensors in the field, and whatever else comes along. The power of Google, Alibaba, and other large online companies lies largely in the varieties of data they possess, along with their intelli‐ gence at making use of it. The extent of their data collection can actually be quite frightening. Google knows with whom you exchange email, where you travel, what interests you enough to do a search, and even (thanks to its affiliate DoubleClick, which Google bought in 2007) which websites you visit. Alibaba has a huge amount of information about your purchases as they "datafy" every customer exchange, as well as about drivers who use their naviga‐ tion service. So far, the companies have used this power mostly for good, improving their services to each visitor. Of course, the recent scandal involving Cambridge Analytica and Facebook shows the darker side (although it's not clear whether Cambridge Analytica was accurate enough to actually have any real influence on elec‐ tions). Relational databases exhibit slow to intolerable sluggishness when trying to incorporate, represent, and serve up such varieties of data. So what works better? Data experts are experimenting with technol‐ ogies for big data all the time. That's why in the 2000s we saw a huge number of "NoSQL" databases sprout up: the open source Cassan‐ dra, CouchDB, MongoDB, and so on, plus various proprietary var‐ iants offered by cloud providers such as Amazon.com, Microsoft, and Google. All of these data stores are designed to handle parti‐ tioned data efficiently among multiple computer systems—an abso‐ lute necessity given the volumes of data they must store—and to seamlessly handle common data management tasks such as replica‐ tion and failure recovery. Because Hadoop was the first major open source tool (it was intro‐ duced around 2006) to process big data algorithms, its storage (Hadoop Distributed File System, or HDFS) became a de facto stan‐ dard. It is agnostic as to data formats but is usually an underlying 4 | The Data Lake Maturity Model

Articles in this issue

Links on this page

view archives of eBooks - Data Lake Maturity Model