eBooks

Data Analytics with Hadoop Zaloni_Preview Edition.pdf?hsCtaTracking=8254b676-4354-4c49-97ac-f08e23de7c8a%7Cb8c38c3c-35d1-40c5-b4e9-ab2a4b497a60&__hstc=111218075.a71d874649e61a8f39ac37304909af70.143958

Issue link: https://resources.zaloni.com/i/790569

Contents of this Issue

Navigation

Page 8 of 50

ion. We hope to introduce most of the concepts, tools, and techniques involved with distributed computing for data analysis and provide a path for deeper dives into spe‐ cific topics areas. What to Expect from This Book This book is not an exhaustive compendium on Hadoop (see Tom White's excellent Hadoop: !e De"nitive Guide for that) or an introduction to Spark (here we might point to Holden Karau et al. Learning Spark) and is certainly not meant to teach the operational aspects of distributed computing. Instead this book is meant as a survey of the Hadoop ecosystem and distributed computation intended to arm data scien‐ tists, statisticians, programmers, and folks who are interested in Hadoop with just enough knowledge to make them dangerous. We hope that you will use this book as a guide as you dip your toes into the world of Hadoop and find the tools and techni‐ ques that interest you the most, be it Spark, Hive, Machine Learning, ETL, Relational Databases, or one of the other many topics related to cluster computing. Who This Book is For Data science is often erroneously conflated with Big Data and while many machine learning model families do require large data sets in order to be widely generalizable, even small data sets can provide a pattern recognition punch. For that reason, most of the focus of data science software literature are on corpora or data sets that are easily analyzable on a single machine (especially machines with many Gigabytes of mem‐ ory). Although Big Data and data science are well suited to work in concert with each other, to date computing literature has separated them. This book intends to fill in the gap, written to an audience of data scientists. It will introduce them to the world of clustered computing and analytics with Hadoop, from a data science perspective. The focus will not be on deployment, operations, or soft‐ ware development, but rather on common analyses, data warehousing techniques, and higher order data workflows. So who are data scientists? We expect that a data scientist is a software developer with strong statistical skills or a statistician with strong software development skills. Typi‐ cally our data teams are composed of three types of data scientists: data engineers, data analysts, and domain experts. Data engineers are programmers or computer scientists who can build or utilize advanced computing systems. They typically program in Python, Java, or Scala and are familiar with Linux, servers, networking, databases, and application deployment. For those data engineers reading this book, we expect that you're used to the difficul‐ ties of programming multi-process code as well as the challenges of data wrangling and numeric computation. We hope that after reading this book you'll have a better vi | Preface

Articles in this issue

Links on this page

view archives of eBooks - Data Analytics with Hadoop Zaloni_Preview Edition.pdf?hsCtaTracking=8254b676-4354-4c49-97ac-f08e23de7c8a%7Cb8c38c3c-35d1-40c5-b4e9-ab2a4b497a60&__hstc=111218075.a71d874649e61a8f39ac37304909af70.143958