What's the hype?
Apache Spark is currently revolutionizing the world of Hadoop as a powerful open source data processing engine that can be used for performing transformations (ETL), interactive queries (SQL), machine learning and streaming over large datasets. Spark can work with data stores in HDFS, Cassandra, HBase and Amazon S3. Spark applications can be developed in Java, Python or Scala. What makes Spark unique is the breadth of options under one single framework and its simplicity. Since its release it has become the largest open source community in Big Data with over 500 contributors from 100+ organizations. Several products in the Big Data market space like Talend, Platfora and Zaloni Bedrock are leveraging Spark as their underlying execution engine and in most cases replacing the legacy MapReduce code base.
MapReduce was designed for high-latency batch model to execute simple computations; Spark provides a generalized execution model with significantly improved performance (up to 100X). Spark is easy to code and supports Python, Scala and Java as programming languages that are much simpler than Java based MapReduce. This helps in democratizing Hadoop based Data Lakes to a wide range of personas like Data Scientists, Data Engineers and Data Analysts who earlier had to wait for developers to create complex MapReduce or use Hive/Pig which were not interactive.
Spark’s success can be broadly summarized into 4 key points:
- Expressive and clean interfaces empowering engineers and data scientists.
- Unified runtime across multiple environments making Spark really portable.
- Powerful standard libraries out of the box - SQL, MLib, Graph and others.
- Performance; up to 100X improvement than traditional MapReduce
At the core, Spark abstracts data with Resilient Distributed Dataset's (RDD), which can be roughly viewed as collection of objects that are distributed across a cluster. They are built through parallel transformations and are represented by lazy evaluated lineage DAG’s composed by connected RDDs. This is similar to Pig Latin programming model where there relations are not computed until the Pig script is executed. RDDs are powerful by itself and can be transformed, joined using simple commands. Spark recently added DataFrames (Feb 2015) in Spark which extends RDD’s by adding a schema and this further simplifies the development effort required by a data engineer/scientist. You can think of DataFrame as a table in relational world without the need to persist all the time.
Consider a use case where you want to perform analysis of feeds coming from various social media like twitter and Facebook. The general workflow used by a data scientist is:
1) With Spark, you can define data frame with twitter using a single command as shown below
tweetszaloni = sqlContext.load (“/data/incoming/tweets-zaloni-com.json”, “json”)
2) You can review the entire schema with the following command:
You can also view a sample data sets using the following command:
3) Next you can transform the data with standard Spark transformations that include filter, join, map and others.
4) Lastly, you can further preserve the results to another RDD for a subsequent Spark job OR to persist results to a file.
Spark uses RDD (fundamental data model) as the building block and on top if this, it is creating high level domain specific libraries for Streaming, Machine Learning, GraphX, SparkSQL and R. The following figure shows the Spark ecosystem that runs on top of existing Hadoop clusters and YARN resource manager.
Following are the key highlights
- Spark Core is the general execution engine for the Spark platform. It provides in memory computing capabilities to deliver speed and a generalized execution model to support wide range of applications.
- Spark SQL allows interactive SQL queries and is integrated with HCatalog enabling low latency SQL access to data in Hadoop.
- Spark streaming integrates with popular streaming data sources like Flume, Kafka, Twitter and HDFS. It also has built in write ahead logs for HA.
- MLib is set of machines learning libraries for data scientists to enable data mining.
- GraphX enables transformation and reason on graph data at large scale.
- SparkR is a package for R statistical language that allows R-users to leverage Spark functionality interactively within the R shell.
In 2014, under the Apache foundation, Spark software had the largest number of committers and has seen exponential growth. The best way to describe this revolution is with a quote from Reynold Xin at Berkeley AmpLab “Spark is what you might call a Swiss Army knife of Big Data analytics tools."Would you like to learn more?
About the Author
Director of Product Support and Professional ServicesMore Content by Rajesh Nadipalli