Traditionally, Hadoop (via MapReduce, Pig or Hive) was used to prepare data for OLAP cubes for external, proprietary OLAP engines. Now we at Zaloni are encountering firms using Apache Kylin to achieve real-time query capabilities on OLAP cubes backed by 40-billion-plus row fact tables. We are helping a firm unify billing data from disparate systems to create OLAP cubes that provide analytics that are not possible with current systems. This all happens in the Hadoop cluster.
The Evolution of Analytics on Hadoop
Hadoop has evolved from a distributed data platform with generic compute capabilities (via MapReduce) to a powerful platform. Hadoop and its ecosystem tools are now capable of tackling a broad set of use cases beyond low-cost distributed batch processing, Hadoop’s original claim to fame. From iterative Machine Learning to OLAP and OLTP systems, open source analytics capabilities that run “on the cluster” are putting pressure on the traditional players in the field (Oracle, SAS, Teradata, IBM, etc.).
Designed for Scale
Apache Kylin, named after a mythical chinese creature, is an open source multidimensional online analytic processing engine (MOLAP). Originating from eBay, Inc., Kylin is designed to handle petabyte scale datasets. Here’s a quote from the Apache Foundation Blog from December 2015:
"Apache Kylin is the best OLAP engine on Big Data so far," said Wilson Pang, Senior Director of Data Services and Solutions at eBay. "At eBay, we collect every user behavior on every eBay screen. While other OLAP engines struggle with the data volume, Kylin enables query responses in the milliseconds. Moreover, we are also starting to leverage Kylin for near real time data streaming storage and analytics engine. All together, Kylin serves as a critical backend component for eBay’s product analytics platform."
How it Works
Kylin achieves its speed by precomputing the various dimensional combinations and the measure aggregates via Hive queries and populating HBase with the results. The Kylin query engine - accessible in Kylin’s user-friendly UI, via an API or via JDBC - leverages the Apache Calcite query processor and HBase features (such as fuzzy row filters) to achieve fast lookups. The HBase rowkeys are compact too, due to the use of a Trie Data Structure for the dictionary of the dimension values.
Kylin only supports the star schema. You are limited to a single fact table for each cube.
Building a cube is a snap. Assuming you already have a Hive table in place, the wizard walks you through the process of selecting the dimensions (which may be hierarchical), selecting the lookup-tables, choosing the measures, etc. Partitioning by date is possible and makes refreshes of segments of the cube a breeze, for example, when incremental or streaming data is involved. Once the cube is defined, the build process can be monitored in Kylin’s UI.
Beyond Kylin’s Web UI, you can query the OLAP cubes via JDBC, inside Zeppelin (there’s a Kylin interpreter distributed with Zeppelin), or by way of a well-designed REST API.
Other Options for OLAP on Hadoop
Kylin is just one open source option for OLAP on Hadoop. Apache Lens is another, but it is a ROLAP solution and does not currently give the responsiveness that Kylin’s precomputed cubes gives. Druid is also option, but it leverages its own clustering technologies (not requiring Hadoop). There are also vendor solutions that claim to achieve OLAP capabilities on Hadoop.