Leveraging Big Data Lessons to Optimize Data Lakes for IoT

August 30, 2017

This article is an excerpt from the "Overcoming Obstacles to Data Lake Success" compendium by Datanami.

Big data isn’t a simple exercise. Nonetheless, important lessons were learned along the way by the many pioneers and those who push big data projects to maturity every day since. Zaloni has mined those lessons in order to optimize data lakes and data management for the next big data surge – the avalanche of data soon to come from the Internet of Things (IoT). The result of that research and analysis is a set of second generation tools that leverage the hard-won knowledge in big data, specifically in conquering complexity at scale and overcoming the ongoing skill shortages.

However, there are more challenges to master on the horizon than just IoT data.

Predictive analytics, prescriptive analytics and machine learning will soon combine and evolve into new assisted intelligence services. Data tools will have to prepare now if they are to keep up.

One example on the horizon is anticipatory analytics, a next generation class of analytics centered on serving a multitude of possibilities rather than the currently expected probability.

Predictive analytics look at probability -- what is likely to occur in the future -- and try to give you a reliable singular answer,” explains Dave Wells, an industry analyst and consultant. “Anticipatory analytics serve multiple answers reflecting many different possibilities.”

“Predictive analytics can drive insights. Anticipatory analytics can drive innovation since innovating is about exploration and trying many different things.”

Here are the highlights of some of the more outstanding features and products that resulted from Zaloni’s efforts to address the obstacles and opportunities in the future and in the Land of Now.

Automated data ingestion and organization

One of the biggest lessons learned is that the ability to ingest disparate data from multiple sources and at multiple speeds, from batch to streaming, is a must have. Further, that data needs to be organized and retrievable from the moment it hits the data lake. Zaloni built a fully integrated data lake management and governance platform, with automated data preparation that includes automatic consumption of data, metadata capture and catalog entry, and registry with the Hadoop ecosystem.

Automated data prep

There’s much ado these days about self-service data and analytics but data prep remains the major bottleneck in that process. Data preparation includes such arduous tasks as workflow orchestration, data tagging, masking, and tokenization. It also includes cleaning the data to remove outdated and duplicate information as well as filling or deleting empty fields or outliers, and standardizing fields and the tools needed to be able to perform those tasks quickly.

Alongside these tasks is the need to assign granular permissions and controls to ensure compliance and security. Converting data formats and orchestrating complex workflows so they can rapidly integrate updated or changed data, even in real time, is also a prime concern.

Zaloni’s Spark-based transformation libraries provide flexible transformations at scale. It rapidly prepares data for use in analytics, automating many of these tasks and providing tools designed to remove complexities for rule applications.

Democratizing access to data

Democratizing data is essential in forming a truly data-driven organization. This simply means making data accessible to many users throughout the organization for use according to their job roles.

For example, a repair or maintenance technician needs access to IoT and other data for use with self-service analytics to determine what action they need to take in maintaining, repairing or replacing any specific equipment. Meanwhile, a CFO may need access to that same data in performing asset management on the financial side.

Therefore, making data accessible to different roles, but also to many users simultaneously, is essential. Zaloni’s data lake management platform democratizes access to data sets in the data lake. Use a self-service tool to serve the data to users on-demand, but know that Zaloni is delivering it cleansed, enriched, governed and secured. Self-service analytics can be made available organization-wide with confidence.

Automated transformations in the self-service capability

Another valuable lesson learned from big data projects is that in order to relieve the bottleneck in IT or with the data scientist team, users must be able to retrieve, clean, prep, customize and transform data themselves. Those are not easy tasks, even if users are rocking data scientist level skills. But the truth is that while most users rock, it’s because of their skills outside of the data science specialty.

Zaloni offers a self-service data capability specifically designed to assist professionals of all stripes in making data-driven decisions. It enables users of a variety of skill levels to use a consolidated catalog to easily identify and find the data sets they need and then prepare it for analytics in a simplified manner with useful prompts. Transformations on the resulting new data are automated.

Mastering the future through conquering obstacles now

With Zaloni, organizations are fully prepared to realize a higher ROI in the here and now, but they’ll also have the data lake they need to quickly leverage IoT data and evolving use cases such as assisted intelligence and ambient computing.

Having such flexibility in the data lake -- to accommodate and manage the growing disparity in data, the increasing number of data sources, the rising flux in ingestion speeds from batch to streaming, and a growing mix of devices the analytics are to be used on --is essential to maturing the company to a full-fledged data-driven organization and driving it forward to future prosperity.


Enjoyed this article? You can read the rest of the compendium here.

Previous Article
Partitioning in Hive
Partitioning in Hive

The concept of partitioning in Hive can make a huge difference in the execution time of large datasets. Her...

Next Article
3 Hacks to Get the Most From Sqoop
3 Hacks to Get the Most From Sqoop

Sqoop is a very effective tool in transferring huge amounts of bulk data between RDBMS and Hadoop. However,...


Get a custom demo for your team.

First Name
Last Name
Phone Number
Job Title
Comments - optional
I would like to subscribe to email updates about content and events
Zaloni is committed to the best experience for you. Read more on our Privacy Policy.
Thank you! We'll be in touch!
Error - something went wrong!