Every year, the Big Data community at large meets up at the San Jose Convention center for a week-long mix of tutorials, keynote talks, sessions and of course the expo, for all things big data. Like previous years, Zaloni was one of the sponsors. I got the opportunity to mix with the crowd, attend some sessions and check out the state-of-the-art of the big data ecosystem.
For the impatient, here are some links to the popular content:
If you're not in a hurry, then geek out with me over the things that I took note of during the week's technical presentations:
Political geek humor never ceases. Two years back it was President Obama making data science puns in a recorded keynote address ("Half the data science jokes my staff came up with were below average."). Last year it was a company riffing off Donald Trump's election slogan on its marketing material ("Make data great again!"). This year, a session on HDFS transparent encryption used some files which curiously had the word "Clinton" plastered over email files.
Kafka performance debugging, finally. Apache Kafka has been receiving well-deserved attention over the years as a robust streaming platform. This year saw some good feedback about production challenges of running large Kafka clusters. This talk by Confluent has some nuggets on general software support, of not subscribing to the Windows philosophy of "restart if it gives you trouble", and also the USE methodology, which is a must read for all support personnel, no matter what stack you maintain.
Uber addresses a common irritant. Uber unveiled Hoodie, a library for handling upserts of HDFS. This is problem that most Hadoop RDBMS offloads often face - how do we merge frequent updates into HDFS files. It's solving a common problem that Zaloni Bedrock addressed a couple of versions back using its Change Data Capture action.
Mesos is missing. Apache Mesos, a cluster resource manager, was conspicuous by its absence from proceedings. Mesos is a poster child in the Spark community and was demoed many times in Strata in previous years. With the rise of the SMACK stack (Spark Mesos Akka Kafka), Mesos may be competing what some traditionally call Hadoop.
No sessions on HDFS3. You should be interested because HDFS 3.0 promises to cut down your storage costs by almost 2/3rd using HDFS erasure coding. There was a session on launching Docker on YARN, but it has some security issues that need to be sorted first.
A lot of sessions from Google on TensorFlow. With massive interest in deep learning, and the fact that it is often combined with Spark to scale out, there were plenty of people attending multiple sessions on Google’s TensorFlow toolkit. Unfortunately, for non-data scientists and non-mathematicians, there was a lot of unfamiliar terminology. A talk by the rather famous (and opinionated) Ted Dunning cut down to the elegance of sophistication to be able to do Tensor math efficiently on computers. Also not to be missed was a session from Facebook on Torch, an alternative to TensorFlow.
Crowd pullers. Two sessions that I wanted to go to but could not get in because of sheer crowding were from Zaloni on building a data architecture and from Netflix on their internal Data architecture. It’s always interesting to see what folks at Netflix are doing, because they are engineering-oriented, but are justifiably wary of using technologies just because it’s well known (two years ago, they were one of the few companies to declare that although they were experimenting with Apache Spark, it wasn't yet “production ready”). Zaloni also turned up in Emirates’ (du) telecommunication use case.
Storm or Heron. Storm (now Apache Storm) is a stream processing system used by Spotify, Twitter and Yahoo amongst others. Twitter has recently released an API-compatible replacement for Storm, called Heron, and open-sourced it. Heron uses cluster resource managers to scale up or down, depending on load. Here are slides from the talk, and a comparison. (Heron can run on Mesos, by the way)
Artificial intelligence and machine learning are (still!) all the rage. Each year at Strata there seems to be a core focus or theme that permeates many of the sessions and keynotes. This year the use of artificial intelligence and machine learning to improve analytic time to value and improve the overall data management process was front and center. It will be interesting to see if the NYC Strata in September will have the same themes and if there will be success stories from actual implementations.
Cloudera joins the Data Science market. Cloudera made a big announcement during one of the keynotes touting their new Data Science Workbench release. The goal behind this workbench is self-service data science.
About the Author
Dev Ayon is a software engineer at Zaloni and specializes on scalable solutions with MapReduce and Spark. Interests are scalable data anonymization, semantic data and reasoning, continuous build and deployment systems, and meddling in other people's business.More Content by Dev Ayon