Machine Learning: How to Master the Basics and Transform your Dataset

September 28, 2016 Jean Georges Perrin

You might be familiar with various number puzzles on LinkedIn. Although some might complain about how they disrupt their LinkedIn news feed (e.g. “This isn’t Facebook!”), the puzzles are designed to trigger your intelligence or challenge your neurons.

Let’s look at the puzzle in the featured image of this post.

5, 15, 25, 35, 45, 55…

What comes after 55? Instantly, you want to say 65, right? But do you know how you got there? Did you add 10 to the last one?

What if the numbers were 5, 13, 27, 39, 41, and 55… Would you still say 65? You can still add 10 to 55, but where does this 10 come from?

Linear Regression and Machine Learning Theory

If you have browsed the opening chapters of Fundamentals of Deep Learning by Nikhil Buduma, you’ve probably noticed the algorithmic complexities that are involved in machine learning. But puzzles like our sample above can be distilled to the bare basics to help you understand these algorithms.

The mathematical concept behind our sample puzzle is a linear regression. It is part of linear algebra. It is one of the very first exercises you go through when you learn about machine learning. If you take a piece of graph paper and start plotting, you will get something like the following diagram:


The 7th element in the series can be 65 and you get a sense that the 8th might be 75.

Just as when you learn new concepts, you acquire a new vocabulary. So the elements on our x axis (1, 2, 3…8) are called features, while the values (5, 15…) are called labels.

On my second series, I get the following graph:


The idea is to draw a straight line that is the closest to all points. The line is then expressed as an equation:

y= β1 x+ β0

β1 (the regression parameter) and β0 (the intercept) can be (easily) calculated if you like linear algebra – or you can use tools to do it. In our first example, our equation is simply: y=10x-5. In our second example, the equation is y=9.8857∙x-4.6. So there is a difference. When x is 7, we get 65 in our first equation, but 64.6 in our second equation. Close enough?

So far so good? Okay. I loved math during my high school and college years and I must admit that it is not my biggest passion anymore. Let’s code!

A Coding Exercise in Machine Learning

I will use Java and Apache Spark ML 2.0.0. Java is probably the most used development language in enterprises and Spark is a wonderful analytics package. ML is the Machine Learning library – yes, they lacked inspiration the day they had to find a name. 

You can download all the code example from GitHub: There will be a few dependencies.

There are quite a few imports, even for a small example. I left them here as I do not want you to be confused by similar names in different packages (Vector is one for example).


Then we have a basic main() that will instantiate the class and start() it.


Spark 2.0.0 enforces the use of a Spark session. In prior versions, it was a little confusing because you might have needed several session and configuration objects.


We need a UDF (User Defined Function) that transforms our input into a format that can be used by Spark.


Our data is in tuple-data-file.csv. Actually, our first set is in tuple-data-file-set1.csv and the second is in guess-what-file.csv? No, they are in tuple-data-file-set2.csv, but I wanted to check if you were following.


In this situation, we need to force the structure of our data because Spark needs some guidance on the metadata of our data.


In Spark 2.0.0, in a Java context, our beloved dataframe is implemented as a Dataset<Row>.


As you can see, we transformed our dataframe to create a label and features. More precisely, each label as a vector of features.


We are now ready to build our linear regression. We will limit to 20 iterations.


We assign our dataframe to our linear regression.


And now, we can throw it for the 7th element, which feature is 7. We can create a vector and predict from it:


In the code on GitHub, you’ll have a lot more statistical information displayed.

After executing, we should get the following information on the first dataset:


And the following information on the second data set:


This is a very basic example with a very limited dataset. Now that you have worked through it, you can proudly check the box for acquiring “knowledge of machine learning”.

There are many other technologies and data science methods that you can glean to get a better understanding of how data is changing the way machines ‘learn’. For additional thoughts, check out our video series Big Data Think Tank or topics in Data Science.

About the Author

Jean Georges Perrin

Jean Georges Perrin is a software architect for Zaloni. He is passionate about software engineering and all things data, small and big data. His latest endeavors bring him in the Apache ecosystem, with a definite penchant for Spark and Zeppelin. He is proud to have been the first in France to be recognized as an IBM Champion, and to have been awarded the honor for his ninth consecutive year. Jean Georges shares his more than 20 years of experience in the IT industry as a presenter and participant at conferences and through publishing articles in print and online media. His blog is visible at When he is not immersed in IT, which he loves, he enjoys exploring his adopted region of North Carolina with his kids.

Follow on Google Plus Follow on Twitter More Content by Jean Georges Perrin
Previous Article
Pig vs. Hive: Is There a Fight?
Pig vs. Hive: Is There a Fight?

Pig and Hive came into existence out of the sole necessity for enterprises to interact with huge amounts of...

Next Article
Open Source: How Open Is It?
Open Source: How Open Is It?

This is the second in a multi-part series of blogs discussing Hadoop distribution differences to help enter...


Get the latest tips and how-to's delivered straight to your inbox!

First Name
Last Name
Zaloni Blog Email Subscription
Thank you!
Error - something went wrong!