7 concepts to launch into Machine Learning

dekoupled
7 min readFeb 11, 2019

After completing How Google Does Machine Learning course (summary in an earlier post), I began Launching into Machine Learning course by Google Cloud on Coursera. I was hoping to beat the course deadline, but I miscalculated the depth and complexity of the material. In addition to refreshing some calculus, I had to search a few blogs and refer to the book Hands-On Machine Learning with Scikit-Learn and TensorFlow by Aurélien Géronto gain a solid grounding on the concepts.

In this post, I’ll document seven key concepts covered by the course to launch into Machine Learning.

1. Machine Learning (ML) terminologies demystified

While it is exciting to see how mathematical concepts are applied to a data problem some terminologies are referred under new names influenced by the ML discipline. For example, the independent variable in an equation is referred to as a feature in ML. Once I got through the ML-speak it became easier to follow the course.

Below are some basic ML terminologies with description. ML purists out there are welcome to correct me :)

  • Algorithm — a set of rules to solve a problem. The algorithm is independent of the dataset. For e.g. a simple regression y = a + x * b (understand this is an equation, but you get the idea)
  • Model — the output of the machine learning process. It is specific to a dataset. E.g. Tip amount = 1 + 0.12 * Bill amount. Many use algorithm and model interchangeably

Feature — an input parameter or independent variable. In ML, this is information we already have access to e.g. Bill Amount in the above equation

  • Feature weight — the coefficient of a variable. In ML feature weight is determined through the training process e.g. 0.12 in the above equation
  • Bias term — a constant in the equation e.g. 1 in the above equation
  • Hyperparameter — this requires some explanation so it is captured below separately

Labeled data — Label is essentially what the ML model predicts. Labeled data are dataset containing answers (i.e. values to be predicted) e.g. past one-year data on Bill amount with corresponding Tip amount

  • Loss function — also known as a cost function, is the error metric used to measure the model performance. e.g. root mean squared error (RMSE) for linear regression
  • Training — the process of applying an algorithm on an existing dataset to come up with a model. E.g. training process outputs the equation Tip amount = 1 + 0.12 * Bill amount by modeling the existing dataset using regression
  • Generalization — the property of an ML model to fairly represent the current dataset so it can predict the labels on unseen data with minimal error. In other words, the model is a good proxy for real-world data which it hasn’t seen before

2. Machine Learning types

Supervised and Unsupervised are the two main Machine Learning types. There is a ton literature on the web about these two types, but I was looking for a side-by-side comparison that captures the key differences between them so I created one below.

There are two more types, Semi-supervised and Reinforcement. Semi-supervised Machine Learning is a hybrid of Supervised and Unsupervised, that mainly feeds off of few labeled dataset using supervised and transitions to unsupervised on a large unlabeled dataset. A great example is tagging of photos with names e.g. iOS Camera app and Google Photos.

The fourth is Reinforcement learning that learns by itself based on the reward or penalty from previous actions. Robots learn how to walk using this algorithm.

3. Optimizing the ML model

The goal of ML is to create a model for a given dataset to predict the label (dependent variable) accurately. The accuracy of prediction increases when the prediction error is minimized. If ML uses a linear regression with few variables there are analytical tools to determine the best feature weights that minimize the error, which is RMSE in this case.

What if you have with hundreds or thousands of features and a large dataset? Analytical computation becomes impractical. Gradient Descent comes to rescue!

Gradient Descent (GD) is an optimization algorithm that iteratively tweaks the weights to minimize the error. GD scales and runs faster and is widely used to optimize the ML model.

Gradient Descent iterates by first identifying the direction to change the weight (referred as slope/gradient or derivative in calculus), second changing the weight in steps (aka learning rate), and third checking for the error and looping through until a minimum is reached. This is the process of walking down the loss curve (for the single parameter) or loss surface (multiple parameters) on all the points in parameter space. Check this blog post at Machine Learning Mastery for further explanation.

There are mainly three flavors of Gradient Descent — Batch, Mini-batch, and Stochastic. Mini-batch is the popular one, which runs on a subset of the data by extracting samples that on average balance each other out. The size of this subset is referred to as a batch size.

4. Parameter vs. Hyperparameter

You will hear Data Scientists talk a lot about hyperparameter tuning. To understand what a hyperparameter is, it is first important to highlight a key point about the previous concept of optimization.

Most ML models require an optimization algorithm to perfect the model and Gradient Descent is a good example. But note this optimization algorithm itself is not part of the final ML model.

Let me try a simple analogy — assume you are a student, learning to drive with an instructor who is guiding so you get trained quickly and effectively. When you learn to drive well (with minimal error) and pass the driver’s test (model validation) you don’t need the instructor anymore.

Now the attribute of the student (i.e. model) learning to drive is the parameter or model parameter and the attribute of the instructor (i.e. optimization algorithm) is the hyperparameter. For Gradient Descent, learning rate (aka step size) and batch size are hyperparameters.

Below is a simple side-by-side comparison and explanation of parameter and hyperparameter.

5. Machine Learning workflow

ML workflow puts together all the above concepts to arrive at a generalized model for prediction. Model selection is the first step (after gathering and cleansing data) which is driven by the use case and data distribution.

Then train the model with a set of hyperparameters and validate the results. Based on the results tune the hyperparameters (i.e. adjust the batch size and learning rate or nodes and layers for neural nets) and repeat the process until model shows signs of overfitting.

Hyperparameters play a key role in regulating the ML workflow by generalizing the model for a given data set.

Finally, use the last non-overfit model (the model before when overfitting starts) for prediction.

6. Three buckets of data

The ML workflow requires three different data buckets.

  • Training set or bucket for computing feature weights and going through the optimization process with Gradient Descent to minimize the error
  • The Validation set is used to frequently evaluate the performance of the model on a data-set that the model hasn’t seen in order to tune the hyperparameters
  • The Test set is used for the final evaluation of the model to make a go or no-go decision at the business level. This is usually done by evaluating the model against a business performance metric or by establishing a simple benchmark

While the Validation set is used multiple times to tune hyperparameter for better model performance the Test set is used only once.

This is to prevent the model from learning the Test set and subsequently provide an unbiased evaluation of a final model fit.

7. Sampling for repeatability

Machine Learning is all about experimenting with models and data to automate prediction at scale.

Successful experimentation requires repeatability. Data scientists experiment by keeping the dataset constant while tweaking and tuning the model and running it again, on the exact same experimentation data set.

Repeatability also enables collaboration by allowing team members to run the experiment and get the same results.

In order to repeat experiments and get the same results, you need a repeatable sample of datasets.

Systematic sampling technique is used to create repeatable datasets.

One such technique is to use a one-way hash function on a column that has a fair distribution. For e.g. apply hash on the date column and use modulo operator to select a certain percentile of data for a given bucket.

That completes the seven basic Machine Learning concepts. It is worth repeating that Machine Learning is an experimentation process with data and models. Understanding these basic concepts is not only essential to put that process into practice, but also helps the Business to better engage with the Data Scientists.

--

--