course: (ML Series) Introduction to Machine Learning I

By Julien Hernandez Lallement, 2020-05-15, in category Course

course, machine learning, python

Introduction

This post is a repository of stuff I explain to folks who want to get into machine learning and data science. As always in the process of explaining, I realized I had some troubles describing a few concepts, so started a repo to use easily when I needed to prepare a workshop or crashcourse for some colleagues.

I used several materials to prepare this, such as lectures I attended back in Amsterdam and different books, such as this one and that one.

This will be a series of posts. I start with a general introduction, and will follow with algorithm-specific posts.

Here, I will:

  • Introduce machine learning fundamental concepts
  • Introduce the concept of cost functions

General concepts

One definition of machine learning could be:

"The field of study that allows computers to learn without being explicitly programmed."

- Arthur Samuel

In the last years, machine learning became a very fancy & hype word that, in the private sector, describes pretty much everything that involves data-driven approaches. In my opinion, machine learning is not much more than (mostly) decade old statistical concepts applied to data. And these applications also are quite old for most of the algorithms out there.

In [22]:
from IPython.display import Image
PATH = "/home/julien/website/content/images/2020_05_ML_Intro_I/"
Image(filename = PATH + "ML_hype.jpeg", width=500, height=500)
Out[22]:

Probably one game changing factor was the emergence of 1) high computing power combined with 2) virtually illimited storage space on the cloud. The first one allowed to implement complex statistical techniques that were previously not possible (e.g., Bayesian statistics and modelling). The second one allowed large amounts of data to be accumulated and used to make predictions about the future.

It should be noted that in academia, the machine learning algorithms used nowadays in the private sector have been used regularly for a while. While models need to be validated with real, empirical data in both private and academic sectors, one main difference is that in academia, datasets are (mostly) finite (experiments have an end, after which data is no more collected), while in the private sector, data flows in every day, and models are heavily challenged by novel trends in the data. That is inherent to the processes in each sector: acamedia tries to elucidate a process, and collets data to get kind of a snapshop of a process at an instant $t$; private sector tries to understand trends to propose better services and make profit.

As a result, in academia,models are fitted on collected dataset that do not grow in size and complexity anymore. Then, paper is written, model is presented, and another group would validate or invalidate the model in a further study. Nothing wrong about that of course, it gets us closer to the truth of a given process.

The way I see it, both environments and their Pros&Cons. Academic situation allows you to have a extremely clean data collected in well-controlled environment, hence getting you probably closer to a pure version of the process you are studying. On the other hand, a situation where data flows in constantly allows to collect increasing evidence and complexity of this process.

Anyway, as noted before, the final goal is completely different in each case, so let's close this parenthesis here.

Learning machine

"A computer program is said to learn from experience E with respect some class of tasks T and performance measure P , if its performance at tasks in T , as measured by P , improves with experience E."

- Tom Mitchell

Let's take an example: you want to classify which mails should land in the spam folder, and which should make it to your mailbox. The software would learn to solve this task $T$, if it's accuracy (or performance $P$) at classifying the spams correctly increases with increasing experience $E$.

In other words, when a relationship exists betwen variables in your dataset, a machine learning approach would allow you to capture them, and identify these relationships in new incoming unseen data.

Let's say you have collected two variables in different subjects, or different time points. You can plot them to observe a possible relationship between them.

In [11]:
show_example()

Here, you can observe what would be called a positive correlation between the variables: there is a trend for x to increase when y increases as well. Now these relationships might be systematic, i.e., they might not be true for every data point in your set, but when visualizing many observations, the trend emerges.

Once this trend is captured by a machine learning algorithm, you could start making predictions about what could be a possible outcome if you would be to y for a yet unobserved value of x (red point in illustration below).

In [12]:
show_example(make_prediction=True)

Now, machine learning is based per se on past behavior, i.e., it requires past collected data to make predictions about what the future could look like. This is why it's quite hard to make predictions in very novel and unknown situations, since data is lacking...

"All models are wrong, but some are useful."

- George E. P. Box

The quote above captures, in a nutshell, the idea that the models we use to make predictions do not reflect the absolute truth of the process they aim at explaining. Real world is far too complex and noisy for us to be able to model. Our approximations will thus always be wrong. But getting accurate in predicting something, while measuing uncertainty if possible, can be very useful to make informed decisions.

I remember the times back when I was doing research in neuroscience, the complexity of the processes at hand was often overwhelming. Trying to understand how decisions are taken from a brain activation patterns, or electrical activity collected in animals...The models applied to the data cannot, maybe for now, capture the 'real truth' of the processes we study. Our trained and learned models will always be wrong to some extend since they are an simplified version of truth. However, the observed relationships can be used to make inferences about how the world might work, and get us a bit closer to the truth..

ML Projects

This is why it is soooo important to understand your data. Garbage in, garbage out is one of the most important moto of data scientists, analysts and all other data profiles. Understanding where the data came from, what the limitations, the caveats, the missing data point are, etc...is maybe the most important step when starting a machine learning project.

Below, I drew a flowchart of what a typical data science and machine learning project could look like:

In [5]:
PATH = "/home/julien/website/content/images/2020_05_ML_Intro_I/"
Image(filename = PATH + "datascience_workflow_crop.jpg", width=800, height=500)
Out[5]:

Based on this flow chart, a few core best practices emerge:

* Understand the business case: talk to all stakeholders, and make sure you understand most about the data as possible before you start digging into it. More questions will emerge, so make sure you get discuss regularly about what you see in the data.
* What I mentioned before: garbage in -> garbage out... ensure high data quality.
* Validate your model: we will discuss about this in a follow up post.
* Answer the problem at hand: make sure you undestand the scope of the project, and that your solution will address that objective.

While all of these points are very important in any data related project, I really want to emphasize the last point. It became apparent to me lately that many data specialist are providing solutions, sometimes very elegant, that are not addressing the initial problem described by the business. It is important to talk to the stakeholders and the people that will end up using your data product to really understand their needs. It is not enough to provide a model that predicts sales if what was asked was to improve sales. Furthermore, sometimes what is actually asked might be the elephant in the room but not the best approach. As a data specialist, I think your job also lies there, in filling that gap between tech and non tech profiles, and making sure everyone is aligned on the product you will be building.
Thus, talk to the the stakeholders, and come back often to the business problem to make sure you are not going side ways!

In [23]:
PATH = "/home/julien/website/content/images/2020_05_ML_Intro_I/"
Image(filename = PATH + "datascience_workflow2_crop.jpg", width=800, height=500)
Out[23]:

Your First Machine Learning Problem

Data is the bread and butter

The substrate of every machine learning problem is data.

Data can take various shape, but let's start with what computers digest easily: numbers.

In [25]:
# Set seed for reproduction
np.random.seed(41)
n_cols = 5 
n_rows = 10
# Define random data
x = pd.DataFrame(data=np.random.randint(1,100, (n_rows, n_cols)), columns=[f'feature{i}' for i in range(5)])
y1 = pd.DataFrame(data=np.random.randint(0, 2, n_rows), columns=['target'])
y2 = pd.DataFrame(data=np.random.randint(0, 1000, n_rows), columns=['target2'])
# Concatenate data in dataframe
pd.concat([x, y1,y2], axis=1)
Out[25]:
feature0 feature1 feature2 feature3 feature4 target target2
0 65 36 13 99 81 1 180
1 66 90 24 86 27 0 74
2 57 4 36 51 71 1 102
3 85 87 62 52 93 0 826
4 89 76 93 53 56 1 641
5 18 99 82 29 54 1 724
6 20 14 62 96 35 1 78
7 59 5 60 2 77 1 201
8 13 36 5 70 21 1 585
9 6 45 23 35 47 0 974

To use a few famous examples from the data science community, the value in the table above could be:

  • prices from houses in the Boston area
  • the age of passengers on the titanic
  • pixel intensities (coming from pictures, for classification purposes).

Note how the $target$ column contains binary data, that could be whether the house was purchased or not, whether the passenged died or survived, or whether the image was classified in the first or second image category.

Note as well how the $target2$ column has continous data, that could represent the price of the house that was purchased, or the number of hairs present on the head of the Titanic's passengers.

I won't be describing the different types of data (continous, nominal, binary, etc...) here because I assume that you have a background in data related projects. But that influences strongly the type of model you should be using.

Basic ML model

Let's look at a first year data science problem: predicting continous numerical data, a problem that can be addressed with a regression approach. Check this post for more details on regressions.

As you probably remember from your early math course, a line can be defined by a slope $\mathbf{a}$ and an intercept parameter $\mathbf{b}$:

$$ y = \mathbf{a} x + \mathbf{b} $$

You can then add so-called 'weights' $\theta$ to the equation, which will affect how each parameter will behave:

$$ y(\theta, \mathbf{x}) = \theta_0 + \theta_1 x_1 (+ \theta_2 x_2 + ...) $$

I used a and b in the first notation because this is the notation used in most schools, but will be using w from now on to defines the weights.

In [16]:
show_example()

The slope defines the actual slope of the line you see in the plot above, while the intercept defines the point where the slope and the y axis meet

Question is, how do we define this line? How do we maximize the parameters so that the line captures the most variance in the data, maximizing the information we extract from each data point?

One could think of drawing many different lines, each better capturing a particular sub-population of the plotted data

In [17]:
show_example(random_lines=3)

However, that approach would be relatively random. Thing is, one of these lines that can be drawn is the optimal solution given the data seen by the model. One line approaches at best all the data points, and is a best fit.

One way to capture a best fit for linear models is to use the residuals, which capture a measure of error between the actual data, and the plotted line.

In [41]:
show_example(plot_errors=True)

That is one approach, that works quite well for linear models. There are other approaches that would suit better other types of models. Altogether, what we want is a type of cost or loss function, that represents the error in our model, and that we aim at minimizing.

Cost Functions

in Machine Learning, a cost function (or loss function) is used to represent the deviation between a mathematical model and the real data. The goal is to vary parameters of the loss function to minimize it, and have a model that approaches best the real data.

For example, the black lines in the graph above represent the difference between what your model would predict, and the real values of the target you are aiming to predict. Reducing these errors is what you should aim for, without falling into the issue of over-fitting which I discuss in a follow-up post. In other words, we want to minimize the residuals.

One function that is often used in these cases is the the Mean Squared Error (MSE) cost function (AKA Residuals Sum of Square):

$$ J(\mathbf{w}) = \frac{1}{n}\sum_{i=1}^n(y_i-y(\theta, \mathbf{x}^i))^2 $$

where:
n = number of training examples
$y_i$ = observed value
$y(\theta, \mathbf{x}^i)$ = predictions

The MSE quantifies the average deviation of predicted values from the actual correct values. Conceptually, I can be explained in terms of model’s performance on the training set (the dataset fed to the data before assessing its generalization to unseen data). As you can see from the equation, the cost is larger when the model is performing poorly on this data set.

One objective of the algorithm you would be using is to minimize the cost (term J in the equation above), by testing different parameters.

As you might have guessed, the cost function defined in equation 3 above has an so-called optimum, that is a parameter value that minimizes the cost quantity.

Since you are looking for the lowest amount of error, an easy way to find the minimum analytically:

\begin{equation*} \left (\sum \limits _{N} ^{n=1} x^n * (\theta_0 + \theta_1 * x^n - y^n)\right) = \left (0\right) \end{equation*}

Which allows you to obtain both parameters $\theta_0$ & $\theta_1$

\begin{equation*} \left (\theta_0 \right) = \left (\frac{(\sum y) * (\sum x^2) - (\sum x) * (\sum xy)}{N \sum x^2 - (\sum x)^2} \right) \end{equation*}
\begin{equation*} \left (\theta_1 \right) = \left (\frac{N(\sum xy) - (\sum x) * (\sum xy)}{N \sum x^2 - (\sum x)^2} \right) \end{equation*}

There are other cost functions that make sense when trying to solve other problems than linear ones. For example, one can use a logistic function as cost function to classify data between two categories. See here for more details.

One can also add regularization terms to cost function in order to control some parameter's coefficients, which can be useful in some cases. See the Ridge or Lasso useful examples.

Gradient Descent

When you work with multiple features (that is, your dependent variable is associated with a vector of features or attributes that be useful to predict its behavior), the multi-dimensional slope (since we have more than one data feature that we use to predict data) is called a gradient.

One way to localize the optimum here would be to use the so-called gradient descent technique. The term descent refers to the ability of the algorithm to detect in which direction should the parameter be updated. Since we want to find an optimum cost, i.e., the minimum value of the function, the parameter should be updated decreasingly if the slope is positive, and vice versa.

Since the optimum can often not be found algebrically, we can use numerical approximation by "descending" the gradient. Gradient descent is an iterative algorithm, meaning that it will run many times. At every repetition, the value of the parameter (typically called $\theta$) to be tested is updated based on previous iteration (learning rate).

$ [New] \theta_k = [Old]\theta_k - \beta \sigma J / \sigma \ \theta_k $

where $\theta$ is the parameter (each one of them) and $\beta$ is the so-called learning rate.

The number of iterations is a parameter that can be played with. In the example a few cells below, I used 10.000 iterations, but instead of using fix iteration number, you could"

  • Monitor your cost after each loop and when it decreases by less than some tolerance - say 0.001 - stop.
  • Use a validation set and track the loss - for example, MSE - on that. When it stops decreasing, stop.

Let's look at a specific example of a coefficient distribution using the Boston Housing datasaset.

In [26]:
import os
os.chdir('/home/julien/website/datasets')
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
In [27]:
df = pd.read_csv('BostonHousing.csv')
In [28]:
mses = []
lstat_coef = np.arange(0.5,1.5,0.01)
for coef in lstat_coef:
    pred_values = np.array([coef * lstat for lstat in df.lstat.values])
    mses.append(np.sum((df.medv - pred_values)**2))
    
plt.plot(lstat_coef, mses)
print("Min MSE: {}".format(np.sqrt(np.min(mses))))
print("Coeff: {}".format(lstat_coef[np.argmin(mses)]))
Min MSE: 406.44763344312884
Coeff: 1.1200000000000006

Above, I took a range of coefficient values for the feature LSTAT and for each one calculated the Mean Squared Error on our data. The resulting plot looks quite convex and in fact, it turns out that our MSE function with our linear regression model will always be convex. This means we can use gradient descent to find the optimal coefficients for our model.

Mathematically, gradients are just the partial derivatives with respect to the coefficients. For each coefficient we have, we calculate the derivative of our MSE with respect to that coefficient.

Remeber our cost function:$$ J(\mathbf{w}) = \frac{1}{n}\sum_{i=1}^n(y_i-y(\theta, \mathbf{x}^i))^2 $$

Now, let's expand it out for our simple example with an intercept and a single variable, lstat:

$\frac{1}{n} \sum_{i}{(y_{i} - (\theta_{0} + \theta_{1} * LSTAT_{i}))^2}$

Now, for the derivative of this with respect to $\theta_{0}$, we get:

$\frac{2}{n} \sum_{i}{(y_{i} - (\theta_{0} + \theta_{1} * LSTAT_{i}))}$ * -1

And for $\theta{1}$:

$\frac{2}{n} \sum_{i}{(y_{i} - (\theta_{0} + \theta_{1} * LSTAT_{i}) * -LSTAT_{i})}$

Now, let's see this gradient descent algorithm in code:

In [29]:
theta_0 = 0 # Param 1
theta_1 = 0 # Param 2
learning_rate = 0.001 # Learning rate beta
lstat_values = df.lstat.values # variable data
n = len (lstat_values)
all_mse = []
for _ in range(10000): # number of iterations. here fixed.
    predicted = theta_0 + theta_1 * lstat_values # Compute prediction
    residuals = df.medv - predicted # Look at residuals
    all_mse.append(np.sum(residuals**2)) # Calculate mean squared error
    theta_0 = theta_0 - learning_rate * ((2/n) * np.sum(residuals) * -1) # Update param 1
    theta_1 = theta_1 - learning_rate * ((2/n) * residuals.dot(lstat_values) * -1) # Update param 2
In [30]:
plt.plot(range(len(all_mse)), all_mse)
Out[30]:
[<matplotlib.lines.Line2D at 0x7f453cb08b50>]
In [32]:
print("Minimum: ", min(np.sqrt(all_mse)))
print("theta_0: ",theta_0)
print("theta_1: ",theta_1)
Minimum:  139.57816823243377
theta_0:  34.27230357706076
theta_1:  -0.9331466362919295

We started a quite high error value, and descended until we reached a plateau.

Let's plot our model onto the real data

In [33]:
plt.scatter(df['lstat'], df.medv)
x = range(0, 40)
plt.plot(x, [theta_0 + theta_1 * l for l in x])
Out[33]:
[<matplotlib.lines.Line2D at 0x7f453c1923d0>]

More complex shapes

Now, that works well if you have a nice convex cost function, like the left drawing below:

In [15]:
PATH = "/home/julien/website/content/images/2020_05_ML_Intro_I/"
Image(filename = PATH + "learning_rates2.jpg", width=800, height=500)
Out[15]:

What happens when your function is relatively more complex and has many local minima? Using gradient descent might get you to get stuck in a local minima, that you would mistakenly interpret as a global minima (right drawing).

This is where the learning rate $\beta$ becomes a quite important parameter, that should be chosen wisely. The learning rate is a hyper-parameter used to determine how far we step away from the direction of the gradient. Typically, many values can be tried (.001, .003, .01, .03, .1, .3, 1, 3).

If your learning rate is too small, and your cost functions has different local minima (see figure below), you might be trapped in one of these, since your gradient descent algorithm won't be able to detect the global minima. This is depicted on the right drawing above, where $\theta$ remains stuck on a local minima, missing the optimum.

On the other hand, if your learning rate is too high, you would be jumping along your cost function, and won't find the of so desired optimum. That is depicted on the left drawing above.

  • Testing different learning rates as suggested above can help solving this issue.

  • There are also other gradient descent optimizers that are more sophisticated and adapt the learning rate over time for you. This is also something you can do on your own where you slowly decay the learning rate over time. You can read more about different optimizers here.

Other types of gradient descent

Another way to approach more complex gradients is to introduce randomness. Instead of using all of the data points simultaneously (that would be referred to as batch gradient descend), we could update one of the data point each time (referred to as stochastic gradient descent). Note that the data point is chosen randomly.

Mathematically, simply taking one data point ($J_n$ in the equation below) to update using the equation below:

$ [New] \theta_k = [Old]\theta_k - \beta \sigma J_n / \sigma \ \theta_k $

This last method has a few Pros&Cons:

  • Convergence: since the data is chosen at random, convergence might not happen as smoothly as for a batch gradient descent.
  • By the same token, this can be a good thing, since the "bouncing around" of the ball (selected randomly) might get you pass a local minima, where you would have otherwise got stuck.

Using stochastic gradient descent will introduce a epoch parameter, which is the number of times you want to loop over your data.

Another type of gradient descent is mini-batch gradient descent. This form is a compromise between the two where you choose a batch size of say 32 and each iteration of your gradient descent gets to use 32 random rows of data to calculate the gradient with. This offers some scalabilty and some randomness. This randomness it turns out is actually useful for cost functions which are not convex (deep learning) as it can help the model escape local minimum. This is the most common method for non-convex cost functions.

Final words

That's it for the first post on basic ML concepts. Check out the following post that covers more topics related to ML projects. Or jump right away to specific post Regression, KMeans, KNeighborsClassifier and others!