Build your first linear regression model in 2 lines of code
Hello fellow machine learners,
I would like you to cast your mind back to your secondary school science classes. At some point, you were probably given a scatter plot and asked to draw a line of best fit.
The line of best fit is useful, because it allows us to infer a y-coordinate value, given an x-coordinate value (or vice versa). This means that you don’t need both coordinates of a new data point in order to evaluate where it should be plotted on the graph.
Hmm, wait a sec… predicting a y value given an x value that was not necessarily present in the original dataset … do we not have some flavours of machine learning here?
The typical machine learning model workflow is:
In this example, the training phase involves the computer drawing a good line of fit (which we hope is actually the ‘best’ fit), and then predicting y-coordinates for the unforeseen x-coordinates it’s given.
So, if the ‘line of best fit’ exercise is something you did at school, then congratulations! Perhaps without realising, you have already leveraged some ideas from machine learning before. The technical term for this is linear regression.
For the case of the secondary school exercises, we humans conduct this ‘training’ step visually. I imagine you just looked at the points, got your ruler out and drew the line, bagging you a few easy marks in your GCSE chemistry/physics/biology/maths exam.
But how would a computer go about accomplishing such a task? Computers don’t see the data points on the graph like we do- all they see are the coordinates of each point.
Is there a way we can mathematically define the ‘line of best fit’ problem?
Mathematical formulation
When it comes to machine learning, the problem is usually set up in a mathematical way. In the context of linear regression, this helps us to rigorously define what would make a line of ‘good’ fit, a line of ‘bad’ fit and, most importantly, a line of ‘best’ fit.
We can model a generic straight line as y=ax+b, where the gradient of the line is given by ‘a’ and the y-intercept of the line is given by ‘b’. In full generality, our dataset of n points is of the form
We aim to optimise the values of a and b to make our line fit the data as well as possible. To accomplish this, we will attempt the quantify how bad our line is at its job. We will call this value the loss of our line. Thus, the optimal parameters (a, b) will be such that the loss is minimised. Now for the formula for the loss:
Let’s unpack where this formula came from:
- For each datapoint (x_i, y_i) in the dataset, compare the difference between the actual y_i value and the value that the regression model predicts. The model’s prediction is given by ax_i+b. This difference is called the residual.
- We compute the square of the residual. (We’ll get back to this in a sec.)
- We then sum up the error incurred for each datapoint, as indicated by the summation from i=1 to n. This gives us the total error.
Why bother squaring the residuals?
Well, if we just took the residuals to be (y_i-(ax_i+b)), then the algorithm may not guarantee that the proposed parameters give us the ‘best’ line.
Consider the example where the dataset consists of just two points: {(0,1), (1,0)}.
Without any calculations, it’s clear that the line of best fit is simply y = 1 — x.
According to the non-squared residual, the loss for this line is (1–1) + (0–0) = 0.
But what if we use the line y = x + 1? Our loss for this line is (1–1) + (0–2) = -2.
And -2 < 0, so according to the algorithm, the second line provides a better fit to the data, because it incurs a smaller loss value.
This setup is illustrated below:
Hmm… what went wrong?
Well, since the residuals were not squared, negative values are considered lower than any positive residual values.
Square numbers cannot be negative, which helps us take care of the problem in the image. Indeed, with squaring we arrive at residuals of 0 and 4 for the blue and green lines respectively, with 0 < 4 ✅
With the usual squared-residuals approach, the lowest possible value we could get for L(a, b) would be the case where the summand losses are all zero, resulting in L(a, b)=0. This would only be the result from the algorithm if our data points formed an exact straight line, which is almost never the case in real-world data.
Nice! We now have a formula that indicates bad our line is. Next step: how do we make our line ‘less bad’?
Solution
The formula for the loss L depends on the parameters a and b. So the best fit line will come from the gradient-intercept pair (a, b) that minimise L.
I claim that the solution is given by:
In the above, we use the bar notation to denote the empirical average, i.e.
Pay attention to the bar placement. In particular, notice the difference between the two following expressions:
Derivation
We follow the derivation provided in The Book of Statistical Proofs which we modify slightly here to account for some small differences. The chain rule yields
and
Setting the second partial derivative equal to zero and rearranging, we get
and this leaves us with the optimal value for b of
We repeat the approach for a:
Substituting in our explicit b value, we get
which we can rearrange to get the optimal value for a of
Derivation complete ✅
Now you know how linear regression models obtain their optimal gradient and y-intercept values. Great stuff!
A tangible example
For the following, we will use the diabetes dataset provided in the scikit-learn docs.
Bear in mind the following from the documentation:
“Ten baseline variables, age, sex, body mass index, average blood pressure, and six blood serum measurements were obtained for each of n = 442 diabetes patients, as well as the response of interest, a quantitative measure of disease progression one year after baseline.
Note: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times the square root of n_samples (i.e. the sum of squares of each column totals 1).”
For this article, we will stick with the standardised values. If you want to work with the raw values instead, you can scrape them yourself from the source.
Let’s see how the ‘standardised’ BMI correlates with the Disease Progression:
If we leverage scikit-learn’s built-in linear_model
function, we can plot the line of best fit and compute the value for the sum of the squared residuals.
The proposed line of best fit certainly appears reasonable. But yikes. That’s a large error value.
It makes sense though, given the scale of the y-axis, as well as the fact that our points, although demonstrating some positive correlation, are by no means collinear.
As per the title, the two lines of code I used for the model:
Of course, there’s a bit of data pre-processing involved beforehand .
Here is a link to the full Python notebook.
Play around with the code yourself. For example, can you make regression plots for the other features in the dataset?
Why bother with the maths?
We didn’t actually need it in the code, right?
True, but that’s because we used the scikit-learn
package which handles the algorithm for us. If you wanted to code the model from scratch, you have to know the closed form solution, if not the derivation for it.
Otherwise, you’ll just be blindly trusting the software packages that others have written. And where’s the fun in that?
Linear regression is a simple ML technique. But we will soon cover models with different formulations and, in particular, different use cases. And you won’t be able to critically evaluate good model choices unless you have at least a vague understanding of what’s going on under the hood.
And if my maths background has taught me anything, it’s this: you should always be able to justify why you’re doing what you’re doing, at least in some part. This is what, I believe, sets a ‘good’ data scientist apart from a ‘great’ one.
Or rather, it’s what distinguishes between a data scientist of ‘good’ fit, and a data scientist of ‘best’ fit 😉
Perhaps the full mathematical derivation isn’t necessary.
But the underlying concepts will take you a long way.
Plus, I find it fun to explore the formulas that underpin all this ‘machine learning’ business.
And this is my attempt at putting the ‘Unpacked’ in ‘Machine Learning Algorithms Unpacked’.
Let’s summarise what we’ve learned this week:
- The line of best fit exercise presents, believe it or not, a bona fide example of machine learning in action. We were all basically budding ML engineers back in our GCSE science days, who would’ve guessed.
- The process of training a linear regression model involves minimising the sum of the squared residuals with respect to the gradient and intercept of the line. The idea of minimising our loss corresponds directly to the line of best fit. Keep the idea of optimisation in mind, as we will come across it again in the future.
- We have to compute the sum of the squared differences in order to avoid the issue of errors cancelling each other out.
- You don’t need to memorise any formulas as such, but you should now be able to explain to another machine learner how a linear regression model obtains its optimal parameters.
I really hope you enjoyed reading!
There’s a bit more going on in this article than the previous ones- do leave a comment if you think I have made any mistakes, caused confusion, etc.
Until next Sunday,
Ameer
Originally published at https://ameersaleem.substack.com.