Multiple linear regression: closed form solution and example in code

6 min read3 days ago

Hello fellow machine learners,

Last week, we derived the closed form solution for simple linear regression and built a model which put the maths into action.

But you may have wondered: the dataset we used had more than one feature, and a linear combination of such features may contribute to the target feature. Can this be handled by a linear regression technique?

It can indeed! All it requires is an extension of what we covered last week.

Setup

Assume that there exists a linear relationship between our input features

and our target feature y. We can model this in a similar way to last week’s example, except this time we leverage a linear combination of the input features, such as

You can think of this as a line of best fit where the y-intercept of the line is given by w_0, and we have n axes instead of just 1. We humans can’t really visualise this beyond n=2, an example of which we’ll showcase later.

Now, you may be wondering about the following:

Well, the loss is essentially the same as in the 1d case: we sum up the squared residuals. It’s just that y now depends on n features, so we need to step up our game slightly.

It’s time for us to bust out the most important tool in our maths toolbox when it comes to machine learning.

It’s time… for linear algebra.

It’s time to whip out your whiteboard pen! Or your blackboard chalk. Or your graphics tablet stylus. Or a pencil? I’m not fussy. (Image designed by Freepik)

Did someone say ‘linear algebra’?

A dataset of n features and one target, with m data samples, takes the form

To clarify:

Each data point is now an (n+1)-dimensional vector, as showcased above.
We have m such vectors in our dataset.
The double subscript for each x is necessary to ensure we can keep track of what element of which vector we’re talking about. The first digit denotes the sample number, and the second digit denotes the feature number.

The awesome thing about this is that we can write a matrix-vector equation that ties together the feature values x, the m target values y and the weights to be determined, w. In full generality, this takes the form

where y, X and w are the target vector, feature matrix and vector of weights respectively. Convince yourself that the above makes sense by multiplying out the matrix-vector product on the right hand side. It should match up exactly with what we wrote in the ‘Setup’ section.

We now aim to solve for the optimal set of weights, as denoted below:

And now for the answer to question 1: the loss is given by

where

Here’s the cool thing: if we set n=1, then we effectively recover the loss function for the simple linear regression model! This is no coincidence, and demonstrates one (of many) advantage(s) of utilising linear algebraic techniques.

We can also write the loss in terms of the vectors rather than using the summation notation. Verify for yourself that the following still works:

Matrix-vector calculus

The idea behind higher-dimensional calculus is that we apply the standard 1D derivative rules linearly to each equation. Thanks to linear algebra, we can write derivatives for vectors more compactly than just listing out a bunch of very similar equations.

Let us begin by expanding out the loss function:

Now, since the loss is just a scalar value, each of the four summands is also a scalar. This means that

and hence

The calculations for the derivatives do get a bit messy. I’ve got a bunch of stuff written down on paper, but I reckon it’ll take a newsletter’s worth of LaTeX to type out. If this would be of interest, do leave a comment about it and I’d be happy to write it up 😃

For now, you’ll just have to trust me when I say that the derivative of the loss with respect to the vector of weights is

Setting this equal to zero and rearranging, we get that

This means that, if we have the feature and target values, we can line them up in a matrix and vector respectively and carry out the above operations to find our optimal weight values!

Regression surface in code

This time, we will be dissecting the student exam performance dataset I found on Kaggle.

The students in the dataset have had their exam scores ranked on a performance index. Naturally, we aim to identify what features contribute the most toward exam scores.

For now, we will analyse the impact of just two features:

Hours Studied: the total number of hours spent studying by each student.
Sleep Hours: The average number of hours of sleep the student had per day.

As we are looking at the impact of two features on one target, we can plot our points in 3D. This time, rather than a line of best fit, we get a plane of best fit, as illustrated below:

Multiple linear regression with 2 explanatory variables. Source: my GitHub.

The complexity of our regression surface scales with the number of features we include.

Namely: for a dataset of n+1 fields (n features and 1 target), our model will produce an n-dimensional hyperplane in (n+1)-dimensional space.

And now, for the moment you’ve all been waiting for… the two lines of code responsible building the model:

‘mmodel’ stands for ‘multiple [linear] model’. It sounded cooler in my head. Source: check out my GitHub.

Yup, verrryyyy similar to last week’s. Be sure to check out my full Python notebook for the context of these two lines of code. It also includes a dynamic version of the surface plot that you can rotate and zoom in/out on. This can all be found here.

Packing it all up

Another week, another round up!

To extend our 1D linear regression to higher dimensions (i.e. more fields of data), we leveraged matrices and vectors. Brush up on your linear algebra skills; this is one of the most important tools for us machine learners.
Notice how similar the workflow is for solving the n-dimensional problem compared to our 1D problem from last week. Sure, it’s more complicated. But it follows the exact same flavours: setting n=1 in all the above will collapse the problem back into what we tackled last week.
Don’t be put off by terms such as ‘hyperplane’, or the fact that we can’t visualise solutions for regression in n>3 dimensions. This is the beauty of generalising the maths to n dimensions. We don’t need to visualise 4D hyperplanes to know that the maths checks out, and will provide us a closed-form solution no matter how, er, ‘hyper’ our hyperplane surfaces become.

Sources

Paper on estimating the parameters of the Ordinary Least Squares regression model: https://scik.org/index.php/jmcs/article/view/5454

Training complete!

I really hope you enjoyed reading!

Do leave a comment if you’re unsure about anything, if you think I’ve made a mistake somewhere, or if you have a suggestion for what we should learn about next 😎

Until next Sunday,

Ameer

Originally published at https://ameersaleem.substack.com.