MATHEMATICS BEHIND LINEAR REGRESSION-

Prakhar Saxena
7 min readJan 14, 2019

Supervised Machine learning can be broadly classified as -

  1. Regression
  2. Classification

Regression — model basically predicts continuous valued outputs.

Classification —model provides discrete values or predictions for every input.

As can be seen above-

In the first we are predicting housing prices,that is we can predict the price of a house based on a certain features.

In second figure we can see that we are determining that whether the tumor is malign(cancer) or benign(normal tumor).

Well here are a few examples-

Tip-To determine whether the problem is classification problem or not ask yourself what the answer to the question can be?

If the answers can be in the form of 0 or 1,yes or no or true or false , it is a classification problem.

If answer is in the form of real values like price of a house, it is regression problem.

LINEAR REGRESSION

Let us consider the housing price prediction model,suppose we want to predict the value of house whose size is 1250 square feet , as can be seen from the graph it will be predicted as 220k by our graph.

The above pic shows the notation used in the equations below.

HOW DOES THE ALGO ACTUALLY WORK-

Flow of working of the algorithm.

The figure is self explanatory.

COST FUNCTION-

Htheta(x) as defined earlier is the hypothesis function.Theta0 and Theta1 are the values that define the equation of the hypothesis equation as can be seen above that the graph is completely different obtained in all these cases.

Hypothesis function is primarily the function that predicts the values closest to the actual value that is y.

The basic idea is written in the pic above.

Cost function is basically the function J(Theta0,Theta1). We have to minimize this function to obtain the value of Theta0 and Theta1 so that the summation of squared error function is minimum.

CASE 1->THETA0=0.

The above pic shows a particular example.

Further explanation of hypothesis function and cost function-

consider a particular case of cost function where Theta0=0.

consider for a second case-

When Theta1=1;

Working -

Similarly plotting for different values of Theta1 and assuming Theta0=0,we can plot the graph of the cost function as obtained above.

As can be seen in the graph Theta1=1 and Theta0=0,will give minimum value for J.Which is represented by the light blue straight line in the left graph and cross in the right one.

SUMMARY-

CASE 2-THETA0->NOT EQUAL TO 0

When Theta0 is not equal to Zero the cost function for the above problem is almost similar.

This 3D figure can also be represented as contour plots-

The left figure is self explanatory.

The ellipses in the right figure represent the values of the cost function.The three crosses marked above have same values of J.

EXPLANATION OF THE PLOT GIVEN ABOVE-

Consider an example marked in red above.

The values of Theta0 and Theta1 are marked.The hypothesis is not a very correct one as can be seen from the distance from the center.

GRADIENT DESCENT-

What do we aim to achieve?

How do we achieve the above goal?

Suppose this is the cost function for the values of Theta0 and Theta1.

As we run the gradient descent algorithm we move downhill from the point(starting point) to reach the minimum value-

Each step is marked by a cross which we reach from the starting point that is the point just before it.

There can be several cases for this depending on the starting point we choose-

Are there many local minimum in the graph ?

Yes, and so there are several values for min of J.

MATHEMATICS OF GRADIENT DESCENT ALGORITHM-

The gradient descent algorithm function is written above.Here we do Assignment in each step and also a new term alpha is introduced here which is the learning rate.

here for j=0 and j=1 means that we have to calculate the values of Theta0 and Theta1 only.

CASE WHEN THETA0=0.

Suppose we start from a random point marked as Theta1 in the figure.

The corresponding equation is written above.

The variation of the cost function for different values of alpha is clearly shown in the figures.

Now answer the following question-

GRADIENT DESCENT ALGORITHM FOR LINEAR REGRESSION-

By using differential calculus in the Gradient Descent Algorithm we can get the following results-

You can get the following results if you try them out yourself.

so finally the results are-

Now as we saw above the gradient descent algorithm can lead to two different local optimums so to obtain the global minimum what should be done?

CASE 1-

CASE 2-

There can be several such cases.

But in case of linear regression the cost function is a bow shaped function-

So there will be only one minimum value.

When we run our algorithm we move as shown by the red cross marks into the center that is the minimum value.

The type of gradient descent we used is-

Test your understanding-

NOTATIONS FOR MULTIVARIATE DATA(DATA WITH MULTIPLE FEATURES)-

Notations used in the equations are shown above.

Hypothesis function in case of multiple features is given below-

The equations are-

Instead of taking the values for the parameters we take it collectively as a matrix and similarly we work it out for the gradient descent.

GRADIENT DESCENT FORMULA FOR MULTIPLE FEATURE-

By using the differential calculas we can easily get the equations on the right side.

Thus this was the entire mathematics behind Linear Regression.

--

--