Machine Learning – An Application of Math

Machine Learning – An Application of Math

Frontier technologies have been in the spotlight for many years now, and terms like β€˜Machine Learning’ are thrown around in every other sentence. Yet very few people understand how these models work. The truth is that computer science, like most science, relies heavily upon mathematical modeling. The simplest form of Machine Learning is Linear Regression, and by the end of this article, you will understand the math necessary to perform such a function!

Linear Regression is used in situations where some past data is available for us to analyze. For example, we can use Linear Regression to predict the price of a house based on various factors, because we have data on the prices of houses in those regions over past years. The first step in Linear Regression is to make a scatterplot of the data collected to visualize the relationships between rent, area of the apartment, etc.

The goal of any linear regression model is to find a line of best fit, i.e., an equation that can accurately summarize all the data points. Once such a line is found, it can be used to predict values in the future, or values by changing variables. The general equation of this line is:

Where 𝑦 is the dependent variable (the rent of the house) and π‘₯1, π‘₯2, 𝑒𝑑𝑐 are independent variables, i.e., values that y depends upon (Size, age, etc.). In this equation, the values of π‘š1, π‘š2, 𝑒𝑑𝑐 are the slopes – the ratio of change in 𝑦 with the change in the respective π‘₯ value, and 𝑏 is a constant value.

Finding the line of best fit, i.e., the line which summarizes the data most accurately can be done by finding the most accurate values of π‘š and 𝑏. Thus, for each value of π‘š and 𝑏, we calculate the loss, which signifies the error in the model. This loss is calculated by running the model on values we already know the answer to. Therefore, if the rent of a house in 1970 was $1000, we will put π‘₯ = 1970 in the model and get a result. If the model returns a value of $1050, the loss is $502 = 2500 in this case. We use the squared value so that a negative loss doesn’t reduce the total loss. (Keep in mind the model has never seen this data before, it was trained with one set of data, but another set is reserved for testing)

Obviously, the model with the least loss is the most effective, thus we must now find a program to find the minimum loss. As computers are extremely fast at lengthy calculations, we use an iterative
approach to minimize this loss and find the π‘š and 𝑏 values. This function is called gradient descent and is found using the equation for the gradient. The graph below shows how this program finds the point of least loss and then outputs the 𝑏 value for that point. The same process is carried out for the π‘š value too.

The mathematical equation used to derive the 𝑏 value (called the intercept gradient) at each point of the graph is:

And the mathematical equation to find the π‘š value (called the slope gradient) at each point of the graph is:

Here, 𝑁 is the number of values in the dataset, and π‘š and 𝑏 are the current guess of the slope and intercept respectively. The summation sign is a loop, where the value of the expression inside is added and the total value is multiplied by 2/N . The π‘₯𝑖 and 𝑦𝑖 are the values of each datapoint in the training set.

These formulas are derived using calculus, try looking at the proofs and understanding them yourself!
Now that we can calculate the gradients, we use a loop to calculate the gradients at different points on the graph and find the optimum values of π‘š and 𝑏. This optimum point occurs when the graph converges, i.e., values of π‘š and 𝑏 stop changing or change at infinitesimally small values. The number of times we run the loop to find the point of convergence is called the number of iterations and is decided by the user. The learning rate is another value provided by the user, which decides at what intervals the program will go down the loss curve (see Fig 3). Thus, it determines how fast this program will run.

In the model above, R is the point of convergence. If we start the program from Point P, then it will keep going down the curve until it reaches R. The Learning Rate and the Number of Intervals together help the program go down the curve. In the first interval, after multiplying the m and b values at P with the learning rate, we will get the m and b values at Q. However, if the learning rate is too large, we will overshoot from P to Z, and miss the convergence (at R). If it is too small, the program will take too much time and the computer may even run out of memory before it completes the program. Additionally, if the number of intervals is too small or too large, we will again miss the convergence. The number of intervals and learning rate do not have an exact value, any values in between a certain range will render a good model. Thus, it is the user’s responsibility to choose the right values.


Once we find the correct π‘š and 𝑏 values, we have completed Linear Regression! The equation can then be used to predict values in the future. For example, I can then find the rent of a house in 2025, based on the line of best fit we have calculated. The line of best fit of Fig. 1 is given below:

Congrats! We have officially completed understanding the math required in Linear Regression. Try learning the codeΒ  required to make these graphs and predict the requisite values. Until next time!

Aryan Agarwal

11 – A

Leave a Reply

Your email address will not be published. Required fields are marked *