Original article can be found here (source): Artificial Intelligence on Medium
Why Don’t We Approach To Classification Problems Using Linear Regression in Machine Learning
As we know, we are at home because of Covid-19. When I started to think about how I can improve myself in these times, I came to the conclusion that I may start studying Machine Learning in Artificial Intelligence, which has been in my mind for years, and even to make myself a competent person in this field.By changing the Rubber Duck Debugging Principle a little, I took Lord Vader with me and started the Machine Learning courses of Andrew Ng from Coursera and Google’s Crash Course with the mentorship of Global AI Hub, Deep Learning AI and Google Developers.
During the course, while I was able to solve our regression problems under the supervised learning algorithm with linear regression, I came across the question why we cannot solve our classification problems with linear regression and I decided to write an article on this topic. First of all, I would like to thank my Vader figure, Spotify, Paint and 3-liter water bottle, which I frequently used during this writing 🙂
I thought it would be good to prepare an introduction section at the beginning of my writing to introduce logic and reinforce the most basic things. If you think you will not an introduction, you can skip to Using Linear Regression in Classification Problem 🙂
If you are ready then take your plastic duck and let’s start 🙂
Roughly speaking, our goal in Machine Learning is to produce output (by predicting) corresponding to a value whose input we have not given before. For example, you want to estimate the value of your home via using machine learning. To achieve that, you give data about your house as input and machine learning predicts you the corresponding price using this information.
But wait a minute. How our machine learning algorithm interprets that values? Is our algorithm reading our minds via using the Force? If so, what if we are a Toydarian instead of human? (A Star Wars joke. Those who get it already smiling 🙂 )
To allow our Artificial Intelligence to predict as desired, we first need to feed it with data. In order to facilitate your understanding, I have created scenarios for both regression and classification problems below for this step.
INTRODUCTION — Regression Problem Scenario
Imagine you have a house and you want to sell it, but you don’t know the value of your house. You can use machine learning to estimate the price. But how?
In machine learning, you give your model hundreds, thousands and even billions of house information and price data (this information can be referred as data set) that you already have, and our model learns this information. You want to estimate a price by giving the house information (such as number of rooms, square meter or location) combination that are not included in the data set that you previously provided, and it makes a prediction as far as our model learn. I said as far as our model learn because we can optimize our model’s learning style during this learning.
Feature: The values that we want to be used from our data set in order to estimate the price of a house in our Regression Problem Scenario. For example, if we want to predict the price according to the number of rooms, house location, square meters then these fields are accepted as features in our data and machine learning learns these features with their corresponding given house prices in our data set.
Label: It is the value we want to estimate in our dataset. In our Regression Problem scenario, this value is the price value.
Linear Regression Equation: If our feature number is 1, simple linear regression is used, and if our feature is multiple, multiple linear regression is used. Our features are shown as x in these equations, and the weight of those features in the equation is shown as b. b0 represents the distance of our line from the origin point.
Hypothesis function: When we train our model with relevant data, our model returns a hypothesis function. In this function, our model makes estimates by giving the feature values that is given by us to the relevant parameters.
For example, in our regression problem, let’s just want to make a price estimate with the age of the building. As a result of the training, our model returns us the formula y = w0 + w1x1. You remember this formula type from somewhere don’t you? Yess! This is the linear regression equation. When we plot this formula on the cart, we get a linear line. Here, y is the label value to be estimated, x is the feature value (age of the building) to be used in the estimation and w0-w1 values are the weight values that our model will determine during the training.
Cost Function: The cost function is the difference between the actual label data in the data we previously provided to our machine learning and the label value that our model predicted using these label features. What am I trying to say? For example, assume the price of a house with a building age of 3 in your data set is 100.000 Turkish Liras(TL). And assume again that our model predicted the label of the age of house feature as 98.000 TL. Then your loss will be 2.000 TL since 100.000–98.000=2.000 TL. Cost functions does this calculation for every data in your data set and finds the average error loss. You can observe the performance of your model with the cost function. Our cost function is a convex function, so when we want to show it on the chart, we get the bowl shaped curve.
INTRODUCTION —Classification Problem Scenario
Our problem of estimating the home price we just talked about was regression, so the predicted data was a continuous data. (The value to be estimated can be between 0 and + infinity.)
Imagine that you’re a doctor and trying to guess whether a tumor is benign or malignant, based on the tumor size of patients. You can use machine learning for this. But how? (You’ve felt the Deja Vu right 🙂 )
In this prediction we mentioned, the machine learning model can predict us in 2 ways. The tumor is either benign or malignant. There is no other option. Therefore, the prediction value is discrete and a limited range of values.
You give your machine learning model the patient data you already have. In this dataset, we have two columns that keep the patient’s tumor size and whether this tumor is benign or malignant.
Can you tell which column is feature and which label is by looking at our terminology section?
If you named the tumor size as feature and the tumor’s colon as label, you got that right!
INTRODUCTION — EPILOGUE
If we are using Python while developing projects in the field of Machine Learning, we can visualize the data with numpy and pandas libraries. Our visualization of the data is a guide for us when creating our machine learning model.
You can see below; the two characteristic graphs for two different type of Supervized Learning algorithm (Regression and Classification) visualized with the help of numpy and pandas:
The points in this graph are the true data values in our data set. And our model tries to draw the most optimal curve in this value distribution graph to make it’s predictions very close the actual values. As you can see in the picture above, using a straight, linear line in the data distribution in the classification problem, as in the regression problem, will lead to inconsistencies and errors in the estimation.
Using Linear Regression in Classification Problems and Consequences
Let’s go over the tumor sample above and approach this problem using linear regression. (Paint skills are loading../)
First of all we can graph our data as follows:
Let’s imagine if we calculate and draw the linear regression line on the graph of our model.
If we want to perform our estimates through this linear regression model, we can determine a threshold value on the y-axis. For example, let’s say we set this threshold to 0.5. If we add this threshold value to our graph, it will look like the following.
beta value is corresponding x-axis value of our threshold value.
Tumor sizes smaller than beta value will be classified as benign, tumor sizes larger than beta value will be classified as malicious.
Let’s assume we have one more training example. And it takes place far from other ones.
Then the comparison of the new linear regression line we just calculated with the linear regression line before adding the additional data would be as follows:
We understand that for every new data, our line will change. The last added data does not give us any new information, so it was already obvious that the newly added point was a malicious tumor because it’s already at the right of our beta value. However, the direction of our linear line shifted, which caused our beta point to shift even more positively, giving us a worse hypothesis function. Of course, we should not ignore this, our linear regression model will not be lucky as in this example in distinguishing two discrete values every time. If we look at our tumor sample, we can see that our label value can be only 1 or 0, which means that our hypothesis function should output only 1 or 0. However, our linear regression hypothesis function may return us a value less than 0 or greater than 1, since our linear regression model will give us a continuous value.
Linear regression produces a linear hypothesis function. However, in classification problems, our data do not show up in a linear distribution but in a grouped distribution. This is because our label data is a numerical data for regression problems, while our label data is a categorical data for classification problems. Therefore, using linear regression will cause errors and inconsistencies in our estimates. So what can we use instead of linear regression?
We use Logistic Regression for classification problems. I don’t explain the Logistic Regression in this article but you can see the comparison of linear and logistic regression by equational (blue: linear regression, orange: logistic regression):
At the same time, if we give the hypothesis equation of our logistics model as a parameter to our cost function, our cost function will be a non-convex function.
We know that our gradient descent function wants to minimize our cost function value, that is, we want to reduce the error rate in our model to the possible optimum value. In Convex functions, this process can be as follows:
You start with a weight value (red values) on a graph and get closer to the minimum every time we iterate our gradient descent function. As your slope will become steeper when you approach the minimum point as much as possible, your derivative process will yield less results and you will be able to get close to the minimum even if the steps you take will be smaller. In this way, you will have reduced your cost function value by optimizing your weight values.
Since the gradient descent is not the subject of this article, I leave a link here. What we do is to reduce the error rate in our model by optimizing the weight values as much as possible. But what if we used a non-convex function? Let’s take the graph below. Let our red horizontal line be the minimum point we converge:
We have converged to the so-called minimum value with a greater value than global minimum value, while there is a much lower minimum point that we can converge. It is almost impossible and difficult to reach the optimum value since there are multiple pit points in non convex functions.
It’s not that hard is it? 🙂
In this article, I tried to explain to you why we cannot solve our classification problems using linear regression. I hope it was a useful and instructive article.If you have come to this part of the article, congratulations and thanks 🙂
I wish everyone a good and healthy week. Please stay tuned for more articles 🙂