Linear to Logistic Regression, Explained Step by Step

This article was written by Clare Liu and originally appeared on the Towards Data Science Blog here: https://towardsdatascience.com/from-linear-to-logistic-regression-explained-step-by-step-11d7f0a9c29

 

 

When we discuss solving classification problems, Logistic Regression should be the first supervised learning type algorithm that comes to our mind and is commonly used by many data scientists and statisticians. It is fundamental, powerful and easy to implement. More importantly, its basic theoretical concepts are integral to understanding deep learning. So, I believe everyone who is passionate about machine learning should have acquired a strong foundation of Logistic Regression and theories behind the code on Scikit Learn. Alright…Let’s start uncovering this mystery of Regression (the transformation from Simple Linear Regression to Logistic Regression)!

Linear Regression

I believe that everyone should have heard or even have learnt Linear model in Mathethmics class at high school. Linear regression is the simplest and most extensively used statistical technique for predictive modelling analysis. It is a way to explain the relationship between a dependent variable (target) and one or more explanatory variables(predictors) using a straight line. There are two types of linear regression- Simple and Multiple.

Quick reminder4 Assumptions of Simple Linear Regression

  1. Linearity: The relationship between X and the mean of Y is linear.
  2. Homoscedasticity: The variance of residual is the same for any value of X (Constant variance of errors).
  3. Independence: Observations are independent of each other.
  4. Normality: The error(residuals) follow a normal distribution.

Simple Linear Regression with one explanatory variable (x): The red points are actual samples, we are able to find the black curve (y), all points can be connected using a (single) straight line with linear regression.

The equation of Multiple Linear Regression: X1, X2 … and Xn are explanatory variables

 

You might have a question “How to draw the straight line that fits as closely to these (sample) points as possible?” The most common method for fitting a regression line is the method of Ordinary Least Squares used to minimize the sum of squared errors (SSE) or mean squared error (MSE) between our observed value(yi) and our predicted value (ŷi).

Residual: = y — ŷ (Observed value — Predicted value)

 

From Numerical to Binary

Now we have a classification problem, we want to predict the binary output variable Y (2 values: either 1 or 0). For example, the case of flipping a coin (Head/Tail). The response yi is binary: 1 if the coin is Head, 0 if the coin is Tail. This is represented by a Bernoulli variable where the probabilities are bounded on both ends (they must be between 0 and 1).

The linear regression line is below 0

Linear regression is only dealing with continuous variables instead of Bernoulli variables. The problem of Linear Regression is that these predictions are not sensible for classification since the true probability must fall between 0 and 1 but it can be larger than 1 or smaller than 0. Noted that classification is not normally distributed which is violated assumption 4: Normality. Moreover, both mean and variance depend on the underlying probability. Any factor that affects the probability will change not just the mean but also the variance of the observations which means the variance is no longer constantly violating the assumption 2: Homoscedasticity. As a result, we cannot directly apply linear regression because it won't be a good fit.

So…how can we predict a classificiation problem?

Instead we can transform our liner regression to a logistic regression curve!

As we are now looking for a model for probabilities, we should ensure the model predicts values on the scale from 0 to 1. A powerful model Generalised linear model (GLM) caters to these situations by allowing for response variables that have arbitrary distributions (other than only normal distributions), and by using a link function to vary linearly with the predicted values rather than assuming that the response itself must vary linearly with the predictor. As a result, GLM offers extra flexibility in modelling. In this case, we need to apply the logistic function (also called the ‘inverse logit’ or ‘sigmoid function’).

Logistic Regression is a type of Generalized Linear Models.

Before we dig deep into logistic regression, we need to clear up some of the fundamentals of statistical terms — Probablility and Odds.

The probability that an event will occur is the fraction of times you expect to see that event in many trials. If the probability of an event occurring is Y, then the probability of the event not occurring is 1-Y. Probabilities always range between 0 and 1.

The odds are defined as the probability that the event will occur divided by the probability that the event will not occur. Unlike probability, the odds are not constrained to lie between 0 and 1, but can take any value from zero to infinity.

If the probability of Success is P, then the odds of that event is:

Odds: Success/ Failure

Example: If the probability of success (P) is 0.60 (60%), then the probability of failure(1-P) is 1–0.60 = 0.40(40%). Then the odds are 0.60 / (1–0.60) = 0.60/0.40 = 1.5.

It’s time…. to transform the model from linear regression to logistic regression using the logistic function.

In (odd)=bo+b1x

logistic function (also called the ‘inverse logit’).

 

We can see from the below figure that the output of the linear regression is passed through a sigmoid function (logit function) that can map any real value between 0 and 1.

Logistic Regression is all about predicting binary variables, not predicting continuous variables.

Don’t get confused with the term ‘Regression’ presented in Logistic Regression. I know it’s pretty confusing, for the previous ‘me’ as well 😀

 

 

Congrats~you have gone through all the theoretical concepts of the regression model. Feel bored?! Here’s a real case to get your hands dirty!

 

 

Imagine that you are a store manager at the APPLE store, increasing 10% of the sale revenue is your goal this month. Therefore, you need to know who the potential customers are in order to maximise the sale amount.

The client information you have is including Estimated Salary, Gender, Age and Customer ID.

3 samples of client information

If now we have a new potential client who is 37 years old and earns $67,000,can we predict whether he will purchase an iPhone or not (Purchase?/ Not purchase?)

Coding Time: Let’s build a logistic regression model with Scikit learn to predict who the potential clients are together!

# import the libraries
import numpy as np
import pandas as pd# import the dataset
dataset = pd.read_csv('Social_Network_Ads.csv')# get dummy variables
df_getdummy=pd.get_dummies(data=df, columns=['Gender'])# seperate X and y variables
X = df_getdummy.drop('Purchased',axis=1)
y = df_getdummy['Purchased']# split the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)# feature scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)# fit Logistic Regression to the training set
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)# predict the Test set results
y_pred = classifier.predict(X_test)# make the confusion matrix for evaluation
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, predictions)

Output

# find accuracy socre (Accuracy = number of times you’re right / number of predictions)from sklearn.metrics import accuracy_score
accuracy_score(y_true=y_train, y_pred=LogReg_model.predict(X_train))

Output

 

0.8333333333333334

The accuracy is 83%

 

You might not be familiar with the concepts of the confusion matrix and the accuracy score. No worries! I will share with you guys more about model evaluation in another blog (how to evaluate the model performance using some metrics for example, confusion matrix, ROC curve, recall and precision etc). Stay tuned!

 

Final Thoughts

Thank you for your time to read my blog. Instead of only knowing how to build a logistic regression model using Sklearn in Python with a few lines of code, I would like you guys to go beyond coding understanding the concepts behind.

 

  • The limitations of linear regression
  • The understanding of “Odd” and “Probability”
  • The transformation from linear to logistic regression
  • How logistic regression can solve the classification problems in Python

Please leave your comments below if you have any thoughts about Logistic Regression. Enjoy learning and happy coding 🙂

You can connect with me on LinkedInMediumInstagram, and Facebook.