This article was written by Clare Liu and originally appeared on the Towards Data Science Blog here: https://towardsdatascience.com/svm-hyper-parameter-tuning-using-gridsearchcv-49c0bc55ce29
In my previous article, I have illustrated the concepts and mathematics behind Support Vector Machine (SVM) algorithm, one of the best supervised machine learning algorithms for solving classification or regression problems. It is used in a variety of applications such as face detection, handwriting recognition and classification of emails. In order to show how SVM works in Python including, kernels, hyper-parameter tuning, model building and evaluation on using the Scikit-learn package, I will be using the famous Iris flower dataset to classify the types of Iris flower.
About the dataset
The Iris flower data set is a multivariate data set introduced by Sir Ronald Fisher in the 1936 as an example of discriminant analysis.
The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor), so there are 150 total samples. Four features were measured from each sample: the length and the width of the sepals and petals, in centimetres.
Here’s a picture of the three different Iris species ( Iris setosa, Iris versicolor, Iris virginica). Given the dimensions of the flower, we will predict the class of the flower.
Import the libraries
import pandas as pd
import numpy as np
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt
%matplotlib inline
Read the input data from the external CSV
irisdata = pd.read_csv('iris.csv')
Take a look at the data
irisdata.head()
irisdata.info()
Visualise Data with Pairs Plots
we apply Seaborn which is a library for making statistical graphics in Python. It is built on top of matplotlib and closely integrated with pandas data structures. This function will create a grid of Axes such that each numeric variable in irisdata
will by shared in the y-axis across a single row and in the x-axis across a single column.
import seaborn as sns
sns.pairplot(irisdata,hue='class',palette='Dark2')
Train Test Split — Split your data into a training set and a testing set.
from sklearn.model_selection import train_test_split
X = irisdata.drop('class', axis=1)
y = irisdata['class']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)
Apply kernels to transform the data to a higher dimension
kernels = ['Polynomial', 'RBF', 'Sigmoid','Linear']#A function which returns the corresponding SVC model def getClassifier(ktype): if ktype == 0: # Polynomial kernal return SVC(kernel='poly', degree=8, gamma="auto") elif ktype == 1: # Radial Basis Function kernal return SVC(kernel='rbf', gamma="auto") elif ktype == 2: # Sigmoid kernal return SVC(kernel='sigmoid', gamma="auto") elif ktype == 3: # Linear kernal return SVC(kernel='linear', gamma="auto")
Train a model
Now it’s time to train a Support Vector Machine Classifier.
Call the SVC() model from sklearn and fit the model to the training data
for i in range(4): # Separate data into test and training sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)# Train a SVC model using different kernal svclassifier = getClassifier(i) svclassifier.fit(X_train, y_train)# Make prediction y_pred = svclassifier.predict(X_test)# Evaluate our model print("Evaluation:", kernals[i], "kernel") print(classification_report(y_test,y_pred))
Since SVMs is suitable for small data set: irisdata
, the SVM model would be good with high accuracy expect using Sigmoid kernels. We could be able to determine which kernel performs the best based on the performance metrics such as precision, recall and f1 score.
In order to improve the model accuracy, there are several parameters need to be tuned. Three major parameters including:
- Kernels: The main function of the kernel is to take low dimensional input space and transform it into a higher-dimensional space. It is mostly useful in non-linear separation problem.
2. C (Regularisation): C is the penalty parameter, which represents misclassification or error term. The misclassification or error term tells the SVM optimisation how much error is bearable. This is how you can control the trade-off between decision boundary and misclassification term.
3. Gamma: It defines how far influences the calculation of plausible line of separation.
Tuning the hyper-parameters of an estimator
Hyper-parameters are parameters that are not directly learnt within estimators. In scikit-learn, they are passed as arguments to the constructor of the estimator classes. Grid search is commonly used as an approach to hyper-parameter tuning that will methodically build and evaluate a model for each combination of algorithm parameters specified in a grid.
GridSearchCV helps us combine an estimator with a grid search preamble to tune hyper-parameters.
Import GridsearchCV from Scikit Learn
from sklearn.model_selection import GridSearchCV
Create a dictionary called param_grid and fill out some parameters for kernels, C and gamma
param_grid = {'C': [0.1,1, 10, 100], 'gamma': [1,0.1,0.01,0.001],'kernel': ['rbf', 'poly', 'sigmoid']}
Create a GridSearchCV object and fit it to the training data
grid = GridSearchCV(SVC(),param_grid,refit=True,verbose=2)
grid.fit(X_train,y_train)
Find the optimal parameters
print(grid.best_estimator_)
Take this grid model to create some predictions using the test set and then create classification reports and confusion matrices
grid_predictions = grid.predict(X_test) print(confusion_matrix(y_test,grid_predictions)) print(classification_report(y_test,grid_predictions))#Output [[15 0 0] [ 0 13 1] [ 0 0 16]]
For the coding and dataset, please check out here.
Summary: Now you should know
- Visualise data with Pairs Plots
- Understand three major parameters of SVMs: Gamma, Kernels and C (Regularisation)
- Apply kernels to transform the data including ‘Polynomial’, ‘RBF’, ‘Sigmoid’, ‘Linear’
- Use GridSearch to tune the hyper-parameters of an estimator
Final Thoughts
Thank you for reading. Hope you now understand how to build the SVMs in Python. Please leave your comments below if you have any thoughts.
You can connect with me on LinkedIn, Medium, Instagram, and Facebook.