This article was written by Digan P and originally appeared on the Alteryx Engine Works Blog here: https://community.alteryx.com/t5/Data-Science/Exploring-EvalML-Automatically-Build-Optimize-and-Evaluate/ba-p/727749
This post takes you on a guided tour of using EvalML to build and evaluate supervised machine learning pipelines. We'll revisit the BigMart dataset we worked on in my recent Featuretools post.
What is EvalML?
EvalML is an AutoML library that builds, optimizes and evaluates machine learning pipelines using domain-specific objective functions. Combined with Featuretools, EvalML can be used to create end-to-end supervised machine learning solutions.
Background on the data for this example:
We are going to be looking at the BigMart dataset. There are 1,559 products and 10 stores. You can visualize it as two tables in one: Item and Outlet table.
|Item_Identifier||Unique product ID|
|Item_Weight||Weight of product|
|Item_Fat_Content||Whether the product is low fat or not|
|Item_Visibility||The % of total display area of all products in a store allocated to the particular product|
|Item_Type||The category to which the product belongs|
|Item_MRP||Maximum Retail Price (list price) of the product|
|Outlet_Identifier||Unique store ID|
|Outlet_Establishment_Year||The year in which store was established|
|Outlet_Size||The size of the store in terms of ground area covered|
|Outlet_Location_Type||The type of city in which the store is located|
|Outlet_Type||Whether the outlet is just a grocery store or some sort of supermarket|
|Item_Outlet_Sales||Sales of the product in the particular store. This is the outcome variable to be predicted.|
We are going to drop the Item_Identifier and the Outlet_Identifier as we won’t be using them as predictor variables. Our target variable is still Item_Outlet_Sales, the sales of a product in a particular store.
We will be bringing in the BigMart data and splitting the columns. Dataframe X will be all our predictor variables, while dataframe y will be our target, Item_Outlet_Sales.
The first step is to make sure that our physical, logical and semantic types are correct.
- Physical Type – The actual data type of the incoming data.
- Logical Type – This is how the DataFrame interprets the physical data type.
- Semantic Tags – These are enhanced feature types that allow you to more thoroughly describe your data.
Now we need to split the data into training and validation sets to train the model and gauge its performance. We are going to do an 80/20 split, with 80% of the dataset for the model to train on and 20% of the dataset for it to test on. We have 6,818 records for training and 1,705 records for testing purposes.
EvalML has many options to configure the pipeline search. We designate the problem type (regression or classification) and optionally select an objective function. (If you don't select a specific objective function, the default for your chosen problem type will be used.) You can imagine a pipeline as nothing more than a sequence of operations to be applied to data, where each operation is either a transformation or a modeling algorithm. An objective function is nothing more than a metric that EvalML will seek to minimize or maximize. You can learn more about objective functions here.
EvalML has different objective functions available for regression and classification models.
Objective functions for regression include:
- Root Mean Squared Error
Objective functions for classification include:
- MCC Binary
- Log Loss Binary
- Balanced Accuracy Binary
- Accuracy Binary
For our regression problem, we are going to use Root Mean Squared Error as the objective function. The lower the score is, the better the pipeline.
When we call search(), the search for the best pipeline will begin. There is no need to wrangle missing data or categorical variables as EvalML includes various preprocessing steps (like imputation, one-hot encoding and feature selection) to ensure you are getting the best results.
As long as your data is in a single table, EvalML can handle it. If not, you can reduce your data to a single table by utilizing Featuretools and its Entity Sets. You can find more information on pipeline components and how to integrate your own custom pipelines into EvalML here.
After the search is finished, we can view all of the pipelines searched and ranked by score. Internally, EvalML performs cross validation to score the pipelines. If it notices a high variance across cross-validation folds, it will warn you. EvalML also provides additional data checks to analyze your data to assist you in producing the best performing pipeline. These data check utility functions help deal with problems such as overfitting, abnormal data and missing data.
If we are interested in getting more details about the pipeline, we can view a summary description using the id from the rankings table:
We can also view the pipeline parameters directly.
EvalML has three different pipeline usages:
- Fit – Fits each component on the provided training data, in order.
- Predict – Computes the predictions of the component graph on the provided data.
- Score – Computes the value of an objective on the provided data.
We can now select the best pipeline and score it on our holdout data:
Using best_pipeline.graph() we can visualize the steps of this pipeline:
We can also get the importance associated with each feature of the resulting pipeline:
Here are some extra links to help you explore what EvalML has to offer:
- EvalML FAQ
- EvalML Tutorials with examples for:
- Fraud Prediction Model
- Lead Scoring Model
- Cost-Benefit Matrix Objective
- Text Data with EvalML
- EvalML User Guide
- API Reference