This article was written by HwasuK and originally appeared on the Alteryx Engine Works Blog here: https://community.alteryx.com/t5/Data-Science/How-to-Use-Automated-Feature-Engineering-in-Alteryx-Intelligence/ba-p/715725
We are thrilled to announce a major new automated machine learning capability in Alteryx Intelligence Suite with the 2021.1 release: Feature Engineering. Automated feature engineering promises to help organizations build high-quality machine learning models faster while focusing on the business value of models.
Feature engineering is commonly defined as a process of creating new columns (or “features”) from raw data using various techniques, and it is widely accepted as a key factor of success in data science projects. Creating meaningful features is challenging—requiring significant time and often coding skills. The new feature-engineering capabilities in Alteryx Intelligence Suite make this process easy and fast for data scientists and analysts—even those users with limited experience.
Let’s begin altering our analytics journey with Feature Engineering!
When starting any type of analysis, the hardest part is typically acquiring the data, which is usually in disparate data sources in raw form. Analysts commonly Alteryx or a language like SQL to aggregate the data and generate fields for analysis.
Imagine we work for a retailer, and we have a defined set of products that we sell:
We have a set of products purchased by a customer in a specified transaction:
We also have dates for each transaction:
And we have information on all of the customers in our database:
Let’s say that management asks us the question, “Which customers will have the most transactions in the coming year?” Using Alteryx Intelligence Suite, this question can be answered with a few simple steps.
For each of the raw tables, we first need to set the data type correctly and optimize the field sizes for downstream analysis. We do this using Designer’s Auto Field tool in partnership with Alteryx Intelligence Suite’s new Feature Types tool.
The Feature Types tool performs “semantic data typing,” which adds real-world context to the base data type. For example, a field for Zip Code might be stored as an Integer data type, but semantic data typing can map this Integer field as a Zip Code data type to better leverage this field in feature engineering.
In the configuration panel of Feature Types, we can use the “Autodetect” option under Change Type. This instructs the tool to analyze each column and automatically attempt to extract what the field is in real life. We can manually change Output Type as needed. Selecting the output type (semantic type) correctly improves the quality of features we generate in the next step.
After setting our data types, we pass all of our data into the Build Features tool. The Build Features tool can take in more than one stream of input data, and the meaningful name given to each data connection helps us track the data reference in the configuration panel.
In the Build Features tool, we define the relationships among our data. Build Features works best when data is in third normal form, where we have a set of tables that can be joined together via a set of relationships.
In this example, our Target Table is customers. Based on the three relationships defined, the data from all tables will be aggregated to the customer level.
Notice how the Build Features tool automatically aggregates our customer_transactions and transactions data. We can see how many transactions each customer had, and how many total items they purchased in all transactions.
Convenient! But how did this happen? Let’s take a look at COUNT(customer_transactions). Notice how we defined that the customer table and the customer_transactions table are joined by customerID.
From there, we then look at the InvoiceNo, the primary key for customer transactions. For each customer, we count how many distinct invoices are in the table. Thus, for customer 12346, we can return a count of two transactions in the final table.
We may ask, how and why does Build Features create the additional fields, COUNT and SUM. These are two new features created by the configuration we set under the Build Feature tool’s Manage Primitives tab. Each “primitive” is a method used to generate new features. We can select up to five primitives. (The limit is intended to prevent the Build Feature tool from generating too many features, which could negatively impact performance.)
In this example, 22 new features are generated in total by selecting the Median, Max, Sum, Std, and Count primitives. These new features can provide additional information about our store’s transactions that were not in the raw data, and they can be useful in predicting future customer behavior.
Building features (aka feature engineering) is traditionally done by writing complex SQL code and taking hours of experimenting and iterations. With the new Build Features tool in Alteryx Intelligence Suite, we can rapidly calculate new features just by defining a few relationships! This is the magic of automatic feature engineering.
Thanks to the new Feature Types and Build Features tools we’ve generated new, meaningful features that can help us build a better predictive model that provides actionable insights to achieve better outcomes for our business. We hope you enjoy these new innovations in the 2021.1 release! Happy Solving!