This article was written by B Hamel and originally appeared on the Alteryx Data Science Blog here: https://community.alteryx.com/t5/Data-Science/Using-Linear-Regression-to-Conduct-a-Pay-Equity-Study/ba-p/761002
Data can sometimes reveal patterns that provide insights into complex issues, especially around diversity, equity and inclusion. This post is your guide to conducting a pay equity analysis using Alteryx Designer — specifically, creating a linear regression model that helps you assess whether pay differences exist between people in different demographic groups.
If you’d like to follow along, download our workflow, attached at the bottom of this post. We hope this guide and workflow can be a starting point for your own analysis in your organization.
The Question and Approach
The overall question is, “Is there a pay gap at your company? If so, who may need a correction to their compensation?”
To answer this question, you want to build a model that will help you determine what “acceptable” variables influence pay, and check whether “unacceptable” terms (i.e., those representing potential causes of bias, like gender) may have a statistically significant effect in lowering an individual’s pay. If a pay gap exists, you can flag the employees whose pay is below the expected range for review by a compensation expert. You can use linear regression to help make this determination.
For this process to work, it is important to have a large enough number of employees to analyze in order to be able to judge the statistical significance of your findings. Say you have a team of 10 individuals, including four women, who are located in different cities, have varying years of experience, and perform different jobs. In this case, it would be very difficult to make conclusions about the statistical significance of any pay differences you identify. However, if your numbers are too low to leverage the full approach, you can still follow this framework and find look-alike employees to do manual comparisons.
Gathering the Right Data
Note: The input data shown here is from a fictional company with a 2% pay gap.
You’ll need to prepare an anonymized data set that includes all the factors that should drive an employee’s compensation, as well as the potential causes of bias that should not. In our sample data set, we identified factors such as education, level, and performance ratings that should drive employee compensation. A factor that should not drive compensation? Gender!
While age shouldn’t factor into compensation — you can see how being “too old” or “too young” might influence compensation unfairly over performance — we did use it as a proxy for years of experience.
For your analysis, consider what should be driving your pay process and make sure you are capturing that data. It’s vital to consult with your HR compensation experts who are most knowledgeable about these issues.
Exploratory Data Analysis
In your Exploratory Data Analysis (EDA), you’ll look for outliers, understand what correlations exist, and identify any errors in the dataset. Don’t skip this step, as lots of valuable learning happens here!
Here are some key EDA steps to complete.
Response variable distribution: This helps us to spot outliers in the dataset. By checking these distributions, we spotted two outliers, representing employees being paid in a different currency (large values, circled in the left histogram below), as well as part-time employees who do not work the entire year (low values, circled in the center histogram). Note you should run different countries separately as they have different job markets. Fixing these outliers (more on that later) made our distribution look a bit more normal (right histogram).
Correlations between variables: This step helps us understand which factors are correlated with the response variable, as well as with each other. There are a few techniques to view these correlations, depending on the column’s data types.
Continuous vs. Continuous Plots (Association Analysis)
The heat map on the left is shaded based on how correlated two columns are with one another. We can see age (our proxy for years of experience) is the most correlated with Total Cash Compensation, as denoted by a darker red square. Performance is also correlated with compensation, but less so than age, based on its lighter red color.
Continuous vs. Categorical Plots (Plot of Means)
We often have categorical data such as Education Level, which can’t be used for correlations in the above association analysis. Instead, we can use the Plot of Means Tool to see if there are different salary ranges for different categories. The above plot shows that people with doctorates in this dataset are paid higher than people at other education levels. Also, there is clear separation between the distributions, so we’d expect it to be a statistically significant term.
Now that we’ve gathered our data and done some preliminary analysis to better understand it, we can take some necessary steps to get it ready for modeling.
Log transform the response variable: Linear regression models assume that the model error (here, Total Cash Compensation) is normally distributed. However, the model error often is not normally distributed, and the model fit will not follow the regression assumptions. The histogram on the left below shows how our response variable is distributed in its original form.
A trick is to take the natural log of the response variable to produce a log normal distribution. This sounds complicated, but it’s really easy; you can use log(total_cash_compensation) in a Formula Tool. The histogram on the right below shows how our distribution has been shifted to look more like a normal “bell curve” after the log transformation. You can try building your model first without the transformation, and see if the residual diagnostics meet the assumptions in our model evaluation steps below. If not, try a natural log transformation, and reevaluate your results.
This transformation has the added benefit of making it easier to interpret the model output, because it reframes your results from "women make X dollars less" to "women make 98% of men in the same job." This is an easier way to understand the magnitude of the pay gap, if one is present at your company.
Binning categories into higher levels: This step will merge people in smaller groups into larger groups based on common characteristics. For example, in the table below, you can see how people with various management and executive job titles have been binned together into a broader category that encompasses more people. This step increases the sample size of each bin, thus increasing degrees of freedom. Work with your HR and compensation experts on this step, as they will know which groups can be aggregated together.
|Management Level (Actual)||Management Level Bin|
|Senior Vice President||Executive|
Removing outliers: In our dataset, we saw a few outliers. They represent different groups and processes (e.g., the part-time job market and another country’s job market), so we feel we can exclude these for a more robust analysis.
Cleaning up missing values: In our exploratory analysis, we noticed that some employees were missing educational information. We opted to replace missing data with “unknown.”
Fitting the Regression Model and Interpreting Results
Now that we have a dataset in good shape, it’s time to fit a model using the Linear Regression Tool!
Place a Browse Tool after the “R” output of the Linear Regression Tool, and click on it to see the model summary.
Let’s walk through the steps of evaluating this regression model.
- Is any factor significant in the model? Check the F-statistic.
- Which factors are/are not statistically significant? Check the p-values down the right-hand side.
- How well does the model fit the data? Check the R-Squared, which here is 0.8651. This means 86% of the variance in compensation data can be explained by the model.
- Does the model meet the regression assumptions? Check the residual diagnostics by clicking the arrow.
- Linear Relationship (“Residuals vs. Fitted” plot, top left): The errors have mean zero (red line is close to zero) and have a constant variance around the red line for different fitted values.
- Normality of the Error Term Distribution (“Normal Q-Q Plot,” top right): The model errors follow a normal distribution (verifying the normality assumption). The dots should follow the 45-degree line on the plot.
- Constant Variance of the Error Term (“Scale-Location” plot, bottom left)
- Independence of the Error Terms (“Residuals vs. Fitted” plot, top left)
We can then report out our findings.
In the sample dataset, we found a pay gap of 2% between men and women at the fictional company. In the table above, this is the coefficient of -0.02 calculated for the “GenderzFemale” variable, and it reflects how much a change in this variable (from male to female) would affect the response variable of Total Cash Compensation, if all other variables were kept the same. This gender variable was statistically significant.
Note that you can have a negative coefficient in your regression model; however, make sure you check that the coefficient is significantly different from 0.
In this fictional dataset, there was a pay gap estimated by the model. The next step is to flag employees who are being paid below their estimated range. If we were to use the Score Tool with the estimated model, it would predict compensation amounts that reflect this gender pay gap, since the model estimated that women are being paid less than men. Therefore, we need to flip gender to “male” for every employee and then estimate employees’ salary range. If the model uses these adjusted data to predict compensation for an employee that is below their actual salary range, that individual can be flagged for review by HR.