Partial Dependence Plots: How to Discover Variables Influencing a Model
Explainability of machine learning models
Photo by Marcel Strauß on Unsplash
Have you ever wondered how machine learning models are constructed? ‘Explainability of machine learning models’ and ‘machine learning models being a black-box’ is one of the most debated topics in model transparency. Today we will explore this and learn some quick techniques on how to find out which variables are influencing the model results and by how much.
I have generated a synthetic dataset with some variables that depict the statistics of football matches like number of goal scored, number of passes, ball possession %, Number of red or yellow cards etc. Using the dataset we will explore the following models:
Decision Tree modelRandom Forest model
This will be the agenda for today:
Train the decision tree modelTrain the random forest modelExplore the influential variables in the modelsFind the threshold of the influential variables
So without further ado let’s get started..
1. Train the decision tree model
Before we begin, let us first talk a bit about decision trees. This concept will be used later in the article.
Decision tree algorithms start with a root node from a data sample and then select features based on metrics like Gini impurity or information gain and splits the root nodes into leaf nodes/end nodes till no more split is possible. This is illustrated in the diagram below with a sample tree.Decision Tree (Image by Author)
After the data and the libraries have been imported, the following lines of code will help to train the decision tree model.
We are trying to predict what are the factors or variables that will determine i.e. positively or negatively influence the ‘Man of the Match’.
#Create the dependent variable
y = (data[‘Man of the Match’] == “Yes”)
#Create the independent variable
feature_names = [i for i in data.columns if data[i].dtype in [np.int64]]
x = data[feature_names]
#Train the decision tree model
train_x, test_x, train_y, test_y = train_test_split(x,y, random_state=1)
dt_model = DecisionTreeClassifier(random_state=0, max_depth=5, min_samples_split=5).fit(train_x,train_y)
pred_y = dt_model.predict(test_x)
cm = confusion_matrix(test_y,pred_y)
print(cm)
accuracy_score(test_y,pred_y)
We will get the following output from the confusion matrix:
The accuracy of the decision tree model is moderate at ~48% with (13+11) targets being corrected predicted and (14+12) being false positives and false negatives respectively.
Note that the accuracy is lower, because we have generated a synthetic data. But I urge the readers to focus on the methodology rather than the actual numbers.
2. Train the random forest model
Let us now learn a bit about the random forest model and then train the data with it.
Random forest is an ensemble learning algorithm that works by constructing multiple decision trees and outputs the class that is either the mode or the mean prediction of the individual decision trees.
An illustration is given below:
Random Forest (Image by Author)
We will now use the code below to train the random forest model.
# Train the RF model
rf_model = RandomForestClassifier(n_estimators=100, random_state=1).fit(train_x,train_y)
pred_y = rf_model.predict(test_x)
cm = confusion_matrix(test_y,pred_y)
print(cm)
accuracy_score(test_y,pred_y)
The output of the Random forest model is given below:
The random forest model has a slightly better accuracy at ~50% with (13+12) targets identified correctly and (14+11) targets mis-classified-14 being false positives and 11 being false negatives.
3. Explore the influential variables in the models
We will now look at the most influential variables in both the models and how they are affecting the accuracy. We will use ‘PermutationImportance’ from the ‘eli5’ library for this purpose. We can do this with only one line of code as given below:
# Import PermutationImportance from the eli5 library
from eli5.sklearn import PermutationImportance
# Influential variables for Decision Tree model
eli5.show_weights(perm, feature_names = test_x.columns.tolist())
The influential variables in the decision tree model is :
The most influential variables in the decision tree model is ‘1st Goal’, ‘Distance covered’, ‘Yellow Card’ among others. There are also variables that influence the accuracy negatively like ‘Ball possession %’ and ‘Pass accuracy %’. Some variables like ‘Red’ Card, ‘Goal scored’ etc has no influence on the accuracy of the model.
The influential variables in the random forest model is :
The most influential variables in the decision tree model is ‘Ball possession %’, ‘Free Kicks’, ‘Yellow Card’ and ‘Own Goals’ among others. There are also variables that influence the accuracy negatively like ‘Red Card’ and ‘Offsides’ — hence we can drop these variables from the model to increase the accuracy.
The weights indicate by how much percentage the model accuracy is impacted by the variable when the variables are re-shuffled. For eg: By using the feature ‘Ball possession %’ the model accuracy can be improved by 5.20% in a range of (+-) 5.99%.
As you can observe there are significant differences in the variables that influence the 2 models and for the same variable like say ‘Yellow Card’ the percentage of change in accuracy also differs.
4. Find out the threshold of the influential variable at which the changes to model accuracy is happening
Let us now take one variable say ‘Yellow Card’ that is influencing both the models and try to find out the threshold at which the accuracy increases. We can do this easily with Partial dependence plots (PDP).
A partial dependence (PD) plot depicts the functional relationship between input variables and predictions. It shows how the predictions partially depend on values of the input variables.
For example: We can create a partial dependence plot of the variable ‘Yellow Card’ to understand how changes in the values of the variable ‘Yellow Card’ affects overall accuracy of the model.
We will start with the decision tree model first –
# Import the libraries
from matplotlib import pyplot as plt
from pdpbox import pdp, info_plots
# Select the variable/feature to plot
feature_to_plot = ‘Yellow Card’
features_input = test_x.columns.tolist()
print(features_input)
# PDP plot for Decision tree model
pdp_yl = pdp.PDPIsolate(model=dt_model,df=test_x,
model_features=features_input,
feature=feature_to_plot, feature_name=feature_to_plot)
fig, axes = pdp_yl.plot(center=True, plot_lines=False, plot_pts_dist=True,
to_bins=False, engine=’matplotlib’)
fig.set_figheight(6)# Import the libraries
from matplotlib import pyplot as plt
from pdpbox import pdp, get_dataset, info_plots
# Select the variable/feature to plot
feature_to_plot = ‘Distance Covered (Kms)’
# PDP plot for Decision tree model
pdp_dist = pdp.pdp_isolate(model=dt_model,dataset=test_x,
model_features=feature_names,
feature= feature_to_plot)
pdp.pdp_plot(pdp_dist, feature_to_plot)
plt.show()
The plot will look like this:
PDP Plot for Decision Tree model (Image by Author)
If number of yellow cards is more than 3 that can negatively impact the ‘Man of the Match’, but if yellow cards is < 3 then that does not influence the model. Also, after 5 yellow cards, there is no significant effect on the model.
The PDP (Partial dependence plot) helps to provide an insight into the threshold values of the features that influence the model.
Now we can use the same code for the random forest model and look at the plot :
PDP Plot for Random Forest model (Image by Author)
For both the decision tree model and the random forest model, the plot looks similar with the performance of the model changing win the range if 3 to 5; post which the variable ‘yellow card’ has little or no influence on the model as given by the flat line henceforth.
Summary:
This is how we can use simple PDP plots to understand the behaviour of influential variables in the model. This information can not only draw insights about the variables that impact the model but is especially helpful in training the models and for selection of the right features. The thresholds can also help to create bins that can be used to sub-set the features that can further enhance the accuracy of the model. In turn, this helps to make the model results explainable to the business.
Please refer to this link on Github for the the dataset and the full code.
I can be reached on Medium, LinkedIn or Twitter in case of any questions/comments.
You can follow me subscribe to my email list 📩 here, so that you don’t miss out on my latest articles.
References:
[1] Abraham Itzhak Weinberg, Selecting a representative decision tree from an ensemble of decision-tree models for fast big data classification (Feb 2019), Springer
[2] Leo Breiman, Random Forests (Oct 2001), Springer
[3] Alex Goldstein, Adam Kapelner, Justin Bleich, and Emil Pitkin, Peeking Inside the Black Box: Visualizing Statistical Learning with Plots of Individual
Conditional Expectation (Mar 2004), The Wharton School of the University of Pennsylvania, arxiv.org
Originally published at https://hackernoon.com on January 10, 2023.
Partial Dependence Plots: How to Discover Variables Influencing a Model was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.