# Decision Tree Regressor, Explained: A Visual Guide with Code Examples

#### REGRESSION ALGORITHM

#### Trimming branches smartly with cost-complexity pruning

Decision Tree Classifier, Explained: A Visual Guide with Code Examples for Beginners

Decision Trees aren’t limited to categorizing data — they’re equally good at predicting numerical values! Classification trees often steal the spotlight, but Decision Tree Regressors (or Regression Trees) are powerful and versatile tools in the world of continuous variable prediction.

While we’ll discuss the mechanics of regression tree construction (which are mostly similar to the classification tree), here, we’ll also advance beyond the *pre*-pruning methods like “minimal sample leaf” and “max tree depth” introduced in the classifier article. We’ll explore the most common *post*-pruning method which is **cost complexity pruning**, that introduces a complexity parameter to the decision tree’s cost function.

All visuals: Author-created using Canva Pro. Optimized for mobile; may appear oversized on desktop.

### Definition

A Decision Tree for regression is a model that predicts numerical values using a tree-like structure. It splits data based on key features, starting from a root question and branching out. Each node asks about a feature, dividing data further until reaching leaf nodes with final predictions. To get a result, you follow the path matching your data’s features from root to leaf.

Decision Trees for regression predict numerical outcomes by following a series of data-driven questions, narrowing down to a final value.

### 📊 Dataset Used

To demonstrate our concepts, we’ll work with our standard dataset. This dataset is used to predict the number of golfers visiting on a given day and includes variables like weather outlook, temperature, humidity, and wind conditions.

Columns: ‘Outlook’ (one-hot encoded to sunny, overcast, rain), ‘Temperature’ (in Fahrenheit), ‘Humidity’ (in %), ‘Wind’ (Yes/No) and ‘Number of Players’ (numerical, target feature)import pandas as pd

import numpy as np

from sklearn.model_selection import train_test_split

# Create dataset

dataset_dict = {

‘Outlook’: [‘sunny’, ‘sunny’, ‘overcast’, ‘rain’, ‘rain’, ‘rain’, ‘overcast’, ‘sunny’, ‘sunny’, ‘rain’, ‘sunny’, ‘overcast’, ‘overcast’, ‘rain’, ‘sunny’, ‘overcast’, ‘rain’, ‘sunny’, ‘sunny’, ‘rain’, ‘overcast’, ‘rain’, ‘sunny’, ‘overcast’, ‘sunny’, ‘overcast’, ‘rain’, ‘overcast’],

‘Temp.’: [85.0, 80.0, 83.0, 70.0, 68.0, 65.0, 64.0, 72.0, 69.0, 75.0, 75.0, 72.0, 81.0, 71.0, 81.0, 74.0, 76.0, 78.0, 82.0, 67.0, 85.0, 73.0, 88.0, 77.0, 79.0, 80.0, 66.0, 84.0],

‘Humid.’: [85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0, 90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0, 65.0, 70.0, 60.0, 95.0, 70.0, 78.0],

‘Wind’: [False, True, False, False, False, True, True, False, False, False, True, True, False, True, True, False, False, True, False, True, True, False, True, False, False, True, False, False],

‘Num_Players’: [52, 39, 43, 37, 28, 19, 43, 47, 56, 33, 49, 23, 42, 13, 33, 29, 25, 51, 41, 14, 34, 29, 49, 36, 57, 21, 23, 41]

}

df = pd.DataFrame(dataset_dict)

# One-hot encode ‘Outlook’ column

df = pd.get_dummies(df, columns=[‘Outlook’],prefix=”,prefix_sep=”)

# Convert ‘Wind’ column to binary

df[‘Wind’] = df[‘Wind’].astype(int)

# Split data into features and target, then into training and test sets

X, y = df.drop(columns=’Num_Players’), df[‘Num_Players’]

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, shuffle=False)

### Main Mechanism

The Decision Tree for regression operates by recursively dividing the data based on features that best reduce prediction error. Here’s the general process:

Begin with the entire dataset at the root node.Choose the feature that minimizes a specific error metric (such as mean squared error or variance) to split the data.Create child nodes based on the split, where each child represents a subset of the data aligned with the corresponding feature values.Repeat steps 2–3 for each child node, continuing to split the data until a stopping condition is reached.Assign a final predicted value to each leaf node, typically **the average of the target values** in that node.

### Training Steps

We will explore the regression part in the decision tree algorithm CART (Classification and Regression Trees). It builds binary trees and typically follows these steps:

1.Begin with all training samples in the root node.

2.For each feature in the dataset:

a. Sort the feature values in ascending order.

b. Consider all midpoints between adjacent values as potential split points.

In total, there are 23 split points to check.

3. For each potential split point:

a. Calculate the mean squared error (MSE) of the current node.

b. Compute the weighted average of errors for the resulting split.

As an example, here, we calculated the weighted average of MSE for split point “Temperature” with value 73.0

4. After evaluating all features and split points, select the one with lowest weighted average of MSE.

5. Create two child nodes based on the chosen feature and split point:

– Left child: samples with feature value <= split point

– Right child: samples with feature value > split point

6. Recursively repeat steps 2–5 for each child node. (Continue until a stopping criterion is met.)

7. At each leaf node, assign the average target value of the samples in that node as the prediction.

from sklearn.tree import DecisionTreeRegressor, plot_tree

import matplotlib.pyplot as plt

# Train the model

regr = DecisionTreeRegressor(random_state=42)

regr.fit(X_train, y_train)

# Visualize the decision tree

plt.figure(figsize=(26,8))

plot_tree(regr, feature_names=X.columns, filled=True, rounded=True, impurity=False, fontsize=16, precision=2)

plt.tight_layout()

plt.show()In this scikit-learn output, the samples and values are shown for the leaf nodes and interim nodes.

### Regression/Prediction Step

Here’s how a regression tree makes predictions for new data:

1. Start at the top (root) of the tree.

2. At each decision point (node):

– Look at the feature and split value.

– If the data point’s feature value is smaller or equal, go left.

– If it’s larger, go right.

3. Keep moving down the tree until you reach the end (a leaf).

4. The prediction is the average value stored in that leaf.

#### Evaluation Step

This value of RMSE is so much better than the result of the dummy regressor.

### Pre-pruning vs Post-pruning

After building the tree, the only thing we need to worry about is the method to make the tree smaller to prevent overfitting. In general, the method of pruning can be categorized as:

#### Pre-pruning

Pre-pruning, also known as early stopping, involves halting the growth of a decision tree during the training process based on certain predefined criteria. This approach aims to prevent the tree from becoming too complex and overfitting the training data. Common pre-pruning techniques include:

**Maximum depth**: Limiting how deep the tree can grow.**Minimum samples for split**: Requiring a minimum number of samples to justify splitting a node.**Minimum samples per leaf**: Ensuring each leaf node has at least a certain number of samples.**Maximum number of leaf nodes**: Restricting the total number of leaf nodes in the tree.**Minimum impurity decrease**: Only allowing splits that decrease impurity by a specified amount.

These methods stop the tree’s growth when the specified conditions are met, effectively “pruning” the tree during its construction phase.

(We have discussed these methods before, which is exactly the same in regression case.)

#### Post-pruning

Post-pruning, on the other hand, allows the decision tree to grow to its full extent and then prunes it back to reduce complexity. This approach first builds a complete tree and then removes or collapses branches that don’t significantly contribute to the model’s performance. One common post-pruning technique is called **Cost-Complexity Pruning.**

### Cost Complexity Pruning

#### Step 1: Calculate the Impurity for Each Node

For each interim node, calculate the impurity (MSE for regression case). We then sorted this value from the lowest to highest.

# Visualize the decision tree

plt.figure(figsize=(26,8))

plot_tree(regr, feature_names=X.columns, filled=True, rounded=True, impurity=True, fontsize=16, precision=2)

plt.tight_layout()

plt.show()In this scikit learn output, the impurity are shown as “squared_error” for each nodes.Let‘s give name to these interim nodes (from A-J). We then sort it based on their MSE, from lowest to highest

#### Step 2: Create Subtrees by Trimming The Weakest Link

The goal is to gradually turn the interim nodes into leaves starting from the **node with the lowest MSE** (= weakest link). We can create a path of pruning based on that.

Let’s name them “Subtree *i*” based on how many times (*i*) it is being pruned. Starting from the original tree, the tree will be pruned on the node with lowest MSE (starting from node J, M (already got cut by J), L, K, and so on)

#### Step 3: Calculate Total Leaf Impurities for Each Subtree

For each subtree *T*, total leaf impurities (*R*(*T*)) can be calculated as:

*R*(*T*) = (1/*N*) Σ *I*(*L*) * *n*_*L*

where:**·** *L* ranges over all leaf nodes**·** *n_L* is the number of samples in leaf *L*

**·**

*N*is the total number of samples in the tree

**·**

*I*(

*L*) is the impurity (MSE)

*of leaf*

*L*

The more we prune, the higher the total leaf impurities.

#### Step 4: Compute the Cost Function

To control when to stop turning the interim nodes into leaves, we check the cost complexity first for each subtree *T *using the following formula:

Cost(*T*) = *R*(*T*) + *α* * |*T*|

where:**·** *R*(*T*) is the total leaf impurities**·** |*T*| is the number of leaf nodes in the subtree**·*** α* is the complexity parameter

#### Step 5: Select the Alpha

The value of alpha control which subtree we will end up with. The **subtree with the lowest cost will be the final tree**.

When *α* is small, we care more about accuracy (bigger trees). When *α* is large, we care more about simplicity (smaller trees)

While we can freely set the *α*, in scikit-learn, you can also get the smallest value of *α* to obtain a particular subtree. This is called **effective α**

*.*

This effective *α **can also be computed.*# Compute the cost-complexity pruning path

tree = DecisionTreeRegressor(random_state=42)

effective_alphas = tree.cost_complexity_pruning_path(X_train, y_train).ccp_alphas

impurities = tree.cost_complexity_pruning_path(X_train, y_train).impurities

# Function to count leaf nodes

count_leaves = lambda tree: sum(tree.tree_.children_left[i] == tree.tree_.children_right[i] == -1 for i in range(tree.tree_.node_count))

# Train trees and count leaves for each complexity parameter

leaf_counts = [count_leaves(DecisionTreeRegressor(random_state=0, ccp_alpha=alpha).fit(X_train_scaled, y_train)) for alpha in effective_alphas]

# Create DataFrame with analysis results

pruning_analysis = pd.DataFrame({

‘total_leaf_impurities’: impurities,

‘leaf_count’: leaf_counts,

‘cost_function’: [f”{imp:.3f} + {leaves}α” for imp, leaves in zip(impurities, leaf_counts)],

‘effective_α’: effective_alphas

})

print(pruning_analysis)

#### Final Remarks

Pre-pruning methods are generally faster and more memory-efficient, as they prevent the tree from growing too large in the first place.

Post-pruning can potentially create more optimal trees, as it considers the entire tree structure before making pruning decisions. However, it can be more computationally expensive.

Both approaches aim to find a balance between model complexity and performance, with the goal of creating a model that generalizes well to unseen data. The choice between pre-pruning and post-pruning (or a combination of both) often depends on the specific dataset, the problem at hand, and of course, computational resources available.

In practice, it’s common to use a combination of these methods, like applying some pre-pruning criteria to prevent excessively large trees, and then using post-pruning for fine-tuning the model’s complexity.

### 🌟 Decision Tree Regressor (with Cost Complexity Pruning) Code Summarized

import pandas as pd

import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.metrics import root_mean_squared_error

from sklearn.tree import DecisionTreeRegressor

from sklearn.preprocessing import StandardScaler

# Create dataset

dataset_dict = {

‘Outlook’: [‘sunny’, ‘sunny’, ‘overcast’, ‘rain’, ‘rain’, ‘rain’, ‘overcast’, ‘sunny’, ‘sunny’, ‘rain’, ‘sunny’, ‘overcast’, ‘overcast’, ‘rain’, ‘sunny’, ‘overcast’, ‘rain’, ‘sunny’, ‘sunny’, ‘rain’, ‘overcast’, ‘rain’, ‘sunny’, ‘overcast’, ‘sunny’, ‘overcast’, ‘rain’, ‘overcast’],

‘Temperature’: [85.0, 80.0, 83.0, 70.0, 68.0, 65.0, 64.0, 72.0, 69.0, 75.0, 75.0, 72.0, 81.0, 71.0, 81.0, 74.0, 76.0, 78.0, 82.0, 67.0, 85.0, 73.0, 88.0, 77.0, 79.0, 80.0, 66.0, 84.0],

‘Humidity’: [85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0, 90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0, 65.0, 70.0, 60.0, 95.0, 70.0, 78.0],

‘Wind’: [False, True, False, False, False, True, True, False, False, False, True, True, False, True, True, False, False, True, False, True, True, False, True, False, False, True, False, False],

‘Num_Players’: [52,39,43,37,28,19,43,47,56,33,49,23,42,13,33,29,25,51,41,14,34,29,49,36,57,21,23,41]

}

df = pd.DataFrame(dataset_dict)

# One-hot encode ‘Outlook’ column

df = pd.get_dummies(df, columns=[‘Outlook’], prefix=”, prefix_sep=”, dtype=int)

# Convert ‘Wind’ column to binary

df[‘Wind’] = df[‘Wind’].astype(int)

# Split data into features and target, then into training and test sets

X, y = df.drop(columns=’Num_Players’), df[‘Num_Players’]

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, shuffle=False)

# Initialize Decision Tree Regressor

tree = DecisionTreeRegressor(random_state=42)

# Get the cost complexity path, impurities, and effective alpha

path = tree.cost_complexity_pruning_path(X_train, y_train)

ccp_alphas, impurities = path.ccp_alphas, path.impurities

print(ccp_alphas)

print(impurities)

# Train the final tree with the chosen alpha

final_tree = DecisionTreeRegressor(random_state=42, ccp_alpha=0.1)

final_tree.fit(X_train_scaled, y_train)

# Make predictions

y_pred = final_tree.predict(X_test)

# Calculate and print RMSE

rmse = root_mean_squared_error(y_test, y_pred)

print(f”RMSE: {rmse:.4f}”)

#### Further Reading

For a detailed explanation of the Decision Tree Regressor, Cost Complexity Pruning, and its implementation in scikit-learn, readers can refer to their official documentation. It provides comprehensive information on their usage and parameters.

#### Technical Environment

This article uses Python 3.7 and scikit-learn 1.5. While the concepts discussed are generally applicable, specific code implementations may vary slightly with different versions

#### About the Illustrations

Unless otherwise noted, all images are created by the author, incorporating licensed design elements from Canva Pro.

Decision Tree Regressor, Explained: A Visual Guide with Code Examples was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.