Dummy Regressor, Explained: A Visual Guide with Code Examples for Beginners
REGRESSION ALGORITHM
Naively choosing the best number for all of your prediction
There are a lot of times when my students come to me saying that they want to try the most sophisticated model out there for their machine learning tasks, and sometimes, I jokingly said, “Have you tried the best ever model first?” Especially in regression case (where we don’t have that “100% accuracy” goal), some machine learning models seemingly get a good low error score but when you compare it with the dummy model, it’s actually… not that great.
So, here’s dummy regressor. Just like in classifier, the regression task also has its baseline model — the first model you have to try to get the rough idea of how much better your machine learning could be.
All visuals: Author-created using Canva Pro. Optimized for mobile; may appear oversized on desktop.
Definition
A dummy regressor is a simple machine learning model that predicts numerical values using basic rules, without actually learning from the input data. Like its classification counterpart, it serves as a baseline for comparing the performance of more complex regression models. The dummy regressor helps us understand if our models are actually learning useful patterns or just making naive predictions.
Dummy Regressor is the simplest machine learning model imaginable.
📊 Dataset & Libraries
Throughout this article, we’ll use this simple artificial golf dataset (again, inspired by [1]) as an example. This dataset predicts the number of golfers visiting our golf course. It includes features like outlook, temperature, humidity, and wind, with the target variable being the number of golfers.
Columns: ‘Outlook’, ‘Temperature’ (in Fahrenheit), ‘Humidity’ (in %), ‘Wind’ (Yes/No) and ‘Number of Players’ (numerical, target feature)# Import libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
# Create dataset
dataset_dict = {
‘Outlook’: [‘sunny’, ‘sunny’, ‘overcast’, ‘rain’, ‘rain’, ‘rain’, ‘overcast’, ‘sunny’, ‘sunny’, ‘rain’, ‘sunny’, ‘overcast’, ‘overcast’, ‘rain’, ‘sunny’, ‘overcast’, ‘rain’, ‘sunny’, ‘sunny’, ‘rain’, ‘overcast’, ‘rain’, ‘sunny’, ‘overcast’, ‘sunny’, ‘overcast’, ‘rain’, ‘overcast’],
‘Temperature’: [85.0, 80.0, 83.0, 70.0, 68.0, 65.0, 64.0, 72.0, 69.0, 75.0, 75.0, 72.0, 81.0, 71.0, 81.0, 74.0, 76.0, 78.0, 82.0, 67.0, 85.0, 73.0, 88.0, 77.0, 79.0, 80.0, 66.0, 84.0],
‘Humidity’: [85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0, 90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0, 65.0, 70.0, 60.0, 95.0, 70.0, 78.0],
‘Wind’: [False, True, False, False, False, True, True, False, False, False, True, True, False, True, True, False, False, True, False, True, True, False, True, False, False, True, False, False],
‘Num_Players’: [52,39,43,37,28,19,43,47,56,33,49,23,42,13,33,29,25,51,41,14,34,29,49,36,57,21,23,41]
}
df = pd.DataFrame(dataset_dict)
# One-hot encode ‘Outlook’ column
df = pd.get_dummies(df, columns=[‘Outlook’], prefix=”, prefix_sep=”, dtype=int)
# Convert ‘Wind’ column to binary
df[‘Wind’] = df[‘Wind’].astype(int)
# Split data into features and target, then into training and test sets
X, y = df.drop(columns=’Num_Players’), df[‘Num_Players’]
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, shuffle=False)
Evaluating Regression Result
Before getting into the dummy regressor itself, let’s recap the method to evaluate the regression result. While in classification case, it is very intuitive to check the accuracy of the model (just check the ratio of the matching values), in regression, it is a bit different.
RMSE (root mean squared error) is like a score for regression models. It tells us how far off our predictions are from the actual values. Just as we want high accuracy in classification to get more right answers, we want a low RMSE in regression to be closer to the true values.
People like using RMSE because its value is in the same type as what we’re trying to guess.
Having RMSE = 3 can be interpreted that the actual value is within ±3 range from the prediction.from sklearn.metrics import mean_squared_error
y_true = np.array([10, 15, 20, 15, 10]) # True labels
y_pred = np.array([15, 11, 18, 14, 10]) # Predicted values
# Calculate RMSE using scikit-learn
rmse = mean_squared_error(y_true, y_pred, squared=False)
print(f”RMSE = {rmse:.2f}”)
With that in mind, let’s get into the algorithm.
Main Mechanism
Dummy Regressor makes predictions based on simple rules, such as always returning the mean or median of the target values in the training data.
For our golf dataset, a dummy regressor might always predict “40.5” for number of players as that is the median of the training label.
Training Steps
It’s a bit of a lie saying that there’s any training process in dummy regressor but anyway, here’s a general outline:
1. Select Strategy
Choose one of the following strategies:
Mean: Always predicts the mean of the training target values.Median: Always predicts the median of the training target values.Constant: Always predicts a constant value provided by the user.Depends on the strategy, Dummy Regressor makes different numerical prediction.from sklearn.dummy import DummyRegressor
# Choose a strategy for your DummyRegressor (‘mean’, ‘median’, ‘constant’)
strategy = ‘median’
2. Calculate the Metric
Calculate either mean or median, depending on your strategy.
The algorithm is simply calculating the median of the training data— in this case we get 40.5.# Initialize the DummyRegressor
dummy_reg = DummyRegressor(strategy=strategy)
# “Train” the DummyRegressor (although no real training happens)
dummy_reg.fit(X_train, y_train)
3. Apply Strategy to Test Data
Use the chosen strategy to generate a list of predicted numerical labels for your test data.
If we choose the “median” strategy, the calculated median (40.5) will simply be the prediction for everything.# Use the DummyRegressor to make predictions
y_pred = dummy_reg.predict(X_test)
print(“Label :”,list(y_test))
print(“Prediction:”,list(y_pred))
Evaluate the Model
Dummy regressor with this strategy gives error value of 13.28 as the baseline for future models.# Evaluate the Dummy Regressor’s error
from sklearn.metrics import mean_squared_error
rmse = mean_squared_error(y_test, y_pred, squared=False)
print(f”Dummy Regression Error: {rmse.round(2)}”)
Key Parameters
There’s only one main key parameter in dummy regressor, which is:
Strategy: This determines how the regressor makes predictions. Common options include:
– mean: Provides an average baseline, commonly used for general scenarios.
– median: More robust against outliers, good for skewed target distributions.
– constant: Useful when domain knowledge suggests a specific constant prediction.Constant: When using the ‘constant’ strategy, this parameter specifies which class to always predict.Regardless of the strategy used, the result are all equally bad but for sure our next regression model should have RMSE value lower than 12.
Pros and Cons
As a lazy predictor, dummy regressor for sure have their strengths and limitations.
Pros:
Easy Benchmark: Quickly shows the minimum performance other models should beat.Fast: Takes no time to set up and run.
Cons:
Doesn’t Learn: Just uses simple rules, so it’s often outperformed by real models.Ignores Features: Doesn’t consider any input data when making predictions.
Final Remarks
Using dummy regressor should be the first step whenever we have a regression task. They provide a standard base line, so that we are sure that a more complex model actually gives better result rather than random prediction. As you learn more advanced technique, never forget to compare your models against these simple baselines — these naive prediction might be what you first need!
🌟 Dummy Regressor Code Summarized
# Import libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.dummy import DummyRegressor
# Create dataset
dataset_dict = {
‘Outlook’: [‘sunny’, ‘sunny’, ‘overcast’, ‘rain’, ‘rain’, ‘rain’, ‘overcast’, ‘sunny’, ‘sunny’, ‘rain’, ‘sunny’, ‘overcast’, ‘overcast’, ‘rain’, ‘sunny’, ‘overcast’, ‘rain’, ‘sunny’, ‘sunny’, ‘rain’, ‘overcast’, ‘rain’, ‘sunny’, ‘overcast’, ‘sunny’, ‘overcast’, ‘rain’, ‘overcast’],
‘Temperature’: [85.0, 80.0, 83.0, 70.0, 68.0, 65.0, 64.0, 72.0, 69.0, 75.0, 75.0, 72.0, 81.0, 71.0, 81.0, 74.0, 76.0, 78.0, 82.0, 67.0, 85.0, 73.0, 88.0, 77.0, 79.0, 80.0, 66.0, 84.0],
‘Humidity’: [85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0, 90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0, 65.0, 70.0, 60.0, 95.0, 70.0, 78.0],
‘Wind’: [False, True, False, False, False, True, True, False, False, False, True, True, False, True, True, False, False, True, False, True, True, False, True, False, False, True, False, False],
‘Num_Players’: [52,39,43,37,28,19,43,47,56,33,49,23,42,13,33,29,25,51,41,14,34,29,49,36,57,21,23,41]
}
df = pd.DataFrame(dataset_dict)
# One-hot encode ‘Outlook’ column
df = pd.get_dummies(df, columns=[‘Outlook’], prefix=”, prefix_sep=”, dtype=int)
# Convert ‘Wind’ column to binary
df[‘Wind’] = df[‘Wind’].astype(int)
# Split data into features and target, then into training and test sets
X, y = df.drop(columns=’Num_Players’), df[‘Num_Players’]
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, shuffle=False)
# Initialize and train the model
dummy_reg = DummyRegressor(strategy=’median’)
dummy_reg.fit(X_train, y_train)
# Make predictions
y_pred = dummy_reg.predict(X_test)
# Calculate and print RMSE
print(f”RMSE: {mean_squared_error(y_test, y_pred, squared=False)}”)
Further Reading
For a detailed explanation of the DummyRegressor and its implementation in scikit-learn, readers can refer to the official documentation [2], which provides comprehensive information on its usage and parameters.
Technical Environment
This article uses Python 3.7 and scikit-learn 1.5. While the concepts discussed are generally applicable, specific code implementations may vary slightly with different versions.
About the Illustrations
Unless otherwise noted, all images are created by the author, incorporating licensed design elements from Canva Pro.
Reference
[1] T. M. Mitchell, Machine Learning (1997), McGraw-Hill Science/Engineering/Math, pp. 59
[2] F. Pedregosa et al., Scikit-learn: Machine Learning in Python, Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011. [Online]. Available: https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyRegressor.html
Dummy Regressor, Explained: A Visual Guide with Code Examples for Beginners was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.