Model Validation Techniques, Explained: A Visual Guide with Code Examples
MODEL EVALUATION & OPTIMIZATION
12 must-know methods to validate your machine learning
Every day, machines make millions of predictions — from detecting objects in photos to helping doctors find diseases. But before trusting these predictions, we need to know if they’re any good. After all, no one would want to use a machine that’s wrong most of the time!
This is where validation comes in. Validation methods test machine predictions to measure their reliability. While this might sound simple, different validation approaches exist, each designed to handle specific challenges in machine learning.
Here, I’ve organized these validation techniques — all 12 of them — in a tree structure, showing how they evolved from basic concepts into more specialized ones. And of course, we will use clear visuals and a consistent dataset to show what each method does differently and why method selection matters.
All visuals: Author-created using Canva Pro. Optimized for mobile; may appear oversized on desktop.
What is Model Validation?
Model validation is the process of testing how well a machine learning model works with data it hasn’t seen or used during training. Basically, we use existing data to check the model’s performance instead of using new data. This helps us identify problems before deploying the model for real use.
There are several validation methods, and each method has specific strengths and addresses different validation challenges:
Different validation methods can produce different results, so choosing the right method matters.Some validation techniques work better with specific types of data and models.Using incorrect validation methods can give misleading results about the model’s true performance.
Here is a tree diagram showing how these validation methods relate to each other:
The tree diagram shows which validation methods are connected to each other.
Next, we’ll look at each validation method more closely by showing exactly how they work. To make everything easier to understand, we’ll walk through clear examples that show how these methods work with real data.
📊 📈 Our Running Example
We will use the same example throughout to help you understand each testing method. While this dataset may not be appropriate for some validation methods, for education purpose, using this one example makes it easier to compare different methods and see how each one works.
📊 The Golf Playing Dataset
We’ll work with this dataset that predicts whether someone will play golf based on weather conditions.
Columns: ‘Overcast (one-hot-encoded into 3 columns)’, ’Temperature’ (in Fahrenheit), ‘Humidity’ (in %), ‘Windy’ (Yes/No) and ‘Play’ (Yes/No, target feature)import pandas as pd
import numpy as np
# Load the dataset
dataset_dict = {
‘Outlook’: [‘sunny’, ‘sunny’, ‘overcast’, ‘rainy’, ‘rainy’, ‘rainy’, ‘overcast’,
‘sunny’, ‘sunny’, ‘rainy’, ‘sunny’, ‘overcast’, ‘overcast’, ‘rainy’,
‘sunny’, ‘overcast’, ‘rainy’, ‘sunny’, ‘sunny’, ‘rainy’, ‘overcast’,
‘rainy’, ‘sunny’, ‘overcast’, ‘sunny’, ‘overcast’, ‘rainy’, ‘overcast’],
‘Temperature’: [85.0, 80.0, 83.0, 70.0, 68.0, 65.0, 64.0, 72.0, 69.0, 75.0, 75.0,
72.0, 81.0, 71.0, 81.0, 74.0, 76.0, 78.0, 82.0, 67.0, 85.0, 73.0,
88.0, 77.0, 79.0, 80.0, 66.0, 84.0],
‘Humidity’: [85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0,
90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0,
65.0, 70.0, 60.0, 95.0, 70.0, 78.0],
‘Wind’: [False, True, False, False, False, True, True, False, False, False, True,
True, False, True, True, False, False, True, False, True, True, False,
True, False, False, True, False, False],
‘Play’: [‘No’, ‘No’, ‘Yes’, ‘Yes’, ‘Yes’, ‘No’, ‘Yes’, ‘No’, ‘Yes’, ‘Yes’, ‘Yes’,
‘Yes’, ‘Yes’, ‘No’, ‘No’, ‘Yes’, ‘Yes’, ‘No’, ‘No’, ‘No’, ‘Yes’, ‘Yes’,
‘Yes’, ‘Yes’, ‘Yes’, ‘Yes’, ‘No’, ‘Yes’]
}
df = pd.DataFrame(dataset_dict)
# Data preprocessing
df = pd.DataFrame(dataset_dict)
df = pd.get_dummies(df, columns=[‘Outlook’], prefix=”, prefix_sep=”, dtype=int)
df[‘Wind’] = df[‘Wind’].astype(int)
# Set the label
X, y = df.drop(‘Play’, axis=1), df[‘Play’]
📈 Our Model Choice
We will use a decision tree classifier for all our tests. We picked this model because we can easily draw the resulting model as a tree structure, with each branch showing different decisions. To keep things simple and focus on how we test the model, we will use the default scikit-learn parameter with a fixed random_state.
Let’s be clear about these two terms we’ll use: The decision tree classifier is our learning algorithm — it’s the method that finds patterns in our data. When we feed data into this algorithm, it creates a model (in this case, a tree with clear branches showing different decisions). This model is what we’ll actually use to make predictions.
from sklearn.tree import DecisionTreeClassifier, plot_tree
import matplotlib.pyplot as plt
dt = DecisionTreeClassifier(random_state=42)
Each time we split our data differently for validation, we’ll get different models with different decision rules. Once our validation shows that our algorithm works reliably, we’ll create one final model using all our data. This final model is the one we’ll actually use to predict if someone will play golf or not.
With this setup ready, we can now focus on understanding how each validation method works and how it helps us make better predictions about golf playing based on weather conditions. Let’s examine each validation method one at a time.
Hold-out Methods
Hold-out methods are the most basic way to check how well our model works. In these methods, we basically save some of our data just for testing.
Train-Test Split
This method is simple: we split our data into two parts. We use one part to train our model and the other part to test it. Before we split the data, we mix it up randomly so the order of our original data doesn’t affect our results.
Both the training and test dataset size depends on our total dataset size, usually denoted by their ratio. To determine their size, you can follow this guideline:
For small datasets (around 1,000–10,000 samples), use 80:20 ratio.For medium datasets (around 10,000–100,000 samples), use 70:30 ratio.Large datasets (over 100,000 samples), use 90:10 ratio.from sklearn.model_selection import train_test_split
### Simple Train-Test Split ###
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Train and evaluate
dt.fit(X_train, y_train)
test_accuracy = dt.score(X_test, y_test)
# Plot
plt.figure(figsize=(5, 5), dpi=300)
plot_tree(dt, feature_names=X.columns, filled=True, rounded=True)
plt.title(f’Train-Test Split (Test Accuracy: {test_accuracy:.3f})’)
plt.tight_layout()
This method is easy to use, but it has some limitation — the results can change a lot depending on how we randomly split the data. This is why we always need to try out different random_state to make sure that the result is consistent. Also, if we don’t have much data to start with, we might not have enough to properly train or test our model.
Train-Validation-Test Split
This method split our data into three parts. The middle part, called validation data, is being used to tune the parameters of the model and we’re aiming to have the least amount of error there.
Since the validation results is considered many times during this tuning process, our model might start doing too well on this validation data (which is what we want). This is the reason of why we make the separate test set. We are only testing it once at the very end — it gives us the truth of how well our model works.
Here are typical ways to split your data:
For smaller datasets (1,000–10,000 samples), use 60:20:20 ratio.For medium datasets (10,000–100,000 samples), use 70:15:15 ratio.Large datasets (> 100,000 samples), use 80:10:10 ratio.### Train-Validation-Test Split ###
# First split: separate test set
X_temp, X_test, y_temp, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Second split: separate validation set
X_train, X_val, y_train, y_val = train_test_split(
X_temp, y_temp, test_size=0.25, random_state=42
)
# Train and evaluate
dt.fit(X_train, y_train)
val_accuracy = dt.score(X_val, y_val)
test_accuracy = dt.score(X_test, y_test)
# Plot
plt.figure(figsize=(5, 5), dpi=300)
plot_tree(dt, feature_names=X.columns, filled=True, rounded=True)
plt.title(f’Train-Val-Test SplitnValidation Accuracy: {val_accuracy:.3f}’
f’nTest Accuracy: {test_accuracy:.3f}’)
plt.tight_layout()
Hold-out methods work differently depending on how much data you have. They work really well when you have lots of data (> 100,000). But when you have less data (< 1,000) this method is not be the best. With smaller datasets, you might need to use more advanced validation methods to get a better understanding of how well your model really works.
📊 Moving to Cross-validation
We just learned that hold-out methods might not work very well with small datasets. This is exactly the challenge we currently face— we only have 28 days of data. Following the hold-out principle, we’ll keep 14 days of data separate for our final test. This leaves us with 14 days to work with for trying other validation methods.
# Initial train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, shuffle=False)
In the next part, we’ll see how cross-validation methods can take these 14 days and split them up multiple times in different ways. This gives us a better idea of how well our model is really working, even with such limited data.
Cross Validation
Cross-validation changes how we think about testing our models. Instead of testing our model just once with one split of data, we test it many times using different splits of the same data. This helps us understand much better how well our model really works.
The main idea of cross-validation is to test our model multiple times, and each time the training and test dataset come from different part of the our data. This helps prevent bias by one really good (or really bad) split of the data.
Here’s why this matters: say our model gets 95% accuracy when we test it one way, but only 75% when we test it another way using the same data. Which number shows how good our model really is? Cross-validation helps us answer this question by giving us many test results instead of just one. This gives us a clearer picture of how well our model actually performs.
K-Fold Methods
Basic K-Fold Cross-Validation
K-fold cross-validation fixes a big problem with basic splitting: relying too much on just one way of splitting the data. Instead of splitting the data once, K-fold splits the data into K equal parts. Then it tests the model multiple times, using a different part for testing each time while using all other parts for training.
The number we pick for K changes how we test our model. Most people use 5 or 10 for K, but this can change based on how much data we have and what we need for our project. Let’s say we use K = 3. This means we split our data into three equal parts. We then train and test our model three different times. Each time, 2/3 of the data is used for training and 1/3 for testing, but we rotate which part is being used for testing. This way, every piece of data gets used for both training and testing.
from sklearn.model_selection import KFold, cross_val_score
# Cross-validation strategy
cv = KFold(n_splits=3, shuffle=True, random_state=42)
# Calculate cross-validation scores
scores = cross_val_score(dt, X_train, y_train, cv=cv)
print(f”Validation accuracy: {scores.mean():.3f} ± {scores.std():.3f}”)
# Plot trees for each split
plt.figure(figsize=(4, 3.5*cv.get_n_splits(X_train)))
for i, (train_idx, val_idx) in enumerate(cv.split(X_train, y_train)):
# Train and visualize the tree for this split
dt.fit(X_train.iloc[train_idx], y_train.iloc[train_idx])
plt.subplot(cv.get_n_splits(X_train), 1, i+1)
plot_tree(dt, feature_names=X_train.columns, impurity=False, filled=True, rounded=True)
plt.title(f’Split {i+1} (Validation Accuracy: {scores[i]:.3f})nTrain indices: {train_idx}nValidation indices: {val_idx}’)
plt.tight_layout()
Validation accuracy: 0.433 ± 0.047
When we’re done with all the rounds, we calculate the average performance from all K tests. This average gives us a more trustworthy measure of how well our model works. We can also learn about how stable our model is by looking at how much the results change between different rounds of testing.
Stratified K-Fold
Basic K-fold cross-validation usually works well, but it can run into problems when our data is unbalanced — meaning we have a lot more of one type than others. For example, if we have 100 data points and 90 of them are type A while only 10 are type B, randomly splitting this data might give us pieces that don’t have enough type B to test properly.
Stratified K-fold fixes this by making sure each split has the same mix as our original data. If our full dataset has 10% type B, each split will also have about 10% type B. This makes our testing more reliable, especially when some types of data are much rarer than others.
from sklearn.model_selection import StratifiedKFold, cross_val_score
# Cross-validation strategy
cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
# Calculate cross-validation scores
scores = cross_val_score(dt, X_train, y_train, cv=cv)
print(f”Validation accuracy: {scores.mean():.3f} ± {scores.std():.3f}”)
# Plot trees for each split
plt.figure(figsize=(5, 4*cv.get_n_splits(X_train)))
for i, (train_idx, val_idx) in enumerate(cv.split(X_train, y_train)):
# Train and visualize the tree for this split
dt.fit(X_train.iloc[train_idx], y_train.iloc[train_idx])
plt.subplot(cv.get_n_splits(X_train), 1, i+1)
plot_tree(dt, feature_names=X_train.columns, impurity=False, filled=True, rounded=True)
plt.title(f’Split {i+1} (Validation Accuracy: {scores[i]:.3f})nTrain indices: {train_idx}nValidation indices: {val_idx}’)
plt.tight_layout()
Validation accuracy: 0.650 ± 0.071
Keeping this balance helps in two ways. First, it makes sure each split properly represents what our data looks like. Second, it gives us more consistent test results . This means that if we test our model multiple times, we’ll most likely get similar results each time.
Repeated K-Fold
Sometimes, even when we use K-fold validation, our test results can change a lot between different random splits. Repeated K-fold solves this by running the entire K-fold process multiple times, using different random splits each time.
For example, let’s say we run 5-fold cross-validation three times. This means our model goes through training and testing 15 times in total. By testing so many times, we can better tell which differences in results come from random chance and which ones show how well our model really performs. The downside is that all this extra testing takes more time to complete.
from sklearn.model_selection import RepeatedKFold
# Cross-validation strategy
n_splits = 3
cv = RepeatedKFold(n_splits=n_splits, n_repeats=2, random_state=42)
# Calculate cross-validation scores
scores = cross_val_score(dt, X_train, y_train, cv=cv)
print(f”Validation accuracy: {scores.mean():.3f} ± {scores.std():.3f}”)
# Plot trees for each split
total_splits = cv.get_n_splits(X_train) # Will be 6 (3 folds × 2 repetitions)
plt.figure(figsize=(5, 4*total_splits))
for i, (train_idx, val_idx) in enumerate(cv.split(X_train, y_train)):
# Train and visualize the tree for this split
dt.fit(X_train.iloc[train_idx], y_train.iloc[train_idx])
# Calculate repetition and fold numbers
repetition, fold = i // n_splits + 1, i % n_splits + 1
plt.subplot(total_splits, 1, i+1)
plot_tree(dt, feature_names=X_train.columns, impurity=False, filled=True, rounded=True)
plt.title(f’Split {repetition}.{fold} (Validation Accuracy: {scores[i]:.3f})n’
f’Train indices: {list(train_idx)}n’
f’Validation indices: {list(val_idx)}’)
plt.tight_layout()
Validation accuracy: 0.425 ± 0.107
When we look at repeated K-fold results, since we have many sets of test results, we can do more than just calculate the average — we can also figure out how confident we are in our results. This gives us a better understanding of how reliable our model really is.
Repeated Stratified K-Fold
This method combines two things we just learned about: keeping class balance (stratification) and running multiple rounds of testing (repetition). It keeps the right mix of different types of data while testing many times. This works especially well when we have a small dataset that’s uneven — where we have a lot more of one type of data than others.
from sklearn.model_selection import RepeatedStratifiedKFold
# Cross-validation strategy
n_splits = 3
cv = RepeatedStratifiedKFold(n_splits=n_splits, n_repeats=2, random_state=42)
# Calculate cross-validation scores
scores = cross_val_score(dt, X_train, y_train, cv=cv)
print(f”Validation accuracy: {scores.mean():.3f} ± {scores.std():.3f}”)
# Plot trees for each split
total_splits = cv.get_n_splits(X_train) # Will be 6 (3 folds × 2 repetitions)
plt.figure(figsize=(5, 4*total_splits))
for i, (train_idx, val_idx) in enumerate(cv.split(X_train, y_train)):
# Train and visualize the tree for this split
dt.fit(X_train.iloc[train_idx], y_train.iloc[train_idx])
# Calculate repetition and fold numbers
repetition, fold = i // n_splits + 1, i % n_splits + 1
plt.subplot(total_splits, 1, i+1)
plot_tree(dt, feature_names=X_train.columns, impurity=False, filled=True, rounded=True)
plt.title(f’Split {repetition}.{fold} (Validation Accuracy: {scores[i]:.3f})n’
f’Train indices: {list(train_idx)}n’
f’Validation indices: {list(val_idx)}’)
plt.tight_layout()
Validation accuracy: 0.542 ± 0.167
However, there’s a trade-off: this method takes more time for our computer to run. Each time we repeat the whole process, it multiplies how long it takes to train our model. When deciding whether to use this method, we need to think about whether having more reliable results is worth the extra time it takes to run all these tests.
Group K-Fold
Sometimes our data naturally comes in groups that should stay together. Think about golf data where we have many measurements from the same golf course throughout the year. If we put some measurements from one golf course in training data and others in test data, we create a problem: our model would indirectly learn about the test data during training because it saw other measurements from the same course.
Group K-fold fixes this by keeping all data from the same group (like all measurements from one golf course) together in the same part when we split the data. This prevents our model from accidentally seeing information it shouldn’t, which could make us think it performs better than it really does. This method can be important when working with data that naturally comes in groups, like multiple weather readings from the same golf course or data that was collected over time from the same location.
Time Series Split
When we split data randomly in regular K-fold, we assume each piece of data doesn’t affect the others. But this doesn’t work well with data that changes over time, where what happened before affects what happens next. Time series split changes K-fold to work better with this kind of time-ordered data.
Instead of splitting data randomly, time series split uses data in order, from past to future. The training data only includes information from times before the testing data. This matches how we use models in real life, where we use past data to predict what will happen next.
from sklearn.model_selection import TimeSeriesSplit, cross_val_score
# Cross-validation strategy
cv = TimeSeriesSplit(n_splits=3)
# Calculate cross-validation scores
scores = cross_val_score(dt, X_train, y_train, cv=cv)
print(f”Validation accuracy: {scores.mean():.3f} ± {scores.std():.3f}”)
# Plot trees for each split
plt.figure(figsize=(4, 3.5*cv.get_n_splits(X_train)))
for i, (train_idx, val_idx) in enumerate(cv.split(X_train, y_train)):
# Train and visualize the tree for this split
dt.fit(X_train.iloc[train_idx], y_train.iloc[train_idx])
plt.subplot(cv.get_n_splits(X_train), 1, i+1)
plot_tree(dt, feature_names=X_train.columns, impurity=False, filled=True, rounded=True)
plt.title(f’Split {i+1} (Validation Accuracy: {scores[i]:.3f})n’
f’Train indices: {train_idx}n’
f’Validation indices: {val_idx}’)
plt.tight_layout()
Validation accuracy: 0.556 ± 0.157
For example, with K=3 and our golf data, we might train using weather data from January and February to predict March’s golf playing patterns. Then we’d train using January through March to predict April, and so on. By only going forward in time, this method gives us a more realistic idea of how well our model will work when predicting future golf playing patterns based on weather.
Leave-Out Methods
Leave-One-Out Cross-Validation (LOOCV)
Leave-One-Out Cross-Validation (LOOCV) is the most thorough validation method. It uses just one sample for testing and all other samples for training. The validation is repeated until every single piece of data has been used for testing.
Let’s say we have 100 days of golf weather data. LOOCV would train and test the model 100 times. Each time, it uses 99 days for training and 1 day for testing. This method removes any randomness in testing — if you run LOOCV on the same data multiple times, you’ll always get the same results.
However, LOOCV takes a lot of computing time. If you have N pieces of data, you need to train your model N times. With large datasets or complex models, this might take too long to be practical. Some simpler models, like linear ones, have shortcuts that make LOOCV faster, but this isn’t true for all models.
from sklearn.model_selection import LeaveOneOut
# Cross-validation strategy
cv = LeaveOneOut()
# Calculate cross-validation scores
scores = cross_val_score(dt, X_train, y_train, cv=cv)
print(f”Validation accuracy: {scores.mean():.3f} ± {scores.std():.3f}”)
# Plot trees for each split
plt.figure(figsize=(4, 3.5*cv.get_n_splits(X_train)))
for i, (train_idx, val_idx) in enumerate(cv.split(X_train, y_train)):
# Train and visualize the tree for this split
dt.fit(X_train.iloc[train_idx], y_train.iloc[train_idx])
plt.subplot(cv.get_n_splits(X_train), 1, i+1)
plot_tree(dt, feature_names=X_train.columns, impurity=False, filled=True, rounded=True)
plt.title(f’Split {i+1} (Validation Accuracy: {scores[i]:.3f})n’
f’Train indices: {train_idx}n’
f’Validation indices: {val_idx}’)
plt.tight_layout()
Validation accuracy: 0.429 ± 0.495
LOOCV works really well when we don’t have much data and need to make the most of every piece we have. Since the result depend on every single data, the results can change a lot if our data has noise or unusual values in it.
Leave-P-Out Cross-Validation
Leave-P-Out builds on the idea of Leave-One-Out, but instead of testing with just one piece of data, it tests with P pieces at a time. This creates a balance between Leave-One-Out and K-fold validation. The number we choose for P changes how we test the model and how long it takes.
The main problem with Leave-P-Out is how quickly the number of possible test combinations grows. For example, if we have 100 days of golf weather data and we want to test with 5 days at a time (P=5), there are millions of different possible ways to choose those 5 days. Testing all these combinations takes too much time when we have lots of data or when we use a larger number for P.
from sklearn.model_selection import LeavePOut, cross_val_score
# Cross-validation strategy
cv = LeavePOut(p=3)
# Calculate cross-validation scores (using all splits for accuracy)
scores = cross_val_score(dt, X_train, y_train, cv=cv)
print(f”Validation accuracy: {scores.mean():.3f} ± {scores.std():.3f}”)
# Plot first 15 trees
n_trees = 15
plt.figure(figsize=(4, 3.5*n_trees))
for i, (train_idx, val_idx) in enumerate(cv.split(X_train, y_train)):
if i >= n_trees:
break
# Train and visualize the tree for this split
dt.fit(X_train.iloc[train_idx], y_train.iloc[train_idx])
plt.subplot(n_trees, 1, i+1)
plot_tree(dt, feature_names=X_train.columns, impurity=False, filled=True, rounded=True)
plt.title(f’Split {i+1} (Validation Accuracy: {scores[i]:.3f})n’
f’Train indices: {train_idx}n’
f’Validation indices: {val_idx}’)
plt.tight_layout()
Validation accuracy: 0.441 ± 0.254
Because of these practical limits, Leave-P-Out is mostly used in special cases where we need very thorough testing and have a small enough dataset to make it work. It’s especially useful in research projects where getting the most accurate test results matters more than how long the testing takes.
Random Methods
ShuffleSplit Cross-Validation
ShuffleSplit works differently from other validation methods by using completely random splits. Instead of splitting data in an organized way like K-fold, or testing every possible combination like Leave-P-Out, ShuffleSplit creates random training and testing splits each time.
What makes ShuffleSplit different from K-fold is that the splits don’t follow any pattern. In K-fold, each piece of data gets used exactly once for testing. But in ShuffleSplit, a single day of golf weather data might be used for testing several times, or might not be used for testing at all. This randomness gives us a different way to understand how well our model performs.
ShuffleSplit works especially well with large datasets where K-fold might take too long to run. We can choose how many times we want to test, no matter how much data we have. We can also control how big each split should be. This lets us find a good balance between thorough testing and the time it takes to run.
from sklearn.model_selection import ShuffleSplit, train_test_split
# Cross-validation strategy
cv = ShuffleSplit(n_splits=3, test_size=0.2, random_state=41)
# Calculate cross-validation scores
scores = cross_val_score(dt, X_train, y_train, cv=cv)
print(f”Validation accuracy: {scores.mean():.3f} ± {scores.std():.3f}”)
# Plot trees for each split
plt.figure(figsize=(4, 3.5*cv.get_n_splits(X_train)))
for i, (train_idx, val_idx) in enumerate(cv.split(X_train, y_train)):
# Train and visualize the tree for this split
dt.fit(X_train.iloc[train_idx], y_train.iloc[train_idx])
plt.subplot(cv.get_n_splits(X_train), 1, i+1)
plot_tree(dt, feature_names=X_train.columns, impurity=False, filled=True, rounded=True)
plt.title(f’Split {i+1} (Validation Accuracy: {scores[i]:.3f})n’
f’Train indices: {train_idx}n’
f’Validation indices: {val_idx}’)
plt.tight_layout()
Validation accuracy: 0.333 ± 0.272
Since ShuffleSplit can create as many random splits as we want, it’s useful when we want to see how our model’s performance changes with different random splits, or when we need more tests to be confident about our results.
Stratified ShuffleSplit
Stratified ShuffleSplit combines random splitting with keeping the right mix of different types of data. Like Stratified K-fold, it makes sure each split has about the same percentage of each type of data as the full dataset.
This method gives us the best of both worlds: the freedom of random splitting and the fairness of keeping data balanced. For example, if our golf dataset has 70% “yes” days and 30% “no” days for playing golf, each random split will try to keep this same 70–30 mix. This is especially useful when we have uneven data, where random splitting might accidentally create test sets that don’t represent our data well.
from sklearn.model_selection import StratifiedShuffleSplit, train_test_split
# Cross-validation strategy
cv = StratifiedShuffleSplit(n_splits=3, test_size=0.2, random_state=41)
# Calculate cross-validation scores
scores = cross_val_score(dt, X_train, y_train, cv=cv)
print(f”Validation accuracy: {scores.mean():.3f} ± {scores.std():.3f}”)
# Plot trees for each split
plt.figure(figsize=(4, 3.5*cv.get_n_splits(X_train)))
for i, (train_idx, val_idx) in enumerate(cv.split(X_train, y_train)):
# Train and visualize the tree for this split
dt.fit(X_train.iloc[train_idx], y_train.iloc[train_idx])
plt.subplot(cv.get_n_splits(X_train), 1, i+1)
plot_tree(dt, feature_names=X_train.columns, impurity=False, filled=True, rounded=True)
plt.title(f’Split {i+1} (Validation Accuracy: {scores[i]:.3f})n’
f’Train indices: {train_idx}n’
f’Validation indices: {val_idx}’)
plt.tight_layout()
Validation accuracy: 0.556 ± 0.157
However, trying to keep both the random nature of the splits and the right mix of data types can be tricky. The method sometimes has to make small compromises between being perfectly random and keeping perfect proportions. In real use, these small trade-offs rarely cause problems, and having balanced test sets is usually matters more than having perfectly random splits.
🌟 Validation Techniques Summarized & Code Summary
To summarize, model validation methods fall into two main categories: hold-out methods and cross-validation methods:
Hold-out Methods
· Train-Test Split: The simplest approach, dividing data into two parts
· Train-Validation-Test Split: A three-way split for more complex model development
Cross-validation Methods
Cross-validation methods make better use of available data through multiple rounds of validation:
K-Fold Methods
Rather than a single split, these methods divide data into K parts:
· Basic K-Fold: Rotates through different test sets
· Stratified K-Fold: Maintains class balance across splits
· Group K-Fold: Preserves data grouping
· Time Series Split: Respects temporal order
· Repeated K-Fold
· Repeated Stratified K-Fold
Leave-Out Methods
These methods take validation to the extreme:
· Leave-P-Out: Tests on P data points at a time
· Leave-One-Out: Tests on single data points
Random Methods
These introduce controlled randomness:
· ShuffleSplit: Creates random splits repeatedly
· Stratified ShuffleSplit: Random splits with balanced classes
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import (
# Hold-out methods
train_test_split,
# K-Fold methods
KFold, # Basic k-fold
StratifiedKFold, # Maintains class balance
GroupKFold, # For grouped data
TimeSeriesSplit, # Temporal data
RepeatedKFold, # Multiple runs
RepeatedStratifiedKFold, # Multiple runs with class balance
# Leave-out methods
LeaveOneOut, # Single test point
LeavePOut, # P test points
# Random methods
ShuffleSplit, # Random train-test splits
StratifiedShuffleSplit, # Random splits with class balance
cross_val_score # Calculate validation score
)
# Load the dataset
dataset_dict = {
‘Outlook’: [‘sunny’, ‘sunny’, ‘overcast’, ‘rainy’, ‘rainy’, ‘rainy’, ‘overcast’,
‘sunny’, ‘sunny’, ‘rainy’, ‘sunny’, ‘overcast’, ‘overcast’, ‘rainy’,
‘sunny’, ‘overcast’, ‘rainy’, ‘sunny’, ‘sunny’, ‘rainy’, ‘overcast’,
‘rainy’, ‘sunny’, ‘overcast’, ‘sunny’, ‘overcast’, ‘rainy’, ‘overcast’],
‘Temperature’: [85.0, 80.0, 83.0, 70.0, 68.0, 65.0, 64.0, 72.0, 69.0, 75.0, 75.0,
72.0, 81.0, 71.0, 81.0, 74.0, 76.0, 78.0, 82.0, 67.0, 85.0, 73.0,
88.0, 77.0, 79.0, 80.0, 66.0, 84.0],
‘Humidity’: [85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0,
90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0,
65.0, 70.0, 60.0, 95.0, 70.0, 78.0],
‘Wind’: [False, True, False, False, False, True, True, False, False, False, True,
True, False, True, True, False, False, True, False, True, True, False,
True, False, False, True, False, False],
‘Play’: [‘No’, ‘No’, ‘Yes’, ‘Yes’, ‘Yes’, ‘No’, ‘Yes’, ‘No’, ‘Yes’, ‘Yes’, ‘Yes’,
‘Yes’, ‘Yes’, ‘No’, ‘No’, ‘Yes’, ‘Yes’, ‘No’, ‘No’, ‘No’, ‘Yes’, ‘Yes’,
‘Yes’, ‘Yes’, ‘Yes’, ‘Yes’, ‘No’, ‘Yes’]
}
df = pd.DataFrame(dataset_dict)
# Data preprocessing
df = pd.DataFrame(dataset_dict)
df = pd.get_dummies(df, columns=[‘Outlook’], prefix=”, prefix_sep=”, dtype=int)
df[‘Wind’] = df[‘Wind’].astype(int)
# Set the label
X, y = df.drop(‘Play’, axis=1), df[‘Play’]
## Simple Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.5, shuffle=False,
)
## Train-Test-Validation Split
# First split: separate test set
# X_temp, X_test, y_temp, y_test = train_test_split(
# X, y, test_size=0.2, random_state=42
# )
# Second split: separate validation set
# X_train, X_val, y_train, y_val = train_test_split(
# X_temp, y_temp, test_size=0.25, random_state=42
# )
# Create model
dt = DecisionTreeClassifier(random_state=42)
# Select validation method
#cv = KFold(n_splits=3, shuffle=True, random_state=42)
#cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
#cv = GroupKFold(n_splits=3) # Requires groups parameter
#cv = TimeSeriesSplit(n_splits=3)
#cv = RepeatedKFold(n_splits=3, n_repeats=2, random_state=42)
#cv = RepeatedStratifiedKFold(n_splits=3, n_repeats=2, random_state=42)
cv = LeaveOneOut()
#cv = LeavePOut(p=3)
#cv = ShuffleSplit(n_splits=3, test_size=0.2, random_state=42)
#cv = StratifiedShuffleSplit(n_splits=3, test_size=0.3, random_state=42)
# Calculate and print scores
scores = cross_val_score(dt, X_train, y_train, cv=cv)
print(f”Validation accuracy: {scores.mean():.3f} ± {scores.std():.3f}”)
# Final Fit & Test
dt.fit(X_train, y_train)
test_accuracy = dt.score(X_test, y_test)
print(f”Test accuracy: {test_accuracy:.3f}”)
Validation accuracy: 0.429 ± 0.495
Test accuracy: 0.714
Comment on the result above: The large gap between validation and test accuracy, along with the very high standard deviation in validation scores, suggests our model’s performance is unstable. This inconsistency likely comes from using LeaveOneOut validation on our small weather dataset — testing on single data points causes performance to vary dramatically. A different validation method using larger validation sets might give us more reliable results.
Choosing the Right Validation Method
Choosing how to validate your model isn’t simple — different situations need different approaches. Understanding which method to use can mean the difference between getting reliable or misleading results. Here are some aspect that you should consider when choosing the validation method:
1. Dataset Size
The size of your dataset strongly influences which validation method works best. Let’s look at different sizes:
Large Datasets (More than 100,000 samples)
When you have large datasets, the amount of time to test becomes one of the main consideration. Simple hold-out validation (splitting data once into training and testing) often works well because you have enough data for reliable testing. If you need to use cross-validation, using just 3 folds or using ShuffleSplit with fewer rounds can give good results without taking too long to run.
Medium Datasets (1,000 to 100,000 samples)
For medium-sized datasets, regular K-fold cross-validation works best. Using 5 or 10 folds gives a good balance between reliable results and reasonable computing time. This amount of data is usually enough to create representative splits but not so much that testing takes too long.
Small Datasets (Less than 1,000 samples)
Small datasets, like our example of 28 days of golf records, need more careful testing. Leave-One-Out Cross-Validation or Repeated K-fold with more folds can actually work well in this case. Even though these methods take longer to run, they help us get the most reliable results when we don’t have much data to work with.
2. Computational Resource
When choosing a validation method, we need to think about our computing resources. There’s a three-way balance between dataset size, how complex our model is, and which validation method we use:
Fast Training Models
Simple models like decision trees, logistic regression, and linear SVM can use more thorough validation methods like Leave-One-Out Cross-Validation or Repeated Stratified K-fold because they train quickly. Since each training round takes just seconds or minutes, we can afford to run many validation iterations. Even running LOOCV with its N training rounds might be practical for these algorithms.
Resource-Heavy Models
Deep neural networks, random forests with many trees, or gradient boosting models take much longer to train. When using these models, more intensive validation methods like Repeated K-fold or Leave-P-Out might not be practical. We might need to choose simpler methods like basic K-fold or ShuffleSplit to keep testing time reasonable.
Memory Considerations
Some methods like K-fold need to track multiple splits of data at once. ShuffleSplit can help with memory limitations since it handles one random split at a time. For large datasets with complex models (like deep neural networks that need lots of memory), simpler hold-out methods might be necessary. If we still need thorough validation with limited memory, we could use Time Series Split since it naturally processes data in sequence rather than needing all splits in memory at once.
When resources are limited, using a simpler validation method that we can run properly (like basic K-fold) is better than trying to run a more complex method (like Leave-P-Out) that we can’t complete properly.
3. Class Distribution
Class imbalance strongly affects how we should validate our model. With unbalanced data, stratified validation methods become essential. Methods like Stratified K-fold and Stratified ShuffleSplit make sure each testing split has about the same mix of classes as our full dataset. Without using these stratified methods, some test sets might end up with no particular class at all, making it impossible to properly test how well our model makes prediction.
4. Time Series
When working with data that changes over time, we need special validation approaches. Regular random splitting methods don’t work well because time order matters. With time series data, we must use methods like Time Series Split that respect time order.
5. Group Dependencies
Many datasets contain natural groups of related data. These connections in our data need special handling when we validate our models. When data points are related, we need to use methods like Group K-fold to prevent our model from accidentally learning things it shouldn’t.
Practical Guidelines
This flowchart will help you select the most appropriate validation method for your data. The steps below outline a clear process for choosing the best validation approach, assuming you have sufficient computing resources.
Final Remarks
Model validation is essential for building reliable machine learning models. After exploring many validation methods, from simple train-test splits to complex cross-validation approaches, we’ve learned that there is always a suitable validation method for whatever data you have.
While machine learning keeps changing with new methods and tools, these basic rules of validation stay the same. When you understand these principles well, I believe you’ll build models that people can trust and rely on.
Further Reading
For a detailed explanation of the validation methods in scikit-learn, readers can refer to the official documentation, which provides comprehensive information on its usage and parameters.
Technical Environment
This article uses Python 3.7 and scikit-learn 1.5. While the concepts discussed are generally applicable, specific code implementations may vary slightly with different versions.
About the Illustrations
Unless otherwise noted, all images are created by the author, incorporating licensed design elements from Canva Pro.
𝙎𝙚𝙚 𝙢𝙤𝙧𝙚 𝙈𝙤𝙙𝙚𝙡 𝙀𝙫𝙖𝙡𝙪𝙖𝙩𝙞𝙤𝙣 & 𝙊𝙥𝙩𝙞𝙢𝙞𝙯𝙖𝙩𝙞𝙤𝙣 𝙢𝙚𝙩𝙝𝙤𝙙𝙨 𝙝𝙚𝙧𝙚:
Model Evaluation & Optimization
𝙔𝙤𝙪 𝙢𝙞𝙜𝙝𝙩 𝙖𝙡𝙨𝙤 𝙡𝙞𝙠𝙚:
Classification AlgorithmsEnsemble Learning
Model Validation Techniques, Explained: A Visual Guide with Code Examples was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.