Least Squares Regression, Explained: A Visual Guide with Code Examples for Beginners
REGRESSION ALGORITHM
Gliding through points to minimize squares
When people start learning about data analysis, they usually begin with linear regression. There’s a good reason for this — it’s one of the most useful and straightforward ways to understand how regression works. The most common approaches to linear regression are called “Least Squares Methods” — these work by finding patterns in data by minimizing the squared differences between predictions and actual values. The most basic type is Ordinary Least Squares (OLS), which finds the best way to draw a straight line through your data points.
Sometimes, though, OLS isn’t enough — especially when your data has many related features that can make the results unstable. That’s where Ridge regression comes in. Ridge regression does the same job as OLS but adds a special control that helps prevent the model from becoming too sensitive to any single feature.
Here, we’ll glide through two key types of Least Squares regression, exploring how these algorithms smoothly slide through your data points and see their differences in theory.
All visuals: Author-created using Canva Pro. Optimized for mobile; may appear oversized on desktop.
Definition
Linear Regression is a statistical method that predicts numerical values using a linear equation. It models the relationship between a dependent variable and one or more independent variables by fitting a straight line (or plane, in multiple dimensions) through the data points. The model calculates coefficients for each feature, representing their impact on the outcome. To get a result, you input your data’s feature values into the linear equation to compute the predicted value.
📊 Dataset Used
To illustrate our concepts, we’ll use our standard dataset that predicts the number of golfers visiting on a given day. This dataset includes variables like weather outlook, temperature, humidity, and wind conditions.
Columns: ‘Outlook’ (one-hot encoded to sunny, overcast, rain), ‘Temperature’ (in Fahrenheit), ‘Humidity’ (in %), ‘Wind’ (Yes/No) and ‘Number of Players’ (numerical, target feature)import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
# Create dataset
dataset_dict = {
‘Outlook’: [‘sunny’, ‘sunny’, ‘overcast’, ‘rain’, ‘rain’, ‘rain’, ‘overcast’, ‘sunny’, ‘sunny’, ‘rain’, ‘sunny’, ‘overcast’, ‘overcast’, ‘rain’, ‘sunny’, ‘overcast’, ‘rain’, ‘sunny’, ‘sunny’, ‘rain’, ‘overcast’, ‘rain’, ‘sunny’, ‘overcast’, ‘sunny’, ‘overcast’, ‘rain’, ‘overcast’],
‘Temp.’: [85.0, 80.0, 83.0, 70.0, 68.0, 65.0, 64.0, 72.0, 69.0, 75.0, 75.0, 72.0, 81.0, 71.0, 81.0, 74.0, 76.0, 78.0, 82.0, 67.0, 85.0, 73.0, 88.0, 77.0, 79.0, 80.0, 66.0, 84.0],
‘Humid.’: [85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0, 90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0, 65.0, 70.0, 60.0, 95.0, 70.0, 78.0],
‘Wind’: [False, True, False, False, False, True, True, False, False, False, True, True, False, True, True, False, False, True, False, True, True, False, True, False, False, True, False, False],
‘Num_Players’: [52, 39, 43, 37, 28, 19, 43, 47, 56, 33, 49, 23, 42, 13, 33, 29, 25, 51, 41, 14, 34, 29, 49, 36, 57, 21, 23, 41]
}
df = pd.DataFrame(dataset_dict)
# One-hot encode ‘Outlook’ column
df = pd.get_dummies(df, columns=[‘Outlook’],prefix=”,prefix_sep=”)
# Convert ‘Wind’ column to binary
df[‘Wind’] = df[‘Wind’].astype(int)
# Split data into features and target, then into training and test sets
X, y = df.drop(columns=’Num_Players’), df[‘Num_Players’]
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, shuffle=False)
While it is not mandatory, to effectively use Linear Regression — including Ridge Regression — we can standardize the numerical features first.
Standard scaling is applied to ‘Temperature’ and ‘Humidity’ while the one-hot encoding is applied to ‘Outlook’ and ‘Wind’import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
# Create dataset
data = {
‘Outlook’: [‘sunny’, ‘sunny’, ‘overcast’, ‘rain’, ‘rain’, ‘rain’, ‘overcast’, ‘sunny’, ‘sunny’,
‘rain’, ‘sunny’, ‘overcast’, ‘overcast’, ‘rain’, ‘sunny’, ‘overcast’, ‘rain’, ‘sunny’,
‘sunny’, ‘rain’, ‘overcast’, ‘rain’, ‘sunny’, ‘overcast’, ‘sunny’, ‘overcast’, ‘rain’, ‘overcast’],
‘Temperature’: [85, 80, 83, 70, 68, 65, 64, 72, 69, 75, 75, 72, 81, 71, 81, 74, 76, 78, 82,
67, 85, 73, 88, 77, 79, 80, 66, 84],
‘Humidity’: [85, 90, 78, 96, 80, 70, 65, 95, 70, 80, 70, 90, 75, 80, 88, 92, 85, 75, 92,
90, 85, 88, 65, 70, 60, 95, 70, 78],
‘Wind’: [False, True, False, False, False, True, True, False, False, False, True, True, False,
True, True, False, False, True, False, True, True, False, True, False, False, True, False, False],
‘Num_Players’: [52, 39, 43, 37, 28, 19, 43, 47, 56, 33, 49, 23, 42, 13, 33, 29, 25, 51, 41,
14, 34, 29, 49, 36, 57, 21, 23, 41]
}
# Process data
df = pd.get_dummies(pd.DataFrame(data), columns=[‘Outlook’])
df[‘Wind’] = df[‘Wind’].astype(int)
# Split data
X, y = df.drop(columns=’Num_Players’), df[‘Num_Players’]
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, shuffle=False)
# Scale numerical features
numerical_cols = [‘Temperature’, ‘Humidity’]
ct = ColumnTransformer([(‘scaler’, StandardScaler(), numerical_cols)], remainder=’passthrough’)
# Transform data
X_train_scaled = pd.DataFrame(
ct.fit_transform(X_train),
columns=numerical_cols + [col for col in X_train.columns if col not in numerical_cols],
index=X_train.index
)
X_test_scaled = pd.DataFrame(
ct.transform(X_test),
columns=X_train_scaled.columns,
index=X_test.index
)
Main Mechanism
Linear Regression predicts numbers by making a straight line (or hyperplane) from the data:
The model finds the best line by making the gaps between the real values and the line’s predicted values as small as possible. This is called “least squares.”Each input gets a number (coefficient/weight) that shows how much it changes the final answer. There’s also a starting number (intercept/bias) that’s used when all inputs are zero.To predict a new answer, the model takes each input, multiplies it by its number, adds all these up, plus adds the starting number. This gives you the predicted answer.
Ordinary Least Squares (OLS) Regression
Let’s start with Ordinary Least Squares (OLS) — the fundamental approach to linear regression. The goal of OLS is to find the best-fitting line through our data points. We do this by measuring how “wrong” our predictions are compared to actual values, and then finding the line that makes these errors as small as possible. When we say “error,” we mean the vertical distance between each point and our line — in other words, how far off our predictions are from reality. Let’s see what happened in 2D case first.
In 2D Case
In 2D case, we can imagine the linear regression algorithm like this:
Here’s the explanation of the process above:
1.We start with a training set, where each row has:
· x : our input feature (the numbers 1, 2, 3, 1, 2)
· y : our target values (0, 1, 1, 2, 3)
2. We can plot these points on a scatter plot and we want to find a line y = β₀ + β₁x that best fits these points
3. For any given line (any β₀ and β₁), we can measure how good it is by:
· Calculating the vertical distance (d₁, d₂, d₃, d₄, d₅) from each point to the line
· These distances are |y — (β₀ + β₁x)| for each point
4. Our optimization goal is to find β₀ and β₁ that minimize the sum of squared distances: d₁² + d₂² + d₃² + d₄² + d₅². In vector notation, this is written as ||y — Xβ||², where X = [1 x] contains our input data (with 1’s for the intercept) and β = [β₀ β₁]ᵀ contains our coefficients.
5. The optimal solution has a closed form: β = (XᵀX)⁻¹Xᵀy. Calculating this we get β₀ = -0.196 (intercept), β₁ = 0.761 (slope).
This vector notation makes the formula more compact and shows that we’re really working with matrices and vectors rather than individual points. We will see more details of our calculation next in the multidimensional case.
In Multidimensional Case (📊 Dataset)
Again, the goal of OLS is to find coefficients (β) that minimize the squared differences between our predictions and actual values. Mathematically, we express this as minimizing ||y — Xβ||², where X is our data matrix and y contains our target values.
The training process follows these key steps:
Training Step
1. Prepare our data matrix X. This involves adding a column of ones to account for the bias/intercept term (β₀).
2. Instead of iteratively searching for the best coefficients, we can compute them directly using the normal equation:
β = (XᵀX)⁻¹Xᵀy
where:
· β is the vector of estimated coefficients,
· X is the dataset matrix(including a column for the intercept),
· y is the label,
· Xᵀ represents the transpose of matrix X,
· ⁻¹ represents the inverse of the matrix.
Let’s break this down:
a. We multiply Xᵀ (X transpose) by X, giving us a square matrix
b. We compute the inverse of this matrix
c. We compute Xᵀy
d. We multiply (XᵀX)⁻¹ and Xᵀy to get our coefficients
Test Step
Once we have our coefficients, making predictions is straightforward: we simply multiply our new data point by these coefficients to get our prediction.
In matrix notation, for a new data point x*, the prediction y* is calculated as
y* = x*β = [1, x₁, x₂, …, xₚ] × [β₀, β₁, β₂, …, βₚ]ᵀ,
where β₀ is the intercept and β₁ through βₚ are the coefficients for each feature.
Evaluation Step
We can do the same process for all data points. For our dataset, here’s the final result with the RMSE as well.
Ridge Regression
Now, let’s consider Ridge Regression, which builds upon OLS by addressing some of its limitations. The key insight of Ridge Regression is that sometimes the optimal OLS solution involves very large coefficients, which can lead to overfitting.
Ridge Regression adds a penalty term (λ||β||²) to the objective function. This term discourages large coefficients by adding their squared values to what we’re minimizing. The full objective becomes:
min ||y — Xβ||² + λ||β||²
The λ (lambda) parameter controls how much we penalize large coefficients. When λ = 0, we get OLS; as λ increases, the coefficients shrink toward zero (but never quite reach it).
Training Step
Just like OLS, prepare our data matrix X. This involves adding a column of ones to account for the intercept term (β₀).The training process for Ridge follows a similar pattern to OLS, but with a modification. The closed-form solution becomes:
β = (XᵀX+ λI)⁻¹Xᵀy
where:
· I is the identity matrix (with the first element, corresponding to β₀, sometimes set to 0 to exclude the intercept from regularization in some implementations),
· λ is the regularization value.
· Y is the vector of observed dependent variable values.
· Other symbols remain as defined in the OLS section.
Let’s break this down:
a. We add λI to XᵀX. The value of λ can be any positive number (say 0.1).
b. We compute the inverse of this matrix. The benefits of adding λI to XᵀX before inversion are:
· Makes the matrix invertible, even if XᵀX isn’t (solving a key numerical problem with OLS)
· Shrinks the coefficients proportionally to λ
c. We multiply (XᵀX+ λI)⁻¹ and Xᵀy to get our coefficients
Test Step
The prediction process remains the same as OLS — multiply new data points by the coefficients. The difference lies in the coefficients themselves, which are typically smaller and more stable than their OLS counterparts.
Evaluation Step
We can do the same process for all data points. For our dataset, here’s the final result with the RMSE as well.
Final Remarks: Choosing Between OLS and Ridge
The choice between OLS and Ridge often depends on your data:
Use OLS when you have well-behaved data with little multicollinearity and enough samples (relative to features)Use Ridge when you have:
– Many features (relative to samples)
– Multicollinearity in your features
– Signs of overfitting with OLS
With Ridge, you’ll need to choose λ. Start with a range of values (often logarithmically spaced) and choose the one that gives the best validation performance.
Apparantly, the default value λ = 1 gives the best RMSE for our dataset.
🌟 OLS and Ridge Regression Code Summarized
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import root_mean_squared_error
from sklearn.linear_model import Ridge
# Create dataset
data = {
‘Outlook’: [‘sunny’, ‘sunny’, ‘overcast’, ‘rain’, ‘rain’, ‘rain’, ‘overcast’, ‘sunny’, ‘sunny’,
‘rain’, ‘sunny’, ‘overcast’, ‘overcast’, ‘rain’, ‘sunny’, ‘overcast’, ‘rain’, ‘sunny’,
‘sunny’, ‘rain’, ‘overcast’, ‘rain’, ‘sunny’, ‘overcast’, ‘sunny’, ‘overcast’, ‘rain’, ‘overcast’],
‘Temperature’: [85, 80, 83, 70, 68, 65, 64, 72, 69, 75, 75, 72, 81, 71, 81, 74, 76, 78, 82,
67, 85, 73, 88, 77, 79, 80, 66, 84],
‘Humidity’: [85, 90, 78, 96, 80, 70, 65, 95, 70, 80, 70, 90, 75, 80, 88, 92, 85, 75, 92,
90, 85, 88, 65, 70, 60, 95, 70, 78],
‘Wind’: [False, True, False, False, False, True, True, False, False, False, True, True, False,
True, True, False, False, True, False, True, True, False, True, False, False, True, False, False],
‘Num_Players’: [52, 39, 43, 37, 28, 19, 43, 47, 56, 33, 49, 23, 42, 13, 33, 29, 25, 51, 41,
14, 34, 29, 49, 36, 57, 21, 23, 41]
}
# Process data
df = pd.get_dummies(pd.DataFrame(data), columns=[‘Outlook’], prefix=”, prefix_sep=”, dtype=int)
df[‘Wind’] = df[‘Wind’].astype(int)
df = df[[‘sunny’,’overcast’,’rain’,’Temperature’,’Humidity’,’Wind’,’Num_Players’]]
# Split data
X, y = df.drop(columns=’Num_Players’), df[‘Num_Players’]
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, shuffle=False)
# Scale numerical features
numerical_cols = [‘Temperature’, ‘Humidity’]
ct = ColumnTransformer([(‘scaler’, StandardScaler(), numerical_cols)], remainder=’passthrough’)
# Transform data
X_train_scaled = pd.DataFrame(
ct.fit_transform(X_train),
columns=numerical_cols + [col for col in X_train.columns if col not in numerical_cols],
index=X_train.index
)
X_test_scaled = pd.DataFrame(
ct.transform(X_test),
columns=X_train_scaled.columns,
index=X_test.index
)
# Initialize and train the model
#model = LinearRegression() # Option 1: OLS Regression
model = Ridge(alpha=0.1) # Option 2: Ridge Regression (alpha is the regularization strength, equivalent to λ)
# Fit the model
model.fit(X_train_scaled, y_train)
# Make predictions
y_pred = model.predict(X_test_scaled)
# Calculate and print RMSE
rmse = root_mean_squared_error(y_test, y_pred)
print(f”RMSE: {rmse:.4f}”)
# Additional information about the model
print(“nModel Coefficients:”)
print(f”Intercept : {model.intercept_:.2f}”)
for feature, coef in zip(X_train_scaled.columns, model.coef_):
print(f”{feature:13}: {coef:.2f}”)
Further Reading
For a detailed explanation of OLS Linear Regression and Ridge Regression, and its implementation in scikit-learn, readers can refer to their official documentation. It provides comprehensive information on their usage and parameters.
Technical Environment
This article uses Python 3.7 and scikit-learn 1.5. While the concepts discussed are generally applicable, specific code implementations may vary slightly with different versions
About the Illustrations
Unless otherwise noted, all images are created by the author, incorporating licensed design elements from Canva Pro.
Least Squares Regression, Explained: A Visual Guide with Code Examples for Beginners was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.