# The Power of Optimization in Designing Experiments Involving Small Samples

#### A step-by-step guide to designing more precise experiments using optimization in Python

Experimentation is the foundation for testing hypotheses with scientific rigor.** **In medicine, it is used to assess the effect of new treatments on patients, while in the digital world, tech giants like Amazon, Netflix, and Uber run thousands of experiments each year to optimize and improve their platforms.

For large-scale experiments, random assignment is commonly used and its considered the “Gold Standard”. With a substantial amount of data, randomness tends to produce comparable groups, where important pre-experiment factors are balanced and the exchangeability assumption holds.

However, when the sample for an experiment is very small, random assignment often fails to create statistically equivalent groups. So, **how can we split units efficiently between treatment and control groups?**

#### What you will learn:

In this post, I’ll explain an optimization-based approach to constructing equivalent groups for an experiment which was proposed by Bertsimas et al. in this article.

With a simple example in Python we’ll learn how to:

Design an optimization-based experiment.Perform inference using bootstrap techniques.Implement the code in Python for your own experiments

### The Power of Optimization

Before diving into our example and Python code, let’s first discuss some insights into the benefits of using an optimization-based approach to design experiments.

Optimization makes statistics more precise, allowing for a more powerful inference.

The optimization-based approach **matches the experimental groups to minimize the en-masse discrepancies in means and variances**. This makes the statistics much more precise, concentrating them tightly around their nominal values while still being unbiased estimates. This increased precision allows for more powerful inference (using the bootstrap algorithm, which we’ll cover later).

This allows researchers to draw statistically valid conclusions with less data, reducing experimental costs — an important benefit in disciplines like oncology research, where testing chemotherapy agents in mouse cancer models is both laborious and expensive. Moreover, compared to other methods for small sample sizes, optimization has been shown to outperform them, as we’ll see later with simulated experiments.

Now, let’s dive into an example to see how to apply this approach using Python!

### Implementing the Algorithm in Python

#### Case Study: A Drug Experiment with 20 Mice

Let’s consider an experiment with 20 mice, where we are interested in studying the effect of a drug on tumor growth with varying initial tumor sizes. Suppose that the initial tumor sizes are normally distributed with a mean of 200 mg and a standard deviation of 300 mg (truncated to ensure non-negative values). We can generate the population of mice with the following Python code:

import numpy as np

import pandas as pd

import scipy.stats as stats

def create_experiment_data(n_mice, mu, sigma, seed):

lower, upper = 0, np.inf

initial_weights = stats.truncnorm(

(lower – mu) / sigma,

(upper – mu) / sigma,

loc=mu,

scale=sigma

).rvs(size=n_mice, random_state=seed)

return pd.DataFrame({

‘mice’: list(range(1, n_mice+1)),

‘initial_weight’: initial_weights,

})

tumor_data = create_experiment_data(n_mice=20, mu=200, sigma=300, seed=123)

print(tumor_data.head())> mice initial_weight

0 1 424.736888

1 2 174.691035

2 3 141.016478

3 4 327.518749

4 5 442.239789

Now, we need to divide the 20 rodents into two groups of 10 each — one group will receive the treatment, while the other will receive a placebo. We’ll accomplish this using optimization.

We will also assume that tumor growth is observed over a one-day period and follows the Gompertz model (detailed in *Mathematical Models in Cancer Research*). The treatment is assumed to have a deterministic effect of reducing tumor size by 250 mg.

#### Experimental Design

We aim to formulate the creation of equivalent groups as an optimization problem, where the objective is to minimize the discrepancy in both the mean and variance** **of the initial tumor weight.

To implement this, we need to follow three steps:

#### Paso 1: Normalize the Initial Tumor Weight

First, the entire sample must be pre-processed and the metric should be normalized so that it has a mean of zero and unit variance:

mean = tumor_data[‘initial_weight’].mean()

std = tumor_data[‘initial_weight’].std()

tumor_data[‘norm_initial_weight’] = (tumor_data[‘initial_weight’] – mean) / std

#### Paso 2: Create the groups using optimization

Next, we need to implement the general optimization model that construct *m-groups *with *k-subjects* each, minimizing the *maximum discrepancy between any two groups* (a full description of the model variables can be found in the article) and passing it the normalized metric:

Optimization model that creates m-groups of k-units each (from Bertsimas et al. 2015).

The mathematical model can be implemented in Python using the ortools library with the SCIP solver, as follows:

from ortools.linear_solver import pywraplp

from typing import Union

class SingleFeatureOptimizationModel:

“””

Implements the discrete optimization model proposed by Bertsimas et al. (2015) in “The Power of

Optimization Over Randomization in Designing Experiments Involving Small Samples”.

See: https://doi.org/10.1287/opre.2015.1361.

“””

def __init__(self, data: pd.DataFrame, n_groups: int, units_per_group: int, metric: str, unit: str):

self.data = data.reset_index(drop=True)

self.parameters = {

‘rho’: 0.5,

‘groups’: range(n_groups),

‘units’: range(len(self.data)),

‘unit_list’: self.data[unit].tolist(),

‘metric_list’: self.data[metric].tolist(),

‘units_per_group’: units_per_group,

}

self._create_solver()

self._add_vars()

self._add_constraints()

self._set_objective()

def _create_solver(self):

self.model = pywraplp.Solver.CreateSolver(“SCIP”)

if self.model is None:

raise Exception(“Failed to create SCIP solver”)

def _add_vars(self):

self.d = self.model.NumVar(0, self.model.infinity(), “d”)

self.x = {}

for i in self.parameters[‘units’]:

for p in self.parameters[‘groups’]:

self.x[i, p] = self.model.IntVar(0, 1, “”)

def _set_objective(self):

self.model.Minimize(self.d)

def _add_constraints(self):

self._add_constraints_d_bounding()

self._add_constraint_group_size()

self._add_constraint_all_units_assigned()

def _add_constraints_d_bounding(self):

rho = self.parameters[‘rho’]

for p in self.parameters[‘groups’]:

for q in self.parameters[‘groups’]:

if p < q:

self.model.Add(self.d >= self._mu(p) – self._mu(q) + rho * self._var(p) – rho * self._var(q))

self.model.Add(self.d >= self._mu(p) – self._mu(q) + rho * self._var(q) – rho * self._var(p))

self.model.Add(self.d >= self._mu(q) – self._mu(p) + rho * self._var(p) – rho * self._var(q))

self.model.Add(self.d >= self._mu(q) – self._mu(p) + rho * self._var(q) – rho * self._var(p))

def _add_constraint_group_size(self):

for p in self.parameters[‘groups’]:

self.model.Add(

self.model.Sum([

self.x[i,p] for i in self.parameters[‘units’]

]) == self.parameters[‘units_per_group’]

)

def _add_constraint_all_units_assigned(self):

for i in self.parameters[‘units’]:

self.model.Add(

self.model.Sum([

self.x[i,p] for p in self.parameters[‘groups’]

]) == 1

)

def _add_contraint_symmetry(self):

for i in self.parameters[‘units’]:

for p in self.parameters[‘units’]:

if i < p:

self.model.Add(

self.x[i,p] == 0

)

def _mu(self, p):

mu = self.model.Sum([

(self.x[i,p] * self.parameters[‘metric_list’][i]) / self.parameters[‘units_per_group’]

for i in self.parameters[‘units’]

])

return mu

def _var(self, p):

var = self.model.Sum([

(self.x[i,p]*(self.parameters[‘metric_list’][i])**2) / self.parameters[‘units_per_group’]

for i in self.parameters[‘units’]

])

return var

def optimize(

self,

max_run_time: int = 60,

max_solution_gap: float = 0.05,

max_solutions: Union[int, None] = None,

num_threads: int = -1,

verbose: bool = False

):

“””

Runs the optimization model.

Args:

max_run_time: int

Maximum run time in minutes.

max_solution_gap: float

Maximum gap with the LP relaxation solution.

max_solutions: int

Maximum number of solutions until stop.

num_threads: int

Number of threads to use in solver.

verbose: bool

Whether to set the solver output.

Returns: str

The status of the solution.

“””

self.model.SetTimeLimit(max_run_time * 60 * 1000)

self.model.SetNumThreads(num_threads)

if verbose:

self.model.EnableOutput()

self.model.SetSolverSpecificParametersAsString(f”limits/gap = {max_solution_gap}”)

self.model.SetSolverSpecificParametersAsString(f”limits/time = {max_run_time * 60}”)

if max_solutions:

self.model.SetSolverSpecificParametersAsString(f”limits/solutions = {max_solutions}”)

status = self.model.Solve()

if verbose:

if status == pywraplp.Solver.OPTIMAL:

print(“Optimal Solution Found.”)

elif status == pywraplp.Solver.FEASIBLE:

print(“Feasible Solution Found.”)

else:

print(“Problem infeasible or unbounded.”)

self._extract_solution()

return status

def _extract_solution(self):

tol = 0.01

self.assignment = {}

for i in self.parameters[‘units’]:

for p in self.parameters[‘groups’]:

if self.x[i,p].solution_value() > tol:

self.assignment.setdefault(p, []).append(self.parameters[‘unit_list’][i])

def get_groups_list(self):

return list(self.assignment.values())model = SingleFeatureOptimizationModel(

data = tumor_data,

n_groups = 2,

units_per_group = 10,

unit = ‘mice’,

metric = ‘norm_initial_weight’,

)

status = model.optimize()

optimized_groups = model.get_groups_list()

print(f”The optimized mice groups are: {optimized_groups}”)> The optimized mice groups are: [

[1, 4, 5, 6, 8, 12, 14, 16, 17, 18],

[2, 3, 7, 9, 10, 11, 13, 15, 19, 20]

]*Note:** The parameter *rho* controls the trade-off between minimizing discrepancies in the first moment and second moment and is chosen by the researcher. In our example, we have considered *rho* equals 0.5.*

#### Paso 3: Randomize which group receives which treatment

Lastly, we randomly determine which experimental mice group will receive the drug, and which will receive the placebo:

import random

def random_shuffle_groups(group_list, seed):

random.seed(seed)

random.shuffle(group_list)

return group_list

randomized_groups = random_shuffle_groups(optimized_groups, seed=123)

treatment_labels = [“Placebo”, “Treatment”]

treatment_dict = {treatment_labels[i]: randomized_groups[i] for i in range(len(randomized_groups))}

print(f”The treatment assignment is: {treatment_dict}”)> The treatment assignment is: {

‘Placebo’: [2, 3, 7, 9, 10, 11, 13, 15, 19, 20],

‘Treatment’: [1, 4, 5, 6, 8, 12, 14, 16, 17, 18]

}

Let’s see the quality of the result by analyzing the mean and variance of the initial tumor weights in both groups:

mice_assignment_dict = {inx: gp for gp, indices in treatment_dict.items() for inx in indices}

tumor_data[‘treatment’] = tumor_data[‘mice’].map(mice_assignment_dict)

print(tumor_data.groupby(‘treatment’).agg(

avg_initial_weight = (‘initial_weight’, ‘mean’),

std_initial_weight = (‘initial_weight’, ‘std’),

).round(2))> avg_initial_weight std_initial_weight

treatment

Placebo 302.79 202.54

Treatment 303.61 162.12

**This is the group division with minimal discrepancy **in the mean and variance of pre-experimental tumor weights!** **Now let’s conduct the experiment and analyze the results.

### Inference with bootstrap

**Simulating tumor growth**

Given the established treatment assignment, tumor growth is simulated over one day using the Gompertz model, assuming an effect of -250 mg for the treated group:

import numpy as np

from scipy.integrate import odeint

# Gompertz model parameters

a = 1

b = 5

# Critical weight

wc = 400

def gompex_growth(w, t):

“””

Gomp-ex differential equation model based on the initial weight.

“””

growth_rate = a + max(0, b * np.log(wc / w))

return w * growth_rate

def simulate_growth(w_initial, t_span):

“””

Simulate the tumor growth using the Gomp-ex model.

“””

return odeint(gompex_growth, w_initial, t_span).flatten()

def simulate_tumor_growth(data: pd.DataFrame, initial_weight: str, treatment_col: str, treatment_value: str, treatment_effect: float):

“””

Simulate the tumor growth experiment and return the dataset.

“””

t_span = np.linspace(0, 1, 2)

final_weights = np.array([simulate_growth(w, t_span)[-1] for w in data[initial_weight]])

experiment_data = data.copy()

mask_treatment = data[treatment_col] == treatment_value

experiment_data[‘final_weight’] = np.where(mask_treatment, final_weights + treatment_effect, final_weights)

return experiment_data.round(2)

experiment_data = simulate_tumor_growth(

data = tumor_data,

initial_weight = ‘initial_weight’,

treatment_col = ‘treatment’,

treatment_value = ‘Treatment’,

treatment_effect = -250

)

print(experiment_data.head())> mice initial_weight norm_initial_weight treatment final_weight

0 1 424.74 0.68 Treatment 904.55

1 2 174.69 -0.72 Placebo 783.65

2 3 141.02 -0.91 Placebo 754.56

3 4 327.52 0.14 Treatment 696.60

4 5 442.24 0.78 Treatment 952.13mask_tr = experiment_data.group == ‘Treatment’

mean_tr = experiment_data[mask_tr][‘final_weight’].mean()

mean_co = experiment_data[~mask_tr][‘final_weight’].mean()

print(f”Mean difference between treatment and control: {round(mean_tr – mean_co)} mg”)> Mean difference between treatment and control: -260 mg

Now that we have the final tumor weights, we observe that the average final tumor weight in the treatment group is 260 mg lower than in the control group. However, to determine if this difference is statistically significant, we need to apply the following bootstrap mechanism to calculate the p-value.

**Bootstrap inference for optimization-based design**

“In an optimization-based design, statistics like the average difference between groups become more precise but no longer follow the usual distributions. Therefore, a bootstrap inference method should be used to draw valid conclusions.”

The boostrap inference method proposed by Bertsimas et al. (2015) involves using sampling with replacement to construct the baseline distribution of our estimator. In each iteration, the group division is performed using optimization, and finally the the *p-value* is derived as follows:

from tqdm import tqdm

from typing import Any, List

def inference(data: pd.DataFrame, unit: str, outcome: str, treatment: str, treatment_value: Any = 1, n_bootstrap: int = 1000) -> pd.DataFrame:

“””

Estimates the p-value using bootstrap for two groups.

Parameters

———–

data (pd.DataFrame): The experimental dataset with the observed outcome.

unit (str): The experimental unit column.

outcome (str): The outcome metric column.

treatment (str): The treatment column.

treatment_value (Any): The value referencing the treatment (other will be considered as control).

n_bootstrap (int): The number of random draws with replacement to use.

Returns

———–

pd.DataFrame: The dataset with the results.

Raise

————

ValueException: if there are more than two treatment values.

“””

responses = data[outcome].values

mask_tr = (data[treatment] == treatment_value).values

delta_obs = _compute_delta(responses[mask_tr], responses[~mask_tr])

deltas_B = _run_bootstrap(data, unit, outcome, n_bootstrap)

pvalue = _compute_pvalue(delta_obs, deltas_B)

output_data = pd.DataFrame({

‘delta’: [delta_obs],

‘pvalue’: [pvalue],

‘n_bootstrap’: [n_bootstrap],

‘avg_delta_bootstrap’: [np.round(np.mean(deltas_B), 2)],

‘std_delta_bootstrap’: [np.round(np.std(deltas_B), 2)]

})

return output_data

def _run_bootstrap(data: pd.DataFrame, unit: str, outcome: str, B: int = 1000) -> List[float]:

“””

Runs the bootstrap method and returns the bootstrapped deltas.

Parameters

———–

data (pd.DataFrame): The dataframe from which sample with replacement.

outcome (str): The outcome metric observed in the experiment.

B (int): The number of random draws with replacement to perfrom.

Returns

———–

List[float]: The list of bootstrap deltas.

“””

deltas_bootstrap = []

for i in tqdm(range(B), desc=”Bootstrap Progress”):

sample_b = _random_draw_with_replacement(data, unit)

responses_b, mask_tr_b = _optimal_treatment_control_split(sample_b, unit, outcome, seed=i)

delta_b = _compute_delta(responses_b[mask_tr_b], responses_b[~mask_tr_b])

deltas_bootstrap.append(delta_b)

return deltas_bootstrap

def _compute_delta(response_tr, responses_co):

delta = np.mean(response_tr) – np.mean(responses_co)

return delta

def _compute_pvalue(obs_delta, bootstrap_deltas):

count_extreme = sum(1 for delta_b in bootstrap_deltas if abs(delta_b) >= abs(obs_delta))

p_value = (1 + count_extreme) / (1 + len(bootstrap_deltas))

return p_value

def _random_draw_with_replacement(data: pd.DataFrame, unit: str):

sample = data.sample(frac=1, replace=True)

sample[unit] = range(1, len(sample) + 1)

return sample

def _optimal_treatment_control_split(data: pd.DataFrame, unit: str, outcome: str, seed: int):

result = _sample(

data = data,

unit = unit,

normalized_feature = ‘norm_initial_weight’,

seed = seed

)

treatment_dict = {inx: gp for gp, indices in result.items() for inx in indices}

treatment = data[unit].map(treatment_dict)

mask_tr = (treatment == ‘Treatment’).values

responses = data[outcome].values

return responses, mask_tr

def _sample(data: pd.DataFrame, unit: str, normalized_feature: str, seed: int):

model = SingleFeatureOptimizationModel(

data,

n_groups = 2,

units_per_group = 10,

unit = unit,

metric = normalized_feature,

)

status = model.optimize()

optimized_groups = model.get_groups_list()

randomized_groups = random_shuffle_groups(optimized_groups, seed=seed)

treatment_labels = [“Placebo”, “Treatment”]

return {treatment_labels[i]: randomized_groups[i] for i in range(len(randomized_groups))}infer_result = inference(

data = experiment_data,

unit = ‘mice’,

outcome = ‘final_weight’,

treatment = ‘group’,

treatment_value = ‘Treatment’,

n_bootstrap = 1000

)

print(infer_result)> delta pvalue n_bootstrap avg_delta_bootstrap std_delta_bootstrap

0 -260.183 0.001998 1000 2.02 112.61

The observed difference of -260 mg between the groups is significant at the 5% significance level (the p-value less than 0.05). Therefore, we reject the null hypothesis of equal means and conclude that the treatment had a statistically significant effect.

**Results over 1000 mice experiments**

We can simulate the experiment multiple times, generating populations of mice with different initial tumor weights, drawn from the same normal distribution with a mean of 200 mg and a standard deviation of 300 mg.

This allows us to compare the optimization-based design with other experimental designs. In the following graph, I compare the optimization approach with simple random assignment and stratified random assignment (where the strata were created using a k-means algorithm based on initial tumor weight):

Results of 1000 simulated experiments to detect an effect of -250 mg (Image by author).

In their article, the authors also compare the optimization-based approach with re-randomization and pairwise matching across different effect sizes and group sizes. I highly recommend reading the full article if you’re interested in exploring the details further!

Congratulations! You’ve reached the end 🎉 If you found this article interesting, consider following me. I often share ideas about optimization and causal inference.

The Power of Optimization in Designing Experiments Involving Small Samples was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.