Five Reasons You Cannot Afford Not Knowing Probability Proportional to Size (PPS) Sampling
Data Science
Simple Random Sampling (SRS) works, but if you do not know Probability Proportional to Size Sampling (PPS), you are risking yourself some critical statistical mistakes. Learn why, when, and how you can use PPS Sampling here!
Photo by Justin Morgan on Unsplash
Rahul decides to measure the “pulse” of customers buying from his online store. He wanted to know how they are feeling, what is going well, and what can be improved for user experience. Because he has learnt about mathematics and he knows the numbers game, he decides to have a survey with 200 of his 2500 customers. Rahul uses Simple Random Sampling and gets 200 unique customer IDs. He sends them an online survey, and receives the results. According to the survey, the biggest impediment with the customers was lack of payment options while checking out. Rahul contacts a few vendors, and invests in rolling out a few more payment options. Unfortunately, the results after six months showed that there was no significant increase in the revenue. His analysis fails, and he wonders if the resources were spent in the right place.
Rahul ignored the biggest truth of all. All the customers are not homogenous. Some spend more, some spend less, and some spend a lot. Don’t be like Rahul. Be like Sheila, and learn how you can use PPS Sampling — an approach that ensures that your most important (profitable) customers never get overlooked — for reasonable and robust statistical analysis.
What is Sampling?
Before I discuss PPS Sampling, I will briefly mention what sampling is. Sampling is a statistical technique which allows us to take a portion of our population, and use this portion of our population to measure some characteristics of the population. For example, taking a sample of blood to measure if we have an infectious disease, taking a sample of rice pudding to check if sugar is enough, and taking a sample of customers to measure the general pulse of customers. Because we cannot afford measuring each and every single unit of the entire population, it is best to take a sample and then infer the population characteristics. This suffices for a definition here. If you need more information about sampling, the Internet has a lot of resources.
What is PPS Sampling?
Probability Proportional to Size (PPS) Sampling is a sampling technique, where the probability of selection of a unit in the sample is dependent upon the size of a defined variable or an auxiliary variable.
WHAT???
Let me explain with the help of an example. Suppose you have an online store, and there are 1000 people who are your customers. Some customers spend a lot of money and bring a lot of revenue to your organization. These are very important customers. You need to ensure that your organization serve the interests of these customers in the best way possible.
If you want to understand the mood of these customers, you would prefer a situation where your sample has a higher representation of these customers. This is exactly what PPS allows you to do. If you use PPS Sampling, the probability of selecting the highest revenue generating customers is also high. This makes sense. The revenue in this case is the auxiliary or dependency variable.
PPS Sampling vs SRS Sampling
Simple Random Sampling is great. No denial of that fact, but it’s not the only tool that you have in your arsenal. SRS works best for the situations where your population is homogenous. Unfortunately for many practical business applications, the audience or population is not homogenous. If you do an analysis with wrong assumption, you will get the wrong inferences. SRS Sampling gives the same probability of selection to each unit of the population which is different from PPS Sampling.
Why should I use PPS Sampling?
As the title of this article says, you cannot afford not knowing PPS Sampling. Here are five reasons why.
Better Representativeness — By prioritizing the units that have a higher impact on your variable of interest (revenue), you are ensuring that the sample has a better representativeness. This contrasts with SRS which assumes that a customer spending 100 USD a month is equal to the customer spending 1000 USD a month. Nein, no, nahin, that is not the case.Focus on High-Impact Units — According to the Pareto principle, 80% of your revenue is generated by 20% of the customers. You need to ensure you do not mess up with these 20% of the customers. By ensuring a sample having a higher say for these 20% customers, you will avoid yourself and them any unseen surprises.Resource Efficiency — There is a thumb’s rule in statistics which says that on an average if you have a sample of 30, you can get close to the estimated population parameters. Note that this is only a thumb rule. PPS Sampling allows you to use the resources you have in designing, distributing, and analyzing interventions are used judiciously.Improved Accuracy — Because we are putting more weight on the units that have a larger impact on our variable of interest, we are more accurate with our analysis. This may not be possible with just SRS. The sample estimates which you get from PPS Sampling are weighted for the units that have a higher impact. In simple words, you are working for those who pay the most.Better Decision-Making — When you use PPS sampling, you’re making decisions based on data that actually matters. If you only sample customers randomly, you might end up with feedback or insights from people whose opinions have little influence on your revenue. With PPS, you’re zeroing in on the important customers. It’s like asking the right people the right questions instead of just anyone in the crowd.
PPS Implementation in Python
Slightly more than six years ago, I wrote this article on Medium which is one of my most-read articles, and is shown on the first page when you search for Probability Proportional to Size Sampling (PPS Sampling, from now onwards). The article shows how one can use PPS Sampling for representative sampling using Python. A lot of water has flown under the bridge since then, and I now I have much more experience in causal inference, and my Python skills have improved considerably too. The code linked above used systematic PPS Sampling, whereas the new code uses random PPS Sampling.
Here is the new code that can do the same in a more efficient way.
import numpy as np
import pandas as pd
# Simulate customer data
np.random.seed(42) # For reproducibility
num_customers = 1000
customers = [f”C{i}” for i in range(1, num_customers + 1)]
# Simulate revenue data (e.g., revenue between $100 and $10,000)
revenues = np.random.randint(100, 10001, size=num_customers)
customer_data = pd.DataFrame({
“Customer”: customers,
“Revenue”: revenues
})
# Calculate selection probabilities proportional to revenue
total_revenue = customer_data[“Revenue”].sum()
customer_data[“Selection_Prob”] = customer_data[“Revenue”] / total_revenue
# Perform PPS Sampling
sample_size = 60 # decide for your analysis
# the actual PPS algorithm
sample_indices = np.random.choice(
customer_data.index,
size=sample_size,
replace=False, # No replacement, we are not replacing the units
p=customer_data[“Selection_Prob”]
)
# Extract sampled customers
sampled_customers = customer_data.iloc[sample_indices]
# Display results
print(“Sampled Customers:”)
print(sampled_customers)
Challenges with PPS Sampling
I am sure if you have read until here, you may be wondering that how is it possible that there will be no cons of PPS Sampling. Well, it has some. Here are they.
PPS Sampling is complex to understand so it may not always have a buy-in from the management of an organization. In that case, it is the data scientist’s job to ensure that the benefits are explained in the right manner.PPS Sampling requires that there is a dependency variable. For example, in our case we chose revenue as a variable upon which we select our units. If you are in agriculture industry, this could be the land size for measuring yield of a cropping season.PPS Sampling is perceived to be biased against the units having a lower impact. Well, it is not biased and the smaller units also have a chance of getting selected, but the probability is lower for them.
Conclusion
In this article, I explained to you what PPS Sampling is, why it’s better and more resource-efficient than SRS Sampling, and how you can implement it using Python. I am curious to hear more examples from your work to see how you implement PPS at your work.
Resources:
PPS Sampling Wiki https://en.wikipedia.org/wiki/Probability-proportional-to-size_samplingPPS Sampling in Python https://chaayushmalik.medium.com/pps-sampling-in-python-b5d5d4a8bdf7
Five Reasons You Cannot Afford Not Knowing Probability Proportional to Size (PPS) Sampling was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.