Learnings from a Machine Learning Engineer — Part 2: The Data Sets

February 14, 2025

In Part 1, we discussed the importance of collecting good image data and assigning proper labels for your Image Classification project to be successful. Also, we talked about classes and sub-classes of your data. These may seem pretty straight forward concepts, but it’s important to have a solid understanding going forward. So, if you haven’t, please check it out.

Now we will discuss how to build the various data sets and the techniques that have worked well for my application. Then in the next part, we will dive into the evaluation of your models, beyond simple accuracy.

I will again use the example zoo animals image classification app.

Data Sets

As machine learning engineers, we are all familiar with the train-validation-test sets, but when we include the concept of sub-classes discussed in Part 1, and incorporate to concepts discussed below to set a minimum and maximum image count per class, as well as staged and synthetic data to the mix, the process gets a bit more complicated. I had to create a custom script to handle these options.

I will walk you through these concepts before we split the data for training:

Image cutoffs — Too few images and your model performance will suffer. Too many and you spend more time training than it’s worth.

Confidence thresholds — Your model indicates how confident it is in the predictions. Let’s use that to decide when to present results to the user.

Benchmark sets — Real-world data is messy and the benchmark sets should reflect that. These need to stretch the model to the limit and help us decide when it is ready for production.

Staged and synthetic data — Real-world data is king, but sometimes you need to produce the your own or even generate data to get off the ground. Be careful it doesn’t hurt performance.

Duplicate images — Repeat data can skew your results and give you a false sense of performance. Make sure your data is diverse.

Building the data sets — Combine sub-classes, apply cutoffs, and create your train-validation-test sets. Now we are ready to get the show started.

Image cutoffs

In my experience, using a minimum of 40 images per class provides descent performance. Since I like to use 10% each for the test set and validation set, that means at least 4 images will be used to check the training set, which feels just barely adequate. Using fewer than 40 images per class, I notice my model evaluation tends to suffer.

On the other end, I set a maximum of about 125 images per class. I have found that the performance gains tend to plateau beyond this, so having more data will slow down the training run with little to show for it. Having more than the maximum is fine, and these “overflow” can be added to the test set, so they don’t go to waste.

There are times when I will drop the minimum cutoff to, say 35, with no intention of moving the trained model to production. Instead, the purpose is to leverage this throw-away model to find more images from my unlabelled set. This is a technique that I will go into more detail in Part 3.

Confidence threshold

You are likely familiar with the softmax score. As a reminder, softmax is the probability assigned to each label. I like to think of it as a confidence score, and we are interested in the class that receives the highest confidence. Softmax is a value between zero and one, but I find it easier to interpret confidence scores between zero and 100, like a percentage.

In order to decide if the model is confident enough with its prediction, I have chosen a threshold of 95. I use this threshold when determining if I want to present results to the user.

Scores above the threshold have a better changes of being right, so I can confidently provide the results. Scores below the threshold may not be right — in fact it could be “out-of-scope”, meaning it’s something the model doesn’t know how to identify. So, instead of taking the risk of presenting incorrect results, I instead prompt the user to try again and offer suggestions on how to take a “good” picture.

Admittedly this is somewhat arbitrary cutoff, and you should decide for your use-case what is appropriate. In fact, this score could probably be adjusted for each trained model, but this would make it harder to compare performance across models.

I will refer to this confidence score frequently in the evaluations section in Part 3.

Benchmark sets

Let me introduce what I call the benchmark sets, which you can think of as extended test sets. These are hand-picked images designed to stretch the limits of your model, and provide a measure for specific classes of your data. Use these benchmarks to justify moving your model to production, and for an objective measure to show to your manager.

Difficult Benchmark — These are the “extra credit” images, like the bonus questions a professor would add to the quiz to see which students are paying attention. You need a keen eye to spot the difference between the ground truth and a similar looking class. For example, a cheetah sleeping in the shade that could pass as a leopard if you don’t look closely.

Out-of-scope Benchmark — These are the “trick question” images. Our model is trained on zoo animals, but people are known for not following the rules. For example, a zoo guest takes a picture of their child wearing cheetah face paint.

Most-Common Benchmark — These are your “bread and butter” classes that need to get near perfect scores and zero errors. This would be a make-or-break benchmark for moving to production.

Least-Common Benchmark — These are your “rare but exceptional” classes that again need to be correct, but reach a minimum score like the confidence threshold.

When looking for images to add to the benchmarks, you can likely find them in real-world images from your deployed model. See the evaluation in Part 3.

For each benchmark, calculate the min, max, median, and mean scores, and also how many images get scores above and below the confidence threshold. Now you can compare these measures against what is currently in production, and against your minimum requirements, to help decide if the new model is production worthy.

Staged or Synthetic data

Perhaps the biggest hurdle to any supervised machine learning application is having data to train the model. Clearly, “real-world” data that comes from actual users of the application is ideal. However you can’t really collect these until the model is deployed. Chicken and egg problem.

One way to get started to is to have volunteers collect “staged” images for you, trying to act like real users. So, let’s have our zoo staff go around taking pictures of the animals. This is a good start, but there will be a certain level of bias introduced in these images. For example, the staff may take the photos over a few days, so you may not get the year-round weather conditions.

Another way to get pictures is use computer-generated “synthetic” images. I would avoid these at all costs, to be honest. Based on my experience, the model struggles with these because they look…different. The lighting is not natural, the subject may superimposed on a background and so the edges look too sharp, etc. Granted, some of the AI generated images look very realistic, but if you look closely you may spot something unusual. The neural network in your model will notice these, so be careful.

Image generated using Dall-E

The way that I handle these staged or synthetic images is as a sub-class that gets merged into the training set, but only after giving preference to the real-world images. I cap the number of staged images to 60, so if I have 10 real-world, I now only need 50 staged. Eventually, these staged and synthetic images are phased out completely, and I rely entirely on real-world.

Duplicate images

One problem that can creep into your image set are duplicate images. These can be exact copies of pictures, or they can be extremely similar. You may think that this is harmless, but imagine having 100 pictures of an elephant that are exactly the same — your model will not know what to do with a different angle of the elephant.

Now, let’s say you have only two pictures that are nearly the same. Not so bad, right? Well, here is what can happen to them:

Both pictures go in the training set — The model doesn’t learn anything from the repeated image and it wastes time processing them.

One goes into the training set, the other goes into the test set — Your test score will be higher, but it is not an accurate evaluation.

Both are in the test set — Your test score will be compounded either higher or lower than it should be.

None of these will help your model.

There are a few ways to find duplicates. The approach I have taken is to calculate a hamming distance on all the pictures and identify the ones that are very close. I have an interface that displays the duplicates and I decide which one I like best, and remove the other.

Another way (I haven’t tried this yet) is to create a vector representation of your images. Store these a vector database, and you can do a similarity search to find nearly identical images.

Whatever method you use, it is important to clean up the duplicates.

Building the data sets

Now we are ready to build the traditional training, validation, and test sets. This is no longer a straight forward task since I want to:

Merge sub-classes into a main class.

Prioritize real-world images over staged or synthetic images.

Apply a minimum number of images per class.

Apply a maximum number of images per class, sending the “overflow” to the test set.

This process is somewhat complicated and depends on how you manage your image library. First, I would recommend keeping your images in a folder structure that has sub-class folders. You can get image counts by using a script to simply read the folders. Second is to keep a configuration of how the sub-classes are merged. To really set yourself up for success, put these image counts and merge rules in a database for faster lookups.

My train-validation-test set splits are usually 90–10–0. I originally started out using 80–10–10, but with diligence on keeping the entire data set clean, I noticed validation and test scores became pretty even. This allowed me to increase the training set size, and use “overflow” to become the test set, as well as using the benchmark sets.

Up next…

In this part, we’ve built our data sets by merging sub-classes and using the image count cutoffs. Plus we handle staged and synthetic data as well as cleaning up duplicate images. We also created benchmark sets and defined confidence thresholds, which help us decide when to move a model to production.

In Part 3, we will discuss how we are going to evaluate the different model performances. And then finally we will get to the actual model training and the techniques to enhance accuracy.

The post Learnings from a Machine Learning Engineer — Part 2: The Data Sets appeared first on Towards Data Science.