Product-Oriented ML: A Guide for Data Scientists
How to build ML products users love.
Photo by Pavel Danilyuk: https://www.pexels.com/photo/a-robot-holding-a-flower-8438979/
Data science offers rich opportunities to explore new concepts and demonstrate their viability, all towards building the ‘intelligence’ behind features and products. However, most machine learning (ML) projects fail! And this isn’t just because of the inherently experimental nature of the work. Projects may lack purpose or grounding in real-world problems, while integration of ML into products requires a commitment to long-term problem-solving, investment in data infrastructure, and the involvement of multiple technical experts. This post is about mitigating these risks at the planning stage, fail here, fast, while developing into a product-oriented data scientist.
This article provides a structured approach to planning ML products, by walking through the key areas of a product design document. We’ll cover clarifying requirements, understanding data constraints and defining what success looks like, all of which dictates your approach to building successful ML products. These documents should be flexible, use them to figure out what works for your team.
I’ve been fortunate to work in startups, part of small scrappy teams where roles and ownership become blended. I mention this because the topics covered below crossover traditional boundaries, into project management, product, UI/UX, marketing and more. I’ve found that people who can cross these boundaries and approach collaboration with empathy make great products and better colleagues.
To illustrate the process, we will work through a feature request, set out by a hypothetical courier company:
“As a courier company, we’d like to improve our ability to provide users with advanced warning if their package delivery is expected to be delayed.”
Problem Definition
This section is about writing a concise description of the problem and the project’s motivation. As development spans months or years, not only does this start everyone on the same page, but unique to ML, it serves to anchor you as challenges arise and experiments fail. Start with a project kickoff. Encourage open collaboration and aim to surface the assumptions present in all cross-functional teams, ensuring alignment on product strategy and vision from day one.
Actually writing the statement starts with reiterating the problem in your own words. For me, making this long form and then whittling it down makes it easier to narrow down on the specifics. In our example, we are starting with a feature request. It provides some direction but leaves room for ambiguity around specific requirements. For instance, “improve our ability” suggests an existing system — do we have access to an existing dataset? “Advanced warning” is vague on information but tells us customers will be actively prompted in the event of a delayed delivery. These all have implications for how we build the system, and provides an opportunity to assess the feasibility of the project.
We also need to understand the motivation behind the project. While we can assume the new feature will provide a better user experience, what’s the business opportunity? When defining the problem, always tie it back to the larger business strategy. For example, improving delivery delay notifications isn’t just about building a better product — it’s about reducing customer churn and increasing satisfaction, which can boost brand loyalty and lower support costs. This is your real measure of success for the project.
Working within a team to unpack a problem is a skill all engineers should develop — not only is it commonly tested as part of an interview processes, but, as discussed, it sets expectations for a project and strategy that everyone, top-down can buy into. A lack of alignment from the start can be disastrous for a project, even years later. Unfortunately, this was the fate of a health chatbot developed by Babylon. Babylon set out with the ambitious goal of revolutionising healthcare by using AI to deliver accurate diagnostics. To its detriment, the company oversimplified the complexity of healthcare, especially across different regions and patient populations. For example, symptoms like fever might indicate a minor cold in the UK, but could signal something far more serious in Southeast Asia. This lack of clarity and overpromising on AI capabilities led to a major mismatch between what the system could actually do and what was needed in real-world healthcare environments (https://sifted.eu/articles/the-rise-and-fall-of-babylon).
Requirements and Constraints
With your problem defined and why it matters, we can now document the requirements for delivering the project and set the scope. These typically fall into two categories:
Functional requirements, which define what the system should do from the user’s perspective. These are directly tied to the features and interactions the user expects.Non-functional requirements, which address how the system operates — performance, security, scalability, and usability.
If you’ve worked with agile frameworks, you’ll be familiar with user stories — short, simple descriptions of a feature told from the user’s perspective. I’ve found defining these as a team is a great way to align, this starts with documenting functional requirements from a user perspective. Then, map them across the user journey, and identify key moments your ML model will add value. This approach helps establish clear boundaries early on, reducing the likelihood of “scope creep”. If your project doesn’t have traditional end-users, perhaps you’re replacing an existing process? Talk to people with boots on the ground — be that operational staff or process engineers, they are your domain experts.
From a simple set of stories we can build actionable model requirements:
What information is being sent to users?
As a customer awaiting delivery, I want to receive clear and timely notifications about whether my package is delayed or on time, so that I can plan my day accordingly.
How will users be sent the warnings?
As a customer awaiting delivery, I want to receive notifications via my preferred communication channel (SMS or native app) about the delay of my package, so that I can take action without constantly checking the app.
What user-specific data can the system use?
As a customer concerned about privacy, I only want essential information like my address to be used to predict whether my package is delayed.
Done right, these requirements should constrain your decisions regarding data, models and training evaluation. If you find conflicts, balance them based on user impact and feasibility. Let’s unpack the user stories above to find how our ML strategy will be constrained:
What information is being sent to users?
The model can remain simple (binary classification) if only a delay notification is needed; more detailed outputs require more complex model and additional data.
How will users be sent the warnings?
Real-time warnings necessitate low-latency systems, this creates constraints around model and preprocessing complexity.
What user-specific data can the system use?
If we can only use limited user-specific information, our model accuracy might suffer. Alternatively, using more detailed user-specific data requires consent from users and increased complexity around how data is stored in order to adhere to data privacy best practices and regulations.
Thinking about users prompts us to embed ethics and privacy into our design while building products people trust. Does our training data result in outputs that contain bias, discriminating against certain user groups? For instance, low-income areas may have worse infrastructure affecting delivery times — is this represented fairly in the data? We need to ensure the model does not perpetuate or amplify existing biases. Unfortunately, there are a litany of such cases, take the ML based recidivism tool COMPAS, used across the US that was shown to overestimated the recidivism risk for Black defendants while underestimating it for white defendants (https://www.propublica.org/article/how-we-analyzed-the-compas-recidivism-algorithm).
In addition to ethics, we also need to consider other non-functional requirements such as performance and explainability:
Transparency and Explainability: How much of a “black-box” do we present the model as? What are the implications of a wrong prediction or bug? These aren’t easy questions to answer. Showing more information about how a model arrives at its decisions requires robust models and the use of explainable models like decision trees. Techniques like SHAP (SHapley Additive exPlanations) and LIME (Local Interpretable Model-agnostic Explanations) can help explain how different features contribute to a prediction, at the risk of overwhelming users. For our example would telling users why a package is delayed build trust? Generally model explainability increases buy in from internal stakeholders.Real-time or Batch Processing: Real-time predictions require low-latency infrastructure and streaming data pipelines. Batch predictions can be processed at regular intervals, which might be sufficient for less time-sensitive needs. Choosing between real-time or batch predictions affects the complexity of the solution and influences which models are feasible to deploy. For instance, simpler models or optimisation techniques reduce latency. More on this later.
A tip borrowed from marketing is the creation of user personas. This typically builds on market research collected through formal interviews and surveys to understand the needs, behaviours, and motivations of users. It’s then segmented based on common characteristics like demographics, goals and challenges. From this we can develop detailed profiles for each segment, giving them names and backstories. Durning planning, personas helps us empathise with how model predictions will be received and the actions they elicit in various contexts.
Take Sarah, a “Busy Parent” persona. She prioritises speed and simplicity. Hence, she values timely, concise notifications about package delays. This means our model should focus on quick, binary predictions (delayed or on-time) rather than detailed outputs. Finally, since Sarah prefers real-time notifications via her mobile, the model needs to integrate seamlessly with low-latency systems to deliver instant updates.
By documenting functional and non-functional requirements, we define “What” we are building to meet the needs of users combine with “Why” this aligns with business objectives.
Modelling Approach
It’s now time to think about “How” we meet our requirements. This starts with framing the problem in ML terms by documenting the type of inputs (features), outputs (predictions) and a strategy for learning the relationship between them. At least something to get us started, we know it’s going to be experimental.
For our example, the input features could include traffic data, weather reports or package details while a binary prediction is required: “delayed” or “on-time”. It’s clear that our problem requires a binary classification model. For us this was simple, but for other product contexts a range of approaches exist:
Supervised Learning Models: Requires a labeled dataset to train.
Classification Models: Binary classification is simple to implement and interpret for stakeholders, making it ideal for a MVP. This comes at the cost of more nuanced insights provided by multi-class classification, like a reason for delay in our case. However, this often requires more data, meaning higher costs and development time.Regression Models: If the target is a continuous value, like the exact time a package will be delayed (e.g., “Your package will be delayed by 20 minutes”), a regression model would be the appropriate choice. These outputs are also subject to more uncertainty.
Unsupervised Learning Models: Works with unlabelled data.
Clustering Models: In the context of delivery delays, clustering could be used during the exploratory phase to group deliveries based on similar characteristics, such as region or recurring traffic issues. Discovering patterns can inform product improvements or guide user segmentation for personalising features/notifications.Dimensionality Reduction: For noisy datasets with a large feature space dimensional reduction techniques like Principal Component Analysis (PCA) or autoencoders can be used to reduce computational costs and overfitting by allowing for smaller models at the cost of some loss in feature context.
Generative Models: Generates new data by training on either labelled and unlabelled data.
Generative Adversarial Networks (GANs): For us, GANs could be used sparingly to simulate rare but impactful delivery delay scenarios, such as extreme weather conditions or unforeseen traffic events, if a tolerance to edge cases is required. However, these are notoriously difficult to train with high computational costs and car must be taken that generated data is realistic. This isn’t typically appropriate for early-stage products.Variational Autoencoders (VAEs): VAEs have a similar use case to GANs, with the added benefit of more control over the range of outputs generated.Large Language Models (LLMs): If we wanted to incorporate text-based data like customer feedback or driver notes into our predictions, LLMs could help generate summaries or insights. But, real-time processing is a challenge with heavy computational loads.
Reinforcement Learning Models: These models learn by interacting with an environment, receiving feedback through rewards or penalties. For a delivery company, reinforcement learning could be used to optimise the system based on the real outcome of the delivery. Again, this isn’t really appropriate for an MVP.
It’s normal for the initial framing of a problem to evolve as we gain insights from data exploration and early model training. Therefore, start with a simple, interpretable model, to test feasibility. Then, incrementally increase complexity by adding more features, tuning hyperparameters, and then explore more complex models like ensemble methods or deep learning architectures. This keeps costs and development time low while making for a quick go to market.
ML differs significantly from traditional software development when it comes to estimating development time, with a large chunk of the work being made up of experiments. Where the outcome is always unknown, and so is the number required. This means any estimate you provide should have a large contingency baked in, or the expectation that it’s subject to change. If the product feature isn’t critical we can afford to give tighter time estimates by starting with simple models while planning for incremental improvements later.
The time taken to develop your model is a significant cost to any project. In my experience, getting results from even a simple model fast, will be massively beneficial downstream, allowing you to handover to frontend developers and ops teams. To help, I have a few tips. First, fail fast and prioritise experiments by least effort, and maximum likelihood of success. Then adjust your plan on the go based on what you learn. Although obvious, people do struggle to embrace failure. So, be supportive of your team, it’s part of the process. My second tip is, do your research! Find examples of similar problems and how they were solved, or not. Despite the recent boom in popularity of ML, the field has been around for a long time, and 9 times out of 10 someone has solved a problem at least marginally related to yours. Keep up with the literature, use sites like Papers with Code, daily papers from Hugging Face or AlphaSignal, which provides a nice email newsletter. For databases try, Google Scholar, Web of Science or ResearchGate. Frustratingly, the cost of accessing major journals is a significant barrier to a comprehensive literature review. Sci-Hub…
Data Requirements
Now that we know what our “black box” will do, what shall we put in it? It’s time for data, and from my experience this is the most critical part of the design with respect to mitigating risk. The goal is to create an early roadmap for sourcing sufficient, relevant, high-quality data. This covers training data, potential internal or external sources, and evaluating data relevance, quality, completeness, and coverage. Address privacy concerns and plan for data collection, storage, and preprocessing, while considering strategies for limitations like class imbalances.
Without proper accounting for the data requirements of a project, you risk exploding budgets and never fully delivering, take Tesla AutoPilot as one such example. Their challenge with data collection highlights the risks of underestimating real-world data needs. From the start, the system was limited by the data captured from early adopters vehicles, which to date, has lacked the sensor depth required for true autonomy (https://spectrum.ieee.org/tesla-autopilot-data-deluge).
Data sourcing is made significantly easier if the feature you’re developing is already part of a manual process. If so, you’ll likely have existing datasets and a performance benchmark. If not, look internally. Most organisations capture vast amounts of data, this could be system logs, CRM data or user analytics. Remember though, garbage in, garbage out! Datasets not built for ML from the beginning often lack the quality required for training. They might not be rich enough, or fully representative of the task at hand.
If unsuccessful, you’ll have to look externally. Start with high-quality public repositories specifically designed for ML, such as Kaggle, UCI ML Repository and Google Dataset Search.
If problem-specific data isn’t available, try more general publicly available datasets. Look through data leaks like the Enron email dataset (for text analysis and natural language processing), government census data (for population-based studies), or commercially released datasets like the IMDb movie review dataset (for sentiment analysis). If that fails, you can start to aggregate from multiple sources to create an enriched dataset. This might involve pulling data from spreadsheets, APIs, or even scraping the web. The challenge for both cases is to ensure your data is clean, consistent, and appropriately formatted for ML purposes.
Worst case, you’re starting from scratch and need to collect your own raw data. This will be expensive and time-consuming, especially when dealing with unstructured data like video, images, or text. For some cases data collection can automated by conducting surveys, setting up sensors or IoT devices or even launching crowd sourced labelling challenges.
Regardless, manual labelling is almost always necessary. There are many highly recommended, off the shelf solutions here, including LabelBox, Amazon SageMaker Ground Truth and Label Studio. Each of these can speed up labelling and help maintain quality, even across large datasets with random sampling.
If it’s not clear already, as you move from internal sources to manual collection; the cost and complexity of building a dataset appropriate for ML grows significantly, and so does the risk for your project. While this isn’t a project-killer, it’s important to take into account what your timelines and budgets allow. If you can only collect a small dataset you’ll likely be restricted to smaller model solutions, or the fine tuning of foundation models from platforms like Hugging Face and Ollama. In addition, ensure you have a costed contingency for obtaining more data later in the project. This is important because understanding how much data is required for your project can only be answered by solving the ML problem. Therefore, mitigate the risk upfront by ensuring you have a route to gathering more. It’s common to see back-of-the-napkin calculations quoted as a reasonable estimate for how much data is required. But, this really only applies to very well understood problems like image classification and classical ML problems.
If it becomes clear you won’t be able to gather enough data, there has been some limited success with generative models for producing synthetic training data. Fraud detection systems developed by American Express have used this technique to simulate card numbers and transactions in order to detect discrepancies or similarities with actual fraud (https://masterofcode.com/blog/generative-ai-for-fraud-detection).
Once a basic dataset has been established you’ll need to understand the quality. I have found manually working the problem to be very effective. Providing insight into useful features and future challenges, while setting realistic expectations for model performance. All while uncovering data quality issues and gaps in coverage early on. Get hands on with the data and build up domain knowledge while taking note of the following:
Data relevance: Ensure the available data reflects your attempts to solve the problem. For our example, traffic reports and delivery distances are useful, but customer purchase history may be irrelevant. Identifying the relevance of data helps reduce noise, while allowing smaller data sets and models to be more effective.Data quality: Pay attention to any biases, missing data, or anomalies that you find, this will be useful when building data preprocessing pipelines later on.Data completeness and coverage: Check the data sufficiently covers all relevant scenarios. For our example, data might be required for both city centres and more rural areas, failing to account for this impacts the model’s ability to generalise.Class imbalance: Understand the distribution of classes or the target variable so that you can collect more data if possible. Hopefully for our case, “delayed” packages will be a rare event. While training we can implement cost-sensitive learning to counter this. Personally, I have always had more success oversampling minority classes with techniques like SMOTE (Synthetic Minority Over-sampling Technique) or Adaptive Synthetic (ADASYN) sampling.Timeliness of data: Consider how up-to-date the data needs to be for accurate predictions. For instance, it might be that real-time traffic data is required for the most accurate predictions.
When it comes to a more comprehensive look at quality, Exploratory Data Analysis (EDA) is the way to uncover patterns, spot anomalies, and better understand data distributions. I will cover EDA in more detail in a separate post, but visualising data trends, using correlation matrices, and understanding outliers can reveal potential feature importance or challenges.
Finally, think beyond just solving the immediate problem — consider the long-term value of the data. Can it be reused for future projects or scaled for other models? For example, traffic and delivery data could eventually help optimise delivery routes across the whole logistics chain, improving efficiency and cutting costs in the long run.
Success Metrics — Finding Good Enough
When training models, quick performance gains are often followed by a phase of diminishing returns. This can lead to directionless trial-and-error while killing morale. The solution? Define “good enough” training metrics from the start, such that you meet the minimum threshold to deliver the business goals for the project.
Setting acceptable thresholds for these metrics requires a broad understanding of the product and soft skills to communicate the gap between technical and business perspectives. Within agile methodologies, we call these acceptance criteria. Doing so allows us to ship quick to the minimum spec and then iterate.
What are business metrics? Business metrics are the real measure of success for any project. These could be reducing customer support costs or increasing user engagement, and are measured once the product is live, hence referred to as online metrics. For our example, 80% accuracy might be acceptable if it reduces customer service costs by 15%. In practice, you should track a single model with a single business metric, this keeps the project focused and avoids ambiguity about when you have successfully delivered. You’ll also want to establish how you track this metrics, look for internal dashboards and analytics that business teams should have available, if they’re not, maybe it’s not a driver for the business.
Balancing business and technical metrics: Finding a “good enough” performance starts with understanding the distribution of events in the real world, and then relating this to how it impacts users (and hence the business). Take our courier example, we expect delayed packages to be a rare event, and so for our binary classifier there is a class imbalance. This makes accuracy alone inappropriate and we need to factor in how our users respond to predictions:
False positives (predicting a delay when there isn’t one) could generate annoying notifications for customers, but when a package subsequently arrives on time, the inconvenience is minor. Avoiding false positives means prioritising high precision.False negatives (failing to predict a delay) are likely to cause much higher frustration when customers don’t receive a package without warning, reducing the chance of repeat business and increasing customer support costs. Avoiding false negatives means prioritising high recall.
For our example, it’s likely the business values high recall models. Still, for models less than 100% accurate, a balance between precision and recall is still necessary (we can’t notify every customer their package is delayed). This trade off is best illustrated with an ROC curve. For all classification problems, we measure a balance of precision and recall with the F1 score, and for imbalanced classes we can extend this to a weighted F1 score.
Balancing precision and recall is a fine art, and can lead to unintended consequences for your users. To illustrate this point consider a services like Google calendar that offers both company and personal user accounts. In order to reduce the burden on businesses that frequently receive fake meeting requests, engineers might prioritise high precision spam filtering. This ensures most fake meetings are correctly flagged as spam, at the cost of lower recall, where some legitimate meetings will be mislabeled as spam. However, for personal accounts, receiving fake meeting requests is far less common. Over the lifetime of the account, the risk of a legitimate meeting being flagged becomes significant due to the trade-off of a lower recall model. Here, the negative impact on the user’s perception of the service is significant.
If we consider our courier example as a regression tasks, with the aim of predicting a delay time, metrics like MAE and MSE are the choices, with slightly different implications for your product:
Mean Absolute Error (MAE): This is fairly intuitive measure of how close the average prediction is to the actual value. Therefore a simple indicator for the accuracy of delay estimates sent to users.Mean Squared Error (MSE): This penalises larger errors more heavily due to the squaring of differences, and therefore important if significant errors in delay predictions are deemed more costly to user satisfaction. However, this does mean the metric is more sensitive to outliers.
As stated above, this is about translating model metrics into terms everyone can understand and communicating trade-offs. This is a collaborative process, as team members closer to the users and product will have a better understanding of the business metrics to drive. Find the single model metric that points the project in the same direction.
One final point, I have seen a tendency for projects involving ML to overpromise on what can be delivered. Generally this comes from the top of an organisation, where hype is generated for a product or amongst investors. This is detrimental to a project and your sanity. Your best chance to counter this is by communicating in your design realistic expectations that match the complexity of the problem. It’s always better to underpromise and overdeliver.
High Level System Design
At this point, we’ve covered data, models, and metrics, and addresses how we will approach our functional requirements. Now, it’s time to focus on non-functional requirements, specifically scalability, performance, security, and deployment strategies. For ML systems, this involves documenting the system architecture with system-context or data-flow diagrams. These diagrams represent key components as blocks, with defined inputs, transformations, and outputs. Illustrating how different parts of the system interact, including data ingestion, processing pipelines, model serving, and user interfaces. This approach ensures a modular system, allowing teams to isolate and address issues without affecting the entire pipeline, as data volume or user demand grows. Therefore, minimising risks related to bottlenecks or escalating costs.
Once our models are trained we need a plan for deploying the model into production, allowing it to be accessible to users or downstream systems. A common method is to expose your model through a REST API that other services or front-end can request. For real-time applications, serverless platforms like AWS Lambda or Google Cloud Functions are ideal for low latency (just manage your cold starts). If high-throughput is a requirement then use batch processing with scalable data pipelines like AWS Batch or Apache Spark. We can breakdown the considerations for ML system design into the following:
Infrastructure and Scalability:
Firstly, we need to make a choice about system infrastructure. Specifically, where will we deploy our system: on-premise, in the cloud, or as a hybrid solution. Cloud platforms, such as AWS or Google Cloud offer automated scaling in response to demand both vertically (bigger machines) and horizontally (adding more machines). Think about how the system would handle 10x or 100x the data volume. Netflix provides excellent insight via there technical blog for how they operate at scale. For instance, they have open sourced there container orchestration platform Titus, which automates deployment of thousands of containers across AWS EC2 instances using Autoscaling groups (https://netflixtechblog.com/auto-scaling-production-services-on-titus-1f3cd49f5cd7). Sometimes on-premises infrastructure is required if you’re handling sensitive data. This provides more control over security while being costly to maintain and scale. Regardless, prepare to version control your infrastructure with infrastructure-as-code tools like Terraform and AWS CloudFormation and automate deployment.
Performance (Throughput and Latency):
For real-time predictions, performance is critical. Two key metrics to consider, throughput measuring how many requests your system can handle per second (i.e., requests per second), and latency, measuring long how long it takes to return a prediction. If you expect to make repeated predictions with the same inputs then consider adding caching for either part or all the pipeline to reduce latency. In general, horizontal scaling is preferred in order to respond to spikes in traffic at peak times, and reducing single point bottlenecks. This highlights how key decisions taken during your system design process will have direct implications on performance. Take Uber, who built their core service around Cassandra database specifically to optimise for low latency real-time data replication, ensuring quick access to relevant data. (https://www.uber.com/en-GB/blog/how-uber-optimized-cassandra-operations-at-scale/).
Security:
For ML systems security applies to API authentication for user requests. This is relatively standard with methods like OAuth2, and protecting endpoints with rate limiting, blocked IP address lists and following OWASP standards. Additionally, ensure that any stored user data is encrypted at rest and flight with strict access control polices for both internal and external users is in place.
Monitoring and Alerts:
It’s also key to consider monitoring for maintaining system health. Track key performance indicators (KPIs) like throughput, latency, and error rates, while alerts are setup to notify engineers if any of these metrics fall below acceptable thresholds. This can be done server-side (e.g., your model endpoint) or client-side (e.g., the users end) to include network latency.
Cost Considerations:
In return for simple infrastructure management the cost of cloud-based systems can quickly spiral. Start by estimating the number of instances required for data processing, model training, and serving, and balance these against project budgets and growing user demands. Most cloud platforms provide cost-management tools to help you keep track of spending and optimise resources.
MLOps:
From the beginning include a plan to efficiently manage model lifecycle. The goal is to accelerate model iteration, automate deployment, and maintain robust monitoring for metrics and data drift. This allows you to start simple and iterate fast! Implement version control with Git for code and DVC (Data Version Control) for tracking data model artefact changes. Tools like MLFlow or Weights & Biases track experiments, while CI/CD pipelines automate testing and deployment. Once deployed, models require real-time monitoring with tools like Promethes and Grafana to detect issues like data drift.
A high-level system design mitigates risks and ensures your team can adapt and evolve as the system grows. This means designing a system that is model agnostic and ready to scale by breaking down the system into modular components for a robust architecture that supports rapid trail and error, scalable deployment, and effective monitoring.
Prototyping By Mocking the ML
We now have an approach for delivering the project requirements, at least from an ML perspective. To round our design off, we can now outline a product prototype, focusing on the user interface and experience (UI/UX). Where possible, this should be interactive, validating whether the feature provides real value to users, ready to iterate on the UX. Since we know ML to be time-consuming and resource-intensive, you can set aside your model design and prototype without a working ML component. Document how you’ll simulate these outputs and test the end-to-end system, detailing the tools and methods used for prototyping in your design document. This is important, as the prototype will likely be your first chance to gather feedback and refine the design, likely evolving into V1.
To mock our ML we replace predictions with a simple placeholder and simulate outputs. This can be as simple as generating random predictions or building a rule-based system. Prototyping the UI/UX involves creating mockups with design tools like Figma, or prototyping APIs with Postman and Swagger.
Once your prototype is ready, put it in the hands of people, no matter how embarrassed you are of it. Larger companies often have resources for this, but smaller teams can create their own user panels. I’ve had great success with local universities — students love to engage with something new, Amazon vouchers also help! Gather feedback, iterate, and start basic A/B testing. As you approach a live product, consider more advanced methods like multi-armed bandit testing.
There is an excellent write up by Apple as an example of mocking ML in this way. During user testing of a conversational digital assistant similar to Siri, they used human operators to impersonate a prototype assistant, varying responses between a conversational style — chatty, non-chatty, or mirror the user’s own style. With this approach they showed users preferred assistants that mirror their own level of chattiness, improving trustworthiness and likability. All without investing in extensive ML development to test UX (https://arxiv.org/abs/1904.01664).
From this we see that mocking the ML component puts the emphasis on outcomes, allowing us to change output formats, test positive and negative flows and find edge cases. We can also gauge the limits of perceived performance and how we manage user frustration, this has implications for the complexity of models we can build and infrastructure costs. All without concern for model accuracy. Finally, sharing prototypes internally helps get buy in from business leaders, nothing sparks support and commitment for a project more than putting it people’s hands.
Gather Feedback and Iterate
As you move into development and deployment, you’ll inevitably find that requirements evolve and your experiments will throw up the unexpected. You’ll need to iterate! Document changes with version control, incorporate feedback loops by revisiting the problem definition, re-evaluating data quality, and re-assessing user needs. This starts with continuous monitoring, as your product matures look for performance degradation by applying statistical tests to detect shifts in prediction distributions (data drift). Implement online learning to counter this or if possible bake into the UI feedback methods from users to help reveal real bias and build trust, so called human-in-the-loop. Actively seek feedback internally first, then from users, through interviews and panels understand how they interact with the product and how this causes new problems. Use A/B testing to compare select versions of your model to understand the impact on user behaviour and the relevant product/business metrics.
ML projects benefit from adopting agile methodologies across the model life cycle, allowing us to manage the uncertainty and change that is inherent in ML, this starts with the planning process. Start small, test quickly, and don’t be afraid to fail fast. Applying this to the planning and discovery phase, will reduce risk, while delivering a product that not only works but resonates with your users.
Product-Oriented ML: A Guide for Data Scientists was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.