Retail Demand Forecasting: Reducing the Error by 33%
SupChains delivered a proof of concept (POC) for an international grocery retailer, operating thousands of stores. The model leverages data on promotions, prices, shortages, product launches, and stores’ opening hours to forecast demand 14 days ahead per product per store. Compared to the retailer’s current software provider, SupChains reduced forecasting errors by 33%. As explained at the end of this case study, this could result in a € 172 million savings for a retail chain with 10,000 stores.
This case study provides an end-to-end process for developing and testing a machine learning (ML) forecasting model for a retailer. We begin by defining the project’s scope and context. Following this, we explain how data was collected, cleaned, and structured. We also address specific exclusions from the model’s design (outliers, segmentation). The study then covers the execution of the proof of concept, how success was measured, and a comparison with the retailer’s current planning software. Finally, we discuss the financial impact based on the achieved results (a 33% forecast error reduction).
Scope, Methodology, and Objectives
Our client is active worldwide; however, this project focused on a representative sample of approximately 150 stores and 500 products, selected by the client within a specific product segment of interest. As the value of the products within this segment is relatively similar, we didn’t use value-weighted metrics when evaluating forecasting accuracy.
The objective of this project was to,
- Demonstrate how machine learning models can forecast demand during promotions (which their current software provider can’t do).
- Assess if machine learning models can deliver more accurate forecasts than their current software provider during non-promotional periods.
- Can machine learning cope with all the operational business complexities of a retailer (opening hours, holidays, new products, and new stores)
To assess the results, we generated a full year (365 sets) of 14-day-ahead forecasts for around 65,000 store-product combinations, totaling around 332 million point forecasts. As discussed in the Results Section, the client was responsible for evaluating the forecast.
Challenges
Big Data
A single store can feature from 1,000 to 40,000 products (that’s as many as some of our projects with manufacturers). Coping with such an amount of data requires strict data hygiene, efficient computing, and the ability to scale IT resources.
New Products (and Stores)
Retailers routinely launch (or discontinue) products and need to plan stock deployment for these new products. Our model was specifically designed to forecast demand for new products before they hit the market, even in the complete absence of historical data. Similarly, the ability to forecast initial demand for newly opened stores was a critical requirement.
Extreme Seasonality
Retailers frequently manage products sold exclusively during specific events or limited periods, such as holiday-specific items or seasonal produce like cherries. These exhibit what we define as extreme seasonality, and our model was engineered to cope with these highly irregular and intermittent demand patterns.
Data Collection and Cleaning
Supply chain managers have many KPIs to track forecasting accuracy, but few, if any, to measure input data quality.
Collecting, cleaning, and structuring data is fundamental to forecasting success. Beyond the basic principle of garbage in, garbage out, high-quality data (including demand and business drivers) enables the model to understand demand signals better and ultimately generate more accurate predictions. At SupChains, we treat data as a source of insight for our machine learning models. To unlock their full potential, we need to feed them with the correct data in the proper format (formatting business drivers into structured inputs for a machine learning model is called feature engineering). This requires both business knowledge and data engineering know-how.
Moreover, our experience with our clients shows that most erratic forecasts are due to inconsistent inputs (such as wrong or missing master data, or erroneous sales transactions). That’s why we emphasize the data preparation phase: rigorously validating input data and integrating key business drivers (such as promotions, prices, sell-outs) ensures our models run smoothly and reliably.
Master Data
Collecting consistent master and hierarchical data is key for two main reasons:
- Our machine learning forecasting engines use products’ hierarchical information (such as brands and product families) as inputs to generate forecasts.
This is especially critical if you want to forecast new products, as they don’t have any historical sales data, the model must rely on general information, such as product family, pricing, or the type of store where the product is sold.
- Correct master data allows for flagging product transitions (old product -> new product) and launch dates. Without these, the forecasting engine might just get confused by every new product. Tracking product transitions is a low-effort, high-impact task: it’s the lowest-hanging fruit when working on data collection.
Transactional Sales Data
For this proof of concept (POC), we collected and cleaned tens of millions of sales transactions. Specifically, we excluded transactions with,
- Null or negative sales price.
- Null or negative sales quantity.
As always, excluding dubious transactions carries a risk of a false positive (flagging a transaction as erroneous when it was legitimate). However, the risk of including errors or irrelevant transactions (such as transactions used to validate stock counts) is too high.
Business Drivers and Model Features
When building our machine learning models, we aim to use the fewest possible features by selecting only those that truly drive accuracy. Let’s review the ones that were specific to this project.
Seasonality
At SupChains, we take pride in leveraging diverse techniques to evaluate seasonality accurately. For this project, we specifically addressed a complex double seasonality: products exhibit an overall yearly seasonality with distinct high and low seasons, coupled with an intraweek daily seasonality. Our approach focused on a by-store assessment for the intraweek daily seasonality, while yearly seasonality was mainly assessed per product.
Dealing with Shortages
Supply chain planners must forecast demand, not sales. Or face a shortage-vicious circle. Unfortunately, most supply chains can’t track demand. Instead, they collect sales, which is a perfect proxy, as long as you have enough inventory. But, as soon as shortages occur, supply chains can no longer accurately track demand.
When facing a shortage, we can either try to interpolate the missing sales data or bypass it. At SupChains, we always favor the second option: all our models (Machine learning and statistical tools) bypass shortages by default.
Demand forecasts should never be evaluated during periods of shortage.
Failing to exclude these periods means you are measuring sales forecasting accuracy, not demand forecasting accuracy. Forecasting sales (and not demand) will result in a vicious circle of everlasting shortages. Accurately tracking and accounting for shortages must be a top priority for S&OP leaders.
Promotions
When businesses are promotion-driven, promotions are usually the main demand driver. Unfortunately, collecting a promotional calendar with historical and future promotions is often tedious and requires multiple workshops with the client. But once the input data is cleaned and consistent, machine learning models do a great job of capturing the impact of promotions.
As you can see in the illustration, the impact of promotions varies over time.
(and in some situations results in long-lasting lower demand as clients bought stock in advance). Capturing the effect of promotions requires going a long way beyond simply “increasing sales by X%,” as many statistical tools do.
Traditional statistical models often struggle to capture the non-linear and complex temporal impact of promotions (what happens before, during, and after the promotion). Particularly when a product experiences only a few promotional events per year. On the other hand, if properly tuned, machine learning models shine at capturing these.
Actual Prices and Future Pricing
Including prices in a forecasting model should, in theory, allow supply chain leaders to assess the price sensitivity of their clients. We identify four main reasons why, in practice, using prices to improve forecasting models for most supply chains is complicated.
- It’s Complicated. Capturing price sensitivity is complicated, as pricing is just one of many business drivers, and price sensitivity is a non-linear relationship.
- Lack of Data. Most supply chains rarely change their prices, making it challenging to capture price sensitivity due to a lack of data.
- Inconsistent Pricing. Some B2B supply chains offer different prices and discounts per client (such as end-of-quarter discounts if specific amounts are reached), making it challenging to assess pricing.
Moreover, analyzing only the historical relationship between prices and demand won’t help you forecast future demand unless you have access to a pricing plan.
4. Lack of Future Pricing Plan. Only a few supply chains have a rigorous (future and historical) pricing plan. Without this, you can’t use prices as a driver for future demand, as you don’t know it.
For this project, we enjoyed a specific case where none of these conditions applied: we had access to the official historical and future price list, actual historical prices, and prices that tend to vary (up and down) over time. There is just one missing aspect:
5. Competition Prices. Capturing historical competition prices is very difficult (or expensive if you have to purchase the data). Moreover, you will never know your competitors’ prices in advance.
In our case, for grocery retailers, it’s unlikely that consumers will visit multiple stores to buy different products to benefit from the lowest prices. Using competition prices might be much more relevant for B2C online retailers.
As you can see on the graph below, even with this data, capturing the relationship between sales and price is tedious, as multiple drivers impact demand (seasonality, promotions).
Forecasting New Products
S&OP leaders typically view new product introductions as a challenge that requires manual work from demand planners. This is especially challenging for retailers, who usually oversee a far greater volume of product launches than other businesses (simply due to the large number of products they provide). Furthermore, retailers must accurately forecast demand for newly opened stores.
Our POC included the launch of various products during the test phase (i.e., these products were never sold for this store in the data we received to train our model). You can see in the following figure various forecasts submitted during the evaluation phase for such a product (the first forecast was generated without any historical data available — as you can see, the model overshot the launch and reverts to normal amounts within two weeks).
What We Didn’t Do
Outliers
At SupChains, we don’t use ML or statistical-based techniques to detect outliers. Instead, we identify and exclude erroneous transactions based on business rules (such as inconsistent prices or stock clearances) and explain most deviations based on business drivers, including promotions and price changes.
Clean the Historical Baseline
We advise against “cleaning” historical sales by removing promotion uplifts. The strength of machine learning models lies in their ability to understand how all business drivers, including promotions, impact demand (when these are provided as inputs). Allowing demand planning teams to remove specific events from historical data manually leads to a labor-intensive, inconsistent, and ultimately inaccurate forecasting process.
Segmentation
We do not use segmentation (such as ABC XYZ segmentation, or Intermittent, Lumpy, Erratic, Smooth classification) or clustering techniques. As explained in this article, we don’t see how they could add value to demand forecasting models. Our recent forecasting competition, VN1, also showed that very few top competitors used segmentation or outlier detection techniques, confirming the limited value of these practices, if any.
External Drivers
For this project, we didn’t use any external drivers (such as economic indicators), as, to the best of our knowledge, we are unaware of projects that successfully used external drivers to enrich forecasts. Generally, internal drivers (such as promotions, confirmed future orders, pricing, and sellouts) provide greater insights into future demand than external economic indicators.
Weather
In our experience, weather data can only be meaningfully used for daily forecasts per store. Using weather data to forecast demand per region or with weekly (or monthly) aggregated periods will result in averaging out weather, including sunlight exposure and rainfall, which doesn’t make much sense.
Even for retailers, using weather data for forecasting models isn’t a sure ROI:
- Collecting historical weather data (per point-of-sales) is complicated (and likely expensive).
- You can’t accurately predict the weather far in advance.
- You need to update the whole dataset daily.
Moreover, if you train your model using historical actual weather data rather than weather forecasts, the resulting correlation between weather and demand is likely to be artificially inflated. Finally, our previous experiences with retail projects showed that the added value of weather is marginal.
Delivering a Proof-of-Concept
Timeline
The proof of concept began with a bit more than a month of data cleaning and structuring (the client provided us with high-quality data). During this first phase, we aligned on the scope (as discussed in the next Section) and some product transitions. We then created the model using our ML framework, mostly spending time on feature engineering and selection (which is the most important part of any ML model).
Following this, we evaluated the model by delivering 365 sets of 14-day forecasts to the client, which was responsible for the assessment.
Results
Measuring Success
Across our multiple projects, we have realized that defining how to measure success and aligning definitions and scope is key. For this project, we established these rules with our client:
- We could compute the Score (computed as MAE% + |Bias%|, as in VN1 and Demand Planning Best Practices).
- And only measure accuracy for periods during which,
- There is no shortage
- The product is officially active (as per the product launch calendar)
- The store is officially open (as per the store opening calendar)
Don’t Measure Accuracy Alone!
To measure forecasting quality, we advise against relying solely on accuracy metrics (such as MAPE, MAE, or WMAPE).[1] Instead, we track both accuracy (using MAE) and bias, grouping them into a single metric, the Score (MAE + |Bias|), as recommended in Demand Forecasting Best Practices.[1] As explained in Data Science for Supply Chain Forecasting, tracking accuracy alone will mechanically promote under-forecasting.
Our Forecasting Metrics: Score and Forecast Value Added
To evaluate our forecasting models, we submitted a full year (365 sets) of 14-day ahead forecasts covering a full year. These were reviewed by the client directly, who compared three models:
- Our model (SupChains), relying on our ML framework.
- The forecast engine of the current software (Current Software), relying on statistical models.
- A 28-day moving average (MA 28 Days).
At the core of our methodology for evaluating forecasts lies the Forecast Value Added (FVA) framework: rather than measuring the accuracy of our model in isolation, we compare it against various benchmarks. We advocate tracking FVA as the #1 practice that supply chain leaders should implement.
The 28-day moving average is suffering from averaging out all historical sales (including promotions), resulting in a massive 62% bias during regular sales.
By reducing the score from 75% to 48%, SupChains delivered an added value of 33% compared to the current planning software.[1]
When comparing the results during promotions, neither the benchmark nor the planning software could deliver meaningful results. SupChains’ forecast was (in % of error) more accurate during promotions than during regular sales. This can be explained by the fact that the total amount of sales is much higher during promotions, which mechanically results in a lower percentage error.
[1] We compute the added value as the % score reduction from one model against another. In this case, we have 33% = 1–49%/74%.
Financial Impact and ROI
While SupChains could reduce the error by a significant 33% (and even more during promotional events — but we’ll exclude them for this assessment), the true measure of success for supply chain leaders lies in the business impact. To estimate these, let’s assume the following financial metrics for a single store of a typical retail chain,
- Yearly revenue per store: €10 million
- Average inventory per store: €1 million
- Annual holding and waste costs: Calculated at 13% of inventory value (€0.13 million per store per year), comprising 10% for financing costs and 3% for waste costs.
- Pre-tax profits per store: €0.3 million
Based on these inputs, let’s estimate the potential impact on retail chains of varying sizes, based on various sources and our simulations:
Based on our detailed simulations, a retailer operating 10,000 stores could realize annual savings of approximately €170–190 million, as per the IBF and SupChains simulations. We excluded Gartner from the analysis, as it would yield unrealistically high numbers. Inventory would be reduced by 60% (computed as 1 — (1–2.7%)^33), and obsolescence by 75% (= 1 — (1–3.9%)^33).
Acknowledgments
Fabian Rang, Quan Pham, Daniel Reynoso Tapia, Valérie LeBlanc, Konrad, Thamin Rashid, Guillaume Clément, Ritavan, Sijo Manikandan, Ajaypal Singh, Gaurav Jhanwar