Case Study: Forecasting Sales for Make-to-Order Products with Confirmed Orders
SupChains delivered a forecasting model to a German manufacturer, leveraging future orders as a key demand driver. Compared to statistical benchmarks, SupChains reduced forecasting error by 20%.
This case study follows the end-to-end delivery of a machine learning (ML) forecasting model. We begin by setting the project’s scope and context, followed by an explanation of how the data was collected, cleaned, and structured, including the treatment of confirmed future orders. We also note key exclusions from the model design. Finally, we describe how the proof of concept was executed, followed by a comparison of results between statistical baselines and machine learning models.
Our client operates across Europe, Asia, and the Americas, employing more than 1,800 people and reporting revenues exceeding € 450 million in 2024. The company manufactures and supplies tailored products for various industrial applications, including automotive, industrial manufacturing, medical, and home appliances. Most products are made to order, with a strong focus on client-specific solutions.
Forecasting accuracy is more than a technical metric — it directly impacts supply chain performance. In Make-to-Stock (MTS) environments, better forecasts support better supply chain decisions, reducing inventory and obsoletes while improving service levels and overall sales. For Make-to-Order (MTO) environments, more accurate forecasts enhance capacity planning, enabling shorter delivery lead times to clients and more reliable available-to-promise dates. In both MTO and MTS settings, improved forecasts also enable better collaboration with suppliers and more effective raw material inventory management. While the exact benefits vary, the link between accurate forecasting and supply chain efficiency is well established.
Scope
Despite our client’s global presence, this project focused on the regional European market with around 500 clients and 5,000 unique products (totally around 10,000 product-client combinations, as most products are client-specific). The objective of this project isn’t to optimize finished goods inventories, since most products are made-to-order. Instead, we aim to improve mid-term capacity planning using forecasts up to 12 months ahead. Although capacity planning only requires product-level forecasts (or even family-level forecasts), the client requested product–client forecasts to facilitate the forecast review and enrichment process. As most products are client-specific, forecasting sales by product-client isn’t much of a stretch compared to product-only forecasting granularity.
A Challenging Project. At project kickoff, only two and a half years of historical data were available due to a recent ERP implementation. This limited history posed challenges for training forecasting models, especially in capturing seasonal patterns, which typically require multiple cycles to detect reliably. Compounding the difficulty, the forecasting task was unusually complex: we aimed to produce mid- and long-term forecasts for products that are 80% client-specific and sold only 2.5 times per year on average.
Data Collection and Cleaning
Supply chain managers have many KPIs to track forecasting accuracy, but few, if any, to measure input data quality.
Collecting, cleaning, and structuring data is fundamental to forecasting success. Beyond the basic principle of garbage in, garbage out, high-quality data (including demand and business drivers) enables the model to understand demand signals better and ultimately generate more accurate predictions. At SupChains, we treat data as a source of insight for our machine learning models. To unlock their full potential, we need to feed them with the correct data in the proper format. This requires both business knowledge and data engineering know-how.
Moreover, our experience with our clients shows that most erratic forecasts are due to inconsistent inputs. That’s why we emphasize the data preparation phase: rigorously validating input data and integrating key business drivers ensures our models run smoothly and reliably.
Master Data
Collecting consistent master and hierarchical data is key for two main reasons:
- Our machine learning forecasting engines use products’ hierarchical information (such as brands and product families) as inputs to generate forecasts.
- Correct master data allows for flagging product transitions (old product -> new product) and launch dates. Without these, the forecasting engine might just get confused by every new product. Tracking product transitions is a low-effort, high-impact task: it’s the lowest-hanging fruit when working on data collection.
Confirmed Future Orders
In most Make-to-Order (MTO) businesses, clients make orders in advance, often based on a contractual delivery lead time. This project is no exception: as shown in the graph below, 40% of volumes are ordered at least 42 days in advance. These confirmed future orders provide the most valuable insight into future demand. In the following sections, we’ll discuss how our model uses these to improve forecasts (and how our model differs from the usual approach).
Business Drivers and Model Features
When building our machine learning models, we aim to use the fewest possible features by selecting only those that truly drive accuracy — less is often more. In this project, a key focus was making the most of the information provided by confirmed future orders, or the lack thereof.
Confirmed Future Orders
The most valuable business driver in this project is the confirmed future order book. Unlike traditional statistical models or ERP planning engines, which rely on fixed rules, our approach treats future orders as a predictive signal about future expected orders.
To illustrate how forecasting is commonly handled in systems without machine learning, consider a common rule used in planning engines:
Final Forecast = max(Statistical Forecast, Confirmed Orders)
Here’s a dummy example,
This rule ensures that confirmed orders are always included in the forecast, but fails to leverage the predictive power of confirmed orders on future expected demand (for example, if a client just placed a big order, you might want to raise your long-term forecast).
Some systems introduce forecast consumption logic, redistributing confirmed orders across neighboring periods:
While these methods represent a step forward, they fall short in several key scenarios:
- Edge cases: What happens when confirmed orders are slightly below the statistical forecast?
- Silence as a signal: If no orders are placed, should we lower the forecast, especially if the client usually places orders weeks in advance?
- Far-in-advance signals: If a confirmed order appears 10 weeks out, should forecasts in all prior weeks be reduced to reflect expected client inactivity?
Moreover, as you will see in the Results section, only adjusting forecasts upward (to reflect confirmed demand) without ever reducing them when expected orders are missing often leads to positively biased forecasts.
SupChains’ Approach
Our machine learning engine treats confirmed future orders like any other business driver: as a predictive signal of future demand. The model learns from historical ordering patterns at the client–product level to determine what can reasonably be expected based on the forecast horizon.
- When a client places a higher-than-usual order, the model may interpret it as a positive trend and increase forecasts in later periods, while potentially reducing forecasts in nearby weeks to avoid double-counting.
- When future order volume falls below expectations, the model may reduce forecasts, interpreting this as a likely softening in demand.
This approach eliminates hard-coded rules and allows forecasts to adjust dynamically, both upward and downward, in response to real business signals.
Seasonality
At the time of model development (Q2 2024), we had access to only 30 months of historical sales data, with several months reserved for testing, making seasonality estimation inherently difficult. To address this, we focused on designing model inputs (features) that could detect and project relevant seasonal patterns.
What We Didn’t Do
Outliers
At SupChains, we don’t use ML or statistical-based techniques to detect outliers. Instead, we identify and exclude erroneous transactions based on business rules (such as inconsistent prices or stock clearances) and explain most deviations due to business drivers, including promotions or price changes.
Analyzing the demand variability of Make-to-Order products makes little sense: with only a few sales per year, most observations would be statistically flagged as outliers.
Segmentation
Furthermore, we do not use segmentation or clustering techniques. As explained in this article, we don’t see how they could add value to demand forecasting models. Our recent forecasting competition, VN1, also showed that very few top competitors used segmentation or outlier detection techniques, confirming the limited value of these practices, if any.
External Drivers
For this project, we didn’t use any external drivers (such as economic indicators), as, to the best of our knowledge, we are unaware of projects that successfully used external drivers to enrich forecasts. Generally, internal drivers (such as promotions, confirmed future orders, pricing, and sellouts) provide greater insights into future demand than external economic indicators.
Inventory, Promotions, Prices, Sellouts
Given the Make-to-Order (MTO) context, this project did not require inventory data. We identified unconstrained demand using the client’s originally requested shipment dates.[1] Promotions were also excluded, as the client does not run promotional campaigns. As most products are tailored for individual clients, price sensitivity analysis is also largely irrelevant.
[1] For MTS products, capturing unconstrained demand is usually the most challenging part of any demand forecasting project. This is why we invest time in collecting and cleaning inventory data to censor periods with shortages.
Delivering a Proof-of-Concept
Timeline
The proof of concept (POC) began with a three-month data cleaning phase, supported by weekly workshop sessions between SupChains and the client. We focused on data quality, business context, and forecast scope (i.e., which client–product combinations to include).
Following this, we conducted several dry runs of the forecasting engine to validate the forecasting scope and confirm inclusion criteria for product–client pairs.
A six-month POC phase followed, during which SupChains delivered monthly forecasts and hosted joint review sessions with the client. These reviews focused on assessing accuracy, interpreting model behavior, and refining model inputs and outputs. Specifically, we refined the project scope (including or excluding specific products or product families), discussed obvious forecast errors, and aligned on the horizon and granularity.
Iterating Models: Weekly vs Monthly Granularity
We initially created a forecasting model that generated weekly forecasts. During the testing phase, we realized that the client’s planning system aggregated weeks into months using Mondays as cut-off dates. This meant, for example, that a forecast starting on Monday 31 March 2025 would be assigned entirely to March — even though most of the week falls in April. This didn’t align with how confirmed orders were structured on our side, which led to misaligned allocations and reduced accuracy.
We adjusted the weekly structure to align with the client’s aggregation logic. However, because the setup remained complex and difficult to manage, we also tested a simpler monthly model. Surprisingly, this not only reduced data complexity but also improved accuracy by 5%.
Learning Curve: S&OP and Forecasting Quality
Initially, our client’s S&OP process focused primarily on forecasting total volumes over a 6- to 18-month horizon. They assessed their forecasting quality using the overall bias. As we began the project’s evaluation phase, we also took the opportunity to refine how they monitored forecasting quality by shifting from a bias mindset to considering both bias and absolute errors.
Tracking the absolute error tells us the magnitude of the forecast errors — how large the deviations are between forecast and actual demand — regardless of direction. Bias, on the other hand, reveals the direction of the error, helping us understand whether we consistently over- or underestimate demand. Tracking both allows us to evaluate the performance more clearly.
Results
Against Statistical Benchmarks
At the core of our methodology for evaluating forecasts lies the Forecast Value Added (FVA) framework: rather than measuring the accuracy of our model in isolation, we compare it against various benchmarks. We advocate tracking FVA as the #1 practice that supply chain leaders should implement.
To evaluate our forecasting models, we created out-of-sample forecasts from December 2024 to March 2025 and evaluated them using actual orders from January to April 2025. We compared six different approaches,
- A 12-month moving average (MA12), and a * version enriched with confirmed orders as known at the time of making the forecast.
- A statistical engine (Statistical) relying mostly on exponential smoothing models, and a * version using confirmed orders.
- Our initial weekly ML model (Weekly ML)
- Our final monthly ML model (Monthly ML)
To measure forecasting quality, we advise against measuring solely accuracy (using metrics such as MAPE, MAE, or WMAPE). Instead, we track both accuracy (using MAE) and bias. To do so, we use the Score (MAE + |Bias|), as recommended in my book Demand Forecasting Best Practices.
By reducing the score from 83.1% (benchmark: MA12*) to 65.2%, SupChains delivered an added value of 22%.
As explained in Data Science for Supply Chain Forecasting, tracking accuracy alone will mechanically promote under-forecasting.
We compute the added value as the % score reduction from one model against another. In this case, we have 21.5% = 1–65.2%/83.1%.
We also measured the variability of our models. That is, how much forecasts (from the same model) change from one month to the next. Lower values indicate stable outputs from one iteration to the next. As always, the straight 12-month moving average enjoys the highest stability, followed by our monthly model (“Monthly ML”), which delivers more accurate, less biased, and more stable forecasts than any other model.
Technically, to compute the variability, we take the overlapping periods of two consecutive forecast sets and calculate the usual score metric (Score = MAE + |Bias|), treating one forecast as the actuals and the other as the prediction. We then scale the metric by the average of the two forecasts (only keeping the overlapping periods).
We use LightGBM as the core algorithm for our model. Based on our experience and international forecasting competitions, it remains one of the most effective tools for supply chain forecasting. However, success doesn’t come from the algorithm alone. The key differentiator lies in how data is pre-processed, transformed, and selected. At SupChains, we strongly emphasize feature engineering (i.e., transforming raw data into meaningful inputs for a machine learning model), leveraging domain knowledge to craft meaningful inputs for our models.
Added Value of Including Future Orders
To assess the value of embedding confirmed future orders into our model, we ran another version of our monthly model without any input related to future orders (see ML (No Orders) below). As for the other models, we created a * version that was enriched with confirmed orders after the forecast was generated. Our main model (using confirmed orders related features) delivered a 25% lower score than the post-processed no-future-orders version.
Enriching a 12-month moving average and a statistical engine with future orders resulted in an error score drop of 25 to 30%, but at the expense of overforecasts.
To recap, using simple post-processing rules to include future orders reduced the score by 25 to 30%, and feeding future orders directly to our machine learning model resulted in a 25% extra reduction.
Conclusion and Learning Points
This project showed the value of using machine learning (ML) in Make-to-Order (MTO) environments. As with all machine learning models, the gains didn’t come from the algorithm alone (simply using machine learning models won’t deliver much value). The key success factors were how the data was structured, cleaned, and fed into the ML engine, especially confirmed future orders, which proved to deliver an extra 25% added value when used properly by the engine. Feeding confirmed future orders into a forecast using post-processing rules helps, but embedding future orders directly as features in the model delivers significantly better results.
The overall proof-of-concept took a bit less than a year (including 3 months of data cleaning, 1 month of model creation, and 6 months of live testing). We had to refine our approach twice throughout the live-testing phase, first realizing that the translation from weekly to monthly forecasts didn’t work and then discovering that monthly forecasts delivered more accurate and stable results than weekly ones.
One of the key success factors of this project was the ability to deliver unbiased forecasts for mid- to long-term planning. Simple post-processing rules that incorporate confirmed future orders tend to push forecasts upwards, mechanically leading to systematic overforecasting. In contrast, our model utilized these inputs to adjust forecasts both upward and downward, resulting in more accurate, stable, and less biased forecasts.
Acknowledgments
Scott Hawes, Rishi Bawdekar, Joaquin Ruiz , Richard Maestas, Zishan Yusuf, Quan Pham, Thamin Rashid, Matt Drake, Richard Maestas, Axel Alfaro, Konrad Grondek, Nural Efe, Akshay Basrur, Evgenii Antipov, Sebastian Bello