Learn / Research process

Back to learn

Why does a trading probability look calibrated in sample but fail live?

A trading probability can look well calibrated in sample and then fail live because the regime, label process, data timing, execution conditions, or class balance changed in ways the calibration layer did not learn to handle.

What to remember

  • Class balance shifts change what a given score bucket really means.
  • Execution delays break the link between forecast timestamp and realized outcome.
  • Feature behavior can drift even when the model still outputs familiar-looking numbers.

Short answer

Calibration can fail live even when the original fit was technically correct. The problem is usually that the world the calibrator learned from is not the world the strategy is now trading.

What changes between sample and live

Regime mix, market microstructure, and even the label definition can drift over time. A forecast that looked stable across one historical slice may become misaligned once latency increases, spreads widen, or the underlying opportunity becomes more crowded.

  • Class balance shifts change what a given score bucket really means.
  • Execution delays break the link between forecast timestamp and realized outcome.
  • Feature behavior can drift even when the model still outputs familiar-looking numbers.

What to monitor

Track rolling reliability by bucket, not just aggregate performance. A strategy can still make money for a while even as the calibration underneath it is deteriorating, which is exactly why the monitoring layer matters.

How to reduce surprise

Use walk-forward slices, freeze the calibration rule before each forward window, and compare those results with paper behavior once the strategy is live. That will not remove drift, but it makes the drift legible sooner and reduces the temptation to repair it with hindsight.