Learn / Research process

Back to learn

How do you calibrate a trading probability?

Calibrating a trading probability means making the forecast line up with realized outcomes or net expectancy well enough that a stated confidence level actually means something after costs and delay.

What to remember

  • Use held-out buckets to compare predicted confidence with realized behavior.
  • Check the forecast against net outcomes, not only gross outcomes, if costs drive the trade decision.
  • Separate by regime or liquidity state when one global curve hides obvious drift.

Short answer

A trading probability is calibrated when the numbers it outputs match the outcomes that matter downstream. If a model says a setup has a 70 percent chance of success, then trades in that bucket should behave roughly like that over a relevant validation window, not just in a polished in-sample chart.

In trading, calibration usually has to go one step further than generic machine learning. The useful question is not only whether the direction was right, but whether the forecast was reliable enough to support a threshold, a size decision, or an expected-value comparison after friction.

What you calibrate against

The target depends on what the score is meant to drive. Some teams calibrate against realized hit rate for a binary decision. Others care more about return buckets, expected value, or payoff-weighted outcomes because a correct tiny trade and a correct large trade should not count the same.

  • Use held-out buckets to compare predicted confidence with realized behavior.
  • Check the forecast against net outcomes, not only gross outcomes, if costs drive the trade decision.
  • Separate by regime or liquidity state when one global curve hides obvious drift.

How teams usually do it

A practical workflow is to start with reliability plots and bucket tests, then apply a monotonic calibration layer such as Platt scaling or isotonic regression on held-out windows. The exact method matters less than the discipline of fitting it on past data and freezing it before the next evaluation slice.

That matters on Alphora because the model score is usually not the end product. It feeds a threshold, a size rule, or a multi-sleeve portfolio decision, so a badly calibrated score can pollute several layers at once.

What to validate before trusting it

Calibration is only useful if it survives time. Check whether nearby windows produce similar curves, whether the reliability breaks under higher costs or slower execution, and whether the same confidence bucket still means the same thing once the strategy moves from backtest into walk-forward and paper phases.