Calibration = does the model say what it means? A well-calibrated model that outputs 70% confidence should actually be right 70% of the time on that batch. If the model says 80% but only wins 60% of those, it's overconfident โ€” your sizing will be too aggressive. If it says 60% but wins 80%, it's underconfident โ€” you're leaving edge on the table.
This page bins predictions by confidence (50-60%, 60-70%, etc.) and plots actual win rate per bin against the diagonal. Closer to diagonal = better calibration.
Brier Score
--
Lower = better calibration
ECE
--
Expected Calibration Error
Avg Overconfidence
--
predicted % - actual %
Total Rated
--
๐Ÿ“Š Reliability Diagram
๐Ÿ”ต dots = actual hit rate per confidence bin ยท gray dashed = perfect calibration ยท circles sized by sample count
๐Ÿ“‹ Per-Bin Calibration
BIN
N
PREDICTED
ACTUAL
STATUS
๐Ÿ”ง If poorly calibrated
  • Consistently overconfident (predicted > actual): Reduce sizing on high-confidence calls by 30-50%. Run a Full Retrain โ€” likely overfitting.
  • Consistently underconfident (predicted < actual): You can size up on high-confidence calls. Model is being more conservative than its track record warrants.
  • Calibrated on low but not high (or vice versa): Common when training data is unbalanced โ€” most outcomes cluster near 50%. More data fixes it.
  • Brier Score > 0.25: Roughly equivalent to random โ€” model needs more training.
  • Brier Score 0.18-0.25: Reasonable for trading models. Most professional systems live here.
  • Brier Score < 0.15: Excellent. Model is well-calibrated.