Standard hard-label training: When a trade resolves with y=1 (win), the model gradient pulls toward predicting 1.0 on those features. The model's prediction loss is
CE(p, 1) =
-log(p). It keeps pushing toward saturation.
Knowledge distillation (Hinton 2015): Use a TEACHER model to provide a soft target. The teacher knows that even winning examples carry uncertainty — instead of saying "y=1.0," it says "y ≈ 0.78 based on the features." This is more information than the hard label.
What's the teacher here? SWA-averaged weights. SWA sits in a flatter loss minimum than the live model, so its predictions are smoother + better calibrated.
See SWA tracker.
The distillation step: After the main training step on hard label y, do one extra step:
soft_target = α × teacher.predict(features) + (1 − α) × y
model.train(features, soft_target)
With α=0.3 the model is gently pulled toward consistency with the SWA teacher while still primarily learning from hard labels. This regularizes against drifting too far from the stable averaged version.
Trade-off: doubles the training compute per resolution (one main step + one distill step). For a brain running on a single browser tab this is fine, but it's why we keep it opt-in by default.