⚙ Configuration
Disabled by default because it doubles training compute per resolution. Enable when:
  • BSS is positive but plateaued and you want smoother convergence
  • SWA divergence (see swa-tracker) is consistently >0.5 — student is drifting too far from teacher
  • Calibration is poor and you want soft labels in addition to hard ones
α ∈ [0, 0.7]. Higher α = more teacher influence. Recommended: 0.3.
📚 How distillation works here
Standard hard-label training: When a trade resolves with y=1 (win), the model gradient pulls toward predicting 1.0 on those features. The model's prediction loss is CE(p, 1) = -log(p). It keeps pushing toward saturation.

Knowledge distillation (Hinton 2015): Use a TEACHER model to provide a soft target. The teacher knows that even winning examples carry uncertainty — instead of saying "y=1.0," it says "y ≈ 0.78 based on the features." This is more information than the hard label.

What's the teacher here? SWA-averaged weights. SWA sits in a flatter loss minimum than the live model, so its predictions are smoother + better calibrated. See SWA tracker.

The distillation step: After the main training step on hard label y, do one extra step: soft_target = α × teacher.predict(features) + (1 − α) × y
model.train(features, soft_target)
With α=0.3 the model is gently pulled toward consistency with the SWA teacher while still primarily learning from hard labels. This regularizes against drifting too far from the stable averaged version.

Trade-off: doubles the training compute per resolution (one main step + one distill step). For a brain running on a single browser tab this is fine, but it's why we keep it opt-in by default.