Self-Distillation — bpleone / brain

⚙ Configuration

Disabled by default because it doubles training compute per resolution. Enable when:

BSS is positive but plateaued and you want smoother convergence
SWA divergence (see swa-tracker) is consistently >0.5 — student is drifting too far from teacher
Calibration is poor and you want soft labels in addition to hard ones

Alpha (α)

α ∈ [0, 0.7]. Higher α = more teacher influence. Recommended: 0.3.

📚 How distillation works here

Standard hard-label training: When a trade resolves with y=1 (win), the model gradient pulls toward predicting 1.0 on those features. The model's prediction loss is CE(p, 1) = -log(p). It keeps pushing toward saturation.

Knowledge distillation (Hinton 2015): Use a TEACHER model to provide a soft target. The teacher knows that even winning examples carry uncertainty — instead of saying "y=1.0," it says "y ≈ 0.78 based on the features." This is more information than the hard label.

What's the teacher here? SWA-averaged weights. SWA sits in a flatter loss minimum than the live model, so its predictions are smoother + better calibrated. See SWA tracker.

The distillation step: After the main training step on hard label y, do one extra step:

soft_target = α × teacher.predict(features) + (1 − α) × y
model.train(features, soft_target)

With α=0.3 the model is gently pulled toward consistency with the SWA teacher while still primarily learning from hard labels. This regularizes against drifting too far from the stable averaged version.

Trade-off: doubles the training compute per resolution (one main step + one distill step). For a brain running on a single browser tab this is fine, but it's why we keep it opt-in by default.