Entropy regularization · pushes model away from peaked outputs · alternative to label smoothing
⚙ Configuration
Disabled by default because it changes the loss surface. Label smoothing handles the same problem more conservatively; this is for advanced tuning when you want stronger anti-overconfidence pressure.
For each predicted probability, the additive term to the standard (p − y) gradient at current β:
📚 How it works
Cross-entropy loss alone keeps pushing the model toward saturation. Each correct prediction with p=0.95 still has a gradient pulling toward p=1.00, which is unreachable. This results in overconfident outputs that don't generalize.
Confidence penalty (Pereyra et al. 2017) adds a Shannon-entropy term to the loss:
Loss = CE(p, y) − β × H(p)
Maximizing entropy means pushing p toward 0.5 (uniform). The combined gradient becomes:
∂L/∂z ≈ (p − y) + β × (p − 0.5)
The β term pushes p toward 0.5 regardless of true label.
Relation to label smoothing: mathematically related but operationally different. Label smoothing changes the target; confidence penalty changes the loss surface. They can be combined; both reduce overconfidence.
When to enable: if Brier Skill Score is poor and conformal coverage is below 90% (indicating overconfidence), turning this on at β=0.05 typically improves calibration.