⚙ Configuration
Disabled by default because it changes the loss surface. Label smoothing handles the same problem more conservatively; this is for advanced tuning when you want stronger anti-overconfidence pressure.
Range: [0, 0.5]. Larger β = stronger push away from peaked outputs.
📊 Gradient adjustment preview
For each predicted probability, the additive term to the standard (p − y) gradient at current β:
📚 How it works
Cross-entropy loss alone keeps pushing the model toward saturation. Each correct prediction with p=0.95 still has a gradient pulling toward p=1.00, which is unreachable. This results in overconfident outputs that don't generalize.

Confidence penalty (Pereyra et al. 2017) adds a Shannon-entropy term to the loss: Loss = CE(p, y) − β × H(p) Maximizing entropy means pushing p toward 0.5 (uniform). The combined gradient becomes: ∂L/∂z ≈ (p − y) + β × (p − 0.5) The β term pushes p toward 0.5 regardless of true label.

Relation to label smoothing: mathematically related but operationally different. Label smoothing changes the target; confidence penalty changes the loss surface. They can be combined; both reduce overconfidence.

When to enable: if Brier Skill Score is poor and conformal coverage is below 90% (indicating overconfidence), turning this on at β=0.05 typically improves calibration.