Standard SGD trains on observed examples (x, y). Mixup (Zhang et al. 2018) generates additional synthetic examples by linearly interpolating between pairs:
x_mix = λ × x_a + (1-λ) × x_b
y_mix = λ × y_a + (1-λ) × y_b
where λ ~ Beta(α, α)
Effect: the model is forced to learn linear interpolations between data points. This regularizes toward smoother decision boundaries — the prediction at a point halfway between two training examples should be roughly halfway between their predictions. Empirically this:
- Reduces overfitting on small datasets
- Improves robustness to small input perturbations (catches the same kind of brittleness Counterfactual Replay measures)
- Improves calibration (synthetic intermediate labels train the model to output intermediate probabilities)
Pipeline: after the main training pass on resolved examples, the continuous-learner generates K=5 synthetic mixup examples and trains on them with discounted sample weight (0.5×). Disabled by default.