Explicit Regularization and Implicit Bias in Deep Network Classifiers Trained with Square Loss

Abstract

This paper provides a comprehensive theoretical justification for the observed effectiveness of deep ReLU networks trained with the square loss in classification tasks. Through rigorous analysis of the associated gradient flow dynamics, we demonstrate that convergence to solutions with absolute minimum norm is expected when normalization techniques such as Batch Normalization (BN) or Weight Normalization (WN) are employed together with Weight Decay (WD). Our central finding establishes that the Frobenius norm of unnormalized weight matrices serves as the primary property bounding expected error - among all close-to-interpolating solutions, those with smaller norms exhibit superior margin characteristics and tighter bounds on expected classification error.

The analysis reveals that the dynamical system becomes singular when BN is used without WD, while implicit dynamical regularization remains possible in scenarios lacking both BN and WD through zero-initial conditions biasing dynamics toward high-margin solutions. The theoretical framework generates several testable predictions, including the specific roles of BN and weight decay, aspects of Neural Collapse phenomena identified by Papyan, Han and Donoho, and the constraints imposed by BN on network weight structures.

1. Introduction

While previous research has identified complexity control mechanisms underlying generalization in networks trained with exponential-type loss functions through asymptotic margin maximization effects, these established frameworks fail to explain two critical empirical observations: the strong performance demonstrated using square loss minimization, and the initialization-dependent convergence behavior observed in cross-entropy loss minimization. This theoretical gap motivates our focused investigation of the square loss in deep network classifiers.

Our analysis primarily examines commonly used gradient descent-based normalization algorithms including Batch Normalization and Weight Normalization combined with weight decay, as these techniques prove essential for reliable training of deep networks and were employed in the empirical studies we seek to explain. Additionally, we consider the case where neither BN nor WD are utilized, demonstrating that dynamic implicit regularization effects for classification can still emerge, though with convergence behavior strongly dependent on initial conditions.

Key Research Motivations

The puzzle of square loss effectiveness in classification despite theoretical expectations
Limitations of existing margin maximization theories for exponential losses
Empirical evidence of initialization-dependent convergence in cross-entropy optimization
The essential role of normalization techniques in modern deep network training

2. Methodology and Notation

We define a deep network with L layers using coordinate-wise scalar activation functions σ(z): ℝ → ℝ as the set of functions g(W; x) = (W_L σ(W_{L-1} ··· σ(W_1 x))), where x ∈ ℝ^d represents the input, and the weights are parameterized by matrices W_k, one per layer, with dimensionally compatible shapes. The shorthand W denotes the complete set of weight matrices {W_k} for k = 1, ..., L.

Notable aspects of our formalization include:

Architecture Details: The network employs no explicit bias terms; instead, the bias is instantiated in the input layer through one input dimension maintained as a constant
Activation Function: We utilize the ReLU activation function defined as σ(x) = x_+ = max(0, x)
Normalized Representation: We define g(x) = ρf(x) where ρ represents the product of the Frobenius norms of the weight matrices across all L layers, and f denotes the corresponding network with normalized weight matrices V_k (leveraging the homogeneity property of ReLU activations)
Notational Conventions: We use f_n to indicate f(x_n), designating the output of the normalized network for input x_n
Input Normalization: We assume ||x|| = 1 for all inputs
Separability Conditions: Separability is defined as correct classification for all training data (y_n f_n > 0, ∀n), with average separability defined as Σ y_n f_n > 0

Mathematical Framework

The decomposition g(x) = ρf(x) enables separate analysis of the scale (ρ) and direction (f(x)) components of the network output, facilitating theoretical insights into normalization effects and margin optimization.

3. Theoretical Framework

3.1 Regression versus Classification Objectives

Our analysis of the square loss must reconcile why regression optimization performs effectively for classification tasks. While training minimizes square loss, we ultimately care about classification performance. Unlike linear networks, deep networks typically exhibit multiple global zero square loss minima corresponding to interpolating solutions. Although all interpolating solutions achieve optimal regression performance, they generally possess different margin characteristics and consequently different expected classification performance.

Crucially, achieving zero square loss does not automatically guarantee large margin or strong classification performance. If g represents a zero-loss solution for the regression problem, then g(x_n) = y_n for all n, which translates to ρf_n = y_n, where f_n represents the margin for x_n. This relationship reveals that the norm ρ of a minimizer is inversely proportional to its average margin. Specifically, for an exact zero-loss regression solution, the margin remains identical across all training data points x_n and equals 1/ρ_eq.

3.2 Gradient Flow Dynamics and Norm Minimization

Beginning from small initialization, gradient descent explores critical points with ρ growing from zero. Our analysis demonstrates that interpolating solutions with small norm ρ_eq (corresponding to superior margin) may be discovered before solutions with large ρ_eq that exhibit inferior margin characteristics. When the weight decay parameter is non-zero and sufficiently large, this process demonstrates independence from initial conditions. Otherwise, convergence behavior exhibits strong dependence on initialization parameters.

3.3 Explicit Regularization through Normalization and Weight Decay

The combination of normalization techniques (BN or WN) with weight decay induces explicit regularization that biases solutions toward those with minimal norm. This explicit regularization mechanism provides theoretical justification for the empirical effectiveness of these techniques in deep learning practice. The Frobenius norm of unnormalized weight matrices emerges as the key property governing expected error bounds.

3.4 Implicit Dynamical Regularization

Even in the absence of both BN and WD, implicit dynamical regularization can occur through the initialization dynamics. Specifically, zero-initial conditions can bias the gradient flow trajectory toward high-margin solutions, providing an alternative mechanism for effective classification performance without explicit regularization techniques.

Theoretical Contributions

Established connection between norm minimization and margin maximization in square loss optimization
Characterized explicit regularization effects of BN/WN with WD
Identified implicit regularization through initialization dynamics
Provided theoretical foundation for Neural Collapse phenomena
Explained constraints induced by BN on network weight structures

4. Experimental Results and Predictions

Our theoretical framework generates several testable predictions that align with empirical observations in deep learning practice:

4.1 Role of Batch Normalization and Weight Decay

The analysis predicts that BN without WD creates a singular dynamical system, while the combination of BN with WD promotes stable convergence to minimum-norm solutions. This explains the empirical observation that both components are often necessary for reliable training of deep networks.

4.2 Neural Collapse Phenomena

Our framework provides theoretical insights into the Neural Collapse phenomena identified by Papyan, Han and Donoho, where during terminal phase of training, class means converge to specific symmetric structures and classifier vectors align with class means.

4.3 Initialization Dependence

In the absence of explicit regularization through BN and WD, convergence behavior exhibits strong dependence on initialization, with zero-like initial conditions biasing solutions toward superior margin characteristics.

4.4 Margin-Norm Relationship

We establish that among all interpolating solutions, those with smaller Frobenius norms of unnormalized weight matrices demonstrate superior margin and better generalization bounds, providing a theoretical foundation for norm-based generalization arguments in deep learning.

Empirical Validation

While complete experimental results are detailed in our extended technical report [5], the theoretical predictions align consistently with empirical observations in deep network training with square loss, particularly regarding the interaction between normalization techniques, weight decay, and initialization strategies.

5. Conclusion

This work provides a comprehensive theoretical framework explaining the effectiveness of deep ReLU networks trained with square loss for classification tasks. By analyzing gradient flow dynamics, we demonstrate that both explicit regularization (through normalization techniques with weight decay) and implicit regularization (through initialization dynamics) bias solutions toward those with favorable margin properties. The Frobenius norm of unnormalized weight matrices emerges as the fundamental property governing expected classification performance.

Our analysis resolves the apparent paradox of square loss effectiveness in classification, provides theoretical justification for common practices like BN and WD, and offers insights into phenomena like Neural Collapse. The framework generates testable predictions and establishes connections between optimization dynamics, regularization effects, and generalization performance in deep network classifiers.

Future work should explore extensions to multi-class classification, different network architectures, and alternative loss functions within this theoretical framework, potentially leading to more effective training strategies and improved understanding of deep learning generalization.