Regularizers¶
Regularizers add extra penalties or constraints for network parameters to restrict the model complexity. The corresponding term used in Caffe is weight decay. Regularization and weight decay are equivalent in backpropagation. The conceptual difference in the forward pass is that when treated as weight decay, they are not considered being part of the objective function. However, in order to reduce the number of computations, Mocha also omits the forward computation for regularizers by default. We choose to use the term regularization instead of weight decay just because it is easier to understand when generalizing to sparse, groupsparse or even more complicated structural regularizations.
All regularizers have the property coefficient
, corresponding to the
regularization coefficient. During training, a global regularization coefficient
can also be specified (see userguide/solver
), which globally scales all
local regularization coefficients.

class
NoRegu
¶ Regularizer that imposes no regularization.

class
L2Regu
¶ L2 regularizer. The parameter blob \(W\) is treated as a 1D vector. During the forward pass, the squared L2norm \(\W\^2=\langle W,W\rangle\) is computed, and \(\lambda \W\^2\) is added to the objective function, where \(\lambda\) is the regularization coefficient. During the backward pass, \(2\lambda W\) is added to the parameter gradient, enforcing a weight decay when the solver moves the parameters towards the negative gradient direction.
Note
In Caffe, only \(\lambda W\) is added as a weight decay in back propagation, which is equivalent to having a L2 regularizer with coefficient \(0.5\lambda\).

class
L1Regu
¶ L1 regularizer. The parameter blob \(W\) is treated as a 1D vector. During the forward pass, the L1norm
\[\W\_1 = \sum_i W_i\]is computed, and \(\lambda \W\_1\) is added to the objective function. During the backward pass, \(\lambda\text{sign}(W)\) is added to the parameter gradient. The L1 regularizer has the property of encouraging sparsity in the parameters.