Loss Layers¶
-
class
HingeLossLayer
¶ Compute the hinge loss for binary classification problems:
\[\frac{1}{N}\sum_{i=1}^N \max(1 - \mathbf{y}_i \cdot \hat{\mathbf{y}}_i, 0)\]Here \(N\) is the batch-size, \(\mathbf{y}_i \in \{-1,1\}\) is the ground-truth label of the \(i\)-th sample, and \(\hat{\mathbf{y}}_i\) is the corresponding prediction.
-
weight
¶ Default
1.0
. Weight of this loss function. Could be useful when combining multiple loss functions in a network.
-
bottoms
¶ Should be a vector containing two symbols. The first one specifies the name for the prediction \(\hat{\mathbf{y}}\), and the second one specifies the name for the ground-truth \(\mathbf{y}\).
-
-
class
MultinomialLogisticLossLayer
¶ The multinomial logistic loss is defined as \(\ell = -w_g\log(x_g)\), where \(x_1,\ldots,x_C\) are probabilities for each of the \(C\) classes conditioned on the input data, \(g\) is the corresponding ground-truth category, and \(w_g\) is the weight for the \(g\)-th class (default 1, see bellow).
If the conditional probability blob is of the shape
(dim1, dim2, ..., dim_channel, ..., dimN)
, then the ground-truth blob should be of the shape(dim1, dim2, ..., 1, ..., dimN)
. Heredim_channel
, historically called the “channel” dimension, is the user specified tensor dimension to compute loss on. This general case allows to produce multiple labels for each sample. For the typical case where only one (multi-class) label is produced for one sample, the conditional probability blob is the shape(dim_channel, dim_num)
and the ground-truth blob should be of the shape(1, dim_num)
.The ground-truth should be a zero-based index in the range of \(0,\ldots,C-1\).
-
bottoms
¶ Should be a vector containing two symbols. The first one specifies the name for the conditional probability input blob, and the second one specifies the name for the ground-truth input blob.
-
weight
¶ Default
1.0
. Weight of this loss function. Could be useful when combining multiple loss functions in a network.
-
weights
¶ This can be used to specify weights for different classes. The following values are allowed
- Empty array (default). This means each category should be equally weighted.
- A 1D vector of length
channels
. This defines weights for each category. - An (N-1)D tensor of the shape of a data point. In other words, the same
shape as the prediction except that the last mini-batch dimension is
removed. This is equivalent to the above case if the prediction is a 2D
tensor of the shape
channels
-by-mini-batch
. - An ND tensor of the same shape as the prediction blob. This allows us to
fully specify different weights for different data points in
a mini-batch. See
SoftlabelSoftmaxLossLayer
.
-
dim
¶ Default
-2
(penultimate). Specify the dimension to operate on.
-
normalize
¶ Indicating how weights should be normalized if given. The following values are allowed
:local
(default): Normalize the weights locally at each location (w,h), across the channels.:global
: Normalize the weights globally.:no
: Do not normalize the weights.
The weights normalization are done in a way that you get the same objective function when specifying equal weights for each class as when you do not specify any weights. In other words, the total sum of the weights are scaled to be equal to weights x height x channels. If you specify
:no
, it is your responsibility to properly normalize the weights.
-
-
class
SoftlabelSoftmaxLossLayer
¶ Like the
SoftmaxLossLayer
, except that this deals with soft labels. For multiclass classification with \(K\) categories, we call an integer value \(y\in\{0,\ldots,K-1\}\) a hard label. In contrast, a soft label is a vector on the \(K\)-dimensional simplex. In other words, a soft label specifies a probability distribution over all the \(K\) categories, while a hard label is a special case where all the probability masses concentrates on one single category. In this case, this loss is basically computing the KL-divergence D(p||q), where p is the ground-truth softlabel, and q is the predicted distribution.-
dim
¶ Default
-2
(penultimate). Specify the dimension to operate on.
-
weight
¶ Default
1.0
. Weight of this loss function. Could be useful when combining multiple loss functions in a network.
-
bottoms
¶ Should be a vector containing two symbols. The first one specifies the name for the conditional probability input blob, and the second one specifies the name for the ground-truth (soft labels) input blob.
-
-
class
SoftmaxLossLayer
¶ This is essentially a combination of
MultinomialLogisticLossLayer
andSoftmaxLayer
. The given predictions \(x_1,\ldots,x_C\) for the \(C\) classes are transformed with a softmax function\[\sigma(x_1,\ldots,x_C) = (\sigma_1,\ldots,\sigma_C) = \left(\frac{e^{x_1}}{\sum_j e^{x_j}},\ldots,\frac{e^{x_C}}{\sum_je^{x_j}}\right)\]which essentially turn the predictions into non-negative values with exponential function and then re-normalize to make them look like probabilties. Then the transformed values are used to compute the multinomial logsitic loss as
\[\ell = -w_g \log(\sigma_g)\]Here \(g\) is the ground-truth label, and \(w_g\) is the weight for the \(g\)-th category. See the document of
MultinomialLogisticLossLayer
for more details on what the weights mean and how to specify them.The shapes of the inputs are the same as for the
MultinomialLogisticLossLayer
: the multi-class predictions are assumed to be along the channel dimension.The reason we provide a combined softmax loss layer instead of using one softmax layer and one multinomial logistic layer is that the combined layer produces the back-propagation error in a more numerically robust way.
\[\frac{\partial \ell}{\partial x_i} = w_g\left(\frac{e^{x_i}}{\sum_j e^{x_j}} - \delta_{ig}\right) = w_g\left(\sigma_i - \delta_{ig}\right)\]Here \(\delta_{ig}\) is 1 if \(i=g\), and 0 otherwise.
-
bottoms
¶ Should be a vector containing two symbols. The first one specifies the name for the conditional probability input blob, and the second one specifies the name for the ground-truth input blob.
-
dim
¶ Default
-2
(penultimate). Specify the dimension to operate on. For a 4D vision tensor blob, the default value (penultimate) translates to the 3rd tensor dimension, usually called the “channel” dimension.
-
weight
¶ Default
1.0
. Weight of this loss function. Could be useful when combining multiple loss functions in a network.
-
weights
¶ -
normalize
¶ Properties for the underlying
MultinomialLogisticLossLayer
. See its documentation for details.
-
-
class
SquareLossLayer
¶ Compute the square loss for real-valued regression problems:
\[\frac{1}{2N}\sum_{i=1}^N \|\mathbf{y}_i - \hat{\mathbf{y}}_i\|^2\]Here \(N\) is the batch-size, \(\mathbf{y}_i\) is the real-valued (vector or scalar) ground-truth label of the \(i\)-th sample, and \(\hat{\mathbf{y}}_i\) is the corresponding prediction.
-
weight
¶ Default
1.0
. Weight of this loss function. Could be useful when combining multiple loss functions in a network.
-
bottoms
¶ Should be a vector containing two symbols. The first one specifies the name for the prediction \(\hat{\mathbf{y}}\), and the second one specifies the name for the ground-truth \(\mathbf{y}\).
-
-
class
BinaryCrossEntropyLossLayer
¶ A simpler alternative to
MultinomialLogisticLossLayer
for the special case of binary classification.\[-\frac{1}{N}\sum_{i=1}^N \log(p_i)y_i + \log(1-p_i)(1-y_i)\]Here \(N\) is the batch-size, \(\mathbf{y}_i\) is the ground-truth label of the \(i\)-th sample, and :math:
p_i
is the corresponding prediction.-
weight
¶ Default
1.0
. Weight of this loss function. Could be useful when combining multiple loss functions in a network.
-
bottoms
¶ Should be a vector containing two symbols. The first one specifies the name for the prediction \(\hat{\mathbf{y}}\), and the second one specifies the name for the binary ground-truth labels \(\mathbf{p}\).
-
-
class
GaussianKLLossLayer
¶ Given two inputs mu and sigma of the same size representing the means and standard deviations of a diagonal multivariate Gaussian distribution, the loss is the Kullback-Leibler divergence from that to the standard Gaussian of the same dimension.
Used in variational autoencoders, as in Kingma & Welling 2013, as a form of regularization.
\[D_{KL}(\mathcal{N}(\mathbf{\mu}, \mathrm{diag}(\mathbf{\sigma})) \Vert \mathcal{N}(\mathbf{0}, \mathbf{I}) ) = -\frac{1}{2}\left(\sum_{i=1}^N (\mu_i^2 + \sigma_i^2 - 2\log\sigma_i) - N\right)\]-
weight
¶ Default
1.0
. Weight of this loss function. Could be useful when combining multiple loss functions in a network.
-
bottoms
¶ Should be a vector containing two symbols. The first one specifies the name for the mean vector \(\mathbf{\mu}\), and the second one the vector of standard deviations \(\mathbf{\sigma}\).
-