Neurons (Activation Functions)

They could be attached to any layers. The neuron of each layer will affect the output in the forward pass and the gradient in the backward pass automatically unless it is an identity neuron. A layer have an identity neuron by default [1].

class Neurons.Identity

An activation function that does nothing.

class Neurons.ReLU

Rectified Linear Unit. During the forward pass, it inhibit all the negative activations. In other words, it compute point-wisely \(y=\max(0, x)\). The point-wise derivative for ReLU is

\[\begin{split}\frac{dy}{dx} = \begin{cases}1 & x > 0 \\ 0 & x \leq 0\end{cases}\end{split}\]


ReLU is actually not differentialble at 0. But it has subdifferential \([0,1]\). Any value in that interval could be taken as a subderivative, and could be used in SGD if we generalize from gradient descent to subgradient descent. In the implementation, we choose 0.

class Neurons.Sigmoid

Sigmoid is a smoothed step function that produces approximate 0 for negative input with large absolute values and approximate 1 for large positive inputs. The point-wise formula is \(y = 1/(1+e^{-x})\). The point-wise derivative is

\[\frac{dy}{dx} = \frac{-e^{-x}}{\left(1+e^{-x}\right)^2} = (1-y)y\]
[1]This is actually not true: not all layers in Mocha support neurons. For example, data layers currently does not have neurons, but this feature could be added by simply adding a neuron property to the data layer type. However, for some layer types like loss layers or accuracy layers, it does not make much sense to have neurons.