0x530 FFN

1. MLP
2. Gated Models
3. Activations
- 3.1. Nonlinearity
- 3.2. Gated Activations

1. MLP

2. Gated Models

Deep models are difficult to train (gradient vanishing). To train efficiently, we want to keep the singular value of Jacobian near 1, explained by this paper

One approach to achieve this is to use the models with gating mechanisms.

Model (residual network)

Model (Highway network)

\[y = H(X)T(X) + X(1-T(X))\]

where the \(T(X)\) is the transform gatew, which controls the behavior of highway layer from the plain layer \(H(X)\) or identity layer \(X\)

when \(T(X)=1\), \(y=H(X)\)
when \(T(X)=0\), \(y=X\)

The transform gate can be implemented as follows:

\[T(X) = \sigma(WX+b)\]

where \(\sigma\) is sigmoid and \(b\) should be a negative value to initially bias towards the identity layer

Model (sparsely-gated mixture of experts) experts can be more than thousands

sparse moe

Model (gMLP, MLP with gating): self-attention is not always necessary depending on the task. MLP might be enough

First we have a channel projection (linear projection) and activation.

\[Z = \sigma(XU)\]

Then we model the spacial interaction \(s\)

\[\tilde{Z} = s(Z)\]

If \(s\) is identical, the entire operation is reduced to FFN. This work uses Spatial Gating Unit which first splits \(Z\) into \((Z_1, Z_2)\)

\[f(Z_2) = WZ_2+b\]

\[\tilde{Z} = Z_1 \circ f(Z_2)\]

cmlp

3. Activations

3.1. Nonlinearity

Activation(GELU) (Hendrycks and Gimpel, 2016)¹ is motivated by stochastic regularization (e.g. dropout) by multiplying input \(x\) by \(m = \text{Bernoulli}(\Phi(x))\) where \(\Phi\) is the gaussian's cdf. Taking (Bernoulli's) expectation, this indicates nonlinearity of

\[\text{GELU}(x) = x\Phi(x)\]

Activation (Swish) found through searching activation space with uniary and binary operations:

\[x \sigma (\beta x)\]

where \(\beta\) is a trainable parameter. The variant of where \(\beta=1\) is called SiLU (Sigmoid Linear Unit)

3.2. Gated Activations

Activation (Gated Linear Unit, GLU) take element-wise product with a sigmoid function to control activation

\[h(X) = (XW+b) \cdot \sigma(XV + c)\]

Noam applied (Shazeer, 2020)² GLU to a few variants, notably:

GEGLU: \(\text{GELU}(xW+b) \cdot (xV + c)\)
SwiGLU: \(\text{Swish}(xW+b) \cdot (xV + c)\)

Dan Hendrycks and Kevin Gimpel. 2016. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415. ↩
Noam Shazeer. 2020. Glu variants improve transformer. arXiv preprint arXiv:2002.05202. ↩