Researching for my master thesis I tried to understand the paper by Goodfellow et al. on the Maxout Units. I found it very hard understanding the details and thought a clear explanation in combination with a nice figure would be really helpful. So this is my shot at doing so.
Please note, that everything explaned here was not developed by me, but is just an explanation of the paper by Goodfellow et al.
h | Maxout function |
i | index of neuron in the layer |
x | input |
k | number of linear pieces |
W | 4D tensor of learned weights |
b | matrix of learned biases |
Now, here is how a single layer with three Maxout units looks like. Try hovering units with the mouse to better see the connection scheme.
“But wait, this looks more like at least two layers!”
Yes indeed, this is an important fact about Maxout. The activation function is implemented using a small sub-network who’s parameters are learned aswell (“Did somebody say ‘Network in Network’?”). So a single Maxout layer actually consists of two parts (The first layer in the image is just the input layer. This doesn’t necessarily mean that this is the very first layer in the whole network, but can be the output of a previous layer, too).
The first is the linear part. It is a fully-connected layer with no activation function (which is referred to as affine), thus each unit in this layer just computes the weighted sum of all inputs, which is defined in the second part of the Maxout definition above:
Attentive readers might notice the missing biases in the image above. Well done.
The three-dimensional tensor W contains the weights of this first part. The dots in the equation mean that all elements from the first dimension are taken. Consequently, W…ij is the weight vector of the unit in row i and column j.
It might be also confusing, that in the figure above the units in this first part are aranged in a two-dimensional grid. The first dimension of this grid (number of rows) matches the number of input units, whereas the second dimension (number of columns) is the hyperparameter k, it is chosen when the whole architecture is defined. This parameter controls the complexity of the Maxout activation function. The higher the k the more accurately any convex function can be approximated. Basically each column of units in this first part performs linear regression.
The second part is much easier. It is just doing max-pooling over each row of the first part, i.e. taking the maximum of the output of each row.
Consider the function f(x)=x2.
We can approximate this function with a single Maxout unit that uses three linear pieces k=3. We can also say that the Maxout unit uses three hidden units.
This Maxout unit would look like this (biases included this time):
Each hidden unit calculates:
This is a simple linear function. The max-pooling unit takes the maximum of these three linear functions.
Take a look at this picture. It shows the x2 function and three linear functions that could be learned by the Maxout unit.
Finally, try imagining how this would look like with 4, 5, 6 or an arbitrary number of linear functions.