Suppose that we have a vocabulary of 3 words, "a", "b", and "c", and we want to predict the next word in a sentence given the previous two words. For this network, we don't want to use feature vectors for words: we simply use the local encoding, i.e. a 3-component vector with one entry being
1 and the other two entries being
0 .
In the language models that we have seen so far, each of the context words has its own dedicated section of the network, so we would encode this problem with two 3-dimensional inputs. That makes for a total of 6 dimensions. For example, if the two preceding words (the "context" words) are "c" and "b", then the input would be
(0,0,1,0,1,0) . Clearly, the more context words we want to include, the more input units our network must have. More inputs means more parameters, and thus increases the risk of overfitting. Here is a proposal to reduce the number of parameters in the model:
Consider a single neuron that is connected to this input, and call the weights that connect the input to this neuron
w1,w2,w3,w4,w5 , and
w6 .
w1 connects the neuron to the first input unit,
w2 connects it to the second input unit, etc. Notice how for every neuron, we need as many weights as there are input dimensions (6 in our case), which will be the number of words times the length of the context. A way to reduce the number of parameters is to
tie certain weights together, so that they share a parameter. One possibility is to tie the weights coming from input units that correspond to the same word but at different context positions. In our example that would mean that
w1=w4 ,
w2=w5 , and
w3=w6 (see the "after" diagram below).
Are there any significant problems with this approach?
Yes: weight tying only makes sense when we are working with images.
No: the new model after weight tying is an example of a convolutional neural network, and these are more powerful than a non-convolutional network because they are invariant to small transformations in the data.
Yes: the network loses the knowledge of the location at which a context word occurs, and that is valuable knowledge.
No: this method is an appropriate solution in that it will reduce the number of parameters and therefore always improve generalization.