GPT-2

15 Oct 2022

GPT-2: Language models are unsupervised multitask learners

Radford, Alec and others 2019 OpenAI blog

Summary:

The GPT2 is the language model aiming to directly generalize to multiple specific NLP tasks without extra training. The motivation here, as the paper states, is that ‘a language model with sufficient capacity will begin to learn to infer and perform tasks demonstrated in nature language sequences in order to better predict them.’ Based on this idea, the model tries to directly use the decoder part of Transformer to autoregress the next word in an unsupervised learning way.

The revision of the Transformer decoder to fit for unsupervised training is quite straightforward. First, since there is no Transform encoder part in the model, the encoder-decoder attention layer in Transformer has to be discarded. Instead, the decoder stack directly takes the concatenation of token embedding and positional encoding as the input. Since the GPT-2 is designed to be capable of performing multiple sentences NLP task, the different sentences in the input sequence are separated by the special token.

Second, the self attention layer in the GPT-2 is masked self-attention. Unlike the traditional self-attention in Transformer and BERT, the unmasked self-attention will leak the information when predict the next word. The masked self-attention here given the -Inf value to the later tokens so that only the token before the target token will be influence the prediction. The idea and implementation here is similar to the encoder-decoder attention in Transformer.

Strength:

The multi-task learning idea further lessen the burden for the NLP engineers to write the task-specific fine-tuning program, which makes it easier to be promoted in some common language tasks.

Also the model doesn’t limit itself on constraint output size, which makes it possible to directly handle QA, article summarization, and article generation tasks.

Critiques:

Obviously, to gain the benefits of unconstraint output size (autoregression), the GPT-2 sacrifices the bi-directional information flow during self-attention. It is obvious that if the self-attention is bi-directional, it will be meaningless to predict the next words since the next word information is leaked already. However, if the language task is to fill in a middle blank word in a context, definitely lots of clues are come from the latter part of the sentence. In that case, there is no guarantee the model could generate the correct word at once. To keep the single-direction encoder, there is at least some mechanism to let the model correct its previous output if the latter context is incorrect.

The zero-shot learning on the other hand, restrict the usage in the specialized tasks. A good general model doesn’t equal to a good model for some domain knowledge tasks. Also, since there is no memorization for the different round user context since the weights are fixed, it makes it impossible to become a good dialogue model.

The assumption that the language model could generalize to multitask automatically if it is trained well and designed so is questionable. Even though learning to infer could help the model predict better, the model is lack of embodied knowledge to make it understand the causal relationship between entities. This argument can be verified by GPT-2’s poor performance on Winograd Schema Challenge problem. Even though GPT-2 was the SOTA at that time and the limit here is not due to the model but data modality problem, making such statement is quite untenable and reckless.