Reformer

20 Oct 2022

Reformer: The Efficient Transformer

Kitaev, Nikita and Kaiser, Lukasz and Levskaya, Anselm ICLR 2020

Summary:

The motivation of the reformer is to shrink down the space and time complexity of the Transformer model since the complexity and cost of the model training (even for fine-tuning) makes it unrealistic for non-industrial research labs. The paper analyzes the possible architecture/implementation redundancy of the Transformer and proposes solutions that could achieve comparable performance but with much less computation time and memory usage.

The original Transformer self-attention takes N*N*D parameter space in Q K dot product operation. A sequence of 64k will thus result in 16GB of memory. However, in the attention operation we only care about the softmax of Q*K and the softmax is mainly dominated by the largest value. This indicates that we can only compute several most closest k for a q vector rather than compute K as whole. For a 64K sequence this will definitely save a lot of space. To maintain and find the top m closest k for a q, the author applies locality-sensitive hash. LSH employs several round random rotations to the vector and make projections on the axes buckets. The intuition here is that if two vectors are close/similar, they would have higher probability to be assigned to same bucket. The more rounds it run, the higher probability the vectors in the same bucket will be similar and the amount of keys in the bucket will be less. After conducting the hash operations, the full attention can be approximated as the self-attention inside the bucket. Suppose there are m buckets. Then the self-attention space complexity on such sequence will be m*(n/m)*(n/m)*d = n*logn*d, given m is around logn.

Replace the normal Transformer residual blocks with the reversible network. The idea here is that the network doesn’t need to save all intermediate value for the backward model, all the activations can be recovered by the following layers. In that case, the y1 = x1 + F(x2) and y2 =x2 +G(y1). All the required activation from previous layer can be already got from the current layer’s input. Thus, the model only needs to treat F(x) as attention layer and G(x) as feed-forward network. In that case, the d_ff of model parameter size can be eliminated.

Since the computation in feed-forward layers are independent across positions in a sequence, to further shrink down the computation space for processing the input sequence as whole, the paper proposes the forward process can be chunked into several segmentation and compute each segmentation linearly in the sequence so that the sequence length N can be reduced to chunk length N_c.

Strength:

Even though we always argue the computational time and space problem will be always mitigated by the fast progress in computer hardware design, I think this work still stand an important position for the deep learning community. This work makes the Transformer level model could be possible on non-industrial level lab GPUs. The novelty of the work mainly lies in its LSH design, which reduce the n^2 self-attention to nlogn level. The reversible nets and chunking are the existing methods that successfully applied on the Transformer. Overall, the model space shrinks from (d_ff*N^2*d + N^2*d)*h to C*N*d + N*logN*d.

Besides, a really important influence of the work is the rethinking of the original design of the Transformer structure. Which it shows the shared-QK mechanism doesn’t influence the performance.

Critique:

The cost of the clapping step is the time. Other revisions won’t bring negative effect on the performance. So overall, the work is solid.