Author: Xindian Ma, Peng Zhang, Shuai Zhang, Nan Duan, Yuexian Hou, Dawei Song, Ming Zhou
The heart of the machine is compiled,Participation: Prince Jia, Yiming, Road, selected from arXiv
Due to its outstanding performance in natural language processing tasks, the Transformer-based pre-training language model is the focus of research in the field of NLP. Considering that the model parameters are too large, which makes them difficult to train and inconvenient to deploy, researchers have been exploring ways to compress models. Recently, Tianjin University and Microsoft Research Asia have proposed Transformer's compression method, which not only reduces the amount of parameters by nearly half, but also improves the performance of the model in language modeling and neural machine translation tasks. This research can help deploy pre-trained language models in environments with small computing resources.
Recently, pre-trained language models have performed well in many NLP tasks. In particular, the Transformer-based pre-training language model is based entirely on the self-attention mechanism and has achieved breakthroughs in natural language processing (NLP) tasks.
However, in these models, Transformer's core structure —— multi-attention mechanism limits the development of the model. The long attention itself brings a lot of models, which may cause problems in training, and deploying models and importing a large number of parameters also requires high resource support. Therefore, compressing large-scale neural pre-training language models has always been an important issue in NLP research.
In order to solve this problem, based on the idea of tensor decomposition and parameter sharing, this paper proposes multi-linear attention and Block-Term Tensor Decomposition (BTD). Researchers have tested language modeling tasks and neural translation tasks. Compared with many language modeling methods, the multi-head linear attention mechanism not only greatly compresses the number of model parameters, but also improves the performance of the model.
First, the transformation of Transformer
In Transformer, long attention is an important mechanism. Since Query, Key, and Value perform multiple linear transformations in training, and each head's attention is calculated separately, a large number of redundant parameters are generated. In order to better compress the parameters in the bullish attention mechanism, there are currently two main challenges:
In order to solve these problems, the method proposed by the researchers combines the idea of low rank approximation and parameter sharing, thus achieving a higher compression ratio. Although it is possible to reconstruct the self-attention mechanism (scaling point of attention) in Transformer, they did not do so, but chose to split the third-order tensor (that is, the output of multi-linear attention), which is more conducive to improving the accuracy of the experiment. rate.
The compression method used in the study is shown in the figure:
Figure 2: Model compression method: The left image shows the use of Tucker decomposition to build a single block of attention. The right picture builds a new attention mechanism —— long linear attention.
Compressed long-headed self-attention
The first problem encountered with model compression is to compress the number of parameters in the multi-head self-attention. To solve this problem, the researchers first proved that orthogonal basis vectors can linearly represent the self-attention mechanism. Then, by initializing the low rank kernel tensor, a new attention representation is reconstructed. To build a multi-attention mechanism and compress the model, they used Block-Term Tensor Decomposition (BTD), a combination of CP decomposition and Tucker decomposition. Q, K, and V are shared when building each third-order block tensor, so many parameters can be reduced.
Figure 2 (left) shows the structure of the single block attention mechanism. First, Query, Key, and Value can be mapped into three factor matrices Q, K, and V, which consist of three sets of orthogonal basis vectors. Then construct a new attention mechanism (single block attention mechanism) by initializing a trainable third-order diagonal tensor G. In Figure 2 (left), R is the rank of the tensor, N is the length of the sequence, and d is the dimension of the matrix. With the Tucker decomposition, you can calculate the expression of a single block of attention:
Integrated multi-head linear attention
To integrate the compressed monoblock attention tensor into Transformer, first, the researchers calculated the average of each block tensor. Second, the tensor is split into matrices, which are then cascaded as input to the next layer in the Transformer, which can be integrated into Transformer's encoder-decoder structure.
In Figure 2 (right), in order to complete the multi-head mechanism and compress the parameters of multiple sets of mapping parameters, the researchers used a set of linear mappings and shared the output of the linear mapping. The linear projection you learn maps Query, Key, and Value to three matrices composed of base vectors. On this basis, Block-Term tensor decomposition is used to establish a multi-head mechanism. The researchers named the model as multi-linear attention, which can be expressed as:
This is a compression model. After compressing the manifold's long attention, the Transformer is quantized. Multi-linear attention can be incorporated into Transformer.
Second, the experimental results
To test the effect of adjusting the attention of the multi-header in Transformer, the researchers are in language modeling (LM) andNeural machine translation(NMT) Experiments were conducted in two missions.
The task of language modeling is to predict the next word in a sentence. The study adopted the standard setting of language modeling —— predicts the next token based on the previous token. A small dataset PTB was selected, a medium dataset WikiText-103 and a large dataset One-Billion. In preprocessing, all words become lowercase. The new line is replaced with <eos>. The vocabulary uses common words, and the words that do not appear are represented by [UNK]. The evaluation of the model is based on the degree of confusion (PPL), which is the average log likelihood of each word. The lower the PPL, the better the model.
The experiment uses the latest open source language modeling architecture Transformer and replaces the standard multi-attention layer with the multi-linear attention layer. We then tested different model configurations on the PTB, WikiText-103, and One-Billian word baseline data sets. The results are shown in Tables 1 and 2.
Table 1: The number of parameters for the model and its confusion score on the One-Billion data set. Core-1 indicates that the model uses a single core tensor. Core-2 means that two block term tensors are used.
Table 2: The number of parameters and their confusion scores for the model on both PTB and WikiText-103 datasets. "-" means there is no performance report for this model. "*" indicates the result of the model that the researcher himself achieved.
Neural machine translation
In this task, the researchers trained the Transformer model on the WMT 2016 English translation dataset. In the experiment, each attention layer was replaced with a multi-head linear attention. For evaluation, use cluster search with a sizing size of 5 and a length penalty of α= 0.6. The results are compared to Transformer, as shown in Table 3. * indicates the results achieved by the researchers themselves.
Table 3: Number of model parameters and corresponding BLEU scores.