Training networks with millions of parameters have made great progress recently.Microsoft has recently updated the deberta (decoding enhanced bet with discrete attention) model and trained a model consisting of 48 transformer layers with 1.5 billion parameters.
The performance of single deberta model is greatly improved, which makes the macro average score of superglue language processing and understanding surpass the human performance for the first time (89.9 vs 89.8), and surpass the human baseline with a considerable advantage (90.3 vs 89.8).
Superglue benchmark includes a wide range of natural language understanding tasks, including question answering and natural language reasoning. The macro average score of 90.8 of the model is also at the top of the benchmark rank of glue.
Deberta uses three novel techniques to improve the most advanced PLM (such as Bert, Roberta, unilm): a separate attention mechanism, an enhanced mask decoder and a virtual countermeasure training method for fine tuning.
Compared with Google T5 model, which is composed of 11 billion parameters, deberta with 1.5 billion parameters is more energy-efficient in training and maintenance, and easier to compress and deploy to various environment applications.
Deberta's surpassing human performance on superglue marks an important milestone towards universal AI. Although we have achieved gratifying results in superglue, the model is by no means NLU human intelligence.