Pre-trained models: Past, present and future
Keywords
1. Introduction
Fig. 1. The two figures show the significant improvement on performance of both language understanding and language generation after using large-scale PTMs.
Fig. 2. Fig. 2(a) shows the number of publications with the keyword “language model” as well as their citations in different years. Fig. 2(b) shows the parameter size of large-scale PTMs for NLP tasks and the pre-training data size are increasing by 10 times per year. From these figures, we can find that, after 2018, when large-scale NLP PTMs begin to be explored, more and more efforts are devoted to this field, and the model size and data size used by the PTMs are also getting larger.
Fig. 3. GPT-3, with 175 billion parameters, uses 560 GB data and 10,000 GPUs for its training. It has shown the abilities of learning world knowledge, common sense, and logical reasoning.
2. Background
2.1. Transfer learning and supervised pre-training
2.2. Self-supervised learning and self-supervised pre-training
Fig. 4. The spectrum of pre-training methods from transfer learning, self-supervised learning to the latest pre-training neural models.
3. Transformer and representative PTMs
3.1. Transformer
Fig. 5. The architecture of Transformer, GPT, and BERT.
- (1)Self-attention is used in the encoder, which uses the output of the previous layer as Q, K, V. In the encoding phase, given a word, the self-attention computes its attention scores by comparing it with all words in the input sequence. And such attention scores indicate how much each of the other words should contribute to the next representation of the given word. We give an example in Fig. 6, where the self-attention accurately captures the referential relationships between “Jack” and “he”, generating the highest attention score.
Fig. 6. An illustration of the self-attention mechanism of Transformer. The figure shows the self-attention results when encoding the word “he”, where the darker the color of the square is, the larger the corresponding attention score is.
- (2)Masked self-attention is used in the decoder, whose attention matrix satisfies Aij = 0, i > j. This attention is beneficial to autoregressive language modeling. In the decoding phase, the self-attention is similar to the encoding, except that it only decodes one representation from left to right at one time. Since each step of the decoding phase only consults the previously decoded results, we thus require to add the masking function into the self-attention.
- (3)Cross-attention is also used in the decoder, which uses the output of the previous decoder block as Q as well as the output of the encoder as K and V. Such a procedure is essentially an aggregation of the information of the whole input sequence, and it will be applied to all the words to generate in the decoding phase. Taking advantage of the input context is of great significance to some seq2seq tasks such as machine translation and text summarization.
3.2. GPT
Fig. 7. The difference between GPT and BERT in their self-attention mechanisms and pre-training objectives.
3.3. BERT
Fig. 8. The pre-training and fine-tuning phases for BERT.
3.4. After GPT and BERT
Fig. 9. The family of recent typical PTMs, including both pre-trained language models and multimodal models.
4. Designing effective architectures
4.1. Unified sequence modeling
- ●Natural language understanding: includes grammatical analysis, syntactic analysis, word/sentence/paragraph classification, question answering, factual/commonsense knowledge inference and etc.
- ●Open-ended language generation: includes dialog generation, story generation, data-to-text generation and etc.
- ●Non-open-ended language generation: includes machine translation, abstract summarizing, blank filling and etc.
Table 1. Three fundamental types of framework and their suitable downstream tasks. “NLU” refers to natural language understanding. “Cond. Gen.” and “Uncond. Gen.” refer to conditional and unconditional text generation, respectively. “✓” means “is good at”, “—” means “could be adapted to”, and “ × ” means “cannot be directly applied to”. We define unconditional generation as the task of generating text without further training as in a standard language model, while conditional generation refers to seq2seq tasks such as text summarization. Taken from (Du et al., 2021).
| Framework | NLU | Cond. Gen. | Uncond. Gen. |
|---|---|---|---|
| Autoregressive | – | – | ✓ |
| Autoencoding | ✓ | × | × |
| Encoder-Decoder | – | ✓ | – |
4.2. Cognitive-inspired architectures
4.3. More variants of existing PTMs
5. Utilizing multi-source data
5.1. Multilingual pre-training
5.2. Multimodal pre-training
5.3. Knowledge-enhanced pre-training
6. Improving computational efficiency
6.1. System-level optimization
Fig. 10. An illustration of ZeRO-Offload and ZeRO-Offload with delayed parameter update.
Fig. 11. An illustration of the data parallelism and model parallelism with 16 nodes.
Fig. 12. An illustration of the pipeline parallelism with 4 nodes and 4 micro batches.
6.2. Efficient pre-training
6.3. Model compression
7. Interpretation and theoretical analysis
7.1. Knowledge of PTMs
7.2. Robustness of PTMs
7.3. Structural sparsity of PTMs
7.4. Theoretical analysis of PTMs
8. Future directions
8.1. Architectures and pre-training methods
8.2. Multilingual and multimodal pre-training
8.3. Computational efficiency
8.4. Theoretical foundation
8.5. Modeledge learning
8.6. Cognitive and knowledgeable learning
8.7. Applications
9. Conclusion
Note and contribution
Declaration of competing interest
References
- Abadi et al., 2016Tensorflow: a system for large-scale machine learningProceedings of OSDI (2016), pp. 265-283
- Adi et al., 2017Fine-grained analysis of sentence embeddings using auxiliary prediction tasksProceedings of ICLR (2017)
- Adiwardana et al., 2020Towards a Human-like Open-domain Chatbot(2020)arXiv preprint arXiv:2001.09977
- Ainslie et al., 2020ETC: encoding long and structured inputs in transformersProceedings of EMNLP (2020), pp. 268-284
- Alberti et al., 2019Fusion of detected objects in text for visual question answeringProceedings of EMNLP-IJCNLP (2019), pp. 2131-2140
- Antol et al., 2015Vqa: visual question answeringProceedings of ICCV (2015), pp. 2425-2433
- Arjovsky et al., 2017Wasserstein generative adversarial networksProceedings of ICML (2017), pp. 214-223
- Ba et al., 2016Layer normalizationProceedings of NeurIPS (2016)
- Baan et al., 2019Understanding Multi-Head Attention in Abstractive Summarization(2019)arXiv preprint arXiv:1911.03898
- Baddeley, 1992Working memoryScience, 255 (5044) (1992), pp. 556-559
- Bao et al., 2020PLATO: pre-trained dialogue generation model with discrete latent variableProceedings of ACL (2020)
- Bao et al., 2021Plato-2: towards building an open-domain chatbot via curriculum learningProceedings of ACL (2021)
- Barrouillet et al., 2004Time constraints and resource sharing in adults' working memory spansJ. Exp. Psychol. Gen., 133 (1) (2004), pp. 83-100
- Belkin et al., 2019Reconciling modern machine-learning practice and the classical bias–variance trade-offProc. Natl. Acad. Sci. Unit. States Am., 116 (32) (2019), pp. 15849-15854
- Beltagy et al., 2019Scibert: a pretrained language model for scientific textProceedings of EMNLP-IJCNLP (2019), pp. 3615-3620
- Beltagy et al., 2020Longformer: the Long-Document Transformer(2020)arXiv preprint arXiv:2004.05150
- Ben-Nun and Hoefler, 2019Demystifying parallel and distributed deep learning: an in-depth concurrency analysisACM Comput. Surv.(CSUR), 52 (4) (2019), pp. 1-43
- Bengio et al., 1994Learning long-term dependencies with gradient descent is difficultIEEE TNNLS, 5 (2) (1994), pp. 157-166
- Bengio et al., 2003A neural probabilistic language modelJMLR, 3 (2003), pp. 1137-1155
- Bi et al., 2020Palm: pre-training an autoencoding&autoregressive language model for context-conditioned generationProceedings of EMNLP (2020), pp. 8681-8691
- Bojar et al., 2014Findings of the 2014 workshop on statistical machine translationProceedings of WMT (2014), pp. 12-58
- Bosselut et al., 2019Comet: commonsense transformers for automatic knowledge graph constructionProceedings of ACL (2019), pp. 4762-4779
- Bouraoui et al., 2020Inducing relational knowledge from BERTProceedings of AAAI (2020), pp. 7456-7463
- Brown, 1958Some tests of the decay theory of immediate memoryQ. J. Exp. Psychol., 10 (1) (1958), pp. 12-21
- Brown et al., 2020Language models are few-shot learnersProceedings of NeurIPS (2020), pp. 1877-1901
- Carion et al., 2020End-to-end object detection with transformersProceedings of ECCV (2020), pp. 213-229
- Chen and He, 2020Exploring Simple Siamese Representation Learning(2020)arXiv preprint arXiv:2011.10566
- Chen et al., 2015Microsoft Coco Captions: Data Collection and Evaluation Server(2015)arXiv preprint arXiv:1504.00325
- Chen et al., 2020aTowards a Universal Continuous Knowledge Base(2020)arXiv preprint arXiv:2012.13568
- Chen et al., 2020bVariance-reduced Language Pretraining via a Mask Proposal Network(2020)arXiv preprint arXiv:2008.05333
- Chen et al., 2020cGraph optimal transport for cross-domain alignmentProceedings of ICML, PMLR (2020), pp. 1542-1553
- Chen et al., 2020dA simple framework for contrastive learning of visual representationsProceedings of ICML (2020), pp. 1597-1607
- Chen et al., 2020eKgpt: knowledge-grounded pre-training for data-to-text generationProceedings of EMNLP (2020), pp. 8635-8648
- Chen et al., 2020fUniter: universal image-text representation learningProceedings of ECCV (2020), pp. 104-120
- Chi et al., 2020aCross-lingual natural language generation via pre-trainingProceedings of AAAI (2020), pp. 7570-7577
- Chi et al., 2020bInfoxlm: an Information-Theoretic Framework for Cross-Lingual Language Model Pre-training(2020)arXiv preprint arXiv:2007.07834
- Child et al., 2019Generating Long Sequences with Sparse Transformers(2019)arXiv preprint arXiv:1904.10509
- Choromanski et al., 2021Rethinking attention with performersProceedings of ICLR (2021)
- Chuang et al., 2019Speechbert: an Audio-And-Text Jointly Learned Language Model for End-To-End Spoken Question Answering(2019)arXiv preprint arXiv:1910.11559
- Clark et al., 2019What does bert look at? an analysis of bert's attentionProceedings of BlackboxNLP (2019), pp. 276-286
- Clark et al., 2020Electra: pre-training text encoders as discriminators rather than generatorsProceedings of ICLR (2020)
- Collobert and Weston, 2008A unified architecture for natural language processing: deep neural networks with multitask learningProceedings of ICML (2008), pp. 160-167
- Conneau et al., 2018aWhat you can cram into a single ∖$&!#∗ vector: probing sentence embeddings for linguistic propertiesProceedings of ACL (2018), pp. 2126-2136
- Conneau et al., 2018bXnli: evaluating cross-lingual sentence representationsProceedings of EMNLP (2018), pp. 2475-2485
- Conneau et al., 2020Unsupervised cross-lingual representation learning at scaleProceedings of ACL (2020), pp. 8440-8451
- Cordts et al., 2016The cityscapes dataset for semantic urban scene understandingProceedings of CVPR (2016), pp. 3213-3223
- Cui et al., 2019Pre-training with Whole Word Masking for Chinese Bert(2019)arXiv preprint arXiv:1906.08101
- Da and Kasai, 2019Cracking the contextual commonsense code: understanding commonsense reasoning aptitude of deep contextual representationsProceedings of EMNLP Workshop (2019)
- Dai et al., 2007Co-clustering based classification for out-of-domain documentsProceedings of KDD (2007), pp. 210-219
- Dai et al., 2008Self-taught clusteringProceedings of ICML (2008), pp. 200-207
- Dai et al., 2019Transformer-xl: attentive language models beyond a fixed-length contextProceedings of ACL (2019), pp. 2978-2988
- Daume and Marcu, 2006Domain adaptation for statistical classifiersJAIR, 26 (2006), pp. 101-126
- Davison et al., 2019Commonsense knowledge mining from pretrained modelsProceedings of EMNLP-IJCNLP (2019), pp. 1173-1178
- Deng et al., 2009Imagenet: a large-scale hierarchical image databaseProceedings of CVPR (2009), pp. 248-255
- Der Kiureghian and Ditlevsen, 2009Aleatory or epistemic? does it matter?Struct. Saf., 31 (2) (2009), pp. 105-112
- Devlin et al., 2019BERT: pre-training of deep bidirectional transformers for language understandingProceedings of NAACL-HLT (2019), pp. 4171-4186
- Dhingra et al., 2020Differentiable reasoning over a virtual knowledge baseProceedings of ICLR (2020)
- Ding et al., 2019Cognitive graph for multi-hop reading comprehension at scaleProceedings of ACL (2019), pp. 2694-2703
- Ding et al., 2020Cogltx: applying bert to long textsProceedings of NeurIPS, 33 (2020), pp. 12792-12804
- Ding et al., 2021aCogview: Mastering Text-To-Image Generation via Transformers(2021)arXiv preprint arXiv:2105.13290
- Ding et al., 2021bPrototypical representation learning for relation extractionProceedings of ICLR (2021)
- Donahue et al., 2015Long-term recurrent convolutional networks for visual recognition and descriptionProceedings of CVPR (2015), pp. 2625-2634
- Dong et al., 2019Unified language model pre-training for natural language understanding and generationProceedings of NeurIPS (2019)
- Du et al., 2021All Nlp Tasks Are Generation Tasks: A General Pretraining Framework(2021)arXiv preprint arXiv:2103.10360
- Erhan et al., 2010Why does unsupervised pre-training help deep learning?Proceedings of AISTATS (2010), pp. 201-208
- Ethayarajh, 2019How contextual are contextualized word representations? comparing the geometry of bert, elmo, and gpt-2 embeddingsProceedings of EMNLP-IJCNLP (2019), pp. 55-65
- Ettinger, 2020What BERT is not: lessons from a new suite of psycholinguistic diagnostics for language modelsTACL, 8 (2020), pp. 34-48
- Ettinger et al., 2016Probing for semantic evidence of composition by means of simple classification tasksProceedings of RepEval (2016), pp. 134-139
- Evgeniou and Pontil, 2004Regularized multi–task learningProceedings of KDD (2004), pp. 109-117
- Evgeniou and Pontil, 2007Multi-task feature learningProceedings of NeurIPS (2007)
- Fan et al., 2019Reducing transformer depth on demand with structured dropoutProceedings of ICLR (2019)
- Fedus et al., 2021Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity(2021)arXiv preprint arXiv:2101.03961
- Févry et al., 2020Entities as experts: sparse memory access with entity supervisionProceedings of EMNLP (2020), pp. 4937-4951
- Forbes et al., 2019Do neural language representations learn physical commonsense?Proceedings of CogSci (2019), pp. 1753-1759
- Gao et al., 2008Knowledge transfer via multiple model local structure mappingProceedings of KDD (2008), pp. 283-291
- Gao et al., 2015Are you talking to a machine? dataset and methods for multilingual image question answeringProceedings of NeurIPS (2015), pp. 2296-2304
- Gao et al., 2021Making pre-trained language models better few-shot learnersProceedings of ACL (2021)
- Gidaris and Komodakis, 2015Object detection via a multi-region and semantic segmentation-aware cnn modelProceedings of ICCV (2015), pp. 1134-1142
- Glavaš and Vulić, 2021Is supervised syntactic parsing beneficial for language understanding tasks? an empirical investigationProceedings of EACL (2021), pp. 3090-3104
- Goldberg, 2019Assessing Bert's Syntactic Abilities(2019)arXiv preprint arXiv:1901.05287
- Gong et al., 2019Efficient training of BERT by progressively stackingProceedings of ICML (2019), pp. 2337-2346
- Gordon et al., 2020Compressing BERT: studying the effects of weight pruning on transfer learningProceedings of RepL4NLP (2020), pp. 143-155
- Goyal et al., 2017Accurate, Large Minibatch Sgd: Training Imagenet in 1 Hour(2017)arXiv preprint arXiv:1706.02677
- Gu et al., 2020Train no evil: selective masking for task-guided pre-trainingProceedings of EMNLP (2020), pp. 6966-6974
- Guan et al., 2020A knowledge-enhanced pretraining model for commonsense story generationTACL, 8 (2020), pp. 93-108
- Guo et al., 2019Star-transformerProceedings of HLT-NAACL (2019), pp. 1315-1325
- Gupta et al., 2015Deep learning with limited numerical precisionProceedings of ICML (2015), pp. 1737-1746
- Gururangan et al., 2020Don't stop pretraining: adapt language models to domains and tasksProceedings of ACL (2020)
- Guu et al., 2020Realm: Retrieval-Augmented Language Model Pre-training(2020)arXiv preprint arXiv:2002.08909
- Hashemi et al., 2019Tictac: accelerating distributed deep learning with communication schedulingProceedings of MLSys (2019)
- Han et al., 2021Ptr: Prompt Tuning with Rules for Text Classification(2021)arXiv preprint arXiv:2105.11259
- He et al., 2016Deep residual learning for image recognitionProceedings of CVPR (2016), pp. 770-778
- He et al., 2019Rethinking imagenet pre-trainingProceedings of ICCV (2019), pp. 4918-4927
- He et al., 2020Momentum contrast for unsupervised visual representation learningProceedings of CVPR (2020), pp. 9729-9738
- He et al., 2021Fastmoe: A Fast Mixture-Of-Expert Training System(2021)arXiv preprint arXiv:2103.13262
- Heusel et al., 2017Gans trained by a two time-scale update rule converge to a local nash equilibriumAdv. Neural Inf. Process. Syst., 30 (2017)
- Hewitt and Manning, 2019A structural probe for finding syntax in word representationsProceedings of NAACL-HLT (2019), pp. 4129-4138
- Hinton et al., 2006A fast learning algorithm for deep belief netsNeural Comput., 18 (7) (2006), pp. 1527-1554
- Hinton et al., 2014Distilling the knowledge in a neural networkProceedings of NeurIPS (2014)
- Howard and Ruder, 2018Universal language model fine-tuning for text classificationProceedings of ACL (2018), pp. 328-339
- Htut et al., 2019Do Attention Heads in Bert Track Syntactic Dependencies?(2019)arXiv preprint arXiv:1911.12246
- Hu et al., 2021Knowledgeable Prompt-Tuning: Incorporating Knowledge into Prompt Verbalizer for Text Classification(2021)arXiv preprint arXiv:2108.02035
- Huang et al., 2019aUnicoder: a universal language encoder by pre-training with multiple cross-lingual tasksProceedings of EMNLP-IJCNLP (2019), pp. 2485-2494
- Huang et al., 2019bGpipe: efficient training of giant neural networks using pipeline parallelismProceedings of NeurIPS (2019), pp. 103-112
- Huang et al., 2020aSwap advisor: pushing deep learning beyond the gpu memory limit via smart swappingProceedings of ASPLOS (2020), pp. 1341-1355
- Huang et al., 2020bM3p: Learning Universal Representations via Multitask Multilingual Multimodal Pre-training(2020)arXiv preprint arXiv:2006.02635
- Hudson and Manning, 2019Gqa: a new dataset for real-world visual reasoning and compositional question answeringProceedings of CVPR (2019), pp. 6700-6709
- Huo et al., 2021Wenlan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-training(2021)arXiv preprint arXiv:2103.06561
- Insightface project, 20212021. Insightface Project. https://github.com/deepinsight/insightface.
- Ioffe and Szegedy, 2015Batch normalization: accelerating deep network training by reducing internal covariate shiftProceedings of ICML (2015), pp. 448-456
- Jacobs et al., 1991Adaptive mixtures of local expertsNeural Comput., 3 (1991), pp. 79-87
- Jaderberg et al., 2015Spatial transformer networksProceedings of NeurIPS (2015), pp. 2017-2025
- Jawahar et al., 2019aWhat does BERT learn about the structure of language?Proceedings of ACL (2019), pp. 3651-3657
- Jawahar et al., 2019bWhat does bert learn about the structure of language?Proceedings of ACL (2019), pp. 3651-3657
- Jia et al., 2019Beyond data and model parallelism for deep neural networksProceedings of MLSys (2019)
- Jiang et al., 2020aA unified architecture for accelerating distributed DNN training in heterogeneous gpu/cpu clustersProceedings of OSDI (2020), pp. 463-479
- Jiang et al., 2020bHow can we know what language models knowTACL, 8 (2020), pp. 423-438
- Jiao et al., 2019Tinybert: distilling bert for natural language understandingProceedings of EMNLP (2019), pp. 4163-4174
- Jin et al., 2020Is bert really robust? a strong baseline for natural language attack on text classification and entailmentProceedings of AAAI (2020), pp. 8018-8025
- Johnson and Zhang, 2005A high-performance semi-supervised learning method for text chunkingProceedings of ACL (2005), pp. 1-9
- Johnson et al., 2016Densecap: fully convolutional localization networks for dense captioningProceedings of CVPR (2016), pp. 4565-4574
- Joshi et al., 2020Spanbert: improving pre-training by representing and predicting spansTACL, 8 (2020), pp. 64-77
- Kalchbrenner et al., 2014A convolutional neural network for modelling sentencesProceedings of ACL (2014), pp. 655-665
- Kao et al., 2020Further Boosting Bert-Based Models by Duplicating Existing Layers: Some Intriguing Phenomena inside Bert(2020)arXiv preprint arXiv:2001.09309
- Kaplan et al., 2020Scaling Laws for Neural Language Models(2020)arXiv preprint arXiv:2001.08361
- Katharopoulos et al., 2020Transformers are rnns: fast autoregressive transformers with linear attentionProceedings of ICML (2020), pp. 5156-5165
- Ke et al., 2020Sentilare: linguistic knowledge enhanced language representation for sentiment analysisProceedings of EMNLP (2020), pp. 6975-6988
- Kim et al., 2020Are pre-trained language models aware of phrases? simple but strong baselines for grammar inductionProceedings of ICLR (2020)
- Kim, 2014Convolutional neural networks for sentence classificationProceedings of EMNLP (2014), pp. 1746-1751
- Kipf and Welling, 2016Semi-supervised classification with graph convolutional networksProceedings of ICLR (2016)
- Kitaev et al., 2020Reformer: the efficient transformerProceedings of ICLR (2020)
- Köhn, 2015What's in an embedding? analyzing word embeddings through multilingual evaluationProceedings of EMNLP (2015), pp. 2067-2073
- Kong et al., 2020A mutual information maximization perspective of language representation learningProceedings of ICLR (2020)
- Kovaleva et al., 2019Revealing the dark secrets of BERTProceedings of EMNLP-IJCNLP (2019), pp. 4364-4373
- Krishna et al., 2017Visual genome: connecting language and vision using crowdsourced dense image annotationsIJCV, 123 (2017), pp. 32-73
- Krizhevsky et al., 2012ImageNet classification with deep convolutional neural networksProceedings of NeurIPS (2012), pp. 1097-1105
- Lample and Conneau, 2019Cross-lingual language model pretrainingProceedings of NeurIPS (2019)
- Lample et al., 2019Large memory layers with product keysProceedings of NeurIPS (2019), pp. 8546-8557
- Lan et al., 2019Albert: a lite bert for self-supervised learning of language representationsProceedings of ICLR (2019)
- Lawrence and Platt, 2004Learning to learn with the informative vector machineProceedings of ICML (2004)
- LeCun et al., 2012Efficient backpropNeural Networks: Tricks of the Trade, Springer (2012), pp. 9-48
- Lee et al., 2015Deeply-supervised netsProceedings of AISTATS (2015), pp. 562-570
- Lee et al., 2019Set transformer: a framework for attention-based permutation-invariant neural networksProceedings of ICML (2019), pp. 3744-3753
- Lee et al., 2020Biobert: a pre-trained biomedical language representation model for biomedical text miningBioinformatics, 36 (4) (2020), pp. 1234-1240
- Lepikhin et al., 2021Gshard: scaling giant models with conditional computation and automatic shardingProceedings of ICLR (2021)
- Lester et al., 2021The Power of Scale for Parameter-Efficient Prompt Tuning(2021)arXiv preprint arXiv:2104.08691
- Lewis et al., 2020aBART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehensionProceedings of ACL (2020), pp. 7871-7880
- Lewis et al., 2020bRetrieval-augmented generation for knowledge-intensive nlp tasksProceedings of NeurIPS (2020), pp. 9459-9474
- Li and Qiu, 2021Token-aware virtual adversarial training in natural language understandingProceedings of AAAI (2021), pp. 8410-8418
- Li et al., 2019VisualBERT: A Simple and Performant Baseline for Vision and Language(2019)arXiv preprint arXiv:1908.03557
- Li et al., 2020aUnicoder-vl: a universal encoder for vision and language by cross-modal pre-trainingProceedings of AAAI (2020), pp. 11336-11344
- Li et al., 2020bBERT-ATTACK: adversarial attack against bert using bertProceedings of EMNLP (2020), pp. 6193-6202
- Li et al., 2020cGenerating Adversarial Examples in Chinese Texts Using Sentence-Pieces(2020)arXiv preprint arXiv:2012.14769
- Li et al., 2020dPytorch distributed: experiences on accelerating data parallel trainingProceedings of PVLDB (2020), pp. 3005-3018
- Li et al., 2020eOscar: object-semantics aligned pre-training for vision-language tasksProceedings of ECCV (2020), pp. 121-137
- Li et al., 2021Terapipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models(2021)arXiv preprint arXiv:2102.07988
- Lin et al., 2014Microsoft coco: common objects in contextProceedings of ECCV (2014), pp. 740-755
- Lin et al., 2019Open sesame: getting inside bert's linguistic knowledgeProceedings of BlackboxNLP (2019), pp. 241-253
- Lin et al., 2021A Survey of Transformers(2021)arXiv preprint arXiv:2106.04554
- Liu et al., 2016Recurrent neural network for text classification with multi-task learningProceedings of IJCAI (2016), pp. 2873-2879
- Liu et al., 2019Linguistic knowledge and transferability of contextual representationsProceedings of NAACL-HLT (2019), pp. 1073-1094
- Liu et al., 2020aK-bert: enabling language representation with knowledge graphProceedings of AAAI (2020), pp. 2901-2908
- Liu et al., 2020bSelf-supervised Learning: Generative or Contrastive(2020)arXiv preprint arXiv:2006.08218
- Liu et al., 2020cMultilingual denoising pre-training for neural machine translationTACL, 8 (2020), pp. 726-742
- Liu et al., 2020dRoberta: a robustly optimized bert pretraining approachProceedings of ICLR (2020)
- Liu et al., 2021aOag-bert: Pre-train Heterogeneous Entity-Augmented Academic Language Model(2021)arXiv preprint arXiv:2103.02410
- Liu et al., 2021bGpt Understands, Too(2021)arXiv preprint arXiv:2103.10385
- Liu et al., 2021cSwin Transformer: Hierarchical Vision Transformer Using Shifted Windows(2021)arXiv preprint arXiv:2103.14030
- Long et al., 2015Fully convolutional networks for semantic segmentationProceedings of CVPR (2015), pp. 3431-3440
- Lu et al., 2019Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasksProceedings of NeurIPS Reproducibility Challenge (2019)
- Lu et al., 202012-in-1: multi-task vision and language representation learningProceedings of CVPR (2020), pp. 10437-10446
- Manning et al., 2020Emergent linguistic structure in artificial neural networks trained by self-supervisionProc. Natl. Acad. Sci. Unit. States Am., 117 (48) (2020), pp. 30046-30054
- McCann et al., 2017Learned in translation: contextualized word vectorsProceedings of NeurIPS (2017), pp. 6294-6305
- Melamud et al., 2016context2vec: learning generic context embedding with bidirectional lstmProceedings of CoNLL (2016), pp. 51-61
- Miaschi and Dell'Orletta, 2020Contextual and non-contextual word embeddings: an in-depth linguistic investigationProceedings of RepL4NLP (2020), pp. 110-119
- Michel et al., 2019Are sixteen heads really better than one?Proceedings of NeurIPS (2019), pp. 14014-14024
- Micikevicius et al., 2018Mixed precision trainingProceedings of ICLR (2018)
- Mihalkova et al., 2007Mapping and revising markov logic networks for transfer learningProceedings of AAAI (2007), pp. 608-614
- Mikolov et al., 2013aEfficient estimation of word representations in vector spaceProceedings of ICLR Workshop (2013)
- Mikolov et al., 2013bDistributed representations of words and phrases and their compositionalityProceedings of NeurIPS (2013)
- Mikolov et al., 2013cLinguistic regularities in continuous space word representationsProceedings of NAACL-HLT (2013), pp. 746-751
- MindSpore Deep Learning Framework, 20212021. MindSpore Deep Learning Framework. https://github.com/mindspore-ai/mindspore.
- Narayanan et al., 2019Pipedream: generalized pipeline parallelism for dnn trainingProceedings of SOSP (2019)
- Narayanan et al., 2021Efficient Large-Scale Language Model Training on Gpu Clusters(2021)arXiv preprint arXiv:2104.04473
- Nie et al., 2020Adversarial nli: a new benchmark for natural language understandingProceedings of ACL (2020), pp. 4885-4901
- Niven and Kao, 2019Probing neural network comprehension of natural language argumentsProceedings of ACL (2019), pp. 4658-4664
- Oldridge et al., 2020Merlin: a gpu accelerated recommendation frameworkProceedings of IRS (2020)
- OneFlow Deep Learning Framework, 20212021. OneFlow Deep Learning Framework. https://github.com/Oneflow-Inc/oneflow..
- Ordonez et al., 2011Im2text: describing images using 1 million captioned photographsAdv. Neural Inf. Process. Syst., 24 (2011), pp. 1143-1151
- Vinyals et al., 2015Show and tell: a neural image caption generatorProceedings of CVPR (2015), pp. 3156-3164
- Ouyang et al., 2020ERNIE-M: Enhanced Multilingual Representation by Aligning Cross-Lingual Semantics with Monolingual Corpora(2020)arXiv preprint arXiv:2012.15674
- Pan and Yang, 2009A survey on transfer learningIEEE TKDE, 22 (10) (2009), pp. 1345-1359
- Pang et al., 2020Rethinking softmax cross-entropy loss for adversarial robustnessProceedings of ICLR (2020)
- Paszke et al., 2019Pytorch: an imperative style, high-performance deep learning libraryProceedings of NeurIPS (2019)
- Peng et al., 2019A generic communication scheduler for distributed dnn training accelerationProceedings of SOSP (2019), pp. 16-29
- Peng et al., 2021Random feature attentionProceedings of ICLR (2021)
- Pennington et al., 2014Glove: global vectors for word representationProceedings of EMNLP (2014), pp. 1532-1543
- Peters et al., 2018Deep contextualized word representationsProceedings of NAACL-HLT (2018), pp. 2227-2237
- Peters et al., 2019Knowledge enhanced contextual word representationsProceedings of EMNLP-IJCNLP (2019), pp. 43-54
- Petroni et al., 2019Language models as knowledge bases?Proceedings of EMNLP-IJCNLP (2019), pp. 2463-2473
- Pires et al., 2019How multilingual is multilingual BERT?Proceedings of ACL (2019), pp. 4996-5001
- Plummer et al., 2015Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence modelsProceedings of ICCV (2015), pp. 2641-2649
- Polino et al., 2018Model compression via distillation and quantizationProceedings of ICLR (2018)
- Pörner et al., 2020E-BERT: efficient-yet-effective entity embeddings for BERTProceedings of EMNLP (2020), pp. 803-818
- Prasanna et al., 2020When BERT plays the lottery, all tickets are winningProceedings of EMNLP (2020), pp. 3208-3229
- Qi et al., 2020Imagebert: Cross-Modal Pre-training with Large-Scale Weak-Supervised Image-Text Data(2020)arXiv preprint arXiv:2001.07966
- Qin et al., 2021Erica: improving entity and relation understanding for pre-trained language models via contrastive learningProceedings of ACL (2021)
- Qiu et al., 2020Pre-trained models for natural language processing: a surveySci. China Technol. Sci., 63 (2020), pp. 1872-1897
- Radford and Narasimhan, 2018Improving Language Understanding by Generative Pre-trainingOpenAI Blog (2018)
- Radford et al., 2019Language Models Are Unsupervised Multitask LearnersOpenAI Blog (2019)
- Radford et al., 2021Learning Transferable Visual Models from Natural Language SupervisionOpenAI Blog (2021)
- Raffel et al., 2020Exploring the limits of transfer learning with a unified text-to-text transformerJMLR, 21 (2020), pp. 1-67
- Raina et al., 2007Self-taught learning: transfer learning from unlabeled dataProceedings of ICML (2007), pp. 759-766
- Rajbhandari et al., 2020Zero: memory optimizations toward training trillion parameter modelsProceedings of SC (2020)
- Rajbhandari et al., 2021Zero-infinity: Breaking the Gpu Memory Wall for Extreme Scale Deep Learning(2021)arXiv preprint arXiv:2104.07857
- Ramesh et al., 2021Zero-shot Text-To-Image Generation(2021)arXiv preprint arXiv:2102.12092
- Rasley et al., 2020Deepspeed: system optimizations enable training deep learning models with over 100 billion parametersProceedings of KDD (2020), pp. 3505-3506
- Ren et al., 2016Faster r-cnn: towards real-time object detection with region proposal networksIEEE PAMI, 39 (6) (2016), pp. 1137-1149
- Ren et al., 2021Zero-offload: Democratizing Billion-Scale Model Training(2021)arxiv preprint arXiv:2101.06840
- Roberts et al., 2020How much knowledge can you pack into the parameters of a language model?Proceedings of EMNLP (2020), pp. 5418-5426
- Roller et al., 2021Recipes for building an open-domain chatbotProceedings of EACL (2021)
- Rosa and Mareček, 2019Inducing Syntactic Trees from Bert Representations(2019)arXiv preprint arXiv:1906.11511
- Rosset et al., 2020Knowledge-aware Language Model Pretraining(2020)arXiv preprint arXiv:2007.00655
- Roy et al., 2021Efficient content-based sparse attention with routing transformersTACL, 9 (2021), pp. 53-68
- Russakovsky et al., 2015Imagenet large scale visual recognition challengeIJCV, 115 (3) (2015), pp. 211-252
- Sanh et al., 2019Distilbert, a distilled version of bert: smaller, faster, cheaper and lighterProceedings of NeurIPS (2019)
- Saunshi et al., 2019A theoretical analysis of contrastive unsupervised representation learningProceedings of ICML (2019), pp. 5628-5637
- Saxe et al., 2013Exact Solutions to the Nonlinear Dynamics of Learning in Deep Linear Neural Networks(2013)arXiv preprint arXiv:1312.6120
- Schick and Schütze, 2020It's Not Just Size that Matters: Small Language Models Are Also Few-Shot Learners(2020)arXiv preprint arXiv:2009.07118
- Schlichtkrull et al., 2018Modeling relational data with graph convolutional networksProceedings of ESWC (2018), pp. 593-607
- Schmidt et al., 2018Adversarially robust generalization requires more dataProceedings of NeurIPS (2018)
- Sermanet et al., 2014Overfeat: integrated recognition, localization and detection using convolutional networksProceedings of ICLR (2014)
- Sharma et al., 2018Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioningProceedings of ACL) (2018), pp. 2556-2565
- Shazeer et al., 2017Outrageously large neural networks: the sparsely-gated mixture-of-experts layerProceedings of ICLR (2017)
- Shazeer et al., 2018Mesh-tensorflow: deep learning for supercomputersProceedings of NeurIPS (2018)
- Shen et al., 2020aQ-bert: Hessian based ultra low precision quantization of bertProceedings of AAAI (2020), pp. 8815-8821
- Shen et al., 2020bBlank language modelsProceedings of EMNLP (2020), pp. 5186-5198
- Shi et al., 2016Does string-based neural MT learn source syntax?Proceedings of EMNLP (2016), pp. 1526-1534
- Shi et al., 2017ZhuSuan: A Library for Bayesian Deep Learning(2017)arXiv preprint arXiv:1709.05870
- Shimodaira, 2000Improving predictive inference under covariate shift by weighting the log-likelihood functionJ. Stat. Plann. Inference, 90 (2) (2000), pp. 227-244
- Shin et al., 2020Autoprompt: eliciting knowledge from language models with automatically generated promptsProceedings of EMNLP (2020), pp. 4222-4235
- Shoeybi et al., 2019Megatron-lm: Training Multi-Billion Parameter Language Models Using Model Parallelism(2019)arXiv preprint arXiv:1909.08053
- Si et al., 2020Better Robustness by More Coverage: Adversarial Training with Mixup Augmentation for Robust Fine-Tuning(2020)arXiv preprint arXiv:2012.15699
- Simonyan and Zisserman, 2015Very deep convolutional networks for large-scale image recognitionProceedings of ICLR (2015)
- Soares et al., 2019Matching the blanks: distributional similarity for relation learningProceedings of ACL (2019)
- Song et al., 2019Mass: masked sequence to sequence pre-training for language generationProceedings of ICML (2019), pp. 5926-5936
- Song et al., 2020Mpnet: masked and permuted pre-training for language understandingProceedings of NeurIPS (2020), pp. 16857-16867
- Stock et al., 2020And the bit goes down: revisiting the quantization of neural networksProceedings of ICLR (2020)
- Su et al., 2020Vl-bert: pre-training of generic visual-linguistic representationsProceedings of ICLR (2020)
- Sun et al., 2019aVideobert: a joint model for video and language representation learningProceedings of ICCV (2019), pp. 7464-7473
- Sun et al., 2019bPatient knowledge distillation for bert model compressionProceedings of EMNLP-IJCNLP (2019), pp. 4323-4332
- Sun et al., 2019cErnie: enhanced representation through knowledge integrationProceedings of ACL (2019), pp. 1441-1451
- Sun et al., 2019dErnie 2.0: A Continual Pre-training Framework for Language Understanding(2019)arXiv preprint arXiv:1907.12412
- Sun et al., 2020Colake: contextualized language and knowledge embeddingProceedings of COLING (2020), pp. 3660-3670
- Sun et al., 2021Reasoning over Virtual Knowledge Bases with Open Predicate Relations(2021)arXiv preprint arXiv:2102.07043
- Sutskever et al., 2014Sequence to sequence learning with neural networksProceedings of NeurIPS (2014), pp. 3104-3112
- Szegedy et al., 2015Going deeper with convolutionsProceedings of CVPR (2015), pp. 1-9
- Tan and Bansal, 2019LXMERT: learning cross-modality encoder representations from transformersProceedings of EMNLP-IJCNLP (2019), pp. 5103-5114
- Tay et al., 2020Efficient Transformers: A Survey(2020)arXiv preprint arXiv:2009.06732
- Taylor, 1953Cloze procedure: a new tool for measuring readabilityJournal. Q., 30 (4) (1953), pp. 415-433
- Tenney et al., 2019aBERT rediscovers the classical NLP pipelineProceedings of ACL (2019), pp. 4593-4601
- Tenney et al., 2019bWhat do you learn from context? probing for sentence structure in contextualized word representationsProceedings of ICLR (2019)
- Thrun and Pratt, 1998Learning to Learn: Introduction and OverviewSpringer Science & Business Media (1998)
- Turian et al., 2010Word representations: a simple and general method for semi-supervised learningProceedings of ACL (2010), pp. 384-394
- Van Schijndel et al., 2019Quantity doesn't buy quality syntax with neural language modelsProceedings of EMNLP-IJCNLP (2019), pp. 5830-5836
- Vaswani et al., 2017Attention is all you needProceedings of NeurIPS (2017), pp. 5998-6008
- Veličković et al., 2018Graph attention networksProceedings of ICLR (2018)
- Verga et al., 2020Facts as Experts: Adaptable and Interpretable Neural Memory over Symbolic Knowledge(2020)arXiv preprint arXiv:2007.00849
- Vilares et al., 2020Parsing as pretrainingProceedings of AAAI (2020), pp. 9114-9121
- Voita et al., 2019Analyzing multi-head self-attention: specialized heads do the heavy lifting, the rest can be prunedProceedings of ACL (2019), pp. 5797-5808
- Wallace et al., 2019aUniversal adversarial triggers for attacking and analyzing nlpProceedings of EMNLP-IJCNLP (2019), pp. 2153-2162
- Wallace et al., 2019bTrick me if you can: human-in-the-loop generation of adversarial examples for question answeringTACL, 7 (2019), pp. 387-401
- Wallace et al., 2019cDo NLP models know numbers? probing numeracy in embeddingsProceedings of EMNLP-IJCNLP (2019), pp. 5306-5314
- Wang et al., 2008Transferred dimensionality reductionProceedings of ECML-PKDD (2008), pp. 550-565
- Wang et al., 2017Residual attention network for image classificationProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017), pp. 3156-3164
- Wang et al., 2019Supporting very large models using automatic dataflow graph partitioningProceedings of EuroSys (2019)
- Wang et al., 2020aLanguage Models Are Open Knowledge Graphs(2020)arXiv preprint arXiv:2010.11967
- Wang et al., 2020bK-adapter: infusing Knowledge into Pre-trained Models with Adapters(2020)arXiv preprint arXiv:2002.01808
- Wang et al., 2020cLinformer: Self-Attention with Linear Complexity(2020)arXiv preprint arXiv:2006.04768
- Wang et al., 2020dMinilm: deep self-attention distillation for task-agnostic compression of pre-trained transformersProceedings of NeurIPS (2020)
- Wang et al., 2020eA large-scale Chinese short-text conversation datasetNLPCC (2020)
- Wang et al., 2020fFurther analysis of outlier detection with deep generative modelsProceedings of NeurIPS (2020)
- Wang et al., 2021aCline: contrastive learning with semantic negative examples for natural language understandingProceedings of ACL (2021)
- Wang et al., 2021bKepler: a unified model for knowledge embedding and pre-trained language representationTACL, 9 (2021), pp. 176-194
- Warstadt and Bowman, 2020Can neural networks acquire a structural bias from raw linguistic data?Proceedings of CogSci (2020)
- Wei et al., 2019Nezha: Neural Contextualized Representation for Chinese Language Understanding(2019)arXiv preprint arXiv:1909.00204
- Wei et al., 2021On learning universal representations across languagesProceedings of ICLR (2021)
- Wharton et al., 1994Below the surface: analogical similarity and retrieval competition in remindingCognit. Psychol., 26 (1994), pp. 64-101
- Williams et al., 2007Multi-task Gaussian process predictionProceedings of NeurIPS (2007), pp. 153-160
- Wu et al., 2016Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation(2016)arXiv preprint arXiv:1609.08144
- Wu et al., 2018Unsupervised feature learning via non-parametric instance discriminationProceedings of CVPR (2018), pp. 3733-3742
- Wu et al., 2020Perturbed masking: parameter-free probing for analyzing and interpreting BERTProceedings of ACL (2020), pp. 4166-4176
- Xia et al., 2020Xgpt: Cross-Modal Generative Pre-training for Image Captioning(2020)arXiv preprint arXiv:2003.01473
- Xiong et al., 2016Dynamic memory networks for visual and textual question answeringProceedings of ICML (2016), pp. 2397-2406
- Xiong et al., 2019Pretrained encyclopedia: weakly supervised knowledge-pretrained language modelProceedings of ICLR (2019)
- Xu et al., 2021How neural networks extrapolate: from feedforward to graph neural networksProceedings of ICLR (2021)
- Yang et al., 2019Xlnet: generalized autoregressive pretraining for language understandingProceedings of NeurIPS (2019)
- Yang et al., 2020Alternating language modeling for cross-lingual pre-trainingProceedings of AAAI (2020), pp. 9386-9393
- You et al., 2017Scaling Sgd Batch Size to 32k for Imagenet Training(2017)arXiv preprint arXiv:1708.03888
- You et al., 2020Large batch optimization for deep learning: training bert in 76 minutesProceedings of ICLR (2020)
- Yao et al., 2021Adversarial language games for advanced natural language intelligenceProceedings of AAAI (2021)
- Zadrozny, 2004Learning and evaluating classifiers under sample selection biasProceedings of ICML (2004)
- Zafrir et al., 2019Q8bert: quantized 8bit bertProceedings of NeurIPS (2019)
- Zaheer et al., 2020Big bird: transformers for longer sequencesProceedings of NeurIPS (2020), pp. 17283-17297
- Zang et al., 2020Word-level textual adversarial attacking as combinatorial optimizationProceedings of ACL (2020), pp. 6066-6080
- Zeng et al., 2021Pangu-alpha: Large-Scale Autoregressive Pretrained Chinese Language Models with Auto-Parallel Computation(2021)arXiv preprint arXiv:2104.12369
- Zhang and He, 2020Accelerating training of transformer-based language models with progressive layer droppingProceedings of NeurIPS (2020), pp. 14011-14023
- Zhang et al., 2017Understanding deep learning requires rethinking generalizationProceedings of ICLR (2017)
- Zhang et al., 2019aOag: toward linking large-scale heterogeneous entity graphsProceedings of KDD (2019), pp. 2585-2595
- Zhang et al., 2019bErnie: enhanced language representation with informative entitiesProceedings of ACL (2019), pp. 1441-1451
- Zhang et al., 2020aPegasus: pre-training with extracted gap-sentences for abstractive summarizationProceedings of ICML (2020), pp. 11328-11339
- Zhang et al., 2020bTernarybert: distillation-aware ultra-low bit bertProceedings of EMNLP (2020), pp. 509-521
- Zhang et al., 2020cCpm: A Large-Scale Generative Chinese Pre-tned Language Model(2020)arXiv preprint arXiv:2012.00413
- Zhang et al., 2021aCpm-2: large-scale Cost-efficient Pre-trained Language Models(2021)
- Zhang et al., 2021bKnow what you don't need: single-Shot Meta-Pruning for attention headsAI Open, 2 (2021), pp. 36-42
- Zhang et al., 2021cRed Alarm for Pre-trained Models: Universal Vulnerabilities by Neuron-Level Backdoor Attacks(2021)arXiv preprint arXiv:2101.06969
- Zheng et al., 2015Conditional random fields as recurrent neural networksProceedings of ICCV (2015), pp. 1529-1537
- Zhou et al., 2020aUnified vision-language pre-training for image captioning and vqaProceedings of AAAI (2020), pp. 13041-13049
- Zhou et al., 2020bEvaluating commonsense in pre-trained language modelsProceedings of AAAI (2020), pp. 9733-9740
- Zhu et al., 2015Aligning books and movies: towards story-like visual explanations by watching movies and reading booksProceedings of ICCV (2015), pp. 19-27
- Zoph et al., 2020Rethinking pre-training and self-trainingProceedings of NeurIPS, 33 (2020)
- Zou et al., 2021Controllable Generation from Pre-trained Language Models via Inverse Prompting(2021)arXiv preprint arXiv:2103.10685
Cited by (502)
Pre-Trained Language Models and Their Applications
2023, EngineeringParameter-efficient fine-tuning of large-scale pre-trained language models
2023, Nature Machine IntelligenceVLP: A Survey on Vision-language Pre-training
2023, Machine Intelligence ResearchTransformers in Time Series: A Survey
2023, IJCAI International Joint Conference on Artificial IntelligencePPT: Pre-trained Prompt Tuning for Few-shot Learning
2022, Proceedings of the Annual Meeting of the Association for Computational LinguisticsSUPERVISION EXISTS EVERYWHERE: A DATA EFFICIENT CONTRASTIVE LANGUAGE-IMAGE PRE-TRAINING PARADIGM
2022, ICLR 2022 - 10th International Conference on Learning Representations
- 1
- The first six authors contribute equally to organize this paper. The order is determined by dice rolling.
- 2
- All faculty authors are alphabetically sorted.
- 3
- ResNet50 is a PTM with 50 layers.
- 4
- Self-study learning can be viewed as a variant of inductive transfer learning without available labeled data.
- 5
- Since GPT uses autoregressive language modeling for the pre-training objective, the cross-attention in the original Transformer decoder is removed.
- 6
- More examples of the Turing test of GPT-3 can be found at https://lacker.io/ai/2020/07/06/giving-gpt-3-a-turing-test.html.