研究生: |
王姵心 Wang, Pei-Hsin |
---|---|
論文名稱: |
針對語言模型之語境溫度 Contextual Temperature for Language Model |
指導教授: |
張世杰
Chang, Shih-Chieh |
口試委員: |
陳縕儂
Chen, Yun-Nung 吳毅成 Wu, I-Chen |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 資訊工程學系 Computer Science |
論文出版年: | 2020 |
畢業學年度: | 108 |
語文別: | 英文 |
論文頁數: | 33 |
中文關鍵詞: | 語言模型 、溫度縮放 、自然語言處理 |
外文關鍵詞: | language model, temperature scaling, natural language processing |
相關次數: | 點閱:4 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
基於溫度的縮放(temperature scaling)能夠有效率地調整一個分佈的平滑程度,並且經常和歸一化指數函數(softmax)一起使用,來調整輸出的機率分佈。現有的方法常使用固定的值作為溫度,抑或是人工設定溫度的函數;然而,我們的研究指出,對於每個類別,亦即每個字詞,其最佳溫度會隨著當前語境的不同而改變。因此,我們提出基於語境的溫度縮放(contextual temperature),經由和其他模型參數一起優化學習,使每個字詞都有其最適、最佳的溫度函數。藉由在Penn Treebank以及WikiText-2資料集上的實驗,我們提出的方法顯著地提升當前最佳的語言模型之表現,分別達到55.31和62.89的困惑度(perplexity)。此外,我們進行深入的分析以及消融研究(ablation study),驗證每個字詞擁有其獨一無二的溫度函數,並且當歷史上文逐漸增加,最佳溫度會隨之下降,以壓抑不確定性。
Temperature scaling is an effective approach to control the smoothness of the distribution, and is widely used on tasks using Softmax as the output layer to improve the performance. Current practices to apply temperature scaling assume either a fixed, or a manually-crafted dynamically changing schedule. However, our studies indicate that the individual optimal trajectory for each class can change with the context. To this end, we propose contextual temperature, a generalized approach to provide an individual optimal temperature trajectory over the context for each vocabulary, while allowing the temperature to be learned along with the remaining model parameters during training. Experiment results confirm that the proposed method significantly improves state-of-the-art language models, achieving a perplexity of 55.31 and 62.89 on the test set of Penn Treebank and WikiText-2, respectively. Additionally, in-depth analyses show that every vocabulary owns its unique schedule of temperature, and that with the increased context, the optimal trajectory drops in order to suppress the uncertainties.
[1] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate.arXiv preprint arXiv:1409.0473, 2014.
[2] Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin. A neural probabilistic language model. Journal of machine learning research, 3(Feb):1137–1155, 2003.
[3] M. Caccia, L. Caccia, W. Fedus, H. Larochelle, J. Pineau, and L. Charlin. Languagegans falling short. In NIPS Workshop, 2018.
[4] C. Chen, W. Xie, W. Huang, Y. Rong, X. Ding, Y. Huang, T. Xu, and J. Huang. Progressive feature alignment for unsupervised domain adaptation. In CVPR, 2019.
[5] K. Cho, B. v. Merrienboer, C. Gulcehre, D. Bahadanau, F. Bougares, H. Scwenk, and Y. Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. In EMNLP, 2014.
[6] Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. V. Le, and R. Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed-length context. In ACL, 2019.
[7] Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier. Language modeling with gated convolutional networks. In ICML, 2017.
[8] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL, 2019.
[9] E. Grave, A. Joulin, and N. Usunier. Improving neural language models with a continuous cache. In ICLR, 2017.
[10] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger. On calibration of modern neural networks. In ICML, 2017.
[11] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. In NIPS, 2014.
[12] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
[13] Z. Hu, Z. Yang, X. Liang, R. Salakhutdinov, and E. P. Xing. Toward controlled generation of text. In ICML, 2017.
[14] E. Jang, S. Gu, and B. Poole. Categorical reparameterization with gumbel-softmax. In ICLR, 2017.
[15] D. P. Kingma and M. Welling. Auto-encoding variational bayes. In ICLR, 2014.
[16] B. Krause, E. Kahembwe, I. Murray, and S. Renals. Dynamic evaluation of neural sequence models.arXiv preprint arXiv:1709.07432, 2017.
[17] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
[18] J. Lin, X. Sun, X. Ren, M. Li, and Q. Su. Learning when to concentrate or divert attention: Self-adaptive attention temperature for neural machine translation. In EMNLP, 2018.
[19] M. Liven, K. Swersky, and D. J. Fleet. Sentencemim: A latent variable language model. arXiv preprint arXiv:2003.02645, 2020.
[20] X. Ma, P. Yin, J. Liu, G. Neubig, and E. Hovy. Softmax q-distribution estimation for structured prediction: A theoretical interpretation for raml. arXiv preprint arXiv:1705.07136, 2017.
[21] M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini. Building a large annotated corpus of english: The penn treebank. In Computational linguistics, 1993.
[22] S. Merity, N. S. Keskar, and R. Socher. Regularizing and optimizing lstm language models. In ICLR, 2018.
[23] S. Merity, C. Xiong, J. Bradbury, and R. Socher. Pointer sentinel mixture models. In ICLR, 2017.
[24] T. Mikolov, A. Deoras, S. Kombrink, L. Burget, and J.ˇCernock`y. Empirical evaluation and combination of advanced language modeling techniques. In INTERSPEECH, 2011.
[25] T. Mikolov, M. Karafi ́at, L. Burget, J.ˇCernock`y, and S. Khudanpur. Recurrent neural network based language model. In INTERSPEECH, 2010.
[26] M. Norouzi, S. Bengio, Z. Chen, N. Jaitly, M. Schuster, Y. Wu, and D. Schuurmans. Reward augmented maximum likelihood for neural structured prediction. In NIPS, 2016.
[27] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a method for automatic evaluation of machine translation. In ACL, 2001.
[28] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, 2017.
[29] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language models are unsupervised multitask learners. OpenAI Blog, 2019.
[30] Z. Yang, Z. Dai, R. Salakhutdinov, and W. W. Cohen. Breaking the softmax bottleneck: A high-rank RNN language model. In ICLR, 2018.
[31] X. Zhang, F. X. Yu, S. Karaman, W. Zhang, and S.-F. Chang. Heated-up softmax embedding. arXiv preprint arXiv:1809.04157, 2018.
[32] J. G. Zilly, R. K. Srivastava, J. Koutn ́ık, and J. Schmidhuber. Recurrent highway networks. In ICML, 2017.