優化效率：利用神經正切核決定大規模模型訓練的停止準則

簡易檢索 / 詳目顯示

回結果列表

研究生：	陳家惠 Chen, Jia-Hui
論文名稱：	優化效率：利用神經正切核決定大規模模型訓練的停止準則 Exploring Economical Sweet Spot: Utilizing Neural Tangent Kernel to Determine Stopping Criteria in Large-Scale Models
指導教授：	吳尚鴻 Wu, Shan-Hung
口試委員:	劉奕汶 Liu, Yi-Wen 沈之涯 Shen, Chih-Ya 邱維辰 Chiu, Wei-Chen
學位類別：	碩士 Master
系所名稱：	電機資訊學院 - 資訊工程學系 Computer Science
論文出版年：	2023
畢業學年度：	112
語文別：	英文
論文頁數：	19
中文關鍵詞：	神經正切核、神經網絡訓練、提前停止
外文關鍵詞：	Neural Tangent Kernel, Neural Networks Training, Early Stopping
相關次數：	點閱：2 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

在近年來,機器學習模型通常存在過度參數化的情況,這些模型可以在具有
良好泛化性能的同時達到零訓練誤差。為提高模型效能,機器學習實踐者通常
使用一種稱為「提前停止」的臨時技術,該技術利用訓練集的一部分作為驗證
集,在驗證集上具有非退化經驗風險時停止訓練過程。然而,在過度參數化的
情況下,模型會陷入核區域,並已被證明,在訓練過程中,泛化表現會單調提
高。因此,這裡的問題不是找到具有最佳泛化性能的最佳點,而是如何以最經
濟的方式訓練過度參數化的模型。基於神經切線核理論,我們展示了以下:1)
過度參數化模型的訓練動態呈現兩階段現象。2)存在一個關鍵點,用於訓練過
度參數化模型,在該點之後邊際增益急劇減少。

In the recent years, machine learning models are often over-parameterized where models can get zero training error while having good generalization performance. To boost models’ performance, machine learning practitioners generally use an ad-hoc technique called Early Stopping which utilizes a portion of the training set as validation set and halts the training procedure while having non-degenerated empirical risk on validation set. However, for over-parameterized setting, the model falls into kernel regime and has been proved that the generalization performance improves monotonically during training. So the problem here is not to find a optimal point with best generalization performance, but what is the most economical way to train an over-parameterized model. Based on neural tangent kernel theory, we show that 1) the training dynamics of an over-parameterized model exhibits a 2-phase phenomenon. 2) There exists a critical point for training an over-parameterized model, where the marginal gain decreases shapely after the critical point.

Abstract (Chinese) I
Abstract II

Contents III

List of Figures V
List of Tables VI

Introduction 1

Observation 3
1 Economical Sweet Spot . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Obstacles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

Kernel Economical Sweet Spot 5
1 Neural Tangent Kernel . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Finding KESS in NTK context . . . . . . . . . . . . . . . . . . . . 7
3 Time Relationship between NTK and NN . . . . . . . . . . . . . . 8
4 Finding KESS in NN context . . . . . . . . . . . . . . . . . . . . . 9
5 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Evaluation 11
1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3 Transferbility Study . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Related Work 14

Conclusion 15

Future Works 16

Bibliography 17
                                

[1] Abien Fred Agarap. Deep learning using rectified linear units (relu). CoRR,
abs/1803.08375, 2018.

[2] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Ka-
plan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry,

Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger,
Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu,
Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin,

Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCan-
dlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are

few-shot learners. CoRR, abs/2005.14165, 2020.

[3] Lenaic Chizat, Edouard Oyallon, and Francis Bach. On lazy training in dif-
ferentiable programming, 2020.

[4] Will Cukierski. Dogs vs. cats, 2013.
[5] Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation:
Representing model uncertainty in deep learning. In international conference
on machine learning, pages 1050–1059. PMLR, 2016.
[6] Bobby He, Balaji Lakshminarayanan, and Yee Whye Teh. Bayesian deep
ensembles via the neural tangent kernel. Advances in neural information
processing systems, 33:1010–1022, 2020.
[7] Arthur Jacot, Franck Gabriel, and Cl ́ement Hongler. Neural tangent kernel:
Convergence and generalization in neural networks. CoRR, abs/1806.07572,
2018.
[8] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features
from tiny images. 2009.
[9] Jaehoon Lee, Samuel S. Schoenholz, Jeffrey Pennington, Ben Adlam, Lechao
Xiao, Roman Novak, and Jascha Sohl-Dickstein. Finite versus infinite neural
networks: an empirical study, 2020.

[10] Jaehoon Lee, Lechao Xiao, Samuel S Schoenholz, Yasaman Bahri, Ro-
man Novak, Jascha Sohl-Dickstein, and Jeffrey Pennington. Wide neural

networks of any depth evolve as linear models under gradient descent
sup∗/sup.JournalofStatisticalMechanics : T heoryandExperiment, 2020(12) :
124002, dec2020.

[11] Nelson Morgan and Herv ́e Bourlard. Generalization and parameter estima-
tion in feedforward nets: Some experiments. Advances in neural information

processing systems, 2, 1989.
[12] Roman Novak, Lechao Xiao, Jiri Hron, Jaehoon Lee, Alexander A. Alemi,
Jascha Sohl-Dickstein, and Samuel S. Schoenholz. Neural tangents: Fast
and easy infinite neural networks in python. In International Conference on
Learning Representations, 2020.
[13] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh,

Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bern-
stein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual

Recognition Challenge. International Journal of Computer Vision (IJCV),
115(3):211–252, 2015.
[14] Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen,
Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin,
Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig,
Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer.
Opt: Open pre-trained transformer language models, 2022.

簡易檢索 / 詳目顯示

相關論文