研究生: |
李奇璋 Li, Chi-Chang |
---|---|
論文名稱: |
利用隨機投影漢森矩陣的擾動梯度下降法 RPH-PGD: Randomly Projected Hessian for Perturbed Gradient Descent |
指導教授: |
韓永楷
Hon, Wing-Kai 李哲榮 Lee, Che-Rung |
口試委員: |
王弘倫
Wang, Hung-Lung 蔡孟宗 Tsai, Meng-Tsung |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 資訊工程學系 Computer Science |
論文出版年: | 2022 |
畢業學年度: | 111 |
語文別: | 英文 |
論文頁數: | 45 |
中文關鍵詞: | 演算法 、梯度下降 、最佳化 、漢森矩陣 、鞍點 |
外文關鍵詞: | Algorithm, Gradient Descent, Optimization, Hessian, Saddle Point |
相關次數: | 點閱:2 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
擾動梯度下降法(PGD)是在搜索方向時增加了隨機噪聲的方法,由於其具有擺
脫鞍點的能力,已被廣泛用於解決大規模的最佳化問題。然而,有時由於兩個原
因,它的效率會很低。第一,隨機產生的噪聲可能不指向能使目標函式值下降的
方向,所以PGD仍然可能在鞍點附近停滯不前。第二,由擾動的方向形成的球
體半徑控制的噪聲大小可能沒有得到適當的設定,所以收斂速度很慢。在本論
文中,我們提出了一種名為RPH-PGD(Randomly Projected Hessian for Perturbed
Gradient Descent)的方法,以改善PGD的性能。隨機投影漢森矩陣(RPH)是通
過將漢森矩陣投影到一個相對較小的子空間中而產生的,該子空間包含了關於原
本漢森矩陣的特徵向量的豐富信息。 RPH-PGD利用隨機投影漢森矩陣的特徵值和
特徵向量來辨識負曲率,並利用該矩陣本身估計原始的漢森矩陣的變化量,這是
計算過程中動態調整半徑的一個必要信息。此外,RPH-PGD是採用有限差分法對
漢森矩陣和特定向量的乘積進行近似計算,而非直接計算整個完整的漢森矩陣。
經過平攤分析可知,RPH-PGD的時間複雜度只比PGD略高一點。而實驗結果則表
明,RPH-PGD不僅比PGD收斂快,還能在PGD不能收斂的情況下收斂。
The perturbed gradient descent (PGD) method, which adds random noises in the
search directions, has been widely used in solving large scaled optimization problems, owing to its capability to escape from saddle points. However, it is inefficient
sometimes for two reasons. First, the random noises may not point to a descent direction, so PGD may still stagnate around saddle points. Second, the size of random
noises, which is controlled by the radius of the perturbation ball, may not be properly configured, so the convergence is slow. In this thesis, we proposed a method,
called RPH-PGD (Randomly Projected Hessian for Perturbed Gradient Descent), to
improve the performance of PGD. The randomly projected Hessian (RPH) is created
by projecting the Hessian matrix into a relatively small subspace which contains rich
information about the eigenvectors of the original Hessian matrix. RPH-PGD utilizes the eigenvalues and eigenvectors of the randomly projected Hessian to identify
the negative curvatures and uses the matrix itself to estimate the changes of Hessian
matrices, which is necessary information for dynamically adjusting the radius during
the computation. In addition, RPH-PGD employs the finite difference method to approximate the product of the Hessian and vectors, instead of constructing the Hessian
explicitly. The amortized analysis shows the time complexity of RPH-PGD is only
slightly higher than that of PGD. The experimental results show that RPH-PGD does
not only converge faster than PGD but also converges in cases that PGD cannot.
[1] Herbert Robbins and Sutton Monro. “A Stochastic Approximation Method”. The Annals of Mathematical Statistics 22.3 (1951), pp. 400–407. issn: 00034851.
[2] John Duchi, Elad Hazan, and Yoram Singer. “Adaptive Subgradient Methods for Online Learning and Stochastic Optimization”. Journal of Machine Learning Research 12.61 (2011), pp. 2121–2159.
[3] Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. 2014.
[4] Zeyuan Allen-Zhu and Yuanzhi Li. “NEON2: Finding Local Minima via FirstOrder Oracles”. Advances in Neural Information Processing Systems. Ed. by S. Bengio et al. Vol. 31. Curran Associates, Inc., 2018.
[5] Naman Agarwal et al. “Finding Approximate Local Minima Faster than Gradient Descent”. Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing. STOC 2017. Montreal, Canada: Association for Computing Machinery, 2017, pp. 1195–1199. isbn: 9781450345286.
[6] Yair Carmon et al. “Accelerated Methods for NonConvex Optimization”. SIAM Journal on Optimization 28.2 (2018), pp. 1751-1772. eprint: https://doi.org/10.1137/17M1114296.
[7] Rong Ge et al. “Escaping from saddle points: Online stochastic gradient for tensor decomposition”. English (US). Journal of Machine Learning Research 40.2015 (2015). Publisher Copyright: 2015 A. Agarwal and S. Agarwal.; 28th Conference on Learning Theory, COLT 2015 ; Conference date: 02-07-2015 Through 06-07-2015. issn: 1532-4435.
[8] Chi Jin et al. “How to Escape Saddle Points Efficiently”. Proceedings of the 34th International Conference on Machine Learning - Volume 70. ICML’17. Sydney, NSW, Australia: JMLR.org, 2017, pp. 1724–1732.
[9] Chi Jin et al. “On Nonconvex Optimization for Machine Learning: Gradients, Stochasticity, and Saddle Points”. J. ACM 68.2 (Feb. 2021). issn: 0004-5411.
[10] N. Halko, P. G. Martinsson, and J. A. Tropp. “Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions”. SIAM Review 53.2 (2011), pp. 217–288. eprint: https://doi.org/10.1137/090771806.
[11] Kfir Y. Levy. “The Power of Normalization: Faster Evasion of Saddle Points”. CoRR abs/1611.04831 (2016). arXiv: 1611.04831.
[12] Zeyuan Allen-Zhu. “Natasha 2: Faster Non-Convex Optimization Than SGD”. Advances in Neural Information Processing Systems. Ed. by S. Bengio et al. Vol. 31. Curran Associates, Inc., 2018.
[13] Gene H. Golub and Charles F. Van Loan. Matrix Computations. Third. The Johns Hopkins University Press, 1996.
[14] Yousef Saad. Iterative Methods for Sparse Linear Systems. Second. Other Titles in Applied Mathematics. SIAM, 2003. isbn: 978-0-89871-534-7.
[15] Gerard L. G. Sleijpen and Henk A. Van der Vorst. “A Jacobi-Davidson Iteration Method for Linear Eigenvalue Problems”. SIAM Review 42.2 (2000), pp. 267–293. issn: 00361445.
[16] G. W. Stewart. Matrix Perturbation Theory. 1990.
[17] Jorge Nocedal and Stephen J. Wright. Numerical Optimization. 2e. New York, NY, USA: Springer, 2006.
[18] Zhongxiao Jia and G. W. Stewart. “An analysis of the Rayleigh-Ritz method for approximating eigenspaces”. Math. Comput. 70 (2001), pp. 637–647.
[19] Che-Rung Lee and G. W. Stewart. “Algorithm 879: EIGENTEST—a test matrix generator for large-scale eigenproblems”. ACM Trans. Math. Softw. 35 (2008/07//2008), 7:1–7:11 - 7:1–7:11.
[20] James Demmel et al. “Communication-optimal Parallel and Sequential QR and LU Factorizations”. SIAM Journal on Scientific Computing 34.1 (2012), A206–A239. eprint: https://doi.org/10.1137/080731992.