運用超參數調優技巧建立深度神經網路來處理迴歸問題

簡易檢索 / 詳目顯示

回結果列表

研究生：	連欣怡 Lien, Hsin-Yi
論文名稱：	運用超參數調優技巧建立深度神經網路來處理迴歸問題 A Hyper-Parameters Tuning Procedure to Construct a Deep Neural Network for Regression Problems
指導教授：	蘇朝墩 Su, Chao-Ton
口試委員:	陳穆臻蕭宇翔
學位類別：	碩士 Master
系所名稱：	工學院 - 工業工程與工程管理學系 Department of Industrial Engineering and Engineering Management
論文出版年：	2019
畢業學年度：	107
語文別：	英文
論文頁數：	62
中文關鍵詞：	類神經網路、迴歸問題、倒傳遞類神經路、深度學習、超參數
外文關鍵詞：	artificial neural network, regression problems, back propagation neural network, deep learning, hyper-parameters
相關次數：	點閱：107 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

由於倒傳遞類神經網路的一般化的映射能力，目前已經在工業中得到廣泛應用，然而，它可能涉及許多超參數要做調整，包含權重初始值、活化函數、學習率等等，在選擇這些參數時通常遵循試誤法，但這要花費很多時間，且模型最後的輸出結果也可能不正確。為了解決這個問題，本研究提出了一套調整超參數的程序來幫助建模時可以選擇適當的超參數以達到較高的映射準確率。
本研究參考許多類神經網路超參數設定值的相關文獻，包含隱藏層層數與神經元數、活化函數、權重初始值與優化方法、批量大小、正規化方法及batch normalization的技巧，以建構我們所提出的深度神經網路模型來處理迴歸問題，在此，迴歸問題是指找到一個可以表示給定數據集的輸入及輸出關係的模型。深度神經網絡可以解決淺層結構之倒傳遞類神經網路的問題，例如收斂到局部最小值而非全域最小值、收斂速度太慢與訓練時間太長等。本研究使用18種不同的資料集來比較深度神經網路與具有淺層架構之倒傳遞類神經網路的表現，這些資料集具有各種不同組合的數據大小、複雜度以及輸入特徵的數目。研究結果顯示本研究所提出的深度神經網路在預測準確率方面的表現比淺層架構的倒傳遞類神經網路為佳，此外，由於有early stopping的機制，此模型進行訓練時所需的epoch數也比較少，而early stopping亦可避免神經網路發生過度擬合的問題。至於訓練模型時所需的時間，此模型在大多數資料集中都省下大量的時間，這也加速了本研究的建模過程。最後，本研究亦透過個案分析來展現此深度神經網路模型的表現。

Due to its general pattern-mapping capability, back propagation neural network (BPNN) has been widely used in industry. However, BPNN can involve many hyper-parameters, including weights initialization, activation function, and learning rate. A trial-and-error approach is usually followed in selecting these parameters, which makes it time consuming and sometimes may provide inaccurate results. To overcome this challenge, this study proposes a tuning procedure to aid in selecting appropriate hyper-parameters when modeling to achieve higher mapping accuracy.
This study refers to lots of related work about artificial neural network (ANN) hyper-parameters including the number of hidden layers and units, activation function, weights initialization and optimization method, mini-batch size, regularization, and batch normalization technique to build our proposed model deep NN for regression problems. Here, the regression problem is to find a model that can represent the input-output relationship for a given dataset. Deep NN can resolve the problems of BPNN with shallow structure such as converge to local minima rather than global minima, slow convergence rate, and long training time. Eighteen different datasets with various combination of data size, complexity, and number of features are then used to compare deep NN with shallow BPNN. Through the performance analysis, our proposed deep NN is superior to BPNN with shallow structure in terms of the accuracy. Besides, it needs a smaller number of epochs due to early stopping technique which can also prevent networks from overfitting. As for the training time, deep NN greatly saves the large amount of time for most of the datasets and this really speeds up the modeling process. In the end, a case study is also demonstrated to show the performance of our model.

摘要...........................................................    I
Abstract.......................................................    II
Chapter 1. Introduction........................................ 1
1 Research Background and Motivation.........................    1
2 Research Objectives........................................    2
3 Research Framework.........................................    3
Chapter 2. Related Works.......................................    5
1 ANN........................................................    5
1.1 Overview of ANN..........................................    5
1.2 BPNN.....................................................    7
2 Deep Learning.............................................. 9
3 Hyper-parameters Tuning in ANN.............................    10
3.1 Number of Hidden Layers..................................    11
3.2 Number of Hidden Units...................................    12
3.3 Activation Function......................................    13
3.4 Weights Initialization................................... 14
3.5 Weights Optimization Method..............................    15
3.6 Mini-batch Size..........................................    20
3.7 Regularization...........................................    21
3.8 Batch Normalization......................................    24
Chapter 3. Proposed Method.....................................    26
1 Preparation for Modeling...................................    26
2 Construction of BPNN Model with Shallow Structure..........    27
3 Construction of Deep NN Model..............................    28
4 Hyper-parameters Setting...................................    30
4.1 Hyper-parameters Setting of BPNN with Shallow Structure..    31
4.2 Hyper-parameters Setting of Deep NN......................    32
Chapter 4. Performance Evaluation and Results..................    36
1 Performance Indices and Data Sets..........................    36
1.1 Performance Indices......................................    36
1.2 Preprocess Data Sets.....................................    37
1.3 Data Sets Characteristics................................    38
2 Experimental Procedure and Hyper-parameters Setting........    39
3 Experimental Results.......................................    42
3.1 Numerical Results........................................    42
3.2 Graphical Results........................................    46
Chapter 5. Case Study..........................................    50
1 Case Background............................................    50
2 Case Performance Analysis..................................    51
Chapter 6. Conclusions.........................................    54
1 Summary....................................................    54
2 Future Research............................................    55
References..................................................... 56

                                

[1] Aslanargun, A., Mammadov, M., Yazici, B. and Yolacan, S. (2007). Comparison of ARIMA, neural networks and hybrid models in time series: tourist arrival forecasting. Journal of Statistical Computation and Simulation, 77(1), 29-53.
[2] Ahn, J. M., Kim, S., Ahn, K. S., Cho, S. H., Lee, K. B. and Kim, U. S. (2018). A deep learning model for the detection of both advanced and early glaucoma using fundus photography. PLoS One, 13(11), 1-8.
[3] Baldi, P. (1995). Gradient descent learning algorithm overview: a general dynamical systems perspective. IEEE Transactions on Neural Networks, 6(1), 182-195.
[4] Bell, S. and Bala, K. (2015). Learning visual similarity for product design with convolutional neural networks. ACM Transactions on Graphics, 34(4):98.
[5] Bengio, Y. (2012). Practical recommendations for gradient-based training of deep architectures. Neural Networks: Tricks of the Trade, vol.7700 of Lecture Notes in Computer Science.
[6] Bengio, Y. (2009). Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2(1), 1-127.
[7] Bengio, Y. and LeCun, Y. (2007). Scaling learning algorithms toward AI. Large Scale Kernel Machines, MIT Press, Cambridge, MA.
[8] Bengio, Y., Lamblin, P., Popovici, D. and Larochelle, H. (2007). Greedy layer-wise training of deep networks. Advances in Neural Information Processing Systems, 19, 153-160.
[9] Bergstra, J. and Bengio, Y. (2012). Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13, 281–305.
[10] Caruana, R., Lawrence, S. and Giles, L. (2001). Overﬁtting in Neural Nets: Backpropagation, Conjugate Gradient, and Early Stopping. Proceedings of the Advances in Neural Information Processing Systems, 402-408.
[11] Chakraborti, T., Chatterjee A., Halder A. and Konar, A. (2015). Automated emotion recognition employing a novel modified binary quantum-behaved gravitational search algorithm with differential mutation. Expert Systems, 32(4), 522-530.
[12] Chen, L., Huang, J. F., Wang, F. M. and Tang, Y. L. (2007). Comparison between back propagation neural network and regression models for the estimation of pigment content in rice leaves and panicles using hyperspectral data. International Journal of Remote Sensing, 28(16), 3457–3478.
[13] Chiba, Z., Abghour, N., Moussaid, K., Omri, A. El and Rida, M. (2018). A novel architecture combined with optimal parameters for back propagation neural networks applied to anomaly network intrusion detection. Computers & Security, 75, 36-58.
[14] Chong, E., Han, C., Park, F. C. (2017). Deep learning networks for stock market analysis and prediction: Methodology, data representations, and case studies. Expert Systems with Applications, 83, 187–205.
[15] Deng, L. and Yu, D. (2014). Deep learning: methods and applications. Foundations & Trends in Signal Processing, 7(3-4), 197-387.
[16] Deng, W., Li, W., and Yang, X. H. (2011). A novel hybrid optimization algorithm of computational intelligence techniques for highway passenger volume prediction. Expert Systems with Applications, 38(4), 4198-4205.
[17] Dozart, T. (2016). Incorporating Nesterov Momentum into Adam. Proceedings of 4th International Conference on Learning Representations (ICLR), Workshop Track, San Juan, Puerto Rico.
[18] Duan, K., Keerthi, S. S., Chu, W., Shevade, S. K. and Poo, A. N. (2003). Multi-category classiﬁcation by soft-max combination of binary classiﬁers. International Workshop on Multiple Classiﬁer Systems, Springer, Berlin, Germany, 125–134.
[19] Duchi, J., Hazan, E. and Singer, Y. (2011). Adaptive subgradient methods for
online learning and stochastic optimization. Journal of Machine Learning Research, 12, 2121-2159.
[20] Fausett, L. (1994). Fundamentals of Neural Network: Architectures, Algorithms, and Applications, Herts, England, Prentice-Hall International.
[21] Freeway Bureau, MOTC. Available at: https://www.freeway.gov.tw/.

[22] Gaidhane, R., Vaidya, C. and Raghuwanshi, M. (2014). Intrusion detection and attack classification using back-propagation neural network. International Journal of Engineering Research & Technology (IJERT), 3(3), 1112-1115.
[23] Glorot, X, Bordes, A. and Bengio, Y. (2011). Deep sparse rectifier neural networks. Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS), 15 of Proceedings of Machine Learning Research, Fort Lauderdale, FL, USA. PMLR., 315–323.
[24] Glorot, X. and Bengio, Y. (2010). Understanding the difﬁculty of training deep feedforward neural networks. Proceedings of the 13th International Conference
on Artificial Intelligence and Statistics (AISTAT), Sardinia, Italy, 249–256.
[25] Gnana Sheela, K. and Deepa, S. N. (2013). Review on methods to fix number of hidden neurons in neural networks. Mathematical Problems in Engineering, 1-11.
[26] Graves, A., Liwicki, M., Fernández, S., Bertolami, R., Bunke, H. and Schmidhuber, J. (2009). A novel connectionist system for unconstrained handwriting recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(5), 855–868.
[27] Haykin, S. (1998). Neural Networks: A Comprehensive Foundation. Prentice Hall, New York.
[28] He, K., Zhang, X., Ren, S. and Sun, J. (2015). Delving deep into rectifiers: surpassing human-level performance on ImageNet classification. Proceedings of the IEEE International Conference on Computer Vision (ICCV) Internet, San Diego, Chile, 1026–1034.
[29] Hinton, G. E. (2007). Learning multiple layers of representation. Trends in cognitive sciences, 11(10), 428–34.
[30] Hinton, G. E., Osindero, S. and Teh, Y. W. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18(7), 1527–1554.
[31] Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. R. (2012). Improving neural networks by preventing co-adaptation of feature detectors. Computer Science, 3(4), 212–223.
[32] Hinton, G. E., Srivastava, N. and Swersky, K. (2014). Lecture 6a Overview of Mini-Batch Gradient Descent. Lecture Notes Distributed in CSC321 of University of Toronto. Available at: https://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf.

[33] Huang, G. B. (2003). Learning capability and storage capacity of two-hidden-layer feedforward networks. IEEE Transactions on Neural Networks, 14(2), 274–281.
[34] Ioffe, S. and Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. ArXiv preprint arXiv:1502.03167. Available at: https://arxiv.org/pdf/1502.03167.pdf.

[35] Jain, A. K., Mao, J. and Mohiuddin, K. M. (1996). Artificial neural networks: A tutorial. Computer, 29(3), 31-44.
[36] Jarrett, K., Kavukcuoglu, K., Ranzato, M.’ A. and LeCun, Y. (2009). What is the best multi-stage architecture for object recognition? IEEE 12th International Conference on Computer Vision (ICCV), Kyoto, Japan, 2146-2153.
[37] Karsoliya, S. (2012). Approximating number of hidden layer neurons in multiple hidden layer BPNN architecture. International Journal of Engineering Trends and Technology, 3(6), 714-717.
[38] Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., and Tang, P. T. P. (2017). On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. Proceedings of 5th International Conference on Learning Representations (ICLR), Toulon, France.
[39] Kim, I.-J. and Xie, X. (2015). Handwritten Hangul recognition using deep convolutional neural networks. International Journal on Document Analysis and Recognition (IJDAR), 18(1), 1-13.
[40] Kingma, D. P. and Ba, J. L. (2015). Adam: A method for stochastic optimization. Proceedings of the 3rd International Conference on Learning Representations (ICLR), San Diego.
[41] Koza, J. R., Bennett, F. H., Andre, D. and Keane, M. A. (1996). Automated design of both the topology and sizing of analog electrical circuits using genetic programming. Artificial Intelligence in Design '96. Springer, Dordrecht. 151–170.
[42] Larochelle, H., Bengio, Y., Louradour, J. and Lamblin, P. (2009). Exploring strategies for training deep neural networks. Journal of Machine Learning Research 1, 10, 1-40.
[43] Lathuilière, S., Mesejo, P., Alameda-Pineda, X. and Horaud, R. (2019). A comprehensive analysis of deep regression. IEEE Transactions on Pattern Analysis and Machine Intelligence, DOI 10.1109/TPAMI.2019.2910523.
[44] LeCun, Y., Bengio, Y. and Hinton, G. (2015). Deep Learning. Nature, 521, 436-444.
[45] Lee, Y., Oh, S. H. and Kim, M. W. (1993). An analysis of premature saturation in back propagation learning. Neural Networks, 6(5), 719-728.
[46] Li, M., Zhang, T., Chen, Y., and Smola, A. J. (2014). Eﬃcient mini-batch training for stochastic optimization. Proceedings of the 20th ACMSIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, New York, 661–670.
[47] Lin, B., Lin, G., Liu, X., Ma, J., Wang, X., Lin, F. and Hu, L. (2015). Application of back-propagation artificial neural network and curve estimation in pharmacokinetics of losartan in rabbit. International Journal of Clinical and Experimental Medicine, 8(12), 22352–8.
[48] Liu, W., Wang, Z., Liu, X., Zeng, N., Liu, Y. and Alsaadi, F. E. (2017). A survey of deep neural network architectures and their applications. Neurocomputing, 234, 11-26.
[49] Masters, D. and Luschi, C. (2018). Revisiting small batch training for deep neural networks. ArXiv preprint arXiv:1804.07612.
Available at: https://arxiv.org/pdf/1804.07612.pdf.

[50] Nair, V. and Hinton, G. E. (2010). Rectiﬁed linear units improve restricted boltzmann machines. Proceedings of the 27th International Conference on Machine Learning (ICML-10), Haifa, Israel, 807–814.
[51] Ng, A. Y. (2004). Feature selection, L1 vs. L2 regularization, and rotational invariance. Proceedings of the 21st International Conference on Machine Learning, Ban, Canada.
[52] Niaki, S. T. A. and Abbasi, B. (2008). Detection and classification mean-shifts in multi-attribute processes by artificial neural networks. International Journal of Production Research, 46(11), 2945-2963.
[53] O’Neal, M. R., Engel, B. A., Ess, D. R., and Frankenberger, J. R. (2002). Neural network prediction of maize yield using alternative data coding algorithms. Biosystems Engineering, 83(1), 31–45.
[54] Panchal, F. S. and Panchal, M. (2014). Review on methods of selecting number of hidden nodes in artificial neural network. International Journal of Computer Science and Mobile Computing (IJCSMC), 3(11), 455-464.
[55] Panchal, G., Ganatra, A., Kosta, Y. P. and Panchal, D. (2011). Behaviour analysis of multilayer perceptrons with multiple hidden neurons and hidden layers. International Journal of Computer Theory and Engineering, 3(2), 332-337.
[56] Pascanu, R., Mikolov, T. and Bengio, Y. (2013). On the difficulty of training recurrent neural networks. Proceedings of the 30th International Conference on Machine Learning, Atlanta, Georgia, USA, 1310–1318.
[57] Prechelt, L. (1998). Early Stopping—but when? Neural Networks: Tricks of the Trade, Springer, Berlin Heidelberg, 1524, 55-69.
[58] Reddi, S. J., Kale, S. and Kumar, S. (2018). On the convergence of adam and beyond. Proceedings of the 6th International Conference on Learning Representations (ICLR), Vancouver, BC, Canada.
[59] Ruder, S. (2017). An overview of gradient descent optimization algorithms. ArXiv preprint, arXiv:1609.04747v2.
Available at: https://arxiv.org/pdf/1609.04747.pdf.

[60] Rumelhart, D. E. and Zipser, D. (1985). Feature discovery by competitive learning. Cognitive Science, 9(1), 75-112.
[61] Rumelhart, D. E., Hinton, G. E. and Williams, R. J. (1985). Learning Internal Representations by Error Propagation. (No. ICS-8506). California Univ San Diego La Jolla Inst for Cognitive Science.
[62] Russell, S. and Norvig, P. (2010). Artificial Intelligence: A Modern Approach. 3rd Edition, Prentice-Hall, Upper Saddle River.
[63] Saduf and Wani, M. A. (2013). Comparative study of back propagation learning algorithms for neural networks. International Journal of Advanced Research in Computer Science and Software Engineering, 3(12), 1151-1156.
[64] Sainath, T. N., Kingsbury, B., Saon, G., Soltau, H., Mohamed, A. r., Dahl, G. and Ramabhadran, B. (2015). Deep convolutional neural networks for large-scale speech tasks. Neural Networks, 64, 39–48.
[65] Sak, H., Senior, A. and Beaufays, F. (2014). Long short-term memory recurrent neural network architectures for large scale acoustic modeling. Proceedings of the Fifteenth Annual Conference of the International Speech Communication Association.
[66] Santurkar, S., Tsipras, D., Ilyas, A. and Madry, A. (2018). How does Batch Normalization help optimization? ArXiv preprint arXiv:1805.11604. Available at: https://arxiv.org/pdf/1805.11604.pdf.

[67] Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural Networks, 61, 85–117.
[68] Smith, L. N. (2017). Cyclical learning rates for training neural networks. IEEE Winter Conference on Applications of Computer Vision (WACV), 464–472.
[69] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. and Salakhutdinov, R. (2014). Dropout: a Simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15, 1929-1958.
[70] Stathakis, D. (2009). How many hidden layers and nodes? International Journal of Remote Sensing, 30(8), 2133-2147.
[71] Sutskever, I., Martens, J., Dahl, G. and Hinton, G. (2013). On the importance of initialization and momentum in deep learning. Proceedings of the 30th International Conference on Machine Learning, Atlanta, Georgia, USA.
[72] Thoma, M. (2017). Analysis and optimization of convolutional neural network architectures. ArXiv preprint arXiv:1707.09725.
Available at: https://arxiv.org/pdf/1707.09725.pdf.

[73] Tseng, F. M., Yu, H.C. and Tzeng, G. H. (2002), Combining neural network model with seasonal time series ARIMA model. Technological Forecasting & Social Change, 69(1), 71-87.
[74] Wang, L., Zeng, Y., Zhang, J., Huang, W. and Bao, Y. (2006). The criticality of spare parts evaluating model using artificial neural network approach. In: Alexandrov V.N., van Albada G.D., Sloot P.M.A., Dongarra J. (eds) Computational Science – ICCS 2006. Lecture Notes in Computer Science, 3991, 728-735.
[75] Widrow, B., Winter, R. G. and Baxter, R. A. (1987). Learning phenomena in layered neural networks. Proceedings of the First IEEE International Joint Conference on Neural Networks, San Diego, 2, 411-429.
[76] Wilson, R. and Martinez, T. R. (2001). The need for small learning rates on large problems. Proceedings of the 2001 International Joint Conference on Neural Networks (IJCNN’01), 115-119.
[77] Wong, S. C., Gatt, A., Stamatescu, V. and McDonnell, M. D. (2016). Understanding data augmentation for classification: When to warp? International Conference on Digital Image Computing: Techniques and Applications (DICTA), Gold Coast, QLD, Australia, 1–6.
[78] Yang, H. F., Dillon, T. S., Chen, Y. P. P. (2017). Optimized structure of the traffic flow forecasting model with a deep learning approach, IEEE Transactions on Neural Networks and Learning Systems, 28(10), 2371–2381.
[79] Zeiler, M. D. (2012). Adadelta: An adaptive learning rate method. ArXiv preprint arXiv:1212.5701. Available at: https://arxiv.org/pdf/1212.5701.pdf.

[80] Zell, A. (1994). Simulation Neuronale Netze. Addison-Wesley, New York.
[81] Zhang, C., Tan, K. C. and Ren, R. (2016) Training cost-sensitive deep belief networks on imbalance data problems. International Joint Conference on Neural Networks (IJCNN), 4362-4367.

簡易檢索 / 詳目顯示

相關論文