360度全景影片視覺注意力預測：考慮敏感性與行動裝置推論速度之卷積神經網路模型剪枝

簡易檢索 / 詳目顯示

回結果列表

研究生：	楊鈞愷 Yang, Chun-Kai
論文名稱：	360度全景影片視覺注意力預測：考慮敏感性與行動裝置推論速度之卷積神經網路模型剪枝 Sensitivity-Latency-Aware Pruning Towards Efficient Saliency Prediction in 360° Videos
指導教授：	黃稚存 Huang, Chih-Tsun
口試委員:	胡敏君 Hu, Min-Chun 潘則佑 Pan, Tse-Yu
學位類別：	碩士 Master
系所名稱：	電機資訊學院 - 資訊工程學系 Computer Science
論文出版年：	2024
畢業學年度：	113
語文別：	英文
論文頁數：	36
中文關鍵詞：	視覺注意力預測、模型剪枝、360度全景影片、行動裝置、敏感性分析、卷積神經網路
外文關鍵詞：	Saliency Prediction, Model Pruning, 360° Videos, Mobile Devices, Sensitivity Analysis, CNN
相關次數：	點閱：172 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

視覺注意力預測是電腦視覺當中相當重要的問題，因其可以幫助我們了解人類的視覺注意力甚至是決策機制，同時開創了許多應用的可能性。近年來，由於虛擬實境、擴增實境等多媒體技術的長足進步，360度全景影片的視覺注意力預測也得到大量關注，成為了熱門的研究題目。

然而，由於其通常較深且複雜的卷積神經網路架構，以及360度全景影片較高的解析度，要將視覺注意力預測模型部署到現實的頭戴式裝置或智慧型手機上，仍然是一個相當棘手的問題，也致使模型的壓縮成為了面向實際應用的首要課題。而在諸多模型壓縮的方法中，剪枝是相當直接且有效率的一種，因此在本論文中，我們將透過剪枝的手段來實現可應用於行動裝置的360度全景影片視覺注意力預測模型。

有關模型剪枝的討論在文獻上已有相當歷史，其中，在對於剪枝之敏感性此一概念被提出與應用後，便有相當多的研究致力於探討如何衡量該敏感性並透過其進行模型剪枝以最小化預測準確度的損失。然而，有將模型實際應用納入考量的研究則為少數，並且在模型剪枝的討論中，大部分文獻皆專注於減少之模型參數量與運算量多寡，容易忽略了在真實行動裝置上之推論速度。

鑑於此，我們提出的方法首先針對360度影片視覺注意力預測進行模型的剪枝敏感性分析，接著透過量化敏感性分析結果，並同時考慮模型各層的運算負擔，進而決定模型各層的剪枝稀疏度。決定模型各層的剪枝稀疏度後，我們導入了一個推論速度預測器，透過線性迴歸模型從各層之稀疏度預測未來於行動裝置之實際推論速度，藉以確保符合真實應用之需求。在剪枝後，我們亦使用了基於隱藏層特徵之知識蒸餾技術來減少預測準確度的下降。

實驗結果顯示，針對360度全景影片視覺注意力預測，所提出的剪枝方法可以在準確度與推論速度之間取得一個較好的平衡，亦即在相同速度下，我們擁有更高的準確度；在相同準確度下，我們擁有更快的推論速度。並且，經過剪枝後最小的模型在智慧型手機上僅需要35毫秒即可完成推論，與原模型相比達到了3.2倍的實際加速效果，已相當充裕於應對後續之應用。預測準確度方面雖有4.2%之下降，但經視覺化分析後，我們認為其仍於可接受範圍內，並且在某些情境下，甚至可比原模型做出更適切的預測。

Saliency prediction is a crucial problem of computer vision because it can help us understand more about human's visual attention and enable many further applications. For conventional 2D images and videos, saliency prediction has been well discussed with many researchers that cover the methods from traditional image processing approaches with handcrafted features to the latest data-driven deep convolutional neural networks. And recently, the saliency prediction for 360-degree visual contents has also become a popular topic with the advancement of VR/AR techniques, devices, and related applications.

With the assistance of powerful CNNs (convolutional neural networks) and community efforts, some works already show great performance of saliency prediction for 360-degree videos. However, due to the complexity of deep models and the larger resolution of inputs, it is still difficult to deploy on mobile devices, such as VR/AR headsets and smartphones. As a result, model reduction has become a critical problem for 360-degree video saliency prediction.

For methods of model compression, pruning is a straightforward and effective one. After introducing the concept of sensitivity to pruning, there were many discussions about considering the sensitivity from different aspects and pruning models in many different ways. Yet, there is little literature on pruning models for specific applications like saliency prediction. Moreover, while pruning models, most existing works focus on reducing model sizes and computational costs instead of real-world inference time.

Therefore, we propose a sensitivity-latency-aware pruning towards efficient 360-degree video saliency prediction. We consider the result of the sensitivity analysis of pruning together with the computational cost to determine pruning sparsity for each layer individually. We also utilize a linear regression model as a latency predictor to further ensure the desired speedup of inference time on mobile devices. In the end, a feature-based knowledge distillation training flow is used to minimize the accuracy drop. Experimental results show that a better accuracy-latency tradeoff can be achieved with our pruning method while, compared to the unpruned model, the smallest model after pruning reaches a real-world speedup of 3.2 times with an inference time of 35ms measured on the Dimensity 1000+ platform from MediaTek. The qualitative results also demonstrate that an accuracy drop of 4.2% is acceptable for potential applications.

Introduction 1
1 Background. . . . . . . . . . . . . . . . 1
2 Motivation. . . . . . . . . . . . . . . . 1
3 Contribution. . . . . . . . . . . . . . . 2
Previous Work 3
1 Dataset . . . . . . . . . . . . . . . . . 3
2 Convolution-Based Encoder-Decoder Model . 4
Methods 5
1 Preliminary . . . . . . . . . . . . . . . 5
2 Sensitivity Analysis of Pruning . . . . . 6
3 Sparsity Tuner. . . . . . . . . . . . . . 8
3.1 Definition of Sensitivity Factor. . . . 8
3.2 Definition of Latency Factor. . . . . . 9
3.3 Determining Sparsity for Every Layer. . 10
4 Latency Predictor . . . . . . . . . . . . 12
5 Feature-Based Knowledge Distillation. . . 12
5.1 One-Stage Training Flow . . . . . . . . 13
5.2 Two-Stage Training Flow . . . . . . . . 14
5.3 Knowledge Distillation Module . . . . . 15
Experiments 17
1 Training Setup. . . . . . . . . . . . . . 17
2 Better Accuracy-Latency Tradeoff from Sensitivity-Latency-Aware Pruning . 18
3 Verification: Sensitivity Analysis of Pruning . 18
4 Verification: Benefit to Accuracy of Considering Sensitivity While Pruning . 21
5 Performance of Latency Predictor. . . . . 22
6 Feature-Based Knowledge Distillation. . . 23
6.1 Improvements in Accuracy. . . . . . . . 23
6.2 Comparisons between Different Settings of KD Modules . 23
6.3 Comparisons between Two Different Training Flows . 26
6.4 Ablation Study of KD Module . . . . . . 28
7 Inference Time Speedup and Qualitative Analysis . 29
Conclusion 33
                                

[1] M. Kümmerer, L. Theis, and M. Bethge, “Deep gaze i: Boosting saliency prediction with feature maps trained on imagenet,” 2015.
[2] M. Kümmerer, T. S. A. Wallis, and M. Bethge, “Deepgaze ii: Reading fixations from deep features trained on object recognition,” 2016.
[3] J. Pan, C. C. Ferrer, K. McGuinness, N. E. O’Connor, J. Torres, E. Sayrol, and X. G.i Nieto, “Salgan: Visual saliency prediction with generative adversarial networks,” 2018.
[4] Y. Dahou, M. Tliba, K. McGuinness, and N. O’Connor, “Atsal: An attention based architecture for saliency prediction in 360 videos,” in Pattern Recognition. ICPR International Workshops and Challenges. Springer International Publishing, 2021, pp. 305–320.
[5] F.-Y. Chao, L. Zhang, W. Hamidouche, and O. Deforges, “Salgan360: Visual saliency prediction on 360 degree images with generative adversarial networks,” in 2018 IEEE International Conference on Multimedia & Expo Workshops (ICMEW). IEEE, 2018, pp. 01–04.
[6] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, “Pruning filters for efficient convnets,” in International Conference on Learning Representations, 2017. [Online]. Available: https://openreview.net/forum?id=rJqFGTslg
[7] C. Zhong, Y. He, Y. An, W. W. Y. Ng, and T. Wang, “A sensitivity-based pruning method for convolutional neural networks,” in 2022 IEEE International Conference on Systems, Man, and Cybernetics (SMC), 2022, pp. 1032–1038.
[8] C. Yang and H. Liu, “Channel pruning based on convolutional neural network sensitivity,” Neurocomputing, vol. 507, pp. 97–106, 2022. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0925231222009110
[9] Z. Babaiee, L. Liebenwein, R. Hasani, D. Rus, and R. Grosu, “End-to-end sensitivity-based filter pruning,” 2022.
[10] Z. Bylinskii, T. Judd, A. Borji, L. Itti, F. Durand, A. Oliva, and A. Torralba, “Mit saliency benchmark.”
[11] M. Jiang, S. Huang, J. Duan, and Q. Zhao, “Salicon: Saliency in context,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
[12] W. Wang, J. Shen, J. Xie, M.-M. Cheng, H. Ling, and A. Borji, “Revisiting video saliency prediction in the deep learning era,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.
[13] Ieee aivr 2021 grand challenge. Accessed: 2023-08-22. [Online]. Available: https://aidea-web.tw/topic/aa5451e1-8697-47ee-96f1-c51ccc3badcc
[14] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
[15] M. Maaz, A. Shaker, H. Cholakkal, S. Khan, S. W. Zamir, R. M. Anwer, and F. S. Khan, “Edgenext: Efficiently amalgamated cnn-transformer architecture for mobile vision applications,” in International Workshop on Computational Aspects of Deep Learning at
17th European Conference on Computer Vision (CADL2022). Springer, 2022.
[16] T. Judd, K. Ehinger, F. Durand, and A. Torralba, “Learning to predict where humans look,” in 2009 IEEE 12th International Conference on Computer Vision, 2009, pp. 2106–2113.
[17] R. J. Peters, A. Iyer, L. Itti, and C. Koch, “Components of bottom-up gaze allocation in natural images,” Vision Research, vol. 45, no. 18, pp. 2397–2416, 2005. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0042698905001975
[18] O. Le Meur, P. Le Callet, and D. Barba, “Predicting visual fixations on video based on low-level visual features,” Vision Research, vol. 47, no. 19, pp. 2483–2498, 2007. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0042698907002593
[19] M. Tan, B. Chen, R. Pang, V. Vasudevan, M. Sandler, A. Howard, and Q. V. Le, “Mnasnet: Platform-aware neural architecture search for mobile,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
[20] P. Zhang, L. Su, L. Li, B. Bao, P. Cosman, G. Li, and Q. Huang, “Training efficient saliency prediction models with knowledge distillation,” in Proceedings of the 27th ACM International Conference on Multimedia, ser. MM ’19. New York, NY, USA: Association for Computing Machinery, 2019, p. 512–520. [Online]. Available: https://doi.org/10.1145/3343031.3351089

簡易檢索 / 詳目顯示

相關論文