研究生: |
詹富翔 Chan, Fu-Hsiang |
---|---|
論文名稱: |
基於深度學習預測交通意外及事故 Anticipating Accidents based on Deep Learning in Dashcam Videos |
指導教授: |
孫民
Sun, Min |
口試委員: |
陳煥宗
Chen, Hwann-Tzong 賴尚宏 Lai, Shang-Hong 王鈺強 Wang, Yu-Chiang |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 電機工程學系 Department of Electrical Engineering |
論文出版年: | 2017 |
畢業學年度: | 105 |
語文別: | 英文 |
論文頁數: | 38 |
中文關鍵詞: | 電腦視覺 、機器學習 、深度學習 |
外文關鍵詞: | Computer Vision, Machine Learning, Deep Learning |
相關次數: | 點閱:2 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
目前我們所提出的深度學習模型(Dynamic-Spatial-Attention (DSA) Recurrent Neural Network (RNN))(如 Fig.1.1)來預測行車紀錄器中的車禍時間點,我們的模型能夠學習到(1)在一個時間點上的場景中哪個物體可能會比較危險並特別注意那幾個物體 (2) 並分析上下的時間點來觀察比較危險的物體有沒有可能發生車禍。預測車禍不像是預測行車行為(Ex:切換道路、轉彎)那麼簡單就能分析,因為車禍都是很突然的發生而且在路上不常發生,所以我們利用(1)最先進的物體偵測的演算法(Faster-RCNN [1])來偵測物體和物體追蹤的演法算(MDP [2])來追蹤物體 (2)並利用場景、物體的外觀和軌跡的資訊來預測車禍時間點。我們收集了 968 部(如(Fig. 5.1))有發生不同形式的車禍(Ex:機車撞機車、汽車撞機車...等)的台灣行車紀錄影片,並且每部影片都有標註發生車禍的時間點以及發生車禍的物體種類,所以我們可以利用這些資料來做監督式學習訓練並量化訓練後的結果。利用我們提出的模型能夠在1.22秒前會有 80% recall 和 46.92% precision 而且能夠達到最高 63.98\% mean average precision。
We propose a Dynamic-Spatial-Attention (DSA) Recurrent Neural Network (RNN) for anticipating accidents in dashcam videos (Fig. 1.1). Our DSA-RNN learns to (1) distribute soft-attention to candidate objects dynamically to gather subtle cues and (2) model the temporal dependencies of all cues to robustly anticipate an accident. Anticipating accidents is much less addressed than anticipating events such as changing a lane, making a turn, etc.,
since accidents are rare to be observed and can happen in many different ways mostly in a sudden. To overcome these challenges, we (1) utilize state-of-the-art object detector [1] and tracking-by-detection [2] to detect and track candidate objects, and (2) incorporate full-frame and object-based appearance and motion features in our model. We also harvest a diverse dataset of 968 dashcam accident videos on the web (Fig.5.1). The dataset is unique, since various accidents (e.g., a motorbike hits a car, a car hits another car, etc.) occur in all videos. We manually mark the time-location of accidents and use them as supervision to train and evaluate our method. We show that our method anticipates accidents about 1.22 seconds before they occur with 80% recall and 46.92% precision. Most importantly, it achieves the highest mean average precision (63.98\%) outperforming other baselines without attention or RNN.
[1] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in NIPS, 2015.
[2] Y. Xiang, A. Alahi, and S. Savarese, “Learning to track: Online multi-object tracking by decision making,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 4705–4713, 2015.
[3] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in ICLR, 2015.
[4] H. Wang and C. Schmid, “Action recognition with improved trajectories,” in ICCV, 2013.
[5] S. Hochreiter and J. Schmidhuber, “Long short-term memory.,” Neural Computation, 1997.
[6] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using rnn encoder-decoder for statistical machine translation,” arXiv preprint arXiv:1406.1078, 2014.
[7] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in CVPR, 2012.
[8] Google Inc., “Google self-driving car project monthly report,” May 2015.
[9] N. Highway Traffic Safety Administration, “2012 motor vehicle crashes: overview,” 2013.
[10] A. Jain, A. Singh, H. S. Koppula, S. Soh, and A. Saxena, “Recurrent neural networks for driver activity anticipation via sensory-fusion architecture,” in ICRA, 2016.
[11] V. V. Valenzuela, R. D. Lins, and H. M. De Oliveira, “Application of enhanced-2d-cwt in topographic images for mapping landslide risk areas,” in International Conference Image Analysis and Recognition, pp. 380–388, Springer, 2013.
[12] S. M. Arietta, A. A. Efros, R. Ramamoorthi, and M. Agrawala, “City forensics: Using visual elements to predict non-visual city attributes,” IEEE transactions on visualization and computer graphics, vol. 20, no. 12, pp. 2624–2633, 2014.
[13] A. Khosla, B. An An, J. J. Lim, and A. Torralba, “Looking beyond the visible scene,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3710–3717, 2014.
[14] M. S. Ryoo, “Human activity prediction: Early recognition of ongoing activities from streaming videos,” in ICCV, 2011.
[15] M. Hoai and F. De la Torre, “Max-margin early event detectors,” in CVPR, 2012.
[16] T. Lan, T.-C. Chen, and S. Savarese, “A hierarchical representation for future action prediction,” in ECCV, 2014.
[17] K. M. Kitani, B. D. Ziebart, J. A. D. Bagnell, and M. Hebert , “Activity forecasting,” in ECCV, 2012.
[18] J. Yuen and A. Torralba, “A data-driven approach for event prediction,” in ECCV, 2010.
[19] J. Walker, A. Gupta, and M. Hebert, “Patch to the future: Unsupervised visual prediction,” in CVPR, 2014.
[20] Z. Wang, M. Deisenroth, H. Ben Amor, D. Vogt, B. Schölkopf, and J. Peters, “Probabilistic modeling of human movements for intention inference,” in RSS, 2012.
[21] H. S. Koppula and A. Saxena, “Anticipating human activities using object affordances for reactive robotic response,” PAMI, vol. 38, no. 1, pp. 14–29, 2016.
[22] H. S. Koppula, A. Jain, and A. Saxena, “Anticipatory planning for human-robot teams,” in ISER, 2014.
[23] J. Mainprice and D. Berenson, “Human-robot collaborative manipulation planning using early prediction of human motion,” in IROS, 2013.
[24] H. Berndt, J. Emmert, and K. Dietmayer, “Continuous driver intention recognition with hidden markov models,” in Intelligent Transportation Systems, 2008.
[25] B. Frohlich, M. Enzweiler, and U. Franke, “Will this car change the lane? - turn signal recognition in the frequency domain,” in Intelligent Vehicles Symposium (IV), 2014.
[26] P. Kumar, M. Perrollaz, S. Lefèvre, and C. Laugier, “Learning-based approach for online lane change intention prediction,” in Intelligent Vehicles Symposium (IV), 2013.
[27] M. Liebner, M. Baumann, F. Klanner, and C. Stiller, “Driver intent inference at urban intersections using the intelligent driver model,” in Intelligent Vehicles Symposium (IV), 2012.
[28] B. Morris, A. Doshi, and M. Trivedi, “Lane change intent prediction for driver assistance: On-road design and evaluation,” in Intelligent Vehicles Symposium (IV), 2011.
[29] A. Doshi, B. Morris, and M. Trivedi, “On-road prediction of driver’s intent with multimodal sensory cues,” IEEE Pervasive Computing, vol. 10, no. 3, pp. 22–34, 2011.
[30] M. M. Trivedi, T. Gandhi, and J. McCall, “Looking-in and looking-out of a vehicle: Computer-vision-based enhanced vehicle safety,” IEEE Transactions on Intelligent Transportation Systems, vol. 8, no. 1, pp. 108–120, 2007.
[31] A. Jain, H. S. Koppula, B. Raghavan, S. Soh, and A. Saxena, “Car that knows before you do: Anticipating maneuvers via learning temporal driving models,” in ICCV, 2015.
[32] L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and A. Courville., “Describing videos by exploiting temporal structure,” in ICCV, 2015.
[33] K. Xu, J. Ba, R. Kiros, A. Courville, R. Salakhutdinov, R. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” arXiv preprint arXiv:1502.03044, 2015.
[34] V. Mnih, N. Heess, A. Graves, and k. kavukcuoglu, “Recurrent models of visual attention,” in NIPS, 2014.
[35] J. Ba, V. Mnih, and K. Kavukcuoglu, “Multiple object recognition with visual attention,” in ICLR, 2015.
[36] G. J. Brostow, J. Shotton, J. Fauqueur, and R. Cipolla, “Segmentation and recognition using structure from motion point clouds,” in ECCV, 2008.
[37] B. Leibe, N. Cornelis, K. Cornelis, and L. V. Gool, “Dynamic 3d scene analysis from a moving vehicle,” in CVPR, 2007.
[38] T. Scharwächter, M. Enzweiler, S. Roth, and U. Franke, “Efficient multi-cue scene segmentation,” in GCPR, 2013.
[39] M. Cordts, M. Omran, S. Ramos, T. Scharwächter, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset,” in CVPR Workshop on The Future of Datasets in Vision, 2015.
[40] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 580–587, 2014.
[41] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 1, pp. 886–893, IEEE, 2005.
[42] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” International journal of computer vision, vol. 60, no. 2, pp. 91–110, 2004.
[43] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep convolutional
networks for visual recognition,” in European Conference on Computer Vision, pp. 346–361, Springer, 2014.
[44] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 1440–1448, 2015.
[45] W. Choi, “Near-online multi-target tracking with aggregated local flow descriptor,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 3029–3037, 2015.
[46] B. D. Lucas, T. Kanade, et al., “An iterative image registration technique with an application to stereo vision,” 1981.
[47] G. Farnebäck, “Two-frame motion estimation based on polynomial expansion,” Image analysis, pp. 363–370, 2003.
[48] H. W. Kuhn, “The hungarian method for the assignment problem,” 50 Years of Integer Programming 1958-2008, pp. 29–47, 2010.
[49] R. E. Kalman et al., “A new approach to linear filtering and prediction problems,” 1960.
[50] R. Pascanu, T. Mikolov, and Y. Bengio, “On the difficulty of training recurrent neural networks,” arXiv preprint arXiv:1211.5063, 2012.
[51] P. J. Werbos, “Backpropagation through time: what it does and how to do it,” Proceedings of the IEEE, vol. 78, no. 10, pp. 1550–1560, 1990.
[52] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,” International journal of computer vision, vol. 88, no. 2, pp. 303–338, 2010.
[53] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European Conference
on Computer Vision, pp. 740–755, Springer, 2014.
[54] C. Vondrick, D. Patterson, and D. Ramanan, “Efficiently scaling up crowdsourced video annotation,” International Journal of Computer Vision, pp. 1–21. 10.1007/s11263-012-0564-1.
[55] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[56] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778, 2016.
[57] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in CVPr, 2005.
[58] D. Tran, L. D. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “C3d: generic features for video analysis,” CoRR, abs/1412.0767, vol. 2, p. 7, 2014.
[59] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
[60] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C.itro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al., “Tensorflow: Large-scale machine learning on heterogeneous distributed systems,” arXiv preprint arXiv:1603.04467, 2016.