研究生: |
詹誠盛 Chan, Cheng-Sheng |
---|---|
論文名稱: |
基於深度學習從手部視角辨識影像 Recognition from Hand Cameras: A Revisit with Deep Learning |
指導教授: |
孫民
Sun, Min |
口試委員: |
賴尚宏
Lai, Shang-Hong 陳煥宗 Chen, Hwann-Tzong 王鈺強 Wang, Yu-Chiang Frank |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 電機工程學系 Department of Electrical Engineering |
論文出版年: | 2017 |
畢業學年度: | 105 |
語文別: | 英文 |
論文頁數: | 37 |
中文關鍵詞: | 深度學習 、手部視角 、影像辨識 |
外文關鍵詞: | deep learning, hand camera, image recognition |
相關次數: | 點閱:1 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
近來穿戴式相機的領域中,有相當多的研究利用穿戴式的特點,用來辨識日常生活中的活動,常見的相機擺設位置通常在頭上或是胸前,可以模擬一般的人類所看到的畫面(第一人稱視角相機)。我們則是從另一個觀點出發,將相機放置在手腕的下方(稱為手部相機)。相較於第一人稱視角,在手部相機中,手的位置永遠在畫面的中央附近,這樣的相機系統有兩個優勢:(1)不需要做手部位置偵測,(2)可以無時無刻觀察到手的活動。我們的目標是用這樣的手部相機系統,辨識手的狀態(有沒有和物體互動、手勢、物品種類)。我們利用深度學的模型達成辨識的目的,也能夠做到在沒看過的場景中自動發現新的物品種類。
我們蒐集了20部影片,在這些影片中,不同視角的畫面是同步的,包含了手部視角和第一人稱視角。使用者會在三個不同的場景之中與各種物體互動。實驗結果顯示手部相機在不同的設定之下,都贏過第一人稱視角相機(一樣是基於深度學習的模型),也贏過第一人稱視角結合時間上軌跡資訊(dense-trajectory)的方法。我們利用了自動對齊的方法,整理了不同使用者在不同場景拍攝的影片,降低因為穿戴相機時位置不同所造成的差異,並提高了$3.3\%$的辨識準確度。接著我們在深度學習模型上,利用finetuning的方法,也能穩定的提升辨識準確度。最重要的是,我們提出了一個結合兩種相機視角的模型(two-streams model),在5個任務中有4個達到最好的辨識準確度。最後,我們希望能將這樣的系統真正應用在現實世界中,我們朝著這樣的目標提升了上述的系統,讓系統能做到實時辨識。相信在蒐集到更多的資料之後,我們的系統能穩定的記錄下日常生活中手的狀態。
We revisit the study of a wrist-mounted camera system (referred to as HandCam) for recognizing activities of hands. HandCam has two unique properties as compared to egocentric systems [2, 3] (referred to as EgoCam): (1) it avoids the need to detect hands; (2) it more consistently observes the activities of hands. By taking advantage of these properties, we propose a deep-learning-based method to recognize hand states (free vs. active hands, hand gestures, object categories), and discover object categories. Moreover, we propose a novel two-streams deep network to further take advantage of both HandCam and EgoCam. We have collected a new synchronized HandCam and EgoCam dataset with 20 videos captured in three scenes for hand states recognition. Experiments show that our HandCam system consistently outperforms a deep-learning-based EgoCam method (with estimated manipulation regions) and a dense-trajectory-based [4]
EgoCam method in all tasks. We also show that HandCam videos captured by different users can be easily aligned to improve free vs. active recognition accuracy (3.3% improvement) in across-scenes use case. Next, we apply finetuning on Convolutional Neural Network [5], and it consistently improves accuracy. More important, our novel two-streams deep network combining HandCam and EgoCam features achieves the best performance in four out of five tasks. Finally, we want to apply our system into daily life. Based on the basic system, we add several steps to upgrade our system to be suitable for real-time application. With more data, we believe the new system with joint HandCam and EgoCam can robustly log hand states in daily life.
[1] C.-S. Chan, S.-Z. Chen, P.-X. Xie, C.-C. Chang, and M. Sun, “Recognition from hand
cameras: A revisit with deeplearning,” in ECCV, 2016.
[2] A. Fathi, X. Ren, and J. M. Rehg, “Learning to recognize objects in egocentric activities,”
in CVPR, 2011.
[3] D. Damen, T. Leelasawassuk, O. Haines, A. Calway, and W. Mayol-Cuevas, “You-do,
i-learn: Discovering task relevant objects and their modes of interaction from multi-user
egocentric video,” in BMVC, 2014.
[4] H. Wang and C. Schmid, “Action Recognition with Improved Trajectories,” in ICCV,
2013.
[5] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convo-
lutional neural networks,” in NIPS, 2012.
[6] D.-A. Huang, W.-C. Ma, M. Ma, and K. M. Kitani., “How do we use our hands? discov-
ering a diverse set of common grasps,” in CVPR, 2015.
[7] C. Wu, “Towards linear-time incremental structure from motion,” in 3DV, 2013.
[8] H. Pirsiavash and D. Ramanan, “Detecting activities of daily living in first-person camera
views,” in CVPR, 2012.
[9] G. Rogez, J. S. Supančič, M. Khademi, J. M. M. Montiel, and D. Ramanan, “3d hand pose
detection in egocentric RGB-D images,” CoRR, vol. abs/1412.0065, 2014.
[10] G. Rogez, J. S. Supančič, and D. Ramanan, “First-person pose recognition using egocen-
tric workspaces,” in CVPR, 2015.
[11] D. Kim, O. Hilliges, S. Izadi, A. D. Butler, J. Chen, I. Oikonomidis, and P. Olivier, “Digits:
Freehand 3d interactions anywhere using a wrist-worn gloveless sensor,” in UIST, 2012.
[12] A. Saxena, J. Driemeyer, and A. Ng, “Robotic grasping of novel objects using vision,”
Int. J. Rob. Res., vol. 27, no. 2, pp. 157–173, 2008.
[13] W. Mayol-Cuevas, B. Tordoff, and D. Murray, “On the choice and placement of wear-
able vision sensors,” Systems, Man and Cybernetics, Part A: Systems and Humans, IEEE
Transactions on, vol. 39, no. 2, pp. 414–425, 2009.
[14] T. Maekawa, Y. Yanagisawa, Y. Kishino, K. Ishiguro, K. Kamei, Y. Sakurai, and
T. Okadome, “Object-based activity recognition with heterogeneous sensors on wrist,”
in ICPC, 2010.
[15] T. Maekawa, Y. Kishino, Y. Yanagisawa, and Y. Sakurai, “Wristsense: wrist-worn sensor
device with camera for daily activity recognition,” in PERCOM Workshops, IEEE, 2012.
[16] Y. Li, A. Fathi, and J. M. Rehg, “Learning to predict gaze in egocentric video,” in ICCV,
2013.
[17] D. J. Patterson, D. Fox, H. Kautz, and M. Philipose, “Finegrained activity recognition by
aggregating abstract object usage,” in ISWC, 2005.
[18] M. Stikic, T. Huynh, K. V. Laerhoven, and B. Schiele, “Adl recognition based on the
combination of rfid, and accelerometer sensing,” in Pervasive Computing Technologies
for Healthcare, 2008.
[19] J. Wu, A. Osuntogun, T. Choudhury, M. Philipose, and J. M. Rehg, “A scalable approach
to activity recognition based on object use,” in ICCV, 2007.
[20] A. Fathi, A. Farhadi, and J. M. Rehg, “Understanding egocentric activities,” in ICCV,
2011.
[21] A. Fathi and J. M. Rehg, “Modeling actions through state changes,” in CVPR, 2013.
[22] A. Fathi, Y. Li, and J. M. Rehg, “Learning to recognize daily actions using gaze,” in ECCV,
2012.
[23] Y. Li, Z. Ye, and J. M. Rehg, “Delving into egocentric actions,” in CVPR, 2015.
[24] J. Ghosh, Y. J. Lee, and K. Grauman, “Discovering important people and objects for ego-
centric video summarization,” in CVPR, 2012.
[25] Z. Lu and K. Grauman, “Story-driven summarization for egocentric video,” in CVPR,
2013.
[26] M. Sun, A. Farhadi, B. Taskar, and S. Seitz, “Salient montages from unconstrained
videos,” in ECCV, 2014.
[27] F. De la Torre, J. K. Hodgins, J. Montano, and S. Valcarcel, “Detailed human data ac-
quisition of kitchen activities: the cmu-multimodal activity database (cmu-mmac),” in
Workshop on Developing Shared Home Behavior Datasets to Advance HCI and Ubiqui-
tous Computing Research, in conjuction with CHI 2009, 2009.
[28] M. Moghimi, P. Azagra, L. Montesano, A. C. Murillo, and S. Belongie, “Experiments on
an rgb-d wearable vision system for egocentric activity recognition,” in CVPR Workshop
on Egocentric (First-person) Vision, 2014.
[29] D. Damen, A. Gee, W. Mayol-Cuevas, and A. Calway, “Egocentric real-time workspace
monitoring using an rgb-d camera,” in IROS, 2012.
[30] C. Li and K. M. Kitani, “Pixel-level hand detection in egocentric videos,” in CVPR, 2013.
[31] C. Li and K. M. Kitani, “Model recommendation with virtual probes for egocentric hand
detection,” in ICCV, 2013.
[32] A. Betancourt, M. Lopez, C. Regazzoni, and M. Rauterberg, “A sequential classifier for
hand detection in the framework of egocentric vision,” in CVPRW, 2014.
[33] A. Vardy, J. Robinson, and L.-T. Cheng, “The wristcam as input device,” in ISWC, 1999.
[34] L. Chan, Y.-L. Chen, C.-H. Hsieh, R.-H. Liang, and B.-Y. Chen, “Cyclopsring: Enabling
whole-hand and context-aware interactions through a fisheye ring,” in UIST, 2015.
[35] K. Ohnishi, A. Kanehira, A. Kanezaki, and T. Harada, “Recognizing activities of daily
living with a wrist-mounted camera,” in CVPR, 2016.
[36] R. Wu, S. Yan, Y. Shan, Q. Dang, and G. Sun, “Deep image: Scaling up image recogni-
tion,” CoRR, vol. abs/1501.02876, 2015.
[37] J. Canny, “A computational approach to edge detection,” PAMI, vol. PAMI-8, no. 6,
pp. 679–698, 1986.
[38] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image
recognition,” in ICLR, 2015.
[39] K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recogni-
tion in videos,” in NIPS, 2014.
[40] Z. Li and D. Hoiem, “Learning without forgetting,” in ECCV, 2016.
[41] R. Caruana, “Multitask learning,” Machine Learn., vol. 28, pp. 41–75, July 1997.
[42] S. Bambach, S. Lee, D. J. Crandall, and C. Yu, “Lending a hand: Detecting hands and
recognizing activities in complex egocentric interactions,” in ICCV, 2015.
[43] Tan, Steinbach, and Kumar., “Introduction to data mining,” 2005.