簡易檢索 / 詳目顯示

研究生: 詹誠盛
Chan, Cheng-Sheng
論文名稱: 基於深度學習從手部視角辨識影像
Recognition from Hand Cameras: A Revisit with Deep Learning
指導教授: 孫民
Sun, Min
口試委員: 賴尚宏
Lai, Shang-Hong
陳煥宗
Chen, Hwann-Tzong
王鈺強
Wang, Yu-Chiang Frank
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電機工程學系
Department of Electrical Engineering
論文出版年: 2017
畢業學年度: 105
語文別: 英文
論文頁數: 37
中文關鍵詞: 深度學習手部視角影像辨識
外文關鍵詞: deep learning, hand camera, image recognition
相關次數: 點閱:2下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 近來穿戴式相機的領域中,有相當多的研究利用穿戴式的特點,用來辨識日常生活中的活動,常見的相機擺設位置通常在頭上或是胸前,可以模擬一般的人類所看到的畫面(第一人稱視角相機)。我們則是從另一個觀點出發,將相機放置在手腕的下方(稱為手部相機)。相較於第一人稱視角,在手部相機中,手的位置永遠在畫面的中央附近,這樣的相機系統有兩個優勢:(1)不需要做手部位置偵測,(2)可以無時無刻觀察到手的活動。我們的目標是用這樣的手部相機系統,辨識手的狀態(有沒有和物體互動、手勢、物品種類)。我們利用深度學的模型達成辨識的目的,也能夠做到在沒看過的場景中自動發現新的物品種類。

    我們蒐集了20部影片,在這些影片中,不同視角的畫面是同步的,包含了手部視角和第一人稱視角。使用者會在三個不同的場景之中與各種物體互動。實驗結果顯示手部相機在不同的設定之下,都贏過第一人稱視角相機(一樣是基於深度學習的模型),也贏過第一人稱視角結合時間上軌跡資訊(dense-trajectory)的方法。我們利用了自動對齊的方法,整理了不同使用者在不同場景拍攝的影片,降低因為穿戴相機時位置不同所造成的差異,並提高了$3.3\%$的辨識準確度。接著我們在深度學習模型上,利用finetuning的方法,也能穩定的提升辨識準確度。最重要的是,我們提出了一個結合兩種相機視角的模型(two-streams model),在5個任務中有4個達到最好的辨識準確度。最後,我們希望能將這樣的系統真正應用在現實世界中,我們朝著這樣的目標提升了上述的系統,讓系統能做到實時辨識。相信在蒐集到更多的資料之後,我們的系統能穩定的記錄下日常生活中手的狀態。


    We revisit the study of a wrist-mounted camera system (referred to as HandCam) for recognizing activities of hands. HandCam has two unique properties as compared to egocentric systems [2, 3] (referred to as EgoCam): (1) it avoids the need to detect hands; (2) it more consistently observes the activities of hands. By taking advantage of these properties, we propose a deep-learning-based method to recognize hand states (free vs. active hands, hand gestures, object categories), and discover object categories. Moreover, we propose a novel two-streams deep network to further take advantage of both HandCam and EgoCam. We have collected a new synchronized HandCam and EgoCam dataset with 20 videos captured in three scenes for hand states recognition. Experiments show that our HandCam system consistently outperforms a deep-learning-based EgoCam method (with estimated manipulation regions) and a dense-trajectory-based [4]
    EgoCam method in all tasks. We also show that HandCam videos captured by different users can be easily aligned to improve free vs. active recognition accuracy (3.3% improvement) in across-scenes use case. Next, we apply finetuning on Convolutional Neural Network [5], and it consistently improves accuracy. More important, our novel two-streams deep network combining HandCam and EgoCam features achieves the best performance in four out of five tasks. Finally, we want to apply our system into daily life. Based on the basic system, we add several steps to upgrade our system to be suitable for real-time application. With more data, we believe the new system with joint HandCam and EgoCam can robustly log hand states in daily life.

    1 Introduction 1 2 Related Work 4 2.1 Egocentric Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 Hand Detection and Pose Estimation . . . . . . . . . . . . . . . . . . . . . . . 6 2.3 Camera for Hands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3 Our Basic System 8 3.1 Wearable Cues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.2 Hand Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.3 Hand States Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.4 State Change Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.5 Full Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.6 Deep Feature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.7 Object Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.8 Combining HandCam with EgoCam . . . . . . . . . . . . . . . . . . . . . . . 15 4 System Promotion 17 4.1 Deeper Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.2 Single model for multi-tasking . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4.3 Recurrent Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 5 Dataset 21 5.1 Data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 5.2 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 6 Experiment Results 24 6.1 EgoCam baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 6.2 Basic System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 6.2.1 Method Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 6.2.2 Hand State Recognition . . . . . . . . . . . . . . . . . . . . . . . . . 28 6.2.3 Combining HandCam with EgoCam . . . . . . . . . . . . . . . . . . . 29 6.3 After System Promotion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 6.4 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 7 Conclusion 33 References 35

    [1] C.-S. Chan, S.-Z. Chen, P.-X. Xie, C.-C. Chang, and M. Sun, “Recognition from hand
    cameras: A revisit with deeplearning,” in ECCV, 2016.
    [2] A. Fathi, X. Ren, and J. M. Rehg, “Learning to recognize objects in egocentric activities,”
    in CVPR, 2011.
    [3] D. Damen, T. Leelasawassuk, O. Haines, A. Calway, and W. Mayol-Cuevas, “You-do,
    i-learn: Discovering task relevant objects and their modes of interaction from multi-user
    egocentric video,” in BMVC, 2014.
    [4] H. Wang and C. Schmid, “Action Recognition with Improved Trajectories,” in ICCV,
    2013.
    [5] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convo-
    lutional neural networks,” in NIPS, 2012.
    [6] D.-A. Huang, W.-C. Ma, M. Ma, and K. M. Kitani., “How do we use our hands? discov-
    ering a diverse set of common grasps,” in CVPR, 2015.
    [7] C. Wu, “Towards linear-time incremental structure from motion,” in 3DV, 2013.
    [8] H. Pirsiavash and D. Ramanan, “Detecting activities of daily living in first-person camera
    views,” in CVPR, 2012.
    [9] G. Rogez, J. S. Supančič, M. Khademi, J. M. M. Montiel, and D. Ramanan, “3d hand pose
    detection in egocentric RGB-D images,” CoRR, vol. abs/1412.0065, 2014.
    [10] G. Rogez, J. S. Supančič, and D. Ramanan, “First-person pose recognition using egocen-
    tric workspaces,” in CVPR, 2015.
    [11] D. Kim, O. Hilliges, S. Izadi, A. D. Butler, J. Chen, I. Oikonomidis, and P. Olivier, “Digits:
    Freehand 3d interactions anywhere using a wrist-worn gloveless sensor,” in UIST, 2012.
    [12] A. Saxena, J. Driemeyer, and A. Ng, “Robotic grasping of novel objects using vision,”
    Int. J. Rob. Res., vol. 27, no. 2, pp. 157–173, 2008.
    [13] W. Mayol-Cuevas, B. Tordoff, and D. Murray, “On the choice and placement of wear-
    able vision sensors,” Systems, Man and Cybernetics, Part A: Systems and Humans, IEEE
    Transactions on, vol. 39, no. 2, pp. 414–425, 2009.
    [14] T. Maekawa, Y. Yanagisawa, Y. Kishino, K. Ishiguro, K. Kamei, Y. Sakurai, and
    T. Okadome, “Object-based activity recognition with heterogeneous sensors on wrist,”
    in ICPC, 2010.
    [15] T. Maekawa, Y. Kishino, Y. Yanagisawa, and Y. Sakurai, “Wristsense: wrist-worn sensor
    device with camera for daily activity recognition,” in PERCOM Workshops, IEEE, 2012.
    [16] Y. Li, A. Fathi, and J. M. Rehg, “Learning to predict gaze in egocentric video,” in ICCV,
    2013.
    [17] D. J. Patterson, D. Fox, H. Kautz, and M. Philipose, “Finegrained activity recognition by
    aggregating abstract object usage,” in ISWC, 2005.
    [18] M. Stikic, T. Huynh, K. V. Laerhoven, and B. Schiele, “Adl recognition based on the
    combination of rfid, and accelerometer sensing,” in Pervasive Computing Technologies
    for Healthcare, 2008.
    [19] J. Wu, A. Osuntogun, T. Choudhury, M. Philipose, and J. M. Rehg, “A scalable approach
    to activity recognition based on object use,” in ICCV, 2007.
    [20] A. Fathi, A. Farhadi, and J. M. Rehg, “Understanding egocentric activities,” in ICCV,
    2011.
    [21] A. Fathi and J. M. Rehg, “Modeling actions through state changes,” in CVPR, 2013.
    [22] A. Fathi, Y. Li, and J. M. Rehg, “Learning to recognize daily actions using gaze,” in ECCV,
    2012.
    [23] Y. Li, Z. Ye, and J. M. Rehg, “Delving into egocentric actions,” in CVPR, 2015.
    [24] J. Ghosh, Y. J. Lee, and K. Grauman, “Discovering important people and objects for ego-
    centric video summarization,” in CVPR, 2012.
    [25] Z. Lu and K. Grauman, “Story-driven summarization for egocentric video,” in CVPR,
    2013.
    [26] M. Sun, A. Farhadi, B. Taskar, and S. Seitz, “Salient montages from unconstrained
    videos,” in ECCV, 2014.
    [27] F. De la Torre, J. K. Hodgins, J. Montano, and S. Valcarcel, “Detailed human data ac-
    quisition of kitchen activities: the cmu-multimodal activity database (cmu-mmac),” in
    Workshop on Developing Shared Home Behavior Datasets to Advance HCI and Ubiqui-
    tous Computing Research, in conjuction with CHI 2009, 2009.
    [28] M. Moghimi, P. Azagra, L. Montesano, A. C. Murillo, and S. Belongie, “Experiments on
    an rgb-d wearable vision system for egocentric activity recognition,” in CVPR Workshop
    on Egocentric (First-person) Vision, 2014.
    [29] D. Damen, A. Gee, W. Mayol-Cuevas, and A. Calway, “Egocentric real-time workspace
    monitoring using an rgb-d camera,” in IROS, 2012.
    [30] C. Li and K. M. Kitani, “Pixel-level hand detection in egocentric videos,” in CVPR, 2013.
    [31] C. Li and K. M. Kitani, “Model recommendation with virtual probes for egocentric hand
    detection,” in ICCV, 2013.
    [32] A. Betancourt, M. Lopez, C. Regazzoni, and M. Rauterberg, “A sequential classifier for
    hand detection in the framework of egocentric vision,” in CVPRW, 2014.
    [33] A. Vardy, J. Robinson, and L.-T. Cheng, “The wristcam as input device,” in ISWC, 1999.
    [34] L. Chan, Y.-L. Chen, C.-H. Hsieh, R.-H. Liang, and B.-Y. Chen, “Cyclopsring: Enabling
    whole-hand and context-aware interactions through a fisheye ring,” in UIST, 2015.
    [35] K. Ohnishi, A. Kanehira, A. Kanezaki, and T. Harada, “Recognizing activities of daily
    living with a wrist-mounted camera,” in CVPR, 2016.
    [36] R. Wu, S. Yan, Y. Shan, Q. Dang, and G. Sun, “Deep image: Scaling up image recogni-
    tion,” CoRR, vol. abs/1501.02876, 2015.
    [37] J. Canny, “A computational approach to edge detection,” PAMI, vol. PAMI-8, no. 6,
    pp. 679–698, 1986.
    [38] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image
    recognition,” in ICLR, 2015.
    [39] K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recogni-
    tion in videos,” in NIPS, 2014.
    [40] Z. Li and D. Hoiem, “Learning without forgetting,” in ECCV, 2016.
    [41] R. Caruana, “Multitask learning,” Machine Learn., vol. 28, pp. 41–75, July 1997.
    [42] S. Bambach, S. Lee, D. J. Crandall, and C. Yu, “Lending a hand: Detecting hands and
    recognizing activities in complex egocentric interactions,” in ICCV, 2015.
    [43] Tan, Steinbach, and Kumar., “Introduction to data mining,” 2005.

    QR CODE