研究生: |
胡厚寧 Hu, Hou-Ning |
---|---|
論文名稱: |
跨越人類感知:代理人的 360 視角導航與深度感知 Beyond Human Vision: 360 View Pilot and Depth Perception for an Embodied Agent |
指導教授: |
孫民
Sun, Min |
口試委員: |
王傑智
Wang, Chieh-Chih 賴尚宏 Lai, Shang-Hong 莊永裕 Chuang, Yung-Yu 黃朝宗 Huang, Chao-Tsung |
學位類別: |
博士 Doctor |
系所名稱: |
電機資訊學院 - 電機工程學系 Department of Electrical Engineering |
論文出版年: | 2021 |
畢業學年度: | 110 |
語文別: | 英文 |
論文頁數: | 176 |
中文關鍵詞: | 單目三維物體追蹤 、準稠密個體相似度 、長短期記憶模型 、360 度影片 、注意力輔助系統 |
外文關鍵詞: | Monocular 3D Object Tracking, Quasi Dense Instance Similarity, Long Short Term Memory, 360 degree Videos, Focus Assistance System |
相關次數: | 點閱:2 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
此篇論文希望探討實體代理人從原始像素中理解物體語義和三維幾何資訊以輔助人類探知環境的能力。一般人通常都有視野限制無法達到 360 度全景視角並且受限於注意力機制無法在高速移動中同時處理視野中所有資訊,但實體代理人可以打破這些限制並進而輔助人類。在論文中我們會用兩個章節來介紹我們如何設計並透過實體代理人來輔助和補強人類的感知限制。
首先,針對人類有限視野中的內容理解出發,由觀看 360 度影片切入。會面臨的挑戰是在有限視野下必須要連續不斷的將視角跟隨著目標,或是把視角移轉到新的目標上。我們探究代理人對於 360 度影片中的場景與內容理解,將像素資料點萃取成物體資訊,並結合不同時間的位置資訊讓代理人執行視角移動。數據與實例皆顯示代理人在視角選擇準確度與使用者喜愛度等主要指標上名列第一。
接著,我們更進一步探究在三維空間中追蹤多物體動態的瓶頸。在高速移動中,人類幾乎無法同時偵測並追蹤畫面中所有物體,更難以精準估算三維空間中物體幾何資訊。我們訓練代理人透過單目視角連續影像在三維空間中預測物體幾何資訊,利用長短期記憶模型通過鳥瞰視角掌握物體時間空間資訊,並依靠準稠密個體相似度與速度資訊持續追蹤多重目標。透過整合時間與空間資訊,代理人可以有效追蹤並預測場景中出現的物體位置,並且在真實世界資料集的複雜環境中有穩定的資料關聯能力。
This dissertation aims to help humans explore the environments by developing an embodied agent understanding highlevel semantic and geometric information from raw pixels. Given the limited field of view (FoV), it is hard for a person to sense 360◦ view of the surroundings. Besides, human vision naturally bypasses peripheral information when they are at speed. Thankfully, embodied agents with welltrained perception ability come to the rescue. We introduce two chapters to describe how an embodied agent assists and complements human perception limitations.
To address the limited FoV problem, we first conducted a 2D content understanding experiment on 360◦ videos. One challenge of navigating through 360◦ videos is continuously focusing and refocusing intended targets within a limited FoV. We built up an agent aggregating video content from pixels to objectlevel scene information. Given the objectlevel information and trajectories of viewing angles, our agent regresses a shift in the current viewing angle to move to the next preferred one. Our agent achieved the best performance on viewing area selection accuracy and user preference among all methods.
We further investigate the peripheral vision loss challenge for a person to detect all objects in sight, locate future 3D locations and track them. The increased moving speed incurs a decreased FoV. We propose a framework that an agent can effectively estimate the 3D bounding box information and associate moving objects over time from a sequence of 2D images captured on a moving platform. We utilize quasidense instance similarity for robust data association and velocitybased LSTM to aggregate spatialtemporal information in a bird’s eye view for 3D trajectory prediction. Experiments on realworld benchmarks show that our 3D tracking framework offers robust object association and tracking in urbandriving scenarios.
[1] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
[2] A. Gaidon, Q. Wang, Y. Cabon, and E. Vig, “Virtual worlds as proxy for multiobject tracking analysis,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
[3] S. R. Richter, Z. Hayder, and V. Koltun, “Playing for benchmarks,” in IEEE International Conference on Computer Vision (ICCV), 2017.
[4] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
[5] Y.C. Su, D. Jayaraman, and K. Grauman, “Pano2vid: Automatic cinematography for watching 360°videos,” in Asian Conference on Computer Vision (ACCV), 2016.
[6] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom, “nuscenes: A multimodal dataset for autonomous driving,” ArXiv:1903.11027, 2019.
[7] X. Zhou, V. Koltun, and P. Krähenbühl, “Tracking objects as points,” in IEEE International Conference on Computer Vision (ICCV), 2020.
[8] H.N. Hu, Q.Z. Cai, D. Wang, J. Lin, M. Sun, P. Krähenbühl, T. Darrell, and F. Yu, “Joint monocular 3d vehicle detection and tracking,” in IEEE International Conference on Computer Vision (ICCV), 2019.
[9] H.N. Hu, Y.C. Lin, M.Y. Liu, H.T. Cheng, Y.J. Chang, and M. Sun, “Deep 360 pilot: Learning a deep agent for piloting through 360°sports videos,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
[10] H.N. Hu, Y.C. Lin, M.Y. Liu, H.T. Cheng, Y.J. Chang, and M. Sun, “Technical report of deep 360 pilot,” 2017. https://aliensunmin.github.io/project/360video/.
[11] Y.C. Lin, Y.J. Chang, H.N. Hu, H.T. Cheng, C.W. Huang, and M. Sun, “Tell me where to look: Investigating ways for assisting focus in 360°video,” in ACM Conference on Human Factors in Computing Systems (CHI), 2017.
[12] B. T. Truong and S. Venkatesh, “Video abstraction: A systematic review and classification,” TMCCA, vol. 3, Feb. 2007.
[13] D. B. Christianson, S. E. Anderson, L. wei He, D. Salesin, D. S. Weld, and M. F. Cohen, “Declarative camera control for automatic cinematography.,” in AAAI Conference on Artificial Intelligence (AAAI), 1996.
[14] L.w. He, M. F. Cohen, and D. H. Salesin, “The virtual cinematographer: A paradigm for automatic realtime camera control and directing,” in Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques(ACM CGI), ACM SIGGRAPH, (New York, NY, USA), pp. 217–224, ACM, 1996.
[15] D. K. Elson and M. O. Riedl, “A Lightweight Intelligent Virtual Cinematography System for Machinima Production,” in AIIDE, 2007.
[16] P. Mindek, L. Čmolík, I. Viola, E. Gröller, and S. Bruckner, “Automatized summarization of multiplayer games,” in ACM CCG, 2015.
[17] Y.C. Su and K. Grauman, “Making 360°video watchable in 2d: Learning videography for click free viewing,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
[18] A. Patney, J. Kim, M. Salvi, A. Kaplanyan, C. Wyman, N. Benty, A. Lefohn, and D. Luebke, “Perceptuallybased foveated virtual reality,” in ACM Transactions on Graphics (TOG), pp. 17:1–17:2, 2016.
[19] J. Chen, H. M. Le, P. Carr, Y. Yue, and J. J. Little, “Learning online smooth predictors for realtime camera planning using recurrent decision trees,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
[20] J. Chen and P. Carr, “Mimicking human camera operators,” in IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 215–222, IEEE, 2015.
[21] S. Ren, K. He, R. Girshick, and J. Sun, “Faster rcnn: Towards realtime object detection with region proposal networks,” in Advances in Neural Information Processing Systems (NeurIPS), 2015.
[22] R. J. Williams, “Simple statistical gradientfollowing algorithms for connectionist reinforcement learning,” Machine Learning, vol. 8, no. 3, pp. 229–256, 1992.
[23] M. Vosmeer and B. Schouten, Interactive Cinema: Engagement and Interaction, pp. 140–147. Cham: Springer International Publishing, 2014.
[24] J. Gugenheimer, D. Wolf, G. Haas, S. Krebs, and E. Rukzio, “Swivrchair: A motorized swivel chair to nudge users’ orientation for 360 degree storytelling in virtual reality,” in ACM Conference on Human Factors in Computing Systems (CHI), CHI ’16, (New York, NY, USA), pp. 1996–2000, ACM, 2016.
[25] “Roto vr chair interactive virtual reality seat.” https://www.rotovr.com/. Accessed: 20170426.
[26] “New publisher tools for 360 video | facebook media.” https://www.facebook.com/2016/08/10/new-publisher-tools-for-360-video/. Accessed: 20170426.
[27] J. Kopf, “360°; video stabilization,” ACM Transactions on Graphics (TOG), vol. 35, pp. 195:1–195:9, Nov. 2016.
[28] W.S. Lai, Y. Huang, N. Joshi, C. Buehler, M.H. Yang, and S. B. Kang, “Semanticdriven generation of hyperlapse from 360 degree video,” IEEE Transactions on Visualization and Computer Graphics (TVCG), 2018.
[29] S. Gustafson, P. Baudisch, C. Gutwin, and P. Irani, “Wedge: clutterfree visualization of offscreen locations,” in ACM Conference on Human Factors in Computing Systems (CHI), pp. 787–796, 2008.
[30] “5 lessons learned while making lost | oculus.” https://www.oculus.com/story-studio/blog/5-lessons-learned-while-making-lost/. Accessed: 20170426.
[31] “Oculus.” https://www.oculus.com/. Accessed: 20170426.
[32] A. Sheikh, A. Brown, Z. Watson, and M. Evans, “Directing attention in 360degree video,” in IBC 2016 Conference, pp. 29–38, 2016.
[33] S. Burigat and L. Chittaro, “Visualizing references to offscreen content on mobile devices: A comparison of arrows, wedge, and overview+detail,” Interact. Comput., vol. 23, pp. 156–166, Mar. 2011.
[34] S. Burigat, L. Chittaro, and S. Gabrielli, “Visualizing locations of offscreen objects on mobile devices: a comparative evaluation of three approaches,” in ACM Conference on Humancomputer interaction with mobile devices and services (mobileCHI), pp. 239–246, 2006.
[35] B. Karstens, R. Rosenbaum, and H. Schumann, “Presenting large and complex information sets on mobile handhelds.,” in Ecommerce and Mcommerce Technologies, pp. 32–56, 2005.
[36] W. Song, D. W. Tjondronegoro, S.H. Wang, and M. J. Docherty, “Impact of zooming and enhancing region of interests for optimizing user experience on mobile sports video,” in ACM Conference on Multimedia (MM), MM ’10, (New York, NY, USA), pp. 321–330, ACM, 2010.
[37] D. Liu, G. Hua, and T. Chen, “A hierarchical visual model for video object summarization,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 32, no. 12, pp. 2178–2190, 2010.
[38] Y. Gong and X. Liu, “Video summarization using singular value decomposition,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2000.
[39] C. Ngo, Y. Ma, and H. Zhan, “Video summarization and scene detection by graph modeling,” in IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2005.
[40] A. Khosla, R. Hamid, C.J. Lin, and N. Sundaresan, “Largescale video summarization using webimage priors,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013.
[41] D. Potapov, M. Douze, Z. Harchaoui, and C. Schmid, “Categoryspecific video summarization,” in European Conference on Computer Vision (ECCV), 2014.
[42] M. Sun, A. Farhadi, and S. Seitz, “Ranking domainspecific highlights by analyzing edited videos,” in European Conference on Computer Vision (ECCV), 2014.
[43] T. Yao, T. Mei, and Y. Rui, “Highlight detection with pairwise deep ranking for firstperson video summarization,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
[44] B. Zhao and E. Xing, “Quasi realtime summarization for consumer videos,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2014.
[45] B. Gong, W.L. Chao, K. Grauman, and F. Sha, “Diverse sequential subset selection for supervised video summarization,” in Advances in Neural Information Processing Systems (NeurIPS), 2014.
[46] K. Zhang, W.L. Chao, F. Sha, and K. Grauman, “Summary transfer: Exemplarbased subset selection for video summarization,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
[47] K. Zhang, W. Chao, F. Sha, and K. Grauman, “Video summarization with long shortterm memory,” in European Conference on Computer Vision (ECCV), 2016.
[48] Y. Pritch, A. RavAcha, A. Gutman, and S. Peleg, “Webcam synopsis: Peeking around the world,” in IEEE International Conference on Computer Vision (ICCV), 2007.
[49] A. RavAcha, Y. Pritch, and S. Peleg, “Making a long video short,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2006.
[50] M. Sun, A. Farhadi, B. Taskar, and S. Seitz, “Summarizing unconstrained videos using salient montages,” in European Conference on Computer Vision (ECCV), 2014.
[51] D. Goldman, B. Curless, D. Salesin, and S. Seitz, “Schematic storyboarding for video visualization and editing,” in ACM Transactions on Graphics (TOG), 2006.
[52] N. Joshi, S. Metha, S. Drucker, E. Stollnitz, H. Hoppe, M. Uyttendaele, and M. F. Cohen, “Cliplets: Juxtaposing still and dynamic imagery,” in UIST, 2012.
[53] Y. J. Lee, J. Ghosh, and K. Grauman, “Discovering important people and objects for egocentric video summarization,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
[54] Z. Lu and K. Grauman, “Storydriven summarization for egocentric video,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013.
[55] J. Kopf, M. F. Cohen, and R. Szeliski, “Firstperson hyperlapse videos,” ACM Transactions on Graphics (TOG), vol. 33, July 2014.
[56] T. Liu, Z. Yuan, J. Sun, J. Wang, N. Zheng, X. Tang, and H.Y. Shum, “Learning to detect a salient object,” TPAMI, vol. 33, no. 2, pp. 353–367, 2011.
[57] J. Harel, C. Koch, and P. Perona, “Graphbased visual saliency.,” in Advances in Neural Information Processing Systems (NeurIPS), 2006.
[58] R. Achanta, S. S. Hemami, F. J. Estrada, and S. Süsstrunk, “Frequencytuned salient region detection.,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009.
[59] F. Perazzi, P. Krähenbühl, Y. Pritch, and A. Hornung, “Saliency filters: Contrast based filtering for salient region detection,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
[60] J. Wang, A. Borji, C.C. J. Kuo, and L. Itti, “Learning a combined model of visual saliency for fixation prediction,” TIP, vol. 25, no. 4, pp. 1566–1579, 2016.
[61] J. Zhang and S. Sclaroff, “Exploiting surroundedness for saliency detection: a boolean map approach,” TPAMI, vol. 38, no. 5, pp. 889–902, 2016.
[62] N. Liu and J. Han, “Dhsnet: Deep hierarchical saliency network for salient object detection,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
[63] S. Jetley, N. Murray, and E. Vig, “Endtoend saliency mapping via probability distribution prediction,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
[64] M. Cornia, L. Baraldi, G. Serra, and R. Cucchiara, “A deep multilevel network for saliency prediction,” in International Conference on Pattern Recognition (ICPR), 2016.
[65] J. Pan, K. McGuinness, E. Sayrol, N. O’Connor, and X. Giroi Nieto, “Shallow and deep convolutional networks for saliency prediction,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
[66] N. D. B. Bruce, C. Catton, and S. Janjic, “A deeper look at saliency: Feature contrast, semantics, and beyond,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
[67] Q. Wang, W. Zheng, and R. Piramuthu, “Grab: Visual saliency via novel graph model and background priors,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
[68] L. Wang, L. Wang, H. Lu, P. Zhang, and X. Ruan, “Saliency detection with recurrent fully convolutional networks.,” in European Conference on Computer Vision (ECCV), 2016.
[69] Y. Tang and X. Wu, “Saliency detection via combining regionlevel and pixellevel predictions with cnns,” in European Conference on Computer Vision (ECCV), 2016.
[70] X. Cui, Q. Liu, and D. Metaxas, “Temporal spectral residual: fast motion saliency detection,” in ACM Conference on Multimedia (MM), 2009.
[71] C. Guo, Q. Ma, and L. Zhang, “Spatiotemporal saliency detection using phase spectrum of quaternion fourier transform,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2008.
[72] V. Mahadevan and N. Vasconcelos, “Spatiotemporal saliency in dynamic scenes,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 32, no. 1, pp. 171–177, 2010.
[73] H. Seo and P. Milanfar, “Static and spacetime visual saliency detection by selfresemblance,” Journal of Vision, 2009.
[74] P. Mital, T. Smith, R. Hill, and J. Henderson, “Clustering of gaze during dynamic scene viewing is predicted by motion,” Cognitive Computation, vol. 3, no. 1, pp. 524, 2011.
[75] T. Lee, M. Hwangbo, T. Alan, O. Tickoo, and R. Iyer, “Lowcomplexity hog for efficient video saliency,” in ICIP, pp. 3749–3752, IEEE, 2015.
[76] T. Judd, K. Ehinger, F. Durand, and A. Torralba, “Learning to predict where humans look,” in IEEE International Conference on Computer Vision (ICCV), 2009.
[77] S. Goferman, L. ZelnikManor, and A. Tal, “Contextaware saliency detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 34, no. 10, pp. 1915–1926, 2012.
[78] D. Rudoy, D. B. Goldman, E. Shechtman, and L. ZelnikManor, “Learning video saliency from human gaze using candidate selection,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1147–1154, 2013.
[79] S. Mathe and C. Sminchisescu, “Actions in the eye: Dynamic gaze datasets and learnt saliency models for visual recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 37, 2015.
[80] A. Fathi, Y. Li, and J. M. Rehg, “Learning to recognize daily actions using gaze,” in European Conference on Computer Vision (ECCV), 2012.
[81] S. Mathe, A. Pirinen, and C. Sminchisescu, “Reinforcement Learning for Visual Object Detection,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
[82] J. Ba, V. Mnih, and K. Kavukcuoglu, “Multiple object recognition with visual attention,” in International Conference on Learning Representations (ICLR), 2015.
[83] V. Mnih, N. Heess, A. Graves, and k. kavukcuoglu, “Recurrent models of visual attention,” in Advances in Neural Information Processing Systems (NeurIPS) (Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, eds.), 2014.
[84] D. Zhang, H. Maei, X. Wang, and Y. Wang, “Deep reinforcement learning for visual object tracking in videos,” ArXiv:1701.08936, 2017.
[85] L. Bourdev and J. Malik, “Poselets: Body part detectors trained using 3d human pose annotations,” in IEEE International Conference on Computer Vision (ICCV), 2009.
[86] J. Foote and D. Kimber, “Flycam: Practical panoramic video and automatic camera control,” in ICME, 2000.
[87] X. Sun, J. Foote, D. Kimber, and B. Manjunath, “Region of interest extraction and virtual camera control based on panoramic video capturing,” IEEE Transactions on Multimedia (TMM), vol. 7, no. 5, pp. 981–990, 2005.
[88] T.Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European Conference on Computer Vision (ECCV), 2014.
[89] M. Andriluka, S. Roth, and B. Schiele, “Peopletrackingbydetection and peopledetectionbytracking,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2008.
[90] N. Dalal, B. Triggs, and C. Schmid, “Human detection using oriented histograms of flow and appearance,” in European Conference on Computer Vision (ECCV), 2006.
[91] G. Gkioxari and J. Malik, “Finding action tubes,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
[92] R. S. Kennedy, N. E. Lane, K. S. Berbaum, and M. G. Lilienthal, “Simulator Sickness Questionnaire: An Enhanced Method for Quantifying Simulator Sickness,” The International Journal of Aviation Psychology, vol. 3, no. 3, p. 203, 1993.
[93] J. J. . Lin, H. B. L. Duh, D. E. Parker, H. AbiRached, and T. A. Furness, “Effects of field of view on presence, enjoyment, memory, and simulator sickness in a virtual environment,” in IEEE Virtual Reality (VR), pp. 164–171, 2002.
[94] H.N. Hu, Y.H. Yang, T. Fischer, F. Yu, T. Darrell, and M. Sun, “Monocular quasidense 3d object tracking,” ArXiv:2103.07351, 2021.
[95] L. Wen, D. Du, Z. Cai, Z. Lei, M.C. Chang, H. Qi, J. Lim, M.H. Yang, and S. Lyu, “Uadetrac: A new benchmark and protocol for multiobject detection and tracking,” ArXiv:1511.04136, 2015.
[96] A. Milan, L. LealTaixé, I. Reid, S. Roth, and K. Schindler, “Mot16: A benchmark for multiobject tracking,” ArXiv:1603.00831, 2016.
[97] J. Pang, L. Qiu, X. Li, H. Chen, Q. Li, T. Darrell, and F. Yu, “Quasidense similarity learning for multiple object tracking,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
[98] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
[99] R. Girshick, “Fast rcnn,” in IEEE International Conference on Computer Vision (ICCV), 2015.
[100] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.Y. Fu, and A. C. Berg, “Ssd: Single shot multibox detector,” in European Conference on Computer Vision (ECCV), 2016.
[101] J. Redmon and A. Farhadi, “Yolo9000: Better, faster, stronger,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
[102] T.Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in IEEE International Conference on Computer Vision (ICCV), 2017.
[103] X. Zhou, D. Wang, and P. Krähenbühl, “Objects as points,” ArXiv:1904.07850, 2019.
[104] H. Law and J. Deng, “Cornernet: Detecting objects as paired keypoints,” in European Conference on Computer Vision (ECCV), 2018.
[105] Z. Tian, C. Shen, H. Chen, and T. He, “Fcos: Fully convolutional onestage object detection,” in IEEE International Conference on Computer Vision (ICCV), 2019.
[106] X. Chen, K. Kundu, Y. Zhu, A. G. Berneshawi, H. Ma, S. Fidler, and R. Urtasun, “3D object proposals for accurate object class detection,” in Advances in Neural Information Processing Systems (NeurIPS), 2015.
[107] X. Chen, K. Kundu, Z. Zhang, H. Ma, S. Fidler, and R. Urtasun, “Monocular 3D object detection for autonomous driving,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
[108] A. Mousavian, D. Anguelov, J. Flynn, and J. Košecká, “3D bounding box estimation using deep learning and geometry,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
[109] G. Brazil and X. Liu, “M3DRPN: Monocular 3D region proposal network for object detection,” in IEEE International Conference on Computer Vision (ICCV), 2019.
[110] A. Simonelli, S. R. Bulo, L. Porzi, M. LópezAntequera, and P. Kontschieder, “Disentangling monocular 3D object detection,” in IEEE International Conference on Computer Vision (ICCV), 2019.
[111] Y. Wang, W.L. Chao, D. Garg, B. Hariharan, M. Campbell, and K. Q. Weinberger, “PseudoLiDAR from visual depth estimation: Bridging the gap in 3D object detection for autonomous driving,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
[112] X. Ma, Z. Wang, H. Li, P. Zhang, W. Ouyang, and X. Fan, “Accurate monocular 3D object detection via colorembedded 3D reconstruction for autonomous driving,” in IEEE International Conference on Computer Vision (ICCV), 2019.
[113] A. Kundu, Y. Li, and J. M. Rehg, “3DRCNN: Instancelevel 3D object reconstruction via renderandcompare,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
[114] F. Chabot, M. Chaouch, J. Rabarisoa, C. Teuliere, and T. Chateau, “Deep MANTA: A coarsetofine manytask network for joint 2D and 3D vehicle analysis from monocular image,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
[115] Y. Zhou and O. Tuzel, “Voxelnet: Endtoend learning for point cloud based 3d object detection,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
[116] S. Shi, X. Wang, and H. Li, “PointRCNN: 3D object proposal generation and detection from point cloud,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
[117] A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom, “PointPillars: Fast encoders for object detection from point clouds,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
[118] X. Chen, H. Ma, J. Wan, B. Li, and T. Xia, “Multiview 3d object detection network for autonomous driving,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
[119] A. Yilmaz, O. Javed, and M. Shah, “Object tracking: A survey,” ACM computing surveys (CSUR), 2006.
[120] S. Salti, A. Cavallaro, and L. Di Stefano, “Adaptive appearance modeling for video tracking: Survey and evaluation,” IEEE Transactions on Image Processing (TIP), 2012.
[121] A. W. Smeulders, D. M. Chu, R. Cucchiara, S. Calderara, A. Dehghan, and M. Shah, “Visual tracking: An experimental survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2014.
[122] D. S. Bolme, J. R. Beveridge, B. A. Draper, and Y. M. Lui, “Visual object tracking using adaptive correlation filters,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010.
[123] M. Kristan, J. Matas, A. Leonardis, M. Felsberg, L. Cehovin, G. Fernández, T. Vojir, G. Hager, G. Nebehay, and R. Pflugfelder, “The visual object tracking vot2015 challenge results,” in IEEE International Conference on Computer Vision Workshops (ICCV Workshops), 2015.
[124] S. Hare, S. Golodetz, A. Saffari, V. Vineet, M.M. Cheng, S. L. Hicks, and P. H. Torr, “Struck: Structured output tracking with kernels,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2016.
[125] B. Babenko, M.H. Yang, and S. Belongie, “Visual tracking with online multiple instance learning,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009.
[126] Z. Kalal, K. Mikolajczyk, and J. Matas, “Trackinglearningdetection,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2012.
[127] W. Luo, J. Xing, A. Milan, X. Zhang, W. Liu, X. Zhao, and T.K. Kim, “Multiple object tracking: A literature review,” ArXiv:1409.7618, 2017.
[128] R. Tao, E. Gavves, and A. W. Smeulders, “Siamese instance search for tracking,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
[129] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. Torr, “Fullyconvolutional siamese networks for object tracking,” in European Conference on Computer Vision (ECCV), 2016.
[130] C. Feichtenhofer, A. Pinz, and A. Zisserman, “Detect to track and track to detect,” in IEEE International Conference on Computer Vision (ICCV), 2017.
[131] P. Bergmann, T. Meinhardt, and L. LealTaixé, “Tracking without bells and whistles,” in IEEE International Conference on Computer Vision (ICCV), 2019.
[132] D. Mykheievskyi, D. Borysenko, and V. Porokhonskyy, “Learning local feature descriptors for multiple object tracking,” in Asian Conference on Computer Vision (ACCV), 2020.
[133] W. Zhang, H. Zhou, S. Sun, Z. Wang, J. Shi, and C. C. Loy, “Robust multimodality multiobject tracking,” in IEEE International Conference on Computer Vision (ICCV), 2019.
[134] Li Zhang, Yuan Li, and R. Nevatia, “Global data association for multiobject tracking using network flows,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2008.
[135] W. Choi, “Nearonline multitarget tracking with aggregated local flow descriptor,” in IEEE International Conference on Computer Vision (ICCV), 2015.
[136] C. Kim, F. Li, A. Ciptadi, and J. M. Rehg, “Multiple hypothesis tracking revisited,” in IEEE International Conference on Computer Vision (ICCV), 2015.
[137] A. Ess, B. Leibe, K. Schindler, and L. Van Gool, “A mobile vision system for robust multiperson tracking,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2008.
[138] S. Scheidegger, J. Benjaminsson, E. Rosenberg, A. Krishnan, and K. Granstrom, “Monocamera 3d multiobject tracking using deep learning detections and pmbm filtering,” in IEEE Intelligent Vehicles Symposium (IV), 2018.
[139] P. Voigtlaender, M. Krause, A. Osep, J. Luiten, B. B. G. Sekar, A. Geiger, and B. Leibe, “Mots: Multiobject tracking and segmentation,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7942–7951, 2019.
[140] S. Sharma, J. A. Ansari, J. Krishna Murthy, and K. Madhava Krishna, “Beyond pixels: Leveraging geometry and shape cues for online multiobject tracking,” in IEEE International Conference on Robotics and Automation (ICRA), 2018.
[141] J. Luiten, T. Fischer, and B. Leibe, “Track to reconstruct and reconstruct to track,” IEEE Robotics and Automation Letters (RAL), 2020.
[142] P. Li, T. Qin, and a. Shen, “Stereo visionbased semantic 3d object and egomotion tracking for autonomous driving,” in European Conference on Computer Vision (ECCV), 2018.
[143] A. Osep, W. Mehner, M. Mathias, and B. Leibe, “Combined image and worldspace tracking in traffic scenes,” in IEEE International Conference on Robotics and Automation (ICRA), 2017.
[144] X. Weng, J. Wang, D. Held, and K. Kitani, “3d multiobject tracking: A baseline and new evaluation metrics,” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2020.
[145] Z. Lu, V. Rathod, R. Votel, and J. Huang, “Retinatrack: Online single stage joint detection and tracking,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
[146] T. Yin, X. Zhou, and P. Krähenbühl, “Centerbased 3d object detection and tracking,” ArXiv:2006.11275, 2020.
[147] W. Maddern, G. Pascoe, C. Linegar, and P. Newman, “1 year, 1000 km: The oxford robotcar dataset,” International Journal of Robotics Research (IJRR), 2017.
[148] F. Yu, H. Chen, X. Wang, W. Xian, Y. Chen, F. Liu, V. Madhavan, and T. Darrell, “Bdd100k: A diverse driving dataset for heterogeneous multitask learning,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
[149] M.F. Chang, J. Lambert, P. Sangkloy, J. Singh, S. Bak, A. Hartnett, D. Wang, P. Carr, S. Lucey, D. Ramanan, and J. Hays, “Argoverse: 3d tracking and forecasting with rich maps,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
[150] P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V. Patnaik, P. Tsui, J. Guo, Y. Zhou, Y. Chai, B. Caine, V. Vasudevan, W. Han, J. Ngiam, H. Zhao, A. Timofeev, S. Ettinger, M. Krivokon, A. Gao, A. Joshi, Y. Zhang, J. Shlens, Z. Chen, and D. Anguelov, “Scalability in perception for autonomous driving: Waymo open dataset,” ArXiv:1912.04838, 2019.
[151] G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez, “The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
[152] S. R. Richter, V. Vineet, S. Roth, and V. Koltun, “Playing for data: Ground truth from computer games,” in European Conference on Computer Vision (ECCV), 2016.
[153] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun, “Carla: An open urban driving simulator,” in Conference on Robot Learning (CoRL), 2017.
[154] P. Krähenbühl, “Free supervision from video games,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
[155] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask rcnn,” in IEEE International Conference on Computer Vision (ICCV), 2017.
[156] P. J. Huber, “Robust estimation of a location parameter,” The Annals of Mathematical Statistics, 1964.
[157] A. Hermans, L. Beyer, and B. Leibe, “In defense of the triplet loss for person reidentification,” ArXiv:1703.07737, 2017.
[158] J. Valmadre, L. Bertinetto, J. Henriques, A. Vedaldi, and P. H. S. Torr, “Endtoend representation learning for correlation filter based tracking,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
[159] Z. Wu, Y. Xiong, X. Y. Stella, and D. Lin, “Unsupervised feature learning via nonparametric instance discrimination,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
[160] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask rcnn,” in IEEE International Conference on Computer Vision (ICCV), pp. 2961–2969, 2017.
[161] A. Kendall and Y. Gal, “What uncertainties do we need in bayesian deep learning for computer vision?,” in Advances in Neural Information Processing Systems (NeurIPS), 2017.
[162] G. Brazil, G. PonsMoll, X. Liu, and B. Schiele, “Kinematic 3d object detection in monocular video,” in European Conference on Computer Vision (ECCV), (Virtual), 2020.
[163] F. Yu, W. Li, Q. Li, Y. Liu, X. Shi, and J. Yan, “Poi: multiple object tracking with high performance detection and appearance feature,” in European Conference on Computer Vision (ECCV), 2016.
[164] H. W. Kuhn, “The hungarian method for the assignment problem,” in Naval Research Logistics Quarterly, 1955.
[165] N. Wojke, A. Bewley, and D. Paulus, “Simple online and realtime tracking with a deep association metric,” in IEEE International Conference on Image Processing (ICIP), 2017.
[166] Y. Xiang, A. Alahi, and S. Savarese, “Learning to track: Online multiobject tracking by decision making,” in IEEE International Conference on Computer Vision (ICCV), 2015.
[167] A. Sadeghian, A. Alahi, and S. Savarese, “Tracking the untrackable: Learning to track multiple cues with longterm dependencies,” in IEEE International Conference on Computer Vision (ICCV), 2017.
[168] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al., “Imagenet large scale visual recognition challenge,” International Journal of Computer Vision (IJCV), 2015.
[169] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative style, highperformance deep learning library,” in Advances in Neural Information Processing Systems (NeurIPS) (H. Wallach, H. Larochelle, A. Beygelzimer, F. d'AlchéBuc, E. Fox, and R. Garnett, eds.), pp. 8024–8035, Curran Associates, Inc., 2019.
[170] F. Yu, D. Wang, E. Shelhamer, and T. Darrell, “Deep layer aggregation,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
[171] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
[172] H. Robbins and S. Monro, “A stochastic approximation method,” Ann. Math. Statist., vol. 22, pp. 400–407, 09 1951.
[173] T.Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European Conference on Computer Vision (ECCV), 2014.
[174] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,” International Journal of Computer Vision (IJCV), vol. 88, 2010.
[175] K. Bernardin and R. Stiefelhagen, “Evaluating multiple object tracking performance: the clear mot metrics,” EURASIP Journal on Image and Video Processing (JIVP), 2008.
[176] Y. Li, C. Huang, and R. Nevatia, “Learning to associate: Hybridboosted multitarget tracker for crowded scene,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009.
[177] R. E. Kalman, “A new approach to linear filtering and prediction problems,” Journal of basic Engineering, 1960.
[178] B. Zhu, Z. Jiang, X. Zhou, Z. Li, and G. Yu, “Classbalanced grouping and sampling for point cloud 3d object detection,” ArXiv:1908.09492, 2019.
[179] Y. Wang, S. Chen, L. Huang, R. Ge, Y. Hu, Z. Ding, and J. Liao, “1st place solutions for waymo open dataset challenges – 2d and 3d tracking,” ArXiv:2006.15506, 2020.
[180] O. D. Team, “Openpcdet: An opensource toolbox for 3d object detection from point clouds.” https://github.com/open-mmlab/OpenPCDet, 2020.
[181] K. Krippendorff, Content analysis: An introduction to its methodology. Sage, 2012.
[182] S. Hwang, J. Park, N. Kim, Y. Choi, and I. S. Kweon, “Multispectral pedestrian detection: Benchmark dataset and baseline,” Integrated ComputerAided Engineering (ICAE), 2013.
[183] L. LealTaixé, A. Milan, I. Reid, S. Roth, and K. Schindler, “Motchallenge 2015: Towards a benchmark for multitarget tracking,” ArXiv:1504.01942, 2015.
[184] A. Shenoi, M. Patel, J. Gwak, P. Goebel, A. Sadeghian, H. Rezatofighi, R. MartinMartin, and S. Savarese, “Jrmot: A realtime 3d multiobject tracker and a new largescale dataset,” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2020.
[185] H. Karunasekera, H. Wang, and H. Zhang, “Multiple object tracking with attention to appearance, structure, motion and size,” IEEE Access, 2019.
[186] G. Gündüz and T. Acarman, “A lightweight online multiple object vehicle tracking method,” in IEEE Intelligent Vehicles Symposium (IV), 2018.
[187] B. Lee, E. Erdenee, S. Jin, M. Y. Nam, Y. G. Jung, and P. Rhee, “Multiclass multiobject tracking using changing point detection,” in European Conference on Computer Vision Workshops (ECCV Workshops), 2016.
[188] W. Choi, “Nearonline multitarget tracking with aggregated local flow descriptor,” in IEEE International Conference on Computer Vision (ICCV), 2015.
[189] D. Frossard and R. Urtasun, “Endtoend learning of multisensor 3d tracking by detection,” in IEEE International Conference on Robotics and Automation (ICRA), 2018.
[190] J. Hong Yoon, C.R. Lee, M.H. Yang, and K.J. Yoon, “Online multiobject tracking via structural constraint event aggregation,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
[191] P. Lenz, A. Geiger, and R. Urtasun, “Followme: Efficient online mincost flow tracking with bounded memory and computation,” in IEEE International Conference on Computer Vision (ICCV), 2015.