簡易檢索 / 詳目顯示

研究生: 胡厚寧
Hu, Hou-Ning
論文名稱: 跨越人類感知:代理人的 360 視角導航與深度感知
Beyond Human Vision: 360 View Pilot and Depth Perception for an Embodied Agent
指導教授: 孫民
Sun, Min
口試委員: 王傑智
Wang, Chieh-Chih
賴尚宏
Lai, Shang-Hong
莊永裕
Chuang, Yung-Yu
黃朝宗
Huang, Chao-Tsung
學位類別: 博士
Doctor
系所名稱: 電機資訊學院 - 電機工程學系
Department of Electrical Engineering
論文出版年: 2021
畢業學年度: 110
語文別: 英文
論文頁數: 176
中文關鍵詞: 單目三維物體追蹤準稠密個體相似度長短期記憶模型360 度影片注意力輔助系統
外文關鍵詞: Monocular 3D Object Tracking, Quasi ­Dense Instance Similarity, Long Short­ Term Memory, 360­ degree Videos, Focus Assistance System
相關次數: 點閱:2下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 此篇論文希望探討實體代理人從原始像素中理解物體語義和三維幾何資訊以輔助人類探知環境的能力。一般人通常都有視野限制無法達到 360 度全景視角並且受限於注意力機制無法在高速移動中同時處理視野中所有資訊,但實體代理人可以打破這些限制並進而輔助人類。在論文中我們會用兩個章節來介紹我們如何設計並透過實體代理人來輔助和補強人類的感知限制。
    首先,針對人類有限視野中的內容理解出發,由觀看 360 度影片切入。會面臨的挑戰是在有限視野下必須要連續不斷的將視角跟隨著目標,或是把視角移轉到新的目標上。我們探究代理人對於 360 度影片中的場景與內容理解,將像素資料點萃取成物體資訊,並結合不同時間的位置資訊讓代理人執行視角移動。數據與實例皆顯示代理人在視角選擇準確度與使用者喜愛度等主要指標上名列第一。
    接著,我們更進一步探究在三維空間中追蹤多物體動態的瓶頸。在高速移動中,人類幾乎無法同時偵測並追蹤畫面中所有物體,更難以精準估算三維空間中物體幾何資訊。我們訓練代理人透過單目視角連續影像在三維空間中預測物體幾何資訊,利用長短期記憶模型通過鳥瞰視角掌握物體時間空間資訊,並依靠準稠密個體相似度與速度資訊持續追蹤多重目標。透過整合時間與空間資訊,代理人可以有效追蹤並預測場景中出現的物體位置,並且在真實世界資料集的複雜環境中有穩定的資料關聯能力。


    This dissertation aims to help humans explore the environments by developing an embodied agent understanding high­level semantic and geometric information from raw pixels. Given the limited field of view (FoV), it is hard for a person to sense 360◦ view of the surroundings. Besides, human vision naturally bypasses peripheral information when they are at speed. Thankfully, embodied agents with well­trained perception ability come to the rescue. We introduce two chapters to describe how an embodied agent assists and complements human perception limitations.
    To address the limited FoV problem, we first conducted a 2D content understanding experiment on 360◦ videos. One challenge of navigating through 360◦ videos is contin­uously focusing and re­focusing intended targets within a limited FoV. We built up an agent aggregating video content from pixels to object­level scene information. Given the object­level information and trajectories of viewing angles, our agent regresses a shift in the current viewing angle to move to the next preferred one. Our agent achieved the best performance on viewing area selection accuracy and user preference among all methods.
    We further investigate the peripheral vision loss challenge for a person to detect all objects in sight, locate future 3D locations and track them. The increased moving speed incurs a decreased FoV. We propose a framework that an agent can effectively estimate the 3D bounding box information and associate moving objects over time from a sequence of 2D images captured on a moving platform. We utilize quasi­dense instance similarity for robust data association and velocity­based LSTM to aggregate spatial­temporal in­formation in a bird’s eye view for 3D trajectory prediction. Experiments on real­world benchmarks show that our 3D tracking framework offers robust object association and tracking in urban­driving scenarios.

    Verification Letter from the Oral Examination Committee i Acknowledgements iii 摘要 v Abstract vii Contents ix List of Figures xvii List of Tables xxiii I Deep 360 Video Pilot 1 Foreword 3 Chapter 1 Introduction 5 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3 Main Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.3.1 Deep 360 Pilot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.3.2 360 Assistance Techniques Survey . . . . . . . . . . . . . . . . . . 10 1.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Chapter 2 Related Work 13 2.1 Previous Techniques on 360 Videos . . . . . . . . . . . . . . . . . . 13 2.2 Auto Pilot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3 Visual Guidance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.4 Video Summarization . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.4.1 Important Frame Sampling. . . . . . . . . . . . . . . . . . . . . . . 17 2.4.2 Ego-centric Video Summarization. . . . . . . . . . . . . . . . . . . 18 2.5 Saliency Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.5.1 Visual Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.5.2 Ranking Foreground Objects of Interest. . . . . . . . . . . . . . . . 19 2.6 Virtual Cinematography . . . . . . . . . . . . . . . . . . . . . . . . 19 Chapter 3 Deep 360 Pilot Neural Network 21 3.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.2 Observing in Object Level . . . . . . . . . . . . . . . . . . . . . . . 22 3.3 Focusing on the Main Object . . . . . . . . . . . . . . . . . . . . . . 24 3.4 Aggregating Object Information . . . . . . . . . . . . . . . . . . . . 25 3.5 Learning Smooth Transition . . . . . . . . . . . . . . . . . . . . . . 25 3.6 Our Final Network . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.7 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Chapter 4 Sports-360 Dataset 31 Chapter 5 Experiments 35 5.1 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 5.2 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . 36 5.3 Methods to be Compared . . . . . . . . . . . . . . . . . . . . . . . . 38 5.4 Benchmark Experiments . . . . . . . . . . . . . . . . . . . . . . . . 39 5.5 User Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.6 Typical Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 5.6.1 Deep 360 Pilot on Sports-360 Dataset . . . . . . . . . . . . . . . . 42 5.6.2 Deep 360 Pilot on AUTOCAM Dataset . . . . . . . . . . . . . . . 43 5.6.3 Human Evaluation Videos . . . . . . . . . . . . . . . . . . . . . . 45 Chapter 6 360 Assistance Techniques Survey 47 6.1 Two Focus Assistance Techniques . . . . . . . . . . . . . . . . . . . 47 6.1.1 Auto Pilot (AP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 6.1.2 Visual Guidance (VG) . . . . . . . . . . . . . . . . . . . . . . . . 48 6.2 User Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 6.2.1 Experiment Design . . . . . . . . . . . . . . . . . . . . . . . . . . 50 6.2.2 Participants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 6.3 Quantitative Analysis and Results . . . . . . . . . . . . . . . . . . . 54 6.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 6.4.1 The Video Content Matters . . . . . . . . . . . . . . . . . . . . . . 58 6.4.2 The Goal of Watching the Video Matters . . . . . . . . . . . . . . . 59 6.4.3 Other Considerations for Choosing Focus Assistance . . . . . . . . 60 6.4.4 Design Implications . . . . . . . . . . . . . . . . . . . . . . . . . . 62 6.4.5 Study Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 Chapter 7 Summary 65 II Monocular 3D Object Tracking 67 Foreword 69 Chapter 8 Introduction 71 8.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 8.2 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . 71 8.3 Main Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 8.3.1 Quasi-Dense 3D Object Tracking Pipeline. . . . . . . . . . . . . . . 74 8.3.2 Depth & Motion-based Similarity and VeloLSTM. . . . . . . . . . . 74 8.3.3 Detailed Experiments on Large-scale Datasets. . . . . . . . . . . . . 75 Chapter 9 Related Works 77 9.1 2D Object Detection. . . . . . . . . . . . . . . . . . . . . . . . . . . 77 9.2 Image-based 3D Object Detection. . . . . . . . . . . . . . . . . . . . 77 9.3 2D Object Tracking. . . . . . . . . . . . . . . . . . . . . . . . . . . 78 9.4 3D Object Tracking. . . . . . . . . . . . . . . . . . . . . . . . . . . 79 9.5 Joint Detection and Tracking. . . . . . . . . . . . . . . . . . . . . . 79 9.6 Autonomous Driving Datasets. . . . . . . . . . . . . . . . . . . . . . 80 Chapter 10 Joint 3D Detection and Tracking 83 10.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . 84 10.2 Candidate Box Detection . . . . . . . . . . . . . . . . . . . . . . . . 85 10.2.1 Projection of 3D Bounding Box Center. . . . . . . . . . . . . . . . 86 10.3 Quasi-dense Similarity Learning . . . . . . . . . . . . . . . . . . . . 86 10.4 3D Bounding Box Estimation . . . . . . . . . . . . . . . . . . . . . 89 10.4.1 3D World Location. . . . . . . . . . . . . . . . . . . . . . . . . . . 90 10.4.2 Initial Projection of 3D Bounding Box Center. . . . . . . . . . . . . 90 10.4.3 Single­frame 3D Confidence. . . . . . . . . . . . . . . . . . . . . . 91 10.4.4 Object Orientation. . . . . . . . . . . . . . . . . . . . . . . . . . . 92 10.4.5 Object Dimension. . . . . . . . . . . . . . . . . . . . . . . . . . . 93 10.5 Data Association and Tracking . . . . . . . . . . . . . . . . . . . . . 93 10.5.1 Data Association Scheme. . . . . . . . . . . . . . . . . . . . . . . 95 10.5.2 3D Location­aware Data Association. . . . . . . . . . . . . . . . . 95 10.5.3 Motion­aware Data Association. . . . . . . . . . . . . . . . . . . . 97 10.6 Motion Model Refinement . . . . . . . . . . . . . . . . . . . . . . . 98 10.6.1 VeloLSTM: Deep Motion Estimation and Update. . . . . . . . . . . 99 Chapter 11 3D Vehicle Tracking Simulation Dataset 103 11.1 Dataset Statistics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 Chapter 12 Experiments 105 12.1 Dataset Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 12.1.1 GTA 3D Vehicle Tracking dataset . . . . . . . . . . . . . . . . . . 105 12.1.2 KITTI MOT Benchmark . . . . . . . . . . . . . . . . . . . . . . . 106 12.1.3 nuScenes Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 12.1.4 Waymo Open Dataset . . . . . . . . . . . . . . . . . . . . . . . . . 106 12.1.5 Cross Camera Aggregation. . . . . . . . . . . . . . . . . . . . . . . 107 12.2 Training and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 108 12.2.1 Network Specification. . . . . . . . . . . . . . . . . . . . . . . . . 108 12.2.2 Training Procedure. . . . . . . . . . . . . . . . . . . . . . . . . . . 108 12.2.3 3D Detection Evaluation. . . . . . . . . . . . . . . . . . . . . . . . 109 12.2.4 Multiple Object Tracking Evaluation. . . . . . . . . . . . . . . . . 110 12.2.5 Discussion of Evaluation Metrics . . . . . . . . . . . . . . . . . . . 111 12.3 Ablation Experiments . . . . . . . . . . . . . . . . . . . . . . . . . 112 12.3.1 Importance of each sub-affinity matrix. . . . . . . . . . . . . . . . . 112 12.3.2 Importance of sub-affinity module design. . . . . . . . . . . . . . . 113 12.3.3 Comparison of proposed tracking module versus our initial work. . . 114 12.3.4 Importance of 3D center projection estimation. . . . . . . . . . . . 115 12.4 Motion Modeling Comparison. . . . . . . . . . . . . . . . . . . . . . 116 12.4.1 Pure Detection (Detection). . . . . . . . . . . . . . . . . . . . . . . 116 12.4.2 Dummy Motion Model (Momentum). . . . . . . . . . . . . . . . . 116 12.4.3 Kalman Filter 3D Baseline (KF3D). . . . . . . . . . . . . . . . . . 117 12.4.4 Deep Motion Estimation and Update (VeloLSTM). . . . . . . . . . 117 12.5 Real-world Evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . 117 12.5.1 nuScenes Tracking Challenge. . . . . . . . . . . . . . . . . . . . . 118 12.5.2 Waymo Open Benchmark. . . . . . . . . . . . . . . . . . . . . . . 120 12.6 Evaluation Metrics. . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 12.7 Amount of Data Matters. . . . . . . . . . . . . . . . . . . . . . . . . 123 12.8 Comparison of Matching Algorithms. . . . . . . . . . . . . . . . . . 124 Chapter 13 Summary 125 Chapter 14 Conclusions and Future Directions 127 14.1 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . 127 14.2 Suggestions and Futuristic Remarks . . . . . . . . . . . . . . . . . . 129 14.2.1 360-degree Augmented Reality System . . . . . . . . . . . . . . . . 129 14.2.2 Multi-sensory Fusion . . . . . . . . . . . . . . . . . . . . . . . . . 130 References 131 Appendix A - Deep 360 Pilot 157 A.1 Qualitative Analysis and Results . . . . . . . . . . . . . . . . . . . . 157 A.1.1 Auto Pilot and Visual Guidance for the SPORT Video . . . . . . . . 157 A.1.2 Auto Pilot and Visual Guidance for the TOUR Video . . . . . . . . 159 Appendix B - Monocular Quasi-Dense 3D Object Tracking 163 B.1 Network design of VeloLSTM . . . . . . . . . . . . . . . . . . . . . 163 B.2 Dataset Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 B.3 Training Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 B.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

    [1] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012.

    [2] A. Gaidon, Q. Wang, Y. Cabon, and E. Vig, “Virtual worlds as proxy for multiobject tracking analysis,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

    [3] S. R. Richter, Z. Hayder, and V. Koltun, “Playing for benchmarks,” in IEEE International Conference on Computer Vision (ICCV), 2017.

    [4] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

    [5] Y.­C. Su, D. Jayaraman, and K. Grauman, “Pano2vid: Automatic cinematography for watching 360°videos,” in Asian Conference on Computer Vision (ACCV), 2016.

    [6] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom, “nuscenes: A multimodal dataset for autonomous driving,” ArXiv:1903.11027, 2019.

    [7] X. Zhou, V. Koltun, and P. Krähenbühl, “Tracking objects as points,” in IEEE International Conference on Computer Vision (ICCV), 2020.

    [8] H.­N. Hu, Q.­Z. Cai, D. Wang, J. Lin, M. Sun, P. Krähenbühl, T. Darrell, and F. Yu, “Joint monocular 3d vehicle detection and tracking,” in IEEE International Conference on Computer Vision (ICCV), 2019.

    [9] H.­N. Hu, Y.­C. Lin, M.­Y. Liu, H.­T. Cheng, Y.­J. Chang, and M. Sun, “Deep 360 pilot: Learning a deep agent for piloting through 360°sports videos,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.

    [10] H.­N. Hu, Y.­C. Lin, M.­Y. Liu, H.­T. Cheng, Y.­J. Chang, and M. Sun, “Technical report of deep 360 pilot,” 2017. https://aliensunmin.github.io/project/360video/.

    [11] Y.­C. Lin, Y.­J. Chang, H.­N. Hu, H.­T. Cheng, C.­W. Huang, and M. Sun, “Tell me where to look: Investigating ways for assisting focus in 360°video,” in ACM Conference on Human Factors in Computing Systems (CHI), 2017.

    [12] B. T. Truong and S. Venkatesh, “Video abstraction: A systematic review and classification,” TMCCA, vol. 3, Feb. 2007.

    [13] D. B. Christianson, S. E. Anderson, L. wei He, D. Salesin, D. S. Weld, and M. F. Cohen, “Declarative camera control for automatic cinematography.,” in AAAI Conference on Artificial Intelligence (AAAI), 1996.

    [14] L.­w. He, M. F. Cohen, and D. H. Salesin, “The virtual cinematographer: A paradigm for automatic real­time camera control and directing,” in Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques(ACM CGI), ACM SIGGRAPH, (New York, NY, USA), pp. 217–224, ACM, 1996.

    [15] D. K. Elson and M. O. Riedl, “A Lightweight Intelligent Virtual Cinematography System for Machinima Production,” in AIIDE, 2007.

    [16] P. Mindek, L. Čmolík, I. Viola, E. Gröller, and S. Bruckner, “Automatized summarization of multiplayer games,” in ACM CCG, 2015.

    [17] Y.­C. Su and K. Grauman, “Making 360°video watchable in 2d: Learning videography for click free viewing,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.

    [18] A. Patney, J. Kim, M. Salvi, A. Kaplanyan, C. Wyman, N. Benty, A. Lefohn, and D. Luebke, “Perceptually­based foveated virtual reality,” in ACM Transactions on Graphics (TOG), pp. 17:1–17:2, 2016.

    [19] J. Chen, H. M. Le, P. Carr, Y. Yue, and J. J. Little, “Learning online smooth predictors for realtime camera planning using recurrent decision trees,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

    [20] J. Chen and P. Carr, “Mimicking human camera operators,” in IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 215–222, IEEE, 2015.

    [21] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r­cnn: Towards real­time object detection with region proposal networks,” in Advances in Neural Information Processing Systems (NeurIPS), 2015.

    [22] R. J. Williams, “Simple statistical gradient­following algorithms for connectionist reinforcement learning,” Machine Learning, vol. 8, no. 3, pp. 229–256, 1992.

    [23] M. Vosmeer and B. Schouten, Interactive Cinema: Engagement and Interaction, pp. 140–147. Cham: Springer International Publishing, 2014.

    [24] J. Gugenheimer, D. Wolf, G. Haas, S. Krebs, and E. Rukzio, “Swivrchair: A motorized swivel chair to nudge users’ orientation for 360 degree storytelling in virtual reality,” in ACM Conference on Human Factors in Computing Systems (CHI), CHI ’16, (New York, NY, USA), pp. 1996–2000, ACM, 2016.

    [25] “Roto vr chair ­ interactive virtual reality seat.” https://www.rotovr.com/. Accessed: 2017­04­26.

    [26] “New publisher tools for 360 video | facebook media.” https://www.facebook.com/2016/08/10/new-publisher-tools-for-360-video/. Accessed: 2017­0426.

    [27] J. Kopf, “360°; video stabilization,” ACM Transactions on Graphics (TOG), vol. 35, pp. 195:1–195:9, Nov. 2016.

    [28] W.­S. Lai, Y. Huang, N. Joshi, C. Buehler, M.­H. Yang, and S. B. Kang, “Semanticdriven generation of hyperlapse from 360 degree video,” IEEE Transactions on Visualization and Computer Graphics (TVCG), 2018.

    [29] S. Gustafson, P. Baudisch, C. Gutwin, and P. Irani, “Wedge: clutter­free visualization of off­screen locations,” in ACM Conference on Human Factors in Computing Systems (CHI), pp. 787–796, 2008.

    [30] “5 lessons learned while making lost | oculus.” https://www.oculus.com/story-studio/blog/5-lessons-learned-while-making-lost/. Accessed: 2017­04­26.

    [31] “Oculus.” https://www.oculus.com/. Accessed: 2017­04­26.

    [32] A. Sheikh, A. Brown, Z. Watson, and M. Evans, “Directing attention in 360­degree video,” in IBC 2016 Conference, pp. 29–38, 2016.

    [33] S. Burigat and L. Chittaro, “Visualizing references to off­screen content on mobile devices: A comparison of arrows, wedge, and overview+detail,” Interact. Comput., vol. 23, pp. 156–166, Mar. 2011.

    [34] S. Burigat, L. Chittaro, and S. Gabrielli, “Visualizing locations of off­screen objects on mobile devices: a comparative evaluation of three approaches,” in ACM Conference on Human­computer interaction with mobile devices and services (mobileCHI), pp. 239–246, 2006.

    [35] B. Karstens, R. Rosenbaum, and H. Schumann, “Presenting large and complex information sets on mobile handhelds.,” in E­commerce and M­commerce Technologies, pp. 32–56, 2005.

    [36] W. Song, D. W. Tjondronegoro, S.­H. Wang, and M. J. Docherty, “Impact of zooming and enhancing region of interests for optimizing user experience on mobile sports video,” in ACM Conference on Multimedia (MM), MM ’10, (New York, NY, USA), pp. 321–330, ACM, 2010.

    [37] D. Liu, G. Hua, and T. Chen, “A hierarchical visual model for video object summarization,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 32, no. 12, pp. 2178–2190, 2010.

    [38] Y. Gong and X. Liu, “Video summarization using singular value decomposition,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2000.

    [39] C. Ngo, Y. Ma, and H. Zhan, “Video summarization and scene detection by graph modeling,” in IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2005.

    [40] A. Khosla, R. Hamid, C.­J. Lin, and N. Sundaresan, “Large­scale video summarization using web­image priors,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013.

    [41] D. Potapov, M. Douze, Z. Harchaoui, and C. Schmid, “Category­specific video summarization,” in European Conference on Computer Vision (ECCV), 2014.

    [42] M. Sun, A. Farhadi, and S. Seitz, “Ranking domain­specific highlights by analyzing edited videos,” in European Conference on Computer Vision (ECCV), 2014.

    [43] T. Yao, T. Mei, and Y. Rui, “Highlight detection with pairwise deep ranking for first­person video summarization,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

    [44] B. Zhao and E. Xing, “Quasi real­time summarization for consumer videos,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2014.

    [45] B. Gong, W.­L. Chao, K. Grauman, and F. Sha, “Diverse sequential subset selection for supervised video summarization,” in Advances in Neural Information Processing Systems (NeurIPS), 2014.

    [46] K. Zhang, W.­L. Chao, F. Sha, and K. Grauman, “Summary transfer: Exemplarbased subset selection for video summarization,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.

    [47] K. Zhang, W. Chao, F. Sha, and K. Grauman, “Video summarization with long short­term memory,” in European Conference on Computer Vision (ECCV), 2016.

    [48] Y. Pritch, A. Rav­Acha, A. Gutman, and S. Peleg, “Webcam synopsis: Peeking around the world,” in IEEE International Conference on Computer Vision (ICCV), 2007.

    [49] A. Rav­Acha, Y. Pritch, and S. Peleg, “Making a long video short,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2006.

    [50] M. Sun, A. Farhadi, B. Taskar, and S. Seitz, “Summarizing unconstrained videos using salient montages,” in European Conference on Computer Vision (ECCV), 2014.

    [51] D. Goldman, B. Curless, D. Salesin, and S. Seitz, “Schematic storyboarding for video visualization and editing,” in ACM Transactions on Graphics (TOG), 2006.

    [52] N. Joshi, S. Metha, S. Drucker, E. Stollnitz, H. Hoppe, M. Uyttendaele, and M. F. Cohen, “Cliplets: Juxtaposing still and dynamic imagery,” in UIST, 2012.

    [53] Y. J. Lee, J. Ghosh, and K. Grauman, “Discovering important people and objects for egocentric video summarization,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012.

    [54] Z. Lu and K. Grauman, “Story­driven summarization for egocentric video,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013.

    [55] J. Kopf, M. F. Cohen, and R. Szeliski, “First­person hyper­lapse videos,” ACM Transactions on Graphics (TOG), vol. 33, July 2014.

    [56] T. Liu, Z. Yuan, J. Sun, J. Wang, N. Zheng, X. Tang, and H.­Y. Shum, “Learning to detect a salient object,” TPAMI, vol. 33, no. 2, pp. 353–367, 2011.

    [57] J. Harel, C. Koch, and P. Perona, “Graph­based visual saliency.,” in Advances in Neural Information Processing Systems (NeurIPS), 2006.

    [58] R. Achanta, S. S. Hemami, F. J. Estrada, and S. Süsstrunk, “Frequency­tuned salient region detection.,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009.

    [59] F. Perazzi, P. Krähenbühl, Y. Pritch, and A. Hornung, “Saliency filters: Contrast based filtering for salient region detection,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012.

    [60] J. Wang, A. Borji, C.­C. J. Kuo, and L. Itti, “Learning a combined model of visual saliency for fixation prediction,” TIP, vol. 25, no. 4, pp. 1566–1579, 2016.

    [61] J. Zhang and S. Sclaroff, “Exploiting surroundedness for saliency detection: a boolean map approach,” TPAMI, vol. 38, no. 5, pp. 889–902, 2016.

    [62] N. Liu and J. Han, “Dhsnet: Deep hierarchical saliency network for salient object detection,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

    [63] S. Jetley, N. Murray, and E. Vig, “End­to­end saliency mapping via probability distribution prediction,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

    [64] M. Cornia, L. Baraldi, G. Serra, and R. Cucchiara, “A deep multi­level network for saliency prediction,” in International Conference on Pattern Recognition (ICPR), 2016.

    [65] J. Pan, K. McGuinness, E. Sayrol, N. O’Connor, and X. Giro­i Nieto, “Shallow and deep convolutional networks for saliency prediction,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

    [66] N. D. B. Bruce, C. Catton, and S. Janjic, “A deeper look at saliency: Feature contrast, semantics, and beyond,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.

    [67] Q. Wang, W. Zheng, and R. Piramuthu, “Grab: Visual saliency via novel graph model and background priors,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.

    [68] L. Wang, L. Wang, H. Lu, P. Zhang, and X. Ruan, “Saliency detection with recurrent fully convolutional networks.,” in European Conference on Computer Vision (ECCV), 2016.

    [69] Y. Tang and X. Wu, “Saliency detection via combining region­level and pixel­level predictions with cnns,” in European Conference on Computer Vision (ECCV), 2016.

    [70] X. Cui, Q. Liu, and D. Metaxas, “Temporal spectral residual: fast motion saliency detection,” in ACM Conference on Multimedia (MM), 2009.

    [71] C. Guo, Q. Ma, and L. Zhang, “Spatio­temporal saliency detection using phase spectrum of quaternion fourier transform,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2008.

    [72] V. Mahadevan and N. Vasconcelos, “Spatiotemporal saliency in dynamic scenes,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 32, no. 1, pp. 171–177, 2010.

    [73] H. Seo and P. Milanfar, “Static and space­time visual saliency detection by selfresemblance,” Journal of Vision, 2009.

    [74] P. Mital, T. Smith, R. Hill, and J. Henderson, “Clustering of gaze during dynamic scene viewing is predicted by motion,” Cognitive Computation, vol. 3, no. 1, pp. 524, 2011.

    [75] T. Lee, M. Hwangbo, T. Alan, O. Tickoo, and R. Iyer, “Low­complexity hog for efficient video saliency,” in ICIP, pp. 3749–3752, IEEE, 2015.

    [76] T. Judd, K. Ehinger, F. Durand, and A. Torralba, “Learning to predict where humans look,” in IEEE International Conference on Computer Vision (ICCV), 2009.

    [77] S. Goferman, L. Zelnik­Manor, and A. Tal, “Context­aware saliency detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 34, no. 10, pp. 1915–1926, 2012.

    [78] D. Rudoy, D. B. Goldman, E. Shechtman, and L. Zelnik­Manor, “Learning video saliency from human gaze using candidate selection,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1147–1154, 2013.

    [79] S. Mathe and C. Sminchisescu, “Actions in the eye: Dynamic gaze datasets and learnt saliency models for visual recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 37, 2015.

    [80] A. Fathi, Y. Li, and J. M. Rehg, “Learning to recognize daily actions using gaze,” in European Conference on Computer Vision (ECCV), 2012.

    [81] S. Mathe, A. Pirinen, and C. Sminchisescu, “Reinforcement Learning for Visual Object Detection,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.

    [82] J. Ba, V. Mnih, and K. Kavukcuoglu, “Multiple object recognition with visual attention,” in International Conference on Learning Representations (ICLR), 2015.

    [83] V. Mnih, N. Heess, A. Graves, and k. kavukcuoglu, “Recurrent models of visual attention,” in Advances in Neural Information Processing Systems (NeurIPS) (Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, eds.), 2014.

    [84] D. Zhang, H. Maei, X. Wang, and Y. Wang, “Deep reinforcement learning for visual object tracking in videos,” ArXiv:1701.08936, 2017.

    [85] L. Bourdev and J. Malik, “Poselets: Body part detectors trained using 3d human pose annotations,” in IEEE International Conference on Computer Vision (ICCV), 2009.

    [86] J. Foote and D. Kimber, “Flycam: Practical panoramic video and automatic camera control,” in ICME, 2000.

    [87] X. Sun, J. Foote, D. Kimber, and B. Manjunath, “Region of interest extraction and virtual camera control based on panoramic video capturing,” IEEE Transactions on Multimedia (TMM), vol. 7, no. 5, pp. 981–990, 2005.

    [88] T.­Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European Conference on Computer Vision (ECCV), 2014.

    [89] M. Andriluka, S. Roth, and B. Schiele, “People­tracking­by­detection and peopledetection­by­tracking,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2008.

    [90] N. Dalal, B. Triggs, and C. Schmid, “Human detection using oriented histograms of flow and appearance,” in European Conference on Computer Vision (ECCV), 2006.

    [91] G. Gkioxari and J. Malik, “Finding action tubes,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.

    [92] R. S. Kennedy, N. E. Lane, K. S. Berbaum, and M. G. Lilienthal, “Simulator Sickness Questionnaire: An Enhanced Method for Quantifying Simulator Sickness,” The International Journal of Aviation Psychology, vol. 3, no. 3, p. 203, 1993.

    [93] J. J. . Lin, H. B. L. Duh, D. E. Parker, H. Abi­Rached, and T. A. Furness, “Effects of field of view on presence, enjoyment, memory, and simulator sickness in a virtual environment,” in IEEE Virtual Reality (VR), pp. 164–171, 2002.

    [94] H.­N. Hu, Y.­H. Yang, T. Fischer, F. Yu, T. Darrell, and M. Sun, “Monocular quasidense 3d object tracking,” ArXiv:2103.07351, 2021.

    [95] L. Wen, D. Du, Z. Cai, Z. Lei, M.­C. Chang, H. Qi, J. Lim, M.­H. Yang, and S. Lyu, “Ua­detrac: A new benchmark and protocol for multi­object detection and tracking,” ArXiv:1511.04136, 2015.

    [96] A. Milan, L. Leal­Taixé, I. Reid, S. Roth, and K. Schindler, “Mot16: A benchmark for multi­object tracking,” ArXiv:1603.00831, 2016.

    [97] J. Pang, L. Qiu, X. Li, H. Chen, Q. Li, T. Darrell, and F. Yu, “Quasi­dense similarity learning for multiple object tracking,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021.

    [98] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.

    [99] R. Girshick, “Fast r­cnn,” in IEEE International Conference on Computer Vision (ICCV), 2015.

    [100] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.­Y. Fu, and A. C. Berg, “Ssd: Single shot multibox detector,” in European Conference on Computer Vision (ECCV), 2016.

    [101] J. Redmon and A. Farhadi, “Yolo9000: Better, faster, stronger,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.

    [102] T.­Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in IEEE International Conference on Computer Vision (ICCV), 2017.

    [103] X. Zhou, D. Wang, and P. Krähenbühl, “Objects as points,” ArXiv:1904.07850, 2019.

    [104] H. Law and J. Deng, “Cornernet: Detecting objects as paired keypoints,” in European Conference on Computer Vision (ECCV), 2018.

    [105] Z. Tian, C. Shen, H. Chen, and T. He, “Fcos: Fully convolutional one­stage object detection,” in IEEE International Conference on Computer Vision (ICCV), 2019.

    [106] X. Chen, K. Kundu, Y. Zhu, A. G. Berneshawi, H. Ma, S. Fidler, and R. Urtasun, “3D object proposals for accurate object class detection,” in Advances in Neural Information Processing Systems (NeurIPS), 2015.

    [107] X. Chen, K. Kundu, Z. Zhang, H. Ma, S. Fidler, and R. Urtasun, “Monocular 3D object detection for autonomous driving,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

    [108] A. Mousavian, D. Anguelov, J. Flynn, and J. Košecká, “3D bounding box estimation using deep learning and geometry,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.

    [109] G. Brazil and X. Liu, “M3D­RPN: Monocular 3D region proposal network for object detection,” in IEEE International Conference on Computer Vision (ICCV), 2019.

    [110] A. Simonelli, S. R. Bulo, L. Porzi, M. López­Antequera, and P. Kontschieder, “Disentangling monocular 3D object detection,” in IEEE International Conference on Computer Vision (ICCV), 2019.

    [111] Y. Wang, W.­L. Chao, D. Garg, B. Hariharan, M. Campbell, and K. Q. Weinberger, “Pseudo­LiDAR from visual depth estimation: Bridging the gap in 3D object detection for autonomous driving,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.

    [112] X. Ma, Z. Wang, H. Li, P. Zhang, W. Ouyang, and X. Fan, “Accurate monocular 3D object detection via color­embedded 3D reconstruction for autonomous driving,” in IEEE International Conference on Computer Vision (ICCV), 2019.

    [113] A. Kundu, Y. Li, and J. M. Rehg, “3D­RCNN: Instance­level 3D object reconstruction via render­and­compare,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.

    [114] F. Chabot, M. Chaouch, J. Rabarisoa, C. Teuliere, and T. Chateau, “Deep MANTA: A coarse­to­fine many­task network for joint 2D and 3D vehicle analysis from monocular image,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.

    [115] Y. Zhou and O. Tuzel, “Voxelnet: End­to­end learning for point cloud based 3d object detection,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.

    [116] S. Shi, X. Wang, and H. Li, “PointRCNN: 3D object proposal generation and detection from point cloud,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.

    [117] A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom, “PointPillars: Fast encoders for object detection from point clouds,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.

    [118] X. Chen, H. Ma, J. Wan, B. Li, and T. Xia, “Multi­view 3d object detection network for autonomous driving,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.

    [119] A. Yilmaz, O. Javed, and M. Shah, “Object tracking: A survey,” ACM computing surveys (CSUR), 2006.

    [120] S. Salti, A. Cavallaro, and L. Di Stefano, “Adaptive appearance modeling for video tracking: Survey and evaluation,” IEEE Transactions on Image Processing (TIP), 2012.

    [121] A. W. Smeulders, D. M. Chu, R. Cucchiara, S. Calderara, A. Dehghan, and M. Shah, “Visual tracking: An experimental survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2014.

    [122] D. S. Bolme, J. R. Beveridge, B. A. Draper, and Y. M. Lui, “Visual object tracking using adaptive correlation filters,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010.

    [123] M. Kristan, J. Matas, A. Leonardis, M. Felsberg, L. Cehovin, G. Fernández, T. Vojir, G. Hager, G. Nebehay, and R. Pflugfelder, “The visual object tracking vot2015 challenge results,” in IEEE International Conference on Computer Vision Workshops (ICCV Workshops), 2015.

    [124] S. Hare, S. Golodetz, A. Saffari, V. Vineet, M.­M. Cheng, S. L. Hicks, and P. H. Torr, “Struck: Structured output tracking with kernels,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2016.

    [125] B. Babenko, M.­H. Yang, and S. Belongie, “Visual tracking with online multiple instance learning,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009.

    [126] Z. Kalal, K. Mikolajczyk, and J. Matas, “Tracking­learning­detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2012.

    [127] W. Luo, J. Xing, A. Milan, X. Zhang, W. Liu, X. Zhao, and T.­K. Kim, “Multiple object tracking: A literature review,” ArXiv:1409.7618, 2017.

    [128] R. Tao, E. Gavves, and A. W. Smeulders, “Siamese instance search for tracking,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

    [129] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. Torr, “Fullyconvolutional siamese networks for object tracking,” in European Conference on Computer Vision (ECCV), 2016.

    [130] C. Feichtenhofer, A. Pinz, and A. Zisserman, “Detect to track and track to detect,” in IEEE International Conference on Computer Vision (ICCV), 2017.

    [131] P. Bergmann, T. Meinhardt, and L. Leal­Taixé, “Tracking without bells and whistles,” in IEEE International Conference on Computer Vision (ICCV), 2019.

    [132] D. Mykheievskyi, D. Borysenko, and V. Porokhonskyy, “Learning local feature descriptors for multiple object tracking,” in Asian Conference on Computer Vision (ACCV), 2020.

    [133] W. Zhang, H. Zhou, S. Sun, Z. Wang, J. Shi, and C. C. Loy, “Robust multimodality multi­object tracking,” in IEEE International Conference on Computer Vision (ICCV), 2019.

    [134] Li Zhang, Yuan Li, and R. Nevatia, “Global data association for multi­object tracking using network flows,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2008.

    [135] W. Choi, “Near­online multi­target tracking with aggregated local flow descriptor,” in IEEE International Conference on Computer Vision (ICCV), 2015.

    [136] C. Kim, F. Li, A. Ciptadi, and J. M. Rehg, “Multiple hypothesis tracking revisited,” in IEEE International Conference on Computer Vision (ICCV), 2015.

    [137] A. Ess, B. Leibe, K. Schindler, and L. Van Gool, “A mobile vision system for robust multi­person tracking,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2008.

    [138] S. Scheidegger, J. Benjaminsson, E. Rosenberg, A. Krishnan, and K. Granstrom, “Mono­camera 3d multi­object tracking using deep learning detections and pmbm filtering,” in IEEE Intelligent Vehicles Symposium (IV), 2018.

    [139] P. Voigtlaender, M. Krause, A. Osep, J. Luiten, B. B. G. Sekar, A. Geiger, and B. Leibe, “Mots: Multi­object tracking and segmentation,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7942–7951, 2019.

    [140] S. Sharma, J. A. Ansari, J. Krishna Murthy, and K. Madhava Krishna, “Beyond pixels: Leveraging geometry and shape cues for online multi­object tracking,” in IEEE International Conference on Robotics and Automation (ICRA), 2018.

    [141] J. Luiten, T. Fischer, and B. Leibe, “Track to reconstruct and reconstruct to track,” IEEE Robotics and Automation Letters (RA­L), 2020.

    [142] P. Li, T. Qin, and a. Shen, “Stereo vision­based semantic 3d object and ego­motion tracking for autonomous driving,” in European Conference on Computer Vision (ECCV), 2018.

    [143] A. Osep, W. Mehner, M. Mathias, and B. Leibe, “Combined image­ and worldspace tracking in traffic scenes,” in IEEE International Conference on Robotics and Automation (ICRA), 2017.

    [144] X. Weng, J. Wang, D. Held, and K. Kitani, “3d multi­object tracking: A baseline and new evaluation metrics,” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2020.

    [145] Z. Lu, V. Rathod, R. Votel, and J. Huang, “Retinatrack: Online single stage joint detection and tracking,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.

    [146] T. Yin, X. Zhou, and P. Krähenbühl, “Center­based 3d object detection and tracking,” ArXiv:2006.11275, 2020.

    [147] W. Maddern, G. Pascoe, C. Linegar, and P. Newman, “1 year, 1000 km: The oxford robotcar dataset,” International Journal of Robotics Research (IJRR), 2017.

    [148] F. Yu, H. Chen, X. Wang, W. Xian, Y. Chen, F. Liu, V. Madhavan, and T. Darrell, “Bdd100k: A diverse driving dataset for heterogeneous multitask learning,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.

    [149] M.­F. Chang, J. Lambert, P. Sangkloy, J. Singh, S. Bak, A. Hartnett, D. Wang, P. Carr, S. Lucey, D. Ramanan, and J. Hays, “Argoverse: 3d tracking and forecasting with rich maps,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.

    [150] P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V. Patnaik, P. Tsui, J. Guo, Y. Zhou, Y. Chai, B. Caine, V. Vasudevan, W. Han, J. Ngiam, H. Zhao, A. Timofeev, S. Ettinger, M. Krivokon, A. Gao, A. Joshi, Y. Zhang, J. Shlens, Z. Chen, and D. Anguelov, “Scalability in perception for autonomous driving: Waymo open dataset,” ArXiv:1912.04838, 2019.

    [151] G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez, “The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

    [152] S. R. Richter, V. Vineet, S. Roth, and V. Koltun, “Playing for data: Ground truth from computer games,” in European Conference on Computer Vision (ECCV), 2016.

    [153] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun, “Carla: An open urban driving simulator,” in Conference on Robot Learning (CoRL), 2017.

    [154] P. Krähenbühl, “Free supervision from video games,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.

    [155] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r­cnn,” in IEEE International Conference on Computer Vision (ICCV), 2017.

    [156] P. J. Huber, “Robust estimation of a location parameter,” The Annals of Mathematical Statistics, 1964.

    [157] A. Hermans, L. Beyer, and B. Leibe, “In defense of the triplet loss for person reidentification,” ArXiv:1703.07737, 2017.

    [158] J. Valmadre, L. Bertinetto, J. Henriques, A. Vedaldi, and P. H. S. Torr, “End­to­end representation learning for correlation filter based tracking,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.

    [159] Z. Wu, Y. Xiong, X. Y. Stella, and D. Lin, “Unsupervised feature learning via nonparametric instance discrimination,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.

    [160] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r­cnn,” in IEEE International Conference on Computer Vision (ICCV), pp. 2961–2969, 2017.

    [161] A. Kendall and Y. Gal, “What uncertainties do we need in bayesian deep learning for computer vision?,” in Advances in Neural Information Processing Systems (NeurIPS), 2017.

    [162] G. Brazil, G. Pons­Moll, X. Liu, and B. Schiele, “Kinematic 3d object detection in monocular video,” in European Conference on Computer Vision (ECCV), (Virtual), 2020.

    [163] F. Yu, W. Li, Q. Li, Y. Liu, X. Shi, and J. Yan, “Poi: multiple object tracking with high performance detection and appearance feature,” in European Conference on Computer Vision (ECCV), 2016.

    [164] H. W. Kuhn, “The hungarian method for the assignment problem,” in Naval Research Logistics Quarterly, 1955.

    [165] N. Wojke, A. Bewley, and D. Paulus, “Simple online and realtime tracking with a deep association metric,” in IEEE International Conference on Image Processing (ICIP), 2017.

    [166] Y. Xiang, A. Alahi, and S. Savarese, “Learning to track: Online multi­object tracking by decision making,” in IEEE International Conference on Computer Vision (ICCV), 2015.

    [167] A. Sadeghian, A. Alahi, and S. Savarese, “Tracking the untrackable: Learning to track multiple cues with long­term dependencies,” in IEEE International Conference on Computer Vision (ICCV), 2017.

    [168] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al., “Imagenet large scale visual recognition challenge,” International Journal of Computer Vision (IJCV), 2015.

    [169] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative style, high­performance deep learning library,” in Advances in Neural Information Processing Systems (NeurIPS) (H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché­Buc, E. Fox, and R. Garnett, eds.), pp. 8024–8035, Curran Associates, Inc., 2019.

    [170] F. Yu, D. Wang, E. Shelhamer, and T. Darrell, “Deep layer aggregation,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.

    [171] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

    [172] H. Robbins and S. Monro, “A stochastic approximation method,” Ann. Math. Statist., vol. 22, pp. 400–407, 09 1951.

    [173] T.­Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European Conference on Computer Vision (ECCV), 2014.

    [174] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,” International Journal of Computer Vision (IJCV), vol. 88, 2010.

    [175] K. Bernardin and R. Stiefelhagen, “Evaluating multiple object tracking performance: the clear mot metrics,” EURASIP Journal on Image and Video Processing (JIVP), 2008.

    [176] Y. Li, C. Huang, and R. Nevatia, “Learning to associate: Hybridboosted multitarget tracker for crowded scene,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009.

    [177] R. E. Kalman, “A new approach to linear filtering and prediction problems,” Journal of basic Engineering, 1960.

    [178] B. Zhu, Z. Jiang, X. Zhou, Z. Li, and G. Yu, “Class­balanced grouping and sampling for point cloud 3d object detection,” ArXiv:1908.09492, 2019.

    [179] Y. Wang, S. Chen, L. Huang, R. Ge, Y. Hu, Z. Ding, and J. Liao, “1st place solutions for waymo open dataset challenges – 2d and 3d tracking,” ArXiv:2006.15506, 2020.

    [180] O. D. Team, “Openpcdet: An open­source toolbox for 3d object detection from point clouds.” https://github.com/open-mmlab/OpenPCDet, 2020.

    [181] K. Krippendorff, Content analysis: An introduction to its methodology. Sage, 2012.

    [182] S. Hwang, J. Park, N. Kim, Y. Choi, and I. S. Kweon, “Multispectral pedestrian detection: Benchmark dataset and baseline,” Integrated Computer­Aided Engineering (ICAE), 2013.

    [183] L. Leal­Taixé, A. Milan, I. Reid, S. Roth, and K. Schindler, “Motchallenge 2015: Towards a benchmark for multi­target tracking,” ArXiv:1504.01942, 2015.

    [184] A. Shenoi, M. Patel, J. Gwak, P. Goebel, A. Sadeghian, H. Rezatofighi, R. MartinMartin, and S. Savarese, “Jrmot: A real­time 3d multi­object tracker and a new large­scale dataset,” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2020.

    [185] H. Karunasekera, H. Wang, and H. Zhang, “Multiple object tracking with attention to appearance, structure, motion and size,” IEEE Access, 2019.

    [186] G. Gündüz and T. Acarman, “A lightweight online multiple object vehicle tracking method,” in IEEE Intelligent Vehicles Symposium (IV), 2018.

    [187] B. Lee, E. Erdenee, S. Jin, M. Y. Nam, Y. G. Jung, and P. Rhee, “Multi­class multi­object tracking using changing point detection,” in European Conference on Computer Vision Workshops (ECCV Workshops), 2016.

    [188] W. Choi, “Near­online multi­target tracking with aggregated local flow descriptor,” in IEEE International Conference on Computer Vision (ICCV), 2015.

    [189] D. Frossard and R. Urtasun, “End­to­end learning of multi­sensor 3d tracking by detection,” in IEEE International Conference on Robotics and Automation (ICRA), 2018.

    [190] J. Hong Yoon, C.­R. Lee, M.­H. Yang, and K.­J. Yoon, “Online multi­object tracking via structural constraint event aggregation,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

    [191] P. Lenz, A. Geiger, and R. Urtasun, “Followme: Efficient online min­cost flow tracking with bounded memory and computation,” in IEEE International Conference on Computer Vision (ICCV), 2015.

    QR CODE