簡易檢索 / 詳目顯示

研究生: 王福恩
Wang, Fu-En
論文名稱: 基於自我監督式學習360影片之深度與相機位移
Self-Supervised Learning of Depth and Camera Motion from 360 Videos
指導教授: 孫民
Sun, Min
口試委員: 王鈺強
Wang, Yu-Chiang
陳煥宗
Chen, Hwann-Tzong
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電機工程學系
Department of Electrical Engineering
論文出版年: 2019
畢業學年度: 107
語文別: 英文
論文頁數: 37
中文關鍵詞: 全景深度深度學習人工智慧計算機視覺點雲
外文關鍵詞: panorama, depth, deep-learning, ai, cv, point-cloud
相關次數: 點閱:1下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著360相機在各種自主系統應用中愈顯普及 (自駕車與無人飛行器),有效率的360感知能力變得越是重要。在此篇論文中,我們提出了新穎的自我監督式方法根據360影片來預測全景影像的深度與相機之移動。始於一個專門設計給一般視角相機的方法,SfMLearner,我們在此引入三個關鍵方法來有效率的處理360影像。首先,我們先將等距離長方圓柱投影轉換成立方投影來避免360影像的扭曲,在所有卷積與反卷積層前,我們使用了立方填補演算法 (Cube Padding) 來移除每個立方面的邊界,立方填補演算法會將每一面的鄰近面特徵值填補到自身來補全每一面的資訊。第二,我們提出了一個新穎的球狀光差一制性限制並使用在球體上,如此一來我們就能避免以往方法會在超出邊界的地方無法計算訓練誤差。最後,我們並非只是獨立預測六個相機移動 (直接對立方體的每個面使用SfMLearner),我們提出了新穎的相機移動一致性訓練誤差來確保每一面的相機位移可以互相限制。為了訓練與評估我們所提出的方法,我們收集了一個全新的數據集PanoSUNCG,這個數據集擁有目前最大量的360影像同時包和所對應的正確深度與相機位移。在PanoSUNCG上,我們所提出的方法達到目前深度與相機位移最高的準確度並且也具有更快的預測速度。在真實世界的影片中,我們的方法仍然能預測出合理的深度與相機移動。


    As 360 cameras become prevalent in many autonomous systems (e.g., self-driving cars and drones), efficient 360 perception becomes more and more important.
    We propose a novel self-supervised learning approach for predicting the omnidirectional depth and camera motion from a 360 video.
    In particular, starting from the SfMLearner, which is designed for cameras with normal field-of-view, we introduce three key features to process 360 images efficiently.
    Firstly, we convert each image from equirectangular projection to cubic projection in order to avoid image distortion. In each network layer, we use Cube Padding (CP), which pads intermediate features from adjacent faces, to avoid image boundaries.
    Secondly, we propose a novel ``spherical" photometric consistency constraint on the whole viewing sphere. In this way, no pixel will be projected outside the image boundary which typically happens in images with normal field-of-view.
    Finally, rather than naively estimating six independent camera motions (i.e., naively applying SfM-Learner to each face on a cube), we propose a novel camera pose consistency loss to ensure the estimated camera motions reaching consensus.
    To train and evaluate our approach, we collect a new PanoSUNCG dataset containing a large amount of 360 videos with groundtruth depth and camera motion. Our approach achieves state-of-the-art depth prediction and camera motion estimation with faster inference speed comparing to equirectangular. In real-world indoor videos, our approach can also achieve qualitatively reasonable depth prediction.

    1 Introduction 4 Related work 7 Our approach 15 Dataset 18 Experiments 32 Conclusion 33 References

    [1] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe, “Unsupervised learning of depth
    and ego-motion from video,” in CVPR, vol. 2, p. 7, 2017. vii, ix, 2, 5, 10, 11, 13,
    14, 18, 20
    [2] R. Garg, G. Carneiro, and I. Reid, “Unsupervised cnn for single view depth estimation: Geometry to the rescue,” in ECCV, 2016. ix, 11, 13, 20
    [3] C. Godard, O. Mac Aodha, and G. J. Brostow, “Unsupervised monocular depth
    estimation with left-right consistency,” in CVPR, vol. 2, p. 7, 2017. ix, 5, 11, 13,
    20
    [4] D. Caruso, J. Engel, and D. Cremers, “Large-scale direct slam for omnidirectional
    cameras,” in Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International
    Conference on, pp. 141–148, IEEE, 2015. 2, 6
    [5] Y.-C. Su and K. Grauman, “Learning spherical convolution for fast features from
    360° imagery,” in NIPS, 2017. 2, 5
    [6] T.-H. Wang, H.-J. Huang, J.-T. Lin, C.-W. Hu, K.-H. Zeng, and M. Sun,
    “Omnidirectional CNN for Visual Place Recognition and Navigation,” 2018.
    arXiv:1803.04228v1. 2, 5
    [7] A. Saxena, M. Sun, and A. Y. Ng, “Make3d: Learning 3d scene structure from a
    single still image,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, pp. 824–840, May 2009. 4
    [8] D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a single image
    using a multi-scale deep network,” in Advances in Neural Information Processing
    Systems 27 (Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q.
    Weinberger, eds.), pp. 2366–2374, Curran Associates, Inc., 2014. 4
    [9] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab, “Deeper depth
    prediction with fully convolutional residual networks,” in 3D Vision (3DV), 2016
    Fourth International Conference on, pp. 239–248, IEEE, 2016. 4
    [10] Y. Cao, Z. Wu, and C. Shen, “Estimating depth from monocular images as classification using deep fully convolutional residual networks,” IEEE Transactions on
    Circuits and Systems for Video Technology, vol. 28, pp. 3174–3182, Nov 2018. 4
    [11] P. Wang, X. Shen, Z. Lin, S. Cohen, B. Price, and A. L. Yuille, “Towards unified
    depth and semantic prediction from a single image,” in The IEEE Conference on
    Computer Vision and Pattern Recognition (CVPR), June 2015. 4
    [12] F. Liu, C. Shen, and G. Lin, “Deep convolutional neural fields for depth estimation
    from a single image,” in The IEEE Conference on Computer Vision and Pattern
    Recognition (CVPR), June 2015. 4
    [13] D. Xu, E. Ricci, W. Ouyang, X. Wang, and N. Sebe, “Multi-scale continuous crfs
    as sequential deep networks for monocular depth estimation,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017. 4
    [14] D. Xu, W. Wang, H. Tang, H. Liu, N. Sebe, and E. Ricci, “Structured attention
    guided convolutional neural fields for monocular depth estimation,” in The IEEE
    Conference on Computer Vision and Pattern Recognition (CVPR), June 2018. 4
    [15] H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao, “Deep ordinal regression
    network for monocular depth estimation,” in The IEEE Conference on Computer
    Vision and Pattern Recognition (CVPR), June 2018. 4
    [16] K. Tateno, F. Tombari, I. Laina, and N. Navab, “Cnn-slam: Real-time dense
    monocular slam with learned depth prediction,” in IEEE Conference on Computer
    Vision and Pattern Recognition (CVPR), 2017. 4
    [17] J. Engel, T. Schöps, and D. Cremers, “Lsd-slam: Large-scale direct monocular
    slam,” in European Conference on Computer Vision, pp. 834–849, Springer, 2014.
    4, 14
    [18] Z. Yin and J. Shi, “Geonet: Unsupervised learning of dense depth, optical flow and
    camera pose,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018. 5
    [19] H. Zhan, R. Garg, C. Saroj Weerasekera, K. Li, H. Agarwal, and I. Reid, “Unsupervised learning of monocular depth estimation and visual odometry with deep
    feature reconstruction,” in The IEEE Conference on Computer Vision and Pattern
    Recognition (CVPR), June 2018. 5
    [20] Z. Yang, P. Wang, W. Xu, L. Zhao, and R. Nevatia, “Unsupervised learning of
    geometry from videos with edge-aware depth-normal consistency,” 2018. 5
    [21] C. Wang, J. Miguel Buenaposada, R. Zhu, and S. Lucey, “Learning depth from
    monocular videos using direct methods,” in The IEEE Conference on Computer
    Vision and Pattern Recognition (CVPR), June 2018. 5
    [22] S. Vijayanarasimhan, S. Ricco, C. Schmid, R. Sukthankar, and K. Fragkiadaki, “Sfm-net: Learning of structure and motion from video,” arXiv preprint
    arXiv:1704.07804, 2017. 5, 14
    [23] A. Byravan and D. Fox, “Se3-nets: Learning rigid body motion using deep neural
    networks,” in Robotics and Automation (ICRA), 2017 IEEE International Conference on, pp. 173–180, IEEE, 2017. 5
    [24] R. Mahjourian, M. Wicke, and A. Angelova, “Unsupervised learning of depth and
    ego-motion from monocular video using 3d geometric constraints,” in The IEEE
    Conference on Computer Vision and Pattern Recognition (CVPR), June 2018. 5
    [25] C. Wang, J. Miguel Buenaposada, R. Zhu, and S. Lucey, “Learning depth from
    monocular videos using direct methods,” in The IEEE Conference on Computer
    Vision and Pattern Recognition (CVPR), June 2018. 5
    [26] C. Häne, L. Heng, G. H. Lee, F. Fraundorfer, P. Furgale, T. Sattler, and M. Pollefeys, “3d visual perception for self-driving cars using a multi-camera system: Calibration, mapping, localization, and obstacle detection,” Image and Vision Computing (IMAVIS), Special Issue ”Automotive Vision”, 2017. 5
    [27] W. S. Lai, Y. Huang, N. Joshi, C. Buehler, M. H. Yang, and S. B. Kang, “Semanticdriven generation of hyperlapse from 360° video.,” TVCG, 2017. 5
    [28] Y.-C. Lin, Y.-J. Chang, H.-N. Hu, H.-T. Cheng, C.-W. Huang, and M. Sun, “Tell
    me where to look: Investigating ways for assisting focus in 360◦ video,” in CHI,
    2017. 5
    [29] H.-N. Hu, Y.-C. Lin, M.-Y. Liu, H.-T. Cheng, Y.-J. Chang, and M. Sun, “Deep 360
    pilot: Learning a deep agent for piloting through 360deg sports videos,” in CVPR,
    2017. 5
    [30] Y.-C. Su, D. Jayaraman, and K. Grauman, “Pano2vid: Automatic cinematography
    for watching 360◦ videos,” in ACCV, 2016. 5
    [31] Y.-C. Su and K. Grauman, “Making 360° video watchable in 2d: Learning videography for click free viewing,” in CVPR, 2017. 5
    [32] S. Pathak, A. Moro, H. Fujii, A. Yamashita, and H. Asama, “3d reconstruction of
    structures using spherical cameras with small motion,” in 2016 16th International
    Conference on Control, Automation and Systems (ICCAS), pp. 117–122, Oct 2016.
    5
    [33] H.-T. Cheng, C.-H. Chao, J.-D. Dong, H.-K. Wen, T.-L. Liu, and M. Sun, “Cube
    padding for weakly-supervised saliency prediction in 360° videos,” in The IEEE
    Conference on Computer Vision and Pattern Recognition (CVPR), June 2018. 5,
    10, 19, 32
    [34] S. Yang, F. Wang, C. Peng, P. Wonka, M. Sun, and H. Chu, “Dula-net: A dualprojection network for estimating room layouts from a single RGB panorama,”
    CoRR, vol. abs/1811.11977, 2018. 5
    [35] C. Zou, A. Colburn, Q. Shan, and D. Hoiem, “Layoutnet: Reconstructing the 3d
    room layout from a single rgb image,” in The IEEE Conference on Computer Vision
    and Pattern Recognition (CVPR), June 2018. 5
    [36] Y. Zhang, S. Song, P. Tan, and J. Xiao, “Panocontext: A whole-room 3d context model for panoramic scene understanding,” in Computer Vision – ECCV 2014
    (D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, eds.), (Cham), pp. 668–686,
    Springer International Publishing, 2014. 5
    [37] Y.-C. Su and K. Grauman, “Learning spherical convolution for fast features
    from 360°imagery,” in Advances in Neural Information Processing Systems 30
    (I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan,
    and R. Garnett, eds.), pp. 529–539, Curran Associates, Inc., 2017. 5
    [38] T. S. Cohen, M. Geiger, J. Köhler, and M. Welling, “Spherical CNNs,” in International Conference on Learning Representations, 2018. 5
    [39] Y. Su and K. Grauman, “Kernel transformer networks for compact spherical convolution,” CoRR, vol. abs/1812.03115, 2018. 5
    [40] N. Zioulis, A. Karakottas, D. Zarpalas, and P. Daras, “Omnidepth: Dense depth estimation for indoors spherical panoramas,” in The European Conference on Computer Vision (ECCV), September 2018. 5
    [41] S. Im, H. Ha, F. Rameau, H.-G. Jeon, G. Choe, and I. S. Kweon, “All-around depth
    from small motion with a spherical panoramic camera,” in European Conference
    on Computer Vision, pp. 156–172, Springer, 2016. 6
    [42] M. Schönbein and A. Geiger, “Omnidirectional 3d reconstruction in augmented
    manhattan worlds,” in 2014 IEEE/RSJ International Conference on Intelligent
    Robots and Systems, pp. 716–723, Sept 2014. 6
    [43] H. Guan and W. A. P. Smith, “Structure-from-motion in spherical video using the
    von mises-fisher distribution,” IEEE Transactions on Image Processing, vol. 26,
    pp. 711–723, Feb 2017. 6
    [44] P. Chang and M. Hebert, “Omni-directional structure from motion,” in Proceedings of the 2000 IEEE Workshop on Omnidirectional Vision, pp. 127 – 133, June
    2000. 6
    [45] A. Pagani and D. Stricker, “Structure from motion using full spherical panoramic
    cameras,” in 2011 IEEE International Conference on Computer Vision Workshops
    (ICCV Workshops), pp. 375–382, Nov 2011. 6
    [46] F. Kangni and R. Laganiere, “Orientation and pose recovery from spherical panoramas,” in 2007 IEEE 11th International Conference on Computer Vision, pp. 1–8,
    Oct 2007. 6
    [47] N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox,
    “A large dataset to train convolutional networks for disparity, optical flow, and
    scene flow estimation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4040–4048, 2016. 10
    [48] J. Flynn, I. Neulander, J. Philbin, and N. Snavely, “Deepstereo: Learning to predict
    new views from the world’s imagery,” in Proceedings of the IEEE Conference on
    Computer Vision and Pattern Recognition, pp. 5515–5524, 2016. 11
    [49] Y. Zhang, S. Khamis, C. Rhemann, J. Valentin, A. Kowdle, V. Tankovich,
    M. Schoenberg, S. Izadi, T. Funkhouser, and S. Fanello, “Activestereonet: Endto-end self-supervised learning for active stereo systems,” in The European Conference on Computer Vision (ECCV), September 2018. 14
    [50] S. Song, F. Yu, A. Zeng, A. X. Chang, M. Savva, and T. Funkhouser, “Semantic scene completion from a single depth image,” IEEE Conference on Computer
    Vision and Pattern Recognition, 2017. 15
    [51] A. Paszke and S. Chintala, “Pytorch: https://github.com/apaszke/pytorch-dist.” 18
    [52] A. Kos, S. Tomazic, and A. Umek, “Evaluation of smartphone inertial sensor performance for cross-platform mobile applications,” in Sensors, vol. 16, p. 477, 04
    2016. 19
    [53] F.-E. Wang, H.-N. Hu, H.-T. Cheng, J.-T. Lin, S.-T. Yang, M.-L. Shih, H.-K. Chu, and M. Sun, “Technical report of self-supervised 360 depth,” 2018. https: //aliensunmin.github.io/project/360-depth/. 23

    QR CODE