簡易檢索 / 詳目顯示

研究生: 王寧緒
Wang, Ning-Hsu
論文名稱: 360SD-Net: 360度雙目深度估測與可學習立體成本容積
360SD-Net: 360° Stereo Depth Estimation with Learnable Cost Volume
指導教授: 孫民
Sun, Min
口試委員: 陳煥宗
Chen, Hwann-Tzong
王鈺強
Wang, Yu-Chiang
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電機工程學系
Department of Electrical Engineering
論文出版年: 2020
畢業學年度: 108
語文別: 英文
論文頁數: 38
中文關鍵詞: 360度影像雙目深度估測全景影像卷積神經網路深度學習電腦視覺
外文關鍵詞: 360° Image, Stereo Depth Estimation, Equirectangular Image, Neural Network, Deep Learning, Computer Vision
相關次數: 點閱:2下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 近年來,透過端對端學習的深度學習模型,兩張小FoV影像的立體匹配與深度估測已達到極佳的性能,遠超過傳統的影像匹配演算法。然而在360度的影像中上,直接套用現有的深度學習模型會因為等距柱狀投影的變形問題而無法達到相同的成果(隨著投影造成的圖形扭曲變形,3D空間中的直線在投影到2D平面上時並不一定都以直線呈現)。為了克服這項難題,我們採用了上下影像匹配的方式,針對變形問題提出了一個新型的深度學習模型。我們的新架構主要透過以下兩個設計來克服圖像變形扭曲的問題:(1)我們將球座標當作一個模型輸入,在經過幾何特徵擷取器後合併到圖像特徵上。(2)一個新提出的可學習的的成本容積(Learnable Cost Volume)。由於缺乏可用的360雙鏡頭資料集,我們在現有的單鏡頭的資料集(Matterport3D and Stanford2D3D)上衍伸出兩個360雙鏡頭的資料集作為模型訓練與驗證的資料。我們在這兩個新提出的資料集上做了豐富的實驗以驗證模型的性能,並與現有的方法左比較。最後我們透過兩台消費者等級的360相機做測試,以驗證我們模型的穩定性與相容性,並且展現了相當出色的成果。


    Recently, end-to-end trainable deep neural networks have significantly improved stereo depth estimation for perspective images. However, 360° images captured under equirectangular projection cannot benefit from directly adopting existing methods due to distortion introduced (i.e., lines in 3D are not projected onto lines in 2D). To tackle this issue, we present a novel architecture specifically designed for spherical disparity using the setting of top-bottom 360° camera pairs. Moreover, we propose to mitigate the distortion issue by (1) an additional input branch capturing the position and relation of each pixel in the spherical coordinate, and (2) a cost volume built upon a learnable shifting filter. Due to the lack of 360° stereo data, we collect two 360° stereo datasets from Matterport3D and Stanford2D3D for training and evaluation. Extensive experiments and ablation study are provided to validate our method against existing algorithms. Finally, we show promising results on real-world environments capturing images with two consumer-level cameras.

    Declaration ii 誌謝 iii Acknowledgements iv 摘要 v Abstract vi 1 Introduction 1 2 Related work5 2.1 Classical Methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . .5 2.2 Deep Learning-based Depth Estimation. . . . . . . . . . . . . . . . .6 2.2.1 Supervised Monocular Methods. . . . . . . . . . . . . . . . .6 2.2.2 Unsupervised Methods. . . . . . . . . . . . . . . . . . . . . .7 2.2.3 Supervised Stereo Methods. . . . . . . . . . . . . . . . . . . .7 2.3 Vision Techniques for 360◦Camera. . . . . . . . . . . . . . . . . . . .8 3 Our approach 10 3.1 Camera Setting and Spherical Disparity. . . . . . . . . . . . . . . . .10 3.2 Algorithm Overview. . . . . . . . . . . . . . . . . . . . . . . . . . .11 3.3 Incorporation with Polar Angle. . . . . . . . . . . . . . . . . . . . . .12 3.4 ASPP Module. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12 3.5 Learnable Cost Volume. . . . . . . . . . . . . . . . . . . . . . . . . .13 3.6 3D Encoder-Decoder and Regression Loss. . . . . . . . . . . . . . . .13 4 Dataset 15 4.1 Dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15 4.2 Real World System Configuration. . . . . . . . . . . . . . . . . . . .17 5 Experiments 18 5.1 Metrics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .18 5.2 Experimental Setting. . . . . . . . . . . . . . . . . . . . . . . . . . .19 5.3 Overall Performance. . . . . . . . . . . . . . . . . . . . . . . . . . .20 5.4 Ablation Study. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .20 5.5 Qualitative Results. . . . . . . . . . . . . . . . . . . . . . . . . . . .30 5.6 Qualitative Results for Real-World Images. . . . . . . . . . . . . . . .30 5.7 Failure Case. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .32 6 Conclusion 33 References 34

    [1]N.-H. Wang, B. Solarte, Y.-H. Tsai, W.-C. Chiu, and M. Sun, “360sd-net: 360° stereo depth estimation with learnable cost volume,” in International Conference on Robotics and Automation (ICRA), 2020.ii
    [2]A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niessner, M. Savva, S. Song,A. Zeng, and Y. Zhang, “Matterport3D: Learning from RGB-D data in indoor environments,”International Conference on 3D Vision (3DV), 2017.ix,4,15,16
    [3]I. Armeni, A. Sax, A. R. Zamir, and S. Savarese, “Joint 2D-3D-Semantic Data for Indoor Scene Understanding,”ArXiv e-prints, Feb. 2017.ix,4,15,16
    [4]J.-R. Chang and Y.-S. Chen, “Pyramid stereo matching network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5410–5418, 2018.4,7,8,12,13,20,22,30
    [5]A. Kendall, H. Martirosyan, S. Dasgupta, P. Henry, R. Kennedy, A. Bachrach, and A. Bry, “End-to-end learning of geometry and context for deep stereo regression,”in Proceedings of the International Conference on Computer Vision (ICCV),2017.4,7,8,12,13,20,22,30
    [6]K. J. Yoon and I. S. Kweon, “Adaptive support-weight approach for correspondence search,”IEEE Transactions on Pattern Analysis and Machine Intelligence,vol. 28, 2006.4,5,20,22,30
    [7]S. Li, “Binocular spherical stereo,”IEEE Transactions on intelligent transportation systems, vol. 9, no. 4, pp. 589–600, 2008.4,6,9,10,20,22,30
    [8]H. Kim and A. Hilton, “3d scene reconstruction from multiple spherical stereo pairs,”International Journal of Computer Vision, vol. 104, 08 2013.4,6,9,10,20,22,30
    [9]C.Banz,S.Hesselbarth,H.Flatt,H.Blume,andP.Pirsch,“Real-time stereo vision system using semi-global matching disparity estimation: Architecture and FPGA-implementation,” in2010 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation, 2010.5
    [10]G. S. Hong and B. G. Kim, “A local stereo matching algorithm based on weighted guided image filtering for improving the generation of depth range images,”Dis-plays, 2017.5
    [11]R. A. Hamzah, M. S. Hamid, A. F. Kadmin, S. F. A. Gani, S. Salam, and T. M.Wook, “Accurate Disparity Map Estimation Based on Edge-preserving Filter,” in 2018 International Conference on Smart Computing and Electronic Enterprise,ICSCEE 2018, IEEE, 2018.5
    [12]S. B. Kang and R. Szeliski, “3-d scene data recovery using omnidirectional multi-baseline stereo,”International journal of computer vision, vol. 25, no. 2, pp. 167–183, 1997.5
    [13]S.Im,H.Ha,F.Rameau,H.-G.Jeon,G.Choe,andI.S.Kweon,“All-around depth from small motion with a spherical panoramic camera,” in European Conference on Computer Vision, pp. 156–172, Springer, 2016.6
    [14]A. Saxena, M. Sun, and A. Y. Ng, “Make3d: Learning 3d scene structure from a single still image,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, pp. 824–840, May 2009.6
    [15]D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a single image using a multi-scale deep network,” in Advances in Neural Information Processing Systems 27(Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q.Weinberger, eds.), pp. 2366–2374, Curran Associates, Inc., 2014.6
    [16]I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab, “Deeper depth prediction with fully convolutional residual networks,” in 3D Vision (3DV), 2016 Fourth International Conference on, pp. 239–248, IEEE, 2016.6
    [17]W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang, “Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1874–1883, 2016.6
    [18]Y. Cao, Z. Wu, and C. Shen, “Estimating depth from monocular images as classification using deep fully convolutional residual networks,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 28, pp. 3174–3182, Nov 2018.6
    [19]P. Wang, X. Shen, Z. Lin, S. Cohen, B. Price, and A. L. Yuille, “Towards unified depth and semantic prediction from a single image,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.6
    [20]F.Liu,C.Shen,and G.Lin,“Deep convolutional neural fields for depth estimation from a single image,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.6
    [21]D. Xu, E. Ricci, W. Ouyang, X. Wang, and N. Sebe, “Multi-scale continuous crfs as sequential deep networks for monocular depth estimation,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.6
    [22]D. Xu, W. Wang, H. Tang, H. Liu, N. Sebe, and E. Ricci, “Structured attention guided convolutional neural fields for monocular depth estimation,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.6
    [23]J. Jiao, Y. Cao, Y. Song, and R. Lau, “Look deeper into depth: Monocular depth estimation with semantic booster and attention-driven loss,”in The European Conference on Computer Vision (ECCV), September 2018.6
    [24]X. Qi, R. Liao, Z. Liu, R. Urtasun, and J. Jia, “Geonet: Geometric neural network for joint depth and surface normal estimation,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 283–291, 2018.6
    [25]Z. Zhang, Z. Cui, C. Xu, Y. Yan, N. Sebe, and J. Yang, “Pattern-affinitive propagation across depth, surface normal and semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4106–4115, 2019.6
    [26]C. Godard, O. Mac Aodha, and G. J. Brostow, “Unsupervised monocular depth estimation with left-right consistency,” in CVPR, vol. 2, p. 7, 2017.7
    [27]T.Zhou, M.Brown, N.Snavely,andD.G.Lowe,“Unsupervised learning of depth and ego-motion from video,” in CVPR, vol. 2, p. 7, 2017.7
    [28]P.-Y. Chen, A. H. Liu, Y.-C. Liu, and Y.-C. F. Wang, “Towards scene under-standing: Unsupervised monocular depth estimation with semantic-aware representation,” in The IEEE Conference on Computer Vision and Pattern Recognition(CVPR), June 2019.7
    [29]P. P. Srinivasan, R. Garg, N. Wadhwa, R. Ng, and J. T. Barron, “Aperture supervision for monocular depth estimation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6393–6401, 2018.7
    [30]S.GurandL.Wolf,“Single image depth estimation trained via depth from defocus cues,”2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), pp. 7675–7684, 2019.7
    [31]J. Zbontar and Y. LeCun, “Stereo matching by training a convolutional neural network to compare image patches,”Journal of Machine Learning Research, vol.17,pp. 1–32, 2016.7,12
    [32]J.Flynn,I.Neulander,J.Philbin,andN.Snavely,“Deep stereo: Learning to predict new views from the world’s imagery,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5515–5524, 2016.7
    [33]W.Luo,A.G.Schwing,andR.Urtasun,“Efficient deep learning for stereo matching,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5695–5703, 2016.7,8
    [34]A. Shaked and L. Wolf, “Improved stereo matching with constant highway networks and reflective confidence learning,”in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4641–4650, 2017.7
    [35]M.-G.ParkandK.-J.Yoon,“Leveraging stereo matching with learning-based confidence measures,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 101–109, 2015.7
    [36]Z. Yin, T. Darrell, and F. Yu, “Hierarchical discrete distribution decomposition formatch density estimation,”CoRR, vol. abs/1812.06264, 2018.7
    [37]F. Guney and A. Geiger, “Displets: Resolving stereo ambiguities using object knowledge,” in Proceedings of the IEEE Conference on Computer Vision and Pat-tern Recognition, pp. 4165–4175, 2015.7
    [38]G. Koch, “Siamese neural networks for one-shot image recognition,” 2015.7
    [39]T. S. Cohen, M. Geiger, J. Köhler, and M. Welling, “Spherical cnns,”CoRR,vol. abs/1801.10130, 2018.8
    [40]C. Esteves, C. Allen-Blanchette, A. Makadia, and K. Daniilidis, “Learning so (3) equivariant representations with spherical cnns,” in Proceedings of the European Conference on Computer Vision (ECCV), pp. 52–68, 2018.8
    [41]Y. Su and K. Grauman, “Kernel transformer networks for compact spherical con-volution,”CoRR, vol. abs/1812.03115, 2018.8
    [42]Y. Su and K. Grauman, “Flat2sphere: Learning spherical convolution for fast features from 360° imagery,”CoRR, vol. abs/1708.00919, 2017.8
    [43]C. Zou, A. Colburn, Q. Shan, and D. Hoiem, “Layoutnet: Reconstructing the 3d room layout from a single rgb image,” in CVPR, 2018.8,16
    [44]S.-T. Yang, F.-E. Wang, C.-H. Peng, P. Wonka, M. Sun, and H.-K. Chu, “Dula-net: A dual-projection network for estimating room layouts from a single rgb panorama,” in Proceedings of the IEEE Conference on Computer Vision and Pat-tern Recognition, pp. 3363–3372, 2019.8,16
    [45]C.Sun,C.-W.Hsiao,M.Sun,andH.-T.Chen,“Horizonnet: Learning room layout with 1d representation and pano stretch data augmentation,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.8,16
    [46]Z. Zhang, Y. Xu, J. Yu, and S. Gao, “Saliency detection in 360° videos,” in The European Conference on Computer Vision (ECCV), September 2018.8
    [47]H.-T. Cheng, C.-H. Chao, J.-D. Dong, H.-K. Wen, T.-L. Liu, and M. Sun, “Cube padding for weakly-supervised saliency prediction in 360° videos,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.8,16
    [48]N.Zioulis,A.Karakottas,D.Zarpalas,andP.Daras,“Omnidepth: Dense depth estimation for indoors spherical panoramas,” in The European Conference on Computer Vision (ECCV), September 2018.8
    [49]F. Wang, H. Hu, H. Cheng, J. Lin, S. Yang, M. Shih, H. Chu, and M. Sun,“Self-supervised learning of depth and camera motion from 360° videos,”CoRR,vol. abs/1811.05304, 2018.8
    [50]G.PayendeLaGaranderie,A.AtapourAbarghouei,andT.P.Breckon,“Eliminating the blind spot: Adapting 3d object detection and monocular depth estimation to 360° panoramic imagery,” in The European Conference on Computer Vision(ECCV), September 2018.8
    [51]C. Won, J. Ryu, and J. Lim, “Sweepnet: Wide-baseline omnidirectional depth estimation,” in 2019 International Conference on Robotics and Automation (ICRA),pp. 6073–6079, May 2019.8
    [52]M. Eder, P. Moulon, and L. Guan, “Pano popups: Indoor 3d reconstruction with a plane-aware network,”2019 International Conference on 3D Vision (3DV), Sep2019.9
    [53]N. Zioulis, A. Karakottas, D. Zarpalas, F. Alvarez, and P. Daras, “Spherical view synthesis for self-supervised 360° depth estimation,”2019 International Conference on 3D Vision (3DV), Sep 2019.9
    [54]L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Semantic image segmentation with deep convolutional nets and fully connected crfs,”arXiv preprint arXiv:1412.7062, 2014.11,12
    [55]M. Savva, A. X. Chang, A. Dosovitskiy, T. Funkhouser, and V. Koltun, “MI-NOS: Multimodal indoor simulator for navigation in complex environments,”arXiv:1712.03931, 2017.15
    [56]F.-E. Wang, H.-N. Hu, H.-T. Cheng, J.-T. Lin, S.-T. Yang, M.-L. Shih, H.-K.Chu, and M. Sun, “Self-supervised learning of depth and camera motion from 360° videos,”in Asian Conference on Computer Vision,pp.53–68,Springer,2018.16
    [57]J. Kannala and S. S. Brandt, “A generic camera model and calibration method for conventional, wide-angle, and fish-eye lenses,”IEEE transactions on pattern analysis and machine intelligence, vol. 28, no. 8, pp. 1335–1340, 2006.17

    QR CODE