基於自我監督式學習360影片之深度與相機位移

簡易檢索 / 詳目顯示

回結果列表

研究生：	王福恩 Wang, Fu-En
論文名稱：	基於自我監督式學習360影片之深度與相機位移 Self-Supervised Learning of Depth and Camera Motion from 360 Videos
指導教授：	孫民 Sun, Min
口試委員:	王鈺強 Wang, Yu-Chiang 陳煥宗 Chen, Hwann-Tzong
學位類別：	碩士 Master
系所名稱：	電機資訊學院 - 電機工程學系 Department of Electrical Engineering
論文出版年：	2019
畢業學年度：	107
語文別：	英文
論文頁數：	37
中文關鍵詞：	全景、深度、深度學習、人工智慧、計算機視覺、點雲
外文關鍵詞：	panorama, depth, deep-learning, ai, cv, point-cloud
相關次數：	點閱：1 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

隨著360相機在各種自主系統應用中愈顯普及 (自駕車與無人飛行器)，有效率的360感知能力變得越是重要。在此篇論文中，我們提出了新穎的自我監督式方法根據360影片來預測全景影像的深度與相機之移動。始於一個專門設計給一般視角相機的方法，SfMLearner，我們在此引入三個關鍵方法來有效率的處理360影像。首先，我們先將等距離長方圓柱投影轉換成立方投影來避免360影像的扭曲，在所有卷積與反卷積層前，我們使用了立方填補演算法 (Cube Padding) 來移除每個立方面的邊界，立方填補演算法會將每一面的鄰近面特徵值填補到自身來補全每一面的資訊。第二，我們提出了一個新穎的球狀光差一制性限制並使用在球體上，如此一來我們就能避免以往方法會在超出邊界的地方無法計算訓練誤差。最後，我們並非只是獨立預測六個相機移動 (直接對立方體的每個面使用SfMLearner)，我們提出了新穎的相機移動一致性訓練誤差來確保每一面的相機位移可以互相限制。為了訓練與評估我們所提出的方法，我們收集了一個全新的數據集PanoSUNCG，這個數據集擁有目前最大量的360影像同時包和所對應的正確深度與相機位移。在PanoSUNCG上，我們所提出的方法達到目前深度與相機位移最高的準確度並且也具有更快的預測速度。在真實世界的影片中，我們的方法仍然能預測出合理的深度與相機移動。

As 360 cameras become prevalent in many autonomous systems (e.g., self-driving cars and drones), efficient 360 perception becomes more and more important.
We propose a novel self-supervised learning approach for predicting the omnidirectional depth and camera motion from a 360 video.
In particular, starting from the SfMLearner, which is designed for cameras with normal field-of-view, we introduce three key features to process 360 images efficiently.
Firstly, we convert each image from equirectangular projection to cubic projection in order to avoid image distortion. In each network layer, we use Cube Padding (CP), which pads intermediate features from adjacent faces, to avoid image boundaries.
Secondly, we propose a novel ``spherical" photometric consistency constraint on the whole viewing sphere. In this way, no pixel will be projected outside the image boundary which typically happens in images with normal field-of-view.
Finally, rather than naively estimating six independent camera motions (i.e., naively applying SfM-Learner to each face on a cube), we propose a novel camera pose consistency loss to ensure the estimated camera motions reaching consensus.
To train and evaluate our approach, we collect a new PanoSUNCG dataset containing a large amount of 360 videos with groundtruth depth and camera motion. Our approach achieves state-of-the-art depth prediction and camera motion estimation with faster inference speed comparing to equirectangular. In real-world indoor videos, our approach can also achieve qualitatively reasonable depth prediction.

Introduction
Related work
Our approach
Dataset
Experiments
Conclusion
References

                                

[1] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe, “Unsupervised learning of depth
and ego-motion from video,” in CVPR, vol. 2, p. 7, 2017. vii, ix, 2, 5, 10, 11, 13,
14, 18, 20
[2] R. Garg, G. Carneiro, and I. Reid, “Unsupervised cnn for single view depth estimation: Geometry to the rescue,” in ECCV, 2016. ix, 11, 13, 20
[3] C. Godard, O. Mac Aodha, and G. J. Brostow, “Unsupervised monocular depth
estimation with left-right consistency,” in CVPR, vol. 2, p. 7, 2017. ix, 5, 11, 13,
20
[4] D. Caruso, J. Engel, and D. Cremers, “Large-scale direct slam for omnidirectional
cameras,” in Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International
Conference on, pp. 141–148, IEEE, 2015. 2, 6
[5] Y.-C. Su and K. Grauman, “Learning spherical convolution for fast features from
360° imagery,” in NIPS, 2017. 2, 5
[6] T.-H. Wang, H.-J. Huang, J.-T. Lin, C.-W. Hu, K.-H. Zeng, and M. Sun,
“Omnidirectional CNN for Visual Place Recognition and Navigation,” 2018.
arXiv:1803.04228v1. 2, 5
[7] A. Saxena, M. Sun, and A. Y. Ng, “Make3d: Learning 3d scene structure from a
single still image,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, pp. 824–840, May 2009. 4
[8] D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a single image
using a multi-scale deep network,” in Advances in Neural Information Processing
Systems 27 (Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q.
Weinberger, eds.), pp. 2366–2374, Curran Associates, Inc., 2014. 4
[9] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab, “Deeper depth
prediction with fully convolutional residual networks,” in 3D Vision (3DV), 2016
Fourth International Conference on, pp. 239–248, IEEE, 2016. 4
[10] Y. Cao, Z. Wu, and C. Shen, “Estimating depth from monocular images as classification using deep fully convolutional residual networks,” IEEE Transactions on
Circuits and Systems for Video Technology, vol. 28, pp. 3174–3182, Nov 2018. 4
[11] P. Wang, X. Shen, Z. Lin, S. Cohen, B. Price, and A. L. Yuille, “Towards unified
depth and semantic prediction from a single image,” in The IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), June 2015. 4
[12] F. Liu, C. Shen, and G. Lin, “Deep convolutional neural fields for depth estimation
from a single image,” in The IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), June 2015. 4
[13] D. Xu, E. Ricci, W. Ouyang, X. Wang, and N. Sebe, “Multi-scale continuous crfs
as sequential deep networks for monocular depth estimation,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017. 4
[14] D. Xu, W. Wang, H. Tang, H. Liu, N. Sebe, and E. Ricci, “Structured attention
guided convolutional neural fields for monocular depth estimation,” in The IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), June 2018. 4
[15] H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao, “Deep ordinal regression
network for monocular depth estimation,” in The IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), June 2018. 4
[16] K. Tateno, F. Tombari, I. Laina, and N. Navab, “Cnn-slam: Real-time dense
monocular slam with learned depth prediction,” in IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), 2017. 4
[17] J. Engel, T. Schöps, and D. Cremers, “Lsd-slam: Large-scale direct monocular
slam,” in European Conference on Computer Vision, pp. 834–849, Springer, 2014.
4, 14
[18] Z. Yin and J. Shi, “Geonet: Unsupervised learning of dense depth, optical flow and
camera pose,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018. 5
[19] H. Zhan, R. Garg, C. Saroj Weerasekera, K. Li, H. Agarwal, and I. Reid, “Unsupervised learning of monocular depth estimation and visual odometry with deep
feature reconstruction,” in The IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), June 2018. 5
[20] Z. Yang, P. Wang, W. Xu, L. Zhao, and R. Nevatia, “Unsupervised learning of
geometry from videos with edge-aware depth-normal consistency,” 2018. 5
[21] C. Wang, J. Miguel Buenaposada, R. Zhu, and S. Lucey, “Learning depth from
monocular videos using direct methods,” in The IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), June 2018. 5
[22] S. Vijayanarasimhan, S. Ricco, C. Schmid, R. Sukthankar, and K. Fragkiadaki, “Sfm-net: Learning of structure and motion from video,” arXiv preprint
arXiv:1704.07804, 2017. 5, 14
[23] A. Byravan and D. Fox, “Se3-nets: Learning rigid body motion using deep neural
networks,” in Robotics and Automation (ICRA), 2017 IEEE International Conference on, pp. 173–180, IEEE, 2017. 5
[24] R. Mahjourian, M. Wicke, and A. Angelova, “Unsupervised learning of depth and
ego-motion from monocular video using 3d geometric constraints,” in The IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), June 2018. 5
[25] C. Wang, J. Miguel Buenaposada, R. Zhu, and S. Lucey, “Learning depth from
monocular videos using direct methods,” in The IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), June 2018. 5
[26] C. Häne, L. Heng, G. H. Lee, F. Fraundorfer, P. Furgale, T. Sattler, and M. Pollefeys, “3d visual perception for self-driving cars using a multi-camera system: Calibration, mapping, localization, and obstacle detection,” Image and Vision Computing (IMAVIS), Special Issue ”Automotive Vision”, 2017. 5
[27] W. S. Lai, Y. Huang, N. Joshi, C. Buehler, M. H. Yang, and S. B. Kang, “Semanticdriven generation of hyperlapse from 360° video.,” TVCG, 2017. 5
[28] Y.-C. Lin, Y.-J. Chang, H.-N. Hu, H.-T. Cheng, C.-W. Huang, and M. Sun, “Tell
me where to look: Investigating ways for assisting focus in 360◦ video,” in CHI,
2017. 5
[29] H.-N. Hu, Y.-C. Lin, M.-Y. Liu, H.-T. Cheng, Y.-J. Chang, and M. Sun, “Deep 360
pilot: Learning a deep agent for piloting through 360deg sports videos,” in CVPR,
2017. 5
[30] Y.-C. Su, D. Jayaraman, and K. Grauman, “Pano2vid: Automatic cinematography
for watching 360◦ videos,” in ACCV, 2016. 5
[31] Y.-C. Su and K. Grauman, “Making 360° video watchable in 2d: Learning videography for click free viewing,” in CVPR, 2017. 5
[32] S. Pathak, A. Moro, H. Fujii, A. Yamashita, and H. Asama, “3d reconstruction of
structures using spherical cameras with small motion,” in 2016 16th International
Conference on Control, Automation and Systems (ICCAS), pp. 117–122, Oct 2016.
5
[33] H.-T. Cheng, C.-H. Chao, J.-D. Dong, H.-K. Wen, T.-L. Liu, and M. Sun, “Cube
padding for weakly-supervised saliency prediction in 360° videos,” in The IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), June 2018. 5,
10, 19, 32
[34] S. Yang, F. Wang, C. Peng, P. Wonka, M. Sun, and H. Chu, “Dula-net: A dualprojection network for estimating room layouts from a single RGB panorama,”
CoRR, vol. abs/1811.11977, 2018. 5
[35] C. Zou, A. Colburn, Q. Shan, and D. Hoiem, “Layoutnet: Reconstructing the 3d
room layout from a single rgb image,” in The IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), June 2018. 5
[36] Y. Zhang, S. Song, P. Tan, and J. Xiao, “Panocontext: A whole-room 3d context model for panoramic scene understanding,” in Computer Vision – ECCV 2014
(D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, eds.), (Cham), pp. 668–686,
Springer International Publishing, 2014. 5
[37] Y.-C. Su and K. Grauman, “Learning spherical convolution for fast features
from 360°imagery,” in Advances in Neural Information Processing Systems 30
(I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan,
and R. Garnett, eds.), pp. 529–539, Curran Associates, Inc., 2017. 5
[38] T. S. Cohen, M. Geiger, J. Köhler, and M. Welling, “Spherical CNNs,” in International Conference on Learning Representations, 2018. 5
[39] Y. Su and K. Grauman, “Kernel transformer networks for compact spherical convolution,” CoRR, vol. abs/1812.03115, 2018. 5
[40] N. Zioulis, A. Karakottas, D. Zarpalas, and P. Daras, “Omnidepth: Dense depth estimation for indoors spherical panoramas,” in The European Conference on Computer Vision (ECCV), September 2018. 5
[41] S. Im, H. Ha, F. Rameau, H.-G. Jeon, G. Choe, and I. S. Kweon, “All-around depth
from small motion with a spherical panoramic camera,” in European Conference
on Computer Vision, pp. 156–172, Springer, 2016. 6
[42] M. Schönbein and A. Geiger, “Omnidirectional 3d reconstruction in augmented
manhattan worlds,” in 2014 IEEE/RSJ International Conference on Intelligent
Robots and Systems, pp. 716–723, Sept 2014. 6
[43] H. Guan and W. A. P. Smith, “Structure-from-motion in spherical video using the
von mises-fisher distribution,” IEEE Transactions on Image Processing, vol. 26,
pp. 711–723, Feb 2017. 6
[44] P. Chang and M. Hebert, “Omni-directional structure from motion,” in Proceedings of the 2000 IEEE Workshop on Omnidirectional Vision, pp. 127 – 133, June
2000. 6
[45] A. Pagani and D. Stricker, “Structure from motion using full spherical panoramic
cameras,” in 2011 IEEE International Conference on Computer Vision Workshops
(ICCV Workshops), pp. 375–382, Nov 2011. 6
[46] F. Kangni and R. Laganiere, “Orientation and pose recovery from spherical panoramas,” in 2007 IEEE 11th International Conference on Computer Vision, pp. 1–8,
Oct 2007. 6
[47] N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox,
“A large dataset to train convolutional networks for disparity, optical flow, and
scene flow estimation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4040–4048, 2016. 10
[48] J. Flynn, I. Neulander, J. Philbin, and N. Snavely, “Deepstereo: Learning to predict
new views from the world’s imagery,” in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pp. 5515–5524, 2016. 11
[49] Y. Zhang, S. Khamis, C. Rhemann, J. Valentin, A. Kowdle, V. Tankovich,
M. Schoenberg, S. Izadi, T. Funkhouser, and S. Fanello, “Activestereonet: Endto-end self-supervised learning for active stereo systems,” in The European Conference on Computer Vision (ECCV), September 2018. 14
[50] S. Song, F. Yu, A. Zeng, A. X. Chang, M. Savva, and T. Funkhouser, “Semantic scene completion from a single depth image,” IEEE Conference on Computer
Vision and Pattern Recognition, 2017. 15
[51] A. Paszke and S. Chintala, “Pytorch: https://github.com/apaszke/pytorch-dist.” 18
[52] A. Kos, S. Tomazic, and A. Umek, “Evaluation of smartphone inertial sensor performance for cross-platform mobile applications,” in Sensors, vol. 16, p. 477, 04
2016. 19
[53] F.-E. Wang, H.-N. Hu, H.-T. Cheng, J.-T. Lin, S.-T. Yang, M.-L. Shih, H.-K. Chu, and M. Sun, “Technical report of self-supervised 360 depth,” 2018. https: //aliensunmin.github.io/project/360-depth/. 23

簡易檢索 / 詳目顯示

相關論文