簡易檢索 / 詳目顯示

研究生: 鄭仙資
Cheng, Hsien-Tzu
論文名稱: 立方填補於360影片之非監督式學習
Cube Padding for Unsupervised Saliency Prediction in 360 Videos
指導教授: 孫民
Sun, Min
口試委員: 邱維辰
Chiu, Wei-Chen
詹力韋
Chan, Li-Wei
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電機工程學系
Department of Electrical Engineering
論文出版年: 2018
畢業學年度: 106
語文別: 英文
論文頁數: 38
中文關鍵詞: 360全景影片立方填補非監督式學習顯著度預測深度類神經網路
外文關鍵詞: 360 Videos, Cube Padding, Unsupervised Learning, Saliency Prediction, Deep Neural Network
相關次數: 點閱:3下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 自動產生360影片顯著度(saliency)預測在視角導覽的應用中非常重要(如Facebook的360運鏡導覽)。這篇論文提出了一個同時考慮時間及空間的神經網絡,他有以上兩點特性:(一)這個網絡可以用非監督式學習(unsupervised learning)的方法訓練、(二)為360球面量身定做。值得一提的是,大多數現有的方法很難量化擴增,因為他們需要標注好的顯著分布圖(saliency map)來做訓練,更甚者,他們將360的球體轉換成2D圖片,如一個等距圓柱投影(equirectangular)的圖片,或是多個標準視野(normal field-of-view)圖片,因此產生扭曲及圖片邊界的問題。為了避免這些問題,我們提出了一個簡單且有效的方法,是為立方填補(cube padding),簡介如下:首先我們先以視角投影的方式渲染合成(render)出360圖片的六個視角,分別在立方體的六個面上,如此一來每個面的圖片都近乎沒有扭曲;接著我們將六個面全部疊合,並運用立方體上每個相鄰面間的連續性,來做網絡內卷積層(convolution)、池化層(pooling)、及卷積長短期記憶層(convolutional LSTM)中邊緣的填補(即立方填補)。如此,把立方填補應用在近乎所有的卷積網絡(convolutional neurol network)結構中,便消除了圖片間的邊界。為了評估我們的方法的表現,我們提出了新的Wild-360影片顯著程度的資料集,其中包含具有一定難度的影片,與標注好的顯著度分布圖。在實驗中,我們提出的方法在運算速度及預測品質上勝過現有的方法。


    Automatic saliency prediction in 360 videos is critical for viewpoint guidance applications (e.g., Facebook 360 Guide). We propose a spatial-temporal network which is (1) unsupervisedly trained and (2) tailor-made for 360 viewing sphere. Note that most existing methods are less scalable since they rely on annotated saliency map for training. Most importantly, they convert 360 sphere to 2D images (e.g., a single equirectangular image or multiple separate Normal Field-of-View (NFoV) images) which introduces distortion and image boundaries. In contrast, we propose a simple and effective Cube Padding (CP) technique as follows. Firstly, we render the 360 view on six faces of a cube using perspective projection. Thus, it introduces very little distortion. Then, we concatenate all six faces while utilizing the connectivity between faces on the cube for image padding (i.e., Cube Padding) in convolution, pooling, convolutional LSTM layers.
    In this way, CP introduces no image boundary while being applicable to almost all Convolutional Neural Network (CNN) structures. To evaluate our method, we propose Wild-360, a new 360 video saliency dataset, containing challenging videos with saliency heatmap annotations. In experiments, our method outperforms all baseline methods in both speed and quality.

    Declaration vii 致謝 ix 摘要 xi Abstract xiii 1 Introduction 1 1.1 MotivationandProblemDescription................... 2 1.2 MainContribution ............................ 3 1.3 RelatedWork ............................... 4 1.4 ThesisStructure.............................. 7 2 Approach 9 2.1 Notations ................................. 10 2.2 SphericalandCubeProjection ...................... 11 2.3 CubePadding............................... 12 2.4 StaticModel................................ 13 2.5 TemporalModel.............................. 14 2.5.1 ConvolutionalLSTM....................... 14 2.5.2 TemporalConsistentLoss .................... 15 3 Dataset 19 3.1 Collection................................. 19 3.2 Annotation ................................ 19 4 Experiment and Results 23 4.1 ImplementationDetails.......................... 23 4.2 BaselineMethods............................. 24 4.3 ComputationalEfficiency......................... 25 4.4 EvaluationMetrics ............................ 26 4.5 SaliencyResultComparison ....................... 27 4.6 NFoVPiloting............................... 28 4.7 HumanEvaluation ............................ 28 4.8 Hyper-parametersStudy ......................... 29 5 Conclusion and Future Work 35 References 37

    [1] Y.-C. Su, D. Jayaraman, and K. Grauman, “Pano2vid: Automatic cinematography for watching 360◦ videos,” in ACCV, 2016. xviii, 2, 5, 20, 22, 28
    [2] H.-N.Hu,Y.-C.Lin,M.-Y.Liu,H.-T.Cheng,Y.-J.Chang,andM.Sun,“Deep360 pilot: Learning a deep agent for piloting through 360 degree sports video,” CVPR, 2017. 2, 5
    [3] W.S.Lai,Y.Huang,N.Joshi,C.Buehler,M.H.Yang,andS.B.Kang,“Semantic- driven generation of hyperlapse from 360° video.,” TVCG, 2017. 2, 5
    [4] Y.-C. Su and K. Grauman, “Making 360° video watchable in 2d: Learning videog- raphy for click free viewing.,” CVPR, 2017. 2, 5
    [5] Y.-T.Lin,Y.-C.Liao,S.-Y.Teng,Y.-J.Chung,L.Chan,andB.-Y.Chen,“Outside- in: Visualizing out-of-sight regions-of-interest in a 360 video using spatial picture- in-picture previews,” in UIST, 2017. 2
    [6] Y.-C. Lin, Y.-J. Chang, H.-N. Hu, H.-T. Cheng, C.-W. Huang, and M. Sun, “Tell me where to look: Investigating ways for assisting focus in 360° video,” in CHI, 2017. 2
    [7] S.-H. Chou, Y.-C. Chen, K.-H. Zeng, H.-N. Hu, J. Fu, and M. Sun, “Self-view grounding given a narrated 360◦ video,” in AAAI, 2018. 2
    [8] Y. Yu, S. Lee, J. Na, J. Kang, and G. Kim, “A deep ranking model for spatio- temporal highlight detection from a 360° video,” in AAAI, 2018. 2
    [9] Z. Bylinskii, T. Judd, A. Borji, L. Itti, F. Durand, A. Oliva, and A. Torralba, “Mit saliency benchmark.” 2, 7, 24, 26
    [10] M. Jiang, S. Huang, J. Duan, and Q. Zhao, “Salicon: Saliency in context,” in CVPR, 2015. 2, 7, 24
    [11] P. Mital, T. Smith, R. Hill, and J. Henderson, “Clustering of gaze during dynamic scene viewing is predicted by motion,” Cognitive Computation, 2011. 2, 5
    [12] https://www.facebook.com/facebookmedia/get-started/discovery-tools-insights. 2, 4, 19, 23
    [13] https://youtube-creators.googleblog.com/2017/06/hot-and-cold-heatmaps-in- vr.html. 2, 4, 19
    [14] Y.-C. Su and K. Grauman, “Learning spherical convolution for fast features from 360° imagery,” in UIST, 2017. 3, 5
    [15] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learning deep features for discriminative localization,” in CVPR, 2016. 3, 6
    [16] S. Xingjian, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, and W.-c. Woo, “Con- volutional lstm network: A machine learning approach for precipitation nowcast- ing,” in NIPS, 2015. 3, 6, 14
    [17] J.Kopf,“360videostabilization,”ACMTransactionsonGraphics(TOG),vol.35, no. 6, p. 195, 2016. 4
    [18] S.Pathak,A.Moro,A.Yamashita,andH.Asama,“Rotationremovedstabilization of omnidirectional videos using optical flow,” in The... international conference on advanced mechatronics: toward evolutionary fusion of IT and mechatronics: ICAM: abstracts, vol. 2015, pp. 51–52, , 2015. 4
    [19] S. Kasahara, S. Nagai, and J. Rekimoto, “Livesphere: immersive experience shar- ing with 360 degrees head-mounted cameras,” in Proceedings of the adjunct pub- lication of the 27th annual ACM symposium on User interface software and tech- nology, pp. 61–62, ACM, 2014. 5
    [20] Y.-C. Lin, Y.-J. Chang, H.-N. Hu, H.-T. Cheng, C.-W. Huang, and M. Sun, “Tell me where to look: Investigating ways for assisting focus in 360° video,” in Pro- ceedings of the 2017 CHI Conference on Human Factors in Computing Systems, pp. 2535–2545, ACM, 2017. 5
    [21] A. Philpot, M. Glancy, P. J. Passmore, A. Wood, and B. Fields, “User experience of panoramic video in cave-like and head mounted display viewing conditions,” in Proceedings of the 2017 ACM International Conference on Interactive Experi- ences for TV and Online Video, pp. 65–75, ACM, 2017. 5
    [22] P.J.Passmore,M.Glancy,A.Philpot,A.Roscoe,A.Wood,andB.Fields,“Effects of viewing condition on user experience of panoramic video,” 2016. 5
    [23] J. Huang, Z. Chen, D. Ceylan, and H. Jin, “6-dof vr videos with a single 360- camera,” in Virtual Reality (VR), 2017 IEEE, pp. 37–44, IEEE, 2017. 5
    [24] K. Matzen, M. F. Cohen, B. Evans, J. Kopf, and R. Szeliski, “Low-cost 360 stereo photography and video capture,” ACM Transactions on Graphics (TOG), vol. 36, no. 4, p. 148, 2017. 5
    [25] R. Lukierski, S. Leutenegger, and A. J. Davison, “Room layout estimation from rapid omnidirectional exploration,” in Robotics and Automation (ICRA), 2017 IEEE International Conference on, pp. 6315–6322, IEEE, 2017. 5
    [26] C. Lee, S.-E. Yu, and D. Kim, “Landmark-based homing navigation using omni- directional depth information,” Sensors, vol. 17, no. 8, p. 1928, 2017. 5
    [27] S. Schubert, P. Neubert, and P. Protzel, “Towards camera based navigation in 3d maps by synthesizing depth images,” in Conference Towards Autonomous Robotic Systems, pp. 601–616, Springer, 2017. 5
    [28] R. Monroy, S. Lutz, T. Chalasani, and A. Smolic, “Salnet360: Saliency maps for omni-directional images with cnn,” arXiv preprint arXiv:1709.06505, 2017. 5
    [29] M. Assens, K. McGuinness, X. Giro, and N. E. O’Connor, “Saltinet: Scan-path prediction on 360 degree images using saliency volumes,” in arXiv, 2017. 5
    [30] T. Liu, Z. Yuan, J. Sun, J. Wang, N. Zheng, X. Tang, and H.-Y. Shum, “Learning to detect a salient object,” TPAMI, 2011. 5
    [31] J. Harel, C. Koch, and P. Perona, “Graph-based visual saliency.,” in NIPS, 2006. 5
    [32] R. Achanta, S. S. Hemami, F. J. Estrada, and S. Süsstrunk, “Frequency-tuned
    salient region detection.,” in CVPR, 2009. 5
    [33] J. Wang, A. Borji, C.-C. J. Kuo, and L. Itti, “Learning a combined model of visual
    saliency for fixation prediction,” TIP, 2016. 5
    [34] J. Zhang and S. Sclaroff, “Exploiting surroundedness for saliency detection: a
    boolean map approach,” TPAMI, 2016. 5
    [35] F. Perazzi, P. Krähenbühl, Y. Pritch, and A. Hornung, “Saliency filters: Contrast
    based filtering for salient region detection,” in CVPR, 2012. 5
    [36] N. Liu and J. Han, “Dhsnet: Deep hierarchical saliency network for salient object
    detection,” in CVPR, 2016. 5
    [37] S. Jetley, N. Murray, and E. Vig, “End-to-end saliency mapping via probability
    distribution prediction,” in CVPR, 2016. 5
    [38] M. Cornia, L. Baraldi, G. Serra, and R. Cucchiara, “A deep multi-level network
    for saliency prediction,” in ICPR, 2016. 5
    [39] J. Pan, K. McGuinness, E. Sayrol, N. O’Connor, and X. Giro-i Nieto, “Shallow
    and deep convolutional networks for saliency prediction,” in CVPR, 2016. 5
    [40] J. Pan, C. Canton, K. McGuinness, N. E. O’Connor, J. Torres, E. Sayrol, and X. Giro-i Nieto, “Salgan: Visual saliency prediction with generative adversarial networks,” arXiv preprint arXiv:1701.01081, 2017. 5, 24, 28
    [41] N. D. B. Bruce, C. Catton, and S. Janjic, “A deeper look at saliency: Feature contrast, semantics, and beyond,” in CVPR, 2016. 5
    [42] Q. Wang, W. Zheng, and R. Piramuthu, “Grab: Visual saliency via novel graph model and background priors,” in CVPR, 2016. 5
    [43] L. Wang, L. Wang, H. Lu, P. Zhang, and X. Ruan, “Saliency detection with recur- rent fully convolutional networks.,” in ECCV, 2016. 5
    [44] Y.TangandX.Wu,“Saliencydetectionviacombiningregion-levelandpixel-level predictions with cnns,” in ECCV, 2016. 5
    [45] X. Cui, Q. Liu, and D. Metaxas, “Temporal spectral residual: fast motion saliency detection,” in ACM Multimedia, 2009. 5
    [46] C. Guo, Q. Ma, and L. Zhang, “Spatio-temporal saliency detection using phase spectrum of quaternion fourier transform,” in CVPR, 2008. 5
    [47] V. Mahadevan and N. Vasconcelos, “Spatiotemporal saliency in dynamic scenes,” TPAMI, 2010. 5
    [48] H. Seo and P. Milanfar, “Static and space-time visual saliency detection by self- resemblance,” Journal of Vision, 2009. 5
    [49] T. Lee, M. Hwangbo, T. Alan, O. Tickoo, and R. Iyer, “Low-complexity hog for efficient video saliency,” in ICIP, 2015. 5
    [50] T. Judd, K. Ehinger, F. Durand, and A. Torralba, “Learning to predict where hu- mans look,” in ICCV, 2009. 5
    [51] S. Goferman, L. Zelnik-Manor, and A. Tal, “Context-aware saliency detection,” TPAMI, 2012. 5
    [52] D. Rudoy, D. B. Goldman, E. Shechtman, and L. Zelnik-Manor, “Learning video saliency from human gaze using candidate selection,” in CVPR, 2013. 5
    [53] S. Mathe and C. Sminchisescu, “Actions in the eye: Dynamic gaze datasets and learnt saliency models for visual recognition,” TPAMI, 2015. 5
    [54] A. Fathi, Y. Li, and J. M. Rehg, “Learning to recognize daily actions using gaze,” in ECCV, 2012. 5
    [55] M. Oquab, L. Bottou, I. Laptev, and J. Sivic, “Is object localization for free?- weakly-supervised learning with convolutional neural networks,” in CVPR, 2015. 5
    [56] L. Wang, H. Lu, Y. Wang, M. Feng, D. Wang, B. Yin, and X. Ruan, “Learning to detect salient objects with image-level supervision,” in CVPR, 2017. 6
    [57] T. Durand, T. Mordan, N. Thome, and M. Cord, “Wildcat: Weakly supervised learning of deep convnets for image classification, pointwise localization and seg- mentation,” 2017. 6
    [58] X. Alameda-Pineda, A. Pilzer, D. Xu, E. Ricci, and N. Sebe, “Viraliency: Pooling local virality,” CVPR, 2017. 6
    [59] K.-J. Hsu, Y.-Y. Lin, and Y.-Y. Chuang, “Weakly supervised saliency detection with a category-driven map generator,” BMVC, 2017. 6, 15
    [60] D. Pathak, R. Girshick, P. Dollár, T. Darrell, and B. Hariharan, “Learning features by watching objects move,” arXiv preprint arXiv:1612.06370, 2016. 6
    [61] J. Wang, H. R. Tavakoli, and J. Laaksonen, “Fixation prediction in videos using unsupervised hierarchical features,” in CVPR Workshop, 2017. 6
    [62] Y.Zhu,Z.Lan,S.Newsam,andA.G.Hauptmann,“Guidedopticalflowlearning,” CVPR BNMW Workshop, 2017. 6, 15
    [63] R. Garg, G. Carneiro, and I. Reid, “Unsupervised cnn for single view depth esti- mation: Geometry to the rescue,” in ECCV, 2016. 6, 15
    [64] P. Tokmakov, K. Alahari, and C. Schmid, “Learning video object segmentation with visual memory,” arXiv preprint arXiv:1704.05737, 2017. 6
    [65] J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbeláez, A. Sorkine-Hornung, and L. Van Gool, “The 2017 davis challenge on video object segmentation,” arXiv preprint arXiv:1704.00675, 2017. 6, 7
    [66] N. Ballas, L. Yao, C. Pal, and A. Courville, “Delving deeper into convolutional networks for learning video representations,” arXiv preprint arXiv:1511.06432, 2015. 6
    [67] P.K.Mital,T.J.Smith,R.L.Hill,andJ.M.Henderson,“Clusteringofgazeduring dynamic scene viewing is predicted by motion,” Cognitive Computation, 2011. 7
    [68] P. Ochs, J. Malik, and T. Brox, “Segmentation of moving objects by long term video analysis,” TPAMI, 2014. 7, 24
    [69] http://paulbourke.net/miscellaneous/cubemaps/. 12
    [70] L. Wan, T.-T. Wong, and C.-S. Leung, “Isocube: Exploiting the cubemap hard- ware,” IEEE Transactions on Visualization and Computer Graphics, vol. 13, no. 4, pp. 720–731, 2007. 12
    [71] J. Pratt, P. V. Radulescu, R. M. Guo, and R. A. Abrams, “It s alive! animate motion captures visual attention,” Psychological Science, 2010. 14
    [72] H. S. Meyerhoff, S. Schwan, and M. Huff, “Interobject spacing explains the atten- tional bias toward interacting objects,” Psychonomic bulletin and review, 2014. 14
    [73] H. S. Meyerhoff, S. Schwan, and M. Huff, “Perceptual animacy: Visual search for chasing objects among distractors.,” Journal of experimental psychology: human perception and performance, 2014. 14
    [74] P. Weinzaepfel, J. Revaud, Z. Harchaoui, and C. Schmid, “Deepflow: Large dis- placement optical flow with deep matching,” in ICCV, 2013. 15, 24, 28
    [75] B. W. Tatler, “The central fixation bias in scene viewing: Selecting an optimal viewing position independently of motor biases and image feature distributions,” Journal of Vision, 2007. 20, 23
    [76] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recogni- tion,” in CVPR, 2016. 23
    [77] K.SimonyanandA.Zisserman,“Verydeepconvolutionalnetworksforlarge-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014. 23
    [78] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large- scale hierarchical image database,” in CVPR, 2009. 23
    [79] https://code.facebook.com/posts/1638767863078802/under-the-hood-building- 360-video/. 23
    [80] W. Wang, J. Shen, and L. Shao, “Consistent video saliency using local gradient flow optimization and global refinement,” TIP, 2015. 24, 28
    [81] D. Tsai, M. Flagg, and J. M.Rehg, “Motion coherent tracking with multi-label mrf optimization,” BMVC, 2010. 24
    [82] T. Judd, F. Durand, and A. Torralba, “A benchmark of computational models of saliency to predict human fixations,” in MIT Technical Report, 2012. 26
    [83] P. Weinzaepfel, J. Revaud, Z. Harchaoui, and C. Schmid, “Deepflow: Large dis- placement optical flow with deep matching,” in ICCV, 2013. 27
    [84] W. Wang, J. Shen, and L. Shao, “Consistent video saliency using local gradient flow optimization and global refinement,” TIP, 2015. 27
    [85] J. Pan, C. Canton, K. McGuinness, N. E. O’Connor, J. Torres, E. Sayrol, and X. Giro-i Nieto, “Salgan: Visual saliency prediction with generative adversarial networks,” arXiv preprint arXiv:1701.01081, 2017. 27

    QR CODE