研究生: |
陳坤傑 Chen, Kun-Jie |
---|---|
論文名稱: |
使用時空注意力的轉換器網路進行事件相機除雨 Event Camera Deraining Using Transformer Networks with Time-Space Attention |
指導教授: |
張世杰
Chang, Shih-Chieh |
口試委員: |
賴尚宏
Lai, Shang-Hong 陳添福 Chen, Tien-Fu |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 資訊工程學系 Computer Science |
論文出版年: | 2024 |
畢業學年度: | 112 |
語文別: | 英文 |
論文頁數: | 31 |
中文關鍵詞: | 事件相機 、除雨 、轉換器 、深度學習 |
外文關鍵詞: | event camera, deraining, transformer, deep learning |
相關次數: | 點閱:45 下載:3 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
事件相機是一種仿生的視覺感測器,具備高動態範圍、高時間解析度和低功耗的特點,使其在高速和光線變化多端的環境中具有優異的表現。這些優點促使大量研究探討事件相機在各種視覺任務中的應用。然而,當事件相機安裝在無人飛行機等載具上時,將會面對惡劣天氣條件,會遇到一些挑戰,特別是下雨。雨滴在高速下降過程中觸發事件,形成雨條紋,這會干擾物體邊緣,降低後續任務的性能。
本論文針對事件相機數據中的除雨問題進行研究。先前的研究WTSD嘗試通過利用xyt空間中事件軌跡的差異來進行除雨。然而,當事件相機具有自我運動狀態時,此方法的效果不佳,因為自我運動導致的場景軌跡事件使得區分雨條紋和物體邊緣更加困難。
為了解決這一問題,我們提出了一種基於變換器(Transformer)的新型除雨模型。我們將視覺變換器(Vision Transformer, ViT)轉移於適合事件框架的時空分割,以更靈活地建模時空關係。我們引入了幾種自注意力設計,並在配對的事件除雨數據集上進行了實證驗證。我們的最佳設計為“xt-y注意力”架構,先對鄰近xt平面的圖元(token)進行局部注意力,再對y方向的圖元進行局部注意力。此方法顯著降低了合成和真實雨數據集上的體素和重建誤差。
結果表明,所提出的基於變換器的方法優於先前方法,為事件相機數據的除雨提供了一種穩健的解決方案,並在雨天條件下提高了後續任務的性能。
Event cameras are bio-inspired visual sensors that offer high dynamic range, high temporal resolution, and low power consumption, making them suitable for high-speed and variable lighting conditions. These advantages have spurred extensive research into their applications in various visual tasks. However, when mounted on vehicles like drones, event cameras face challenges under adverse weather conditions, particularly rain. Raindrops trigger events at high velocity, creating rain streaks that interfere object edges and degrade the performance of downstream tasks.
This thesis addresses the problem of rain removal in event camera data. Previous work, WTSD, has attempted rain removal by leveraging differences in event trajectories in the xyt space. However, these methods fall short when event cameras experience ego motion, as the resulting scene trajectory events complicate the differentiation between rain streaks and object edges.
To tackle this, we propose a novel transformer-based model for event deraining. By adapting the Vision Transformer (ViT) to handle spatiotemporal segmentation of event frames, we enable more flexible modeling of spatiotemporal relationships. We introduce several self-attention designs and empirically validate their performance on paired event raining datasets. Our best-performing design, the "xt-y attention" architecture, applies local attention to tokens in neighboring xt planes followed by neighboring tokens in the y direction. This approach significantly reduces voxel and reconstruction errors on both synthetic and real rain datasets.
Our results demonstrate that the proposed transformer-based method outperforms WTSD and other proposed schemes, providing a robust solution for rain removal in event camera data, and enhancing the performance of downstream tasks under rainy conditions.
P. Lichtsteiner, C. Posch, and T. Delbruck, “A 128× 128 120 db 15 µs latency asynchronous temporal contrast vision sensor,” IEEE Journal of Solid-State Circuits, vol. 43, no. 2, pp. 566–576, 2008.
C. Posch, D. Matolin, and R. Wohlgenannt, “A qvga 143 db dynamic range frame-free pwm image sensor with lossless pixel-level video compression and time-domain cds,” IEEE Journal of Solid-State Circuits, vol. 46, no. 1, pp. 259–275, 2011.
H. Rebecq, R. Ranftl, V. Koltun, and D. Scaramuzza, “High speed and high dynamic range video with an event camera,” IEEE Trans. Pattern Anal. Mach. Intell. (T-PAMI), 2019.
H. Rebecq, R. Ranftl, V. Koltun, and D. Scaramuzza, “Events-to-video: Bringing modern computer vision to event cameras,” IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), 2019.
B. Su, L. Yu, and W. Yang, “Event-based high frame-rate video reconstruction with a novel cycle-event network,” in 2020 IEEE International Conference on Image Processing (ICIP), pp. 86–90, 2020.
C. Scheerlinck, H. Rebecq, D. Gehrig, N. Barnes, R. E. Mahony, and D. Scaramuzza, “Fast image reconstruction with an event camera,” in 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 156–163, 2020.
L. Pan, R. Hartley, C. Scheerlinck, M. Liu, X. Yu, and Y. Dai, “High frame rate video reconstruction based on an event camera,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.
D. Gehrig, J. Hidalgo-Carrió, and D. Scaramuzza, “Learning monocular dense depth from events,” IEEE International Conference on 3D Vision (3DV), 2020.
M. Gehrig, J. Hidalgo-Carrió, and D. Scaramuzza, “Combining events and frames using recurrent asynchronous multimodal networks for monocular depth prediction,” IEEE Robotic and Automation Letters. (RA-L), 2021.
S. Tulyakov, F. Fleuret, M. Kiefel, P. Gehler, and M. Hirsch, “Learning an event sequence embedding for dense event-based deep stereo,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1527–1537, 2019.
S. M. N. Uddin, S. H. Ahmed, and Y. J. Jung, “Unsupervised deep event stereo for depth estimation,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 11, pp. 7489–7504, 2022.
S. M. Mostafavi I, K.-J. Yoon, and J. Choi, “Event-intensity stereo: Estimating depth by the best of both worlds,” in 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 4238–4247, 2021.
A. Zhu, L. Yuan, K. Chaney, and K. Daniilidis, “Ev-flownet: Self-supervised optical flow estimation for event-based cameras,” in Proceedings of Robotics: Science and Systems, (Pittsburgh, Pennsylvania), June 2018.
A. Zihao Zhu, L. Yuan, K. Chaney, and K. Daniilidis, “Unsupervised event-based optical flow using motion compensation,” in Proceedings of the European Conference on Computer Vision (ECCV) Workshops, September 2018.
C. Ye, A. Mitrokhin, C. Fermüller, J. A. Yorke, and Y. Aloimonos, “Unsupervised learning of dense optical flow, depth and egomotion with event-based sensors,” in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 5831–5838, 2020.
M. Gehrig, M. Millhäusler, D. Gehrig, and D. Scaramuzza, “E-raft: Dense optical flow from event cameras,” in International Conference on 3D Vision (3DV), 2021.
Z. Wan, Y. Dai, and Y. Mao, “Learning dense and continuous optical flow from an event camera,” IEEE Transactions on Image Processing, vol. 31, pp. 7237–7251, 2022.
L. Cheng, N. Liu, X. Guo, Y. Shen, Z. Meng, K. Huang, and X. Zhang, “A novel rain removal approach for outdoor dynamic vision sensor event videos,” Frontiers in Neurorobotics, vol. 16, 2022.
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference on Learning Representations, 2021.
Y. Zhang, J. Wang, W. Weng, X. Sun, and Z. Xiong, “Egvd: Event-guided video deraining,” 2023.
J. Wang, W. Weng, Y. Zhang, and Z. Xiong, “Unsupervised video deraining with an event camera,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10831–10840, October 2023.
T. Stoffregen, C. Scheerlinck, D. Scaramuzza, T. Drummond, N. Barnes, L. Kleeman, and R. Mahony, “Reducing the sim-to-real gap for event cameras,” ECCV, Aug 2020.
G. Bertasius, H. Wang, and L. Torresani, “Is space-time attention all you need for video understanding?,” in Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event (M. Meila and T. Zhang, eds.), vol. 139 of Proceedings of Machine Learning Research, pp. 813–824, PMLR, 2021.
D. Gehrig, M. Gehrig, J. Hidalgo-Carrió, and D. Scaramuzza, “Video to events: Recycling video datasets for event cameras,” in IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), June 2020.
J. Chen, C.-H. Tan, J. Hou, L.-P. Chau, and H. Li, “Robust video content alignment and compensation for rain removal in a CNN framework,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6286–6295, 2018.
H. Rebecq, D. Gehrig, and D. Scaramuzza, “ESIM: an open event camera simulator,” Conf. on Robotics Learning (CoRL), Oct. 2018.