簡易檢索 / 詳目顯示

研究生: 楊善雅
Yang, Shan-Ya
論文名稱: 有效多任務學習的關鍵:用於任務協同處理及節點利用的分離查詢
A Key to Effective Multi-task Learning: Separate Query Selection for Task-Synergized Handling and Node Utilization
指導教授: 李濬屹
Lee, Chun-Yi
口試委員: 廖弘源
Liao, Hong-Yuan
王建堯
Wang, Chien-Yao
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Computer Science
論文出版年: 2024
畢業學年度: 112
語文別: 英文
論文頁數: 35
中文關鍵詞: 多任務機器學習圖神經網路節點利用物件偵測語義分割
外文關鍵詞: Multi-task, Machine Learning, Graph Neural Network, Node Utilization, Object Detection, Semantic Segmentation
相關次數: 點閱:87下載:1
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在電腦視覺領域,同時有效處理多個任務是一項需要創新解決方法的挑戰。為了更好地解決多任務視覺問題,我們提出SeTano,一個基於圖神經網絡(GNN)的整合型多任務框架。這個框架包括一個動態邊感知GNN(DES-GNN)骨幹,能夠動態調整邊以提取更關鍵的特徵,以及一個包括節點減少策略和分離查詢選擇策略的下游設計,以增強多任務學習。骨幹和下游架構將可以互相配合。DES-GNN使用注意力感知邊選擇器(AES)和邊感知節點選擇器(ENS)模組,動態捕捉局部和長距離特徵,使DES-GNN能夠自適應地提供從圖像中選擇更具訊息性的特徵以供下游任務使用。在下游架構中,實施了節點減少策略以篩選來自DES-GNN的提取特徵,旨在最小化冗餘信息干擾。此外,分離查詢選擇策略與骨幹協同工作,選擇出適合不同類型任務的查詢。為了驗證我們的方法,我們在ImageNet和MS COCO數據集上進行了多任務實驗。結果表明,SeTano的整合設計受益於骨幹與下游架構之間的相互配合,而且根據不同任務分離查詢選擇來改進下游架構的結果顯著,從而在包括物體檢測、實例分割和全景分割在內的各種多任務中提升性能。


    In the realm of computer vision, effectively handling multiple tasks simultaneously presents a challenge that necessitates innovative solutions. To better address multi-task vision problems, we introduce SeTano, an integrated Graph Neural Network (GNN)-based multi-task framework. This framework comprises a Dynamic Edge-Sensing GNN (DES-GNN) backbone, which can dynamically adjust edges to extract more pivotal features, and a downstream design which includes a node reduction strategy and a separate query selection strategy to enhance multi-task learning. The backbone and the downstream architecture are tailored to operate synergistically. DES-GNN utilizes an Attention-aware Edge Selector (AES) and Edge-aware Node Selector (ENS) module that dynamically capture local and long-range features, thus enabling DES-GNN to adaptively provide more informative features selected from an image for downstream tasks. In the downstream architecture, a node reduction strategy is implemented to filter the extracted features from DES-GNN. This strategy aims to minimize redundant information interference. In addition, the separate query selection strategy is developed to work in concert with the backbone and select queries suitable for different types of tasks. To validate our approach, we perform multi-task experiments on the ImageNet and MS COCO datasets. The results indicate that the integrated design of SeTano not only benefits from the synergy between the backbone and the downstream architecture but also demonstrates that the downstream architecture can be improved by separating query selection according to different tasks, which leads to enhanced performance in various multi-tasks including object detection, instance segmentation, and panoptic segmentation.

    Abstract (Chinese) I Acknowledgements (Chinese) II Abstract III Acknowledgements IV Contents V List of Figures VII List of Tables VIII 1 Introduction 1 2 Related Work 5 2.1 Vision Backbones 5 2.2 Multi-Task Vision Frameworks 6 3 Methodology 8 3.1 Problem Definition 8 3.2 Overview of the Framework 9 3.3 DES-GNN Backbone 10 3.3.1 Attention-aware Edge Selector (AES) 10 V 3.3.2 Edge-aware Node Selector (ENS) 11 3.4 Node Reduction Strategy 11 3.5 Separate Query Selection 12 3.6 Loss Function Design 14 4 Experimental Results 15 4.1 Experimental Setups 15 4.2 Quantitative Results 16 4.2.1 Instance segmentation and object detection 16 4.2.2 Panoptic segmentation 17 4.2.3 Semantic segmentation 17 4.3 Qualitative Results 18 4.4 Ablation Studies 19 4.4.1 Attention region visualization of DESGNN backbone 19 4.4.2 Node Reduction Strategy 20 4.4.3 Separate Query Selection Strategy 20 5 Conclusion 22 Bibliography 23

    [1] F. Li, H. Zhang, H. Xu, S. Liu, L. Zhang, L. M. Ni, and H.-Y. Shum, “Mask
    DINO: Towards a unified transformer-based framework for object detection
    and segmentation,” in Proceedings of the IEEE/CVF Conference on Computer
    Vision and Pattern Recognition (CVPR), pp. 3041–3050, 2023.
    [2] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo,
    “Swin transformer: Hierarchical vision transformer using shifted windows,” in
    Proceedings of the IEEE/CVF International Conference on Computer Vision
    (ICCV), pp. 10012–10022, 2021.
    [3] Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, and S. Xie, “A ConvNet for the 2020s,” in Proceedings of the IEEE/CVF Conference on Computer
    Vision and Pattern Recognition (CVPR), pp. 11976–11986, 2022.
    [4] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang,
    A. Karpathy, A. Khosla, M. Bernstein, et al., “ImageNet large scale visual
    recognition challenge,” International Journal of Computer Vision (IJCV),
    vol. 115, pp. 211–252, 2015.
    [5] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with
    deep convolutional neural networks,” in Advances in Neural Information Processing Systems (NeurIPS), pp. 1097–1105, 2012.
    [6] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
    V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–9, 2015.
    [7] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” in International Conference on Learning Representations (ICLR), 2015.
    [8] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and
    Pattern Recognition (CVPR), pp. 770–778, 2016.
    [9] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., “An image
    is worth 16×16 words: Transformers for image recognition at scale,” in International Conference on Learning Representations (ICLR), 2021.
    [10] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. J´egou,
    “Training data-efficient image transformers & distillation through attention,”
    in Proceedings of the International Conference on Machine Learning (ICML),
    pp. 10347–10357, 2021.
    [11] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies
    for accurate object detection and semantic segmentation,” in Proceedings of the
    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
    pp. 580–587, 2014.
    [12] R. Girshick, “Fast R-CNN,” in Proceedings of the IEEE/CVF International
    Conference on Computer Vision (ICCV), pp. 1440–1448, 2015.
    [13] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time
    object detection with region proposal networks,” in Advances in Neural Information Processing Systems (NeurIPS), pp. 91–99, 2015.
    [14] K. He, G. Gkioxari, P. Doll´ar, and R. Girshick, “Mask R-CNN,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV),
    pp. 2961–2969, 2017.
    [15] T.-Y. Lin, P. Doll´ar, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in Proceedings of the IEEE/CVF
    Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2117–
    2125, 2017.
    [16] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once:
    Unified, real-time object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 779–788,
    2016.
    [17] J. Redmon and A. Farhadi, “YOLO9000: better, faster, stronger,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7263–7271, 2017.
    [18] J. Redmon and A. Farhadi, “YOLOv3: An incremental improvement,” arXiv
    preprint arXiv:1804.02767, 2018.
    [19] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, “YOLOv4: Optimal speed
    and accuracy of object detection,” arXiv preprint arXiv:2004.10934, 2020.
    [20] J. Glenn et. al., “YOLOv5 release v7.0.” https://github.com/
    ultralytics/yolov5/releases/tag/v7.0, 2022.
    [21] C.-Y. Wang, A. Bochkovskiy, and H.-Y. M. Liao, “YOLOv7: Trainable bag-offreebies sets new state-of-the-art for real-time object detectors,” in Proceedings
    of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
    (CVPR), pp. 7464–7475, 2023.
    [22] J. Glenn et. al, “YOLOv8 8.0.208.” https://github.com/ultralytics/
    ultralytics, 2023.
    [23] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko,
    “End-to-end object detection with transformers,” in Proceedings of the European Conference on Computer Vision (ECCV), pp. 213–229, 2020.
    [24] Z. Liu, H. Hu, Y. Lin, Z. Yao, Z. Xie, Y. Wei, J. Ning, Y. Cao, Z. Zhang,
    L. Dong, et al., “Swin transformer v2: Scaling up capacity and resolution,”
    in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
    Recognition (CVPR), pp. 12009–12019, 2022.
    [25] W. Wang, E. Xie, X. Li, D.-P. Fan, K. Song, D. Liang, T. Lu, P. Luo, and
    L. Shao, “Pyramid vision transformer: A versatile backbone for dense prediction without convolutions,” in Proceedings of the IEEE/CVF International
    Conference on Computer Vision (ICCV), pp. 568–578, 2021.
    [26] W. Wang, E. Xie, X. Li, D.-P. Fan, K. Song, D. Liang, T. Lu, P. Luo, and
    L. Shao, “PVT v2: Improved baselines with pyramid vision transformer,” Computational Visual Media, vol. 8, no. 3, pp. 415–424, 2022.
    [27] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, “Path aggregation network for
    instance segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8759–8768, 2018.
    [28] K. Chen, J. Pang, J. Wang, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Shi,
    W. Ouyang, et al., “Hybrid task cascade for instance segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4974–4983, 2019.
    29] A. Kirillov, K. He, R. Girshick, C. Rother, and P. Doll´ar, “Panoptic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and
    Pattern Recognition (CVPR), pp. 9404–9413, 2019.
    [30] B. Cheng, A. Schwing, and A. Kirillov, “Per-pixel classification is not all you
    need for semantic segmentation,” in Advances in Neural Information Processing
    Systems (NeurIPS), pp. 17864–17875, 2021.
    [31] B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Girdhar, “Maskedattention mask transformer for universal image segmentation,” in Proceedings
    of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
    (CVPR), pp. 1290–1299, 2022.
    [32] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3431–3440, 2015.
    [33] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for
    biomedical image segmentation,” in Proceedings of the Medical Image Computing and Computer-Assisted Intervention (MICCAI), pp. 234–241, 2015.
    [34] H. Noh, S. Hong, and B. Han, “Learning deconvolution network for semantic
    segmentation,” in Proceedings of the IEEE/CVF International Conference on
    Computer Vision (ICCV), pp. 1520–1528, 2015.
    [35] R. Strudel, R. Garcia, I. Laptev, and C. Schmid, “Segmenter: Transformer
    for semantic segmentation,” in Proceedings of the IEEE/CVF International
    Conference on Computer Vision (ICCV), pp. 7262–7272, 2021
    [36] Y. Xu, Y. Yang, and L. Zhang, “DeMT: Deformable mixer transformer for
    multi-task learning of dense prediction,” in Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pp. 3072–3080, 2023.
    [37] I. Misra, A. Shrivastava, A. Gupta, and M. Hebert, “Cross-stitch networks
    for multi-task learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3994–4003, 2016.
    [38] S. Liu, E. Johns, and A. J. Davison, “End-to-end multi-task learning with
    attention,” in Proceedings of the IEEE/CVF Conference on Computer Vision
    and Pattern Recognition (CVPR), pp. 1871–1880, 2019.
    [39] Z. Zhang, Z. Cui, C. Xu, Z. Jie, X. Li, and J. Yang, “Joint task-recursive
    learning for semantic segmentation and depth estimation,” in Proceedings of
    the European Conference on Computer Vision (ECCV), pp. 235–251, 2018.
    [40] Y. Gao, J. Ma, M. Zhao, W. Liu, and A. L. Yuille, “NDDR-CNN: Layerwise
    feature fusing in multi-task cnns by neural discriminative dimensionality reduction,” in Proceedings of the IEEE/CVF Conference on Computer Vision
    and Pattern Recognition (CVPR), pp. 3205–3214, 2019.
    [41] D. Bhattacharjee, T. Zhang, S. S¨usstrunk, and M. Salzmann, “MulT: An
    end-to-end multitask learning transformer,” in Proceedings of the IEEE/CVF
    Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12031–
    12041, 2022.
    [42] H. Ye and D. Xu, “InvPT: Inverted pyramid multi-task transformer for dense
    scene understanding,” in Proceedings of the European Conference on Computer
    Vision (ECCV), pp. 514–530, 2022.
    [43] D. Br¨uggemann, M. Kanakis, A. Obukhov, S. Georgoulis, and L. Van Gool,
    “Exploring relational context for multi-task dense prediction,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV),
    pp. 15869–15878, 2021.
    [44] I. Bello, W. Fedus, X. Du, E. D. Cubuk, A. Srinivas, T.-Y. Lin, J. Shlens,
    and B. Zoph, “Revisiting ResNets: Improved training and scaling strategies,”
    in Advances in Neural Information Processing Systems (NeurIPS), vol. 34,
    pp. 22614–22627, 2021.
    [45] S. Zheng, J. Lu, H. Zhao, X. Zhu, Z. Luo, Y. Wang, Y. Fu, J. Feng, T. Xiang, P. H. Torr, et al., “Rethinking semantic segmentation from a sequenceto-sequence perspective with transformers,” in Proceedings of the IEEE/CVF
    Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6881–
    6890, 2021.
    [46] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille,
    “DeepLab: Semantic image segmentation with deep convolutional nets, atrous
    convolution, and fully connected CRFs,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 40, no. 4, pp. 834–848, 2017.
    [47] Y. Yuan, L. Huang, J. Guo, C. Zhang, X. Chen, and J. Wang, “OCNet: Object
    context for semantic segmentation,” International Journal of Computer Vision
    (IJCV), vol. 129, no. 8, pp. 2375–2398, 2021.
    [48] Z. Zong, G. Song, and Y. Liu, “DETRs with collaborative hybrid assignments
    training,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 6748–6758, 2023.
    [49] J. Jain, J. Li, M. T. Chiu, A. Hassani, N. Orlov, and H. Shi, “OneFormer:
    One transformer to rule universal image segmentation,” in Proceedings of the
    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
    pp. 2989–2998, 2023.
    [50] A. Hassani, S. Walton, J. Li, S. Li, and H. Shi, “Neighborhood attention
    transformer,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6185–6194, 2023.
    [51] J. Yang, C. Li, P. Zhang, X. Dai, B. Xiao, L. Yuan, and J. Gao, “Focal selfattention for local-global interactions in vision transformers,” in Advances in
    Neural Information Processing Systems (NeurIPS), pp. 30008–30022, 2021.
    [52] K. Han, A. Xiao, E. Wu, J. Guo, C. Xu, and Y. Wang, “Transformer in transformer,” in Advances in Neural Information Processing Systems (NeurIPS),
    pp. 15908–15919, 2021.
    [53] M. Goldblum, H. Souri, R. Ni, M. Shu, V. Prabhu, G. Somepalli, P. Chattopadhyay, M. Ibrahim, A. Bardes, J. Hoffman, et al., “Battle of the backbones:
    A large-scale comparison of pretrained models across computer vision tasks,”
    in Advances in Neural Information Processing Systems (NeurIPS), 2023.
    [54] F. Locatello, D. Weissenborn, T. Unterthiner, A. Mahendran, G. Heigold,
    J. Uszkoreit, A. Dosovitskiy, and T. Kipf, “Object-centric learning with slot
    attention,” in Advances in Neural Information Processing Systems (NeurIPS),
    pp. 11525–11538, 2020.
    [55] M. Zhang, G. Song, Y. Liu, and H. Li, “Decoupled DETR: Spatially disentangling localization and classification for improved end-to-end object detection,”
    in Proceedings of the IEEE/CVF International Conference on Computer Vision
    (ICCV), pp. 6601–6610, 2023.
    [56] Z. Li, W. Wang, E. Xie, Z. Yu, A. Anandkumar, J. M. Alvarez, P. Luo,
    and T. Lu, “Panoptic SegFormer: Delving deeper into panoptic segmentation
    with transformers,” in Proceedings of the IEEE/CVF Conference on Computer
    Vision and Pattern Recognition (CVPR), pp. 1280–1289, 2022.
    [57] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., “Learning transferable visual models
    from natural language supervision,” in Proceedings of the International Conference on Machine Learning (ICML), pp. 8748–8763, 2021.
    [58] J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc,
    A. Mensch, K. Millican, M. Reynolds, et al., “Flamingo: a visual language
    model for few-shot learning,” in Advances in Neural Information Processing
    Systems (NeurIPS), vol. 35, pp. 23716–23736, 2022.
    [59] L. Zhao, L. Yuan, B. Gong, Y. Cui, F. Schroff, M.-H. Yang, H. Adam, and
    T. Liu, “Unified visual relationship detection with vision and language models,”
    in Proceedings of the IEEE/CVF International Conference on Computer Vision
    (ICCV), 2023.
    [60] Z. Peng, W. Wang, L. Dong, Y. Hao, S. Huang, S. Ma, and F. Wei, “Grounding multimodal large language models to the world,” in International Conference on Learning Representations (ICLR), 2024.
    [61] J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and
    J. Zhou, “Qwen-vl: A frontier large vision-language model with versatile abilities,” 2023.
    [62] Z. Sun, Y. Fang, T. Wu, P. Zhang, Y. Zang, S. Kong, Y. Xiong, D. Lin, and
    J. Wang, “Alpha-clip: A clip model focusing on wherever you want,” 2023.
    [63] K. Han, Y. Wang, J. Guo, Y. Tang, and E. Wu, “Vision GNN: An image is
    worth graph of nodes,” in Advances in Neural Information Processing Systems
    (NeurIPS), pp. 8291–8303, 2022.
    [64] Y. Han, P. Wang, S. Kundu, Y. Ding, and Z. Wang, “Vision HGNN: An image
    is more than a graph of nodes,” in Proceedings of the IEEE/CVF International
    Conference on Computer Vision (ICCV), pp. 19878–19888, 2023.
    [65] M. Munir, W. Avery, and R. Marculescu, “MobileViG: Graph-based sparse attention for mobile vision applications,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW),
    pp. 2210–2218, 2023.
    [66] X. Qi, R. Liao, J. Jia, S. Fidler, and R. Urtasun, “3D graph neural networks
    for RGBD semantic segmentation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 5199–5208, 2017.
    [67] N. Zhang, Z. Pan, T. H. Li, W. Gao, and G. Li, “Improving graph representation for point cloud segmentation via attentive filtering,” in Proceedings
    of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
    (CVPR), pp. 1244–1254, 2023.
    [68] Z.-H. Lin, S.-Y. Huang, and Y.-C. F. Wang, “Convolution in the cloud: Learning deformable kernels in 3d graph convolution networks for point cloud analysis,” in Proceedings of the IEEE/CVF Conference on Computer Vision and
    Pattern Recognition (CVPR), pp. 1800–1809, 2020.
    [69] G. Zhao, W. Ge, and Y. Yu, “GraphFPN: Graph feature pyramid network for
    object detection,” in Proceedings of the IEEE/CVF International Conference
    on Computer Vision (ICCV), pp. 2763–2772, 2021.
    [70] W. Shi and R. Rajkumar, “Point-GNN: Graph neural network for 3D object
    detection in a point cloud,” in Proceedings of the IEEE/CVF Conference on
    Computer Vision and Pattern Recognition (CVPR), pp. 1711–1719, 2020.
    [71] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll´ar,
    and C. L. Zitnick, “Microsoft COCO: Common objects in context,” in Proceedings of the European Conference on Computer Vision (ECCV), pp. 740–755,
    2014.
    [72] X. Ding, X. Zhang, J. Han, and G. Ding, “Scaling up your kernels to 31×31:
    Revisiting large kernel design in CNNs,” in Proceedings of the IEEE/CVF
    Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11963–
    11975, 2022.
    [73] X. Ding, X. Zhang, N. Ma, J. Han, G. Ding, and J. Sun, “RepVGG: Making
    VGG-style ConvNets great again,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13733–13742,
    2021.
    [74] Y. Xu, C. Li, D. Li, X. Sheng, F. Jiang, L. Tian, and A. Sirasao, “FDViT: Improve the hierarchical architecture of vision transformer,” in Proceedings of the
    IEEE/CVF International Conference on Computer Vision (ICCV), pp. 5950–
    5960, 2023.
    [75] Z. Chen, Y. Duan, W. Wang, J. He, T. Lu, J. Dai, and Y. Qiao, “Vision
    transformer adapter for dense predictions,” in International Conference on
    Learning Representations (ICLR), 2023.
    [76] I. O. Tolstikhin, N. Houlsby, A. Kolesnikov, L. Beyer, X. Zhai, T. Unterthiner,
    J. Yung, A. Steiner, D. Keysers, J. Uszkoreit, et al., “MLP-Mixer: An all-MLP
    architecture for vision,” in Advances in Neural Information Processing Systems
    (NeurIPS), pp. 24261–24272, 2021.
    [77] H. Touvron, P. Bojanowski, M. Caron, M. Cord, A. El-Nouby, E. Grave,
    G. Izacard, A. Joulin, G. Synnaeve, J. Verbeek, et al., “ResMLP: Feedforward networks for image classification with data-efficient training,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 45, no. 4,
    pp. 5314–5321, 2022.
    78] J. Guo, Y. Tang, K. Han, X. Chen, H. Wu, C. Xu, C. Xu, and Y. Wang,
    “Hire-MLP: Vision mlp via hierarchical rearrangement,” in Proceedings of the
    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
    pp. 826–836, 2022.
    [79] M.-H. Guo, Z.-N. Liu, T.-J. Mu, and S.-M. Hu, “Beyond self-attention: External attention using two linear layers for visual tasks,” IEEE Transactions on
    Pattern Analysis and Machine Intelligence (TPAMI), vol. 45, no. 5, pp. 5436–
    5447, 2022.
    [80] J. Li, A. Hassani, S. Walton, and H. Shi, “ConvMLP: Hierarchical convolutional mlps for vision,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 6306–6315,
    2023.
    [81] S. Vandenhende, S. Georgoulis, and L. Van Gool, “MTI-Net: Multi-scale task
    interaction networks for multi-task learning,” in Proceedings of the European
    Conference on Computer Vision (ECCV), pp. 527–543, 2020.
    [82] G. Sun, T. Probst, D. P. Paudel, N. Popovi´c, M. Kanakis, J. Patel, D. Dai,
    and L. Van Gool, “Task switching network for multi-task learning,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV),
    pp. 8291–8300, 2021.
    [83] X. Xu, H. Zhao, V. Vineet, S.-N. Lim, and A. Torralba, “MTFormer: Multitask learning via transformer and cross-task reasoning,” in Proceedings of the
    European Conference on Computer Vision (ECCV), pp. 304–321, 2022.
    [84] X. Gu, Y. Cui, J. Huang, A. Rashwan, X. Yang, X. Zhou, G. Ghiasi, W. Kuo,
    H. Chen, L.-C. Chen, et al., “Dataseg: Taming a universal multi-dataset multitask segmentation model,” in Advances in Neural Information Processing Systems (NeurIPS), 2023.
    [85] T. Chen, X. Chen, X. Du, A. Rashwan, F. Yang, H. Chen, Z. Wang, and Y. Li,
    “Adamv-moe: Adaptive multi-task vision mixture-of-experts,” in Proceedings
    of the IEEE/CVF International Conference on Computer Vision (ICCV),
    pp. 17346–17357, 2023.
    [86] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Doll´ar, “Focal loss for dense
    object detection,” in Proceedings of the IEEE/CVF International Conference
    on Computer Vision (ICCV), pp. 2980–2988, 2017.
    [87] H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and S. Savarese,
    “Generalized intersection over union: A metric and a loss for bounding box
    regression,” in Proceedings of the IEEE/CVF Conference on Computer Vision
    and Pattern Recognition (CVPR), pp. 658–666, 2019.
    [88] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman,
    “The pascal visual object classes (VOC) challenge,” International Journal of
    Computer Vision (IJCV), vol. 88, pp. 303–338, 2010.

    QR CODE