簡易檢索 / 詳目顯示

研究生: 楊晶宇
Yang, Ching-Yu
論文名稱: 防禦語音偽造:基於原始波形的攻擊方法應對單樣本語音轉換
Defending Against Voice Forgery: Raw Waveform Approach to Attacking One-Shot Voice Conversion
指導教授: 李祈均
Lee, Chi-Chun
口試委員: 吳尚鴻
Wu, Shan-Hung
廖元甫
Liao, Yuan-Fu
王新民
Wang, Hsin-Min
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電機工程學系
Department of Electrical Engineering
論文出版年: 2024
畢業學年度: 113
語文別: 英文
論文頁數: 49
中文關鍵詞: 語音轉換隱私保護深度偽造
外文關鍵詞: Voice Conversion, Privacy Protection, Deepfake
相關次數: 點閱:35下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 近年來,生成式人工智慧技術在圖片、文字及語音等領域皆獲得突破性的進展,深度偽造技術是生成式人工智慧技術的一個典型例子,其通過人工智慧和深度學習演算法生成極其逼真的視頻、音頻、文字或圖片,徹底改變了媒體的創作方式,應用於虛擬形象、聊天機器人和藝術創作等各個領域。在眾多深度偽造技術中,單樣本語音轉換技術逐漸成熟,該技術只需要目標語者的一個聲音樣本就複製其音色,合成該語者的各式偽造語音。這帶來了諸如身份盜竊、語音欺詐等嚴重的安全和隱私問題。

    為了解決上述問題,我們提出了一種基於原始波形的對抗性攻擊方法,該方法通過生成微小且不可察覺的對抗性噪音來保護語音樣本,從而有效削弱單樣本語音轉換模型複製音色的能力。與傳統方法相比,本方法的創新之處在於直接在原始波形層面進行擾動生成,避免了兩階段處理過程中可能出現的特徵不匹配問題。

    此外,我們還提出了兩種方式來增強模型對於不同語者的泛化能力。第一種方式是將語者的特徵構建為圖結構,分析每位訓練語者在圖中的中心性,並在模型訓練過程中根據語者的中心性進行權重化採樣,以提升模型對於不同語者保護能力的廣泛適應性。第二種方式則是一種目標語音選擇策略,通過分析語者之間的特徵距離,選擇最適合的目標語音,以進一步優化對抗性攻擊效果以及泛化能力。

    總結來說,我們提供了一種有效的對抗性防禦方法,能夠在維持講話者語音特徵的同時,防止單樣本語音轉換模型成功複製聲音。這項技術為防範語音複製和隱私泄露問題提供了重要的技術支持。


    In recent years, generative AI has made significant advancements in images, text, and speech, with deepfake technology being a prominent example. Deepfakes use AI to generate highly realistic media, transforming virtual content creation in areas like virtual avatars, chatbots, and art. One-shot voice conversion, which replicates a speaker’s voice using a single sample, has also matured, raising concerns about identity theft and voice fraud.

    To address these issues, we propose a raw waveform-based adversarial attack that protects speech samples by generating subtle, imperceptible noise, effectively weakening one-shot voice conversion models' ability to replicate voice characteristics. This method innovatively generates perturbations at the raw waveform level, avoiding the feature mismatch common in two-stage processes.

    We also introduce two strategies to enhance model generalization across speakers: (1) constructing a graph of speaker features and applying weighted sampling based on speaker centrality to improve protection; (2) selecting target speech by analyzing speaker feature distances, further optimizing adversarial attack effectiveness.

    Experiments in both black-box and white-box scenarios, with subjective and objective evaluations, demonstrate that our method significantly alters the speaker characteristics in the converted voice while maintaining a high signal-to-noise ratio with minimal quality loss. This provides an effective defense against voice cloning, preventing one-shot voice conversion models from successfully replicating voices.

    Abstract (Chinese) Acknowledgements (Chinese) Abstract Contents List of Figures List of Tables List of Algorithms Introduction---------------------1 Related Work---------------------5 Preliminaries---------------------9 Methodology---------------------12 Experiment Results and Analysis---------------------16 Speaker Bias Mitigation for Improved Performance---------------------25 Conclusion---------------------41 Bibliography---------------------42

    [1] Chien-yu Huang, Yist Y Lin, Hung-yi Lee, and Lin-shan Lee. Defending your voice: Adversarial attack on voice conversion. In 2021 IEEE Spoken Language Technology Workshop (SLT), pages 552–559. IEEE, 2021.
    [2] Jianfeng Zhang, Zihang Jiang, Dingdong Yang, Hongyi Xu, Yichun Shi, Guox- ian Song, Zhongcong Xu, Xinchao Wang, and Jiashi Feng. Avatargen: a 3d generative model for animatable human avatars. In European Conference on Computer Vision, pages 668–685. Springer, 2022.
    [3] Tengfei Wang, Bo Zhang, Ting Zhang, Shuyang Gu, Jianmin Bao, Tadas Bal- trusaitis, Jingjing Shen, Dong Chen, Fang Wen, Qifeng Chen, et al. Rodin: A generative model for sculpting 3d digital avatars using diffusion. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition, pages 4563–4573, 2023.
    [4] Jonas Oppenlaender. The creativity of text-to-image generation. In Proceed- ings of the 25th International Academic Mindtrek Conference, pages 192–202, 2022.
    [5] Yuhao Zhu, Qi Li, Jian Wang, Cheng-Zhong Xu, and Zhenan Sun. One shot face swapping on megapixels. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4834–4844, 2021.

    [6] Yixuan Li, Chao Ma, Yichao Yan, Wenhan Zhu, and Xiaokang Yang. 3d-aware face swapping. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12705–12714, 2023.
    [7] Wenliang Zhao, Yongming Rao, Weikang Shi, Zuyan Liu, Jie Zhou, and Ji- wen Lu. Diffswap: High-fidelity and controllable face swapping via 3d-aware masked diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8568–8577, 2023.
    [8] Jingyi Li, Weiping Tu, and Li Xiao. Freevc: Towards high-quality text-free one-shot voice conversion. In ICASSP 2023-2023 IEEE International Confer- ence on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
    [9] Yang Yang, Yury Kartynnik, Yunpeng Li, Jiuqiang Tang, Xing Li, George Sung, and Matthias Grundmann. Streamvc: Real-time low-latency voice con- version. arXiv preprint arXiv:2401.03078, 2024.
    [10] Edresson Casanova, Julian Weber, Christopher D Shulby, Arnaldo Candido Junior, Eren G¨olge, and Moacir A Ponti. Yourtts: Towards zero-shot multi- speaker tts and zero-shot voice conversion for everyone. In International Con- ference on Machine Learning, pages 2709–2720. PMLR, 2022.
    [11] Matthew Le, Apoorv Vyas, Bowen Shi, Brian Karrer, Leda Sari, Rashel Moritz, Mary Williamson, Vimal Manohar, Yossi Adi, Jay Mahadeokar, et al. Voicebox: Text-guided multilingual universal speech generation at scale. Ad- vances in neural information processing systems, 36, 2024.
    [12] Yinghao Aaron Li, Cong Han, and Nima Mesgarani. Styletts-vc: One-shot voice conversion by knowledge transfer from style-based tts models. In 2022

    IEEE Spoken Language Technology Workshop (SLT), pages 920–927. IEEE, 2023.
    [13] Kai Shen, Zeqian Ju, Xu Tan, Yanqing Liu, Yichong Leng, Lei He, Tao Qin, Sheng Zhao, and Jiang Bian. Naturalspeech 2: Latent diffusion mod- els are natural and zero-shot speech and singing synthesizers. arXiv preprint arXiv:2304.09116, 2023.
    [14] Nirmesh Shah, Mayank Singh, Naoya Takahashi, and Naoyuki Onoe. Nonpar- allel emotional voice conversion for unseen speaker-emotion pairs using dual domain adversarial network & virtual domain pairing. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
    [15] Guangyan Zhang, Ying Qin, Wenjie Zhang, Jialun Wu, Mei Li, Yutao Gai, Feijun Jiang, and Tan Lee. iemotts: Toward robust cross-speaker emotion transfer and control for speech synthesis based on disentanglement between prosody and timbre. IEEE/ACM Transactions on Audio, Speech, and Lan- guage Processing, 2023.
    [16] Yi Ren, Xu Tan, Tao Qin, Jian Luan, Zhou Zhao, and Tie-Yan Liu. Deepsinger: Singing voice synthesis with data mined from the web. In Pro- ceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1979–1989, 2020.
    [17] Jinglin Liu, Chengxi Li, Yi Ren, Feiyang Chen, and Zhou Zhao. Diffsinger: Singing voice synthesis via shallow diffusion mechanism. In Proceedings of the AAAI conference on artificial intelligence, volume 36, pages 11020–11028, 2022.

    [18] Yin-Jyun Luo, Chin-Cheng Hsu, Kat Agres, and Dorien Herremans. Singing voice conversion with disentangled representations of singer and vocal tech- nique using variational autoencoders. In ICASSP 2020-2020 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3277–3281. IEEE, 2020.
    [19] Shengkui Zhao, Hao Wang, Trung Hieu Nguyen, and Bin Ma. Towards natural and controllable cross-lingual voice conversion based on neural tts model and phonetic posteriorgram. In ICASSP 2021-2021 IEEE International Confer- ence on Acoustics, Speech and Signal Processing (ICASSP), pages 5969–5973. IEEE, 2021.
    [20] S Shahnawazuddin, Nagaraj Adiga, Kunal Kumar, Aayushi Poddar, and Waquar Ahmad. Voice conversion based data augmentation to improve chil- dren’s speech recognition in limited data scenario. In Interspeech, pages 4382– 4386, 2020.
    [21] Catherine Stupp. Fraudsters used ai to mimic ceo’s voice in unusual cyber- crime case. The Wall Street Journal, 2022.
    [22] Xiaohai Tian, Rohan Kumar Das, and Haizhou Li. Black-box Attacks on Automatic Speaker Verification using Feedback-controlled Voice Conversion. In Proc. The Speaker and Language Recognition Workshop (Odyssey 2020), pages 159–164, 2020.
    [23] Zhe Ye, Terui Mao, Li Dong, and Diqun Yan. Fake the Real: Backdoor Attack on Deep Speech Classification via Voice Conversion. In Proc. INTERSPEECH 2023, pages 4923–4927, 2023.

    [24] Hafsa Ilyas, Ali Javed, and Khalid Mahmood Malik. Avfakenet: A unified end-to-end dense swin transformer deep learning model for audio–visual deep- fakes detection. Applied Soft Computing, 136:110124, 2023.
    [25] Ameer Hamza, Abdul Rehman Rehman Javed, Farkhund Iqbal, Natalia Kryvinska, Ahmad S Almadhor, Zunera Jalil, and Rouba Borghol. Deep- fake audio detection via mfcc features using machine learning. IEEE Access, 10:134018–134028, 2022.
    [26] Zihao Liu, Yan Zhang, and Chenglin Miao. Protecting your voice from speech synthesis attacks. In Proceedings of the 39th Annual Computer Security Ap- plications Conference, pages 394–408, 2023.
    [27] Zhiyuan Yu, Shixuan Zhai, and Ning Zhang. Antifake: Using adversarial audio to prevent unauthorized speech synthesis. In Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, pages 460–474, 2023.
    [28] Liang He, Ruida Li, and Mengqi Niu. A study on graph embedding for speaker recognition. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 10741–10745. IEEE, 2024.
    [29] Jee-weon Jung, Hee-Soo Heo, Ha-Jin Yu, and Joon Son Chung. Graph at- tention networks for speaker verification. In ICASSP 2021-2021 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6149–6153. IEEE, 2021.
    [30] Jixuan Wang, Xiong Xiao, Jian Wu, Ranjani Ramamurthy, Frank Rudzicz, and Michael Brudno. Speaker diarization with session-level speaker embed- ding refinement using graph neural networks. In ICASSP 2020-2020 IEEE In-

    ternational Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7109–7113. IEEE, 2020.
    [31] Yi Xie, Zhuohang Li, Cong Shi, Jian Liu, Yingying Chen, and Bo Yuan. Enabling fast and universal audio adversarial attack using generative model. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 14129–14137, 2021.
    [32] Junichi Yamagishi, Christophe Veaux, and Kirsten MacDonald. CSTR VCTK Corpus: English multi-speaker corpus for CSTR voice cloning toolkit (version 0.92). sound, 2019.
    [33] Ju-chieh Chou, Cheng-chieh Yeh, and Hung-yi Lee. One-shot voice conversion by separating speaker and content representations with instance normaliza- tion. arXiv preprint arXiv:1904.05742, 2019.
    [34] Kaizhi Qian, Yang Zhang, Shiyu Chang, Xuesong Yang, and Mark Hasegawa- Johnson. Autovc: Zero-shot voice style transfer with only autoencoder loss. In International Conference on Machine Learning, pages 5210–5219. PMLR, 2019.
    [35] Songxiang Liu, Yuewen Cao, Disong Wang, Xixin Wu, Xunying Liu, and Helen Meng. Any-to-many voice conversion with location-relative sequence- to-sequence modeling. IEEE/ACM Transactions on Audio, Speech, and Lan- guage Processing, 29:1717–1728, 2021.
    [36] Yi-Chiao Wu, Kazuhiro Kobayashi, Tomoki Hayashi, Patrick Lumban Tobing, and Tomoki Toda. Collapsed speech segment detection and suppression for wavenet vocoder. arXiv preprint arXiv:1804.11055, 2018.

    [37] Jaehyeon Kim, Jungil Kong, and Juhee Son. Conditional variational autoen- coder with adversarial learning for end-to-end text-to-speech. In International Conference on Machine Learning, pages 5530–5540. PMLR, 2021.
    [38] Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, et al. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518, 2022.
    [39] Naveed Akhtar and Ajmal Mian. Threat of adversarial attacks on deep learn- ing in computer vision: A survey. Ieee Access, 6:14410–14430, 2018.
    [40] Nicholas Carlini and David Wagner. Audio adversarial examples: Targeted at- tacks on speech-to-text. In 2018 IEEE security and privacy workshops (SPW), pages 1–7. IEEE, 2018.
    [41] Moustafa Alzantot, Bharathan Balaji, and Mani Srivastava. Did you hear that? adversarial examples against automatic speech recognition. arXiv preprint arXiv:1801.00554, 2018.
    [42] NetworkX Developers. Degree centrality. https://networkx.org/ documentation/stable/reference/algorithms/generated/networkx. algorithms.centrality.degree_centrality.html, 2023.
    [43] NetworkX Developers. Closeness centrality. https://networkx.org/ documentation/stable/reference/algorithms/generated/networkx. algorithms.centrality.closeness_centrality.html, 2023.
    [44] NetworkX Developers. Betweenness centrality. https://networkx.org/ documentation/stable/reference/algorithms/generated/networkx. algorithms.centrality.betweenness_centrality.html, 2023.

    [45] A. Nagrani, J. S. Chung, and A. Zisserman. Voxceleb: a large-scale speaker identification dataset. In INTERSPEECH, 2017.
    [46] J. S. Chung, A. Nagrani, and A. Zisserman. Voxceleb2: Deep speaker recog- nition. In INTERSPEECH, 2018.
    [47] Daniel Stoller, Sebastian Ewert, and Simon Dixon. Wave-u-net: A multi- scale neural network for end-to-end audio source separation. arXiv preprint arXiv:1806.03185, 2018.
    [48] L. Ferrer and P. Riera. Confidence intervals for evaluation in machine learning.

    [49] Daniel Griffin and Jae Lim. Signal estimation from modified short-time fourier transform. IEEE Transactions on acoustics, speech, and signal processing, 32(2):236–243, 1984.
    [50] Brecht Desplanques, Jenthe Thienpondt, and Kris Demuynck. ECAPA- TDNN: emphasized channel attention, propagation and aggregation in TDNN based speaker verification. In Helen Meng, Bo Xu, and Thomas Fang Zheng, editors, Interspeech 2020, pages 3830–3834. ISCA, 2020.
    [51] Vadim Popov, Ivan Vovk, Vladimir Gogoryan, Tasnima Sadekova, Mikhail Kudinov, and Jiansheng Wei. Diffusion-based voice conversion with fast max- imum likelihood sampling scheme. arXiv preprint arXiv:2109.13821, 2021.
    [52] Wenqi Fan, Yao Ma, Qing Li, Yuan He, Eric Zhao, Jiliang Tang, and Dawei Yin. Graph neural networks for social recommendation. In The world wide web conference, pages 417–426, 2019.
    [53] Shiwen Wu, Fei Sun, Wentao Zhang, Xu Xie, and Bin Cui. Graph neu- ral networks in recommender systems: a survey. ACM Computing Surveys, 55(5):1–37, 2022.

    QR CODE