| 研究生: |
吳柏澄 Wu, Bo-Cheng |
|---|---|
| 論文名稱: |
使用多提取器在多說話人重疊問題中的個人語音活動檢測 A Multi-Extractor Strategy for Personal-VAD in Multiple Speakers Overlapped Problem |
| 指導教授: |
李祈均
Lee, Chi-Chun |
| 口試委員: |
廖元甫
Liao, Yuan-Fu 曹昱 Tsao, Yu |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 通訊工程研究所 Communications Engineering |
| 論文出版年: | 2023 |
| 畢業學年度: | 111 |
| 語文別: | 英文 |
| 論文頁數: | 38 |
| 中文關鍵詞: | 語者自動分段標記 、語音活動檢測 、多重提取器 、重疊說話人 |
| 外文關鍵詞: | PVAD, multi-extractors, overlapped speech, Rawnet |
| 相關次數: | 點閱:144 下載:2 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
多個重疊的說話人對現實生活中的語者自動分段標記(speaker diarization)是一項巨大的挑戰,而在語者自動分段標記中有一個重要的分支,也就是個人語音活動檢測(PVAD)。這對於許多後端的語音應用,例如自動語音識別(ASR)是必不可少的過程,因其可以顯著的降低計算與時間上的成本。由於大多數相關研究著重於語音辨識模型本身的優化而非策略本身。在本文中我們提出了一種新穎的策略來使用多重提取器(multiple extractor)解決極具挑戰性的多重疊語音下的個人語音活動檢測(最多五人同時說話)。我們所提出的多個提取器主要用來在提取魯棒且穩定的說話人特徵,以在不同數量的重疊說話人(例如:二到五個人)條件下有效的表徵出目標說話人,接著我們透過組合多個提取器的特徵並將這些特徵組合起來形成聯合表徵來辨別出目標說話人使否有出現。在實驗設計中我們使用Librispeech公開數據集構建出新的語音資料庫,用以模擬極具挑戰性的多個重疊說話人情境。實驗結果顯示我們提出的框架在開放和封閉設置環境下都在很大程度上優於最先進的單一提取器方法,並證明了對於提取器骨幹的的不同選擇、目標說話者的多樣性以及註冊資料(enrollment data)的數量都會對語音活動的檢測結果有很大的影響。最後我們進一步的分析顯示了對不同時間解析度和語音活動檢測的性能之間取捨(trade off)的見解。
Multiple overlapped speakers have posed great challenges for speaker diarization in real-life scenarios. One of the important branches of speaker diarization is personal voice activity detection (PVAD). It is an essential procedure that can significantly reduce the computational cost for many downstream speech applications such as automatic speech recognition (ASR). Most of the related studies are limited to the modification of the diarization model itself (network architecture). In this paper, we propose a novel strategy using multi-level speaker overlap extractors to improve PVAD in extremely challenging multiple overlapped speech problems (up to five overlapped speakers speaking simultaneously). The proposed multi-extractors aim to extract robust representations to characterize the target speaker across conditions of varied numbers of overlapped speakers. An ensemble model then recognizes the target speaker using the extracted joint representations. Our proposed framework outperforms state-of-the-art single extractor approaches in both open and close-set settings by a large degree. We demonstrate that different choice of backbone extractors, diversity of target speakers and number of enrollment data matter a lot. Further analyses show insights on the trade off between different temporal granularity and VAD performance.
[1] T. J. Park, N. Kanda, D. Dimitriadis, K. J. Han, S. Watanabe, and S. Narayanan, “A review of speaker diarization: Recent advances with deep learning,” Computer Speech & Language, vol. 72, p. 101317, 2022.
[2] N. Ryant, P. Singh, V. Krishnamohan, R. Varma, K. Church, C. Cieri, J. Du, S. Ganapathy, and M. Liberman, “The third dihard diarization challenge,” arXiv:2012.01477, 2020.
[3] S. Watanabe, M. Mandel, J. Barker, E. Vincent, A. Arora, X. Chang, S. Khudanpur, V. Manohar, D. Povey, D. Raj et al., “Chime-6 challenge: Tackling multispeaker speech recognition for unsegmented recordings,” arXiv:2004.09249, 2020.
[4] A. B. Nassif, I. Shahin, I. Attili, M. Azzeh, and K. Shaalan, “Speech recognition using deep neural networks: A systematic review,” IEEE Access, vol. 7, pp. 19 143–19 165, 2019.
[5] S. Ding, Q. Wang, S.-Y. Chang, L. Wan, and I.-L. Moreno, “Personal vad: Speaker-conditioned voice activity detection,” in Proc. Odyssey 2020 The Speaker and Language Recognition Workshop, 2020, pp. 433–439.
[6] S. Ding, R. Rikhye, Q. Liang, Y. He, Q. Wang, A. Narayanan, T. O’Malley, and I. McGraw, “Personal VAD 2.0: Optimizing personal voice activity detection for on-device speech recognition,” In Proc. Annual Conference of the International Speech Communication Association (INTERSPEECH), pp. 3744–3748, 2022
[7] D. Garcia-Romero, D. Snyder, G. Sell, D. Povey, and A. McCree, “Speaker diarization using deep neural network embeddings,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 4930–4934.
[8] Y. Fujita, N. Kanda, S. Horiguchi, Y. Xue, K. Nagamatsu, and S. Watanabe, “End-to-end neural speaker diarization with self-attention,” in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2019, pp. 296–303.
[9] I. Medennikov, M. Korenevsky, T. Prisyach, Y. Khokhlov, M. Korenevskaya, I. Sorokin, T. Timofeeva, A. Mitrofanov, A. Andrusenko, I. Podluzhny, A. Laptev, and A. Romanenko, “TargetSpeaker Voice Activity Detection: A Novel Approach for MultiSpeaker Diarization in a Dinner Party Scenario,” in Proc. Interspeech 2020, 2020, pp. 274–278.
[10] S. Horiguchi, Y. Fujita, S. Watanabe, Y. Xue, and K. Nagamatsu,“End-to-end speaker diarization for an unknown number of speakers with encoder-decoder based attractors,” in Proc. Interspeech 2020, 2020, pp. 269–273
[11] S. Horiguchi, Y. Fujita, S. Watanabe, Y. Xue, and P. Garcia, “Encoder-decoder based attractors for end-to-end neural diarization,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 1493–1507, 2022.
[12] K. Kinoshita, M. Delcroix, and N. Tawara, “Integrating end-toend neural and clustering-based diarization: Getting the best of both worlds,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 7198–7202.
[13] ——, “Advances in Integration of End-to-End Neural and Clustering-Based Diarization for Real Conversational Speech,” in Proc. Interspeech 2021, 2021, pp. 3565–3569.
[14] M. HM. He, D. Raj, Z. Huang, J. Du, Z. Chen, and S. Watanabe,“Target-Speaker Voice Activity Detection with Improved i-Vector Estimation for Unknown Number of Speaker,” in Proc. Interspeech 2021, 2021, pp. 3555–3559.
[15] D. Wang, X. Xiao, N. Kanda, T. Yoshioka, and J. Wu, “Target speaker voice activity detection with transformers and its integration with end-to-end neural diarization,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
[16] J. S. Chung, J. Huh, S. Mun, M. Lee, H. S. Heo, S. Choe, C. Ham, S. Jung, B.-J. Lee, and I. Han, “In defence of metric learning for speaker recognition,” in Proc. Interspeech, 2020.
[17] Y. Kwon, H.-S. Heo, B.-J. Lee, and J. S. Chung, “The ins and outs of speaker recognition: lessons from voxsrc 2020,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 5809–5813.
[18] J.-w. Jung, Y. J. Kim, H.-S. Heo, B.-J. Lee, Y. Kwon, and J. S. Chung, “Pushing the limits of raw waveform speaker recognition,” in Proc. Interspeech, 2022.
[19] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an asr corpus based on public domain audio books,” in 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2015, pp. 5206–5210.
[20] N. Kanda, Y. Gaur, X. Wang, Z. Meng, Z. Chen, T. Zhou, and T. Yoshioka, “Joint speaker counting, speech recognition, and speaker identification for overlapped speech of any number of speakers,” in Proc. Interspeech, 2020, pp. 36–40.
[21] N. Kanda, Y. Gaur, X. Wang, Z. Meng, and T. Yoshioka, “Serialized output training for end-to-end overlapped speech recognition,” in Proc. Interspeech, 2020, pp. 2797–2801.
[22] B. Despla B. Desplanques, J. Thienpondt, and K. Demuynck, “ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification,” in Proc. Interspeech 2020, 2020, pp. 3830–3834.
[23] J.-w. Jung, S.-b. Kim, H.-j. Shim, J.-h. Kim, and H.-J. Yu, “Improved RawNet with filter-wise rescaling for text-independent speaker verification using raw waveforms,” in Proc. Interspeech 2020, 2020, pp. 1496–1500
[24] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur,“X-vectors: Robust DNN embeddings for speaker recognition,” in Proc. ICASSP, 2018, pp. 5329–5333.
[25] M. Pariente, S. Cornell, A. Deleforge and E. Vincent, “Filterbank design for end-to-end speech separation,” in Proc. ICASSP, 2020.
[26] W. Xie, A. Nagrani, J. S. Chung, and A. Zisserman, “Utterance level aggregation for speaker recognition in the wild,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 5791–5795.
[27] J. Deng, J. Guo, N. Xue, and S. Zafeiriou, “Arcface: Additive angular margin loss for deep face recognition,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 4690–4699.
[28] J. Snell, K. Swersky, and R. Zemel, “Prototypical networks for few-shot learning,” Advances in neural information processing systems, vol. 30, 2017.