基於共振峰迭代濾波的歌聲轉換｜國立清華大學博碩士論文庫

簡易檢索 / 詳目顯示

回結果列表

研究生：	許鈞棠 Hsu, Chun-Tang
論文名稱：	基於共振峰迭代濾波的歌聲轉換 Singing Voice Conversion based on Iterative Formant Filtering
指導教授：	劉奕汶 Liu, Yi-Wen
口試委員:	蘇文鈺 Su, Wen-Yu 蘇黎 Su, Li
學位類別：	碩士 Master
系所名稱：	電機資訊學院 - 電機工程學系 Department of Electrical Engineering
論文出版年：	2019
畢業學年度：	108
語文別：	英文
論文頁數：	57
中文關鍵詞：	共振峰濾波、歌聲合成
外文關鍵詞：	Formant Filtering, Singing Synthesis
相關次數：	點閱：76 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

在歌聲轉換領域中，一位來源歌手會提供其歌唱音訊，此音訊之音色將由另一位歌手所建立之聲音資料庫所取代。本論文提出一套系統來實現這項功能，並同時減低繁雜的聲音資料庫建立及音素標記程序。這裡我們假設音色源訊號(TSS)，母音源訊號(VSS)及歌聲源訊號(SSS)是現有可供利用的，三者分別是由歌手SingerT, SingerV與 SingerS所錄製。其中，TSS僅包含了日文五段音/a、i、u、e、o/中的其中一個母音，VSS則涵蓋了完整的五段音而SSS為將被轉換音色的歌聲音訊。整套系統分為三個模組: 擴充、辨識與合成模組。首先擴充模組利用TSS與VSS可以製做一個SingerT專屬的音源庫，此階段目的是要擴充SingerT的母音涵蓋範圍至完整五段音，並同時保留SingerT的語者特徵。接著，辨識模組將會利用一個深度神經網路演算法的模型辨識SSS在每一個時間點的音素(phone)種類。這個模組是在語音資料庫上做訓練，並在歌唱資料庫上做驗證。最後，在合成模組中系統會依據預估階段所輸出的母音字譜進行音色代換，並且由WORLD聲碼器合成最終轉換音訊。本篇論文主要貢獻為擴充模組及辨識模組。
為了衡量合成品質，此篇提出擴充階段的演算法使用了交叉合成的方式合成11個人的原始母音持續音及擴充母音持續音。接著，由11位受測者衡量擴充音訊與SingerT和SingerV的語者身分相似度。結果顯示有7位受測者成功的以高於70%的正確率選擇SingerT。此外，辨識模組也可在測試歌唱音檔上以高達85%的準確率預估語音的種類。

In the field of singing voice conversion, a source singer’s singing data are provided and the voice bank provided by another singer is used for substituting the source singer’s timbre. In this thesis, a system is proposed to not only achieve this goal but also reduce the work of voice bank construction and source data phone labelling. Here we assume that a timbre source signal (TSS), a vowel source signal (VSS) and a singing source signal (SSS) are available. The signals are recorded by singerT, singerV and singerS, respectively. Among them, the TSS includes only one of the five Japanese vowels, i.e. /a, i, u, e, o/ while the VSS includes complete five vowels and SSS is the singing signal to be converted. The entire system comprises the augmentation, the recognition and the synthesis module. First of all, a singing source library for TSS would be built by the augmentation module. The goal of this stage is to augment one of TSS’s vowel to its complete 5 Japanese vowels and make sure the speaker characteristics are unchanged. Then, the recognition module, which uses a deep neural network algorithm, would recognize the phone category of SSS in each time step. This model is trained on a speech dataset and tested by a singing dataset. Finally, the synthesis module would substitute the timbre of SSS according to the vowel annotation predicted by the neural network. With the aid of the WORLD vocoder, converted singing data could be obtained. The contribution of this paper lies in the augmentation module and the recognition module.
To evaluate the synthesis quality, sustained vowels sung by 11 singers were recorded, and the proposed algorithm was applied for cross-synthesis. 11 subjects are invited to tell if synthesized speaker identity resembles singerT or singerV. Results show that 7 subjects successfully chose singerT with accuracy > 70%. Moreover, the recognition model could also reach 85% accuracy on identifying phones of the testing singing voice.

  Introduction    5
1    Motivation    5
2    Related works    6
3    The phone database augmentation module    7
4    The recognition module    9
5    The organization of this thesis    10
  Dataset overview    12
1    The glissando dataset    12
1.1    Data collection    12
1.2    Dictionary construction    14
1.3    Data analysis of the glissando dataset    15
2    The TIMIT dataset    16
3    The Nagoya singing dataset    18
  The Augmentation Module    19
1    Iterative formant filtering algorithm    19
1.1    Spectral envelope extraction    19
1.2    VSS iterative envelope fitting    21
2    Synthesis    22
  The Phone Recognition Module    23
1    SSS preprocessing    23
1.1    The coded spectral envelope    23
1.2    The 40 dimensions normalized MFCCs    24
2    Network structure    26
3     The CBHG module    29
3.1    Convolution bank    29
3.2    Highway networks    30
4    Batch normalization    33
5    Residual connection    34
6    Other Important details    35
  Evaluation    36
1    Evaluation of the Augmentation module    36
1.1    Listening test    36
1.2    Synthesized spectrum evaluation (subjective)    38
2    Evaluation of the Phone Recognition Module    40
2.1    Baseline model    40
2.2    Training method 1    41
2.3    Training method 2    43
2.4    Ensembling method 1 and 2    44
3    System overall evaluation    45
3.1    Questionnaire design    47
3.2    Results    48
4    Limitations    49
4.1    Pitch limitation    49
4.2    Vowel and pronunciation limitation    50
  Conclusions and Future Work    51
Reference    53

                                

1. “UTAU” retrieved from “https://en.wikipedia.org/wiki/Utau”
2. “VOCALOID” retrieved from “http://www.vocaloid.com/en/”
3. “VOCALOID development” retrieved from “https://vocaloid.fandom.com/
wiki/VOCALOID_development”
4. T. Nakano, M. Goto. “Vocalistener: A singing-to-singing synthesis system based iterative parameter estimation,” Proc. SMC, pp.343-348; 2009
5. M. Morise, F. Yokomori, and K. Ozawa. “WORLD: a vocoder-based high-quality speech synthesis system for real-time applications,” IEICE Trans. Information Systems, vol. 99, iss.7, pp.1877-1884; 2016.
6. M. Morise, “Cheaptrick, a spectral envelope estimator for high-quality speech synthesis,” Speech Communication, vol.67, pp.1–7; 2015.
7. M. Morise, “Error evaluation of an f0-adaptive spectral envelope estimator in robustness against the additive noise and f0 error,” IEICE Trans. Information Systems, vol.E98-D, no.7, pp.1405–1408; 2015.
8. M. Morise, “Platinum: A method to extract excitation signals for voice synthesis system,” Acoust. Sci. Tech., vol.33, no.2, pp.123–125; 2012.
9. Y.-C. Wu, H.-T. Hwang, C.-C. Hsu, Y. Tsao, and H.-M. Wang, “Locally linear embedding for exemplar-based spectral conversion,” Proc. Interspeech, pp. 1652–1656; 2016.
10. S. T. Roweis and L. K. Saul, “Nonlinear dimensionality reduction by locally linear embedding,” Science, vol. 290, no. 5500, pp. 2323–2326; 2000.
11. M. Latinus, P. Belin. “Human voice perception,” Current Biology, vol.21, no. 4, pp. 143-145; 2011.
12. K. Jensen. “The Timbre Model,” JASA, vol. 112, no. 5, pp. 2238-2238; 2001.
13. Julius Smith. “Introduction to digital filters: With Audio Applications.” vol.2, p.210; 2007.
14. L. Dmitriev, A. Kiselev, “Relationship between the formant structure of different types of singing voices and the dimensions of supraglottic cavities,” Folia Phoniatr, vol.31, no.4, pp. 238-241; 1979.
15. “The Orb” retrieved from “https://www.pluginboutique.com/product/2-Effects/19-Filter/2888-The-Orb.”
16. “iZotope Vocal Synth 2” retrieved from “https://www.izotope.com/
en/products/create-and-design/vocal- synth.html.”
17. T. Bohm and G. Nemeth. “Algorithm for formant tracking, modification and synthesis,” Infocommunications journal, vol.62, no.1, p.1116; 2007.
18. P. Boersma, D. Weenink. Praat: doing phonetics by computer Computer program.; 2019
19. Y. Wang. “Tacotron: Towards end-to-end speech synthesis,” Proc. Interspeech 2017. pp. 4006–4010; 2017.
20. D. Ahn. Deep-voice-conversion. GitHub repository; 2018
21. “TIMIT acoustic-phonetic continuous speech corpus LDC93S1,” retrieved from “https://catalog.ldc.upenn.edu/LDC93S1”; 1993.
22. “Nagoya singing dataset NIT-SONG070-F001” retrieved from “http://hts.sp.nitech.ac.jp/archives/2.3/HTSdemo_NIT-SONG070-F001.tar.bz2”
23. M. Morise, H. Kawahara, and H. Katayose. “Fast and reliable f0 estimation method based on the period extraction of vocal fold vibration of singing voice and speech,” Proc. AES 35th International Conference, pp.77-81; 2009.
24. M. Morise, H. Kawahara, and T. Nishiura. “Rapid f0 estimation for high-snr speech based on fundamental component extraction,” IEICE Trans. Inf. Syst. (Japanese Edition), vol.J93-D, no.2, pp.109–117; 2010.
25. P. Ahuja. “A case study on comparison of male and female vowel formants by native speakers of Gujarati,” ICPhS, vol. 17, pp. 934-937; 2015.
26. S. Kumar, K. Stephan, J. Warren, K. Friston, T. Griffiths. “Hierarchical processing of auditory objects in humans,” PLoS Comput Biol, vol.3, no.6, p.100; 2007.
27. J. Markel and A. Gray. “Linear prediction of speech,” JSV, vol.51, iss.4, p.595; 1982.
28. A. Robel and X. Rodet. “Efficient spectral envelope estimation and its application to pitch shifting and envelope preservation,” Proc. Digital Audio Effects (DAFx-05), pp.30-35; 2005.
29. “spectral envelope extraction” retrieved from “https://www.dsprelated.com /freebooks/sasp/ Spectral_Envelope_Extraction.html”
30. Q. Meng, M. Yuan, Z. Yang. “An empirical envelope estimation algorithm,” CISP, vol.2, pp.132-1136 ; 2013.
31. L. Rabiner, B. Gold. “Theory and application of digital signal processing. Englewood Cliffs,” N.J.: Prentice-Hall: pp 65–67. ISBN 0-13-914101-4;1975.
32. M. Morise, G. Miyashita, K. Ozawa. “Low-dimensional representation of spectral envelope without deterioration for full-band speech analysis/synthesis system,” Interspeech 2017, pp.409-413; 2017.
33. F. Graham. “Vocal range and poor pitch singing,” POM, vol.7, no.2, pp.13-31; 1979.
34. N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov. “Dropout: A simple way to prevent neural networks from overfitting,” JMLR, vol.15, no.1, pp.1929-1958; 2014.
35. S. Hochreiter, Y. Bengio, P. Frasconi, J. Schmidhuber."Gradient flow in recurrent nets: the difficulty of learning long-term dependencies,"IEEE Press; 2001.
36. J. Lee, K. Cho, T. Hofmann. “Fully character-level neural machine translation without explicit segmentation,” TACL, vol.5, pp.365-378; 2016.
37. R. Srivastava, K. Greff, J. Schmidhuber. “Highway networks,” ICML 2015, vol.abs/1507.06228; 2015.
38. R. Kumar Srivastava, K. Greff, J. Schmidhuber. “Training very deep networks,” NIPS, pp. 2377-2385; 2015.
39. C. Merrienboer, D. Bahdanau, Y. Bengio. “On the properties of neural machine translation: Encoder-decoder approaches,” SSST-8, pp.103-111; 2014.
40. S. Hochreiter and J. Schmidhuber. “Long short-term memory,” Neural Computation, vol.9, no.8, pp.1735-1780, 1997.
41. S. Ioffe, C. Szegedy. “Batch normalization. Accelerating deep network training by reducing internal covariate shift,” ICML 2015, vol.37, pp.448-456; 2015
42. K. He, X. Zhang, S. Ren and J. Sun. “Deep residual learning for image recognition,” IEEE CVPR 2016, pp.770-778; 2016.
43. D. Kingma, J. Ba. “Adam: A method for stochastic optimization,” ICLR 2015, vol.abs/1412.6980; 2015.
44. E. Mendoza, N. Valencia, J. Noz, H. Trujillo. “Differences in voice quality between men and women: use of the long-term average spectrum (LTAS),” Journal of Voice, vol.10, no.1, pp.59–66; 1996.
45. S. Itahashi, S. Yokoyama, “A formant extraction method utilizing mel scale and equal loudness contour,” STL-QPSR Journal, pp.17-29; 1978.

簡易檢索 / 詳目顯示

相關論文