簡易檢索 / 詳目顯示

研究生: 許肇凌
Hsu, Chao-Ling
論文名稱: 單聲道音樂之歌聲分離
Monaural Singing Voice Separation from Music Accompaniment
指導教授: 張智星
口試委員:
學位類別: 博士
Doctor
系所名稱: 電機資訊學院 - 資訊工程學系
Computer Science
論文出版年: 2011
畢業學年度: 99
語文別: 英文
論文頁數: 79
中文關鍵詞: 聲音分離音樂計算聽覺場景分析
外文關鍵詞: voice separation, music, Computatinal Auditory Scene Analysis
相關次數: 點閱:3下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 於單聲道的音樂中分離出歌聲是個極具挑戰性的問題。當以音高為基礎分離歌聲的方法大有進展的情況下,卻鮮有研究去注意此類方法無法分離歌聲氣音的部分,一方面是由於氣音部分不若母音的部分有諧波結構,另一方面氣音的能量通常較母音的部分微弱,因此容易和背景音樂混雜在一起造成分離上的困難。於此論文中,我們提出了一個方法去從背景音樂中偵測並且分離出歌聲的氣音。我們也提出了新的歌聲音高擷取演算法來改善歌聲母音部分的分離效果。我們提出的方法沿用計算聽覺場景分析(computational auditory scene analysis, CASA)的架構而分成分解階段及重組階段。在分解階段中,所處理的歌曲訊號根據不同的時間及頻率解析度被分解成許多小的感知元素。而屬於歌聲氣音部分的感知元素我們使用高斯混和模型加以識別出來。實驗結果顯示分離出的歌聲在氣音有顯著的改進。另外一方面,由於大部分的歌聲是屬於母音,因此目標音高偵測是CASA系統影響歌聲分離效能的關鍵技術。但不幸的,強健的目標音高偵測非常的困難,尤其是對於像音樂配樂這樣不穩定規律的背景雜訊。本論文也提出了一個tandem演算法,此tandem演算法可同時偵測音高以及分離母音。粗略的音高首先被估量出來,同時考量頻譜上的諧波結構以及時間上的連續性來分離歌聲。而分離出來的歌聲和偵測出的音高則用來互相增進彼此的效能。為了增進tandem演算法對於音樂歌曲的效能,我們提出了一個音高趨勢偵測演算法,此演算法可以一個音框一個音框的估量出最可能的歌聲音高範圍,而此技術大量的減少了由樂器或歌聲的泛音所產生的錯誤音高。系統化的評估顯示出tandem演算法於音高偵測以及歌聲分離明顯的超越之前的演算法。結合氣音分離的演算法,我們提出了一個完整的CASA-based歌聲分離系統。另一方面,為了解決在目前歌聲分離研究上缺乏一個公開且具規模的語料庫的問題,我們建構了MIR-1K (Multimedia Information Retrieval lab, 1000 song clips)語料庫。在MIR-1K中,所有的錄音歌聲及背景音樂都是分開錄製於不同的聲道中,而且所有的歌曲皆有人工標記的音高資訊,氣音位置及其種類,歌聲出現位置,歌詞,以及歌詞的讀音錄音,使此語料的用途更加廣泛。


    Monaural singing voice separation is an extremely challenging problem. While efforts in pitch-based inference methods have led to considerable progress in voiced singing voice separation, little attention has been paid to the incapability of such methods to separate unvoiced singing voice due to its inharmonic structure and weaker energy. In this dissertation we proposed a systematic approach to identify and separate the unvoiced singing voice from music accompaniment. The proposed system follows the framework of computational auditory scene analysis (CASA) which consists of the segmentation stage and the grouping stage. In the segmentation stage, the input song signals are decomposed into small sensory elements in different time-frequency resolutions. The unvoiced sensory elements are then identified by Gaussian mixture models. The experimental results demonstrated that the quality of the separated singing voice is improved for the unvoiced part. On the other hand, target pitch detection is key to the performance of a CASA system since most of the singing voice is voiced. Unfortunately, it is difficult to detect the target pitch robustly, especially for mixtures with non-stationary and harmonic interference such as music. This dissertation also investigates a tandem algorithm that estimates the singing pitch and separates the singing voice jointly and iteratively. Rough pitches are first estimated and then used to separate the target singer by considering harmonicity and temporal continuity. The separated singing voice and estimated pitches are used to improve each other iteratively. To enhance the performance of the tandem algorithm for dealing with musical recordings, we propose a trend estimation algorithm to detect the pitch ranges of a singing voice in each time frame. The detected trend substantially reduces the difficulty of singing pitch detection by removing a large number of wrong pitch candidates either produced by musical instruments or the overtones of the singing voice. Systematic evaluation shows that the tandem algorithm outperforms previous systems for pitch extraction and singing voice separation. With both the proposed voiced and unvoiced singing voice separation method, we have a complete CASA system to separate singing voice from music accompaniment. Moreover, to deal with the problem of lack of a publicly available dataset for singing voice separation, we have constructed a corpus called MIR-1K (Multimedia Information Retrieval lab, 1000 song clips) where all singing voices and music accompaniments were recorded separately. Each song clip comes with human-labeled pitch values, unvoiced sounds and vocal/non-vocal segments, and lyrics, as well as the speech recording of the lyrics.

    Chapter 1. Introduction ... 10 Chapter 2. System Overview and Front End Processing ... 13 Chapter 3. Unvoiced Singing Voice Separation ... 16 3.1. Background of Unvoiced Singing Voice Separation ... 16 3.2. A/U/V Detection ... 19 3.3. Unvoiced-dominant T-F unit Identification within Unvoiced Frames ... 21 3.4. Resynthesis ... 22 3.5. MIR-1K Dataset ... 22 Chapter 4. Evaluation of Unvoiced Singing Voice Separation ... 25 4.1. Evaluation of A/U/V Detection ... 25 4.2. Evaluation of Unvoiced-dominant T-F unit Identification ... 29 4.3. Evaluation of Singing Voice Pitch Detection ... 31 4.4. Evaluation of Unvoiced Singing Voice Separation ... 33 4.5. Comparison to Existing Methods... 39 Chapter 5. Voiced Singing Voice Separation ... 43 5.1. Background of Voiced Singing Voice Separation ... 43 5.2. Overview of the Singing Pitch Extraction and Voiced Singing Voice Separation ... 44 5.3. Trend-Estimation ... 46 5.3.1. Vocal Component Enhancement ... 46 5.3.2. Pitch Range Estimation... 48 5.4. Mask Estimation and Pitch Determination ... 52 5.4.1. IBM Estimation Given Pitch ... 52 5.4.2. Pitch Estimation Given Binary Mask .... 55 5.5. Iterative Procedure ... 56 5.5.1. Initial Estimation ... 56 5.5.2. Iterative Estimation ... 57 5.5.3. Post-Processing ... 59 5.6. Singing Voice Detection ... 59 Chapter 6. Evaluation of Voiced Singing Voice Separation ... 61 6.1. Evaluation of Singing Voice Detection ... 61 6.2. Evaluation of Trend Estimation ... 64 6.3. Evaluation of Singing Pitch Extraction ... 64 6.4. Evaluation of voiced Singing Voice Separation ... 68 Chapter 7. Conclusions ... 70

    [1] A.S. Bregman, Auditory scene analysis. Cambridge MA: MIT Press, 1990.
    [2] A. L. Berenzweig, D. P. W. Ellis, and S. Lawrence, “Using voice segment to
    improve artist classification of music,” in Proc. AES 22nd Int. Conf. Virtual,
    Synthetic Entertainment Audio, 2002.
    [3] G.J. Brown and M. Cooke, "Computational auditory scene analysis," Computer
    Speech and Language, vol. 8, pp. 297-336, 1994.
    [4] G. J. Brown and D. L. Wang, “Timing is of the essence: Neural oscillator models
    of auditory grouping,” in Listening to Speech: An Auditory Perspective, S.
    Greenberg and W. Ainsworth Ed, Lawrence Erlbaum, Mahwah NJ, 2006. pp.
    375-392.
    [5] A. de Cheveigne, "Multiple F0 estimation," in Computational auditory scene
    analysis: Principles, algorithms, and Applications, D.L. Wang and G.J. Brown,
    Ed., Hoboken NJ: Wiley & IEEE Press, pp. 45-79, 2006.
    [6] K. Dressler, “Sinusoidal extraction using an efficient implementation of a
    multi-resolution FFT,” Proceedings of the International Conference on Digital
    Audio Effects, pp. 247–252, 2006
    [7] K. Dressler, “An auditory streaming approach on melody extraction,” Extended
    abstract for, ISMIR 2006, Victoria, Canada Sept. 8 – 12, 2006.
    [8] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from
    incomplete data via the EM algorithm,” J. R. Statist. Soc., vol. 39, pp. 1–38, 1977.
    [9] H. Fujihara, M. Goto, O.Jun, K. Komatani, T. Ogata, and H. G. Okuno,
    “Automatic synchronization between lyrics and music CD recordings based on
    viterbi alignment of segregated vocal signals,” in Proc. of the IEEE International
    Symposium on Multimedia (ISM 2006), 2006, pp. 257–264.
    [10] H. Fujihara and M. Goto "Three techniques for improving automatic
    synchronization between music and lyrics: fricative sound detection, filler model,
    and novel feature vectors for vocal activity detection," In Proc. of the 2008 IEEE
    International Conference on Acoustics, Speech, and Signal Processing
    (ICASSP2008), pp. 69-72, Las Vegas, Nevada, U.S.A., March - April 2008.
    [11] R. Gribonval, L. Benaroya, E. Vincent, and C. Févotte, “Proposals for
    performance measurement in source separation,” in Proc. Int. Symp. ICA and BSS,
    Nara, Apr. 2003, pp. 763–768.
    [12] C. L. Hsu and J. S. Jang, “On the improvement of singing voice separation for
    monaural recordings using the MIR-1K dataset,” IEEE Transactions on Audio,
    Speech, and Language Processing, volume 18, pp. 310-319, 2010.
    [13] G. Hu and D. L. Wang, “A tandem algorithm for pitch estimation and voiced
    speech segregation,” IEEE Transactions on Audio, Speech, and Language
    Processing, vol. 18, pp. 2067-2079, 2010.
    [14] G. Hu and D. L. Wang, “Monaural speech segregation based on pitch tracking
    and amplitude modulation,” IEEE Trans. on Neural Networks, vol. 15, pp.
    1135-1150, 2004.
    [15] G. Hu and D.L. Wang, “An auditory scene analysis approach to monaural speech
    segregation,” in Acoustic Echo and Noise Control, E. Hansler and G. Schmidt
    Ed., Heidelberg: Springer, 2006, pp. 485-515.
    [16] G. Hu and D. L. Wang, ”Segregation of unvoiced speech from nonspeech
    interference,” Journal of the Acoustical Society of America, vol.124, pp.
    1306-1319, 2008.
    [17] Y. E. Kim, “Singing voice analysis/synthesis,” Ph.D. dissertation, Media Lab.,
    Mass. Inst. Technol., Cambridge, 2003.
    [18] C.-C. Kuo, “Phonetic and phonological background of Chinese spoken
    language,” in Chinese Spoken Language Processing, Chin-Hui Lee, Haizhou
    Li, Lin-shan Lee, Ren-Hua Wang, Qiang Huo, Ed. Singapore, 2007, pp. 33-55.
    [19] J. Li and C.-H. Lee, “On designing and evaluating speech event detectors,” Proc.
    Interspeech, Lisbon, Portugal, September 2005.
    [20] Y. Li and D. L. Wang, “Separation of singing voice from music accompaniment
    for monaural recordings,” IEEE Trans. Audio, Speech, and Language Processing,
    vol. 15, pp. 1475-1487, 2007.
    [21] S. P. Lloyd, “Least squares quantization in PCM,” IEEE Trans. Inf. Theory, vol.
    IT-28, no. 2, pp. 129–137, Mar. 1982.
    [22] N. C. Maddage, C. Xu, and Y. Wang, “A SVM-based classification approach to
    musical audio,” in Proc. Int. Conf. Music Inform. Retrieval, 2003.
    [23] N. Ono, K. Miyamoto, J. Le Roux, H. Kameoka, and S. Sagayama, “Separation
    of a monaural audio signal into harmonic/percussive components by
    complementary diffusion on spectrogram,” Proceedings of EUSIPCO, 2008.
    [24] A. Ozerov, P. Philippe, F. Bimbot and R. Gribonval, "Adaptation of Bayesian
    models for single channel source separation and its application to voice / music
    separation in popular songs," IEEE Trans. on Audio, Speech and Lang. Proc.,
    special issue on Blind Signal Proc. for Speech and Audio Applications, vol. 15, no.
    5, pp. 1564-1578, 2007.
    [25] A. Ozerov, P. Philippe, R. Gribonval, and F. Bimbot, ”One microphone singing
    voice separation using source-adapted models,” in Proc. IEEE Workshop on
    Applications of Signal Processing to Audio and Acoustics, New York, 2005, pp.
    90-93.
    [26] R.D. Patterson, J. Holdsworth, I. Nimmo-Smith, and P. Rice, "An efficient
    auditory filterbank based on the gammatone function," Rep. 2341, MRC Applied
    Psychology Unit, 1988.
    [27] L. R. Rabiner, “A tutorial on hidden markov models and selected application in
    speech recognition,” Proc. of the IEEE, vol. 77, no. 2, Feb. 1989.
    [28] B. Raj, Smaragdis, P., Shashanka, M.V. and Singh, R., “Separating a foreground
    singer from background music,” Intl Symposium on Frontiers of Research on
    Speech and Music (FRSM), Mysore, India, Jan 2007.
    [29] D.E. Rumelhart, G.E. Hinton, and R.J. Williams, "Learning internal
    representations by error propagation," in Parallel distributed processing, D.E.
    Rumelhart and J.L. McClelland, Ed., Cambridge, MA: MIT Press, pp. 318-362,
    1986.
    [30] M. Ryynänen, T. Virtanen, J. Paulus, and A. Klapuri, "Accompaniment
    separation and karaoke application based on automatic melody transcription," in
    Proc. 2008 IEEE International Conference on Multimedia & Expo (ICME'08),
    Hannover, Germany, June 2008.
    [31] P. Scalart and J. Vieira-Filho, “Speech enhancement based on a priori signal to
    noise estimation,” in Proc. 21st IEEE Int. Conf. Acoust. Speech Signal Processing,
    Atlanta, GA, May 1996, pp. 629–632.
    [32] K. N. Stevens, Acoustic Phonetics, Cambridge MA: MIT Press, 1998.
    [33] H. Tachibana, T. Ono, N. Ono, and S. Sagayama, “Melody line estimation in
    homophonic music audio signals based on temporal-variability of melody source”,
    International Conference on Acoustics, Speech and Signal Processing (ICASSP),
    pp. 425-428, 2010.
    [34] S. Vembu and S. Baumann, “Separation of vocals from polyphonic audio
    recordings,” in Proc. Int. Symp. Music Inf. Retrieval (ISMIR), 2005, pp. 337–344.
    [35] T. Virtanen, A. Mesaros, and M. Ryynänen, "Combining pitch-based inference
    and non-negative spectrogram factorization in Separating vocals from polyphonic
    music," in Proc. ISCA Tutorial and Research Workshop on Statistical and
    Perceptual Audition (SAPA), Brisbane, Australia, September 2008.
    [36] C. K. Wang, R. Y. Lyu, and Y. C. Chiang, “An automatic singing transcription
    system with multilingual singing lyric recognizer and robust melody tracker,” in
    Proc. 8th European Conf. on Speech Communication and Technology, Geneva,
    Switzerland, 2003.
    [37] D.L. Wang, "On ideal binary mask as the computational goal of auditory scene
    analysis," in Speech separation by humans and machines, P. Divenyi, Ed.,
    Norwell MA: Kluwer Academic, pp. 181-197, 2005.
    [38] D.L. Wang and G.J. Brown, "Separation of speech from interfering sounds based
    on oscillatory correlation," IEEE Trans. Neural Net., vol. 10, pp. 684-697, 1999.
    [39] D.L. Wang and G.J. Brown, Ed., Computational auditory scene analysis:
    Principles, algorithms, and applications. Hoboken NJ: Wiley & IEEE Press, 2006.
    [40] Y. Wang, M.-Y. Kan, T. L. Nwe, A. Shenoy, and J. Yin, “LyricAlly: Automatic
    synchronization of acoustic musical signals and textual lyrics,” in Proc. 12th Annu.
    ACM Int. Conf. Multimedia, New York, 2004, pp. 212–219, ACM Press.
    [41] M. Weintraub, “A theory and computational model of auditory monaural sound
    separation,” Ph.D. dissertation, Dept. of Elec. Eng., Stanford Univ., 1985.
    [42] D. Yang and W. Lee. “Disambiguating music emotion using software agents,” In
    Proc. Symp. Music Inf. Retrieval (ISMIR’04), Barcelona, Spain, 2004. pp. 52-57
    [43] E. Zavarehei, “Sample speech enhancement methods,” 2005 [Online]. Available:
    http://dea.brunel.ac.uk/cmsp/Home_Esfandiar/Sample wave Files.htm
    [44] T. Zhang, “System and method for automatic singer identification,” in Proc.
    IEEE International Conference on Multimedia and Expo (ICME), pp. 33-36,
    2003.
    [45] ——, “Perception of singing,” in Psychology of Music, D. Deutsch, Ed., 2nd ed.
    New York: Academic, 1999, pp. 171–214.

    無法下載圖示 全文公開日期 本全文未授權公開 (校內網路)
    全文公開日期 本全文未授權公開 (校外網路)

    QR CODE