研究生: |
陳致中 Chih-Chung Chen |
---|---|
論文名稱: |
適用於多使用者音訊視訊轉換之混合高斯模型調變 Adaptation of Gaussian Mixture Model for Multi-user Audio-to-Visual Conversion |
指導教授: |
陳永昌
Yung-Chang Chen |
口試委員: | |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 電機工程學系 Department of Electrical Engineering |
論文出版年: | 2000 |
畢業學年度: | 88 |
語文別: | 英文 |
論文頁數: | 48 |
中文關鍵詞: | 音訊視訊轉換 、線頻譜對 、倒頻譜 、混合高斯模型 、E-M演算法 |
外文關鍵詞: | Audio-to-Visual, Line Spectrum Pair, LSP, Cepstral, Gaussian Mixture Model, GMM, E-M Algorithm, Expectation Maximization |
相關次數: | 點閱:3 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
由於語音與嘴部影像之間具有極高的相關性,我們可以將語音特徵參數經過轉換,對應到嘴部影像的特徵參數。將語音轉換為嘴部影像後,不但提供了較佳的視覺效果,還可在吵雜環境下輔助語音的辨識。這個語音-嘴部影像的轉換技術,還可以廣泛應用於其他場合,像是作為人機介面,或是適合聽覺障礙者使用的多媒體電話,也可以用來解決卡通動畫的語音-嘴唇同步問題,以及應用於低位元率視訊會議系統。
在這篇論文裡,我們提出了一個基於混合高斯模型的語音-嘴部影像轉換系統。使用混合高斯模型,可以從語音特徵參數來預估嘴部影像特徵參數;我們利用這個方法,即時地將連續語音對應到同步的嘴部動作。
在這個轉換系統裡,我們採用“線頻譜對”作為語音特徵參數,所以可以輕易的將這個轉換技術,和同樣採用了線頻譜對的語音壓縮標準,像是G.723.1,以及MPEG-4 CELP、HVXC等作緊密的整合,而且不會增加太大的運算複雜度。之所以採用線頻譜對作為語音特徵參數,是因為它在解壓縮過程中自然會產生,不須要任何額外的運算;然而,實驗結果卻顯示採用線頻譜對,比採用傳統的倒頻譜可以更準確的預估嘴部影像的特徵參數。
除此之外,我們還提出了三種模型調變演算法,試圖降低模型訓練階段所需的大量運算。我們一一測試了這幾種模型調變演算法,來驗證是否可以較少的運算量來得到混合高斯模型。有了模型調變,我們將更容易把這個語音-嘴部影像轉換系統實際應用到其他場合。
Audio and video are highly correlated, so that it is possible to map acoustic features into visual features. The audio-visual conversion not only provides “visible speech” in noisy circumstances, it also provides a better visual experience. Hence it can be applied to a great deal of applications, such as multimedia telephony for hearing-impaired people, human-computer interfaces, lip synchronization of cartoon animation, and low bit rate video conferencing.
In this thesis, we propose a real-time audio-to-visual conversion system based on Gaussian mixture model. With the GMM, we can predict visual parameters from corresponding audio parameters. We use this method to map continuous speech into synchronized lip movements.
This system utilizes LSPs as the audio features, hence it can be easily integrated with LSP- based speech coders like G.723.1 or MPEG-4 CELP & HVXC speech coding without increasing computation complexity too much. The LSPs are utilized as audio features initially for it’s immediately available during speech decoding process. But the experimental results show that the estimated visual features mapped from LSPs are much better than those mapped from traditional cepstrals.
Furthermore, we also propose three model adaptation algorithms to reduce the heavy computation requirement in the training phase. These model adaptation algorithms are tried to examine if a GMM can be trained with fewer computational operations. With model adaptation, the audio-visual conversion system can be easier to be applied in related applications.
[1] Ram R. Rao, Tsuhan Chen, “Audio-to-Visual Integration in Multimedia Communication,” Proceedings of the IEEE, Vol. 86, No.5, pp. 837-852, May 1998.
[2] Ram R. Rao, Tsuhan Chen, “Audio-to-Visual Conversion for Multimedia Communication,” IEEE Transactions on Industrial Electronics, Vol. 45, No.1, pp. 15-22, Feb. 1998.
[3] K. Aizawa, H. Harashima, and T. Saito, “Model-based synthesis image coding (MBASIC) system for a person’s face,” Signal Processing: Image Communication, Vol. 1, No. 2, pp. 139–152, Oct. 1989.
[4] Fabio Lavagetto, “Converting Speech into Lip movement: A Multimedia Telephone for Hard of Hearing People,” IEEE Transactions on Rehabilitation Engineering, Vol. 3, No.1, pp. 90-102, 1995.
[5] Yao-Jen Chang, Chih-Chung Chen, Jen-Chung Chou, and Yung-Chang Chen, “Implementation of a Virtual Chat Room for Multimedia Communications,” IEEE Signal Processing Society 1999 Workshop on Multimedia Signal Processing (MMSP99), Copenhagen, Denmark, Sept. 13-15, 1999.
[6] D. G. Stork, G. Wolff, and E. Levine, “Neural network lipreading system for improved speech recognition,” in Proc. Int. Joint Conf. Neural Networks, pp. 285–295, 1992.
[7] Fabio Lavagetto, “Time-Delay Neural Networks for Estimating Lip Movements From Speech Analysis: A Useful Tool in Audio-Video Synchronization,” IEEE Transactions on Circuits and systems for Video Technology, Vol. 7, No.5, pp. 786-800, 1997.
[8] KyoungHo Choi, Jeng-Neng Hwang, “Baum-Welch Hidden Markov Model Inversion For Reliable Audio-to-Visual Conversion”, IEEE 3rd Workshop on Multimedia Signal Processing (MMSP99), pp. 175 –180, Copenhagen, Denmark, Sept. 13-15, 1999.
[9] I-Chen Lin, Cheng-Sheng Hung, Chien-Feng Huang, Ming Ouhyoung, ”A System for Speech Driven Facial Animation,” 1999 Workshop on Consumer Electronics: Digital Video and Multimedia Communications, pp. 37-42, Taipei, Oct. 18, 1999.
[10] R. Rao, R. Mersereau, Tsuhan Chen, “Using HMM’s in Audio-toVisual Conversion,” IEEE 1997 First Workshop on Multimedia Signal Processing, pp. 19-24, 1997.
[11] ITU-T Recommendation G.723.1, “Dual Rate Speech Coder for Multimedia Communications Transmitting at 5.3 and 6.3 kbit/s,” Mar. 1996.
[12] P. M. Milner, “Physiological Psychology,” New York: Holt, Rinehart and Winston, pp. 217, 1970.
[13] R. Deller, Jr., G. Proakis, H. L. Hansen, “Discrete-Time Processing of Speech Signals,” 1993.
[14] J. Durbin, “Efficient estimation of parameters in moving-average models,” Biometrika, Vol. 46, Parts 1 and 2, pp. 306-316, 1959.
[15] Karlheinz Brandenburg, Oliver Kunz, Akihiko Sugiyama, “MPEG-4 natural audio coding,” Signal Processing: Image Communication 15, pp. 423–444, 2000.
[16] A. P. Dempster, N. M. Laird and D. B. Rubin, “Maximum Likelihood from Incomplete Data via the EM Algorithm,” J. Royal Statistical Society B, 39:1-38, 1977.
[17] G. A. Abrantes, F. Pereira, “MPEG-4 Facial Animation Technology: Suvey, Implementation, and Results”, IEEE Transactions on CSVT, Vol. 9, No. 2, pp. 290-305, Mar. 1999.
[18] 莊向凱, 王小川教授, “國語語音資料庫之標音系統”, 國立清華大學碩士論文, June 1999.
[19] Thomas P. Minka, “The Expectation-Maximization Algorithm for MAP Estimation,” lecture notes of MIT MAS622J/1.126J-Pattern Recognition Handouts, 1998. The document would be available at http://vismod.www.media.mit.edu/vismod/classes/mas622/handouts.html
[20] Carey E. Priebe, “Adaptive Mixtures,” Journal of the American Statistical Association, Vol. 89, No. 427, Sep. 1994.