簡易檢索 / 詳目顯示

研究生: 劉怡芬
Liu, Yi-Fen
論文名稱: 結合語言知識建置自然口語發音變異模型以提升中文連續口語語音辨識效能
Linguistically Motivated Word Pronunciation Modeling for Automatic Speech Recognition of Chinese Conversational Speech
指導教授: 張俊盛
Chang, Jason S.
張智星
Jang, Jyh-Shing Roger
曾淑娟
Tseng, Shu-Chuan
口試委員: 王逸如
Wang, Yih-Ru
陳柏琳
Chen, Berlin
廖元甫
Liao, Yuan-Fu
學位類別: 博士
Doctor
系所名稱: 電機資訊學院 - 資訊系統與應用研究所
Institute of Information Systems and Applications
論文出版年: 2016
畢業學年度: 104
語文別: 英文
論文頁數: 113
中文關鍵詞: 語音變體音節縮讀語音弱化類型字詞類型發音變異模型雙音節字詞自然口語語音辨識系統
外文關鍵詞: Pronunciation variation, Reduction type, Word type, Pronunciation modeling, Disyllabic word, Spontaneous Speech Recognition, ASR system
相關次數: 點閱:2下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 本論文以自然語流中字詞發音變異(pronunciation variation)的觀察出發,將中文字詞隨著跨音節結構改變,並由統計聲學模型辨認的各種可能語音變體(variants),依音節CV-結構的改變與音段(segment)長度漸減的限制,來學習分類可能衍生的四種語音弱化(reduction)不同的範疇(Tsengs et al., 2013b)。具體而言,依照本論文所定義的語音弱化類型(reduction type, RT),發展的自動語音變體演算法,可以將自然口語中常被使用的典型語音變體,納入發音變異模型(pronunciation modeling)裡,以提升語句辨識的正確率。
    在實際語言使用上,除了有與其他日耳曼語系相似的受鄰近音影響改變的發音變異,中文有跨音節結構改變的縮讀現象(syllable contraction)。在音韻學上,可以透過Edge-in Theory (Chung, 1997; Hsu, 2003)推測其中可能使用的語音變體,這類型語音變體屬字詞內音節邊界已然消失,弱化成中文合法單一音節的語音變體。本研究與前人研究不同之處在於,我們將中文字詞內音節邊界存在,或趨向模糊,甚至消失視為類化語音弱化程度上變異的主要特徵,學習中文自然語流中各個字詞不容忽視的典型語音變體,而非以易造成發音混淆來排除由知識規則(knowledge-based)或實際語料(data-driven)衍生的可能語音變體。
    因此,我們考量依據個別字詞標準讀音原始音節CV-結構定義的字詞類型(word type, WT),來學習辨認由實際語料語音訊號自動取得語音弱化的發音變異類型。同時,本文以頻率為選取典型語音變體的依據,從高頻使用的弱化類型中,選擇高頻出現的發音變體為該字詞的典型語音變體。研究結果發現透過具有考量中文字詞音節弱化結構可歸納的知識規則所找到的語音變體,在語言實際使用與認知上,或是中文自然口語辨識系統上具有一定的代表性。將典型語音變體納入發音變異詞典裡,並落實於自然語句字詞辨識研究課題上,結果顯示新增語音變體所引起的字詞混淆度提升,卻仍不減語音辨識的正確率,甚至有所提升。
    依本文提出的架構,我們針對中文自然口語使用比例相當高、且具多樣性的雙音節字詞(disyllabic word),找到各個雙音節字詞所屬重要、典型語音變體納入發音變異詞典裡,並實作於中文自然口語語音辨識上。本論文第七章詳細說明針對詞彙量有所限制的任務導向型口語對話(MMTC)以及詞彙不受限的自由口語對話(MCDC8),個別建立的語音辨識系統與建置的發音變異詞典在語句字詞辨識是否有達到提升。透過實驗結果,顯示本文以語言知識規則提出的發音變異模型,可以找到中文自然口語雙音節字詞重要、且高頻使用的語音變體。並且,我們將嘗試延伸至自然口語中任意兩個音節之間的弱化程度,像「我的學生」名詞組裡的「我的」,希望能對語音辨識系統有進一步的提升。


    This thesis examines how pronunciations of disyllabic words vary across word-internal syllable boundary for spoken Mandarin Chinese and how multiple pronunciation dictionaries constructed via the proposed variant-selection algorithm may significantly enhance the performance of pronunciation modeling for speech recognition. Three preprocessing stages prior to the selection of typical variants in multiple pronunciation dictionaries are: 1) to derive word variants by a free phone recognizer; 2) to calculate similarity scores by aligning phonetic surface forms to canonical forms; 3) to categorize pronunciation variants by stipulated reduction rules on the changes of the word-internal syllable structure. In our work, four reduction types are suggested by considering the presence of a within-word syllable boundary: Citation form-like reduction, marginal segment deletion, nuclei merger, and syllable merger. Additionally, the results on a series of statistical analyses show that the most frequent reduction types for disyllabic words in Chinese conversation are citation form-like reduction and syllable merger. In particular, high-frequency disyllabic words preferentially take the extreme syllable-merger form. Furthermore, our results show that segmental reduction in Chinese disyllabic words is morphology-dependent, and also related to the prosodic position at which a disyllabic word is produced as well as the temporal quality of the word. Motivated by the quasi-categorical reduced forms of disyllabic words produced in Chinese conversational speech, a frequency-based selection procedure is adopted to select typical pronunciations in the dictionary. The implementation of our new pronunciation models derived from the training data has shown a 2.4% absolute improvement on the domain-specific recognition task (MMTC), and an enhancement by 1.2% on the freely-conversed recognition task (MCDC8). Even though the confusability in provided lexicons is increased, our findings suggest that the automatically learned pronunciation models may capture more linguistic variation beyond short-span contextual effects, such as phoneme substitution and deletion.

    Contents List of Figures ix List of Tables x 1 Introduction 1 1.1 What is pronunciation?.....................................................................................1 1.2 Modeling pronunciations in an ASR system…………………………………3 1.3 Structure of thesis…………………………………………………………….6 2 Pronunciation Modeling (PM) 7 2.1 Overview of a speech recognizer…………………………………………….7 2.2 Types of pronunciation variations…………………………………………..12 2.2.1 Variations that depend on higher-level knowledge………………..12 2.2.2 Variations captured by context-dependent HMMs………………..14 2.2.3 Variations derived from phonological / rewrite rules……………..15 2.3 Sources of pronunciation derivation………………………………………..17 2.4 Formalize pronunciation or not?....................................................................19 2.5 Level of modeling in ASR systems……………………………………….21 3 Word, Syllable Mergers, a Continuum Reduction of Spoken Words 25 3.1 Word coverage in natural, spontaneous Mandarin……………………….…26 3.2 Syllable mergers as a representative spoken word form……………………28 3.3 Categorical phonetic forms of reduced disyllabic words…………………...29 3.4 Summary……………………………………………………………………31 4 Automatic Pronunciation Variation Derivation of Chinese Disyllabic Words 4.1 Surface form generation…………………………………………………….34 4.2 Reduction type (RT) categorization………………………………………...35 4.2.1 Language-dependent word type (WT) definition…...……………..35 4.2.2 Four reduction types (CAN, MSD, NUM, and SYM)……………37 4.3 Typical variant selection……………………………………………………40 4.3.1 Phonetic similarity score…………………………………………..41 4.3.2 Selection by CV-structure type (CVT) and RT……………………43 4.3.3 Conditioned selection to Expand/Reduce Variants………………..43 4.4 Summary……………………………………………………………………46 5 Statistical Analysis on Selected Typical Variants 47 5.1 Frequent words prefer syllable merger……………………………………..48 5.2 Prosodic position results in differing variants………………………………50 5.3 Variants correlate with word types and phonetic similarity………………...53 5.4 Likely more than one typical variant………………………………………..55 5.5 Variant selection on realistic speech data…………………………………..56 6 Are Monophones Suitable for Aligning? 58 6.1 A phone-aligner for spontaneous, conversational Mandarin……………….59 6.2 A combined procedure on phone boundary verification……………………63 6.3 Criteria of phone boundary labeling………………………………………..64 6.3.1 Separation of two adjacent phones……….………………………..65 6.3.2 Allophones………………………………………………………...66 6.3.3 Marking segment deletion…………………………………………66 6.3.4 Transcription errors………………………………………………..67 6.4 Labeling results……………………………………………………………..68 6.4.1 Inter-labeler consistency…………………………………………..68 6.4.2 Phones with severe boundary deviation…………………………...69 6.4.3 Perceived segment deletion………………………………………..70 6.4.4 Small discussion on labeling difference…...………………………71 6.5 Summary……………………………………………………………………73 7 Inclusion of Variants into ASR systems 75 7.1 Speech data………………………………………………………………….76 7.2 Experiment setup……………………………………………………………77 7.3 The compared pruning method……………………………………………...78 7.4 Results on different speech corpora………………………………………...83 7.4.1 Task-oriented dialogues: MMTC………………………………….83 7.4.2 Freely talked conversations: MCDC8……………………………..85 7.5 Discussion on added variants in ASR systems……………………………86 8 Conclusions 88 Appendix 90 References 92

    Akaike, H. (1973). “Information theory and an extension of the maximum likelihood principle,” in Proceedings of the Second International Symposium on Information Theory (Budapest, Hungary), pp. 267-281.
    Akita, Y., and Kawahara, T. (2010). “Statistical transformation of language and pronunciation models for spontaneous speech recognition,” IEEE Transactions on Audio, Speech, and Language Processing 18, 1539-1549.
    Amdal, I., Korkmazskiy, F., and Surendran, A. C. (2000). “Joint pronunciation modelling of non-native speakers using data-driven methods,” in Proceedings of International Conference on Spoken Language Processing (Beijing, China), pp. 622-625.
    Badr, I., McGraw, I., and Glass, J. (2011). “Pronunciation learning from continuous speech,” in Proceedings of Interspeech (Florence, Italy), pp. 549-552.
    Bell, A., Jurafsky, D., Fosler-Lussier, E., Girand, C., Gregory, M., and Gildea D. (2003). “Effects of disfluencies, predictability, and utterance position on word form variation in English conversation,” Journal of the Acoustical Society of America 113, 1001-1024.
    Bigi, B., and Hirst, D. (2012). “Speech Phonetization Alignment and Syllabification (SPPAS): a tool for automatic analysis of speech prosody,” in Proceedings of Speech Prosody (Shanghai, China), pp. 19-22.
    Bisani, M., and Ney H. (2008). “Joint-sequence models for grapheme-to-phoneme conversion,” Speech Communication 50, 434-451.
    Boersma, P., and Weenink, D. (2012). Praat: doing phonetics by computer. Software package from http://www.praat.org.
    Byrne, W., Venkataramani, V., Kamm, T., Zheng, T. F., Song, Z., Fung, P., Liu, Y., and Ruhi, U. (2001). “Automatic generation of pronunciation lexicons for Mandarin spontaneous speech,” in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (Utah, USA), pp. 569-572.
    Chen, K. and Hasegawa-Johnson, M. (2004). “Modeling pronunciation variation using artificial neural networks for English spontaneous speech,” in Proceedings of Interspeech (Jeju Island, South Korea), pp.400-403.
    Chien, J.-T. and Huang, C.-H. (2003). “Bayesian learning of speech duration models,” IEEE Transactions on Speech and Audio Processing 11, 558-567.
    Chomsky, N., and Halle, M. (1968). The Sound Pattern of English. Harper and Row, New York.
    Chung, R.-F. (1997). “Syllable contraction in Chinese,” in Chinese Languages and Linguistics III: Morphology and Lexicon, edited by F.-F. Tsao and H. Samuel Wang (Academia Sinica, Taipei), pp. 199-235.
    Connolly, J. H. (1997). “Quantifying target-realization differences: Part I: Segments,” Clinical Linguistics & Phonetics 11, 267-287.
    Cohen, M. H. (1989). “Phonological structures for speech recognition,” Ph.D. dissertation, University of California, Berkeley.
    Cooper, W., Soares, C., Ham, A., and Damon, K. (1983). “The influence of inter- and intra-speaker tempo on fundamental frequency and palatalization,” Journal of the Acoustical Society of America 73, 1723-1730.
    Cucchiarini C., and Binnenpoorte, D. (2002). “Validation and improvement of automatic phonetic transcriptions,” in Proceedings of International Conference on Spoken Language Processing (Denver, Colorado, USA), pp. 313-316.
    Chung, K. S. (2006). “Contraction and Backgrounding in Taiwan Mandarin,” Concentric: Studies in Linguistics 32, 69-88.
    Dilley, L., and Pitt, M. (2007). “A study of regressive place assimilation in spontaneous speech and its implications for spoken word recognition,” Journal of the Acoustical Society of America 122, 2340-2353.
    Duanmu, S. (2000). The phonology of standard Chinese. New York: Oxford University Press.
    Engstrand, O., and Krull, D. (2001). “Segment and syllable reduction: preliminary observations,” in Lund University, Dept. of Linguistics Working Papers, pp. 26-29.
    Ernestus, M. (2000). “Voice assimilation and segment reduction in casual Dutch – A corpus-based study of the phonology-phonetics interface,” Ph.D. dissertation, Utrecht: LOT.
    Ernestus, M., Baayen, R. H., and Schreuder R. (2002). “The recognition of reduced word forms,” Brian and Language 81, 162-173.
    Ernestus, M., and Warner, N. (2011). “An introduction to reduced pronunciation variants,” Journal of Phonetics 39, 253-260.
    Fosler-Lussier, E. (1999a). “Dynamic pronunciation models for automatic speech recognition,” Ph.D. dissertation, International Computer Science Institute, University of California, Berkeley.
    Fosler-Lussier, E., and Williams, G. (1999b). “Not just what, but also when: Guided automatic pronunciation modeling for broadcast news,” in DARPA Broadcast News Workshop (Herndon, Virginia, USA), pp. 171-174.
    Fukada, T., Yoshimura, T., and Sagisaka, Y. (1999). “Automatic generation of multiple pronunciations based on neural networks,” Speech Communication 27, 63-73.
    Gerstman, L. (1968). “Classification of self-normalized vowels,” IEEE Transactions on Audio and Electroacoustics 16, 78-80.
    Greenberg, S. (1999). “Speaking in shorthand – A syllable-centric perspective for understanding pronunciation variation,” Speech Communication 29, 159-176.
    Hämäläinen, A., Gubian,. M., ten Bosch, L., and Boves, L. (2007). “Modelling pronunciation variation using multi-path HMMS for syllables,” in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (Honolulu HI, USA), pp. 781-784.
    Hämäläinen, A., ten Bosch, L., and Boves, L. (2009a). “Modelling pronunciation variation with single-path and multi-path syllable models: Issues to consider,” Speech Communication 51, 130-150.
    Hämäläinen, A., Gubian,. M., ten Bosch, L., and Boves, L. (2009b). “Analysis of acoustic reduction using spectral similarity measures,” Journal of the Acoustical Society of America 126, 3227-3235.
    Hanique, I., Ernestus, M., and Schuppler, B. (2013). “Informal speech processes can be categorical in nature, even if they affect many different words,” Journal of the Acoustical Society of America 133, 1644-1655.
    Hazen, T. J., Hetherington, I. L., Shu, H., and Livescu, K. (2005). “Pronunciation modeling using a finite-state transducer representation,” Speech communication 46, 189-203.
    Hermansky, H. (1990). “Perceptual linear predictive (PLP) analysis of speech,” Journal of the Acoustical Society of America 87, 1738-1752.
    Hetherington, I. L. (2001). “An efficient implementation of phonological rules using finite-state transducers,” in Proceedings of Interspeech (Aalborg, Denmark), pp. 1599-1602.
    Ho, D.-an. (1996). Some Concepts and Methodology of Phonology. Da-An Press. (in Chinese).
    Hofmann, H., Sakti, S., Isotani, R., Kawai, H., Nakamura, S., and Minker, W. (2010). “Improving spontaneous English ASR using a joint-sequence pronunciation model,” in Proceedings of IEEE International Universal Communication Symposium (Beijing, China), pp. 58-61.
    Holter, T., and Svendsen, T. (1999). “Maximum likelihood modeling of pronunciation variation,” Speech communication 29, 177-191.
    Hsu, H.-C. (2003). “A sonority model of syllable contraction in Taiwanese Southern Min,” Journal of East Asian Linguistics 12, 349-377.
    Johnson, K. (2004). “Massive reduction in conversational American English,” in Spontaneous speech: Data and analysis, edited by K. Yoneyama, and K. Maekawa, Tokyo: The National International Institute for Japanese Language, pp.29-54.
    Jurafsky, D., Bell, A., Gregory, M., and Raymond, W. D. (2001a). “Probabilistic relations between words: Evidence from reduction in lexical production,” in Frequency and the emergence of linguistic structure, edited by J. L. Bybee and P. J. Hopper, John Benjamins, pp. 229-254.
    Jurafsky, D., Ward, W., Zhang, J., Herold, K., Yu, X., and Zhang, S. (2001b). “What kind of pronunciation variation is hard for triphones to model?” in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (Salt Lake City, UT, USA), pp. 577-580.
    Jurafsky, D., and Martin, J. H. (2009). Speech and Language Processing: An Introduction to Natural Language Processing, Speech Recognition, and Computational Linguistics, 2nd edition, Prentice-Hall, Englewood Cliffs, NJ.
    Jyothi, P., Fosler-Lussier, E., and Livescu, K. (2013). “Discriminative training of WFST factors with application to pronunciation modeling,” in Proceedings of Interspeech (Lyon, France), pp. 1961-1965.
    Kaplan, R., and Kay, M. (1994). “Regular models of phonological rule systems,” Computational Linguistics 20, 331-378.
    Karanasou, P., Yvon, F., Lavergne, T., and Lamel, L. (2013). “Discriminative training of a phoneme confusion model for a dynamic lexicon in ASR,” in Proceedings of Interspeech (Lyon, France), pp. 1966-1970.
    Karttunen, L. (1993). “Finite state constraints,” in The Last Phonological Rule, edited by J. Goldsmith, University of Chicago Press, Chicago, Chapter 6, pp 173-194.
    Kessens, J. M., Wester, M., and Strik, H. (1999). “Improving the performance of a Dutch CSR by modeling within-word and cross-word pronunciation variation,” Speech Communication 29, 193-207.
    Kessens, J. M., Cucchiarini, C., and Strik, H. (2003). “A data-driven method for modeling pronunciation variation,” Speech Communication 40, 517-534.
    Kipp, A., Wesenick, M. B., and Schiel, F. (1996). “Automatic detection and segmentation of pronunciation variants in German speech corpora,” in Proceedings of International Conference on Spoken Language Processing (Philadelphia, USA), pp. 106-109.
    Kipp, A., Wesenick, M. B., and Schiel, F. (1997). “Pronunciation modeling applied to automatic segmentation of spontaneous speech,” in Proceedings of European Conference on Speech Communication and Technology (Rhodes, Greece), pp. 1023-1026.
    Kondrak, G. (2003). “Phonetic alignment and similarity,” Computers and Humanities 37, 273-291.
    Kuhl, P. K., Conboy, B. T., Coffey-Corina, S., Padden, D., Rivera-Gaxiola, M., and Nelson, T. (2008). “Phonetic learning as a pathway to language: new data and native language magnet theory expanded (NLM-e),” Journal of Philosophical Transactions of the Royal Society B: Biological Sciences 363, 979-1000.
    Ladefoged, P. (2006). A Course in Phonetics. Thomson Wadsworth, Boston, MA, Chapter 6-9, pp. 133-236.
    Lindblom, B. (1990). “Explaining phonetic variation: A sketch of the H&H theory,” in Speech Production and Speech Modelling, edited by W. J. Hardcastle, and A. Marchal, Springer, Netherlands, pp.403-439.
    Liu, Y., and Fung, P. (2004a). “State-dependent phonetic tied mixtures with pronunciation modeling for spontaneous speech recognition,” IEEE Transactions on Speech and Audio Processing 12, 351-364.
    Liu, Y., and Fung, P. (2004b). “Pronunciation modeling for spontaneous Mandarin speech recognition,” International Journal of Speech Technology 7, 155-172.
    Liu, Y.-F., and Tseng, S.-C. (2009). “Linguistic patterns detected through a prosodic segmentation in Spontaneous Taiwan Mandarin speech,” in Linguistic Patterns in Spontaneous Speech, edited by S.-C. Tseng, Institute of Linguistics, Academia Sinica, Taiwan, pp. 147-166.
    Liu, Y.-F., Tseng, S.-C., and Jang, R. J.-S. (2014). “Phone boundary annotation in conversational speech,” in Proceedings of Language Resources and Evaluation Conference (Reykjavik, Iceland), pp. 848-853.
    Lobanov, B. M. (1971). “Classification of Russian vowels spoken by different speakers,” Journal of the Acoustical Society of America 49, 606-608.
    Ma, W.-Y., and Chen, K.-J. (2004). “Design of CKIP Chinese word segmentation system,” International Journal of Asian Language Processing 14, 235-249.
    Maekawa, K. (2003). “Corpus of spontaneous Japanese: its design and evaluation,” in Proceedings of ICSA & IEEE Workshop on Spontaneous Speech Processing and Recognition (Tokyo, Japan), pp. 7-12.
    McGraw, I., Badr, I., and Glass, J. R. (2013). “Learning lexicons from speech using a pronunciation mixture model,” IEEE Transactions on Audio, Speech, and Language Processing 21, 357-366.
    Mohri, M., Pereira, F., and Riley, M. (2002). “Weighted finite-state transducers in speech recognition,” Computer Speech and Language 16, 69-88.
    Oostdijk, N. (2002). “The design of Spoken Dutch Corpus,” in New Frontiers of Corpus Research, edited by P. Peters, P. Collins, and A. Smith (Rodopi, Amsterdam), pp. 105-112.
    Pierrehumbert, J. (1994). “Knowledge of variation,” in CLS 30 Vol. 2: Papers from the parasession on variation, edited by K. Beals, J. Denton, R. Knippen, L. Melnar, H. Suzuki and E. Zeinfeld (Chicago IL, USA), pp. 232-256.
    Pitt, M. A., Johnson, K., Hume, E., Kiesling, S., and Raymond, W. (2005). “The Buckeye corpus of conversational speech: labeling conventions and a test of transcriber reliability,” Speech Communication 45, 89-95.
    Pitt, M. A., Dilley, L., and Tat, M. (2011). “Exploring the role of exposure frequency in recognizing pronunciation variants,” Journal of Phonetics 39, 304-311.
    Riley, M., Byrne, W., Finke, M., Khudanpur, S., Ljolje, A., McDonough, J., Nock, H., Saraclar, M., Wooters, C., and Zavaliagkos, G. (1999). “Stochastic pronunciation modelling from hand-labelled phonetic corpora,” Speech Communication 29, 209-224.
    Sakti, S., Markov, K., and Nakamura, S. (2008). “Probabilistic pronunciation variation model based on Bayesian Network for conversational speech recognition,” in Proceedings of the Second International Symposium on Universal Communication (Osaka, Japan), pp. 405-410.
    Saraçlar, M., Nock, H., and Khudanpur, S. (2000). “Pronunciation modeling by sharing Gaussian densities across phonetic models,” Computer Speech and Language 14, 137-160.
    Schramm, H., Aubert, X., Bakker, B., Meyer, C., and Ney H. (2006). “Modeling spontaneous speech variability in professional dictation,” Speech Communication 48, 493-515.
    Schuppler, B., Ernestus, M., Scharenborg, O., and Boves, L. (2011). “Acoustic reduction in conversational Dutch: A quantitative analysis based on automatically generated segmental transcriptions,” Journal of Phonetics 39, 96-109.
    Schuppler, B., van Dommelen W. A., Koreman, J., and Ernestus, M. (2012). “How linguistic and probabilistic properties of a word affect the realization of its final /t/: Studies at the phonemic and sub-phonemic level,” Journal of Phonetics 40, 595-607.
    Seneff, S., and Wang, C. (2005). “Statistical modeling of phonological rules through linguistic hierarchies,” Speech communication 46, 204-216.
    Shafran, I. (2001). “Clustering wide-contexts and HMM topologies for spontaneous speech recognition,” Ph.D. dissertation, Department of Electrical Engineering, University of Washington.
    Shattuck-Hufnagel, S., and Veilleux, N. (2007). “Robustness of acoustic landmarks in spontaneously-spoken American English,” in Proceedings of International Congress of Phonetic Sciences (Saarbrucken, Germany), pp. 123-128.
    Shriberg, L. D., and Lof, G. L. (1991). “Reliability studies in broad and narrow phonetic transcription”, Clinical Linguistics and Phonetics 5, 225-279.
    Sloboda, T., and Waibel, A. (1996). “Dictionary learning for spontaneous speech recognition,” in Proceedings of International Conference on Spoken Language Processing (Philadelphia, USA), pp. 2328-2331.
    Sproat, R., and Riley, M. (1996). “Compilation of weighted finite-state transducers from decision trees,” in Proceedings of annual meeting on Association for Computational Linguistics (Santa Cruz, California, USA), pp. 215-222.
    Stolcke, A. (2002). “SRILM-an extensible language modeling toolkit,” in Proceedings of International Conference on Spoken Language Processing (Denver, Colorado, USA), pp. 901-904.
    Strik, H., and Cucchiarini, C. (1999). “Modeling pronunciation variation for ASR: A survey of the literature,” Speech communication 29, 225-246.
    Tajchman, G., Fosler, E., and Jurafsky, D. (1995). “Building multiple pronunciation models for novel words using exploratory computational phonology,” in Proceedings of European Conference on Speech Communication and Technology (Madrid, Spain), pp. 2247-2250.
    Torre, D., Villarrubia, L., Hernandez, L., and Elvira, L. M. (1997). “Automatic alternative transcription generation and vocabulary selection for flexible word recognizers,” in in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (Munich, Germany), pp. 1463-1466.
    Torreira, F., and Ernestus, M. (2011). “Vowel elision in casual French: The case of vowel /e/ in the word c’était,” Journal of Phonetics 39, 50-58.
    Tsai, M.-Y., Chou, F.-C., and Lee, L.-S. (2007). “Pronunciation modeling with reduced confusion for Mandarin Chinese using a three-stage framework,” IEEE Transactions on Audio, Speech, and Language Processing 15, 661-675.
    Tseng, H.-H., Chang, P.-C., Andrew, G., Jurafsky, D., and Manning, C. (2005). “A Conditional Random Field word segmenter for SIGHAN bakeoff 2005,” in Proceedings of SIGHAN workshop on Chinese Language Processing (Jeju Island, South Korea), pp. 168-171.
    Tseng, S.-C. (2005). “Contracted Syllables in Mandarin: Evidence from Spontaneous Conversations,” Language and Linguistics 6, 153-180.
    Tseng, S.-C. (2013a). “Lexical coverage in Taiwan mandarin conversation,” International Journal of Computational Linguistics and Chinese Language Processing 18, 1-18.
    Tseng, S.-C., Soemer, A. and Lee, T.-L. (2013b). “Tones of Reduced T1-T4 Mandarin Disyllables,” International Journal of Computational Linguistics and Chinese Language Processing 18, 81-106.
    Tseng, S.-C. (2016). “/kwo/ and / y/ in Taiwan Mandarin: social factors and phonetic variation,” Language and Linguistics 17, 383-405.
    Van Bael, C., Boves, L., van den Heuvel, H., and Strik, H. (2007a). “Automatic phonetic transcription of large speech corpora,” Computer Speech and Language 21, 652-668.
    Van Bael, C., Baayen H., and Strik, H. (2007b). “Segment deletion in spontaneous speech: A corpus study using Mixed Effects Models with crossed random effects,” in Proceedings of Interspeech (Antwerp, Belgium), pp. 2741-2744.
    Weintraub, M., Fosler-Lussier, E., Galles, C., Kao, Y.-H., Khudanpur, S., Saraclar, M., and Wegmann S. (1996). “WS96 project report: Automatic learning of word pronunciation from data,” presented at the JHU Workshop Pronunciation Group (Baltimore, USA).
    Wester, M., and Fosler-Lussier, E. (2000). “A comparison of data-derived and knowledge-based modeling of pronunciation variation,” in Proceedings of International Conference on Spoken Language Processing (Beijing, China), pp. 270-273.
    Wester, M. (2003). “Pronunciation modeling for ASR – knowledge-based and data-derived methods,” Computer Speech and Language 17, 69-85.
    Williams, G., and Renals, S. (1998). “Confidence measures for evaluating pronunciation models,” in Proceedings of ISCA workshop on Modeling Pronunciation for Automatic Speech Recognition (Rolduc, Netherlands), pp. 151-156.
    Xue, N. and Shen, L. (2003). “Chinese word segmentation as LMR tagging,” in Proceedings of SIGHAN workshop on Chinese Language Processing (Sapporo, Japan), pp. 176-179.
    Yang, Q., and Martens, J.-P. (2000). “Data-driven lexical modeling of pronunciation variation for ASR,” in Proceedings of International Conference on Spoken Language Processing (Beijing, China), pp. 417-420.
    Yang, Q., Martens, J.-P., Ghesquiere, P.-J., and Van Compernolle, D. (2002). “Pronunciation variation modeling for ASR: Large improvements are possible but small ones are likely to achieve,” in Proceedings of ISCA workshop on Pronunciation Modeling and Lexicon Adaptation for Spoken Language Technology (Colorado, USA), pp. 123-128.
    Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X. A., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., and Woodland, P. (2006). The HTK Book 3.4, Cambridge University Press, London.
    Zhao, H., Huang, C.-N., and Li, M. (2006). “An improved Chinese word segmentation system with Conditional Random Field,” in Proceedings of SIGHAN workshop on Chinese Language Processing (Sydney, Australia), pp. 162-165.

    無法下載圖示 全文公開日期 本全文未授權公開 (校內網路)
    全文公開日期 本全文未授權公開 (校外網路)

    QR CODE