在中國近體詩中用機率式免境語法學習語義剖析

簡易檢索 / 詳目顯示

回結果列表

研究生：	傅怡婷 Fu Yi-Ting
論文名稱：	在中國近體詩中用機率式免境語法學習語義剖析 Learning Semantic Parsing Using Probabilistic Context-Free Grammar in Chinese Poetry Domains
指導教授：	蘇豐文 Von-Wun Soo
口試委員:
學位類別：	碩士 Master
系所名稱：	電機資訊學院 - 資訊系統與應用研究所 Institute of Information Systems and Applications
論文出版年：	2004
畢業學年度：	93
語文別：	英文
論文頁數：	61
中文關鍵詞：	機率式免境語法、語義剖析、中國近體詩
外文關鍵詞：	Probabilistic Context Free Grammar, Semantic Parsing, Classical Chinese Poetry
相關次數：	點閱：1 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

由於中文近體詩中沒有明顯文法的存在，因此人需要大量的背景知識(background knowledge)才能理解，將詩句裡的每個概念串起，進而才能理解詩的意境與內涵。機率模式在自然語言處理上通常都被用在詞性標註，本篇論文試著加入機率模式來探討語義的建構。
然而在無法取得大量語料的同時，機率模型無法得到較佳的預測效果，因此我們藉由詩句的特性，建立概括(general)的語義規則，建構詩句的剖析樹(parse tree)加以預測，及分析詩句裡隱藏的語義結構，這是詩句語義庫建立的基本工作。
期望藉由詩句的語義剖析樹可以加速大量加註語義碼，並幫助辨別歧義、甚至利用語義結構比對進行資訊擷取(Information Retrieval)，詩句翻譯成白話(machine translation)、自動作詩(poetry generation)等工作。
本篇論文嘗試建構中國近體詩的語義規則(semantic grammar)，再從加註好的語料庫中學習每條規則的機率；用韋特比演算法(Viterbi Algorithm)去做剖析的動作。最後輔以兩個實驗來評估此詩句文法模型的適切性，以及與隱藏式馬可夫模型(Hidden Markov Model) Bi-gram Tagger 比較機率式免境語法(Probabilistic Context Free Grammar)加註語義碼以及分辨歧義的正確性。

Statistical model have been used quite successfully in Natural Language Processing for recovery of hidden structure such as part-of-speech tags, or syntactic structure. This thesis considers semantic parsing and tagging of classical Chinese poetry lines.
There are five aims in this thesis: (1) Construct semantic grammars; (2) Modify and learning probabilities of the semantic grammars from the training corpus; (3) Parse the sentence to tree structure; (4) Evaluate the accuracy of parsing results and (5) Compare with the Hidden Markov Model bi-gram tagger.
In the first three tasks, we assumed that the categories of Chinese Thesaurus are representative enough to help us analyze the semantic of the sentences. And the semantic grammars were built upon the semantic categories and semantic rules. We modified the grammars and learned the probabilities from training data with Inside-Outside algorithm. And Viterbi algorithm was used to find the most likely parsing route.
In the last two tasks, we found that the PCFG semantic parser has better performance on prediction of semantic tagging in the situation of data sparseness and the greater ability on disambiguation. We believe that parsing results might have broadly usages in machine translation, and poetry generation, and etc. in the future.

TABLE OF CONTENTS
中文摘要    ii
Abstract    iii
Acknowledgement    iv
Table of Contents    v
List of figures    vii
List of Tables    viii
Chapter 1 Introduction    1
1.1    Background    1
1.2    Motivation    3
1.3    Related work    5
1.4 organization of this thesis    6
Chapter 2 System Overview    7
2.1 Data Description    7
2.2 The Preprocessing of corpus    10
2.3 Rule Induction    15
Chapter 3 PCFG EM Algorithm    18
3.1 PCFG Introduction    18
3.2 Notations of PCFG    19
3.3 Learning the probabilities    21
3.4 Parsing    29
3.5 HMM tagger    32
Chapter 4 Evaluation    37
4.1    Experiment 1    37
4.2    Experiment 2    43
4.3    Results Discussion    46
Chapter 5 Conclusions and Future Work    48
Reference    50
Appendix    52
A.    Grammar induction    52
B.    Dependency rules    54
C.    Training examples    58
D.    Inside probabilities of the sentence ‘水中明月臥浮圖’    60
E.    Outside probabilities of the sentence ‘水中明月臥浮圖’    61

                                

[1] Isaac Asimov’s Website. 2004. 13 Aug 2004 < http://www.asimovonline.com/>.
[2] Charles N. Li, Sandra A. Thompson 著, 黃宣範譯. 漢語語法.台北:文鶴,1984.
[3] “Nine things to remember about Classical Chinese”. John J. Emerson.com. 10 July 2004 < http://www.johnjemerson.com/introductory.htm>.
[4] Von-Wun Soo, Shih-Yao Yang, Shu-Lei Chen, and Yi-Ting Fu. “Ontology Acquisition and Semantic Retrieval from Semantic Annotated Chinese Poetry.”Joint Conference on Digital Libraries (2004).
[5] Christopher D. Manning and Hinrich Schutze. Foundations of Statistical Natural Language Processing. Massachusetts: MIT Press, 2001.
[6] 俞士汶, 胡俊峰. “唐宋詩之詞彙自動分析及應用.”語言, 文學與資訊. 羅鳳珠編. 新竹:清華大學, 2003.
[7] 蘇豐文, 傅怡婷, 楊世堯, 陳書磊. “漢語詩的本體知識與語義檢索.” 語言, 文學與資訊. 羅鳳珠編. 新竹:清華大學, 2003.
[8] Su-Lei Chen. Semantic Structure Extraction and Retrieval of Chinese Poetry(Thesis). 2004.
[9] 王文誥、馮應榴(清)輯註. 蘇軾詩集. 台北市:學海出版社, 1984.
[10] 袁行霈. “中國古典詩歌的音樂美.” 中國詩歌藝術研究. 台北市:五南, 民78. p.113-p.123.
[11] Mei et al. TongyiciCilin Thesaurus. Shanghai: Commercial Press, 1996.
[12] Lee-Feng Chein. “Exploration of Fundamental Techniques toward Intelligent Chinese Information Retrieval for the Internet.” IICM Communication 1:3 (1998).
[13] 中研院平衡語語料庫, 2004. <http://www.sinica.edu.tw/SinicaCorpus/>.
[14] 教育部國語辭典,2004. < http://140.111.1.22/mandr/clc/dict/>.
[15] Stuart J. Russell, Peter Norviq. Artificial Intelligence: A Modern Approach. N. J.: Prentice Hall, 2002. pp.798-835.
[16] Eugene Charniak. Statistical Language Learning. Mass: MIT Press, 1993.
[17] Wesley Tanaka. “PCFG Expectation Maximization.” 2003. < http://www.ofb.net/~wtanaka/papers/pcfg-exp-max.pdf >.
[18] 楊哲青, 曾憲雄, 蘇俊銘, 羅鳳珠. “詩作風格知識庫之研究-以蘇軾近體詩為例.” 語言, 文學與資訊. 羅鳳珠編. 新竹:清華大學, 2003.
[19] 俞士汶, 朱學鋒, 李峰. “現代漢語語素庫的開發與應用.” 世界漢語教學2:38-45, 1999.
[20] Michael Collins, Scott Miller. ‘Semantic Tagging using a Probabilistic Context Free Grammar’, 1997. <http://acl.ldc.upenn.edu/W/W98/W98-1105.pdf >
[21] “Cross Validation.” Autonomous Modeling.2004. <http://www-2.cs.cmu.edu/~schneide/tut5/node42.html>.
[22] Kevin Knight, “A Statistical MT Tutorial Workbook”, prepared in connection with the JHU summer workshop, 1999.
[23] “eXtensible Markup Language.” World Wide Web Consortium. 2004. < http://www.w3.org/XML/>.
[24] “Resource Description Framework .” World Wide Web Consortium. 2004. <http://www.w3.org/TR/rdf-syntax-grammar/>.

全文公開日期本全文未授權公開 (校內網路)
全文公開日期本全文未授權公開 (校外網路)

簡易檢索 / 詳目顯示

相關論文