T3台語剖析樹語料庫與Brill詞類標記｜國立清華大學博碩士論文庫

簡易檢索 / 詳目顯示

回結果列表

研究生：	周思源
論文名稱：	T3台語剖析樹語料庫與Brill詞類標記 T3 Taiwanese Treebank and Brill Part-of-Speech Tagger
指導教授：	江永進
口試委員:
學位類別：	碩士 Master
系所名稱：	理學院 - 統計學研究所 Institute of Statistics
論文出版年：	2006
畢業學年度：	94
語文別：	中文
論文頁數：	42
中文關鍵詞：	詞類標記、N-gram語言模型、馬可夫模型、隱藏馬可夫模型、維特比演算法、K-Fold交叉驗證、Brill詞類標記
外文關鍵詞：	Part-of-Speech Tagging, N-gram, Markov language Model, Hidden Markov Model, Viterbi Algorithm, Deleted Interpolation, K-fold Cross Validation, Transformation-Based Error-Driven Learning
相關次數：	點閱：99 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

給定一串詞的序列，將各詞標上詞類，也就是配一個適當的詞類序列，這叫作詞類標記 (Part-of-Speech Tagging)，這是自然語言處理的基礎問題。在本文內，我們以T3剖析樹語料庫台語部分的部份語料，實施Brill詞類標記法。Brill標記法需要二個階段，先訓練出轉換規則，然後應用得到詞類序列。Brill詞類標記是一種的錯誤驅動的學習程序 (Error-Driven Learning)，學習的結果是一組詞類轉換規則 (Transformation Rules) 的集合。Brill標記法是根基於其他標記法，再做進一步的改善。這些其他標記法常用的是N-gram語言模型，其中我們分別使用Uni-gram、Bi-gram、Tri-gram的馬可夫及隱藏馬可夫模型來進行標記。本文除了報告T3語料庫的詞類標記的效果以外，我們也針對語料庫的不一致問題，使用混淆矩陣來發覺、檢視、修正。最後得到的較佳詞類標記正確率，其組內測試正確率為92.80%，組外測試的正確率為85.59%。

Part-of-Speech Tagging is a basic issue in the natural language processing. In this paper, we study the effect of Brill Tagger (1992) using part of the T3 Taiwanese treebank. Brill tagger is a transformation-based error-driven approach. Based on the results of other tagging method such as N-gram language model, Brill tagger learns a set of transformation rules from an annotated corpus. The learning process is error-driven in that its objective is to minimize the tagging errors computed from the comparison of the transformed results to the standard annotated corpus. Annotated corpus is often suffered from inconsistency problem, and we also study the problem using the confusing matrix. The best tagging result that we obtained is 92.80% and 85.59% for the inside test and the outside test respectively.

Chapter 1   概論    1
Chapter 2   T3剖析樹語料庫    3
2.1   T3剖析樹語料庫    3
2.2   T3剖析樹語料庫詞類集合    8
Chapter 3   模型理論    11
3.1   N-gram模型    11
3.2   馬可夫模型    12
3.3   隱藏馬可夫模型    15
3.4   維特比演算法    17
3.5   平滑法    20
3.6   K-fold交叉驗證法    20
3.7   正確率的計算    21
3.8   Brill詞類標記    22
Chapter 4   模型應用及結果    24
4.1   Lexicalized (word/tag-word/tag) 轉移機率的馬可夫模型    24
4.2   Non-Lexicalized (tag- tag) 轉移機率的隱藏馬可夫模型    25
4.3   應用Brill詞類標記    27
Chapter 5   混淆矩陣    32
Chapter 6   結論    40
附錄    41
參考文獻    42

                                

1. Brill Eric (1992), “A simple rule-based part of speech tagger. In Proceedings of the Third Conference on Applied Natural Language Processing”, ACL, Trento, Italy.
2. Brill Eric (1995), “Transformation-Based Error Driven Learning and Natural Language: A Case Study in Part of Speech Tagging”, Computational Linguistics, 21(4): 543-555.
3. Daniel Jurafsky and James H. Martin (2000), “Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition”, Prentice Hall.
4. Fei Xia (2000), “The Bracketing Tagging Guidelines for the Penn Chinese Treebank (3.0)”, http://www.cis.upenn.edu/~chinese/parsequide.3rd.ch.pdf.
5. Fei Xia (2000), “The Part-Of-Speech Tagging Guidelines for the Penn Chinese Treebank (3.0)”, http://www.cis.upenn.edu/~chinese/posguide.3rd.ch.pdf.
6. Fei Xia (2000), “The Segmentation Guidelines for the Penn Chinese Treebank (3.0)”, http://www.cis.upenn.edu/~chinese/ segguide.3rd.ch.pdf.
7. Jelinek, Frederick (1997), “Statistical methods for speech recognition”, Cambridge, Mass.: MIT Press.
8. 朱德熙 (1982), “語法講義”, 北京: 商務印書館.
9. 朱德熙 (1984), “語法答問”, 北京: 商務印書館.
10. 陸儉明 (2003), “對“NP+的+VP”結構的重新認識”, 北京大學.
11. 洪俊詠 (2005), “馬可夫語言模型應用di台語變調gah注音”, 清華大學統計所碩士論文.
12. 劉亦真 (2005), “建立T3剖析樹語料庫：台語部分”, 清華大學統計所碩士論文.

全文公開日期本全文未授權公開 (校內網路)
全文公開日期本全文未授權公開 (校外網路)

簡易檢索 / 詳目顯示

相關論文