讓格書寫下之斷詞探討｜國立清華大學博碩士論文庫

簡易檢索 / 詳目顯示

回結果列表

研究生：	王建傑 Wang, Jian Jie
論文名稱：	讓格書寫下之斷詞探討 A Study of Chinese Word Segmentation under LangGeh orthography
指導教授：	江永進 Jiang, Yong Jin
口試委員:	呂仁園高明達
學位類別：	碩士 Master
系所名稱：	理學院 - 統計學研究所 Institute of Statistics
論文出版年：	2013
畢業學年度：	101
語文別：	中文
論文頁數：	49
中文關鍵詞：	中文斷詞、斷詞標準、避免單字詞落單、讓格
相關次數：	點閱：97 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

中文斷詞是資訊處理基礎動作，但中文詞的定義模糊，使得應用因此受限。台灣主要的斷詞標準是中研院 CKIP規範(CKIP, 1997[8])，此標準是以語意、語法及使用頻率為基準所建構。本文提出新的斷詞標準，主要想法是避免單字詞落單，減少瑣碎的斷詞結果，增加字數做為斷詞標準的所能扮演的角色，使斷詞標準能夠更加簡潔好用。在新提出的斷詞標準下，我們準備了一份近3萬字元的網路文章，加以讓格，再加以 (新標準)斷詞，然後撰寫簡單的斷詞系統，結果斷詞F-量度可以達到 98%。相對的，簡單的最長詞匹配法只有70%左右；而傳統書寫的傳統斷詞使用大量語料訓練模型效率可到96%。本文方法使用簡單，實作也簡單。
關鍵字：中文斷詞、斷詞標準、避免單字詞落單、讓格

The concept of words in Mandarin Chinese is not really well defined. And as a result the important basic word segmentation module of the natural language processing of Chinese becomes somewhat difficult to implement. The primary standard of word segmentation in Taiwan is the CKIP standard of Academia Sinica, which uses semantics, syntax, and usage frequency to define a word. We propose an added principle of singleton-avoiding that dictates minimizing single character word in a segmented text. More specifically, two character string and three character string are principally treated as a word. By making use of the number of characters in defining a word, the standard becomes easy to follow. Furthermore, by writing the Chinese sentences with spaces between simple short phrases (called LangGeh orthography) instead of traditional way of no spaces in-between, and the segmentation module becomes much easier to implement. An implemented segmentation module written in programming language Python is tested on a testing text corpus of around 30000 characters, collected from internet and transformed into LangGeh orthography. The resulting performance is 98% in F-measure, and compared quite favorably to the traditional word segmentation of about 96% using large amount of training data. For marginalized languages such as Taiwanese and Hakka, LangGeh and the new segmentation standard seem to be the way to follow.

Keywords: Chinese word segmentation, singleton-avoiding principle, LangGeh orthography, segmentation standard.

第一章 概論    1
1 研究背景    1
2 研究動機    1
3 章節概要    2
第二章 斷詞方法 與 讓格書寫    3
1 斷詞方法    3
1.1詞典斷詞    3
1.2隱藏式馬可夫模型（Hidden Markov Model, HMM）    4
2 讓格書寫    5
2.1 讓格書寫的 簡要規則    6
第三章 避免瑣碎的 斷詞原則    7
1避免瑣碎的 斷詞原則 及其含意    7
2避免瑣碎 斷詞法 舉例    8
3測試語料庫 介紹    11
4評估標準 F-measure    12
第四章 讓格文 斷詞的 實作與評估    13
1 斷詞流程 介紹 與 結果    13
2 斷詞系統的 遞迴細程    15
3 特定類 細程    19
3.1提示詞典    19
3.2中英夾雜    22
3.3「的」 細程    23
3.4某某某說    24
3.5重複構詞    25
4 語法類 細程    27
4.1數量詞細程    28
4.2助詞細程    29
4.3副詞細則    29
4.4後綴詞細程    30
4.5連接詞細程    31
4.6介詞細程    31
4.7方位詞細程    32
5 字數形式類 細程    33
5.1三字以下詞組    34
5.2四字詞組    35
5.3五字以上詞組    36
6結論    39
第五章 讓格書寫 文章斷詞    41
1 前言    41
2 LLCCS    41
3無間書寫下 的 LLCCS新詞抽取 及 其斷詞效果    43
4讓格書寫下 的 LLCCS新詞抽取 及 其斷詞效果    44
第六章 結論    46
參考文獻    47
附錄    49

                                

[1]Hongmei Zhao and Qun Liu. 2010.“The CIPS-SIGHAN CLP 2010 Chinese Word Segmentation Bakeoff”. In Proceedings of the First CPS-SIGHAN Joint Conference on Chinese Language Processing. Beijing, China.
[2]江永進、楊佩琦、林淑卿、張春凰、高明達、呂仁園、陳孟彰(2009)。讓格書寫以及台華互譯初探。第二十一屆自然語言與語音處理研討會，p.399-413。
[3]李佳鴻(2010), “讓格書寫的台語自動標音初探”，國立清華大學統計學研究所碩士論文，新竹市。
[4]陳建忠(2010), “延複詞與延複詞類初探”，國立清華大學統計學研究所碩士論文，新竹市。
[5]謝博行(2013), “局部最長連續共同子序列與收集新詞”，國立清華大學統計學研究所碩士論文，新竹市。
[6]林千翔(2006), “基於特製隱藏式馬可夫模型之中文斷詞研究”，國立中央大學資訊工程研究所碩士論文，桃園縣。
[7]CKIP斷詞(2013).中文斷詞系統(http://asbc.iis.sinica.edu.tw/)(提供線上斷詞服務。
[8]CKIP規範(1996)，《「搜」文解字:中文詞界研究與資訊用分詞標準》，中文詞知識庫小組技術報告 96-01，台北，中央研究院。
[9]何孟翰(2012年10月04日), “超強圖解前進App Store！iOS6 SDK 實戰演練”。悅知文化。
[10]林宏翰(2013年1月9日), “未打先轟動中職回春露曙光”, 中央通訊社。
2013年6月14日，取自http://www.cna.com.tw/News/aSaM/201301090378-1.aspx.
[11]自由時報電子報http://www.libertytimes.com.tw/
[12]Python 3.2.3(2012), http://www.python.org/

全文公開日期本全文未授權公開 (校內網路)
全文公開日期本全文未授權公開 (校外網路)
全文公開日期本全文未授權公開 (國家圖書館：臺灣博碩士論文系統)

簡易檢索 / 詳目顯示

相關論文