局部最長連續共同子序列與新詞組收集｜國立清華大學博碩士論文庫

簡易檢索 / 詳目顯示

回結果列表

研究生：	謝博行 Sie, Bo-Sing
論文名稱：	局部最長連續共同子序列與新詞組收集 Locally Longest Common Consecutive Subsequence and Collection of New Phrases
指導教授：	江永進 Chiang, Yuang-chin
口試委員:	高明達呂仁園
學位類別：	碩士 Master
系所名稱：	理學院 - 統計學研究所 Institute of Statistics
論文出版年：	2013
畢業學年度：	101
語文別：	中文
論文頁數：	52
中文關鍵詞：	未知詞、新詞組、局部最長共同子序列
外文關鍵詞：	Unknown word, New phrase, Locally longest common consecutive subsequence
相關次數：	點閱：2 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

時代在變，用詞在變，詞典的詞條應該也跟著變，跟不上時代的詞典代表跟不上的基礎文化。針對單篇文章或者二篇文章，我們提出局部最長連續共同子序列 (locally longest consecutive common subsequence: LLCCS) 的方法，近似出名的最長共同子系列 (longest common subsequence: LCS) 算程，可以有效率擷取文章中的重複使用的字串。由此所擷取出的字串我們再進一步處理篩選，得到較合語法意義的新詞組，以及新詞。因為網路上可以自動收集大量新聞或文章，新詞組、新詞的擷取應可快速幫助詞典新詞條的累積。

Adapting from the well-known longest common subsequence (LCS) algorithm, we propose an efficient algorithm that is capable of extracting locally longest consecutive common subsequence (LLCCS) from one or two different articles. Further processing on the extracted subsequence makes them closer to syntatical phrases/words. With world wide web full of adundant articles, we hope this is an efficient way to enrich the entries of Chinese lexicon.

摘要

第一章 概論 -1-

第二章 研究相關文獻 -2-
2.1 新詞、未知詞 -2-
2.2 未知詞擷取相關方法 -2-
2.3 相似度評估方法介紹 -3-

第三章 局部最長連續共同子序列 -5-
3.1 最長共同子序列 -5-
3.2 兩字串局部最長連續共同子序列 -8-
3.3 單字串局部最長連續共同子序列 -11-
3.4 小結 -15-

第四章 新詞組擷取 -16-
4.1 基礎資料介紹 -16-
4.2 初步處理 -18-
4.3 詞組判定 -20-
4.4 過長詞組分割與縮減 -23-
4.5 包含數字詞組擷取 -25-
4.6 中英混合詞組：接受 -27-

第五章 新詞組收集系統實作 -28-
5.1 介面 -28-
5.2 流程 -28-
5.3 成果分析 -30-

第六章 總結 -38-

參考文獻 -39-
附錄 -40-

                                

[1] K. J. Chen and M. H. Bai (1998). “Unknown Word Detection for Chinese by a Corpus-based Learning Method”. International Journal of Computational linguistics and Chinese Language Processing, Vol.3, #1, pp.27-44.

[2] K. J. Chen and W. Y. Ma (2002). “Unknown Word Extraction for Chinese Documents”. COLING, pp.169-175.

[3] Fuchun Peng, Fangfang Feng and Andrew McCallum (2004). “Chinese Segmentation and New Word Detection Using Conditional Random Fields”. COLING, pp.562-568.

[4] T. H. Chang and C. H. Lee (2003). “Automatic Chinese unknown word extraction using small-corpus-based method”, Proceedings of IEEE International Conference on Natural language processing and knowledge engineering, pp.459-464.

[5] 楊傑程, “應用樣式探勘與機器學習方法於中文未知詞擷取之研究”, 國立中央大學資訊工程學系碩士論文, 2009

[6] 陳崇正, “應用網路書籤與VSM相似度演算法於強化實踐社群的形成”, 國立中央大學資訊工程學系碩士論文, 2009

[7] Python 3.2.3(2012),
http://www.python.org/

[8] Beautiful Soup 4(2013),
http://www.crummy.com/software/BeautifulSoup/

全文公開日期本全文未授權公開 (校內網路)
全文公開日期本全文未授權公開 (校外網路)

簡易檢索 / 詳目顯示

相關論文