研究生: |
陳鈺瑾 Yu-Jin Chen |
---|---|
論文名稱: |
可調式之中文文件自動摘要 Scalable Summarization for Chinese Text |
指導教授: |
張俊盛 博士
Dr. Jason S. Chang |
口試委員: | |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 資訊工程學系 Computer Science |
論文出版年: | 2000 |
畢業學年度: | 88 |
語文別: | 英文 |
論文頁數: | 64 |
中文關鍵詞: | 可調 、中文摘要 |
外文關鍵詞: | scalable, summarization, Chinese text |
相關次數: | 點閱:2 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在現今的社會中,網路已成為資訊流通的主要管道了。隨著檢索工具的普及,使用者可以迅速地尋找到其所需要的任何資訊。但是當找到的資料篇幅很長,使用者不想閱讀全文時,如何減少使用者的閱讀時間,在這個分秒必爭的時代,便成為一個十分重要的課題。有鑑於此,我們提出一個自動摘要的方法,讓簡短的摘要來提示文件內容,甚而取代全文,以減少使用者的瀏覽及閱讀時間。
摘要主要是由主題句組成;而主題句則是由主題詞(關鍵詞)所組成。所謂的主題詞,就是在組成一篇文章的單字之中,具有能夠表達該文章意義的重要詞語。主題句就是能夠代表一個段落或文章的重要句子。一般而言,主題句多出現在某段的開頭或結束的部分,不過有時也會出現在文章的中間部分。而我們假設主題句為一個段落之中包含最多主題詞的句子。
我們觀察出主題詞在語言學上的一些特性:1. 主題詞常常重複出現(Repetition);2. 主題詞為一些名詞組的組合(Syntactic Patterns)。中文並不像英文,沒有以空白區隔,因此在找出主題詞前,勢必得先做斷詞。我們利用語料庫,再運用機率競爭的方法,找出最適合的斷詞結果。之後再利用主題詞的語言學特性,找出能夠代表文章的主題詞。
得到主題詞之後,我們使用分群的方法將文章自動分段。因為許多作者為了文章的可讀性,常常將同一次主題分成好幾個段落,因此我們認為將文章重新分段,並從每個新段落中找出摘要,應該是較為合理的作法。
我們設計了多種評分方式,從每個新段落中找出分數最高的為主題句做為摘要。除了傳統的詞頻法,我們還加入了位置的考量、主題詞的個數、以及主題詞長度等評分方法。
實驗結果證實以主題詞長度的評分方法有較好的結果,並且位置的考量的確是十分必要的。因此,我們若能更確切地掌握文章的架構,應該能得到更佳的結果。
This paper proposes an approach to generate scalable summaries for Chinese text automatically. We observe that summaries usually consist of topic sentences, and topic sentences usually contain topic phrases. Chinese words are not like English ones, which are separated by white spaces, therefore we have to carry out word segmentation before identifying topic phrases.
We adopt a dynamic programming method based on Markov Model to segment and tag words for known as well as unknown words. Then, we identify topic phrases of the article based on linguistic properties of topic phrases at syntactic and discourse levels. At syntactic level, the topic phrases always follow a limited set of syntactic patterns, while at discourse level, the topic phrases always repeat in the article.
After identifying topic phrases, we divide the article into subtopic segments, because authors often divide one subtopic into several paragraphs for readability. We merge the most similar adjacent sentences into one segment using clustering method, and extract one sentence from each segment as summary.
We design six scoring methods to calculate the imformativeness of sentences, including measurement with topic phrase length, topic phrase frequency, and topic phrase count, with or without lead weight. The experiment illustrates that the lead weight methods perform better, and among all scoring methods, the measurement with topic phrase length has the best performance.
In the future, we will combine more article features such as cue phrase to produce better results. Further more, we can shorten or combine the sentences using some reduction and combination rules to produce summaries with quality approaching the manual ones.
Barzilay, Regina and Micael Elhadad (1997) Using Lexical Chains for Text Summarization. In Proceedings of the ACL'97/EACL'97 Workshop on Intelligent Scalable Text Summarization, pp 10-17, Madrid, Spain, July 11.
Chang, Jyun-Shen, C.-D. Chen, and Shun-De Chen. (1991) Xianzhishi manzu ji jilu zuijiahua de zhongwen duanci fangfa [Chinese word segmentation through constraint satisfaction and statistical optimization]. In Proceedings of ROCLING IV, pp. 147-165, Taipei. ROCLING.
Chang, Jyun-Shen, Shun-De Chen, Ying Zheng, Xian-Zhong Liu, and Shu-Jin Ke. (1992). Large-corpus-based methods for Chinese personal name recognition. Journal of Chinese Information Processing, Vol. 6, No. 3, pp. 7-15.
Chinatsu Aone, Mary Ellen Okurowski, James Gorlinsky, Bjornar Larsen (1997) A Scalable Summarization System Using Robust NLP. In Proceedings of the ACL'97/EACL'97 Workshop on Intelligent Scalable Text Summarization, pp. 66-73, Madrid, Spain, July 11.
Fowler, H.R. and J.E. Aaron. (1998) The Little, Brown Handbook. Seventh Edition, Addison Wesley Longman, New York, pp. 85-125.
Gu, Ping and Yuhang Mao. (1994) Hanyu zidong fenci de jinlin pipei suanfa ji qi zai QHFY hanying jiqi fanyi xitong zhong de shixian [The adjacent matching algorithm of Chinese automatic word segmentation and its implementation in the QHFY Chinese-English system]. In International Conference on Chinese Computing, Singapore.
J. E. Rush, R. Salvador, and A. Zamora (1971) Automatic Abstracting and Indexing. II. Production of Indicative Abstracts by Application of Contextual Inference and Syntactic Coherence Criteria. Journal of the American Society for Information Science, July-August, 1971. pp.260-274.
Justeson, John S. and Slava M. Katz. (1995) Technical Terminology: Some Linguistic Properties and an Algorithm for Identification in Text. Natural Language Engineering, Vol. 1, Pt. 1, March 1995, pp.9-27.
J. Kupiec, J. Pedersen and F. Chen (1995) A Trainable Document Summarizer. In Proceedings of the 18th ACM-SIGIR Conference, pp. 68-73.
Li, B.Y., S. Lin, C.F. Sun, and M.S. Sun. (1991) Yi zhong zhuyao shiyong yuliaoku biaoji jinxing qiyi jiaozheng de zuida pipei hanyu zidong fenci suanfa sheji [A maximum-matching word segmentation algorithm using corpus tags for disambiguation]. In ROCLING IV, pp.135-146, Taipei. ROCLING.
Liang, Nanyuan. (1986) Shumian hanyu zidong fenci xitong-CDWS [A written Chinese automatic segmentation system-CDWS]. Journal of Chinese Information Processing, Vol. 1, No. 1, pp.44-52.
Lin, Chin-Yew and Eduard Hovy (1997) Identifying Topics by position. In Preceedings of the Fifth Conference on Applied Natural Language Processing, pp. 283-290, Washington, DC, March 31 - April 3.
Luhn, H. P. (1958). The automatic creation of literature abstracts. In IBM J. Research Development, Vol. 2, pp.159-165.
Man, W.C. and S. Thompson (1988) Rhetorical Structure Theory: Description and Construction of Text Structure. In Gerard Kempen, Ed., Natural Language Generation, Martinus Ninjhoff Publishers, pp. 85-96.
Marti A. Hearst and Christian Plaunt (1993) Subtopic Structuring for Full-Length Document Access. ACM-SIGIR'93, Pittsburgh, PA, USA, pp. 59-68.
Marcu, Daniel (1997) From Discourse Structures to Text Summarization. In Proceedings of the ACL'97/EACL'97 Workshop on Intelligent Scalable Text Summarization, pp 82-88, Madrid, Spain, July 11.
Michael P. Oakes and Chris. D. Paice (1998) Term Extraction for Automatic Abstracting. Computerm'98 First Workshop on Computational Terminology, pp. 91-95.
Morris, Jane and Graeme Hirst. (1991) Lexical Cohesion Computed by Thesaural Relations as an Indicator of the Structure of Text. Computational Linguistic , Vol. 17, No. 1, 1991, pp.21-48.
Paice C. D. and Jones P. A. (1993) The Identification of Important Concepts in Highly Structured Technical Papers. ACM-SIGIR'93, Pittsburgh, PA, USA, pp. 69-77.
Pascale N. Fun. (1997) Using Word Signature Features for Terminology Translation from Large Corpora. Columbia University.
Peng J. Y. (彭載衍), J. S. Chang (張俊盛). (1993) 中文詞彙歧義之研究?斷詞與詞性標示. In Proceedings of ROCLING VI, pp. 173-193.
Salton, Gerard and Chris Buckley. (1991a) Automatic text structuring and retrieval: Experiments in automatic encyclopedia searching. In Proceedings of SIGIR, pp.21-31.
Salton, Gerard and Chris Buckley. (1991 b) Global text matching for information retrieval. Science, 253:1012-1015.
Simone Teufel and Marc Moens. (1997) Sentence extraction as a classification task. In Proceedings of the ACL'97/EACL'97 Workshop on Intelligent Scalable Text Summarization, pp. 58-65, Madrid, Spain, July 11.
Sproat, Richard and Chilin Shih. (1990) A statistical method for finding word boundaries in Chinese text. Computer Processing of Chinese and Oriental Languages, 4:336-351.
Sproat,Richard, Chilin Shih, William Gale, and Nancy Chnag. (1996) A Stochastic Finite-State Word-Segmentation Algorithm for Chinese. Computational Linguistics, Vol. 22, No. 3, Sep. 1996, pp. 377-404.
Wang, Liang-Jyh, Wei-Chuan Li, and Chao-Huang Chang. (1992) Recognizing unregistered names for Mandarin word identification. In Proceedings of COLING-92, pp. 1239-1243. COLING.
藤泰正作/張正薇譯 (1992) 驚人ゑм速讀術 (速讀術:提高理解能力縮短閱讀時間的閱讀訣竅). 台北市, 遠流出版.