研究生: |
梁庭耀 Ting-Yao Liang |
---|---|
論文名稱: |
非文件集基礎的中文文件自動摘要系統之探討 A Study of Non-Corpus Based Automatic Chinese Document Summarizers |
指導教授: |
陳鴻基
Houn-Gee Chen |
口試委員: | |
學位類別: |
碩士 Master |
系所名稱: |
科技管理學院 - 科技管理研究所 Institute of Technology Management |
論文出版年: | 2006 |
畢業學年度: | 94 |
語文別: | 中文 |
論文頁數: | 49 |
中文關鍵詞: | 文件自動摘要 、非文件集基礎文件自動摘要 、機率潛在語義分析 、潛在語義分析 、關聯性衡量 |
外文關鍵詞: | Automatic Summarization, Non-Corpus based Automatic Summarization, Probabilistic Latent Semantic Analysis, Latent Semantic Analysis, Relevance Measure |
相關次數: | 點閱:1 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
本研究中應用機率潛在語意分析(Probabilistic Latent Semantic Analysis,PLSA)做為單文件自動摘要的方法。PLSA基於Aspect統計模型,可用於分析詞彙與句子的共同出現(co-occurrence)的現象。PLSA在自動索引的領域已被證實比潛在語意分析(Latent Semantic Analysis,LSA)表現更佳,自動摘要則是本研究所提出的一項新應用。
在過去的研究中,自動摘要系統大部份是以文件集為基礎(corpus-based)的技術來建立,但是其缺點在於訓練的過程需要人工摘要的輔助,且在新的主題出現時,由於缺乏足夠學習的文件集,無法產生良好的摘要。本研究中的PLSA自動摘要系統是採用非文件集基礎(non-corpus based)的作法,並在實驗過程中與同樣是非文件集基礎的LSA及關聯性衡量(Relevance Measure,RM)自動摘要技術做比較。在實驗過程中,使用新台灣週刊的文章做為摘要的對象,RM摘要器獲得了最佳的效果,PLSA摘要器次之,LSA摘要器則表現最差。
In our research, we applied Probabilistic Latent Semantic Analysis (PLSA) to single-document summarization. PLSA is based on Aspect model which can be used to analyze co-occurrence of terms and sentences. PLSA had been already proved that it performs better than Latent Semantic Analysis (LSA) in automatic indexing domain. In our research, we attempt to apply PLSA to solve automatic summarization problem.
In literature, most of automatic summarizers were built on corpus-based structure. However, a corpus-based automatic summarizer requires a lot of documents and artificial summaries for training. Moreover, it will be hindered by the shortage of training documents on emerging topics. As so, we applied non-corpus based technique for automatic summarizer builder. A modified PLSA is proposed to build a summarizer. The performance of PLSA was compared with that of LSA and Relevance Measure (RM) summarizer. Using New Taiwan Magazine data, the results indicate that RM summarizer performed the best, PLSA summarizer ranked second, and LSA summarizer performed the worst.
英文部份
Brants, T., Chen, F., & Tsochantaridis, I. (2002). Topic-Based Document Segmentation with Probabilistic Latent Semantic Analysis. In Proceedings of CIKM’02.
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman R. (1990). Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science, 41(6), pp. 391-407.
Edmundson H. P. (1969) New Methods in Automatic Extracting. Journal of the ACM, 16(2), pp. 264-285.
Gong Y., & Liu, X. (2001) Generic Text Summarization Using Relevance Measure and Latent Semantic Analysis. In Proceedings of SIGIR’01.
Hahn, U., & Mani, I. (2000) The Challenges of Automatic Summarization. Computer, vol. 33, no. 11, pp. 29-36.
Hofmann, T. (1999a) Probabilistic Latent Semantic Analysis. In Proceedings of 15th Conference on Uncertainty in AI.
Hofmann, T. (1999b) Probabilistic Latent Semantic Indexing. In Proceedings of SIGIR’99.
Hofmann, T.(1999c) Unsupervised Learning from Dyadic Data. In Advances in Neural Information Processing Systems, volume 11. MIT Press.
Hofmann, T. (2001) Unsupervised Learning by Probabilistic Latent Semantic Analysis. Machine Learning, 42, pp. 177-196.
Jin, X., & Zhou, Y. (2004) Bamshad Mobasher. Web Usage Mining Based on Probabilistic Latent Semantic Analysis. In Proceedings of KDD’04.
Kintsch, W. (2002) The Potential of Latent Semantic Analysis for Machine Grading of Clinical Case Summaries. Journal of Biomedical Informatics, 35, pp. 3-7.
Kupiec, J., Pedersen, J., & Chen, F. (1995) A Trainable Document Summarizer. In Proceedings of SIGIR’95.
Landauer, T. K., Foltz, P. W., & Laham, D. (1998) Introduction to Latent Semantic Analysis. Discourse Processes, 25, pp. 259-284.
Luhn, H. P. (1958) The Automatic Creation of Literature Abstracts. IBM Jounal of Research and Development, 2(2), pp. 159-165.
Macedo, A. A., Pimentel, M. G. C., & Camacho-Guerrero, J. A. (2002) An Infrastructure for Open Latent Semantic Linking. In Proceedings of ACM Hypertext 2002.
Reimer, U., & Hahn, U. (1988) Text Condensation as Knowledge Base Abstraction. In Proceedings of the 4th Conference on Artificial Intelligence Applications, pp. 338–344.
Reithinger, N., Kipp, M., Engel, R., & Alexandersson, J. (2000) Summarizing Multilingual Spoken Negotiation Dialogues. In Proceedings of the 38th Conference of the Association for Computational Linguistics, pp. 310–317.
Salton, G., Singhal, A., Mitra, M., Buckley, C. (1997) Automatic Text Structuring and Summarization. Information Processing & Management, Vol.33, No. 2, pp. 193-207.
Saul, L., & Pereira, F. (1997) Aggregate and Mixed-Order Markov Models for Statistical Language Processing. In Proceedings of the 2nd International Conference on Empirical Methods in Natural Language Processing, pp. 81-89.
Schein, A. I., Popescul, A., Ungar, L. H., (2001) PennAspect: Two-Way Aspect Model Implementation. University of Pennsylvania Department of Computer and Information Science Technical Report MS-CIS-01-25.
Simina, M., & Barbu, C. (2004) Meta Latent Semantic Analysis. In Proceedings of the 2004 IEEE International Conference on Systems, Man and Cybernetics.
Xu, G., Zhang, Y., & Zhou, X. (2005) Using Probabilistic Latent Semantic Analysis for Web Page Grouping. In Proceedings of RIDE-SDMA’05.
中文部份
葉鎮源(民91)。文件自動化摘要方法之研究及其在中文文件的應用。交通大學資訊科學碩士畢業論文。
翁鴻加(民90)。多文件摘要一些新技術及評估模型之建立。台灣大學資訊工程碩士畢業論文。
網頁資料
中文詞知識庫小組。http://ckip.iis.sinica.edu.tw/CKIP/。
新台灣新聞週刊。http://www.newtaiwan.com.tw/。