非文件集基礎的中文文件自動摘要系統之探討

簡易檢索 / 詳目顯示

回結果列表

研究生：	梁庭耀 Ting-Yao Liang
論文名稱：	非文件集基礎的中文文件自動摘要系統之探討 A Study of Non-Corpus Based Automatic Chinese Document Summarizers
指導教授：	陳鴻基 Houn-Gee Chen
口試委員:
學位類別：	碩士 Master
系所名稱：	科技管理學院 - 科技管理研究所 Institute of Technology Management
論文出版年：	2006
畢業學年度：	94
語文別：	中文
論文頁數：	49
中文關鍵詞：	文件自動摘要、非文件集基礎文件自動摘要、機率潛在語義分析、潛在語義分析、關聯性衡量
外文關鍵詞：	Automatic Summarization, Non-Corpus based Automatic Summarization, Probabilistic Latent Semantic Analysis, Latent Semantic Analysis, Relevance Measure
相關次數：	點閱：2 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

本研究中應用機率潛在語意分析(Probabilistic Latent Semantic Analysis，PLSA)做為單文件自動摘要的方法。PLSA基於Aspect統計模型，可用於分析詞彙與句子的共同出現(co-occurrence)的現象。PLSA在自動索引的領域已被證實比潛在語意分析(Latent Semantic Analysis，LSA)表現更佳，自動摘要則是本研究所提出的一項新應用。
在過去的研究中，自動摘要系統大部份是以文件集為基礎(corpus-based)的技術來建立，但是其缺點在於訓練的過程需要人工摘要的輔助，且在新的主題出現時，由於缺乏足夠學習的文件集，無法產生良好的摘要。本研究中的PLSA自動摘要系統是採用非文件集基礎(non-corpus based)的作法，並在實驗過程中與同樣是非文件集基礎的LSA及關聯性衡量(Relevance Measure，RM)自動摘要技術做比較。在實驗過程中，使用新台灣週刊的文章做為摘要的對象，RM摘要器獲得了最佳的效果，PLSA摘要器次之，LSA摘要器則表現最差。

In our research, we applied Probabilistic Latent Semantic Analysis (PLSA) to single-document summarization. PLSA is based on Aspect model which can be used to analyze co-occurrence of terms and sentences. PLSA had been already proved that it performs better than Latent Semantic Analysis (LSA) in automatic indexing domain. In our research, we attempt to apply PLSA to solve automatic summarization problem.

In literature, most of automatic summarizers were built on corpus-based structure. However, a corpus-based automatic summarizer requires a lot of documents and artificial summaries for training. Moreover, it will be hindered by the shortage of training documents on emerging topics. As so, we applied non-corpus based technique for automatic summarizer builder. A modified PLSA is proposed to build a summarizer. The performance of PLSA was compared with that of LSA and Relevance Measure (RM) summarizer. Using New Taiwan Magazine data, the results indicate that RM summarizer performed the best, PLSA summarizer ranked second, and LSA summarizer performed the worst.

摘要    I
ABSTRACT    II
第一章　緒論    1
　第一節　研究背景    1
　第二節　研究動機    3
　第三節　研究目的    3
　第四節　各章簡介    4
第二章　文獻探討    5
　第一節　自動摘要發展過程    5
　第二節　以文件集為基礎的自動摘要    6
　第三節　以關聯性衡量為基礎的自動摘要    10
　第四節　以潛在語意分析為基礎的自動摘要    11
第三章　以機率潛在語意分析為基礎的自動摘要    18
　第一節　機率潛在語意分析簡介    18
　第二節　機率潛在語意分析模型    18
　第三節　應用機率潛在語意分析於語句摘要    21
第四章　實驗結果評估    26
　第一節　實驗資料描述    26
　第二節　系統成果評估方法    32
　第三節　詞彙權重組合    33
　第四節　系統成效評估    35
第五章　結論與未來研究方向    45
　第一節　結論    45
　第二節　未來研究方向    46

                                

英文部份
Brants, T., Chen, F., & Tsochantaridis, I. (2002). Topic-Based Document Segmentation with Probabilistic Latent Semantic Analysis. In Proceedings of CIKM’02.
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman R. (1990). Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science, 41(6), pp. 391-407.
Edmundson H. P. (1969) New Methods in Automatic Extracting. Journal of the ACM, 16(2), pp. 264-285.
Gong Y., & Liu, X. (2001) Generic Text Summarization Using Relevance Measure and Latent Semantic Analysis. In Proceedings of SIGIR’01.
Hahn, U., & Mani, I. (2000) The Challenges of Automatic Summarization. Computer, vol. 33, no. 11, pp. 29-36.
Hofmann, T. (1999a) Probabilistic Latent Semantic Analysis. In Proceedings of 15th Conference on Uncertainty in AI.
Hofmann, T. (1999b) Probabilistic Latent Semantic Indexing. In Proceedings of SIGIR’99.
Hofmann, T.(1999c) Unsupervised Learning from Dyadic Data. In Advances in Neural Information Processing Systems, volume 11. MIT Press.
Hofmann, T. (2001) Unsupervised Learning by Probabilistic Latent Semantic Analysis. Machine Learning, 42, pp. 177-196.
Jin, X., & Zhou, Y. (2004) Bamshad Mobasher. Web Usage Mining Based on Probabilistic Latent Semantic Analysis. In Proceedings of KDD’04.
Kintsch, W. (2002) The Potential of Latent Semantic Analysis for Machine Grading of Clinical Case Summaries. Journal of Biomedical Informatics, 35, pp. 3-7.
Kupiec, J., Pedersen, J., & Chen, F. (1995) A Trainable Document Summarizer. In Proceedings of SIGIR’95.
Landauer, T. K., Foltz, P. W., & Laham, D. (1998) Introduction to Latent Semantic Analysis. Discourse Processes, 25, pp. 259-284.
Luhn, H. P. (1958) The Automatic Creation of Literature Abstracts. IBM Jounal of Research and Development, 2(2), pp. 159-165.
Macedo, A. A., Pimentel, M. G. C., & Camacho-Guerrero, J. A. (2002) An Infrastructure for Open Latent Semantic Linking. In Proceedings of ACM Hypertext 2002.
Reimer, U., & Hahn, U. (1988) Text Condensation as Knowledge Base Abstraction. In Proceedings of the 4th Conference on Artificial Intelligence Applications, pp. 338–344.
Reithinger, N., Kipp, M., Engel, R., & Alexandersson, J. (2000) Summarizing Multilingual Spoken Negotiation Dialogues. In Proceedings of the 38th Conference of the Association for Computational Linguistics, pp. 310–317.
Salton, G., Singhal, A., Mitra, M., Buckley, C. (1997) Automatic Text Structuring and Summarization. Information Processing & Management, Vol.33, No. 2, pp. 193-207.
Saul, L., & Pereira, F. (1997) Aggregate and Mixed-Order Markov Models for Statistical Language Processing. In Proceedings of the 2nd International Conference on Empirical Methods in Natural Language Processing, pp. 81-89.
Schein, A. I., Popescul, A., Ungar, L. H., (2001) PennAspect: Two-Way Aspect Model Implementation. University of Pennsylvania Department of Computer and Information Science Technical Report MS-CIS-01-25.
Simina, M., & Barbu, C. (2004) Meta Latent Semantic Analysis. In Proceedings of the 2004 IEEE International Conference on Systems, Man and Cybernetics.
Xu, G., Zhang, Y., & Zhou, X. (2005) Using Probabilistic Latent Semantic Analysis for Web Page Grouping. In Proceedings of RIDE-SDMA’05.

中文部份
葉鎮源(民91)。文件自動化摘要方法之研究及其在中文文件的應用。交通大學資訊科學碩士畢業論文。
翁鴻加(民90)。多文件摘要一些新技術及評估模型之建立。台灣大學資訊工程碩士畢業論文。

網頁資料
中文詞知識庫小組。http://ckip.iis.sinica.edu.tw/CKIP/。
新台灣新聞週刊。http://www.newtaiwan.com.tw/。

全文公開日期本全文未授權公開 (校內網路)
全文公開日期本全文未授權公開 (校外網路)

簡易檢索 / 詳目顯示

相關論文