簡易檢索 / 詳目顯示

研究生: 詹權恩
論文名稱: 以詞彙關聯性詞庫為基礎之文件關鍵字擷取模式
A Document Keyword Extraction Model Using the Keyword Correlation Thesaurus
指導教授: 侯建良
口試委員:
學位類別: 碩士
Master
系所名稱: 工學院 - 工業工程與工程管理學系
Department of Industrial Engineering and Engineering Management
論文出版年: 2004
畢業學年度: 92
語文別: 中文
論文頁數: 114
中文關鍵詞: 詞彙關聯性關鍵字擷取文件內容管理知識管理
相關次數: 點閱:4下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 近年來由於網際網路技術之盛行,資訊需求者需面對越來越多之網路資訊/文件,如何自廣泛資訊中取得使用者真正需要之資訊,同時節省大量資訊過濾與篩選時間,以成為資訊/文件/知識管理之一項重要課題。另一方面,為使資訊需求者快速地瀏覽文件,藉以判斷文件所欲傳達之資訊,即須擷取文件中關鍵詞彙作為文件索引,供資訊需求者參考。傳統資訊管理機制下,文件關鍵字判定工作多以人工方式進行,故需耗費大量人力與時間而無法滿足時效性。因此,本論文首先分析文件內容並擷取各詞彙之出現頻率與位置,根據此兩因子自動建立詞彙關聯關係,並建立詞彙關聯庫。而針對關鍵字擷取議題,本論文乃以詞彙關聯性為基礎,進行各文件關鍵字之自動擷取,且所擷取之關鍵字能適切表達文件內容之重要訊息。最後,發展一詞彙間關聯性為基之文件內容管理模式與系統,以確認模式之可行性,並以一案例評估此模式之有效性。綜合言之,本研究所提出之文件內容管理模式與技術不限於特定應用領域,藉由導入此文件內容管理模式,除可提供企業於知識內容管理議題一可行之解決方案,亦可協助企業累積個人經驗/文件/知識與再利用,方便組織成員利用此架構於網際網路之系統,進行文件資訊擷取、資料集中式儲存、分享與管理。


    Abstract

    Recently, due to the large amount of electronic information over the Internet, to efficiently retrieve and filter information that meets the user requirements has become an essential issue for enterprise document/knowledge management. In addition, in order to assist users quickly acquire the critical information in the documents, keyword extraction of each document is necessary. Traditionally, document keywords are determined manually and much expertise and time are required. In this research, algorithms for automatic keyword correlation analysis and keyword extraction are proposed. The keyword correlation is analyzed based on the word frequencies and locations of documents in the document repository. Based on the keyword correlation, a recursive approach for document keyword extraction is developed to effectively determine the critical and representative concepts of the target documents. A web-based system for document content management is established and a case study is provided to valuate the proposed model. The proposed approach is not domain-oriented and thus can be applied in different applications. This research is to provide a feasible solution for enterprise knowledge accumulation and reuse in the collaboration networks.

    目錄 中文摘要 I 英文摘要 II 目錄 III 圖目錄 V 表目錄 IX 第一章、 研究背景 1 1.1研究動機與目的 1 1.2研究步驟 2 1.3研究定位 5 第二章、 文獻回顧 7 2.1詞彙關聯性分析 7 2.1.1詞彙關聯特性 7 2.1.2詞彙關聯性擷取 8 2.1.3詞彙關聯性應用 11 2.2詞庫建立與應用 12 2.2.1詞庫建立 12 2.2.2詞庫應用 13 2.3 資訊檢索 13 2.3.1文字型資訊擷取 14 2.3.2特殊型態資訊擷取 18 2.3.3資訊檢索技術應用 19 第三章、 詞彙關聯解析與關鍵字擷取模式 22 3.1詞彙關聯分析模式 22 3.1.1頻率為基之詞彙關聯解析 23 3.1.2位置為基之詞彙關聯解析 27 3.2關鍵字擷取模式 32 3.2.1遞迴關鍵字擷取模式 32 3.2.2一般化關鍵字擷取模式 38 第四章、 系統架構規劃 44 4.1自動化關鍵字擷取模式 44 4.2系統功能架構 45 4.3資料模式定義 48 4.4系統流程 50 4.4.1系統操作流程 50 4.4.2系統資料流程 55 4.5系統開發工具 56 第五章、 案例驗證與評估 58 5.1系統操作說明 58 5.2系統分析與評估 94 第六章、 結論與未來發展 106 參考文獻 109

    參考文獻:
    1. 中華民國專利公報資料庫─一般查詢方式,http://nbs.apipa.org.tw/twplogin.htm。
    2. 方策民,2002,「電視新聞文稿之研究」,碩士論文(指導教授:傅心家),國立交通大學資訊工程研究所。
    3. 牛維娟,2003,「應用於USENET之Q&A系統之研究與設計」,碩士論文(指導教授:李錫捷),元智大學資訊管理研究所。
    4. 王朝煌,1998,「資料檢索技術及其警察文件管理應用之探討」,警學叢刊,第二十八卷,第五期,第219-236頁。
    5. 吳仕先,2002,「文件資料之概念主題檢索」,碩士論文(指導教授:姚修慎),元智大學資訊工程研究所。
    6. 吳偉欽,2003,「模糊性對跨領域技術聯盟知識移轉影響之研究」,碩士論文(指導教授:蔡淑梨),輔仁大學織品服裝學系碩士班。
    7. 吳潮崇,2002,「線上新聞之自動摘要系統」,碩士論文(指導教授:王振興、林敏勝),國立台北科技大學電機工程研究所。
    8. 呂春嬌,1996,「相關概念在資訊檢索中之發展與趨勢」,圖書與資訊學刊,第十六期,第21-32頁。
    9. 李紹群,2000,「以關鍵字相關性為基礎之超本文資訊檢索系統」,碩士論文(指導教授:賀嘉生),中原大學資訊工程學系研究所。
    10. 沈天佐,2002,「以網際網路內容為基礎之問答系統“為什麼”問句之研究」,碩士論文(指導教授:陳信希),國立台灣大學資訊工程研究所。
    11. 侯建良、林峰興、畢威寧,2003,"知識文件之多層級分類演算法," 中國工業工程學會九十二年度年會暨學術研討會,Paper ID: CIIE2003-365.
    12. 侯建良、黃佳新、詹權恩、林仁貴,2003,"建構電子化法律知識管理與服務模式," 第九屆資訊管理暨實務研討會,Paper ID: 8.
    13. 侯建良、詹權恩,2004,"電子化文件內容擷取技術—結合詞彙相關性與關鍵字擷取模式," 2004電子商務與數位生活研討會, Paper ID: 2003008.
    14. 施衣喬,2003,「適用於入侵偵測之模糊關聯法則機制研究」,碩士論文(指導教授:曹偉駿),大葉大學資訊管理學系碩士班。
    15. 柯俊宏,2003,「灰關聯分析結合田口參數設計運用於逆向工程點群資料處理之研究」,碩士論文(指導教授:王中行、劉大銘),大葉大學自動化工程學系碩士班。
    16. 孫銘聰,2003,啟發式電子化文件權限推論模式與技術建構,碩士論文(指導教授:侯建良),清華大學工業工程與工程管理研究所。
    17. 孫銘聰、侯建良,2002,「以推論法則為基之知識文件權限管理程序模式」,產業電子化運籌管理學術暨實務研討會,Paper ID:39.
    18. 陳永承,2002,「以灰色理論為基礎之關聯式索引典應用於互動式查詢拓展」,碩士論文(指導教授:李漢銘),國立台灣科技大學電子工程研究所。
    19. 陳光華,1996,「資訊檢索查詢之自然語言處理」,中國圖書館學會會報,第57期,第141-153頁。
    20. 陳光華、莊雅蓁,2001,「資訊檢索之中文詞彙擴展」,資訊傳播與圖書館學,第8卷,第1期,第59-75頁。
    21. 陳光華、莊雅蓁,2001,「應用於資訊檢索的中文同義詞之建構」,中國圖書館學會會報,第67期,第93-107頁。
    22. 陳俊德,2003,「應用灰關聯分析於改善供應商決策分析-以筆記型電腦廠為例」,碩士論文(指導教授:楊錦洲),中原大學工業工程研究所。
    23. 陳威丞,2000,「Unicode全文檢索之研究與實作」,碩士論文(指導教授:陳賀翔),國立中正大學資訊工程研究所。
    24. 陳雅娟,2003,「基於Ontology之模糊代理人於中文新聞文件摘要技術之研究」,碩士論文(指導教授:李健興、陳宗禧),長榮大學經營管理研究所。
    25. 陳鈺瑾,1999,「可調式之中文文件自動摘要」,碩士論文(指導教授:張俊盛),國立清華大學資訊工程研究所。
    26. 陳鴻儀,2001,「應用關聯法則於語言模型之調整及建立個人化新聞文件瀏覽器」,碩士論文(指導教授:簡仁宗),國立成功大學資訊工程學系研究所。
    27. 曾元顯,1997,「關鍵詞自動擷取技術之探討」,中國圖書館學會會訊,第一零六期,第26-29頁。
    28. 曾元顯,1997,「關鍵詞自動擷取技術與相關詞回饋」,中國圖書館學會會報,第五十九卷,第59-64頁。
    29. 黃乾綱,2001,「全球資訊網互動式檢索之相關詞推薦之研究」,碩士論文(指導教授:歐陽彥正、簡立峰),國立台灣大學資訊工程學研究所。
    30. 黃雲龍,1998,「中文全文資訊檢索研究架構與重要議題探討」,大學圖書館,第二卷,第三期,第4-26頁。
    31. 黃慕萱,1996,「資訊檢索」,台灣學生書局。
    32. 楊綠淵、侯建良,2003,"以網路閱讀趨勢為基之企業知識文件管理模式," 九十二年度國防整體後勤年會暨研討會論文集,pp. 483-494.
    33. 董振東,董強,1996,《知網》http://www.keenage.com/html/index.html
    34. 趙俊彥,2001,「以關聯式規則探勘為基礎建構關聯式索引典用於互動式查詢擴展」,碩士論文(指導教授:李漢銘),國立台灣科技大學電子工程研究所。
    35. 蔡澤銘,廖炳堯,喻瀚寬,2002,「下一波Web趨勢:語意網」,資訊與電腦,第8期,第82-88頁。
    36. 鄭錫聰,2003,「服務接觸滿意度與購後行為關聯性之研究-以推廣教育為例」,碩士論文(指導教授:賴其勛、張景旭),大葉大學工業關係學系碩士班。
    37. 謝君偉、江政欽、陳蘊彥,1999,「影像視訊資料庫檢索技術」,電腦與通訊,第76期,第40-47頁。
    38. Agichtein, E., Lawrence, S. and Gravano, L., 2001, “Learning search engine specific query transformations for question answering,” Proceedings of the 10th World Wide Web Conference, pp. 169-178.
    39. Ando, K., Yamasaki, T., Shishibori, M., and Aoe, J., 2001, “Automatic text summarization based on keyword derivation,” IEEE International Conference on Systems,Man,and Cybernetics, Vol. 1, pp. 464-469.
    40. Aoe, J.-I., 1989, “An efficient implementation of static string pattern matching machines,”, IEEE Transactions on Software Engineering, Vol. 15, pp. 1010-1016.
    41. Atlam, E.-S., Fuketa, M., Morita, K and Aoe, J.-I., 2000, “Similarity measurement using term negative weight and its application to word similarity,” Information Processing and Management, Vol. 36, No. 5, pp. 717-736.
    42. Beaulieu, M. and Jones, S., 1998, “Interactive searching and interface issues in the Okapi best match probabilistic retrieval system,” Processing of the International Conference on Interacting with Computers, Vol. 10, pp. 237-248.
    43. Boger, Z., Kuflik, T., Shoval, P. and Shapira, B., 2001, “Automatic keyword identification by artificial neural networks compared to manual identification by users of filtering systems,” Information Processing and Management, Vol. 37, pp. 187-198.
    44. Brown, P.-F., 1992, “Class-based n-gram models of natural language,” Computational Linguistics , Vol. 18, No 4., pp. 467-479.
    45. Chen, S.-F. and Goodman, J., 1999, “An empirical study of smoothing techniques for language modeling,” Computer Speech and Language , Vol.13, pp. 359-394.
    46. Chien, L.-F., 1995, “Fast and quasi-natural language search for gigabytes of Chinese texts,” Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp.112-120.
    47. Chien., L.-F., 1997, “PAT-tree-based keyword extraction for Chinese information retrieval,” ACM SIGIR Forum, Vol. 31, pp. 50-58.
    48. Church, K.-W. and Hanks, P., 1990, “Word association norms, mutual information and lexicography,” Computational Linguistics, Vol. 16, No. 1, pp. 22-29.
    49. Cuadra, C.-A., and Katter, R.V., 1967 “Experimental studies of relevance judgment,” Systems Development Corporation, Vol. 3, pp. 34-41.
    50. Dagan, I., Marcus, S., and Markovitch, S., 1995, “Contextual word similarity and estimation from sparse data,” Computer Speech and Language, Vol. 9, pp. 123-152.
    51. Frakes, W.-B., and Ricardo, B.-Y., 1992, “Information Retrieval data structures and algorithms,” Prentice Hall, pp.44-65.
    52. Goldstain, J., kantrowitz, M., Mittal, V. and Carbonell, J. 1999, “Summarizing text documents: sentence selection and evaluation metrics,” The 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 121-128.
    53. Han, J.-J., Choi, J.-H., Park, J.-J., Yang, J.-D. and Lee, J.-K., 1998, “An object-based information retrieval model: toward the structural construction of thesauri,” IEEE International Forum on Research and Technology Advances in Digital Librarys, pp. 117-125.
    54. Horng, J.-T. and Yeh, C.-C., 2000, “Applying genetic algorithms to query optimization in document retrieval,” Information Processing and Management, Vol. 36, pp. 737-759.
    55. Ikeji, A.-C., and Fotouhi, F., 1999, “An adaptive real-time web search engine,” Proceedings of the Second International Workshop on Web Information and Data Management, pp.12-16.
    56. Iyer, R. and Ostendorf, M., 1999, “Relevance weighting for combining multi-domain data for n-gram language modeling,” Computer Speech and Language, Vol.13, pp. 267-282.
    57. Jain, R., 1999, “A new multimedia technology today-a standards challenge for the future,” IEEE International Conference on Multimedia Computing and Systems, Vol.1, pp.128-130.
    58. Krulwich, B., 1995, “Learning document category descriptions through the extraction of semantically significant phrases,” The IJCAI Workshop on Data Engineering for Inductive Learning, pp. 1-10.
    59. Kuo, S.-S. and Agazzi, O.-E., 1994, “Automatic keyword recognition using hidden Markov models,” Journal of Visual Communication and Image Representation, Vol. 5, pp. 265-272.
    60. Kupiec, J., Pedersen, J., and Chen, F., 1995, “A Trainable Document,” Proceedings of the 18th annual international ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 68-73.
    61. Lagus, K., and Kaski, S., 1999, “WEBSOM–self organizing maps of document collections,”, ICANN 99. Ninth International Conference on Artificial Neural Networks, Vol. 1, pp. 371-376.
    62. Lagus, K., Kaski, S., Honkela, T., and Kohonen, T., 1998, “WEBSOM–self organizing maps of document collections,”, Neurocomputing, Vol. 21, pp. 101-117.
    63. Lin, C.-H. and Chen, H.-C., 1996, “An automatic indexing and neural network approach to concept retrieval and classification of multilingual (Chinese-English) documents,” IEEE Transactions on Systems, Man and Cybernetics, Part B, Vol. 26, No. 1, pp. 75-88.
    64. Lin, S.-C., Chien, L.-F., Chen, K.-J., and Lee, L.-S., 1996, “An efficient voice retrieval system for very-large-vocabulary Chinese textual databases with a clustered language model,” IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 1, pp. 287-290.
    65. Lin, Z., 1991, “CAT: an execution model for concurrent full text search,” Proceedings of the First International Conference on Parallel and Distributed Information Systems, pp.151-158.
    66. Louise, T.-S., 1992, “Evaluation measures for interactive information retrieval,” Information Process and Management, Vol. 28, No. 4, pp. 503.
    67. Lu, C., Lee, K.-H., and Chen, H.-Y., 1995, “TheSys-a comprehensive thesaurus system for intelligent document analysis and text retrieval,” IEEE International Conference on Document Analysis and Recognition, Vol. 2, pp. 1169-1173.
    68. Manjunath, B.-S., and Ma, W.-Y., 1996, “Texture features for browsing and retrieval of image data,” IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 18, pp.837-842.
    69. Miller, G.-A., 1995, “WordNet: a lexical database for English,” Communications of the ACM, Vol. 38, No. 11, pp. 39-41.
    70. Moldovan, D. I. and Mihalcea, R., 2000, “Using WordNet and lexical operators to improve Internet searches,” IEEE Internet Computing, Vol. 4, No. 1, pp. 34-43.
    71. National Library of Medicine, 2001, “UMLS knowledge sources,” National Library of Medicine, 12th Experimental Edition.
    72. Parikh, J. and Narasimha, M.-M., 2002, “Adapting question answering techniques to the Web,” Proceedings of Conference on Language Engineering, pp. 163-171.
    73. Peat, H. J and Willett, P., 1991, “The limitations of term co-occurrence data for query expansion in document retrieval systems,” Journal of the American Society for Information Science, Vol. 42, No. 5, pp. 379-380.
    74. Ricardo, B.-Y. and Berthier, R.-N., 1999, “Modern information retrieval,” New York,Addison-Wesley.
    75. Rijsbergen, C.-J., 1983, “Information retrieval: new directions: old solutions,” Proceedings of the 6th annual international ACM SIGIR conference on Research and Development in Information Retrieval, Vol. 264, pp. 264-265.
    76. Roberston, S.-E., 1969, “The parametric description of retrieval tests,” Journal of Documents, Vol. 25, pp. 3.
    77. Sanderson., M., 1998, “Accurate user directed summarization from existing tools,” Proceedings of the 7'th International Conference on Information and Knowledge Management. pp. 45-51.
    78. Saracevic, T., 1970, “The concept of ‘Relevance’ in information science: a historical review,” Introduction to Information Science, pp. 114.
    79. Shaw, W.-M., Burgin, R. and Howell, P., 1997, “Performance standards and evaluations in IR test collections: cluster-based retrieval models, ” Information Processing and Management, Vol. 33, pp. 1-14.
    80. Sparck, J.-K., 1972, “A statistical interpretation of term specificity and its application in retrieval,” Journal of Documentation, Vol. 28, No. 1, pp. 11-21.
    81. Sproat, R., Shih, C., Gale, W. and Chang, N., 1996, “A stochastic finite-state word-segmentation algorithm for Chinese,” Computational Linguistics, Vol. 22, No. 3, pp. 376-404.
    82. Stricker, M., and Orengo, M., 1995, “Similarity of Color Image,” Proceedings IS&T/SPIE Conference on Storage and Retrieval for Image and Video Databases, pp. 381-392.
    83. Su, W.-F., Li, S.-Z., Li, T.-Q. and You, W.-J., 2002, “Cross-language text filtering based on text concepts and kNN,” Computational Linguistics and Chinese Language Processing, Vol. 7, No. 1, pp. 79-90.
    84. Surdeanu, M., Moldovan, D.-I., and Harabagin, S.-M., 2001, “Performance Analysis of a Distributed Question/Answering System,” Proceedings, 15th International Conference on Parallel and Distributed Processing Symposium, pp. 23-27.
    85. Thomas, B., 1998, “URL diving,” IEEE Internet Computing, Vol. 2, pp.92-93.
    86. Tseng, Y.-H., 2001, “Fast co-occurrence thesaurus construction for Chinese news,” IEEE International Conference on Systems , Man , and Cybernetics, Vol. 2, pp. 853-858.
    87. Vickery, B.-C., 1958, “The structure of information retrieval system,” Proceeding of the International Conference on Scientific Information, pp.1275-1289.
    88. Vickery, B.-C., 1958, “The structure of information retrieval system,” Proceeding of the International Conference on Scientific Information, pp.1275-1289.
    89. Wakami, N., Mizutani, K., Kataoka, M. and Imanaka, T., 1997, “Fuzzy intelligent information processing in home appliances,” Proceedings, Ninth International Conference on Scientific and Statistical Database Management, pp. 12-21.
    90. Wan, X., and Kuo, C.-C.-J., 1998, “A new approach to image retrieval with hierarchical color clustering,” IEEE Transactions on Circuits and Systems for Video Technology, Vol. 8, pp.628-643.
    91. Wei, J., Bressen, S. and Ooi, B.-C., 2000, “Mining term association rules for automatic global query expansion: methodology and preliminary results,” Proceedings of the First International Conference on Web Information Systems Engineering, Vol. 1, pp. 366-373.
    92. Zhou, G.-D. and Lua, K.-T., 1999, “Interpolation of n-gram and mutual-information based trigger pair language models for Mandarin speech recognition,” Computer Speech and Language, Vol. 13, pp. 125-141.

    無法下載圖示 全文公開日期 本全文未授權公開 (校內網路)
    全文公開日期 本全文未授權公開 (校外網路)

    QR CODE