研究生: |
楊庭豪 Yang, Ting-Hao. |
---|---|
論文名稱: |
基於統計準則式方法強化出版物參考元數據提取方法之研究 A Study of the Statistical Principle-Based Approach to Enhance the Publication Reference Metadata Extraction |
指導教授: |
許聞廉
Hsu, Wen-Lian |
口試委員: |
蘇豐文
SOO, VON-WUN 張詠淳 Chang, Yung-Chun 戴敏育 Day, Min-Yuh 吳世弘 Wu, Shih-Hung |
學位類別: |
博士 Doctor |
系所名稱: |
電機資訊學院 - 資訊系統與應用研究所 Institute of Information Systems and Applications |
論文出版年: | 2022 |
畢業學年度: | 110 |
語文別: | 中文 |
論文頁數: | 68 |
中文關鍵詞: | 參考元數據 、準則式方法 、自動模板生成 |
外文關鍵詞: | reference-metadata, principle-based, principle-generation |
相關次數: | 點閱:1 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
出版物字串是描述資源資訊以讓其他研究者可以搜尋到該資源的一種特殊格式字串,通常用於論文最後引用資料描述以及研究者個人的著作資料整理。我們延續過去的研究基礎,提出結合統計技術與知識規則的方法,透過自動化的準則生成演算法與匹配演算法將出版物字串資料轉換為結構化資訊。出版物參考元數據提取作為學術資料結構化的基本任務,除了用於文獻檢索的精確資訊萃取以外,對於研究學術社群活動網路關係也有助益。然而文獻引用格式的變化性大,且文獻格式也以驚人速度增加,這對於出版物參考元數據提取造成障礙。在這一篇論文中,我們將針對此議題作探討,尋求方法來提升參考元數據提取的效果。此篇論文將研究方向集中在兩個議題上:
(1) 整合統計技巧與知識本體之系統設計:我們建構了一套知識表達與應用的環境。該環境包含了知識管理環境與整合式方法核心模型,整合式方法核心模型結合了階層架構式的知識本體與統計方法。在簡化了標記工作的同時仍可以保有資訊提取效能。結合知識的系統架構也使得專家能夠分析各階段的錯誤,並針對關鍵處快速改善系統。我們以此環境開發了出版物參考元數據提取模型。
(2) 以統計準則式方法(Statistical Principle-Based Approach, SPBA)強化出版物參考元數據提取: 過去實驗室發展了幾個系統來處理出版物參考元數據提取的任務,在發展過程中我們針對準則產生方式改進並嘗試用於不同任務,最後發展出了SPBA。SPBA方法有三個步驟,第一步為建立知識本體(Ontology),並用這些知識對文本進行語意標注(Semantic Labeling)。第二步將前一步驟生成的樣板(Pattern)透過準則生成演算法(Principle Generation Algorithm)將樣板們整合成具有代表性的準則(Principles)。最後用準則批配演算法(Principle Matching)提供彈性比對機制以處理多變的引用格式
在本論文中,我們以出版物參考元數據提取任務的公開資料集與專家編輯過的雜訊資料集來驗證SPBA方法的可用性,實驗測試了四個期刊論文引用字串資料集跟一個會議論文引用字串資料集。我們也比較了當前技術的CRF與Bi-LSTM-CRF方法,SPBA方法在元數據提取任務的效能上在各資料集都獲得了改進。在使用較少訓練資料的實驗中也驗證了SPBA的強健性。大多數的出版物參考元數據提取研究少有提出整合機器學習與知識規則的方法,SPBA可填補此空缺。本研究的貢獻可歸納為下列幾點:第一是結合精簡標記與批配,可以減輕標記工作的負擔。第二是讓從資料中生成準則,可以減輕專家撰寫準則的負擔。第三是我們分享了新的出版物參考元數據提取任務資料集,讓後續研究可以有新的發展材料。
SPBA作為一個結合知識本體與統計方法的的技術,能夠產生有可讀性的準則,也能夠讓從各步驟中理解出錯誤的原因,這種具可解釋性的特性將有助於拓展到未來其它需要細緻處理語意的資訊萃取任務。
Publication reference string is a special format string that describes the information of a cited resource so that other researchers can search for the resource. Continuing the foundation of our past research, we propose a method that combines statistical techniques and knowledge to convert publication reference strings into structured information through automated principle generation algorithms and matching algorithms. The task of reference metadata extraction is a necessary technique to transform the publication strings into structured data/text. In addition to accurate metadata information extraction for academic document retrieval, it is also helpful for exploring academic social network relationships by extracting metadata. However, formats of publication reference strings are highly variable and are increasing at an alarming rate, creating obstacles to the extraction of publication reference metadata. In this paper, we investigate the following two concepts.
(1) A sophisticated designed system architecture that would help to integrate the knowledge-based and statistical-based methods.
The environment includes the knowledge management module and the core model of the hybrid approach. The hybrid approach model combines hierarchical ontology and statistical-based methods. We use simplified markup but still maintain information extraction performance. An ontology architecture incorporating knowledge also enables experts to analyze errors and improve the system at various stages quickly. We develop the model of publication reference metadata extraction in this environment, and we apply it to the public dataset and the noise dataset composed by experts for further testing.
(2) Statistical principle-based approach (SPBA)
In the past, we have developed several systems to handle the task of publication reference metadata extraction. We refined the principle generation algorithm and tried them for different tasks in development. Based on the aforementioned work, we developed SPBA. There are three steps in SPBA. The first step is building ontology and use the knowledge base to perform semantic labeling on the corpus. The second step is collecting the labeled pattern from the previous step and using the principle generation algorithm to summarize them into principles. Finally, the third step, a principle matching algorithm is used to provide a flexible comparison mechanism to handle changing citation formats.
In this paper, we validate the usability of the SPBA method with the datasets of publication reference metadata extraction. These datasets contain the public datasets and the expert-edited noise datasets. The experiment contains four journal-style datasets and one conference-style dataset. We compare the state-of-the-art such as CRF approach, and Bi-LSTM-CRF approach on the above datasets. By using the SPBA approach, we reduce the field error rate against the state-of-the-art approaches. The robustness of SPBA is also validated in the experiments of reduced training data. Only a few publications metadata extraction research propose methods to integrate the statistical method and knowledge-based method. SPBA can fill this research gap. The contributions of our research include the following: (i) The first is to combine simplified labeling and matching algorithms to reduce the burden of experts in labeling work. (ii) The second is to generate principles from the data, reducing experts' burden of writing principles. (iii) The third is the new dataset so that follow-up research can have new development materials.
As a technology that combines ontology and statistical methods, SPBA can produce readable principles, and it is possible to analyze the cause of the error from each step. The interpretable SPBA is helpful when developing other information extraction tasks that require careful processing of semantic meaning.
[1] R. Habib and M. T. Afzal, "Sections-based bibliographic coupling for research paper recommendation," Scientometrics, vol. 119, no. 2, p. 643–656, 2019.
[2] S. D. J. Barbosa, M. S. Silveira and I. Gasparini, "What publications metadata tell us about the evolution of a scientific community: the case of the Brazilian human–computer interaction conference series," Scientometrics, vol. 110, p. 275–300, 2017.
[3] I. G. Councill, C. L. Giles and M.-Y. Kan, "ParsCit: an Open-source CRF Reference String Parsing Package.," in LREC, p. 661–667, 2008.
[4] H. Han, C. L. Giles, E. Manavoglu, H. Zha, Z. Zhang and E. A. Fox, "Automatic document metadata extraction using support vector machines," in 2003 Joint Conference on Digital Libraries, 2003. Proceedings.,p. 37–48, 2003.
[5] B. Ojokoh, M. Zhang and J. Tang, "A trigram hidden Markov model for metadata extraction from heterogeneous references," Information Sciences, vol. 181, p. 1538–1551, 2011.
[6] F. Peng and A. McCallum, "Information extraction from research papers using conditional random fields," Information processing & management, vol. 42, p. 963–979, 2006.
[7] A. Prasad, M. Kaur and M.-Y. Kan, "Neural ParsCit: a deep learning-based reference string parser," International Journal on Digital Libraries, vol. 19, p. 323–337, 2018.
[8] K. Seymore, A. McCallum, R. Rosenfeld and others, "Learning hidden Markov model structure for information extraction," in AAAI-99 workshop on machine learning for information extraction, p. 37–42, 1999.
[9] M. T. Afzal, H. A. Maurer, W.-T. Balke and N. Kulathuramaiyer, "Rule based Autonomous Citation Mining with TIERL.," J. Digit. Inf. Manag., vol. 8, p. 196–204, 2010.
[10] C.-C. Chen, K.-H. Yang, C.-L. Chen and J.-M. Ho, "BibPro: A citation parser based on sequence alignment," IEEE Transactions on Knowledge and Data Engineering, vol. 24, p. 236–250, 2010.
[11] G. G. Chowdhury, "Template mining for information extraction from digital documents," 1999.
[12] E. Cortez, A. S. da Silva, M. A. Gonçalves, F. Mesquita and E. S. de Moura, "FLUX-CIM: flexible unsupervised extraction of citation metadata," in Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries, p. 215–224, 2007.
[13] M.-Y. Day, R. T.-H. Tsai, C.-L. Sung, C.-C. Hsieh, C.-W. Lee, S.-H. Wu, K.-P. Wu, C.-S. Ong and W.-L. Hsu, "Reference metadata extraction using a hierarchical knowledge representation framework," Decision Support Systems, vol. 43, p. 152–167, 2007.
[14] Y. Ding, G. Chowdhury, S. Foo and others, "Template mining for the extraction of citation from digital documents," in Proceedings of the Second Asian Digital Library Conference, Taiwan, p. 47–62, 1999.
[15] C. L. Giles, K. D. Bollacker and S. Lawrence, "CiteSeer: An automatic citation indexing system," in Proceedings of the third ACM conference on Digital libraries, p. 89–98, 1998.
[16] Y.-L. Hsieh, S.-H. Liu, T.-H. Yang, Y.-H. Chen, Y.-C. Chang, G. Hsieh, C.-W. Shih, C.-H. Lu and W.-L. Hsu, "A frame-based approach for reference metadata extraction," in International Conference on Technologies and Applications of Artificial Intelligence, p. 154–163, 2014.
[17] W.-L. Hsu, S.-H. Wu and Y.-S. Chen, "Event identification based on the information map-INFOMAP," in 2001 IEEE International Conference on Systems, Man and Cybernetics. e-Systems and e-Man for Cybernetics in Cyberspace (Cat. No. 01CH37236), p. 103–112, 2001.
[18] S. Lawrence, C. L. Giles and K. Bollacker, "Digital libraries and autonomous citation indexing," Computer, vol. 32, p. 67–71, 1999.
[19] S.-H. Wu, M.-Y. Day, T.-H. Tsai and W.-L. Hsu, "FAQ-centered organizational memory," in Knowledge management and organizational memories, Springer, p. 103–112, 2002.
[20] E. Agichtein and V. Ganti, "Mining reference tables for automatic text segmentation," in Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, p. 20-29, 2004.
[21] P.-T. Lai, M.-S. Huang, T.-H. Yang, W.-L. Hsu and R. T.-H. Tsai, "Statistical principle-based approach for gene and protein related object recognition," Journal of cheminformatics, vol. 10, p. 1–9, 2018.
[22] T.-H. Yang, Y.-L. Hsieh, S.-H. Liu, Y.-C. Chang and W.-L. Hsu, "A flexible template generation and matching method with applications for publication reference metadata extraction," Journal of the Association for Information Science and Technology, vol. 72, p. 32–45, 2021.
[23] C. Manning and H. Schutze, Foundations of statistical natural language processing, MIT press, 1999.
[24] C. Cortes and V. Vapnik, "Support-vector networks," Machine learning, vol. 20, p. 273–297, 1995.
[25] J. Lafferty, A. McCallum and F. C. N. Pereira, "Conditional random fields: Probabilistic models for segmenting and labeling sequence data," 2001.
[26] H. M. Wallach, "Conditional random fields: An introduction," Technical Reports (CIS), p. 22, 2004.
[27] A. Viterbi, "Error bounds for convolutional codes and an asymptotically optimum decoding algorithm," IEEE transactions on Information Theory, vol. 13, p. 260–269, 1967.
[28] W. S. McCulloch and W. Pitts, "A logical calculus of the ideas immanent in nervous activity," The bulletin of mathematical biophysics, vol. 5, p. 115–133, 1943.
[29] Z. Huang, W. Xu and K. Yu, "Bidirectional LSTM-CRF models for sequence tagging," arXiv preprint arXiv:1508.01991, 2015.
[30] S.-H. Wu, T.-H. Tsai, W.-L. Hsu and others, "Domain Event Extraction and Representation with Domain Ontology.," in IIWeb, p. 33–38, 2003.
[31] N. F. Noy, D. L. McGuinness and others, Ontology development 101: A guide to creating your first ontology, Stanford knowledge systems laboratory technical report KSL-01-05 and _K, 2001.
[32] D. Tkaczyk, A. Collins, P. Sheridan and J. Beel, "Machine learning vs. rules and out-of-the-box vs. retrained: An evaluation of open-source bibliographic reference and citation parsers," in Proceedings of the 18th ACM/IEEE on joint conference on digital libraries, p. 99–108, 2018.
[33] A. Lally and P. Fodor, "Natural language processing with prolog in the ibm watson system," The Association for Logic Programming (ALP) Newsletter, vol. 9, 2011.
[34] C. F. Baker, C. J. Fillmore and J. B. Lowe, "The berkeley framenet project," in 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, vol. 1, p. 86–90, 1998.
[35] R. Speer, J. Chin and C. Havasi, "Conceptnet 5.5: An open multilingual graph of general knowledge," in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 31, no. 1, 2017.
[36] W.-Y. Ma and Y.-Y. Shih, "Extended hownet 2.0–an entity-relation common-sense representation model," in Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), 2018.
[37] C. J. Fillmore and others, "Frame semantics and the nature of language," in Annals of the New York Academy of Sciences: Conference on the origin and development of language and speech, vol. 280, no. 1, p.20-32, 1976.
[38] M. A. Musen, "The protégé project: a look back and a look forward," AI matters, vol. 1, p. 4–12, 2015.
[39] G. A. Miller, R. Beckwith, C. Fellbaum, D. Gross and K. J. Miller, "Introduction to WordNet: An on-line lexical database," International journal of lexicography, vol. 3, p. 235–244, 1990.
[40] "DBLP," [Online]. Available: https://dblp.org/.
[41] C. Boisson and N. Shahmehri, "Template generation for identifying text patterns," in International Symposium on Methodologies for Intelligent Systems, p. 463-473, 2000.
[42] A. Y. Ng, "Feature selection, L 1 vs. L 2 regularization, and rotational invariance," in Proceedings of the twenty-first international conference on Machine learning, p. 463–473, 2004.
[43] W. S. Lee and B. Liu, "Learning with positive and unlabeled examples using weighted logistic regression," ICML, vol. Vol. 3, p. 448-455, 2003.
[44] C. Liang and K. Forbus, "Learning plausible inferences from semantic web knowledge by combining analogical generalization with structured logistic regression," in Proceedings of the AAAI Conference on Artificial Intelligence, 2015.
[45] J. L. Peterson, Petri net theory and the modeling of systems, Prentice Hall PTR, 1981.
[46] F. Peng and A. McCallum, "Accurate information extraction from research papers using conditional random fields," in In: Proceedings of HLT-NAACL 2004, Boston, Massachusetts, 2004.
[47] D. Yu, S. Wang and L. Deng, "Sequential labeling using deep-structured conditional random fields," IEEE Journal of Selected Topics in Signal Processing, vol. 4, p. 965–973, 2010.
[48] N. Reimers and I. Gurevych, "Reporting Score Distributions Makes a Difference: Performance Study of LSTM-networks for Sequence Tagging," in EMNLP, p. 338-348, 2017.
[49] J. Carletta, "Assessing agreement on classification tasks: the kappa statistic," arXiv preprint cmp-lg/9602004, 1996.
[50] J. R. Landis and G. G. Koch, "The measurement of observer agreement for categorical data," biometrics, p. 159–174, 1977.
[51] T. Kudo and Y. Matsumoto, "Chunking with support vector machines," in Second Meeting of the North American Chapter of the Association for Computational Linguistics, 2001.
[52] L. A. Ramshaw and M. P. Marcus, "Text chunking using transformation-based learning," in Natural language processing using very large corpora, Springer, p. 157–176, 1999.
[53] X. Liao and Z. Zhao, "Unsupervised approaches for textual semantic annotation, a survey," ACM Computing Surveys (CSUR), vol. 52, p. 1–45, 2019.
[54] F. Daniel, P. Kucherbaev, C. Cappiello, B. Benatallah and M. Allahbakhsh, "Quality control in crowdsourcing: A survey of quality attributes, assessment techniques, and assurance actions," ACM Computing Surveys (CSUR), vol. 51, p. 1–40, 2018.
[55] E. J. Houtgast, V.-M. Sima, K. Bertels and Z. Al-Ars, "GPU-accelerated BWA-MEM genomic mapping algorithm using adaptive load balancing," in International conference on architecture of computing systems, p. 130–142, 2016.
[56] J. González-Domı́nguez, "Fast and Accurate Multiple Sequence Alignment with MSAProbs-MPI," in Multiple Sequence Alignment, Springer, p. 39–47, 2021.