研究生: |
謝佑明 Hsieh, Yu-Ming |
---|---|
論文名稱: |
以結構機率重估改進中文句法分析 Syntactic Parsing for Mandarin Chinese via Structural Probability Re-estimation |
指導教授: |
張俊盛
Chang, Jason S. 陳克健 Chen, Keh Jiann |
口試委員: |
陳信希
許聞廉 高照明 |
學位類別: |
博士 Doctor |
系所名稱: |
電機資訊學院 - 資訊工程學系 Computer Science |
論文出版年: | 2015 |
畢業學年度: | 103 |
語文別: | 英文 |
論文頁數: | 107 |
中文關鍵詞: | Syntactic Parsing 、PCFG 、Structural Disambiguation 、Grammar Representation |
相關次數: | 點閱:1 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
句法剖析(Syntactic parsing)是理解自然語言最重要的一步,在機器翻譯、問答系統、資訊檢索、語音辨識和其他自然語言處理的應用上都十分重要。當輸入一個句子並載入語法規則,句法剖析會辨識出詞彙的詞類及詞組的語法功能,並產生符合語法規則的數種歧義結構。然而要從眾多歧義結構中挑選出最好的句法結構並不容易,需仰賴一個強健的結構機率估算方法。
本論文首先提出一個通用模型,與上下文相關的機率重估模型(context-dependent probability re-estimation model, CDM),以改善機率式上下文無關語法規則(probabilistic context-free grammars, PCFG)在結構機率不夠精確的問題。我們所提出的模型可以有效率且彈性地使用上下文特徵,以獲得更為精確的結構機率,提升句法分析效能。接著,為彌補通用模型在特殊結構上(special structures)處理能力不足的問題,我們針對特殊結構提出特殊結構解歧模型,例如及物動詞後接名詞結構(Vt-N structures)的解歧及並列結構(conjunctive structures)的解歧。主要目的是將有利於特殊結構的特徵或方法加入結構解歧模型中,以重估出更為精確的結構機率,提升解歧的正確率,並有效地整合至現有的結構機率重估模型之中。從實驗評估結果來看,我們提出的結構機率重估方法比一般的PCFG剖析器及其它的統計式剖析器都有更好的剖析結果。
Syntactic parsing is the first major step of natural language understanding. It plays an important role in machine translation, question answering, information retrieval, speech recognition, and other natural language processing applications. Given a sentence and grammar rules, a syntactic parser may identify the part-of-speeches of words, then produce several ambiguous structures accepted by the grammar rules. However, to select the best structure from several ambiguous structures is a challenging task. Quality of the best structure selection usually depends on the precision of the structure probability estimation methods.
In this thesis we first propose a general model, a context-dependent probability re-estimation model, to enhance the estimation of structure probabilities produced by probabilistic context-free grammars (PCFG). Compared with using rule probabilities only, the proposed model has the advantage of using effective, flexible, and broader range of contexture features to better estimate structure probabilities. Secondly we propose using specific models to resolve specific cases in parsing Chinese by pinpointing features specifically useful for such cases to enhance general models. The specific cases tested in this thesis are Vt-N structures and conjunctive structures. Evaluation on a set of experiments shows that the proposed models outperform the baseline parser and the existing state-of-the-art statistical parsers.
[1] Agarwal, Rajeev and Lois Boggess. 1992. A Simple but Useful Approach to Conjunct Identification. In Proceedings of 30th Annual Meeting of Association for Computational Linguistics, pages 15-21.
[2] Berger, Adam. 1997. The improved iterative scaling algorithm: A gentle introduction. Carnegie Mellon University (1997).
[3] Bikel, Daniel M. and David Chiang. 2000. Two Statistical Parsing Models Applied to the Chinese Treebank. In Proceedings of the Second Chinese Language Processing Workshop, pages 1-6.
[4] Bikel, Daniel M.. 2004. Intricacies of Collins’ parsing model. Journal of Computational Linguistics, 30(4):479-511, December 2004.
[5] Black, E., S. Abney, D. Flickenger, C. Gdaniec, R. Grishman, P. Harrison, D. Hindle, R. Ingria, F. Jelinek, J. Klavans, M. Liberman, M. Marcus, S. Roukos, B. Santorini, and T. Strzalkowski. 1991. A Procedure for Quantitatively Comparing the Syntactic Coverage of English Grammars, In Proceedings of the Workshop on Speech and Natural language, pages 306-311.
[6] Chang, Li-li, Keh-Jiann Chen, and Chu-Ren Huang. 2000. Alternation Across Semantic Fields: A Study on Mandarin Verbs of Emotion. Internal Journal of Computational Linguistics and Chinese Language Processing (IJCLCLP), 5(1):61-80.
[7] Charniak, Eugene. 1996. Treebank grammars. In Proceedings of the thirteenth National Conference on Artificial Intelligence, pages 1031-1036. AAAI Press/MIT Press.
[8] Charniak, Eugene. 2000. A Maximum-Entropy-Inspired. In Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference, pages 132-139.
[9] Charniak, Eugene and M. Johnson. 2005. Coarse-to-fine n-best parsing and MaxEnt discriminative reranking. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL), pages 173-180.
[10] Chen, Keh-Jiann, Chu-Ren Huang, Chi-Ching Luo, Feng-Yi Chen, Ming-Chung Chang, Chao-Jan Chen, and Zhao-Ming Gao. 2003. Sinica Treebank: Design Criteria, Representational Issues and Implementation. In (Abeille 2003) Treebanks: Building and Using Parsed Corpora, pages 231-248. Dordrecht, the Netherlands: Kluwer.
[11] Chen, Li-jiang. 2008. Autolabeling of VN Combination Based on Multi-classifier. Journal of Computer Engineering, 34(5):79-81.
[12] Chen, Wenliang, Jun’ichi Kazama, Kiyotaka Uchimoto, and Kentaro Torisawa. 2009. Improving Dependency Parsing with Subtrees from Auto-Parsed Data. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing (EMNLP2009), pages 570-579, Singapore, August 2-7, 2009
[13] Chiu, Chih-ming, Ji-Chin Lo, and Keh-Jiann Chen. 2004. Compositional Semantics of Mandarin Affix Verbs. In Proceedings of the Research on Computational Linguistics Conference (ROCLING), pages 131-139.
[14] CKIP (Chinese Knowledge Information Processing). 1993. The Categorical Analysis of Chinese. ACLCLP Technical Report 93-05, Institute of Information Science Academia Sinica, Taipei, 1993.
[15] Collins, Michael. 2003. Head-driven statistical models for natural language parsing. Computational Linguistics. 29:589-637.
[16] Culotta, Aron and Jeffrey Sorensen. 2004. Dependency tree kernels for relation extraction. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics (ACL’04), pages 423-429.
[17] Darroch, J. N. and D. Ratcliff. 1972. Generalized iterative scaling for log-linear models. The Annals of Mathematical Statistics, 43(5): 1470-1480.
[18] Delden, Sebastian van. 2002. A Hybrid Approach to Pre-Conjunct Identification. In Proceedings of the 2002 Language Engineering Conference (LEC 2002), pages 72-77.
[19] Ding, Yuan and Martha Palmer. 2005. Machine translation using probabilistic synchronous dependency insertion grammars. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics (ACL), pages 541-548.
[20] Dong, Zhendong and Qiang Dong. 2006. HowNet and the Computation of Meaning. World Scientific Publishing Co. Pte. Ltd.
[21] Fossum, Victoria, and Kevin Knight. 2009. Combining Constituent Parsers. In Proceedings of NAACL HLT (Short Papers), pages 253-256.
[22] Gildea, Daniel. 2001. Corpus Variation and Parser Performance. In Proceedings of the 2001 Conference on Empirical Methods in Natural Language Processing, pages 167-202.
[23] He, Liangye, Derek F Wong, and Lidia S Chao. 2012. Adapting Multilingual Parsing Models to Sinica Treebank. In Proceedings of the Second CIPS-SIGHAN Joint Conference on Chinese Language Processing, pages 211-215.
[24] Hsieh, Yu-Ming , Duen-Chi Yang, Keh-Jiann Chen. 2005. Linguistically-Motivated Grammar Extraction, Generalization and Adaptation. In Proceedings of the Second International Joint Conference on Natural Language Processing (IJCNLP), LNAI 3651, pages 177-187.
[25] Hsieh, Yu-Ming, Duen-Chi Yang and Keh-Jiann Chen. 2007. Improve Parsing Performance by Self-Learning. In Computational Linguistics and Chinese Language Processing, 12(2):195-216.
[26] Hsieh, Yu-Ming, Ming-Hong Bai, Jason S. Chang, and Keh-Jiann Chen. 2012. Improving PCFG Chinese Parsing with Context-Dependent Probability Re-estimation, In Proceedings of the Second CIPS-SIGHAN Joint Conference on Chinese Language Processing, pages 216-221.
[27] Hsieh, Yu-Ming, Su-Chu Lin, Jason S. Chang, and Keh-Jiann Chen. 2013. Improving Chinese Parsing with Special-Case Probability Re-estimation. In Proceedings of International Conference on Asian Language Processing (IALP), pages 177-180.
[28] Hsieh, Yu-Ming, Jason S Chang, and Keh-Jiann Chen. 2014. Ambiguity Resolution for Vt-N Structures in Chinese. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pages 928-937.
[29] Huang, Shu-Ling, You-Shan Chung, Keh-Jiann Chen. 2008. E-HowNet: the Expansion of HowNet. In Proceedings of the First National HowNet Workshop, pages 10-22.
[30] Johnson, Mark. 1998. PCFG models of linguistic tree representations. Computational Linguistics, 24(4):613-632.
[31] Klein, Dan and Christopher D. Manning. 2003. Accurate Unlexicalized Parsing. In Proceeding of the 4lst Annual Meeting of the Association for Computational Linguistics, pages 423-430.
[32] Klein, Dan and Christopher D. Manning. 2003b. Fast Exact Inference with a Factored Model for Natural Language Parsing. In Advances in Neural Information Processing systems 15, pages 3-10. MIT Press, Cambridge, MA.
[33] Kudo, Taku. 2006. CRF++: Yet Another CRF toolkit. http://chasen.org/~taku/software/CRF++/.
[34] Kummerfeld, Jonathan K, Daniel Tse, and James R Curran. 2013. An Empirical Examination of Challenges in Chinese Parsing. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, pages 98–103.
[35] Kurohashi, Sadao and Makoto Nagao. 1994. A Syntactic Analysis Method of Long Japanese Sentences Based on the Detection of Conjunctive Structure. Computational Linguistics 20(4), pages 507-534.
[36] Lafferty, John, Andrew McCallum and Fernando Pereira. 2001. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of the 18th International Conference on Machine Learning (ICML-01), pages 282-289.
[37] Lee, Yong-Hun, Mi-Young Kim and Jong-Hyeok Lee. 2005. Chunking Using Conditional Random Fields in Korea Texts. In Proceedings of the Second International Join Conference on Natural Language Processing (IJCNLP2005), pages 155-164.
[38] Levy, Roger and Christopher Manning. 2003. Is it harder to parse Chinese, or the Chinese Treebank? In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, pages 439-446, Sapporo, Japan.
[39] Li, Junhji, Guodong Zhou, and Hwee Tou Ng. 2010. Joint Syntactic and Semantic Parsing of Chinese. In Proceedings of ACL 2010, pages 1108-1117.
[40] Liu, Chunhi. 2008. Xiandai Hanyu Shuxing Fanchou Yianjiu (現代漢語屬性範疇研究). Chengdu: Bashu Books.
[41] Ma, Ji, Longfei Bai, Ao Zhang, Zhuo Liu, and Jingbo Zhu. 2012. NEU Systems in SIGHAN Bakeoff 2012. In Proceedings of the Second CIPS-SIGHAN Joint Conference on Chinese Language Processing, pages 206-210.
[42] Ma, Wei-Yun and Keh-Jiann Chen. 2003. A Bottom-up Merging Algorithm for Chinese Unknown Word Extraction. In Proceedings of the second SIGHAN workshop on Chinese Language Processing, pages 31-38.
[43] Matsuzaki, Takuya, Yusuke Miyao, and Jun’ichi Tsujii. 2005. Probabilistic CFG with latent annotations. In Proceedings of the 43rd Annual Meeting of the ACL, pages 75-82.
[44] Mei, Jiaju, Yiming Lan, Yunqi Gao, and Yongxian Ying. 1983. A Dictionary of Synonyms (同義詞詞林). Shanghai Cishu Chubanshe.
[45] Miller, Geroge. 1993. Introduction to WordNet: An Online Lexical Database. Princeton, CSL Report 43.
[46] Petrov, Slav, Leon Barrett, Romain Thibaux, and Dan Klein. 2006. Learning Accurate, Compact, and Interpretable Tree Annotation. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL (COLING-ACL 2006), pages 433-440.
[47] Petrov, Slav and Dan Klein. 2007. Improved Inference for Unlexicalized Parsing. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), pages 404-411. Rochester, New York, USA.
[48] Pinchak, Christopher and Dekang Lin. 2006. A Probabilistic Answer Type Model. In Proceedings of the European Chapter of the Annual Meeting of the Association for Computational Linguistics, pages 393-400.
[49] Qian, Xian and Yang Liu. 2012. Joint Chinese word segmentation, POS tagging and parsing. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 501-511, Jeju Island, Korea.
[50] Qiu, Likun. 2005. Constitutive Relation Analysis for V-N Phrases. Journal of Chinese Language and Computing, 15(3):173-183.
[51] Ratnaparkhi, Adwait. 1999. Learning to Parse Natural Language with Maximum Entropy Models. Machine Language, 34(1-3):151-175.
[52] Rosenfled, Ronald. 1997. A Whole Sentence Maximum Entropy Language Model. In Proceedings of the IEEE workshop on Automatic Speech Recognition and Understanding, Santa Barbara, California.
[53] Sproat, Richard and Thomas Emerson. 2003. The first International Chinese Word Segmentation Bakeoff. In Proceedings of the Second SIGHAN Workshop on Chinese Language Processing, pages 133-143.
[54] Steiner, Ilona. 2003. Parsing Syntactic Redundancies in Coordinate Structures. Poster presentation at the European Cognitive Science Conference (EuroCogSci03), pages 10-13.
[55] Sun, Honglin and Dan Jurafsky. 2003. The effect of rhythm on structural disambiguation in Chinese. In Proceedings of the second SIGHAN workshop on Chinese language processing, pages 39-46.
[56] Tsai, Yu-Fang and Keh-Jiann Chen. 2004. Reliable and Cost-Effective Pos-Tagging. International Journal of Computational Linguistics and Chinese Language Processing (IJCLCLP), 91:83-96.
[57] Tseng, Yuen-Hsieh, Lung-Hao Lee, and Liang-Chih Yu. 2012. Tranditional Chinese Parsing Evaluation at SIGHAN Bake-offs 2012. In Proceedings of the Second CIPS-SIGHAN Joint Conference on Chinese Language Processing, pages 199-205.
[58] Wang, Mengqiu, Kenji Sagae, and Teruko Mitamura. 2006. A Fast, Accurate Deterministic Parser for Chinese. In Proceedings of COLING-ACL 2006, pages 425-432.
[59] Wu, Andi. 2003. Learning Verb-Noun Relations to Improve Parsing. In Proceedings of the Second SIGHAN workshop on Chinese Language Processing, pages 119-124.
[60] Wu, Yunfang. 2003b. Contextual Information of Coordinate Structure. Advances on the Research of Machine Translation, pages 103-109.
[61] Xiong, Deyi, Shuanglong Li, Qun Liu, Shouxun Lin, and Yueliang Qian. 2005. Parsing the Penn Chinese Treebank with semantic knowledge. In Proceedings of the Second International Join Conference on Natural Language Processing, pages 70-81, Jeju Island, Republic of Korea.
[62] Xue, Nianwen, Fei Xia, Fu-Dong Chiou, and Martha Palmer. 2005. The Penn Chinese TreeBank: Phrase structure annotation of a large corpus. Natural Language Engineering, 11(2):207-238.
[63] Xu, Peng, Ciprian Chelba, and Frederick Jelinek. 2002. A study on richer syntactic dependencies for structured language modeling. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pages 191-198.
[64] Yang, Duen-Chi, Yu-Ming Hsieh, and Keh-Jiann Chen. 2008. Resolving Ambiguities of Chinese Conjunctive Structures by Divide-and-conquer Approaches. In Proceedings of the Third International Joint Conference on Natural Language Processing, pages 715-720.
[65] Yu, Kun, Daisuke Kawahara, and Sadao Kurohashi. 2008. Chinese Dependency Parsing with Large Scale Automatically Constructed Case Structures, In Proceedings of the 22nd International Conference on Computational Linguistics (COLING), pages 1049-1056.
[66] Zhang, Le. 2004. Maximum Entropy Modeling Toolkit for Python and C++. Reference Manual. http://homepages.inf.ed.ac.uk/lzhang10/maxent.html.
[67] Zhao, Jun and Chang-ning Huang. 1999. The Complex-feature-based Model for Acquisition of VN-construction Structure Templates. Journal of Software, 10(1):92-99.