混合式語言查詢解析：雙語內嵌式語言模型｜國立清華大學博碩士論文庫

簡易檢索 / 詳目顯示

回結果列表

研究生：	何佳芳 Ho, Chia-Fang
論文名稱：	混合式語言查詢解析：雙語內嵌式語言模型 Learning to Respond to Mixed-code Queries using Bilingual Word Embeddings
指導教授：	張俊盛 Chang, Jyun-Sheng
口試委員:	馬偉雲 Ma, Wei-Yun 陳浩然 Chen, Hao-Jan
學位類別：	碩士 Master
系所名稱：	電機資訊學院 - 資訊系統與應用研究所 Institute of Information Systems and Applications
論文出版年：	2019
畢業學年度：	107
語文別：	英文
論文頁數：	30
中文關鍵詞：	混合語言查詢、內嵌式語言模型、搜索引擎
外文關鍵詞：	Bilingual Word Embeddings, Mixed-code Query, Linguistic Search Engine
相關次數：	點閱：2 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

本論文提出了一個學習雙語單詞嵌入（Bilingual Word Embeddings）的方法，以
協助第二語言學習者（L2 learners）使用包含母語的查詢去獲得第二語言的相關
片語和例句。我們將混合語言查詢轉換為目標語言再執行查詢，進而檢索相關
目標語言片語和以及雙語例句。在訓練階段，我們首先處理並對應平行語料
庫，將處理後的資料轉換為混語訓練資料，以及藉由混語資料進而生成雙語內
嵌式語言模型。系統運行時，基於雙語內嵌式語言模型將混語查詢轉換為目標
語言並查詢，根據頻率以關聯程度對片語和例句重新排序並回傳。我們提出了
一個搜索引擎雛形x.Linggle，應用於平行語料庫和基礎語言搜索引擎。我們使
用英文學習者語料庫中真實錯誤列表進行初步個評估，實驗結果顯示，使用內
嵌式語言模型來轉換混語查詢有好的效能。

We present a method for learning bilingual word embeddings (BWE) in order to support second language (L2) learners in finding recurring phrases and example sentences that match mixed-code queries (e.g., “接受 education”) composed of words in both target language and native language (L1) from L2 learners. In our approach, mixedcode queries are transformed into target language queries aimed at maximizing the probability of retrieving relevant target language phrases and sentences. The method involves re-aligning a given parallel corpus into, converting re-aligned parallel data into mixed-code data, and generating word embeddings from mixed-code data. At run time, mixed-code queries are transformed into monolingual ones based on bilingual word embeddings, and re-ranking is performed on the phrases and sentences retrieved by frequency. We present a prototype search engine, x.Linggle, that applies the method to a
parallel corpus and an underlying linguistic search engine. Preliminary evaluation on a list of real miscollocations in the Chinese ESL Learner Corpus shows that the method performs reasonably well. Our methodology supports processing mixed-code queries with a system that provides relevant target language phrases and sentences.

Abstract i
摘要 ii
致謝 iii
Contents iv
List of Figures vi
List of Tables vii
1 Introduction 1
2 Related Work 5
3 Methodology 9
3.1 Preprocessing Parallel Corpus . . . . . . . . . . . . . . . . . . . . . . 10
3.1.1 Re-aligning a Parallel Corpus . . . . . . . . . . . . . . . . . . 11
3.1.2 Transplanting Translations across Parallel Sentences . . . . . . 13
3.2 Learning Bilingual Word Embeddings . . . . . . . . . . . . . . . . . . 14
3.3 Run-Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4 Experiment and Evaluation 19
4.1 Training x.Linggle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.1.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.1.2 Bilingual Word Embeddings . . . . . . . . . . . . . . . . . . . 20
4.1.3 x.Linggle: a Cross Language Search Engine . . . . . . . . . . . 21
4.2 Preliminary Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5 Conclusion 27
Reference 28
                                

1. Jens Bahns. Lexical collocations: a contrastive view. ELT journal, 47(1):56–63, 1993.
2. Collin F Baker, Charles J Fillmore, and John B Lowe. The berkeley framenet project.
In Proceedings of the 17th international conference on Computational linguisticsVolume 1, pages 86–90. Association for Computational Linguistics, 1998.
3. Yoshua Bengio, Rejean Ducharme, Pascal Vincent, and Christian Jauvin. A neural ´
probabilistic language model. Journal of machine learning research, 3(Feb):1137–
1155, 2003.
4. David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. Journal
of machine Learning research, 3(Jan):993–1022, 2003
5. Joanne Boisson, Ting-Hui Kao, Jian-Cheng Wu, Tzu-Hsi Yen, and Jason S Chang.
Linggle: a web-scale linguistic search engine for words in context. In Proceedings
of the 51st Annual Meeting of the Association for Computational Linguistics: System
Demonstrations, pages 139–144, 2013.
6. Jose Camacho-Collados, Mohammad Taher Pilehvar, and Roberto Navigli. Nasari: a ´
novel approach to a semantically-aware representation of items. In Proceedings of
the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 567–577, 2015.
7. Alexis Conneau, Guillaume Lample, Marc’Aurelio Ranzato, Ludovic Denoyer,
and Herve J ´ egou. Word translation without parallel data. ´ arXiv preprint
arXiv:1710.04087, 2017.
8. Long Duong, Hiroshi Kanayama, Tengfei Ma, Steven Bird, and Trevor Cohn. Multilingual training of crosslingual word embeddings. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics:
Volume 1, Long Papers, volume 1, pages 894–904, 2017.
9. Manaal Faruqui, Jesse Dodge, Sujay K Jauhar, Chris Dyer, Eduard Hovy, and
Noah A Smith. Retrofitting word vectors to semantic lexicons. arXiv preprint
arXiv:1411.4166, 2014.
10. Juri Ganitkevitch, Benjamin Van Durme, and Chris Callison-Burch. Ppdb: The paraphrase database. In Proceedings of the 2013 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 758–764, 2013.
11. Chengyou Han, Zibin Zhang, Guocan Yu, and Feihe Huang. Syntheses of a pillar
[4] arene [1] quinone and a difunctionalized pillar [5] arene by partial oxidation.
Chemical Communications, 48(79):9876–9878, 2012.
12. Ignacio Iacobacci, Mohammad Taher Pilehvar, and Roberto Navigli. Embeddings for
word sense disambiguation: An evaluation study. In Proceedings of the 54th Annual
Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),
volume 1, pages 897–907, 2016.
13. Thomas K Landauer and Susan T Dumais. A solution to plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge.
Psychological review, 104(2):211, 1997.
14. Thang Luong, Hieu Pham, and Christopher D Manning. Bilingual word representations
with monolingual quality in mind. In Proceedings of the 1st Workshop on Vector
Space Modeling for Natural Language Processing, pages 151–159, 2015.
15. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of
word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
16. George A Miller. Wordnet: a lexical database for english. Communications of the ACM,
38(11):39–41, 1995.
17. Nadja Nesselhauf. The use of collocations by advanced learners of english and some
implications for teaching. Applied linguistics, 24(2):223–242, 2003.
18. Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors
for word representation. In Proceedings of the 2014 conference on empirical methods
in natural language processing (EMNLP), pages 1532–1543, 2014.
19. Joseph Reisinger and Raymond J Mooney. Multi-prototype vector-space models of
word meaning. In Human Language Technologies: The 2010 Annual Conference of
the North American Chapter of the Association for Computational Linguistics, pages
109–117. Association for Computational Linguistics, 2010.
20. Mo Yu and Mark Dredze. Improving lexical embeddings with semantic knowledge.
In Proceedings of the 52nd Annual Meeting of the Association for Computational
Linguistics (Volume 2: Short Papers), volume 2, pages 545–550, 2014.

簡易檢索 / 詳目顯示

相關論文