一個偵測跨語言內容相互引用的系統｜國立清華大學博碩士論文庫

簡易檢索 / 詳目顯示

回結果列表

研究生：	鞏和平 Jose P. Gonzalez-Brenes
論文名稱：	一個偵測跨語言內容相互引用的系統 SMURF: A Cross-lingual Co-derivative Detection System
指導教授：	林福仁 Fu-Ren Lin
口試委員:
學位類別：	碩士 Master
系所名稱：	科技管理學院 - 科技管理研究所 Institute of Technology Management
論文出版年：	2007
畢業學年度：	95
語文別：	英文
論文頁數：	46
中文關鍵詞：	multilingual plagarism detection 、cross-lingual co-derivative detection 、translation detection
外文關鍵詞：	multilingual plagarism detection, cross-lingual co-derivative detection, translation detection
相關次數：	點閱：53 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

An automatic approach to detect content overlapping will mitigate the workload on the repetitiveness and tedious nature of manually checking the originality of a large pool of documents. The objective of this research is to design and evaluate a novel algorithm, SMURF –Semantic MUltilingual Related-Document Finder, aimed to find pairs of documents in different languages that share a common source (co-derivative) which may be used to facilitate the protection of intellectual property. We demonstrate SMURF on identifying English co-derivatives on the Web of Spanish documents on several textual domains with a sentence-level precision of 88.75%. Although SMURF’s design focused
on English and Spanish, the concepts applied could be easily implemented on other languages where the constituent technologies have been studied.

iv
Table of Contents
ABSTRACT ........................................................................................................................ ii
Acknowledgment ............................................................................................................... iii
List of Tables ..................................................................................................................... vi
List of Figures ................................................................................................................... vii
Chapter 1 Introduction .........................................................................................................1
Chapter 2 Literature Review ................................................................................................4
2.1 Constituent Technologies ..................................................................................... 4
(1) Sentence Boundary Detection ........................................................................... 4
(2) Bilingual Sentence Alignment .......................................................................... 5
(3) Bilingual Word Alignment ............................................................................... 5
(4) Part of Speech Tagging ..................................................................................... 6
(5) Key-phrase Selection ........................................................................................ 6
2.2 Previous Efforts .................................................................................................... 7
(1) SNITCH: Copy-Paste Detection ....................................................................... 8
(2) Sherlock: Sentence Based Plagiarism Detection .............................................. 9
Chapter 3 Research Framework .........................................................................................10
3.1 Overview ............................................................................................................ 10
3.2 Interface .............................................................................................................. 11
3.3 Architecture of SMURF ..................................................................................... 12
(1) Tokenization ................................................................................................... 13
(2) Translation ...................................................................................................... 15
(3) Key-phrase extraction ..................................................................................... 18
(4) Key-phrase translation .................................................................................... 18
(5) Document Search ............................................................................................ 19
(6) Clean search results ......................................................................................... 19
(7) Co-derivative identification ............................................................................ 20
(8) Parsing of results ............................................................................................. 21
Chapter 4 Evaluation..........................................................................................................22
4.1 Experimental Design .......................................................................................... 22
4.2 Testing Data ....................................................................................................... 23
4.3 Evaluation results and discussion ....................................................................... 24
v
(1) Experiment 1 ................................................................................................... 24
(2) Experiment 2 ................................................................................................... 25
Chapter 5 Conclusions and Future Work ...........................................................................31
References ..........................................................................................................................32
Appendix A Cross Document Relationship (Source: Radev, n.d) .....................................34
Appendix B Data Set .........................................................................................................37
B.1 Enigma Original English Source ....................................................................... 37
B.2 Enigma Translation............................................................................................ 38
B.3 Space Elevator English Source ......................................................................... 39
B.4 Space Elevator Spanish Translation .................................................................. 39
B.5 Salvador Dal? English Source ........................................................................... 40
B.6 Salvador Dal? Spanish Translation .................................................................... 41
B.7 Yesterday (Song) English Source ...................................................................... 41
B.8 Yesterday (Song) Spanish Translation ............................................................... 42
B.9 Supreme Court of the United States English Source ......................................... 42
B.10 Supreme Court of the United States Spanish Translation................................ 43
B.11 Tony Blair English Source ............................................................................... 45
B.12 Tony Blair Spanish Translation ....................................................................... 46
vi
List of Tables
Table 2.1 Mean of acceptable key terms extracted ........................................................ 6
Table 2.2 Comparison of commercial plagiarism detection software ........................... 8
Table 4.1 Translations Detected .................................................................................. 24
Table 4.2 Precision of the sentence-level alignments found by SMURF .................... 26
Table 4.3 Sentences incorrectly identified as co-derivatives by SMURF ................... 27
Table 4.4 Co-derivatives sentences detected with Dice Score below 55% ................. 28
Table 4.5 Co-derivatives example found for Tony Blair document ............................ 29
Table 4.6 The different kinds of the co-derivatives found ........................................... 30
vii
List of Figures
Figure 2.1 Example of a non-trivial sentence alignment ............................................... 5
Figure 2.2 All the possible alignments of Chinese-English parallel sentences ............. 6
Figure 2.3 SNITCH Algorithm ...................................................................................... 9
Figure 3.1 Architecture of SMURF ............................................................................ 10
Figure 3.2 Upload Spanish document .......................................................................... 11
Figure 3.3 Automatic key-phrases detection ............................................................... 11
Figure 3.4 Co-derivative annotations on the Spanish document ................................. 12
Figure 3.5 Sample Spanish text ................................................................................... 12
Figure 3.6 Pseudo-code of the find co-derivative function ......................................... 13
Figure 3.7 Example of grouping words with a sliding window ................................... 14
Figure 3.8 Example of a Spanish tokenization ............................................................ 15
Figure 3.9 Translation algorithm example ................................................................... 16
Figure 3.10 Sample translation .................................................................................... 17
Figure 3.11 Sample n-grams extracted ....................................................................... 18
Figure 3.12 Example of translation of “illegal copy” key-phrase ............................... 19
Figure 4.1 Workflow for Experiment 1 ....................................................................... 22
Figure 4.2 Proportion of the relation type annotated of the co-derivatives found ....... 30

                                

References
Brown, P.F., Lai, J. C., Mercer, R.L. (1991). Aligning Sentences in Parallel Corpora
[Electronic version]. Proceedings of 29th Annual Meeting of the Association for
Computational Linguistics, pp. 169-176.
Brown, P.F, Della Pietra, V.J, Della Pietra, S.A., Mercer, R. (1993). The mathematics
of statistical machine translation: parameter estimation [Electronic version].
Computational Linguistics, Volume 19, Issue 2
Dorr, B., Mart?, M.A., Castell?n, I. (1997). Spanish EuroWordNet and LCS-Based
Interlingual MT. Proceedings of the Workshop on Interlinguas in MT, pages 19-32
Gale, W., Church, K (1991). A program for aligning sentences in bilingual corpora
[Electronic version]. Proceedings of the 29th Annual Meeting of the Association for
Computational Linguistics., 177-184
Fellbaum, Christiane (editor). 1998. WordNet: An Electronic Lexical Database. The MIT
Press, Cambridge, MA.
Frank, E., Paynter, G.W., Witten, I.H., Gutwin, C., and Nevill-Manning, C.G. Domainspecific
keyphrase extraction [Electronic Version]. Proceedings of the Sixteenth
International Joint Conference on Artificial Intelligence (IJCAI-99), pp. 668-673
Frantzi, K., Ananiadou, S., Mima, H. (2000) Automatic recognition of multi-word terms
[Electronic version]. International Journal of Digital Libraries 3(2), Special issue
edited by Nikolau, C. & Stephanidis, C. (eds.), 117–
132.http://personalpages.manchester.ac.uk/staff/sophia.ananiadou/IJODL2000.pdf
Hoad, T. C., Zobel, J. (2003). Methods for identifying versioned and plagiarized
documents [Electronic version]. J. Am. Soc. Inf. Sci. Technol. 54, 3, 203-215.
Koehn, P. (2002). Europarl: A Multilingual Corpus for Evaluation of Machine
Translation. Unpublished. http://people.csail.mit.edu/~koehn/publications/europarl.ps
Monostori, K., Finkel R., Zaslavsky A. B., Hod?sz, G., Pataki, M. (2002) Comparison of
Overlap Detection Techniques [Electronic version]. International Conference on
Computational Science (1) 51-60
Niezgoda, S., Way, T. (2006). SNITCH: a software tool for detecting cut and paste
plagiarism [Electronic version]. Proceedings of the 37th SIGCSE technical
symposium on Computer Science. Pages: 51 - 55
Radev, D. (n.d). Cross-document relationship classification for text summarization.
Retrieved January 8,2007 from University of Michigan Web Site:
http://tangra.si.umich.edu/~radev/papers/progress/p1.pdf
Reynar, J., Ratnaparkhi, A. (1997). A maximum entropy approach to identifying
sentence boundaries [Electronic version]. Proceedings of the fifth conference on
Applied natural language processing. Pages: 16 - 19
Hewavitharana, S. (2006). (Statistical) Approaches to Word Alignment. Retrieved
January 8, 2007 from Carnegie Mellon University, Language Technologies Institute
Web Site: http://www.cs.cmu.edu/afs/cs.cmu.edu/project/cmt-
55/lti/Courses/734/Spring-06/Sanjika_11734.ppt
Schmid, H. (1994). Probabilistic Part-of-Speech Tagging Using Decision Trees
[Electronic version]. Proceedings of the International Conference on New Methods in
Language Processing, 44-49.
Simard, M., Plamondon, P .(1998). Bilingual Sentence Alignment: Balancing
Robustness and Accuracy [Electronic version]. Machine Translation. Volume 13,
Number 1. Pages 59-80
Si, A., Leong, H.V., Lau, R.W.H. (1997). CHECK: a document plagiarism detection
system [Electronic version]. Proceedings of the 1997 ACM symposium on Applied
computing.
White, D.R., Joy, M. S. (2004). Sentence-Based Natural Language Plagiarism Detection
[Electronic version]. ACM Journal on Educational Resources in Computing. Vol. 4,
No. 4, December 2004. Article 2.
Zhang, Y., Zincir-Heywood, N., Milios, E. (2005). Narrative text classification for
automatic key phrase extraction in web document corpora [Electronic version].
Workshop On Web Information And Data Management.

全文公開日期本全文未授權公開 (校內網路)
全文公開日期本全文未授權公開 (校外網路)
全文公開日期本全文未授權公開 (國家圖書館：臺灣博碩士論文系統)

簡易檢索 / 詳目顯示

相關論文