簡易檢索 / 詳目顯示

研究生: 鞏和平
Jose P. Gonzalez-Brenes
論文名稱: 一個偵測跨語言內容相互引用的系統
SMURF: A Cross-lingual Co-derivative Detection System
指導教授: 林福仁
Fu-Ren Lin
口試委員:
學位類別: 碩士
Master
系所名稱: 科技管理學院 - 科技管理研究所
Institute of Technology Management
論文出版年: 2007
畢業學年度: 95
語文別: 英文
論文頁數: 46
中文關鍵詞: multilingual plagarism detectioncross-lingual co-derivative detectiontranslation detection
外文關鍵詞: multilingual plagarism detection, cross-lingual co-derivative detection, translation detection
相關次數: 點閱:1下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • An automatic approach to detect content overlapping will mitigate the workload on the repetitiveness and tedious nature of manually checking the originality of a large pool of documents. The objective of this research is to design and evaluate a novel algorithm, SMURF –Semantic MUltilingual Related-Document Finder, aimed to find pairs of documents in different languages that share a common source (co-derivative) which may be used to facilitate the protection of intellectual property. We demonstrate SMURF on identifying English co-derivatives on the Web of Spanish documents on several textual domains with a sentence-level precision of 88.75%. Although SMURF’s design focused
    on English and Spanish, the concepts applied could be easily implemented on other languages where the constituent technologies have been studied.


    An automatic approach to detect content overlapping will mitigate the workload on the repetitiveness and tedious nature of manually checking the originality of a large pool of documents. The objective of this research is to design and evaluate a novel algorithm, SMURF –Semantic MUltilingual Related-Document Finder, aimed to find pairs of documents in different languages that share a common source (co-derivative) which may be used to facilitate the protection of intellectual property. We demonstrate SMURF on identifying English co-derivatives on the Web of Spanish documents on several textual domains with a sentence-level precision of 88.75%. Although SMURF’s design focused
    on English and Spanish, the concepts applied could be easily implemented on other languages where the constituent technologies have been studied.

    iv Table of Contents ABSTRACT ........................................................................................................................ ii Acknowledgment ............................................................................................................... iii List of Tables ..................................................................................................................... vi List of Figures ................................................................................................................... vii Chapter 1 Introduction .........................................................................................................1 Chapter 2 Literature Review ................................................................................................4 2.1 Constituent Technologies ..................................................................................... 4 (1) Sentence Boundary Detection ........................................................................... 4 (2) Bilingual Sentence Alignment .......................................................................... 5 (3) Bilingual Word Alignment ............................................................................... 5 (4) Part of Speech Tagging ..................................................................................... 6 (5) Key-phrase Selection ........................................................................................ 6 2.2 Previous Efforts .................................................................................................... 7 (1) SNITCH: Copy-Paste Detection ....................................................................... 8 (2) Sherlock: Sentence Based Plagiarism Detection .............................................. 9 Chapter 3 Research Framework .........................................................................................10 3.1 Overview ............................................................................................................ 10 3.2 Interface .............................................................................................................. 11 3.3 Architecture of SMURF ..................................................................................... 12 (1) Tokenization ................................................................................................... 13 (2) Translation ...................................................................................................... 15 (3) Key-phrase extraction ..................................................................................... 18 (4) Key-phrase translation .................................................................................... 18 (5) Document Search ............................................................................................ 19 (6) Clean search results ......................................................................................... 19 (7) Co-derivative identification ............................................................................ 20 (8) Parsing of results ............................................................................................. 21 Chapter 4 Evaluation..........................................................................................................22 4.1 Experimental Design .......................................................................................... 22 4.2 Testing Data ....................................................................................................... 23 4.3 Evaluation results and discussion ....................................................................... 24 v (1) Experiment 1 ................................................................................................... 24 (2) Experiment 2 ................................................................................................... 25 Chapter 5 Conclusions and Future Work ...........................................................................31 References ..........................................................................................................................32 Appendix A Cross Document Relationship (Source: Radev, n.d) .....................................34 Appendix B Data Set .........................................................................................................37 B.1 Enigma Original English Source ....................................................................... 37 B.2 Enigma Translation............................................................................................ 38 B.3 Space Elevator English Source ......................................................................... 39 B.4 Space Elevator Spanish Translation .................................................................. 39 B.5 Salvador Dal? English Source ........................................................................... 40 B.6 Salvador Dal? Spanish Translation .................................................................... 41 B.7 Yesterday (Song) English Source ...................................................................... 41 B.8 Yesterday (Song) Spanish Translation ............................................................... 42 B.9 Supreme Court of the United States English Source ......................................... 42 B.10 Supreme Court of the United States Spanish Translation................................ 43 B.11 Tony Blair English Source ............................................................................... 45 B.12 Tony Blair Spanish Translation ....................................................................... 46 vi List of Tables Table 2.1 Mean of acceptable key terms extracted ........................................................ 6 Table 2.2 Comparison of commercial plagiarism detection software ........................... 8 Table 4.1 Translations Detected .................................................................................. 24 Table 4.2 Precision of the sentence-level alignments found by SMURF .................... 26 Table 4.3 Sentences incorrectly identified as co-derivatives by SMURF ................... 27 Table 4.4 Co-derivatives sentences detected with Dice Score below 55% ................. 28 Table 4.5 Co-derivatives example found for Tony Blair document ............................ 29 Table 4.6 The different kinds of the co-derivatives found ........................................... 30 vii List of Figures Figure 2.1 Example of a non-trivial sentence alignment ............................................... 5 Figure 2.2 All the possible alignments of Chinese-English parallel sentences ............. 6 Figure 2.3 SNITCH Algorithm ...................................................................................... 9 Figure 3.1 Architecture of SMURF ............................................................................ 10 Figure 3.2 Upload Spanish document .......................................................................... 11 Figure 3.3 Automatic key-phrases detection ............................................................... 11 Figure 3.4 Co-derivative annotations on the Spanish document ................................. 12 Figure 3.5 Sample Spanish text ................................................................................... 12 Figure 3.6 Pseudo-code of the find co-derivative function ......................................... 13 Figure 3.7 Example of grouping words with a sliding window ................................... 14 Figure 3.8 Example of a Spanish tokenization ............................................................ 15 Figure 3.9 Translation algorithm example ................................................................... 16 Figure 3.10 Sample translation .................................................................................... 17 Figure 3.11 Sample n-grams extracted ....................................................................... 18 Figure 3.12 Example of translation of “illegal copy” key-phrase ............................... 19 Figure 4.1 Workflow for Experiment 1 ....................................................................... 22 Figure 4.2 Proportion of the relation type annotated of the co-derivatives found ....... 30

    References
    Brown, P.F., Lai, J. C., Mercer, R.L. (1991). Aligning Sentences in Parallel Corpora
    [Electronic version]. Proceedings of 29th Annual Meeting of the Association for
    Computational Linguistics, pp. 169-176.
    Brown, P.F, Della Pietra, V.J, Della Pietra, S.A., Mercer, R. (1993). The mathematics
    of statistical machine translation: parameter estimation [Electronic version].
    Computational Linguistics, Volume 19, Issue 2
    Dorr, B., Mart?, M.A., Castell?n, I. (1997). Spanish EuroWordNet and LCS-Based
    Interlingual MT. Proceedings of the Workshop on Interlinguas in MT, pages 19-32
    Gale, W., Church, K (1991). A program for aligning sentences in bilingual corpora
    [Electronic version]. Proceedings of the 29th Annual Meeting of the Association for
    Computational Linguistics., 177-184
    Fellbaum, Christiane (editor). 1998. WordNet: An Electronic Lexical Database. The MIT
    Press, Cambridge, MA.
    Frank, E., Paynter, G.W., Witten, I.H., Gutwin, C., and Nevill-Manning, C.G. Domainspecific
    keyphrase extraction [Electronic Version]. Proceedings of the Sixteenth
    International Joint Conference on Artificial Intelligence (IJCAI-99), pp. 668-673
    Frantzi, K., Ananiadou, S., Mima, H. (2000) Automatic recognition of multi-word terms
    [Electronic version]. International Journal of Digital Libraries 3(2), Special issue
    edited by Nikolau, C. & Stephanidis, C. (eds.), 117–
    132.http://personalpages.manchester.ac.uk/staff/sophia.ananiadou/IJODL2000.pdf
    Hoad, T. C., Zobel, J. (2003). Methods for identifying versioned and plagiarized
    documents [Electronic version]. J. Am. Soc. Inf. Sci. Technol. 54, 3, 203-215.
    Koehn, P. (2002). Europarl: A Multilingual Corpus for Evaluation of Machine
    Translation. Unpublished. http://people.csail.mit.edu/~koehn/publications/europarl.ps
    Monostori, K., Finkel R., Zaslavsky A. B., Hod?sz, G., Pataki, M. (2002) Comparison of
    Overlap Detection Techniques [Electronic version]. International Conference on
    Computational Science (1) 51-60
    Niezgoda, S., Way, T. (2006). SNITCH: a software tool for detecting cut and paste
    plagiarism [Electronic version]. Proceedings of the 37th SIGCSE technical
    symposium on Computer Science. Pages: 51 - 55
    Radev, D. (n.d). Cross-document relationship classification for text summarization.
    Retrieved January 8,2007 from University of Michigan Web Site:
    http://tangra.si.umich.edu/~radev/papers/progress/p1.pdf
    Reynar, J., Ratnaparkhi, A. (1997). A maximum entropy approach to identifying
    sentence boundaries [Electronic version]. Proceedings of the fifth conference on
    Applied natural language processing. Pages: 16 - 19
    Hewavitharana, S. (2006). (Statistical) Approaches to Word Alignment. Retrieved
    January 8, 2007 from Carnegie Mellon University, Language Technologies Institute
    Web Site: http://www.cs.cmu.edu/afs/cs.cmu.edu/project/cmt-
    55/lti/Courses/734/Spring-06/Sanjika_11734.ppt
    Schmid, H. (1994). Probabilistic Part-of-Speech Tagging Using Decision Trees
    [Electronic version]. Proceedings of the International Conference on New Methods in
    Language Processing, 44-49.
    Simard, M., Plamondon, P .(1998). Bilingual Sentence Alignment: Balancing
    Robustness and Accuracy [Electronic version]. Machine Translation. Volume 13,
    Number 1. Pages 59-80
    Si, A., Leong, H.V., Lau, R.W.H. (1997). CHECK: a document plagiarism detection
    system [Electronic version]. Proceedings of the 1997 ACM symposium on Applied
    computing.
    White, D.R., Joy, M. S. (2004). Sentence-Based Natural Language Plagiarism Detection
    [Electronic version]. ACM Journal on Educational Resources in Computing. Vol. 4,
    No. 4, December 2004. Article 2.
    Zhang, Y., Zincir-Heywood, N., Milios, E. (2005). Narrative text classification for
    automatic key phrase extraction in web document corpora [Electronic version].
    Workshop On Web Information And Data Management.

    無法下載圖示 全文公開日期 本全文未授權公開 (校內網路)
    全文公開日期 本全文未授權公開 (校外網路)
    全文公開日期 本全文未授權公開 (國家圖書館:臺灣博碩士論文系統)
    QR CODE