研究生: |
鞏和平 Jose P. Gonzalez-Brenes |
---|---|
論文名稱: |
一個偵測跨語言內容相互引用的系統 SMURF: A Cross-lingual Co-derivative Detection System |
指導教授: |
林福仁
Fu-Ren Lin |
口試委員: | |
學位類別: |
碩士 Master |
系所名稱: |
科技管理學院 - 科技管理研究所 Institute of Technology Management |
論文出版年: | 2007 |
畢業學年度: | 95 |
語文別: | 英文 |
論文頁數: | 46 |
中文關鍵詞: | multilingual plagarism detection 、cross-lingual co-derivative detection 、translation detection |
外文關鍵詞: | multilingual plagarism detection, cross-lingual co-derivative detection, translation detection |
相關次數: | 點閱:2 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
An automatic approach to detect content overlapping will mitigate the workload on the repetitiveness and tedious nature of manually checking the originality of a large pool of documents. The objective of this research is to design and evaluate a novel algorithm, SMURF –Semantic MUltilingual Related-Document Finder, aimed to find pairs of documents in different languages that share a common source (co-derivative) which may be used to facilitate the protection of intellectual property. We demonstrate SMURF on identifying English co-derivatives on the Web of Spanish documents on several textual domains with a sentence-level precision of 88.75%. Although SMURF’s design focused
on English and Spanish, the concepts applied could be easily implemented on other languages where the constituent technologies have been studied.
An automatic approach to detect content overlapping will mitigate the workload on the repetitiveness and tedious nature of manually checking the originality of a large pool of documents. The objective of this research is to design and evaluate a novel algorithm, SMURF –Semantic MUltilingual Related-Document Finder, aimed to find pairs of documents in different languages that share a common source (co-derivative) which may be used to facilitate the protection of intellectual property. We demonstrate SMURF on identifying English co-derivatives on the Web of Spanish documents on several textual domains with a sentence-level precision of 88.75%. Although SMURF’s design focused
on English and Spanish, the concepts applied could be easily implemented on other languages where the constituent technologies have been studied.
References
Brown, P.F., Lai, J. C., Mercer, R.L. (1991). Aligning Sentences in Parallel Corpora
[Electronic version]. Proceedings of 29th Annual Meeting of the Association for
Computational Linguistics, pp. 169-176.
Brown, P.F, Della Pietra, V.J, Della Pietra, S.A., Mercer, R. (1993). The mathematics
of statistical machine translation: parameter estimation [Electronic version].
Computational Linguistics, Volume 19, Issue 2
Dorr, B., Mart?, M.A., Castell?n, I. (1997). Spanish EuroWordNet and LCS-Based
Interlingual MT. Proceedings of the Workshop on Interlinguas in MT, pages 19-32
Gale, W., Church, K (1991). A program for aligning sentences in bilingual corpora
[Electronic version]. Proceedings of the 29th Annual Meeting of the Association for
Computational Linguistics., 177-184
Fellbaum, Christiane (editor). 1998. WordNet: An Electronic Lexical Database. The MIT
Press, Cambridge, MA.
Frank, E., Paynter, G.W., Witten, I.H., Gutwin, C., and Nevill-Manning, C.G. Domainspecific
keyphrase extraction [Electronic Version]. Proceedings of the Sixteenth
International Joint Conference on Artificial Intelligence (IJCAI-99), pp. 668-673
Frantzi, K., Ananiadou, S., Mima, H. (2000) Automatic recognition of multi-word terms
[Electronic version]. International Journal of Digital Libraries 3(2), Special issue
edited by Nikolau, C. & Stephanidis, C. (eds.), 117–
132.http://personalpages.manchester.ac.uk/staff/sophia.ananiadou/IJODL2000.pdf
Hoad, T. C., Zobel, J. (2003). Methods for identifying versioned and plagiarized
documents [Electronic version]. J. Am. Soc. Inf. Sci. Technol. 54, 3, 203-215.
Koehn, P. (2002). Europarl: A Multilingual Corpus for Evaluation of Machine
Translation. Unpublished. http://people.csail.mit.edu/~koehn/publications/europarl.ps
Monostori, K., Finkel R., Zaslavsky A. B., Hod?sz, G., Pataki, M. (2002) Comparison of
Overlap Detection Techniques [Electronic version]. International Conference on
Computational Science (1) 51-60
Niezgoda, S., Way, T. (2006). SNITCH: a software tool for detecting cut and paste
plagiarism [Electronic version]. Proceedings of the 37th SIGCSE technical
symposium on Computer Science. Pages: 51 - 55
Radev, D. (n.d). Cross-document relationship classification for text summarization.
Retrieved January 8,2007 from University of Michigan Web Site:
http://tangra.si.umich.edu/~radev/papers/progress/p1.pdf
Reynar, J., Ratnaparkhi, A. (1997). A maximum entropy approach to identifying
sentence boundaries [Electronic version]. Proceedings of the fifth conference on
Applied natural language processing. Pages: 16 - 19
Hewavitharana, S. (2006). (Statistical) Approaches to Word Alignment. Retrieved
January 8, 2007 from Carnegie Mellon University, Language Technologies Institute
Web Site: http://www.cs.cmu.edu/afs/cs.cmu.edu/project/cmt-
55/lti/Courses/734/Spring-06/Sanjika_11734.ppt
Schmid, H. (1994). Probabilistic Part-of-Speech Tagging Using Decision Trees
[Electronic version]. Proceedings of the International Conference on New Methods in
Language Processing, 44-49.
Simard, M., Plamondon, P .(1998). Bilingual Sentence Alignment: Balancing
Robustness and Accuracy [Electronic version]. Machine Translation. Volume 13,
Number 1. Pages 59-80
Si, A., Leong, H.V., Lau, R.W.H. (1997). CHECK: a document plagiarism detection
system [Electronic version]. Proceedings of the 1997 ACM symposium on Applied
computing.
White, D.R., Joy, M. S. (2004). Sentence-Based Natural Language Plagiarism Detection
[Electronic version]. ACM Journal on Educational Resources in Computing. Vol. 4,
No. 4, December 2004. Article 2.
Zhang, Y., Zincir-Heywood, N., Milios, E. (2005). Narrative text classification for
automatic key phrase extraction in web document corpora [Electronic version].
Workshop On Web Information And Data Management.