簡易檢索 / 詳目顯示

研究生: 馬喬斯
Martinez Bueso, Jose Isaac
論文名稱: 應用於拉丁美洲社群媒體之非法毒品術語識別
Identifying Illegal-Drug Terminology in Latin American Social Media.
指導教授: 陳宜欣
Chen, Yi-Shin
口試委員: 蘇豐文
Soo, Von-Wun
陳朝欽
Chen, Chaur-Chin
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊系統與應用研究所
Institute of Information Systems and Applications
論文出版年: 2017
畢業學年度: 105
語文別: 英文
論文頁數: 36
中文關鍵詞: 非法毒品拉丁美洲文字嵌入技術文本挖掘
外文關鍵詞: Illegal Drugs, Latin America, Word Embeddings, Text Mining
相關次數: 點閱:2下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 近年來,社群媒體上的非法毒品活動日益增加。為了阻止類似行為發生,這些社群平台透過不斷更新毒品關鍵字黑名單來阻斷及藏匿這些內容。即使如此,其更新的速度仍無法趕上這些字的更新及演變。對於英文之外的語言更是難以觸及,使得拉丁美洲中非英語系國家的非法毒品盛行率仍居高不下。

    要解決拉丁美洲的社群媒體毒品問題,首先將擴充該地區語言的毒品關鍵字黑名單來幫助阻斷社群媒體上的相關資訊。然而,由於該地區毒品用字閃爍其詞、資訊缺乏及跨國的不連續性上,找出拉丁美洲中暗喻毒品的密碼詞彙成為了一大的困難。

    本研究透過社群媒體資料及文字嵌入技術,鎖定在不同國家間自動化擷取毒品相關詞彙的議題,並提出一套方法,讓我們能透過微量已知的毒品詞彙來測定毒品用字相似度。並且更進一步地訓練支援向量機分類器來學習毒品詞彙。不僅如此,基於人類標注者的評估結果顯示本方法能夠擷取出不同國家中毒品詞彙,且與目前已知的毒品術語間具有有相當高的相關性。


    Recent years have seen an increased activity on illicit drugs markets through social media websites. These platforms rely on a set of blacklisted keywords to block or obscure this kind of content on a timely fashion. However, these sets of words are always evolving, and most of the current sources don't cover languages other than English, which excludes regions with high illicit drug impact like Latin America. Finding these coded words for the Latin American region is a challenging task due to their evasive nature, scarcity, and inconsistent meaning across different countries. This paper studies the problem of automatically finding drug-related terms at a country level using social media data and neural word embedding techniques. We propose a way to measure the similarity of a word to the whole drug context, given an initial set of known drug-words. We later refine these results using the most informative features of a SVM classifier, trained especially to learn drug-contexts. Moreover, we were able to extract country-level drug terminology that was later evaluated by human annotators, showing a high correlation between known drug-terminology and the extracted terms.

    Introduction 1 Related Work 4 2.1 Illicit Drugs and Social Media . . . . . . . . . . . . . . . . . . . . . 4 2.2 Illicit Drug Lexicons . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.3 Word Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Methodology 8 3.1 Data Collection and Pre-processing . . . . . . . . . . . . . . . . . . 9 3.2 Candidate word selection . . . . . . . . . . . . . . . . . . . . . . . . 12 3.3 Candidate word filtering . . . . . . . . . . . . . . . . . . . . . . . . 13 3.4 Synonyms and Contextual Words . . . . . . . . . . . . . . . . . . . 15 Experiments 18 4.1 Word Embedding Modeling . . . . . . . . . . . . . . . . . . . . . . 18 4.2 Human Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.3 Other Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Conclusions and Future Work 32

    1. Yakushev A, Mityagin S. Social networks mining for analysis and modeling drugs usage. Procedia Comput Sci. 2014;29:2462-2471. doi:10.1016/j.procs.2014.05.230.
    2. Ding T, Roy A, Chen Z, Zhu Q, Pan S. Analyzing and retrieving illicit drug-related posts from social media. Proc - 2016 IEEE Int Conf Bioinforma Biomed BIBM 2016. 2017:1555-1560. doi:10.1109/BIBM.2016.7822752.
    3. Buntain C, Golbeck J. This is your Twitter on drugs. Proc 24th Int Conf World Wide Web - WWW ’15 Companion. 2015:777-782. doi:10.1145/2740908.2742469.
    4. Sarker Abeed,O’Connor, Karen GR. Social Media Mining for Toxicovigilance: Automatic Monitoring of Prescription Medication Abuse from Twitter. Drug Saf J. 2016. https://link.springer.com/content/pdf/10.1007%2Fs40264-015-0379-4.pdf. Accessed July 4, 2017.
    5. Zhou Y, Sani N, Lee C-K, Luo J. Understanding Illicit Drug Use Behaviors by Mining Social Media. 2016 IEEE Int Conf Big Data (Big Data). 2016. https://arxiv.org/ftp/arxiv/papers/1604/1604.07096.pdf. Accessed July 4, 2017.
    6. Gonçalves B, Sánchez D. Learning about Spanish dialects through Twitter. Rev Int Linguist Iberoam. 2016;14(2):65-75. https://arxiv.org/ftp/arxiv/papers/1511/1511.04970.pdf. Accessed June 24, 2017.
    7. Yang CC, Yang H, Jiang L, Zhang M. Social media mining for drug safety signal detection. Proc 2012 Int Work Smart Heal wellbeing - SHB ’12. 2012:33. doi:10.1145/2389707.2389714.
    8. Hossain N, Hu T, Feizi R, White AM, Luo J, Kautz H. Inferring Fine-grained Details on User Activities and Home Location from Social Media: Detecting Drinking-While-Tweeting Patterns in Communities. CoRR. 2016;abs/1603.0:12. http://arxiv.org/abs/1603.03181. Accessed November 20, 2016.
    9. Ding T, Bickel WK, Pan S. Social Media-based Substance Use Prediction. CoRR. 2017;abs/1705.0. http://arxiv.org/abs/1705.05633. Accessed July 4, 2017.
    10. Ueda H. Investigaciones sobre la variación léxica del español: Proyectos y resultados de 1992 a 2007. http://lecture.ecc.u-tokyo.ac.jp/~cueda/varilex/art/vx15/resultado.pdf. Accessed July 4, 2017.
    11. Mikolov T, Corrado G, Chen K, Dean J. Efficient Estimation of Word Representations in Vector Space. Proc Int Conf Learn Represent (ICLR 2013). 2013:1-12. doi:10.1162/153244303322533223.
    12. Pennington J, Socher R, Manning C. Glove: Global Vectors for Word Representation. Proc 2014 Conf Empir Methods Nat Lang Process. 2014:1532-1543. doi:10.3115/v1/D14-1162.
    13. Bojanowski P, Grave E, Joulin A, Mikolov T. Enriching Word Vectors with Subword Information. Emnlp. 2016;91:28-29. doi:1511.09249v1.
    14. Maier W, Gómez-Rodríguez C. Language variety identification in Spanish tweets. LT4CloseLang 2014 Proc EMNLP’2014 Work Lang Technol Closely Relat Lang Lang Var. 2014:25-35. http://alt.qcri.org/LT4CloseLang/pdf/LT4CloseLang04.pdf. Accessed July 4, 2017.
    15. Levy O, Goldberg Y. Dependencybased word embeddings. Proc 52nd Annu Meet Assoc Comput Linguist. 2014;2:302-308. doi:10.3115/v1/P14-2050.
    16. Lin Y, Lei H, Wu J, Li X. An Empirical Study on Sentiment Classification of Chinese Review using Word Embedding. 29th Pacific Asia Conf Lang Inf Comput. 2015:258-266. http://arxiv.org/abs/1511.01665. Accessed July 6, 2017.
    17. Yang X, Macdonald C, Ounis I. Using Word Embeddings in Twitter Election Classification. CoRR. 2016;abs/1606.0. http://arxiv.org/abs/1606.07006. Accessed July 6, 2017.
    18. Ghosh S, Chakraborty P, Cohn E, Brownstein JS, Ramakrishnan N. Characterizing Diseases from Unstructured Text: A Vocabulary Driven Word2vec Approach. CoRR. 2016;abs/1603.0. doi:10.1145/2983323.2983362.
    19. Wijeratne S, Balasuriya L, Doran D, Sheth A. Word Embeddings to Enhance Twitter Gang Member Profile Identification. CoRR. 2016;abs/1610.0. http://arxiv.org/abs/1610.08597. Accessed June 25, 2017.
    20. Dao T, Keller S, Bejnood A. Alternate Equivalent Substitutes : Recognition of Synonyms Using Word Vectors. 2013:2-6. http://nlp.stanford.edu/courses/cs224n/2013/reports/bejnood.pdf. Accessed March 22, 2017.
    21. Etcheverry M, Wonsever D. Spanish word vectors from Wikipedia. Proc Lang Resour Eval. 2014:3681-3685. http://www.lrec-conf.org/proceedings/lrec2016/pdf/1212_Paper.pdf. Accessed March 23, 2017.

    QR CODE