一個基於機器學習的網頁靜態檢測方法｜國立清華大學博碩士論文庫

簡易檢索 / 詳目顯示

回結果列表

研究生：	簡易 Chien, Yi
論文名稱：	一個基於機器學習的網頁靜態檢測方法 Predicting Injection Vulnerabilities in Web Applications
指導教授：	孫宏民 Sun, Hung-Min
口試委員:	曾文貴 Tzeng, Wen-Guey 顏嵩銘 Yen, Sung-Ming
學位類別：	碩士 Master
系所名稱：
論文出版年：	2017
畢業學年度：	105
語文別：	英文
論文頁數：	71
中文關鍵詞：	靜態檢測、注射型網頁攻擊、PHP 、JavaScript 、機器學習
外文關鍵詞：	Static analysis, Injection type vulnerability, PHP, JavaScript, Machine learning
相關次數：	點閱：3 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

網頁已經與現代人的生活密不可分，像是訂機票、線上購物或是瀏覽Facebook，
我們的生活作息已經跟網頁息息相關。然而諸如訂機票或線上購物或是瀏
覽Facebook 的這些動作，都透露許多使用者的個人隱私資訊在這上面，因此網
頁的安全就顯得更加得重要。
而今日的網頁，有大部分是用PHP 來做後端網頁的開發，像是Facebook、
Wikipedia 或WordPress 這些大公司都是用PHP 來做後端網頁的開發與維護。另
一方面Node.js 則是近幾年來開始盛行，Node.js 的好處是可以用JavaScript 一種
語言來完成前後端的整合，越來越多開發者開始選擇使用Node.js 來進行網頁的
開發。
我們在這篇論文提出一個靜態檢測方法來檢測PHP 跟JavaScript 的Injection
Type 的漏洞，我們提出一個從PHP 跟JavaScript 中抽取特徵碼來代表該檔案
的漏洞行為，並使用特徵碼與機器學習的方法來訓練漏洞檢測的模型。最後，
給予我們的系統一個PHP 或JavaScript 的檔案，我們可以回傳該檔案可能的
Injection Type 的漏洞，並回報給開發者，讓開發者可以在網頁還沒上線前先進行
檢測，並針對可能的漏洞進行修補。

Surfing websites have become a part of modern people’s life, like online shopping,
booking flight tickets, or browsing Facebook. Our daily life has become inseparable
to the internet and websites, and our personal and private data are also uploaded
to the web services. Therefore, securing the websites becomes an important issue.
A vulnerability often comes from unnoticeable program flaws in programs. It is
developers’ obligation to make sure that web project are safe and secure.
There are numerous choices of language for developers to build a website. For
example, most of the websites are built on PHP, like Facebook, Wikipedia, and etc.
On the other hand, Node.js is becoming more and more popular with developers
nowadays. If developers can examine website’s security flaws and repair them before
release, the website’s service would be more secure, and users can surf the net
without worrying the leakage of their personal data.
In this thesis, we propose a system using static analysis method based on machine
learning to predict injection type vulnerabilities of PHP and JavaScript. We
propose a feature extraction algorithm for the source code and use machine learning
techniques to learn the possible vulnerabilities. Given a source code written in PHP
or JavaScript, our system can predict the possible injection type vulnerabilities with
the training models and return to the developers. As a consequence, developers can
detect potential vulnerabilities in a website project, and repair weak points before
the website’s release.

Table of Contents
Table of Contents............................................................................................... i
List of Algorithm ............................................................................................... iv
List of Figures .................................................................................................... v
List of Tables...................................................................................................... vi
Chapter 1 Introduction ................................................................................... 1
1 Motivation............................................................................................ 3
2 Our Contributions................................................................................ 3
3 Organization ........................................................................................ 4
Chapter 2 Background .................................................................................... 5
1 Categories of Injection Attacks ............................................................ 5
1.1 SQL Injection ........................................................................ 5
1.2 OS Command Injection......................................................... 6
1.3 LDAP Injection ..................................................................... 7
1.4 XML Injection....................................................................... 7
1.5 Local File Inclusion Injection ................................................ 9
1.6 PHP Remote File Injection ................................................... 9
1.7 Cross-site Scripting ............................................................... 10
2 Malware Analysis Method.................................................................... 11
2.1 Static Analysis ...................................................................... 11
2.2 Dynamic Analysis.................................................................. 12
3 Abstract Syntax Tree........................................................................... 12
4 Machine Learning................................................................................. 14
4.1 Decision Tree......................................................................... 14
4.2 Random Forest ...................................................................... 15
4.3 Naive Bayes........................................................................... 15
i
4.4 Support Vector Machine........................................................ 17
Chapter 3 Related Works ............................................................................... 18
1 Test Cases Generation ......................................................................... 18
1.1 Stivalet et al. Proposed Method ........................................... 18
2 Static Analysis Method........................................................................ 19
2.1 Scandariato et al Proposed Method ...................................... 19
2.2 Shar et al. Proposed Method ................................................ 19
2.3 Walden et al Proposed Method............................................. 19
2.4 Gupta et al Proposed Method............................................... 20
2.5 Medeiros et al. Proposed Method ......................................... 20
3 Dynamic Analysis Method................................................................... 20
3.1 Kiezun et al. Proposed Method ............................................ 20
3.2 Huang et al. Proposed Method ............................................. 21
4 Hybrid Analysis Method ...................................................................... 21
4.1 Shar et al. Proposed Method ................................................ 21
Chapter 4 Scheme ........................................................................................... 22
1 Overview .............................................................................................. 22
2 Proposed Scheme ................................................................................. 22
2.1 Labeled Data collection......................................................... 23
2.2 Building Abstract Syntax Tree.............................................. 24
2.3 Designing Feature Vector ...................................................... 25
2.4 Extraction of Feature Vector................................................. 30
2.5 Training Classifiers ................................................................ 30
2.6 Predicting Vulnerabilities ...................................................... 30
Chapter 5 Implementation.............................................................................. 37
1 Tools .................................................................................................... 37
1.1 Python Packages ................................................................... 37
1.2 Weka ..................................................................................... 38
1.3 LIBSVM ................................................................................ 39
2 Building Abstract Syntax Tree ............................................................ 39
2.1 Abstract Syntax Tree of PHP ............................................... 39
2.2 Abstract Syntax Tree of JavaScript ...................................... 39
ii
3 Extracting Feature Vector.................................................................... 39
3.1 Extracting Feature Vector of PHP ........................................ 40
3.2 Extracting Feature Vector of JavaScript ............................... 41
4 Training Classifiers............................................................................... 43
4.1 Weka ..................................................................................... 44
4.2 LIBSVM ................................................................................ 46
5 Prediction Vulnerability ....................................................................... 47
Chapter 6 Experimental Result and Analysis ................................................ 56
1 Dataset for Experiment........................................................................ 56
2 Validation............................................................................................. 57
2.1 Cross-Validation .................................................................... 57
3 Result and Analysis ............................................................................. 58
3.1 Confusion Matrix and Statistic Numbers.............................. 58
3.2 Experimental Results ............................................................ 59
3.3 Analysis................................................................................. 59
4 Comparison .......................................................................................... 64
Chapter 7 Conclusions and Future Works...................................................... 66
1 Conclusions .......................................................................................... 66
2 Future Works ....................................................................................... 67
iii
List of Algorithms
Algorithm of Feature Vector Extraction . . . . . . . . . . . . . . . . . 31
Functions of Pseudo Code in Algorithm 1 . . . . . . . . . . . . . . . . 34
Functions of Pseudo Code in Algorithm 1 . . . . . . . . . . . . . . . . 35
Functions of Pseudo Code in Algorithm 1 . . . . . . . . . . . . . . . . 36
iv
List of Figures
1 Number of Websites before 2015 . . . . . . . . . . . . . . . . . . . . . 2
1 Reflected XSS Attack Flow . . . . . . . . . . . . . . . . . . . . . . . 11
2 Abstract Syntax Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3 Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4 J48 Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1 Training System Flow Chart . . . . . . . . . . . . . . . . . . . . . . . 32
2 Prediction System Flow Chart . . . . . . . . . . . . . . . . . . . . . . 33
1 PHP findGetPostVariable Function . . . . . . . . . . . . . . . . . . . 43
2 PHP findSystemCall Function . . . . . . . . . . . . . . . . . . . . . . 44
3 PHP Sanitize Function . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4 JavaScript checkModule Function . . . . . . . . . . . . . . . . . . . . 45
5 JavaScript findSQL Function . . . . . . . . . . . . . . . . . . . . . . 46
6 Weka . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
7 Weka Inpur Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
8 J48 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
9 Random Forest Properties . . . . . . . . . . . . . . . . . . . . . . . . 54
10 Naive Bayes Properties . . . . . . . . . . . . . . . . . . . . . . . . . . 54
11 Weka Package Manager . . . . . . . . . . . . . . . . . . . . . . . . . 55
12 Predicting Vulnerability . . . . . . . . . . . . . . . . . . . . . . . . . 55
v
List of Tables
1 PHP Generated Test Cases . . . . . . . . . . . . . . . . . . . . . . . 23
2 Table of Input Variables . . . . . . . . . . . . . . . . . . . . . . . . . 25
3 Table of Sensitive Points . . . . . . . . . . . . . . . . . . . . . . . . . 26
4 Table of General Behavior . . . . . . . . . . . . . . . . . . . . . . . . 26
5 Feature Vector Table . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
1 Abstract Syntax Tree Node Types and Corresponding Properties of
PHP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
1 Abstract Syntax Tree Node Types and Corresponding Properties of
PHP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
1 Abstract Syntax Tree Node Types and Corresponding Properties of
PHP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
1 Abstract Syntax Tree Node Types and Corresponding Properties of
PHP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2 Abstract Syntax Tree Node Types and Corresponding Properties of
JavaScript . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2 Abstract Syntax Tree Node Types and Corresponding Properties of
JavaScript . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2 Abstract Syntax Tree Node Types and Corresponding Properties of
JavaScript . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
1 Table of Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . 59
2 Performance of Training Models with J48 . . . . . . . . . . . . . . . 60
3 Performance of Training Models with Random Forest . . . . . . . . . 61
4 Performance of Training Models with Naive Bayes . . . . . . . . . . . 62
5 Performance of Training Models with SVM . . . . . . . . . . . . . . . 63
6 XSS Vulnerability Prediction Evaluation . . . . . . . . . . . . . . . . 65
vi
                                

[1] Web statistics report. https://whitehatsec.com/categories/statisticsreport.
[2] Ldap injection owasp. https://www.owasp.org/index.php/LDAP_injection.
[3] Xml injection owasp. https://www.owasp.org/index.php/Testing_for_
XML_Injection_(OWASP-DV-008).
[4] Local file inclusion injection owasp. https://www.owasp.org/index.php/
Testing_for_Local_File_Inclusion.
[5] Static analysis wikipedia. https://en.wikipedia.org/wiki/Static_
program_analysis.
[6] Dynamic analysis wikipedia. https://en.wikipedia.org/wiki/Dynamic_
program_analysis.
[7] Abstract syntax tree wikipedia. https://en.wikipedia.org/wiki/Abstract_
syntax_tree.
[8] Machine learning wikipedia. https://en.wikipedia.org/wiki/Machine_
learning.
[9] Andy Liaw and Matthew Wiener. Classification and regression by randomforest.
R news, 2(3):18–22, 2002.
[10] Naive bayes wikipedia. https://en.wikipedia.org/wiki/Naive_Bayes_
classifier.
[11] Irina Rish. An empirical study of the naive bayes classifier. In IJCAI 2001
workshop on empirical methods in artificial intelligence, volume 3, pages 41–46.
IBM New York, 2001.
[12] Svm wikipedia. https://en.wikipedia.org/wiki/Support_vector_
machine.
[13] Chih-Chung Chang and Lin CJ LIBSVM. a library for support vector machines,
2001. Software available at http://www. csie. ntu. edu. tw/cjlin/libsvm, 2012.
[14] Bertrand Stivalet and Elizabeth Fong. Large scale generation of complex and
faulty php test cases. In Software Testing, Verification and Validation (ICST),
2016 IEEE International Conference on, pages 409–415. IEEE, 2016.
[15] Riccardo Scandariato, James Walden, Aram Hovsepyan, and Wouter Joosen.
Predicting vulnerable software components via text mining. IEEE Transactions
on Software Engineering, 40(10):993–1006, 2014.
[16] Lwin Khin Shar and Hee Beng Kuan Tan. Predicting sql injection and cross
site scripting vulnerabilities through mining input sanitization patterns. Information
and Software Technology, 55(10):1767–1780, 2013.
[17] James Walden, Jeff Stuckman, and Riccardo Scandariato. Predicting vulnerable
components: Software metrics vs text mining. In Software Reliability Engineering
(ISSRE), 2014 IEEE 25th International Symposium on, pages 23–33. IEEE,
2014.
[18] Mukesh Kumar Gupta, Mahesh Chandra Govil, and Girdhari Singh. Predicting
cross-site scripting (xss) security vulnerabilities in web applications. In
Computer Science and Software Engineering (JCSSE), 2015 12th International
Joint Conference on, pages 162–167. IEEE, 2015.
[19] Ibéria Medeiros, Nuno F Neves, and Miguel Correia. Automatic detection and
correction of web application vulnerabilities using data mining to predict false
positives. In Proceedings of the 23rd international conference on World wide
web, pages 63–74. ACM, 2014.
[20] Adam Kieyzun, Philip J Guo, Karthick Jayaraman, and Michael D Ernst. Automatic
creation of sql injection and cross-site scripting attacks. In Software
Engineering, 2009. ICSE 2009. IEEE 31st International Conference on, pages
199–209. IEEE, 2009.
[21] Shih-Kun Huang, Han-Lin Lu, Wai-Meng Leong, and Huan Liu. Craxweb:
Automatic web application testing and attack generation. In Software Security
and Reliability (SERE), 2013 IEEE 7th International Conference on, pages
208–217. IEEE, 2013.
[22] Lwin Khin Shar, Hee Beng Kuan Tan, and Lionel C Briand. Mining sql injection
and cross site scripting vulnerabilities using hybrid program analysis.
In Proceedings of the 2013 International Conference on Software Engineering,
pages 642–651. IEEE Press, 2013.
[23] Abstract syntax tree for php. https://pypi.python.org/pypi/phply.
[24] Abstract syntax tree for javascript. https://pypi.python.org/pypi/slimit.
[25] re package of python. https://docs.python.org/2/library/re.html.
[26] os package of python. https://docs.python.org/2/library/os.html.
[27] sys package of python. https://docs.python.org/2/library/sys.html.
[28] json package of python. https://docs.python.org/2/library/json.html.
[29] Regular expression of python. https://docs.python.org/2/library/copy.
html.
[30] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann,
and Ian H Witten. The weka data mining software: an update. ACM
SIGKDD explorations newsletter, 11(1):10–18, 2009.
[31] Ecma. https://www.ecma-international.org/.
[32] Npm. https://www.npmjs.com/.
[33] Cross-validation wikipedia. https://en.wikipedia.org/wiki/Crossvalidation_(
statistics).
[34] Confusion matrix. https://en.wikipedia.org/wiki/Confusion_matrix.

簡易檢索 / 詳目顯示

相關論文