研究生: |
蔡孟學 Tsai, Meng Xue |
---|---|
論文名稱: |
英文和中文語料的整體語言結構 Global linguistic structure for both English and Chinese Corpora |
指導教授: |
洪在明
Hong, Tzay Ming |
口試委員: |
王道維
Wang, Daw Wei 徐南蓉 Hsu, Nan Jung |
學位類別: |
碩士 Master |
系所名稱: |
理學院 - 物理學系 Department of Physics |
論文出版年: | 2016 |
畢業學年度: | 104 |
語文別: | 英文 |
論文頁數: | 66 |
中文關鍵詞: | 冪次定律 、齊夫定律 、赤池信息量準則 、字數統計 、揉皺 |
外文關鍵詞: | Zipf's law, word count |
相關次數: | 點閱:3 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
冪次定律由於其無尺度的特性,在數以百計研究自然或人為的複雜系統的報告中都有它的蹤跡,例如地震發生的頻率,股市波動,人口結構,拉丁語系的字數統計。與此同時為了解釋這些實證研究結果,巴克等人(Bak el.),在1970年提出沙堆模型和自組織臨界性,和曼德博(Mandelbrot)提出了碎形解釋。他們都試圖用一種物理學家喜愛、統一的理論模型解釋這種現象。
此篇論文中,我們從不同的角度出發,即專注於要求宣稱符合冪次定律的現象在統計上的嚴謹性。藉由使用與建議赤池信息量準則(Akaike information criterion, AIC),我們研究了冪次定律的幾個著名的例子,發現他們的數據用雙冪次定律(Double power law, DPL)能更好地描述。借用合理且強力的 AIC 的支持,我們深入測量在揉皺實驗中發出的聲音。當分別揉兩張不同薄片時,AIC 能夠正確地挑出雙冪次定律。而共同揉兩張不同薄片時,AIC 則指出在逐漸壓縮的過程中,從偏好於雙冪次定律過渡到冪次定律。這種轉變如我們的期待,即交互作用應有的統計行為。
此篇論文後半段,擁有 AIC 和數據分析經驗,我們轉而研究齊夫定律(Zipf's law)在中文和英文以及其他拉丁語系的語言規律,也被稱為字數統計變形。並經過進一步推廣到音樂和圖樣,我們發現,統計分佈的變形是很常見的,取決於用於數據分析的具體方式。由於冪次定律擁有無尺度和自相似特徵的精美又簡單的數學形式,它將是我們量化不同的方法的主體。
Power laws have been reported in more than one-hundred instances of natural and man-made complex systems for its scale-free characteristics, which phenomenon sets off many researches, such as the frequency of occurrence of earthquake in geology, the stock market fluctuations in economics, population structure, and word count for Latin languages. In parallel to these empirical findings, Bak et al. in 1970 proposed the sandpile model and self-organized criticality, and Mandelbrot presented fractal interpretation. They all tried to construct a theoretical model in hope of explaining this popular phenomenon in a physicist-loving unified way.
In this thesis, we visit this problem from a different angle, namely, focusing on the statistical rigor to claim the power law. By adopting and proposing the Akaike information criterion (AIC), we examined several famous examples of power law and found their data are better described by double power laws (DPL). To lend more support to the legitimacy and powerfulness of AIC, we measured the acoustic emission of crumpling experiment in-depth. AIC was able to pick out DPL correctly for two different sheets crumpled separately. When the sheets are crumpled jointly, AIC indicates the statistics transits from favoring DPL to simple power law as the compaction increases. This is in support of the expectation that interactions can give rise to such a shift of statistical behavior.
In the second half of this thesis, armed with the AIC and accompanying experience on data analysis I shifted gear to study the Zipf’s law in Chinese, English and other Latin-based languages, also known as word count morphing. And after further generalization to music and graphs, we find that the morph of statistical distribution is common and depends on the specific method used for data analysis. Since the power law is beautifully simple in mathematics with its scale-free and self-similar feature, it is used as our primary subject to quantify different methods.
[1] Newman M. E. J. Power laws, Pareto distributions and Zipf's law. Contemporary Physics. 2005 Sep 1:323-351.
[2] Humphries N. E., Queiroz N, Dyer JRM, Pade NG, Musyl MK, Schaefer KM, Fuller DW, Brunnschweiler JM, Doyle TK, Houghton JDR, et al. Environmental context explains Lévy and Brownian movement patterns of marine predators. Nature. 2010 Jun 24:1066-1069.
[3] Klaus A., Yu S., Plenz D. Statistical Analyses Support Power Law Distributions Found in Neuronal Avalanches. PLOS ONE. 2011 May 26.
[4] Zanette D. H., Manrubia S. C. Vertical transmission of culture and the distribution of family names. Physica A: Statistical Mechanics and its Applications. 2001 Jun 1:1-8.
[5] Szabo T. L., Wu J. A model for longitudinal and shear wave propagation in viscoelastic media. The Journal of the Acoustical Society of America. 2000 May 1:2437-2446.
[6] Anderson R. O., Neumann R. M. Length, Weight, and Associated Structural Indices. In: Murphy BE, Willis DW, editors. Fisheries Techniques. second edition ed. American Fisheries Society; 1996.
[7] Clauset A., Shalizi C., Newman M. Power-Law Distributions in Empirical Data. SIAM Review. 2009 Nov 4:661-703.
[8] Gutenberg B., Richter C. F. Magnitude and energy of earthquakes. Annals of Geophysics. 1956 Nov 25:1-15.
[9] Tsai S. T. , Wang L. M., Huang P., Yang Z., Chang C. D., Hong T. M. Acoustic Emission from Breaking a Bamboo Chopstick. Physical Review Letters. 2016 Jan 19:035501.
[10] Dutta P., Horn P. M. Low-frequency fluctuations in solids: 1/f noise. Reviews of Modern Physics. 1981 Jul 1:497-516.
[11] Downey A. Self-Organized Criticality. In: Think Complexity: Complexity Science and Computational Modeling. O'Reilly Media, Inc.; 2012. p. 79.
[12] Bak P., Tang C., Wiesenfeld K. Self-organized criticality: An explanation of the 1/f noise. Physical Review Letters. 1987 Jul 27:381-384.
[13] Konishi S., Kitagawa G. Information Criteria and Statistical Modeling. New York, NY: Springer New York; 2008.
[14] Akaike H. A new look at the statistical model identification. IEEE Transactions on Automatic Control. 1974 Dec 1:716-723.
[15] Sugiura N. Further analysts of the data by akaike' s information criterion and the finite corrections. Communications in Statistics - Theory and Methods. 1978 Jan 1:13-26.
[16] Hurvich C.M., Tsai C.L. Bias of the corrected AIC criterion for underfitted regression and time series models. Biometrika. 1991 Jan 9:499-509.
[17] [Internet]. Available from: http://scedc.caltech.edu/research-tools/alt-2011-yanghauksson-shearer.html.
[18] Eguíluz V. M., Chialvo D. R., Cecchi G. A., Baliki M., Apkarian A. V. Scale-Free Brain Functional Networks. Physical Review Letters. 2005 Jan 6:018102.
[19] Palmer R. G. Mathematica Notebooks on Complex Systems [Internet]. Available from: http://www.phy.duke.edu/~palmer/notebooks/csindex.html.
[20] Wang F. Y., Dai Z. G. Self-organized criticality in X-ray flares of gamma-ray-burst afterglows. Nature Physics. 2013 Aug 13:465-467.
[21] Broder A., Kumar R., Maghoul F., Raghavan P., Rajagopalan S., Stata R., Tomkins A., Wiener J. Graph structure in the Web. Computer Networks. 2000 Jan 1:309-320.
[22] Searls D. B. The language of genes. Nature. 2002:211-217.
[23] Gabaix X., Gopikrishnan P., Plerou V., Stanley H. E. A theory of power-law distributions in financial market fluctuations. Nature. 2003 May 15:267-270.
[24] Zipf G. K. The psycho-biology of language. Vol ix. Houghton-Mifflin; 1935.
[25] Ha L. Q., Sicilia-Garcia EI, Ming J, Smith FJ. Extension of Zipf's Law to Words and Phrases. In: Proceedings of the 19th International Conference on Computational Linguistics; 2002; Stroudsburg, PA, USA. p. 1–6.
[26] Chomsky N. Syntactic Structures. Walter de Gruyter; 2002.
[27] Xiao H. On the Applicability of Zipf's Law in Chinese Word Frequency Distribution. Journal of Chinese Language and Computing. 2008:33-46.
[28] Albert J. S., Reis R. E. Historical Biogeography of Neotropical Freshwater Fishes. University of California Press; 2011.