簡易檢索 / 詳目顯示

研究生: 凱 碧
Silvia Gabriela Herrera Poggio
論文名稱: 篇章結構自動評分:運用 ChatGPT 進行資料標記
Automatic Organization Scoring: Leveraging ChatGPT for Data Annotation
指導教授: 張俊盛
Chang, Jason S.
胡敏君
Hu, Anita
口試委員: 高宏宇
Kao, Hung-Yu
黃芸茵
Huang, Yun-Yin
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊系統與應用研究所
Institute of Information Systems and Applications
論文出版年: 2024
畢業學年度: 112
語文別: 英文
論文頁數: 33
中文關鍵詞: 作文自動評分ChatGPT評分模型
外文關鍵詞: Automatic essay Scoring, ChatGPT, Scoring model
相關次數: 點閱:158下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 我們提出了一套針對作文組織面向的生成式自動評分方法。在這套方法中,我們為每個句子都額外標註組織結構的資訊,以最大化的提升模型在組織面向評分的準確性。這個方法涉及了使用ChatGPT標註中級學生作文的句子,為文章提供了額外的結構特徵資訊,搭配專業英文老師所批閱之作文整體性評分,用以訓練自動評分模型,並進而能產生組織面向的分項評分。在執行階段中,輸入的作文句子分別被標註上組織結構的特徵,作為額外資訊輸入模型以取得評分。在一組真實學生作文測試集的評估顯示,這個方法的評分效果相當接近專業英文老師的批閱水準。這證明了我們的方法在缺乏分項評分的作文資料集上,也能有效的提供組織面向的分項自動評分,且達到相當良好的效能。


    We introduce a method for automatically generating an organizational aspect score for student essays. In our approach, the essay is separated into sentences, which are transformed into structurally annotated sentences aimed at maximizing the probability of obtaining an accurate organizational score. The method involves leveraging ChatGPT to enrich intermediate-level essay dataset with tags that highlight the structural characteristics of the essay, training a model using the structurally annotated dataset and the corresponding holistic score to then automatically generate organizational scores. At run-time, the input essay is transformed into a sentence-level structurally annotated essay, which is then fed into the model to derive the score. Blind evaluation on a set of real learner essays shows that the achieves comparable performance to human evaluators. Our methodology cleanly supports automatic organization score, yielding reasonably good performance results.

    Contents Abstract (Chinese) . . . . . . . . . . . . . I Abstract . . . . . . . . . . . . . . . . . . II Acknowledgements . . . . . . . . . . . . . . . III Contents . . . . . . . . . . . . . . . . . . IV List of Figures . . . . . . . . . . . . . . . VI List of Tables . . . . . . . . . . . . . . . VII 1 Introduction . . . . . . . . . . . . . . . 1 2 Related works . . . . . . . . . . . . . . . 3 3 Methodology . . . . . . . . . . . . . . . . 6 3.1 Stage 1 . . . . . . . . . . . . . . . . . 7 3.2 Stage 2 . . . . . . . . . . . . . . . . . 8 4 Experiments . . . . . . . . . . . . . . . . 9 4.1 Dataset . . . . . . . . . . . . . . . . . 9 4.2 Tag creation . . . . . . . . . . . . . . . 10 4.3 Prompt engineering . . . . . . . . . . . . 12 4.5 Training . . . . . . . . . . . . . . . . 18 4.6 Evaluation . . . . . . . . . . . . . . . . 21 5 Results and Discussion . . . . . . . . . . . 22 5.1 Stage 1 . . . . . . . . . . . . . . . . . 22 5.1.1 Results from prompt evaluation . . . . . 22 5.1.2 Results from data annotation . . . . . . 23 5.2 Stage 2 . . . . . . . . . . . . . . . . . 24 5.2.1 Optimized hyperparameters . . . . . . . 24 5.2.2 Results from T5 Model . . . . . . . . . 25 6 Conclusion and Future Work . . . . . . . . . 27 References . . . . . . . . . . . . . . . . . . 29

    Abdi, H. (2007). Z-scores. Encyclopedia of measurement and statistics, 3 , 1055–
    1058.
    Brack, A., Entrup, E., Stamatakis, M., Buschermöhle, P., Hoppe, A., & Ewerth, R.
    (2024). Sequential sentence classification in research papers using cross-domain
    multi-task learning. International Journal on Digital Libraries, 1–24.
    Cantor, A. B. (1996). Sample-size calculations for cohen’s kappa. Psychological
    methods, 1 (2), 150.
    Center, U. E. E. (2024). 107 academic year subject ability test english test scoring cri-
    teria explanation. Retrieved from https://www.ceec.edu.tw/xcepaper/cont
    ?xsmsid=0J066588036013658199&qperoid=0J133544156387960269&sid=
    0J133630011608125564 (Updated 2024-06-14. Accessed 2024-06-14)
    Do, H., Kim, Y., & Lee, G. G. (2024). Autoregressive score generation for multi-trait
    essay scoring. arXiv preprint arXiv:2403.08332 .
    Gilardi, F., Alizadeh, M., & Kubli, M. (2023). Chatgpt outperforms crowd workers
    for text-annotation tasks. Proceedings of the National Academy of Sciences,
    120 (30), e2305016120.
    Haller, S. (2020). Automatic short answer grading using text-to-text transfer
    transformer model (Unpublished master’s thesis). University of Twente.
    Ibekwe-SanJuan, F., Chen, C., & Roberto, P. (2008). Identifying strategic informa-
    tion from scientific articles through sentence classification. In 6th international
    conference on language resources and evaluation conference (lrec-08) (pp. 1518–
    1522).
    Kim, J., & Kim, J. (2018). The impact of imbalanced training data on machine
    learning for author name disambiguation. Scientometrics, 117 (1), 511–526.
    Lagakis, P., & Demetriadis, S. (2021). Automated essay scoring: A review
    of the field. In 2021 international conference on computer, information and
    telecommunication systems (cits) (p. 1-6). doi: 10.1109/CITS52676.2021.9618476
    Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement
    for categorical data. Biometrics, 33 (1), 159–174. Retrieved 2024-05-25, from
    http://www.jstor.org/stable/2529310
    Lee, S.-H. (2023). Writingprofile: Learning to predict trait-specific scores for learner
    essays (Master’s thesis, National Tsing Hua University, Hsinchu, Taiwan). Re-
    trieved from https://etd.lib.nycu.edu.tw/cgi-bin/gs32/hugsweb.cgi?o=
    dnthucdr&s=id=%22G021090657020%22.&searchmode=basic (Advisor: Jason
    S. Chang. Committee Members: Chih-Hsing Chang, Chao-Ming Gao, Jo-Chi
    Hsiao. Student ID: 109065702. Year of Publication: 112 (R.O.C.). Academic Year
    of Graduation: 111. Language: English. Pages: 34)
    Mizumoto, A., & Eguchi, M. (2023). Exploring the potential of using an ai language
    model for automated essay scoring. Research Methods in Applied Linguistics,
    2 (2), 100050.
    Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., . . . Liu,
    P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text
    transformer. Journal of machine learning research, 21 (140), 1–67.
    Samosa, R. C., Sayong, J. M., Gonzales, M. P., Dacusan, R. G., & Menguito, V.
    (2021). Opinion, reason, explanation and opinion (oreo) as an innovation to
    improve learners’ writing skills among grade four learners. Online Submission,
    5 (12), 166–172.
    Schmidt, F. L. (1996). Statistical significance testing and cumulative knowledge in
    psychology: Implications for training of researchers. Psychological methods, 1 (2),
    115.
    Shermis, M. D., & Barrera, F. D. (2002). Exit assessments: Evaluating writing
    ability through automated essay scoring. Non-Journal.
    Törnberg, P. (2023). Chatgpt-4 outperforms experts and crowd workers in an-
    notating political twitter messages with zero-shot learning. arXiv preprint
    arXiv:2304.06588 .
    Training, L., & Center, T. (n.d.). Language training and testing center. Retrieved
    from https://www.lttc.ntu.edu.tw/ (Accessed: 2024-06-01)
    Uto, M., & Okano, M. (2020). Robust neural automated essay scoring using
    item response theory. In Artificial intelligence in education: 21st international
    conference, aied 2020, ifrane, morocco, july 6–10, 2020, proceedings, part i 21
    (pp. 549–561).
    Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., . . .
    Polosukhin, I. (2017). Attention is all you need. Advances in neural information
    processing systems, 30 .
    White, J., Fu, Q., Hays, S., Sandborn, M., Olea, C., Gilbert, H., . . . Schmidt, D. C.
    (2023). A prompt pattern catalog to enhance prompt engineering with chatgpt

    QR CODE