篇章結構自動評分：運用 ChatGPT 進行資料標記

簡易檢索 / 詳目顯示

回結果列表

研究生：	凱碧 Silvia Gabriela Herrera Poggio
論文名稱：	篇章結構自動評分：運用 ChatGPT 進行資料標記 Automatic Organization Scoring: Leveraging ChatGPT for Data Annotation
指導教授：	張俊盛 Chang, Jason S. 胡敏君 Hu, Anita
口試委員:	高宏宇 Kao, Hung-Yu 黃芸茵 Huang, Yun-Yin
學位類別：	碩士 Master
系所名稱：	電機資訊學院 - 資訊系統與應用研究所 Institute of Information Systems and Applications
論文出版年：	2024
畢業學年度：	112
語文別：	英文
論文頁數：	33
中文關鍵詞：	作文自動評分、ChatGPT 、評分模型
外文關鍵詞：	Automatic essay Scoring, ChatGPT, Scoring model
相關次數：	點閱：158 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

我們提出了一套針對作文組織面向的生成式自動評分方法。在這套方法中，我們為每個句子都額外標註組織結構的資訊，以最大化的提升模型在組織面向評分的準確性。這個方法涉及了使用ChatGPT標註中級學生作文的句子，為文章提供了額外的結構特徵資訊，搭配專業英文老師所批閱之作文整體性評分，用以訓練自動評分模型，並進而能產生組織面向的分項評分。在執行階段中，輸入的作文句子分別被標註上組織結構的特徵，作為額外資訊輸入模型以取得評分。在一組真實學生作文測試集的評估顯示，這個方法的評分效果相當接近專業英文老師的批閱水準。這證明了我們的方法在缺乏分項評分的作文資料集上，也能有效的提供組織面向的分項自動評分，且達到相當良好的效能。

We introduce a method for automatically generating an organizational aspect score for student essays. In our approach, the essay is separated into sentences, which are transformed into structurally annotated sentences aimed at maximizing the probability of obtaining an accurate organizational score. The method involves leveraging ChatGPT to enrich intermediate-level essay dataset with tags that highlight the structural characteristics of the essay, training a model using the structurally annotated dataset and the corresponding holistic score to then automatically generate organizational scores. At run-time, the input essay is transformed into a sentence-level structurally annotated essay, which is then fed into the model to derive the score. Blind evaluation on a set of real learner essays shows that the achieves comparable performance to human evaluators. Our methodology cleanly supports automatic organization score, yielding reasonably good performance results.

Contents
Abstract (Chinese)  . . . . . . . . . . . . .  I
Abstract  . . . . . . . . . . . . . . . . . .  II
Acknowledgements . . . . . . . . . . . . . . . III
Contents  . . . . . . . . . . . . . . . . . .  IV
List of Figures  . . . . . . . . . . . . . . . VI
List of Tables   . . . . . . . . . . . . . . . VII
Introduction   . . . . . . . . . . . . . . . 1
Related works  . . . . . . . . . . . . . . . 3
Methodology  . . . . . . . . . . . . . . . . 6
1 Stage 1  . . . . . . . . . . . . . . . . . 7
2 Stage 2  . . . . . . . . . . . . . . . . . 8
Experiments  . . . . . . . . . . . . . . . . 9
1 Dataset  . . . . . . . . . . . . . . . . . 9
2 Tag creation . . . . . . . . . . . . . . . 10
3 Prompt engineering . . . . . . . . . . . . 12
5 Training   . . . . . . . . . . . . . . . . 18
6 Evaluation . . . . . . . . . . . . . . . . 21
Results and Discussion . . . . . . . . . . . 22
1 Stage 1  . . . . . . . . . . . . . . . . . 22
1.1 Results from prompt evaluation . . . . . 22
1.2 Results from data annotation . . . . . . 23
2 Stage 2  . . . . . . . . . . . . . . . . . 24
2.1 Optimized hyperparameters  . . . . . . . 24
2.2 Results from T5 Model  . . . . . . . . . 25
Conclusion and Future Work . . . . . . . . . 27
References . . . . . . . . . . . . . . . . . . 29
                                

Abdi, H. (2007). Z-scores. Encyclopedia of measurement and statistics, 3 , 1055–
1058.
Brack, A., Entrup, E., Stamatakis, M., Buschermöhle, P., Hoppe, A., & Ewerth, R.
(2024). Sequential sentence classification in research papers using cross-domain
multi-task learning. International Journal on Digital Libraries, 1–24.
Cantor, A. B. (1996). Sample-size calculations for cohen’s kappa. Psychological
methods, 1 (2), 150.
Center, U. E. E. (2024). 107 academic year subject ability test english test scoring cri-
teria explanation. Retrieved from https://www.ceec.edu.tw/xcepaper/cont
?xsmsid=0J066588036013658199&qperoid=0J133544156387960269&sid=
0J133630011608125564 (Updated 2024-06-14. Accessed 2024-06-14)
Do, H., Kim, Y., & Lee, G. G. (2024). Autoregressive score generation for multi-trait
essay scoring. arXiv preprint arXiv:2403.08332 .
Gilardi, F., Alizadeh, M., & Kubli, M. (2023). Chatgpt outperforms crowd workers
for text-annotation tasks. Proceedings of the National Academy of Sciences,
120 (30), e2305016120.
Haller, S. (2020). Automatic short answer grading using text-to-text transfer
transformer model (Unpublished master’s thesis). University of Twente.
Ibekwe-SanJuan, F., Chen, C., & Roberto, P. (2008). Identifying strategic informa-
tion from scientific articles through sentence classification. In 6th international
conference on language resources and evaluation conference (lrec-08) (pp. 1518–
1522).
Kim, J., & Kim, J. (2018). The impact of imbalanced training data on machine
learning for author name disambiguation. Scientometrics, 117 (1), 511–526.
Lagakis, P., & Demetriadis, S. (2021). Automated essay scoring: A review
of the field. In 2021 international conference on computer, information and
telecommunication systems (cits) (p. 1-6). doi: 10.1109/CITS52676.2021.9618476
Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement
for categorical data. Biometrics, 33 (1), 159–174. Retrieved 2024-05-25, from
http://www.jstor.org/stable/2529310
Lee, S.-H. (2023). Writingprofile: Learning to predict trait-specific scores for learner
essays (Master’s thesis, National Tsing Hua University, Hsinchu, Taiwan). Re-
trieved from https://etd.lib.nycu.edu.tw/cgi-bin/gs32/hugsweb.cgi?o=
dnthucdr&s=id=%22G021090657020%22.&searchmode=basic (Advisor: Jason
S. Chang. Committee Members: Chih-Hsing Chang, Chao-Ming Gao, Jo-Chi
Hsiao. Student ID: 109065702. Year of Publication: 112 (R.O.C.). Academic Year
of Graduation: 111. Language: English. Pages: 34)
Mizumoto, A., & Eguchi, M. (2023). Exploring the potential of using an ai language
model for automated essay scoring. Research Methods in Applied Linguistics,
2 (2), 100050.
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., . . . Liu,
P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text
transformer. Journal of machine learning research, 21 (140), 1–67.
Samosa, R. C., Sayong, J. M., Gonzales, M. P., Dacusan, R. G., & Menguito, V.
(2021). Opinion, reason, explanation and opinion (oreo) as an innovation to
improve learners’ writing skills among grade four learners. Online Submission,
5 (12), 166–172.
Schmidt, F. L. (1996). Statistical significance testing and cumulative knowledge in
psychology: Implications for training of researchers. Psychological methods, 1 (2),
115.
Shermis, M. D., & Barrera, F. D. (2002). Exit assessments: Evaluating writing
ability through automated essay scoring. Non-Journal.
Törnberg, P. (2023). Chatgpt-4 outperforms experts and crowd workers in an-
notating political twitter messages with zero-shot learning. arXiv preprint
arXiv:2304.06588 .
Training, L., & Center, T. (n.d.). Language training and testing center. Retrieved
from https://www.lttc.ntu.edu.tw/ (Accessed: 2024-06-01)
Uto, M., & Okano, M. (2020). Robust neural automated essay scoring using
item response theory. In Artificial intelligence in education: 21st international
conference, aied 2020, ifrane, morocco, july 6–10, 2020, proceedings, part i 21
(pp. 549–561).
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., . . .
Polosukhin, I. (2017). Attention is all you need. Advances in neural information
processing systems, 30 .
White, J., Fu, Q., Hays, S., Sandborn, M., Olea, C., Gilbert, H., . . . Schmidt, D. C.
(2023). A prompt pattern catalog to enhance prompt engineering with chatgpt

簡易檢索 / 詳目顯示

相關論文