簡易檢索 / 詳目顯示

研究生: 林育萱
Lin, Yu-Hsuan
論文名稱: E.M.I.ghty:利用大型語言模型來運用詞彙串撰寫英語授課腳本
E.M.I.ghty: Using Generative AI for Assisting EMI Classroom Communication
指導教授: 張俊盛
Chang, Jason S.
蕭若綺
Hsiao, Jo-Chi
口試委員: 張智星
Jang, Jyh-Shing
鍾曉芳
Chung, Siaw-Fong
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊系統與應用研究所
Institute of Information Systems and Applications
論文出版年: 2024
畢業學年度: 113
語文別: 英文
論文頁數: 55
中文關鍵詞: 全英語授課自動生成詞彙串語言模型檢索增強生成生成式AI數據擴增
外文關鍵詞: Automatic text generation, Retrieval Augmented Generation (RAG)
相關次數: 點閱:74下載:1
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 本論文提出了一個方法,能為使用者輸入的腳本自動生成包含有效詞彙串的EMI新腳本,用於課堂教學。
    我們在一個現有詞彙串表中,以添加新條目的方式,生成了一個豐富的詞彙串庫。
    同時,此方法涉及從詞彙串庫中檢索相關詞彙,以提示工程運用大型語言模型(LLM)來生成改進後的腳本。
    初步評估顯示,我們提出的系統能夠有效地提升輸入腳本的品質。我們提出了一個原型系統 E.M.I.ghty,並將該方法應用於真實的課堂交流。
    用一組課堂腳本做為測試資料,進行人工評估後表示該系統能有效提升腳本品質。我們的方法支持用語料庫中的新詞彙串補充短語庫,結合檢索增強生成 (RAG) 技術,幫助用戶更有效地生成改進的 EMI 腳本。


    We present a method that automatically generates a new script with effective EMI lexical bundles for a given script.
    In our approach, new entries are added to an existing list of lexical bundles to generate an enriched Lexical Bundle Phrase Bank.
    The method involves retrieving relevant phrases from Lexical Bundle Phrase Bank, composing an RAG prompt, and utilizing LLM to generate an improved script.
    At run-time, a given script from a user is broken into sentences, and lexical bundles are retrieved from the Phrase Bank. LLM takes the script and lexical bundles as input and generates an improved EMI script back to the user.
    We present a prototype system, E.M.I.ghty, that applies this method to real-world classroom communication. Human evaluation on a set of classroom scripts shows that the proposed system effectively enhances script quality.
    Our methodology supports supplementing a phrase bank with new lexical bundles in a speech corpus, thus helping users more effectively in generating improved EMI scripts, with a retrieval-augmented generation technique.

    1 Introduction 1 2 Related Work 5 3 Methodology 9 3.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.2 Training a Classifier to Augment Phrase Bank . . . . . . . . . . . . 11 3.2.1 Extracting n-grams from Target Corpus . . . . . . . . . . . 12 3.2.2 Construct a Training set and N-gram set . . . . . . . . . . 12 3.2.3 Calculate the statistic features . . . . . . . . . . . . . . . . 13 3.2.4 Categorizing the N-gram set into Phrase Bank . . . . . . . . 15 3.3 Generating Script with Lexical Bundles . . . . . . . . . . . . . . . . 15 4 Experiment 18 4.1 Datasets and Toolkits . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.2 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.2.1 Training Dataset . . . . . . . . . . . . . . . . . . . . . . . . 20 4.2.2 Pre-train Model . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.3 Lexical Bundle Phrase Bank . . . . . . . . . . . . . . . . . . . . . . 25 4.4 Generating Script Using RAG . . . . . . . . . . . . . . . . . . . . . 26 4.5 Script Generation Systems Compared . . . . . . . . . . . . . . . . . 28 4.6 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 5 Evaluation Results 31 5.1 Result of Lexical Bundle Phrase Bank . . . . . . . . . . . . . . . . 31 5.2 Results of Generating Improved EMI script . . . . . . . . . . . . . . 33 6 Conclusion and Future Work 37

    University All-English Instruction: 900 Classroom Context Sentence Patterns and
    Applications. Chungwen Publishing Co., 2021. ISBN 9789575325749. URL
    https://books.google.com.tw/books?id=_b97zgEACAAJ.
    Douglas Biber and Federica Barbieri. Lexical bundles in university spoken and
    written registers. English for Specific Purposes, 26(3):263–286, 2007. ISSN
    0889-4906. doi: https://doi.org/10.1016/j.esp.2006.08.003. URL https://www.
    sciencedirect.com/science/article/pii/S0889490606000366.
    Douglas Biber, Susan Conrad, and Viviana Cortes. If you look at. . . : Lexical
    bundles in university teaching and textbooks. Applied linguistics, 25(3):371–
    405, 2004.
    Steven Bird, Ewan Klein, and Edward Loper. Natural Language Processing with
    Python: Analyzing Text with the Natural Language Toolkit. O’Reilly Media,
    Inc., 2009.
    Tom Brown, Benjamin Mann, Nick Rider, et al. Language models are few-shot
    learners. arXiv preprint arXiv:2005.14165, 2020.
    Y-H Chen and Paul Baker. Lexical bundles in l1 and l2 academic writing. Language
    Learning & Technology, 2010.
    Susan Conrad and Douglas Biber. The frequency and use of lexical bundles in
    conversation and academic prose. Lexicographica, 20:56–71, 01 2004. doi: 10.
    1515/9783484604674.56.
    Aviva Crismore, Raija Markkanen, and Margaret S. Steffensen. Metadiscourse in
    persuasive writing: A study of texts written by american and finnish university
    students. Written Communication, 6(1):39–71, 1989.
    P Foster, A Tonkyn, and G Wigglesworth. Measuring spoken language: a unit for
    all reasons. Applied Linguistics, 21(3):354–375, 09 2000. doi: 10.1093/applin/
    21.3.354. URL https://doi.org/10.1093/applin/21.3.354.
    Suchin Gururangan, Ana Marasovi´c, Swabha Swayamdipta, Kyle Lo, Iz Beltagy,
    Doug Downey, and Noah A. Smith. Don’t stop pretraining: Adapt language
    models to domains and tasks. In Proceedings of the 58th Annual Meeting of the
    Association for Computational Linguistics, pages 8342–8360. Association for
    Computational Linguistics, 2020. doi: 10.18653/v1/2020.acl-main.740. URL
    https://aclanthology.org/2020.acl-main.740/.
    Ken Hyland. Disciplinary Discourses: Social Interactions in Academic Writing.
    University of Michigan Press, 2004.
    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir
    Karpukhin, Naman Goyal, Heinrich K¨uttler, Mike Lewis, Wen-tau Yih, Tim
    Rockt¨aschel, et al. Retrieval-augmented generation for knowledge-intensive nlp
    tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020.
    Mingyu Li. Non-native english-speaking (nnes) students’ english academic writing experiences in higher education: A meta-ethnographic qualitative synthesis. Journal of English for Academic Purposes, 71:101430, 2024. ISSN
    1475-1585. doi: https://doi.org/10.1016/j.jeap.2024.101430. URL https:
    //www.sciencedirect.com/science/article/pii/S1475158524000985.
    OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge
    Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam
    Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie
    Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner, Lenny
    Bogdonoff, Oleg Boiko, Madelaine Boyd, Anna-Luisa Brakman, Greg Brockman, Tim Brooks, Miles Brundage, Kevin Button, Trevor Cai, Rosie Campbell, Andrew Cann, Brittany Carey, Chelsea Carlson, Rory Carmichael, Brooke
    Chan, Che Chang, Fotis Chantzis, Derek Chen, Sully Chen, Ruby Chen, Jason
    Chen, Mark Chen, Ben Chess, Chester Cho, Casey Chu, Hyung Won Chung,
    Dave Cummings, Jeremiah Currier, Yunxing Dai, Cory Decareaux, Thomas Degry, Noah Deutsch, Damien Deville, Arka Dhar, David Dohan, Steve Dowling,
    Sheila Dunning, Adrien Ecoffet, Atty Eleti, Tyna Eloundou, David Farhi, Liam
    Fedus, Niko Felix, Sim´on Posada Fishman, Juston Forte, Isabella Fulford, Leo
    Gao, Elie Georges, Christian Gibson, Vik Goel, Tarun Gogineni, Gabriel Goh,
    Rapha Gontijo-Lopes, Jonathan Gordon, Morgan Grafstein, Scott Gray, Ryan
    Greene, Joshua Gross, Shixiang Shane Gu, Yufei Guo, Chris Hallacy, Jesse
    Han, Jeff Harris, Yuchen He, Mike Heaton, Johannes Heidecke, Chris Hesse,
    Alan Hickey, Wade Hickey, Peter Hoeschele, Brandon Houghton, Kenny Hsu,
    Shengli Hu, Xin Hu, Joost Huizinga, Shantanu Jain, Shawn Jain, Joanne Jang,
    Angela Jiang, Roger Jiang, Haozhun Jin, Denny Jin, Shino Jomoto, Billie Jonn,Heewoo Jun, Tomer Kaftan, Lukasz Kaiser, Ali Kamali, Ingmar Kanitscheider, Nitish Shirish Keskar, Tabarak Khan, Logan Kilpatrick, Jong Wook Kim,
    Christina Kim, Yongjik Kim, Jan Hendrik Kirchner, Jamie Kiros, Matt Knight,
    Daniel Kokotajlo, Lukasz Kondraciuk, Andrew Kondrich, Aris Konstantinidis,
    Kyle Kosic, Gretchen Krueger, Vishal Kuo, Michael Lampe, Ikai Lan, Teddy
    Lee, Jan Leike, Jade Leung, Daniel Levy, Chak Ming Li, Rachel Lim, Molly
    Lin, Stephanie Lin, Mateusz Litwin, Theresa Lopez, Ryan Lowe, Patricia Lue,
    Anna Makanju, Kim Malfacini, Sam Manning, Todor Markov, Yaniv Markovski,
    Bianca Martin, Katie Mayer, Andrew Mayne, Bob McGrew, Scott Mayer McKinney, Christine McLeavey, Paul McMillan, Jake McNeil, David Medina, Aalok
    Mehta, Jacob Menick, Luke Metz, Andrey Mishchenko, Pamela Mishkin, Vinnie Monaco, Evan Morikawa, Daniel Mossing, Tong Mu, Mira Murati, Oleg
    Murk, David M´ely, Ashvin Nair, Reiichiro Nakano, Rajeev Nayak, Arvind Neelakantan, Richard Ngo, Hyeonwoo Noh, Long Ouyang, Cullen O’Keefe, Jakub
    Pachocki, Alex Paino, Joe Palermo, Ashley Pantuliano, Giambattista Parascandolo, Joel Parish, Emy Parparita, Alex Passos, Mikhail Pavlov, Andrew
    Peng, Adam Perelman, Filipe de Avila Belbute Peres, Michael Petrov, Henrique Ponde de Oliveira Pinto, Michael, Pokorny, Michelle Pokrass, Vitchyr H.
    Pong, Tolly Powell, Alethea Power, Boris Power, Elizabeth Proehl, Raul Puri,
    Alec Radford, Jack Rae, Aditya Ramesh, Cameron Raymond, Francis Real,
    Kendra Rimbach, Carl Ross, Bob Rotsted, Henri Roussez, Nick Ryder, Mario
    Saltarelli, Ted Sanders, Shibani Santurkar, Girish Sastry, Heather Schmidt,
    David Schnurr, John Schulman, Daniel Selsam, Kyla Sheppard, Toki Sherbakov,
    Jessica Shieh, Sarah Shoker, Pranav Shyam, Szymon Sidor, Eric Sigler, Maddie Simens, Jordan Sitkin, Katarina Slama, Ian Sohl, Benjamin Sokolowsky,
    Yang Song, Natalie Staudacher, Felipe Petroski Such, Natalie Summers, Ilya
    Sutskever, Jie Tang, Nikolas Tezak, Madeleine B. Thompson, Phil Tillet, Amin
    Tootoonchian, Elizabeth Tseng, Preston Tuggle, Nick Turley, Jerry Tworek,
    Juan Felipe Cer´on Uribe, Andrea Vallone, Arun Vijayvergiya, Chelsea Voss,
    Carroll Wainwright, Justin Jay Wang, Alvin Wang, Ben Wang, Jonathan Ward,
    Jason Wei, CJ Weinmann, Akila Welihinda, Peter Welinder, Jiayi Weng, Lilian Weng, Matt Wiethoff, Dave Willner, Clemens Winter, Samuel Wolrich,
    Hannah Wong, Lauren Workman, Sherwin Wu, Jeff Wu, Michael Wu, Kai
    Xiao, Tao Xu, Sarah Yoo, Kevin Yu, Qiming Yuan, Wojciech Zaremba, Rowan
    Zellers, Chong Zhang, Marvin Zhang, Shengjia Zhao, Tianhao Zheng, Juntang
    Zhuang, William Zhuk, and Barret Zoph. Gpt-4 technical report, 2024. URL
    https://arxiv.org/abs/2303.08774.
    Sun. An english medium instruction guide to classroom phrases.
    http://hdl.handle.net/11536/33013, 2014. URL https://ir.lib.nycu.
    edu.tw/handle/11536/33013.
    William J. Vande Kopple. Some exploratory discourse on metadiscourse. College
    Composition and Communication, 36(1):82–93, 1985.
    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R´emi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame,
    Quentin Lhoest, and Alexander M. Rush. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical
    Methods in Natural Language Processing: System Demonstrations, pages 38–
    45, Online, October 2020. Association for Computational Linguistics. URL
    https://www.aclweb.org/anthology/2020.emnlp-demos.6.

    QR CODE