The development and validation of a rating scale for definition essays: A data-based approach

簡易檢索 / 詳目顯示

回結果列表

研究生：	陳佳偉 Chen, Jia-Wei
論文名稱：	The development and validation of a rating scale for definition essays: A data-based approach 定義型文章評分量表的發展與其效度考驗研究
指導教授：	張寶玉 Viphavee Vongpumivitch
口試委員:
學位類別：	碩士 Master
系所名稱：	人文社會學院 - 外國語文學系 Foreign Languages and Literature
論文出版年：	2010
畢業學年度：	99
語文別：	英文
論文頁數：	162
中文關鍵詞：	寫作測驗、評分量表
外文關鍵詞：	Writing Assessment, Rating scale
相關次數：	點閱：2 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

To date, no rating scale has been created for definition writing. The holistic rating scales used in large-scale standardized tests, such as those used in the TOEFL iBT, are either for general use like in the case of the independent writing scale, or task-specific like in the case of the integrated writing scale. Brindley (1994) criticized that such scales are too general to be applied to a specific task and context. This project thus aims to use a data-based approach to create a scale for definition writing.
The development of the scale was largely based on the procedures taken by Knoch (2007) because her study described the details of how a data-based scale can be constructed and the rating scale she created was proved valid and reliable. In this study, the rating criteria were first chosen for the scale for definition essays based on a number of models relevant to writing performance. From these models, six traits were selected: accuracy, fluency, complexity (syntactic and lexical), coherence, cohesion, and content. Then, 268 samples were selected from a pool of 1,365 short definition essays and analyzed using discourse measures covering these six traits. The analysis results of the discourse measures were subject to statistical analyses. The statistical information helped indicated measures that could effectively distinguish the essays at different performance levels. The scale was finally written based on these discriminating measures and featured the following characteristics. First, the scale descriptions were made specific to the definition writing task. Second, the scale criteria were prioritized to reflect their importance in the writing task and divided into three steps so that the criteria can be judged separately.
Once the scale was created, it was tested in a validation study. The purpose is to collect evidence to support the validity of this scale by comparing it with the TOEFL scale, a scale often considered generalizable to many writing tasks. The validation aims to find out which of the scales was more suitable for definition writing like the one investigated in this study. Four raters were invited to use both scales on the same batch of 65 essays, and filled out a questionnaire and participated semi-structured interviews based on their experience of using the scales. The ratings from the scales were statistically analyzed to estimate the inter-rater reliability, and besides the raters’ responses to the questionnaire and interviews were used to investigate the validity of the scales.
The analysis of rating consistency indicated that both scales led to similar inter-rater reliability estimates that were not desirably high. Further statistical analyses proved that the raters actually operated the rating scales with different levels of scoring severity. A follow-up interview with the raters indeed revealed several factors that explained the inconsistency of the ratings, and these factors could be largely attributed to insufficient rater training and inappropriate design of the scoring procedures. Even though the new scale failed to reach high rating consistency, the questionnaire and interview results showed that it was positively perceived by the raters because it could (1) bring benefits to the test users and the test takers, (2) generate ratings fair to the test takers, (3) adequately represent the writing ability involved in definition writing, (4) have strong connection to the definition writing task, and (5) provide enough information for the raters to discriminate the test takers at different levels. Therefore, the raters considered that the definition-writing scale was more suitable for definition writing.
Even though the reliability of the scale created in this study was not satisfactory, the raters still perceived it quite positively. It can still be concluded that this study, following Knoch (2007), confirmed that a scale developed based on empirical analysis shows more evidence to scale validity.

　　目前，尚未有單一個評分量表是專為批改定義型文章而設計，許多大型標準化測驗所使用的量表（如托福iBT寫作測驗）不是過於通用，就是過於特定於該測驗所使用的寫作題型。Brindley (1994) 指出這些大型測驗的量表過於籠統，並不適合特定的寫作題型和測驗環境，因此本研究旨在使用真實語料來創建一個定義型文章的評鑑量表。
　　Knoch (2007) 詳細記載了發展一個基於語料的量表所需的步驟和細節，其所建立的量表信度（reliability）和效度（validity）的結果皆相當不錯，因此本研究參照Knoch的研究設計來制定量表。本研究所使用的評分準則是基於數個與寫作表現相關的理論模型，篩選出六個寫作指標，包含流暢度、準確度、句型與字彙複雜度、句子連接性、段落連貫性以及內容。首先，從1,365篇定義型文章中隨機選出268篇，使用量測上述的六大指標的語言分析計量單位（discourse analysis measures）加以分析；接著使用統計分析來研究語言分析後的量化數據，經由統計分析後的結果能夠確切地指出那些語言分析的計量單位能夠有效地鑑別這些文章的等級；最後，根據這些結果來建立新的量表以及評分敘述。此量表有兩個特點，第一，其評分敘述是針對定義型文章的特色而撰寫，第二，量表所使用的準則有重要性之分，且各自獨立評分。
　　當此量表建立後，本研究接著進行效度考驗（validation），分析其與托福獨立寫作量表的效度表現，並探討哪一種量表較適合用來批改定義型文章。四位評分員使用這兩種量表來批改 65篇文章，之後根據其批改的過程和經驗來填寫問卷並參加訪談。批閱後的分數以統計分析來衡量評分員者間信度（inter-rater reliability），此外，評分員的問卷回答和訪談內容也加以分析，探討此兩種量表的效度表現。
　　分數一致性的分析結果指出，新的量表與托福量表都達到相似的評分員者間信度，但並不是很高。進一步的統計分析發現，四位評分員在使用兩量表時，評量的嚴格程度不同。後續的訪談中確實可發現出許多導致信度不足的因素，而這些因素大多可歸咎於閱卷訓練（rater training）的不足以及閱卷程序的設計不佳。雖然新的量表與並未達到較高的信度，但評分員們在問卷和訪談中皆給予此量表正面回饋。他們認為，新量表能：（一）造福於施測者和受試者，（二）讓評分者公正地給分，（三）適切地代表定義型寫作的能力，（四）與定義型寫作題型較為相關，（五）提供足夠的資訊來幫助評分員區分文章的等級。因此，評分員們認為新量表較適合用於批改定義型文章。
　　雖然此定義寫作評分量表的信度表現並不完美，評分員們仍給予正面肯定。本研究繼Knoch (2007) 之後再度證明根據語料所制定的評分量表對於其特定的寫作題型能有較佳的效度表現。

中文摘要    i
Abstract    iii
Acknowledgments    v
Table of Contents    vii
List of Tables    x
List of Figures    xii
Chapter 1    INTRODUCTION    1
1    Introduction    1
2    Research Background    1
3    Purpose of the Study    5
4    Thesis Outline    6
Chapter 2    LITERATURE REVIEW    7
1    Overview of This Chapter    7
2    The Nature of Writing Ability    7
2.1    The nature of writing ability    7
2.2    Models of writing process    8
2.3    Models of writing performance    12
2.4    Components essential for writing performance    15
2.5    Summary of Section 2.2    23
3    Selection of Criteria for Rating Scales    23
3.1    Intuitive method (Expert judgment)    23
3.2    Existing-scale-based approach    24
3.3    Data-based approach    24
4    Scale Design and Scoring Procedures for Writing Assessment    27
4.1    Factors to consider in design rating scales    28
4.2    Scoring process    35
4.3    Evaluating scoring procedures    36
5    Scale Validation    37
6    Example of Scale Development Project    43
7    Summary of This Chapter    49
Chapter 3    SCALE DEVELOPMENT    51
1    Background Information of the Data    51
1.1    Description of the data    51
1.2    Essay writers    51
1.3    The writing task    52
1.4    Data Rating and Data Selection    52
2    Procedures    54
2.1    Discourse analysis    54
2.2    Statistical analysis    67
2.3    Comparison with the methodological design in Knoch (2007)    68
3    Selection of Criteria – Statistical Results    69
3.1    Descriptive statistics    70
3.2    Factor analysis    71
3.3    Analysis of variance    75
3.4    Multiple regression analysis    77
3.5    Discussion of the analysis results    78
4    New Scale    82
4.1    Characteristics of the scale    82
4.2    Writing the scale descriptions    83
4.3    Use of the new scale to assign essay scores    89
4.4    First revision    91
5    Summary of This Chapter    93
Chapter 4    SCALE VALIDATION    95
1    Pilot Study    95
1.1    Participants    95
1.2    Materials    96
1.3    Objectives and plans of the pilot study    98
1.4    Trial results    99
2    Main Study    106
2.1    Participants    106
2.2    Materials    107
2.3    Data collection procedures    110
2.4    Data analysis    111
2.5    Comparison with the validation procedure in Knoch (2007)    112
3    Results and Discussion of the Main Study    113
3.1    Inter-rater reliability    113
3.2    Questionnaire results - Validity arguments for TOEFL scale vs. the new scale    117
3.3    Interview results – Evaluation of the scales    124
3.4    The final words: Raters’ choice of the scale    128
3.5    Summary and discussion of the validation results    128
4    Comparison with Knoch’s findings    130
5    Summary of This Chapter    132
Chapter 5    CONCLUSION    133
1    Summary of the Study    133
2    Limitations    134
3    Significance of this Study    135
4    Implications    136
5    Suggestion for Future Work    136
References    138
Appendice    142
Appendix A – TOEFL iBT Independent Writing Rubrics    142
Appendix B – Knoch’s analytic scale for the diagnostic test    143
Appendix C – Knoch’s analytic scale for the diagnostic test    144
Appendix D – Guidelines for T-units    145
Appendix E – Guidelines for Clauses    146
Appendix F – Questionnaire for Validation    147
Appendix G – Training Manual for the TOEFL Scale    149
Appendix H – Training Manual for the New Scale    154

                                

Alderson, C. J. (1991). Bands and scores. In J. C. Alderson, & B. North (Eds.), Language testing in the 1990s (pp. 71-86). London: Modern English Publicatioins and the British Council.
Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford: Oxford University Press.
Bachman, L. F., & Palmer, A. S. (1996). Language testing in practice. New York: Oxford University Press.
Bachman, L., & Palmer, A. (2010). Language assessment in practice: Developing language assessments and justifying their use in the real world. Oxford: Oxford University Press.
Bardovi-Harlig, K., & Bofman, T. (1989). Attainment of syntactic and morphological accuracy by advanced language learners. Studies in Second Language Acquisition, 11, 17-34.
Beers, S. F., & Nagy, W. F. (2009). Syntactic conplexity as a predictor of adolescent writing quality: Which measure? Which genre? Reading and Writing, 22, 185-200.
Bereiter, C., & Scardamalia, M. (1987). The psychology of written composition. NJ: Lawrence Erlbaum Associates.
Brindley, G. (1994). Task-centered assessment in language learning: The promise and the challenge. In N. Bird, P. Falvey, A. Tsui, D. Allison, & A. McNeill (Eds.), Language and learning: Papers presented at the Annual International Language in Education Conference (Hong Kong, 1993) (pp. 73-94). Hong Kong: Hong Kong Education Department.
Brown, J. D. (1991). Do English and ESL faculties rate writing samples differently? TESOL Quarterly, 25(4), 587-603.
Center for Advanced Research on Language Acquisition (CARLA). (n.d.). Types of rubrics: Primary trait and multiple trait. Retrieved August 28, 2010, from http://www.carla.umn.edu/assessment/VAC/Evaluation/rubrics/types/traitRubrics.html
Cobb, T. (2002). Web VocabProfile. Retrieved May 26, 2009, from http://www.lextutor.ca/vp/
Council of Europe. (2001). Common European framework of reference for languages: Learning, teaching, assessment. Cambridge: Cambridge University Press.
Coxhead, A. (2000). A new academic word list. TESOL Quarterly, 34(2), 213-238.
Cumming, A., Kantor, R., Baba, K., Erdosy, E., Eouanzoui, K., & James, M. (2005). Differences in written discourse in independent and integrated prototype tasks for next generation TOEFL. Assessing Writing, 10, 5-43.
Cumming, A., Kantor, R., Powers, D., Santos, T., & Taylor, C. (2000). TOEFL 2000 writing framework: A working paper. Princeton, NJ: Educational Testing Service.
Davies, A., Brown, A., Elder, C., Hill, K., Lumley, T., & McNamara, T. (1999). Dictionary of language testing. Cambridge: Cambrige University Press.
de Haan, P., & van Esch, K. (2008). Measuring and assessing the development of foreign language writing competence. Porta Linguarum, 9, 7-21.
DeCoster, J. (1998). Overview of factor analysis. Retrieved November 20, 2009, from http://www.stat-help.com/notes.html
Diederich, P. B. (1964). Problems and possibilities of research in the teaching of written composition. In Research design and the teaching of English. Champaign, IL: National Council of Teachers of English.
Educational Testing Service. (n.d.). TOEFL iBT tips: How to prepare for the TOEFL iBT. Retrieved March 15, 2010, from http://www.ets.org/toefl
Educational Testing Service. (n.d.). TOEFL iBT independent writing rubrics. Retrieved from http://www.ets.org/Media/Tests/TOEFL/pdf/Writing_Rubrics.pdf
English Language Institute of University of Michigan. (n.d.). How is ECPE scored? Retrieved January 9, 2010, from English Language Institute, University of Michigan: http://www.lsa.umich.edu/UMICH/eli/Home/Test%20Programs/ECPE/ECPE09%20NewWritingRubric.pdf
Foster, P., & Skehan, P. (1996). The influence of planning and task type on second language performance. Studies in Second Language Acquisition, 18, 299-323.
Freedman, S. W. (1979a). Why do teachers give the grades they do? College Communication and Composition, 30(2), 161-164.
Freedman, S. W. (1979b). How characteristics of student essays influence teachers' evaluation? Journal of Educational Psychology, 71(3), 328-338.
Fulcher, G. (1996). Does thick descriptions lead to smart tests? A data-based approach to rating scale construction. Language Testing, 13(2), 208-238.
Fulcher, G. (2003). Testing second language speaking. London: Pearson Longman.
Grabe, W. (2001). Notes toward a theory of second language writing. In T. Silva, & P. K. Matsuda (Eds.), On second language writing (pp. 39-56). Mahwah, NJ: Lawrence Erlbaum Associates.
Grabe, W., & Kaplan, R. B. (1996). Theory and practice of writing: An applied linguistic perspective. NY: Longman.
Halliday, M. A., & Hasan, R. (1976). Cohesion in English. London: Longman.
Hamp-Lyons, L. (1991). Scoring procudures for ESL contexts. In L. Hamp-Lyons (Ed.), Assessing second language writing in academic contexts (pp. 241-276). NJ: Ablex Publishing.
Hatch, E., & Lazaraton, A. (1991). The research manual: Design and statistics for applied linguistics. New York: Newbury Hose.
Hayes, J. R. (1996). A new framework for understanding cognition and affect in writing. In L. W. Levy, & S. Ransdell, The science of writing. NJ: Lawrence Elbaum Associates.
Henning, G. (1987). A guide to language testing. Cambridge, MA: Newberry House Publishers.
Hughes, A. (2003). Testing for language teachers (2nd ed.). Cambridge: Cambridge University Press.
Hunt, K. (1965). Grammatical structures written at three grade levels. In NCTE Research Report, No. 3. Champaign, IL: NCTE.
Huot, B. (1996). Towrd a new theory of writing assessment. College Composition and Communication, 47(4), 549-566.
Jacobs, H., Zinkgraf, S., Wormuth, D., Hartfiel, V., & Hughey, J. (1981). Testing ESL composition: A practical apporach. Rowley, MA: Newbury House.
Johns, A. M. (1986). Coherence and academic writing: Some definitions and suggestions for teaching. TESOL Quarterly, 20(2), 247-266.
Knoch, U. (2007). Diagnostic writing assessment: The development and validation of a rating scale. Unpublished doctoral dissertation, The University of Auckland, New Zealand.
Lloyd-Jones, R. (1977). Primary trait scoring. In C. R. Cooper, & L. Odell, Evaluating writing (pp. 33-69). NY: National Council of Teachers of English.
Lumley, T. (2002). Assessment criteria in a large-scale writing test: What do they really mean to the raters? Language Testing, 19(3), 246-276.
Luoma, S. (2004). Assessing speaking. Cambridge: Cambridge University Press.
Luoma, S., & Tarnanen, M. (2003). Creating a self-rating instrument for second language writing: From idea to implementation. Language Testing, 20, 440-465.
McNamara, T. (1996). Measuring second language performance. NY: Addison-Wesley Longman.
McNamara, T. (2000). Language testing. Oxford: Oxford University Press.
Mendelsohn, D., & Cumming, A. (1987). Professors' ratings of language use and rhetorical organizations in ESL compositions. TESL Canada Journal, 5(1), 9-26.
Messick, S. (1989). Validity. In R. L. Linn, Educational measurement (pp. 13-103). New York: Macmillan / Ameirican Council on Education.
Morris, L., & Cobb, T. (2004). Vocabulary profiles as predictors of the academic performance of Teaching English as a Second Language trainees. System, 32, 75-87.
National Capital Language Resource Center (NCLRC). (n.d.). Assessing Learning. Retrieved August 28, 2010, from The Essentials of Language Teaching: http://www.nclrc.org/essentials/assessing/alternative.htm
North, B. (1995). The development of a common framework scale of descriptors of language proficiency based on a theory of measurement. System, 23(4), 445-465.
North, B., & Schneider, G. (1998). Scaling descriptors for language proficiency scales. Language Testing, 15(2), 217-263.
Oregon Department of English. (2005). Six Trait Analytic Writing Rubric. Retrieved January 9, 2010, from AIMS Information Center: http://www.ade.state.az.us/standards/6traits/
Ortega, L. (2003). Syntactic complexity measures and their relationship to L2 proficiency: A research synthesis of college-level L2 writing. Applied Linguistics, 24(4), 492-518.
Perkins, K. (1983). On the use of composition scoring techniques, objective measures, and objective tests to evaluate ESL writing ability. TESOL Quarterly, 17(4), 651-671.
Polio, C. G. (1997). Measures of linguistic accuracy in second language writing research. Language Learning, 47(1), 101-143.
Sakyi, A. A. (2000). Validation of holistic scoring for ESL writing assessment: How raters evaluate compositions. In A. J. Kunnan (Ed.), Fairness and validation in language assessment: Selected papers from the 19th Language Testing Research Colloquium, Orlando, Florida (pp. 129-152). Cambridge: Cambridge University Press.
Schmidt, R. (1992). Psychological mechanisms underlying second language fluency. Studies in Second Language Acquisition, 14, 357-385.
Song, B., & Caruso, I. (1996). Do English and ESL faculty differ in evaluating the essays of native-English speaking and ESL students? Journal of Second Language Writing, 5(2), 163-182.
Turner, C. E., & Upshur, J. A. (2002). Rating scales derived from student samples: Effects of the scale marker and the student sample on scale content and student scores. TESOL Quarterly, 36(1), 49-70.
Upshur, J. A., & Turner, C. E. (1995). Constructing rating scales for second language tests. ELT Journal, 49, 3-12.
Vaughan, C. (1991). Holistic assessment: What goes on in the rater's mind? In L. Hamp-Lyon (Ed.), Assessing second language writing in academic contexts (pp. 111-126). NJ: Ablex Publishing.
Weigle, S. C. (2002). Assessing writing. Cambridge: Cambridge University Press.
Weir, C. J. (1990). Communicative language testing. NJ: Prentice Hall.
White, E. M. (1984). Holisticism. College Compoition and Communication, 35(4), 400-409.
White, E. M. (1991). Teaching and assessing writing. San Francisco: Jossy-Bass Publishers.
White, E. M. (1995). An apologia for the times impromptu essay test. College Composition and Communication, 46, 30-45.
Witte, S. P., & Faigley, L. (1981). Coherence, cohesion, and writing quality. College Composition and Communication, 32(2), 189-204.
Wolfe-Quintero, K., Inagaki, S., & Kim, H.-Y. (1998). Second language development in writing: Measures of fluency, accuracy, and complexity. Honolulu: University of Hawai'i, Second Language Teaching and Curriculum Center.

全文公開日期本全文未授權公開 (校內網路)
全文公開日期本全文未授權公開 (校外網路)

簡易檢索 / 詳目顯示

相關論文