研究生: |
胡柏先 Hu, Bo Sien |
---|---|
論文名稱: |
應用Mantel-Haenszel於BIB設計情境中進行DIF檢測時總分配對策略選擇之比較 Comparison of Different Matching Strategies Using the Mantel-Haenszel Method to Detect DIF in the BIB Booklet Design |
指導教授: |
陳承德
Chen, Cheng Te |
口試委員: |
陳承德
Chen, Cheng Te 鄒慧英 Tzou, Hue Ying 施慶麟 Shih, Ching Lin |
學位類別: |
碩士 Master |
系所名稱: |
清華學院 - 學習科學研究所 Institute of Learning Sciences |
論文出版年: | 2016 |
畢業學年度: | 104 |
語文別: | 中文 |
論文頁數: | 97 |
中文關鍵詞: | Mantel-Haenszel 、差異試題功能 、BIB設計 、總分配對策略 、等化組合題本 |
外文關鍵詞: | Mantel-Haenszel, DIF, BIB design, Matching strategy, Equated pooled booklet |
相關次數: | 點閱:2 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在大型測驗中,為了盡可能抽樣並描述受測學生群體的潛在特質,通常會採用題本設計的方式施測大量試題,但也會讓每道試題受測的樣本數減少,影響對差異試題功能(DIF)檢測的成效。隨著測驗的普及,應用門檻較低的非參數試題分析方法也較受實務工作者歡迎,在過去探討如何將非參數的Mantel-Haenszel方法應用於題本設計情境的研究中,總分配對策略被認為是左右DIF檢測成效的關鍵,但仍鮮有研究在較複雜的題本設計情境中,比較近年來所發展出的總分配對策略在DIF檢測成效上的表現,過往的研究在DIF檢測的流程中,通常沒有或有限制地使用量尺淨化程序,對題本間平均難度的差異操弄程度也很小。
為了釐清等化與未等化的配對策略,在題本間平均難度差異擴大時的檢測成效是否有所不同,並探討在不同配對策略下,加入基於當前DIF檢測結果的疊代量尺淨化程序後,對DIF檢測成效的影響,以及比較在PISA 2012主副向度的題本設計情境中進行DIF檢測的成效,本研究中將操弄樣本數、團體平均潛在特質水準差異(Impact)、DIF試題所占百分比、以及題本平均難度全距,並比較三種總分配對策略(區塊層次、百分比組合題本、等化組合題本)在PISA 2012主副向度的題本設計架構下,搭配基於當前DIF檢測結果的量尺淨化程序,進行DIF檢測的檢定力和型一錯誤率。
研究結果顯示,三種配對策略在副向度時的檢定力高於主向度、在相同向度、樣本數相當的情境中,區塊層次配對策略的檢測成效會隨著DIF百分比的提高而下降,百分比組合題本策略的檢則成效則會受到Impact、題本間平均難度全距、DIF百分比三因子間的交互作用影響而降低,等化組合題本配對策略在各情境中對型一錯誤的控制與檢定力的表現都優於另外兩者;在本研究中亦進一步使用搭配量尺淨化程序的等化組合題本配對策略,對PISA 2012台灣區數學向度真實資料,進行以性別為分組變數的DIF分析,結果顯示約有30%的試題被認為具有差異試題功能;基於研究結果,未來如需在BIB題本設計中進行DIF檢測,建議可以使用搭配量尺淨化程序的等化組合題本配對策略,以期在題本間平均難度差異較大,或測驗中DIF試題所佔百分比較多的情境中,仍可獲得較可靠的檢測結果。
In large scale assessment programs, the method of booklet design is commonly adopted for sampling a large amount of test items and describing latent traits of student participants. The missingness derived from booklet design not only reduce the number of samples responding to each tested item, but also cause vacancy in participants’ responses, which in turn harms the result of DIF assessment. Recently, an increasing attention has been drawn to a non-parametric DIF assessment method called the Mantel-Haenszel test (MH) due to its simplicity. Although past studies have found the matching strategy was crucial to the effectiveness of MH-DIF assessment in tests adopting booklet design, researches comparing the effectiveness of various matching strategies developed recently within a more authentic but complex booklet design context areis relatively rare. Furthermore, the DIF assessments in previous studies are often conducted based on a limited, or even without, scale purification procedure, and the differences of mean difficulties between booklets are too small to influence the result of DIF assessment.
In this research, three research questions were raised. First, can we amplify the differences in MH-DIF results using equated pooled booklet matching strategy and the other matching strategies by increasing the difference of mean difficulty between booklets? Second, if the matching variable is iteratively purified according to presumed results of DIF assessment instead of true DIF items, will this purification procedure affect the performance of DIF assessment among various matching strategies? Third, what is the difference in the results of DIF assessments between various matching strategies in both main and sub-dimension of PISA 2012 booklet design, respectively?
In this researchstudy, the sample size, impact, percentage of DIF items, and the range of mean item difficulty between booklets are manipulated. The booklet design adopted in this study research followed authentic PISA 2012, and we the matching variable based on presumed DIF assessment is iteratively purified matching variable based on presumed DIF assessment. Type I error rate and power of DIF assessment using three matching strategies, namely block level, percent pooled booklet, and equated pooled booklet, are recorded.
The findings indicated that the power rates in sub-dimension were higher than that those in main dimension among three matching strategies. Controlling for dimension and sample size, the power rates of block level matching strategies became lower when there were more DIF items in the test. The power rates of percent pooled booklet were affected by the three way interaction of impact, range of mean item difficulty between booklets, and percentage of DIF items. Comparing to the previous two strategies, equated pooled booklet strategy yielded the most ideal Type I error rate and the highest power rate in all scenarios. Furthermore, a real data example derived from PISA 2012 math test of Taiwan was analyzed for gender DIF using the equated pooled booklet strategy. Approximately 30% of the items were deemed to be DIF. According to this research, equated pooled booklet strategy with iterative purification procedure is strongly recommended in DIF assessment, especially when there are huge differences of mean difficulty between booklets, or when a lot of DIF items are expected in tests.
王文中(2004)。Rasch 測量理論與其在教育 和心理之應用。教育與心理研究,27(4),637–694
王文中、陳承德(譯)(2008)。心理測驗(原作者:K. R. Murphy., & C. O. Davidshofer)。臺北市:雙葉書廊。(原著出版年:2001)
余民寧(2009)。試題反映理論IRT及其應用。臺北市:心理。
余民寧、謝進昌(2006)。國中基本學力測驗之DIF的實徵分析:以91年度兩次測驗為例。教育學刊,26,241-276
郭伯臣(2010)。測驗等化。載於譚克平等人(主編),測驗及評量專論文集-題庫建置與測驗編制(初版,102-134頁)。臺北縣:國家教育研究院籌備處。
臺灣PISA國家研究中心(2010,7月)。計畫概述。2016年1月28日,取自:臺灣PISA國家研究中心網頁:http://pisa.nutn.edu.tw/pisa_tw.htm
Agresti, A. (1996). An introduction to categorical data analysis. New York: Wiley.
Albano, A. D. (2014). equate: An R Package for Observed-Score Linking and Equating. R package version, 2.
Allen, N. L., & Donoghue, J. R. (1996). Applying the Mantel-Haenszel Procedure to Complex Samples of Items. Journal of Educational Measurement, 33(2), 231–251.
Bradley, D. R., Bradley, T. D., McGrath, S. G., & Cutcomb, S. D. (1979). Type I error rate of the
chi-square test in independence in R× C tables that have small expected frequencies.
Psychological Bulletin, 86(6), 1290-1297.
Cheng, Y., Chen, P., Qian, J., & Chang, H.-H. (2013). Equated Pooled Booklet Method in DIF Testing. Applied Psychological Measurement, 37(4), 276–288.
Chen, J.-H., Chen, C.-T., & Shih, C.-L. (2014). Improving the Control of Type I Error Rate in Assessing Differential Item Functioning for Hierarchical Generalized Linear Model When Impact Is Presented. Applied Psychological Measurement, 38(1), 18–36.
DeMars, C. E. (2010). Type I error inflation for detecting DIF in the presence of impact. Educational and Psychological Measurement, 70(6), 961–972.
Donoghue, J. R., Holland, P. W., & Thayer, D. T. (1993). A Monte Carlo study of factors that affect the Mantel-Haenszel and standardization measures of differential item functioning. Differential Item Functioning, 137–166.
Dorans, N. J. (1990). Equating Methods and Sampling Designs. Applied Measurement in Education, 3(1), 3-17.
Dorans, N. J., & Holland, P. W. (1993). DIF detection and description: Mantel-Haenszel and standardization. In P. W. Holland and H. Wainer (Eds.), Differential item functioning (pp.35-66). Hillsdale, NJ: Lawrence Erlbaum Associates.
Dorans, N. J., & Holland, P. W. (2000). Population Invariance and the Equatability of Tests: Basic Theory and the Linear Case. Journal of Educational Measurement, 281-306.
Dorans, N. J., Liu, J., & Hammond, S. (2008). Anchor test type and population invariance: An exploration across subpopulations and test administrations. Applied Psychological Measurement, 32(1), 81–97.
Fidalgo, A. M., Mellenbergh, G. J., & Muñiz, J. (2000). Effects of amount of DIF, test length, and purification type on robustness and power of Mantel-Haenszel procedures. Methods of Psychological Research Online, 5(3), 43–53.
Fidalgo,A. M., & Madeira, J. M. (2008). Generalized Mantel-Haenszel methods for differential item functioning detection. Educational and Psychological Measurement, 68(6), 940-958.
Finch, H. (2005). The MIMIC model as a method for detecting DIF: Comparison with Mantel-Haenszel, SIBTEST, and the IRT likelihood ratio. Applied Psychological Measurement, 29(4), 278–295.
Frey, A., Hartig, J., & Rupp, A. A. (2009). An NCME Instructional Module on Booklet Designs in Large-Scale Assessments of Student Achievement: Theory and Practice. Educational Measurement: Issues and Practice, 28(3), 39–53.
Goodman, J. T., Willse, J. T., Allen, N. L., & Klaric, J. S. (2011). Identification of differential item functioning in assessment booklet designs with structurally missing data. Educational and Psychological Measurement, 71(1), 80-94.
Hidalgo, M. D. (2004). Differential Item Functioning Detection and Effect Size: A Comparison between Logistic Regression and Mantel-Haenszel Procedures. Educational and Psychological Measurement, 64(6), 903–915.
Hu, B. S., & Chen, C. T. (2015, March). Applying Double Purification Procedure for Differential Item Functioning on Large Scale Assessments. Paper session presented at The Fifth Asian Conference on Psychology & Behavioral Sciences, Osaka, Japan.
Holland, P. W., & Thayer, D. T. (1988). Differential item performance and the Mantel-Haenszel procedure. In H. Wainer and H. I. Braun (Eds.), Test validity (pp. 129-145). Hillsdale, NJ: Lawrence Erlbaum Associates.
Kolen, M. J., & Brennan, R. L. (1987). Linear equating models for the common-item nonequivalent-populations design. Applied Psychological Measurement, 11(3), 263–277.
Kopf, J., Zeileis, A., & Strobl, C. (2015). Anchor Selection Strategies for DIF Analysis: Review, Assessment, and New Approaches. Educational and Psychological Measurement, 75(1), 22–56.
Lee, H., & Geisinger, K. F. (2015). The Matching Criterion Purification for Differential Item Functioning Analyses in a Large-Scale Assessment. Educational and Psychological Measurement, 76(1), 141-163.
Le, L. T. (2009). Investigating gender differential item functioning across countries and test languages for PISA science items. International Journal of Testing, 9(2), 122–133.
Little, R. J., & Rubin, D. B. (1989). The analysis of social science data with missing values. Sociological Methods & Research, 18(2-3), 292–326.
Li, Z. (2015). A Power Formula for the Mantel-Haenszel Test for Differential Item Functioning. Applied Psychological Measurement, 39(5), 373–388.
Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Erlbaum.
Magis, D., & De Boeck, P. (2014). Type I Error Inflation in DIF Identification With Mantel-Haenszel: An Explanation and a Solution. Educational and Psychological Measurement, 74(4), 713–728.
Mazor, K. M. (1994). Identification of Nonuniform Differential Item Functioning Using a Variation of the Mantel-Haenszel Procedure. Educational and Psychological Measurement, 54(2), 284-291.
Organization for Economic Cooperation and Development (OECD). (2014). PISA 2012 Technical Report. Paris: Author
Parshall, C. G., & Miller, T. R. (1995). Exact Versus Asymptotic Mantel-Haenszel DIF Statistics: A Comparison of Performance Under Small-Sample Conditions. Journal of Educational Measurement, 32(3), 302–316.
Preece, D. A. (1990). Fifty years of Youden squares: a review. Bulletin of the Institute of Mathematics and its Applications, 26(4), 65–75.
Revelle, W. (2015). Using the psych package to generate and test structural models. Retrived from http://bioconductor.statistik.tu-
dortmund.de/cran/web/packages/psych/vignettes/psych_for_sem.pdf
Sandilands, D. A. (2014). Accuracy of differential item functioning detection methods in structurally missing data due to booklet design. (Unpublished doctoral dissertation). The University of British Columbia, Vancouver, Canada.
Shealy, R., & Stout, W. (1993). A model-based standardization approach that separates true bias/DIF from group ability differences and detects test bias/DTF as well as item bias/DIF. Psychometrika, 58(2), 159–194.
Shih, C.-L., & Wang, W.-C. (2009). Differential item functioning detection using the multiple indicators, multiple causes method with a pure short anchor. Applied Psychological Measurement, 33(3), 184–199.
Su, Y.-H., & Wang, W.-C. (2005). Efficiency of the Mantel, Generalized Mantel–Haenszel, and logistic discriminant function analysis methods in detecting differential item functioning for polytomous items. Applied Measurement in Education, 18(4), 313–350.
Swaminathan, H., & Rogers, H. J. (1990). Detecting differential item functioning using logistic regression procedures. Journal of Educational Measurement, 27, 361–370.
Thissen, D., Steinberg, L., & Wainer, H. (1988). Use of item response theory in the study of
group differences in trace lines. In H. Wainer & H. I. Braun (Eds.),Test validity (pp. 147-170). Hillsdale, NJ: Lawrence Erlbaum
Wald, A. (1943). Tests of Statistical Hypotheses Concerning Several Parameters When the Number of Observations is Large. Transactions of the American Mathematical Society, 54(3), 426-482.
Wang, W.-C., Shih, C.-L., & Sun, G.-W. (2012). The DIF-free-then-DIF strategy for the assessment of differential item functioning. Educational and Psychological Measurement, 72(4), 687–708.
Wang, W.-C., & Su, Y.-H. (2004). Effects of average signed area between two item characteristic curves and test purification procedures on the DIF detection via the Mantel-Haenszel method. Applied Measurement in Education, 17(2), 113–144.
Wang, W.-C., & Yeh, Y.-L. (2003). Effects of anchor item methods on differential item functioning detection with the likelihood ratio test. Applied Psychological Measurement, 27(6), 479–498.
Woods, C. M. (2008). Empirical Selection of Anchors for Tests of Differential Item Functioning. Applied Psychological Measurement, 33(1), 42–57.
Youden, W. J. (1937). Use of incomplete block replications in estimating tobacco-mosaic virus. Contributions from Boyce Thompson Institute, 9, 41–48.
Youden, W. J. (1940). Experimental designs to increase accuracy of greenhouse studies. Contributions from Boyce Thompson Institute, 11, 219–228.