

Optimized Approach to TEM Band 4 Test Equating: A Tentative Research

【作者】 金微敏

【导师】 卢志鸿;

【作者基本信息】 北京邮电大学 , 外国语言学及应用语言学, 2008, 硕士

【摘要】 大规模语言考试经常面临的一个难题,就是如何保证不同试卷的考试分数具有可比性,不会因其难度、信度和分数分布等方面的差别而导致使用不同试卷的考生受到不公平的对待。等值是测验公平性的保证,也是克服考试测量局限性、实现不同考卷分值可比性和互换性、确保测试结果一致性和稳定性、保证测试相关决策客观性以及体现对高危考生的公平性等客观因素的需要。然而,目前等值方法在我国许多大规模考试中尚未得到应有的推广,相关研究也仍然处于十分薄弱的状况。本论文首先介绍了当前国际测试学界主流的考试等值化理论、主要的考试等值化设计和重要的考试等值化方法,并简要介绍了在中国和世界其他国家大规模语言考试中实际使用的代表性等值实践,如托福、GRE、汉语水平考试、英语水平考试、大学英语四六级考试等大规模语言测试已采用多年的等值化设计。文中还重点介绍了建立在传统考试理论和项目反应理论基础之上的主要等值方法,包括平均值等值方法、线性等值方法、等百分位法、IRT等值方法等。论文着眼于考试等值化在英语专业四级考试设计中的有用性、可行性和适用性,通过一项试验研究试图找到英语专业四级考试值得借鉴的测试等值方法。通过设计一项小规模的类比测试,作者选取了一定数量的共同题作为两次试验考试的共同部分,使用非等组共同题设计对两批学生进行了两次不同的试验考试。由此得出的原始数据经过计算、分析和研究,分别在平均值等值方法、Tucker和Levine非等组观察分等值方法、等百分位等值方法和项目反应理论中的Rasch单参数等值方法基础上,构建了不同的等值数学模型,开展了等值化数据分析。由此获得的等值数据对于英语专业四级考试的等值设计很有启发和借鉴意义,也为我们找到符合英语专业四级考试实情的恰当等值方法奠定了基础、提供了启示。在这项试验研究的基础上,本论文通过数据分析和对比得出以下结论:为提高英语专业四级考试的客观性、一致性、可靠性和不同试卷分值的可比性,我们有必要做出努力,在英语专业四级考试中引入测试等值方法,实现不同试卷得分的可比性和可互换性,确保考试信度的稳定性和考试结果的一致性;试验研究证明,英语专业四级考试完全可以借助测试等值化方法实现更佳的考试信度和效度,也完全可以借鉴国际和国内大规模语言测试现有的等值方法;尽管适合英语专业四级考试的最佳等值方法仍需进一步的研究努力,将等值方法引入英语专业四级考试的必要性和可行性却是毋庸置疑的;试验结果表明,建立在传统测试理论基础上的Tucker观察分线性等值方法和等百分位方法在诸多方面的表现都不逊于建立在项目反应理论基础上的Rasch单参数等值方法,非等组共同题设计表现出了很高的可靠性,因而值得在英语专业四级考试的等值化设计中予以考虑;尽管如此,单靠传统测试理论本身还无法构成判断一项试题是否适合英语专业四级考试试卷的唯一基础,建议英语专业四级考试在等值设计中有效结合传统测试方法和项目反应理论的方法,从而扬长避短,趋利避害。

【Abstract】 Large-scale language tests are constantly confronted with difficulties in guaranteeing the comparability or interchangeability of scores on different test forms. Test equating, the statistical process used to adjust scores on different test forms so that the scores derived from the two forms will be directly equivalent after conversion, is thus deemed necessary to overcome measurement limitations, to make different test forms interchangeable, to ensure test consistency and decision-making objectivity, as well as to be fair to high-stake examinees.This thesis starts with an introduction to prevalent test equating theories, typical equating designs and representative equating practices applied in large-scale language tests in China and other countries. In particular, special emphasis is laid upon the application of CTT (Classical Testing Theory) and IRT (Item Response Theory) in the major equating approaches, including mean equating, linear equating, the equipercentile method and the IRT equating method.With an eye on the usefulness, feasibility and applicability of test equating approaches in TEM Band 4 (Test for English Majors Band 4), the thesis sets out to conduct a tentative experiment with the recommendable test equating design for TEM Band 4 through an empirical research. Two groups of students in a relatively small sample population took part in two separate experimental tests with common items, and the scoring results were analyzed, computed and discussed in the different statistical models constructed on the basis of the mean equating approach, the Tucker and Levine Observed Score Methods in Non-equivalent Groups, the equipercentile equating method and the IRT Rasch Single-parameter Equating Approach. The equating results thus obtained are illuminating and shed light on the appropriate equating design that caters to the realities of the TEM Band 4 test.On the basis of the empirical study, the thesis concludes that, to improve the validity, interchangeability, objectivity and consistency of the TEM Band 4 test, efforts to make the TEM Band 4 test forms interchangeable are worthwhile and long due. Although the optimum approach still merits further empirical studies, TEM Band 4 Test can be equated well by borrowing from existing equating practices widely accepted by the test measurement community. The paper also recommends that, in designing the equating method for TEM Band 4 Test, common-item non-equivalent groups design should be contemplated due to its reliability, since experiment data reveal the CTT-based Tucker Observed Score Linear Equating Method and Equipercentile Method are both as effective as, if not better than, the Rasch single-parameter equating method in a number of aspects. The paper also contends that CTT alone cannot constitute the sole basis for judging a particular item as a suitable TEM Band 4 formal test items, and both CTT and IRT equating approaches should be contemplated in the equating design for the TEM Band 4 Test. It is therefore quite essential for TEM Band 4 Test to combine the two approaches effectively, minimizing the shortcomings of both approaches while maximizing their respective strong points.

  • 【分类号】H310.4
  • 【下载频次】217