节点文献

面向英语学习的文本难度判定

Text Difficulty Measuremengt for the English Learning

【作者】 吴锦霞

【导师】 刘秉权;

【作者基本信息】 哈尔滨工业大学 , 计算机科学与技术, 2007, 硕士

【摘要】 英文文本难度判定是应用语言学和信息处理领域的重要课题,正广泛应用于教学、出版和搜索引擎等领域。现在的网络资源非常丰富,如何高效准确地为不同水平的英语学习者提供适合自身水平的阅读材料,是文本难度判定面临的最大挑战。本文首先介绍了一种在国际上广泛使用的文本难度判定方法:基于易读性公式判定文本的难度。通常易读性公式使用文本的词汇难度和句法难度来判定文本的难度,词汇难度以词频和词长来衡量,句子的难度以句子的长度来衡量。目前易读性公式已有上百个,本文选择了三个典型的易读性公式傅莱区易读性公式(Flesch Reading Ease)、迷惑指数(Gunning Fog Index)和自动易读性指数(Automated Readability Index)在一定数量的文本上进行了验证。尽管通过易读性公式进行文本难度判定比较容易施行,但是计算值过于集中,无法进行等级划分。本文试图建立一种有广泛应用价值的模型判定文本的难度,向量空间模型是一种典型的文本表示方法,它不考虑词汇之间的顺序,把文本表示为向量空间中的一个向量,文本的相似度可以通过内积或者夹角余弦值来计算,实现起来比较方便。本文基于向量空间模型进行文本难度判定,把文本难度判定问题当成是一个分类问题来解决。这种方法有很多的优点,其中之一就是它的结果不是二元值,而是它的整个训练集上的概率值,第二就是提供额外的信息。本文对几种常用的特征选择方法如文档频率、信息增益、互信息、X 2统计量、期望交叉熵、文本证据权、几率比等进行了分析,并进行了实验验证,结果表明几率比效果最好,互信息效果最差。分析了TF-IDF权重算法的不足,考虑结合TF-IDF与类间、类内分布信息的改进了权重算法,实验结果表明改进的权重算法提高了分类的F1值。最后主要考察了Rocchio’s算法,K-近邻法、朴素贝叶斯法这三种分类算法,通过实验检测了这三种算法的性能,结果表明多项式贝叶斯方法的分类F1值最高,达到了80%以上。

【Abstract】 English text difficulty measurement is an important conception in applied linguistics and information processing. It is used in teaching, publishing, search engines and other fields widely. Because there are very rich reading materials in network, how to efficiently find different level of reading materials is a challenge to the text difficulty measurement.This paper introduces an international widely used method witch based on readability formula to measure text difficulty. Usually, the widely used readability formula only have two varies, word length/word frequent and the average sentence length. In this paper, we chose three formulas: Flesch Reading Ease, Gunning Fog Index, Automated Readability Index and we tested them on different levels data, but the results of this method are very poor, so we can’t measure text’s difficulty using it.Therefore, we focus on building a broadly applicable model of text to measure text difficulty. Vector space model is a typical example of the text expressing witch does not consider the terms’order and expresses a text as vector space of a vector. The text will gain a value through calculation the similarity to the samples by cosine of angle, so it was easier to achieve. This paper bases on the vector space model to measure the text difficulty, solves text difficulty measurement as a question of classification problems. This method has a lot of advantages, one of these is that it is not the result of the dual value but the probability of the entire training set. The second is to provide additional information, such as the terms of distribution. In Feature selection, this paper analyzes several commonly used methods of feature selection such as document frequency, information gain, mutual information, statistics CHI. expect cross entropy, the weight of evidence for text, the odds ratio. The results show the odds ratio is the best method than others, the worst is mutual information. This paper discusses the traditional algorithm of term weighting: TF-IDF, the introduction of among class and inside class factor in term weighting is presented. Experimental results show that the improved algorithms outperformed the traditional methods in F1.At last, this paper inspected three classification algorithms: Rocchio’s algorithm, K-Nearest-Neighbor and Naive Bayes. Experimental results of these algorithms indicate polynomial Bayesian method classification F1 is the highest value, reached more than 80%.

  • 【分类号】TP391.1
  • 【被引频次】1
  • 【下载频次】220
节点文献中: