节点文献
基于改进哈希算法的快速KNN文本分类方法
Fast KNN Text Categorization Method Based on Improved Hash Algorithm
【作者】 夏青松;
【导师】 郑诚;
【作者基本信息】 安徽大学 , 计算机软件与理论, 2012, 硕士
【摘要】 网络的日益普及和人们对技术的日益依赖,使得数据越来越多的以电子的形式存储在计算机中。在当今高节奏社会,无论是在大型的企业数据中,还是在网络上,如何迅速有效的找到所需要的数据已经成为一个重要的话题。对此,国内外的专家提出了各种各样的技术,如数据库技术、关键词匹配技术和文本分类技术等。对文本进行分类能够有效的降低搜索感兴趣内容的时间,并且提高结果的准确率,在一定的程度上提高了用户的体验度。常用的分类技术如贝叶斯分类技术、支持向量机分类法、决策树等需要大量的时间来训练分类器,如果更新训练用的语料库的话,需要重新训练文本分类器。传统中的KNN分类器的一大优点在于其能够在语料增加的情况下,不用重新训练分类器,同时分类准确率也比较高,因此一直很是受欢迎。但是,KNN算法也有其瓶颈:需要计算待分类文本与所有训练文本之间的相似度,这会浪费大量的时间。本文提出了一种改进的KNN文本分类方法,根据具有最小方差的若干个特征建立相应的文本列表,搜索近邻文本时,先确定待分类文本的近邻文本在这些特征上的大致取值范围,从而依据哈希算法直接剔除掉绝大多数的文本,对于剩下的文本计算与待分类文本的相似度并找出最近邻的若干个,如果不满足K的要求,可以适当的扩展特征的取值范围直到满足为止。这种做法会极大的提高文本检索的速度。同时根据训练文本的类别与待分类文本的距离溢出率,对该类别中的文本与待分类文本之间的相似度进行适当的权重调整,从而提高分类的准确率。在筛选特征的时候,改进了传统的tf-idf算法,并且根据特征的词性、在句子中的成分、文章标题、摘要、所在段落的位置、所在句子的位置以及句子中的提示词对特征进行适当的权重调整。实验结果说明了这些做法能够非常有效的提高文本分类的准确性。
【Abstract】 The growing popularity of the network and people become increasingly dependent on technology to make the data more and more in electronic form stored in the computer. In today’s high-speed society, in large enterprise data or the network, how to quickly and efficiently find the needed data has become an important topic.So the domestic and foreign experts have proposed a variety of techniques, such as database technology, keyword matching and text classification technique.Text classification can effectively reduce the time of searching interesting content, and effectively improve the accuracy of search results and the user experience degrees to a certain extent.The commonly used text classification techniques such as the bayesian classification technique, support vector machine classification,decision tree require a lot of time to train the classifiers, if the training texts are updated,they need re-train text classifiers. One of the big advantages of traditional KNN classifier is that if the training texts are increased, it doesn’t have to re-train the classifier.The classification accuracy rate is relatively high, so it has been very popular. However, the KNN algorithm also has its bottleneck:it need computing the similarity with all the text in all the training text set and it will waste a lot of time.This paper proposes an improved algorithm:establish some text list based on some of the features,compare features with feature of the text needed to be classified, and based on the results hash to the text subset that are most probably needed, and this algorithm will greatly improve the speed of text retrieval. Based on the overflow rate which is the quotient of the distance to the class and the text needed to classify, adjust the similarity between the texts of the class and the text needed to classify,and it obviously improves the accuracy of classification. Based on the improve the traditional tf-idf for algorithm,we select features of texts, and according to part-of-speech, sentence composition, the title of the article and summary, the location of the passage, the position of the sentence and the sentence prompt words,we adjust feature properly.The experimental result indicates that the practice can very effectively improve the accuracy of text classification.
【Key words】 text classification; KNN; weighted feature; part-of-speech tagging; tipwords;