节点文献

信息过滤系统中特征选择算法的研究

Research on Feature Selection Methods in Information Filtering System

【作者】 王美方

【导师】 刘培玉;

【作者基本信息】 山东师范大学 , 计算机软件与理论, 2008, 硕士

【摘要】 随着Internet的迅速发展和日益普及,电子文本信息迅速膨胀,如何有效地组织和管理这些信息,并快速、准确、全面地从中找到用户所需要的信息就是当前信息科学技术领域面临的一大挑战。网络信息过滤技术作为处理和组织庞大的网络信息的关键技术,可以在较大程度上解决信息杂乱的现象,方便用户准确地定位所需信息。目前,对于信息过滤技术的研究,大多数研究者的精力主要放在各种不同分类方法的研究与改进上。然而,特征选择一直是网络信息过滤中的基础性工作,而且是一项瓶颈技术。因此,对特征选择算法的研究也是十分必要的。目前常用的特征选择算法都直接利用了特征之间的条件独立性假设,通过构造一个评价函数,单独对特征集的每个特征进行评价,但是由于没有直接考虑特征的类别相关性,也没有考虑特征子集的冗余性,这些方法选择的特征子集在类别区分能力上往往存在着冗余,导致最终分类效果不佳。本文主要针对信息过滤系统中特征选择算法的相关问题,在如下几个方面进行了研究和讨论:1、对常用的特征选择方法的优点和缺点进行了分析,并针对存在的不足之处指出了相应的改进方向。本文首先对特征选择技术做了综合分析,并着重介绍了特征选择技术的框架。目前常用的几种特征选择方法各有所长,亦各有所短,文中从计算复杂度和分类效果出发,分析了它们的优缺点,并指出了可能导致的原因所在。另外,根据相关文献资料,列举出了常用特征选择算法的对比实验结论。这与本文最后的实验结果大致相同。2、从特征相关性和冗余性定义出发,提出了一种特征选择框架FSBC(feature selection based on correlation),即把特征选择过程分两步进行:第一步选取类别相关的特征子集;第二步通过冗余分析,去除候选特征子集中的冗余特征,最终获得优化特征子集。首先,选取类别相关特征时,本文根据这样一个原则构造评价函数来选取特征项:如果一个特征项t在一个类别的文档中频繁出现,而在其它类别中很少出现的话,那么该特征项t能够很好的代表这个类别,这样的特征项应该赋予较高的权值,并选来作为该类别的特征词,以区别于其它类别的文档。另外,文中引入了TFIDF权重计算的思想,考虑将词频和文档频率结合起来共同作为评价特征项的依据。其次,进行冗余分析时,本文采用聚类方法中常用的K-Means算法作为去冗余的核心算法,针对该算法中的初始簇中心的选择及初始簇个数的设置问题进行了相应的改进,使类K-Means算法更有效的减少特征集的冗余性。3、最后,将所提出的特征选择策略在网络信息过滤平台上进行了实验测试,并取得了令人满意的测试效果。本文将特征选择框架FSBC应用于网络信息过滤系统,并与信息增益(IG)和CHI统计方法进行了实验对比。实验表明,FSBC方法在准确率和查全率上要好于其它两种方法,尤其在特征维数较高时取得了不错的实验效果。

【Abstract】 With the rapid development and the spread of Internet, electronic text information greatly increases. It is a great challenge for information science and technology that how people organize and process large amount of document data, and find the interesting information for users quickly, exactly and fully. As the key technology in organizing and processing large amount of document data, network Information filtering technology can solve the problem of information disorder to a great extent, and is convenient for users to find the required information quickly. Recently, for the study of Information Filtering technology,researchers mostly focus on the exploration and improvement of diffenent classification algorithms. However, the feature selection has always been a basic work and a bottle-neck technology furthmore of Network Information Filtering.so, it is necessary to study feature selection algorithms.At present, common feature selection algorithms directly uses the conditions of independence assumptions among features , evaluates separately each feature in the feature set through constructing a evaluation function. But duing to in the absence of the relevant categories of features and redundancy of feature subsets, the feature subsets selected by these methods exist redundancy sometimes in the ability to distinguish between categories, and thus lead to a final classification ineffective.In this paper, for the related issues of feature selection algorithm in the information filtering system, the following aspects were studied and discussed:1. The strengths and weaknesses of feature selection commonly used were analysized ,and improvement of direction was pointed out for the weaknesses.This paper firstly gived the comprehensive analysis of feature selection technology, and emphatically introduced the framework of feature selection technology. At present, several feature selections commonly used have their strong points and weak points. We analyzed their advantages and disadvantages from the computational complexity and classification effect in this paper, and pointed out the reason that may lead to it. In addition, according to the literature data related, we described the experiment conclusions .this conclusions were same to the finally experimental results.2. A feature selection framework FSBC(feature selection based on correlation) was proposed from the definition of feature relativity and redundancy, that is the process of feature selection was separated into two-step section: first, selecting the feature subset that was related to categories; secondly, removing out the redundant feature item in the choosely feature subset through the redundancy analysis, and finally got the optimized feature subset.Firstly, for the selecting feature relevant of category, this paper constructs a evaluation function to selecting feature item according to the principle : if a feature item t frequently appear in the document belonging to one category, but few in other categories, then the feature item t can well represente this category ,and should be given a higher weight, and should be selected as the categories of feature words to distinguished from other category of documents. In addition, this paper introduces the idea of weight computing TFIDF, and considers combining the word frequency and the document frequency as the basis for the evaluation of features.Secondly, for the redundancy analysis, this paper adopts the algorithms of K-Means commonly used in the clustering method as core algorithm to removing redundancy. For the selection of the center of initial cluster and the number of the initial cluster in this algorithm, this paper has improved those issues in order to making similary K-Means algorithm reduce the redundancy of features set effectively.3.Finally, the proposed feature selection strategy was applied in the platform of Network Information Filtering, and achieved satistisfying experimental effect.This paper applied the feature selection framework of FSBC into Network Information Filtering System, and did experimental comparsion for Information Gain(IG) and CHI statistical methods. Experiments show that FSBC method is better than the other two methods in accuracy and recall rate, and it can make good performance especially in the higher dimension.

  • 【分类号】TP311.52
  • 【被引频次】5
  • 【下载频次】214
节点文献中: 

本文链接的文献网络图示:

本文的引文网络