节点文献

基于信息熵的特征选择算法研究

A Study on Feature Selection Algorithms Using Information Entropy

【作者】 刘华文

【导师】 刘磊;

【作者基本信息】 吉林大学 , 计算机软件与理论, 2010, 博士

【摘要】 随着新技术的不断出现,现实中数据集朝着大规模方向发展,并呈现样本少、维数高等特点,这给传统的数据分类学习带来了巨大的挑战,其中冗余特征的存在间接加重这种不利影响。因此,如何从高维数据中剔除冗余或无关的特征,以避免维灾难问题,使得传统学习算法仍然能在高维数据上进行学习训练是目前人们面临的一道难题。特征选择就是在这种情况下提出的,它主要是指从数据的原始特征中选择一个最优特征子集,使得它包含原始特征的全部或大部分分类信息的过程。目前,特征选择是数据挖掘、统计模式识别和机器学习等领域的重要研究课题之一,在实际情况中也得到了广泛的应用。本文首先介绍数据挖掘中的分类问题,然后阐述特征选择问题的研究现状和研究热点等,并重点讨论信息度量标准。然后针对数据中冗余特征的消除问题,以信息理论中信息熵和互信息为基础,围绕如何度量特征之间相关性这一主体,开展了一系列的研究工作,以解决不同的数据分类预测问题。本文主要贡献和研究内容包括以下几个方面:1).根据数据挖掘中层次聚类算法的思想,提出了一种新的Filter特征选择算法ISFS。这种算法采用互信息和关联系数分别表示特征间的“类间距离”和“类内距离”,并依据层次聚类分析思想选择重要特征,以保证选择的特征子集具有最小的冗余性和最大的相关性,从而最终提高分类性能;2).针对现有特征选择算法中不同的信息度量标准,给出了一种泛化表示形式,并详细讨论该形式与其他信息标准之间的关系。此外,针对现有选择算法中互信息估值不准确的问题,提出了动态互信息的概念,以准确描述特征之间的相关性。在此基础上,还提出了基于动态互信息(DMIFS)和条件动态互信息(CDMI)这两种新的特征选择算法,以克服传统互信息选择算法的缺点,即不能准确反映选择过程中相关性的动态变化问题;3).针对现有特征选择算法中数据样本权重不变的问题,利用数据抽样技术,提出一种新的Boosting特征选择算法框架,以描述数据样本的重要性程度在特征选择过程中是不断发生变化的,从而避免了动态互信息对噪声数据比较敏感的问题。另外,还讨论了该选择算法框架中其他的相关问题,如样本权重更新方式和误差函数等;4).针对生物信息学中基因表达数据的样本少、维数高的特点,根据基因(即特征)之间的相似性,利用信息相关系数和近似Markov blanket技术对基因进行相似性分组,并结合集成学习技术,提出一种集成的基因选择算法EGSG,以最终提高分类模型的识别诊断能力。

【Abstract】 With the emergence of new technologies and their rapid development, a large volume of data, which plays an increasingly role in our everyday life, in various fields has been accumulated. Nowdays, how to get important and useful information from the mass data and then take full advantage of these information to achieve realistic goals is becoming a problem facing to people. Under this context, the concept of data mining has been introduced. It refers to the process of figuring out those hidden and potentially useful information or knowledge from the mass data with noise. Data classification is one of the most extensively studied issues in data mining. Its main purpose is to obtain patterns or behavior from known data, so as to assist users in predicting or determining the patterns of unknown data. As feature of the patterns contains a lot of important information or knowledge, it becomes one of the theoretical bases for predicting behavior patterns of unknown data.However, lots of features in databases are redundant or irrelevant. They may bring great challenges to traditional classification algorithms, such as reducing the efficiency of learning algorithms, distracting the learning process, and ultimately resulting in the obtained model with poor quality and over-fitting. Additionally, the more features in a database, the more noise inhabited it. This, however, may bring learning algorithms into more adverse situations. To tackle with this problem, the concept of feature selection has been introduced. It refers to the process of identifying an optimal feature subset, which embodies most information of data, from the original space of features in terms of some given measurements. By now, feature selection is becoming one of hot topics in various fields, such as data mining, statistical pattern recognition and machine learning, and attracting many focuses of scholars, but also is widely used in practice.This study mainly focuses on the issue of removing redundant or useless features from databases. According to the characteristics of selection process, the relevant degree between features in this thesis is measured by information measurements, whose bases are information entropy and its related concepts. Moreover, this thesis has proposed five different kinds of feature selection algorithms using information measurements to deal with the prediction tasks in different situations of classification. Specifically, the related research work of this thesis is carried out in the following aspects:Firstly, a new filter feature selection algorithm, called ISFS, has been proposed. The idea of this method mainly originates from data clustering analysis. However, unlike traditional clustering algorithms, the data item in ISFS is a single feature, rather than sample in dataset. In addition, the class feature, i.e., the classificatory labels, of the dataset is taken as a special class or group, called label group, while other features are marked as candidate groups and selected group in ISFS, where the candidate and selected groups denote single candidate feature and selected features, respectively. In this case, the feature selection problem is now transferred into the issue of hierarchical clustering analysis in aggregated manner. Thus, during each clustering or selection procedure, the selected group will continuously combine with the candidate group, which has the greatest“distance”to the label group and the smallest“distance”to the selected group. This clustering procedure will be repeated until the number of features in the selected group is larger than a specified threshold.To measure the“distance”between data items, ISFS adopts two information criteria, i.e., mutual information and correlation coefficients, to measure the distances to the label group (inter-class distance) and the selected group (intra-class distance), respectively. Consequently, the purpose of clustering is to keep the selected group with the minimal intra-class distance and the maximal inter-class distance with the label group at the same time, and the feature with more discriminability would be selected in a higher priority during the whole clustering process.Secondly, a generalized representation of information metric has been introduced. This generalized representation can bring most information metrics used in feature selection algorithms into a unified framework. Moreover, the relationship between our measurement and others are discussed in detail, and then a unified framework for feature selection using information measurements is given.As information entropy or mutual information can effectively measure the degree of uncertainty of feature, they have been extensively studied in feature selection. However, the value of entropy or mutual information in existing selection algorithm may contain“false”information. As a result, it can not exactly represent the degree of relevance between features. To this end, dynamic mutual information is introduced to accurately represent the relevance between features, which dynamically changes during the selection process. Unlike traditional mutual information, the value of dynamic one is not estimated on the whole sample space, but unrecognized samples. Based on this concept, two feature selection algorithms using dynamic mutual information (DMIFS) and conditional ones (CDMI) have been proposed respectively. These algorithms work more like the method of decision tree in classification, where the samples, which can be identified by the newly selected feature, will be removed from the original space, and then the value of mutual information of candidate features will be estimated on the remaining samples.Thirdly, based on data sampling techniques, a new selection framework for feature selection algorithm using boosting technique has been presented. Additionally, several related issues about this framework, such as updating strategies and error functions, have also been discussed. The purpose of using boosting technique is to overcome the negative aspects resulted from the fact that the recognized samples would be removed in estimating the values of dynamic mutual information of features, and to alleviate the impact of noise on the whole process of feature selection.During the selection process, the proposed method estimates mutual information by dynamically adjusting the weights of samples, so as to the estimated values can accurately measure the correlation between features, which is dynamically changed along with the selection process. As a result, the selected feature at each time can exactly represent the characteristics of classification. Unlike other boosting selection algorithms, the result of our framework is a feature subset, rather than one or several classifiers. In addition, its updating strategy is not the error of base-classifier, that is, the information measurement in our method is independent of base-classifiers.Finally, for the less-sample and high-dimensionality of gene expression datasets in bioinformatics, an ensemble feature selection algorithm, named EGSG, has been proposed. This selection algorithm takes use of Information Correlation Coefficient to measure the relevance between genes, and approximate Markov blanket technique to divide genes into several groups, which represent different biological functions. After the partition stage, a feature subset is formed by picking a representation gene from each group. Since genes in the subset come from different groups, it has similar prediction capability to the original genes.Due to the limited capability of a single gene subset for high-dimensionality gene expression datasets, EGSG take the advantages of multiple gene subsets to further improve the prediction capability in disease diagnostic by ensemble technique. That is to say, EGSG obtains several gene subsets in the same way, and then aggregates many base-classifiers trained on these subsets into an overall one by virtue of the major voting strategy. Compared with other ensemble methods, the diversity of EGSG lies in combining different gene subsets, where each subset has similar biological function with each other and higher discriminative capability.Simulation experiments carried out on public datasets show the performance and effectiveness of the selection algorithms proposed in this thesis. They can effectively improve the efficiency and performance of learning algorithms for classification in most cases. Nevertheless, there still are several problems in the proposed algorithms. For example, ISFS has relatively poor efficiency, and CDMI and DMIFS are sensitive to noises. The performance of our boosting algorithm relies on the specific values of parameters, while the results of EGSG are difficult to be explained. Therefore, our future work will be carried out on these factors in order to further improve their performance and efficiencies.

  • 【网络出版投稿人】 吉林大学
  • 【网络出版年期】2010年 08期
节点文献中: