节点文献

基于特征评价的模式识别算法研究

Research on Pattern Recognition Algorithm Based on Feature Evaluation

【作者】 王丽娟

【导师】 王晓龙;

【作者基本信息】 哈尔滨工业大学 , 计算机应用技术, 2007, 博士

【摘要】 欧氏距离是模式识别算法中最常采用的相似性计算量度。在计算数据间相似度时,欧氏距离为所有特征赋予相同的重要性,但是这与实际情况不符。尤其当特征维数较高时,大量不相关特征会影响欧氏距离计算的准确性,从而影响模式识别算法的性能,产生维数灾问题。通常维数灾问题通过特征选取算法解决。但是特征选取算法仅适用于解决特征与类存在较高相关性或者完全不相关的情形。本文通过特征评价解决特征与类存在不同相关度的维数灾问题。针对模糊C均值算法的维数灾,提出了基于函数CFuzziness的特征权重的学习算法。特征权重学习算法为每个特征赋予一个权重,区分其对聚类的贡献。合理的权重值使得相似的数据彼此更靠近,不相似的数据相互远离,此时的聚类结果好。通过梯度下降算法极小化函数CFuzziness就可以为每个特征赋予一个合适的权重值。权重应用于模糊C均值算法,得到加权模糊C均值算法。加权模糊C均值算法强调重要特征的作用,消减冗余特征的作用,从而得到较好的聚类结果。实验表明,加权模糊C均值算法的聚类结果优于模糊C均值算法的聚类结果。针对最近邻分类器的维数灾,提出了两种特征子集划分算法,并构造相应的多分类器融合系统。首先,特征集合被划分成若干特征子集;然后,每个特征子集由一个子分类器分类识别;最终,多个子分类器的分类结果融合输出。特征子集所包含的维数降低,子分类器的维数灾减轻。选用合理的特征子集划分算法保证子分类器的正确性和多样性,融合多个子分类器的分类结果就能够得到更好的分类性能。本文构造了基于遗传算法的特征子集划分算法和基于互信息的特征子集划分算法。遗传算法根据多分类器的融合正确率,采用全局搜索机制寻找最优的特征子集划分,属于Wrapper类的特征子集划分算法,该算法能够为子分类器选取最适合的特征子集。互信息根据特征与类的相关性,通过前向贪心搜索机制为子分类器选取相应的特征子集,属于Filter类的特征子集划分算法,该算法具有时间复杂度小的优点。本文提出了一种模糊最近邻分类器,并采用它为子分类器。最近邻分类器仅能够给出数据所属的类别信息;而模糊最近邻分类器能够给出数据在每一类中的隶属度,更有效的反映输出结果。多个子分类器的决策通过模糊积分融合得到最终分类结果。模糊积分是基于模糊测度的融合算法。模糊测度用于度量子分类器的重要性,重要性根据训练数据学习得到。与其他融合算法相比,模糊积分不仅考虑了子分类器的实际输出,而且考虑了子分类器的重要性,融合效果好。实验表明,基于遗传算法和基于互信息划分特征子集的两种模糊最近邻融合算法的分类性能均优于最近邻分类器的分类性能。本文将上述3种算法应用于识别Corel图像库。Corel图像库中每幅图片分别通过颜色直方图,颜色一致向量,PWT和Hu矩提取得到4个特征文件,作为图像识别实验系统的输入数据。加权模糊C均值算法图像聚类的结果优于模糊C均值算法。图像分类采用基于遗传算法划分特征子集的模糊最近邻融合算法和基于互信息划分特征子集的模糊最近邻融合算法。两种融合算法分类图像的结果明显优于最近邻分类器。由于两种特征子集划分算法采用了不同的策略,融合算法的分类性能依赖于不同的数据库有所不同。

【Abstract】 The Euclidean distance is the commonly used similarity measure in pattern recognition algorithm. It assumes that each feature plays the same role in pattern recognition algorithm, but it is not in practice. When the size of feature dimensionality is higher, the Euclidean distance may be dominated by some irrelevant features. Therefore, the performance of pattern recognition algorithm based on the Euclidean distance will be affected, which is called the curse of dimensionality. It can be lessened by feature selection. When the relevance between feature and class is either highly correlated or completely irrelevant, feature selection can perform best. In this study, feature evaluation is used to deal with the problem with different relevance between feature and class.For the curse of dimensionality in fuzzy c mean, feature weight learning algorithm with respect to index CFuzziness is proposed. Feature weight learning algorithm assigns each feature an importance degree denoting the role in clustering. An appropriate feature weight leads to that the data within one class are more similar and the data in different classes are more separate. In this case, the performance of clustering is better. When index CFuzziness gets its minimum value through the gradient descent technique, the appropriate feature weights are learned. Fuzzy c mean incorporated with feature weight forms the weighted fuzzy c mean. Weighted fuzzy c mean algorithm emphasizes the roles of important features and lessens the roles of irrelevant features. Experimental results show that the weighted fuzzy c mean outperforms fuzzy c mean in clustering.For the curse of dimensionality in nearest neighbor classifier, two multiple classifier systems are proposed based on different feature subset partition methods. Firstly, it decomposes the feature set into several feature subsets. Then each feature subset is classified by one component classifier. Finally, multiple decisions from each component classifiers are combined. Because the size of dimensionality in feature subset is low, the curse of dimensionality is lessened. If there is diversity and accuracy among component classifiers generated by feature subset partition method, multiple classifier system gets a better performance.In this paper, GA and mutual information are used to partition feature subset. According to the multiple classifier system’s accuracy, GA automatically fulfils the feature subset partition by a global search strategy, which belongs to wrapper method. The wrapper method may select the feature subset suitable for each component classifier. Mutual information selects the salient feature subset according to the relevance between feature and class by a forward greedy search strategy, which belongs to filter method. The filter method may be computationally efficient.In this paper, fuzzy nearest neighbor classifier is proposed, which is adopted as the component classifier. Nearest neighbor classifier outputs the class of data. While fuzzy nearest neighbor classifier outputs the membership degree of data belonging to each class.Fuzzy integral is adopted to combine multiple decisions from each component classifier with respect to fuzzy measure. The importance degree for each feature subset is measured by fuzzy measure, where the importance degree is learned by training data. In comparison with other combination method, fuzzy integral not only considers the output of each component classifier but also considers the importance degree for each feature subset. Therefore, it outperforms other combination methods. Experimental results show that both multiple fuzzy nearest neighbor classifier systems based on feature subset by GA and mutual information can get better performance than nearest neighbor classifier in classification.In this paper, three proposed methods are used to recognize Corel image database. Four datasets are retrieved from Corel image database by color histogram, color coherence vector, PWT and Hu moments respectively, which input the image recognition experimental system. Experimental results of image clustering show that weighted fuzzy c mean is superior to fuzzy c mean. Image classification adopts multiple fuzzy nearest neighbor classifier system based on feature subset by GA and multiple fuzzy nearest neighbor classifier system based on feature subset by mutual information. The experimental results show that both multiple fuzzy nearest neighbor classifier system improves the performance of image classification by nearest neighbor classifier. Because GA and mutual information adopt different strategies to partition feature subset, the performance of multiple classifier system depends on the dataset.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络