

Multi-valued and Multi-labeled Data Classification

【作者】 郭跃健

【导师】 李宏;

【作者基本信息】 中南大学 , 信息与通信工程, 2010, 硕士

【摘要】 随着计算机技术、网络技术和数据库技术的迅速发展,现实中越来越多的应用都与多值属性、多标记数据密切相关,因此多值属性和多标记数据的分类算法成为了当前数据挖掘和机器学习领域的一个研究热点。目前的研究主要集中于多标记数据的分类算法,没有考虑多值属性的问题,而且大多数算法没有充分学习标记之间的相关信息,加上现实中多样本的数量少、标记困难等问题,对传统的分类算法提出很多新的挑战。本文的主要工作分为3个部分:(1)提出5种多值属性分解算法,结合已有的多标记分类算法,建立多值属性多标记分类的学习框架,并通过实验比较了不同分解算法的优劣,验证了按照取值顺序进行分解的学习效果最好;(2)改进已有的贝叶斯网络算法,提出了结合通用贝叶斯网络GBN和多网贝叶斯网络MBN的多标记学习算法,能够有效获取多个标记之间的相关信息,较大地提高了分类的精度;(3)针对多标记数据标记样本少的问题,结合实际对基于多标记组合算法的缺点进行了深入分析,建立多标记组合的分层模型,并提出基于不确定度的主动学习和基于置信度的半监督学习,交替选择最有效的样本进行学习,最终建立分层多标记分类器模型,实验验证了该方法能够大大提高多标记分类器的有效性和鲁棒性。本文的研究成果为学习多标记之间的相关信息以及在少量标记样本下的多标记分类学习提供了有效的方法,并通过结合多值属性分解的算法,为多值属性多标记数据的分类建立了新的学习框架。

【Abstract】 With the rapid development of computer technology, internet and database system, more and more applications are combined with multi-valued and multi-labeled datasets. Hence, multi-valued and multi-labeled classification has become a hot topic for researchers in data mining and machine learning.At present, most of the existing researches are done on multi-labeled classification without consideration about multi-valued problem. Meanwhile, the correlations between different labels are not studied adequately. What is more, lack of labeled sample results in insufficient information to learn during the training stage. All these arise new challenges to traditional classifiers. There are three contributions of this thesis. Firstly, it puts forward a new learning framework for multi-valued and multi-labeled classification by combining multi-value decomposition with multi-labeled classification algorithms. Five efficient decomposition methods are proposed and Rank Order method performs the best. Secondly, based on the study of Bayesian network, this thesis constructs a multi-labeled Bayesian network with the combination of General Bayesian network and Multi-net Bayesian network. The proposed algorithm can learn the correlations of labels in a better way, enhancing the accuracy of classification largely. Thirdly, as to the lack of labeled samples, an active learning and semi-supervised multi-labeled classification algorithm is conducted alternately based on hierarchical model. Experimental results demonstrates this algorithm greatly boosts the efficiency and robust of the classifier.This thesis provides an effective way to learn correlations between different labels and to construct a robust classifier with limited number of the labeled samples. Through combining multi-valued decomposition and multi-label classification algorithms, it builds a new learning framework for multi-valued and multi-labeled datasets.

  • 【网络出版投稿人】 中南大学
  • 【网络出版年期】2011年 03期

