节点文献

基于构造性学习的覆盖算法的发展及应用

The Development and Application of Covering Algorithm Based on Constructive Learning

【作者】 段震

【导师】 程家兴;

【作者基本信息】 安徽大学 , 计算机应用技术, 2010, 博士

【摘要】 机器学习通过使用机器来模拟人类的学习活动,从已知事物中发现规律、获取知识,从而建立对未知事物的预测模型,根据经验不断提高自身的水平。研究者经过多年的探索,提出了如支持向量机、决策树、神经网络等多种优秀的学习方法,并将这些方法推广到机器学习中的各个领域。中国学者在基于覆盖思想的学习方法上进行了很多工作,张铃和张钹所提出的基于构造性学习的覆盖算法被认为是一种具有代表性的方法。覆盖算法能够根据样本的自身特点来构造神经网络,克服了传统神经网络中的一些缺陷,如网络结构难以确定、速度慢等。该方法形式直观,能够有效对多类分类问题和海量数据进行处理,在一些实际应用中表现出良好的性能。相关研究者围绕着该算法的改进和应用进行了大量的研究工作。目前已有的对覆盖算法的各项工作都是针对单示例单标记的学习方式来进行的,但随着机器学习的发展,不断出现一些新的学习问题。本文结合机器学习中出现的一些新模型,对覆盖算法进行了发展和应用,主要体现在如下几个方面:(1)对覆盖算法进行了全面研究,并将算法应用于实际分类问题的解决。本文对覆盖算法的基本模型以及近年来所取得的各项理论和应用成果进行了全面研究,探讨了如何将算法应用于文本分类和垃圾邮件过滤等问题的解决。在应用过程中,针对实际问题的特点,设计了不同的改进策略。在文本分类中,通过引入维数调节的策略,使不同类别文本的特征能够在特征向量中均等出现,提高文本分类的准确率。在垃圾邮件过滤中,将邮件的各类附加信息与正文内容一起构成复合特征,提高过滤器的分类效果,并针对垃圾邮件过滤中正常邮件的风险最小化问题进行了讨论。(2)对核覆盖算法进行了细致分析,将算法加强为模糊核覆盖算法。支持向量机方法通过将样本映射到高维特征空间后构造最优分类超平面,取得了优秀的分类性能。将核函数引入到覆盖算法后所得到的核覆盖算法能够有效提高分类能力,但仍存在不足之处。本文对核覆盖算法中的半径选择策略和分类原则进行了细致分析,指出现有处理方式所存在的缺陷。通过改变领域半径的确定原则,并对拒识样本引入新的隶属度函数来描述样本对各个类别的隶属度,将算法加强为模糊核覆盖算法,明确了隶属度函数的物理意义。引入几种性能不同的覆盖约简方法,结合模糊核覆盖算法,能够在保持识别性能的前提下,有效降低覆盖数量,提高分类效率。在一些数据集上的测试和对比表明了方法的有效性。(3)研究了多标记学习下的覆盖算法。传统学习问题中,每个样本只属于一个类别,即仅有一个标记。而在实际应用中,一个样本可能同时属于多个类别,如文本分类和场景分类等。本文对多标记学习中的样本集分解和算法改造两种策略进行了研究,针对多标记学习的特点和评价指标,探讨如何使用覆盖算法来解决多标记问题。实验表明,多标记覆盖算法的性能达到了同类算法中的先进水平,并且在时空开销上具有优势。由于多标记学习中对训练数据的标记需要更多的人力和物力,因此数据集中的已标记样本数量一般较少。为了能利用大量未标记样本来辅助学习,本文采用半监督学习中的自训练策略,结合已标记样本和未标记样本来训练分类器,提高分类性能,取得了一定成效。(4)讨论了如何在多示例学习中使用覆盖算法进行分类。多示例学习与传统的监督学习、无监督学习和强化学习都存在差异,是机器学习中的第四种框架,起源于药物分子活性预测的研究。在多示例学习中,学习的对象是由多个示例所构成的包,包的标记已知,示例的标记未知,但包的标记是由某些示例决定的。多示例学习的难度比带噪声的监督学习难度更大。本文对现有的各类多示例学习方法进行了研究,对如何将覆盖算法应用于多示例学习进行了探讨,根据不同的解决思路,给出几种多示例学习覆盖算法,算法效果达到大多数同类方法的水平。多示例多标记学习结合了多示例学习和多标记学习两种问题,是分类问题中的最一般情况,能够描述输入空间和输出空间中所具有的歧义性。本文探讨了如何将覆盖算法与其它方法相结合来解决该问题的思路,并给出初步的解决方案。在本文的研究工作中,进行了如下创新:(1)将覆盖算法应用于文本分类和垃圾邮件过滤等实际分类问题,并针对具体应用的特点分别提出不同的调整策略,提高分类器的性能。(2)对核覆盖算法中如何确定领域半径给出新的策略;对拒识点引入新的隶属度函数;对隶属度函数的物理意义给予明确的解释;将核覆盖算法加强为模糊核覆盖算法;结合模糊核覆盖算法给出几种覆盖精简的方法。(3)将覆盖算法推广到多标记学习,结合样本集分解和算法改造两种策略,提出多标记覆盖算法,其性能达到同类算法的先进水平;以自训练策略为指导,提出一种半监督学习方式下的多标记覆盖算法。(4)将覆盖算法推广到多示例学习,提出几种不同思路的多示例覆盖算法;针对多示例多标记学习给出初步的解决思路。

【Abstract】 Machine learning is a subject of acquiring knowledge and rules from known material to build a forecasting model for unseen problems. It simulates human being’s learning behavior and can improve on itself through continuous learning. After years of research, many outstanding learning methods, such as Support Vector Machine, Decision Tree and Neural Networks have been proposed, and applied to numerous machine learning areas. Chinese scholars have done substantial research work on covering-based learning methods, among which the covering algorithm based on constructive learning proposed by Zhang Ling and Zhang Bo is a good representative.Covering algorithm can construct neural networks based on samples’ own characteristics and overcomes some general drawbacks of traditional neural networks, like learning is too slow, and the structure of network is hard to determine. Covering algorithm is very straightforward and can effectively handle multi-category classification and large-scale data, and performs well in many real applications. Many researches have been done to improve on this method and apply it to different domain problems. Current work focuses on single-instance single-label problems and can not solve some new learning questions. This dissertation extends covering algorithm in the following ways:(1) It does comprehensive research over covering algorithm and applies it to real classification problems.This dissertation researches the basic learning model of covering algorithm and the recent progress in theory and application comprehensively. It applies covering algorithm to text categorization and spam-filtering. And different strategies have been proposed according to specific-matters. In text categorization, dimension regulation is introduced to make different text categories get evenly represented in the feature vector which enhances the precision. In spam-filtering, extra information of every email is combined with body text to create the compounded feature to improve the accuracy. It also discusses how to minimize the risk of filtering out regular emails.(2) It analyzes kernel covering algorithm and extends it to fuzzy kernel covering algorithm.Support vector machine maps samples to high dimension space to construct optimal classification space and achieves excellent performance. Kernel covering algorithm utilizes kernel function and improves the accuracy effectively. But there are still some drawbacks. This dissertation analyzes the influence of proximity principle used to judge rejection points on classifier’s effect. FKCA, i.e. fuzzy kernel covering algorithm, is proposed to improve the performance of classifier. The main improvement of FKCA is the change of radius selection and introduction of membership function. The physical explanation of the membership function is also discussed. A couple of reduction methods are introduced to improve on classifier which keep the number of covering down. Experiments show that the performance of these methods is effective.(3) It studies multi-label learning covering algorithm.In classic machine learning, each sample belongs to a single category, i.e., one label. But in real world, a sample can belong to multi categories, for instance, the text categorization and scene classification. This dissertation researches the decomposing of sample set and algorithm improvement, and explores applying covering algorithm to multi-label learning. Experiments show that multi-label covering algorithm performs at par with other multi-label learning algorithms and has the advantage of lower time/space cost. Since more efforts are required to label the multi-label training data, generally much data in training set is not labeled. To overcome this weakness, we adopt semi-supervised learning to improve the accuracy and it works well.(4) It discusses how to extend covering algorithm to multi-instance learning.Multi-instance learning is different from traditional supervised learning, unsupervised learning and reinforcement learning. It originates from predicting the molecules’ activity and is regarded as the fourth learning framework. The learning object is the bag consisting of multiple instances. The labels of bags are known while labels of instances are unknown. And bag’s label is determined by instances. Multi-instance learning is even harder than supervised learning with noise. This dissertation explores applying covering algorithm to multi-instance learning and proposes several algorithms which have comparable performance. We also discuss how to combine covering algorithm and other methods to solve the multi-instance multi-label learning and present the initial solution.The innovations of this dissertation are as follows:(1) Covering algorithm is applied to text categorization and spam filtering and different strategies are applied to improve the overall performance.(2) Presents the new method of determining the covering’s radius; Introduces the new membership function for rejection samples; Gives physical explanation for membership function; Extends kernel covering algorithm to fuzzy covering algorithm; Presents several reduction methods.(3) Extends covering algorithm to multi-label learning, and new algorithm is proposed. The performance of MICA is at par with other well-known works, a multi-label covering algorithm based on semi-supervised learning is proposed.(4) Extends covering algorithm to multi-instance learning and proposes several methods. And an initial solution to multi-instance multi-label learning problem is provided.

  • 【网络出版投稿人】 安徽大学
  • 【网络出版年期】2010年 10期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络