节点文献

基于半监督学习的物体识别

Object Classification Based on Semi-supervised Learning

【作者】 褚镇飞

【导师】 杨小康;

【作者基本信息】 上海交通大学 , 信号与信息处理, 2010, 硕士

【摘要】 物体识别是机器学习中的基本问题,解决对文本、图片、视频等数据做分类识别的问题。在数据量较少的情况下,传统的机器学习方法已经取得了很好的效果。但是,随着信息量指数式的增加,获得大量的数据标注已经变得几乎无法完成,这使得传统的机器学习方法在处理这类问题的时候显得力不从心。在这样的情况下,半监督学习方法应运而生,它是使用少量有标注数据的信息,将其扩展到未标注数据上,从而可以解决示例数据和标注数据在数量上严重不匹配的问题。本文阐述了针对难以获得的精确标注和容易获得的粗略标注同时存在的情况下的半监督学习问题,研究了协同训练的鲁棒性问题,即对给定初始标注数据中的错误,对协同训练性能的影响。在协同训练的鲁棒性问题的基础上,本文将信息瓶颈算法和计算后验概率的方法相结合,创新性地提出了一种使用无监督学习方法产生伪标注的方法。与现有方法相比,该方法仅需要较少的标注信息,并可有效降低计算复杂度。在使用伪标注的过程中,本文创新性地提出了一种使用伪标注的协同训练方法。该方法以重排序算法为主要框架,与现有方法相比,此方法对初始的错误标注,具有较高的鲁棒性。在初始标注中存在较多错误时,改进后的方法仍然可以训练出性能较好的分类器。本文在利用伪标注来进行协同训练时,从统计学角度对该方法进行了理论分析,在数学上对该方法在提高协同训练的鲁棒性方面的有效性进行了研究,并探讨了朴素贝叶斯分类和信息瓶颈方法在理论基础上的相似性。

【Abstract】 Object classification is one of the basic problems in machine learning, which aiming at solving the classification and recognition problem on text, image, and video data. In the case of small amount of data, traditional machine learning methods have already achieved a sound performace. However, as the exponential booming of information, it is impossible to obtain such a large amount of data with labels, which leads to ineffectivity of traditional methods. In such scenario, semi-supervised learning methods become a hot point in research. It uses small amount of data with labels and extends them to unlabeled data to fill the quantity gap of labeled examples and unlabeled examples.In this thesis, we focus on a typical semi-supervised learning problem with small amount of high-accurate labels and large amount of low-accurate labels. We also propose the robustness factor of co-training, which denotes the influence of initial incorrect labels to co-training process.Based on robustness problem of co-training, we originally propose an unsupervised pseudo-label-generating method based on the combination of information bottleneck principle and the method of posteri. In comparison with existing methods, this improvement needs smaller amount of labels and requires lower computation complexity.In applying pseudo-labels, we creatively discover a pseudo-label-aided co-training method. Comparing with existing methods, this method is more robust to initial incorrect labels. This improvement can guide co-training to obtain better classifiers even in the case that there are many incorrect labels in the labeled data.We also raise a theoretical analysis on this improvement in the angle of statics. We also mathematically prove the effectiveness in boosting robustness of co-training and discuss the similarity of Naive Bayes Classification and Information Bottleneck Principle.

  • 【分类号】TP181;TP391.4
  • 【被引频次】2
  • 【下载频次】213
  • 攻读期成果
节点文献中: 

本文链接的文献网络图示:

本文的引文网络