节点文献

主动学习算法中采样策略研究

Research of Sampling Strategy in Active Learning Algorithms

【作者】 吴伟宁

【导师】 郭茂祖;

【作者基本信息】 哈尔滨工业大学 , 人工智能与信息处理, 2013, 博士

【摘要】 当前,文本挖掘、语音识别、生物信息发掘和视觉对象分类等应用领域面临的一个现实问题是:无标注样例数目众多,易于获得;标注样例数量稀少,难于获得。作为机器学习领域的重要研究方向之一,主动学习算法可以同时利用标注样例和无标注样例来构建高精度分类模型,因此,本文对主动学习算法中采样策略进行深入研究,并将所提出采样策略算法应用于视觉对象分类任务中。视觉对象的语义理解是计算机视觉领域的重要问题。网络技术的快速发展使得短时间内获取大量图像成为可能,但对这些无监督或弱监督图片中包含的视觉对象进行分类成为一个艰巨、富有挑战性的任务。越来越多的科研人员致力于发展有效的机器学习算法,在已标注图像集上建立模型,继而利用学习到的知识判断和划分视觉对象所属的类别。这一做法通常需要大量标注图像用于模型训练过程,而为这些图像添加精确的标注信息需要花费大量的人力、物力,因此,迫切需要充分利用标注者资源,减少人工标注代价,旨在以尽量低的标注代价建立较高精度的模型。为了更加有效地收集和利用图像的标注信息,主动学习算法提供了一种解决途径。算法随机选择少量图像并获取其标注信息,通过模型与标注者进行交互的形式,利用已收集标注图像中的语义信息和知识,选择部分最有助于模型训练的无标注图像提交标注者进行标注。主动学习算法的优势是通过让学习系统向标注者进行提问的方式达到减少标注者工作量的目的,这不仅充分利用了珍贵的标注者资源,而且更好的将人类知识迁移到学习系统中。因此,发掘高效的主动学习算法对视觉对象的分类与检索研究具有重要的理论价值和现实意义。目前,虽然部分主动学习算法已经用于减少对象分类与检索任务中的标注代价,并取得了良好的效果,但是,这些算法中往往存在一些理想化的假设条件,导致其不能很好地适用于噪声或者未标注图像数据较大条件下的学习任务。鉴于这一点,本文以主动学习算法作为研究对象,在已有采样策略的基础上,结合统计学理论,发掘噪声或未标注图像数据较大条件下有效的样例选择算法。目的是以尽量低的标注和时间代价获得较高精度的分类模型,并以此为基础,构造有效的主动学习算法模型应用于视觉对象分类与检索实践当中。主要的研究工作和创新点如下:(1)提出一种基于模型风险的加权样例选择算法针对主动学习算法中训练数据与测试数据具有相同分布这一理想化假设条件,提出一种基于模型风险的加权样例选择算法,旨在解决因分布差异导致的采样算法效果下降,以及在给定标注代价条件下分类模型训练效果不理想的问题。算法对每个样例设置权重,使用训练数据与无标记数据上模型风险的期望误差来估计样例对应权重值,并根据该值选择最有助于分类模型训练的样例。算法与其它同类方法进行了比较,实验结果证明分类模型的精度得到有效提高。(2)提出一种批量选择样例的训练集构造方法针对主动学习算法面临的因视觉对象数量多,但同类别对象数量稀少而造成的正反例数量不平衡这一实际问题,提出一种批量选择样例的训练集构造方法。目的是在相同标注代价条件下,克服大量反例对分类模型的不利影响,提高分类模型的精度。算法利用分类模型风险,通过最小化模型风险的方差来构造训练分布,并依据该分布选择样例,建立训练集。算法与其它同类方法进行了比较,实验结果证明,在模型分类精度相同时,算法需要的标注代价更少。(3)提出一种多标注者主动学习概率模型针对主动学习算法中单一标注者必须能够提供准确无误的样例标记这一理想化假设条件,提出一种标注噪声条件下多标注者主动学习概率模型,旨在减少标注者标记质量对主动学习算法的影响。模型通过选择准确度高的标注者提供标记和估计样例对应正确标记的方式达到同时减少标记代价和提高模型精度的目的。实验结果表明,与其它同类方法相比,所提出概率模型有效减少了标注噪声的影响,提高分类模型的性能。(4)提出一种基于Hash技术的主动学习样例选择算法针对未标注数据数量较大条件下主动学习算法选择样例时间开销大的实际问题,提出一种基于Hash技术的样例选择算法,旨在快速返回所选样例,减少主动学习分类模型所需时间。算法通过利用Hash技术选择分类模型权重,进而,获得无标注样例与分类界面间近似距离,并依此选择样例用于训练。算法与其它同类方法进行了比较,实验结果证明,所提出算法可以有效减少训练所消耗的时间。

【Abstract】 Currently, in fields of text mining, speech recognition, bioinformation data mining andvisual object classification, it has been a real problem that there are always lots ofunlabeled examples which are easy to be obtained, but there are a few of labeledexamples which are hard to be obtained. As one of important aspects in machinelearning, active learning techniques can utilize labeled and unlabeled examples at thesame time in order to obtain a classification model with high performance. In this paper,we make a thorough study of sampling strategy in active learning, and then we applythe proposed algorithms to real tasks of visual object classification.It has been an important problem all along how to understand or utilize semanticinformation contained in visual objects. Due to rapid development of web techniques, itis possible to collect a lot of images in a short time, and then it becomes a challenge toclassifiy visual objects by using their semantic information which is extracted fromthese unsupervised or weak-supervised images. More and more researchers focus onmining effective algorithms of machine learning, and then judge which category avisual object belongs to according to the knowledge obtained by building a model onlabeled images. In this process, it always needs lots of precisely labeled images fortraining a model whose costs are expensive and time-consuming. In order to obtain sucha model within as few costs as possible, it needs to fully utilize the annotator resource,and then reduce the total labeling costs.In order to collect and utilize the annotations of images, active learning algorithmsprovide effective solutions. Firstly, a small number of images are randomly chosen andtheir annotations are obtained. Then, by creating the interaction between the annotatorsand model, the learning system can freely choose some unlabeled images, which areconsidered as the most helpful images of all, for querying their annotations. The goal ofreducing annotators’ workload is achieved by making the learning system ask forannotations. Not only this method makes full use of rare annotations, but also ittransfers the knowledge of annotators into the learning system. Therefore, it is importantto mine active learning algorithms for classification and retrieval of visual objects.Now, some active learning algorithms have been used in reducing total labeling costsof classification and retrieval of visual objects, and these works have achieved favoriteperformance on practical tasks. But there are always some idealized assumptions whichmake active learning algorithms unsuitable for noisy or big data environment. In thispaper, we focus on research of active learning algorithms. Based on existed works inactive learning, we explore sampling strategies which can be used in the condition of noise or big data, and then use them to obtain accurate classification models with as lowlabeling cost as possible. At last, we apply the proposed algorithms in the task of objectclassification and retrieval. Our main contributions are listed as follows:(1) A sampling strategy is proposed by weighting examples based on structure riskAiming at the idealized assumption that training data and test data must have thesame distribution, we propose a sampling strategy by weighting some examples basedon the structure risk. Our goal is to solve the problem that the performance ofclassification model will fall, when there is difference between training and testdistribution. In our method, we use the expected error of structure risk between labeledand unlabeled data to estimate the weight value of every unlabeled example, and then,according to their corresponding weights, choose the most helpful example of all.Compared with other methods, experimental results show that the proposed method caneffectively enhance the performance of classification model.(2) A method of constructing the training set is proposed by selecting some examples ina batch modeAiming at the unbalanced classification problem which is caused by lots of objects inthe whole database but few of them belong to the same category, we propose a methodof constructing training set by selecting examples in a batch mode. Our goal is to avoidthe adverse effect coming from lots of negative examples, and then enhanceclassification performance of the classifier. In our method, we estimate the trainingdistribution by minimizing the variance of structure risk, and then select a group ofexamples according to the estimated distribution. Compared with other methods, theexperimental results show that the labeling costs are fewer than other methods, whenthe classifier obtains similar performance.(3) A multi-annotator probabilistic model of active learning is proposedAiming at the idealized assumption that there is only one annotator to provideaccurate annotation for the selected example, we propose a multi-annotator probabilisticmodel of active learning for noisy annotations. Our goal is to reduce the effect ofannotation quality from multiple annotators. In our probabilistic model, the totallabeling costs are reduced and the classification performance is enhanced by choosingthe most reliable annotator of all for labeling the selected example and estimating theactual annotation. Compared with other methods, the experimental results show that theproposed probabilistic model can effectively reduce the impact from noisy annotations,and then enhance the performance of classification model.(4) A hash-based sampling strategy of active learning is proposedAiming at the problem that it needs a lot of time costs for selecting examples in alarge number of data, we propose the hash-based sampling strategy. Our goal is to return selected examples in a short time, and then reduce the time expense required byobtaining a classification model. In our method, the important weight elements in theparameter vector of classification model are selected, and then the approximate distancebetween the unlabeled examples and classification boundary are estimated. Comparedwith other methods, the experimental results also show that the proposed algorithm caneffectively reduce the time costs.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络