节点文献

领域自适应学习算法及其应用研究

A Study on Domain Adaptation Algorithm and Its Application

【作者】 许敏

【导师】 王士同;

【作者基本信息】 江南大学 , 轻工信息技术与工程, 2014, 博士

【摘要】 传统的机器学习假定训练域与测试域独立同分布,将由训练数据集得到的模型直接应用于测试集。但在实际应用中,这种假设并不一定成立,若训练域与测试域分布存在差异,则传统机器学习的性能将会大大降低,故领域自适应学习得以提出,其目标是在领域间建立桥梁,提高测试域预测性能,并广泛应用于解决现实世界中的分类、回归、概率密度估计等机器学习问题。目前,许多国内外专家学者对领域适应学习进行了深入的研究,并获得了许多重要的研究成果,且广泛应用于实际生产中,但仍有许多问题需要进一步的探索和研究。本课题主要从概率密度估计、支持向量域描述、分类、回归等4个方面进行深入的领域自适应学习研究。主要内容如下:1、基于最小包含球的领域自适应学习。相同应用领域,不同时间、地点或设备检测到的数据域不一定完整。针对如何进行源域与目标域间知识传递的问题,在支持向量域描述、分类与回归等问题在数学模型上均可等价于中心约束最小包含球的前提下,首次提出相似领域的概率密度差可由两域最小包含球中心点表示,且其上限值与半径无关的定理。基于此定理,提出一种新颖的领域自适应算法,算法中心思想是先将各算法的数学模型转换成其各自等价的最小包含球模型,再利用源域最小包含球中心点对目标域最小包含球中心点进行校正,从而提高目标域机器学习的性能。这种传递中心点,即源域知识的领域自适应算法具有源域数据隐私保护的优点,且新算法仍等价于中心约束最小包含球的理论证明,使所提算法可利用核心集技术解决大规模数据集问题。实验结果表明,这种领域自适应算法可弥补目标域缺失数据的不足,大大提高算法性能。2、基于SVM的领域间迁移学习算法。当与某领域相关的新领域出现时,标注这个新领域样本可能代价昂贵,而丢弃所有旧领域数据又显得十分浪费。故提出基于SVM算法的迁移学习新算法TL-SVM,其主要思想是SVM分类器由(w,b)组成,若两领域相关,则两域分类器各自的w值应相近,通过训练目标域少量已标签数据和学习源领域的知识w s来为目标域构建一个高质量的分类模型,实现领域间的知识迁移学习。该方法继承了基于经验风险最小化的最大间隔SVM的优点,又弥补了传统SVM不能进行知识迁移的缺陷。将上述理论成果进一步应用于基于密度差(Difference Of Density, DOD)思想的L2核分类器。L2核分类器算法具有良好的分类性能及稀疏性,然而其训练域与测试域独立同分布的假设限制了其应用范围。针对此不足,在L2核分类器的数学模型等价于变形SVM的理论前提下,对其等价的变形SVM进行知识迁移学习,提出具有领域间迁移学习能力的L2核分类器,该算法既保持了L2核分类器算法良好的分类性能,又能处理数据集缓慢变化及训练集在特定约束条件下获得导致训练集和未来测试集分布不一致的问题。3、基于浓缩集概率密度估计(Reduced set density estimation, RSDE)算法的领域自适应学习。RSDE算法是一种基于核的密度估计器,它仅使用数据样本中的一小部分的线性组合来表示概率密度估计式,与传统Parzen Window概率密度估计法相比,极大降低计算复杂度的同时实现了数据浓缩的目的,但该算法必须满足训练集与测试集独立同分布条件。本文提出一种新颖的基于RSDE算法的领域自适应概率密度估计方法A-RSDE,通过学习源域(训练域)概率密度函数p (x;θ1),使目标域(测试域)概率密度估计函数q (x;θ2)最优逼近真实密度函数q(x)的同时,与源域概率密度函数p (x;θ1)也最优逼近,达到领域自适应学习目的;并用基于近似最小包含球的核心集快速算法求解A-RSDE,将其应用于大数据集密度估计。上述概率密度函数均可看作密度估计线性组合空间上的概率密度估计式,在此基础上进一步提出密度估计线性组合空间概念,指出若需求线性组合空间内的密度估计函数,可由高斯函数为基函数的线性组合在ISE标准下逼近,并进一步提出密度估计线性组合空间的近似框架。该框架的优势在于可直接对概率密度线性组合函数进行估计而不必依次估计各域的密度函数,与传统概率密度估计法相比具有更好地精度;其参与运算的数据规模为l,l值远小于样本总数,适用于大规模数据集;该框架可应用于分类、数据浓缩、随机变量间的独立性检测、回归模型变量选择、条件概率密度估计等;若使该线性组合空间逼近某已知空间,可用于源域与目标域近似度估计,适用于多源领域自适应学习。

【Abstract】 Traditional machine learning algorithms assume that the training data and the test dataare drawn from the same distribution. Models that are purely trained from the training data areapplied directly to the test data. Unfortunately, in practice this assumption is often too strong.Given that the instances in the two domains are drawn from different distributions, traditionalmachine learning can not achieve high performance on the new domain. Therefore domainadaptation algorithms are designed to build a bridge between the training data and the testdata in order to improve the performance of the test domain prediction and these algorithmsare widely used to solve real-world classification, regression, probability density estimationproblems in machine learning problems. Currently, many experts and scholars conductin-depth study in the field of domain adaptation, obtain a number of important research resultsand widely applied them in the actual production. However, there are still many issues whichneed further exploration and research. Several issues are addressed in this dissertation aboutdomain adaptation from four aspects of probability density estimation, support vector domaindescription, classification and regression. The main contents are as follows:1. This dissertation proposed a novel domain adaptation algorithm which based onminimum enclosing ball. For many machine learning problems, the incomplete data collectionwould lead to low prediction performance, which arises the issue of domain adaptation. Basedon the theory that many kernel methods such as support vector domain description (SVDD),support vector machine (SVM) and support vector regression (SVR) can be equivalentlyformulated as minimum enclosing ball (MEB) or center-constrained minimum enclosing ball(CC-MEB) problems in computational geometry, novel algorithms are proposed. In order tosolve the problem that how to effectively transfer the knowledge between the two fields, thenew theorem is revealed that the difference between two probability distributions from twosimilar domains only depends on the centers of the two domains’ minimum enclosing balls.Based on these claims, fast adaptive algorithms are proposed for large domain adaptation.These proposed algorithms use the center of the source domain’s MEB or CC-MEB tocalibrate the center of the target domain’s in order to improve the machine learningalgorithms’ performance of the target domain. Experimental results show that these proposeddomain adaptive algorithms can make up for the lack of missing data and greatly improve theperformance of the target domain’s machine learning problems.2. A novel transfer learning algorithm based on SVM was proposed in the dissertation.When task from one new domain comes, relabeled the new domain samples costly and itwould also be a waste to discard all the old domain data. A novel algorithm TL-SVM basedon SVM algorithm was proposed. The main idea of this algorithm is that SVM classifier iscomposed of (w, b). If two domains are related, the values of w about the two domains’classifier respectively should be similar. We can build a high-performance classificationmodel by using a small amount of the target domain’s samples and the knowledgew sof thesource domain to accomplish the transfer learning between two domains. The method inheritsthe advantages of the maximum interval SVM based on empirical risk minimization and makes up for the defects that traditional SVM can not migrate knowledge.The above theoretical results can be further applied to L2kernel classifier which based onthe concept of the difference of density. L2kernel classifier has good classification effect andsparsity, however, the premise that the training domain and testing domain are independentand identically distributed severely constrains its usefulness. In order to overcome thisshortcoming, under the premise that L2kernel classifier is equivalent to a deformation SVMand knowledge can transfer through its equivalent deformation SVM. So a novel classifiernamed transfer learnging-L2kernel classification (TL-L2KC) is proposed. This classifier candeal with the problem that training set and test set distribution inconsistencies which causedby dataset’s changing slowly or training set obtained in a specific constraints. And at the sametime the algorithm can inherit the good performance of L2KC.3. Reduced set density estimation (RSDE) algorithm provides a kernel based densityestimator which employs a small percentage of the available data sample and is optimal in theL2sense. This method provides a reduced set density estimator with comparable accuracy tothat of the full sample Parzen density estimator and demonstrates a nicer performance in thecomputational time, but it can not work well when the training set and the testing set are notindependent and identically. In order to achieve the above goal, a novel A-RSDE is proposedfor adaptive probability density estimation by making full use of the source domain’s (trainingdataset)knowledge p (x;θ1)of the probability density distribution, which lets the targetdomain’s (testing dataset) probability density estimation q (x;θ2)be closer to the trueprobability density distribution q(x). Meanwhile, the fast core-sets based minimum enclosingball (MEB) approximation algorithm is introduced to develop the proposed algorithmA-FRSDE.The above RSDE, A-RSDE algorithms can be viewed as the probability densityestimation in a linear combination space of densities. It is introduced to develop itsapproximation framework based on a linear combination of Gaussian basis functions underintegrated square error criterion. The proposed approximation framework has threeadvantages. Firstly, it can directly estimate the probability density function of the linearcombination space of densities without having to estimate the probability density function ofeach domain, and it has at least comparable to or even better approximation accuracy thantraditional density estimation methods. Secondly, the time complexity of the proposedapproximation framework is, since l is generally much less than the sample size, hence it isvery suitable for large datasets. Thirdly, this proposed framework can be typically used todevelop alternative approaches to classification, data condensation, justification of theindependence between random variables, conditional density estimation and the similarityidentification between multiple source domains and a target domain. If the linear combinationspace of densities is used to approximate a known space, it can be applied to estimate thesource domain and the target domain approximation for multi-source domain adaptivelearning.

  • 【网络出版投稿人】 江南大学
  • 【网络出版年期】2014年 12期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络