节点文献

基于核的学习算法与应用

Kernel Based Learning Algorithm and Application

【作者】 渐令

【导师】 夏尊铨;

【作者基本信息】 大连理工大学 , 运筹学与控制论, 2012, 博士

【摘要】 核技巧是解决非线性问题的强力工具,基于核的学习理论与算法研究是机器学习领域的研究热点.本文主要针对核学习算法设计及其在高炉冶炼过程、蛋白质鉴定问题中的若干应用展开研究.核学习算法设计方面,设计了二进制编码支持向量机(Support Vector Ma-chines:SVM)算法,将N-分类问题转化为[log2N]个二分类子问题,相比于传统的one-against-one方法需要(?)(N2)个子分类器,one-against-all方法需要(?)(N)个子分类器,二进制编码SVM显著提高了子分类器的效率;将最小二乘支持向量机(Least Square SVM:LS-SVM)的多核学习(Multiple Kernel Learning:MKL)(?)问题转化为半定规划问题(Semidefinite Programming:SDP),在MKL统一框架下实现了对核系数和正则化参数的优化,进而推动了核和正则化参数的自动化选取,与SVM MKL相比LS-SVM MKL在保持精度的同时计算复杂度大大降低,UCI基准数据库上的数值试验验证了所设计LS-SVM MKL算法的有效性.高炉冶炼过程的炉温预测与趋势分类是本文研究的应用问题之一.本文以高炉炉内热状态的重要指标高炉铁水硅含量([Si])为研究对象,在光滑支持向量回归机(Smooth Support Vector Regression:SSVR)模型中引入滑动窗口(Sliding Windows:SW)机制建立了SW-SSVR模型,通过不断更新学习样本,能够及时追踪系统的变化,应用SW-SSVR模型对[Si]进行数值预报,数值试验表明,SW-SSVR模型有较高的预测成功率,较短的计算时间,适合在线应用;将[Si]趋势预报问题转化为一个4分类问题,即剧升、微升、微降、剧降,应用二进制编码SVM对国内两座高炉[Si]进行趋势预报,该模型使得高炉工长在控制高炉炉温方向的同时可以决定调控力度;使用MKL整合高炉冶炼过程中出现的异质数据提高了模型预测精度,应用MKL对高炉采集变量进行特征约简,增强了黑箱模型的可解释性.基于串联质谱(MS/MS)的多肽鉴定问题是本文研究的另一个应用问题.蛋白质组学是后基因组时代的前沿热点,而串联质谱、蛋白质芯片等高通量实验技术极大地推动了蛋白质组学的发展.通过串联质谱鉴定多肽序列进而鉴定蛋白质是当前蛋白质组学研究中常用的研究方法.由于蛋白质样品和生物实验的复杂性,质谱图富含噪声,数据库搜索得到的多肽匹配中存在大量阴性鉴定,目前已提出多种算法用来优化多肽鉴定,但仍不能完美地区分阳性和阴性多肽鉴定.鉴于此,本文应用基于MKL SVM的De-Noise算法将串联质谱数据多肽鉴定问题转化为特殊分类问题:正类样本点被严重污染并不可信,而负类样本点完全可信De-Noise算法首先依赖距离关系执行去噪处理,然后基于去噪后的样本集训练SVM分类器并执行2次精炼过程,最后整合多肽的酶切信息给出鉴定结果.在3个蛋白质数据集Yeast(LCQ质谱仪)、UPS1(LTQ质谱仪)、Ta108(Orbit质谱仪)的SEQUEST搜库结果中对比了De-Noise算法和PeptideProphet、Percolator的多肽鉴定结果,在给定期望假阳性率(False Discovery Rate:FDR)下De-Noise算法显著提高了多肽鉴定的灵敏度和特异性.

【Abstract】 Kernel trick is a powerful tool for solving nonlinear problems, kernel based learning theory and algorithm are research focuses in machine learning field. This thesis mainly focuses on the design of kernel based learning algorithm and its application in blast furnace ironmaking process and protein identification problems.The main studies in design of kernel based learning algorithms lie in:propose a novel binary coding SVM algorithm which takes a N-classes classification task as multiple binary classification problem and only requires [log2N]binary classifiers, greatly lower than the con-ventional one-against-one method (?)(N2) and one-against-all method (?)(N); formulate the is-sue of multiple kernel learning(MKL) for LS-SVM as a semidefinite programming to get the global optimal solution, furthermore, optimize the regularization parameter with the kernel co-efficients in a unified framework, which leads to an automatic process for model selection, the computational complexity of LS-SVM MKL reduces greatly compared with that of SVM MKL but sharing evenly matched precision, which makes LS-SVM MKL be suitable for dealing with large scale data sets, and perform extensive validation experiments.As one application problem, this paper studies the prediction and trend classification mod-els of temperature in blast furnace(BF) ironmaking progress. Focus on the silicon content in hot metal([Si]), a chief indicator of the furnace temperature, this thesis explores the nonlinear approximation ability of SVM and constructs data-based models for [Si] prediction includes: incorporate the sliding windows schematic into smooth support vector regression and construct the sliding windows smooth support vector regression(SW-SSVR) model, which can update learning samples and track the state change of the studied system in time, the SW-SSVR model is employed to address the [Si] prediction problem, which exhibits good performance with high percentage of successful trend prediction, competitive computational speed and timely online service; through the proposed binary coding SVM algorithm, a four-class problem, i.e., sharp descent, slight descent, sharp ascent and slight ascent of [Si], is reduced into two binary classifi-cation problems to solve, to heel, the four-class classification results can guide the blast furnace operators to determine the blast furnace control span together with the control direction in ad- vance; aiming at the prediction problem of [Si] change trend, MKL is employed to integrate heterogeneous data which improves the prediction accuracy, further more MKL is utilized to do feature reduction which is quite helpful for increasing the comprehensibility on explaining which variable is important for black box modeling.Peptide identification by tandem mass spectrometry(MS/MS) is another application is-sue of this thesis. Proteomics has become a hot subject in the post-genomic era. Peptide identification by MS/MS is widely used for high-throughput identification of proteins in com-plex biological samples. A flexible algorithm based on MKL SVM, named De-Noise, is pro-posed to transform the peptide identification problem into a special binary classification prob-lem. The De-Noise algorithm starts with the pre-process in which some of the noisy target PSM are eliminated from the target PSM dataset to provide more reliable training dataset. The noisy PSM are determined by computing their distance to the centroid of decoy PSM. Once the noisy target PSM are discarded from the original target PSM dataset in the data pre-process step, two rounds of refining processes are taken to distinguish the correct PSM from the incorrect PSM. At last, proteolytic information is integrated for validating PSM.We test the De-Noise algorithm on three data sets from multiple mass spectrometry platforms, Yeast(LCQ)、UPS1(LTQ)、Ta108(Orbit) and compared it with PeptideProphet and percolator. The performance of the De-Noise algorithm is shown to be superior on all data sets searched on sensitivity and spectificity. Thus, the De-Noise algorithm could be able to validate the database search results effectively.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络