节点文献

稀疏优化在机器学习中的若干应用

Some Applications of Sparse Optimization in Machine Learning

【作者】 梁锡军

【导师】 张宏伟;

【作者基本信息】 大连理工大学 , 运筹学与控制论, 2013, 博士

【摘要】 近年来,利用解的稀疏性和其他内在结构成为众多计算和工程领域中共同关注的问题.稀疏的内含不仅是指“只有很少的非零分量”,它蕴含着“具有一种简单结构”.本文对机器学习中不同问题的稀疏结构进行建模,并在必要时改进经典的稀疏优化算法进行求解.论文的主要工作可概括如下:1.第2章给出了本文在解决不同的机器学习问题中所提出的稀疏优化模型及算法.所提出的稀疏优化模型有同样的抽象结构,即在一个具有某种简单或特定结构的假设空间上极小化某个损失泛函.本文中给出的盒子约束的Lasso模型及块PCA模型均具有这一结构.该章给出了求解盒子约束的Lasso模型的同伦算法及求解块PCA模型的Splitting算法.2.第3章研究了求解盒子约束的Lasso模型的同伦算法的收敛性并检验了该算法的数值性能.该章的工作指出同伦算法收敛性不是显然成立.在无退化指标假设和其它较弱的条件下,该章证明了同伦算法具有有限终止性.另外,该章讨论了退化和循环的问题.当前已有众多算法可求解该模型,但数值实验证明同伦算法具有特别的优势:适于最优解非常稀疏的问题及需要计算整条正则化路径的情形.这是第4章协同过滤数据可预测性问题的计算中所采用的关键技术.3.第4章研究了协同过滤问题中评分数据的可预测性问题.当前协同过滤方面的大部分工作主要研究算法性能的改进.该章指出,受评分数据自身的限制,评分矩阵中有一部分未知评分是难于给出准确预测的.第4章提出了一个新的度量——相关性,以度量用户在某个商品上的评分能被准确预测的可能性.一个用户一商品对的相关性由相关的用户和商品构成的社区所确定.作为相关性度量的应用,提出了基于数据的组合方法(DOC)以应用于推荐系统.4.第5章研究从时间序列基因表达数据中推断基因正则化网络(GRN).由于计算复杂度较大,大部分GRN重建方法仅限于推断较低连通性的单个网络.该章提出了网络和社区识别方法,结合社区结构信息,从基因表达数据中推断多个子网络.其中的块PCA模型,通过第2章给出的Splitting算法,可有效求解网络中的社区结构.5.第6章研究了作为蛋白质鉴别关键步骤的肽段识别问题.序列数据库搜索是当前肽段识别的主流方法.但搜索引擎给出的大量的匹配是不正确的.现有方法大多基于半监督或监督学习框架,充分利用了诱骗PSM的样本及标签信息,但目标PSM样本点自身信息没有被充分利用.该章提出了一个称为FC-Ranker的新的评分方法,给每个目标PSM赋予一个非负权重,反映其匹配正确的可能性.特别地,FC-Ranker通过模糊支持向量机分类模型和所提出的模糊Silhouette指标迭代更新该权重.FC-Ranker在ROC指标、相同FDR水平下鉴别目标PSM的数目等方面的性能表现超过了主流后验数据库搜索方法.

【Abstract】 In recent years, exploring the sparsity of the solutions and other special structure has become a common issue in many computational and engineering areas. Sparsity is much broader than "having few nonzero elements". It roughly means "having a simple structure". In this thesis, we utilize the sparsity structures to construct models for various practical problems in machine learning and improve the corresponding classical algorithms to solve them when necessary. The main work can be summarized as follows.1. In Chapter2, we present the proposed sparse optimization models and algorithms for various practical problems in machine learning. All the models have a common abstract struc-ture, i.e., minimizing a loss functional over a hypothesis space with a certain simple or special structure. The proposed box-constrained Lasso model and Block PCA model possess this struc-ture. We present the Splitting algorithm for solving the Block PCA model and the Homotopy algorithm for solving the Lasso model with box constraints in this chapter.2. In Chapter3we analyze the convergence of the Homotopy algorithm for solving the box-constrained Lasso model and evaluate its numerical performance. It is shown that the con-vergence is not as apparent as it looks. Under a non-degeneration index assumption and some other weak conditions, it is proved that Homotopy terminates and finds an optimal solution after finite iterations. Moreover, the issue of degeneration and recursion is discussed. Although there exist many algorithms for solving this model, the numerical experiments have shown the special advantage of the proposed algorithm, and illustrated that the Homotopy algorithm is particularly useful when the underlying solution is quite sparse or the entire regularization path is required. This is the key technique for the calculation of measuring the prediction capability of data for collaborative filtering in Chapter4.3. Chapter4is devoted to the prediction capability of data for collaborative filtering. Most work of collaborative filtering has focused on improving the performance of the algorithms. In this chapter we point out that part of unknown ratings could not be accurately predicted, limited by the information of known ratings. We propose a metric,"relatedness", to measure the potential that a user’s preference on an item can be accurately predicted. The relatedness of a user-item pair is determined by a community which consists of users and items related to the pair. As an application of the relatedness metric, we develop the Data-Oriented Combination (DOC) method for recommender systems. 4. Chapter5is devoted to the problem of identifying gene regulatory network (GRN) from time course gene expression data. Due to the computational complexity, most approaches for GRN reconstruction can only identify a single network of low connectivity for a given set of genes. We propose the network and community identification (NCI) method for identifying multiple sub-networks from gene expression data by incorporating community structure infor-mation into GRN inference. The Block PCA model can search communities effeciently in GRNs by employing the Splitting algorithm proposed in Chapter2.5. Chapter6is devoted to the peptide identification problem which is the key procedure for protein identification. The sequence database searching has been the dominant method for peptide identification. However, lots of the peptide spectrum matches (PSMs) output by the searching engine are not correct. Based on supervised or semi-supervised learning, most of the current methods make use of the decoy PSMs fully, but the information of target PSM samples themselves has not been explored entirely. A novel scoring method named FC-Ranker is de-veloped by assigning a nonnegative weight to each target PSM based on the possibility of its correctness. Particularly, the weights of PSMs are updated by using a fuzzy SVM classification model and a fuzzy Silhouette index iteratively. In terms of ROC and the number of identified PSMs under common FDR level, FC-Ranker outperforms other popular post-database search algorithms.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络