节点文献

基于组合特征的高阶因子分解机模型研究

Research on the High-Order Factorization Machine Based on Combined Features

【作者】 刘晨征

【导师】 曾庆田;

【作者基本信息】 山东科技大学 , 计算机科学与技术, 2019, 硕士

【摘要】 因子分解机(Factorization Machine)是近几年被提出的,主要用于解决大规模稀疏数据中特征组合问题的算法,它是一种结合矩阵分解和支持向量机的机器学习算法。因子分解机对交叉项系数采用一种因子分解的方式,其在稀疏数据中也能很好的学到隐含数据中变量间的相互关系。组合特征是通过将单特征进行组合而形成的高阶特征,有助于表示数据中的非线性关系,可以表达比单特征更多的数据底层语义。本文立足于自定义特征组合,对面向分类和序数回归任务的因子分解机进行研究,具体成果如下:(1)基于频繁模式,提出一种面向分类的组合特征提取方法。首先,挖掘数据中有关类别的频繁模式,作为组合特征依据;其次,为了使提取的组合特征对类别区分有帮助,本文使用K-L散度度量频繁模式的类别区分能力;最后,给出了特征组合方式,利用最有区分能力的前m项频繁模式进行特征组合。实验结果表明,使用该方法提取的组合特征,对多数分类模型的效果都有提升。(2)针对序数回归问题,提出一种面向序数回归的组合特征提取方法。为了使提取的组合特征包含标签的序数信息,提出一种有序二元分解的方法,把序数回归有序分解为多个二元子问题。在每个二元子问题上,挖掘有关类别的频繁模式,并计算相关K-L散度。考虑到在不同子问题中,频繁模式K-L散度的不平衡性,提出一种循环选择频繁模式的方法,平衡选择区分不同等级的频繁模式,利用最后选择出的频繁模式进行特征组合。在公开数据集和自有数据集上,使用多种序数回归模型进行了实验论证。实验结果表明,使用最有区分能力的频繁模式组合特征,能够有效提升大多数序数回归模型的训练效果。(3)提出一种基于自定义高阶特征的因子分解机(CHOFM)。因子分解机只能学习特征之间的二阶关系,属于二阶多项式模型。高阶因子分解机通过穷举的方式,列举了全部特征组合项,这导致模型过于复杂,不易求解。本文提出一种基于自定义高阶特征的因子分解机,使用一组自定义的高阶特征组合规则集代替原始的高阶组合。这种方式既减少了无效的特征组合,同时保留高阶组合特征的表达能力。本文给出了基于SGD的CHOFM模型训练方法。实验结果表明,CHOFM模型效果相对FM模型更优。此外,CHOFM模型具有更好的收敛性。

【Abstract】 The Factorization Machine is proposed in recent years and is mainly used to solve the problem of large-scale sparse data feature combination.It is a machine learning algorithm combining matrix decomposition and support vector machine.FM uses a factorization method for the cross-term coefficient,which can also well learn the inter-relationship between variables in the implicit data in the sparse data.The combined feature is a high-order feature formed by combining a single feature,which helps to represent a nonlinear relationship in the data,and can express more data underlying semantics than a single feature.Based on the custom feature combination,this thesis studies the factorization machine for classification and ordinal regression tasks.The specific results are as follows:(1)Based on the frequent pattern,a classification-oriented feature extraction method is proposed.Firstly,the frequent patterns of related categories in the data are mined as the basis of the combined features.Secondly,in order to make the extracted combined features helpful for class classification,the K-L divergence is used to measure the class distinguishing ability of the frequent patterns.Finally,the feature combination is given.Feature combination using the most distinguishing pre-m frequent patterns.The experimental results show that the combined features extracted by this method have improved the effect of most classification models.(2)For the ordinal regression problem,a combined feature extraction method for ordinal regression is proposed.In order to make the extracted combination features contain ordinal tag information,an ordered binary decomposition method is proposed to decompose the ordinal regression into multiple binary sub-problems.On each of the binary sub-problems,the frequent patterns of the relevant categories are mined and the associated K-L divergence is calculated.Considering the imbalance of frequent mode K-L divergence in different sub-problems,a method of cyclically selecting frequent patterns is proposed.The balanced selection distinguishes different levels of frequent patterns,and the last selected frequent patterns are used for feature combination.Experimental demonstrations were performed using a variety of ordinal regression models on public and private datasets.The experimental results show that using the most distinguishing frequent pattern combination features can effectively improve the training effect of most ordinal regression models.(3)A custom high-order factorization machine model is proposed.The factorization machine can only learn the second-order relationship between features,and belongs to the second-order polynomial model.The high-order factorization machine enumerates all the feature combinations by exhaustive means,which makes the model too complicated and difficult to solve.This thesis proposes a custom high-order factorization machine(CHOFM),which uses a set of custom high-order feature combination rule sets instead of the original high-order combination.This approach reduces invalid feature combinations while preserving the expressive power of high-order composite features.We present a training method for CHOFM models based on SGD.The experimental results show that the CHOFM model is better than FM.In addition,the CHOFM model has better convergence.

节点文献中: