节点文献

结构化数据核函数的研究

Research on Kernels for Structured Data

【作者】 尹传环

【导师】 田盛丰;

【作者基本信息】 北京交通大学 , 计算机应用技术, 2008, 博士

【摘要】 核函数是支持向量机中非常重要的一个研究方向,尽管在统计学习理论出现之前,核函数的概念与技术早已存在,但它在机器学习中真正的成功应用,是从支持向量机开始的。正是支持向量机与核函数技术的结合,才使得以支持向量机为代表的核机器学习得到了快速的发展和广泛的应用。本论文的所有工作正是基于支持向量机与核函数的结合而展开的,主要包括三个方面的内容:核函数的构造、核函数的实现以及核函数的应用。支持向量机的输入数据一般定义在向量空间,常用的核函数如多项式核、径向基核等都能取得很好的效果。但是,还有很多机器学习问题在解决的时候涉及到一些含有结构信息的数据(我们称为结构化数据),如字符串和图像等,采用这些常用的核函数往往无法取得满意的效果,因为这些数据在转换成向量时将会丢失一些结构信息。因此针对这些结构化数据的核函数构造问题,已经提出了许多新型的核函数以及实现算法。本论文以结构化数据的核函数作为研究对象,提出了一些新的核函数以及它们的实现算法;并且对已有核函数的实现进行改进,降低计算复杂度;然后将一些新型字符串核函数应用于入侵检测领域。(1)核函数的构造。在归纳和总结了现有的字符串核函数的基础上,本论文将字符串核函数划分为基于序列的以及基于概率的两大类字符串核函数。基于序列的字符串核函数比较常见,包括间隙加权核以及谱核等常用的核函数。谱核没有考虑不连续的子序列对核函数的影响,而间隙加权核函数则惩罚长度较大的子序列,实际上,在有些应用中我们应该奖励长度较大的子序列,而非惩罚。在详细分析之后,本论文提出了一种基于序列的字符串核函数,叫做长度加权核函数,在这个核函数中长度越大的子序列所占的权重越大。另外,提出了一种变种——长度加权一次核函数,在这个核函数中重复出现的子序列我们只考虑一次。基于序列的字符串核只计算在两个字符串中出现的匹配子序列对核值的贡献,而没有考虑依次出现的字符之间的依赖关系。为了在核函数中体现字符之间的依赖关系,我们依据马尔可夫模型提出了基于概率的混合阶马尔可夫核函数,它也是一种字符串核函数。(2)核函数的实现。已经有许多算法用来实现字符串核函数,包括基于动态规划的、基于后缀树的以及基于后缀核的算法。在分析了后缀核的概念之后,本论文提出了一系列基于后缀核的实现算法,能够用来解决目前的间隙核函数以及本论文提出的长度加权核函数。另外,我们将位并行算法应用于核函数的实现算法中,分析表明这种处理在一定条件下能够加快定长度加权核函数的计算。为了快速实现混合阶马尔可夫核函数,本论文采用了后缀树存储结构,并利用它的匹配统计量计算混合阶马尔可夫核函数,能够在线性时间内求出核函数的值。(3)入侵检测是信息安全中很重要的一个环节。支持向量机作为一种分类算法已经被应用于基于网络的入侵检测中,但是在基于主机的入侵检测中,由于输入数据大部分为命令序列或者系统调用序列,采用常见的径向基或者多项式核函数的支持向量机并不合适。针对基于主机的入侵检测系统,我们利用训练数据构造了基于字符串核函数的1-类支持向量机,包括现有的以及本论文提出的字符串核函数,并用这个1-类支持向量机对测试数据进行测试,实验结果表明本论文提出的一些字符串核函数比现有的一些字符串核函数更加适用于基于主机的入侵检测系统。

【Abstract】 Kernel function is an important idea in Machine Learning, especially in Support Vector Machine (SVM). Kernel function had been introduced into the context of machine learning before SVM was presented. However, the first successful application of kernel function is SVM. The development of kernel machine learning depends on the research of the combination of kernel function and SVM. This dissertation is based on the combination of kernel function and SVM, including constructions, implementations, and applications of the kernel function.In general, SVM takes a feature vector representation as its input, where some usual kernel functions such as polynomial kernel and RBF kernel are used. However, in the real world, many machine learning problems often work on special data types, such as strings and images. These special data contain some information of their structure, which might be useful in problem solving. We denote these special data by structured data. The usual kernel function may be not applicable for the problem involving structured data, because some information of these structured data may be lost in the mapping to feature vector. Therefore, many efforts were made to construct kernel functions for structured data and to compute these functions efficiently.This dissertation takes the kernel functions for string (denoted by string kernels) as research object, and presents some novel string kernels and algorithms for these kernels. Moreover, a bit-parallel technique is introduced to speed up the computation of fixed length string kernels. Finally, these string kernels are used in the context of intrusion detection.(1) The construction of new string kernels. String kernels can be divided into two kinds, one is subsequence-based and another is probability-based. The existing gap-weighted kernels and spectrum kernels belong to the former. The spectrum kernels did not consider the contribution of non-contiguous subsequences, and the gap-weighted kernels penalized the longer subsequences rather than encouraged them. By observing these disadvantages of existing string kernels in certain context, we presented a subsequence-based string kernel, which is called length-weighted kernel, in which the longer subsequence can contribute more to the kernel value. Moreover, a variant of length-weighted kernel is proposed, which is called length-weighted once kernel, in which all subsequences will be considered only once regardless of whether they occur only once or many times in a string. Other than subsequence-based string kernels, the probability-based string kernels take the dependence of characters occur in string into account. In order to consider the dependence of characters, we proposed a probability-based string kernel, which is based on Markov model and called Markov kernel.(2) The implementation of string kernels. Dynamic programming, suffix kernel, and suffix tree are three general approaches to compute the value of string kernels. After analyzing the idea of suffix kernels, we presented a series of algorithms based on suffix kernels to compute existing gapped kernels and the length-weighted kernel as well as its variant. Moreover, a bit-parallel technique is introduced to accelerate the computation of kernel functions. In order to compute the Markov kernel, a suffix tree is used to store the string and its matching statistics are used in the process of computation.(3) Intrusion detection is an important facet of information security. As a classification algorithm, SVM was used widely in the network-based intrusion detection. Other than network-based intrusion detection, most of the input data of host-based intrusion detection are command sequences or system call sequences. Therefore, the usual polynomial and RBF kernels are not applicable for host-based intrusion detection. In the host-based intrusion detection system, we construct a 1 -SVM classifier based on the training data by embedding string kernels, and detect intrusions in the testing data using this 1-SVM classifier. The experimental results reveal that the new string kernels outperform the existing string kernels in the host-based intrusion detection system.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络