节点文献

基于核方法的说话人辨认模型研究

Research of Speaker Identification Models Based on Kernel Methods

【作者】 郑建炜

【导师】 王万良;

【作者基本信息】 浙江工业大学 , 控制理论与控制工程, 2010, 博士

【摘要】 说话人识别技术由于其独特的便利性、精确性和经济性,被认为是最自然的生物认证技术,在安全监控、司法鉴定、电子侦听、金融服务等方面都具有广泛的应用前景。当前,说话人识别系统开发已逐渐从理论研究转向实际应用,对其要求也随着应用环境的变更而越来越高,不仅期望具有极高的识别率,还要具有较高的实时性,此外,系统构建便捷性、模型扩展能力等都不容忽视。近十几年来,基于核函数的分类算法已成为模式识别领域的研究焦点,它有效克服了传统模式识别方法中局部极小和非完全统计分析的缺点,具有很强的非线性处理能力,而说话人识别系统中输入的语音特征参数正好是非线性的且局部特性复杂。因此,应用核分类模型于说话人识别时能够获得很好的效果。本文针对说话人识别中的辨认任务,以小样本语料库为应用对象,着重研究模型域的改进,提出具有各种优势的核分类方法。主要工作如下:1.深入分析当前主导的说话人识别模型GMM-UBM和SVM。产生性模型-高斯混合模型(GMM)一直作为说话人识别的基准技术,但其直接应用存在训练样本量需求过大的缺陷,而统一背景模型(UBM)可以削减目标说话人的输入数据,且效果更比单纯GMM优越。区分性模型-支持向量机(SVM)具有最大分类间隔、全局最优解、稀疏性能等优势,在小样本说话人识别应用中效果比GMM-UBM更佳。本文分别从原理、优缺点、融合策略、应用细节等几方面对两者作了细致地分析。通过说话人辨认实验表明,GMM-UBM模型的测试实时度稍逊,而两分类模型SVM的多元扩展能力较为薄弱。2.联合相关向量机与高斯混合模型进行说话人辨认。相关向量机(RVM)分类模型与SVM具有一致的判决公式,同样有很强的泛化能力,并且其稀疏性更好;此外,RVM使用概率输出克服了SVM二值结果的缺点,又无需进行繁琐的惩罚因子C计算。但是在与文本无关的话者辨别中,RVM模型构建过程却过于缓慢。本文将二元模型RVM引入说话人辨认领域,并采用快速训练算法进行基于帧的话者识别。为进一步提升模型构建速率,取GMM统计特征参数作为RVM的输入矢量,既能够有效地提炼话者个体性信息,解决大样本数据情形中的RVM训练问题,又结合了统计模型鲁棒性高和分辨模型辨别效果好的优点。实验结果表明,RVM与SVM模型扩展能力一致,识别率相仿,但测试实时性能明显优于SVM。3.提出多元核Logistic回归(MKLR)说话人辨认方法。虽然RVM与SVM都具有优秀的识别性能和测试实时度,但二元模型的本质使得两者都无法直接应用于说话人辨认。本文针对基于帧的与文本无关说话人辨认多分类目标和大训练样本的情况,将经典核Logistic回归(KLR)模型进行多元化变形,叠加L2惩罚因子,以提高模型泛化能力。把最优目标函数负对数Logistic公式对偶化,并利用序列最小优化算法进行模型训练,不仅保持了原KLR模型的强分辨能力,也加快了模型构建过程。实验结果表明,MKLR在辨认任务中无需进行繁复的多分类扩展,且识别率达到99.5%。4.提出概率稀疏型多元核Logistic说话人辨认方法(SMKLC)。MKLR的缺陷是测试速率低下,因此,本文对经典Logistic回归模型进行多元化扩展后,为参数叠加RVM中采用的稀疏性先验概率分布,在不引入新的先验信息性参数前提下使模型具有稀疏结果。模型训练采用自下向上贪婪算法,规避了大型矩阵逆操作,约简了训练过程计算量。说话人辨认实验结果表明,SMKLC在保持高识别性能的同时,其测试用时仅需0.0057秒/短语。5.提出保局部核Fisher鉴别说话人辨认方法(LWFDA)。结合核Fisher判别(KFD)方法与局部保持投影(LPP)两者的优势,将亲和因子引入KFD中的类内散度矩阵,保留KFD全局最优投影能力的同时,又凸现LPP的局部保持能力,对重叠(离群)样本与多态分簇样本都能实现有效的分类投影,并给出了快速求解算法,解决大样本训练时出现的内存溢出问题,以适应于说话人辨认。实验结果表明,LWFDA识别率与MKLR一致,测试用时较MKLR缩减了9.25%。6.提出增强型数据域描述说话人辨认方法(EDDD)。为适应开集的辨认任务,以支持向量域描述(SVDD)算法为基础,通过一种简易的形式引入数据间密度因子,使处于不同区域的数据对分类器的作用不再被同等对待,高密度区数据对分类支撑域的作用被强化,而低密区即呈零星分布的数据作用被削弱,结果使分类超球体自动靠近数据高密区而提高其识别性能。说话人辨认实验结果表明,EDDD模型的识别性能全面优于GMM。本文主要研究说话人辨认系统中的模型部分,提出或改进了各类基于核的分类方法,包括二元分类器、多元分类器、单类分类器、降维分类器等,它们各自都具有得天独厚的优势,能满足不同需求的说话人辨认系统。

【Abstract】 Due to its special merits of flexibility, accuracy and economy, speaker recognition technology has been regarded as the most natural kind of biometrics, which has comprehensive perspective of applications in the field of security access, forensics evaluation, electronic sniff, financial services. Recently the speaker recognition system research has turned from theory to practice and people demands more and more with the change of circumstances. Seeking the higher recognition rate will never be the only criterion. The real-time quality can not be neglected as well as the convenience and expandability of the system model.In recent ten years, many classification algorithms were proposed based on the kernel function, which effectively solved the drawbacks of local minimum and incomplete statistical analysis of the traditional pattern recognition model. These new algorithms always have super power of nonlinear capacity which can meet the speech feature’s demanding. So speaker recognition systems based on kernel method like the support vector machine have been proved to be very successful.In this thesis, we focus on the improvement of the model domain and propose different kinds of kernel classification method which can be applied to the task of speaker identification in the circumstance of small sample speech corpus. The main contributions of the work are as follows:1. Provided the analysis of the leading speaker recognition model, GMM-UBM and SVM. The generative model GMM is always the baseline technology for last decade, but it needs too many input speech data. The GMM-UBM can reduce the amount of the input data for target and has better effect than the GMM. The discriminative model SVM has lots of merits including maximum classification margin, global solution and sparsity. When applied to the small sample speaker recognition system, it has even better result than the GMM-UBM. We deeply analyzed the principle and performances and mixture strategy and application details of these models. The last experiments show that GMM-UBM’s test speed is low and SVM has poor expandability for multi-class classification. 2. Proposed the hybrid strategy of GMM and RVM for speaker identification. Relevance vector machine classification method uses the probabilistic output to overcome SVM’s shortage as well as has more sparsity. Whereas RVM has overloaded computation complexity and memory storage when applied for the text-independent speaker identification because of the mass training samples. For solving this problem, a hybrid GMM/RVM approach is proposed which can effectively extract the speaker feature vector as well as solve the mass storage problem. Further more; this hybrid approach combines the robustness of generative model and the powerful classification of discriminative model to improve the performance and robustness of identification.The experiments prove that this method has better error rate than the GMM system and more sparsity than state-of- the-art GMM/SVM system.3. Proposed the multi-class kernel logistic regression speaker identification model. The traditional logistic regression model is transformed to multi-class kernel logistic model applying for text-independent speaker identification, which is nonlinear and more than just two classes. The L 2 penalty factor is added for enhancing model generalization ability. Then a new iterative algorithm is proposed based on the solution of a dual problem using ideas similar to those of the Sequential Minimal Optimization algorithm for SVM. Experiments show that the algorithm is robust and fast and the recognition rate is as good as widely used methods such as SVM while being used in text-independent speaker identification.4. Proposed the probabilistic sparse kernel logistic speaker identification model. A true sparse multiclass formulation was introduced based on multinomial logistic regression which incorporates weighted sums of basis functions with sparsity-promoting priors encouraging the weight estimates to be either significantly large or exactly zero. Then the bottom-up training algorithm is adopted which controls the capacity of the learned classifier by minimizing the number of basis functions used, resulting in better generalization and faster computation. Experimental results on standard benchmark data sets and speaker identification show the proposed method has best real-time ability.5. Proposed the local within-class features preserving kernel fisher discriminant algorithm and applied in speaker identification. Dimensionality reduction without losing intrinsic information on original data is an important technique for succeeding tasks such as classification. A novel algorithm is proposed after deeply analysis on the relationship between kernel fisher discriminant and kernel local preservation projection. The new method keeps the ability of KFD’s global projection and introduces the local preservation ability of LPP, which can work well on overlapped or multimodal labeled data. The training algorithm is improved for resolving out-of-memory problem when applied in large sample situation. The speaker identification application shows that the proposed algorithm has more adaptability as well as advanced recognition rate and speed.6. Proposed the enhanced data domain description speaker identification method. The purpose of data description is to give a compact description of the target data that represents most of its characteristics. In a support vector data description (SVDD), the compact description of target data is given in a hyperspherical model, which is determined by a small portion of data called support vectors. Despite the usefulness of the conventional SVDD, however, it may not identify the optimal solution of target description especially when the support vectors do not have the overall characteristics of the target data. To address the issue in SVDD methodology, the enhanced SVDD is proposed introducing new distance measurements based on the notion of a relative density degree for each data point in order to reflect the distribution of a given data set. Experiments are made for comparison between GMM and the proposed enhanced SVDD because they both can apply for open-set speaker identification, the results show that the enhanced SVDD outperforms the GMM whenever in recognition rate, real-time ability and sample demanding.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络