节点文献
语音识别中的说话人自适应研究
Research on Speaker Adaptation in Speech Recognition
【作者】 王坚;
【导师】 郭军;
【作者基本信息】 北京邮电大学 , 信号与信息处理, 2007, 博士
【摘要】 今天,各种高效、快捷的算法使得建立实时的连续语音识别系统成为可能,但是在实际应用中由于说话人的改变会使得系统性能下降。说话人自适应技术利用少量的自适应数据来提高系统性能,能够较好的解决这这种声学差异问题。本文将基于大词汇量连续语音识别平台,围绕说话人自适应技术展开研究,具体工作和创新包括以下几个方面:1.MAP和MLLR算法比较文章在讨论由说话人引起的声学差异基础上,研究两种基于模型的自适应算法:最大似然线性回归(MLLR)和最大后验概率(MAP)。实验结果表明,不论采用哪种自适应都能使识别率有一定的提升。两种算法之间的差异性在于MAP具有良好的渐进性,但收敛性较差,而MLLR在很大程度上改善了收敛特性,但其渐进特性却不如MAP。文章讨论了在MAP自适应中,初始模型参数的先验知识对自适应效果的影响,以及在MLLR中,回归类对自适应效果的影响。文章还进一步研究了采用两种算法的累加自适应效果,从结果看MAP和MLLR结合的方法比单独使用MAP和MLLR的效果要好。文章还对包括基于特征层的归一化算法和用于基于声学模型的MLLR算法等效性进行讨论,并给出了统一的算法框架。2.改进的基于聚类的说话人自适应算法文章提出以模型间加权交叉似然比为距离测度的说话人聚类自适应算法框架。在识别过程中,寻找训练说话人和测试说话人的相关性,充分利用可以提供的自适应语料和训练语料,是提高说话人自适应性能的有效手段。本文中,利用高斯混合模型来表征说话人,并通过说话人聚类来减少参考模型数量,实现粗分类。以此为基础,根据测试说话人的声学特征对参考说话人进行选择,从而实现快速说话人自适应。同时,文章还采用了统一的背景模型来作为各说话人模型的基线系统以增加模型间的耦合度。在目标说话人模型生成阶段,本文利用模型训练过程中产生的声学统计量,快速得到所需的模型参数。实验结果表明,利用说话人聚类技术进行参考说话人粗分类后,识别率比基线系统有较大提高。而且,粗分类精识别的手段表现在不同模型混合度上,都取得了较好的性能。3.参考说话人的动态选择技术及其改进文章在对参考说话人选择技术进行分析的基础上提出了基于支撑向量机的动态参考说话人选择技术(Speaker Support VectorSelection,SSVS)。参考说话人数量及其数据是否足够描述所有参考说话人的分布是取得好的自适应效果的关键。支撑向量机具有自动寻找那些对分类有较好区分能力的支撑向量的能力,因此本文提出将参考说话人视作支撑向量,结合支撑向量机训练过程进行参考说话人选择,以满足最优化和动态的要求。SSVS将参考说话人的选择由手动变为自动,同时满足声学模型完整性和声学近似性的要求。实验证明,这种方法能够取得较好的自适应效果。在此基础上,文章对SSVS进行改进,通过直接选取代表参考说话人的支撑向量来完成参考说话人选择(Reference Support SpeakerSelection,RSSS)。动态参考说话人选择的实现关键在于寻找代表参考说话人的支撑向量。本文借助SVM中的核函数来计算高维特征空间中两个样本间的距离,遍历训练样本集后得到最优分类面附近的样本集,其中各样本即为所需要的参考说话人支撑向量,同时,文章利用置信度来约束支撑向量选择过程。实验数据表明基于RSSS的说话人选择能有效提高系统性能。
【Abstract】 Today, various effective and rapid algorithms make the relization of continous speech rocgniton system become possible, however, when there exits mismatch between test speaker and training speaker, the performance of recognition will degrades severely. Speaker adaptation techniques aim to improve recognition performance in test eviroment with a small amount of data. This thesis will make research foucs on speaker adaptation based on our large vocabulary continuous speech recognition (LVCSR) system. The research and innovations are describedin details as follows:1. Comparison of MAP and MLLR AlgorithmTwo classical model based adaptation algorithm: MLLR and MAP dicussed in the thesis, and the experimental results show that either of these two methods work better than the baseline system to improve the recognition results. the difference of the two algorithms is MAP has desirable asymptotic properties and MLLR has better convergence properties. In MAP, prior knowledge of model parameter the the effect of speaker adaptation and in MLLR, the regression class also make its influence upon final results, so both of them are discussed in the paper. A further research is focus on the adaptation policy of combining two algorithms, from the experiments results we can conclude that the combine method is better than a single one. A unified view of normalization algorithm based on feature space and MLLR algorithm based on acoustic model is also presented in this paper.2. An improvement of clustering based speaker adaptationIn this paper, a new measurement for speaker clustering using cross likelihood ratio is proposed. In the process of recognition, the effective means of improving the adaptation is take advantage of the correlation of test speaker and training speakers as well as make full use of the adaptation data and training data available. In this paper, GMM based speaker clustering is adopted to reduce the number of reference models, based on it, chossing the appropriate reference speakers according the acustic feature of test speaker and realizing rapid speaker adaptation. In the clustering processing, the model CLR is used as distance measurement and universal background model is also used to provide a tighter coupling between the speaker(?)models.The adapted model can be calculated by using the previously stored hidden markov model (HMM) statistics, by which, a quick adaptation can be done. By using speaker clustering to perform speaker classification, the better performance is obtained even with different model mixture number.3 Dynamic selections of reference speakers and relative improvementThis thesis proposed a new method for dynamic selections of reference speakers by using SVM (support vector machine) which named as SSVS and, a relative improvement is also proposed named as RSSS. Good adaptation performance depends on not only the number of selected speakers but also whether these statistics are sufficient for describing the distribution of the reference speakers. How to select is still a very trick problems relied on the experiments. Dynamic instead of fixed number of close speaker selection can make a trade off between good coverage and small variance among the cohorts. In this paper, we try to find subset of training speakers who are acoustically close to the test speaker using (SVM) which outperforms general speaker selection method since it uses a smart way to choose an optimal set of reference models as well as save computation time. Experimental results show that SSVS algorithm can obtain relatively accurate model.It can be concluded that the dynamic selection of reference speakers depend on finding appropriate support vector. In the thesis, rely on the kernal function to compute the distance of two samples in high-dimensional feature space, we traversing the training set and get the samples set near the optimal classification surface, in which the samples is what we need to represent reference speakers. Meanwhile, confidence measure is using to the selection process, the experimental results show that the proposed method can improve the recognition accuracy effectively.
【Key words】 Continous Speech Recognition; Speaker Adaptation; Reference Speaker Model; Speaker Clustering; Support Vector Machine;