节点文献

说话人辨认中的特征参数提取和鲁棒性技术研究

Research on Feature Extraction and Robust Technology for Speaker Identification

【作者】 李燕萍

【导师】 唐振民;

【作者基本信息】 南京理工大学 , 模式识别与智能系统, 2009, 博士

【摘要】 语音是人类获取信息的主要来源之一,也是最方便、最有效、最自然的交流工具。语音识别是研究使机器能准确地听出人的语音内容的问题,其目的是方便人与机器的交流。说话人识别技术是语音识别的一种特殊形式,其目的不是识别说话人讲的内容,而是识别说话人是谁。说话人识别技术在近三十多年的时间里取得了很大的进步,这种技术的应用为人类的日常生活带来很大的便利。但是,随着说话人识别技术实用化的不断深入,不同应用领域对该技术的要求越来越高。一方面,说话人发音的多变性,要求提取适合说话人识别的特征以保证系统的性能;另一方面,噪声环境、训练与测试数据的时长以及通信信道的失真等问题都严重影响到说话人识别系统在实际应用中的性能。本论文针对文本无关的说话人辨认任务,在说话人个性特征提取和噪声鲁棒性技术两个方面进行了研究,主要内容包括:1.提出基于特征变换和模糊最小二乘支持向量机的辨认算法。针对最小二乘支持向量机模型在语音数据大样本输入下的局限性,一方面对传统的梅尔倒谱特征MFCC进行基于高斯混合模型的特征变换,解决训练最小二乘支持向量机的过程中需要求解的线性方程组的变量数目与特征数量紧密相关的问题;另一方面,通过引入模糊隶属度函数,处理了最小二乘支持向量机从二分类扩展到说话人辨认的多分类时存在的不可分数据问题。高斯混合模型作为一种经典的生成式模型,不但能有效减少数据量,起到压缩数据的作用,而且由于聚类变换后的结果是高斯混合模型的均值矢量集,能够很好地代表说话人的特征,起到突出说话人信息的作用。基于特征变换和模糊最小二乘支持向量机的辨认算法结合了高斯混合模型在拟合数据方面的优势和最小二乘支持向量机在分类辨别方面的优势,从而改善系统系统的性能。2.提出基于高斯混合模型的感知特征补偿变换的抗噪声算法。从人类听觉感知特性出发,基于感知线性预测模型从不同层次模拟了人耳的听觉特性,从语音的频谱细节考虑,去除了会引起说话人信息平滑的临界带频谱分析,提取改进的感知对数面积比系数MPLAR作为说话人特征,具有良好的可分性;并在此基础上,根据说话人识别的声学特性,从匹配得分的整体考虑,对模型输出的似然得分引入非线性变换,拉大目标模型与非目标模型的得分比,拉近同一模型各帧得分值,使得各模型的得分值不仅与当前时刻的似然概率有关,还与之前的K个时刻的似然概率有关,解决了MPLAR在不同类型噪声条件下的抗噪性能问题。基于感知特征和模型补偿的说话人辨认算法不仅提供了可分性更好的特征,并且在模型匹配阶段从整体得分的统计特性出发,得到稳定的模型得分,增强了系统在噪声环境下的识别能力。3.提出基于自适应频率规整的鲁棒性辨认算法。经典的梅尔倒谱特征和感知线性预测特征从人类的听觉感知机理出发,模拟了人类听觉系统对声音频率的感知特性,改进了说话人的识别性能,但是这种处理方式并没有对语义特征和说话人个性特征区别对待,而是在特征提取阶段笼统地降低了高频信息的比重。自适应频率规整算法是基于说话人信息在不同频带呈不均匀分布的原理,从语音生成的生理学角度分析人类在发音过程中的结构变化,从中获取携带说话人信息的生理特征,进而从频谱分析的层次对不同频带对说话人信息的贡献进行量化,指导设计了与Mel频率尺度不同的自适应频率尺度变换,在说话人信息贡献大的区域分配的滤波器个数增多,带宽变小,频率分辨率提高,而贡献小的区域分配的滤波器个数减少,带宽变大,频率分辨率降低,从而进行自适应的频谱滤波,提取区分性特征DFCC。并且针对应用到实际使用环境时存在的训练语音与测试语音失配的问题,对语音频谱进行逐帧逐频率点的预增强处理,去除噪声的干扰,进一步提高系统的鲁棒性。4.提出基于汉语元音映射的说话人辨认方法。该方法从汉语语音的特点出发,对基于汉语的说话人识别进行研究。由于汉语具有相对稳定的音节结构,并且其中的元音部分占据了主要的能量和时长,基于此,从汉语语音的特点出发,对汉语拼音的结构、发音特点进行分析,并且通过元音频谱对比、音素滑动分析、韵母分解实验和共振峰分析等,从短时帧角度将韵母中的元音部分分解为单元音音素的组合,结合大量语音学知识构建了汉语元音映射表,通过汉语元音映射,能够有效地分离语音信号中的语义信息和话者身份信息,将文本无关的说话人识别问题转化为与有限个单元音音素有关的识别问题,并由此衍生出新的说话人建模方法以及新的识别框架,在提高识别率的同时降低对训练和测试数据时长的依赖。在新的识别框架下,提出了一种基于仿生模式识别的说话人辨认算法,在训练阶段利用改进的最近邻覆盖算法为每个单元音音素建立有效的覆盖;在识别阶段根据待测元音帧是否落入对应覆盖区域进行判别,该算法在开集测试条件下对冒名者具有较好的分辨能力。

【Abstract】 Speech is the major source of acquiring information for people, and it is also the most convenient, effective and natural communication tool. Speech recognition is to identify speech contens, the purpose of which is to facilite the exchange of people and machines. Speaker recognition is a special form of speech recognition, which is the use of a machine to recognize a person from a spoken phrase. Speaker recognition technology has made great progress in the near thirty years, at the same time, along with the development of different practical applications, it requires higher performance. On the one hand, the speaker pronunciation variability made that extracting discriminative feature become the key factor of ensuring the system performance. On the other hand, many disturbance factors, such as noise environment, the length of training and testing data and the mismatch of communication channel, seriously degrade the performance of speaker recognition in the practical application. This dissertation focuses on the text-independent speaker identification, including the extraction of speaker characteristic and noise robust. The main research results include four aspects:1.A speaker identification algorithm based on feature transformation and fuzzy least-squares support vector machine is presented to solve the limitation of least-squares support vector machine with large sample of speech data. During the solving process of least-squares support vector machine, it needs to solve a set of linear equations with the number of variables equal to the number of training data, then this paper proposes a method of feature transformation based on Gaussian mixture model.Simultaneously this paper introduces fuzzy membership function into least-support vector machine, which deal with the unclassifiable regions for multi-class problem. GMM is a classical generative model, which can effectively reduce the amount of feature data, and highlight the speaker characteristic owing to that the clustering result is Gaussian mean vectors.The proposed algorithm combines the advantages of generative model and discriminative model.Experimental results demonstrate that fuzzy least-squares support vector machine has better discriminative ability and generalization ability.2.A noise robust method of perceptual feature compensation transformation based on Gaussian mixture model is proposed. From the analysis of human auditory perception, the model of perceptual linear prediction has taken three steps to reflect the human perception of sound. In this paper, it modifies the PLP in the phase of feature extraction via removing the process of critical band spectral resolution analysis, then extracts modified perceptual log area ratio. Furthermore, according to the acoustic characteristic of speaker recognition, it adopts nonlinear transformation for the output likelihood scores, which can widen the score ratio between target model and non-target model, and keep frames’score for the same model close considering the whole distribution of scores.This means that each model score is not only relevant with current likelihood score, but also relevant with the prior K frames’score, which can overcome the limitation of robustness stability under different noise environments for MPLAR feature. The method based on perceptually feature and model compensation can provide discriminative feature, stable the model scores and improve the recognition rate and robustness for recognition system.3.A robust algorithm based on self-adaptive frequency warping is introduced. Although considering the characteristic of human auditory perception and improve the performance of recognition system to some extent, the Mel frequency feature and perceptual linear prediction feature can’t treat the semantic information and personality characteristic differently, and pay no attention to high frequency information. This paper presents a new discriminative feature based on adaptive frequency warping. We analyze the relationship between frequency components and individual characteristics and quantify this dependency. This new feature is extracted by non-uniform sub-band filters designed according to the adaptive frequency warping in different frequency bands. Furthermore, we adopt pre-enhancement prior to feature extraction module. Using a series of controlled experiments, it is shown that the warping algorithm is reasonable and understandable, and the proposed feature is insensitive to spoken content and thus more discriminative and robust. The experimental results demonstrate that combining pre-enhancement and proposed feature leads to noticeable improvement on speaker recognition rate and robustness.4.A novel framework of speaker recognition based on Chinese vowel mapping technique is proposed. The base of this framework is the decomposition of Chinese multi-vowel with single-vowel phonemes.In Chinese pronunciation, all syllables have a simple and stable phonetic structure, and the including vowel part holds the main emergy and duration. We find out that the diphthong and multi-vowel in Chinese can approximately be considered as the complex of vowel and transitional part in point of short-term analysis and built up a new Chinese vowel mapping table from multi-vowel to single-vowel phoneme. Based on this mapping table, we succeed in separating personal identification information from semantic information, which is a novel way to transform the text-independent system into text-dependent speaker recognition system and be reusable by industrials or other researchers. In the new framework, we propose a new Chinese speaker identification system based on biomimetic pattern recognition and improve the nearest neighbor algorithm to find the effective cover of each phoneme in the eigen-space for every speaker. During the identification phase, the final decision will be made according to the relation between the cover and the feature characteristic. Experimental results demonstrate that the Chinese vowel mapping theory is valid and meaningful, and the new system can effectively reduce the requirement of data amount and avoid the disturbance of impostors.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络