节点文献

基于联合因子分析的耳语音说话人识别研究

Speaker Identification of Whispered Speech Based on Joint Factor Analysis

【作者】 龚呈卉

【导师】 赵鹤鸣;

【作者基本信息】 苏州大学 , 信号与信息处理, 2014, 博士

【摘要】 说话人识别,作为生物特征识别的重要组成部分,可广泛应用于公安司法、生物医学工程、军队安全系统等领域。随着计算机和网络技术的迅速发展,说话人识别技术已取得了长足的进步。耳语发音方式是一种特殊的语音交流形式,在很多场合应用。由于耳语音与正常音之间存在较大差异,耳语方式下说话人识别无法照搬正常音说话人识别的方法,尚有很多问题亟待解决。本文以与文本无关的耳语说话人识别为研究对象,进行了较为深入的探索。耳语音说话人识别所面临的问题主要包括:耳语数据库的不完善,对于正常语音,美国国家标准技术局给出了统一的数据库资源用于开展说话人识别研究,而耳语音在这方面的资源较为匮乏;耳语音特征表达问题,耳语音由于其发音的特殊性,有些常用的特征参数无法提取,其频谱参数的获取较正常音也更加困难;耳语音是气声发音,声级较低,较易受噪声干扰,且耳语音往往在手机通话时使用,易受信道环境影响;同时,耳语发音时,受发音场所制约,情感表达受限,且发音状态、心理因素都会产生一定的变化,更易受到说话人心理因素、情绪及发音状态的影响。因此,较之正常发音,耳语发音方式下说话人识别面临的主要难点是:特征参数更难提取,易受说话人自身状态影响,以及对信道变化更加敏感等。针对这些问题,本文开展了以下几个方面的工作:1.提出了反映耳语音说话人特征的参数提取算法。耳语音无基频、声源特征难以体现,作为表征声道特性的共振峰参数,其提取算法的可靠性显得尤为重要。本文提出了基于频谱分段的耳语音共振峰提取算法,该方法可动态地进行频谱分段,通过选择性线性预测获得滤波器参数,采用并联的逆滤波控制得到共振峰。该方法为解决因耳语发音导致的共振峰偏移、合并、平坦等问题提供了有效途径。另一方面,本文依据变量统计里中心与平坦度可衡量信号稳定性的特点,结合人耳听觉模型,提出了Bark子带谱中心与Bark子带谱平坦度的概念,与其他频谱变量组成特征参数集,可有效表征耳语发音方式下说话人特征。2.提出了基于特征映射及说话人模型合成的非典型情绪下耳语说话人识别方法。较好地解决训练语音与测试语音说话人情绪状态失配的问题。由于耳语音在情绪表达方面不如正常音有效,无法明晰地进行情感分类,本文通过耳语音说话人状态的A、V因子分类方法,模糊其情感表达的一一对应性,并在测试阶段,作为语音信号的前端处理手段,对每一段语音进行说话人状态分辨,而后实现特征域或模型域的补偿。实验表明,基于特征映射及说话人模型合成的说话人状态补偿方法不仅体现了耳语音的独特性,更能有效提高非典型情绪下耳语音说话人识别的正确率。3.提出了基于潜因子分析的非典型情绪下耳语说话人识别方法。为耳语说话人状态补偿提供了有效的途径。因子分析不关注公共因子所代表的具体物理含义,仅是在众多变量中找出具有代表性的因子,且可通过因子数目的增减,调节算法的复杂度。根据潜因子理论,可将耳语音特征超矢量分解为说话人超矢量与说话人状态超矢量,通过均衡的训练语音分别估计说话人与说话人状态空间,并在测试阶段,对每一段语音估计其说话人因子,进而做出判决。潜因子分析方法规避了测试环节中的说话人状态分类,相较于对分类方法有依赖性的补偿算法,可进一步提升耳语说话人识别率。4.提出了基于联合因子分析的多信道下非典型情绪耳语音说话人识别方法。实现了耳语音说话人识别中的信道与说话人状态双重补偿。根据联合因子分析的基本概念,可将语音特征超矢量分解为说话人超矢量、说话人状态超矢量以及信道超矢量。针对因耳语音训练数据不充分,无法同时估计出说话人、说话人状态及信道空间的问题,用联合因子分析方法,在获得UBM模型后,计算语音的Baum-Welch统计量,并首先估计说话人空间,而后采用并行模式分别估计说话人状态及信道空间。测试阶段,对于语音的特征矢量减去信道及说话人状态偏移,变换后的特征用于说话人识别。实验结果表明,基于联合因子分析的方法可同时进行信道及说话人状态补偿,相较于其他算法,可获得更好的识别效果。

【Abstract】 Speaker Identification (SI), an important part of biometrics identification technology, iswidely used in public safety, judicial system, biomedical engineering, etc. It has made greatprogress with the rapid development of computer science and network technology.Nowadays, the study on whispered speech focuses not only on its fundamental research,but also on its applications. Speaker identification of whispered speech is an interestingwhile challenging task. So many issues are still to be resolved as to its particulararticulation.This paper pays special attention to text-independent speaker identification of whisperedspeech. The difficulties are as follows. First, the database of whispered speech is faulty,unlike the voiced one, whose database is provided by NIST for the research of SI. Secondly,with the characteristic of whisper, some parameters are not available, and some are moredifficult to be abstracted. What’s more, as the excitation of whispered speech is exhalation,it is more easily to be affected by noise. Meanwhile, as the whispered speech is oftenencountered in mobile communication, it is often influenced by its channel. Finally, whenwhispering, the speaker might be restricted by the surroundings, which will lead to changeof speaking mode or psychological factors. Hence, whispered speech is more likely to beaffected by the state of speakers. In short, the obstacles of SI for whispered speech are: thedifficulties in obtaining the parameters, the affections from the channels and the states ofspeakers as well.The contributions of this dissertation to speaker identification of whispered speech are asfollows:1. The algorithms for abstracting the parameters to represent the characteristics ofwhispered speakers are proposed. As there is no fundamental frequency in whisperedspeech, the reliability of the abstraction of formants is essential. Formant estimation ofwhispered speech based on spectral segmentation is proposed. This algorithm candynamically segment the spectrum, and get the parameters of the inverse filters bylinear prediction. It can solve the merged and shifted formants, which might often beencountered in the whispered speech. On the other hand, the SFMB and SCB are defined to represent the speakers’ trait of whispered speech. It is based on the propertythat the central and flatness can figure the stability of signal.2. Speaker identification of whispered speech based on feature mapping and speakermodel synthesis are proposed. They are smart ways to solve the mismatch betweentraining set and test set from the speakers’ state. As the whispered speech, compared tothe voice one, is weaker in delivering emotions, the classification of A, V factors forwhispers is proposed in this dissertation. It can also be taken as the pre-process for SIof whispers. The experimental results show the algorithms based on feature mappingand SMS are efficacious to the SI of whispered speech with perceptible mood.3. Speaker identification based on latent factor analysis of whispered speech withperceptible mood is proposed. It offers a probability to the speakers’ statecompensation. The factor analysis doesn’t care about the physical meanings of eachfactors, it’s only a mathematical way to find the representative factors from the massvariables. By plus or minus the quantities of the factors, it can adjust the complexitiesof the algorithms. As to the latent factor theory, the supervectors of whispered speechcan be decomposed into speaker and speakers’ state supervectors. It needs balanceddata to train the space of the vectors mention above. In the test stage, the speaker’ssupervector should be estimated from each session. The latent based algorithm canobtain better recognitions by avoiding the classification of speakers’ state.4. Speaker identification of whispered speech based on the joint factor analysis isproposed. It’s a compensation algorithm for SI of whispers with perceptible mood anddifferent channels. According to the JFA theory, the supervectors of speech signal canbe decomposed into speaker, speaker’s state and channel supervectors. As the trainingset is not large enough, it is not available to estimate the spaces of the supervectorsmentioned above. Hence, the procedures are: estimate the UBM model, calculate theBaum-Welch statistics, obtain the speaker’s space and get the speaker’s state andchannel space at the same time. When testing, the abstracted parameters should bytransformed by eliminating its channel factor and speaker’s state factor as well. Theexperimental results show the superiority of this algorithm compared to othercompensation methods.

  • 【网络出版投稿人】 苏州大学
  • 【网络出版年期】2014年 09期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络