节点文献

语音信号鲁棒特征提取及可视化技术研究

Research on Robust Feature Extracting and Visualization of Speech Signal

【作者】 韩志艳

【导师】 王旭;

【作者基本信息】 东北大学 , 检测技术与自动化装置, 2009, 博士

【摘要】 语音是语言的声学表现,是人类交流信息最自然、最有效、最方便的手段,也是人类思维的一种依托。而对听力障碍者来说,语言交流变成一件很难实现的事情。一部分聋哑人不能说话是因为他们的听觉器官遭到破坏,不能将语音信息采集到大脑,但发音器官是完好的。这种情况下的聋哑人,如果辅助于一些视觉训练系统,经过一段时间的专门训练,是可以学会说话并和健全人进行交流的。这样为残障者进行听力无损补偿的语音可视化技术便应运而生。本课题便立足于这一研究构想,通过提取语音信号的特征参数,将其与图像进行映射,产生具有声音意义的图像,供听力障碍者学习并认知,辅助听力障碍者听到声音。而语音信号特征提取是关系到语音识别和可视化系统性能的一个重要指标,目前提取的语音特征参数在安静的环境下具有很好的鲁棒性,但是这些参数一旦应用于噪声环境时,其性能会急剧下降。所以本文主要针对低信噪比环境下特征参数的提取及这些特征参数在语音可视化中的应用进行了深入的研究。本文的主要研究内容和创新点有以下几个方面:(1)为了提高低信噪比下语音端点检测的准确率,提出了一种端点检测算法。其核心技术是利用短时能零积与鉴别信息的互补优势,首先利用短时能零积的方法进行判决,当遇到噪声帧与语音帧的转折帧时,利用基于子带能量鉴别信息的方法来进行复检,从而避免了因噪声幅度急剧变化而导致的误检。并提出了一种动态更新噪声能量门限的方法,从而能更准确地跟踪噪声能量的变化。仿真实验结果表明,提出的方法在信噪比变化比较剧烈的情况下仍能准确快速地检测出语音的起止点,对语音信号的后续研究起到了很好的铺垫作用。(2)由于小波神经网络的学习效果对网络隐层节点数、初始权值(包括阂值)、伸缩和平移因子以及学习率和动量因子的依赖性较大,致使其全局搜索能力弱,易陷入局部极小,收敛速度减慢,甚至不收敛。而遗传算法具有的高度并行、随机、自适应搜索性能,使它在处理用传统搜索方法解决不了的复杂和非线性问题时,具有明显的优势。因此,我们考虑把遗传算法和神经网络相结合,采用遗传算法选取初值进行训练,用小波神经网络完成给定精度的学习。仿真实验结果表明,该模型有效地提高了语音的识别率,并缩短了识别时间,实现了效率与时间的双赢,为算法的实用性奠定了基础。(3)以改善噪音环境下语音识别和语音可视化系统的鲁棒性为着眼点,把多信号分类法(MUSIC)的谱估计技术引入到特征参数的提取中,并与语音信号的感知特性相结合提出了一种新的语音特征参数PMUSIC-MFCC,同基线参数MFCC相比不但提高了稳健性而且还提高了计算效率。(4)动态特性是语音多样性的一部分,它不同于平稳的随机过程,它具有时间相关性,揭示了语音信号前后以及相邻之间存在着的密切关联。由于差分参数和加速度参数并不能将动态信息挖掘得很充分,所以它们尚不能很好地反映语音信号的动态特性。而调制谱具有时频集聚性,它不仅可以充分地反映语音之间的动态特性而且对语音环境的敏感度较低。所以根据干扰信号与语音信号在调制信息中不同的反映,提取调制信息中有效的语音成分,然后与MFCC参数的提取方法类似来提取其倒谱特征。这样得到的特征参数鲁棒性更好。(5)由于人耳对不同的频率在相应的临界带宽内的信号会引起基底膜上不同位置的振动,而小波变换在各分析频段的恒Q(品质因数)特性与人耳听觉对信号的加工特点相一致,所以本文在对MFCC参数提取过程分析的基础上,结合小波包对频带的多层次划分,并根据人耳感知频带的特点,自适应地选择相应频带,提出了一种基于小波包变换的特征参数(WPTC)。经实验验证鲁棒性很好。(6)鉴于如何在大量的特征参数中选择出少数具有互补作用的特征参数,提出一种系统性的实用的特征参数优化方法—基于方差的正交实验设计法。首先进行因素(语音特征参数)和水平的选择,再根据数理统计与正交性原理,从大量的实验点中挑选适量的具有代表性的点构造正交表进行正交实验,最后通过计算对正交实验结果进行分析,找出最优的特征参数组合。并且与目前参数的简单组合方案相比较,新方法的误识率和响应时间均减少了很多。(7)基于聋哑人的视觉鉴别能力和对色彩刺激的视觉记忆能力较强的优点,提出了两种可视化方法,一种是基于局部线性嵌入(LLE)和模糊核聚类相结合的方法,先采用本文提出的改进的LLE对特征进行非线性降维,然后再利用模糊核聚类算法对其进行聚类分析,即利用Mercer核,将原始空间通过非线性映射到高维特征空间,在高维特征空间中对语音信号特征进行模糊核聚类分析。由于经过了核函数的映射,使原来没有显现的特征突现出来,从而能够更好地支持基于位置的语音可视化,经过试验验证具有很好的效果。另一种是基于位置和图案的语音信号可视化方法,通过集成不同的语音特征进入一副图像中为聋哑人创造了语音信号的可读模式。首先对语音信号进行一系列预处理,然后提取其特征,其中用三个共振峰特征来映射图像的主颜色信息,声调特征来映射图案信息,再把经过正交实验设计优选后的23个特征送入神经网络2映射出位置信息,最后合成出可视化图像。我们对该可视化系统进行了初步的测试,并与以前的语谱图方法进行了比较,测试结果表明该方法应用在聋哑人辅助学习方面,可以收到良好的效果,具有很好的鲁棒性。

【Abstract】 Speech is acoustic representation for language. It is the most natural, effective and convenient method when communicating information between people, is a kind of relying for human thought. However, language communication becomes very difficult for hearing handicapped people. Some deaf-mute cannot talk because their aural organ is damaged and cannot collect speech information to brain, but their pronunciation organ is intact. In this condition, the deaf-mute can communicate with the normal person if they obtain some special training through vision training system for a period of time. Listening lossless compensation of speech visualization technology for deaf-mute is rising. This paper stands on the conception of research by means of extracting speech features, and then mapping the image with voice meanings to assist deaf-mute learning and hearing. While speech signal feature extracting relates to speech recognition and visualization systems performance, although present speech features have very good robustness under quiet environment, their performance will decrease sharply under noisy environmental conditions. So the purpose of this paper is mainly to extract robust speech feature under noisy environmental conditions and deeply study in visualization.The main contents and innovations of this dissertation include:(1) This paper proposed a novel speech endpoint detection algorithm aiming to improve the accuracy in low signal-to-noise ratio (SNR) conditions. Core technology was the complementary advantage between the Short-time energy-zero-product and discrimination information, which used Short-time energy-zero-product algorithm to make judgment firstly, and then used discrimination information based on the sub-band energy distribution probabilities algorithm to recheck when met with the transition frame for noise frame and speech frame, so as to avoid error-detected owing to the sharp change of noise amplitude. Moreover, we proposed a novel dynamically update the noise energy threshold algorithm, which has high accuracy when traces the changes for noise energy. The simulation results show that the new method gives a precise and rapid endpoint detection in the case of the seriously changed noise environment, and it plays an important role in the latter speech research.(2) Due to the fact that the learning effects of wavelet neural network strongly depend on the number of hidden nodes, the initial weights(including thresholds), the scale and displacement factors, the learning rate and momentum factor, which leads to weak global search capability, easily falling into local minimum values, low convergence rate, and even not convergent. While Genetic Algorithm (GA) has height parallel performance, random and adaptive search performance, and it has obvious advantages in solving complex and nonlinear problem. Therefore, we can combine neural network and genetic algorithm by using GA to select initial value, and use wavelet neural network to finish the learning. The simulation results show that the new model effectively improves speech recognition rate, shortens the recognition time, realizes double wins in efficient and time, establishes the foundation for practicality of the algorithm.(3) This paper proposed a novel feature extraction algorithm aiming to improve speech recognition and visualization systems robustness in noisy environmental conditions. Core technology was the Multiple Signal Classification (MUSIC), which estimated MUSIC spectrum from the speech signal and incorporated perceptual information directly into the spectrum estimation, this provided improved robustness and computational efficiency when compared with the previously proposed Mel Frequency Cepstral Coefficient (MFCC) technique.(4) Dynamic characteristics is a part of speech diversity, which is different from stationary random process, and has the temporal Correlation, reveals the intimate connections with speech signal pre and post and adjacent. Because the difference parameters and acceleration parameters can not adequately excavate speech dynamic characteristics, whereas modulation spectrum has time frequency agglomeration performance, not only adequately reflects speech dynamic characteristics, but also has lower sensitivity for speech environment. So according to different reflect in modulation spectrum for interference signal and speech signal to extract the effective components for modulation spectrum, then the cepstrum coefficients were extracted as the feature parameter. The simulation results show that the new method has very good robustness.(5) Different frequency within corresponding critical bandwidth signal for human ear cause basement membrane vibration in different location, while the constant Q characteristics of each analysis frequency for wavelet transform and signal processing characteristics for human auditory are consistent, so this paper combined with the frequency band multi-level division with wavelet packet transform, and according to the characteristics for human ear perception frequency band, adaptively selected relative frequency band, proposed a new feature extracting algorithm based on wavelet packet transform, the simulation experimental results show that the new method has very good robustness.(6) In view of how to select complementary speech parameters in plenty of feature parameters, a systematic and practical method of the parameters selection based on the variance orthogonal test design is proposed. Firstly, chose factors (speech parameters) and levels. And then according to the principle of mathematical statistics and orthogonality, picked out proper and representative points from massive experimental points to construct orthogonal table. Finally, calculated and analyzed the experimental results, and the optimal set of process parameter values was discovered. Moreover, the word error rate and response time are reduced when compared with that of the traditional parameter selection method.(7) In view of the stronger superiority of deaf-mute in visual identification ability and visual memory ability for color, two kinds of new speech visualization methods were proposed. One was the method combining LLE (Locally Linear Embedding) with fuzzy kernel clustering algorithm, where the improved LLE could reduce the nonlinear dimensionality of the speech features and then the fuzzy kernel clustering algorithm was used for clustering analysis. That is to say, the Mercer kernel function was used to change the data in original space into a high-dimensional eigenspace through nonlinear mapping, and then the fuzzy clustering analysis was made in the high-dimensional eigen-space. Therefore, after the kernel function mapping, the original inherent features of speech were highlighted to improve the discriminations of the different speech. The results of simulation experiments show the feasibility and effectiveness of the method. Another method was based on position and pattern algorithm, it created readable patterns by integrating different speech features into a single picture. first, series preprocessing of speech signals were done, then extracted features. We used three formant features to map principal color information, used intonation features to map pattern information, and then 23 features selected by orthogonal test design used as the inputs of neural network 2. Finally, the outputs of neural network 2 mapped the position information. We evaluated the visualized speech in a preliminary test and contrasted with spectrogram, the test result shows that the visualization approach is an effective method to assist deaf-mute learning and has very good robustness.

  • 【网络出版投稿人】 东北大学
  • 【网络出版年期】2012年 06期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络