节点文献

语音信号动态特征分析及其可视化的关键技术研究

Key Technologies Research of Speech Dynamic Feature Analysis and Speech Visualization

【作者】 薛丽芳

【导师】 王旭;

【作者基本信息】 东北大学 , 检测技术与自动化装置, 2010, 博士

【摘要】 语音信息的传递是人们之间交流最方便、最自然的手段。一部分聋哑人不能说话是因为他们的听觉器官遭到损坏,不能将语音信息采集到大脑,但发音器官是完好的。这种情况下的聋哑人,如果辅助于一些视觉训练系统,经过一段时间的专门训练,是可以学会说话并和健全人进行交流。对这种将语音信息转换为可以用视觉识别图像的辅助聋哑人语音训练系统自上世纪六十年代中期以来国内外都有很多研究,但到目前为止这些系统大多采用单一的语音特征表示方法,不仅识别率不高,而且显示的信息过于专业化,不宜为聋哑人理解接受。本文着眼于研究语音生成和感知的机理,特别是语音生成和感知在大脑中的信息传递和处理方式,利用现有技术(小波变换、听觉模型、神经元网络和流行学习方法等)在语音分析方面的优势,提出一种语音在大脑感知系统中的参数描述,并以图形形式进行显示的一种新的语音识别方法。该方法与传统语音识别方法相比,原理易于理解,计算量小;同时又试图证实语音(至少是元音)的感知过程是一个简单的拓扑映射。最终形成的图形易于识别,只需要进行简单的训练,利用聋哑人大脑自身反馈和极强的视觉补偿功能,即可进行语音的辨识。本文的创新点如下:(1)详尽阐述了传统语音识别技术和辅助聋哑人语音训练技术的研究现状,并通过对语音生成和感知机理的系统研究,论证了将人类的语音信号转化为视觉信息的可行性和适用性;同时对现阶段在语音分析领域中使用的各种语音图谱及可视化方法进行了较为深入的研究和探讨,分析了这些方法各自的原理、应用范围、优点和不足;最后在简要阐述传统手工语音信号的特征提取方法(包括LPCC、MFCC和PLP等等)的基础上,基于神经元网络和流行学习方法的基本原理,提出了语音信号自动特征提取的概念和方法。(2)提出了一种新的语音信号可视化方法,该方法利用基于小波理论(WT)的多分辨率思想,建立听觉模型滤波器组来对听觉系统进行模拟,克服了传统语音分析方法(STFT)对高、低频段具有相同的时间分辨率和频率分辨率的缺点,这种特性十分接近人耳对声音信号的感知。对经过小波变换滤波后的语音信号进行特征编码形成语音的组合特征,将该组合特征作为一个新的特征量来表示和反映语音的特征规律;并将这种特征用简单的图形表示出来,利用聋哑人自身的大脑来识别语音,在一定程度上实现了语音变图像的设想。(3)创建并描述了一种基于时间自组织映射网络(TSOM)的语音可读模式。在自组织映射网络(SOM)基础上,引进了时间增强机制来提高系统性能。该方法弥补了原自组织映射网络固定的空间拓扑结构和忽视了时间因素(对于语音信号至关重要)的缺陷。时间自组织映射网络(TSOM)方法对随时间变化的语音谱的可视化尤其有效,连续短时谱形成了二维映射平面上的一条轨迹并且随时间的变化可以观测到语音信号的动态变化规律。(4)提出了一种基于时间线性嵌入(TLE)的语音信号可视化方法。局部线性嵌入方法(LLE)是一种进行特征提取的无人监督的学习算法,特征提取的目的就是在降低语音信号特征维数的同时保留语音信号的大部分关键信息。如果语音变量可以由一小部分连续特征来描述的话,我们可以把语音数据看作是嵌入在所有可能波形的高维空间中的低维流形。本文将流形学习算法运用在语音数据处理中,详细分析并讨论了局部线性嵌入(LLE)的基本算法和局限性;在此基础上提出了基于时间线性嵌入(TLE)的改进算法,尽可能从高维的语音信号中提取出有用的低维结构。该算法在低维空间中分离元音的能力得到了评价并与经典的线性降维方法(PCA)进行了比较;结果表明流形学习算法在低维空间优于经典方法并能发现语音数据有用的流形结构。(5)提出了一种基于听觉模型的语音信号可视化方法,该方法利用Gammotone听觉滤波器组和Meddis内毛细胞发放模型来获取表征听觉神经活动特性的听觉相关图;并将听觉相关图中每个频带的频率分量幅值进行特征编码作为表征当前频带特性的特征向量。与传统语音信号处理方法(如语谱图)相比,该方法能反映出更多的语音频率特性。

【Abstract】 Information transfer by voice is the most convenient and natural communication mean between people. Some deaf-mute cannot talk because their aural organ is damaged and cannot collect speech information to brain, but their pronunciation organ is intact. In this condition, the deaf-mute can communicate with the normal person if they accept some special train through some vision train system after a moment.The visual assistant speaking training system in order to help deaf-mute study speech has been widely researched by the inside and outside the country since the middle of 1960s. But the majority of system adopts single voice character to show image. These methods are not only very low identification rate but also make the deaf-mute difficultly accepted because of too professional.Based on the principle of speech production and perception, especially the information transform method in the human brain of speech production and perception, making use of the advantage of present technology in speech signal processing, including Wavelet Transform, Auditory Model, Artificial Neural Network and Manifold Learning, the paper brought forward a parameter description in human brain perception system of speech and a novel speech recognition method which was displayed with image mode. Compared with traditional method, the principle of novel method was easily understood and computed simply. At the same time, the paper attempted approve that the perception course of speech (at least vowel) was a simple topology map. The ultimate figure was easily distinguished and the deaf could recognize speech only by some simple training, making use of their vision compensating function. The innovation of the paper was described as follow:(a) The paper discussed the research status of the traditional speech recognition technology and speech training technology assistant hearing impaired in detail, and demonstrated the feasibility and applicability of speech to image through system research of speech production and perception. The various speech spectrum patterns in nowadays were investigated deeply, and the principle of these methods, their advantage and disadvantage were given. At last, based on traditional speech feature extracting method, including LPCC, MFCC, and PLP etc., the paper put forward the automatic speech feature extracting concept and method according to the principle of Artificial Neural Network and Manifold Learning.(b) This paper described a new speech visualization method that created readable patterns by integrating combined feature into a single image. The system made use of time-frequency analysis based on wavelet transform to simulate the band-pass filter property of basilar membrane. The auditory feature was displayed on the CRT by plot patterns and the deaf could utilize their own brain to identify different speech for training their oral ability effectively.(c) This paper described a novel speech visualization method that creates a readable pattern based on temporal self-organizing map (TSOM). According to SOM, TSOM introduced a time enhanced mechanism to improve system performance. The method remedied the defect that SOM only provided spatial topographic map ignoring temporal factor which was extremely important for speech signal. The representations of consecutive short-time spectra formed a trajectory on the map and changes in time could be observed from the representations.(d) This paper described a novel speech visualization method that created a readable pattern based on tempral linear embedding (TLE). LLE was an Unsupervised learning algorithm for feature extraction. If the speech variability was described by a small number of continuous features, then we could imagine the data as lying on a low dimensional manifold in the high dimensional space of speech waveforms. The goal of feature extraction was to reduce the dimensionality of the speech signal while preserving the informative signatures. In this paper we have present results from the analysis and visualization of speech data using PCA and LLE. And we observed that the nonlinear embeddings of LLE separated certain phonemes better than the linear projections of PCA.(e) This paper described a novel speech visualization method that created a readable pattern based on Auditory Model. The metod made use of Gammotone auditory filter and Meddis inner hair cell model to obtain auditory correlogram which expressed auditory nerve active characteristic. Then the every frequency amplitude of auditory correlogram was coded as a feature vector expressing present frequency band characteristic. The auditory model extracted the critical information of the speech signal and presented more frequency information compared with conventional acoustic processing techniques (spectrogram etc.).

  • 【网络出版投稿人】 东北大学
  • 【网络出版年期】2010年 08期
  • 【分类号】TN912.3;TP391.41
  • 【被引频次】1
  • 【下载频次】722
  • 攻读期成果
节点文献中: 

本文链接的文献网络图示:

本文的引文网络