节点文献

基于语音信号的情感识别研究

A Study on Recognition of Emotions in Speech

【作者】 金学成

【导师】 汪增福;

【作者基本信息】 中国科学技术大学 , 模式识别与智能系统, 2007, 博士

【摘要】 语音是人类交流的重要手段,是相互传递信息的最方便、最基本和最直接的途径。语音信号在传达语义信息的同时,还传递着情感信息,而情感在人们生活和交流中起着重要的角色。因此,随着人机交互技术的快速发展,语音信号中的情感信息正越来越受到研究人员的重视。作为语音信号情感信息处理的一个重要研究方向,语音情感识别是计算机理解人类情感的关键,是实现智能化人机交互的前提。但是,目前对于人类情感的研究还处于一个探索阶段,对情感的定义与表示至今没有一个统一的认识。加之情感具有较强的社会性和文化性,以及语音信号本身的复杂性,这些因素使得语音情感识别的研究面临着重重困难。应该说语音情感识别的研究还处于一个起步阶段,对于情感语音库、情感特征以及情感建模与识别方法等诸多方面的研究还有待深入。本文以建立不依赖于话者和文本内容的语音情感识别系统为目标,对情感语音数据库、语音声学特征参数提取、情感特征分析与选取、情感维度空间、语音情感建模与识别等问题进行了深入探讨与研究。在对大量情感语料进行分析的基础上,提出了两种语音情感建模方法,为语音情感识别提供了一个理论和技术上的框架,为实现自然的人机交互奠定了一定的基础。借助于这两种情感模型,本文开发了两种语音情感识别算法,构建了不依赖于话者和文本内容的汉语语音情感识别系统。本文的创新点和主要贡献如下:(1)从语音情感特征提取的需求出发,提出了一种基于修正倒谱和动态规划技术的基频估计算法。该算法根据倒谱、短时能量和短时过零率在清音段和浊音段的不同表现,构造了一个清浊音判决函数,据此可简化清浊音判决过程,并大大提高清浊音判决精度。为了得到合乎实际的、具有平滑轨迹的基频估计,利用动态规划技术进行基频跟踪。由于充分考虑了基频连续性的影响,该算法能够有效地避免倍频和半频错误,具有准确率高、基频轨迹平滑等优点。(2)对韵律和声道共振峰等语音声学特征与情感状态之间的关系进行了深入细致的定性/定量分析,得出了一些具有重要指导意义的结论。通过分析发现,短时能量虽然对于区分情感状态有一定的帮助,但存在明显不足;但是信号能量在不同频段上的分布对于区分情感状态具有重要意义,其中,250Hz以下能量占全部能量的比例是区分情感状态的一个重要特征。本文还对基频轮廓及基频轨迹导数等特征与情感状态之间的关系进行了分析。在分析过程中我们发现,男性和女性在语音情感特征参数的分布上存在着较大的差异。据此本文提出了一种以基频均值、范围和方差为特征、采用Fisher线性判别函数的性别判别方法。实验结果表明,通过训练,该方法可取得非常高的正确判别率。(3)提出了一个三维情感空间模型构想,通过听辨实验确定了几种基本情感在情感空间中的位置,并定量分析了语音信号的韵律特征和音质特征与不同情感维度之间的相关性。(4)从情感建模的角度出发,根据情感具有连续性和离散性的双重特点,将数据场的概念引入情感建模,提出了情感场和情感势的概念,并对势函数的计算方法提出了改进措施。通过势函数寻优确定各类基本情感中心在情感空间中的位置,从而把情感空间中任何一点的情感看成是由几种基本情感复合而成,每种基本情感对该点的贡献由基本情感中心在该点处的情感势决定,情感势的大小决定了该点处情感属于某种基本情感的程度。本文基于这一思想开发了一种基于情感场的汉语语音情感识别方法,获得了优于传统语音情感识别方法的识别率。(5)根据语音韵律特征与情感唤醒度、音质特征与愉悦度之间的相关性,提出了一种基于情感维度的情感建模方法。该方法利用韵律特征和音质特征分别为每种情感构建唤醒度和愉悦度概率模型,然后将每个情感语音样本在12个维度模型上的概率输出作为特征训练情感类别模型。本文利用高斯混合模型(Gaussian Mixture Model,GMM)构建情感维度模型,并提出了一种基于对训练样本进行聚类分析的GMM初始参数估计方法。在最后识别时,选用了支持向量机(Surport Vecter Machine,SVM)来构造六类情感类别识别器。根据该情感维度模型,本文进行了汉语语音情感识别的相关实验,获得了优于情感场方法的识别率。作为一种新的尝试,本文提出的两种语音情感建模方法具有一定的理论依据和较好的实用效果,为今后的语音情感建模与识别研究奠定了良好的基础。

【Abstract】 Speech is one of the most convenient means of communication between people and it is one of the fundamental methods of conveying emotion as well as semantic information. Moreover, emotion plays an important role in communication. So emotion information processing in speech signals has gained increasing attention during the last few years as the need for machines to understand human well in human-machine interaction has grown. Being one of the most branchs of emotion information processing in speech, emotion recognition in speech is the fundemental of the nature human-machine communication. However, the research about the human emotion is still at the exploratory stage. There is still no acknowledged definition of human emotion. And emotion has strong social and culture characteristics. On the other hand, speech signals contain complex information. All of these factors are great challenges for emotion recognition in human speech, which is in its infancy.In order to establish a speaker independent speech emotion recognition system without getting any profit from context or linguistic information, this paper focuses on emotional speech corpus establishment, acoustic features extraction of speech, analysis and selection of emotional features, emotion dimension space, emotion modeling and emotion recognition. Based on the analysis of adequate number of emotional speech samples, two methods of emotion modeling are presented in this paper, which provide a theoretical and technical framework for emotion recognition in spoken language. Base on these studies, two emotion recognition algorithms are accomplished and a speaker and content independent Mandarin emotion recognition system is completed.The innovative points and main contributions of this paper are as follows:(1) An algorithm based on the modified cepstrum is presented for the estimation of the fundamental frequency (F0) of speech signals. Voicing decisions are made using a decision function which is composed of cepstral peak, zero-crossing rate, and energy of short-time segments of speech signals. An accurate voiced/unvoiced classification is obtained based on this decision function. Then a dynamic programming method is used to realize pitch tracking. The consecution of F0 is considered sufficiently in the cost function. The proposed algorithm can avoid the problem of pitch doubling and pitch halving effectively, as well as preserve the legitimate doubling and halving of F0. And the algorithm has some desirable advantages such as high accuracy and smooth F0 contour, which needs no further smoothing.(2) This paper analyzes the relationships between emotion states and speech acoustic features, including prosody and voice quality. The shortage of short-time energy on distinguishing emotion states is pointed out in this paper. On the oterh hand, we find that the proportion of energy below 250Hz to the whole is one of the potential choices for emotion recognition in speech. And the characters of the pitch contour and pitch derivative are analyzed for the purpose of emotion recognition. At the same time, the differences of emotional acoustic features between male speech and female speech are found out and a gender distinguish method is developed based on these findings. In this method, the mean, range and variance of F0 are used as features and Fisher linear discriminant function is used to distinguish male speech and female speech. Experimental results show that the proposed method gains a high accuracy.(3) A conception of an emotion space model based on the results from psychological research is presented and a perceptual experiment is reported. In the experiment, we have studied how the six basic emotions of Mandarin in the emotion space. Furthermore, we have studied the relationships between the prosodic and quality features and the mean ratings in the two dimensional space of arousal and valence.(4) From the point of view for emotion modeling, the paper uses emotion field and emtional potency to describe the emotion space, by introducing the conception of data field and potential function into the emotion modeling. Through this method, any emotion in the emotin space can be seen as the composite of all basic emotions in this research. The contribution of each basic emtion to the emotion is determined by the emotional potency which is formed by the former in the later. The center of each basic emotion is searched by hill climbing algorithm. The emotion recognition algorithm based on this model performs well than the traditional methods.(5) A dimension based emtoin model is presented according to the relationships between the acoustic features of speech and emotion dimensions. In this modeling method, prosodic features are used to construct the statistic arousal models and quality features are used to contruct the statistic valence modesl. Then the probability outputs of all these dimension models are considered as the features to establish the emotion category models. GMM is selected to construct the emotion dimension models and a new algorithm for the estimation of the GMM’s origin parameters is proposed based on clustering method. SVM is used to establish the emotion catergory models. Experimental results indicate that the emotion recognition algorithm based on this model gains the better performance than the emotion field method.The two emotion modeling methods proposed in this paper, which are with scientific foundations and good performances, provide a direction for the future work of emotion recognition in spoken language.

  • 【分类号】TN912.34
  • 【被引频次】38
  • 【下载频次】2214
  • 攻读期成果
节点文献中: 

本文链接的文献网络图示:

本文的引文网络