节点文献

语音情感识别的研究与应用

Research and Application of Speech Emotion Recognition

【作者】 刘佳

【导师】 陈纯;

【作者基本信息】 浙江大学 , 计算机科学与技术, 2009, 博士

【摘要】 随着人机交互技术的发展,人机接口的研究已经逐渐从机械化时代跨入多媒体用户界面时代。作为智能人机交互的关键技术之一,语音情感分析与识别已经成为一个研究热点。各领域研究者十分关注如何从语音中自动识别说话人的情感状态,并使计算机作出更有针对性和更人性化的响应。本文首先概述了语音情感识别的研究意义以及文中的主要研究内容,然后回顾了目前语音情感研究中涉及的多个关键问题,包括情感的分类、情感语料库概况、语音信号的声学特征、特征降维、分类算法以及基于半监督学习的语音情感分类。本文提出了多种特征选择和特征抽取模型。基于类集和类对特征选择相融合的语音情感识别是一种新型的模型结构,它在关注每一对类别区分度的同时,兼顾样本数据的全局分布,因而同时引入类集和类对特征选择方式。该模型结构适用于多种分类算法,而且能有效地提高系统的识别性能。基于特征投影矩阵的特征选择算法利用特征抽取算法中的投影矩阵,衡量各个初始声学特征的重要性,据此进行特征子集的选择。实验结果表明,相比于单纯使用投影矩阵进行映射变换的特征抽取方法,该特征选择算法更具优势。基于多层次特征抽取的语音情感识别通过对数据的分析,针对不同性别、不同情感类别的语料,选择多样化的降维算法进行处理。这种思想可以推广到其他语料库上,通过构建合适的基于多层次降维的识别系统,提高系统整体的识别效果。基于流形学习的增强型Lipschitz嵌入算法则是一种非线性降维算法,它通过测地距离的计算,将高维特征向量映射到低维子空间中。该算法在实验室受控环境下的特定人和非特定人语音情感识别、高斯白噪声和正弦噪声情况下的特定人语音情感识别中,显著地提高了识别准确率。在传统的语音情感识别系统中,各个声学特征通常是以分量的形式简单地组成特征向量,作为分类器的数据输入。基于协方差描述子和黎曼流形的语音情感识别系统考虑了不同声学特征之间的关联性,实验表明该关联性能够反映语音的情感信息,而且基于此关联性所建立的识别系统稳定性高,抗噪能力强。在只有少量已标记样本和大量未标记样本的情况下,本文提出增强型协同训练算法,建立起基于半监督学习的分类模型。它通过引入类别预测一致性的限制,改进标准协同训练算法,减少了分类噪音的产生,并提高了分类器的性能。虑到语音情感研究的实用性,使用AdaBoost+C4.5分类模型对语音信号进行情感分析,实现了完全实时的情感识别,并将其应用于实时情感语音驱动的人脸动画生成系统。

【Abstract】 With the development of human-computer interaction technology, the research of human-computer interface has gradually entered the era of multimedia interface from the era of mechanization. As one of the key technologies in intelligent human-computer interaction, speech emotion analysis and recognition has been a hot spot. Researchers from various fields concerned about how to make the computer automatically to recognize speakers’ emotional states from speech signals, and respond more targetedly and more humanly.The research significance of speech emotion recognition and the main research content of this paper are summarized firstly. Then we recall some key issues in the current studies of speech emotion, including the kinds of emotional states, the overview of emotional corpus, acoustic features of speech signals, feature dimensionality reduction, classification algorithm, and semi-supervised learning based speech emotion classification.This paper presents several models of feature selection and feature extraction. The speech emotion recognition based on a fusion of all-class and pairwise-class feature selection is a new type of model structure. It focus on the discrimination between every two emotional states, and simultaneously take the overall distribution of samples into account, so the all-class feature selection and the pairwise-class feature selection are both involved. The model structure is suitable to many classification algorithms and it can effectively improve the performance of recognition system. Feature selection based on feature projection matrix uses the projection matrix from feature extraction to evaluate the importances of initial acoustic features, and then complete feature subset selection based on the importances. The experimental results show that, compared to the feature extraction method which simply uses the projection matrix to do data mapping, this feature selection algorithm has more advantages. Through the analysis of the data, a hierarchical framework of feature extraction for speech emotion recognition selects a variety of dimensionality reduction algorithm to process different gender or different emotional states of corpus. This idea can be extended to other corpus, by constructing a suitable recognition system based on hierarchical dimensionality reductio, and it will improve recognition performance. Enhanced Lipschitz embedding algorithm based on manifold learning is a nonlinear dimensionality reduction algorithm. Through the calculation of geodesic distance, the high-dimensional feature vectors are mapped into a low-dimensional subspace. The algorithm improves the recognition accuracy dramatically in speaker-dependent and speaker-independent speech emotion recognition under controlled laboratory environment, as well as in speaker-dependent speech emotion recognition under the environment of Gaussian white noise and sinusoidal noise.In the traditional system of speech emotion recognition, each acoustic feature is regarded as one component of a simply composed feature vector which is the input of classifiers. Speech emotion recognition based on covariance descriptor and Riemannian manifold considers the the correlation between different acoustic features. The experimental results show that the correlation could reflect the emotional information, and the recognition system established on the correlation has high stability and anti-noise ability.On a small number of labeled samples and a large number of unlabeled samples, this paper presents an enhanced co-training algorithm to build a classification model based on semi-supervised learning. It introduces a restriction on label predictors to improve the standard co-training algorithm. This algorithm reduces the production of classification noises and improves the performance of classifiers.Considering the practicality of the researchs on speech emotion, this paper proposes a classification model of AdaBoost+C4.5 to analyze the emotional states of real-time speech signals. We realize a complete real-time emotion recognition model and apply it in a real-time facial animation system driven by emotional speech.

  • 【网络出版投稿人】 浙江大学
  • 【网络出版年期】2011年 03期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络