节点文献

基于上下文的音视频标注研究

Research on Context-Based Audio and Video Annotation

【作者】 钟岑岑

【导师】 苗振江;

【作者基本信息】 北京交通大学 , 信号与信息处理, 2014, 博士

【摘要】 随着计算机和网络技术的迅速发展,音频、视频等多媒体数据呈海量趋势不断增长。为了便于对这些繁杂数据的管理与利用,常见的处理方式是对其内容进行低层特征、结构信息、语义特征等不同层级的描述,其中,语义特征作为最贴近用户理解的描述形式得到了普遍关注,而基于机器学习的音视频标注作为获得这些描述的一种快速有效方式,也成为了当今的研究热点。然而,由于多媒体低层特征与高层语义之间存在的“语义鸿沟”,仅仅依赖学习算法本身往往很难达到满意的标注效果。在这种情况下,合理利用音视频数据丰富内容所蕴含的语义关联上下文、时间关联上下文、多模态关联上下文等信息将有助于缩小这个“语义鸿沟”,从而改善和提高音视频内容标注的准确性。本文以基于上下文的音视频标注为出发点,对当前标注方法中存在的一些关键问题进行了讨论,并对上述三种上下文的挖掘、建模、利用等问题展开深入研究,主要取得了以下成果:(1)针对音频标注中语义关联上下文利用不足的问题,提出基于关联主题混合高斯模型的音频概念检测算法,并探索了基于主题信息反馈的关键词检出。作为描述音视频内容的语义特征,标注单元之间会呈现出共现、约束等上下文关联,本文以一般音频和特殊音频——语音为出发点,对音频标注中这种语义关联上下文的挖掘和利用进行讨论。对于面向一般音频的多标记的音频概念检测,传统的处理方法忽略了语义概念之间的关联特性,本文算法则是将其嵌入至混合高斯模型框架中来指导检测过程,进而提高了检测准确性。而对于语音,本文从语音产生的角度出发,对说话人的原始表达意图进行基于文本分类的主题建模,尝试以此作为高层语义上下文来实现对关键词检出初始结果的进一步虚警剔除,在语音文档检索的应用中得到了有效验证。(2)分析了视频标注通常采用的通用概念关联的局限性,提出特定数据的两视角概念关联估计算法。语义关联上下文中的概念关联在标注过程中处于宏观指导地位,但通常采用的通用概念关联无法正确描述每一个待处理数据的概念分布,因此会导致以此为指导的视频标注不能达到期待中的效果。针对这一问题,本文尝试对具体待处理镜头和镜头对所隐含的空间和时间概念关联进行估计,将其转化为数据的分解与重建问题。在基于概率计算的视频标注优化中,面向TRECVID2006-2008数据集的实验测试以及与其它方法的比较表明本文算法得到的概念关联能够反映数据自身的语义内容,因此更为有效地提高了视频标注优化性能。(3)从对视频时间一致性的建模角度出发,提出图正则化的连续概率潜在语义分析模型,以及基于特征转换的视频概念检测算法。视频的时间特性决定了时间连续的视频片段可能具有相似的视觉和语义内容,本文模型基于这种时间一致性上下文的文档元素关联,对原始连续概率潜在语义分析中被忽略的元素关联通过基于图的流形正则化进行建模;在视频标注中,该模型除了用于特征映射,还作为一种产生式模型,由此得到的特征转换算法通过利用视频结构所隐含的上下文信息,克服了基于概率潜在语义分析的概率建模标注方法在视频标注中的局限。在YouTube和TRECVID数据集上的实验显示了本文模型及特征转换算法的有效性。(4)针对多模态关联上下文的有效利用问题,提出多模态连续概率潜在语义分析模型及其通用形式——图正则化的多模态连续概率潜在语义分析模型。描述同一个视频片段的音频、视频等不同模态特征相互关联彼此补充,合理的多模态融合方式应既能描述模态个体特性又能保持它们之间的关联。上述两个模型以此为出发点,前者在连续概率潜在语义分析框架下将多模态融合转化为多模态元素的建模问题,对每一个模态赋予一个混合高斯分布来描述其特征分布,并在基于分类的视频标注中有效完成了音视频融合;在此基础上,后者加入对多模态元素之间本质关联的建模,作为连续概率潜在语义分析、以及本文提出的多模态连续概率潜在语义分析和图正则化的连续概率潜在语义分析的通用形式,该模型进一步实现了对视频多模态和时间一致性等上下文的同时建模。

【Abstract】 With great advances in computer and network technologies, multimedia data are increasing in an explosive way. For the convenience of exploiting and organizing these massive data, researchers attempt to describe their multimedia content in terms of low-level feature, video structure and semantic feature. Among these descriptions, semantic feature that provides benefits for human understanding has been receiving a lot of attention. And consequently, as the most effective and efficient way to derive this kind of description, machine learning-based audio and video annotation is highly desired and greatly explored. However, due to the well-known semantic gap between low-level features and high-level semantics, satisfactory annotation performance is difficult to be achieved just by improving the learning algorithms. Thus, it is necessary to make full and effective use of the useful contextual cues underlying the rich content of audio and video, such as semantic correlation, temporal correlation, multi-modal correlation, etc., so as to bridge the semantic gap and enhance the annotation.Focusing on context-based audio and video annotation, in this thesis, we analyze the existing problems and conduct a deep research on the exploration, modeling and exploitation of the aforementioned three contextual cues. The main contributions are as follows:(1) We propose a new model named Correlated-Aspect Gaussian Mixture Model for multi-label audio concept detection and explore a topic feedback-based keyword spotting method, aiming to utilize semantic correlation-based contextual cues (such as the association between two annotation units) which have been neglected in most cases for boosting audio annotation. Oriented to generic audio data, the former algorithm models concept correlation under the framework of Gaussian Mixture Model, whereby the detection for some concepts that are difficult to be detected is enhanced by those which can be easily detected. While for speech, the latter algorithm exploits topics derived from text categorization to model the original intentions of speakers making the speech and takes them as the high-level semantic context to refine the initial results of keyword spotting. In the application of speech document retrieval, the effectiveness of this algorithm is demonstrated.(2) We propose a data-specific two-view concept correlation estimation procedure for video annotation refinement. As the guidance to annotation, concept correlation is crucial. Since the commonly used generic concept correlation which is applied to all data is not practically useful as expected, this procedure focuses on inferring the spatial and temporal concept correlations respectively underlying specific shot and shot pair by formulating this as a problem of data decomposition and reconstruction. In a probability calculation-based video annotation refinement scheme where the derived two types of data-specific correlations are incorporated, experiments on TRECVID2006-2008datasets show that these correlations could well characterize the semantic content of specific data and refine the initial results stemming from individual concept detectors effectively.(3) We propose graph regularized probabilistic Latent Semantic Analysis with Gaussian Mixtures (GRGM-pLSA) to deal with the problem of video temporal consistency modeling, and further present a feature conversion algorithm for video concept detection. Originating from pLSA with Gaussian Mixtures (GM-pLSA), GRGM-pLSA employs graph-based manifold regularization to model the neglected intrinsic interdependence between terms. By this means, video temporal consistency, marked by the fact that temporally consecutive video segments usually have similar visual content and express similar semantic meanings, can be modeled in terms of term correlation. Except for feature mapping, GRGM-pLSA is also applied as a generative model. Grounded on the contextual cue underlying video structure, a GRGM-pLSA-based visual-to-textual feature conversion algorithm is proposed, which provides a new perspective of applying probabilistic modeling-based annotation to video. Extensive experiments on YouTube and TRECVID datasets prove the effectiveness of our approaches.(4) We propose multi-modal pLSA with Gaussian Mixtures (MMGM-pLSA) as a way of exploiting multi-modal correlation-based contextual cue and extend it to a generalized model-graph regularized MMGM-pLSA (GRMMGM-pLSA). As the multi-modal features extracted from one video segment are correlated with each other, a reasonable multi-modal fusion manner should be capable of maintaining the characteristic of each modality as well as the intrinsic interdependence between them. For this purpose, MMGM-pLSA introduces multiple GMMs with each depicting the feature distribution of each modality, and is used for audio-visual fusion in the task of classification-based video annotation. Furthermore, so as to capture the intrinsic correlation between multi-modal terms, GRMMGM-pLSA, as the generalization of GM-pLSA and our GRGM-pLSA and MMGM-pLSA, is derived, and consequently succeeds in modeling the contextual cues of multiple modalities and temporal consistency simultaneously.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络