节点文献

多模态特征融合和变量选择的视频语义理解

Video Semantic Understanding with Multi-modality Feature Fusion and Variable Selection

【作者】 刘亚楠

【导师】 庄越挺; 吴飞;

【作者基本信息】 浙江大学 , 计算机科学与技术, 2010, 博士

【摘要】 随着计算机技术及互联网应用的迅速发展,多媒体数据特别是视频数据呈海量趋势增长,如何有效存储、管理、传输、检索和使用这些多媒体数据,是摆在人们面前巨大的挑战和亟待解决的研究问题。视频数据蕴含了丰富的语义,同时视频又是时序数据,视频中存在图像、音频和文本三种媒质数据,并呈现时序关联共生特性。本文针对视频数据中多种模态之间的时序关联特性,通过特征融合和变量选择来进行视频语义分析与理解。在视频语义信息理解和挖掘中,充分利用图像、音频和文本等多模态媒质之间的交互关联是非常重要的研究方向。考虑到视频的多模态和时序关联共生特性,提出了一种基于多模态子空间相关性传递的语义概念检测方法来挖掘视频的语义信息。该方法对所提取视频镜头的多模态底层特征,根据共生数据嵌入和相似度融合进行多模态子空间相关性传递而得到镜头之间的相似度关系,接着通过局部不变投影对原始数据进行降维以获得低维语义空间内的坐标,再利用标注信息训练分类模型,从而可对训练集外的测试数据进行语义概念检测,实现视频语义信息挖掘。实验表明这一方法有较高的准确率。传统视频表达所采用的向量模型除了会产生高维向量而导致“维度灾难”问题外,同时在降维过程中,由于特征向量过高的维度及训练样本的数据不足,将不同类型特征进行拼合会引起“过压缩”问题,以致丢失大量信息。另外,不同类型特征通过简单向量拼接也在一定程度上减弱或忽略了视频中这些多种模态特征之间的时序关联共生性。为了解决这一问题,提出了一种基于高阶张量表示的视频语义分析与理解框架。在这个框架中,视频镜头首先被表示成由视频中所包含的文本、视觉和听觉等多模态数据构成的3阶张量;其次,基于此3阶张量表达及视频的时序关联共生特性设计了一种子空间嵌入降维方法,称为“张量镜头”;由于半监督学习从已知样本出发能对特定的未知样本进行学习和识别,最后在这个框架中提出了基于“张量镜头”的直推式支持张量机算法以及两种基于主动学习的后精化处理策略,其不仅保持了张量镜头所在的流形空间的本征结构,而且能将训练集合外数据直接映射到流形子空间,同时充分利用未标记样本改善分类器的学习性能。实验结果表明本方法能有效地进行视频镜头的语义概念检测。为了更加有效利用标记样本,基于压缩感知和稀疏表示理论,结合稀疏表达、非负矩阵分解和监督学习,提出了基于(非负)组稀疏表示的分类方法对图像和视频进行分类思路。其基本思想是将测试样本表示为训练样本的加权线性组合:即在非负l1正则化因子约束下,对每个训练样本求取一个回归系数,同时每一类别也求取加权系数,使得在训练过程中能基于稀疏系数对类别中所有样本同时选择或放弃。另外,非“负”回归加权系数使得视频和图像理解过程更加具有可解释性(interpretable)。基于(非负)组稀疏表示的分类方法优势在于能有效利用类别信息对视频和图像进行变量选择,不仅提高了语义分类精度,而且使得这一过程更具可解释性。

【Abstract】 With the recent advances in computer technologies and Internet applications, the number of multimedia files and archives increase dramatically, and video data constitute the majority. Therefore, efficient and fast content-based video storage, management, indexing, browsing and retrieval have become important research topics. Video data comprises plentiful semantics.such as people.object, event and story.etc. In general, video data compose of three low level modalities namely the image, audio. and text modalities. These multiple modalities in video are in essence characteristic of temporal associated cooccurrence (TAC). Considering the TAC of the multiple modalities of video data, this paper proposes effective feature fusion and variable selection schemes to better analyze video semantic contents.Interaction and integration of multi-modality media types such as visual, audio and textual data in video are the essence of video content analysis. A great deal of research has been focused on utilizing multi-modality features for better understanding of video semantics. In this paper, we propose a new approach to detect semantic concept in video using Co-Occurrence Data Embedding (CODE), SimFusion, and Locality Preserving Projections (LPP) from temporal associated co-occurring multimodal media data in video. CODE is a method for embedding objects of different types into the same low dimension Euclidean space based on their co-occurrence statistics. SimFusion is an effective algorithm to reinforce or propagate the similarity relations between multiple modalities. LPP is an optimal combination of linear and nonlinear dimensionality reduction method. Our experiments show that by employing these key techniques, we can improve the performance of video semantic concept detection and get better video semantics mining results.Traditionally, the multimodal media features in video are preferred to be represented merely by concatenated vectors, whose high dimensionalities always cause the problem of "curse of dimensionality". Besides, over-compression problem will occur when the sample vector is very long and the number of training samples is small, which results in loss of information in the dimension reduction process. This paper proposes a higher-order tensor framework for video analysis and understanding. In this framework, we represent image frame, audio and text which are the three modalities in video shots as data points by the 3rd-order tensor. Then we propose a novel video representation and dimension reduction method which explicitly considers the manifold structure of the tensor space from temporal-sequenced associated co-occurring multimodal media data. We call it TensorShot approach. Semi-supervised learning used large amount of unlabeled data together with the labeled data, to build better classifiers. We propose a new transductive support tensor machines algorithm to train effective classifier and an active learning based contextual and temporal post-refining strategy to enhance detection accuracy. Our algorithm preserves the intrinsic structure of the submanifold where tensorshots are sampled, and is also able to map out-of-sample data points directly. Moreover. the utilization of unlabeled data builds better classifiers. Experiment results show that our method improves the performance of video semantic concept detection.Based on Compressive Sensing and Sparse Representation theories, also with the idea of nonnegative matrix factorization and supervised learning, this paper develops a novel approach to image and video representation, classification and retrieval, which we call group sparse representation. The basic idea is to represent a test image as a weighted combination of all the training images. In particular, we introduce two sets of weight coefficients, one of which is for each training image and another is for each class, which does varable selection at the class level. Moreover, due to the "nonnegative" features of image and video, we impose nonnegative constraints to the coefficients to make the classifier more interpretable and additive model. Specifically, we formulate our concern as a group nonnegative garrote model. The resulting representations are sparse, and they are appropriate for discriminant analysis.

  • 【网络出版投稿人】 浙江大学
  • 【网络出版年期】2010年 12期
  • 【分类号】TP391.41
  • 【被引频次】13
  • 【下载频次】1344
  • 攻读期成果
节点文献中: 

本文链接的文献网络图示:

本文的引文网络