节点文献

层次化视频语义标注与检索

Multi-level Video Annotation and Retrieval

【作者】 袁勋

【导师】 吴秀清;

【作者基本信息】 中国科学技术大学 , 信号与信息处理, 2008, 博士

【摘要】 随着多媒体、计算机和网络的发展,视频数据飞速增长。为了对这些海量视频数据进行存储、管理、和索引,需要研究高效的基于内容的方法对视频数据进行检索,而视频标注是视频索引和视频搜索的基础。本文研究如何利用机器学习和视频的特征,对视频进行多层次的、基于内容的标注。视频在结构上共分四个层次:视频(video)、场景(scene)、镜头(shot)、图象帧(frame)。通常视频标注主要在其中的视频层和镜头层中进行。视频层的标注是对整段视频标注其类型属性。镜头层的标注主要是依据从该镜头中提取的关键帧,标注其对应的语义概念。根据所标注的语义概念对应的是图象帧层次还是物体层次的,镜头层标注又可进一步分为图象帧层标注和物体层标注。本文研究在视频层、图象帧层、和物体层上进行视频标注时的关键问题,主要工作和创新之处归纳为以下几点:1.目前视频类型层标注的研究工作通常仅仅标注了几种简单的类型,或者是局限在电影、体育运动等某个特定的类型内标注其子类型,而且使用的分类器也过于简单。本文定义了一个相对完备的视频类型分层表示,分析并提取一系列与类型相关的时空域特征,并提出使用局部和全局优化的多类SVM二叉树提高分类精度。实验结果表明,本文提出的局部和全局优化的SVM二叉树与另外两种典型的SVM多类分类算法、以及现有的视频分类工作中使用的分类器相比,能够获得更高的精确度。2.当前的视频类型层标注工作都是采用被动监督学习的方法,需要大量的训练数据和费时费力的手工标注。本文将主动学习引入视频类型层标注,并提出使用后验概率来计算分类器对未标注样本的置信度,然后依据此置信度选择分类器最不确定的样本,也即最“有用”的样本提供给用户进行标注,从而用更少的训练样本获得和大量训练样本近似的分类效果,减轻用户标注训练数据的负担。实验结果表明,本文提出的基于后验概率的主动学习样本选择策略要略好于现有的基于变型空间的主动学习样本选择策略、以及被动学习的样本选择策略。3.对于图象帧层视频标注,本文考虑一种经常遇到的实际应用:仅拥有一小部分相关的正例,如何学习该目标概念的模型。此时进行视频标注主要存在下面两个问题:第一,对于仅有正例的训练数据,传统的区分型分类器如SVM等无法直接使用;第二,区分各种语义概念的底层特征有很大的不同,使用统一的特征无法适应各种语义概念的变化。本文提出一个基于流形排序的关键帧图象层视频标注框。对第一个问题,用流行排序解决仅有正例的不足,同时可以利用未标注数据的分布信息。对第二个问题,定义一个特征选择准则,引入特征选择为不同的语义概念选择不同的特征。此关键帧图象层视频标注框架支持新定义的目标概念和新特征的引入。4.在物体层视频标注中,传统的多实例学习表达忽略了各种语义概念之间的语义相关性。因此本文提出existence-based多实例表达来描述这种概念间的语义相关性,并根据existence-based表达设计一种新的多实例学习算法MI-AdaBoost。算法首先对训练数据中的每个包进行特征映射,转换成包级特征空间的一个特征矢量,从而将多实例学习转换为传统的监督学习。这种特征映射会为每个包建立一个包含大量噪声的高维特征矢量,可以用AdaBoost进行特征选择并构建分类器。5.不同的语义概念对应的底层特征有很大的不同,因此特征选择对视频标注是非常关键的一个问题。以前的研究工作在将多实例学习应用于视频标注时,都忽略了如何在多实例学习情况下做特征选择的问题。由于传统的单实例学习下的特征选择算法通常都无法在多实例学习中直接应用,本文提出了一种多实例学习下的特征选择算法EBMIL,能够在选择映射后的包级特征的同时,选择不同的特征源(颜色、纹理等),从而获得更好的视频标注效果。

【Abstract】 With the development of multimedia, computer and internet, there is an explosive increasement of video data. For efficient storage, management, and indexing of these massive video data, we need to investigate more efficient Content-Based Video Retrieval (CBVR) algorithms. Video annotation is a preliminary step for video retrieval and search. In this dissertation, we will investigate how to utilize machine learning techniques and video features, to performan content-based video annotation at different video levels.There are totally four levels in video structure: video, scene, shot, and frame. Typically video annotation is performed in video-level and shot-level. Annotation in video-level is to assign video genre information for each video clip. Annotation in shot-level is to annotate the corresponding semantic concept for each shot, based on the key-frame extracted from the shot. Shot-level video annotation is further classified into image-level annotation and object-level annotation, according to the annotated concept belongs to image-level or object-level. In this dissertation, we investigate some prolems in video annotation in video-level, image-level, and object-level. The main contributions and innovations can be summarized as follows:1. For video-level annotation, current research works usually annotate several limited genres, or the sub-genres within a certain genre, and their classifiers are often too simple. We define a relatively comprehensive video genre ontology, analyze and extract a series of spatial and temporal features related to video genres. Furthermore, we propose to use a local optimal and global optimal SVM binary-tree for multi-class SVM classification to improve the classification accuracy.2. Current research works in video-level annotation usually adopt passive learning, which demand large-scale training data and time-consuming human labeling effort. We incorporate active learning into video genre classification, and propose an SVM active learning algorithm based on posterior probability. We first use posterior probability output by SVM classifier to calculate the confidence of each unlabeled sample, and then select the "most unconfident" samples of the classifier for users to label. The "most unconfident" samples always correspond to the "most valuable" samples for the classifier. Through this active learning strategy, we can use fewer training samples to obtain comparable classification accuray obtained by using large-scale training samples, thus alleviate users’ labeling effort.3. For key-frame image-level video annotation, we discuss a typical case in video annotation: to learn the target concept using only a small number of positive samples. A novel manifold-ranking based scheme is proposed to tackle this problem. However, video annotation need large scale video data and large scale feature pool to get good performance. In this situation, applying manifold ranking will induce the following two problems: intractable computation cost and the curse of dimensinality. We incorporate two modules, i.e. pre-filtering and feature selection, to tackle the two problems respectively. This scheme is extensible and flexible in terms of adding new features into the feature pool, introducing human interaction on selecting features, and defining new concepts.4. In object-level video annotation, because the training data are usually labeled in image-level while the semantic concepts are in regional-level, typical single-instance supervised learning cannot learn the target concept directly. If we deem each image as a labeled bag of multiple instances, and the objects in the image as the instances in the bag, object-level video annotation becomes a typical multiple-instance learning (MIL) problem. However, conventianl multiple-instance learning in video annotation neglects the concept dependencies, i.e. the relationship between positive and negative concepts. Therefore, we propose the existence -based MIL formulation to model the concept dependencies, and present a MIL algorithm MI-AdaBoost according to the existence-based MIL formulation. MI-AdaBoost firstly maps each training bag into a feature vector in a new bag-level feature space, thus translating the MIL problem into a standard single-instance problem. This feature mapping would induce a high-dimensional feature vector with much noise for each bag. Therefore, we utilize AdaBoost to perform feature selection and build the final classifier.5. As there are usually large gaps between the effective features for different semantic concepts, feature selection is a key problem in video annotation. Typical feature selection algorithm under single-instance settings usually cannot be adapted directly under multi-instance settings. Previous works on MIL in video annotation often neglect the feature selection problem under MIL settings. We propose a feature selection algorithm named EBMIL under MIL settings. EBMIL is able to select different raw feature sources (color, texture, etc.) during selecting mapped bag-level features, thus achieve better performance in video annotation.

  • 【分类号】TP391.41
  • 【被引频次】9
  • 【下载频次】688
节点文献中: 

本文链接的文献网络图示:

本文的引文网络