节点文献

面向结构化数据的视频检索研究

Research on Video Retrieval with Structral Data

【作者】 顾志伟

【导师】 吴秀清;

【作者基本信息】 中国科学技术大学 , 信号与信息处理, 2008, 博士

【摘要】 视频数据在近几年呈现出爆炸式的增长,在人们的日常生活中占据越来越重要的地位,而视频分享在未来数年甚至数十年也都将会是热点,这使得视频内容分析以及视频检索成为当前视频研究领域的重点。基于内容的视频检索(CBVR)是一项集理论性、实用性和挑战性为一体的技术,经过十几年的研究,取得了巨大的进展,已经有一些原型系统开发出来,并在小型商用搜索引擎中使用。在CBVR中,广义的视频结构化起着非常关键的作用。由于原始视频为无结构的数据流,在检索时首先需要采用合适的模型将视频组织为结构化数据,并根据结构化的组织形式对视频进行分析、索引与查询。本文的主要工作目标是研究视频的数据结构化特性,并充分利用其结构特性设计高效的机器学习算法用于高层语义理解,能够自动地或以较少的人工参与缩小底层特征与高层语义之间的“语义鸿沟”,最终改善视频检索的性能。本文以视频的结构为主线,分别从图像层次结构、镜头层次结构和场景层次结构进行研究,提出在这些层次结构下的机器学习算法。本文的主要工作和创新点总结如下:1.针对基于全局信息的图像层检索,提出采用AdaBoost方法与SVM相结合进行多次样本抽样,将分类精度作为特征性能的判据进行特征选择,选取少量有利于检索的特征,将弱分类器增强为强分类器,从而较好的融合多种特征。2.对基于区域信息的图像层检索,采用多示例学习进行建模,并利用多示例主动学习以减少人工标注的工作量,解决标注样本缺乏问题。文中详细分析多示例主动学习的特点,归纳为包层、示例层和混合层次三种主动学习模式;针对包层多示例主动学习问题,提出一种结合示例数目统计特征和不确定性的样本选择策略,实验验证了该方法的有效性。3.镜头是视频的基本物理单元,因此视频检索通常都是在镜头级别进行。本文分析视频本身所具有的多层次结构特性,首次提出多层次多示例学习框架,该框架结合了结构学习和多示例学习的特点,能对视频内容有效建模。文中探讨多层次多示例学习需要解决的关键问题,并针对这些问题设计多个算法构成一个完整的框架。本文首先设计多层次多示例核来度量这种特定结构下样本的相似度;然后利用边缘化核的思想对多层次多示例核进行改进得到边缘化多层次多示例核,解决示例贡献的权重问题;继而提出多层次多示例正则化框架,引入多重约束显式地表达多层次结构和多示例关系特性,最终较好地解决了多层次多示例学习问题。4.场景是视频中的语义单元,比镜头具有更高的抽象和概括能力,在视频语义理解时有效地结合场景信息将对视频检索、管理等语义级应用提供支持。本文提出一种将全局分布特性和局部相似性约束结合的基于能量最小化的方法进行场景分割(EMS);同时,提出一种将场景分割结果与自动语音识别(ASR)结果融合的方法用于视频检索中,得到更加优异的性能。

【Abstract】 In recent years, the amounts of video data have surged to an unprecedented level, videos play more and more important role in our daily life, and internet video sharing will still be remarkable in the next several years (even decades). As a result, video content analysis and video retrieval are becoming central issues in video research. Content-based video retrieval (CBVR) is a theoretical, practical and challenging technique, it has made tremendous progress in the past several years, and some prototype systems have been developed for small commercial search engines. Generalized video structuring plays a key role in CBVR, however, the raw video data is unstructured, in the first step, it needs to be organized as structural data using appropriate models, and then perform video analysis, indexing and querying on the basis of the organized structure. The objective of this thesis is to research on the structural characteristics in video content, and further design efficient machine learning algorithms for high-level semantic understanding by using such structural characteristics. These machine learning algorithms attempt to narrow the "semantic gap" between low-level feature and high-level semantic automatically or with few manual laboring, and ultimately improve the retrieval performance.In this thesis, we take the hierarchical structure as the clue to analysis the semantics in video content. We propose appropriate algorithms with the hierarchical structure, i.e. image-level, shot-level and scene-level structures. The main contribution are summarized as follows,1. For image-level retrieval based on global information, we propose to process multiple sampling by integrating AdaBoost and SVM, and select a few helpful features taking classification accuracy as criterion of feature, meanwhile boost the weak classifiers to a strong classifier.2. For image-level retrieval based on regional information, we model the image structure with multiple-instance learning which belongs to structural learning framework, and introduce multiple instance active learning (MIAL) to reduce manual labeling and solve the problem of lacking labeled-samples. We analysis the characteristics of MIAL, and categorize it into three paradigms, i.e. bag-level, instance-level and mixture-level active learrling. For bag-level MIAL, we propose a sample selection strategy which takes the statistics of instance number as an important measure, and combines with the uncertainty of samples. The experimental results demonstrate the effectiveness of the proposed algorithm.3. As shot is the basic physical unit of video, video retrieval is usually adopted at shot-level. We study the intrinsic hierarchical structure information of the video content, and propose the multi-layer multi-instance (MLMI) learning framework, which is the combination of structural learning and multiple instance learning, has the ability of modeling the video content in natural sense. We discuss the problems should be solved in multi-layer multi-instance learning, and designed a complete framework composed of several algorithms for these problems. Firstly, a MLMI kernel is constructed to measure the similarity of the special structure. To weight the instance contributions, we further utilize marginalize method and propose the marginalized MLMI kernel. To deal with the ambiguity propagation problem which is introduced by weak labeling and multi-layer structure, we then propose a regularization framework which takes several explicit constraints into consideration, i.e. hyper-bag prediction error, sub-layer prediction error, inter-layer inconsistency measure, and classifier complexity, and the MLMI learning problem is finally solved preferably.4. Scene is regarded as the basic semantic unit in video, it is more abstract and recapitulative than shot, thus employing the scene information in semantic understanding could be beneficial for the semantic level applications, such as video retrieval, management, etc. We propose an energy minimization based scene segmentation (EMS) algorithm in which not only the global distribution of time and content, but also the local temporal continuity are taken into account simultaneously. Moreover, a scheme of fusing scene segmentation and automatic speech recognition (ASR) results is proposed and adopted in video retrieval.

  • 【分类号】TP391.41
  • 【被引频次】8
  • 【下载频次】739
节点文献中: 

本文链接的文献网络图示:

本文的引文网络