节点文献

面向视频挖掘的视觉内容分析

A Research on Visual Content Analysis Towards Video Mining

【作者】 罗青山

【导师】 曾贵华;

【作者基本信息】 上海交通大学 , 通信与信息系统, 2009, 博士

【摘要】 视频挖掘具有广泛的应用前景,它通过分析原始视频数据的内容,实现不同目的和用途的数据挖掘任务。借助视频挖掘,我们可以发现隐藏在视频内容中的有趣的模式,得到有用的知识,用来辅助情报分析和事务决策。然而,计算机在视频内容理解上的困难极大限制了视频挖掘技术的发展。本文着手于解决视频挖掘在视频内容理解上的难题,寻求视频内容在句法分段、语义提取等方面的解决途径,建立视频挖掘与视频数据之间的桥梁。本文提出和改进了若干算法进行视觉内容分析,实现视频的内容理解。具体工作包括:首先,视频的镜头检测是视频内容理解的第一步,它实现了视频内容的句法分段。在图像帧的描述及匹配上,本文提出了连续颜色直方图的概念,基于距离插值的思想建立颜色直方图,克服了简单量化的“间隔效应”。并引入空间的金字塔匹配算法,巧妙地在基于颜色直方图的图像匹配中添加了几何空间信息约束。在镜头边界的判定上,本文提出相似度演化矩阵来描述视频镜头边界的特征。借助少量的矩阵模板,并结合成熟的动态时间规整算法,实现了一种既能检测切变、又能检测渐变的统一算法。其次,为了实现视频流中视觉对象的自动标注,本文提出了一种基于格子的均值漂移搜索算法,将视频识别问题看成是一个对直方图特征进行检测和跟踪的问题。这种方法用一组图像标本来表示每个对象,检测算法被用来以不同的缩放比例、不同的旋转角度扫描整个图像帧。同时进行对象的跟踪,把前一帧获得对象在本帧进行了状态和特征的更新。通过将检测信息与跟踪信息进行融合,实现了视觉对象在视频中的连续识别。再次,本文采用了基于时空体的整体法实现视觉行为的自动标注。本文所研究的检测问题并不局限于静态的背景、稳定的光照等限制条件,而是在真实的场景中研究人物的行为。为了找到更有效的表示模式,同时克服背景运动和不同外观的影响,本文仅仅利用运动信息来描述的人物行为。基于光流场的计算和统计,本文设计了三种类型的局部运动直方图来描述某个行为时空体。另一方面,本文采用GentleAdaBoost方法,选择具有区分性的特征来学习行为模型,从而实现了对行为体的有效的分类。此外,本文还采用基于时空块的部件法实现视觉行为的自动标注。本文设计了一种在视频流中提取行为的时空块部件的快速算法。可以按照实际应用的需要,通过设置不同的频率参数组合,实现时空块部件的疏密程度和数量控制,使得行为的描述具有良好的可伸缩性。为了充分利用视觉行为在时间和空间上的结构信息,本文提出了“部件三链环”的概念,建立显式的形状模型来描述行为的不同时空块部件的相对位置信息。结合传统的pLSA,实现了基于部件的视频中的人物行为的检测。最后,本文基于低级特征分析,提出了一种定位监控视频中的“异常”行为的方法。无需预定义和学习显式的模型来描述异常行为和正常行为,本文将异常行为的检测问题理解为:从现有的几段包含正常行为的视频剪辑构成的数据库中,查询新的观测时空块。本文提出了一种时空块部件的特征描述符,融合了时空块的外观、运动和位置等三方面的信息对时空块部件进行全面的描述。为了实现“异常”行为的推理,本文提出一种“K-best”概率推理算法,对每个“时空块”进行极大似然估计,从而判断当前的部分行为是否异常。本文对现实生活中的监控视频进行了试验,结果很好地证实“K-best”算法的有效性。

【Abstract】 The technique video mining has a bright prospect of application, which realizes datamining for different goals and different tasks by automatic analysis on the content from thoseraw videos. Specially, the hidden patterns of interest can be discovered, and useful knowl-edge can also be obtained. They are meaningful and helpful for information analysis anddecision-making for transaction. However, the difficulties on video content understandinglimit the development of video mining.This paper aims at solving the key problems on video content understanding towardsvideo mining. We make efforts to find solutions for video syntax segmentation and semanticinformation extraction, which bridges the gap between data mining and video sequences.Some algorithms are proposed or promoted, this paper realizes video content understandingby visual information analysis. The main contributions include:Firstly, automatic shot detection is the first step on the way of video content under-standing, which realizes syntax segmentation. A concept of continuous color histogram isproposed, which is based on the idea of distance-interpolation, and the resulting histogramavoids the interval effect. In addition, Spatial Pyramid Matching is introduced to add geome-try restrictions to frames matching. When determining a shot boundary, similarity evolutionmatrix is proposed to characterize the potential shot boundaries. To compare to severalmatrix templates, Dynamic Time warping is introduced to match different matrices. Thismethod for shot detection is a united method which can detect both abrupt boundaries andgradual boundaries.Secondly, to achieve automatic annotation of visual objects in videos, a grid-basedMean-Shift method is proposed which treats video recognition as a problem of detectingand tracking on histogram features. With this method, a set of exemplars are applied to rep-resent an object, and an efficient detection is used to scan over the whole video frames withmultiple scales and rotations. Furthermore, detection is going together with tracking, and the pre-obtained objects are updated. Finally, continuous video recognition is achieved bycombining results from detection and tracking.Thirdly, a holistic approach based on spatio-temporal volumes is proposed to realize theautomatic annotation of visual actions. The detecting problem is not limited in controlled set-tings like stationary background or invariant illumination, but studied in real scenarios. Todevelop effective representation while remaining resistant to background motions, only mo-tion information is exploited to define suitable descriptors for action volumes. Based on thecalculation of optical ?ow, three types of local motion histograms are designed to describethe action inside a spatio-temporal volume. On the other hand, action models are learned byusing boosting techniques to select discriminative features for efficient classification.Additionally, a part-based approachb ased on spatio-temporal cuboids is also proposedto realize the automatic annotation of visual actions. To ensure enough number of cuboidscan be extracted, an improved detector is used to detect interest points at multiple frequen-cies. We can adjust the density and number of interest points via different combination offrequencies according to the requirement, which achieves a scalable description of an action.To make full use of the structural information among cuboids, a concept word triplet is pre-sented, which builds an explicit shape model to describe the relative positions of cuboids.The classic probabilistic Latent Semantic Analysis is introduced to achieve our part-basedaction detection.Finally, using low-level features, an approach capable of detecting and localizing ir-regularities in surveillance video is proposed. Without predefining rules or learning explicitmodels to describe regularities and irregularities, we formulate the detecting problem asquerying new observed cuboids from the database built from several video clips containingonly regular behaviors. This paper designs a descriptor to characterize a spatio-temporalcuboid, which fuses appearance, motion and spatio-temporal configuration. To infer irregu-lar cuboids from videos, a“K-best”probabilistic inference algorithm is employed to find theML estimation for each cuboid to check whether the current part of behavior is irregular ornot. Experiments on real world videos have validated the approach quantitatively.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络