节点文献
融合多视觉对象的行为识别研究
Fusion of Multiple Visual Objects for Action Recognition
【作者】 刘婧;
【作者基本信息】 北京理工大学 , 计算机科学与技术, 2015, 硕士
【摘要】 行为识别是计算机视觉和模式识别领域的热点问题,在智能监控、虚拟现实、高级人机交互等领域具有广阔的应用前景。然而,在真实的、不受限制的环境中,由于人体表观、动作存在差异,背景复杂多变和存在遮挡,摄像机运动等,行为识别仍然是一项具有挑战性的工作。对于发生在真实环境中的行为,物体和场景等上下文信息往往是十分关键的。本文主要研究如何融合多种视觉对象,利用运动、物体、场景等信息联合建模进行行为识别,以提高识别的准确率。本文提出了一种融合多视觉对象信息的联合建模方法,引入含有隐变量的结构化支持向量机框架来建模运动、物体和场景之间的同现关系。模型不仅建模了各个视觉对象与行为类别标签的直接关系,还建模了各个视觉对象之间的同现关系,其中,物体类别标签和场景类别标签被作为隐变量处理。这个模型不仅可以预测出行为类别、物体类别和场景类别,同时还可以定位物体在场景中的位置。实验结果证明了多视觉对象融合的有效性,能进一步提高真实场景中行为识别的准确率。另外,本文采用中层类相关特征来描述多种视觉对象,并提出采用迁移学习的方法训练生成中层特征的预分类器。类相关特征是带有一定语义信息的特征,它由一系列预分类器的判别值组成,衡量了输入视频和相应类别的匹配度。由于训练视频大多数分辨率较低,导致物体模糊不清、场景不易辨认,而且具有上下文信息标注的训练视频十分有限,加重了人工标注训练数据的负担,因此,在训练物体和场景的预分类器时,本文提出采用由图片到视频的迁移学习方法。首先利用有标签的网络图片训练物体和场景分类器,然后采用无监督的区域适应算法,解决图片源域和视频目标域数据分布不同的问题。实验证明了中层特征良好的判别性和迁移算法的有效性。
【Abstract】 Action recognition is a highly active research in the domain of computer vision and pattern recognition, and has a multitude of applications, such as in surveillance, virtual reality, human-computer interaction, etc. However, recognizing actions in realistic videos from unconstrained environments still remains a challenging problem due to the large appearance variations of human bodies, background clutter and camera movement. In realistic environment, object and scene can provide rich source of contextual information for analyzing human actions, as human actions often occur under particular scene settings with certain related objects. Therefore, this paper tries to utilize the contextual object and scene for improving the performance of action recognition.This paper proposes a method of fusing multiple visual objects, modeling the relationship of action, object and scene. Specifically, a latent structural SVM is introduced to build the co-occurrence relationship among action, object and scene, in which the object class label and scene class label are treated as latent variables. In this framework, action class labels, object class labels as well as scene class labels can be predicted, and the object location can be simultaneously estimated as a by-product. Experimental results demonstrate the effectiveness of the proposed method for improving the performance of action recognition.Moreover, this paper propose to train the pre-learned classifies for mid-level feature using transfer learning, as a mid-level discriminative feature is utilized to describe the information of visual object. The mid-level class correlation feature is actually a set of decision values from the pre-learned classifiers of all the classes, measuring the likelihood that the input video belongs to the corresponding class. To train the pre-learned classifiers, this paper proposes a transfer learning method from images to videos, as the objects and scene is blur in the limited video training data, and labeling training samples is time consuming and labor expensive. Specifically, the labeled Web images are used to train the initial classifiers, and the unsupervised domain adaptation method is utilized to solve the difference between source image domain and target video domain. Experimental results demonstrate the discrimination of the mid-level feature and the effectiveness of the transfer learning method.
【Key words】 action recognition; context modeling; latent structural SVM; mid-level feature; transfer learning;