节点文献

图像序列中人的行为分析和识别方法

Human Action Analysis and Recognition from Image Sequences

【作者】 韩磊

【导师】 贾云得;

【作者基本信息】 北京理工大学 , 计算机应用技术, 2009, 博士

【摘要】 人的行为分析和识别是计算机视觉和模式识别领域的研究热点,它在智能监控、虚拟现实、运动分析等领域具有广阔的应用前景。本文主要研究图像序列中人的行为分析和识别,在行为特征提取、特征表示和行为识别与建模三个方面,对手势跟踪与动态手势识别、单人行为识别和两人交互行为识别等问题进行了探索研究。本文提出一种层级潜变量空间中的三维人手跟踪算法,和其它基于流形学习的跟踪方法不同,它将人手状态空间划分成多个人手部分状态空间,采用层级高斯过程潜变量模型得到更能反映人手运动本质的树状低维流形空间,在该低维空间使用粒子滤波器跟踪人手和各个人手部分的运动,降低了粒子滤波器有效跟踪人手所需的粒子数量;使用径向基函数插值方法构建低维流形空间到图像空间的非线性映射,将低维粒子直接映射到图像空间中观测。实验表明,该方法可以鲁棒的跟踪关节人手,并具有更小的跟踪误差。本文提出一种层级条件随机场模型(Hierarchical Conditional Random Filed, Hierarchical CRF)建模动态手势,该模型为每一帧图像预测一个行为标签,可用于连续动态手势的识别。实验结果证明了该模型的有效性。目前大多数单人行为识别方法是基于整个人体运动特征的。本文提出一种层级潜变量空间中的单人行为识别方法,它基于人体自身的生理学结构,构建人体运动的层级潜变量空间,并在该空间中采用聚类技术提取各个人体部分的运动模式。该方法采用层叠条件随机场模型(Cascade CRF)建模输入数据和运动模式的概率映射,使用判别式分类器估计最终的人体行为标签。在运动捕捉数据上的识别结果证明了该方法的有效性,在合成图像上的识别结果验证了该方法的鲁棒性。本文研究了两人交互行为的识别与建模,提出一种基于时空单词的两人交互行为识别方法,该方法从行为视频中提取丰富的时空兴趣点,基于人体剪影的连通性分析和时空兴趣点的历史信息,把时空兴趣点划分给不同的人体,并在兴趣点样本空间聚类生成时空码本(spatial-temporal codebook)。对于给定的时空兴趣点集,通过投票得到表示单人原子行为的时空单词(spatial-temporal words)。它采用条件随机场模型建模单人原子行为,在建模两人交互行为语义时,人工建立表示领域知识(domain knowledge)的一阶逻辑知识库,并训练马尔可夫逻辑网用以两人交互行为的推理。两人交互行为库上的实验结果证明了该方法的有效性。

【Abstract】 Human action analysis and recognition is a hot topic in the domain of computer vision and pattern recognition, and has promising applications to intelligent surveillance, visual reality and motion analysis. The key problems in this task are feature extraction, feature representation and action recognition. In this thesis, we focus on human action analysis and recognition from image sequences and investigate hand tracking, gesture recognition, human action and interaction recognition.This thesis proposes an algorithm for 3D hands tracking on the learned hierarchical latent variable space, which employs a Hierarchical Gaussian Process Latent Variable Model (HGPLVM) to learn the hierarchical latent space of hands motion and the nonlinear mapping from the hierarchical latent space to the pose space simultaneously. Nonlinear mappings from the hierarchical latent space to the space of hand images are constructed using radial basis function interpolation method. With these mappings, particles can be projected into hand images and measured in the image space directly. Particle filters with fewer particles are used to track the hand on the learned hierarchical low-dimensional space. Then the Hierarchical Conditional Random Field (Hierarchical CRF), which can capture extrinsic class dynamics and learn the relationship between motions of hand parts and different hand gestures simultaneously, is presented to model the continuous hand gestures. Experimental results show that our proposed method can track articulated hand robustly and approving recognition performance has also been achieved on the user-defined hand gesture dataset.Most researches on human action recognition are mainly based on the features of whole body motion. This thesis presents a hierarchical discriminative approach for recognizing human action based on limbs motion. The approach consists of feature extraction with mutual motion pattern analysis and discriminative action modeling in the hierarchical manifold space. HGPLVM is employed to learn the hierarchical manifold space in which motion patterns are extracted. A cascade CRF is introduced to estimate the motion patterns in the corresponding manifold subspace, and the trained SVM classifier is used to predict the action label for the current observation. The results on motion capure data prove the significance motion analysis of body parts, and the results on synthetic image sequences are also presented to demonstrate the robustness of the proposed algorithm.This thesis also explores a hierarchical approach for recognizing person-to-person interactions in an indoor scenario from a single view. It detects dense space-time interest points from action videos and divides them into two sets exclusively according to the history information and the connectivity of the two silhouettes. Then K-means clustering is performed on the combined set of interest points of all the training interactions to learn the spatio-temporal codebook. For a given set of interest points, a spatio-temporal word is built by allowing each point to vote softly into the few centers nearest to it and accumulating the scores of all the points. The CRF whose inputs are the spatio-temporal words is used to modeling the primitive actions for each person. Domain knowledge and first order logic production rules with weights are employed to learn the structure and the parameters of Markov Logic Network (MLN). MLN can naturally integrate common sense reasoning with uncertain analysis, which is capable of dealing with the uncertainty produced by CRF. Experiment results on our interaction dataset demonstrate the effectiveness and the robustness.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络