节点文献

基于视觉的大范围头部姿态跟踪关键技术研究

Research on Key Techniques of Vision-based Large Head Pose Tracking

【作者】 赵刚强

【导师】 陈根才; 陈岭;

【作者基本信息】 浙江大学 , 计算机科学与技术, 2009, 博士

【摘要】 三维头部姿态跟踪(3D head pose tracking)是计算机视觉和人机交互领域中的重要问题,也是近年来越来越引起重视的研究方向,其主要目的是通过对输入图像序列的分析确定头部在三维空间中的姿态参数。三维头部姿态跟踪技术在人机交互、智能监控、视频压缩编码、人脸识别、表情识别、疲劳检测、基于身体控制的游戏和娱乐等领域有广泛的应用前景。目前常用的头部姿态估计方法可以分成两大类:基于统计学习的方法和基于注册跟踪的方法。基于统计学习的方法假设头部姿态参数和人脸的某些特征之间存在一定的对应关系,并通过对大量具有不同姿态的样本图像进行训练来确定这种关系。此类方法容易受到特征定义的影响,并且往往要对姿态参数进行插值操作,因此结果不够精确。基于注册跟踪的方法通常假设头部为刚性物体,通过帧与帧之间的特征点跟踪计算姿态参数。所选择的特征在不同的实现中有很大的差异。一种方法是选择嘴角、鼻尖和眼角等显著特征点进行跟踪,当所选的特征点被遮挡时会影响跟踪结果。另一种方法是在跟踪过程中动态选择特征点,当一些特征点丢失后自动进行补充,此类方法有更鲁棒的表现。总体来说,基于注册跟踪的方法易于实现,同时具有较高的跟踪精度。已有的头部姿态跟踪算法大都假设被跟踪对象没有身体运动或者很小的身体运动,如用户坐在椅子上的情况。人们在日常生活中很多时候都是通过头部姿态来表达自己的注意力方向、态度和心理感受的,而在这些活动中,人们可能是坐在固定位置,也有可能是在身体运动过程中的。这里我们定义身体运动情况下的头部姿态跟踪为大范围头部姿态跟踪。相对于传统的小范围头部姿态跟踪技术,大范围头部姿态跟踪技术可以更方便的应用在人机交互、智能监控和行为识别等多个领域。本文选用基于注册的方法来解决大范围头部姿态跟踪问题,但是当人体大范围运动时,过大的姿态参数变化会降低注册算法的精度,逐帧跟踪的方法在长时间跟踪后会导致一定的误差累计,并且为了进行三维姿态参数计算,还需要提供对应头部特征点的深度信息。因此,本文提出基于局部特征描述符的注册算法和视角表观模型相结合的跟踪方法,该方法将整个姿态跟踪过程分为三个主要部分:一是获取视频信息和对应的深度信息,深度信息既可以使用立体摄像机获得也可以通过立体匹配技术获得;二是通过基于局部描述符的注册算法计算两帧之间的姿态参数变化;三是使用外观模型消除跟踪过程中的误差累计。与以前的工作相比,本文主要有以下几个方面的贡献:1.提出一种基于尺度不变特征变换(Scale-Invariant Feature Transform,以下简称SIFT)描述符的注册算法。首先在两帧灰度图像中找到匹配的SIFT特征点,然后通过立体摄像机或者立体匹配技术获得这些匹配点的深度信息,为了克服错误匹配点的影响,最后使用基于随机抽样一致性(RANSAC)的运动估计方法来计算头部运动。基于SIFT特征匹配的注册算法具有较高的跟踪精度,当两帧图像间发生一定尺度变化时仍然可以完成跟踪,是一种适合大范围头部姿态跟踪的注册算法。该算法是第一个针对大范围头部姿态跟踪提出的注册算法,在领域内产生了一定的影响,我们发表的介绍该算法的文章已被多个国际同行引用。2.提出一种紧凑的特征描述符KPB-SIFT(Kernel Projection Based SIFT,以下简称KPB-SIFT)。首先使用SIFT检测算法计算特征点的位置、尺度和主方向,然后通过对特征点邻域内的有向梯度信息进行核映射的方式获得低维描述符。与SIFT相比,KPB-SIFT可显著提高描述符的匹配速度,并且具有较强的区分度,在发生光照变化和几何形变等情况下都有鲁棒的表现。3.提出一种视角表观模型。该模型通过多次注册的方法消除逐帧跟踪时的误差累计,其原理就是当前帧除了和它的前一帧进行注册外,还可以和一两个关键帧进行多次注册以减少误差累积。具体来说,就是从输入序列中选择一些关键帧组成描述头部的表观模型,每个关键帧都被附加上对应的姿态参数,除此外还对每个关键帧精确提取头部区域作为头部视角,当被跟踪对象大范围运动时,只要当前帧的头部视角与模型中的关键帧头部视角接近时,当前帧就与关键帧进行注册。多次注册的结果通过卡尔曼滤波器(Kalman filter)进行平滑已获得最终的姿态参数。视角表观模型不仅可以减少跟踪过程中的误差累计,在头部进出摄像机视角、头部离摄像机较远等情况下,视角表观模型还可以用来快速恢复头部姿态参数。4.提出一种适合稠密立体匹配的快速局部特征描述符(Speeded-Up LocalDescriptor,以下简称SULD),用来作为立体匹配过程中的对应点查找方法。为了生成局部描述符,首先使用哈尔(Haar)函数对图像进行滤波,其次对滤波响应图进行多次高斯平滑,然后计算采样点并获得采样向量,最后对采样向量进行归一化并生成描述符。通过使用Haar函数响应信息和紧凑的描述符形式,SULD方法在描述符生成阶段和匹配阶段都可以快速的进行计算。使用SULD描述符作为相似度评价方法,可以解决人脸等弱纹理图像的立体匹配问题,进而生成对应的深度信息,为基于单目摄像机的头部姿态跟踪提供深度约束。头部深度信息还可以在人机交互、表情识别、游戏和娱乐等领域得到广泛的应用。在整个研究过程中,还实现了一个集视频采集、深度获取、姿态计算和结果评测与一体的头部姿态跟踪原型系统-HPObserver。HPObserver为验证各关键技术和后续研究工作提供了一个完整方便的测试平台。使用多组头部运动序列进行的实验表明,提出的方法能完成对头部运动的跟踪,即使在人体大范围运动、头部进出摄像机视角、人脸部分遮挡、脸部表情明显变化等情况下都能鲁棒的完成跟踪。在本文最后,分析了提出方法的主要问题并展望了未来的研究方向。

【Abstract】 3D head pose tracking is an important research problem in the field of computer vision and human computer interaction. Recently, it becomes to be a more attractive research direction. The principal objective of head pose tracking is estimating the 3D pose parameters by analyzing the input image sequence. The head pose information can be widely employed in human computer interaction, intelligent surveillance, video compressed-coding, face recognition, expression recognition, fatigue detection, body-controlled games, entertainments and etc.Most existing head pose tracking methods can be classified into two categories: statistics learning-based methods and registration-based methods. Statistics learning -based methods assume there exists a relationship between some facial features and 3D head poses, and it employs a large number of training images to determine this relationship. These methods are easily affected by the facial features selecting approaches, and they usually need to interpolate the recovered pose parameters, so their results are not very accurate. Registration-based methods commonly assume the head is a rigid object and estimate the pose parameters by employing the feature correspondences between two frames. The selected features might vary from one implementation to another. One approach is selecting distinct features such as mouth corners, nose tip, eye corners and etc. The tracking results of this approach will be less precise when the selected features are occluded. The other approach is selecting facial features dynamically. This approach can automatically select new features In the tracking process when some features are lost, and has more robust results. Generally speaking, registration-based methods are easily to be implemented and have more precise result. However, when the head moves in a large range, it is difficult to register two frames if there has large pose change between them. Besides that, there has a drift accumulation after a long time frame-by-frame tracking. In order to estimate 3D head pose parameters, the registration method also requires the corresponding 3D information of facial features. These existing head pose tracking methods always assume the subject has no body movement or only small body movement, e.g. the subject sitting in a chair. However, when the human beings express interest, attitude and feeling by using head pose in their daily life, they either sits in a place, or moves in a large range. In this thesis, large head pose tracking is defined as the head pose tracking while there has body movement. As compared with the common head movement tracking, the large head movement tracking technique can be more widely applied in human computer interaction, intelignt surviliance, action recognition and etc.This thesis deals with the problem of large head movement tracking by using the local descriptor to detect and match facial features. The whole process includes three steps. First, get the image information and corresponding depth map. The depth map is obtained either from stereo vision camera or by 3D reconstructing techniques. Second, register two frames and estimate the head pose change. Third, reduce the drift accumulation in the frame-by-frame tracking procedure by employing appearance model, which is also helpful for recovering the pose tracking automatically. Compared with existed work, our main contributions can be stated as follows:1. We propose a novel Scale Invariant Feature Transform (SIFT) based registration algorithm. Salient SIFT features are first detected and tracked between two images, and then the 3D points corresponding to these features are obtained from a stereo camera or 3D reconstructed information. With these 3D points, a registration algorithm in a RANSAC framework is employed to detect the outliers and estimate the head pose. By using SIFT-based algorithm, two frames can be accurately registered even when their scale are also changed. Thus, the proposed SIFT-based registration algorithm is appropriate for large head movement tracking. The proposed SIFT-based registration algorithm is the first registration algorithm designed for large head movement tracking, and the related paper has been referenced by other researchers.2. A new compact feature descriptor, called Kernel Projection Based SIFT (KPB-SIFT), was proposed. It detects the interest feature points using the SIFT feature detector firstly. And then apply kernel projection techniques to orientation gradient information in the feature point’s neighborhood. KPB-SIFT is significantly faster in descriptor’s matching stage, and shows superior advantages in terms of distinctiveness, invariance to scale, and tolerance of geometric distortions.3. In order to reduce the drift accumulation during tracking in large range, we propose a view-based appearance model, which can select key frames online when the head undergoes different motions. These key frames are annotated with their poses and head regions both, and collectively represent the appearances of the subject viewed from these estimated poses. To bound drift, our tracker registers current frame against its previous frame using the SIFT-based registration algorithm firstly, and then select one key frame as the base frame if its view of head is similar enough to that of current frame, and then registers current frame against the base frame using the SIFT-based registration algorithm again, finally the pose of current frame is obtained by merging results of two registrations using Kalman filter.4. We present a novel local image descriptor for dense wide-baseline matching purposes, coined SULD (Speeded-Up Local Descriptor). The building process of SULD is divided into four stages. First, convolve input image using Haar wavelet filter. Second, smooth response maps with Gaussian kernels. Third, calculate sample locations and obtain the corresponding sample vectors from smoothed response maps. Finally, normalize sample vectors and concatenate the SULD descriptors. SULD can be computed and matched much faster by employing the efficient Haar wavelet filters and integral image techniques. SULD can be used to densely matching of texture-less face image pairs and the produced depth information will be provided for monocular camera based head pose tracking. The face depth information is also widely employed in human computer interaction, expression recognition, body-controlled games and entertainments.During the research period, we have designed and implemented a head pose tracking demo system named as HPObserver. HPObserver supports video collection, depth production, pose estimation and performance evaluation. HPObserver is helpful for both ongoing research and future works.In order to evaluate the performance of the proposed approach, we do experiments on dozen of image sequences. The extensive experiments shows that, the proposed approach can obtain a robust result even in the case of the large body movement, the subject returns to the visual field of camera after abrupt leaving, the subject’s facial expressions varies and an occlusion happens. We analyze the existing problems, and discuss the future directions in the end.

  • 【网络出版投稿人】 浙江大学
  • 【网络出版年期】2011年 03期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络