

Key Techniques of Content-based Intelligent Video Surveillance and the Applications in Public Security

【作者】 张剑

【导师】 庄越挺;

【作者基本信息】 浙江大学 , 计算机科学与技术, 2007, 博士

【摘要】 智能视频监控技术能够自动解读监控视频,发现不寻常的事件并进行预警,其中针对可疑人员的行为进行监控是最重要的一类应用,本文就是在这个范畴内展开相关研究。具体说来,本文围绕人体行为监控,以基于学习的方法为主线,研究了视频中的运动检测、动作识别、人脸超分辨率、表情合成及三维重建技术。本文首先讨论了采用增量式背景建模进行运动检测与提取。提出一种自适应选取权重的机制,根据每一帧中包含的运动信息自动为这一帧选取权重,并用加权的视频帧更新背景模型。本文提供了一种合理的权重表达和自适应的权重计算方法。该方法能够很好地适应场景动态变化,即使在视频中包含复杂背景运动时仍能生成高质量的背景图像,提高了背景建模的鲁棒性。为了对监控视频中出现的人体行为进行自动解读,本文研究了人体动作的自动识别方法,用来识别特定场合中几种具有危害性的异常动作。这是一种基于模板匹配的视点无关的人体动作识别方法。动作模板由子空间中的几个超球构成,集成了多种人体动作在多个视点下的特征,动作识别是通过待识别动作与模板中样例动作之间的相似度匹配实现的。动作超球融合了多视点下的人体运动特征,有利于进行视点无关的人体动作识别;另外基于超球的识别方法在计算效率上要优于传统的κNN算法。监控视频中人脸区域通常很小,辨识度很差,这给主观的人脸识别造成了一定的困难,为此本文提出一种基于样本学习的两阶段人脸超分辨率技术。在第一阶段,采用局部保持幻想算法合成全局的高分辨率人脸图像;在第二阶段,为了补偿全局高分辨率人脸图像的局部细节特征,采用基于邻域重建的图像残差合成技术。本文提出的方法能够根据低分辨率人脸图像合成具有不同视觉效果的高分辨率人脸图像,消除了分辨率过低对人脸识别造成的影响。为了解决人脸表情变化给人脸识别带来的问题,本文先后提出了基于图像的和基于视频的人脸表情合成方法。前者采用特征关联学习算法根据一幅中性表情人脸图像合成具有其它表情的人脸图像;后者是一种结合局部线性和全局非线性子空间分析的两层融合方法,其中局部线性子空间学习采用特征表示技术在时间域对视频样本进行压缩、全局非线性子空间学习在空间域内产生优化的人脸表情。合成的表情比较真实自然,有助于在表情变化时进行人脸识别。为减小监控视频中人脸姿态变化对人脸识别造成的影响,本文先后提出基于图像和视频的三维人脸重建,为人脸识别提供辅助信息。基于图像的人脸重建采用自适应的局部线性嵌入算法对样本空间进行非参数化采样,并基于采样结果合成特定的三维人脸模型。通过基于约束的纹理映射合成真实感人脸。基于视频的人脸重建首先在未标定的单目视频首帧自动标注一定数目特征点,然后使用仿射矫正的光流方法对特征点进行鲁棒的跟踪,采用SFM算法恢复相机投影矩阵和特征点三维坐标,基于人脸特征对一般人脸变形得到个性化人脸模型和表情效果,最后采用动态纹理映射方法合成真实感外观效果。本文为智能视频监控提供了一种总体解决方案,并围绕智能视频监控关键技术展开研究,取得了初步成果。在本文的总结与展望中指出,智能监控领域的许多问题仍然需要深入进行研究探索。

【Abstract】 Intelligent video surveillance can smartly understand the video content, find the unusual events and alert. One of the most important applications is automatically monitoring the behaviors of suspects. This thesis focuses on this application and addresses the problems of motion detection, action recognition, face super-resolution, expression synthesis and 3D reconstruction based on learning.First, we discuss the motion detection and extraction based on incremental background modeling. An adaptive weight selection mechanism is put forward to automatically determine a weight for each frame according to the motion contained in this frame. The background model is updated using the reasonably weighted frames. This background model can adapt to the dynamic scene well and generate good background image even when the background scene contains complex motions.Second, to automatically understand human behaviors in videos, we propose a human action recognition approach to recognize several kinds of abnormal harmful actions in certain situations. This is a view-independent approach based on template matching. The template is composed of several action hyperspheres in subspace which encodes multi-view information of the actions. Recognition is then achieved by comparing the test action with the sample actions in the template. The action hypersphere contributes to view-independent action recognition, and the hypersphere-based recognition is superior to kNN classification in computation efficiency.The tiny face in surveillance video is an obstacle to face recognition. Therefore, we propose a two-phase face super-resolution approach. In the first phase, Locality Preserving Hallucination (LPH) algorithm is used to synthesize the global high-resolution face. In the second phase, we adopt neighbor reconstruction to synthesize the image residue and compensate the global face with detailed facial feature. Our approach can synthesize distinct high-resolution faces with various facial appearances efficiently, and this helps to eliminate the influences caused by tiny face.Then, image-based and video-based expression synthesis approaches are provided to tackle the problems in face recognition due to various facial expressions. The former uses Eigen-associative Learning algorithm to learn various facial expressions according to a face image with neutral expression. The latter is a two-level fusion approach which combines local linear and global nonlinear subspace learning. Amongst, the local linear subspace learning adopts eigen-representation technique for video sample compression in temporal domain; the global nonlinear subspace learning synthesizes optimized facial expressions in spatial domain. Synthesized facial expressions are close to ground truth expressions, and this improves the face recognition under various facial expressions.Finally, in order to diminish the influence of pose variation to face recognition, we introduce image-based and video-based 3D face reconstruction in turn. The image-based reconstruction firstly uses adaptive LLE to sample the training space in a nonparametric way, and reconstructs 3D face model based on the sampling result. Then constraint-based texture mapping is used to synthesize the realistic appearance. The video-based reconstruction adopts affine-rectified optical flow to track the feature points automatically aligned on the first frame of an uncalibrated monocular video sequence. Then SFM algorithm is used to recover the camera projection matrix and 3D coordinates of the facial feature points. We use facial feature points to deform a generic face model, and obtain personalized 3D face model and facial expressions.This thesis not only provides a holistic solution to intelligent video surveillance, but also explores some related key techniques and obtains primary results. In the conclusion of this thesis, we point out that there are still many open problems in intelligent video surveillance to be studied in depth.

  • 【网络出版投稿人】 浙江大学
  • 【网络出版年期】2008年 06期

