节点文献

基于多模态信息的新闻视频内容分析技术研究

Research on News Video Content Analysis Based on Multimodality Information

【作者】 冀中

【导师】 张春田;

【作者基本信息】 天津大学 , 信号与信息处理, 2007, 博士

【摘要】 对视频数据的有效处理、浏览、检索和管理正伴随着视频数据的快速增长而成为亟待解决的现实问题。视频内容分析技术旨在将非结构化的视频数据结构化,并提取其中的语义内容,构建低层特征到高层语义之间的桥梁,最终建立视频的摘要、索引和检索等应用系统,提供给用户方便的视频内容获取方式。本论文以新闻视频为研究对象,以音频、字幕、视觉等多模态信息及其有效融合为研究手段,以模式识别理论中的相关模型为工具,对视频内容分析技术展开了较为深入的研究。主要贡献包括以下三个方面:(1)提出了一种新颖的基于MPEG压缩域的主持人镜头快速检测算法。其中,在预处理部分,引入了一种改进的利用压缩域信息检测人脸的方法;在镜头聚类部分,构造了一个新颖的度量特征量对主持人镜头采用系统聚类法进行聚类,并用模糊C均值聚类法解决了聚类过程中自适应阈值确定的问题。该算法在保持较高检测性能的前提下提高了主持人镜头的检测速度。(2)提出了一种基于决策树的镜头分类算法,将新闻视频镜头依次分为广告、“其他”、静态图像、主持人、记者和独白六类。其中广告、“其他”和静态图像三类分别利用黑帧、运动、时间以及人脸等特征进行检测;主持人镜头采用聚类方法进行检测;对于比较难区分的记者和独白镜头,创新性地将它们的检测转换为文本序列标注的问题,并采用条件随机场进行建模。该算法有效地融合了音频、人脸以及上下文等多模态信息,对新闻视频中重要的镜头进行了区分,并取得了较好的分类结果。(3)提出了一种融合音频、字幕以及视觉等多模态信息的新闻故事单元分割算法。创新性地将字幕变化、音频类型以及镜头类型等高层次内容特征联系起来共同处理,巧妙地将新闻镜头序列转换成为多个关键词序列,使新闻故事单元分割问题转换成为文本序列分割的问题。该算法采用条件随机场进行建模,充分利用了每个序列内以及序列之间的上下文信息,得到了较好的分割性能。此外,论文还综述了视频内容分析技术,构造了一个基于规则和隐马尔可夫模型的分层音频分类方法,实现了一个较完整的新闻视频中字幕提取框架,最终设计并实现了一个基于COM架构的视频内容分析与摘要系统。综上所述,本论文分别从音频、字幕、视觉以及它们之间的有效融合等方面对新闻视频进行了基于内容的分析,实验结果证明了这些算法的有效性。

【Abstract】 Semantic video management, including video browsing, indexing and retrieval, is necessary for the effective utilization of video repositories. Video content analysis technology aims to bridge the semantic gap between low-level features and high-level concepts, and to provide an accessible way to organize and manage video data.In this dissertation, research efforts are concentrated on audio, caption and visual content analysis and multimodality information fusion techniques for news video with pattern recognition models. The three main contributions are as follows:(1) A novel anchorperson shot detection algorithm in MPEG domain is proposed, in which an improved face detection method in compressed domain and a new dissimilarity metric for clustering are presented. The proposed algorithm is effective and computationally efficient.(2) A new video shot classification method is proposed using decision tree. Six semantic types are studied and categorized: Commercial, Others, Still Image, Anchorperson, Reporter and Monologue. The first three types are identified with features of black frame, motion activity, shot duration and face. The anchorperson shots are detected by clustering method. And the reporter and monologue shots are distinguished by conditional random fields (CRFs) model, where the detection is transformed into sequence labeling problem using audio, face, motion and temporal information. The experimental results demonstrate the effectiveness and high performance of the method.(3) A novel news story segmentation method is proposed, fusing multimodality information from the results of audio classification, caption extraction and video shot classification. The video shot sequence is transformed into several keywords sequences so that the news story segmentation is treated as a sequence segmentation problem. CRFs model is employed to fuse the context information within and between the keywords sequences. Experiments show that the idea is feasible and better result is achieved.Besides, various video content analysis techniques are surveyed, a layered audio classification method based on rules and HMM model is developed, a caption extraction framework for news video is designed and realized, and a COM-based video content analysis and abstraction system is devised and implemented in this dissertation.All in all, the dissertation provides an in-depth investigation into semantic concepts detection and multimodality information fusion.

  • 【网络出版投稿人】 天津大学
  • 【网络出版年期】2009年 07期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络