节点文献

基于特征融合的视频文本获取研究

Reasearch on Video Text Information Extraction Based on Features Integration

【作者】 黄晓冬

【导师】 马华东;

【作者基本信息】 北京邮电大学 , 计算机科学与技术, 2010, 博士

【摘要】 视频文本能够提供重要的视频语义信息以供视频检索和视频摘要。视频文本一般被划分为两种类型:叠加文本和场景文本。叠加文本主要包含视频中的标题和字幕,能够对视频语义提供重要的辅助信息。场景文本是自然场景中存在的文本,能够用来推测场景信息。因此获取视频文本对于视频语义分析具有重要作用。视频文本获取主要包括文本检测、定位、提取和识别四个步骤。本论文主要从叠加文本检测和定位、场景文本检测和定位、文本提取三个方面对视频文本获取进行研究并提出解决算法。具体的研究问题包括:如何在视频复杂背景中检测和定位叠加文本;如何在有光照变化和文本排列不规则的情况下,检测和定位场景文本;如何在检测到的文本区域中完整提取文本。本文的主要贡献如下:(1)提出了基于运动感知场的视频叠加文本检测和定位算法,为复杂背景下叠加文本检测和定位提供了一种有效方法。基于相同的叠加文本在多帧视频序列上会保持位置不变的运动模式,我们定义多帧运动向量融合的运动场为运动感知场,并通过运动感知场获取叠加文本运动模式。同时提出基于运动感知场的视频叠加文本检测和定位算法。我们在视频镜头分割的基础上,在单个镜头中提取30帧进行多帧融合形成一个融合帧,在融合帧上执行基于运动感知场的视频叠加文本检测和定位算法。最后,执行基于运动感知场的多帧验证确定文本区域。(2)提出了基于笔划图的场景文本检测和定位算法,为有光照变化和多种文本排列方式的场景文本检测和定位提供了一种有效方法。场景文本易受光照变化和文本排列方式的影响,我们定义基于字符笔划特征融合的笔划图,并基于笔划图检测场景文本。首先,对于视频帧上获取的笔划图进行基于纹理粗糙度统计的文本行检测,并运用角点检测和形态学算法定位文本区域。最后,对于一些与文本区域相似的纹理区域,提取小波矩特征、Laws纹理特征和小波共生矩阵特征作为SVM分类器的特征向量进行训练,并用这个SVM分类器在候选区域中区分文本区域和非文本区域。(3)提出了基于边缘图和颜色聚类分析的叠加文本提取算法和基于改进的Niblack场景文本提取算法,为复杂背景下的文本提取提供了有效方法。对于视频叠加文本提取,我们基于彩色梯度图算法,融合图像梯度信息形成文本行边缘图。我们提出了基于边缘图字符分割和基于K均值颜色聚类算法的文本提取算法。算法首先基于文本行边缘图垂直投影将文本行分割成单个字符。然后,用K均值聚类算法将单个字符图像划分成几个聚类图像,然后使用改进的坝点标注和向内填充算法处理聚类图像。对于场景文本,我们提出了基于改进的Niblack场景文本提取算法。(4)视频文本获取系统的设计与实现。为了验证本文算法有效性,我们设计并实现了视频文本获取的验证系统。通过在系统上的大量实验表明,本文提出的方法能够准确地获取视频文本,所获取的视频文本可以应用于视频检索和场景理解的工作中。本文主要研究了视频文本获取算法,对于叠加文本和场景文本的获取提出了一些有效的解决方法。本文提出的视频文本检测和提取算法对于视频内容理解具有实际应用价值。通过系统的实际应用验证了本文提出的算法的有效性。

【Abstract】 Video text brings important semantic clues for video indexing and summarization. There are two kinds of textual information in the video: the superimposed text and the scene text. In videos, the superimposed texts (e.g., captions in broadcast news programs) are added by video editors and normally can be used to infer the semantic content of videos. The scene text is inherent text in the video captured by the video camera. Scene text can be used to infer scene information. Therefore, video text information extraction is important for video semantics analysis.Extraction of text information involves detection, extraction, and recognition of the text from video. This thesis mainly focuses on the three aspects:superimposed text detection and localization, scene text detection and localization, text extraction. We discuss some important problems of these areas and try to provide some solutions. These problems are as follows:how to detect and locate superimposed text on complex background, how to detect and locate scene text with uneven illumination and various text alignments, how to extract the text efficiently. In this thesis, our major contributions are as follows:(1) The author proposes a superimposed text detection and localization algorithm based on motion perception field, which provides an effective method for superimposed text detection and localization. The same superimposed texts keep the same position on consecutive frames. We define the motion perception field (MPF) to retrieve the text motion patterns. Moreover, we propose a superimposed text detection and localization method based on MPF. First, based on shot segmentation, we extract MPF on the 30 consecutive frames of a single shot. Then we perform multiframe integration to retrieve the synthesized frame. We detect and locate candidate text regions on synthesized frame based on MPF. Finally, multi-frame verification based on MPF is performed to filter candidate text regions.(2) The author proposes a scene text detection and localization algorithm based on stroke map, which provides an effective method for scene text detection and localization under the condition of uneven illuminations and various text alignments. Scene text detection in video present many difficulties due to uneven illuminations and various text alignments.We define the stroke map which integrate the character stroke features in certain orientation. Then we propose a scene text detection method based on stroke map. First, we produce a stroke map based on 2D Log-Gabor filters. Second, we calculate texture feature on every line of stroke map to detect text lines. Then, we perform Harris corner detection and morphological operation to locate the text regions. Finally, a trained SVM is used to verify the candidate text regions.(3) The author proposes a superimposed text extraction algorithm based on edge map and color clustering, and proposes a scene text extraction algorithm based on improved Niblack method, both of which provide effective text extraction methods in complicated background. For superimposed text, we use the color gradient method to integrate the gradient information into edge map of text row. We propose a text extraction algorithm based on character segmantion on edge map and color clustering. First, we produce the edge map using the gradient amplitude and orientation. Second, we segment the text row into single character based on the vertical projection of edge map. Third, we use K-means to cluster single character image into several clustering images. Then we use dampoint label and inward filling to extract the character binary image. For scene text, we proposed a text extraction approach based on improved Niblack method.(4) Video text information extraction system. For verifying the efficiency of our method, we design and implement the video text information extraction system. The experimental results demonstrate that the proposed methods can efficiently detect, locate and extract the text, which can be applied to video search and scene understanding.In this thesis, we focus on the research about video text information extraction. We propose some efficient methods for superimposed text and scene text. The video text detection and extraction algorithms proposed by us have pratical significance for video content understanding. Experimental results show that our approaches are robust and can be effectively applied to real video.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络