节点文献

数字视频中的文本分割的研究

Research on Text Segmentation in Digital Video

【作者】 许剑峰

【导师】 黎绍发;

【作者基本信息】 华南理工大学 , 计算机应用, 2005, 博士

【摘要】 如今多媒体信息的应用越来越广泛。以前图书馆里收藏的资料绝大多数都是纯粹的文本书籍,现在则有了多媒体图书馆,里面收藏的资料包括图像﹑视频和音频。建立多媒体图书馆的一个重要步骤是为海量的多媒体资料建立索引,以便用户进行高效率的检索。随着在多媒体数据制造、存储与传播方面取得的重大技术进步,数字视频在各个领域的应用也越来越广泛,已经成为大多数人日常生活中经常遇到的一部分,能够从大量的视频资料中找到想要的信息成为人们迫切的要求。数字图像和视频也是数字图书馆计划中的核心内容。为了构建数字图书馆,要求将各种信息数字化,以便存储,检索和操作。如何管理和检索海量的视频数据已经成为近10 年来全球学术界和工业界一个富有挑战性的热门话题之一。近年来对视频检索系统的构建已经有了一些研究。有的系统是基于低层特性的,如视频中对象的形状﹑区域的亮度﹑颜色﹑纹理﹑人物动作描述﹑声音特征,有的系统是基于高层特征的,如人脸检测﹑说话人识别﹑文本识别。其中从视频中提取文本信息是比较受关注的一项,也是建立索引的一个重要的来源。文本是视频中重要的内容信息。视频中文本的检测和识别在视频分析过程中起到很大的作用。文本可以作为视频片断的内容标识和索引,例如在新闻视频中出现的新闻摘要,可以作为该段新闻内容的描述,用于新闻视频资料的检索;文字可以作为视频分段的依据,例如播音员名字或演员表出现的地方,可以作为新闻视频的开始或影片的结束;文字可以作为视频内容重要程度的判断依据,例如出现醒目文字的帧,可以抽取出来作为对应的视频片断的代表帧,或者在生成视频摘要的过程中,出现醒目文字的部分,可以截取下来作为视频摘要的一部分。所以对文字的分析和处理是视频分析的重要内容。而检测视频中文字的出现及其准确位置,并将文字从复杂多变的背景中分割出来,是视频文字分析处理的基础。在视频中提取和识别文字,可以有许多应用:从视频中提取出来的文本可以作为它们的索引和注释。例如对于一个关于篮球比赛的视频,可以提取视频中球员衣服上的球衣号码、球员姓名、球队名字作为注释和索引。这和建立视频中基于其他内容的索引相比,如对象的形状,计算代价要小得多。又如商业中,多媒体文档的手工登记工作要消耗大量的人力。如果能够自动读取商业多媒体档案中的特定文本信息,那就可以节约不少人力资源。同扫描出来的文件图像中的文字的检测与识别相比,视频中的文字的检测与识别需要不同的方法。因为前者一般具有单一的文字颜色和背景颜色,只需要一个简单的阈值就可以将文字与背景分开。而视频图像中往往有多种噪声成分,文字的背景大多处于运动状态,字与背景的颜色也经常不单一,分辨率也比较低,

【Abstract】 Information is becoming increasingly enriched by multimedia components. Libraries that were originally pure text are continuously adding images, videos, and audio clips to their repositories, and large digital image and video libraries are emerging as well. They all need an automatic means to efficiently index and retrieve multimedia components. Most of the information available today is either on paper or in the form of still photographs and videos. The rapid growth of video data leads to an urgent demand for efficient and true content-based browsing and retrieving systems. To construct such systems, both low-level features such as object shape, region intensity, color, texture, motion descriptors, audio measurements, and high-level techniques such as human face detection, speaker identification, and character recognition have been studied for indexing and retrieving image and video information in recent years. Among these techniques, video caption based methods have attracted particular attention due to the rich content information contained in caption text. Caption text routinely provides such valuable indexing information as scene locations, speaker names, program introductions, sports scores, special announcements, dates and time. Compared to other video features, information in caption text is highly compact and structured, thus is more suitable for efficient video indexing. Text detection and recognition in videos can help a lot in video content analysis and understanding, since text can provide concise and direct description of the stories presented in the videos. In digital news videos, the superimposed captions usually present the involved person’s name and the summary of the news event. Hence, the recognized text can become a part of index in a video retrieval system. Systems that automatically extract and recognize text from images with general backgrounds are also useful in many situations, for examaple: text found in images or videos can be used to annotate and index those materials. For example, video sequences of events such as a basketball game can be annotated and indexed by extracting a player’s number, name and the name of the team that appear on the player’s uniform. In contrast, image indexing based on image content such as the shape of an object is difficult and computationally expensive to do. Systems that automatically register stock certificates and other financial documents by reading specific text information in the documents are in demand. This is because manual registration of the large volume of documents generated by daily trading requires tremendous manpower. Crrent OCR technology is largely restricted to finding text printed against clean backgrounds and cannot handle text printed against shaded or textured backgrounds and or embedded in images. More sophisticated text reading systems usually employ document analysis (page segmentation) schemes to identify text regions before applying OCR, so that the OCR engine does not spend time trying to interpret non-text items. However, most such schemes require clean binary input; some assume specific document layouts such as newspapers and technical journals; others utilize domain specific knowledge such as mail address blocks or configurations of chess games. However, extracting captions embedded in video frames is not a trivial task. In

节点文献中: