节点文献

基于音频词袋和MPEG-7特征的暴力视频快速分类算法研究

Research on Violent Video Detection Algorithm Based on Bag of Audio Words and MPEG-7 Features

【作者】 李荣杰

【导师】 蒋兴浩;

【作者基本信息】 上海交通大学 , 通信与信息系统, 2010, 硕士

【摘要】 随着网络视频的普及与流行,互联网上存在着各类视频。近些年,计算机视觉越来越得到关注,通过分析计算机中的二进制数据,可以区分每个视频的所属类别。传统的基于内容的视频分类技术主要分为视频和音频特征提取两部分,视频特征主要提取图像的全局特征如颜色、纹理、形状等,并比较这些视觉特征间的相似性,从而自动搜索出符合用户要求的图像。而音频特征主要提取音频流的音频特征,如基音频率带宽、频谱流量、Mel倒谱系数、声音功率等。这些视频和音频特征通过分类器训练学习后,能够对视频类别有较为准确的识别。另一方面,由于网络上充斥着各类不健康的视频,尤其是其中的一些恐怖和暴力视频对于儿童的发展是有比较大的危害,需要对这些视频进行标注和监管。近年来,对于网络视频的监管需求越来越高。针对以上需求,本文提出两种针对暴力视频的分类方法。本文介绍了一种结合MPEG-7音频特征和词袋模型的―音频词袋‖特征。首先,提取网络视频的音频流,对其提取MPEG-7音频特征,通过对音频签名特征的分类和聚类,构造属于暴力场景特有的―音频词汇‖,通过特有的权重分配机制,获得新的―音频词袋‖特征。通过实验,本方法有不错的查全率,可以应用到网络视频的实时监控上。本文还通过视音频特征结合,提出了两种针对暴力视频特有的筛选模型,分别为结构张量筛选模型以及音频快速筛选模型。结构张量筛选模型是通过对视频进行结构张量特征(一种运动检测特征)过滤,得到运动比较激烈的画面,然后进行人脸检测及音频场景匹配。音频快速筛选模型是先提取音频特征进行常见暴力场景的匹配,对得到的候选镜头进行图像特征的精确分类。通过实验,音频快速筛选模型在分类速度上快于结构张量模型,而结构张量模型的准确率较高。两者都能比较好的应用于网络暴力视频的过滤中。

【Abstract】 With the flourishing of the moving industry and development of the multimedia, many types of movies are available through the internet. We can easily differentiate among different genres of movies after watching them. However, for the computer, it is a quite complicate work to automatically recognize the theme of various types of the movies. Recent years, more and more attention is paid on the computer vision research area. The computer can make difference of the types of the video by compare the binary data of the video and audio features. The traditional content based video classification mainly includes two parts: the audio features and video features. The visual features include the color, texture and motion while the audio features mainly include the low level features such as audio bandwidth, frequency and Mell feature.On the other hand, there are some films with many violent and horror scenes which are uncomfortable for children to watch. Nowadays, the government pays more attention on the video regulation on the network. For this reason, two methods of classy the violent videos are presented in this paper.We first introduced a new method to identify the violent videos by the bag of audio words is introduced. The MPEG-7 audio descriptors are firstly extracted, including the low level features such as AudioSpectrumCentroid and AudioSpectrum-Spread, etc. The audio words are then built according to the MPEG-7 high level descriptor, the AudioSighnature, which is considered as the―fingerprint‖of the audio stream. The support vector machine is used to classify the feature vectors into two classes, i.e. the violent and non-violent videos. The experiment results demonstrate that our method can achieve good recall accuracy.Combined with the video features, two filtering models are introduced later, which are the visual structure tensor filtering model and fast audio filtering model. In the structure tensor model, we first extract the structure tensor features and then classify the candidate shots by face detection and violent audio event detection. While in the fast audio model, we extract the audio features first and classify the candidate shots by visual features. The experiment results show the visual structure tensor model shows high classification accuracy while the audio model performs higher speed. Both of the models can be applied in the violent vide filtering on the internet.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络