节点文献

基于中层语义表示的图像场景分类研究

Research on Middle Semantic Representation Based Image Scene Classification

【作者】 解文杰

【导师】 须德;

【作者基本信息】 北京交通大学 , 计算机科学与技术, 2011, 博士

【摘要】 随着多媒体技术和计算机网络技术的发展,人们接触到的图像数据以前所未有的速度增长。面对海量的图像资源,如何有效地分析、组织和管理图像数据,实现基于内容的图像检索成为多媒体技术的研究热点。场景分类(Scene Classification)任务就是在这种背景下产生的。场景分类根据给定的一组语义类别对图像数据库进行自动标注,为指导目标识别等更高层次的图像理解提供了有效的上下文语义信息。其研究的难点在于如何使计算机能够从人类的认知角度来理解图像的场景语义信息,有效辨别图像场景类内差异性和场景类间相似性。本文在场景的中层语义表示的基础上,着重讨论了如何从场景图像中提出有效的视觉特征,弥合图像低层特征和高层语义之间的语义鸿沟。围绕该问题,本文取得了以下研究成果:提出了一种构建类别视觉辞典的场景分类算法,该算法使用互信息作为特征选择方法来构建类别视觉辞典。根据视觉单词对给定类别的贡献度,从全局视觉辞典中选择对给定类别贡献度高的视觉单词,组成该类的类别视觉辞典,进而生成类别直方图。最终的融合直方图由基于全局视觉辞典的全局直方图和基于类别视觉辞典的类别直方图通过自适应加权合并生成,这种加权合并方法可以使类别直方图和全局直方图通过互相竞争的方式来描述图像。融合直方图不仅可以保留全局直方图的的区分能力,而且通过类别直方图加强了不同类别的相似场景的区分能力,以克服不同场景类别间的相似性问题,提高分类正确率。提出了一种基于不同特征粒度的多尺度多层次场景分类模型(Multi-Scale Multi-Level Generative Model, MSML-pLSA)。该模型由两部分组成:多尺度部分负责从不同尺度的场景图像中提取视觉细节,构建多尺度直方图;多层次部分将对应不同数量语义主题的场景表示线性连接生成最终的场景表示一多尺度多层次直方图MSML-pLSA模型可以在一个统一的框架下整合了不同粒度的视觉信息和语义信息,从而得到更加完善的场景描述。提出了一种使用无监督学习方法提取上下文信息的场景分类算法,该算法将局部视觉单词扩展到上下文视觉单词。上下文视觉单词不仅包含了当前尺度下给定感兴趣区域(Region Of Interest, ROI)的局部视觉信息,而且还包含了ROI相邻区域和相邻粗糙尺度下与ROI同中心的区域包含的视觉信息。通过引入ROI的上下文信息,上下文视觉单词能够更加有效地描述图像场景的语义信息,从而减少了图像场景语义的歧义性,进而减少了场景分类的错误率。研究了基于词包模型(Bag of Words, BoW)表示的特征点的数量对分类正确率的影响。在构建词包模型的过程中,如何选取特征点,以便能更好地表征图像的视觉信息是一个非常重要的工作。在场景分类领域中有一个普遍认同的观点,即较大数量的特征点可以获得较高的分类正确率,但是该观点却没有被验证过。在词包模型的框架下,本文做了大量的实验来验证这个观点,本文采用了四种特征选择方法和三种不同的SIFT特征(Scale Invariant Feature Transform)来改变特征点的数量。实验结果证明特征点的数量可以明显影响场景分类的正确率。

【Abstract】 With the development of multimedia technology and computer network, the content-based image retrieval (CBIR) system becomes more and more important to organize, index and retrieve the massive image information in many application domains, which has emerged as a hot topic in recent years. Scene classification appears under the background above. Scene classification annotates automatically images based on a group of given semantic labels, which helps to provide effective contextual information on the higher level for image understanding task such as object recognition. The key point lies in how to train the computer to understand the semantic content of scenes from human cognition perspective, and recognize the similarities and diversities among scenes of different categories.Based on the middle representation of scene, our work focuses on how to extract effective visual information from the scene images and narrow down the well known semantic gap between low-level visual features and high-level semantic concepts. This paper achieves the following research results:Our work proposes a multiple class-specific visual dictionaries framework for scene category, where the class-specific visual dictionaries are constructed using mutual information as the feature selection method. According to the contribution of visual words to classification, universal visual dictionary is tailored to form the class-specific codebook for each category. Then, an image is characterized by a set of combined histograms which are generated by concentrating the traditional histogram based on universal codebook and the class-specific histogram grounded on class-specific codebook. Additionally, this paper also proposes a practical adaptive weighting method that leads to competition between the traditional histogram and the class-specific histogram. The proposed method can provide much more effective information to overcome the similarity of images of different categories and improve the categorization performance.Our work proposes a novel and practical algorithm for scene category called Multi-Scale Multi-Level pLSA model (MSML-pLSA). It consists of two parts: multi-scale part, where the image is decomposed into variant scales and diverse visual details are extracted from the layers of defferent sclaes to construct the multi-scale histogram, and multi-level part, where the representations corresponding to diverse numbers of topics are linearly concentrated to form the multi-level histogram. It is constructed to represent scene in variant visual granularity and semantic granularity. The MSML-pLSA model can create a more complete representation of the scene due to the inclusion of fine and coarse visual detail information in a joint approach and the comparative study shows the superiority of the proposed method.Our work presents a scene categorization approach by unsupervised learning the contextual information to extend the’bags of visual words’model to a’bags of contextual visual words model’. The contextual visual words represent the local property of the region of interest and the contextual property (from the coarser scale and neighborhood regions) simultaneously. By considering the contextual information of the ROI, the contextual visual word gives us a richer representation of the scene image which reduces ambiguities and errors.Our work focuses on the relationship between the number of interest points and the accuracy rate in scene classification. Here, we accept a common belief that more interest points will generate higher accuracy rate. But, few efforts have been done in this field. In order to validate this viewpoint, extensive experiments based on the bag of words method are implemented. In particular, three different SIFT descriptors and four feature selection methods are adopted to change the number of interest points. Experimental results show that the number of interest points can aggressively affect the classification accuracy.

  • 【分类号】TP391.41
  • 【被引频次】23
  • 【下载频次】1012
  • 攻读期成果
节点文献中: 

本文链接的文献网络图示:

本文的引文网络