节点文献
基于机器学习的图像检索若干问题研究
Research on Image Retrieval with Machine Learning Techniques
【作者】 张磊;
【导师】 马军;
【作者基本信息】 山东大学 , 计算机系统结构, 2011, 博士
【摘要】 近十年来,随着数码相机、拍照手机、带有摄像头的移动电脑的普及,数字图像得以大量涌现,而随着互联网技术的发展,特别是web 2.0技术的流行,图像的传播和扩散也变得越来越容易。如何快速、有效地组织和管理这些海量的图像信息,已经成为学术界和工业界共同关注的热点问题。近些年来,随着研究的深入,机器学习技术被广泛的应用于图像检索领域,例如图像标注、图像内容的分类、用户反馈的建模、图像搜索结果的排序、图像数据集的获取等等。本文围绕机器学习框架下的图像检索这一研究主题,主要针对图像标注(image annotation)、图像重排序(image re-ranking)和物体检测(object detection)这三个问题展开研究。论文的主要工作与创新体现在以下几个方面:1:图像标注的目的是根据图像的视觉内容来确定对应的文本语义描述。本文提出了一种把词汇间的语义关系嵌入到多类支持向量机中的图像标注方法。首先,每幅图像被分成5个固定大小的块(block),对于训练集中的图像,手工指定每个标注词对应于哪个块,词汇间的语义关系通过共现矩阵来计算。然后,利用MPEG-7视觉描述子表示每个块的视觉特征。为了减少特征维数,采用了一种名为mRMR(最小重复性最大相关性)的特征选择方法。同时针对Corel 5000数据集中的80个语义词,训练了一个多类支持向量机分类器。最后,把支持向量机分类器的后验概率输出和词汇间语义关系集成到一起,用于得到图像的标注词。在Corel 5000数据集中的实验表明此方法是有效的。2:图像重排序是指在原始搜索结果排序的基础上,通过利用图像内容、挖掘数据关联、或者借鉴领域知识和人工交互,对原始搜索结果进行重排序提升用户满意度的过程。当前的商业搜索引擎尽管在语义相关性上取得很大进步,但由于较少利用图像内容本身,造成图像排序结果缺乏视觉多样性。而一些研究者提出的纯粹基于聚类的方法,在取得视觉多样性的同时,又有把不相关图像排在前面的风险。本文提出了一种同时兼顾语义相关性和视觉多样性的图像重排序方法,本算法是一种混合方法,把Leuken等人提出的相互投票算法和Deselaers等人提出的贪心算法综合起来,以同时获得两种方法的优点。首先,每幅图像根据视觉相似度为其它图像投票,得票数最高的一些图像作为候选者。然后利用一个受限的轻量级贪心算法来找出最相关和最有新鲜感的图像作为聚类的中心。在计算视觉相似度时,混合了不同的视觉特征,包括颜色、纹理和主题特征。同时利用PLSA和LDA两种潜在主题模型作为降维手段,并在实验中比较了这两种主题模型,并讨论了综合主题特征的优点。首次引入了聚类查全率和NDCG的调和平均值作为衡量排序性能的标准。对Google和Bing的初始排序结果做了大量的重排序实验,与学术界领先的算法做了比较,通过计算聚类召回率、F1值、聚类召回率与NDCG的调和平均值表明,本文方法是可行的。3:物体检测的目的在于不仅需要判断出某图像中有无该物体,还需要指出该物体在图像中的具体位置。当前领先的物体检测技术主要采用有监督的机器学习方法并组合多种特征,这些基于有监督学习的方法需要大量的训练数据,但标注用于物体检测的训练数据非常耗时,需要大量的人力。虽然一些研究者提出可以利用web图像或者半监督学习技术来获取物体的图像库,但这些图像库中由于没有物体的具体位置信息,一般情况下只能用于物体的分类。本文首次提出可以利用Flickr中的notes数据来获取物体检测数据集,本方法的目的是希望能够以较少的人力提供用于物体检测的训练数据,并且保证训练数据的高质量,这些可以通过挖掘Flickr中的notes数据来实现。Notes数据是由用户在图像中添加的感兴趣的区域(矩形框)及其元数据,包括矩形框的位置、大小以及文本。本文的方法首先通过文本挖掘找到与物体有语义关联的初始图像集,然后从初始集中人工选择出高质量图像作为种子集,最后这个种子集通过增量式的主动学习算法来扩展。在PASCAL VOC2007和NUS-WIDE数据集中做了实验,结果表明本方法获取的数据集可以作为传统数据集的补充,甚至替代传统数据集。
【Abstract】 In recent ten years, the digital images have been springing up rapidly due to the rising popularity of digital cameras, camera phones and mobile PCs with cameras. Meanwhile, with the develop of Internet techniques, especially Web 2.0, the sharing and defusing of images get easier day by day. How to organize these massive images efficiently and effectly has become a hot topic in both academic and industrial community. At the same time as research progressed, machine learning techniques have been widely used in image retrieval area, such as image annotation, image classification, user feedback, image re-ranking and image dataset acquisition.This thesis investigates three specific image retrieval problems with machine learning techniques:image annotation, image re-ranking and object detection. The main research contents and innovations of this thesis are listed as follows:1. Image annotation refers to the labeling of an image depending on the content of the image. An image annotation approach by incorporating word correlations into multi-class support vector machines (SVM) is proposed. At first, each image is segmented into five fixed-size blocks instead of timeconsuming object segmentation. Every keyword from training images is manually assigned to the corresponding block and word correlations are computed by a co-occurrence matrix. Then, MPEG-7 visual descriptors are applied to these blocks to represent visual features and the mRMR (minimal redundancy maximum relevance) method is used to reduce the feature dimension. A block-feature based multi-class SVM classifier is trained for 80 semantic concepts from Corel 5000 dataset. At last, the probabilistic outputs from SVM and the word correlations are integrated to obtain the final annotation keywords. The experiments on Corel 5000 dataset demonstrate this approach is effective and efficient.2. The image re-ranking process is used to improve the user satisfaction by reordering the images based on the multimodal cues extracted from the initial search results (including image content and data association), the auxiliary knowledge, user feedback, etc. Eventhough there has been a noticeable improvement in current commercial search engines for retrieving relevant images, the search results are lack of visual diversity without analyzing the visual content of images. Some researches impove the visual diversity by purely cluster-based methods, with the risk of little relevant images being at top ranks.An image re-ranking approach, which takes semantic relevance and visual diversity into consideration, is proposed. This approach is a hybrid approach to capture the benefits of the reciprocal election algorithm proposed by R.van Leuken et al. and the greedy search algorithm proposed by T.Deselaers et al. At first, each image casts votes for other images according to visual similarity. The images with the top highest votes are selected as candidate representatives. Then a bounded greedy selection algorithm is employed to select the most novel and relevant one as the cluster representative. This approach fuses different visual features to calculate image similarity including color, texture, and especially topic content features. We present an evaluation of pLSA and LDA as dimension reduction approach for the task of web image re-ranking and discuss the benefits of integrating topic distribution features. This thesis novelly introduces the harmonic mean of cluster recall and NDCG as a criterion to evaluate the reranking performance. Extensive experiments and the comparation with state-of-the-art methods demonstrate that using this approach to re-rank an initially returned set of images from Google and Bing search engines is a practical way to improve the user satisfaction in terms of cluster recall, F1 score and the harmonic mean of NDCG and cluster recall.3. Object detection systems aim at deciding not only whether an image contains a specific object but also where the object is. Most of the state-of-the-art systems for object detection combine multiple features with machine learning techniques. For these supervised learning methods to work well, one needs large amounts of labeled training data, but image labeling for object detection is very time consuming and requires a large amount of human efforts. Some researches can gather object images by exploiting images from the web or using some semi-supervised techniques. However, these images are not able to be used for object detection since there are no object size and position information.An effortless training data acquisition approach for object detection by active learning through notes data in Flickr is proposed. The motivation is to provide high-quality training dataset for object localization with minimum human effort. This can be realized by data mining through the notes data in Flickr. The notes are some specific interesting regions (bounding boxes) defined by users. The metadata of a note includes the position, size and text of a bounding box. In this approach, a text mining method to gather semantically related images for a specific class is applied at first. Then a handful of images are selected manually as seed images or initial training set. At last, the training set is expanded by an incremental active learning framework. This approach requires significantly less manual supervision compared to standard methods. The experimental results on the PASCAL VOC 2007 and NUS-WIDE datasets show that the training data acquired by this approach can complement or even substitute conventional training data for object localization.
【Key words】 Image Retrieval; Image Annotation; Image Re-ranking; Object Detection; SVM; Machine Learning;