节点文献

基于图学习的Web信息检索技术研究

Research on Graph Learning Based Information Retrieval Techniques

【作者】 管子玉

【导师】 陈纯; 卜佳俊;

【作者基本信息】 浙江大学 , 计算机科学与技术, 2010, 博士

【摘要】 随着互联网和万维网(World Wide Web)的快速繁荣发展,万维网逐渐成为人们生活中不可或缺的一种信息获取来源。万维网给信息检索技术带来了极大的机遇和挑战。经过最近十几年的发展,信息检索已经由一个纯粹的学术研究学科转变成大多数人信息获取的技术基础。随着Web 2.0概念的普及和发展,万维网不再仅仅是一个巨大的信息库,更逐渐成为一个用户参与和交流的平台。Web 2.0应用网站的蓬勃发展将再次推动信息检索技术的革新。本文认为,在Web 2.0时代,信息检索技术主要有以下三方面的发展趋势:1)更加灵活的个性化信息服务。随着用户的急剧增加,Web 2.0网站迫切需要满足用户的个性化信息需求。然而,传统的Web信息检索技术并不擅长处理Web 2.0应用的复杂结构数据。Web 2.0需要更加灵活的个性化信息服务,如信息推荐系统。2)更加有效的多媒体数据检索技术。随着Web 2.0的普及,用户可以很方便地上传和分享多媒体信息。多媒体数据的迅速增多使得多媒体信息检索技术成为人们关注的焦点。3)检索服务的专业化。当前,Web 2.0应用中的用户产生数据已经成为万维网这个巨大信息库的重要组成部分之一。过于繁杂的Web数据使得Web信息检索向领域化、专业化方向发展。很多Web数据呈现复杂的内在关联结构。本文指出,为了更好地解决这些数据上的相关检索问题、提升检索效果,就需要充分利用蕴含在数据复杂关联结构中的知识。图学习技术能够对复杂关联结构进行较好地建模并捕捉其中蕴含的知识。因此,结合上述发展趋势,本文研究工作围绕基于图学习的Web信息检索技术展开,具体在以下四个相关研究问题上进行深入研究并提出了新颖的图学习算法:1) Web 2.0社区化标签应用中的个性化标签推荐:社区化标签应用中用户可以对资源任意地加标签。产生的标签标注数据可以很自然地用图来建模。本文提出一种新的基于图的多类关联对象查询排序算法,以解决社区化标签应用中的个性化标签推荐问题。2)Web 2.0社区化标签应用中的个性化文档推荐:传统的信息推荐系统聚焦在评级打分数据上,而社区化标签应用中的标签标注数据是一种不同的且具有特殊图结构的数据。本文提出一种新的基于图的多类关联对象降维(语义空间学习)算法,将用户、标签和文档映射到同一语义空间中,然后根据用户与文档之间的欧式距离来进行文档推荐。3)人脸图像检索与识别:传统的人脸检索和识别研究利用降维技术(子空间学习)来获得人脸图像的高层次特征表达。最近提出的一种基于图的二阶张量子空间学习算法在人脸图像上表现比较出色,但是其时间复杂度比较高。本文提出一种新的基于图的高效二阶张量子空间学习算法,在保证可接受的检索、识别性能的同时,降低了学习子空间映射函数的时间复杂度。4)高质量专业Web资源抓取:聚焦爬虫是从Web上抓取主题相关信息资源的一种重要技术手段。对垂直搜索引擎来讲,最重要的研究问题之一是如何从Web中把高质量的相关资源找出来。本文提出一种新的基于Web图的网页主题质量在线评估算法,并在此基础上设计了一个获取高质量主题相关Web资源的聚焦爬虫。文章最后总结了本文工作,并对基于图学习的Web信息检索技术发展前景进行展望。

【Abstract】 With the proliferation and evolution of Internet and World Wide Web(WWW), WWW has gradually become an important information source in people’s daily life. WWW has brought in new challenges as well as opportunities to the information retrieval technology.In the last decade,Web information retrieval technology has undergone a significant development.Nowadays,information retrieval has changed from an academic discipline to the technical foundation of information acquisition for most people in the world.The widespread idea of Web 2.0 has made WWW not only a huge database,but also a platform in which users can participate and communicate with others.The rapid proliferation of Web 2.0 applications will lead to a new round evolution of Web information retrieval technology.This thesis argues that,in the age of Web 2.0,Web information retrieval technology has mainly three evolutionary trends:1) More flexible personalized information services.With rapid increase of users,Web 2.0 Websites pressingly need to satisfy users’ personalized information needs.However,traditional Web information retrieval techniques are not expert in dealing with the complex data structures in Web 2.0 applications.Web 2.0 applications need more flexible personalized information services,such as recommender systems.2) More effective multimedia information retrieval techniques.Many Web 2.0 Websites allow users to upload and share multimedia data files,such as pictures and videos.This leads to the rapid growth of multimedia information on the Web.Thus,multimedia information retrieval techniques have become a popular research area.3) Domain or topic specific retrieval.Nowadays,user generated data in Web 2.0 applications has become a significant part of the data of WWW.Huge and topically diverse Web data is forcing Web information retrieval to focus on domain or topic specific retrieval.Web data usually have intrinsic complex relational structures.The thesis points out that in order to better address related retrieval problems or improve the retrieval effectiveness on those Web data,we need to exploit these intrinsic complex relational structures.Graph-based learning techniques can properly model these complex relational structures and capture the knowledge contained in them.Thus,considering the evolutionary trends mentioned above,this thesis focuses on graph learning based Web information retrieval.Specific research topics include:1) Personalized tag recommendation in social tagging services:in social tagging services users can add tags to resources.Tagging data can be modeled as graphs naturally.This thesis proposes a novel graph-based ranking algorithm for multi-type interrelated objects in order to solve the personalized tag recommendation problem in social tagging services.2) Personalized document recommendation in social tagging services:traditional recommender systems focused on rating data,while social tagging data is different from rating data.This thesis proposes a novel graph-based semantic space learning algorithm which projects users,tags and documents iuto the same semantic space. Documents arc recommended to users according to Euclidean distance.3) Face image retrieval and recognition:dimension reduction(subspace learning) techniques were used to learn a high level representation for face image retrieval and recognition.Recently a graph-based tensor subspace learning algorithm showed good performance.However,its time complexity is high.This thesis proposes a novel efficient graph-based second order tensor subspace learning algorithm.4) Focused crawling for high quality topical Web resources:Focused crawlers are designed for harvesting topical Web pages.For vertical search engines,a key problem is how to find high quality related Web resources.This thesis proposes a novel Web graph based on-line algorithm for estimating Web pages’ topical quality and, based on it,designs a focused crawler for harvesting high quality topical resources.Finally,the thesis concludes these works and discusses future work on graph learning based Web information retrieval.

  • 【网络出版投稿人】 浙江大学
  • 【网络出版年期】2010年 08期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络