节点文献

异构信息网络检索技术研究

Research on Information Retrieval of Heterogeneous Information Networks

【作者】 刘钰峰

【导师】 李仁发;

【作者基本信息】 湖南大学 , 计算机科学与技术, 2014, 博士

【摘要】 现实世界中各种信息对象和它周围的信息对象都在不同方面、不同层次,以不同方式相互影响、相互作用着,从而组成了复杂的信息网络。信息网络不仅能帮助我们更好的表达和存储现实世界中的本质信息,而且通过对信息网络中的联接信息进行分析,它可以作为一种挖掘现实世界中隐藏信息的有用工具。因此,从信息网络中挖掘信息获取知识已成为当前的研究热点之一。本文在分析了信息网络尤其是异构信息网络的研究现状的基础上,通过分析信息文档及其相关对象的关系构建异构信息网络,研究了半监督学习、文档聚类、检索结果聚类标签抽取以及查询推荐等信息检索中的关键技术。论文的主要研究工作和创新点如下:(1)提出了针对查询和文档的内容特征以及点击关系构造异构信息网络及半监督学习的框架。根据查询和文档自身内容特征分别构造基于特征的相似图,同时基于查询和文档之间的点击关系构建查询-文档二部图,并引入标记样本的判别信息强化网络结构。提出了查询-文档异构信息网络上半监督学习的正则化框架和标记传播算法。在给出少量标签的情况下,本文方法能更充分的利用查询和文档本身的内容信息,并借助于相互之间的关系互相传播,实验表明本文方法优于传统的半监督学习方法比较。(2)为包含多种类型和联系的高阶异构信息网络建立了图正则化的半监督学习框架。在该框架中,使用图正则化区分了不同类型联系的语义,提出了一种能充分保留标记样本和未标记样本共同揭示的空间结构的光滑性的代价函数,并得到了该代价函数的闭式解。提出了高阶异构信息网络上的标记传播算法,标记信息从标记节点不断向邻近节点传播直至稳定状态,证明了标记传播算法将收敛于代价函数的闭式解。在该框架之下,一些经典的半监督学习算法可以作为其特例存在。(3)针对查询-文档富文本异构信息网络提出了两种不同的主题传播模型:TP-TS和TP-Unify。TP-TS把主题建模和随机漫步看成是两个独立的过程,首先通过潜在概率主题分析(PLSA)对文本内容构建主题模型,然后主题信息在异构的查询-文档二部图互相传播,从而揭示不同节点的主题并进行类别划分。TP-Unify把异构信息网络上异构节点之间的一致性约束引入主题分析,在进行主题建模的同时结合了网络结构分析技术。(4)提出了一种新的类别标签抽取的方法,其基本思想是把类别标签抽取转化为与类簇相关的查询词的排序问题,从而避免了从网页文档簇中抽取主题词的操作。提出了一种融合查询-网页点击图、网页相似图以及链接图对查询词和网页进行联合排序的算法,该算法能有效的整合用户、网页创建者和网页写作者对网页的评价。(5)把基于日志分析和基于语义分析的查询推荐技术结合起来,通过构造Term-Query-URL异构信息网络同时分析日志信息及语义信息,采用基于查询的重启动随机游走进行查询推荐。借助于点击日志进行协同推荐,在高频查询上能取得很好的效果,采用基于文档的方法训练词汇和查询词之间的语义关系,可以提高稀疏查询的推荐效果。在大规模商业搜索引擎查询日志上的实验表明本文方法优于现有的查询推荐方法。

【Abstract】 Heterogeneous information networks, composed of multiple types of objects andlinks, are ubiquitous in real life. It turns out that this level of abstraction has greatpower in not only representing and storing the essential information about the realworld, but also providing a useful tool to mine knowledge from it, by exploring thepower of links. Therefore, effective analysis of large-scale heterogeneous informationnetworks has recently attracted substantial interest. Following discussion on thedevelopment history and research of heterogeneous information networks, thisdissertation focus on some key topics in information retrieval by constructingheterogeneous information networks, i.e. semi-supervise learning, document clustering,cluster description and query suggestion. The main results and contributions of thisdissertation are as follows.(1) We consider The semi-supervised classification problem on query-documentheterogeneous information network which incorporate the bipartite graph with thecontent information from both sides. In order to strengthen the network structure, weintroduce class information of sample nodes. We investigate semi-supervised learningalgorithm based on two frameworks, including the graph-based regularizationframework and the iterative framework. In the regularization framework, we develop acost function to consider the direct relationship between two entity sets and the contentinformation from both sides, which leads to a significant improvement over thebaseline methods.(2) The semi-supervised classification problem on heterogeneous informationnetworks with an arbitrary schema consisting of a number of object and link types isconsidered in this paper. By applying graph regularization to preserve consistency overeach relation graph corresponding to each type of links separately, a classifyingfunction is developed which is sufficiently smooth with respect to the intrinsicstructure collectively revealed by known labeled and unlabeled points. an iterativeframework on heterogeneous information network is proposed in which theinformation of labeled data can be spread to the adjacent nodes by iterative methoduntil the steady state. The class memberships of unlabeled data can be inferred fromthose of labeled ones according to their proximities in the network. Some classicsemi-supervised learning algorithm can be used as a special case of the algorithm. (3) Two different topic propagation models: TP-TS and TP-Unify are proposedfor rich-text query-document heterogeneous information network. TP-TS consider thetopic modeling and random walk process are combined as two independent stages,PLSA provides a simplified solution to model topics of documents and queries, thenthe topic information propagate on the query-document bipartite graph. TP-Unifyinvestigate a joint regularization framework to directly incorporate heterogeneousinformation network into topic modeling by regularizing a statistical topic model, theimprovement over TP-TS owes to the direct optimization of the heterogeneousinformation analysis and topic modeling in a unified regularization framework.(4) A new method of extracting the category label was proposed, the basic idea isto convert cluster description into query rank in cluster, thus avoiding extractkeywords from web documents. We presented a rank algorithm which combination ofquery-document click graph, document affinity graph and web link graph, which caneffectively integrate evaluation of user, web pages creator and web page writers.(5) A Term-Query bipartite graph was trained by extracting semantic relationshipsfrom snippet clicked by query. With the combination of Query-URL graph andQuery-Flow graph, a heterogeneous Term-Query-URL information network wasconstructed. Random walk with restart (RWR) was performed on the informationnetwork for query suggestion. The relevance of long tail query suggestion can begreatly improved by taking account of semantic information and log information. Termvector of query was constructed based on probabilistic language model for querysuggestion of new query. The experimental results clearly show that our approachoutperforms three baseline methods.

  • 【网络出版投稿人】 湖南大学
  • 【网络出版年期】2014年 09期
节点文献中: