节点文献

搜索引擎中命名实体查询处理相关技术研究

Relevant Techniques of Named Entity Query Processing for Search Engine

【作者】 伍大勇

【导师】 刘挺; 张宇;

【作者基本信息】 哈尔滨工业大学 , 计算机应用技术, 2012, 博士

【摘要】 当前互联网已经成为人们获取信息和进行事务活动的一个重要平台。随着互联网上各种数据和应用资源的快速增长,搜索引擎成为人们从海量的网上资源中快速准确地获取信息的必要工具。用户通过提交查询到搜索引擎表达他们的信息需求,搜索引擎则根据对查询的分析提供给用户需要的检索结果,查询是用户和搜索引擎之间必要的信息传递方式。为了使搜索引擎能够准确地理解查询中表达的信息需求,则需要开展查询自动分析处理技术的研究。命名实体查询是一类重要的查询,在搜索引擎查询中占有很高的比例,并且具有一些自身特点,研究命名实体查询的相关处理技术能够使搜索引擎更好地分析用户的检索意图,提供给用户准确的检索结果,改善用户的检索体验。命名实体查询处理技术通常包括获取查询中的语义片段,识别出查询中包含的实体,分析命名实体查询的检索意图等方面的研究。据此,本文从以下几个方面开展了命名实体查询处理的相关技术研究。1、基于单语词对齐模型的无指导查询自动切分。查询切分是一项基础和必要的查询处理工作,是将查询从字符序列切分出词汇或短语等语义单元的过程。由于查询中出现的词汇规模巨大并且包含许多不规范的词汇,有指导的方法需要人工标注大量的训练语料,使其不能很好地适应查询切分的任务。本文提出了一种基于单语词对齐模型的无指导查询切分方法。该方法仅利用查询日志自动训练查询切分模型,并在模型中能够结合字符的共现信息、位置信息以及繁殖度信息,获得了较好的查询切分效果。本文在查询词项切分的基础上进一步对查询进行了层次化切分,将查询表示为切分片段的树状结构,查询层次化切分结果可以表示出查询中哪些切分片段之间的关系更为紧密。实验结果显示与已有的切分方法相比,本文方法获得了更好的查询切分效果。2、基于图上随机游走模型的查询日志中命名实体挖掘。查询日志是一个包含大量命名实体的数据资源。从查询日志中挖掘出的命名实体,更加符合用户构造查询时使用命名实体的习惯,并且查询日志会不断更新,其中记录了一些新出现的实体名称,这使得研究查询日志中命名实体挖掘对于搜索引擎处理命名实体查询更具有实际意义。本文中采用了一种弱指导的方法进行命名实体挖掘,其中利用了少量的属于目标类别的命名实体名称作为种子,使用从查询日志中抽取出的候选命名实体、查询中命名实体的上下文模板以及用户点击URL构造三分图,采用图上的随机游走算法获取目标类别的命名实体。实验结果显示,本文方法能够有效结合查询日志中的命名实体相关信息,提高查询日志中获取命名实体的准确率。3、基于在线百科的命名实体同义属性短语获取。在命名实体的属性短语中,描述实体同一属性的不同表达形式的短语,被称为同义属性短语。获取实体的同义属性短语对命名实体查询的检索意图分析将有所帮助。在命名实体查询中,用户通常使用属性短语构建查询,表达对实体属性值的需求意图。本文从在线百科中获取命名实体的属性短语,并采用了分类的框架结合了多种特征去识别出其中的同义属性短语。据我们了解,本文方法是首次提出利用在线百科获取同义属性短语的研究。实验结果表明,在线百科是获取实体同义属性短语的有效资源,并且本文提出的方法能够有效地获取大量的同义属性短语。4、命名实体查询的检索意图识别。在本文中包括基于分类的查询检索意图识别和更细粒度的基于查询检索模式的检索意图识别两个部分。查询意图分类可以限制检索结果的类别空间,提高检索准确率。在查询意图分类中,采用融合多种资源信息的方法进行分类,其中根据对查询文本,查询日志以及互联网检索结果的分析,获取了有效的查询意图分类特征。本文进一步在查询意图分类模型识别出的信息类和事务类命名实体查询中,抽取用户经常使用的查询检索模式,并将具有相似检索意图的查询检索模式进行聚类。查询检索模式可以用来匹配用户提交的查询,帮助搜索引擎准确地分析查询的检索意图。本文中采用了基于图模型方法和基于相似度方法级联地进行命名实体查询的检索模式获取。实验结果显示本文方法在多个实体类别上均有效地获取了查询检索模式。综上所述,本文开展了命名实体查询处理一些关键技术的研究工作,其中有些查询处理技术出于更广泛适应性的考虑,其面向的对象不仅是命名实体查询,也可以应用到其他查询上。在研究中取得了一些初步的结论和成果,希望能对搜索引擎的命名实体查询处理任务有所裨益。

【Abstract】 At present, Internet is an important platform on which people access toinformation and make transactions. With explosively increasing resources ofinformation and application on the Internet, search engine has been becoming anindispensable tool that guides people instantly and precisely access to their neededinformation on the Internet. Users issue queries to search engine and use the queriesto represent their information needs. Search engine provides users with the resultthey need according to analyzing the queries. Obviously, queries are the media inwhich users’ information need is delivered to a search engine. In order to makesearch engine to understand the information needs of queries better, it is necessary tocarry out research on the techniques of processing and analyzing queries.Named entity query is an important type of query, which is a high percentage inqueries of search engine. Named entity queries have special features and attributes.To carry out research on named entity query processing is beneficial for searchengine to better understand users’ search intent represented by their issued queries,which would help search engine to provide more precise search results and satisfyusers with better search experiences. There is some relevant research work on thenamed entity query processing such as acquiring semantic segments in queries,recognizing the named entities in queries, analyzing the search intent of queries, etc.The main contents in our research can be summarized as follows:1、Unsupervised query segmentation based on monolingual word alignmentmodel. Query segmentation, which is a fundamental and essential query processingtask, deals with obtaining a sequence of words or phrases by segmenting a sequenceof characters. There are a large numbers of words appearing in queries in them agreat number of informal words exist. The supervised segmentation methods need alarge amount of manually annotated training data, which is not suitable for querysegmentation. Therefore, in this work we propose an approach for unsupervisedquery segmentation in which the query segmentation model is trained only usingquery log. Due to effectively combining the information about charactersco-occurrence, position and fertility in queries, the query segmentation modelachieves a good performance. In this work, we further carry out research onmultilevel query segmentation in which a query can be parsed as a tree structure. Thetree structure of a query presents which segments in a query are closely related to each other. The experimental results show that our approach achieves higheraccuracy than existing methods, which demonstrates that our approach is effective.2. Mining named entities in query log based on random walk on graph. Thereare a lot of named entities contained in the queries of query log. The named entitiesmined from query log coincide with the queries that users construct in practice. Thequery log of a search engine is constantly updated and can contain a number of newnamed entities. Therefore, the work of mining named entities is useful for searchengine to process named entity queries. This work proposes a weakly supervisedmethod of mining named entities. Firstly, a few named entities selected manually areused as the seeds for a given named entity category. And then the context patterns,the candidate named entities and users’ clicked URLs are extracted from query logusing the seeds in a bootstrapping process and adopted to construct a tri-partite graph.Finally, the named entities belonging to the given category are extracted using therandom walk algorithm on the graph. The experimental results show that thealgorithm can effectively exploit information related to named entities in a query logto impove the performance of mining named entities.3. Acquiring synonymous attribute phrases for named entities via onlineencyclopedia. A named entity has a number of attributes which describe its propertiesor features. Synonymous attribute phrases are the phrases that refer to the sameattribute with different surface forms for a named entity category. In named entityqueries, the attribute phrases are usually used to represent the intent of thecorresponding attribute value. Therefore, synonymous attribute phrases are beneficialfor analyzing the search intents of named entity queries. This work exploits onlineencyclopedia to acquire the attribute phrases of named entities and identifysynonymous attributes among them using a classification framework combiningmultiple features. To our knowledge, this is the first attempt to acquire synonymousattribute phrases ultilizing online encyclopedia. The experimental results show thatonline encyclopedias are the rich resources for acquiring synonymous attributephrases, in which our approach can effectively acquire a great amount ofsynonymous attribute phrases.4. Recognizing the intents of named entity queries. This work includes two parts;one is identifying query intents based on classification from the perspective of coarsegrained intent analysis, another is acquiring search patterns of named entity queriesfrom the perspective of fine grained intent analysis. In query intent classificationwork, we adopt a classification approach which combines multiple effective features acquired from different resources including query text semantic and syntacticanalysis, information obtained from query log and contents of result returned bysearch engine. Query intent classification can limit the search space of search enginebased on classified information and thus improve precision of search result. We usethe informational and transactional named entity queries recognized by query intentclassification model to extract query patterns which users often use in queries. Thequery patterns are clustered into groups and those in a group have the same searchintent. When the query patterns match the queries issued to search engine, searchengine can accurately capture the search intent of the queries. This work proposes acascade method which graph based method and similarity based method aresuccessively applied to extract query patterns from named entity queries. Theexperimental results demonstrate that our method can effectively acquire the querypatterns for multiple named entity categories.In summary, this dissertation describes research on some crucial techniques ofnamed entity query processing, in which some of the query processing techniques cannot only be applied to named entity queries but also to general queries. This research ofthe dissertation has achieved some preliminary results, which we hope can be helpfulto the task of named entity query processing in search engine.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络