节点文献

相关实体查找与主页查找研究

Relatedentityfinding and Homepage Finding

【作者】 周文渊

【导师】 徐蔚然;

【作者基本信息】 北京邮电大学 , 电子与通信工程(专业学位), 2013, 硕士

【摘要】 REF (Related Entity Finding,相关实体查找)是TREC (Text Retrieval Conference,文本检索会议)实体检索中非常有前景的研究课题,对它的研究将对搜索引擎和人们对网络信息的处理方式带来巨大的改变。REF的要求是根据提供的topic的信息,通过互联网和相关数据库抽取出与topic相对应的相关实体答案以及对应实体主页。本文对国内外的现状和一些前沿的算法进行了研究,并对关键词的提取和扩展,文本的检索,段落的切分和相关度计算,命名实体识别,实体排序和支撑文档的检索等几个方面逐个分析和研究,对实现过程的改进和创新如下:(1)对于以往的对整个网页文本进行处理的方式做了改进,增加了对于短文本即段落的处理方式,从而剔除了大量的不相关文本内容,减小了返回文本的大小,提高了系统的处理效率。(2)根据Wikipedia的结构特点,利用Wikipedia中的同义词和上位词等构建基于Wikipedia的类别词典,并用于实体抽取部分,适应了今年REF项目的实体类型多而细的特点,同时提高了实体抽取的准确率。(3)添加了基于词密度的算法,实现了对DCM模型结果的校对,取得了比较好的效果。并根据去年的答案对DCM文档中心模型的计算公式中的参数做了调整,对模型进行了改进。

【Abstract】 REF (Related Entity Finding) is the TREC (Text Retrieval Conference) physical retrieval is a promising research topic. REF requirement is that the topic information, extracted via the Internet and related database that corresponds with the topic of the relevant entities of the answers and the corresponding entities Home. The status quo at home and abroad, and some cutting-edge algorithms, calculated from the extraction and expansion of key words, text retrieval, paragraph segmentation and correlation, named entity recognition, entity sorting and supporting documentation to find, etc. the implementation process of research and analysis, mainly to complete the work of the following aspects:(1) For the entire page text improved approach for short text paragraph, which removed a lot of text content, reducing the size of the returned text to improve the system processing efficiency.(2) According to Wikipedia’s structural features, the use of synonyms and hypernyms in Wikipedia is built based on the Wikipedia category dictionary, and for entity extraction part, adapted to the entity type of the REF project this year, and fine features, while improving the entity extraction the accuracy of.(3) Add the word density-based algorithm, the proofing of the DCM model results, and achieved fairly good results.According to the answer to last year’s model of DCM Documentation Center in the calculation formula parameters adjusted, the model has been improved.

  • 【分类号】TP391.3
  • 【下载频次】38
节点文献中: 

本文链接的文献网络图示:

本文的引文网络