节点文献

基于WEB页面的关键词与关键概念提取技术

Keyword and Key Concept Extraction Technique Based on WEB Page

【作者】 王明燕

【导师】 陈信祥;

【作者基本信息】 北京工业大学 , 计算机软件与理论, 2003, 硕士

【摘要】 关键词技术是文本信息处理的一项重要技术。目前,由于计算机在自然语言理解方面还有很大的不足,关键词提取是在进行文本自动摘要、文本自动分类、主题词提取、主题提取等凡是涉及到文本信息理解的工作时,都要应用到的一项关键技术。本论文详细介绍了一种基于Web页面的关键词与关键概念提取技术及其实验系统的设计与实现,并对该技术在搜索引擎中的应用进行了探讨。论文的核心内容包含以下三部分:首先,关键词提取系统介绍。围绕着Web页面的特殊性开始,依据Web页面不同于一般文本的特殊性,介绍了一种基于Web页面的关键词提取技术。该技术的实现充分利用了Web页面中的各种标记。然后,关键概念提取系统介绍。语言是一种不断发展的文化,新概念层出不穷,同时还存在很多人名、地名、机构名等专有名词。这些概念的存在影响了关键词的提取质量。从常用的N元语法入手,分析该方法存在的问题——N元截断效应,提出了一种基于上下文和互信息的关键概念提取方法。该方法的实现克服了N元算法的截断效应,实现了可变长的概念提取方法。同时,本文又结合规则选词的方法,对提取结果进行了优化,取得较好的实验效果。最后,论文对该技术在搜索引擎中的应用在理论进行了简单的探讨。通过对搜索引擎中“相关性(系统角度的相关性与用户角度的相关性)”问题的分析,提出了一种改进的系统角度的“相关性”模型,并对该模型的系统实现进行了构想与设计。

【Abstract】 Keyword Extraction is an important technique of text information process. At present, Keyword Extraction is an important technique used for automatic abstract, automatic classification, subject extraction, subject word extraction etc. The paper introduces a new technique of keyword extraction and key concept extraction based on Web page, the design and implement of experimental system, and the application of the system in the search engine. The paper includes three main part.First, Keyword Extraction System. The paper describes the special of Web page compared with the common text. Depending on the special, a technique of keyword extraction based on Web page is introduced. The system takes full advantage of tags in the Web page.Second, Key Concept Extraction System. Language is a developing culture, and new concepts are produced. And many proper names which include person name, geography name and corporation name, are new unknown concept. These concepts have an impact on the result of Keyword Extraction system. The paper brings forward a key concept extraction technique based on the mutual information and context dependency. The means avoids the truncation effect of N-gram model and realizes vari-gram statistical model of concept extraction. At the same time, the paper adopts the way based on rules to optimize the extraction result.In the end, a simple research is done for the application of the system in the Search engine. By analyzing the relevance of search engine, the paper brings forward a improved system relevance model and describes the design of the model.

【关键词】 关键词关键概念搜索引擎
【Key words】 keywordkey conceptSearch engine
  • 【分类号】TP393.092
  • 【被引频次】8
  • 【下载频次】491
节点文献中: 

本文链接的文献网络图示:

本文的引文网络