

Knowledge Acquisition from Text

【作者】 王菁华

【导师】 钟义信;

【作者基本信息】 北京邮电大学 , 信号与信息处理, 2008, 博士

【摘要】 人类通过文字来描述世界、表达思想,文本是人类智慧传承的重要媒介。随着知识经济时代的到来,文档知识管理在学术界和企业界引起了广泛关注。但是文档知识管理系统面临着几个重要问题:如何识别文档主题,如何识别文档中心词;如何对用户所关心的内容进行个性化的关键性提示;如何精确返回用户希望得到信息。关键词获取技术和信息抽取技术是文本处理中的重要技术,可以在一定程度上解决上述问题。本文对基于语义词典的单文本关键词获取技术,信息抽取技术中的规则生成机制进行了研究,主要的研究工作和研究成果包括:1)基于语义网络与UW-PageRank算法的词义消歧提出了基于语义网络和UW-PageRank结合的知识词义消歧算法,能够对文档中出现的任何词语(同时包含在知识库内)进行实时消歧处理,不需要语料库,无须训练。针对中文文本,以HowNet为语义知识库,以义原为节点,义原间的相关性为边的权重构造无向赋权网络,表达文本内容。使用UW-PageRank算法评价义原的权重,进而计算义项的权重;对每一个词语来说,权重最高的义项即为其含义。分别采用全文标注试验与SENSEVAL-3评测集对算法进行了评价。针对英文文本,以WordNet为语义知识库,以Synset为节点,Synset间的相关性为边的权重构造无向赋权网络,表达文本内容;使用UW-PageRank算法评价Synset的权重;根据Synset的权重并结合共指词义现象、词义常用性等因素进行词义消歧。在SemCor数据集对算法进行了评测。2)基于语义网络与UW-PageRank算法的关键词抽取提出了基于语义网络与UW-PageRank算法的单文本关键词抽取算法。在词义消歧的基础上,文本中的所有词语都具有确定的词义,对语义网络进行剪裁,去掉词语的其他义项,此时语义网络中的节点即为该词在文本中的义项,然后使用UW-PageRank公式挖掘出重要的词义,其对应的词语即为文本关键词。在对中英文科技论文的手工标注数据集上,与Tf方法进行比较,结果表明了算法的有效性。3)启发式的汉语信息抽取规则生成算法——RGA-CIE提出了一种启发式的汉语信息抽取系统的规则生成算法——RGA-CIE(RuleGeneration Algorithm for Chinese Information Extraction)。采用有监督的自底向上规则学习过程,能够根据中文的特点进行启发式的逐步泛化,同时采用Laplacian~*算子作为评价生成规则的效果。Laplacian~*算子能够很好的平抑覆盖率与准确率的矛盾;采用语义扩展进一步提高规则的覆盖效果。在自主开发的财经新闻信息抽取系统上,对RGA-CIE算法性能进行评测,生成规则的准确率为0.84,召回率为0.82,性能优于手工编制的规则。此外,将信息抽取技术应用于本体的实例获取,在北京旅游信息查询系统(Travelingin Beijing,TBJ)的领域本体构建过程中起了重要的作用。

【Abstract】 Text is one of the most important media for people to describe the world, express their thoughts and diffuse knowledge. Coming with knowledge economy, more and more attention has been paid on text knowledge management by researchers and engineers. But there are still some problems for text knowledge management systems: How to acquire the subject of the texts? How to extract the topic words of the texts? How to high-light personalized important information for different people? How to provide exact information for users? Keyword extraction and information extraction may help to solve these problems, which are important technologies in text processing. This paper focused on keyword extraction from single document and rule generation for information extraction. And main achievements are as following:1) Word sense disambiguation based on semantic networks and UW-PageRankThis paper proposes a word sense disambiguation method based on semantic networks and UW-PageRank, which is able to disambiguate all the words in whole text at one time without corpus and training.For Chinese, we use HowNet as knowledge base and build undirected weighted graph which use sememes as vertices and relatedness of sememes as weighted edges. Then UW-PageRank is applied on the graph to score the importance of sememes. Score of each definition of one word can be computed from the score of sememes it contains. Then, the highest scored definition is assigned to the word. This algorithm is tested with text indexing experiment and SENSEVAL-3.For English, we use WordNet as knowledge base and build undirected weighted graph which use synsets as vertices and relatedness of synsets as weighted edges. Then UW-PageRank is applied to score the importance of synsets. The highest scored synset is assigned to the word. This algorithm is tested with SemCor corpus.2) Keyword extraction based on semantic networks and UW-PageRankThis paper proposes a keyword extraction method based on semantic networks and UW-PageRank. After word sense disambiguation, one sense is assigned to one word, so the semantic graph can be pruned according to the results with only "right" sense. Then, UW-PageRank is applied to mining the most important senses, i.e. keywords.We test our algorithm on manually tagged Chinese and English papers and comparing with Tf algorithm, our algorithm performs better.3) Heuristic rule generation algorithm for Chinese information extraction: RGA-CIEThis paper proposes a heuristic rule generation algorithm for Chinese information extraction: RGA-CIE, which is domain independent for free text of Chinese. RGA-CIE applies supervised learning with bottom-up strategy, which is a rule generalization processwith a heuristic method to decide rule generalization path and Laplacian~* formula toevaluate the performance of rules. And semantic extension is also applied to improve the flexibility of rules. The learned rules have been tested on Commercial News Information Extraction System, and achieve a performance of 0.84 as precision and 0.82 as recall, which is better than the manually wrote rules. We also applied information extraction technology on ontology instance learning and made great contribute to Traveling in Beijing System.


