节点文献

领域知识的获取

Domain Knowledge Acquisition

【作者】 李卫

【导师】 钟义信;

【作者基本信息】 北京邮电大学 , 信号与信息处理, 2008, 博士

【摘要】 知识库是自然语言处理系统的基础,为系统“理解”自然语言并顺利完成任务提供有力的知识保障。本文针对领域知识的获取进行了研究和探索,提出了一些新的处理技术和模型。主要创新点包括:1.针对领域知识源获取过程中的网络冗余信息问题,提出了一种基于关键词序列的网络文本信息去重算法——KSM。以全信息理论为依据,使用文档的关键词序列来描述其结构特征和内涵特征,通过比较主题相似文档的关键词序列的重叠度,判断是否存在信息冗余现象。在各类隐式重复检测实验中,KSM算法的总体准确率和召回率分别达到了99.2%和97.7%,显示了较好的性能。2.针对低频术语抽取召回率较低的问题,提出了一种基于语言认知理论的中文术语自动抽取算法,借助科技论文的话语标记,在C-value测度和SCP_f测度中引入候选术语的加权词频因子,提出了一种MC-SCP测度,用于候选术语的单元性和术语性的综合评价。在车牌识别领域的术语抽取实验中,基于MC-SCP测度的算法召回率和准确率分别是96.5%和77.8%,低频术语的召回率和准确率则分别是96.2%和79.3%;在保证术语抽取整体性能的同时,显著改善了低频术语的抽取效果。3.针对术语关系类型的多样化问题,提出了一种基于多策略的术语关系自动获取模型。根据科技论文的语言学特点,综合术语的内部特征和外部特征,从多个层面发现和获取术语间的各种关系,包括:基于规则的术语同义关系获取、基于结构相似性的术语层级关系获取、基于完全加权关联规则的术语非层级关系获取、基于粒子群的术语聚类等。在术语非层级关系获取中,提出了一种基于非频繁项集多重剪枝检测的完全加权关联规则挖掘算法——AWARM-MPIS,用于完全加权关联规则的频繁项集生成和剪枝,取得了良好的效果;在术语分组关系获取中,提出了一种基于粒子群的术语聚类算法,使用术语的结构相似性(内部特征)和关联度(外部特征)来评价术语的语义相似性。实验结果表明,其平均运行时间与迭代次数比K-Means提高了2个级别。4.针对多领域科技论文的大量出现与编辑人员专业知识有限的问题,提出了一个领域知识制导的科技论文初审辅助系统模型。根据科技期刊的出版要求和科技论文的特点,结合编辑人员的工作经验,将编辑初审细化为4个方面的评判,以此为依据开发了一个原型系统,并使用《计算机工程与应用》和《计算机科学与探索》的2365篇投稿论文为语料进行了性能测试。实验结果表明,该系统可辅助编辑人员淘汰35%左右的低质量稿件,提高了编辑初审的效率。

【Abstract】 Knowledge base is "brain" of natural language processing systems and enables them to "understand" and process natural language. This dissertation makes effort to explore new technologies of domain knowledge acquisition. The main contributions are as follow:1. To solve web redundancy information during the domain knowledge source acquisition, a web document duplicate removal algorithm based on keyword sequences (i.e. KSM) is presented. Referring to comprehensive information theory, KSM uses keyword sequences of web document to represent its structure feature and intension feature, then judges information redundancy by comparing keyword sequences between similar documents. In the various obscure duplicate detection experiments, the overall precision and recall rate of KSM is 99.2% and 97.7% respectively.2. To improve the recall of terms with low frequency, an automatic Chinese term extraction algorithm based on language cognition theory is presented. Making use of discourse markers in research papers, this algorithm introduces "weighed frequency" factor to C-Value and SCP_f measures, then proposes MC-SCP measure to evaluate both "unithood" and "termhood" of candidate terms. In the "License Plate Recognition" domain term extraction, the overall recall and precision is 96.5% and 77.8% respectively, while the recall and precision for terms with low frequency is 96.2%. and 79.3% respectively.3. To acquire various relations of terms, a multi-strategies based relation acquisition model is designed, including a) rule-based synonymical relation acquisition, b) hierarchical relation acquisition based on terms’ morphologic similarities, c) non-hierarchical relation acquisition based on all weighted association rules, and d) PSO-based term clustering.4. To alleviate the conflict between swarming of multi-domain research papers and limitation of editors’ knowledge, a domain-knowledge-guided first review assistant system is presented. According to the editors’ experience, the first review is refined into four judgments. In the experiment of 2365 research papers, this system can assist editors with filtering 35% unqualified manuscripts.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络