节点文献
中文词汇知识获取算法和语义计算研究及应用
The Research of Knowledge Acquisition Algorithm and Emantics Computation for Chinese Vocabulary and It’s Applications
【作者】 刘兴林;
【导师】 郑启伦;
【作者基本信息】 华南理工大学 , 计算机应用技术, 2012, 博士
【摘要】 互联网的飞速发展使其成为全球信息传播和共享的最重要资源,其数据成几何级数增长,然而要从互联网上获取有用的知识却非常困难,“数据爆炸,知识贫乏”已成为当前诸多专家学者需要迫切解决的问题。目前知识获取的大多数研究都是从单纯的计算机技术角度出发,采取诸如规则、句式等从语法逻辑结构层面来挖掘、提取知识,然而新概念的不断涌现,导致许多新词汇被创造出来。这些新词汇由多个语素或多个词组成,当前的分词系统,在收录这些词之前,会将它们切分成多个语素或词,而导致当前已有的知识获取方法无法正确识别,更难于在语义层面上进行比较。这将给知识获取带来新的难题,也使得当前以信息检索为主要技术的搜索引擎在处理网页时采取了“非语义”的关键词匹配的方式,以致于内容查找准确率低,语义计算的引入将有望改善这种状况。本文的主要研究工作有两部分:中文词汇知识获取算法和中文词汇语义计算方法。本文基于分词系统之上,进行合成词的识别,解决未登录词无法正确识别的问题;为合成词建立词性标注模型,对合成词进行词性标注,消除词性歧义,解决当前词性标注模型无法直接应用于合成词的词性标注的问题,同时修正分词结果。在实现合成词识别的基础上进行文本主题词的提取,建立词汇语义计算模型,使词与词之间可比较,用语义计算代替传统的关键词匹配,是实现智能信息检索的一个根本途径;同时也是构建词汇语义知识库、实现知识推理的一个关键基础性研究工作,具有重要的研究意义。本文最后实现了一个中文词汇知识获取和语义计算平台,通过应用上述算法,建立了一个包含中文词汇知识获取以及中文词汇语义计算的综合系统,验证了本文各项研究工作的意义和算法的有效性。本文的创新性工作主要有以下几点:1、针对当前未登录词识别的难点问题,提出了基于词性探测和词共现有向图的合成词识别算法CWRWCDG,该算法先采用词性探测从文本中获取词串,进而由获取到的词串生成词共现有向图,借鉴Bellman-Ford算法思想,从词共现有向图中搜索多源点长度最长且权重值满足给定条件的路径,则该路径所对应的词串为合成词。实验结果表明该算法要优于同类算法。2、中文合成词标注的难点在于词性的确定,针对该问题,提出了基于核心属性渗透理论的中文合成词词性标注算法,核心属性渗透理论最早由Lieber于1980年提出,他认为在英语中合成词的词性由合成词的核心成分决定,本文将该理论应用于中文合成词词性的标注,并根据实际情况需要提供显式标注和隐式标注两种方式。3、当前文本主题词提取算法主要从词频角度出发,基于TF/IDF值,然而对于词语分布较均衡的文本效果不理想,针对这种情况,提出了基于词位置权重和增量词集频率的主题词提取算法TTEITS。该算法认为同一个词在文本的不同位置出现,对该词是否成为主题词的影响是不一样的,同时,在确定一个候选主题词是否真正成为主题词时,不但计算该单个词的权重(频率),而且计算它对整个主题词集的增量权重(频率),若该增量大于某个给定的阈值,则判定该词为主题词,否则算法结束。该算法的优点在于当各候选主题词出现次数都比较低、较平均时,仍然能够提取出最合适的主题词。4、研究主题词集在自动文摘上的应用,提出了基于主题词集的中文自动文摘算法CASTTS。该算法先通过TTEITS算法提取文本主题词,再由主题词权重进行加权计算各主题词所在的句子权重,从而得出主题词集对应的每个句子的总权重,最后根据自动文摘比例选取句子权重较大的几个句子并按原文顺序输出文摘。实验结果表明,该方法所获得的文摘质量高,较接近于参考文摘,取得了良好的效果。5、针对现有词汇语义计算及文本相似度计算中存在的一些不足,基于知网,巧妙的将文本相似度计算转换为计算文本主题词集相似度,提出了基于主题词集的文本相似度计算方法TSCTTS。该方法先通过TTEITS算法提取文本主题词,然后在知网义原层次体系结中构获取两个词语的语义距离,经转换公式得到两个词语的语义相似度,最后由主题词集的语义相似度得到文本相似度。该算法应用于文本分类实验,结果表明该算法有较好的分类性能。
【Abstract】 The Web has become the most important resource for information dissemination andsharing due to its rapid development. However, with the exponential data growth, it is noteasy to find useful knowledge on the Web.“Full of data, but lack of knowledge” has becomea most urgent problem to many researchers.Most research on Knowledge Acquisition is solely based on computer technology, suchas extracting knowledge on the level of grammar logic using rules or sentence-mode. But theoccurrence of new concepts creates many new vocabularies, which consist of several words ormorphemes. The existing word segmentation systems often split them into several singlewords or morphemes before collecting them. As a result, the existing knowledge acquisitionmethods can’t recognize them correctly, let alone semantic comparison. This will bring newproblems to Knowledge Acquisition, and also forces the search engine that using informationretrieval as a main technique in dealing with web pages to take “non-semantic” but keywordmatching manner, so that precision of funded content is lower; the application of semanticcomputation is expected to improve the situation.This paper mainly focuses on research and application of vocabulary knowledgeacquisition and vocabulary semantic computing. In particular, to solve the problem ofout-of-vocabulary recognition, it tries to recognize compound-words based on wordsegmentation system. Moreover, it builds a part-of-speech tagging model forcompound-words to eliminate lexicon ambiguity, which can not only solve the problem thatthe existing part-of-speech tagging model can’t directly apply for tagging compound-words,but also correct the word segmentation results. Based on the compound-words recognition, itextracts thematic words from text and builds a vocabulary semantic computing model, so thatwords can compare with each other. Replacing the traditional keyword matching approachwith semantic computing approach is fundamental for Intelligent Information Retrieval,building a vocabulary semantic knowledge base and knowledge reasoning.Finally, a platform for vocabulary knowledge acquisition and semantics calculation isimplemented. Based on the above proposed algorithms, it builds an integrated system containing vocabulary knowledge acquisition, vocabulary semantics calculation and avocabulary semantics knowledge base, and validates the meaning and effectiveness of theproposed algorithms.The main contributions of this paper include:1. A Chinese compound-word recognition algorithm CWRWCDG based onpart-of-speech detecting and word co-occurrence directed graph is proposed in this paper forsolving the out-of-vocabulary recognition problem. The algorithm firstly extracts wordsequence from a text using part-of-speech detecting, and then generates word co-occurrencedirected graph with these sequences. After that, inspired by the Bellman-Ford algorithm, itfinds the longest paths whose weight value satisfies the given condition for multiple startingpoints in the word co-occurrence directed graph, the word strings corresponding to the pathsare considered as compound-words. Experiment results show that the proposed algorithmoutperforms existing algorithms.2. The key problem in labeling Chinese compound-word is part-of-speech identification.To solve this problem, a part-of-speech tagging of Chinese compound-word algorithm basedon head-feature percolation theory is proposed in this paper. Lieber firstly introduced thetheory in1980, and he figured that the lexicon of compound-word is decided by keyattributions. This paper applies the theory on part-of-speech tagging for Chinesecompound-word, and provides two tagging methods: explicit and implicit.3. The existing thematic term extracting algorithms are often based on word frequency,such as TF/IDF value, and don’t really work on text with balance word distribution. To solvethe problem, a thematic term extraction algorithm TTEITS based on word position weight andincremental term set frequency is proposed in this paper. The algorithm considers thatdifferent positions of a word in a document suggest different importance of the term.Moreover, when distinguishing a thematic term, it not only calculates the weight of the singleword, but also calculates the incremental weight in the term set. As a result, the algorithm stillcan extract the most suitable thematic terms even when the candidate thematic terms haverelatively small or average frequency of occurrence.4. Based on the work of thematic term extraction, an automatic summarizationalgorithm CASTTS on Chinese texts based on thematic term set is proposed in this paper. The algorithm firstly utilizes the TTEITS algorithm to extract thematic terms, and then calculatesthe weights of the sentences which contain thematic terms to get the total weight of eachsentences corresponding to the thematic term set. Finally it selects a certain number ofsentences with the largest weight to form the summarization. Experiment results show that thealgorithm can generate high quality summarization, is very close to the original referencesummarization.5. A text similarity calculation method TSCTTS based on thematic term set is proposedin this paper, which transforms text similarity calculation into thematic term set similaritycalculation using HowNet. The algorithm firstly extracts thematic terms using TTEITSalgorithm, and then calculates the semantic distance between two words at the primitive levelstructure of HowNet. After that, it calculates the text similarity based on the semanticsimilarity between thematic terms. The algorithm was applied for text classification, andexperiment results prove its effectiveness.
【Key words】 Vocabulary Knowledge; Compound-word; Thematic Term; SemanticComputation; Text Similarity;