节点文献

文本信息度量研究

Research on Information Metric for Text

【作者】 布凡

【导师】 朱小燕;

【作者基本信息】 清华大学 , 计算机科学与技术, 2013, 博士

【摘要】 度量是用来刻画对象之间相互关系的定量描述。在文本信息处理中,不同语言学粒度上的信息度量研究都有重要的理论价值和广泛的应用背景。近些年,Web2.0的蓬勃发展对文本信息度量提出了新的挑战。复杂多样的网络数据以及不规范的网络文本书写使得许多传统的自然语言信息度量方法不适用于互联网环境。比如,基于词典的词汇相似度度量无法很好地处理快速出现的新词;基于句法树的句子相似度度量无法很好地处理书写不规范的用户查询以及网络文档标题。特别地,中文网络语言的不规范性对中文自然语言处理提出的挑战更为明显。另外,传统基于网页链接分析的相关性度量方法并没有很好地利用社会协同百科全书的结构特点,因此无法解释概念之间的相关性。针对新形势下文本数据的特点,本文在四种不同的信息对象层面上提出了新的信息度量方法并进行了应用实现,具体如下。在短语层面,本文提出了一种短语非合成性度量,这种度量基于信息距离理论,具有完善的理论依据,可以用来判断一个给定的单词序列(在特定语境下)的合成性。由于所需的统计量来源于整个互联网,因此具有很强的适用性和鲁棒性,可用于问答系统后处理以及复杂名字实体识别。在概念层面,本文提出了一种新的网络百科全书(比如维基百科)概念相关性度量方法。和以往基于网页链接分析的方法不同,这种方法充分利用了维基百科的结构特点,使得其不仅能度量概念相关性,而且能用百科中的分类来解释概念之间的关系。在句子层面,本文提出了一种基于模板集的度量方法来计算自然语言问题之间的相似度。针对疑问句中虚词和实词的特点,我们采用硬模板和软模板来分别处理它们。这种度量可以在不借助句法树的前提下刻画单词间长距离的关系,并可以被有效地应用到问题分类任务中。在句子关系层面,本文提出了一种基于核方法的句子对类比相似度度量。这种方法将句子关系映射到重写规则空间,并用该空间上的内积来表示其相似度。这种方法可以在不借助句法树的前提下从结构上刻画句子关系的类比相似性,并在同义句识别以及句子蕴含关系识别上取得一流的准确率。

【Abstract】 Metric is used for characterizing the relationship between objects. In natural lan-guage processing (NLP), researches on information metric on diferent linguistic unitsare of essential research value and wide application backgrounds.Recently, rapid development of web2.0poses great challenges to natural languageprocessing. Classical NLP information metrics are not able to handle complex and dy-namic internet data and informally written web texts. For example, lexical similaritymetrics based on local dictionary are not suitable to process new words emerged on inter-net. Sentence level similarity metrics based on syntactic trees are not sound to be used tomeasure the similarity between user queries and document titles, especially in Chinese.Moreover, classical metrics based on link analysis cannot make full use of the structuralfeatures of social collaborative data.In order to deal with these challenges, we proposed new information metrics on fourdiferent information objects, which are listed as follows.On phrase level, we proposed a non-compositionality metric for n-grams, which isbased on information distance with solid theoretical background. It can be used tomeasure the non-compositionality of a given n-gram (under certain contexts). Sincethis metric is approximately computed from the frequency counts on the internet, itis robust and widely applicable, which can be used for post-possessing of questionanswering and complex named entity recognition.On concept level, we proposed a new algorithm for measuring the semantic relat-edness between concepts on social collaborative encyclopedia (e.g. Wikipedia).Diferent from classical metrics based on link analysis, our method fully took ad-vantage of the structural feature of encyclopedia. It can not only measure related-ness, but also interpret the relatedness by using categories.On sentence level, we proposed a question similarity metric based on pattern set.To utilize function words and content words in questions, we built hard patternsand soft pattern on them respectively. The metric can model long range depen-dencies between words without using syntactic trees and be applied to questionclassification.On sentence relation level, we proposed a sentence relation similarity metric based on kernel method, which maps sentence pairs onto re-writing rules space and usesinner product on this space to represent similarity. The method can capture struc-tural similarity between sentence pairs without using syntactic analysis tools andstill achieve state-of-the-arts accuracy on paraphrasing identification and recogniz-ing textual entailment.

  • 【网络出版投稿人】 清华大学
  • 【网络出版年期】2014年 07期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络