

Dynamic Circulation Corpus (DCC) Based Automatic Unlisted Term Extraction in the Field of Information Technology

【作者】 王强军

【导师】 张普;

【作者基本信息】 北京语言文化大学 , 语言学及应用语言学, 2003, 博士

【摘要】 本文以动态语言知识更新理论为指导,以信息技术领域为实验对象,对基于大规模动态流通语料库的术语提取技术进行研究,提出了利用接续指数判断字符串词语度的方法,实现了“接续指数+TFIDF+领域相减”进行术语提取的技术路线和工作流程,初步形成了一个基于动态流通语料库的信息技术领域新术语提取系统。 本文介绍了动态语言知识更新理论体系和基于动态流通语料库的研究框架,提出了动态流通语料库建设的扩展方案,使之在扩展研究范围和研究深度的同时保持与现有系统的全面兼容,并具有较好的可扩缩性。 新术语首先是术语,它具有术语的三个基本特征:一般只在一个或几个特定的领域出现;是本领域的高流通度词语:在其他领域的流通度接近于0。基于此,本文的基本思路是通过研究已有术语在语料库中的分布情况,确定新术语在语料库中的可能分布情况,通过分析各种阈值条件下已有术语的提取结果,确定提取新术语的最佳阈值条件。 新术语往往是未登录词语,所有未登录词语识别的困难在新术语提取中同样存在,经过传统的自动分词方法处理的语料对新术语的提取跟对未登录词语识别一样存在困难,因此,为了尽可能多的保留新术语,本文采用了全切分方法对语料进行前期处理。 一个字符串在特定的上下文中成为术语的两个指标是词语度(unithood)和术语度(termhood)。本文提出接续指数的概念用于衡量一个字符串的词语度。实验表明接续指数对于判断一个字符串是不是一个完整的词语具有比较明显的效果。 在提取方法上本文提出了“接续指数+TFIDF+领域相减”的方法。利用接续指数判断字符串的词语度,利用“TFIDF+领域相减”的方法判断字符串的术语度。该方法在动态流通语料库(DCC)的部分语料(目标语料1700万字,对照语料6亿字)上进行实验,结果表明,在基于大规模语料库的术语自动提取中,本论文所采用的语料处理方法和术语提取技术对新术语的发现有较为显著的效果,在较少人工干预的基础上,提取出较多新术语,部分地实现了传统分词方法难以完成的任务。 另外,本文讨论了术语提取的两种工作模式:“文件+索引+统计结果”模式和“文件+数据库”模式,分析了两者的优缺点,指出后者是动态语言知识更新在语言监控方面较好的应用。 综上所述,本文的创新之处有如下几个方面: 1.提出了接续指数的概念。 2.把接续指数用于衡量一个字符串的词语度。 3.在术语提取方法上,提出了“接续指数+TFIDF+领域相减”的方法。 本研究所形成的初步的术语提取系统可为专业领域术语提取、动态流通语料库建设提供原型和参考。

【Abstract】 This research disserts automatic unlisted terms extraction in the field of Information Technology based on the large-scale DCC (Dynamic Circulation Corpus), under the theory of Dynamic Updating of Language and Knowledge. It proposes the concept of Concatenation Index to decide whether a character string is a word/phrase or not. It presents a method named "Concatenation Index + TFIDF + Domains Subtracting" for extracting unlisted terms. This research chose the IT domain as the experimental object in order to draw the primitive research flow based on the theory of the Dynamic Updating of Language and Knowledge.This research introduces the frame work of Dynamic Updating of Language and Knowledge, and suggests a schema to improve the Dynamic Circulation Corpus (DCC). The schema makes it possible to enlarge the DCC both in content and structure while keeping compatible to the existed system.There are three basic characteristics of terms. They are: Terms usually only show up in one or some specialized domains; Terms are the phrases with the high degree of the circulation in its domain; and its circulation is near 0 in other domains. Unlisted terms are terms, hence, in nature, they also bear these three characteristics. Based on this, the basic thinking behind this research is to ascertain unlisted terms’ possible distributing in the corpus through examining the enlisted terms in the corpus; and to set the best threshold for extracting unlisted terms through analyzing the extracting result under the different thresholds.Unlisted terms usually are unlisted words. There exists the same difficulty in distinguishing unlisted words as in extracting unlisted terms. Furthermore, the corpus under the traditional word segmentation would show great difficulty in extracting unlisted terms as in distinguishing the unlisted words. Therefore, this research adopts the traversing word segmentation method in preprocessing the corpus.There are two indices used in indicating whether a character string can be a term in the certain context. They are: unithood and termhood. This research suggests that the Concatenation Index should be used in measuring the unithood of a character string. And the experimentation shows that the use of the Concatenation Index, indeed, has the better effect in determining if a character string is a whole integrated word/phrase.This research also presents a method named "Concatenation Index + TFIDF + Domains Subtracting" for extracting unlisted terms. By using the Concatenation Index, we can decide the unithood of a character string. And by using the method of "TFIDF + Domains Subtracting", we can decide the termhood of a character string. This method was experimented on the DCC. It shows that the methods and techniques adopted in this research have the outstanding effect in processing the corpus and in extracting unlisted terms. Under the less human’s interference, there are more unlisted terms being extracted. As a result, it partly realized the intention objective of the word segmentation.It also discusses two different processing modes for extracting the unlisted terms: "text-index-statistics mode" and "text-database mode" and their strong points and flaws. And more, it points out the "text-database mode" is a better method in the Dynamic Updating of Language and Knowledge at the aspect of the language monitoring in this paper.Putting it in other words, the main innovation of this research can be summed up as follows::(1) It proposes the concept of Concatenation Index;(2) It applies the Concatenation Index in measuring the unithood of a character string;(3) It presents a method named "Concatenation Index + TFIDF + Domains Subtracting" for extracting unlisted terms.This research drew the primitive research flow based on the theory of the Dynamic Updating of Language and Knowledge. It can be used as a prototype and as the valuable reference in extracting unlisted terms in other domains; and in building and updating the DCC.


