

A Research on the Extraction of the Valid String:Based on the Dynamic Circulating Corpus

【作者】 隋岩

【导师】 张普;

【作者基本信息】 北京语言大学 , 语言学及应用语言学, 2004, 博士

【摘要】 本文提出了一个新的语言处理单位“有效字符串(Valid String,VSt)”并以“动态流通语料库(DCC)”为依托,以“流通度”理论为支点,对“有效字符串(VSt)”的提取进行了初步的研究。 本文定义的“有效字符串(VSt)”是一种语言理解单位,而不是单纯的语法单位。从语用的角度看,语法研究中的各级单位(例如词、词组/短语、组块等)在一定语用条件下都可以单独完成语言理解和交际任务,本质上也是“有效字符串(VSt)”的一种形式。而关于这些传统语法单位人们已经作了深入细致的研究,并且取得了丰硕的成果,因此,本文更专注于比这些传统语法单位空间跨度更大的“有效字符串(VSt)”的提取研究。 从形式上看,本文所要提取的“有效字符串(VSt)”也是由上述传统语法单位构成的,它涵盖了从词一直到语块的全部可能的“表达/理解”单位。所不同的是,这些字符串跟语用的要求更加接近,它们不是静态的、备用的语法单位,而是动态的、备用的语用单位,通过对“有效字符串(VSt)”在大规模真实文本中使用情况监控,就可以间接实现对语言使用情况的监控,也就是“语用监控”,进而达到“语言知识动态更新”的终极目标。 为了实现这一目标,本研究建造了以“句碎片”库为核心的“动态流通语料库(DCC)”,并把“流通度”理论作为整个研究的指导,从“有效字符串(VSt)”的提取入手,试图从一个全新的角度对大规模真实文本的加工处理进行一次探索。 在这个过程中,本文考察了已有的相关研究成果并从中汲取丰富的营养。参考了认知心理学、大众传播学等的相关理论,对“有效字符串(VSt)”进行了严格的定义,对字符串“频度、使用度、流通度”曲线走势模式进行了初步的分析和归纳,为“有效字符串(VSt)”的自动提取做好了准备。 在语料具体处理过程中,本文引进了“全捆绑”的策略,从经过分词处理的“句碎片”库中“捆绑”出“备选字符串”,把它们与字符串曲线走势模式进行匹配,从而提取出“有效字符串(VSt)”。 本研究建造的“动态流通语料库(DCC)”包含2003年10种报纸1-6月的全部语料,8,687,925条记录,平均“句碎片”长度为16字,总语料规模为8,687,925~*16=139,006,800字。全部语料都按照时间序列存储。 为了处理语料和提取“有效字符串(VSt)”,我们开发了“DCC’处理软系统件”。包括“句碎片’切分、分词”模块、“X串’剥离”模块、“备选字符串’捆绑”模块、“有效字符串(VSt)’提取”模块和“有效字符串(VSt)’后处理”模块。 以这个规模的语料库为中心,本研究作了157,661条“有效字符串(VSt)”提取实验,正确率为80.21%。 本文主要有以下四方面创新: 1、从认知的角度定义了语言的理解和交际单位“有效字符串(VSt)”。 2、分析并确定“有效字符串(VSt)”的曲线走势图模式(三种)。 3、提出了基于“曲线走势图”的“流通度”评估方法并提取“有效字符串(VSt)”。 4、建造基于“句碎片”库的“动态流通语料库(DCC)”。

【Abstract】 The goal of this dissertation is to study the extraction of valid strings from natural language corpus. The study is based on the new concept of valid string and the theory of the degree of circulation and is sustained by the Dynamic Circulating Corpus.Valid string is not a unit in grammar but is a unit in language communication and understanding. Most grammatical units, such as a word, a phrase or a chunk, may be used independently in communication and be understood as valid strings. There are also valid strings that are combinations of these basic grammatical units.On the surface, a valid string is a grammatical unit or a combination of several units. A valid string is not a static item waiting to be used but is dynamic unit in actual language use. By monitoring the use of valid strings in large scale real time natural language corpus, the actual language use can be monitored indirectly and the goal of dynamic language knowledge updating can be reached eventually.The concept of valid string is defined in terms of not only grammar but also cognitive psychology and the study of mass media. It is based on the curve of the frequency, distribution and circulation of the valid strings.A sentence fragment corpus was built for this study and all potential strings were extracted by using an all-round combination strategy. The combined strings were then compared with a circulation curve model to determine their validity.The dynamic circulating corpus built for this study consists of data from ten newspapers (from January to June, 2003), with 8,687,925 entries which have an average length of 16 characters and a total of 8,687,925x16=139,006,800 characters. The data is stored according to their dates.A soft-ware for the processing of Dynamic Circulating Corpus was designed for the study, which consists of several modules for the identifying and combining of potential valid strings.A total of 157,661 valid strings were extracted from the corpus and the validity rate is 80.21%.The contribution of this dissertation is:1.to have defined the concept of valid string on the basis of cognition;2.to have analyzed and posited three models of the curve for valid strings;3.to have established a method for the extraction and evaluation of valid strings; and4.to have built a Dynamic Circulating Corpus based on the sentence fragment corpus.

  • 【分类号】H08
  • 【被引频次】6
  • 【下载频次】652

