节点文献

基于现代汉语动态流通语料库的通用词汇自动提取方法研究

A Study on Extraction Method of Contemporary Chinese Common_used Words for Language Engineering Based on Dynamic Circulating Corpus

【作者】 唐长宁

【导师】 赵小兵;

【作者基本信息】 内蒙古师范大学 , 计算机应用技术, 2008, 硕士

【摘要】 人类社会正在从工业社会迈向信息社会,信息的主要载体是自然语言,即人类彼此交流所使用的语言。自然语言处理研究如何让计算机理解人类语言并开发有关的适用系统,然而自然语言中的通用词汇是一个民族的语言系统中最常见,使用频率较高的那些词汇,无论在汉语言教学上,还是在字典的编写上,甚至在计算机信息处理上,汉语言的通用词汇范围的准确界定都有着深远的意义。在某一时段内,通用词汇既是一个相对稳定又是一个开放的集合,既是一个相对动态又是一个静态的集合;传统的统计方法以及语言学家的经验法则等等都根本无法给与通用词汇准确界定。因此把计算机应用到具体语言学的通用词汇提取领域,就更体现其应用价值和意义。运用“语料库”的科学数据方法来研究语言已经成为必然的趋势和必要的手段。本论文选择的是基于中国主流报纸的动态流通语料库,动态性和流通性是其本质特征。动态流通语料库的“动态性”贯穿着“历时中包含着共时”和“共时中包含有历时”的语言知识变化原则。也就是说,这种语料库既可以提供语言的共时描写,也可以提供语言的历时描写。流通性体现在栏目信息尽可能多样的报纸,发行地区应尽可能多样,语料的覆盖量要足够大。本论文主要做了以下工作:1.原始语料领域分类(自编程序)用程序实现按照报纸语料的栏目信息,将原始语料分为10类,分类结果见表4-3。2.原始语料格式转换(自编程序)原始下载语料为HTML\HML网页格式,需要按/领域分类/媒体/年月分别将原始语料转化为纯文本格式语料,同时应该滤除网页格式中的垃圾信息,只保留有效的文本信息内容。转换后文件格式为xml文件。3.文本文件切分词(引用程序)、入库(自编程序)按领域分类/媒体/年月分别将文本文件切分成词,将切分后的文件以词语为单位导入到数据库中等待进一步处理,实验时数据库软件使用的是SQL Server7.0。4.对其进行校对使用自行开发的人工校对系统(java语言编写)进行检查式校对,对其上面分词中不可避免的错误进行纠正,使结果更科学更准确。5.词汇统计计算每个词按月“词频度”、“领域通用度”、“时间通用度”。实验时使用的软件是微软的excel 2003。6.通用词汇提取按照词语的年“词汇通用度Ok”降序排序,提取通用词汇表,使通用词汇表中词语的总词次能覆盖全部语料词语总词次的85-95%。

【Abstract】 Human society is moving from the industrial society into an information society, and information is the main carrier of natural language, which is used for communicating by human being. Natural language researches how to make computers understand human language and develop the suitable system. The common vocabulary of natural language is used frequently in a national language system, whatever in Chinese language teaching, or in making a dictionary, even in the computer information processing, so the clear conception of Chinese common vocabulary has a far-reaching significance. In a certain period of time, the common vocabulary is not only a relatively closed and open set, but also a dynamic and relatively stable set. Traditional statistical methods, as well as the experience of linguists can’t give a correct conception of common vocabulary. The computer technology is applied to the extraction for common vocabulary, that is an automatic extraction for common vocabulary based on DCC, which has its value and significance.That by using the scientific data of "Corpus" to study languages has become an inevitable trend and necessary means in the language study field. This paper is based on DCC of the mainstream newspapers in china. the dynamic and the circulation are the essential character of DCC. "The dynamic" of DCC permeates a language change rule, which is“last contains simultaneity”, and "simultaneity contains last”. In other words, it not only can provide the language description at the same time, but also can provide the language description in different time.“The circulation”of DCC is reflected in the newspaper, which has more columns, more diverse areas, and more coverage of the corpus.Main contents in this paper:1.The classification of the original corpusDesigning a process the author divides the corpus into 10 categories according to the different columns in the newspaper, the classification results appears in table 4-3. 2.The format conversion of the original corpusThe format of original corpus is HTML \ HML, and it should be transformed into a XML file which has its own field classification, its own media, year and month. Meanwhile clean the useless information in the format of the Web and only retain the effective information content. After the conversion, the format of document is XML.3.The segmentation, depositing of the text file into the databaseThe author cuts the word text file into the segmentation by the field classified / media / year and month and puts the segmentation whose unit is word into the database for further processing, the database software used in the experiments is SQL Server7.04.CheckUsing self-developed artificial proofing system (developed by java language), the author checks and corrects the inevitable mistakes in above procedures, lets results much more scientific and more accurate.5.The statistics of vocabularyCalculate the "the frequency" "the usage" and " the circulation" of each word in a month. The software used in the experience is Microsoft excel 2003.6.Extraction of the common vocabularyPutting the vocabulary in descending order according to "the common vocabulary usage Ok" in a year the author extracts the common vocabulary; the words can cover 85-95% of all the words in the corpus terms.

  • 【分类号】TP391.1
  • 【被引频次】5
  • 【下载频次】152
节点文献中: 

本文链接的文献网络图示:

本文的引文网络