

A Research on the Extraction and Analysis of the Newspaper Theme Words Group Based on the Dynamic Circulating Corpus

【作者】 史艳岚

【导师】 张普;

【作者基本信息】 北京语言大学 , 语言学及应用语言学, 2006, 博士

【摘要】 本文以对外汉语报刊新闻教学改革为动因,以中国主流报纸动态流通语料库为研究基础,进行了报刊新闻资源库的初步建设,基本形成一个报刊新闻分类资源库。该资源库对报刊新闻文本按领域分类,利用计算机语言信息处理技术对文本进行分词处理和统计,得出各类领域词表。从各类领域词表中用领域相交的方法提取各领域间的通用词语;再利用通用词表用词汇分离的方法提取各领域一级主题词群、各领域中的二级子领域主题词群、子领域中更下位的三级主题词群。主题词群的提取研究是在几个不同的层级上进行的。提取的主题词群带有很强的主题特征。在单文本的主题词群提取实验中,这些主题特征词语对判断文章的主题相关度有较好的效果。本文还对主题词群和报刊新闻主题教学的关系进行了探讨,对主题词群提取的准确度进行了测试,也初步探索了报刊新闻文本的主题相关度和难易度的测量方法。主题词群的研究为报刊新闻教学提供了一个科学、实用的研究平台,同时也为词汇研究探索了一条新的研究思路和方法。 研究路线: 报刊新闻资源库——通用词语——主题词群提取及相关研究——主题教学 围绕主题词群提取这个中心,本文取得了以下的研究成果: 1、构建了一个基于主流报纸动态流通语料库的汉语报刊新闻资源库。 该报刊新闻资源库目前的语料有1.7亿字,33545个文本。利用计算机技术对大规模的真实语料进行了处理,初步建立了对外汉语报刊新闻教学资源库,使报刊新闻素材能够及时得到动态更新。也为报刊新闻教学研究提供了一个科学、实用的研究平台,填补了对外汉语教学研究领域的一项空白。 2、基于报刊新闻资源库初步建立了一个报刊新闻教学分类体系 参考了各种权威的分类法,考察了网页文本分类,对现有的几种对外汉语报刊新闻教材的主题分类作了考察,最后综合各种相关因素提出了报刊新闻资源库的分类框架。在报刊新闻资源库内初步建立了一个有19个领域、91个子领域、189个下位主题的报刊新闻教学领域分类体系,基本涵盖了报刊新闻的主要领域,为报刊新闻和其他课程的教学提供支持。 3、基于十九个领域的分类词表提取了报刊新闻通用词表 本研究的重点是主题词群的提取研究,提取报刊新闻通用词表的目的是为了用词汇分离的方法有效地提取主题词群。因此本通用词表是为词语的领域分类服务的。我们在报刊新闻资源库内提取了在十九个领域间都通用的词语,由于通用词表是在大规模中国主流报纸语料库的基础上产生的,具有领域通用和动态更新的特点,对主题词群的提取具有良好的效果。 4、运用词汇分离的方法提取了不同层级的主题词群 用词汇分离的方法将领域词表、子领域词表中的通用词语和专用词语进行

【Abstract】 Teaching Chinese to foreigners is a great undertaking for the Chinese nation. More and more foreigners come to China to acquire latest information from mainstream Newspapers and other media. This research was drove by the requirement of the teaching reform on Newspaper Reading Course in BLCU. This paper disserts how to build a Newspaper resource database and extract theme words group from it based on the large-scale Chinese mainstream Newspaper Dynamic Circulation Corpus, all the study is under the theory of Dynamic Updating of Language and Knowledge. First, we established a classified Newspaper resource database on the DCC corpus, and we got 19 domain word lists from natural language in the database. Then we extract the general words by making the 19 domain word list across together. The most important research is the extraction of theme words group by the means of making the vocabulary apart. The theme words group is delaminated into different layer — A domain theme words group; B subdomain theme words group; C hypogynous theme words group; and single text theme words group. In the course of the experiment, all the theme words are strongly reflect the feature of the domain, subdomain, hypogynous theme and single text. We can use these different layer feature words to measure the extent of the theme semantic relevancy, we also try to explore the way to weigh the degree of the text difficulty. The research of the theme words group is benefit to the Newspaper Reading Course in the actual teaching. It provides a scientific and applied research platform to the Teaching Chinese to foreigners, and also, it provides a new landscape to the vocabulary study.Research route:Newspaper resource database--general words lists-- the extraction of themewords group and relevant research-- theme-centered teachingThis paper focuses on the extraction of theme words group and relevant research as follows:1 Built a Newspaper resource database based on the large-scale Chinese mainstream Newspaper Dynamic Circulation CorpusDynamic information resource system is from the material process of the instruction. Dynamic information is another kind of education information, It is very significant for studying and teaching. The range of content is wide, and its representation is diversity. This resource database has a total of 170,633,995 characters, 33545 text files. It is fills up the blank of the research of the Teaching Chinese to foreigners.2 Built a Classed Newspaper teaching system based on the Newspaper resource database After study many authoritative classify system and several Newspaper teaching material, webuilt a layered classed Newspaper teaching frame. This frame contains 19 different domains,91 subdomains, 189 hypogynous themes, basically cover all the main domains in the Newspaper and press. It is benefit to the teaching on the Newspaper and other courses.3 Extract a Newspaper and press general words list from the 19 domain words lists


