节点文献

文本标注平台的设计原理与应用研究

【作者】 杨小梅

【导师】 江荻; 潘悟云;

【作者基本信息】 上海师范大学 , 中国少数民族语言文学, 2014, 博士

【摘要】 中国的语言在大型数据库建设方面主要集中于语音和词汇,而文本语法研究进展缓慢,导致这种情况主要有三方面的因素。首先,学术观念的约束,忽视了文本型的语言资源;其次,研究方法的制约,阻碍了文本标注语法研究的发展;最后,研究人员的缺少,中国民族语言众多而研究队伍成员不多。现如今越来越多的语言学家意识到真实文本语料的重要性,文本标注的语法研究也逐步取得了一些成绩,但当前用于语法标注与分析的文本处理方法和加工工具存在复杂繁琐问题,对于中国语言而言普适性不够好,尤其是处理有声调的语言,因此,通过计算机技术支持重新设计与开发一个用于文本处理实现语法标注的研究平台非常有必要且极其迫切的。本文主要目标就是想实现一个更适合中国语言文本标注的语法研究平台,具有实用性和高效性,语言学家可以高效、准确地完成生语料到熟语料的标注处理,建立语言学界一直都期盼的高效能隔行对照化格式的语言资源。本文重点论述了两个方面,一方面,改善语料资源来源的途径,丰富文本资源,扩大研究者自建的语料库;另一方面,改进文本资源加工的方法,完善文本处理,准确高效地完成语法标注。本研究基础技术由三个部分组成,输入技术、文本处理技术和输出技术。这三个部分的设计原理和解决策略构建起来也就是本平台的整体框架,为研究者提供一个更适合中国语言使用的语法研究平台,用于语法分析与文本标注。全文共分为八章:第一章:由语言资源与语法标注现状分析进而说明本研究的必要性与重要性。第二章:介绍文本标注平台的整体框架,以及本文主要技术方法的设计原理。第三章:通过本文提供的输入技术可以获取多种文本资源的来源方式,以及提出的语音快速录入形成文本的新途径。第四章:词典贯穿于整个研究平台,词典的重要性和词典设置,重点介绍了文本与词典的互动技术,隔行对照化、跳转插词和词典编辑的实现方法。第五章:句法分析,面对多种语言本文提出了改进的匹配算法提高了文本分词和匹配标注的效率性和准确性,重点阐述了文本分词的重要性和实现策略。第六章:形态分析,面对多种语言本文实现了屈折、粘着、变调、重叠和多义的语音语法语义现象的文本标注,提供了合理可行的解决方法。第七章:提供了多种资源成果的输出方式,包括语料、例句、勘拷灯、词典、词表等。可排版的隔行对照化格式,可筛选的检索结果输出,实用性非常强。第八章:总结了本文的创新之处,并对下一阶段的工作提出展望。本研究介绍了合理可行的文本资源来源方法,高效实用的语法标注处理方法,多样可排版的资源成果输出方法。本文采用词典策略、文本分词、隔行对照化、匹配标注、形态处理、深层和表层形式、词规则等技术方法完成了大量文本资源的语法标注。改善了中国语言资源挖掘与研究的方法,促进了少数民族语言和汉语方言真实文本资源语法标注的发展,同时对濒危性语言和非物质文化遗产有着极为重要的保护与保留作用。

【Abstract】 The building of large databases of Chinese languages has developed rapidly in terms of phonetics and vocabulary, while that of syntax has developed slowly, which can be explained as follows. Firstly, the text-based language resources are ignored due to constraints by academic concepts. Secondly, the development of the study of text annotation syntax has been restricted by research methods. Finally, there are not enough researchers while there are a great number of minority languages in China. Nowadays more and more linguists have become aware of the importance of the study of text resources, and there have been some achievements in the study of syntax with text annotation. But there are some problems with the methods which are used to annotate and analyze syntax, which are not good enough for languages in China, especially in handling tone languages. Therefore, it is very necessary to design and develop a research platform used to process texts, thus realizing syntax annotation, supported by computer technology.The main objective of this study is to design a syntax research platform which is suitable for text annotation of Chinese languages, with practicality and efficiency, and linguists can complete annotating languages from raw materials to annotated materials efficiently and accurately, thus establishing corpus with high efficiency.This paper focuses on the two aspects, on the one hand, to expand the corpus made by researchers themselves by way of improving the sources of data; On the other hand, to complete syntax annotation accurately and efficiently by way of improving text resources processing methods. The basic technology consists of three components:input technology, text processing technology, and output technology The design principles and strategy of these three parts of is the overall framework of the platform, providing the researchers with a syntax study platform which is more appropriate for Chinese languages and used for grammar parsing and text annotation. The thesis is divided into eight chapters:Chapter One:To analyze the current situation of language resources and syntax annotation, therefore to prove the necessity and importance of the study;Chapter Two:To introduce the overall framework of text annotation platform and the design principles of main technical methods in this paper;Chapter Three:To get a variety of text resources by way of input technology provided in this paper, and to introduce new ways of forming new texts by way of quick entry;Chapter Four:To introduce the importance of dictionaries in the research platform, and the interactive technology of text and dictionaries, interlaced control, jump-insert method and the dictionary editing.Chapter Five:Syntactic analysis:matching algorithm used in multilingual text improves the efficiency and accuracy of text segmentation and matching annotation. Also, this chapter introduces the importance of text word segmentation and implementation strategies.Chapter Six:Morphological analysis:to introduce feasible solutions of text annotation of phonetic, syntactic and semantic phenomenon:inflection, adhesion, tone, overlapping and polysemy.Chapter Seven:To offer the ways of outputting a variety of resources outcome, including the corpus, example sentences, collate copy lights, dictionaries, thesaurus, etc.Chapter Eight:To summarize the main conclusions and innovation of this paper, and introduce the work that will be done.This study describes the sources of text resources, the methods of syntax annotation and output technologies of diverse resources results. In this paper, syntax annotation of a large number of text resources is completed by way of dictionary strategies, text segmentation, interlacing control, match tagging, morphological processing, technical methods of deep and surface form, and the word grammar rules. This study improves the methods of researching Chinese language resources, promote the development of syntax study of minority languages and Chinese dialects, and especially, protect the endangered languages and non-material culture.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络