节点文献

西里尔和传统蒙古文的形态和转换系统研究

Research on Cyrillic and Mongolian Script’s Morphology and Conversion System

【作者】 奥干巴特尔

【导师】 高光来;

【作者基本信息】 内蒙古大学 , 计算机应用技术, 2014, 博士

【摘要】 蒙古族以前使用过若干个文字,但是目前主要使用传统蒙古文、西里尔蒙古文和托(?)文。本文旨在研究传统蒙古文和西里尔蒙古文的信息化技术,该信息化技术一是指研究传统蒙古文和西里尔蒙古文之间的文字转化,二是研究传统蒙古文和西里尔蒙古文的形态即词法。本文绪论中详细介绍了上述研究工作的意义、目的和目标。将计算机技术与蒙古学研究相结合已经成为研究蒙古文计算语言学的必然趋势。尽管在蒙古国内已有相关公司及个人在此领域从事相关研究工作并研发了一些相关的应用程序,但上述应用程序的研发水平尚不能与发达国家的相关研究水平相媲美。鉴于此,本人致力于研究西里尔蒙古文和传统蒙古文的信息化技术。在这项工作中,我们试图从形态分析西里尔蒙古文和传统蒙古文,并利用蒙古文构词规则研究了西里尔蒙古文和传统蒙古文的相互转换问题。这个过程包含以下两个步骤:首先,从形态分析西里尔蒙古文或传统蒙古文语义,找出词干和后缀;然后,将它们转换成对应的传统蒙古文或西里尔蒙古文词干和后缀,并利用构词规则生成对应的传统蒙古文或西里尔蒙古文。本文完成的主要研究工作如下:1.本文研究了西里尔蒙古文和传统蒙古文的相关特点,从而试图将二级形态的模型(Two Level Morphology Model)应用在蒙古文当中。从计算语言学角度考虑,西里尔蒙古文和传统蒙古文有着很多相似之处,也有一些不同之处。目前,西里尔蒙古文的书写规则有66大类。传统蒙古文确只有3项书写规则,即元音和谐规则、辅音规则、连接音规则。蒙古文是粘着语,是词干加后缀的形式生成新词的。在词干和后缀缀接方面,西里尔蒙古文和传统蒙古文也有不同之处,这是因为书写规则不同而导致。根据上述情况,本人研究了名词和动词的生成和解析模型,同时研究出词干加构形后缀的规则,并找出了词干加多个构形后缀的所有可能。2.完成上述工作后建立对应资源库的工作显得十分紧迫。资源库是继续开展西里尔蒙古文和传统蒙古文相互转换工作的基础。该资源库包括词干资源库、形态资源库和附加资源库。蒙古文词干后缀加构形后缀后可以生成大规模的蒙古文单词,所以本人选用词干作为资源库的基本单元,主要优点是:资源库的数据不会太大;加快应用程序的运算速度;可以确定词汇生成规则,进而掌握生成某种词汇的所有可能。词干资源库包含3个子库:西里尔蒙古文和传统蒙古文对应词干库,并包含单词解释(包含72000词条);带有词性标注的西里尔蒙古文和传统蒙古文对应词干库(包含61000词条);由词干编码,词汇生成、词汇解析编码组成的资源库(48000条)。形态资源库包含2个子库:西里尔蒙古文和传统蒙古文对应构形后缀库(包含86词条);多个构形后缀缀接条件库(包含876词条)。附件资源库包含2个子库:专有名词库(包含9135条);缩略语库(包含1100条)。3.根据二级形态的模型及“有穷自动机”制作出西里尔蒙古文和传统蒙古文书写规则模型。根据该模型对单词的构成进行分析,并做了西里尔蒙古文和传统蒙古文相互转换试验。PC-Kimmo是用于词形分析的开源系统,它由两个组成部分,即词汇形式和规则形式。本文以PC-Kimmo为工具制作完成了西里尔蒙古文和传统蒙古文相互转换模型。本文将词汇分成了名词和动词两大类,并分别建立了名词生成模型和动词生成模型。本人将西里尔蒙古文和传统蒙古文书写规则分别制作了模型,并利用该模型及资源库建立了西里尔蒙古文和传统蒙古文相互转换系统,并把该系统命名为KIM_MON(第一版)。该系统能够为用户解析、研判、生成词汇并将最终结果告知用户。4.最后,利用KIM_MON系统进行了蒙古文词法分析的实验工作。实验结果表明:当我们对西里尔蒙古文和传统蒙古文的形态分析时,正确率达到了97.6%。在正确分析蒙古文形态基础上KIM MON能够100%的正确的连接单词。在词法研究工作的基础上,我们对西里尔蒙古文和传统蒙古文相互转换工作进行实验,实验结果表明:从西里尔蒙古文到传统蒙古文的转换准确率达到了91.3%,从传统蒙古文到西里尔蒙古文的转换准确率达到了89.1%。在西里尔蒙古文的词同义不同单词的转换实验中,准确率达到了86.9%。并且通过实验得出,随着训练数据的增多会提高词同义不同单词的转换准确率。

【Abstract】 Although Mongolian people have used several scripts in their historical period, they use three main scripts such as Traditional Mongolian script, Cyrillic Mongolian and Tod scripts.In this thesis, we demonstrated morphological and script’s conversion between two types of Mongolian such as Traditional Mongolian script and Cyrillic Mongolian. In introduction part, we showed significance of research work in detail. And also, you can see the aim and objective of research work in introduction. Countries, which have understood that language processing industry is critical in creating next generation of knowledge based, knowledge processing computers, have supported this industry greatly by public policy, established national level research centers and implemented many national level projects which require a lot of capital. Coordinating Mongolian studies with modern technology and developing Mongolian computational linguistics are topical requirements.Recognizing Mongolian word and sentence in computer helps to reveal and study Mongolian principle and feature thanks to modern approaches and technologies. That is, our further research work will be effective as a result of this work. Even though, some Mongolian companies and individuals have done research and analysis, and created some applications and programs in this industry, it is dissatisfactory compared to the level of other countries. Furthermore, we haven’t created unified system yet for the industry.Thus, I chose processing Mongolian using computer as main subject of thesis.In this work we tried to do morphological analyze both in Cyrillic Mongolian and Traditional Mongolian script and define inflection method of affix in accordance to orthography rule using computer. The aim of this work is to convert from Cyrillic Mongolian text to Traditional Mongolian script and vice versa. This process runs in following steps:First, to do morphological analyze in Cyrillic Mongolian and Traditional Mongolian word, find out stem and affixes of and then convert them to Traditional Mongolian and Cyrillic Mongolian script. Then join converted word stem with affix and generate word Traditional Mongolian script. This combined process is belonged to morphology of computational linguistics. Word which is written differently due to its meaning in Traditional Mongolian script is the same in Cyrillic script. Thus, we intended to define the meaning of word. In the frame of research work, we executed following activities.1. We demonstrated feature of both Cyrillic Mongolian and Traditional Mongolian script, Mongolian parts of speech and word structure. Traditional Mongolian script is a type of phonetic script and there are many words which have the same tones. It observes the principles of morphology and the traditions. The Cyrillic Mongolian script observes the principles of phonetics and it has the disadvantage of not observing the other principles.For computational linguistics, Traditional Mongolian script and Cyrillic Mongolian may have both same features. Contrariwise, there are large numbers of different features in both two scripts. For orthography, they may be similar in some ways. Because scientists who created the Cyrillic letter rule have mentioned that the Cyrillic Mongolian letter rule was based on the Traditional Mongolian script’s rule. The Cyrillic Mongolian alphabet that we use now consists of66articles. But the Traditional Mongolian script which has been inherited from thousand years consists of only3rules:vowel harmony (conformity), syllable closing consonants rule, and combining vowels. Mongolian is agglutinative language and rule for generating and inflecting word is based on approaches like attaching suffix and affix to word stem. But we follow different rules in both Cyrillic Mongolian and Traditional Mongolian script in order to attach suffix and affix to word stem. It is not Mongolian feature, but it is feature of orthographic rule followed in that script.When Mongolian noun, adjective and pronoun lie in sentence, they are inflected by plural suffix, case and possessive suffix. But verb is inflected by voice, state, temporal ending suffix, possessive ending suffix, subordinating conjunctive suffix and determining suffix. Then we developed model of noun and verb inflection.Thus, we calculated suffix sequence possibility and formulated suffix combination rule.2. We needed to create certain database after carrying out mentioned-above researches. Thus, I created both Mongolian morphological and inflectional suffix’s databases that fulfilled requirements of feature of Mongolian language and my own research work. This database will be the base of our many tasks which we will be doing in computer linguistics. Using our database, we will initially complete Mongolian language, Mongolian script morphology and conversion system research. Saving the word stems and grammatically transformed units into entries would be deemed as the most simple and crude method. Therefore, we have defined the database unit will be "word stem". Main advantages are:Words saved in the database will not be fictionally high; Program speed will increase; Word grammatical form will be solved based on the grammar, so all the possible transformations can be included;Basic database can be consists of following3types of bases:Primitive database of primitive key, Cyrillic Mongolian and Traditional Mongolian head words and explanation (72210); Database of word class(53294); Inflectional database with their code that shows grammar inflection (48000);We created vocabulary of abbreviated word containing1100words and vocabulary of proper noun consisting of9135words.According to the research, there are86suffixes such as Instrumental, directive, dative-locative, plural and negative etcin Mongolian language. We created vocabulary of suffix consisting of Cyrillic Mongolian and Traditional Mongolian script’s form by numbering that suffix. Sequence of doubling suffix has accurate principle. Morphemes which participates in word structure has own accurate position and sequence and their margins are obvious. But there are some exceptions that break the rule of morpheme’s certain position and sequence. For two scripts, we created sequence database of suffixes that were estimated accurately.3. As a result of executing mentioned-above activities, I was able to decide goal of doing Mongolian morphological analysis using two-level morphology based on created database. We demonstrated modeling rule of Traditional Mongolian script and Cyrillic Mongolian in order to analyze in Mongolian morphology. In order to do this, we model Mongolian rule using finite-state automata and two-level morphology in Mongolian morphology. We conducted experiment on parsing word as structure and generating word through this model. We studied it deeply, turned it into practical usage and executed following activities.In the work process, it became obvious that two-level finite state morphology can be used in Mongolian morphology. It gave us opportunity to use these actions such as generating and parsing word in further research work. Two actions like parsing and generating word as inflectional affixes need to be based on finite state automata in computational morphology. Thus, it is important to describe design for automata that inflect word of database unit. Because we classified database into inflect and non-inflect word and inflected words were divided into noun and verb. Word grammar inflections suit noun and verb inflection.In order to process description in PC-KIMMO, all rules should be created true and to be checked consequently.In addition, we considered approaches related to creating rule in chapter. We modeled Mongolian rule and did morphological analysis. In order to do this, we modeled Cyrillic Mongolian and Traditional Mongolian script individually and created suitable rule files. We developed morphological analysis’software of Cyrillic Mongolian and Traditional Mongolian script using rule file and lexical file and then we tested successfully.For word automata, it has to parse inserted text of user, process, generate correct word by attaching appropriate affixes to stem and show result or text of word structure. When we do processing in Unicode text, we need to execute following additional works.As a result, Cyrillic Mongolian and Traditional Mongolian texts can be processed and first version of KIM_MON program was developed. Result of text processing is irrelative to character coding (Latin, Cyrillic, etc.) but directly depends on how it provides and classifies sufficient vocabulary file and how it defines the rules correctly.4. I conducted experiment on Mongolian morphological analysis using KIM_MON program and created database. Let us state about result of experiment in brief. When we parse morphology on text, correct conversion comprises97.6%. For attaching word action, it attached correctly mentioned-above word that was correctly and draws correct result. When we do conversion in accordance with developed algorithm, following results were appeared.a) While converting from Cyrillic Mongolian to Traditional Mongolian script, recognizing word sense is91.3%.b) While converting from Traditional Mongolian script to Cyrillic Mongolian, recognizing word sense is89.1%.While doing experiment related to recognizing word sense, recognizing word sense is86.9%. From the experiment process, creating massive training database can increase recognizing percent.

  • 【网络出版投稿人】 内蒙古大学
  • 【网络出版年期】2014年 09期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络