

Research on High-Precision Machine Translation of Chinese Organization Name and Address

【作者】 苗文彦

【导师】 赵铁军;

【作者基本信息】 哈尔滨工业大学 , 计算机科学与技术, 2009, 硕士

【摘要】 机器翻译简单地说就是用计算机将一种自然语言翻译为另一种自然语言。作为信息的主要承载者,命名实体的翻译质量对译文的整体翻译质量具有十分重要的影响,命名实体的翻译也成为研究者关注的焦点。在人名、地名的翻译任务利用音译技术基本完成之后,机构名称、地址等非音译信息的翻译成为命名实体翻译探索的重点。由于现有的机构名称及地址的汉英双语语料极其匮乏,导致当前主流的基于统计的机器翻译技术无法发挥优势。针对上述情况,本文构建了以基于表示模式的高精度切分方法为核心的机构名称翻译系统,以及面向机器翻译的中文机构地址切分方法和基于地址单元的翻译机制相结合的中文机构地址翻译系统。具体地讲,本文从如下几个方面进行了研究:1.通过分析大量的数据实例,采用上下文无关文法抽象出符合机构名称构成特点的表示模式,并设计了一种基于表示模式的高精度切分方法,通过融合机构独立切分模式和地址独立切分模式得到的两个切分结果,消除机构名称中的歧义。2.深入研究了中文地址的构成特点,给出了一个合法的地址单元的定义,构建了符合中文地址构成特点的地址识别知识库,实现了一种面向机器翻译的机构地址切分方法。实验证明,针对机构地址翻译这一特定任务,该方法十分有效。3.中文机构地址被切分为地址单元序列之后,需要相应的翻译机制相支撑,才能完成机构地址汉英翻译任务。因此,本文定制了一种基于地址单元的翻译方法,实现了对不同类型的地址单元的翻译。通过CTR的自动获取,解决了广泛存在于基于规则的翻译系统中的规则冲突问题。4.本文设计并实现了中文机构名称翻译系统和中文机构地址翻译系统。实验表明,在仅有几千条标准汉英双语语料的情况下,根据5分制评分标准,两个系统的翻译准确率分别为97.28%和91.26%,达到了实用化的翻译水平。

【Abstract】 Machine Translation is to apply the computer into the translation of one natural language into another. As the main bearer of information, the translation quality of named entities has a very important impact on the text translation, and named entities translation also become a research focus.After the study of transliteration of person names and placename, the translation of address and organization name is the next issue to be resolved. At present, as the existing Chinese-English bilingual corpus of organization name and address is extremely scarce, the current main translation technology SMT can not play to its advantages. To address the above situation, we propose a Chinese organization names translation system which employs a model-based high-precision segmentation method, and a Chinese organization address translation system which combined a organization address segmentation method for MT and a unit-based translation mechanism. In detail, this thesis is arranged as the following:1. A CKY grammar is employed to format Chinese organization name, and we designed a high-precision segmentation method based on the grammar. Ambiguities in organization name are eliminated by combining the segmentation results of organization segmentation and address segmentation.2. A relevant structural features and knowledge base were obtained on a complete research of the organization address composition, and a segmentation approach for MT was proposed. The experimental results show that the performance of this method is efficient.3. After Chinese organization address has been divided into a series of address units, a corresponding translation mechanism is need for the translation task. Therefore, we proposed a unit-based translation approach to acquire the translation of different address unit. Through automated access to CTR, rule conflict which widely range exists in the rule-based translation system is solved.4. This paper designed a Chinese address translation system and a Chinese organization name translation system. The experiments show that, with the help of several thousand bilingual pairs, the two systems reach the 97.28% and 91.26% by 5-point scale score standard respectively.


