节点文献

面向WI输入法的新词发现技术研究与实现

The Research and Implementation of Discovery of New Words for WI Input Method

【作者】 周春波

【导师】 关毅;

【作者基本信息】 哈尔滨工业大学 , 计算机技术, 2011, 硕士

【摘要】 拼音输入法通过输入拼音串转换为汉字串,转换的准确率很大程度上取决于词典是否涵盖常用词汇,特别是一些新兴词汇。手工向词典中加入新词费时费力,而新词发现技术则从大规模文本中自动挖掘新词,具有自动化、易于发现热门词汇等特点。本文将探讨新词发现技术,并将挖掘出来的新词添加到输入法词典中以期提高输入法的音字转换准确率。本文首先探讨了两类新词的挖掘方法:情感词以及商品词。在情感词挖掘中,本文提出基于最大流最小割原理的迭代中文情感词挖掘方法,实验结果显示,基于该思想在挖掘主观词方面具有较强能力,其性能高于传统的基于统计模型的主观词挖掘方法;在商品词挖掘中,本文选择用户在购物网站上的搜索日志作为发现商品词的数据来源,并根据搜索日志的数据特点,在对用户查询(query)的自然分词基础上,采用N元递增分步算法和串频统计,计算候选串的条件概率,选择候选商品词。最后,本文介绍了针对“苹果”公司iOS平台的输入法开发的相关流程,并展示了新词发现技术在WI输入法中发挥的重要作用。WI输入法是哈尔滨工业大学计算机学院语言技术中心网络智能研究室研发的一款面向苹果平台的中文语句级输入法。它的第一个版本于2010年11月11日发布,目前已有用户12万以上,其输入的准确性、流畅性等获得了用户的广泛好评。

【Abstract】 Pinyin input method converts alphabetic string to Chinese character string. The accuracy of conversion depends largely on whether the dictionary covers common words, specially some new words. It will take large effort to add new words into dictionary manually. The new word discovery technology finds new words from a large-scale of text automatically, which has some features such as automatic and easy to find new words. This article will explore new word discovery technology, and then add the new words into the dictionary used in input method to increase the accuracy.First, this paper discusses methods of two kind of new words: emotional words and commodity words. In emotional words mining, this paper discusses the Chinese emotional words mining using iteration method which is based on the principle of the maximum flow minimum cut. Experimental results show that this method has a strong capacity on subjective word mining, its performance is better than that of traditional subjective term mining based on statistical model. In commodity words mining, the data source comes from user’s search log on shopping site. First, this paper finishes word segmentation on users‘query depending on the search log data‘s characteristics. And then calculate the conditional probability of the candidate strings using N-gram increasing algorithm and the string frequency statistics. Finally, select the commodity words.Finally, this paper describes the related development processes of input method for iOS platform of Apple Company. And shows the important role of the new word discovery technology used in WI input method. WI input method is developed by Web Intelligence Research Center of computer science department of Harbin Institute of Technology. And it is a statement-level Chinese input method. This input method was released on November 11, 2010. Now the number of its users has been more than 120000. Its accuracy and fluency have received high praise from large number of users.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络