

Research on Intelligence Aided Quality Control of Texts in the Chinese Newspaper Publishing

【作者】 侯锋

【导师】 李国辉;

【作者基本信息】 国防科学技术大学 , 控制科学与工程, 2010, 博士

【摘要】 从汉字“激光照排”技术的应用开始,中文新闻出版业的信息化水平突飞猛进。近年来,我国中文报业出版规模不断扩大,报社中的采编、组版、印刷、财务和发行等生产环节已实现信息化。但是,报业生产流程中的质量控制环节仍然以传统的全手工方式处理每日见报的新闻稿件及版面,效率低,成本高,成为报业生产的瓶颈所在。本文从当前报业出版的现状和存在的问题出发,以报业生产流程优化为切入点,以自动文字查错和重稿检测为手段,以期实现智能辅助的报业出版文字质量控制。论文取得的主要成果如下:1.对现有的报业生产流程和相关软件进行整合优化,提出了文字质量数字化智能辅助控制的概念框架和技术框架。优化后的生产流程不仅为人和计算机提供了协同质量控制的数字化平台,而且为计算机构建了闭环学习的环境,使其能从历史稿件中不断学习新词和语言知识,这些知识又应用于基于词汇语义类的文字查错和重稿检测算法,因此计算机可以较高的智能辅助人工质量控制。2.为利用词汇语义进行语义层面的文字查错,提出了面向文字查错的汉语实词语义分类体系划分方法及种子词获取方法。并提出一种基于种子词的汉语实词义类自动获取算法,利用句法和构词素两种特征,从大规模未分词语料库中自动获取实词的义类标签,该算法能自动获取多义词的多个义类,并能识别情感词。给出了基于词汇义类的汉语词法分析过程,利用条件随机场模型标注词汇义类并识别名词短语边界。3.根据新闻稿的文字错误类型及造成错误的原因,针对中文自动校对研究中没有解决的语法、语义以及前后不一致等错误,提出了四种针对不同错误类型的文字查错算法。基于义类3-gram的语义查错算法是利用词汇义类之间的邻接异常查找普通查错算法无法查出的真词替换错误,以及部分语法、语义错误。基于语义优选的查错算法是利用动词对主语和宾语的语义优选,查找长距离的动宾或主谓搭配错误。基于点互信息的复句结构和标点查错算法,是利用复句连词和标点之间的共现概率查找语法和标点错误。人名-职务不一致检测利用人名-职务对的比较,查找人名或职务在前后文的不一致错误。4.针对重稿检测对历史稿件自动更新的需求,提出了重稿检测的流程与具体算法。算法首先对历史稿件按照广义话题进行分类,并在广义话题内对稿件聚类。在线重稿检测时,首先根据待测稿件的首段文字将其分配到相应的事件类下;然后利用全文特征在事件类内判断其是否为重稿。算法可以同时实现历史稿件自动更新和重稿检测,通过段落间的相似比较,提高重稿检测的精度。基于生产流程优化的应用系统在《长江日报》上线并运行2年多,其在效率和成本方面的优势得到证明。本文提出的自动文字查错和重稿检测算法绝大多数也已在系统中得到应用。

【Abstract】 The informationize level of Chinese Newspaper Publishing has leaped greatly since the application of Chinese Characters’laser photocomposing system. During recent years, the Chinese Newspaper Publishing has scaled up continuously, and the producing processes, such as reportorial writing, typesetting, press, financial and circulational management etc. have digitalized. However, the quality control process, which processes news text and newspaper to control errors and repetitions, is still complete manual. The manual quality control process has been the bottleneck of newspaper publishing because of its low efficiency and high cost.In this thesis, based on analyzing the problems of current newspaper publishing process, the current newspaper publishing process was adapted and several automatic error checking and repetition detecting algorithms were proposed, in order to achieve intelligent aided quality control of newpaper publishing. The primary contributions including:1. The current producing process and related softwares were integrated and optimized, and the concept and technical framework of intelligence aided quality control of the Chinese Newspaper Publishing was presented. The adapted and optimized producing process provides not only a digital coordinated quality control platform for users and computers, but also a close-loop learning environment for computers, in which environment the computers can learn new words and language knowledges, and then these knowledges were applied in the lexical semantic class based error checking and repetition detection algorithms, thus the computers can aid the quality control with high inteligence.2. In order to find semantic errors of texts by using the lexical semantics, a method for substantive lexical semantic classification taxonomy was proposed. And a seed words based semantic class automantic acquisition algorithm for Chinese substantive lexion was proposed. The algorithm can learn semantic class of substantive lexicon from words unsegmented Chinese corpus, and can acquire multi semantic class for multi-sense words, and can acquire subjective words. The semantic class based Chinese lexical analysis process was presented, in this process the conditional random fields model was used to lable the semantic class of segmented Chinese words and identify the boundary of noun phrase.3. According to error types and error causations, four algorithms for different error types and error causations were proposed to detect syntactic, semantic and inconsistent errors, which have not been solved in traditional Chinese automatic proofreading. The semantic class based tri-gram error checking algorithm was used to detect the vocabulary replacement errors and some syntactic and semantic errors. The selectional preference based error checking algorithm was used to detect subject-predicate collocation errors and verb-object collocation errors by using the selectional preference. The point mutual information based error checking algorithm was used to detect syntactic and punctuational errors by using the point mutual information between syntactic conjunctions and punctuations. The inconsistent error checking algorithm was used to detect the inconsistent of person name and title in a text.4. For the purpose of historical news texts automatic organization in repetition detection, a repetition detection algorithm was proposed. The historical news texts were first classified according to general topics, and then were clustered by events. For the online repetition detection, the input text was first classified to general topic and assigned to event by using the first paragraph text, and then the whole text was used to predict whether the input text was repetition or not. This algorithm can both organize the historical texts automatically and detect repetitions, and the precision of repetition detection was improved by similarity computing between paragraphs of different texts.The application system based on adapted and optimized producing pocess has been put into application in Changjing Newspaper for more than 2 years; the advantages on efficiency and cost have been proven. And most of the error checking and repetition detection algorithms have been applied in the system.
