节点文献
网络双语语料挖掘关键技术研究
Research on Key Technology in Mining Web Bilingual Corpora
【作者】 朱泽德;
【导师】 李淼;
【作者基本信息】 中国科学技术大学 , 模式识别与智能系统, 2014, 博士
【摘要】 随着统计方法的迅速发展,大规模双语语料库已成为跨语言信息处理不可或缺的基础资源。双语语料已被大量应用于挖掘双语术语、命名实体和双语词典等更细粒度的互译等价对,为统计机器翻译和跨语言信息检索等领域提供支持。然而,现有的双语语料资源十分匮乏,而低密度语言的双语语料尤为稀缺。近年来,网络原始双语资源迅速增长,且具有内容新颖和来源广阔的优势,围绕网络双语语料挖掘方法的研究已成为人们关注的焦点。本文以网络双语语料挖掘技术为研究对象,进行了平行语料和可比语料挖掘系统的设计,并展开了四项关键技术的研究:平行网页识别、网页正文抽取、关键词提取以及跨语言文本相似度计算。分述如下:1)基于新特征信息的平行网页识别为解决挖掘网络平行语料面临网页结构非对称的问题,本文提出将改进的编辑距离计算网页HTML标签序列的相似性以及最大匹配计算数字序列的相似性等作为特征信息,利用支持向量机进行平行网页识别。该方法降低了对网页结构信息的依赖程度,提高了对现有的低密度语言网络资源的适应性。2)基于文本密度模型的网页正文抽取为解决结构布局各异的网页抽取正文时发生边界误判的问题,本文提出一种基于文本密度模型的新闻网页正文抽取方法,主要通过融合网页结构和语言特征的统计模型,将网页文档按文本行转化成正、负文本密度序列,再根据邻近行内容的连续性,利用高斯平滑技术修正文本密度序列,最后采用改进的最大子序列分割密度序列抽取正文内容。该方法既保持了正文的完整性又排除了噪声的干扰,且无需人工干预或反复训练。3)基于LDA模型的文档关键词提取为解决现有的关键词抽取方法未能综合体现文本主题的显著性、可读性以及全面性的问题,本文提出一种基于文档隐含主题的关键词抽取新算法TFITF,主要利用大规模语料产生隐含主题模型以计算词汇对主题的TFITF权重,并进一步产生词汇对文档的权重,再采用共现信息合并相邻词汇以形成候选关键短语,最后使用相似性排除隐含主题相近的冗余短语。该方法有效地提高了文档关键词抽取的准确率与召回率。4)基于Bi-LDA模型的跨语言文档相似度为解决使用互译词汇等特征匹配跨语言文档时无法衡量文档对主题相关性的问题,本文提出基于Bi-LDA模型分析不同语言文档的跨语言主题模型,并给出文档-主题的KL散度、主题频率-逆文档频率的余弦值和文档的条件概率三种方法,分别计算不同语言文档的相似度,为筛选相似文本对自动构建可比语料库提供基础。该方法增强了对文档语义信息的理解,克服了利用互译词汇匹配文档的表面性,可有效地匹配主题一致的不同语言文档。本文在平行语料挖掘中主要的技术有平行网页识别和网页正文抽取,在可比语料挖掘中主要的技术有网页正文抽取、关键词提取和跨语言文本相似性。实验证明,本文的方法提高了网络资源的利用率和网络双语语料挖掘的质量。
【Abstract】 With the development of statistical techniques, the large-scale bilingual corpora have been indispensable fundamental resources for cross-language processing research field. The bilingual corpora have been applied to mine fine-grained translation equivalents, such as bilingual terminologies, named entities and bilingual lexicography, to support statistical machine translation and cross-language information retrieval. However, existing bilingual corpora are significantly scarce in practical use, especially the low-density languages. In recent years, the original bilingual resources are witnessing rapidly increasing on the web with its advantage of innovative content and vast sources. Mining bilingual corpora from web have become the focus of attention.With the purpose of study on mining bilingual corpora, this thesis designs two systems to mine parallel corpora and comparable corpora respectively together with four key technologies which includes parallel webpages identification, content extraction, keyphrase extraction and cross-language document similarity. The main work includes:1) Parallel Webpage Identification Based on the New Heuristic Information To solve the problem of heterogeneous web structure with mining parallel corpora from web, this thesis develops tag structure alignment calculated in accordance with the improved edit distance and the similarity of co-occurrence number sequence calculated in accordance with maximal common subsequences as the new heuristics. Then we apply a support vector machine to combine these heuristics to classify pages as parallel pages or not. This approach reduces dependence on page structure information to improve the adaptability of the low-density language.2) Web Content Extraction Based on Text Density Model In order to avoid misjudgment boundary and obtain useful content from different layout webpages, this thesis proposes an approach of web content extraction which is based on the text density model, integrating page structure features with language features to convert text lines of page document into a positive or negative density sequence. Additionally, the Gaussian smoothing technique is adopted to revise the density sequence, which takes the content continuity of adjacent lines into consideration. Finally, the improved maximum sequence segmentation is adopted to split the sequence and extract web content. Without any human intervention or repeated training, this approach can maintain the integrity of content and eliminate noise disturbance.3) Keyphrase Extraction Based on LDA Model In order to solve the problem that existing methods lose the comprehensive analysis of significance, readability and coverage of document topics, a new algorithm of keyphrase extraction TFITF which bases on the implicit topic model is presented. The algorithm adopts the large-scale corpus to produce latent topic model to calculate the TFITF weight of vocabulary on the topic and further generate the weight of vocabulary on the document. Then adjacent lexical are picked as keyphrases based on co-occurrence information. Lastly, according to the similarity of vocabulary topics, redundant phrases are excluded. The method can effectively improve the precision and recall of keyphrase extraction.4) Cross-language Document Similarity Based on Bi-LDA Model In order to solve the problem of existing methods which adopt inter-translate words and relative features cannot evaluate the topical relation between cross-language document pairs, this thesis adopts Bi-LDA model to analyze document topic structure and gives the similarity of cross-language documents by KL divergence between document-topics, cosine similarity between values of Topic Frequency-Inverse Document Frequency and condition probability between documents to construct comparable corpora. This method enhances the understanding of document semantic information, overcomes the superficial matching of vocabulary and obtains similar documents with consistent topics.The system of mining parallel corpora mainly adopts parallel webpages identification and content extraction. The system of mining comparable corpus mainly adopts content extraction, keyphrase extraction and cross-language document similarity. The experiment results that the method of the thesis can effectively improve the utilization of web resources and the quality of bilingual corpora.