èŠ‚ç‚¹æ–‡çŒ®

WebåŒè¯å¹³è¡Œè¯æ–™è‡ªåŠ¨èŽ·å–åŠå…¶åœ¨ç»Ÿè®¡æœºå™¨ç¿»è¯‘ä¸çš„åº”ç”¨

Mining Bilingual Parallel Corpora from Web Automatically and Its Application in Statistical Machine Translation

åˆ†é¡µä¸‹è½½
åˆ†ç« ä¸‹è½½
æ•´æœ¬ä¸‹è½½
åœ¨çº¿é˜…è¯»
ä¸æ”¯æŒè¿…é›·ç‰ä¸‹è½½å·¥å…·ï¼Œè¯·å–æ¶ˆåŠ é€Ÿå·¥å…·åŽä¸‹è½½ã€‚

ã€ä½œè€…ã€‘ æž—æ”¿ï¼›

ã€ä½œè€…åŸºæœ¬ä¿¡æ¯ã€‘ å¤©æ´¥å¸ˆèŒƒå¤§å¦ ï¼Œ è®¡ç®—æœºåº”ç”¨æŠ€æœ¯ï¼Œ 2010ï¼Œ ç¡•å£«

ã€æ‘˜è¦ã€‘ åŒè¯å¹³è¡Œè¯æ–™åº“åœ¨è‡ªç„¶è¯è¨€å¤„ç†é¢†åŸŸæœ‰å¾ˆå¤šé‡è¦åº”ç”¨,å®ƒä¸ºç»Ÿè®¡æœºå™¨ç¿»è¯‘æ¨¡åž‹æä¾›ä¸å¯æˆ–ç¼ºçš„è®ç»ƒæ•°æ®,åŒæ—¶ä¹Ÿæ˜¯è¯å…¸ç¼–çº‚å’Œè·¨è¯è¨€ä¿¡æ¯æ£€ç´¢ç‰åº”ç”¨çš„é‡è¦åŸºç¡€èµ„æºã€‚ä½†æ˜¯å¤§è§„æ¨¡åŒè¯å¹³è¡Œè¯æ–™åº“çš„èŽ·å–å¹¶ä¸å®¹æ˜“,çŽ°æœ‰çš„å¹³è¡Œè¯æ–™åº“åœ¨è§„æ¨¡ã€æ—¶æ•ˆæ€§å’Œé¢†åŸŸçš„å¹³è¡¡æ€§ç‰æ–¹é¢è¿˜ä¸èƒ½æ»¡è¶³å¤„ç†çœŸå®žæ–‡æœ¬çš„å®žé™…éœ€è¦ã€‚éšç€äº’è”ç½‘çš„æ™®åŠå’Œé£žé€Ÿå‘å±•,è¶Šæ¥è¶Šå¤šçš„åŒè¯ç½‘ç«™è¢«åˆ›å»º,è¶Šæ¥è¶Šå¤šçš„ä¿¡æ¯ä»¥å¤šè¯è¨€çš„å½¢å¼å‘å¸ƒ,è¿™å°±ä¸ºåŒè¯å’Œå¤šè¯è¯æ–™åº“çš„å»ºè®¾æä¾›äº†å¾ˆå¤§çš„æ¥æºã€‚ä¸€äº›ç ”ç©¶è€…æå‡ºäº†åŸºäºŽWebçš„åŒè¯æˆ–å¤šè¯å¹³è¡Œè¯æ–™åº“è‡ªåŠ¨æŒ–æŽ˜æ–¹æ³•,ä¸ºåŒè¯æˆ–å¤šè¯å¹³è¡Œè¯æ–™åº“çš„è‡ªåŠ¨æž„å»ºæå‡ºäº†æœ‰æ•ˆçš„è§£å†³é€”å¾„ã€‚æœ¬æ–‡è‡´åŠ›äºŽæž„å»ºä¸€ä¸ªåŸºäºŽWebçš„å¤§è§„æ¨¡åŒè¯å¹³è¡Œè¯æ–™åº“è‡ªåŠ¨èŽ·å–ç³»ç»Ÿã€‚å–å¾—ä¸»è¦æˆæžœæœ‰ä»¥ä¸‹å‡ æ–¹é¢ï¼š1.ç ”ç©¶äº†åŒè¯æ··åˆç½‘é¡µçš„è‡ªåŠ¨å‘çŽ°å’ŒèŽ·å–äº’è”ç½‘ä¸Šçš„åŒè¯å¹³è¡Œèµ„æºä¸»è¦åˆ†ä¸ºä¸¤ç±»ï¼šä¸€ç±»æ˜¯åŒè¯èµ„æºåˆ†å¸ƒäºŽä¸¤ä¸ªç½‘é¡µé—´,ä¸¤ä¸ªç½‘é¡µç”¨ä¸åŒè¯è¨€æè¿°å†…å®¹ä¸Šæ˜¯äº’è¯‘çš„,æˆ‘ä»¬ç§°ä¹‹ä¸ºåŒè¯å¹³è¡Œç½‘é¡µï¼›å¦ä¸€ç±»æ˜¯åŒè¯èµ„æºä½äºŽåŒä¸€ç½‘é¡µå†…,æˆ‘ä»¬ç§°ä¹‹ä¸ºåŒè¯æ··åˆç½‘é¡µã€‚ä»¥å¾€çš„ç³»ç»Ÿä¸»è¦æ˜¯åŸºäºŽåŒè¯å¹³è¡Œç½‘é¡µçš„,ä½†æ˜¯é€šè¿‡è§‚å¯Ÿ,æˆ‘ä»¬å‘çŽ°Webä¸Šå˜åœ¨å¤§é‡çš„åŒè¯æ··åˆç½‘é¡µ,è€Œä¸”åŒè¯æ··åˆç½‘é¡µä¸Šçš„åŒè¯èµ„æºå¯¹ç…§æ›´ä¸ºå·¥æ•´,ç¿»è¯‘è´¨é‡è¾ƒé«˜,æ˜¯éžå¸¸å®è´µçš„åŒè¯èµ„æºæ¥æºã€‚åŒè¯å¹³è¡Œç½‘é¡µå˜åœ¨åœ°å€æˆ–ç»“æž„ä¸Šçš„ç›¸ä¼¼æ€§,å¤„ç†æ–¹æ³•å·²ç»å¾ˆæˆç†Ÿ,ä½†è¿™äº›æ–¹æ³•å¹¶ä¸é€‚ç”¨äºŽåŒè¯æ··åˆç½‘é¡µã€‚å€™é€‰åŒè¯æ··åˆç½‘é¡µåˆ†å¸ƒé€šå¸¸ä¸ç¡®å®š,ç¼ºä¹ä¸€äº›å¸¸è§çš„å¯å‘ä¿¡æ¯,èŽ·å–æ›´ä¸ºå›°éš¾ã€‚æœ¬æ–‡æå‡ºäº†ä¸€ç§åŸºäºŽå°è¯•ä¸‹è½½ç–ç•¥çš„è‡ªåŠ¨å‘çŽ°åŒè¯æ··åˆç½‘é¡µçš„æ–¹æ³•,è¿ç”¨è¯¥æ–¹æ³•èŽ·å–å€™é€‰æ··åˆç½‘ç«™å…·æœ‰è¾ƒé«˜çš„æ£ç¡®çŽ‡ã€‚2.ç ”ç©¶äº†ä»ŽåŒè¯æ··åˆç½‘é¡µä¸æŠ½å–å¹³è¡Œå¥å¯¹çš„æ–¹æ³•ä»ŽåŒè¯æ··åˆç½‘é¡µä¸æŠ½å–å¹³è¡Œå¥å¯¹çš„ä¸»è¦ä»»åŠ¡å¯ä»¥åˆ†æˆä¸‰éƒ¨åˆ†ï¼šç½‘é¡µå™ªå£°è¿‡æ»¤ã€åŒè¯æ··åˆç½‘é¡µç¡®è®¤å’Œå¥åå¯¹é½ã€‚æœ¬æ–‡ç ”ç©¶å¹¶å®žçŽ°äº†ä¸¤ç§ç½‘é¡µåŽ»å™ªå£°æ–¹æ³•ï¼šä¸“ç”¨çš„åŸºäºŽæ¨¡æ¿çš„æ–¹æ³•å’Œé€šç”¨çš„åŸºäºŽHtmlæ ‡ç¾æ ‘çš„æ–¹æ³•ã€‚å¯¹äºŽåŒè¯æ··åˆç½‘é¡µçš„ç¡®è®¤æœ¬æ–‡åˆ†ä¸¤æ¥å®žéªŒ,åˆ†åˆ«æ˜¯åŸºäºŽåŒè¯å—ç¬¦æ•°çš„ç²—åˆ¤åˆ«å’ŒåŸºäºŽè¯å…¸çš„ç»†åˆ¤åˆ«ã€‚æœ€åŽ,æœ¬æ–‡é‡‡ç”¨åŸºäºŽæ··åˆä¿¡æ¯çš„å¥åå¯¹é½æ–¹æ³•å°†ç¯‡ç« çº§çš„åŒè¯å¹³è¡Œæ–‡æœ¬è½¬åŒ–æˆåŒè¯å¹³è¡Œå¥å¯¹ã€‚æœ¬æ–‡è§£å†³äº†ä¸Šè¿°ä¸‰ä¸ªéš¾ç‚¹é—®é¢˜,å®žçŽ°äº†ä¸€ä¸ªåŸºäºŽåŒè¯æ··åˆç½‘é¡µçš„å¹³è¡Œè¯æ–™è‡ªåŠ¨æŒ–æŽ˜ç³»ç»Ÿã€‚3.ç ”ç©¶äº†WebåŒè¯å¹³è¡Œè¯æ–™åœ¨å®žé™…ä¸çš„åº”ç”¨æœ¬æ–‡å°†ä»ŽWebä¸ŠèŽ·å–çš„åŒè¯å¹³è¡Œå¥å¯¹åº”ç”¨äºŽç»Ÿè®¡æœºå™¨ç¿»è¯‘çš„æ¨¡åž‹è®ç»ƒ,æå‡ºäº†å¥å¯¹è´¨é‡æŽ’åºå’Œé¢†åŸŸä¿¡æ¯æ£€ç´¢ä¸¤ç§ä¸åŒçš„åº”ç”¨ç–ç•¥å°†Webå¹³è¡Œè¯æ–™åŠ è½½åˆ°è®ç»ƒé›†ä¸,å®žéªŒè¯æ˜Žæœ¬æ–‡æå‡ºçš„ä¸¤ç§ç–ç•¥å¯ä»¥æé«˜ç¿»è¯‘ç³»ç»Ÿæ€§èƒ½,åœ¨IWSLTè¯„æµ‹ä»»åŠ¡ä¸BLEUå€¼å¯ä»¥æé«˜2åˆ°5ä¸ªç™¾åˆ†ç‚¹ã€‚æ›´å¤š è¿˜åŽŸ

ã€Abstractã€‘ There are many important applications of bilingual parallel corpora in natural language processing, which provides essential training data for statistical machine translation, and can be used in lexicography and cross-language information retrieval. However, access to a large-scale bilingual parallel corpus is not easy, the existing parallel corpora can not meet the actual needs in terms of the scale, timeliness and balance of the fields. With the popularity of the Internet and rapid development, more and more bilingual sites have been created, more and more information in multiple languages have been published, which can be the source of bilingual and multi-lingual corpus. Some researchers have proposed several effective solutions of Web-based bilingual or multilingual parallel corpora automatically mining for building the bilingual or multilingual parallel corpus. This paper aims to build a large-scale Web-based automatic acquisition system of bilingual parallel corpus. The main contributions are identified as follows:1. Study discovery and access to mixed-languages Web pages automatically.Bilingual parallel resources on the Internet can be divided into two categories:one category is a bilingual resource distribution between the two pages, two pages described in different languages with the same meaning, which are called bilingual parallel pages; the other is Bilingual resources located in the same page, which are called mixed-languages pages. Previous systems are mainly based on the first category, but through observation, we found that there are a large number of mixed-languages pages on the Web, and the parallel texts are neater and the translation quantity is higher, which are very valuable resources of bilingual corpus.The bilingual parallel pages exist address similarity or structural similarity and the treatments are already very mature, but these methods can not be applied to mixed-languages pages. The distribution of candidate mixed-languages pages is usually uncertain, and the lack of some common heuristic information makes the discovery more difficult. This paper presents a method of discovery the mixed-languages pages automatically based on the strategy of tentative download, using this method to get the eligible candidate mixed-languages pages close to accuracy of 100%. 2. Study the method of extracting bilingual parallel sentence pairs from mixed-languages pages.The main tasks of extracting bilingual parallel sentence pairs from mixed-languages pages can be divided into three parts:Web-noise filtering, mixed-languages pages identifying and sentence alignment. In this paper, we realized two kinds of method to filter Web noise:a dedicated template-based approach and a common approach based on the Html tag tree. The identification of mixed-languages pages are performed through two-step experiments, respectively, the first step is based on the ratio of character number and the second is based on the ratio of translation. Finally, we convert the parallel passages to parallel sentences using the method of hybrid-information-based alignment.This paper solved these three difficult problems and realized an automatic mining system based on mixed-languages pages.3. Study the application of Web bilingual parallel corpus.We apply the bilingual parallel sentences obtained from Web to the training of statistical machine translation model, during which we proposed the sentence quality sorting method and information retrieval method to loaded the Web corpus into the training experiment. The result proves that the two strategies can improve the translation system performance. Experiments conducted on the IWSLT tasks show+2 to+5 BLEU gains over baseline.æ›´å¤š è¿˜åŽŸ

ã€å…³é”®è¯ã€‘ WebæŒ–æŽ˜ï¼› å¹³è¡Œè¯æ–™åº“ï¼› å¥åå¯¹é½ï¼› ç»Ÿè®¡æœºå™¨ç¿»è¯‘ï¼›
ã€Key wordsã€‘ Web Miningï¼› Parallel Corporaï¼› Sentence Alignmentï¼› Statistical Machine Translationï¼›

ã€ç½‘ç»œå‡ºç‰ˆæŠ•ç¨¿äººã€‘ å¤©æ´¥å¸ˆèŒƒå¤§å¦

ã€åˆ†ç±»å·ã€‘TP391.2
ã€è¢«å¼•é¢‘æ¬¡ã€‘2
ã€ä¸‹è½½é¢‘æ¬¡ã€‘291
æ”»è¯»æœŸæˆæžœ

çŸ¥ç½‘èŠ‚ä¸‹è½½

èŠ‚ç‚¹æ–‡çŒ®ä¸ï¼š

æœ¬æ–‡é“¾æŽ¥çš„æ–‡çŒ®ç½‘ç»œå›¾ç¤º:

æœ¬æ–‡çš„å¼•æ–‡ç½‘ç»œ

èŠ‚ç‚¹æ–‡çŒ®

èŠ‚ç‚¹æ–‡çŒ®

WebåŒè¯­å¹³è¡Œè¯­æ–™è‡ªåŠ¨èŽ·å–åŠå…¶åœ¨ç»Ÿè®¡æœºå™¨ç¿»è¯‘ä¸­çš„åº”ç”¨

Mining Bilingual Parallel Corpora from Web Automatically and Its Application in Statistical Machine Translation

æœ¬æ–‡é“¾æŽ¥çš„æ–‡çŒ®ç½‘ç»œå›¾ç¤º:

WebåŒè¯å¹³è¡Œè¯æ–™è‡ªåŠ¨èŽ·å–åŠå…¶åœ¨ç»Ÿè®¡æœºå™¨ç¿»è¯‘ä¸çš„åº”ç”¨