èŠ‚ç‚¹æ–‡çŒ®

ä¸æ–‡è¯æ³•åˆ†æžçš„ç ”ç©¶åŠå…¶åº”ç”¨

The Research and Applications of Chinese Lexical Analysis

åˆ†é¡µä¸‹è½½
åˆ†ç« ä¸‹è½½
æ•´æœ¬ä¸‹è½½
åœ¨çº¿é˜…è¯»
ä¸æ”¯æŒè¿…é›·ç‰ä¸‹è½½å·¥å…·ï¼Œè¯·å–æ¶ˆåŠ é€Ÿå·¥å…·åŽä¸‹è½½ã€‚

ã€ä½œè€…ã€‘ å™æ™“ï¼›

ã€ä½œè€…åŸºæœ¬ä¿¡æ¯ã€‘ å¤§è¿žç†å·¥å¤§å¦ ï¼Œ è®¡ç®—æœºåº”ç”¨æŠ€æœ¯ï¼Œ 2010ï¼Œ åšå£«

ã€æ‘˜è¦ã€‘ åœ¨æœºå™¨ç¿»è¯‘å’Œå…¶ä»–è‡ªç„¶è¯è¨€å¤„ç†ä»»åŠ¡ä¸,å¯¹äºŽä¸æ–‡å’Œæ—¥æ–‡ç‰äºšæ´²è¯è¨€,è¯çš„è¯†åˆ«å’Œå¤„ç†æ˜¯ä¸€ä¸ªæœ€ä¸ºå…³é”®çš„åŸºç¡€æ€§æ¥éª¤,è€Œå…¶ä¸å˜åœ¨çš„é—®é¢˜è‡³ä»Šä»ç„¶æ²¡æœ‰å¾—åˆ°å®Œå–„çš„è§£å†³,ä»Žè€Œå½±å“äº†æœºå™¨ç¿»è¯‘ä»¥åŠå…¶ä»–è‡ªç„¶è¯è¨€å¤„ç†ä»»åŠ¡çš„ç²¾åº¦å’Œæ•ˆçŽ‡ã€‚åœ¨ä¸æ–‡è¯æ³•åˆ†æžä»»åŠ¡ä¸,é™¤äº†ä¸æ–‡åˆ†è¯,è¿˜åŒ…æ‹¬è¯æ€§æ ‡æ³¨,æœªç™»å½•è¯(æˆ–æ–°è¯)çš„è¯†åˆ«å’Œè¯æ€§æ ‡æ³¨ç‰åŸºç¡€æ€§æ¥éª¤,è¿™äº›ä¹Ÿæ˜¯å½±å“ä¸æ–‡è¯æ³•åˆ†æžæ€§èƒ½å’Œç²¾åº¦æé«˜çš„éš¾ç‚¹æ‰€åœ¨ã€‚é¦–å…ˆ,é’ˆå¯¹ä¸æ–‡è¯æ³•åˆ†æžå˜åœ¨çš„é—®é¢˜,æå‡ºäº†ä¸€ç§æ–°çš„èžåˆå•è¯å’Œå•å—ä¿¡æ¯çš„åŸºäºŽè¯æ ¼çš„ä¸æ–‡è¯æ³•åˆ†æžæ–¹æ³•ã€‚è¯¥æ–¹æ³•åˆ©ç”¨ç³»ç»Ÿè¯è¡¨,æž„å»ºåŒ…å«æ‰€æœ‰åˆ†è¯å’Œè¯æ€§æ ‡æ³¨å€™é€‰è·¯å¾„çš„è¯æ ¼,åŒæ—¶å¯¹å€™é€‰æœªç™»å½•è¯åŠå…¶è¯æ€§è¿›è¡ŒåŒæ¥è¯†åˆ«å¹¶åŠ å…¥åˆ°è¯æ ¼ä¸,é™ä½Žäº†æœªç™»å½•è¯è¯†åˆ«çš„è¿ç®—å¤æ‚åº¦,ç„¶åŽåˆ©ç”¨åŸºäºŽè¯çš„æ¡ä»¶éšæœºåŸŸæ¨¡åž‹,ç»“åˆå®šä¹‰åœ¨æ•´æ¡è¾“å…¥è·¯å¾„ä¸Šçš„å…¨å±€ç‰¹å¾æ¨¡æ¿,åœ¨è¯æ ¼ä¸é€‰æ‹©æœ€ç»ˆçš„åˆ†è¯ä»¥åŠè¯æ€§æ ‡æ³¨ç»“æžœã€‚åŸºäºŽè¯çš„æ¡ä»¶éšæœºåŸŸçš„è§£ç é€Ÿåº¦è¦é«˜äºŽåŸºäºŽå•å—çš„æ¡ä»¶éšæœºåŸŸ,å¹¶é™ä½Žäº†æ ‡æ³¨åç½®é—®é¢˜å’Œé•¿åº¦åç½®çš„å½±å“,åœ¨SIGHAN-6ç‰å¼€å¼å’Œé—å¼è¯æ–™ä¸Šè¿›è¡Œæµ‹è¯•,èŽ·å¾—äº†ä»¤äººæ»¡æ„çš„ç»“æžœã€‚å¦å¤–,ä¸ºäº†è¿›è¡Œå¯¹æ¯”,å¯¹åŸºäºŽå•å—çš„ä¸æ–‡åˆ†è¯æ¨¡åž‹ä¹Ÿè¿›è¡Œäº†è¿›ä¸€æ¥çš„ç ”ç©¶,åœ¨å…¶ä¸å¼•å…¥å¤šä¸ªå¤–éƒ¨è¯å…¸,å¹¶å¢žåŠ äº†ç›¸åº”çš„ç‰¹å¾,è¿›ä¸€æ¥æé«˜äº†åŸºäºŽå•å—çš„ä¸æ–‡åˆ†è¯æ¨¡åž‹çš„åˆ†è¯ç²¾åº¦ï¼›åŒæ—¶,ä¸ºäº†æ»¡è¶³é«˜æ•ˆçŽ‡çš„ä¸æ–‡è¯æ³•åˆ†æžéœ€æ±‚,æå‡ºäº†åŸºäºŽæœ€é•¿æ¬¡é•¿åŒ¹é…ç®—æ³•çš„ä¸€ä½“åŒ–çš„ä¸æ–‡è¯æ³•åˆ†æžæ–¹æ³•,å› ä¸ºæ˜¯åŸºäºŽéšé©¬å°”å¯å¤«è¿›è¡Œç¼–ç å’Œè§£ç ,å› æ¤å…·æœ‰è¾ƒé«˜çš„è®ç»ƒå’Œè¯æ³•åˆ†æžé€Ÿåº¦ã€‚å…¶æ¬¡,é’ˆå¯¹ä¸æ–‡è¯æ³•åˆ†æžä¸çš„æœªç™»å½•è¯è¯†åˆ«å’Œæ ‡æ³¨é—®é¢˜,æå‡ºäº†éšè—çŠ¶æ€çš„åŠé©¬å°”å¯å¤«æ¡ä»¶éšæœºåŸŸæ¨¡åž‹(Hidden semi-CRF), Hidden semi-CRFæ¨¡åž‹å¯ä»¥åŒæ¥è¯†åˆ«æœªç™»å½•è¯åŠå…¶è¯æ€§ã€‚Hidden semi-CRFæ¨¡åž‹ç»“åˆäº†éšè—å˜é‡åŠ¨æ€æ¡ä»¶éšæœºåŸŸæ¨¡åž‹(LDCRF)å’ŒåŠé©¬å°”å¯å¤«æ¡ä»¶éšæœºåŸŸæ¨¡åž‹(semi-CRF)çš„ä¼˜åŠ¿,ç›¸å¯¹semi-CRFæ¨¡åž‹å…·æœ‰æ›´ä½Žçš„è¿ç®—ä»£ä»·å’Œæ›´é«˜çš„è¯†åˆ«ç²¾åº¦ã€‚é€šè¿‡Hidden semi-CRFæ¨¡åž‹åŒæ¥è¯†åˆ«æœªç™»å½•è¯åŠå…¶è¯æ€§,å¹¶åŠ å…¥åˆ°è¯æ ¼ä¸å‚ä¸Žæ•´ä½“è·¯å¾„é€‰æ‹©,æé«˜äº†è¯æ³•åˆ†æžçš„æ•´ä½“ç²¾åº¦ã€‚æœ€åŽ,å°†ä¸æ–‡è¯æ³•åˆ†æžçš„ç»“æžœç›´æŽ¥åº”ç”¨åˆ°åŸºäºŽè¶…å‡½æ•°çš„ä¸æ—¥æœºå™¨ç¿»è¯‘ç³»ç»Ÿä¸,å¯¹åŽŸæœ‰è¶…å‡½æ•°è¿›è¡Œäº†æ‰©å±•ï¼šé¦–å…ˆæ˜¯å°†è¶…å‡½æ•°æ‰©å±•ä¸ºé¢å‘å¥åçš„è¶…å‡½æ•°å’Œé¢å‘çŸè¯çš„è¶…å‡½æ•°,å…¶æ¬¡æ˜¯æ‰©å±•äº†è¶…å‡½æ•°ä¸å˜é‡çš„èŒƒå›´,æœ€åŽæå‡ºäº†é«˜æ•ˆçŽ‡çš„æœç´¢ç›¸ä¼¼è¶…å‡½æ•°çš„åŒ¹é…ç®—æ³•ã€‚æ‰©å±•åŽçš„è¶…å‡½æ•°é™ä½Žäº†è¶…å‡½æ•°åº“çš„æ•°é‡,æé«˜äº†åŒ¹é…è¶…å‡½æ•°çš„æ£€ç´¢é€Ÿåº¦,å¹¶ä¸”ç¿»è¯‘çš„ç²¾åº¦å’Œè´¨é‡ä¹Ÿå¾—åˆ°æé«˜ã€‚æ›´å¤š è¿˜åŽŸ

ã€Abstractã€‘ Words are the smallest meaningful units that can be used independently, lexical analysis is the basic step for syntactic tagging, semantic tagging and other deeply corpus processing. Most natural language processing systems, such as machine translation, speech synthesis, information extraction, document retrieval and so on, treat the word as the basic processing units, so correct lexical analysis is of great significance, In machine translation and other natural language processing tasks, the identification of words has been, and is still problematic in Chinese and other Asian language such as Japanese. Since written Chinese does not use blank spaces to indicate word boundaries, segmenting Chinese texts (Chinese word segmentation) becomes an essential task for Chinese language processing. In Chinese lexical analysis, besides Chinese word segmentation, we also need to identify the part-of-speech (POS) tags for the words and detect the unknown words.First, we proposed a pragmatic Chinese lexical analyzer integrating the word-level and character-level information based on conditional random fields (CRFs) model. The word-lattice, which represents all candidate outputs, is built by utilizing the system lexicon. The linear-chain CRF is applied in the selection of final token sequence from the word-lattice by using rich and flexible predefined features. This pragmatic method based on hybrid CRF models offers a solution to the long-standing problems in corpus-based or statistical, word-based or character-based Chinese lexical analysis.In order to make comparisons, we continue to extend the character-based Chinese lexical analysis for comparison, several extended dictionary are added into the system and corresponding features are imported for Chinese lexical analysis. We used this model to attend the SIGHAN-6 bakeoff and gained satisfying results. For meeting the demand of effectiveness, based on the maximum matching and second-maximum matching algorithm, we build the integrative Chinese lexical analyzer, which is encoded and decoded by using the HMM model. Thus, the integrative model has higher training and testing speed.Secondly, for the unknown words in the real-word text, we proposed a hidden semi-CRF model, which combines the strength of (Latent-Dynamic CRF) LDCRF and semi-CRF. The proposed hidden semi-CRF, which incorporates the character-level features and word-level features, is invoked when no matching word can be found in a lexicon and could detect the unknown words and the corresponding POS tags synchronously. Thirdly, based on the results from the pragmatic Chinese lexical analyzer, we built an extended Super Function-based Chinese Japanese machine Translator. We extended the original Super Function in three ways, the first is that the Super Function is divided in to Super Function for sentences and Super Function for phrases; the second is the scope of the variables is extended, and the third is the matching algorithm for Super Functions is proposed. With the extended Super Function, fewer Super Functions are stored in database and the precision of the Chinese Japanese machine translation is also guaranteed.æ›´å¤š è¿˜åŽŸ

ã€å…³é”®è¯ã€‘ ä¸æ–‡ä¿¡æ¯å¤„ç†ï¼› ä¸æ–‡è¯æ³•åˆ†æžï¼› æ¡ä»¶éšæœºåŸŸï¼› è¶…å‡½æ•°ï¼› æœºå™¨ç¿»è¯‘ï¼›
ã€Key wordsã€‘ Chinese Information Processingï¼› Chinese Lexical Analysisï¼› Conditional Random Fieldsï¼› Super Functionï¼› Machine Translationï¼›

ã€ç½‘ç»œå‡ºç‰ˆæŠ•ç¨¿äººã€‘ å¤§è¿žç†å·¥å¤§å¦

ã€åˆ†ç±»å·ã€‘TP391.1
ã€è¢«å¼•é¢‘æ¬¡ã€‘2
ã€ä¸‹è½½é¢‘æ¬¡ã€‘667
æ”»è¯»æœŸæˆæžœ

çŸ¥ç½‘èŠ‚ä¸‹è½½

èŠ‚ç‚¹æ–‡çŒ®ä¸ï¼š

æœ¬æ–‡é“¾æŽ¥çš„æ–‡çŒ®ç½‘ç»œå›¾ç¤º:

æœ¬æ–‡çš„å¼•æ–‡ç½‘ç»œ

èŠ‚ç‚¹æ–‡çŒ®

èŠ‚ç‚¹æ–‡çŒ®

ä¸­æ–‡è¯æ³•åˆ†æžçš„ç ”ç©¶åŠå…¶åº”ç”¨

The Research and Applications of Chinese Lexical Analysis

æœ¬æ–‡é“¾æŽ¥çš„æ–‡çŒ®ç½‘ç»œå›¾ç¤º:

ä¸æ–‡è¯æ³•åˆ†æžçš„ç ”ç©¶åŠå…¶åº”ç”¨