èŠ‚ç‚¹æ–‡çŒ®

å•ç±»ä¸å¿ƒå¦ä¹ åŠå…¶åœ¨äºŒå…ƒå…³ç³»æŠ½å–ä¸çš„åº”ç”¨

The Learning of Single Class Center and the Application in Binary Relation Extraction

åˆ†é¡µä¸‹è½½
åˆ†ç« ä¸‹è½½
æ•´æœ¬ä¸‹è½½
åœ¨çº¿é˜…è¯»
ä¸æ”¯æŒè¿…é›·ç‰ä¸‹è½½å·¥å…·ï¼Œè¯·å–æ¶ˆåŠ é€Ÿå·¥å…·åŽä¸‹è½½ã€‚

ã€ä½œè€…ã€‘ æŽå¿—åœ£ï¼›

ã€ä½œè€…åŸºæœ¬ä¿¡æ¯ã€‘ å¤©æ´¥å¤§å¦ ï¼Œ è®¡ç®—æœºåº”ç”¨æŠ€æœ¯ï¼Œ 2008ï¼Œ åšå£«

ã€æ‘˜è¦ã€‘ åœ¨äº’è”ç½‘ä¸Šè¿›è¡ŒäºŒå…ƒå…³ç³»æŠ½å–,æ˜¯å½“å‰ä¿¡æ¯æŠ½å–çš„é‡è¦ç ”ç©¶æ–¹å‘ã€‚ä¸ºåˆ©ç”¨äº’è”ç½‘çš„å¤§é‡æœªæ ‡å®šè¯æ–™,è®¸å¤šæ–‡çŒ®æå‡ºäº†åŸºäºŽself-trainingæœºåˆ¶çš„å¦ä¹ æ–¹æ³•:å³åœ¨å°æ ‡æ³¨é›†ä¸Šè®ç»ƒåˆå§‹ç³»ç»Ÿ,ç„¶åŽåœ¨ç³»ç»Ÿè¿è¡Œè¿‡ç¨‹ä¸,è‡ªåŠ¨æ ‡å®šå¯é å€™é€‰,é‡æ–°è®ç»ƒ,ä»¥æ”¹è¿›ç³»ç»Ÿæ€§èƒ½ã€‚å®žè·µè¯æ˜Ž:ä¸Šè¿°æ–¹æ³•åœ¨äºŒå…ƒå…³ç³»æŠ½å–ä¸æ˜¯è¡Œä¹‹æœ‰æ•ˆçš„,ä½†å·²æœ‰æ–‡çŒ®ç¼ºä¹å¯¹å¦ä¹ è¿‡ç¨‹çš„ç†è®ºåˆ†æžã€‚æœ¬æ–‡é¦–å…ˆå°†åœ¨äºŒå…ƒå…³ç³»æŠ½å–ä¸çš„æ¨¡å¼å¦ä¹ é—®é¢˜è½¬åŒ–ä¸ºå•ç±»æ–‡æœ¬ä¸å¿ƒçš„å¦ä¹ é—®é¢˜ã€‚åœ¨æ–‡æœ¬å‘é‡ç©ºé—´ä¸,å½“åˆå§‹ä¸å¿ƒè¢«ç»™å®šåŽ,å¯å°†å…¶è¶³å¤Ÿå°é‚»åŸŸå†…çš„æ–‡æœ¬å‘é‡ä½œä¸ºè‡ªåŠ¨æ ‡å®šæ•°æ®ã€‚æœ¬æ–‡è¦è§£å†³çš„æ ¸å¿ƒé—®é¢˜æ˜¯:å½“æ•°æ®é›†å…·æœ‰ä½•ç§ç‰¹æ€§æ—¶,åˆ©ç”¨è‡ªåŠ¨æ ‡å®šæ•°æ®èƒ½ç¡®å®šåœ°æ”¹è¿›å¯¹å•ç±»ä¸å¿ƒçš„å¦ä¹ ?ä¸ºè§£å†³è¯¥é—®é¢˜,æœ¬æ–‡ç ”ç©¶æ–‡æœ¬å‘é‡ç©ºé—´çš„åˆ†å¸ƒç‰¹æ€§ã€‚ä¸ºå…‹æœé«˜æ–¯æ··åˆæ¨¡åž‹åœ¨æè¿°å…·æœ‰ç¡¬èšç±»ç‰¹æ€§çš„æ•°æ®åˆ†å¸ƒæ—¶çš„ç¼ºç‚¹,æœ¬æ–‡æå‡ºäº†åŸºäºŽk-meansç®—æ³•åˆ’åˆ†åŒºåŸŸçš„TGMKæ¨¡åž‹,å¹¶æç¤ºäº†TGMKæ¨¡åž‹ä¸Žk-meansç®—æ³•ã€é«˜æ–¯æ··åˆæ¨¡åž‹çš„å¯†åˆ‡è”ç³»ã€‚å®žéªŒç»“æžœè¡¨æ˜Ž:TGMKæ¨¡åž‹é€‚åˆæè¿°å¤šç±»æ–‡æœ¬æ•°æ®ã€‚æœ¬æ–‡åœ¨k-meansç®—æ³•åŸºç¡€ä¸Šæå‡ºäº†single-meanç®—æ³•ã€‚æ–‡ä¸è¯æ˜Ž:å½“å¤šç±»æ•°æ®é›†é€‚åˆè¢«1-TGMKçš„æ³›åŒ–æ¨¡åž‹â€”1-TGMRæ¨¡åž‹æ‰€æè¿°æ—¶,æ–°ç®—æ³•ä»Žç›®æ ‡ç±»çš„åˆå§‹ä¸å¿ƒå‡ºå‘,å°†æ”¶æ•›åˆ°å®žé™…ä¸å¿ƒã€‚è‡³æ¤,å®Œæˆäº†å¯¹æ ¸å¿ƒé—®é¢˜çš„è§£ç”ã€‚å®žéªŒè¡¨æ˜Žäº†æ–°ç®—æ³•åœ¨æ–‡æœ¬æ•°æ®ä¸Šçš„æœ‰æ•ˆæ€§,ä»Žè€Œè¯´æ˜Žäº†self-trainingæœºåˆ¶åœ¨äºŒå…ƒå…³ç³»æŠ½å–ä¸çš„æœ‰æ•ˆæ€§ã€‚æœ¬æ–‡ä¸ºäºŒå…ƒå…³ç³»æŠ½å–å·¥ä½œå»ºç«‹äº†åŸºäºŽsingle-meanç®—æ³•çš„å½¢å¼åŒ–å¦ä¹ æ¨¡åž‹,å¹¶é’ˆå¯¹åœ¨äº’è”ç½‘ä¸Šè¿›è¡ŒäºŒå…ƒå…³ç³»æŠ½å–çš„ç‰¹æ®Šæ€§,æå‡ºäº†æ–°çš„å€™é€‰è¯„åˆ†æ–¹æ³•å’Œè‡ªåŠ¨æ ‡å®šæ–¹æ³•ã€‚æœ¬æ–‡å°†å¦ä¹ æ¨¡åž‹åº”ç”¨åˆ°ä¸æ–‡é—®ç”å¯¹å’Œä¸è‹±æ–‡æœ¯è¯å¯¹çš„æŠ½å–ä¸ã€‚ä¸Žå‰äººå·¥ä½œä¸åŒçš„æ˜¯:æœ¬æ–‡å°†self-trainingæœºåˆ¶å¼•å…¥ä¸æ–‡é—®ç”æ¨¡å¼å’Œä¸è‹±æ–‡æœ¯è¯æ¨¡å¼çš„å¦ä¹ ä¸,ä½¿å¾—ç³»ç»Ÿå¯¹äººå·¥æ ‡å®šè¯æ–™çš„ä¾èµ–åº¦å‡åˆ°æœ€å°;æœ¬æ–‡åˆ©ç”¨å¯å‘è§„åˆ™,æ”¹è¿›æ¨¡å¼å’Œå€™é€‰çš„è¯„åˆ†æ–¹æ³•ã€‚å®žéªŒè¡¨æ˜Ž:ä¸ŽåŒç±»ç³»ç»Ÿç›¸æ¯”,æ–°ç³»ç»Ÿèƒ½åœ¨æ›´å°çš„æ ‡æ³¨é›†ä¸Š,å®žçŽ°æ›´ä¼˜çš„æ€§èƒ½ã€‚æ›´å¤š è¿˜åŽŸ

ã€Abstractã€‘ To extract the binary relation from web is an important research direction in the field of information extraction. Many literatures had presented learning methods based on self-training mechanism. In these methods, an initial system is trained on a small labeled data set. Then the system labels the reliable candidate data to re-train itself for better performance.These Literatures show that the above methods are efficient in extraction of binary relation. But no literature tries to analyze the methods strictly.This paper transforms the pattern learning in the extraction of binary relation into the learning of centre of single text class. In text vector space, the vectors in the small neighborhood region of the initial centre could be labeled as the reliable data. This paper aims to answer the key problem: what nature the data set should owns so that the self-labeled data can definitely improve the learning of single class centre.This paper solves the key problem through the study on the nature of text vector space. For conquer the defects in the description for distribution ofâ€œhardâ€data set by Gaussian mixture model, this paper presents a new model: TGMK model based on the partitions acquired by k-means algorithm, and exposes the relations among the TGMK model, k-means algorithm and Gaussian mixture model. The experiment result shows that TGMK model is suitable as the description for the text data set of multiple classes.Based on k-means algorithm, this paper presents a new algorithm: single-mean algorithm. This paper proves that if the data set of multiple classes is suitable to be described by 1-TGMR model which is the generalization version of TGMK model, the output centre of single-mean algorithm will definitely converge to the actual centre of data set from the initial centre. The above researches solve the key problem perfectly. The experiment shows that single-mean algorithm is efficient on the text data set of multiple classes, which also shows that the learning method based on self-training mechanism is efficient in the extraction of binary relation.This paper creates a formal learning model for the extraction of binary relation based on single-mean algorithm, and presents a new score method for candidates and a new self-labeled method against the particularity of the extraction of binary relation from web. This paper uses the formal model to acquire Chinese Q-A patterns and Chinese-English terminology pairs. Differently with the previous work, this paper presents new methods to learn Chinese Q-A patterns and Chinese-English terminology patterns based on self-training mechanism, which reduces the dependence of labeled data set to the maximum extent. This paper also utilizes the heuristic rules to improve the scoring methods for the patterns and candidates. The experiment results show that compared with the same kind of systems, our systems have better performances on smaller labeled data set.æ›´å¤š è¿˜åŽŸ

ã€å…³é”®è¯ã€‘ self-trainingï¼› å•ç±»ä¸å¿ƒå¦ä¹ ï¼› äºŒå…ƒå…³ç³»æŠ½å–ï¼› é«˜æ–¯æ··åˆæ¨¡åž‹ï¼› é—®ç”æ¨¡å¼å¦ä¹ ï¼› æœ¯è¯æ¨¡å¼å¦ä¹ ï¼›
ã€Key wordsã€‘ self-trainingï¼› the learning of centre of single classï¼› the extraction of binary relationï¼› Gaussian mixture modelï¼› learning of Q-A patternsï¼› learning of terminology patternsï¼›

ã€ç½‘ç»œå‡ºç‰ˆæŠ•ç¨¿äººã€‘ å¤©æ´¥å¤§å¦

ã€åˆ†ç±»å·ã€‘TP391.1
ã€ä¸‹è½½é¢‘æ¬¡ã€‘110
æ”»è¯»æœŸæˆæžœ

çŸ¥ç½‘èŠ‚ä¸‹è½½

èŠ‚ç‚¹æ–‡çŒ®ä¸ï¼š

æœ¬æ–‡é“¾æŽ¥çš„æ–‡çŒ®ç½‘ç»œå›¾ç¤º:

æœ¬æ–‡çš„å¼•æ–‡ç½‘ç»œ

èŠ‚ç‚¹æ–‡çŒ®

èŠ‚ç‚¹æ–‡çŒ®

å•ç±»ä¸­å¿ƒå­¦ä¹ åŠå…¶åœ¨äºŒå…ƒå…³ç³»æŠ½å–ä¸­çš„åº”ç”¨

The Learning of Single Class Center and the Application in Binary Relation Extraction

æœ¬æ–‡é“¾æŽ¥çš„æ–‡çŒ®ç½‘ç»œå›¾ç¤º:

å•ç±»ä¸å¿ƒå¦ä¹ åŠå…¶åœ¨äºŒå…ƒå…³ç³»æŠ½å–ä¸çš„åº”ç”¨