èŠ‚ç‚¹æ–‡çŒ®

æ±‰è¯æ¡†æž¶è¯ä¹‰è§’è‰²çš„è‡ªåŠ¨æ ‡æ³¨æŠ€æœ¯ç ”ç©¶

Research on Techniques of Automatic Sematic Role Labeling of Chinese FrameNet

åˆ†é¡µä¸‹è½½
åˆ†ç« ä¸‹è½½
æ•´æœ¬ä¸‹è½½
åœ¨çº¿é˜…è¯»
ä¸æ”¯æŒè¿…é›·ç‰ä¸‹è½½å·¥å…·ï¼Œè¯·å–æ¶ˆåŠ é€Ÿå·¥å…·åŽä¸‹è½½ã€‚

ã€ä½œè€…ã€‘ æŽæµŽæ´ªï¼›

ã€ä½œè€…åŸºæœ¬ä¿¡æ¯ã€‘ å±±è¥¿å¤§å¦ ï¼Œ è®¡ç®—æœºåº”ç”¨æŠ€æœ¯ï¼Œ 2010ï¼Œ åšå£«

ã€æ‘˜è¦ã€‘ ä¸ºäº†ç»™å¤§è§„æ¨¡çš„æ±‰è¯æ¡†æž¶è¯ä¹‰èµ„æºçš„æž„å»ºæä¾›ä¸€ä¸ªè‡ªåŠ¨æ ‡æ³¨å·¥å…·,æœ¬æ–‡åŸºäºŽå±±è¥¿å¤§å¦è‡ªä¸»å¼€å‘çš„æ±‰è¯æ¡†æž¶è¯ä¹‰çŸ¥è¯†åº“(CFN),åœ¨ç»™å®šå¥åä¸çš„ä¸€ä¸ªç›®æ ‡è¯åŠç›®æ ‡è¯æ‰€å±žæ¡†æž¶æƒ…å½¢ä¸‹,å°†å…¶è¯ä¹‰è§’è‰²(æ¡†æž¶å…ƒç´ )çš„è‡ªåŠ¨æ ‡æ³¨é—®é¢˜é€šè¿‡IOBç–ç•¥è½¬åŒ–ä¸ºæ•´ä¸ªå¥åä¸Šçš„è¯åºåˆ—æ ‡æ³¨é—®é¢˜,ä½¿ç”¨æ¡ä»¶éšæœºåœºæ¨¡åž‹(CRF),é‡‡ç”¨ç»Ÿè®¡å¦ä¸çš„æ£äº¤è¡¨å®žéªŒæ–¹æ¡ˆ,ç ”ç©¶äº†æ±‰è¯æ¡†æž¶è¯ä¹‰è§’è‰²çš„è‡ªåŠ¨æ ‡æ³¨æ¨¡åž‹ã€‚æœ¬æ–‡çš„å…¨éƒ¨å®žéªŒè¯æ–™ä½¿ç”¨çš„æ˜¯çŽ°æœ‰çš„CFNä¸é€‰å‡º25ä¸ªæ¡†æž¶çš„6692ä¸ªä¾‹å¥ã€‚å°†è¯æ–™å‡åŒ€åˆ†ä¸º4ä»½,åˆ†3ç»„ä½œ2-foldäº¤å‰éªŒè¯,ä»¥3ç»„äº¤å‰éªŒè¯çš„å¹³å‡F1-å€¼ä½œä¸ºç³»ç»Ÿæ€§èƒ½è¯„ä»·æŒ‡æ ‡ã€‚æœ¬æ–‡ç»™å‡ºäº†ç³»ç»Ÿæ€§èƒ½è¯„ä»·æŒ‡æ ‡çš„æ–¹å·®ä¼°è®¡,ä»¥åŠä¸¤ä¸ªæ ‡æ³¨ç³»ç»Ÿæ€§èƒ½å·®å¼‚çš„æ˜¾è‘—æ€§æ£€éªŒæ–¹æ³•ã€‚æœ¬æ–‡ä»¥è¯ä¸ºåŸºæœ¬æ ‡æ³¨å•å…ƒ,å°†æ ‡æ³¨æ¥éª¤åˆ†ä¸º1)è¾¹ç•Œè¯†åˆ«ã€2)è§’è‰²åˆ†ç±»ã€3)åŽå¤„ç†ä¸‰ä¸ªæ¥éª¤ã€‚åˆ†åˆ«é‡‡ç”¨äº†è¾¹ç•Œè¯†åˆ«ä¸Žè§’è‰²åˆ†ç±»ä¸€èµ·è¿›è¡Œ,ä»¥åŠå…ˆè¾¹ç•Œè¯†åˆ«,å†è§’è‰²åˆ†ç±»ä¸¤ç§æ ‡æ³¨ç–ç•¥ã€‚åœ¨åŽå¤„ç†æ¥éª¤ä¸Š,å¯¹è¾“å‡ºçš„æ ‡æ³¨åºåˆ—è¦æ±‚åœ¨æ•´ä¸ªå¥åä¸Šæ»¡è¶³IOBåºåˆ—åˆæ³•æ€§çº¦æŸ,å¹¶ä»¥æ‰€æœ‰åˆæ³•åºåˆ—ä¸æ¦‚çŽ‡æœ€å¤§çš„åºåˆ—ä½œä¸ºæœ€åŽçš„æ ‡æ³¨è¾“å‡ºã€‚æœ¬æ–‡æ€»å…±æå–äº†26ä¸ªç‰¹å¾,å¯¹æ¯ä¸ªç‰¹å¾è®¾å®šè‹¥å¹²å¯é€‰çš„çª—å£,ç»„åˆæž„æˆCRFæ¨¡åž‹çš„å„ç§ç‰¹å¾æ¨¡æ¿ã€‚ä¸ºäº†é€‰å‡ºè¾ƒå¥½çš„ç‰¹å¾æ¨¡æ¿,æœ¬æ–‡åŸºäºŽç»Ÿè®¡å¦ä¸çš„æ£äº¤è¡¨ç»™å‡ºäº†ä¸€ç§æ¨¡æ¿é€‰ä¼˜æ–¹æ³•,å¹¶é‡‡ç”¨ä¸‰ç§æ–¹æ¡ˆè¿›è¡Œäº†å®žéªŒã€‚æ–¹æ¡ˆä¸€ï¼šåŸºäºŽ11ä¸ªè¯å±‚é¢ç‰¹å¾,å…¶ç‰¹å¾åŒ…æ‹¬è¯ã€è¯æ€§ã€è¯ç›¸å¯¹äºŽç›®æ ‡è¯çš„ä½ç½®ã€ç›®æ ‡è¯ç‰,å®žéªŒé€‰ç”¨æ£äº¤è¡¨L32(49Ã—24)ï¼›æ–¹æ¡ˆäºŒï¼šåŸºäºŽå…¨éƒ¨çš„26ä¸ªç‰¹å¾,åŒ…æ‹¬11ä¸ªè¯å±‚é¢çš„ç‰¹å¾å’ŒåŸºæœ¬å—çš„å¥æ³•æ ‡è®°ã€ç»“æž„æ ‡è®°ç‰15ä¸ªç‰¹å¾,é€‰ç”¨æ£äº¤è¡¨L54(21Ã—325)ã€‚å…¶ä¸åŸºæœ¬å—ç‰¹å¾æå–ä½¿ç”¨çš„æ˜¯æ¸…åŽå¤§å¦å‘¨å¼ºçš„è‡ªåŠ¨åˆ†æžå™¨ï¼›æ–¹æ¡ˆä¸‰ï¼šåˆ†æ‰¹æ£äº¤è¡¨å®žéªŒ,å³å…ˆç”¨æ£äº¤è¡¨L32(49Ã—24),åœ¨11ä¸ªè¯å±‚é¢ç‰¹å¾é€‰å‡ºçš„æœ€å¥½æ¨¡æ¿åŸºç¡€ä¸Š,å†åŠ å…¥15ä¸ªåŸºæœ¬å—ç‰¹å¾,ä½¿ç”¨æ£äº¤è¡¨L54(21Ã—325),é€šè¿‡é€‚å½“é€‰æ‹©æ£äº¤è¡¨çš„æ°´å¹³ä»¥ç¡®ä¿æ€§èƒ½ä¸ä½ŽäºŽå‰ä¸€æ‰¹å®žéªŒç»“æžœã€‚å¯¹æ¯ç§æ–¹æ¡ˆçš„å®žéªŒè¿›è¡Œäº†è¯¦ç»†åˆ†æžã€‚æœ¬æ–‡å¯¹æ£äº¤è¡¨æ¨¡æ¿é€‰ä¼˜æ–¹æ³•ä¸Žä¼ ç»Ÿçš„åŸºäºŽè´ªå¿ƒç®—æ³•çš„æ–¹æ³•è¿›è¡Œäº†æ¯”è¾ƒã€‚ä¹Ÿæ¯”è¾ƒäº†æœ¬æ–‡çš„åŸºäºŽè¯åºåˆ—æ ‡æ³¨æ–¹æ³•å’Œé‡‡ç”¨å®Œå…¨å¥æ³•åˆ†æžæ ‘çš„æ–¹æ³•,ä¹Ÿå¯¹é€‰ç”¨ä¸åŒæ ‡æ³¨æ¨¡åž‹,å¦‚æ”¯æŒå‘é‡æœº(SVM)æ¨¡åž‹å’Œæœ€å¤§ç†µæ¨¡åž‹çš„å®žéªŒç»“æžœè¿›è¡Œäº†æ¯”è¾ƒã€‚å®žéªŒç»“æžœè¡¨æ˜Žï¼š(1)åœ¨åŸºäºŽ11ä¸ªè¯å±‚é¢ç‰¹å¾ä¸Š(æ–¹æ¡ˆä¸€),æœ€å¥½ç»“æžœ(å¹³å‡F1-å€¼)è¾¾åˆ°61.61%,æ¯”åŸºäºŽå®Œå…¨å¥æ³•åˆ†æžæ ‘,å°†è§’è‰²æ ‡æ³¨çœ‹åšå¥æ³•æˆåˆ†çš„åˆ†ç±»é—®é¢˜çš„ç»“æžœæ˜¾è‘—é«˜ã€‚ä¸Žä¼ ç»Ÿçš„è´ªå¿ƒç®—æ³•ç‰¹å¾é€‰æ‹©æ–¹æ³•æ¯”è¾ƒ,æœ¬æ–‡çš„æ£äº¤è¡¨æ¨¡æ¿é€‰æ‹©æ–¹æ³•ä¸Žå…¶åœ¨æ ‡æ³¨æ€§èƒ½ä¸Šæ²¡æœ‰æ˜¾è‘—å·®å¼‚,ä½†æ£äº¤è¡¨æ–¹æ³•çš„è®¡ç®—æ›´ç®€å•,ä¸”åœ¨é€šç”¨æ¨¡æ¿çš„é€‰æ‹©ä¸Šæ›´é€‚å®œã€‚(2)åŠ å…¥15ä¸ªåŸºæœ¬å—ç‰¹å¾(æ–¹æ¡ˆäºŒ)å¯ä»¥æ˜¾è‘—æé«˜æ ‡æ³¨æ¨¡åž‹çš„æ€§èƒ½ã€‚è¿™ç±»ç‰¹å¾ä¸»è¦å¯¹è§’è‰²åˆ†ç±»æœ‰æ˜¾è‘—ä½œç”¨,å¯¹è§’è‰²çš„è¾¹ç•Œè¯†åˆ«ä½œç”¨ä¸æ˜¾è‘—ã€‚(3)åˆ†æ‰¹æ£äº¤è¡¨å®žéªŒ(æ–¹æ¡ˆä¸‰)æ¯”å®žéªŒæ–¹æ¡ˆäºŒåœ¨æ€§èƒ½ä¸Šæœ‰æ˜¾è‘—æé«˜ã€‚(4)æ¯ä¸ªæ¡†æž¶è®ç»ƒä¸€ä¸ªæ¨¡åž‹,è¾¹ç•Œè¯†åˆ«ä¸Žè§’è‰²åˆ†ç±»ä¸€èµ·è¿›è¡Œ,ä¸Žå…ˆè¾¹ç•Œè¯†åˆ«,å†è§’è‰²åˆ†ç±»ä¸¤ä¸ªæ¥éª¤åœ¨æ ‡æ³¨æ€§èƒ½ä¸Šæ²¡æœ‰æ˜¾è‘—å·®åˆ«,ä½†ç”±å‰è€…å¾—åˆ°çš„æ ‡æ³¨æ€§èƒ½æœ‰è¾ƒå°çš„æ–¹å·®ã€‚(5)åŸºäºŽæ¡ä»¶éšæœºåœºæ ‡æ³¨æ¨¡åž‹(CRF)ä¸ŽåŸºäºŽæ”¯æŒå‘é‡æœº(SVM)æ¨¡åž‹çš„æ ‡æ³¨ç»“æžœæ²¡æœ‰æ˜¾è‘—å·®å¼‚,ä½†æ˜¾è‘—å¥½äºŽåŸºäºŽæœ€å¤§ç†µ(ME)æ¨¡åž‹çš„æ ‡æ³¨ç»“æžœã€‚(6)åœ¨å…¨éƒ¨25ä¸ªæ¡†æž¶çš„æ‰€æœ‰å®žéªŒä¸,è¯ä¹‰è§’è‰²è¾¹ç•Œè¯†åˆ«æœ€å¥½çš„ç»“æžœ(å¹³å‡F1-å€¼)ä¸º71.68%ï¼›åœ¨ç»™å®šè¯ä¹‰è§’è‰²è¾¹ç•Œä¸‹,è§’è‰²åˆ†ç±»çš„æœ€å¥½ç»“æžœ(å¹³å‡ç²¾ç¡®çŽ‡)ä¸º84.08%ï¼›åœ¨ç»™å®šå¥åä¸çš„ç›®æ ‡è¯ä»¥åŠç›®æ ‡è¯æ‰€å±žçš„æ¡†æž¶æƒ…å†µä¸‹,æœ€å¥½ç»“æžœ(å¹³å‡F1-å€¼)è¾¾åˆ°63.26%.æœ¬æ–‡çš„åˆ›æ–°ä¹‹å¤„ä¸»è¦æ˜¯é¦–æ¬¡ç³»ç»Ÿåœ°ç ”ç©¶æ±‰è¯æ¡†æž¶è¯ä¹‰è§’è‰²çš„è‡ªåŠ¨æ ‡æ³¨æ¨¡åž‹,ç»™å‡ºäº†ä¸€ç§é‡‡ç”¨æ£äº¤è¡¨çš„æ¨¡æ¿é€‰ä¼˜æ–¹æ³•,åœ¨è®¡ç®—ä¸Š,è¯¥æ–¹æ³•æ¯”åŸºäºŽè´ªå¿ƒç®—æ³•çš„æ¨¡æ¿é€‰æ‹©æ–¹æ³•æ›´ç®€å•ã€‚å¯¹äºŽä¸€èˆ¬çš„åºåˆ—æ ‡æ³¨ä¸çš„ç‰¹å¾é€‰æ‹©é—®é¢˜,æœ¬æ–‡çš„æ£äº¤è¡¨ç‰¹å¾æ¨¡æ¿é€‰ä¼˜æ³•ä¹Ÿé€‚ç”¨ã€‚åœ¨æ ‡æ³¨æ€§èƒ½ä¸Š,æœ¬æ–‡çš„ç»“æžœä¼˜äºŽåŸºäºŽå¥æ³•åˆ†æžæ ‘çš„è¯ä¹‰è§’è‰²æ ‡æ³¨çš„ç»“æžœã€‚æ›´å¤š è¿˜åŽŸ

ã€Abstractã€‘ In order to provide an automatic labeling tools for developing a large-scale resource of Chinese FrameNet, based on the semantic knowledge base of Chinese FrameNet(CFN) self-developed by Shanxi University, this paper study the semantic role(frame element) automatic labeling for given a target word in a sentence and its known frame name.The task of semantic role automatic labeling is conversed into sequential tagging problem at word-level within the entire sentence by IOB strategy. The conditional random fields model (CRF), and the orthogonal array experiment in statistics are employed.The experimental corpus in the paper, selected from current CFN corpus, include 6692 annotated sentences of 25 frames. The corpus is uniformly divided into four parts. Therefore,2-fold cross validation experimrnt can be engaged in the three different groups. We take the cross-validation average F1-value on three groups as the system performance measure. This paper presents the estimator of variance of the system performance measure as well as the significant test method for two different labeling system.Using word as the basic tagging unit, the tagging procedure is divided into three steps:1)identification,2) classification,3)post-processing. The two IOB strategies are adopted, one is conjunction of identifying and classifying, and the other is firstly identifying then classifying. In post-processing step, the final output of the sequential labels is choosen by the largest probability of all labels with a logical IOB sequence in entire sentence.This paper totally extract 26 features, and for each feature set some optional windows. The combination of various features with different windows form the feature templates of CRF model.The best template selection method is given based on orthogonal array in statistics.The three schemes of experiment are adopted. Scheme I: based on 11 word-level features, including word, POS of word, position of word relative to the target word, and the target word etc, the experiment is arranged in orthogonal array L32 (49Ã—24); Schemeâ…¡:based on all 26 features, including 11 word-level features and 15 base chunk features about shallow syntax, arranged in orthogonal array L54 (21Ã—325).The base chunk features are automaticly extracted by automatic analyzer of Tsinghua University Zhou; Scheme III:batch orthogonal array experiment, i.e. first using the orthogonal array L32 (49 x 24) on 11 words-level features acquire the best templates, and then join the 15 base chunk features into orthogonal array L54 (21Ã—325). Through the appropriate selection of the levels in orthogonal array L54 (21Ã—325), it ensure that performance measure is not lower than the previous results.Each experiment conduct a detailed analysis.The paper compares the template selection method upon orthogonal array with traditional greedy algorithm, and compares the sequential tagging method at word-level with method upon syntactic parses tree, and also compares with different tagging models, such as support vector machine (SVM) model and maximum entropy(ME) model.Experimental results show:1)Based on 11 word-level features (schemeâ… ), the best average F1-value reach 61.61%.The result is significantly higher than method upon syntactic parses tree which regard the role labeling as a classification of syntactic constituents.On template selection, comparing with traditional greedy algorithm, the two methods dose not have significant differences, but orthogonal array method is relatively sample in calculation, and has some advantage in choice of the general template for any frame.2) Adding the 15 base chunk features (schemeâ…¡) can significantly improve the performance. These features mainly have significant effects on role classification, not significant on role identification.3) Batch orthogonal array experiment (schemeâ…¢) has a significant higher performance than the schemeâ…¡.4) Two IOB stratagies, i.e. each frame training a model with role identification and classification together, or the firstly identifying then classifying, are no significant difference in performance, but the former has less variance of performance measure.5) The experimental results have no significant difference in annotation upon the conditional random fields model (CRF) and upon support vector machine (SVM) model, but CRF model is significantly better than a maximum entropy (ME) model. 6) In total 25 frames of all experiments, for semantic role identification the best average F1-value reach 71.68%. Given semantic role boundary, for the role classification the best average accuracy achieve 84.08%.Given target word in the sentence and its known frame name, the best average F1-value obtain 63.26%. Main innovations in the paper is the first systematic studies of Chinese FrameNet semantic role of automatic labeling, and proposes to adopt orthogonal array to select template. The method is more simple than the greedy algorithm template selection method. For feature selection problems in general sequential annotation, the orthogonal array feature template selection method is also applicable.For labeling performance, this paperâ€™s results is better than those based on syntactic parses tree.æ›´å¤š è¿˜åŽŸ

ã€å…³é”®è¯ã€‘ æ±‰è¯æ¡†æž¶è¯ä¹‰ï¼› è¯ä¹‰è§’è‰²æ ‡æ³¨ï¼› æ£äº¤è¡¨ï¼› ç‰¹å¾é€‰æ‹©ï¼› æ¡ä»¶éšæœºåœºï¼›
ã€Key wordsã€‘ Chinese FrameNetï¼› semantic role labelingï¼› orthogonal arrayï¼› features selectionï¼› conditional random fieldsï¼›

ã€ç½‘ç»œå‡ºç‰ˆæŠ•ç¨¿äººã€‘ å±±è¥¿å¤§å¦

ã€åˆ†ç±»å·ã€‘TP391.1
ã€è¢«å¼•é¢‘æ¬¡ã€‘7
ã€ä¸‹è½½é¢‘æ¬¡ã€‘339
æ”»è¯»æœŸæˆæžœ

çŸ¥ç½‘èŠ‚ä¸‹è½½

èŠ‚ç‚¹æ–‡çŒ®ä¸ï¼š

æœ¬æ–‡é“¾æŽ¥çš„æ–‡çŒ®ç½‘ç»œå›¾ç¤º:

æœ¬æ–‡çš„å¼•æ–‡ç½‘ç»œ

èŠ‚ç‚¹æ–‡çŒ®

èŠ‚ç‚¹æ–‡çŒ®

æ±‰è¯­æ¡†æž¶è¯­ä¹‰è§’è‰²çš„è‡ªåŠ¨æ ‡æ³¨æŠ€æœ¯ç ”ç©¶

Research on Techniques of Automatic Sematic Role Labeling of Chinese FrameNet

æœ¬æ–‡é“¾æŽ¥çš„æ–‡çŒ®ç½‘ç»œå›¾ç¤º:

æ±‰è¯æ¡†æž¶è¯ä¹‰è§’è‰²çš„è‡ªåŠ¨æ ‡æ³¨æŠ€æœ¯ç ”ç©¶