èŠ‚ç‚¹æ–‡çŒ®

æ–‡æœ¬ä¸çŸ¥è¯†çš„èŽ·å–

Knowledge Acquisition from Text

åˆ†é¡µä¸‹è½½
åˆ†ç« ä¸‹è½½
æ•´æœ¬ä¸‹è½½
åœ¨çº¿é˜…è¯»
ä¸æ”¯æŒè¿…é›·ç‰ä¸‹è½½å·¥å…·ï¼Œè¯·å–æ¶ˆåŠ é€Ÿå·¥å…·åŽä¸‹è½½ã€‚

ã€ä½œè€…ã€‘ çŽ‹èåŽï¼›

ã€ä½œè€…åŸºæœ¬ä¿¡æ¯ã€‘ åŒ—äº¬é‚®ç”µå¤§å¦ ï¼Œ ä¿¡å·ä¸Žä¿¡æ¯å¤„ç†ï¼Œ 2008ï¼Œ åšå£«

ã€æ‘˜è¦ã€‘ äººç±»é€šè¿‡æ–‡å—æ¥æè¿°ä¸–ç•Œã€è¡¨è¾¾æ€æƒ³,æ–‡æœ¬æ˜¯äººç±»æ™ºæ…§ä¼ æ‰¿çš„é‡è¦åª’ä»‹ã€‚éšç€çŸ¥è¯†ç»æµŽæ—¶ä»£çš„åˆ°æ¥,æ–‡æ¡£çŸ¥è¯†ç®¡ç†åœ¨å¦æœ¯ç•Œå’Œä¼ä¸šç•Œå¼•èµ·äº†å¹¿æ³›å…³æ³¨ã€‚ä½†æ˜¯æ–‡æ¡£çŸ¥è¯†ç®¡ç†ç³»ç»Ÿé¢ä¸´ç€å‡ ä¸ªé‡è¦é—®é¢˜:å¦‚ä½•è¯†åˆ«æ–‡æ¡£ä¸»é¢˜,å¦‚ä½•è¯†åˆ«æ–‡æ¡£ä¸å¿ƒè¯;å¦‚ä½•å¯¹ç”¨æˆ·æ‰€å…³å¿ƒçš„å†…å®¹è¿›è¡Œä¸ªæ€§åŒ–çš„å…³é”®æ€§æç¤º;å¦‚ä½•ç²¾ç¡®è¿”å›žç”¨æˆ·å¸Œæœ›å¾—åˆ°ä¿¡æ¯ã€‚å…³é”®è¯èŽ·å–æŠ€æœ¯å’Œä¿¡æ¯æŠ½å–æŠ€æœ¯æ˜¯æ–‡æœ¬å¤„ç†ä¸çš„é‡è¦æŠ€æœ¯,å¯ä»¥åœ¨ä¸€å®šç¨‹åº¦ä¸Šè§£å†³ä¸Šè¿°é—®é¢˜ã€‚æœ¬æ–‡å¯¹åŸºäºŽè¯ä¹‰è¯å…¸çš„å•æ–‡æœ¬å…³é”®è¯èŽ·å–æŠ€æœ¯,ä¿¡æ¯æŠ½å–æŠ€æœ¯ä¸çš„è§„åˆ™ç”Ÿæˆæœºåˆ¶è¿›è¡Œäº†ç ”ç©¶,ä¸»è¦çš„ç ”ç©¶å·¥ä½œå’Œç ”ç©¶æˆæžœåŒ…æ‹¬:1)åŸºäºŽè¯ä¹‰ç½‘ç»œä¸ŽUW-PageRankç®—æ³•çš„è¯ä¹‰æ¶ˆæ§æå‡ºäº†åŸºäºŽè¯ä¹‰ç½‘ç»œå’ŒUW-PageRankç»“åˆçš„çŸ¥è¯†è¯ä¹‰æ¶ˆæ§ç®—æ³•,èƒ½å¤Ÿå¯¹æ–‡æ¡£ä¸å‡ºçŽ°çš„ä»»ä½•è¯è¯(åŒæ—¶åŒ…å«åœ¨çŸ¥è¯†åº“å†…)è¿›è¡Œå®žæ—¶æ¶ˆæ§å¤„ç†,ä¸éœ€è¦è¯æ–™åº“,æ— é¡»è®ç»ƒã€‚é’ˆå¯¹ä¸æ–‡æ–‡æœ¬,ä»¥HowNetä¸ºè¯ä¹‰çŸ¥è¯†åº“,ä»¥ä¹‰åŽŸä¸ºèŠ‚ç‚¹,ä¹‰åŽŸé—´çš„ç›¸å…³æ€§ä¸ºè¾¹çš„æƒé‡æž„é€ æ— å‘èµ‹æƒç½‘ç»œ,è¡¨è¾¾æ–‡æœ¬å†…å®¹ã€‚ä½¿ç”¨UW-PageRankç®—æ³•è¯„ä»·ä¹‰åŽŸçš„æƒé‡,è¿›è€Œè®¡ç®—ä¹‰é¡¹çš„æƒé‡;å¯¹æ¯ä¸€ä¸ªè¯è¯æ¥è¯´,æƒé‡æœ€é«˜çš„ä¹‰é¡¹å³ä¸ºå…¶å«ä¹‰ã€‚åˆ†åˆ«é‡‡ç”¨å…¨æ–‡æ ‡æ³¨è¯•éªŒä¸ŽSENSEVAL-3è¯„æµ‹é›†å¯¹ç®—æ³•è¿›è¡Œäº†è¯„ä»·ã€‚é’ˆå¯¹è‹±æ–‡æ–‡æœ¬,ä»¥WordNetä¸ºè¯ä¹‰çŸ¥è¯†åº“,ä»¥Synsetä¸ºèŠ‚ç‚¹,Synseté—´çš„ç›¸å…³æ€§ä¸ºè¾¹çš„æƒé‡æž„é€ æ— å‘èµ‹æƒç½‘ç»œ,è¡¨è¾¾æ–‡æœ¬å†…å®¹;ä½¿ç”¨UW-PageRankç®—æ³•è¯„ä»·Synsetçš„æƒé‡;æ ¹æ®Synsetçš„æƒé‡å¹¶ç»“åˆå…±æŒ‡è¯ä¹‰çŽ°è±¡ã€è¯ä¹‰å¸¸ç”¨æ€§ç‰å› ç´ è¿›è¡Œè¯ä¹‰æ¶ˆæ§ã€‚åœ¨SemCoræ•°æ®é›†å¯¹ç®—æ³•è¿›è¡Œäº†è¯„æµ‹ã€‚2)åŸºäºŽè¯ä¹‰ç½‘ç»œä¸ŽUW-PageRankç®—æ³•çš„å…³é”®è¯æŠ½å–æå‡ºäº†åŸºäºŽè¯ä¹‰ç½‘ç»œä¸ŽUW-PageRankç®—æ³•çš„å•æ–‡æœ¬å…³é”®è¯æŠ½å–ç®—æ³•ã€‚åœ¨è¯ä¹‰æ¶ˆæ§çš„åŸºç¡€ä¸Š,æ–‡æœ¬ä¸çš„æ‰€æœ‰è¯è¯éƒ½å…·æœ‰ç¡®å®šçš„è¯ä¹‰,å¯¹è¯ä¹‰ç½‘ç»œè¿›è¡Œå‰ªè£,åŽ»æŽ‰è¯è¯çš„å…¶ä»–ä¹‰é¡¹,æ¤æ—¶è¯ä¹‰ç½‘ç»œä¸çš„èŠ‚ç‚¹å³ä¸ºè¯¥è¯åœ¨æ–‡æœ¬ä¸çš„ä¹‰é¡¹,ç„¶åŽä½¿ç”¨UW-PageRankå…¬å¼æŒ–æŽ˜å‡ºé‡è¦çš„è¯ä¹‰,å…¶å¯¹åº”çš„è¯è¯å³ä¸ºæ–‡æœ¬å…³é”®è¯ã€‚åœ¨å¯¹ä¸è‹±æ–‡ç§‘æŠ€è®ºæ–‡çš„æ‰‹å·¥æ ‡æ³¨æ•°æ®é›†ä¸Š,ä¸ŽTfæ–¹æ³•è¿›è¡Œæ¯”è¾ƒ,ç»“æžœè¡¨æ˜Žäº†ç®—æ³•çš„æœ‰æ•ˆæ€§ã€‚3)å¯å‘å¼çš„æ±‰è¯ä¿¡æ¯æŠ½å–è§„åˆ™ç”Ÿæˆç®—æ³•â€”â€”RGA-CIEæå‡ºäº†ä¸€ç§å¯å‘å¼çš„æ±‰è¯ä¿¡æ¯æŠ½å–ç³»ç»Ÿçš„è§„åˆ™ç”Ÿæˆç®—æ³•â€”â€”RGA-CIE(RuleGeneration Algorithm for Chinese Information Extraction)ã€‚é‡‡ç”¨æœ‰ç›‘ç£çš„è‡ªåº•å‘ä¸Šè§„åˆ™å¦ä¹ è¿‡ç¨‹,èƒ½å¤Ÿæ ¹æ®ä¸æ–‡çš„ç‰¹ç‚¹è¿›è¡Œå¯å‘å¼çš„é€æ¥æ³›åŒ–,åŒæ—¶é‡‡ç”¨Laplacian~*ç®—åä½œä¸ºè¯„ä»·ç”Ÿæˆè§„åˆ™çš„æ•ˆæžœã€‚Laplacian~*ç®—åèƒ½å¤Ÿå¾ˆå¥½çš„å¹³æŠ‘è¦†ç›–çŽ‡ä¸Žå‡†ç¡®çŽ‡çš„çŸ›ç›¾;é‡‡ç”¨è¯ä¹‰æ‰©å±•è¿›ä¸€æ¥æé«˜è§„åˆ™çš„è¦†ç›–æ•ˆæžœã€‚åœ¨è‡ªä¸»å¼€å‘çš„è´¢ç»æ–°é—»ä¿¡æ¯æŠ½å–ç³»ç»Ÿä¸Š,å¯¹RGA-CIEç®—æ³•æ€§èƒ½è¿›è¡Œè¯„æµ‹,ç”Ÿæˆè§„åˆ™çš„å‡†ç¡®çŽ‡ä¸º0.84,å¬å›žçŽ‡ä¸º0.82,æ€§èƒ½ä¼˜äºŽæ‰‹å·¥ç¼–åˆ¶çš„è§„åˆ™ã€‚æ¤å¤–,å°†ä¿¡æ¯æŠ½å–æŠ€æœ¯åº”ç”¨äºŽæœ¬ä½“çš„å®žä¾‹èŽ·å–,åœ¨åŒ—äº¬æ—…æ¸¸ä¿¡æ¯æŸ¥è¯¢ç³»ç»Ÿ(Travelingin Beijing,TBJ)çš„é¢†åŸŸæœ¬ä½“æž„å»ºè¿‡ç¨‹ä¸èµ·äº†é‡è¦çš„ä½œç”¨ã€‚æ›´å¤š è¿˜åŽŸ

ã€Abstractã€‘ Text is one of the most important media for people to describe the world, express their thoughts and diffuse knowledge. Coming with knowledge economy, more and more attention has been paid on text knowledge management by researchers and engineers. But there are still some problems for text knowledge management systems: How to acquire the subject of the texts? How to extract the topic words of the texts? How to high-light personalized important information for different people? How to provide exact information for users? Keyword extraction and information extraction may help to solve these problems, which are important technologies in text processing. This paper focused on keyword extraction from single document and rule generation for information extraction. And main achievements are as following:1) Word sense disambiguation based on semantic networks and UW-PageRankThis paper proposes a word sense disambiguation method based on semantic networks and UW-PageRank, which is able to disambiguate all the words in whole text at one time without corpus and training.For Chinese, we use HowNet as knowledge base and build undirected weighted graph which use sememes as vertices and relatedness of sememes as weighted edges. Then UW-PageRank is applied on the graph to score the importance of sememes. Score of each definition of one word can be computed from the score of sememes it contains. Then, the highest scored definition is assigned to the word. This algorithm is tested with text indexing experiment and SENSEVAL-3.For English, we use WordNet as knowledge base and build undirected weighted graph which use synsets as vertices and relatedness of synsets as weighted edges. Then UW-PageRank is applied to score the importance of synsets. The highest scored synset is assigned to the word. This algorithm is tested with SemCor corpus.2) Keyword extraction based on semantic networks and UW-PageRankThis paper proposes a keyword extraction method based on semantic networks and UW-PageRank. After word sense disambiguation, one sense is assigned to one word, so the semantic graph can be pruned according to the results with only "right" sense. Then, UW-PageRank is applied to mining the most important senses, i.e. keywords.We test our algorithm on manually tagged Chinese and English papers and comparing with Tf algorithm, our algorithm performs better.3) Heuristic rule generation algorithm for Chinese information extraction: RGA-CIEThis paper proposes a heuristic rule generation algorithm for Chinese information extraction: RGA-CIE, which is domain independent for free text of Chinese. RGA-CIE applies supervised learning with bottom-up strategy, which is a rule generalization processwith a heuristic method to decide rule generalization path and Laplacian~* formula toevaluate the performance of rules. And semantic extension is also applied to improve the flexibility of rules. The learned rules have been tested on Commercial News Information Extraction System, and achieve a performance of 0.84 as precision and 0.82 as recall, which is better than the manually wrote rules. We also applied information extraction technology on ontology instance learning and made great contribute to Traveling in Beijing System.æ›´å¤š è¿˜åŽŸ

ã€å…³é”®è¯ã€‘ å…³é”®è¯èŽ·å–ï¼› ä¿¡æ¯æŠ½å–ï¼› è¯ä¹‰æ¶ˆæ§ï¼› WordNetï¼› HowNetï¼› PageRankï¼›
ã€Key wordsã€‘ Keyword Extractionï¼› Information Extractionï¼› Word Sense Disambiguationï¼› WordNetï¼› HowNetï¼› PageRankï¼›

ã€ç½‘ç»œå‡ºç‰ˆæŠ•ç¨¿äººã€‘ åŒ—äº¬é‚®ç”µå¤§å¦

ã€åˆ†ç±»å·ã€‘TP391.1
ã€è¢«å¼•é¢‘æ¬¡ã€‘3
ã€ä¸‹è½½é¢‘æ¬¡ã€‘905
æ”»è¯»æœŸæˆæžœ

çŸ¥ç½‘èŠ‚ä¸‹è½½

èŠ‚ç‚¹æ–‡çŒ®ä¸ï¼š

æœ¬æ–‡é“¾æŽ¥çš„æ–‡çŒ®ç½‘ç»œå›¾ç¤º:

æœ¬æ–‡çš„å¼•æ–‡ç½‘ç»œ

èŠ‚ç‚¹æ–‡çŒ®

èŠ‚ç‚¹æ–‡çŒ®

æ–‡æœ¬ä¸­çŸ¥è¯†çš„èŽ·å–

Knowledge Acquisition from Text

æœ¬æ–‡é“¾æŽ¥çš„æ–‡çŒ®ç½‘ç»œå›¾ç¤º:

æ–‡æœ¬ä¸çŸ¥è¯†çš„èŽ·å–