èŠ‚ç‚¹æ–‡çŒ®

åŸºäºŽè¯å…±çŽ°çš„æ–‡æœ¬ä¸»é¢˜æŒ–æŽ˜æ¨¡åž‹å’Œç®—æ³•ç ”ç©¶

Research on Terms Co-occurrence Based Models and Algorithms for Text Mining

åˆ†é¡µä¸‹è½½
åˆ†ç« ä¸‹è½½
æ•´æœ¬ä¸‹è½½
åœ¨çº¿é˜…è¯»
ä¸æ”¯æŒè¿…é›·ç‰ä¸‹è½½å·¥å…·ï¼Œè¯·å–æ¶ˆåŠ é€Ÿå·¥å…·åŽä¸‹è½½ã€‚

ã€ä½œè€…ã€‘ å¸¸é¹ï¼›

ã€ä½œè€…åŸºæœ¬ä¿¡æ¯ã€‘ å¤©æ´¥å¤§å¦ ï¼Œ ç®¡ç†ç§‘å¦ä¸Žå·¥ç¨‹ï¼Œ 2010ï¼Œ åšå£«

ã€æ‘˜è¦ã€‘ éšç€ä¿¡æ¯æŠ€æœ¯çš„å‘å±•ä¸Žç¤¾ä¼šä¿¡æ¯åŒ–è¿›ç¨‹çš„åŠ å¿«,æ•°å—åŒ–çš„ä¿¡æ¯å‘ˆçˆ†ç‚¸å¼çš„å¢žé•¿,å·²ç»è¿œè¿œè¶…å‡ºäº†äººç±»çš„ç†è§£ä¸Žæ¦‚æ‹¬èƒ½åŠ›ã€‚åˆ©ç”¨è®¡ç®—æœºä»Žå¤§é‡çš„æ–‡æœ¬èµ„æ–™ä¸è‡ªåŠ¨å‘æŽ˜æœ‰ä»·å€¼çš„çŸ¥è¯†ä¸Žä¿¡æ¯,æ˜¯è§£å†³è¿™ä¸€éš¾é¢˜çš„æœ‰æ•ˆé€”å¾„ã€‚æœ¬æ–‡ä»¥æ•°æ®æŒ–æŽ˜ç†è®ºä¸ºåŸºç¡€,é‡ç‚¹ç ”ç©¶äº†æ–‡æœ¬ä¸»é¢˜æŒ–æŽ˜çš„ç›¸å…³æ¨¡åž‹åŠç®—æ³•ã€‚ä¸»è¦ç ”ç©¶å†…å®¹åŒ…æ‹¬:é¦–å…ˆ,ç ”ç©¶äº†æ–‡æœ¬çš„è¡¨ç¤ºæ¨¡åž‹ã€‚é€šè¿‡åˆ†æžè¯å…±çŽ°çŽ°è±¡,ä»Žç†è®ºä¸Šè¯æ˜Žäº†è¯å…±çŽ°çŽ°è±¡ä¸Žä¸»é¢˜ä¹‹é—´çš„ç›¸å…³å…³ç³»,ä»Žè€Œæå‡ºäº†åŸºäºŽè¯å…±çŽ°ç»„åˆçš„æ–‡æ¡£è¡¨ç¤ºæ¨¡åž‹(Co-occurrence Term Vector Space Model, CTVSM)ã€‚åˆ©ç”¨å…³è”è§„åˆ™æŒ–æŽ˜,æŠ½å–å‡ºæ–‡æœ¬é›†ä¸Šçš„å…±çŽ°è¯ç»„åˆé›†åˆ,è¿›è€Œå®šä¹‰äº†åŸºäºŽCTVSMçš„æ–‡æœ¬è¡¨ç¤ºå‘é‡,ä»¥åŠæ–‡æœ¬ç›¸ä¼¼æ€§çš„åº¦é‡æ–¹æ³•ã€‚å…¶æ¬¡,ä»¥CTVSMä¸ºåŸºç¡€ç ”ç©¶äº†æ–‡æœ¬èšç±»é—®é¢˜,æå‡ºäº†åŸºäºŽCTVSMçš„æ–‡æ¡£å±‚æ¬¡èšç±»æ–¹æ³•,å°†æ–‡æ¡£å’Œæ–‡æ¡£çš„èšç±»è¡¨ç¤ºä¸ºå…±çŽ°è¯ç»„åˆçš„å‘é‡,åˆ©ç”¨æ–‡æœ¬ç›¸ä¼¼æ€§åº¦é‡æ–¹æ³•,è®¾è®¡äº†æ–‡æ¡£èšç±»ä¹‹é—´çš„ç›¸ä¼¼æ€§åº¦é‡æ–¹æ³•ã€‚ä¸ºäº†å¿«é€Ÿåˆ¤æ–å±‚æ¬¡èšç±»è¿‡ç¨‹ä¸çš„æœ€ä¼˜åˆ’åˆ†å±‚,å®šä¹‰äº†æ–‡æ¡£èšç±»çš„ä¸å¿ƒç‚¹,æå‡ºäº†åŸºäºŽèšç±»ç†µçš„æœ€ä¼˜åˆ’åˆ†å±‚åˆ¤æ–å‡†åˆ™ã€‚å®žéªŒè¯æ˜Ž,åŸºäºŽCTVSMçš„æ–‡æ¡£èšç±»å–å¾—äº†è¾ƒå¥½çš„æ•ˆæžœã€‚ç„¶åŽ,ç ”ç©¶äº†æ–‡æœ¬ç©ºé—´ä¸çš„è¯èšç±»é—®é¢˜,æ ¹æ®æ–‡æœ¬é›†ä¸Šçš„æŠ½å–å‡ºçš„å…±çŽ°è¯ç»„åˆé›†åˆ,å®šä¹‰äº†æ–‡æœ¬é›†ä¸Šçš„è¯å…±çŽ°å›¾,å°†è¯æ˜ å°„ä¸ºå›¾ä¸çš„ç‚¹,è¯ä¸Žè¯çš„å…±çŽ°åº¦æ˜ å°„ä¸ºå›¾ä¸çš„è¿žæŽ¥ä¸¤ç‚¹çš„è¾¹,ä»Žè€Œå°†è¯èšç±»é—®é¢˜è½¬åŒ–ä¸ºåœ¨å›¾ä¸åˆ’åˆ†ç‚¹ç°‡çš„é—®é¢˜ã€‚æå‡ºäº†åŸºäºŽå›¾å¯†åº¦çš„è¯èšç±»æ–¹æ³•,åœ¨èšç±»è¿‡ç¨‹ä¸,ä¸€ä¸ªè¯åŠ å…¥ä¸€ä¸ªè¯ç±»çš„ä¾æ®ä¸ºè¯¥è¯çš„åŠ å…¥æ˜¯å¦èƒ½æ˜¾è‘—æé«˜è¯¥è¯ç±»çš„å›¾å¯†åº¦,ç›´åˆ°æ‰€æœ‰è¯éƒ½è¢«åˆ’åˆ†åˆ°è¯ç°‡ä¸ã€‚å®žéªŒç»“æžœè¡¨æ˜Žæœ¬æ–‡æå‡ºçš„æ–¹æ³•ä¸Žä¸€èˆ¬æ–¹æ³•åœ¨ç®—æ³•å¤æ‚åº¦(å®žéªŒè¿›è¡Œçš„æ—¶é—´)ä»¥åŠèšç±»æ•ˆæžœä¸Šå‡æœ‰æ˜¾è‘—æé«˜ã€‚æœ€åŽ,ç ”ç©¶äº†æ–‡æœ¬é›†ä¸ŠæŒ–æŽ˜å‡ºçš„ä¸»é¢˜åœ¨ä¿¡æ¯æŽ¨èä¸Žä¿¡æ¯æ£€ç´¢ä¸çš„åº”ç”¨é—®é¢˜ã€‚ä»¥æ–‡æœ¬çš„ä¸»é¢˜æŠ½å–ä¸ºä¾‹,åˆ©ç”¨æ–‡æœ¬ç©ºé—´ä¸çš„ä¸»é¢˜ä¿¡æ¯,æé«˜äº†æ–‡æœ¬ä¸»é¢˜æŠ½å–çš„è´¨é‡ã€‚é€šè¿‡å¯¹æ–‡æœ¬ä¸»é¢˜çš„é¢„æµ‹,ç¡®å®šæ–‡æ¡£æ‰€å±žçš„ä¸»é¢˜åŸŸ,è¿›è€Œç¡®å®šäº†è¯¥æ–‡æœ¬ä¸»é¢˜è¯æŠ½å–çš„é¢†åŸŸè¯èŒƒå›´,æ®æ¤å¯¹æ–‡æ¡£ä¸çš„è¯çš„æƒé‡è¿›è¡Œè°ƒæ•´,ä»Žè€Œä½¿ä¸»é¢˜é¢†åŸŸè¯æ±‡å¾—ä»¥è¾ƒé«˜çš„æƒé‡,ä¿è¯äº†æŠ½å–å‡ºçš„ä¸»é¢˜è¯çš„ä¸»é¢˜ç²¾ç¡®åº¦ã€‚å®žéªŒè¯æ˜Ž,ç®—æ³•æé«˜äº†æ–‡æœ¬ä¸»é¢˜è¯æŠ½å–çš„è´¨é‡,ç‰¹åˆ«æ˜¯åœ¨è¯é¢‘æƒé‡åŒºåˆ«åº¦ä¸æ˜Žæ˜¾çš„çŸæ–‡æœ¬ä¸,æŠ½å–è´¨é‡æœ‰æ˜¾è‘—æé«˜ã€‚æ›´å¤š è¿˜åŽŸ

ã€Abstractã€‘ There has been a phenomenal growth of information during past decades. The work of understanding the massive information has been a hopeless for human-beings. To obtain information automatically from the text information has become a key problem in our information research society. The main research work of this thesis is based on statistical machine learning methods with the usage of co-occurrence, especially the Text Mining models and algorithms. The main contents are as follows:First, a novel model of document is presented which is built with co-occurrence term, named co-occurrence term vector space model (CTVSM). The algorithm of mining associate rules is employed to extract the co-occurrence terms in the document space. Then the document model is defined with these co-occurrence terms and measurement of the similarity between two documents is defined further. Experimental results show that the distance of documents which are less similar is farther than distance in Euclidean space basis of VSM, and the distance of documents are more similar is closer than the one in Euclidean space.Second, on the basis of CTVSM, a novel document clustering algorithm is proposed. In this algorithm the document and cluster are presented by CTVSM and the measurement of different clusters is given according to the measurement of documents. In order to decide the optimal number of clusters, clustering gain as a measure for clustering optimality is advanced. It shows good performance producing intuitively reasonable clustering configurations in document clustering according to the evidence from experimental results.Third, another focus of this thesis is on using CTVSM to cluster large scale terms in document space. A map of co-occurrence terms is defined, in which words are mapped into dots and relationship between the co-occurrence words is mapped into edges. An algorithm of word clustering is proposed based on this map. It joints the word with the cluster on the basis of the change of the clusterâ€™s density. It shows that this algorithm is better than the normal word clustering method in both performance and efficiency.Finally, an application of the topic map extracted from the document space is proposed. An algorithm of subject words extraction is improved by using topic map. Topics of a document are identified by means of estimation of statistical topic model. Thus the documentâ€™s topic term fields are identified. The weight of terms is adjusted according to the topic term fields. Experimental results indicate that the proposed method significantly outperforms methods that combine existing techniques.æ›´å¤š è¿˜åŽŸ

ã€å…³é”®è¯ã€‘ æ–‡æœ¬ä¸»é¢˜æŒ–æŽ˜ï¼› è¯å…±çŽ°ï¼› æ–‡æ¡£èšç±»ï¼› è¯èšç±»ï¼› ä¸»é¢˜è¯æŠ½å–ï¼›
ã€Key wordsã€‘ Text Topic Miningï¼› Terms Co-occurrenceï¼› Document Clusteringï¼› Terms Clusteringï¼› Keyword Extractionï¼›

ã€ç½‘ç»œå‡ºç‰ˆæŠ•ç¨¿äººã€‘ å¤©æ´¥å¤§å¦

ã€åˆ†ç±»å·ã€‘TP391.1
ã€è¢«å¼•é¢‘æ¬¡ã€‘3
ã€ä¸‹è½½é¢‘æ¬¡ã€‘1087
æ”»è¯»æœŸæˆæžœ

çŸ¥ç½‘èŠ‚ä¸‹è½½

èŠ‚ç‚¹æ–‡çŒ®ä¸ï¼š

æœ¬æ–‡é“¾æŽ¥çš„æ–‡çŒ®ç½‘ç»œå›¾ç¤º:

æœ¬æ–‡çš„å¼•æ–‡ç½‘ç»œ

èŠ‚ç‚¹æ–‡çŒ®

èŠ‚ç‚¹æ–‡çŒ®

åŸºäºŽè¯å…±çŽ°çš„æ–‡æœ¬ä¸»é¢˜æŒ–æŽ˜æ¨¡åž‹å’Œç®—æ³•ç ”ç©¶

Research on Terms Co-occurrence Based Models and Algorithms for Text Mining

æœ¬æ–‡é“¾æŽ¥çš„æ–‡çŒ®ç½‘ç»œå›¾ç¤º:

åŸºäºŽè¯å…±çŽ°çš„æ–‡æœ¬ä¸»é¢˜æŒ–æŽ˜æ¨¡åž‹å’Œç®—æ³•ç ”ç©¶