èŠ‚ç‚¹æ–‡çŒ®

æ–‡æœ¬æŒ–æŽ˜è‹¥å¹²å…³é”®æŠ€æœ¯ç ”ç©¶

The Key Techniques Research on Text Mining

åˆ†é¡µä¸‹è½½
åˆ†ç« ä¸‹è½½
æ•´æœ¬ä¸‹è½½
åœ¨çº¿é˜…è¯»
ä¸æ”¯æŒè¿…é›·ç‰ä¸‹è½½å·¥å…·ï¼Œè¯·å–æ¶ˆåŠ é€Ÿå·¥å…·åŽä¸‹è½½ã€‚

ã€ä½œè€…ã€‘ é™ˆæ™“äº‘ï¼›

ã€ä½œè€…åŸºæœ¬ä¿¡æ¯ã€‘ å¤æ—¦å¤§å¦ ï¼Œ è®¡ç®—æœºè½¯ä»¶ä¸Žç†è®ºï¼Œ 2005ï¼Œ åšå£«

ã€æ‘˜è¦ã€‘ é¢å¯¹æµ©å¦‚çƒŸæµ·çš„ç”µåä¿¡æ¯,å¦‚ä½•å¸®åŠ©äººä»¬æœ‰æ•ˆåœ°æ”¶é›†å’Œé€‰æ‹©æ„Ÿå…´è¶£çš„ä¿¡æ¯,å¦‚ä½•å¸®åŠ©ç”¨æˆ·åœ¨æ—¥ç›Šå¢žå¤šçš„ä¿¡æ¯ä¸å‘çŽ°æ½œåœ¨æœ‰ç”¨çš„çŸ¥è¯†å·²æˆä¸ºä¿¡æ¯æŠ€æœ¯é¢†åŸŸçš„çƒç‚¹é—®é¢˜ã€‚æ•°æ®æŒ–æŽ˜å°±æ˜¯ä¸ºè§£å†³è¿™ä¸€é—®é¢˜è€Œäº§ç”Ÿçš„ç ”ç©¶é¢†åŸŸã€‚è‡ª90å¹´ä»£äº§ç”Ÿä»¥æ¥,å¯¹æ•°æ®æŒ–æŽ˜çš„ç ”ç©¶å·²ç»æ¯”è¾ƒæ·±å…¥,ç ”ç©¶èŒƒå›´æ¶‰åŠåˆ°å…³è”åˆ†æžã€åˆ†ç±»åˆ†æžã€èšç±»åˆ†æžã€è¶‹åŠ¿åˆ†æžç‰å¤šä¸ªæ–¹é¢ã€‚ç”±äºŽçŽ°å®žç”Ÿæ´»ä¸ç»å¤§éƒ¨åˆ†ä¿¡æ¯èµ„æºæ˜¯ä»¥éžç»“æž„åŒ–æ•°æ®çš„å½¢å¼å˜åœ¨,è€Œæ•°æ®æŒ–æŽ˜åˆ™æ™®éä»¥ç»“æž„åŒ–æ•°æ®å¦‚å…³ç³»æ•°æ®åº“ä¸çš„æ•°æ®ä¸ºå¯¹è±¡,å› æ¤å¯¹éžç»“æž„åŒ–ä¿¡æ¯è¿›è¡ŒæŒ–æŽ˜æˆä¸ºç»§æ•°æ®æŒ–æŽ˜ä¹‹åŽå‡ºçŽ°çš„åˆä¸€è¯¾é¢˜ã€‚ åœ¨å¸¸è§çš„éžç»“æž„åŒ–æ•°æ®å¦‚æ–‡æœ¬ã€å›¾åƒã€è§†é¢‘ä¸,æ–‡æœ¬æ•°æ®æ˜¯åº”ç”¨æœ€ä¸ºå¹¿æ³›çš„ä¸€ç§å½¢å¼,å¸¸ç”¨äºŽæ•°å—å›¾ä¹¦é¦†ã€äº§å“ç›®å½•ã€æ–°é—»ç»„ã€åŒ»å¦æŠ¥å‘Šã€ç»„ç»‡åŠä¸ªäººä¸»é¡µã€‚åœ¨è‡ªç„¶è¯è¨€ç†è§£ã€æ–‡æœ¬è‡ªåŠ¨æ‘˜è¦ã€ä¿¡æ¯æå–ã€ä¿¡æ¯è¿‡æ»¤ã€ä¿¡æ¯æ£€ç´¢ç‰é¢†åŸŸ,æ–‡æœ¬æŒ–æŽ˜æŠ€æœ¯éƒ½æœ‰ç€å¹¿æ³›çš„åº”ç”¨,å› è€Œæ¯”æ•°æ®æŒ–æŽ˜å…·æœ‰æ›´é«˜çš„å•†ä¸šä»·å€¼ã€‚ æœ¬æ–‡ä»¥æ–‡æœ¬æ•°æ®ä¸ºç ”ç©¶å¯¹è±¡,å¯¹æ–‡æœ¬æŒ–æŽ˜çš„è‹¥å¹²å…³é”®æŠ€æœ¯è¿›è¡Œç ”ç©¶,ä¸»è¦åŒ…æ‹¬æ–‡æœ¬ç‰¹å¾æå–å’Œç‰¹å¾é€‰æ‹©ã€æ–‡æœ¬å…³è”åˆ†æžã€æ–‡æœ¬å…³è”åˆ†ç±»,å¹¶æå‡ºæ›´æœ‰æ•ˆçš„æ–‡æœ¬æŒ–æŽ˜ç®—æ³•ã€‚æœ¬æ–‡çš„ç ”ç©¶å·¥ä½œå’Œåˆ›æ–°å†…å®¹åŒ…æ‹¬ä»¥ä¸‹å‡ ä¸ªæ–¹é¢: (1)åˆ©ç”¨æœ€å°è¯é¢‘é˜ˆå€¼çš„æ–‡æ¡£é¢‘ç‰¹å¾è¯„ä¼°å‡½æ•°å‡å°‘å™ªå£°ç‰¹å¾çš„æ¯”ä¾‹,æé«˜æ–‡æœ¬åˆ†ç±»çš„è´¨é‡ã€‚ ç›®å‰,æ–‡æœ¬ç‰¹å¾é€‰æ‹©æ™®éé‡‡ç”¨ç‰¹å¾è¯„ä¼°å‡½æ•°çš„æ–¹æ³•,å„ç§è¯„ä¼°å‡½æ•°æ ¹æ®å…¶ä½¿ç”¨çš„æ˜¯è¯é¢‘è¿˜æ˜¯æ–‡æŒ¡é¢‘æœ‰æ‰€ä¸åŒã€‚æˆ‘ä»¬é’ˆå¯¹å™ªå£°ç‰¹å¾çš„è¯é¢‘æ™®éè¾ƒä½Žçš„ç‰¹ç‚¹,æå‡ºåˆ©ç”¨æœ€å°è¯é¢‘é˜ˆå€¼çš„æ–‡æ¡£é¢‘æ–¹æ³•è¿›è¡Œç‰¹å¾é€‰æ‹©ã€‚åˆ†åˆ«å¯¹äº’ä¿¡æ¯ã€ä¿¡æ¯å¢žç›Šã€x~2ç»Ÿè®¡ä¸‰ç§ç‰¹å¾è¯„ä¼°å‡½æ•°é‡‡ç”¨è¯¥æ–¹æ³•è¿›è¡Œå®žéªŒ,ç»“æžœè¡¨æ˜Žæœ€å°è¯é¢‘é˜ˆå€¼æœ‰æ•ˆåœ°å‡å°‘ç‰¹å¾é›†ä¸å™ªå£°ç‰¹å¾æ‰€å çš„æ¯”ä¾‹,å¹¶ä¸”å‘çŽ°éšç€é˜ˆå€¼çš„æé«˜ä¸åŒè¯„ä¼°å‡½æ•°å¾—åˆ°çš„ç‰¹å¾é›†è¶‹äºŽä¸€è‡´ã€‚ (2)é’ˆå¯¹æ–‡æœ¬å…³è”åˆ†æžä¸éš¾ä»¥ç¡®å®šæœ€å°æ”¯æŒåº¦é˜ˆå€¼çš„é—®é¢˜,æå‡ºNä¸ªæœ€é¢‘ç¹é¡¹é›†æŒ–æŽ˜ç®—æ³•ã€‚ åœ¨æ–‡æœ¬å…³è”åˆ†æžä¸,é¢‘ç¹é¡¹é›†æŒ–æŽ˜æ˜¯é‡è¦çš„çŽ¯èŠ‚,ä½†åœ¨é¢‘ç¹é¡¹é›†æŒ–æŽ˜è¿‡ç¨‹ä¸,ç”¨æˆ·éš¾ä»¥å®šä¹‰åˆé€‚çš„æœ€å°æ”¯æŒåº¦é˜ˆå€¼è¿™ä¸€é—®é¢˜å§‹ç»ˆå˜åœ¨ã€‚æœ¬æ–‡æå‡ºåŸºäºŽæœ€å°æ”¯æŒåº¦é˜ˆå€¼åŠ¨æ€è°ƒæ•´ç–ç•¥çš„Nä¸ªæœ€é¢‘ç¹é¡¹é›†æŒ–æŽ˜ç®—æ³•,ç®—æ³•é€šè¿‡æŒ‡å®šéœ€è¦äº§ç”Ÿçš„é¢‘ç¹é¡¹é›†çš„æ•°é‡Næ¥æŽ§åˆ¶é¢‘ç¹é¡¹é›†çš„è§„æ¨¡ã€‚æŒ–æŽ˜è¿‡ç¨‹ä¸,ä¸æ–æ ¹æ®å·²æœ‰ç»“æžœè°ƒé«˜æœ€å°æ”¯æŒåº¦é˜ˆå€¼,ä»Žè€Œè¾¾åˆ°é™ä½Žæœç´¢ç©ºé—´ã€æ”¹å–„æŒ–æŽ˜æ€§èƒ½çš„ç›®çš„ã€‚æ ¹æ®è¿™ä¸€ç–ç•¥åˆ†åˆ«æå‡ºç±»Aprioriç®—æ³•å’ŒåŸºäºŽå€’æŽ’çŸ©é˜µçš„IntvMatrixç®—æ³•æŒ–æŽ˜å‰Nä¸ªé¢‘ç¹é¡¹é›†ã€‚æ›´å¤š è¿˜åŽŸ

ã€Abstractã€‘ With the rapid development and spread of Internet, electronic information greatly increases. It become a hotspot for information science and technology that how to collect and find the interested information of user, and discovery latent, useful knowledge quickly, exactly and fully. Data mining technology is a new research fields to solve the problem. Since 90â€™s the concept of DM was produced, the researches on DM have been very deep, and involved association analysis, categorization analysis, cluster analysis, trend analysis and so on. Structural data such as relational database is main research object for DM, but a majority of information exists with the form of unstructured data in realization; some datum show the unstructured data take 80% of existing information sources, so mining the unstructured information succeeds DM as a new challenge.Text data is a kind of information form used most spread among common unstructured data such as text, image, and video and so on. It is often used in digital library, product catalog, news group, medicine report, organization or individual homepages, and is also applied broadly to natural language understand, text summarize, information extract, information filter, information retrieval etc fields. So its value of business is higher than DM.Research on the key techniques of text mining is done in the paper, including text feature extract and feature select, text association analysis, text association classification. Several methods and techniques are presented from aspects of improving the speed, precision and stability. Our primary works are as follow.(1) The paper present feature evaluating function based document frequency with minimum term frequency threshold to reduce the proportion of noise features and improving the quality of text categorization.At present, the feature evaluating functions are main methods to select text feature for text categorization. These evaluating functions are different because some of them use term frequency and others use document frequency. Feature evaluating function based document frequency with minimum term frequency threshold is present in the paper. The result of experiment shows mutual information, information increase or x~2 Statistic with minimum term frequency thresholds is more effective than with document frequency.(2) Research on mining the top N most frequent item sets in text collection.The frequent item set mining is important step in text association analysis, but it is very difficult to ensure fit minimum support threshold. The paper present a strategyæ›´å¤š è¿˜åŽŸ

ã€å…³é”®è¯ã€‘ æ–‡æœ¬æŒ–æŽ˜ï¼› ç‰¹å¾é€‰æ‹©ï¼› å…³è”åˆ†æžï¼› æ–‡æœ¬å…³è”åˆ†ç±»ï¼› è§„åˆ™åŠ æƒï¼› æ ·æœ¬åŠ æƒï¼›
ã€Key wordsã€‘ Text Miningï¼› Feature Selectionï¼› Text Association Analysisï¼› Text Association Categorizationï¼› Rule Intensityï¼› Boosting Techniqueï¼›

ã€ç½‘ç»œå‡ºç‰ˆæŠ•ç¨¿äººã€‘ å¤æ—¦å¤§å¦

ã€åˆ†ç±»å·ã€‘TP311.13
ã€è¢«å¼•é¢‘æ¬¡ã€‘76
ã€ä¸‹è½½é¢‘æ¬¡ã€‘4423
æ”»è¯»æœŸæˆæžœ

çŸ¥ç½‘èŠ‚ä¸‹è½½

èŠ‚ç‚¹æ–‡çŒ®ä¸ï¼š

æœ¬æ–‡é“¾æŽ¥çš„æ–‡çŒ®ç½‘ç»œå›¾ç¤º:

æœ¬æ–‡çš„å¼•æ–‡ç½‘ç»œ

èŠ‚ç‚¹æ–‡çŒ®