èŠ‚ç‚¹æ–‡çŒ®

å¸¦å™ªå£°çš„æ–‡æœ¬èšç±»åŠå…¶åœ¨ååžƒåœ¾é‚®ä»¶ä¸çš„åº”ç”¨

Text Clustering with Noise and Application in Anti-spam

åˆ†é¡µä¸‹è½½
åˆ†ç« ä¸‹è½½
æ•´æœ¬ä¸‹è½½
åœ¨çº¿é˜…è¯»
ä¸æ”¯æŒè¿…é›·ç‰ä¸‹è½½å·¥å…·ï¼Œè¯·å–æ¶ˆåŠ é€Ÿå·¥å…·åŽä¸‹è½½ã€‚

ã€ä½œè€…ã€‘ å‘¨é‘«ï¼›

ã€ä½œè€…åŸºæœ¬ä¿¡æ¯ã€‘ å¹¿ä¸œå·¥ä¸šå¤§å¦ ï¼Œ è®¡ç®—æœºåº”ç”¨æŠ€æœ¯ï¼Œ 2012ï¼Œ ç¡•å£«

ã€æ‘˜è¦ã€‘ éšç€äº’è”ç½‘æŠ€æœ¯çš„é£žé€Ÿå‘å±•,æ–‡æœ¬æ•°æ®å‘ˆæŒ‡æ•°çº§å¢žé•¿ã€‚ä¸ºäº†èŽ·å¾—æ•°æ®ä¹‹é—´çš„å†…åœ¨å…³ç³»åŠéšå«ä¿¡æ¯,æ–‡æœ¬æŒ–æŽ˜æŠ€æœ¯åº”è¿è€Œç”Ÿã€‚èšç±»åˆ†æžä½œä¸ºæ•°æ®æŒ–æŽ˜çš„ä¸€ä¸ªé‡è¦åŠŸèƒ½,åœ¨æ–‡æœ¬æŒ–æŽ˜ä¸æœ‰ç€éžå¸¸é‡è¦çš„ä½œç”¨,æœ¬æ–‡å°†è®¨è®ºå¸¦æœ‰å¹²æ‰°ä¿¡æ¯çš„æ–‡æœ¬èšç±»æ–¹æ³•ã€‚ä¼ ç»Ÿçš„æ–‡æœ¬æŒ–æŽ˜æ–¹æ³•é¦–å…ˆå°†æ–‡æœ¬è¡¨ç¤ºæˆå‘é‡ç©ºé—´æ¨¡åž‹ï¼›ç„¶åŽç”¨TFIDFæƒé‡å°†æ–‡æ¡£è½¬åŒ–ä¸ºå‘é‡å½¢å¼,æœ€åŽåœ¨å‘é‡ç©ºé—´æ¨¡åž‹ä¸è®¡ç®—æ–‡æœ¬ç›¸ä¼¼åº¦ã€‚åœ¨ä¼ ç»Ÿçš„å‘é‡ç©ºé—´æ¨¡åž‹ä¸,ç”±äºŽæ²¡æœ‰è€ƒè™‘è¯ä¹‹é—´å˜åœ¨çš„æ¦‚å¿µç›¸ä¼¼æƒ…å†µ,å› æ¤å½±å“äº†æ•°æ®èšç±»çš„å‡†ç¡®æ€§ã€‚å› è€Œé’ˆå¯¹ä¸æ–‡æå‡ºäº†ä¸€ç§åŸºäºŽçŸ¥ç½‘æ¨¡åž‹å’Œè¯ä¹‰å†…ç§¯çš„ç›¸ä¼¼åº¦è®¡ç®—æ–¹æ³•ã€‚ç„¶è€Œ,è¿™ä¸€æ–¹æ³•å´å¹¶ä¸é€‚ç”¨äºŽåžƒåœ¾é‚®ä»¶çš„èšç±»é—®é¢˜ã€‚åŽŸå› æ˜¯åžƒåœ¾é‚®ä»¶å‘é€è€…ç»åœ¨é‚®ä»¶ç¼–è¾‘å®ŒæˆåŽ,ç”¨ç±»ä¼¼äºŽæŸ¥æ‰¾æ›¿æ¢çš„åŠžæ³•,æŠŠæ–‡æœ¬ä¸è§„èŒƒçš„æ•æ„Ÿå…³é”®è¯æ›¿æ¢ä¸ºå¦ä¸€ä¸ªç”¨æ’å…¥ç¬¦å·ã€æ”¹åŠ¨æ¬¡åºç”šè‡³ç”¨æ‹¼éŸ³æ›¿ä»£ç‰æ–¹æ³•æ··æ·†è¿‡çš„ã€ä½†èƒ½è¢«è¯»è€…ç†è§£çš„è¯è¯,ä»¥é€ƒè„±é‚®ä»¶å¤„ç†ç¨‹åºçš„è¿‡æ»¤ã€‚å¦‚æžœåˆ©ç”¨ä¼ ç»Ÿçš„æ–¹æ³•åˆ™ä¼šé‡‡å–ä¸€ç³»åˆ—é¢„å¤„ç†æŽªæ–½,å°†ä¼šè¿‡æ»¤æŽ‰å¹²æ‰°ä¿¡æ¯,è¿™æ ·ä¼šä½¿åžƒåœ¾é‚®ä»¶çš„ç›¸ä¼¼åº¦è®¡ç®—å‡†ç¡®åº¦è¾ƒä½Ž,æœ€ç»ˆå¯¼è‡´èšç±»è´¨é‡æ•ˆæžœè¾ƒå·®ã€‚é’ˆå¯¹åžƒåœ¾é‚®ä»¶å«æœ‰è¾ƒå¤šå¹²æ‰°ä¿¡æ¯è€Œå¯¼è‡´ç›¸ä¼¼æ€§åº¦é‡è¾ƒå·®è¿™ä¸€é—®é¢˜,æœ¬æ–‡é‡‡ç”¨éžåº¦é‡çš„æ–¹æ³•,å°†Needleman-Wunschç®—æ³•åº”ç”¨åˆ°æ–‡æœ¬ç›¸ä¼¼åº¦è®¡ç®—ä¸ã€‚æœ€åŽ,åˆ©ç”¨è¯¥ç›¸ä¼¼åº¦è®¡ç®—æ–¹æ³•,æå‡ºä¸€ç§åŸºäºŽNeedleman-Wunschçš„èšç±»ç®—æ³•,æœ€ç»ˆå®Œæˆæ–‡æœ¬èšç±»ã€‚ä¸ŽåŸºäºŽå‘é‡ç©ºé—´æ¨¡åž‹ç›¸æ¯”,é‡‡ç”¨Needleman-Wunschç®—æ³•è®¡ç®—æ–‡æœ¬ç›¸ä¼¼åº¦æ—¶,é¿å…äº†åˆ†è¯è¿‡ç¨‹,å‡å°‘è¯ä¹‰æŸå¤±,ä¿ç•™äº†æ‰€æœ‰çš„æ–‡æœ¬ä¿¡æ¯,ä¿è¯äº†èšç±»è´¨é‡ï¼›è€Œæœ¬æ–‡é€šè¿‡é¢„å¤„ç†å°†æ–‡æ¡£å†…å®¹åˆ†æˆä¸æ–‡å—ç¬¦ã€è‹±æ–‡å—ç¬¦ä¸²å’Œç¬¦å·ä¸²,å‡è½»æ•°æ®ç¨€ç–é—®é¢˜,å‡å°‘äº†å—ç¬¦çš„æ¯”è¾ƒæ¬¡æ•°,ä»Žè€ŒåŠ å¿«äº†å¤„ç†é€Ÿåº¦ã€‚é€šè¿‡ä»¿çœŸå®žéªŒä¸Žä¼ ç»Ÿçš„èšç±»ç®—æ³•è¿›è¡Œå¯¹æ¯”,è¯¥èšç±»è´¨é‡å’Œæ•ˆçŽ‡éƒ½æœ‰å¾ˆå¤§æ”¹è¿›ã€‚è¿™è¯´æ˜Žæœ¬æ–‡æå‡ºçš„èšç±»ç®—æ³•é€‚åˆäºŽåžƒåœ¾é‚®ä»¶èšç±»,ä»Žè€Œæä¾›äº†ä¸€ç§æœ‰æ•ˆçš„åžƒåœ¾é‚®ä»¶è¿‡æ»¤æŠ€æœ¯ã€‚å…·ä½“æ€è·¯æ˜¯åˆ©ç”¨æœ¬æ–‡æ–¹æ³•å°†åžƒåœ¾é‚®ä»¶ä¸Žåˆæ³•é‚®ä»¶è¿›è¡Œèšç±»,æ ¹æ®æ–‡æ¡£ç›¸ä¼¼åº¦å€¼èšæˆä¸åŒçš„ç±»åˆ«,ä»Žè€Œåˆ¤æ–å‡ºåžƒåœ¾é‚®ä»¶ä¸Žåˆæ³•é‚®ä»¶ã€‚æ›´å¤š è¿˜åŽŸ

ã€Abstractã€‘ With the rapid development of Internet technology, the text data is growing exponentially. In order to obtain the intrinsic relationship between the data and implied information, text mining technology emerges as the times require.Cluster analysis has a very important role in text mining and has an important feature of data mining, the paper will discuss the text clustering method with interference information.Traditional text mining methods first represent text into a vector space model; secondly, documents are converted to vector form by using the TFIDF weights.Finally calculate the text similarity in the vector space model. Traditional vector space model donâ€™t consider the conceptual similarity between the words, thus affecting the accuracy of the data clustering. To solve the problem, a method of similarity for Chinese based on the HowNet model and semantics of the inner product is proposed.However, this method is not appropriate to the problem of spam. Because in order to escape the filter of the mail, when finishing editing spam, spam senders will use some methods such as finding and replacing the sensitive keywords by another or inserting symbols or changing orders of words or altering words to phonetic.But readers can understand the text. Traditional methods will take a series of pretreatment measures, which will filter out the interference information and cause less accuracy of similarity. Ultimately the methods lead to poor quality of clustering effect.In this paper, a method based on Needleman-Wunsch algorithm is proposed to measure the similarity among the spam mail, in which the texts usually contain a lot of noises. Based on the proposed similarity measurement, an efficient clustering algorithm based on Needleman-Wunsch algorithm is devised. Finally text clustering is completed.Compared with the vector space model, when using the Needleman-Wunsch algorithm to compute the text similarity, the method avoids the process of segmentation, reduces the semantic loss, and retains all the text information, so that the quality of the clustering is ensured;By preprocessing the content of the document into Chinese characters, English strings and symbol strings, the data sparseness problem is alleviated, the number of comparisons of the characters is reduced,thereby speeding up the processing speed.Compared by simulation with traditional clustering algorithm, the clustering quality and efficiency are greatly improved.That shows that the proposed clustering algorithm is suitable for spam clustering, and then provides a valid e-mail spam filtering technology. The specific idea is that spam and legitimate e-mail are clustered by using the method proposed in the paper. According to the document similarity values, they are clustered into different categories. Finally the spam and legitimate mail are determined.æ›´å¤š è¿˜åŽŸ

ã€å…³é”®è¯ã€‘ æ–‡æœ¬ç›¸ä¼¼åº¦ï¼› æ–‡æœ¬èšç±»ï¼› Needleman-Wunschç®—æ³•ï¼› éžåº¦é‡æ–¹æ³•ï¼› åžƒåœ¾é‚®ä»¶ï¼›
ã€Key wordsã€‘ text similarityï¼› text clusteringï¼› Needleman-Wunsch algorithmï¼› non-metricmethodï¼› spamï¼›

ã€ç½‘ç»œå‡ºç‰ˆæŠ•ç¨¿äººã€‘ å¹¿ä¸œå·¥ä¸šå¤§å¦

ã€åˆ†ç±»å·ã€‘TP391.1;TP393.098
ã€ä¸‹è½½é¢‘æ¬¡ã€‘62
æ”»è¯»æœŸæˆæžœ

çŸ¥ç½‘èŠ‚ä¸‹è½½

èŠ‚ç‚¹æ–‡çŒ®ä¸ï¼š

æœ¬æ–‡é“¾æŽ¥çš„æ–‡çŒ®ç½‘ç»œå›¾ç¤º:

æœ¬æ–‡çš„å¼•æ–‡ç½‘ç»œ

èŠ‚ç‚¹æ–‡çŒ®

èŠ‚ç‚¹æ–‡çŒ®

å¸¦å™ªå£°çš„æ–‡æœ¬èšç±»åŠå…¶åœ¨ååžƒåœ¾é‚®ä»¶ä¸­çš„åº”ç”¨

Text Clustering with Noise and Application in Anti-spam

æœ¬æ–‡é“¾æŽ¥çš„æ–‡çŒ®ç½‘ç»œå›¾ç¤º:

å¸¦å™ªå£°çš„æ–‡æœ¬èšç±»åŠå…¶åœ¨ååžƒåœ¾é‚®ä»¶ä¸çš„åº”ç”¨