èŠ‚ç‚¹æ–‡çŒ®

åžƒåœ¾é‚®ä»¶è¿‡æ»¤ä¸çš„æ•Œæ‰‹åˆ†ç±»é—®é¢˜ç ”ç©¶

Adversarial Classification for Email Spam Filtering

åˆ†é¡µä¸‹è½½
åˆ†ç« ä¸‹è½½
æ•´æœ¬ä¸‹è½½
åœ¨çº¿é˜…è¯»
ä¸æ”¯æŒè¿…é›·ç‰ä¸‹è½½å·¥å…·ï¼Œè¯·å–æ¶ˆåŠ é€Ÿå·¥å…·åŽä¸‹è½½ã€‚

ã€ä½œè€…ã€‘ é‚“è”šï¼›

ã€ä½œè€…åŸºæœ¬ä¿¡æ¯ã€‘ ç”µåç§‘æŠ€å¤§å¦ ï¼Œ è®¡ç®—æœºç³»ç»Ÿç»“æž„ï¼Œ 2011ï¼Œ åšå£«

ã€æ‘˜è¦ã€‘ æœºå™¨å¦ä¹ ä½œä¸ºä¸€ç§é‡è¦çš„æ™ºèƒ½ä¿¡æ¯å¤„ç†æŠ€æœ¯ï¼Œåœ¨åžƒåœ¾é‚®ä»¶è¿‡æ»¤ç³»ç»Ÿä¸å¾—åˆ°å¹¿æ³›çš„åº”ç”¨ã€‚ç„¶è€Œåœ¨å®žé™…å¯¹æŠ—æ€§ç½‘ç»œçŽ¯å¢ƒä¸ï¼Œåžƒåœ¾é‚®ä»¶è¿‡æ»¤å™¨é¢ä¸´ç€åžƒåœ¾é‚®ä»¶å‘é€è€…æ— ä¼‘æ¢æ¶æ„æ”»å‡»çš„å¨èƒã€‚ä»Žè€Œå¯¼è‡´åœ¨å®žéªŒçŽ¯å¢ƒä¸é«˜æ€§èƒ½çš„æœºå™¨å¦ä¹ ç®—æ³•ï¼Œåœ¨å®žé™…åº”ç”¨æ—¶å…¶æ€§èƒ½å¯èƒ½å˜çš„å¾ˆå·®ã€‚æ•Œæ‰‹åˆ†ç±»çš„æå‡ºæ£æ˜¯ä¸ºäº†åº”å¯¹è¿™ç§æŒ‘æˆ˜ï¼Œå¹¶æˆä¸ºå½“å‰æœºå™¨å¦ä¹ é¢†åŸŸçš„ç ”ç©¶çƒç‚¹ï¼Œå…·æœ‰é‡å¤§çš„ç†è®ºå’Œå®žé™…åº”ç”¨ä»·å€¼ã€‚æœ¬æ–‡é’ˆå¯¹åžƒåœ¾é‚®ä»¶è¿‡æ»¤ä¸çš„æ•Œæ‰‹åˆ†ç±»é—®é¢˜å±•å¼€äº†ç ”ç©¶ï¼ŒåŒ…æ‹¬å¯¹æ•Œæ‰‹åˆ†ç±»ä¸çš„æ”»é˜²åšå¼ˆé—®é¢˜ï¼Œåžƒåœ¾é‚®ä»¶è¿‡æ»¤çš„æŠ—ä¸æ–‡å¥½è¯æ”»å‡»é—®é¢˜ï¼Œä»¥åŠåŸºäºŽKolmogorovå¤æ‚æ€§çš„é²æ£’æ€§åˆ†ç±»é—®é¢˜è¿™ä¸‰æ–¹é¢çš„ç ”ç©¶ã€‚æœ¬æ–‡å–å¾—äº†å¦‚ä¸‹äº”ç‚¹åˆ›æ–°æ€§æˆæžœï¼š1.æå‡ºäº†ä¸€ä¸ªåŸºäºŽStackelbergå»¶æ—¶åšå¼ˆçš„æ•Œæ‰‹åˆ†ç±»æ¨¡åž‹ã€‚ä»¥å¾€åŸºäºŽStackelbergåšå¼ˆçš„æ•Œæ‰‹åˆ†ç±»æ¨¡åž‹ï¼Œä¸èƒ½è§£é‡Šå–å¾—çº³ä»€å‡è¡¡åŽåžƒåœ¾é‚®ä»¶å‘é€è€…ä¸ºä½•è¿˜è¦ç»§ç»å‘åŠ¨æ”»å‡»ã€‚æœ¬æ¨¡åž‹å°†å®žé™…ä¸è·Ÿéšè€…çš„ååº”å»¶æ—¶å¼•å…¥Stackelbergåšå¼ˆå»ºæ¨¡ï¼Œé‡ç‚¹åˆ†æžäº†ååº”å»¶æ—¶å¯¹é¢†å¯¼è€…å’Œè·Ÿéšè€…æ”¶ç›Šçš„å½±å“ï¼Œå¹¶åˆ©ç”¨é—ä¼ ç®—æ³•å¾—åˆ°çº³ä»€å‡è¡¡ï¼Œæœ€åŽé€šè¿‡å®žéªŒä»¿çœŸéªŒè¯äº†æœ¬æ¨¡åž‹çš„æ£ç¡®æ€§ã€‚æœ¬æ¨¡åž‹è¡¨æ˜Žåžƒåœ¾é‚®ä»¶å‘é€è€…å…·æœ‰å…ˆå‘ä¼˜åŠ¿ï¼Œå¹¶åœ¨æ•°æ®æŒ–æŽ˜è€…çš„ååº”å»¶æ—¶ä¸èŽ·å¾—è¶…é¢æ”¶ç›Šï¼Œä»Žè€Œä¸æ–å‘èµ·æ–°çš„æ”»å‡»ã€‚2.æå‡ºäº†ä¸€ä¸ªåŸºäºŽStackelbergä¸ç¡®å®šæ€§åšå¼ˆçš„æ•Œæ‰‹åˆ†ç±»æ¨¡åž‹ã€‚çŽ°æœ‰æ•Œæ‰‹åˆ†ç±»çš„Stackelbergåšå¼ˆæ¨¡åž‹é€šå¸¸å‡è®¾è·Ÿéšè€…çš„è¡ŒåŠ¨æ˜¯æœ€ä¼˜çš„å’Œç†æ€§çš„ï¼Œè¿™åœ¨å®žé™…åžƒåœ¾é‚®ä»¶è¿‡æ»¤ä¸æ˜¯ä¸åˆç†çš„ã€‚æœ¬æ¨¡åž‹å°†è·Ÿéšè€…çš„æœ‰é™ç†æ€§å’Œæœ‰é™è§‚å¯Ÿå¼•å…¥æ•Œæ‰‹åˆ†ç±»çš„Stackelbergåšå¼ˆå»ºæ¨¡ï¼Œå¹¶é‡ç‚¹åˆ†æžäº†ä¸ç¡®å®šæ€§å‚æ•°å¯¹åˆ†ç±»å™¨æ€§èƒ½çš„å½±å“ï¼Œæœ€åŽé€šè¿‡çœŸå®žé‚®ä»¶æ•°æ®é›†è¿›è¡Œäº†å®žéªŒï¼ŒéªŒè¯äº†æœ¬æ¨¡åž‹çš„æœ‰æ•ˆæ€§ã€‚3.æå‡ºäº†ä¸€ä¸ªæŠ—ä¸æ–‡åžƒåœ¾é‚®ä»¶å¥½è¯æ”»å‡»çš„å¤šç¤ºä¾‹é€»è¾‘å›žå½’æ¨¡åž‹ã€‚ç›®å‰å¯¹ä¸æ–‡å¥½è¯æ”»å‡»é—®é¢˜çš„ç ”ç©¶å°šä¸å¤šè§ã€‚æœ¬æ¨¡åž‹ç»“åˆä¸æ–‡åˆ†è¯æŠ€æœ¯å’Œç‰¹å¾é€‰æ‹©æ–¹æ³•è¿›è¡Œé¢„å¤„ç†ï¼Œå¹¶åˆ©ç”¨å¤šç¤ºä¾‹æœºåˆ¶å’Œé€»è¾‘å›žå½’ç®—æ³•è¿›è¡Œå¦ä¹ å’Œåˆ†ç±»ï¼Œæœ€åŽåœ¨ä¸æ–‡é‚®ä»¶æ•°æ®é›†ä¸Šè¿›è¡Œäº†å®žéªŒã€‚å®žéªŒç»“æžœè¡¨æ˜Žè¯¥æ¨¡åž‹èƒ½å¤Ÿæœ‰æ•ˆå¯¹æŠ—ä¸æ–‡åžƒåœ¾é‚®ä»¶çš„å¥½è¯æ”»å‡»ï¼Œä¸”é²æ£’æ€§ä¼˜äºŽå•ç¤ºä¾‹é€»è¾‘å›žå½’å’Œå•ç¤ºä¾‹æ”¯æŒå‘é‡æœºæ¨¡åž‹ã€‚4.æå‡ºäº†ä¸€ä¸ªåŸºäºŽKolmogorovå¤æ‚æ€§çš„åžƒåœ¾å›¾åƒåˆ†ç±»æ¨¡åž‹ã€‚ä¼ ç»Ÿçš„åžƒåœ¾å›¾åƒåˆ†ç±»ç®—æ³•å˜åœ¨ç€é²æ£’æ€§è¾ƒå·®ã€å›¾åƒç‰¹å¾å¯¹ç‰¹å®šæ•°æ®é›†æ•æ„Ÿç‰é—®é¢˜ã€‚æœ¬æ¨¡åž‹åˆ©ç”¨æ•°æ®åŽ‹ç¼©æŠ€æœ¯å’ŒKolmogorovåˆ†ç±»æœºåˆ¶ï¼Œå®žçŽ°äº†å¯¹åžƒåœ¾å›¾åƒçš„å‡†ç¡®åˆ†ç±»ã€‚é€šè¿‡åœ¨åžƒåœ¾å›¾åƒæ•°æ®é›†ä¸Šè¿›è¡Œå®žéªŒï¼ŒéªŒè¯äº†æœ¬æ¨¡åž‹èƒ½æœ‰æ•ˆå¯¹åžƒåœ¾å›¾åƒè¿›è¡Œåˆ†ç±»ã€‚åŒæ—¶å¯¹è¯¥æ¨¡åž‹çš„æ›´æ–°æœºåˆ¶è¿›è¡Œäº†å®‰å…¨æ€§åˆ†æžã€‚æœ¬æ¨¡åž‹æ—¢ä¸éœ€è¦æå–å›¾åƒä¸çš„æ–‡å—ï¼Œä¹Ÿä¸éœ€è¦å¯¹å›¾åƒç‰¹å¾è¿›è¡Œå®šä¹‰å’Œé€‰æ‹©ï¼Œæ˜¯ä¸€ç§æ•°æ®é©±åŠ¨çš„æ— å‚æ•°åˆ†ç±»æ–¹æ³•ã€‚5.æå‡ºäº†ä¸€ä¸ªåŸºäºŽKolmogorovå¤æ‚æ€§çš„æ¶æ„è½¯ä»¶æ£€æµ‹æ¡†æž¶ã€‚åžƒåœ¾é‚®ä»¶æ˜¯ä¼ æ’æ¶æ„è½¯ä»¶çš„æœ‰æ•ˆæ–¹å¼ï¼Œä¼ ç»Ÿçš„åŸºäºŽç‰¹å¾ç çš„æ–¹æ³•éš¾äºŽæ£€æµ‹æ–°çš„å’Œå˜ç§çš„æ¶æ„è½¯ä»¶ã€‚æœ¬æ¨¡åž‹æå‡ºäº†ä¸€ç§é€šç”¨çš„æ¶æ„è½¯ä»¶æ£€æµ‹æ–¹æ³•ï¼Œå¹¶åˆ©ç”¨åŠ¨æ€é©¬å°”ç§‘å¤«åŽ‹ç¼©æ¥å¯¹ä»£ç æ ·æœ¬è¿›è¡Œåˆ†ç±»ï¼Œæœ€åŽçš„å®žéªŒç»“æžœéªŒè¯äº†æœ¬æ¡†æž¶èƒ½å¯¹æ¶æ„è½¯ä»¶è¿›è¡Œå‡†ç¡®çš„åˆ†ç±»ã€‚æœ¬æ¡†æž¶å®žçŽ°ç®€å•ï¼Œæ— éœ€æå–ç‰¹å¾ç ï¼Œå¹¶ä¸”èƒ½å¤Ÿæœ‰æ•ˆè¯†åˆ«æ–°çš„å’Œå˜ç§çš„æ¶æ„è½¯ä»¶ã€‚æ›´å¤š è¿˜åŽŸ

ã€Abstractã€‘ As an important technology of intelligent information processing, machinelearning is widely used in spam filtering systems. However, in practical adversarialenvironments, spam filters encounter never-ending malicious attacks by spammers. Sothe machine learning algorithms which perform well in experimental environment mayperform badly in practice. Adversarial classification is proposed for this challenge. Nowadversarial classification is a hot topic in machine learning and has great value intheories and practical applications.In this dissertation, researches on adversarial classification problems in spamfiltering have been conducted, which include game problems between attacker anddefender in adversarial classification, combating Chinese good word attacks in spamfiltering, and Kolmogorov complexity based robust classification methods. Fiveinnovative contributions of the dissertation are enumerated as follows.1. A Stackelberg game theoretical model with reaction-time delay is proposed foradversarial classification. Previous researches on Stackelberg game theoretical modelsof adversarial classification could not explain the reason that the spammer continues tolaunch attacks after the Nash equilibrium is reached. In this model, the data minerâ€™sreaction-time delay is considered in Stackelberg game. In addition, the influences ofreaction-time delay to the spammer and data miner are emphatically analyzed. The Nashequilibrium is reached by using genetic algorithm. The modelâ€™s correctness is verifiedby our experiments. The model shows that the spammer who has the advantage of beingin the lead obtains extra payoffs during the data minerâ€™s reaction-time delay. So thespammer can continuously launch new attacks.2. A Stackelberg game theoretical model with uncertainties is proposed foradversarial classification. Existing researches on Stackelberg game model foradversarial classification critically assume the data miner plays optimally and rationally.Unfortunately, it is not real in practical spam filtering. In the proposed model, the dataminerâ€™s bounded rationality and limited observation for the spammerâ€™s strategy is considered. In addition, the influences of different uncertainty parameters to theclassifier are analyzed with emphasis. At last, the modelâ€™s effectiveness is verified onreal spam dataset.3. A multiple instance logic regression model for combating Chinese good wordattacks is proposed. Now there is little research on the problem of Chinese good wordattacks. This model uses Chinese word segmentation and feature selection methods forpreprocessing. Then it uses multiple instance learning mechanism and logic regressionalgorithm for learning and classification. At last the experimental results on largeChinese spam corpora show that the model can effectively combat against Chinese goodword attacks. It also shows that the robustness of the model is better than that of singlelogic regression model and single instance support vector machine model.4. A Kolmogorov complexity based spam image classification model is proposed.Traditional classification algorithms for spam image have the vulnerabilities of lessrobustness and strong sensitivity of image features for special image dataset. The modeluses data compression technology and Kolmogorov complexity classificationmechanism to classify spam images effectively. At last, the experimental results onspam image database show the model can accurately classify spam images. In addition,the modelâ€™s security of updating mechanism is primarily analyzed. The model needsneither text extraction from images, nor feature definition and feature selection ofimages. It is a kind of data-driven parameter-free classification method.5. A Kolmogorov complexity based malware detection framework is proposed.Spam is an effective way to transmit malware. It is hard for traditional signature-basedapproaches to detect malware which is new or obfuscated. A general malware detectionframework is proposed. It uses dynamic Markov compression to classify code instances.The experimental results show the framework can accurately detect malware. Theframework can be implemented easily without malware signature selection and candetect unknown and obfuscated malware effectively.æ›´å¤š è¿˜åŽŸ

ã€å…³é”®è¯ã€‘ åžƒåœ¾é‚®ä»¶è¿‡æ»¤ï¼› æ•Œæ‰‹åˆ†ç±»ï¼› Stackelbergåšå¼ˆï¼› Kolmogorovå¤æ‚æ€§ï¼› ä¸æ–‡å¥½è¯æ”»å‡»ï¼›
ã€Key wordsã€‘ spam filteringï¼› adversarial classificationï¼› Stackelberg gamesï¼› Kolmogorovcomplexityï¼› Chinese good word attacksï¼›

ã€ç½‘ç»œå‡ºç‰ˆæŠ•ç¨¿äººã€‘ ç”µåç§‘æŠ€å¤§å¦

ã€åˆ†ç±»å·ã€‘TP393.098
ã€ä¸‹è½½é¢‘æ¬¡ã€‘288
æ”»è¯»æœŸæˆæžœ

çŸ¥ç½‘èŠ‚ä¸‹è½½

èŠ‚ç‚¹æ–‡çŒ®ä¸ï¼š

æœ¬æ–‡é“¾æŽ¥çš„æ–‡çŒ®ç½‘ç»œå›¾ç¤º:

æœ¬æ–‡çš„å¼•æ–‡ç½‘ç»œ

èŠ‚ç‚¹æ–‡çŒ®

èŠ‚ç‚¹æ–‡çŒ®

åžƒåœ¾é‚®ä»¶è¿‡æ»¤ä¸­çš„æ•Œæ‰‹åˆ†ç±»é—®é¢˜ç ”ç©¶

Adversarial Classification for Email Spam Filtering

æœ¬æ–‡é“¾æŽ¥çš„æ–‡çŒ®ç½‘ç»œå›¾ç¤º:

åžƒåœ¾é‚®ä»¶è¿‡æ»¤ä¸çš„æ•Œæ‰‹åˆ†ç±»é—®é¢˜ç ”ç©¶