èŠ‚ç‚¹æ–‡çŒ®

æ•°æ®æµä¸Šçš„èšç±»ä¸Žåˆ†ç±»ç®—æ³•

Clustering and Classication Algorithms for Data Stream

åˆ†é¡µä¸‹è½½
åˆ†ç« ä¸‹è½½
æ•´æœ¬ä¸‹è½½
åœ¨çº¿é˜…è¯»
ä¸æ”¯æŒè¿…é›·ç‰ä¸‹è½½å·¥å…·ï¼Œè¯·å–æ¶ˆåŠ é€Ÿå·¥å…·åŽä¸‹è½½ã€‚

ã€ä½œè€…ã€‘ æ¨æ˜¥å®‡ï¼›

ã€ä½œè€…åŸºæœ¬ä¿¡æ¯ã€‘ æ¸…åŽå¤§å¦ ï¼Œ æŽ§åˆ¶ç§‘å¦ä¸Žå·¥ç¨‹ï¼Œ 2009ï¼Œ åšå£«

ã€æ‘˜è¦ã€‘ åœ¨çŽ°ä»£ç¤¾ä¼šä¸,è¶Šæ¥è¶Šå¤šçš„æ•°æ®ä»¥æ•°æ®æµçš„å½¢å¼å‡ºçŽ°ã€‚æ•°æ®æµä¸Žä¼ ç»Ÿé™æ€æ•°æ®çš„åŒºåˆ«åœ¨äºŽå…¶è§„æ¨¡çš„æ— é™å¢žé•¿ä»¥åŠå…¶ä¸è•´å«æ¦‚å¿µçš„ä¸æ–æ¼”åŒ–,è¿™äº›ç‰¹ç‚¹ä½¿å¾—è®¸å¤šæ ¹æ®é™æ€æ•°æ®æ¨¡åž‹è®¾è®¡çš„æ•°æ®æŒ–æŽ˜ç®—æ³•ä¸å†é€‚ç”¨,å› æ¤é’ˆå¯¹æ•°æ®æµçš„æ•°æ®æŒ–æŽ˜ç®—æ³•ç ”ç©¶æˆä¸ºä¸€ä¸ªé‡è¦çš„ç ”ç©¶æ–¹å‘ã€‚æœ¬æ–‡å¯¹æ¼”åŒ–æ•°æ®æµçš„èšç±»ä¸Žåˆ†ç±»é—®é¢˜è¿›è¡Œäº†ç ”ç©¶,å®Œæˆäº†å¦‚ä¸‹å·¥ä½œ:1.æå‡ºäº†ä¸€ç§å¤„ç†æ··åˆå±žæ€§æ•°æ®æµçš„èšç±»ç®—æ³•ã€‚è¯¥ç®—æ³•åˆ©ç”¨æ³Šæ¾è¿‡ç¨‹å¯¹æ•°æ®æµçš„äº§ç”Ÿè¿›è¡Œå»ºæ¨¡,å¹¶å°†æ•°æ®æµä¸æ ·æœ¬çš„è¿žç»å±žæ€§ä¸Žç¦»æ•£å±žæ€§ç»Ÿä¸€è€ƒè™‘,å®šä¹‰äº†æ··åˆå±žæ€§æ¡ä»¶ä¸‹æ ·æœ¬ä¹‹é—´çš„è·ç¦»ã€‚åœ¨ä¸Šè¿°å®šä¹‰çš„åŸºç¡€ä¸Šå®žçŽ°äº†ä¸€ç§åŒ…å«åœ¨çº¿ä¸Žç¦»çº¿ä¸¤ä¸ªé˜¶æ®µçš„æ•°æ®æµèšç±»ç®—æ³•ã€‚2.æå‡ºäº†åŸºäºŽäº§ç”Ÿå¼æ¨¡åž‹çš„æ”¯æŒå‘é‡æœºè¾“å‡ºæ¦‚çŽ‡åŒ–ç®—æ³•ã€‚è¯¥ç®—æ³•åˆ©ç”¨æ£æ€åˆ†å¸ƒæ¨¡åž‹å¯¹æ”¯æŒå‘é‡æœºåŽŸå§‹è¾“å‡ºå€¼çš„ç±»æ¡ä»¶æ¦‚çŽ‡å¯†åº¦è¿›è¡Œå»ºæ¨¡,å®žçŽ°äº†æ‰¹é‡å¼åˆ†ç±»é—®é¢˜ä¸æµ‹è¯•é›†ä¸Šçš„åˆ†ç±»å™¨è¾“å‡ºè°ƒæ•´,ä»¥è§£å†³è®ç»ƒé›†ä¸Žæµ‹è¯•é›†ä¸ç±»å…ˆéªŒæ¦‚çŽ‡å˜åœ¨å·®å¼‚çš„é—®é¢˜ã€‚å®žéªŒè¡¨æ˜Ž,è¯¥ç®—æ³•æ¯”å·²æœ‰ç»å…¸ç®—æ³•æ›´é€‚åˆäºŽåˆ†ç±»å™¨è¾“å‡ºè°ƒæ•´ã€‚3.é’ˆå¯¹å˜åœ¨ç±»å…ˆéªŒæ¼”åŒ–çŽ°è±¡çš„æ•°æ®æµ,æå‡ºäº†åˆ†ç±»å™¨è¾“å‡ºè°ƒæ•´ç®—æ³•ã€‚è¯¥ç®—æ³•åˆ©ç”¨æ—¶é—´åºåˆ—åˆ†æžä¸çš„æŒ‡æ•°å¹³æ»‘ç®—æ³•ä»¥åŠARæ¨¡åž‹è¿›è¡Œæ•°æ®æµä¸Šç±»å…ˆéªŒæ¦‚çŽ‡çš„é¢„æµ‹,å¹¶åˆ©ç”¨é¢„æµ‹ç»“æžœè¿›è¡Œåˆ†ç±»å™¨çš„è¾“å‡ºè°ƒæ•´ã€‚å®žéªŒè¡¨æ˜Ž,è¯¥ç®—æ³•å¯ä»¥å¾ˆå¥½çš„å¤„ç†ç±»å…ˆéªŒæ¼”åŒ–è¿™ç§ç‰¹æ®Šçš„æ¦‚å¿µæ¼‚ç§»é—®é¢˜ã€‚æ¤å¤–,é’ˆå¯¹å‘¨æœŸæ€§çš„ç±»å…ˆéªŒæ¼”åŒ–æå‡ºäº†æ”¹è¿›çš„ç±»å…ˆéªŒæ¦‚çŽ‡é¢„æµ‹ç®—æ³•,å¹¶æˆåŠŸåœ°ç”¨äºŽæ™ºèƒ½è§†é¢‘äº¤é€šç›‘æŽ§ä¸çš„è½¦è¾†åˆ†ç±»ã€‚4.æå‡ºäº†ä¸€ç§å¤„ç†ä¸€èˆ¬æ¦‚å¿µæ¼‚ç§»é—®é¢˜çš„çº¿æ€§åˆ†ç±»å™¨å¢žé‡æ›´æ–°ç®—æ³•ã€‚é’ˆå¯¹é€»è¾‘æ–¯è’‚å›žå½’æ¨¡åž‹,åœ¨è‡ªè®ç»ƒçš„æ¡†æž¶ä¸‹ç”¨äºŒé˜¶æ³°å‹’å±•å¼€æ¥è¿‘ä¼¼æ•°æ®æµçš„å¯¹æ•°æ¡ä»¶ä¼¼ç„¶å‡½æ•°,å®žçŽ°äº†è¿‘ä¼¼å¯¹æ•°æ¡ä»¶ä¼¼ç„¶å‡½æ•°çš„å¢žé‡æ›´æ–°,å¹¶ä»¥æ¤ä¸ºåŸºç¡€è¿›è¡Œåˆ†ç±»å™¨å‚æ•°æ±‚è§£ã€‚ä¸Žé‡‡ç”¨æ¢¯åº¦ä¸‹é™çš„è‡ªè®ç»ƒæ–¹æ³•ç›¸æ¯”,æœ¬æ–‡æå‡ºçš„ç®—æ³•åœ¨å¤„ç†å¤æ‚çš„æ¦‚å¿µæ¼‚ç§»é—®é¢˜æ—¶æ›´ä¸ºé²æ£’ã€‚æ›´å¤š è¿˜åŽŸ

ã€Abstractã€‘ In modern society, more and more data is generated in streaming format. The maindi?erences between data stream model and traditional static data model are growingand concept drifting. These characteristics make the data mining algorithms designedfor static data model are not valid for streaming model anymore. Therefore, some spe-cific algorithms are proposed for data stream mining accordingly. In this thesis, severalalgorithms for data stream clustering and classification are proposed. Specifically, themain contribution of this thesis is as follows:1. This thesis proposes an algorithm to handle the heterogeneous stream clusteringproblem. A Poisson model is used to describe the arriving process of the samples in thestream. The distance metric between heterogeneous samples is defined by consideringboth continuous and categorical attributes simultaneously. Based on such definition, atwo-step clustering algorithm containing online and o?-line steps is realized.2. This thesis proposes an algorithm for the probabilistic outputs of Support VectorMachines (SVM) using generative model. A univariate normal distribution model isused to approximate the within class density of the unthresholded outputs of SVM.According to this model, the output of the classifier on the test set is adjusted in orderto make up the decrease of the classification accuracy caused by the disparity betweenthe class priors on the training set and the test set. The proposed algorithm achievedhigher classification accuracy on some data sets than the classic algorithm.3. This thesis proposes a general classifier adjusting algorithm to deal with theclass priors evolution over the data stream. The algorithm uses exponential smooth-ing and AR model to forecast the class priors along the data stream dynamically, andadjusts the outputs of the classifier accordingly. Experimental results show that theproposed algorithm can handle the changing class priors problem well. Besides, thealgorithm is modified to make use of the periodicity in the seasonal class priors evolu-tion. The modified algorithm has been successfully applied to the vehicle classification problem in a smart video tra?c surveillance system.4. This thesis proposes an incremental linear classifier updating algorithm for thegeneral concept drift problem. Under the self-training framework, the second orderTaylor expansion is used to approximate the logarithmic conditional likelihood of thenew observed samples described by logistic regression model. Based on such approx-imation, the approximated log conditional likelihood can be updated incrementally.Then the parameter of the classifier is solved by maximizing the approximated logconditional likelihood. Comparing to the self-training method based on the gradientdescent algorithm optimizing the log conditional likelihood directly, the proposed al-gorithm is more robust when handling sophisticated concept drift.æ›´å¤š è¿˜åŽŸ

ã€å…³é”®è¯ã€‘ æ•°æ®æµï¼› æ¦‚å¿µæ¼‚ç§»ï¼› åˆ†ç±»ï¼› èšç±»ï¼› æœŸæœ›æœ€å¤§åŒ–ï¼›
ã€Key wordsã€‘ Data Streamï¼› Concept Driftï¼› Classificationï¼› Clusteringï¼› Expectation Max-imizationï¼›

ã€ç½‘ç»œå‡ºç‰ˆæŠ•ç¨¿äººã€‘ æ¸…åŽå¤§å¦

ã€åˆ†ç±»å·ã€‘TP311.13
ã€è¢«å¼•é¢‘æ¬¡ã€‘1
ã€ä¸‹è½½é¢‘æ¬¡ã€‘897
æ”»è¯»æœŸæˆæžœ

çŸ¥ç½‘èŠ‚ä¸‹è½½

èŠ‚ç‚¹æ–‡çŒ®ä¸ï¼š

æœ¬æ–‡é“¾æŽ¥çš„æ–‡çŒ®ç½‘ç»œå›¾ç¤º:

æœ¬æ–‡çš„å¼•æ–‡ç½‘ç»œ

èŠ‚ç‚¹æ–‡çŒ®

èŠ‚ç‚¹æ–‡çŒ®

æ•°æ®æµä¸Šçš„èšç±»ä¸Žåˆ†ç±»ç®—æ³•

Clustering and Classication Algorithms for Data Stream

æœ¬æ–‡é“¾æŽ¥çš„æ–‡çŒ®ç½‘ç»œå›¾ç¤º:

æ•°æ®æµä¸Šçš„èšç±»ä¸Žåˆ†ç±»ç®—æ³•