èŠ‚ç‚¹æ–‡çŒ®

å¼‚å¸¸æ£€æµ‹æ–¹æ³•åŠå…¶å…³é”®æŠ€æœ¯ç ”ç©¶

Research on Outlier Detection Method and Its Key Techniques

åˆ†é¡µä¸‹è½½
åˆ†ç« ä¸‹è½½
æ•´æœ¬ä¸‹è½½
åœ¨çº¿é˜…è¯»
ä¸æ”¯æŒè¿…é›·ç‰ä¸‹è½½å·¥å…·ï¼Œè¯·å–æ¶ˆåŠ é€Ÿå·¥å…·åŽä¸‹è½½ã€‚

ã€ä½œè€…ã€‘ é™ˆæ–Œï¼›

ã€ä½œè€…åŸºæœ¬ä¿¡æ¯ã€‘ å—äº¬èˆªç©ºèˆªå¤©å¤§å¦ ï¼Œ è®¡ç®—æœºåº”ç”¨æŠ€æœ¯ï¼Œ 2013ï¼Œ åšå£«

ã€æ‘˜è¦ã€‘ æ‰€è°“å¼‚å¸¸æ£€æµ‹å°±æ˜¯æ£€æµ‹å’Œå‘çŽ°è§‚æµ‹æ•°æ®ä¸ä¸ç¬¦åˆæ£å¸¸ï¼ˆæœŸæœ›ï¼‰è¡Œä¸ºçš„å¼‚å¸¸æ•°æ®æ¨¡å¼ï¼Œæ ¹æ®åº”ç”¨é¢†åŸŸçš„ä¸åŒï¼Œè¿™äº›å¼‚å¸¸æ¨¡å¼ä¹Ÿè¢«ç§°ä¸ºé‡Žå€¼ç‚¹ã€ä¸ä¸€è‡´ç‚¹ã€æ–°é¢–ç‚¹ã€ç¦»ç¾¤ç‚¹æˆ–è€…æ±¡ç‚¹ã€‚è¿‘å¹´æ¥å¼‚å¸¸æ£€æµ‹å·²å¹¿æ³›ç”¨äºŽæ•…éšœè¯Šæ–ã€ç–¾ç—…æ£€æµ‹ã€å…¥ä¾µæ£€æµ‹ã€ä¿¡ç”¨å¡ï¼ˆæˆ–ä¿é™©ï¼‰æ¬ºè¯ˆæ£€æµ‹åŠèº«ä»½è¾¨è¯†ç‰é¢†åŸŸã€‚åœ¨è¿™äº›é¢†åŸŸä¸ï¼Œå¼‚å¸¸æ¨¡å¼å¸¸å¸¸è•´å«äº†æ˜¾è‘—çš„ï¼ˆé€šå¸¸å…·æœ‰å¾ˆå¤§å±å®³ç”šè‡³è‡´å‘½çš„ï¼‰è¡Œä¸ºä¿¡æ¯ï¼Œå¦‚äº’è”ç½‘ä¸ç½‘ç»œæµé‡(è¡Œä¸º)çš„å¼‚å¸¸å¯èƒ½æ„å‘³ç€å—æ”»å‡»ä¸»æœºä¸Šæ•æ„Ÿä¿¡æ¯çš„æ³„å¯†ï¼Œä¿¡ç”¨å¡çš„æ¬ºè¯ˆè¡Œä¸ºä¼šå¯¼è‡´å·¨å¤§çš„ç»æµŽæŸå¤±ã€‚å› æ¤å¼‚å¸¸æ£€æµ‹çš„ç ”ç©¶æžå…·ç†è®ºæ„ä¹‰å’Œå®žç”¨ä»·å€¼ï¼Œå¹¶å·²å¾—åˆ°äº†å¹¿æ³›çš„å…³æ³¨ï¼Œæˆä¸ºäº†æ¨¡å¼è¯†åˆ«é¢†åŸŸä¸ä¸€ä¸ªéžå¸¸æ´»è·ƒå’Œçƒé—¨çš„ç ”ç©¶æ–¹å‘ã€‚å¼‚å¸¸æ£€æµ‹ä»»åŠ¡çš„ç‰¹æ®Šæ€§å¾€å¾€åœ¨äºŽåªæœ‰ç¬¦åˆæœŸæœ›ï¼ˆæ£å¸¸ç±»ï¼‰è¡Œä¸ºçš„æ•°æ®æ¨¡å¼ï¼Œè€Œç½•æœ‰æˆ–æœªçŸ¥è¿åç¬¦åˆæœŸæœ›ï¼ˆå¼‚å¸¸ç±»ï¼‰è¡Œä¸ºçš„æ•°æ®æ¨¡å¼ï¼Œæ¤ä¸¤ç±»è§‚å¯Ÿæ ·æœ¬çš„æžç«¯ä¸å¹³è¡¡æ€§ï¼ˆå¼‚å¸¸ç±»æ ·æœ¬æ•°è¿œå°äºŽæ£å¸¸ç±»æ ·æœ¬æ•°ï¼‰ä½¿å¾—å¼‚å¸¸æ£€æµ‹éžå¸¸å›°éš¾ã€‚å› è€Œç›®å‰å¯¹å¼‚å¸¸æ£€æµ‹æ–¹æ³•çš„ç ”ç©¶ä¸»è¦é›†ä¸äºŽæ— ç›‘ç£å¦ä¹ æ¡†æž¶å’Œä¸€äº›åˆ©ç”¨æžå°‘æ•°æœ‰æ ‡å·å¼‚å¸¸æ ·æœ¬çš„ç›‘ç£å¦ä¹ æ–¹æ³•ã€‚æœ¬æ–‡é’ˆå¯¹å„ç§å¼‚å¸¸æ£€æµ‹æ–¹æ³•çš„åŽŸç†ã€é²æ£’æ€§å’Œå…ˆéªŒä¿¡æ¯åµŒå…¥ç‰æ–¹é¢è¿›è¡Œäº†æ·±å…¥ç ”ç©¶ï¼Œä¸»è¦å·¥ä½œå¦‚ä¸‹ï¼š1.æå‡ºäº†åŸºäºŽå•ç°‡èšç±»çš„æ•°æ®æè¿°OCCDD (One-cluster Clustering based Data Description)ï¼Œå…¶åˆ©ç”¨å•ç°‡ç±»èšç±»ç®—æ³•å¯èƒ½æ€§Cï¼å‡å€¼PCM (Possibilistic Cï¼Meansï¼‰å³P1M(PCM,C=1)è¿›è¡Œæƒå€¼è®¡ç®—å¹¶é‡‡ç”¨åŠ æƒå¹³å‡æ–¹æ³•æ±‚è§£åŒ…å«è¶…çƒï¼Œå…‹æœäº†SVDD (Support Vector Data Descriptionï¼‰é‡‡ç”¨æžå°æžå¤§åŒ–ä¼°è®¡åŒ…å«å¤§å¤šæ•°æ£å¸¸ç±»æ ·æœ¬è¶…çƒæ—¶è¶…çƒä¸å¿ƒå¯¹é‡Žå€¼ç‚¹çš„ä¸é²æ£’æ€§ï¼Œé¿å…äº†SVDDæ±‚è§£äºŒæ¬¡è§„åˆ’çš„é«˜è®ç»ƒå¤æ‚æ€§ã€‚å¹¶ä»Žç†è®ºä¸Šè¯æ˜Žäº†P1Mæ‹¥æœ‰PCMï¼ˆC>1ï¼‰ä¸€èˆ¬ä¸å…·å¤‡çš„å…¨å±€æœ€ä¼˜ç‰¹æ€§ã€‚è¿›ä¸€æ¥é’ˆå¯¹æ–‡æœ¬åˆ†ç±»ç‰åº”ç”¨ä¸è‡ªç„¶å½¢æˆçš„è§‚æµ‹æ•°æ®çš„å¤šè§†å›¾ç‰¹æ€§ï¼Œå¯¹OCCDDè¿›è¡Œæ‹“å±•ï¼Œæå‡ºäº†ä¸€ç§å¤šè§†å›¾çš„å¼‚å¸¸æ£€æµ‹æ–¹æ³•ï¼Œä¸åŒäºŽå•ä¸ªè§†å›¾ä¸Šçš„å•ç‹¬è®ç»ƒï¼Œå…¶å®žçŽ°äº†å¤šè§†å›¾çš„åŒæ—¶å¦ä¹ å’Œç›¸äº’ä¿ƒè¿›ã€‚2.æå‡ºäº†AUC (Area under the ROC curve)æ£åˆ™åŒ–çš„SVDDï¼Œå…¶é’ˆå¯¹å¼‚å¸¸ç±»æ ·æœ¬åˆ†å¸ƒåœ¨æ£å¸¸ç±»æ ·æœ¬å››å‘¨çš„æƒ…å½¢ï¼Œåˆ©ç”¨AUCåº¦é‡å¯¹æ ·æœ¬åˆ†å¸ƒå’Œé”™åˆ†ä»£ä»·çš„ä¸æ•æ„Ÿæ€§ï¼Œå°†AUCåº¦é‡ä½œä¸ºæ£åˆ™åŒ–é¡¹åµŒå…¥åˆ°SVDDä¼˜åŒ–ç›®æ ‡ä¸ï¼Œä»Žè€ŒåŒæ—¶ä¼˜åŒ–æœ€å°åŒ…å«çƒä½“ç§¯å’ŒAUCæ€§èƒ½ï¼Œè§£å†³äº†ä¸€èˆ¬å¼‚å¸¸æ£€æµ‹å™¨ä¸èƒ½èƒœä»»å˜åœ¨æžå°‘å¼‚å¸¸ç±»æ ·æœ¬çš„æžç«¯ä¸å¹³è¡¡æ ·æœ¬åˆ†å¸ƒé—®é¢˜ã€‚æ¤åŽï¼Œé’ˆå¯¹AUCæ£åˆ™åŒ–æ–¹æ³•äº§ç”Ÿçš„é«˜è®ç»ƒå¤æ‚æ€§ï¼Œæå‡ºäº†ä¸¤ç§è§£å†³æ–¹æ¡ˆè¿›è¡ŒåŠ é€Ÿã€‚3.æå‡ºäº†ä¸€ç§æµå½¢å¦ä¹ ç®—æ³•çš„è®¾è®¡æ¡†æž¶ï¼šmXXXâ‰ˆISOMAP+XXXï¼ˆXXXå¯ä¸ºä»»ä¸€åŸºäºŽæ¬§æ°è·ç¦»çš„å¦ä¹ ç®—æ³•ï¼‰ï¼Œå…¶ä»…éœ€å°†åŽŸç©ºé—´çš„æµ‹åœ°è·ç¦»è¿‘ä¼¼ä¸ºISOMAPé™ç»´ç©ºé—´ä¸Šçš„æ¬§æ°è·ç¦»ï¼Œè€Œæ— éœ€æ˜¾å¼ISOMAPé™ç»´ï¼Œå³åœ¨éšå«ISOMAPé™ç»´åŽç©ºé—´ä¸Šæ‰§è¡ŒåŽŸXXXç®—æ³•è€Œå®žçŽ°æµå½¢ç»“æž„ä¿¡æ¯çš„åµŒå…¥ã€‚é’ˆå¯¹è§‚æµ‹æ•°æ®ä½äºŽæˆ–æŽ¥è¿‘äºŽä½Žç»´éžçº¿æ€§æµå½¢æ—¶æ¬§æ°è·ç¦»éš¾ä»¥çœŸå®žåœ°åˆ»ç”»å…¶å‡ ä½•ç»“æž„çš„ä¸è¶³ï¼Œé‡‡ç”¨ä¸Šè¿°æ¡†æž¶ä»¥SVDDä¸ºä¾‹è®¾è®¡äº†æµå½¢åµŒå…¥çš„SVDD (mSVDD)ï¼Œç®—æ³•ä¼˜ç‚¹å¦‚ä¸‹ï¼šï¼ˆ1ï¼‰é€šè¿‡å¯¹ISOMAPé™ç»´ç©ºé—´ä¸æ¬§æ°è·ç¦»çš„è¿‘ä¼¼è®¡ç®—ï¼Œè§£å†³äº†å‰è¿°åŸºäºŽæµ‹åœ°è·ç¦»çš„SVDDæ— æ³•ç›´æŽ¥ä¼˜åŒ–çš„é—®é¢˜ï¼›ï¼ˆ2ï¼‰æ— éœ€çœŸæ£æ‰§è¡ŒISOMAPçš„MDS (Multidimensional Scaling)å’ŒåµŒå…¥æµå½¢ç»´æ•°çš„é€‰æ‹©ï¼ˆï¼›3ï¼‰ä¸åŒäºŽåŽŸç©ºé—´(åŸºäºŽæ¬§æ°è·ç¦»çš„)SVDDï¼ŒmSVDDåŸºäºŽæµ‹åœ°è·ç¦»å¹¶éšå«æ‰§è¡Œäº†ISOMAPï¼Œæ•…èƒ½å®žçŽ°æµå½¢åµŒå…¥ã€‚4.æç¤ºäº†åŸºäºŽæ”¯æ’‘åŸŸçš„å¼‚å¸¸æ£€æµ‹å™¨å’Œå¯†åº¦ä¼°è®¡çš„å…³ç³»ã€‚åœ¨ç»¼è¿°ç›®å‰çš„å¼‚å¸¸æ£€æµ‹æ–¹æ³•åŸºç¡€ä¸Šï¼Œé‡ç‚¹å°±ä¸¤ç§åŸºäºŽæ”¯æ’‘åŸŸçš„å•åˆ†ç±»å™¨ï¼šå•ç±»æ”¯æŒå‘é‡æœºï¼ˆOne-class SVMï¼ŒOne-class Support VectorMachine)å’Œæ”¯æŒå‘é‡æ•°æ®æè¿°SVDDï¼Œæç¤ºäº†é«˜æ–¯æ ¸æ ¸åŒ–åŽå®ƒä»¬ä¸Žå¯†åº¦ä¼°è®¡ä¹‹é—´çš„æœ¬è´¨æ€§å…³ç³»ï¼šé¦–å…ˆï¼Œå°†åŸºäºŽæ”¯æ’‘åŸŸçš„å•åˆ†ç±»å™¨ç»Ÿä¸€åˆ°å¯†åº¦ä¼°è®¡çš„æ¡†æž¶ä¸‹ï¼›å…¶æ¬¡ï¼Œè¿˜è¯æ˜Žäº†åŸºäºŽæ”¯æ’‘åŸŸçš„å•åˆ†ç±»å™¨è¯±å¯¼çš„å¯†åº¦ä¼°è®¡å’ŒçœŸå®žå¯†åº¦ä¸€è‡´ï¼Œä¼˜åŒ–è¿™äº›å•åˆ†ç±»å™¨çš„åŒæ—¶ä¹Ÿèƒ½å‡å°ç§¯åˆ†å¹³æ–¹è¯¯å·®ã€‚æ›´å¤š è¿˜åŽŸ

ã€Abstractã€‘ Outlier detection is to detect and discover those abnormal data patterns not conforming to normal(expected) behavior in observed data. These abnormal patterns are noted as outlier, inconsistent point,novelty or stain for different applications. Recent years, outlier detection is widely applied in faultdiagnosis, disease detection, intrusion detection, credit card (or insurance) fraud detection and personidenfication. In these areas, the abnormal pattern often implies significant (usually greatly harmedeven deadly) behavior. For instance, the abnormal traffic (behavior) in Internet may imply the leakageof sensitive information in attacked host, and credict card fraud behavior would lead to greateconomic loss. For the great pratical meaning and value, outlier detection is now becoming a veryactive and hot research area. As a result, many researchers pay close attention to the research in thearea.Different from other learning task, outlier detection task is with only data patterns conforming toexpected behavior (target class), and rare (even no) data patterns not conforming to expected behavior(outlier class). So there exists extreme imbalance (outlier samples are much less than target samples)leading to great difficulty in outlier detection. Therefore, recent research maily focused inunsupervised learning framework and supervised learning method with a very few labeled outliersamples. Based on the deep research on the principles of various outlier detection methods, robustnessto outliers and the embedding of prior knowledge, the contributions of this paper are as followed:1. First, One-cluster Clustering based Data Description (OCCDD) is proposed which employsthe PCM (Possibilisitic C-Mean) algorithm with one cluster, that is, P1M(PCM,C=1) to compute theweights, and hereafter, obtains an enclosing ball with weight averaging. As a result, OCCDD advoidsthe sensitivity to outliers and high training complexity in Support Vector Data Description (SVDD)due to minimax optimization. Second, global optimal charactistic of P1M which original PCM (C>1)has no is proved in theory. In the end, a multiview OCCDD is proposd to adapt the instinctivemultiview property in text classification. Different from general classifers learn in single view,multiview OCCDD simultaneously learns from all views, and increases the performance owing toeach view boosting mutally.2. A SVDD regularized with Area under the ROC curve (AUC) is proposed towards the situationthat outliers lie around the target samples. The regularized SVDD incorporates AUC measure into theoptimizing object of SVDD, and simultaneously optimizes the volume of minimum enclosing ball andAUC performance so as to deal with the extreme balance in class distribution. Then, two speed tricksare proposed to solve the high training complexity after AUC regularization. 3. A designing framework for manifold-based classifier: mXXXâ‰ˆISOMAP+XXX (here, XXXdenotes an existed learning algorithm based on Euclid Distance) is proposed, which replaces theEuclid distance in the feature space after ISOMAP dimension reduction by the Geodesic Distance ininput space, and implicitly conducts a ISOMAP without the truly ISOMAP process. When underlyingmanifold of the observed data existed, SVDD performance degrades since Euclid Distance cannotdepict the true geometrical structure, so we extend this method to SVDD and derivate a SVDD withManifold Embedding (mSVDD). After manifold embedding, mSVDD has advantages as follows:(1)With the approximation of Euclid Distances in the feature space induced by ISOMAP process, itsolves the problem that Geodesic Distance based SVDD cannot be directly optimized;(2)It avoidstruly Multidimensional Scaling (MDS) process in ISOMAP and selection of the dimension of theEuclid space after ISOMAP;(3) Different from formal Euclid Distance based SVDD, mSVDD isbased on Geodesic Distance, and implicitly executes a ISOMAP process, thus it can find a manifoldembedding.4. The relationship beween density estimation and domain-based outlier dectectors is revealed,especially, the essential relation between kernel density estimation and two domain-based outlierdetectors (One-Class Support Vector Machine (OCSVM) and SVDD) induced by Gaussian kernel.That is, domain-based outlier detectors are falling into the framework of density estimation. Moreover,the density estimator induced by OCSVM and SVDD is consistent to the true density; meanwhile,optimizing OCSVM and SVDD can also reduce the Integrated Squared Error (ISE).æ›´å¤š è¿˜åŽŸ

ã€å…³é”®è¯ã€‘ å¼‚å¸¸æ£€æµ‹ï¼› æ”¯æŒå‘é‡æ•°æ®æè¿°ï¼› é²æ£’æ€§ï¼› åŠ æƒå¹³å‡ï¼› å¯èƒ½æ€§Cï¼å‡å€¼ï¼› å¤šè§†å›¾å¦ä¹ ï¼› AUCæ€§èƒ½ï¼› æµå½¢åµŒå…¥ï¼› AUCæ£åˆ™åŒ–ï¼›
ã€Key wordsã€‘ outlier detectionï¼› support vector data descriptionï¼› robustnessï¼› weighted averagingï¼› possibilisitic C-meansï¼› multiview learningï¼› AUC metricï¼› manifold embeddingï¼›

ã€ç½‘ç»œå‡ºç‰ˆæŠ•ç¨¿äººã€‘ å—äº¬èˆªç©ºèˆªå¤©å¤§å¦

ã€åˆ†ç±»å·ã€‘TP274
ã€ä¸‹è½½é¢‘æ¬¡ã€‘793
æ”»è¯»æœŸæˆæžœ

çŸ¥ç½‘èŠ‚ä¸‹è½½

èŠ‚ç‚¹æ–‡çŒ®ä¸ï¼š

æœ¬æ–‡é“¾æŽ¥çš„æ–‡çŒ®ç½‘ç»œå›¾ç¤º:

æœ¬æ–‡çš„å¼•æ–‡ç½‘ç»œ

èŠ‚ç‚¹æ–‡çŒ®

èŠ‚ç‚¹æ–‡çŒ®

å¼‚å¸¸æ£€æµ‹æ–¹æ³•åŠå…¶å…³é”®æŠ€æœ¯ç ”ç©¶

Research on Outlier Detection Method and Its Key Techniques

æœ¬æ–‡é“¾æŽ¥çš„æ–‡çŒ®ç½‘ç»œå›¾ç¤º:

å¼‚å¸¸æ£€æµ‹æ–¹æ³•åŠå…¶å…³é”®æŠ€æœ¯ç ”ç©¶