èŠ‚ç‚¹æ–‡çŒ®

å¤šæºçŽ¯å¢ƒä¸æ•°æ®é¢„å¤„ç†ä¸Žæ¨¡å¼æŒ–æŽ˜çš„ç ”ç©¶

Data Preprocessing and Pattern Mining in Multiple Data Sources

åˆ†é¡µä¸‹è½½
åˆ†ç« ä¸‹è½½
æ•´æœ¬ä¸‹è½½
åœ¨çº¿é˜…è¯»
ä¸æ”¯æŒè¿…é›·ç‰ä¸‹è½½å·¥å…·ï¼Œè¯·å–æ¶ˆåŠ é€Ÿå·¥å…·åŽä¸‹è½½ã€‚

ã€ä½œè€…ã€‘ æž—è€€è¿›ï¼›

ã€ä½œè€…åŸºæœ¬ä¿¡æ¯ã€‘ åˆè‚¥å·¥ä¸šå¤§å¦ ï¼Œ è®¡ç®—æœºåº”ç”¨æŠ€æœ¯ï¼Œ 2014ï¼Œ åšå£«

ã€æ‘˜è¦ã€‘ éšç€æ•°æ®åº“ã€ç½‘ç»œä»¥åŠå„ç§ä¿¡æ¯æŠ€æœ¯çš„è¿…çŒ›å‘å±•,è®¸å¤šå®žé™…åº”ç”¨é¢†åŸŸå¦‚ï¼šä¼ æ„Ÿå™¨ç½‘ç»œã€å•†ä¸šäº¤æ˜“ã€ç¤¾ä¼šåª’ä½“åˆ†æžç‰æ•°æ®çš„æè¿°ä¿¡æ¯å˜å¾—è¶Šæ¥è¶Šå¤š,äº§ç”Ÿäº†ç§æµ·é‡ã€å¤šæºå’Œå¼‚æž„è¡¨çŽ°å½¢å¼çš„æ•°æ®ã€‚è¿™äº›å¤šæºå¼‚æž„æ•°æ®è•´å«ç€ä¸°å¯Œçš„çŸ¥è¯†å’Œæœ‰ç”¨çš„ä¿¡æ¯ã€‚ç„¶è€Œ,ç”±äºŽå¤šæ•°æ®æºå…·æœ‰å¼‚æž„æ€§ã€è‡ªæ²»æ€§ã€å¤æ‚æ€§ã€ä¸ä¸€è‡´æ€§ç‰ç‰¹å¾,ä½¿å¾—ä¼ ç»Ÿçš„æ•°æ®æŒ–æŽ˜æŠ€æœ¯é¢ä¸´ç€å·¨å¤§çš„æŒ‘æˆ˜ã€‚å› æ¤,å¼€å±•å¤šæ•°æ®æºçŽ¯å¢ƒä¸‹æ ‡ç¾ä¼ æ’ã€æ•°æ®æºè´¨é‡è¯„ä¼°ã€æ¨¡å¼æŒ–æŽ˜ç‰çŸ¥è¯†æŒ–æŽ˜ç ”ç©¶å…·æœ‰é‡è¦çš„ç ”ç©¶ä¸Žåº”ç”¨ä»·å€¼ã€‚æœ¬æ–‡ä¸»è¦ç ”ç©¶å†…å®¹å¦‚ä¸‹ï¼š1)ç”±äºŽæ•°æ®æºä¹‹é—´ç»“æž„çš„ä¸ä¸€è‡´æ€§,å¾ˆéš¾å°†å¤šä¸ªæ•°æ®æºç›´æŽ¥æ•´åˆæˆå•ä¸€æ•°æ®æºè¿›è¡Œå¦ä¹ ã€‚åœ¨å……åˆ†åˆ©ç”¨æœ‰æ ‡ç¾æ•°æ®æºçš„æ ‡ç¾ä¿¡æ¯ä¸Žæ— æ ‡ç¾æ•°æ®æºçš„å†…éƒ¨ç»“æž„ä¿¡æ¯åŸºç¡€ä¸Š,åˆ†åˆ«æå‡ºäº†å…¨å±€ä¸€è‡´åŒ–å’Œå±€éƒ¨ä¸€è‡´åŒ–ä¸¤ç§æ ‡ç¾ä¼ æ’æ–¹æ³•,åˆ©ç”¨æ¤ä¸¤ç§æ–¹æ³•ä½¿æ— æ ‡ç¾æ•°æ®æºçš„æ•°æ®æ ·æœ¬å…·æœ‰ç±»æ ‡ç¾ã€‚å†æ¬¡åŸºç¡€ä¸Š,æž„å»ºå¤šæ•°æ®æºçš„é›†æˆå¦ä¹ æ–¹æ³•,ä»Žåˆ†ç±»ç²¾åº¦ã€é²æ£’æ€§å’Œæ‰©å±•æ€§ç‰ä¸‰æ–¹é¢éªŒè¯äº†æ‰€æç®—æ³•çš„æœ‰æ•ˆæ€§ã€‚å¦å¤–,å®žéªŒç»“æžœè¡¨æ˜Žå½“æ— æ ‡ç¾æ•°æ®æºè¾ƒå¤šæ—¶,å±€éƒ¨ä¸€è‡´åŒ–çš„æ ‡ç¾ä¼ æ’æ–¹æ³•æ•ˆæžœä¼˜äºŽå…¨å±€ä¸€è‡´åŒ–çš„æ ‡ç¾ä¼ æ’æ–¹æ³•ã€‚2)é¢å¯¹å¤šæ•°æ®æºè¿›è¡Œå¦ä¹ æ—¶,å¤šæ•°æ®æºä¸å¯èƒ½å˜åœ¨æ— å…³çš„æˆ–å†—ä½™çš„æ•°æ®æºã€‚ä»Žæ•°æ®æºçš„é‡è¦åº¦å’Œæ•°æ®æºé—´çš„å†—ä½™åº¦å‡ºå‘,è®¾è®¡äº†ä¸€ç§åŸºäºŽæœ€å¤§é‡è¦åº¦æœ€å°å†—ä½™åº¦çš„æ•°æ®æºè´¨é‡è¯„ä¼°ä¸Žé€‰æ‹©ç®—æ³•ã€‚å…¶ä¸,é‡è¦åº¦è¡¨ç¤ºä¸€ä¸ªæ•°æ®æºå¯¹åˆ†ç±»çš„è´¡çŒ®ç¨‹åº¦,å†—ä½™åº¦è¡¨ç¤ºä¸åŒæ•°æ®æºä¹‹é—´è•´å«ä¿¡æ¯çš„é‡å ç¨‹åº¦ã€‚æœ€åŽ,é€šè¿‡é€‰æ‹©å‰p%ä¸ªæ•°æ®æºè¿›è¡Œå¤šæ•°æ®æºçš„é›†æˆå¦ä¹ ã€‚å®žéªŒç»“æžœè¡¨æ˜Žè¯¥åº¦é‡æ–¹æ³•èƒ½æœ‰æ•ˆåœ°é€‰æ‹©ä¸Žä»»åŠ¡ç›¸å…³çš„æ•°æ®æºã€‚3)å•†åœºéšç€é”€å”®é‡çš„æ—¥ç›Šå¢žé•¿,å˜å‚¨äº†å¤§é‡ä¸Žæ—¶é—´ç›¸å…³çš„äº‹åŠ¡åž‹é”€å”®æ•°æ®ã€‚é€šè¿‡å°†é”€å”®æ•°æ®æŒ‰æ—¶é—´åˆ’åˆ†ä¸ºå¤šä¸ªæ—¶é—´æˆ³æ•°æ®åº“ã€‚é’ˆå¯¹å¤šä¸ªæ—¶é—´æˆ³æ•°æ®åº“æž„æˆçš„å¤šç›¸å…³æ•°æ®åº“,æå‡ºäº†ä¸€ç§ä»¥æŒ–æŽ˜ç¨³å®šæ¨¡å¼ä¸ºä»£è¡¨çš„æœ‰æ•ˆç®—æ³•ã€‚è¯¥ç®—æ³•é¦–å…ˆé€šè¿‡å®šä¹‰ä¸¤ä¸ªçº¦æŸæ¡ä»¶ï¼šminsuppå’Œvarivalueä»¥å®šä¹‰ç¨³å®šæ•°æ®é¡¹,ç„¶åŽåŸºäºŽç°è‰²å…³è”åˆ†æžæ–¹æ³•åº¦é‡ç¨³å®šæ•°æ®é¡¹ä¹‹é—´çš„ç›¸ä¼¼åº¦ã€‚åœ¨æ¤åŸºç¡€ä¸Š,æå‡ºäº†ä¸€ç§å±‚æ¬¡ç°è‰²èšç±»æ–¹æ³•æŒ–æŽ˜ç”±ç¨³å®šæ•°æ®é¡¹ç»„æˆçš„ç¨³å®šæ¨¡å¼ã€‚ä»Žæ¨¡å¼çš„æœ‰æ•ˆæ€§ã€æ—¶é—´æ•ˆçŽ‡åŠæ‹“å±•æ€§ç‰æ–¹é¢éªŒè¯äº†æ‰€æç®—æ³•çš„æœ‰æ•ˆæ€§ã€‚æ›´å¤š è¿˜åŽŸ

ã€Abstractã€‘ With the raid development of database, network and other information technologies, multiple data sources with large volumes and heterogeneity have become ubiquitous in many practical applications, such as sensor networking, supermarket transactions and social media analysis. These databases contain plenty of useful information and valuable knowledge, and bring new characteristics as being heterogeneous, autonomous, complex, and inconsistent, which are challenging for traditional mining algorithms. Thus, knowledge discovery from multiple data sources, such as label propagation, quality of source evaluation, and pattern mining, is a significant problem with application values in real-world applications. The main contributions of this dissertation are as follows.1) It is difficult to merge multiple data sources into a centralized database for learning due to the inconsistency between different data sources. We present two label propagation methods to infer the labels of training objects from unlabeled sources by making a full use of class label information from labeled sources, and internal structure information from unlabeled sources, which are referred to as global consensus and local consensus, respectively. We test the classification accuracy, robustness and scalability of the proposed methods by constructing a multiple-data-source ensemble learning model. Experimental results show that the local consensus outperforms the global consensus when there exist plenty of unlabeled sources.2) It is noticeable that some sources might be irrelevant or redundant when constructing multiple-data-source learning. Thus, it is meaningful to select a set of good information sources that could help improve the learning performance. We present an algorithm of source assessment and selection based on max-significance-min-redundancy, in which significance represents the degree to which an information source contributes to classification, and redundancy implies the information overlap among different information sources. Finally, we select the first p percent sources to construct multiple-data-sources ensemble learning. Experimental results show that the metric can effectively select some sources related to the target mining task.3) Every time when a customer interacts with a business, there is an opportunity to gain strategic knowledge. Transactional data collected over time contain a wealth of information about customers and their purchasing patterns. We divide transactional data into multiple time-stamped databases according to their sale periods. We present an efficient algorithm for mining four patterns represented by stable patterns. First, we define the notion of stable items according to two constraint conditions:minsupp and varivalue. We then measure the similarity between stable items based on gray relational analysis, and propose a hierarchical gray clustering method for mining stable patterns consisting of stable items. Finally, experimental results show that the proposed algorithm is effective, efficient and scalable.æ›´å¤š è¿˜åŽŸ

ã€å…³é”®è¯ã€‘ å¤šæ•°æ®æºï¼› è´¨é‡è¯„ä¼°ï¼› æ ‡ç¾ä¼ æ’ï¼› æ¨¡å¼æŒ–æŽ˜ï¼›
ã€Key wordsã€‘ Multiple Data Sourcesï¼› Quality Assessmentï¼› Label Propagationï¼› Pattern Miningï¼›

ã€ç½‘ç»œå‡ºç‰ˆæŠ•ç¨¿äººã€‘ åˆè‚¥å·¥ä¸šå¤§å¦

ã€åˆ†ç±»å·ã€‘TP311.13
ã€è¢«å¼•é¢‘æ¬¡ã€‘1
ã€ä¸‹è½½é¢‘æ¬¡ã€‘455
æ”»è¯»æœŸæˆæžœ

çŸ¥ç½‘èŠ‚ä¸‹è½½

èŠ‚ç‚¹æ–‡çŒ®ä¸ï¼š

æœ¬æ–‡é“¾æŽ¥çš„æ–‡çŒ®ç½‘ç»œå›¾ç¤º:

æœ¬æ–‡çš„å¼•æ–‡ç½‘ç»œ

èŠ‚ç‚¹æ–‡çŒ®

èŠ‚ç‚¹æ–‡çŒ®

å¤šæºçŽ¯å¢ƒä¸­æ•°æ®é¢„å¤„ç†ä¸Žæ¨¡å¼æŒ–æŽ˜çš„ç ”ç©¶

Data Preprocessing and Pattern Mining in Multiple Data Sources

æœ¬æ–‡é“¾æŽ¥çš„æ–‡çŒ®ç½‘ç»œå›¾ç¤º:

å¤šæºçŽ¯å¢ƒä¸æ•°æ®é¢„å¤„ç†ä¸Žæ¨¡å¼æŒ–æŽ˜çš„ç ”ç©¶