èŠ‚ç‚¹æ–‡çŒ®

å¯¹ç‰è®¡ç®—ç³»ç»Ÿä¸çš„ç›¸ä¼¼æŸ¥è¯¢å¤„ç†ç ”ç©¶

Similarity Search in Peer-to-Peer Systems

åˆ†é¡µä¸‹è½½
åˆ†ç« ä¸‹è½½
æ•´æœ¬ä¸‹è½½
åœ¨çº¿é˜…è¯»
ä¸æ”¯æŒè¿…é›·ç‰ä¸‹è½½å·¥å…·ï¼Œè¯·å–æ¶ˆåŠ é€Ÿå·¥å…·åŽä¸‹è½½ã€‚

ã€ä½œè€…ã€‘ å¾æž—æ˜Šï¼›

ã€ä½œè€…åŸºæœ¬ä¿¡æ¯ã€‘ å¤æ—¦å¤§å¦ ï¼Œ è®¡ç®—æœºè½¯ä»¶ä¸Žç†è®ºï¼Œ 2005ï¼Œ åšå£«

ã€æ‘˜è¦ã€‘ å¯¹ç‰è®¡ç®—(peer-to-peer computingï¼Œç®€ç§°P2P)å·²ç»æˆä¸ºäº†è®¡ç®—æœºç§‘å¦é¢†åŸŸçš„ç ”ç©¶çƒç‚¹ã€‚åœ¨å¯¹ç‰è®¡ç®—ç³»ç»Ÿä¸ï¼Œæ¯ä¸ªèŠ‚ç‚¹éƒ½æ˜¯å®Œå…¨è‡ªæ²»çš„ï¼Œæ‹¥æœ‰ç›¸åŒçš„è´£ä»»ï¼Œæ‰®æ¼”ç€åŒé‡è§’è‰²â€”æ—¢å¯ä»¥æ˜¯å®¢æˆ·æœº(æœåŠ¡æ¶ˆè´¹è€…)ï¼Œä¹Ÿå¯ä»¥è¶³æœåŠ¡å™¨(æœåŠ¡æä¾›è€…)ï¼Œè€Œä¸”ä»»æ„ä¸€ä¸ªèŠ‚ç‚¹éƒ½å¯ä»¥éšæ„åœ°åŠ å…¥æˆ–é€€å‡ºç³»ç»Ÿã€‚å› æ¤ï¼Œå¯¹ç‰è®¡ç®—ç³»ç»Ÿæ˜¯ä¸€ä¸ªå®Œå…¨åŠ¨æ€çš„ã€æ²¡æœ‰ä»»ä½•é›†ä¸æŽ§åˆ¶çš„åˆ†å¸ƒå¼ç³»ç»Ÿã€‚å¯¹ç‰è®¡ç®—æ¨¡åž‹å…·æœ‰è®¸å¤šæ½œåœ¨çš„ä¼˜åŠ¿ï¼Œå¦‚æ‰©å±•æ€§å¼ºã€é²æ£’æ€§å¥½ã€èµ„æºå¯ç”¨æ€§é«˜ç‰ç‰¹ç‚¹ï¼Œç‰¹åˆ«é€‚ç”¨äºŽå…·æœ‰åœ°ç†åˆ†å¸ƒã€èµ„æºå¼‚æž„ã€æ‰©å±•æ€§è¦æ±‚é«˜ã€å±€éƒ¨è‡ªæ²»ç‰ç‰¹å¾çš„åˆ†å¸ƒå¼ç³»ç»Ÿã€‚å› è€Œï¼Œå¯¹ç‰è®¡ç®—æ¨¡åž‹æŽ¨åŠ¨äº†â€œä»¥ä¸»æœºä¸ºä¸å¿ƒ(host-centric)â€çš„ä¼ ç»Ÿäº’è”ç½‘å‘â€œä»¥æ•°æ®ä¸ºä¸å¿ƒ(data-centric)â€çš„æœªæ¥äº’è”ç½‘çš„å‘å±•ï¼Œè¢«å¦æœ¯ç•Œå’Œå·¥ä¸šç•Œå…¬è®¤ä¸ºæ˜¯é‡æž„åŸºäºŽäº’è”ç½‘åº”ç”¨çš„å…³é”®æŠ€æœ¯ä¹‹ä¸€ã€‚è™½ç„¶ï¼Œå¦æœ¯ç•Œå·²ç»å–å¾—äº†ä¸å°‘å¯¹ç‰è®¡ç®—çŽ¯å¢ƒä¸‹çš„æŸ¥è¯¢å¤„ç†ç ”ç©¶æˆæžœï¼Œä½†ä»ç„¶å˜åœ¨ç€è®¸å¤šæœ‰å¾…ç ”ç©¶ä¸Žè§£å†³é—®é¢˜ã€‚æœ¬æ–‡ç ”ç©¶äº†å¯¹ç‰è®¡ç®—çŽ¯å¢ƒä¸‹çš„ç›¸ä¼¼æŸ¥è¯¢é—®é¢˜ï¼ŒæŽ¢ç´¢äº†å¯¹ç‰è®¡ç®—çŽ¯å¢ƒä¸‹çš„åŸºäºŽè·¯ç”±ç´¢å¼•ã€æ•°æ®ç©ºé—´åˆ’åˆ†ã€åä½œç¼“å˜å’Œæ¦‚çŽ‡æ¨¡åž‹çš„ç›¸ä¼¼æŸ¥è¯¢å¤„ç†æŠ€æœ¯ï¼Œæ—¨åœ¨ä¸ºçŽ°æœ‰çš„å¯¹ç‰è®¡ç®—ç³»ç»Ÿæä¾›åŸºäºŽè¯ä¹‰æˆ–è€…ç›¸ä¼¼åº¦çš„æŸ¥è¯¢å¤„ç†åŠŸèƒ½ã€‚æœ¬æ–‡çš„ä¸»è¦è´¡çŒ®æœ‰å¦‚ä¸‹å››ä¸ªæ–¹é¢ï¼š1ï¼Žå°†å¤šç»´æ•°æ®ç©ºé—´ä¸çš„ç›¸ä¼¼æŸ¥è¯¢å¤„ç†(similarity search)æŠ€æœ¯å¼•å…¥åˆ°æ— ç»“æž„(unstructured)å¯¹ç‰è®¡ç®—ç³»ç»Ÿä¸ï¼Œåˆ©ç”¨è¿‘ä¼¼å‘é‡(vector approximation)æŠ€æœ¯å’Œè·¯ç”±ç´¢å¼•(routing index)æŠ€æœ¯ï¼Œä¸ºç³»ç»Ÿä¸çš„æ¯ä¸ªèŠ‚ç‚¹å»ºç«‹åŸºäºŽè¿‘ä¼¼å‘é‡çš„è·¯ç”±ç´¢å¼•ï¼Œä½¿å¾—ç”¨æˆ·æŸ¥è¯¢èƒ½å¤Ÿå‡†ç¡®åœ°è·¯ç”±åˆ°å¹¶ä¸”æœ‰æ•ˆåœ°æŸ¥è¯¢æ‹¥æœ‰ç›¸å…³æ•°æ®èµ„æºçš„èŠ‚ç‚¹ï¼Œå®žçŽ°æ— ç»“æž„å¯¹ç‰è®¡ç®—ç³»ç»Ÿä¸çš„ç›¸ä¼¼æŸ¥è¯¢å¤„ç†ã€‚å¦å¤–ï¼Œåˆ©ç”¨æ— ç»“æž„å¯¹ç‰è®¡ç®—ç³»ç»Ÿä¸çš„ç½‘ç»œè‡ªé…ç½®(self-reconfiguration)ç‰¹æ€§ï¼Œé€šè¿‡åŠ¨æ€è°ƒæ•´èŠ‚ç‚¹åœ¨ç½‘ç»œä¸çš„ä½ç½®ï¼Œä½¿å¾—ä¸Žç›¸ä¼¼æŸ¥è¯¢ç›¸å…³çš„èŠ‚ç‚¹ä¿æŒä½ç½®é‚»è¿‘ï¼Œè¿›ä¸€æ¥æé«˜äº†ç³»ç»Ÿçš„æŸ¥è¯¢å¤„ç†æ€§èƒ½ã€‚ä»¿çœŸå®žéªŒè¡¨æ˜Žï¼Œè¯¥æ–¹æ³•å¯¹æ— ç»“æž„å¯¹ç‰è®¡ç®—çŽ¯å¢ƒä¸‹çš„ç›¸ä¼¼æŸ¥è¯¢å¤„ç†éžå¸¸æœ‰æ•ˆã€‚2ï¼Žå°†æ•°æ®ç©ºé—´åˆ’åˆ†(space partitioning)æŠ€æœ¯å¼•å…¥åˆ°ç»“æž„åŒ–(structured)å¯¹ç‰è®¡ç®—ç³»ç»Ÿä¸ï¼Œé€šè¿‡é€‰å®šçš„ä»£è¡¨ç‚¹(reference point)ï¼Œå°†æ•´ä¸ªæ•°æ®ç©ºé—´åˆ’åˆ†æˆæ²¡æœ‰ä»»ä½•é‡å (overlap)çš„æ•°æ®åç©ºé—´ã€‚é€šè¿‡å°†ä»£è¡¨ç‚¹çº¿æ€§åŒ–ï¼Œåœ¨èŠ‚ç‚¹ã€ä»£è¡¨ç‚¹å’Œæ•°æ®åç©ºé—´ä¸‰è€…ä¹‹é—´å»ºç«‹èµ·ä¸€ä¸€æ˜ å°„å…³ç³»ã€‚åˆ©ç”¨ä¼ ç»Ÿçš„é«˜ç»´ç´¢å¼•æŠ€æœ¯å’ŒåŸºäºŽåˆ†å¸ƒå¼æ•£åˆ—è¡¨(distributed hash tableï¼Œæˆ–DHT)çš„èµ„æºæŸ¥æ‰¾å’Œå®šä½æœºåˆ¶ï¼Œä½¿å¾—é«˜ç»´æ•°æ®ç©ºé—´ä¸çš„ç›¸ä¼¼æŸ¥è¯¢å¤„ç†åœ¨ç»“æž„åŒ–å¯¹ç‰è®¡ç®—ç³»ç»Ÿä¸Šå¾—ä»¥å®žçŽ°ã€‚æ¤å¤–ï¼Œé€šè¿‡ç»´æŠ¤æ•°æ®åç©ºé—´ä¹‹é—´çš„ç‰©ç†é‚»è¿‘(physical proximity)ç‰¹å¾ï¼Œé™ä½Žäº†ç³»ç»Ÿçš„æŸ¥è¯¢è·¯ç”±ä»£ä»·ï¼›é€šè¿‡è°ƒæ•´æ•°æ®åç©ºé—´çš„ç²’åº¦ï¼Œè¾¾åˆ°å‡è¡¡ç³»ç»Ÿè´Ÿè½½(load balance)çš„ç›®çš„ã€‚ä»¿çœŸå®žéªŒè¡¨æ˜Žï¼Œè¯¥æ–¹æ³•èƒ½å¤Ÿæœ‰æ•ˆåœ°é€‚åº”æ•°æ®ç»´åº¦çš„å¢žé•¿å’Œç³»ç»Ÿè§„æ¨¡çš„æ‰©å±•ã€‚3ï¼Žé’ˆå¯¹å…³ç³»æŸ¥è¯¢å¤„ç†ï¼ŒæŽ¢ç´¢äº†åŸºäºŽåå•†(negotiation)çš„åä½œç¼“å˜æŠ€æœ¯(collaborative caching)ï¼Œæå‡ºäº†ä¸€ç§åŸºäºŽç½‘ç»œä¼ è¾“ä»£ä»·çš„æŸ¥è¯¢ä»£ä»·æ¨¡åž‹ï¼Œç”¨äºŽè¯„ä»·ä¸åŒæŸ¥è¯¢è®¡åˆ’çš„æ‰§è¡Œä»£ä»·ã€‚åœ¨å¯¹ç‰è®¡ç®—çŽ¯å¢ƒä¸‹ï¼Œä¸€ä¸ªæŸ¥è¯¢è®¡åˆ’çš„æ‰§è¡Œä»£ä»·å¯ä»¥è¢«åˆ†è§£ä¸ºåæŸ¥è¯¢è®¡åˆ’çš„æ‰§è¡Œä»£ä»·ã€‚ç»“åˆä»£ä»·æ¨¡åž‹ï¼Œåˆ©ç”¨åè°ƒé‡å ç½‘ç»œ(collaborative overlap network)ï¼Œé€šè¿‡æŸ¥è¯¢è¯·æ±‚èŠ‚ç‚¹(requester)å’Œåè°ƒèŠ‚ç‚¹(coordinator)ä¹‹é—´çš„åå•†ï¼Œç¡®å®šåä½œç¼“å˜çš„é€»è¾‘æŸ¥è¯¢è¡¨è¾¾å¼å’Œå‚ä¸Žæ•°æ®ç¼“å˜çš„æŸ¥è¯¢è¯·æ±‚èŠ‚ç‚¹ï¼Œå®žçŽ°äº†å¯¹ç‰è®¡ç®—çŽ¯å¢ƒä¸‹çš„åŸºäºŽè¯ä¹‰çš„æŸ¥è¯¢å¤„ç†ã€‚ä»¿çœŸå’ŒçœŸå®žå®žéªŒè¡¨æ˜Žï¼Œè¯¥æ–¹æ³•èƒ½å¤Ÿç¡®å®šè¾ƒä¼˜çš„æ•°æ®ç¼“å˜æ”¾ç½®ç–ç•¥ï¼Œé™ä½Žç³»ç»Ÿçš„æŸ¥è¯¢å¤„ç†å¼€é”€ã€‚å°¤å…¶æ˜¯åœ¨å•ä¸ªèŠ‚ç‚¹ä»…èƒ½è´¡çŒ®æœ‰é™çš„å˜å‚¨èµ„æºçš„æƒ…å†µä¸‹ï¼Œè¯¥æ–¹æ³•çš„ä¼˜åŠ¿æ›´ä¸ºæ˜Žæ˜¾ã€‚4ï¼Žé’ˆå¯¹åŸºäºŽä¸»é¢˜(topic)çš„å¯¹ç‰è®¡ç®—æ–‡ä»¶å…±äº«ç³»ç»Ÿï¼Œç ”ç©¶äº†ä¸€ç§åŸºäºŽæ¦‚çŽ‡çš„ç›¸ä¼¼æŸ¥è¯¢å¤„ç†æŠ€æœ¯ã€‚è¯¥æŠ€æœ¯çš„æ ¸å¿ƒæ€æƒ³æ˜¯åˆ©ç”¨æ¦‚çŽ‡æ¨¡åž‹(probabilistic model)æè¿°å…±äº«ä¸»é¢˜ä¹‹é—´çš„è¯ä¹‰é‡å åº¦(overlap)ä»¥åŠèŠ‚ç‚¹å¯¹ä¸»é¢˜çš„ä¿¡æ¯è¦†ç›–åº¦(coverage)ï¼Œä¸ºèŠ‚ç‚¹å»ºç«‹èµ·æ¦‚çŽ‡è·¯ç”±ä¿¡æ¯ã€‚ç›¸ä¼¼æŸ¥è¯¢å¤„ç†ç®—æ³•ä»¥æ¯ä¸ªèŠ‚ç‚¹å·²æœ‰çš„æ¦‚çŽ‡ä¿¡æ¯ä¸ºåŸºç¡€ï¼Œä¾æ®æŽ¨å¯¼å‡ºçš„é‚»å±…èŠ‚ç‚¹å¯¹æŸ¥è¯¢ä¸»é¢˜çš„è¦†ç›–åº¦ï¼Œå†³å®šä¸»é¢˜æŸ¥è¯¢çš„æœç´¢è·¯å¾„ã€‚æ¤å¤–ï¼Œåˆ©ç”¨æŸ¥è¯¢åé¦ˆçš„ä¿¡æ¯ï¼Œé€šè¿‡æ›´æ–°è·¯ç”±æŸ¥è¯¢çš„èŠ‚ç‚¹ä¸Šçš„æ¦‚çŽ‡ä¿¡æ¯ï¼Œä½¿å¾—è¿™äº›èŠ‚ç‚¹èƒ½å¤Ÿä¸ºå°†æ¥çš„ä¸»é¢˜æŸ¥è¯¢é€‰æ‹©æ›´å‡†ç¡®çš„æŸ¥è¯¢æœç´¢è·¯å¾„ã€‚æ¨¡æ‹Ÿå®žéªŒè¡¨æ˜Žï¼Œè¯¥æ–¹æ³•èƒ½å¤Ÿåˆ©ç”¨åŸºäºŽè‡ªåé¦ˆçš„æ¦‚çŽ‡æ›´æ–°ç®—æ³•ï¼Œé€æ¥æ”¹å–„æŸ¥è¯¢å¤„ç†çš„æ•ˆæžœï¼Œæé«˜æŸ¥è¯¢å¤„ç†çš„æ•ˆçŽ‡ã€‚æ€»ä¹‹ï¼Œæœ¬æ–‡è¯¦ç»†åœ°ä»‹ç»äº†å››ç§ç›¸ä¼¼æŸ¥è¯¢å¤„ç†æ–¹æ³•çš„ç®—æ³•è®¾è®¡ä¸Žå®žçŽ°ï¼Œä»¥åŠæµ‹è¯•ç»“æžœã€‚è¿™äº›æ–¹æ³•æ˜¯å¯¹çŽ°æœ‰å¯¹ç‰è®¡ç®—çŽ¯å¢ƒä¸‹çš„æŸ¥è¯¢å¤„ç†æŠ€æœ¯çš„æœ‰ç›Šè¡¥å……å’Œæ”¹è¿›ã€‚æœ¬æ–‡çš„ç ”ç©¶å·¥ä½œå»ºç«‹åœ¨å¯¹å½“å‰å·²æœ‰æŠ€æœ¯çš„è¯¦å°½åˆ†æžä¸Žç†è®ºç ”ç©¶ï¼Œä»¥åŠå¤§é‡çš„å®žéªŒæµ‹è¯•çš„åŸºç¡€ä¸Šã€‚å®žéªŒå’Œåˆ†æžè¡¨æ˜Žï¼Œä¸Žå½“å‰å¯¹ç‰è®¡ç®—çŽ¯å¢ƒä¸‹çš„æŸ¥è¯¢å¤„ç†æŠ€æœ¯ç›¸æ¯”ï¼Œä¸Šè¿°æ–¹æ³•åœ¨æŸ¥è¯¢æ•ˆçŽ‡å’Œèµ„æºåˆ©ç”¨çŽ‡ç‰æ–¹é¢å…·æœ‰ä¼˜åŠ¿ã€‚æ›´å¤š è¿˜åŽŸ

ã€Abstractã€‘ Peer-to-peer computing (P2P) has become an extremely popular topic in computer science. In a P2P system, each peer is fully autonomic, has equal responsibility and plays the role of both a service costumer and a service provider. Moreover, peers can enter or leave the P2P network at any time. Thus, a P2P system is a type of fully dynamic distributed system without any central administration. The P2P computing paradigm has many advantages, such as, scalability, robustness, resource availability etc, and is especially suitable for distributed applications with properties of geographical distribution, heterogeneous resource, scalability and local autonomy. Hence, the P2P computing paradigm is driving the evolution from the host-centric Internet to the data-centric future Internet, and is thought of as a promising technology to reconstruct the future Internet-based applications.Despite of current achievements on P2P-based query processing, there are still a lot of problems need to be studied. To this end, this thesis is devoted to the issue of similarity search in P2P systems. It studies routing index, space partitioning, collaborative caching and probabilistic model-based similarity search techniques, in order to support semantic or similarity-based query processing methods above existing P2P systems. Major contributions of this thesis include:1. Similarity search in multi-dimensional data space is introduced to unstructured P2P systems. A simple yet effective routing index structure, called EVA-Index, is designed by combining both vector approximation and routing indexing techniques together. With the aid of EVA-Index, each peer can process similarity query with its local dataset, and route queries to promising peers with the desired data objects. Furthermore, each peer can reconfigure its neighboring peers to keep the relevant peers close by so as to improve system performance. Simulation shows that the proposed approach is efficient and effective to similarity query processing in unstructured P2P environments.2. An efficient space partitioning strategy is introduced to a structured P2P system. The whole data space is first partitioned based on a set of pre-generated reference points. Each reference point has an associated subspace and any pair of subspaces does not overlap with each other. As such, through building one-to-one mapping between nodes and reference points (and hence subspace), similarity search in high-dimensional space can be implemented by using the traditional high-dimensional indexing technique and distributed hash table-based resource location mechanism. Moreover, the routing cost of similarity search can be greatly reduced through capturing physical proximity of sub-spaces, and the load balance at nodes can be obtained by tuning the granularity of subspaces. Simulation shows the efficiency of the proposed method can be kept in term of dimensionality and network size increasing.3. The technique for collaborative cache, based on negotiation, is studied for supporting SQL query processing, and a cost model for evaluation of the network transmission cost of query plans is given. The cost of the query plans are estimated by using the cost of sub-query plans. The cost model is combined with collaborative overlap network, through negotiation between requesters and coordinators, to determine the logical expression and physical maintenance nodes of data cache. Thus, based on collaborative caching technique, the P2P system can implement semantic-based query processing. Simulation and real experiments show that the proposed method usually obtains near-optimized cache plans and reduces the cost of query processing, especially in the case that a single node can contribute limited storage space.4. A similarity search technique based on probabilistic information is investigated for the P2P file sharing application with hierarchical topics. This approach uses probabilistic model to describe the overlap between topics and the coverage of nodes to topics, and hence builds routing indices at nodes. Based on existing probabilistic information at nodes, the similarity search algorithm can deduce the coverage of neighboring nodes to the queried topic, so that a promising routing path can be determined. Further, using feedback of the previous queries, nodes that were responsible for routing topic queries can update their probabilistic information to guide future ones more efficiently. Simulation shows the proposed approach can gradually improve the search efficiency and effectiveness in a feedback-based way.In summary, this thesis gives a detailed description of the algorithm, implemen-tation and experimental evaluation of the above four similarity query processing approaches. These approaches are a kind of useful complement to and improvement on current query processing methods in the P2P environment. The work is based on complete survey and theoretical analysis along with extensive experimental evaluation. The experiments and analysis show that, compared with existing query processing methods for P2P systems, the approaches mentioned above have advantages in efficiency of both query processing and resource utilization.æ›´å¤š è¿˜åŽŸ

ã€å…³é”®è¯ã€‘ å¯¹ç‰è®¡ç®—ï¼› ç›¸ä¼¼æŸ¥è¯¢ï¼› ç´¢å¼•ï¼› ç¼“å˜ï¼› æ¦‚çŽ‡ï¼›
ã€Key wordsã€‘ peer-to-peer computingï¼› similarity searchï¼› indexï¼› cacheï¼› probabilityï¼›

ã€ç½‘ç»œå‡ºç‰ˆæŠ•ç¨¿äººã€‘ å¤æ—¦å¤§å¦

ã€åˆ†ç±»å·ã€‘TP393.01;TP301
ã€è¢«å¼•é¢‘æ¬¡ã€‘1
ã€ä¸‹è½½é¢‘æ¬¡ã€‘251
æ”»è¯»æœŸæˆæžœ

çŸ¥ç½‘èŠ‚ä¸‹è½½

èŠ‚ç‚¹æ–‡çŒ®ä¸ï¼š

æœ¬æ–‡é“¾æŽ¥çš„æ–‡çŒ®ç½‘ç»œå›¾ç¤º:

æœ¬æ–‡çš„å¼•æ–‡ç½‘ç»œ

èŠ‚ç‚¹æ–‡çŒ®

èŠ‚ç‚¹æ–‡çŒ®

å¯¹ç­‰è®¡ç®—ç³»ç»Ÿä¸­çš„ç›¸ä¼¼æŸ¥è¯¢å¤„ç†ç ”ç©¶

Similarity Search in Peer-to-Peer Systems

æœ¬æ–‡é“¾æŽ¥çš„æ–‡çŒ®ç½‘ç»œå›¾ç¤º:

å¯¹ç‰è®¡ç®—ç³»ç»Ÿä¸çš„ç›¸ä¼¼æŸ¥è¯¢å¤„ç†ç ”ç©¶