èŠ‚ç‚¹æ–‡çŒ®

åŸºäºŽRSSçš„èšç„¦ç½‘ç»œçˆ¬è™«åœ¨é«˜æ ¡ç½‘ç«™ç¾¤ä¸çš„ç ”ç©¶

Research of Focused Crawler about Group of University Website Based on RSS

åˆ†é¡µä¸‹è½½
åˆ†ç« ä¸‹è½½
æ•´æœ¬ä¸‹è½½
åœ¨çº¿é˜…è¯»
ä¸æ”¯æŒè¿…é›·ç‰ä¸‹è½½å·¥å…·ï¼Œè¯·å–æ¶ˆåŠ é€Ÿå·¥å…·åŽä¸‹è½½ã€‚

ã€ä½œè€…ã€‘ å¼ ç¿æ¶µï¼›

ã€ä½œè€…åŸºæœ¬ä¿¡æ¯ã€‘ å—æ˜Œå¤§å¦ ï¼Œ è®¡ç®—æœºåº”ç”¨æŠ€æœ¯ï¼Œ 2012ï¼Œ ç¡•å£«

ã€æ‘˜è¦ã€‘ ç½‘ç»œå‘å±•è¿…é€Ÿ,ç½‘é¡µæ•°é‡è¶Šæ¥è¶Šåºžå¤§,äººä»¬ä¸ºäº†èŽ·å–éœ€è¦çš„ä¿¡æ¯,å¾€å¾€éœ€è¦ç¿»é˜…å¤§é‡çš„ç½‘é¡µ,æµªè´¹æ—¶é—´å’Œç²¾åŠ›,å¹¶ä¸”è¿˜ä¸ä¸€å®šèƒ½å¤ŸèŽ·å–æœ€æ–°æœ€å…¨çš„ä¿¡æ¯,è€Œç½‘ç»œä¿¡æ¯çš„å‘å¸ƒè€…ä¹Ÿå¸Œæœ›æœ‰æ›´å¤šçš„ç”¨æˆ·èƒ½å¤Ÿå®žæ—¶çš„é˜…è¯»è‡ªå·±çš„ä¿¡æ¯,ä¸ºæ¤æœ‰å¾ˆå¤šé’ˆå¯¹è¯¥éœ€æ±‚çš„ç ”ç©¶å•è‚²è€Œç”Ÿ,ä¾‹å¦‚ç”±ç½‘ç»œçˆ¬è™«æ”¯æŒçš„æœç´¢å¼•æ“Žã€RSSä¿¡æ¯æŽ¨é€ç‰ã€‚ä½†æ˜¯å®ƒä»¬éƒ½å„æœ‰å„çš„å±€é™æ€§,ä¾‹å¦‚æˆ‘ä»¬éœ€è¦æŒ‰ç…§åˆ†ç±»å¾—åˆ°æŸé«˜æ ¡çš„æ‰€æœ‰ç½‘ç«™ä¸çš„æœ€æ–°é€šçŸ¥,æ¯”å¦‚è¯¥é«˜æ ¡æ‰€æœ‰ç§‘ç ”ç±»åˆ«çš„æœ€æ–°é€šçŸ¥ã€‚ä½¿ç”¨æœç´¢å¼•æ“Žè¿›è¡Œæœç´¢,ç»“æžœå·®å¼ºäººæ„ã€‚è€ŒRSSè™½ç„¶å¯ä»¥å®žçŽ°åˆ†ç±»çš„æŽ¨é€æœ€æ–°ä¿¡æ¯,ä½†æ˜¯å®ƒæŽ¨é€çš„ä¿¡æ¯ä»…é™äºŽé‚£äº›æä¾›RSS feedçš„ç½‘ç«™ã€‚å¯¹äºŽä¸€äº›ç±»ä¼¼äºŽé«˜æ ¡ç½‘ç«™ç¾¤è¿™ç§æ—©æœŸå»ºç«‹çš„æ—¶å€™å°±æ²¡æœ‰å®žçŽ°RSSæŽ¨é€åŠŸèƒ½çš„å¯¹è±¡æ¥è¯´,å®ƒå°±çˆ±èŽ«èƒ½åŠ©äº†ã€‚å› æ¤,æœ¬æ–‡ä¸»è¦ç ”ç©¶åŸºäºŽRSSçš„èšç„¦ç½‘ç»œçˆ¬è™«æ¥è§£å†³ä¸Šè¿°é—®é¢˜,å¹¶å°†å…¶åº”ç”¨åœ¨é«˜æ ¡ç½‘ç«™ç¾¤ä¸,å–å¾—äº†è¾ƒå¥½çš„æ•ˆæžœã€‚å®ƒçš„åŽŸç†æ˜¯ç”¨èšç„¦ç½‘ç»œçˆ¬è™«å¯¹ç›®æ ‡ç½‘ç«™ç¾¤çš„æ•°æ®è¿›è¡ŒæŠ“å–ã€åˆ†æžå’Œå¤„ç†,ç„¶åŽæä¾›RSSæŽ¨é€ã€‚é€šè¿‡è¿™ç§æ–¹å¼,å¯¹äºŽå³ä½¿æ²¡æœ‰æä¾›RSS feedçš„ç½‘ç«™,ç”¨æˆ·ä¹Ÿå¯ä»¥é€šè¿‡RSSé˜…è¯»å™¨åˆ†ç±»è®¢é˜…å…¶æœ€æ–°çš„ä¿¡æ¯ã€‚å…åŽ»äº†å¤§é‡ç¿»é˜…ç½‘é¡µæŸ¥æ‰¾ä¿¡æ¯çš„éº»çƒ¦,ä»¥åŠæŸ¥æ‰¾ç–å¿½å¯¹ä¿¡æ¯çš„é—æ¼ã€‚æœ¬æ–‡çš„ä¸»è¦ç ”ç©¶å†…å®¹åŒ…æ‹¬ï¼š(1)æå‡ºä¸€ç§æ–°çš„åŸºäºŽRSSçš„èšç„¦ç½‘ç»œçˆ¬è™«çš„ç ”ç©¶,ä½¿å¾—ç”¨æˆ·å¯ä»¥ä½¿ç”¨RSSé˜…è¯»å™¨,è®¢é˜…å¹¶é˜…è¯»åˆ°æ²¡æœ‰æä¾›RSS feedçš„ç½‘ç«™çš„æœ€æ–°çš„ä¿¡æ¯ã€‚è¿‡æ»¤æ— ç”¨çš„å¹¿å‘Šç‰åžƒåœ¾ä¿¡æ¯,å…åŽ»æŸ¥æ‰¾ä¿¡æ¯çš„éº»çƒ¦ã€‚(2)åŸºäºŽTF-IDFç®—æ³•å¯¹æŠ“å–çš„ç½‘é¡µæ–‡æœ¬è¿›è¡Œåˆ†ç±»,å¹¶ä¸”åœ¨ç”¨TF-IDFæå–ä¸åŒç±»åˆ«çš„ç‰¹å¾å‘é‡éƒ¨åˆ†,é’ˆå¯¹ç½‘é¡µçš„ç‰¹å¾å¯¹å…¶è¿›è¡Œäº†æ”¹è¿›ã€‚ä½¿å¾—æå–å‡ºçš„ç‰¹å¾å‘é‡æ›´èƒ½å¥½çš„ä»£è¡¨ç±»åˆ«,åˆ†ç±»ç»“æžœæ›´å‡†ç¡®ã€‚(3)å¯¹ç½‘ç»œçˆ¬è™«çš„å¢žé‡å¼çˆ¬å–è¿›è¡Œæ”¹è¿›,åŸºäºŽä¼ ç»Ÿçš„å¢žé‡å¼çˆ¬å–ç®—æ³•æå‡ºäº†ä¸€ç§æ–°çš„è®¡ç®—é¢„æµ‹æ›´æ–°æ—¶é—´çš„ç®—æ³•,ä½¿å¾—é¢„æµ‹æ—¶é—´æ›´è´´è¿‘å®žé™…æ›´æ–°æ—¶é—´çš„å€¼,å‡å°‘ç³»ç»Ÿçš„å¼€é”€,æé«˜æ•ˆçŽ‡ã€‚(4)å°†åŸºäºŽRSSçš„èšç„¦ç½‘ç»œçˆ¬è™«çš„ç ”ç©¶åº”ç”¨åˆ°é«˜æ ¡ç½‘ç«™ç¾¤ä¸,é’ˆå¯¹é«˜æ ¡ç½‘ç«™ç¾¤çš„ç‰¹å¾å¯¹PageRankç®—æ³•è¿›è¡Œæ”¹è¿›,æé«˜ç½‘ç»œçˆ¬è™«çš„æŸ¥å…¨çŽ‡ã€‚æ›´å¤š è¿˜åŽŸ

ã€Abstractã€‘ Internet is developing much faster and the number of pages is increasing, so when people want to get the information they need, they have to read a large number of web pages. It wastes peopleâ€™s time and energy, and also makes people unable to get the latest and most complete information. Network of information publishers hope that more users can read their information in real time. To meet this demand, a lot of research comes out, such as the search engine supported by the web crawler, RSS information pushing technology. But they have limitations, for example, we need to get the latest notice from all the sites of a university by category, such as the latest notice of the research category. A typical search engine canâ€™t return the satisfactory result. RSS can push the latest information in accordance with the classification, but the information which it pushed is limited to the websites which provide the RSS feed. So the RSS canâ€™t work on the websites which do not provide RSS feed at all such as university website group. Therefore, the focus of this study is the research of focused crawler based on RSS, and itâ€™s application insolving the above problem, and expansion to the group of the university website, which will achieved good results. Its principle is to use the focus web crawler to crawl, analyse and process the data of the site group, and then offer RSS feed. In this way, for those websites without RSS feeds, people can also use the RSS reader to subscribe their latest classification information. The research will reduce a lot of time spant in flipping through the pages to find the latest information and will reduce negligent omission of information.The main study contents are as follows:(1) To propose a new research of focused crawler based on RSS, the user can use a RSS reader, subscribe and read the latest information from the sites which did not provide the RSS feed. It filters unwanted ads and spam, and eliminates the trouble of finding information.(2) Use TF-IDF algorithm to classify the pagesâ€™text, and improve it on extracting category feature vector based on the characteristics of the web page, improving the accuracy of the feature vector, and making the classification more accurate.(3) The research improved incremental crawled of the web crawler. Proposed a new computing forecast update algorithm based on the traditional incremental algorithm, making the prediction closer to the actual update time, reducing system overhead and improving efficiency.(4) Applied the research of focused crawler based on RSS to the university website group, and improved the PageRank algorithm baseds on the characteristics of the university website group to raise the recall rate of Web crawler.æ›´å¤š è¿˜åŽŸ

ã€å…³é”®è¯ã€‘ èšç„¦ç½‘ç»œçˆ¬è™«ï¼› RSSï¼› PageRankç®—æ³•ï¼› TF-IDFç®—æ³•ï¼› å¢žé‡å¼æŠ“å–ï¼›
ã€Key wordsã€‘ Focused Web crawlerï¼› RSSï¼› PageRank algorithmï¼› TheTF-IDF algorithmï¼› Incremental crawlï¼›

ã€ç½‘ç»œå‡ºç‰ˆæŠ•ç¨¿äººã€‘ å—æ˜Œå¤§å¦

ã€åˆ†ç±»å·ã€‘TP393.092
ã€è¢«å¼•é¢‘æ¬¡ã€‘2
ã€ä¸‹è½½é¢‘æ¬¡ã€‘177
æ”»è¯»æœŸæˆæžœ

çŸ¥ç½‘èŠ‚ä¸‹è½½

èŠ‚ç‚¹æ–‡çŒ®ä¸ï¼š

æœ¬æ–‡é“¾æŽ¥çš„æ–‡çŒ®ç½‘ç»œå›¾ç¤º:

æœ¬æ–‡çš„å¼•æ–‡ç½‘ç»œ

èŠ‚ç‚¹æ–‡çŒ®

èŠ‚ç‚¹æ–‡çŒ®

åŸºäºŽRSSçš„èšç„¦ç½‘ç»œçˆ¬è™«åœ¨é«˜æ ¡ç½‘ç«™ç¾¤ä¸­çš„ç ”ç©¶

Research of Focused Crawler about Group of University Website Based on RSS

æœ¬æ–‡é“¾æŽ¥çš„æ–‡çŒ®ç½‘ç»œå›¾ç¤º:

åŸºäºŽRSSçš„èšç„¦ç½‘ç»œçˆ¬è™«åœ¨é«˜æ ¡ç½‘ç«™ç¾¤ä¸çš„ç ”ç©¶