èŠ‚ç‚¹æ–‡çŒ®

åŸºäºŽç²—ç³™é›†çš„â€œè§„åˆ™+ä¾‹å¤–â€ç½‘é¡µåˆ†ç±»ç ”ç©¶

Study on Web-Pages Classification Based on Rough Set and "Rule+Exception"

åˆ†é¡µä¸‹è½½
åˆ†ç« ä¸‹è½½
æ•´æœ¬ä¸‹è½½
åœ¨çº¿é˜…è¯»
ä¸æ”¯æŒè¿…é›·ç‰ä¸‹è½½å·¥å…·ï¼Œè¯·å–æ¶ˆåŠ é€Ÿå·¥å…·åŽä¸‹è½½ã€‚

ã€ä½œè€…ã€‘ åˆ˜äº‘éœžï¼›

ã€ä½œè€…åŸºæœ¬ä¿¡æ¯ã€‘ å¤ªåŽŸç†å·¥å¤§å¦ ï¼Œ è®¡ç®—æœºè½¯ä»¶ä¸Žç†è®ºï¼Œ 2007ï¼Œ ç¡•å£«

ã€æ‘˜è¦ã€‘ éšç€ä¿¡æ¯æŠ€æœ¯çš„è¿…é€Ÿå‘å±•ï¼Œç½‘ç»œä¿¡æ¯ä¸æ–è†¨èƒ€ã€‚å¦‚ä½•è®©ç½‘ç»œä¿¡æ¯æ›´å¥½åœ°ä¸ºäººç±»æœåŠ¡ï¼Œå·²æˆä¸ºæœªæ¥å‡ å¹´çš„ä¸€ä¸ªç ”ç©¶çƒç‚¹ã€‚ä¸€æ–¹é¢æ˜¯äººä»¬å¯¹å¿«é€Ÿã€å‡†ç¡®è€Œå…¨é¢èŽ·å–ä¿¡æ¯çš„æ¸´æœ›ï¼Œè€Œå¦ä¸€æ–¹é¢å´æ˜¯ç½‘ç»œä¿¡æ¯çš„çº·ç¹èŠœæ‚ï¼Œåœ¨è¿™ä¸¤è€…ä¹‹é—´æž¶è®¾ä¸€åº§æ¡¥æ¢çš„ç¡®æ˜¯ä¸€ä¸ªå·¨å¤§çš„æŒ‘æˆ˜ã€‚ç½‘é¡µè‡ªåŠ¨åˆ†ç±»æŠ€æœ¯æ£ä¸ºè§£å†³è¿™ä¸ªé—®é¢˜æä¾›äº†ä¸€ç§åˆç†æœ‰æ•ˆåœ°ç»„ç»‡ä¿¡æ¯çš„æ–¹æ³•ã€‚ä¸ºäº†æœ‰æ•ˆåœ°ç»„ç»‡å’Œåˆ†æžç½‘é¡µä¿¡æ¯ï¼Œå¸®åŠ©ç”¨æˆ·è¿…é€Ÿåœ°èŽ·å–æ‰€éœ€è¦çš„ä¿¡æ¯ï¼Œè®ºæ–‡é’ˆå¯¹ä¸åŒç”¨æˆ·å¯¹ç½‘ç»œä¿¡æ¯çš„ä¸åŒéœ€æ±‚æ¥æå–å¯¹åº”çš„è§„åˆ™ï¼ŒåŒæ—¶æ ¹æ®çŸ¥è¯†ä¸è§„åˆ™ä¸Žä¾‹å¤–ç›¸äº’è¡¥å……çš„å¦ä¹ ç†è®ºå¯¹å˜åœ¨çš„ä¾‹å¤–è¿›è¡Œåˆ†æžï¼Œä»Žè€Œå¯¹ä¸æ–‡ç½‘é¡µæ–‡æœ¬è¿›è¡Œç²¾ç¡®åˆ†ç±»ã€‚æœ¬æ–‡ä»Žç†è®ºå’Œåº”ç”¨çš„è§’åº¦å¯¹ä¸æ–‡ç½‘é¡µæ–‡æœ¬ä¿¡æ¯çš„åˆ†ç±»æŠ€æœ¯è¿›è¡Œäº†æ·±å…¥çš„ç ”ç©¶ï¼Œæå‡ºäº†å°†ç²—ç³™é›†ä¸Žé¢å‘è‡ªç„¶è¯è¨€å¤„ç†çš„è§„åˆ™ä¸Žä¾‹å¤–å¦ä¹ ç†è®ºåº”ç”¨åˆ°ä¸æ–‡ç½‘é¡µåˆ†ç±»ä¸ï¼Œå¹¶å®žçŽ°äº†ä¸€ä¸ªåŸºäºŽç²—ç³™é›†çš„â€œè§„åˆ™+ä¾‹å¤–â€ä¸æ–‡ç½‘é¡µåˆ†ç±»ç³»ç»Ÿã€‚è®ºæ–‡å¯¹ä¸æ–‡ç½‘é¡µåˆ†ç±»çš„å…³é”®æŠ€æœ¯ã€ç²—ç³™é›†ç†è®ºçš„ä¸»è¦å†…å®¹ã€è§„åˆ™å½’çº³ä»¥åŠä¾‹å¤–åˆ†æžè¿›è¡Œäº†ç³»ç»Ÿçš„ç ”ç©¶å’Œè¯¦ç»†çš„ä»‹ç»ï¼Œå¹¶åœ¨è¿™äº›ç†è®ºçŸ¥è¯†çš„æŒ‡å¯¼ä¸‹è®¾è®¡äº†ä¸€ä¸ªè§£å†³ç”¨æˆ·éœ€æ±‚çš„ä¸æ–‡ç½‘é¡µæ–‡æœ¬åˆ†ç±»å™¨ã€‚è®ºæ–‡ä¸»è¦åšäº†ä»¥ä¸‹ç ”ç©¶å·¥ä½œï¼šç½‘é¡µæ–‡æœ¬åˆ†ç±»é¦–å…ˆéœ€è¦æ”¶é›†WEBæ–‡æœ¬ï¼Œå¯¹WEBæ–‡æœ¬è¿›è¡Œé¢„å¤„ç†ï¼Œä¿å˜å…¶ä¸çš„æ–‡æœ¬ä¿¡æ¯ã€‚åœ¨è¿™éƒ¨åˆ†ï¼Œæ–‡ç« é¦–å…ˆå®žçŽ°äº†æŠ¢å…ˆå¼å¤šçº¿ç¨‹ä¸æ–‡ç½‘é¡µæ”¶é›†å™¨ï¼Œé‡‡ç”¨æ·±åº¦ä¼˜å…ˆçš„ç®—æ³•èŽ·å–ç‰¹å®šç±»åž‹çš„ç½‘é¡µï¼ŒæŽ¥ç€æ ¹æ®HTML Tagæ–‡æœ¬çš„ç‰¹ç‚¹ï¼Œå®žçŽ°äº†åŸºäºŽéžé€’å½’æ–¹å¼åŒ¹é…çš„WEBæ–‡æœ¬é¢„å¤„ç†å™¨ï¼Œå®ƒç”¨äºŽæå–ç½‘é¡µä¸çš„æ–‡æœ¬ä¿¡æ¯ä»¥åŠå®šä¹‰çš„ç½‘é¡µæ ‡è®°é›†ã€‚å…¶æ¬¡ï¼Œæœ¬æ–‡åœ¨ç ”ç©¶æ–‡æœ¬ä¿¡æ¯è¡¨ç¤ºå’Œç½‘é¡µä¿¡æ¯ç‰¹ç‚¹çš„åŸºç¡€ä¸Šï¼Œæ”¹è¿›äº†ä¸æ–‡ç½‘é¡µæ–‡æœ¬è¡¨ç¤ºçš„æƒé‡è®¡ç®—æ–¹æ³•ï¼Œè®¾è®¡äº†é¢å‘ç”¨æˆ·éœ€æ±‚çš„å±žæ€§çº¦ç®€ç®—æ³•ï¼Œè¯¥ç®—æ³•åœ¨æ–‡æœ¬åˆ†ç±»ç³»ç»Ÿä¸å–å¾—äº†è¾ƒå¥½çš„æ•ˆæžœã€‚æ¤å¤–ï¼Œæœ¬æ–‡ç»“åˆç²—ç³™é›†ç†è®ºä¸çš„ç ”ç©¶å†…å®¹åˆ†æžäº†è§„åˆ™ä¸Žä¾‹å¤–çš„å½¢æˆè¿‡ç¨‹ï¼Œå¹¶æå‡ºåŸºäºŽreductçš„ä¾‹å¤–é‰´åˆ«æ–¹æ³•ã€‚è®ºæ–‡æœ€åŽè®¾è®¡äº†ä¸æ–‡ç½‘é¡µæ–‡æœ¬åˆ†ç±»ç³»ç»Ÿçš„æ€»ä½“æ–¹æ¡ˆï¼Œå¹¶æ ¹æ®æ–¹æ¡ˆå®žçŽ°äº†åŸºäºŽç²—ç³™é›†çš„â€œè§„åˆ™+ä¾‹å¤–â€ä¸æ–‡ç½‘é¡µæ–‡æœ¬åˆ†ç±»ç³»ç»Ÿã€‚ä¸ºäº†è¿›è¡Œå®žéªŒè¯„ä¼°ï¼Œè®ºæ–‡è¿›è¡Œäº†ä¸¤ç»„å®žéªŒè¿›è¡Œç»“æžœæ¯”è¾ƒã€‚å®žéªŒæ•°æ®è¡¨æ˜Žæœ¬æ–‡è®¾è®¡çš„ç½‘é¡µæ–‡æœ¬åˆ†ç±»å™¨æé«˜äº†ç½‘é¡µæ–‡æœ¬åˆ†ç±»çš„æ•ˆçŽ‡ï¼Œæœ‰ä¸€å®šçš„å®žé™…æ„ä¹‰ã€‚æ›´å¤š è¿˜åŽŸ

ã€Abstractã€‘ Along with the rapid development of information technology, network information increases explosively. Itâ€™s a real researching hotspot to make network information easier and more efficient to be used. The information in Internet is in short of organization and full of a mass of pages. On the other hand, people want to retrieve information quickly and accurately. The technique of automatic web pages classification seemed as a good approach to solve such problems.To effectively organize and analyze massive web information resource and help users to promptly get knowledge and information they need, this thesis extracts diverse rules according to usersâ€™ different requirements and analyses the existing exceptions to reach the aim of accurate classification on the basis of the learning theory that rules and exception are complementary. This paper studies the Chinese web text mining techniques deeply in the aspects of theory and application, puts forward applying rough sets and the learning theory of "rule + exception" in natural language processing to Chinese web text mining and realizes a classifier of the Chinese web page text. The key techniques of Chinese web pages classification and the main theory of rough sets, rule induction and exception analyzing have been introduced systematically in this thesis. At last, a Chinese web pages classifier has been designed under the guidance of the theory. The achievements of this thesis are:Unlike the general text classification, we need to collect Chinese web pages, preprocess these web pages and save the weight of the text information. First, a preemptive multi-thread web text collector which is used to collect web pages of special catalog using Depth First Algorithm is realized. Besides, a web text preprocessor which is used to erase the meaningless HTML tag and extract web text by recursive match method is implemented.Furthermore, a weight computing algorithm is improved taking into account of the characters of text information and web pages information. To be important, an attributes reducing algorithm oriented usersâ€™ requirements is proposed, which is proved to be highly effective in the text classification system and a Reduct exception analysis method is proposed based on the theory of rough sets by analyzing the reasons that rules and exception appear in the web pages text classification.At last, the designing process of Chinese web pages text classification is listed and the Chinese web pages text classifier based on the theory of rough set and rule plus exception is realized according to the process. To evaluate the performance of the classifier, we did two experiments and compared the results. The results show both the efficiency and the correctness of the web pages text classification system are higher and these researches are worthy to be referenced in the field of text classification.æ›´å¤š è¿˜åŽŸ

ã€å…³é”®è¯ã€‘ æ–‡æœ¬åˆ†ç±»ï¼› ç‰¹å¾æå–ï¼› ç²—ç³™é›†ï¼› è§„åˆ™å½’çº³ï¼› ä¾‹å¤–åˆ†æžï¼›
ã€Key wordsã€‘ text classificationï¼› feature extractingï¼› rough setsï¼› rule inductionï¼› exception analysisï¼›

ã€ç½‘ç»œå‡ºç‰ˆæŠ•ç¨¿äººã€‘ å¤ªåŽŸç†å·¥å¤§å¦

ã€åˆ†ç±»å·ã€‘TP391.1
ã€è¢«å¼•é¢‘æ¬¡ã€‘1
ã€ä¸‹è½½é¢‘æ¬¡ã€‘115

çŸ¥ç½‘èŠ‚ä¸‹è½½

èŠ‚ç‚¹æ–‡çŒ®ä¸ï¼š

æœ¬æ–‡é“¾æŽ¥çš„æ–‡çŒ®ç½‘ç»œå›¾ç¤º:

æœ¬æ–‡çš„å¼•æ–‡ç½‘ç»œ

èŠ‚ç‚¹æ–‡çŒ®

èŠ‚ç‚¹æ–‡çŒ®

åŸºäºŽç²—ç³™é›†çš„â€œè§„åˆ™+ä¾‹å¤–â€ç½‘é¡µåˆ†ç±»ç ”ç©¶

Study on Web-Pages Classification Based on Rough Set and "Rule+Exception"

æœ¬æ–‡é“¾æŽ¥çš„æ–‡çŒ®ç½‘ç»œå›¾ç¤º:

åŸºäºŽç²—ç³™é›†çš„â€œè§„åˆ™+ä¾‹å¤–â€ç½‘é¡µåˆ†ç±»ç ”ç©¶