èŠ‚ç‚¹æ–‡çŒ®

åŸºäºŽæ–‡æœ¬å±‚æ¬¡æ¨¡åž‹çš„Webæ¦‚å¿µæŒ–æŽ˜ç ”ç©¶

Web Concept Mining Based on Text Layer Model

åˆ†é¡µä¸‹è½½
åˆ†ç« ä¸‹è½½
æ•´æœ¬ä¸‹è½½
åœ¨çº¿é˜…è¯»
ä¸æ”¯æŒè¿…é›·ç‰ä¸‹è½½å·¥å…·ï¼Œè¯·å–æ¶ˆåŠ é€Ÿå·¥å…·åŽä¸‹è½½ã€‚

ã€ä½œè€…ã€‘ ç« æˆå¿—ï¼›

ã€ä½œè€…åŸºæœ¬ä¿¡æ¯ã€‘ å—äº¬å†œä¸šå¤§å¦ ï¼Œ å†œä¸šç»æµŽåŠç®¡ç†ï¼Œ 2002ï¼Œ ç¡•å£«

ã€å‰¯é¢˜åã€‘åŸºäºŽæ¦‚å¿µè¯ä¹‰ç½‘ç»œçš„è‡ªåŠ¨æ ‡å¼•å’Œè‡ªåŠ¨åˆ†ç±»ç ”ç©¶

ã€æ‘˜è¦ã€‘ æœ¬è®ºæ–‡é’ˆå¯¹ç›®å‰Webæ–‡æœ¬æŒ–æŽ˜å·¥å…·çš„ä¸è¶³ä¹‹å¤„ï¼Œç»¼åˆè¿ç”¨æ–‡çŒ®ä¿¡æ¯è‡ªåŠ¨æ ‡å¼•å’Œè‡ªåŠ¨åˆ†ç±»æŠ€æœ¯ã€æ•°æ®æŒ–æŽ˜æŠ€æœ¯ã€æ¨¡å¼è¯†åˆ«æŠ€æœ¯ã€æ•°æ®åº“æŠ€æœ¯ï¼Œæ•°ç†ç»Ÿè®¡çŸ¥è¯†ï¼Œæž„å»ºäº†ä¸€ä¸ªç®€å•æ˜“è¡Œçš„ä¿¡æ¯æå–æ¨¡åž‹ï¼Œå³æ–‡æœ¬å±‚æ¬¡æ¨¡åž‹ï¼Œé’ˆå¯¹å› ç‰¹ç½‘ä¸Šä¸‰ç§ç»“æž„ç±»åž‹æ•°æ®ï¼Œè¿›è¡Œäº†åŸºäºŽçŸ¥è¯†åº“ï¼Œå³åŸºäºŽæ¦‚å¿µè¯ä¹‰ç½‘ç»œçš„è‡ªåŠ¨æ ‡å¼•å’Œè‡ªåŠ¨åˆ†ç±»ç ”ç©¶ã€‚æœ¬é¡¹ç›®ç ”ç©¶å…·æœ‰å¦‚ä¸‹æ„ä¹‰ï¼šä½¿åˆ†ç±»çŸ¥è¯†åº“å»ºè®¾ç³»ç»ŸåŒ–å’Œæµç¨‹åŒ–ï¼›æä¾›å› ç‰¹ç½‘é¡µé¢å’Œæ™®é€šæ–‡æœ¬çš„æ ‡å¼•æºé€‰æ‹©æ–¹æ¡ˆåŠä¸»é¢˜æå–æ—¶çš„æƒé‡æ–¹æ¡ˆï¼›æé«˜åŒä¹‰è¯çš„è¯†åˆ«èƒ½åŠ›ï¼›å¢žå¼ºæœªç™»å½•è¯æŒ–æŽ˜èƒ½åŠ›ã€‚ æ–‡æœ¬åˆ†ç±»çŸ¥è¯†åº“çš„æž„å»ºä¸»è¦æ˜¯åˆ©ç”¨äº†æ•°æ®æŒ–æŽ˜æŠ€æœ¯ï¼Œæ•°ç†ç»Ÿè®¡çŸ¥è¯†ï¼Œåœ¨è¿›è¡Œå…³é”®è¯ä¸Žåˆ†ç±»å·çš„ç›¸å…³åº¦åº¦é‡æ—¶ï¼Œæˆ‘ä»¬ä¸ºäº†å…‹æœä»¥å‰åº¦é‡æ–¹æ³•çš„ç¼ºé™·ï¼Œå¼•å…¥äº†Diceæµ‹åº¦çš„æ–¹æ³•ã€‚ä¸ºäº†ç¡®å®šçŸ¥è¯†åº“çš„è§„æ¨¡ï¼Œæˆ‘ä»¬å¯¹Webæ¦‚å¿µæŒ–æŽ˜ç³»ç»Ÿçš„å®žé™…è¿è¡Œç»“æžœï¼Œè¿›è¡ŒæŠ½æ ·åˆ†æžï¼Œé€‰æ‹©äº†ä¸€ä¸ªæ•´ä½“æ€§èƒ½è¾ƒå¥½çš„åˆ†ç±»çŸ¥è¯†åº“ï¼Œå¦å¤–è¿˜å¼•å…¥ç¯‡åçŸ¥è¯†æ¥è¿›ä¸€æ¥å®Œå–„åˆ†ç±»çŸ¥è¯†åº“ã€‚ åœ¨è¿›è¡ŒWebæ–‡æœ¬çš„ä¸»é¢˜æå–æ—¶ï¼Œä¸ºäº†åŒºåˆ†ç½‘é¡µä¸åŒæ ‡å¼•æºçš„ä¸»é¢˜è¡¨è¾¾èƒ½åŠ›ï¼Œæœ¬æ–‡æ ¹æ®ä¸€å®šè§„æ¨¡çš„æ•°æ®è°ƒæŸ¥ç»“æžœï¼Œç¡®å®šäº†å…·æœ‰æ–‡çŒ®ä¾æ®çš„æƒé‡æ–¹æ¡ˆï¼Œå¯¹æ–‡æœ¬ä¸åŒæ ‡å¼•æºçš„æµ‹è¯•ï¼ŒèŽ·å¾—äº†é¡µé¢å’Œæ™®é€šæ–‡æœ¬çš„æ ‡å¼•æºé€‰æ‹©æ–¹æ¡ˆï¼ŒéšåŽè¿˜å¯¹æ–‡æœ¬å¤šä¸»é¢˜æŒ–æŽ˜è¿›è¡Œäº†åˆæ¥çš„ç ”ç©¶ã€‚ åœ¨åŒä¹‰è¯çš„è¯†åˆ«ä¸Šï¼Œé¦–æ¬¡å¼•å…¥ã€ŠåŒä¹‰è¯è¯æž—ã€‹ï¼Œä½œä¸ºè¯ä¹‰ä½“ç³»ï¼Œæå‡ºäº†åŸºäºŽã€ŠåŒä¹‰è¯è¯æž—ã€‹è¯ä¹‰ä½“ç³»çš„åŒä¹‰è¯è¯†åˆ«ç®—æ³•ï¼Œåˆ©ç”¨è¯æ±‡é—´çš„è¯ä¹‰ç›¸ä¼¼åº¦åº¦é‡ï¼Œæ¥è¿›è¡ŒåŒä¹‰è¯è¯†åˆ«ï¼Œæé«˜äº†åŒä¹‰è¯è¯†åˆ«ç³»ç»Ÿçš„è¯†åˆ«æ€§èƒ½ã€‚æ¤å¤–ï¼Œåœ¨è¿›è¡Œæ–‡æœ¬çš„è‡ªåŠ¨åˆ†ç±»æ—¶ï¼Œå°†è¯ä¹‰ç›¸ä¼¼åº¦åŒ¹é…ä»£æ›¿äº†å—é¢ç›¸ä¼¼åº¦åŒ¹é…ï¼Œæé«˜äº†æ–‡æœ¬çš„è‡ªåŠ¨åˆ†ç±»èƒ½åŠ›ã€‚ ä¸ºäº†è§£å†³æœªç™»å½•çš„æŒ–æŽ˜é—®é¢˜ï¼Œæå‡ºäº†åŸºäºŽå—è¯æ£å‘æ‰©å±•çš„æœªç™»å½•è¯è¯†åˆ«æ–¹æ³•ï¼Œä¸åŒäºŽN-Gramæ¨¡åž‹çš„æ˜¯ï¼Œæœ¬æ–¹æ³•ä¸éœ€åºžå¤§çš„è¯æ–™åº“ï¼Œåˆ©ç”¨å±€éƒ¨ç»Ÿè®¡ä¿¡æ¯å³å¯è¯†åˆ«å‡ºå…·æœ‰æ£€ç´¢æ„ä¹‰çš„æœªç™»å½•è¯ã€‚ æœ¬æ–‡æœ€åŽç»™å‡ºäº†ç³»ç»Ÿçš„å®žé™…æµ‹è¯„ç»“æžœï¼Œè¯æ˜Žæ•´ä¸ªç³»ç»Ÿçš„å¯è¡Œæ€§ã€‚ Webæ¦‚å¿µæŒ–æŽ˜ç³»ç»Ÿé‡‡ç”¨Borland Delphi6.0,Nicrosoft Visual C++6.0ä»¥åŠMicrosoft Visual Foxpro6.0å¼€å‘ã€‚æ›´å¤š è¿˜åŽŸ

ã€Abstractã€‘ To improve the performance of web text mining tools, this paper try on using automatic indexing and automatic classification techniques , data mining technology , pattern recognition technology and mathematical statistics method to create a practical model , i.e. Text Layer Model , and it can extract information from three kinds of data on the Internet. The significance of this paper is as follows: providing a new method to create the knowledge database used for automatic classifying, providing the location weighting algorithm for information extraction, presenting a new methods to improve the performance of Chinese recognition of synonyms and unregistered words.The creating of the knowledge database used for automatic classifying is base on data mining technology and mathematical statistics knowledge. We use the Dice measure, support degree and confidence degree to create four kinds database of different dimensions through different thresholds of correlation degree and interesting degree. Lastly, we select one of database through the test by concept mining system.To distinguishing the subject expression ability of different parts of text, including 1800 Web pages, we have a investigative statistics and providing the location weighting algorithm for information extraction.To enhance the ability of the recognition synonyms, we use the synonyms dictionary as the semantic system and providing the new algorithm of recognition synonyms base on the synonyms dictionary. We use this algorithm to calculate the similarity degree among the words and match the subject in the automatic classification.We provide a new method to enhance the ability of mining the unregistered words, i.e. recognition method base on the character or word expanding. Different from the N-Grams Model, this method uses the location information of the text to recognize unregistered words.At the end of the paper, we test and evaluate concept mining system, the deficiency of systems is also detailed objectively..æ›´å¤š è¿˜åŽŸ

ã€å…³é”®è¯ã€‘ æ–‡æœ¬å±‚æ¬¡æ¨¡åž‹ï¼› Webæ¦‚å¿µæŒ–æŽ˜ï¼› åŠ æƒæ ‡å¼•ï¼› è‡ªåŠ¨æ ‡å¼•ï¼› è‡ªåŠ¨åˆ†ç±»ï¼› Diceæµ‹åº¦ï¼› åŒä¹‰è¯è¯†åˆ«ï¼› å—è¯æ£å‘æ‰©å±•ï¼› æœªç™»å½•è¯è¯†åˆ«ï¼›
ã€Key wordsã€‘ web concept miningï¼› text layer modelï¼› knowledge databaseï¼› recognition of synonymsï¼› recognition of unregistered wordsï¼› automatic indexingï¼› automatic classifyingï¼›

ã€ç½‘ç»œå‡ºç‰ˆæŠ•ç¨¿äººã€‘ å—äº¬å†œä¸šå¤§å¦

ã€åˆ†ç±»å·ã€‘TP393
ã€è¢«å¼•é¢‘æ¬¡ã€‘13
ã€ä¸‹è½½é¢‘æ¬¡ã€‘460

çŸ¥ç½‘èŠ‚ä¸‹è½½

èŠ‚ç‚¹æ–‡çŒ®ä¸ï¼š

æœ¬æ–‡é“¾æŽ¥çš„æ–‡çŒ®ç½‘ç»œå›¾ç¤º:

æœ¬æ–‡çš„å¼•æ–‡ç½‘ç»œ

èŠ‚ç‚¹æ–‡çŒ®