èŠ‚ç‚¹æ–‡çŒ®

åŸºäºŽç¥žç»ç½‘ç»œçš„æ–‡æœ¬åˆ†ç±»ç³»ç»ŸNNTCSçš„è®¾è®¡å’Œå®žçŽ°

åˆ†é¡µä¸‹è½½
åˆ†ç« ä¸‹è½½
æ•´æœ¬ä¸‹è½½
åœ¨çº¿é˜…è¯»
ä¸æ”¯æŒè¿…é›·ç‰ä¸‹è½½å·¥å…·ï¼Œè¯·å–æ¶ˆåŠ é€Ÿå·¥å…·åŽä¸‹è½½ã€‚

ã€ä½œè€…ã€‘ åˆ˜é’¢ï¼›

ã€ä½œè€…åŸºæœ¬ä¿¡æ¯ã€‘ ä¸å›½ç§‘å¦é™¢ç ”ç©¶ç”Ÿé™¢ï¼ˆè½¯ä»¶ç ”ç©¶æ‰€ï¼‰ ï¼Œ è®¡ç®—æœºåº”ç”¨æŠ€æœ¯ï¼Œ 2003ï¼Œ ç¡•å£«

ã€æ‘˜è¦ã€‘ æ–‡æœ¬åˆ†ç±»æ˜¯æ–‡æœ¬æŒ–æŽ˜çš„åŸºç¡€ä¸Žæ ¸å¿ƒï¼Œæ˜¯è¿‘å¹´æ¥æ•°æ®æŒ–æŽ˜å’Œç½‘ç»œæŒ–æŽ˜çš„ä¸€ä¸ªç ”ç©¶çƒç‚¹ï¼Œåœ¨ä¼ ç»Ÿçš„æƒ…æŠ¥æ£€ç´¢ã€ç½‘ç«™ç´¢å¼•ä½“ç³»ç»“æž„çš„å»ºç«‹å’ŒWebä¿¡æ¯æ£€ç´¢ç‰æ–¹é¢å æœ‰é‡è¦åœ°ä½ã€‚ æœ¬æ–‡é¦–å…ˆå¯¹å½“å‰æ–‡æœ¬åˆ†ç±»é¢†åŸŸå‡ ä¸ªå…³é”®é—®é¢˜çš„å¸¸ç”¨è§£å†³æ–¹æ³•è¿›è¡Œäº†ç ”ç©¶ï¼ŒåŒæ—¶é˜è¿°äº†å…¸åž‹æ–‡æœ¬åˆ†ç±»ç³»ç»Ÿçš„æ ¸å¿ƒæŠ€æœ¯å’Œç³»ç»Ÿç»“æž„ï¼Œå¯¹æ–‡æœ¬åˆ†ç±»çš„åº”ç”¨èŒƒå›´è¿›è¡Œäº†æè¿°ã€‚ç„¶åŽç€é‡ä»‹ç»äº†ä¸€ä¸ªåŸºäºŽç¥žç»ç½‘ç»œçš„æ–‡æœ¬è‡ªåŠ¨åˆ†ç±»ç³»ç»ŸNNTCSï¼Œé‡ç‚¹é˜è¿°äº†ç‰¹å¾æå–ã€ç©ºé—´é™ç»´ã€å±‚æ¬¡åˆ†ç±»å’Œåˆ†ç±»å™¨è®ç»ƒç‰æŠ€æœ¯çš„å®žçŽ°æ–¹æ³•ã€‚ åœ¨NNTCSä¸ï¼Œç¬¬ä¸€æ¥æ˜¯å¯¹ä¸æ–‡æ–‡æ¡£è¿›è¡Œæ±‰è¯åˆ†è¯ï¼Œä»Žæ–‡æ¡£ä¸æŠ½å‡ºç‰¹å¾è¯ï¼Œå¹¶ä¸”ç»Ÿè®¡å„ç‰¹å¾è¯çš„è¯é¢‘ã€‚ ç³»ç»Ÿä½¿ç”¨ç¥žç»ç½‘ç»œä½œä¸ºåˆ†ç±»å™¨ï¼Œç‰¹å¾è¯çš„è¯é¢‘ç»„æˆåŽŸå§‹ç‰¹å¾å‘é‡ï¼Œå’Œç¥žç»ç½‘ç»œè¾“å…¥å±‚çš„ç¥žç»å…ƒä¸€ä¸€å¯¹åº”ã€‚åœ¨æ–‡æœ¬è®ç»ƒçš„æ—¶å€™ï¼Œåˆ©ç”¨æ ‡è®°å¥½çš„è®ç»ƒæ–‡æ¡£é›†è¿›è¡Œç½‘ç»œè®ç»ƒï¼Œè¯¯å·®åé¦ˆç®—æ³•å¯¹ç½‘ç»œè¿›è¡Œæƒå€¼è°ƒæ•´ï¼Œå¾—åˆ°å›ºå®šçš„æƒå€¼ä½œä¸ºåˆ†ç±»çŸ¥è¯†å˜å‚¨ã€‚è€Œåœ¨æ–‡æœ¬åˆ†ç±»çš„æ—¶å€™ï¼Œè¾“å…¥å¾…åˆ†ç±»æ–‡æ¡£çš„ç‰¹å¾å‘é‡ï¼Œè¿è¡Œå›ºå®šæƒå€¼çš„ç½‘ç»œï¼Œå¾—åˆ°çš„è¾“å‡ºå€¼ä¸Žé˜ˆå€¼æ¯”è¾ƒç¡®å®šç±»åˆ«ã€‚ ç³»ç»Ÿä¸å¼•å…¥äº†ä¿¡æ¯æ£€ç´¢ä¸çš„å¸¸ç”¨æŠ€æœ¯â€”â€”æ½œåœ¨è¯ä¹‰ç´¢å¼•ï¼ŒæŠŠåŽŸå§‹å‘é‡ç©ºé—´è½¬æ¢åˆ°æŠ½è±¡çš„kç»´è¯ä¹‰ç©ºé—´ï¼Œå®žçŽ°åŽŸå§‹å‘é‡ç©ºé—´çš„é™ç»´ï¼Œæé«˜ç½‘ç»œè®ç»ƒé€Ÿåº¦å’Œæ€§èƒ½ã€‚ ç¥žç»ç½‘ç»œåœ¨ä¸€èˆ¬çš„æ¨¡å¼è¯†åˆ«ä¸å¾ˆå¸¸ç”¨ï¼Œä½†æ˜¯åœ¨æ–‡æœ¬åˆ†ç±»ä¸è¾ƒå°‘é‡‡ç”¨ï¼Œä¸»è¦åŽŸå› æ˜¯å‘é‡ç©ºé—´å¤ªåºžå¤§ï¼Œç½‘ç»œæ€§èƒ½å—é™åˆ¶ï¼Œè€Œå¼•å…¥æ½œåœ¨è¯ä¹‰ç´¢å¼•å¯¹ç©ºé—´é™ç»´å¯ä»¥é¿å…è¿™ç§ç¼ºé™·ï¼Œä¸¤è€…ç›¸å¾—ç›Šå½°ã€‚ è®ç»ƒè¿‡ç¨‹ä¸ç»“åˆé—ä¼ ç®—æ³•ï¼Œä¼˜åŒ–ç¥žç»ç½‘ç»œçš„åˆå§‹æƒå€¼ã€‚é—ä¼ ç®—æ³•æœ‰å…¨å±€æœç´¢çš„ç‰¹ç‚¹ï¼Œå¯ä»¥é¿å…ç¥žç»ç½‘ç»œå±€éƒ¨æ”¶æ•›çš„é—®é¢˜ï¼Œå……åˆ†å‘æŒ¥é—ä¼ ç®—æ³•å’Œç¥žç»ç½‘ç»œå„è‡ªçš„ä¼˜åŠ¿ã€‚ æœ€åŽå¯¹NNTCSè¿›è¡Œäº†å¼€æ”¾æ€§æµ‹è¯•ï¼Œå®žéªŒè¡¨æ˜ŽNNTCSå¯¹æ–‡æœ¬åˆ†ç±»å…·æœ‰è¾ƒé«˜çš„å¹³å‡æŸ¥å…¨çŽ‡å’Œå¹³å‡ç²¾åº¦ã€‚æ›´å¤š è¿˜åŽŸ

ã€Abstractã€‘ Text classification is the basis and core of text mining, and plays an important rule in traditional information retrieval, construction of web site architecture, and search for web information. It has become a hot research project in recent years.At first the traditional solutions to some key technical problems in the field of TC are studied, also core techniques and system architecture of the typical TC systems are discussed, the applications of TC are described in this paper. Then this paper presents a text classifier based on neural networks (NNTCS) as the main topic. Some key techniques implemented in this classifier, such as feature extraction, dimension reduction, hierarchical classification and classifier training, are discussed in details.The first step in NNTCS is Chinese word segmentation on Chinese documents. Feature Terms are selected from documents. Term frequencies of each term are recorded.In NNTCS, we use artificial neural networks (ANN) as the classifier. The recorded term frequencies form the original feature vector, matching with neurons in the input layer of ANN one by one. In the stage of training, NNTCS applies labeled documents to ANN for training, and the error back propagation algorithm (BP) is employed to adjust weights of the networks. After training, the final fixed weights are saved as knowledge of classification. While in the stage of document classifying, NNTCS inputs feature vectors of the document to be classified, runs network with fixed weights, then compares the output with the predefined threshold to judge the class of the unlabelled document.NNTCS imports a traditional technique called Latent Semantic Indexing (LSI) for dimension reduction. LSI comes from the field of Information Retrieval. It transforms the original vector space to abstract k-dimension semantic space. So the huge dimensions of the original vector space are reduced greatly, also the training speed and system performance are improved.ANN is often used in common pattern recognition systems, but rarely in TC. Itâ€™s because the vector space is so huge that the performance of ANN is weakened. LSIâ€™s advantage in dimension reduction can avoid this flaw. So both ANN and LSI are improved.NNTCS employs genetic algorithm (GA) in the stage of training to optimize initial weights of ANN. Because of GAâ€™s advantage of globally searching, it can avoid ANNâ€™S problem of local convergence. Thus the advantages of both GA and ANN are brought into play completely.Finally an open test is done on the developed system NNTCS. As experiment results show, NNTCS can reach both high precision and high recall on average.æ›´å¤š è¿˜åŽŸ

ã€å…³é”®è¯ã€‘ æ–‡æœ¬åˆ†ç±»ï¼› ç¥žç»ç½‘ç»œï¼› ç‰¹å¾æå–ï¼› æ½œåœ¨è¯ä¹‰ç´¢å¼•ï¼› é—ä¼ ç®—æ³•ï¼›
ã€Key wordsã€‘ Text Classificationï¼› Neural Networksï¼› Feature Extractionï¼› LSIï¼› GAï¼›

ã€ç½‘ç»œå‡ºç‰ˆæŠ•ç¨¿äººã€‘ ä¸å›½ç§‘å¦é™¢ç ”ç©¶ç”Ÿé™¢ï¼ˆè½¯ä»¶ç ”ç©¶æ‰€ï¼‰

ã€åˆ†ç±»å·ã€‘TP311.52
ã€è¢«å¼•é¢‘æ¬¡ã€‘3
ã€ä¸‹è½½é¢‘æ¬¡ã€‘388

çŸ¥ç½‘èŠ‚ä¸‹è½½

èŠ‚ç‚¹æ–‡çŒ®ä¸ï¼š

æœ¬æ–‡é“¾æŽ¥çš„æ–‡çŒ®ç½‘ç»œå›¾ç¤º:

æœ¬æ–‡çš„å¼•æ–‡ç½‘ç»œ

èŠ‚ç‚¹æ–‡çŒ®

èŠ‚ç‚¹æ–‡çŒ®

åŸºäºŽç¥žç»ç½‘ç»œçš„æ–‡æœ¬åˆ†ç±»ç³»ç»ŸNNTCSçš„è®¾è®¡å’Œå®žçŽ°

æœ¬æ–‡é“¾æŽ¥çš„æ–‡çŒ®ç½‘ç»œå›¾ç¤º:

åŸºäºŽç¥žç»ç½‘ç»œçš„æ–‡æœ¬åˆ†ç±»ç³»ç»ŸNNTCSçš„è®¾è®¡å’Œå®žçŽ°