èŠ‚ç‚¹æ–‡çŒ®

æ–‡æœ¬åˆ†ç±»åŠå…¶ç›¸å…³æŠ€æœ¯ç ”ç©¶

Research on Text Classification and Its Related Technologies

åˆ†é¡µä¸‹è½½
åˆ†ç« ä¸‹è½½
æ•´æœ¬ä¸‹è½½
åœ¨çº¿é˜…è¯»
ä¸æ”¯æŒè¿…é›·ç‰ä¸‹è½½å·¥å…·ï¼Œè¯·å–æ¶ˆåŠ é€Ÿå·¥å…·åŽä¸‹è½½ã€‚

ã€ä½œè€…ã€‘ æŽè£é™†ï¼›

ã€ä½œè€…åŸºæœ¬ä¿¡æ¯ã€‘ å¤æ—¦å¤§å¦ ï¼Œ è®¡ç®—æœºè½¯ä»¶ä¸Žç†è®ºï¼Œ 2005ï¼Œ åšå£«

ã€æ‘˜è¦ã€‘ éšç€Internetçš„è¿…çŒ›å‘å±•å’Œæ—¥ç›Šæ™®åŠ,ç”µåæ–‡æœ¬ä¿¡æ¯è¿…é€Ÿè†¨èƒ€,å¦‚ä½•æœ‰æ•ˆåœ°ç»„ç»‡å’Œç®¡ç†è¿™äº›ä¿¡æ¯,å¹¶å¿«é€Ÿã€å‡†ç¡®ã€å…¨é¢åœ°ä»Žä¸æ‰¾åˆ°ç”¨æˆ·æ‰€éœ€è¦çš„ä¿¡æ¯æ˜¯å½“å‰ä¿¡æ¯ç§‘å¦å’ŒæŠ€æœ¯é¢†åŸŸé¢ä¸´çš„ä¸€å¤§æŒ‘æˆ˜ã€‚æ–‡æœ¬åˆ†ç±»ä½œä¸ºå¤„ç†å’Œç»„ç»‡å¤§é‡æ–‡æœ¬æ•°æ®çš„å…³é”®æŠ€æœ¯,å¯ä»¥åœ¨è¾ƒå¤§ç¨‹åº¦ä¸Šè§£å†³ä¿¡æ¯æ‚ä¹±çŽ°è±¡çš„é—®é¢˜,æ–¹ä¾¿ç”¨æˆ·å‡†ç¡®åœ°å®šä½æ‰€éœ€çš„ä¿¡æ¯å’Œåˆ†æµä¿¡æ¯ã€‚è€Œä¸”ä½œä¸ºä¿¡æ¯è¿‡æ»¤ã€ä¿¡æ¯æ£€ç´¢ã€æœç´¢å¼•æ“Žã€æ–‡æœ¬æ•°æ®åº“ã€æ•°å—åŒ–å›¾ä¹¦é¦†ç‰é¢†åŸŸçš„æŠ€æœ¯åŸºç¡€,æ–‡æœ¬åˆ†ç±»æŠ€æœ¯æœ‰ç€å¹¿æ³›çš„åº”ç”¨å‰æ™¯ã€‚ æœ¬æ–‡å¯¹æ–‡æœ¬åˆ†ç±»åŠå…¶ç›¸å…³æŠ€æœ¯è¿›è¡Œäº†ç ”ç©¶ã€‚ä»Žæé«˜åˆ†ç±»æ–¹æ³•çš„å¿«é€Ÿæ€§ã€å‡†ç¡®æ€§å’Œç¨³å®šæ€§å‡ºå‘,æå‡ºå¤šç§æœ‰æ•ˆçš„è§£å†³æˆ–æ”¹è¿›çš„æ–¹æ³•å’ŒæŠ€æœ¯ã€‚åŒæ—¶,å¯¹æ–‡æœ¬åˆ†ç±»æŠ€æœ¯çš„ä¸€ä¸ªæ–°çš„ç ”ç©¶æ–¹å‘â€”â€”æ–‡æœ¬æµæ´¾åˆ†ç±»,æ–‡æœ¬åˆ†ç±»çš„ä¸€ä¸ªé‡è¦åº”ç”¨é¢†åŸŸâ€”â€”æ–‡æœ¬ä¿¡æ¯è¿‡æ»¤,è¿›è¡Œäº†ç ”ç©¶ã€‚æœ¬æ–‡ç ”ç©¶å†…å®¹å’Œåˆ›æ–°å·¥ä½œä¸»è¦åŒ…æ‹¬ä»¥ä¸‹äº”ç‚¹ã€‚ (1)è®ç»ƒæ ·æœ¬çš„é€‰æ‹© è®ç»ƒæ ·æœ¬çš„é€‰æ‹©å¯¹åˆ†ç±»å™¨çš„åˆ›å»ºéžå¸¸é‡è¦,éžå…¸åž‹æ ·æœ¬ä¸ä»…å¢žåŠ äº†åˆ†ç±»å™¨çš„è®ç»ƒæ—¶é—´,è€Œä¸”å®¹æ˜“ç»™è®ç»ƒæ ·æœ¬é›†ä¸å¼•å…¥ä¸€äº›â€œå™ªå£°â€ã€‚è®ºæ–‡é’ˆå¯¹KNNè¿™ç§å¸¸ç”¨çš„æ–‡æœ¬åˆ†ç±»æ–¹æ³•,åˆ†æžäº†ä»€ä¹ˆæ˜¯å®ƒçš„å…¸åž‹æ ·æœ¬,æå‡ºäº†ä¸€ç§åŸºäºŽå¯†åº¦çš„æ ·æœ¬é€‰æ‹©ç®—æ³•ã€‚æ ¹æ®æ ·æœ¬Îµé‚»åŸŸå†…çš„æ ·æœ¬æ•°ç›®ä¼°è®¡æ ·æœ¬å‘¨å›´çš„å¯†åº¦,æ ¹æ®æ ·æœ¬Îµé‚»åŸŸå†…ä¸åŒç±»åˆ«æ ·æœ¬çš„æ•°ç›®ç¡®å®šç±»åˆ«ä¹‹é—´çš„è¾¹ç•Œã€‚è£å‰ªé«˜å¯†åº¦åŒºåŸŸçš„æ ·æœ¬,å‡å°‘éžå…¸åž‹æ ·æœ¬çš„æ•°é‡ã€‚åŒæ—¶,å°½é‡ä¿ç•™ç±»åˆ«è¾¹ç•Œéƒ¨åˆ†çš„æ ·æœ¬,ä»¥ä¿è¯åˆ†ç±»å™¨çš„å‡†ç¡®æ€§ã€‚ (2)åŸºäºŽæœ€å¤§ç†µæ¨¡åž‹çš„ä¸æ–‡æ–‡æœ¬åˆ†ç±»ç ”ç©¶ ä¸æ–‡æœ¬æ–‡åˆ†ç±»å’Œè‹±æ–‡æ–‡æœ¬åˆ†ç±»æœ‰è®¸å¤šä¸åŒä¹‹å¤„,æ–‡æœ¬ç‰¹å¾çš„æå–æ–¹å¼ã€ç¨€ç–ç¨‹åº¦éƒ½æœ‰æ‰€ä¸åŒ,æ‰€ä»¥åˆ†ç±»ç»“æžœäº¦æœ‰æ‰€ä¸åŒã€‚å¯¹äºŽæœ€å¤§ç†µæ¨¡åž‹æ¥è¯´å°¤ä¸ºä¸åŒ,å› ä¸ºæ±‰è¯çš„ç†µé«˜äºŽè‹±è¯ã€‚è®ºæ–‡ä»Žä¸æ–‡æ–‡æœ¬ç‰¹å¾çš„ç”Ÿæˆæ–¹æ³•å…¥æ‰‹,ä½¿ç”¨äº†åˆ†è¯å’ŒN-Gramä¸¤ç§æ–‡æœ¬ç‰¹å¾ç”Ÿæˆæ–¹æ³•,ä½¿ç”¨äº†ç»å¯¹æŠ˜æ‰£æŠ€æœ¯å¯¹ç‰¹å¾çš„æ¦‚çŽ‡è¿›è¡Œå¹³æ»‘å¤„ç†,å¯¹æœ€å¤§ç†µæ¨¡åž‹å’ŒNaive Bayesã€KNNã€SVMä¸‰ç§æ–¹æ³•çš„æ€§èƒ½è¿›è¡Œäº†æ¯”è¾ƒåˆ†æžã€‚åœ¨å®žéªŒä¸å‘çŽ°æœ€å¤§ç†µæ¨¡åž‹çš„ç¨³å®šæ€§ä¸å¤Ÿå¥½,æ‰€ä»¥å°†Baggingå’Œæœ€å¤§ç†µæ¨¡åž‹ç»“åˆèµ·æ¥,æé«˜äº†æœ€å¤§ç†µæ¨¡åž‹çš„ç¨³å®šæ€§ã€‚ (3)ä½¿ç”¨å±‚æ¬¡åˆ†ç±»æ”¹å–„å¹³é¢åˆ†ç±»çš„æ€§èƒ½ ä¸åŒäºŽä»¥å¾€çš„å±‚æ¬¡åŒ–åˆ†ç±»,è®ºæ–‡ä¸ä½¿ç”¨äº†ä¸€ç§æœ¬è´¨ä¸ºå›¾çš„å±‚æ¬¡ç»“æž„,åˆ©ç”¨è¿™ç§å±‚æ¬¡ç»“æž„è§£å†³å¹³é¢åˆ†ç±»é—®é¢˜,ä»Žè€Œæé«˜å¹³é¢åˆ†ç±»çš„æŸ¥å‡†çŽ‡å’ŒæŸ¥å…¨çŽ‡ã€‚åœ¨æ™®é€šçš„ç±»åˆ«å±‚æ¬¡ç»“æž„ä¸,åŒä¸€çˆ¶ç±»çš„å…„å¼Ÿç±»åˆ«ä¹‹é—´çš„æ··æ·†å…³ç³»æ˜¯å¯¹ç§°çš„,ä½†äº‹å®žä¸Šç±»åˆ«ä¹‹é—´çš„æ··æ·†å…³ç³»ä¸æ˜¯å¯¹ç§°çš„ã€‚è®ºæ–‡ä»Žåˆ†ç±»å™¨çš„æ··æ·†çŸ©é˜µå…¥æ‰‹,å¼•å…¥äº†æ··æ·†ç±»åˆ«çš„æ¦‚å¿µã€‚åˆ©ç”¨æ··æ·†ç±»åˆ«æž„é€ çš„ç±»åˆ«å±‚æ¬¡ç»“æž„,ä»ŽæŸ¥å‡†çŽ‡å’ŒæŸ¥å…¨çŽ‡çš„è§’åº¦æ¥è€ƒè™‘ç±»åˆ«ä¹‹é—´çš„å…³ç³»,è¡¨è¾¾å‡ºäº†æ··æ·†å…³ç³»çš„éžå¯¹ç§°æ€§ã€‚æ›´å¤š è¿˜åŽŸ

ã€Abstractã€‘ With the rapid development and spread of Internet, electronic text information greatly increases. It is a great challenge for information science and technology that how to organize and process large amount of document data, and find the interested information of user quickly, exactly and fully. As the key technology in organizing and processing large mount of document data, text classification can solve the problem of information disorder to a great extent, and is convenient for user to find the required information quickly. Moreover, text classification has the broad applied future as the technical basis of information filtering, information retrieval, search engine, text database, and digital library and so on.Research on text classification and its related technologies are done in the paper. From the angle of improving the speed, precision and stability, several methods and techniques are presented. Moreover, research on text genre classification, which is a new research field in text classification, and information filtering, which is an important application of text classification are also done. Our primary works are as follow.(1) Selection of Training SamplesSelection of training samples has the great important influence on the performance of classifier. Using the atypical samples not only increases the training time, but also is apt to bring the noise to training samples. In the paper, what is the typical sample of KNN is analyzed, and a method of samples selection based density is presented. The number of samples in the e -neighborhood of a specified sample is used to estimate the density of region surrounding the sample. The number of classes in the e -neighborhood of a specified sample is used to judge whether the sampe is around the border of classes. Reduce the atypical samples by reduce the samples in the high-density region. In the same time, reserve the samples around the border of classes in order to guarantee the precision of classifier.(2) Research on Chinese Text Classification Based on Maximum Entropy Model There are many differences between Chinese text classification and English textclassification. So the classification results are also different. It is espically different for maximum entropy model because the entropy of Chinese is higher than that of English. In the paper, two kinds of methods of Chinese text feature generation, word segmentation and n-gram, are used. Absolute-discounting technique is adopted to smooth the feature probability. Maximum entropy model, Naive Bayes, KNN and SVM are compared. Experiment results show that maximum entropy model isnâ€™t stable enough. So bagging is used to improve the stability of maximum entropy model.(3) Using Hierarchical Classification to Improve the Performance of Flatæ›´å¤š è¿˜åŽŸ

ã€å…³é”®è¯ã€‘ æ–‡æœ¬åˆ†ç±»ï¼› æ–‡æœ¬æµæ´¾åˆ†ç±»ï¼› ä¿¡æ¯è¿‡æ»¤ï¼› æ ·æœ¬é€‰æ‹©ï¼› æœ€å¤§ç†µæ¨¡åž‹ï¼› å±‚æ¬¡åŒ–åˆ†ç±»ï¼› N-Gramï¼›
ã€Key wordsã€‘ Text Classificationï¼› Text Genre Classificationï¼› Information Filteringï¼› Samples Selectionï¼› Maximum Entropy Modelï¼› Hierarchical Classificationï¼› N-Gramï¼›

ã€ç½‘ç»œå‡ºç‰ˆæŠ•ç¨¿äººã€‘ å¤æ—¦å¤§å¦

ã€åˆ†ç±»å·ã€‘TP391.1
ã€è¢«å¼•é¢‘æ¬¡ã€‘202
ã€ä¸‹è½½é¢‘æ¬¡ã€‘4532
æ”»è¯»æœŸæˆæžœ

çŸ¥ç½‘èŠ‚ä¸‹è½½

èŠ‚ç‚¹æ–‡çŒ®ä¸ï¼š

æœ¬æ–‡é“¾æŽ¥çš„æ–‡çŒ®ç½‘ç»œå›¾ç¤º:

æœ¬æ–‡çš„å¼•æ–‡ç½‘ç»œ

èŠ‚ç‚¹æ–‡çŒ®

èŠ‚ç‚¹æ–‡çŒ®

æ–‡æœ¬åˆ†ç±»åŠå…¶ç›¸å…³æŠ€æœ¯ç ”ç©¶

Research on Text Classification and Its Related Technologies

æœ¬æ–‡é“¾æŽ¥çš„æ–‡çŒ®ç½‘ç»œå›¾ç¤º:

æ–‡æœ¬åˆ†ç±»åŠå…¶ç›¸å…³æŠ€æœ¯ç ”ç©¶