èŠ‚ç‚¹æ–‡çŒ®

åŸºäºŽæ¦‚å¿µç©ºé—´çš„æ–‡æœ¬åˆ†ç±»çš„åº”ç”¨ç ”ç©¶

A Study on Concept-VSM And Its Application in Text Classification

åˆ†é¡µä¸‹è½½
åˆ†ç« ä¸‹è½½
æ•´æœ¬ä¸‹è½½
åœ¨çº¿é˜…è¯»
ä¸æ”¯æŒè¿…é›·ç‰ä¸‹è½½å·¥å…·ï¼Œè¯·å–æ¶ˆåŠ é€Ÿå·¥å…·åŽä¸‹è½½ã€‚

ã€ä½œè€…ã€‘ é»„æµ·è‹±ï¼›

ã€ä½œè€…åŸºæœ¬ä¿¡æ¯ã€‘ å¹¿è¥¿å¸ˆèŒƒå¤§å¦ ï¼Œ è®¡ç®—æœºè½¯ä»¶ä¸Žç†è®ºï¼Œ 2002ï¼Œ ç¡•å£«

ã€æ‘˜è¦ã€‘ éšç€æ–‡æœ¬ä¿¡æ¯çš„å¿«é€Ÿå¢žé•¿ï¼Œç‰¹åˆ«æ˜¯Internetä¸Šåœ¨çº¿ä¿¡æ¯çš„å¢žåŠ ï¼Œæ–‡æœ¬ï¼ˆç½‘é¡µï¼‰åˆ†ç±»æ˜¾å¾—è¶Šæ¥è¶Šé‡è¦ã€‚ç”±äºŽæ–‡æœ¬åˆ†ç±»æœ‰åŠ©äºŽç”¨æˆ·æœ‰é€‰æ‹©åœ°é˜…è¯»å’Œå¤„ç†æµ·é‡æ–‡æœ¬ï¼Œå¯ä»¥åœ¨è¾ƒå¤§ç¨‹åº¦ä¸Šè§£å†³ç›®å‰ç½‘ä¸Šä¿¡æ¯æ‚ä¹±çš„çŽ°è±¡ï¼Œæ–¹ä¾¿ç”¨æˆ·å‡†ç¡®åœ°å®šä½æ‰€éœ€çš„ä¿¡æ¯å’Œåˆ†æµä¿¡æ¯ï¼Œå› æ¤ï¼Œæ–‡æœ¬è‡ªåŠ¨åˆ†ç±»å·²æˆä¸ºä¸€é¡¹å…·æœ‰è¾ƒå¤§å®žç”¨ä»·å€¼çš„å…³é”®æŠ€æœ¯ï¼Œæ˜¯ç»„ç»‡å’Œç®¡ç†æ•°æ®çš„æœ‰åŠ›æ‰‹æ®µ.æ–‡æœ¬åˆ†ç±»çš„æ–¹æ³•åˆ†ä¸ºä¸¤ç±»ï¼šä¸€æ˜¯åŸºäºŽçŸ¥è¯†çš„åˆ†ç±»æ–¹æ³•ï¼›äºŒæ˜¯åŸºäºŽç»Ÿè®¡çš„åˆ†ç±»æ–¹æ³•ã€‚åŸºäºŽçŸ¥è¯†çš„æ–‡æœ¬åˆ†ç±»ç³»ç»Ÿåº”ç”¨äºŽæŸä¸€å…·ä½“é¢†åŸŸï¼Œéœ€è¦è¯¥é¢†åŸŸçš„çŸ¥è¯†åº“ä½œä¸ºæ”¯æ’‘ï¼Œç”±äºŽçŸ¥è¯†æå–ã€æ›´æ–°ã€ç»´æŠ¤ä»¥åŠè‡ªæˆ‘å¦ä¹ ç‰æ–¹é¢å˜åœ¨çš„ç§ç§é—®é¢˜ï¼Œä½¿å¾—å®ƒé€‚ç”¨é¢è¾ƒçª„ã€‚è€ŒåŸºäºŽç»Ÿè®¡çš„åˆ†ç±»æ–¹æ³•ç”±äºŽé‡‡ç”¨çº¯ç²¹çš„æ•°å¦è¿ç®—ï¼Œä¸è‹›æ±‚å¤æ‚çš„è¯è¨€å¦çŸ¥è¯†å’Œé¢†åŸŸçŸ¥è¯†ï¼Œä»¥åŠåœ¨å®žé™…åº”ç”¨ä¸æ‰€ä½“çŽ°å‡ºæ¥çš„è‰¯å¥½æ•ˆæžœï¼Œæˆä¸ºç›®å‰æµè¡Œçš„æ–‡æœ¬åˆ†ç±»æ–¹æ³•ã€‚çŽ°åœ¨å¹¿æ³›åº”ç”¨çš„åŸºäºŽç»Ÿè®¡çš„æ¨¡åž‹æœ‰å‘é‡ç©ºé—´æ¨¡åž‹ã€Naive Bayesæ¨¡åž‹ã€å®žä¾‹æ˜ å°„æ¨¡åž‹å’Œæ”¯æ’‘å‘é‡æœºæ¨¡åž‹ã€‚å…¶ä¸å‘é‡ç©ºé—´æ¨¡åž‹ï¼ˆVector Space Modelï¼ŒVSMï¼‰æ˜¯ç”±G.Saltonç‰äººåœ¨20ä¸–çºª60å¹´ä»£æå‡ºçš„ï¼ŒæŠŠæ–‡æ¡£ç®€åŒ–ä¸ºä»¥é¡¹çš„æƒé‡ä¸ºåˆ†é‡çš„å‘é‡è¡¨ç¤ºï¼ŒæŠŠåˆ†ç±»è¿‡ç¨‹ç®€åŒ–ä¸ºç©ºé—´å‘é‡çš„è¿ç®—ï¼Œä½¿å¾—é—®é¢˜çš„å¤æ‚æ€§å¤§å¤§å‡ä½Žã€‚æ¤å¤–ï¼Œå‘é‡ç©ºé—´æ¨¡åž‹å¯¹é¡¹çš„æƒé‡è¯„ä»·ã€ç›¸ä¼¼åº¦çš„è®¡ç®—éƒ½æ²¡æœ‰ä½œå‡ºç»Ÿä¸€çš„è§„å®šï¼Œåªæ˜¯æä¾›ä¸€ä¸ªç†è®ºæ¡†æž¶ï¼Œå¯ä»¥ä½¿ç”¨<WP=4>ä¸åŒçš„æƒé‡è¯„ä»·å‡½æ•°å’Œç›¸ä¼¼åº¦è®¡ç®—æ–¹æ³•ï¼Œä½¿å¾—æ¤æ¨¡åž‹æœ‰å¹¿æ³›çš„é€‚åº”æ€§ã€‚ä½†æ¤æ¨¡åž‹ä¸€èˆ¬é‡‡ç”¨ç´¢å¼•è¯æ¥è¡¨ç¤ºæ–‡æ¡£ï¼Œåˆ†ç±»æ˜¯é€šè¿‡æ–‡æ¡£ä¹‹é—´çš„å—ã€è¯åŒ¹é…æ¥å®žçŽ°ï¼Œæ˜¯æµ…å±‚æ¬¡çš„è¯åŒ¹é…ï¼Œè€Œéžæ·±å±‚æ¬¡çš„è¯ä¹‰åŒ¹é…ï¼Œæ˜¯ä¸å‡†ç¡®çš„ã€‚æ˜¾ç„¶ï¼Œå—ã€è¯çš„åŒä¹‰æ€§å’Œå¤šä¹‰æ€§å°†åˆ†åˆ«å¯¹æ–‡æœ¬åˆ†ç±»çš„æŸ¥å…¨çŽ‡å’ŒæŸ¥å‡†çŽ‡äº§ç”Ÿä¸åˆ©å½±å“ã€‚LSIï¼ˆLatent Semantic Indexingï¼Œæ½œåœ¨è¯ä¹‰ç´¢å¼•ï¼‰æ–¹æ³•æ˜¯1988å¹´S.T.Dumainsç‰äººæå‡ºçš„ä¸€ç§æ–°çš„ä¿¡æ¯æ£€ç´¢ä»£æ•°æ¨¡åž‹ï¼Œå…¶åŸºæœ¬æ€æƒ³æ˜¯æ–‡æœ¬ä¸çš„è¯ä¸Žè¯ä¹‹é—´å˜åœ¨æŸç§è”ç³»ï¼Œå³å˜åœ¨æŸç§æ½œåœ¨çš„è¯ä¹‰ç»“æž„ï¼Œå› æ¤é‡‡ç”¨ç»Ÿè®¡çš„æ–¹æ³•æ¥å¯»æ‰¾è¯¥è¯ä¹‰ç»“æž„ï¼Œå¹¶ä¸”ç”¨è¯ä¹‰ç»“æž„æ¥è¡¨ç¤ºè¯å’Œæ–‡æœ¬ï¼Œè¿™æ ·çš„ç»“æžœå¯ä»¥è¾¾åˆ°æ¶ˆé™¤è¯ä¹‹é—´çš„ç›¸å…³æ€§ï¼ŒåŒ–ç®€æ–‡æœ¬å‘é‡çš„ç›®çš„ã€‚LSIåˆ©ç”¨ç»Ÿè®¡è®¡ç®—å¯¼å‡ºçš„æ¦‚å¿µç´¢å¼•è¿›è¡Œä¿¡æ¯æ£€ç´¢ï¼Œè€Œä¸å†æ˜¯ä¼ ç»Ÿçš„ç´¢å¼•å—ã€è¯ã€‚LSIåŸºäºŽè¿™æ ·ä¸€ç§æ–è¨€ï¼Œå³æ–‡æ¡£åº“ä¸å˜åœ¨éšå«çš„å…³äºŽè¯ä½¿ç”¨çš„è¯ä¹‰ç»“æž„ï¼Œè¿™ç§è¯ä¹‰ç”±äºŽéƒ¨åˆ†åœ°è¢«æ–‡æ¡£ä¸è¯çš„è¯ä¹‰å’Œå½¢å¼ä¸Šçš„å¤šæ ·æ€§æ‰€æŽ©ç›–è€Œä¸æ˜Žæ˜¾ã€‚LSIé€šè¿‡å¯¹åŽŸæ–‡æ¡£åº“çš„è¯â€”æ–‡æ¡£çŸ©é˜µçš„å¥‡å¼‚å€¼åˆ†è§£ï¼ˆSingular Value Decompositionï¼‰è®¡ç®—ï¼Œå¹¶å–å‰kä¸ªæœ€å¤§çš„å¥‡å¼‚å€¼åŠå…¶å¯¹åº”çš„å¥‡å¼‚çŸ¢é‡æž„æˆä¸€ä¸ªæ–°çŸ©é˜µæ¥è¿‘ä¼¼è¡¨ç¤ºåŽŸæ–‡æ¡£åº“çš„è¯â€”æ–‡çŸ©é˜µã€‚ç”±äºŽæ–°çŸ©é˜µæ¶ˆå‡äº†è¯å’Œæ–‡æ¡£ä¹‹é—´è¯ä¹‰å…³ç³»çš„æ¨¡ç³Šåº¦ï¼Œä»Žè€Œæ›´æœ‰åˆ©äºŽä¿¡æ¯æ£€ç´¢ã€‚ä¸Žä¼ ç»Ÿä¿¡æ¯æ£€ç´¢æ¨¡åž‹ç›¸æ¯”ï¼ŒLSIçš„ä¼˜åŠ¿è¡¨çŽ°åœ¨ï¼šå‘é‡ç©ºé—´ä¸æ¯ä¸€ç»´çš„å«ä¹‰å‘ç”Ÿäº†å¾ˆå¤§çš„å˜åŒ–ï¼Œå®ƒåæ˜ çš„ä¸å†æ˜¯è¯çš„ç®€å•å‡ºçŽ°é¢‘åº¦å’Œåˆ†å¸ƒå…³ç³»ï¼Œè€Œæ˜¯å¼ºåŒ–çš„è¯ä¹‰å…³ç³»ï¼›ç”¨ä½Žç»´è¯ã€æ–‡æ¡£å‘é‡æ›¿ä»£åŽŸæœ‰è¯ã€æ–‡æ¡£å‘é‡ï¼Œå¯ä»¥æœ‰æ•ˆåœ°å¤„ç†å¤§è§„æ¨¡æ–‡æ¡£åº“ã€‚æœ¬è®ºæ–‡ä»¥LSIæ–¹æ³•ä¸ºåŸºç¡€ï¼Œåœ¨æ–‡[1][2]çš„å¯å‘ä¸‹ï¼ŒæŽ¢è®¨äº†åŸºäºŽæ¦‚å¿µç©ºé—´æ–‡æœ¬åˆ†ç±»çš„è®¡ç®—æ–¹æ³•ã€‚ç”±äºŽæ–‡æœ¬åˆ†ç±»æ˜¯è®¡ç®—æœºæƒ…æŠ¥æ£€ç´¢çš„ä¸€ä¸ªåˆ†æ”¯ï¼Œè®ºæ–‡é¦–å…ˆç®€è¦åœ°ä»‹ç»äº†æƒ…æŠ¥æ£€ç´¢ä¸Žè®¡ç®—æœºæƒ…æŠ¥æ£€ç´¢çš„æ¶µä¹‰åŠå‘å±•ç®€å²å’Œå‘å±•è¶‹åŠ¿ï¼›è®¡ç®—æœºæƒ…æŠ¥æ£€ç´¢çš„åŸºæœ¬ç†è®ºã€ç ”ç©¶å¯¹è±¡å’Œæ–¹æ³•ï¼Œä»¥åŠæ–‡æœ¬åˆ†ç±»çš„å…³é”®æŠ€æœ¯ï¼›ç„¶åŽè®ºè¿°äº†éšå«è¯ä¹‰ç´¢å¼•ï¼ˆLSIï¼‰æ–¹æ³•çš„æ€æƒ³å’Œç†è®ºåŸºç¡€ï¼Œå¹¶ç”¨å›¾ä¾‹å’Œä¸€ä¸ªå°çš„å®žä¾‹å¯¹å…¶è¿›è¡Œå½¢è±¡åŒ–è¯´æ˜Žï¼Œé˜è¿°äº†LSIæ–¹æ³•çš„ä¼˜åŠ¿ã€‚è®ºæ–‡çš„ä¸»è¦å·¥ä½œæ˜¯åœ¨å‘é‡ç©ºé—´æ¨¡åž‹å’ŒLSIçš„åŸºç¡€ä¸Šæž„é€ æ–‡æœ¬åˆ†ç±»çš„æ¦‚å¿µç©ºé—´å¹¶æå‡ºåœ¨æ¦‚å¿µç©ºé—´ä¸è¯è¯ç›¸ä¼¼åº¦ã€æ–‡æ¡£ç›¸ä¼¼åº¦ã€å¾…åˆ†ç±»æ–‡æ¡£ä¸Žç±»çš„ç›¸ä¼¼åº¦çš„è®¡ç®—æ–¹æ³•ï¼Œåœ¨å¤§é‡è®ç»ƒé›†çš„åŸºç¡€ä¸Šï¼Œè¿›è¡Œæ¦‚å¿µèŽ·å–ï¼Œå°†æ–‡æ¡£è½¬åŒ–ä¸ºæ–‡æ¡£å‘é‡ï¼ŒåŒæ—¶æž„é€ ç±»åŸºå‡†å‘é‡ï¼Œæœ€åŽåœ¨æ¦‚å¿µç©ºé—´ä¸å°†æ–‡æ¡£å‘é‡ä¸Žç±»åŸºå‡†å‘é‡è¿›è¡ŒåŒ¹é…ï¼Œå®Œæˆåˆ†ç±»ï¼ŒåŒæ—¶è¿˜è®¨è®ºäº†æœ‰å¾…åœ¨æ¦‚å¿µç©ºé—´ä¸æŽ¢è®¨çš„åˆ†ç±»å¦ä¹ é—®é¢˜ã€‚å®žéªŒè¯å®žäº†åŸºäºŽæ¦‚å¿µç©ºé—´æ–‡æœ¬åˆ†ç±»èƒ½å¤Ÿå–å¾—è¾ƒå¥½çš„æ•ˆæžœã€‚ç”±äºŽè¯è¨€ä¸è¯çš„åŒä¹‰æ€§å’Œå¤šä¹‰æ€§æ™®éå˜åœ¨ï¼Œä½¿å¾—åŸºäºŽè¯åŒ¹é…çš„æ–‡æœ¬åˆ†ç±»æ–¹æ³•å…ˆå¤©ä¸è¶³ï¼Œæœ¬è®ºæ–‡æå‡ºçš„åŸºäºŽæ¦‚å¿µç©ºé—´çš„æ–‡æœ¬åˆ†ç±»æ–¹æ³•ä»¥ä¸€ä¸ªè¾ƒå°çš„è€Œæ›´å¥å£®çš„ç»Ÿè®¡å¯¼å‡ºçš„æ¦‚å¿µç©ºé—´æ›¿ä»£åŽŸæ¥åŸºäºŽç‹¬ç«‹è¯ç´¢å¼•çš„æ–‡æ¡£å‘é‡ç©ºé—´ï¼Œè¡¨çŽ°å‡ºæ˜Žæ˜¾çš„æ€§èƒ½ä¼˜åŠ¿ï¼Œå¸Œæœ›å°†æ¥é€šè¿‡å¯¹åŸºäºŽæ¦‚å¿µç©ºé—´çš„æ–‡æœ¬åˆ†ç±»çš„è®¡ç®—æ–¹æ³•çš„ä¸€äº›æ¯”è¾ƒç³»ç»Ÿçš„ç ”ç©¶ï¼Œä»¥æœŸå¯»æ±‚ä¸€ä¸ªæ—¢æœ‰ä¸¥æ ¼çš„ç†è®ºä¾æ®ï¼Œè€Œä¸”åœ¨å®žè·µä¸ä¹Ÿå¯è¡Œçš„æ–‡æœ¬åˆ†ç±»æ–¹æ³•ã€‚æ›´å¤š è¿˜åŽŸ

ã€Abstractã€‘ As the volume of information available on the Internet and corporate intranets continues to increase, there is growing interest in helping people better find, filter, and manage these resources. Text classification - the assignment of natural language texts to one or more predefined categories based on their content - is an important component in many information organization and management tasks. Its most widespread application has been for assigning subject categories of documents to support text retrieval, routing and filtering. In many contexts, trained professionals are employed to categorize new items. This process is very time-consuming and costly, thus limiting its applicability. Rule-based approaches similar to those used in expert systems are common, but they generally require manual construction of the rules, make rigid binary decisions about category membership, and are typically difficult to modify. Another strategy is to use statistical analysis to automatically construct classifiers using labeled training data. The resulting classifiers, however, have many advantages: they are easy to construct and update, they depend only on information that is easy for people to provide, they can be customized to specific categories of interest to individuals, and they allow users to smoothly trade-off precision and recall depending on their tasks. A growing number of statistical classifications have been applied to text categorization, including Vector Space model, Naive Bayes model and Support Vector Machine model. VSM ( Vector Space Model ) was presented by G.Salton in 20 centuries. In the model, each document is represented as a vector of words, as is typically done in the popular vector representation for information retrieval. Because text classification is essentially semantic categorization, the VSM represents the contents of documents and queries with a set of index terms, which can lead to poor classification performance.Latent semantic index (LSI) was presented by S.T.Dumains in 1988, it is a new algebraic model that has achieved good results in information retrieval, which maps documents and queries vectors into a lower<WP=6>dimensional space by singular value decomposition, so that the inherent vagueness associated with a retrieval process based on keyword sets is considerably reduced, and semantic association among the documents is highlighted consequently. LSI is useful to find relation between terms, where human effort does not bring good results. Thus the synonymy can be solved, and the polysemy can be solved partially. With the guidance of LSI and VSM theory and taking paper [1][2] as the foundation, this paper will probe into the text classification based upon concept-VSM. First of all, the paper gives a brief introduction to the concept of information, information retrieval and computer information retrieval, and its development. Then the types of information retrieval model, the approach and basic contents of attribute theory will be dwelled upon. Third, this paper introduces the fundamental principles of LSI, and then using an illustration and an example elucidate LSI advantages. The focus of my work has been on building a concept space based on VSM and LSI, presenting the calculating method of the word-similarity and the text-similarity in the concept-space, acquiring concepts on large training set, converting the text to text vector, and constructing the basis vector. Finally, this paper discusses the future work - problem in the classification study problem in the concept space. At the end of this paper, theoretic analyses and experimental results all show that classification based upon concept-VSM can improve categorize performance significantly, and indicate it has high classification precision and recall on average. Because of existence of the synonymy and polysemy, the text classification based on words is of congenital lack, my thesis presents a text classification method based on concept-VSM with a small but more strong concept space instead of the text vector space baæ›´å¤š è¿˜åŽŸ

ã€å…³é”®è¯ã€‘ æ–‡æœ¬åˆ†ç±»ï¼› æ½œåœ¨è¯ä¹‰ç´¢å¼•ï¼› å‘é‡ç©ºé—´æ¨¡åž‹ï¼›
ã€Key wordsã€‘ text classificationï¼› latent semantic indexingï¼› vector space modelï¼›

ã€ç½‘ç»œå‡ºç‰ˆæŠ•ç¨¿äººã€‘ å¹¿è¥¿å¸ˆèŒƒå¤§å¦

ã€åˆ†ç±»å·ã€‘TP391.3
ã€è¢«å¼•é¢‘æ¬¡ã€‘1
ã€ä¸‹è½½é¢‘æ¬¡ã€‘249

çŸ¥ç½‘èŠ‚ä¸‹è½½

èŠ‚ç‚¹æ–‡çŒ®ä¸ï¼š

æœ¬æ–‡é“¾æŽ¥çš„æ–‡çŒ®ç½‘ç»œå›¾ç¤º:

æœ¬æ–‡çš„å¼•æ–‡ç½‘ç»œ

èŠ‚ç‚¹æ–‡çŒ®