èŠ‚ç‚¹æ–‡çŒ®

WEBæ–‡æœ¬æŒ–æŽ˜ä¸å…³é”®é—®é¢˜çš„ç ”ç©¶

Research on Key Problems in WEB Text Mining

åˆ†é¡µä¸‹è½½
åˆ†ç« ä¸‹è½½
æ•´æœ¬ä¸‹è½½
åœ¨çº¿é˜…è¯»
ä¸æ”¯æŒè¿…é›·ç‰ä¸‹è½½å·¥å…·ï¼Œè¯·å–æ¶ˆåŠ é€Ÿå·¥å…·åŽä¸‹è½½ã€‚

ã€ä½œè€…ã€‘ ä½•æ…§ï¼›

ã€ä½œè€…åŸºæœ¬ä¿¡æ¯ã€‘ åŒ—äº¬é‚®ç”µå¤§å¦ ï¼Œ ä¿¡å·ä¸Žä¿¡æ¯å¤„ç†ï¼Œ 2009ï¼Œ åšå£«

ã€æ‘˜è¦ã€‘ éšç€äº’è”ç½‘å’Œé€šè®¯ç½‘çš„è¿…çŒ›å‘å±•,ç½‘ç»œæ–‡æœ¬æˆä¸ºä¿¡æ¯çš„ä¸»è¦è½½ä½“åŠäººä»¬ç”Ÿæ´»ä¸ä¸å¯æˆ–ç¼ºçš„ä¸»è¦ä¿¡æ¯æ¥æº,æ–‡æœ¬æŒ–æŽ˜æŠ€æœ¯çš„ç ”ç©¶æ„ä¹‰å’Œå®žç”¨ä»·å€¼è¶Šæ¥è¶Šçªå‡ºã€‚å¦ä¸€æ–¹é¢,éšç€Web2.0æ—¶ä»£çš„åˆ°æ¥,å‡ºçŽ°äº†è¶Šæ¥è¶Šå¤šçš„ç”±ç”¨æˆ·åˆ›ä½œçš„ç½‘ç»œæ•°å—å†…å®¹ã€‚ç”¨æˆ·æ•°å—å†…å®¹çš„å¤§é‡äº§ç”Ÿå’Œä¼ æ’ä½¿å¾—çŸæ–‡æœ¬è®¡ç®—ã€Webæ–‡æœ¬ä¿¡æ¯æŠ½å–ã€æ–‡æœ¬æƒ…æ„Ÿåˆ†æžç‰é€æ¸æˆä¸ºWebæ–‡æœ¬æŒ–æŽ˜ç ”ç©¶çš„çƒç‚¹é—®é¢˜ã€‚é’ˆå¯¹è¿™äº›é—®é¢˜,æœ¬æ–‡è¿›è¡Œäº†ä»¥ä¸‹ç ”ç©¶:(1)åŸºäºŽç»Ÿè®¡è¯è¨€æ¨¡åž‹çš„çŸæ–‡æœ¬è®¡ç®—ã€‚é’ˆå¯¹çŸæ–‡æœ¬åŒ…å«å—ç¬¦å°‘ã€æ–‡æœ¬è¯è¨€ä¸è§„èŒƒã€æ–‡æœ¬æ•°é‡å·¨å¤§çš„ç‰¹ç‚¹,æœ¬æ–‡æå‡ºäº†ä¸€ç§åŸºäºŽN-gramçš„ç‰¹å¾æå–å’ŒRPCL(Rival Penalized Competitive Learning)çš„çŸæ–‡æœ¬èšç±»ç®—æ³•ã€‚é¦–å…ˆè¿›è¡ŒåŸºäºŽå—ç¬¦çº§çš„N-gramç‰¹å¾æå–,å³ä»Žæœªåˆ†è¯çš„è¯æ–™ä¸æŠ½å–ä¸æ–‡å—ã€‚ä¸æ–‡å—å¯ä»¥æ˜¯ä¸€ä¸ªæ±‰å—ã€ä¸€ä¸ªè¯æˆ–è€…å—ç¬¦ä¸²,è¿™æ ·,ä¸æ–‡å—ä¸ä½†å¯ä»¥è¡¨è¾¾çŸæ–‡æœ¬çš„è¯ä¹‰ä¿¡æ¯,è€Œä¸”èƒ½å¤Ÿä¿ç•™è¯åºç»“æž„å’Œå—ç¬¦ä¹‹é—´çš„ä¾èµ–ã€‚ç„¶åŽé€šè¿‡ç»Ÿè®¡åä¸²çº¦å‡å’Œäº’ä¿¡æ¯è¿‡æ»¤å¾—åˆ°å€™é€‰ä¸æ–‡å—é›†åˆã€‚æœ€åŽ,ä½¿ç”¨ä¸€ç§ç¥žç»ç½‘ç»œèšç±»ç®—æ³•RPCLå¯¹çŸæ–‡æœ¬è¿›è¡Œèšç±»ã€‚å®žéªŒç»“æžœè¡¨æ˜Ž,è¿™ç§åŸºäºŽN-gramçš„ç‰¹å¾æå–å’ŒRPCLçš„çŸæ–‡æœ¬èšç±»ç®—æ³•èƒ½å¤Ÿæœ‰æ•ˆçš„å¯¹çŸæ–‡æœ¬èšç±»,å¹¶èƒ½æœ‰æ•ˆçš„é™ä½Žç‰¹å¾çš„ç»´åº¦ã€‚(2)é¢å‘å¹¿å‘ŠæŽ¨èå’Œæƒ…æ„Ÿåˆ†æžçš„Webæ–‡æœ¬ä¿¡æ¯æŠ½å–ã€‚é’ˆå¯¹å¹¿å‘ŠæŽ¨èä¸çš„å¤åˆè¯æŠ½å–é—®é¢˜,æœ¬æ–‡æå‡ºäº†åŸºäºŽéšé©¬å°”ç§‘å¤«æ¨¡åž‹çš„åŠç›‘ç£ä¸æ–‡å¤åˆè¯æŠ½å–ç®—æ³•ã€‚ä»Žå°‘é‡ç§åå¤åˆè¯å‡ºå‘,é€šè¿‡è®¾å®šä¸€ä¸ªBEMI(Begin,End,Middle,Independent)æ¨¡æ¿,ä½¿ç”¨éšé©¬å°”ç§‘å¤«æ¨¡åž‹è¯†åˆ«ä¸Žç§åå¤åˆè¯å…·æœ‰ç›¸åŒæˆ–ç›¸ä¼¼ä¿¡æ¯çš„å¤åˆè¯ã€‚ç®—æ³•é‡‡ç”¨Bootstrappingçš„å¦ä¹ æ–¹æ³•,é€šè¿‡è‡ªå¦ä¹ ä¸æ–å¢žå¤§å¤åˆè¯åˆ—è¡¨çš„è§„æ¨¡ã€‚å®žéªŒç»“æžœè¡¨æ˜Ž,æœ¬ç®—æ³•å¯ä»¥æ»¡è¶³å¹¿å‘Šç³»ç»Ÿå…³é”®è¯æŽ¨èçš„ä¿¡æ¯æŠ½å–éœ€æ±‚,å¹¶å…·æœ‰è¾ƒé«˜çš„å‡†ç¡®çŽ‡å’Œå¯ä»¥æŽ¥å—çš„å¬å›žçŽ‡ã€‚é’ˆå¯¹æ–‡æœ¬åˆ†æžé—®é¢˜ä¸æƒ…æ„Ÿè¯æŠ½å–çš„é—®é¢˜,æœ¬æ–‡æå‡ºäº†åŸºäºŽæœ€å¤§ç†µå’ŒLMR(Left,Middle,Right)æ¨¡æ¿çš„ä¸æ–‡æƒ…æ„Ÿè¯æŠ½å–ç®—æ³•ã€‚é€šè¿‡å¯¹æ–‡æœ¬è®¾å®šä¸€ä¸ªæ»‘åŠ¨çª—å£,ä½¿ç”¨LMRæ¨¡æ¿æ ‡è®°è¯çš„ä½ç½®ä¿¡æ¯,ä½¿ç”¨è¯ã€è¯çš„å…ˆåŽä½ç½®ä¿¡æ¯ã€è¯æ€§ä¿¡æ¯ä½œä¸ºç‰¹å¾,å¯¹æƒ…æ„Ÿè¯è¿›è¡Œè¯†åˆ«å’ŒæŠ½å–ã€‚å®žéªŒç»“æžœè¡¨æ˜Ž,æœ¬ç®—æ³•å…·æœ‰è¾ƒé«˜çš„å¬å›žçŽ‡å’Œå‡†ç¡®çŽ‡,åŒæ—¶åœ¨æŸäº›ç‰¹å¾ç»„åˆçš„æƒ…å†µä¸‹,æƒ…æ„Ÿè¯æŠ½å–å…·æœ‰è‰¯å¥½çš„é²æ£’æ€§ã€‚(3)åŸºäºŽç›‘ç£å’ŒåŠç›‘ç£çš„æ–‡æœ¬æƒ…æ„Ÿåˆ†ç±»ã€‚é’ˆå¯¹ç½‘ç»œä¸Šå¤§é‡æµè¡ŒéŸ³ä¹ã€ç½‘å‹åŽŸåˆ›ã€æ”¹ç¼–çš„éŸ³ä¹,æœ¬æ–‡æå‡ºäº†ä¸€ç§å¯¹éŸ³ä¹æŒè¯çš„æƒ…æ„Ÿåˆ†ç±»æ–¹æ³•ã€‚é¦–å…ˆ,é€šè¿‡å¯¹æŒè¯è¯æ–™åº“çš„è¯è¿›è¡Œç»Ÿè®¡å‘çŽ°å…¶åˆ†å¸ƒåŸºæœ¬ç¬¦åˆé½å¤«å®šå¾‹,ä½†ä¸Žä¸æ–‡åˆ†ç±»é€šç”¨è¯æ–™åº“(863è®¡åˆ’æ–‡æœ¬åˆ†ç±»æµ‹è¯•æ•°æ®)ä¸è¯è¯åˆ†å¸ƒç•¥æœ‰å·®å¼‚ã€‚ç”±äºŽå¯¹æŒè¯è¡¨çŽ°çš„æƒ…æ„Ÿè¿›è¡Œçš„åˆ†ç±»ä¸åŒäºŽæŒ‰ç…§ä¸»é¢˜å¯¹æ™®é€šæ–‡æœ¬çš„åˆ†ç±»ä»»åŠ¡,æ‰€ä»¥éœ€è¦æŠ½å–æ›´å¤šè¡¨çŽ°æƒ…æ„Ÿè‰²å½©çš„ç‰¹å¾ã€‚æœ¬æ–‡åœ¨Nå…ƒæ¨¡åž‹çš„æ¡†æž¶ä¸‹é‡‡å–äº†ä¸‰ç§ä¸åŒçš„é¢„å¤„ç†æ–¹æ³•(ä¸åŒN-gramæ¨¡æ¿ã€æ¶ˆåŽ»åœç”¨è¯ã€æŒ‰è¯æ€§è¿‡æ»¤)æŠ½å–æ›´å¤šçš„æŒè¯æƒ…æ„Ÿè¯ä¹‰ç‰¹å¾,å¹¶æå‡ºäº†å¸¦æœ‰é«˜æ–¯å…ˆéªŒå’ŒæŒ‡æ•°å…ˆéªŒçš„æœ€å¤§ç†µæ¨¡åž‹çš„åˆ†ç±»ç®—æ³•å¯¹æŒè¯çš„æƒ…æ„Ÿç‰¹å¾è¿›è¡Œå»ºæ¨¡ã€‚å®žéªŒç»“æžœè¡¨æ˜Ž,å…·æœ‰é«˜æ–¯å…ˆéªŒå’ŒæŒ‡æ•°å…ˆéªŒçš„æœ€å¤§ç†µæ¨¡åž‹éžå¸¸é€‚åˆç”¨äºŽæŒè¯æƒ…æ„Ÿåˆ†æžé—®é¢˜ã€‚é’ˆå¯¹å®žé™…çš„æƒ…æ„Ÿåˆ†ç±»ä¸æ ‡æ³¨æ•°æ®ä¸è¶³çš„æƒ…å†µ,æœ¬æ–‡æå‡ºäº†ä¸€ç§åŸºäºŽåŠç›‘ç£å¦ä¹ çš„æ–‡æœ¬æƒ…æ„Ÿåˆ†ç±»ç®—æ³•ã€‚å‡è®¾ç©ºé—´ä¸å˜åœ¨ä¸€ä¸ªæƒ…æ„Ÿæµå½¢ç»“æž„,å°†å¾…åˆ†ç±»æ–‡æœ¬çœ‹ä½œæ˜¯è¿™ä¸ªæƒ…æ„Ÿæµå½¢ä¸ŠæŠ½æ ·çš„ç‚¹ã€‚é¦–å…ˆ,åˆ©ç”¨è¿™äº›ç‚¹çš„é‚»åŸŸä¿¡æ¯è¿›è¡Œæž„å›¾,æ¯ä¸ªç‚¹ä¸Žå®ƒè¿‘é‚»çš„è¾¹çš„æƒé‡ä½¿ç”¨å®ƒçš„è¿‘é‚»çº¿æ€§åŠ æƒè¡¨ç¤º;ç„¶åŽ,å°†è¯¥å›¾çœ‹ä½œæ˜¯ä¸€ä¸ªæ¦‚çŽ‡è½¬ç§»çŸ©é˜µ,å„ç±»åˆ«çš„æ ‡ç¾åœ¨æ¤çŸ©é˜µä¸Šæ‰©æ•£å®Œæˆæƒ…æ„Ÿåˆ†ç±»è¿‡ç¨‹ã€‚åœ¨ç”µå½±è¯„è®ºå’Œä¸æ–‡æŒè¯è¯æ–™é›†ä¸Šçš„å®žéªŒç»“æžœè¡¨æ˜Ž,è¯¥ç®—æ³•åœ¨æ–‡æœ¬æƒ…æ„Ÿåˆ†ç±»ä¸Šå…·æœ‰è‰¯å¥½çš„æ€§èƒ½ã€‚(4)æ–‡æœ¬è§‚ç‚¹æ£€ç´¢ã€‚ä»¥æœ¬æ–‡ä½œè€…2008å¹´å‚åŠ çš„COAE2008ä¸çš„é¢å‘ä¸»é¢˜çš„ä¸æ–‡æ–‡æœ¬è§‚ç‚¹æ£€ç´¢ä»»åŠ¡ä¸ºä¸»çº¿,ä»‹ç»äº†æœ¬æ–‡å‚è¯„ç³»ç»ŸPRIS-SASã€‚æœ¬ç³»ç»Ÿé‡‡ç”¨ä¸¤é˜¶æ®µå¤„ç†æ–¹å¼,åœ¨ç»è¿‡ç¼–ç è½¬æ¢ã€åˆ†è¯ç‰é¢„å¤„ç†åŽ,PRIS-SASé¦–å…ˆä½¿ç”¨Indriæ£€ç´¢ç³»ç»Ÿå¯¹è¯æ–™é›†å»ºç«‹ç´¢å¼•,ä½¿ç”¨ä»»åŠ¡ä¸çš„ä¸»é¢˜è¯è¿›è¡Œad-hocæ£€ç´¢,ç„¶åŽä½¿ç”¨æœ¬æ–‡ä¸æ–‡æœ¬æƒ…æ„Ÿåˆ†ç±»ç®—æ³•å»ºç«‹å€¾å‘æ€§æ¨¡åž‹å’Œæžæ€§æ¨¡åž‹,å¯¹æ£€ç´¢å¾—åˆ°çš„ç›¸å…³æ–‡æœ¬è¿›è¡Œæ–‡æœ¬å€¾å‘æ€§åˆ¤æ–,å¹¶å¯¹æ£€ç´¢ç»“æžœé‡æ–°æŽ’åºã€‚åœ¨COAE2008æ•°æ®é›†ä¸Šçš„è¯„æµ‹æŒ‡æ ‡è¡¨æ˜Ž,æœ¬æ–‡è®¾è®¡çš„æ–‡æœ¬è§‚ç‚¹æ£€ç´¢ç³»ç»Ÿè¾¾åˆ°äº†è¾ƒé«˜çš„æ€§èƒ½æ°´å¹³ã€‚æ›´å¤š è¿˜åŽŸ

ã€Abstractã€‘ With the rapid development of Internet and communication networks, web documents have become one of the major modern information media as well as an indispensable information source in peopleâ€™s lives. Text mining has become a technology of great research and practical significance. While the Web2.0 is coming, more and more users are involved in the generation of information, and more and more personal opinioned contents are full of the Internet. Such contents are meaningful and valuable for many applications, such as e-commerce, network community, network information security, web search engine and so on. However, it is enormous challenges to process these texts by traditional text mining.In this dissertation, three problems are investigated, which includes short text computing, web text information extraction, and text sentiment analysis. The main contributions of this dissertation are summarized as follows:(1) Short text computing based on statistical language model. We introduce an algorithm to cluster Chinese short texts based on N-gram feather extraction. Aiming at the characteristics of Chinese short texts, the algorithm employs N-gram feather extraction, statistical substring reduction and mutual information filtering to capture Chinese chunks from texts, which reflect the text semantic structure and character dependency. Then RPCL algorithm is applied to realizing text clustering with high precision, which needs not know the exact number of clusters. Experiment results show that this approach can remarkably reduce the dimensionality and effectively improve the performance of Chinese short texts clustering than traditional methods.(2) Web text information extraction based on keyword recommendation system and sentiment analysis. In keyword recommendation system in advertisement, we propose a semi-supervised Chinese compounds extraction approach based on HMM using bootstrapping in this paper. First, we define a set of tags BEMI {beginning, end, middle, independence}, which means the position of words in compounds. Then we employ HMM to extract compounds automatically in BEMI tagging algorithm. We rank the Compounds extracted from corpus by their word frequency and length in descending order, and add top N compounds in seed compounds list. The algorithm learns more Chinese compounds from corpus by bootstrapping. Experimental results show that this approach get much higher performance than Unsupervised one. Different from those extracted by traditional methods, these Chinese compounds contain category information, which can be used in text classification/clustering as features. Also, this approach can be applied in keyword recommendation system in advertisement for different kinds of advertisers because of its expansibility and versatility.For word level sentiment analysis, we propose an algorithm based on Maximum Entropy model and LMR template. LMR template is used to tag word position. Words, word position and POS are used as feature in ME. A text window sides and the sentiment of the word in M poisiton is labeled. Experimental results show that this algorithm has good performance in sentiment word extraction. And, this algorithm is robust in some feature combination.(3) Text sentiment classification based on supervised and semi-supervised learning. Most of pop music songs have suited lyrics, which play an essential role to semantically understand songs. Therefore, analysis of lyrics must be a complement of acoustic methods for music retrieval. One basic aspect of music retrieval is music emotion classification by learning from lyrics. This problem is different from traditional text classification in that more linguistic or semantic information is required for better emotion analysis. We investigate the lyrics corpus based on Zipfâ€™s Law using word as a unit, and results roughly obey Zipfâ€™s Law. Thereby, we study three kinds of preprocessing methods (including different N-grams, deleting stop words, and filtering based on POS) and a series of language grams under the well-known N-gram language model framework to extract more semantic features. Besides that, we also improve Maximum Entropy model with Gaussian and exponential priors to model features for music emotion classification. Experimental results show that feature extraction methods improved music emotion classification accuracy. ME with priors obtained the best results.Since labeled data in sentiment classification is scarce, we are interested in such situation. We introduce a novel semi-supervised learning algorithm to address such task. We assume that there is a sentiment manifold structure, and documents are sampled from such manifold. We do so by creating a graph on both labeled and unlabeled data, which is linearly constructed by data pointsâ€™ neighborhood information. Then, labels are spread though the graph, which is regarded as probabilistic transition matrix in the process of spread. This algorithm is capable for learning sentimental manifold structures within texts. Promising experimental results are shown in lyrics and movie review data.(4) Opinion retrieval. Following the Chinese Opinion Analysis Evaluation (COAE2008), we discuss text opinion retrieval. Our sentiment analysis system named PRIS-SAS employ a two-stage approach. After preprocessing, corpus given by COAE2008 is indexed by Indri retrieval system, which is used to ad-hoc retrieval. And then sentiment model and polarity model trained by ME with priors are used to classify text returned by Indri. The retrieval results are reranked by classification results. Experiments on COAE2008 datasets show that, the system proposed in this dissertation is a state-of-the-art opinion retrieval system.æ›´å¤š è¿˜åŽŸ

ã€å…³é”®è¯ã€‘ çŸæ–‡æœ¬è®¡ç®—ï¼› å¤åˆè¯æŠ½å–ï¼› æƒ…æ„Ÿè¯æŠ½å–ï¼› æƒ…æ„Ÿåˆ†ç±»ï¼› è§‚ç‚¹æ£€ç´¢ï¼›
ã€Key wordsã€‘ Short text computingï¼› Compound word extractionï¼› sentiment word extractionï¼› text sentiment classificationï¼› opinion retrievalï¼›

ã€ç½‘ç»œå‡ºç‰ˆæŠ•ç¨¿äººã€‘ åŒ—äº¬é‚®ç”µå¤§å¦

ã€åˆ†ç±»å·ã€‘TP311.13
ã€è¢«å¼•é¢‘æ¬¡ã€‘15
ã€ä¸‹è½½é¢‘æ¬¡ã€‘2364
æ”»è¯»æœŸæˆæžœ

çŸ¥ç½‘èŠ‚ä¸‹è½½

èŠ‚ç‚¹æ–‡çŒ®ä¸ï¼š

æœ¬æ–‡é“¾æŽ¥çš„æ–‡çŒ®ç½‘ç»œå›¾ç¤º:

æœ¬æ–‡çš„å¼•æ–‡ç½‘ç»œ

èŠ‚ç‚¹æ–‡çŒ®

èŠ‚ç‚¹æ–‡çŒ®

WEBæ–‡æœ¬æŒ–æŽ˜ä¸­å…³é”®é—®é¢˜çš„ç ”ç©¶

Research on Key Problems in WEB Text Mining

æœ¬æ–‡é“¾æŽ¥çš„æ–‡çŒ®ç½‘ç»œå›¾ç¤º:

WEBæ–‡æœ¬æŒ–æŽ˜ä¸å…³é”®é—®é¢˜çš„ç ”ç©¶