节点文献

文本主题域划分与无监督特征提取

Text Subtopic-field Segmentation and Unsupervised Feature Extraction

【作者】 王小芳

【导师】 张树功; 张文燚;

【作者基本信息】 吉林大学 , 计算数学, 2009, 博士

【摘要】 本文针对性地解决文本聚类中的一些相关问题,包括主题域划分和无监督特征提取。当前,文本主题域划分方法较少,现有的方法受领域知识库和非全局优化等方面的制约,在通用性及划分效果等方面有很大的局限性。本文主要建立一种新的全局最优化的,与具体应用领域无关的主题域划分模型,在模型的构造过程中着重考虑了主题域内距离、主题域间距离、主题域内夹角和主题域间夹角等要素,通过求解最优化模型得到最优的主题域划分模式。特征提取和权重计算是文本聚类中最为重要的环节.本文提出一种新的特征提取和权重计算方法。首先定义了语义量子,并依据不同类型的语义量子对表达文本主题的贡献将语义量子分为潜量子和显量子;进而借助于改进的向量空间模型进行语义显量子的结构化表达,借助改进的词序列模型对语义潜量子进行结构化表达,从而建立了一种新的基于主题概念模型的文本表示模型;最后采用显量子分布模型进行显量子权重计算,通过在有效区域内潜量子的共现模型进行潜量子权重的计算。该算法无需领域知识库,且支持后续增量式文本聚类,为文本聚类在互联网上的应用奠定基础.

【Abstract】 With the rapid development and popularization of the Internet, online information resources are increasing and people have changed the era information age to the rich digital information age. Faced with a deluge of online information resources, it has been difficult to find the real need of information quickly and efficiently. Therefore, how rational and effective way to organize, manage, and use of such information, has gradually become an important field of information processing study. Traditionally, information processing methods mainly rely on manual classification and selection, and web pages would be assigned to one or several more appropriate category through professional analysis of the contents. Obviously, with the rapid growth of Web information capacity, artificial approach has become very unrealistic.Text clustering is a powerful tool to organize and manage information, and it can be to solve the current chaotic situation on the Internet, making it easier for users to more accurately locate the information they need. Therefore, an ongoing study of text clustering is necessary and essential. This makes the study of text clustering has become an increasingly important area of research, and it gradually combined with the search engines, information filtering technologies into an important means of obtaining web-based information.Text clustering is a classic problem in natural language processing. In order to changing text clustering into a general pattern recognition problem, several problems need to be solved. First, the multi-topic text should be divided into a lot of single-topic sub-topic fields, then the appropriate feature units can be selected in virtue of the characteristics of natural language and the weight of the feature units can be canculated and sorted. Finally, the feature units can be clustered through a lot of clustering strategy. In order to resolve the problems of current sub-topic field segmentation and feature extraction, in this paper main works is the following:1. The text representation model was studied. Semantic quantum was defined based on the key elements of characteristics and divided into obvious quantum and latent quantum based on the contribution to expressing the topic and the concept. Obvious quantum has a direct instructions role to express the topic of text and latent quantum can express the text details through the Co-occurrence in effective area. With the improved vector space model to improve significantly the structure expression of obvious quantum and with the improved word-series model to improve significantly the structure expression of latent quantum, thereby a new text representation model based on the topic and the concept was established.2. A subtopic-field segmentation technique based on the optimal control model was proposed. A basic supposition that the subtopic-fields segmentation pattern in which the distance and the angle in the subtopic-field is small and the distance and the angle between the subtopic-field is bigger is best was proposed. The object function of the optimal control model was constructed by the within-subtopic-field distance, the between-subtopic-field distance, the within-subtopic-field angle and the between-subtopic-field angle. By solving the optimal control model, optimal subtopic-field segmentation is obtained. The method independent of specific applications is a global optimal method. This method can apply to not only the specific applications but also the Internet information retrieval and processing.3. This paper presents a new unsupervised feature extraction model based on the text conceptual model. First of all, we compute the weight of obvious quantum based the obvious quantum entanglement intensity, thereby we compute the weight of latent quantum based the window function of the latent quantum, finally we can obtain the obvious quantum feature sequence and the latent quantum feature sequence according to the respective sorted weights.If only to category the text-sets, we can obtain the categories through the clustering of the obvious quantum features. To reflect the details of the categories, the clustering of the latent quantum features based on the clustering results of obvious quantum is required. In practice, the selection of different features can be based on the different needs, it can greatly reduce the computational complexity, on the other hand greatly reduce redundancy between features.

  • 【网络出版投稿人】 吉林大学
  • 【网络出版年期】2010年 07期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络