节点文献

基于软件人的情境主题分析及应用研究

Research on Contextual Topic Analysis Based on Softman and Its Application

【作者】 周亦鹏

【导师】 杨扬; 涂序彦;

【作者基本信息】 北京科技大学 , 计算机应用技术, 2012, 博士

【摘要】 互联网社会网络的发展将促进互联网从信息的网络向人的网络进行演化,软件人将成为人参与互联网社会网络活动的虚拟实体,个人数据空间也将映射到软件人的虚拟脑当中。在这种新的网络环境下,对于互联网信息的监测和主题分析则依赖于每个软件人在不同情境下,对于不同主题内容所掌握的不同主题模式,即其个性化和情境化的语言模型。本文以国家自然科学基金项目(突发事件跨媒体数据挖掘研究,编号:91024001),北京市自然科学基金项目(旅游突发事件的数据挖掘与智能预测研究,编号:4082021),北京市自然科学基金项目(“软件人”与Linux的融合研究,编号:4072018)为研究任务,以食品安全事件互联网信息监测及旅游信息服务为应用背景,研究基于软件人的情境主题建模、文本情境主题分析、情境主题模型的自动标注和跨媒体主题分析的理论方法、关键技术和系统实现,取得创新性的研究成果如下:1)提出了在软件人构成的互联网社会网络信息监测环境下,信息监测软件人的情境主题模型。给出了情境主题的形式化定义,并且将情境变量引入混合概率主题模型,建立了情境主题模型来实现软件人的认知。模型通过主题和其它情境的条件分布来分析不同情境下主题内容的变化情况及变化强度,还通过一般性分量将先验知识集成到情境主题模型中。模型的有效性在文本和跨媒体的主题分析中都得到了实验验证。2)在情境主题模型的基础上,引入时空情境,提出一种时空情境主题分析方法,将文本从词语特征空间转换到主题空间,并且将多主题分布与时空情境关联起来,对主题周期和强度进行描述,通过改进时序聚类和EM算法在主题空间上实现情境主题的发现和跟踪,实验表明该方法优于词语空间上的主题发现和跟踪方法。3)提出一种主题标记方法,基于语义分类建立主题关联词集对主题模型进行标注,通过选择具有高语义覆盖度和区分度的主题词,为情境主题模型自动生成可理解的标记,解释各种情境的概率特征,解决概率语言模型对普通用户难以理解的问题。实验表明该方法优于高概率主题词的标注方法,尤其在食品安全主题标注方面已经接近人工标注的准确度。4)提出一种利用视觉主题模型来实现跨媒体信息主题分析的方法,以视觉词的方式来表达图像的语义,并且给出了视觉主题学习方法,建立文本主题与图像语义间的映射关系,将文本主题也以图像视觉语义的方式进行描述,实现跨媒体数据的统一描述和情境主题建模。实验表明该方法改善了短文本主题发现准确度差的问题。5)在上述研究的基础上,实现了食品安全事件监测和旅游信息智能推拉系统,分别应用于食品安全事件互联网信息的监测和旅游信息的个性化服务。论文的研究成果有助于对日益复杂的互联网信息进行主题分析,对特定领域或主题的信息进行监测,判断热点主题,从而进行有效应对或者有针对性地提供个性化信息服务。

【Abstract】 The development of Internet social network will promote Internet evolvefrom the network of information to the network of person. And software will playthe role of virtual entity of people to participate activities in the social network,where personal data space is mapped into the virtual brain of Softman. In the newnetwork environment, the monitoring of Internet information and topic analysisrelies on topic patterns that each softman has for different contexts, which is alsocalled personal and contextual topic model.The dissertation is based on the research of National Science Foundation ofChina (The Research of Cross-media Data Mining on Emergency Information, No.91024001), Beijing Natural Science Foundation (The Research of Data Mining onTourism Emergency Information and Intelligent Prediction, No.4082021), andBeijing Natural Science Foundation (The Research of Fusion of ‘Softman’ andLinux, No.:4072018). With the Internet information monitoring of emergency andtourism information service as the application background, the theory ofSoftman’s contextual topic modeling, topic pattern extraction and cross-mediatopic analysis is studied in the dissertation. And the solutions of criticaltechnologies are also proposed and implemented in the application systems. Mainresults of the dissertation can be concluded as follows.1) A contextual topic model applied in Softman based Internet social networkmonitoring environment is presented in this dissertation. And the formal definitionof context is given to describe the background of the topic. The cognition ofsoftman is implemented by establishing mixture contextual topic model. Thecontextual model is established by introducing contextual variables into theprobabilistic topic model, and the changes of topic in content and intensity underdifferent scenarios is analyzed through conditional distribution of the topic andother contexts. Moreover, the prior knowledge is also incorporated into the modelthrough a general component. Effectiveness of the model is verified by theexperiments of text and cross-media topic analysis.2) A spatiotemporal topic analysis method is presented by introducing spatioand temporal information into the contextual topic analysis framework. In thisway, the distribution of mixture of subtopics is associated with spatiotemporalcontext to describe the lifecycle and strength of topics. And an improved temporal clustering and EM algorithm is given to achieve contextual topic discovery andtracking. Experiment’s results show that this method is better than that in the wordspace.3) An automatic topic labeling method is presented to make the probabilistictopic model become understandable for ordinary users. An associated topic wordsextraction method based on semantic classification is proposed to build candidatelabel set. And a label selection method is also given to automatically select topicwords with high semantic coverage and distinction to explain the characteristicsof various contextual models. Experiment’s results show that this method is betterthan that of high probability topic word tagging. Especially, for food safety topic,the accuracy is close to that of manual annotation.4) A cross-media topic analysis method using visual topic model is presentedin this dissertation, in which the meaning of the image is described with visualwords. And a visual topic learning algorithm is given to establish relationshipsbetween texts and images by mapping topic of texts to that of images, in whichthe topic of texts is described with visual words to achieve the uniform descriptionand topic modeling of cross-media data. Experiment’s results show that thismethod improve the accuracy of topic detection for short text data.5) On the basis of the above research. Food safety information monitoringsystem and intelligent tourism information push and pull system are designed,which are applied to food safety incident monitoring and personal tourisminformation service respectively.This work contributes to topic analysis and monitoring on the increasinglycomplex Internet information. It can be used to discovery hot topics and will dohelp to improve decision making ability or to provide personal informationservice.

【关键词】 软件人主题分析情境语言模型
【Key words】 Softmantopic analysiscontextlanguage model
节点文献中: