节点文献

基于Dirichlet过程混合模型的话题识别与追踪

Topic Detection and Tracking Based on Dirichlet Process Mixture Model

【作者】 王婵

【导师】 王小捷;

【作者基本信息】 北京邮电大学 , 计算机科学与技术, 2013, 博士

【摘要】 互联网已成为当前人们获取新闻的一个重要途径。将已有各种新闻报道按话题进行分类,进而追踪特定话题的新报道返回给用户,不仅可以有效节省用户获取相关新闻的时间,也提供了一种基于话题对网络新闻数据进行有效组织的方式,有着广泛的现实需求。为达成此目的,需要解决两个关键问题:其一是如何将初始呈现给用户的新闻报道自动地依据其所涉及话题的异同进行分组,其二是如何自动判断新出现的报道是否属于某个已知话题或属于一个新话题。这两个问题分别是话题识别与话题追踪。对话题识别与追踪的研究已经有近二十年的历史,取得了不少进展,但是仍然存在一些问题。例如,话题识别任务中如何确定话题数量的问题,话题追踪任务面临的数据稀疏问题、话题漂移问题以及话题偏离问题。本文针对这些问题,分别对话题识别技术和话题追踪技术展开研究,在Dirichlet过程混合模型(DPMM)这个统一的模型框架下提出了一系列有效的解决方法,最后,通过综合这些解决方法提出了一个能满足节省用户新闻获取时间、对互联网新闻数据进行基于话题的组织等应用需求的系统方案。论文的主要工作和研究成果如下:(1)针对话题识别任务在先验知识缺乏时难以预先确定话题数目的问题,将DPMM引入话题识别研究中,提出了一个基于DPMM的话题识别模型。该模型无需预先给定话题数目,而是可以根据输入的新闻报道而自动确定。模型假设任一报道都对应一个话题分布,并将其中具有最大概率的话题作为这个报道的话题标签。实验表明,基于DPMM的话题识别模型可以得到比已有方法更好的识别性能,最低识别代价仅为0.0981,比基于传统聚类算法的话题识别模型降低了50%以上。(2)提出了一种考虑上下文信息的Gibbs抽样(C_Gibbs)方法,该方法在对某个词产生抽样概率时同时考虑其上下文中的其他词,以建模同一报道中的词间相关性。实验表明,与Gibbs抽样方法相比,基于C_Gibbs抽样方法进行参数推导可以大幅度提高识别系统的性能。(3)提出了一个能有效结合待测话题信息的DPMM进行静态话题追踪。模型在基于Gibbs抽样进行参数推理时融入待测话题信息,得到报道和各个待测话题的相关度。同时,对多次Gibbs抽样结果进行投票确定最后的话题追踪结果。实验结果表明,该模型只需要少量的种子报道,就可以显著提高话题追踪的性能,最低追踪代价仅为0.0723,比基于一元语言模型的话题追踪模型降低了45%。同时,该投票方法也保证了性能的稳定性。(4)针对话题追踪任务中存在的话题漂移问题以及已有自适应方法中存在的话题偏离现象,本文在基于DPMM的静态话题追踪模型的基础上,提出了一种新的自适应话题追踪方法。该方法的基本思想是在追踪过程中考虑追踪反馈,并在话题、报道相关度计算过程中为追踪反馈赋予一个M_reli参数,以控制不相关报道反馈带来的误差。实验结果表明,该方法不仅可以在一定程度上解决话题漂移问题,并可以有效地抑制已有自适应算法中的话题偏离现象。该模型最低追踪代价仅为0.0677,比静态话题模型降低了6%。(5)综合本文提出的一系列话题识别和追踪技术,设计了一个可以满足前述应用需求的话题识别与追踪系统方案。该系统首先利用话题识别和话题追踪技术将新闻报道流以报道簇为单位组织起来,每个报道簇对应一个话题,同时获取报道流中描述话题内容的标签,并将相关报道和标签同时呈现给用户,达到节省用户新闻获取时间、并基于话题对互联网新闻数据进行组织的目的。

【Abstract】 Internet has become one important way of obtaining news. How to group large volumes of news stories according to the latent topics and track news of a specific topic can not only efficiently reduce time of mastering interested news for users, but also offers an efficient topic oriented information organization. Two key problems must be solved in implementing the topic oriented information organization:how to automatically group initial news stories according to the latent topics discussed in stories;and how to automatically associate incoming stories with topics that are known in advance, or cluster them into new topics. These two problems are corresponding to topic detection and topic tracking.Lots of progress has been made on the research of topic detection and tracking, however, there are still some defects in them. For instance, how to precisely decide the number of topics in topic detection task, how to deal with serious data sparseness problem, topic excursion and topic deviation problem in topic tracking task.To overcome the above problems, this thesis investigates a Bayesian non-parametric approach called Dirichlet Process Mixture Model (DPMM). Firstly DPMM is implemented on topic detection and topic tracking separately. Then DPMM is refined to resolve the two tasks simultaneously, and is verified to be effective under various data settings. Finally, through integrating topic detection and tracking, a system scheme is designed to reduce time of mastering interested news for user and meet the application requirement of topic-oriented Internet information organization. The main research work and achievements are as following:(1) To overcome the subjectivity in determining the number of topics due to lack of prior knowledge of the topic, a topic detection model based on DPMM is proposed in this thesis. The model does not fix the number of topics, but determines it through processing news stories automatically. DPMM assumes that every story is corresponding to a topic distribution, and assigns the topic corresponding to the maximum probability to this story. The experimental results indicate that topic detection model based on DPMM achieves better performance than several existing methods. The lowest detection error cost is0.0981, decreased by more than50%compared with the traditional cluster-based topic detection models.(2) To smooth the word independence assumption in DPMM, the contextual information is introduced in Gibbbs sampling during parameter inference. The improved sampling method takes contextual words into account when obtaining sampling probability of a word, which reflects real word correlations in a natural language. The experimental results show that the improved parameters inference method can yields better performance of topic detection.(3) To alleviate the influence of lacking on-topic stories in static topic tracking task, the prior knowledge of known topics is efficiently exploited and used in Gibbs sampling procedure. Then, the results of topic tracking are obtained by making a vote on Gibbs sampling results. As indicated by the experiments, the prior knowledge can improve the performance of topic tracking significantly even with a few on-topic stories. The lowest tracking error cost is0.0723, decreased by45%compared with the topic tracking method based on unigram model. Moreover, vote method can ensure the stability of performance.(4) To overcome topic excursion and topic deviation brought by existing adaptive learning mechanisms, the thesis presents a new adaptive tracking method based on DPMM. The basic idea of adaptive tracking method is to endow tracking feedback with a metric, M_reli, to control errors brought by feedback of off-topic stories. The experimental results show that the adaptive DPMM model, without a large scale of in-domain data, can solve topic excursion of topic tracking task and topic deviation brought by existing adaptive learning mechanisms significantly. The lowest tracking error cost is0.0677, decreased by6%compared with static topic tracking model.(5) Based on the above technologies of topic detection and topic tracking technology, a topic detection and tracking system is designed to meet the practical application requirement. The system scheme firstly organizes news stories streams by taking story cluster as a unit, per story cluster corresponds to a topic, and obtains tags describing topic from news stories streams. Finally, story clusters and topic tags are presented to users. The system scheme can achieve the goal of reducing time of mastering interested news for users and organizing Internet news stories according to the latent topics.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络