节点文献

基于内容分析的Blog话题检测方法研究

Research on Topic Detection in Blogosphere Based on Content Analysis

【作者】 何金艳

【导师】 黄哲学; 叶允明;

【作者基本信息】 哈尔滨工业大学 , 计算机科学与技术, 2010, 硕士

【摘要】 话题检测技术是面向文本信息流进行未知话题识别的信息处理技术,它是话题检测与追踪技术的重要组成部分。这项技术旨在从特定时间和地点发生的事件扩展为具备更多相关外延的话题,它在信息抽取和舆情监控方面有很大的实用价值。目前,常见的话题检测算法大多面向具备突发性和延续性规律的新闻网站语料,而专门针对博客空间的话题检测算法并不成熟,这是因为博客属于个人媒体,跟新闻语料相比,具有数据量庞大和形式多样化的特点。本文通过对博客数据的结构深入分析,明晰了对博客数据进行话题检测的主要技术需求。针对博客数据形式多样化的特点,选取必要特性转化为新的话题模型——话题质心和关键词序列为主的话题模型,并基于该话题模型设计了话题检测算法,话题关键词提取算法,专题提取算法。本文的主要贡献体现在以下几个方面:(1)本文设计了符合博客数据特性的话题模型。话题模型由多个特征组成,其中包括:话题名称、关键词序列、话题质心、博文集合、话题发起时间。话题模型贯穿于本文的三个核心算法:话题检测算法和话题关键词提取算法在博文的基础上生成话题模型;专题抽取算法在话题模型的基础上作进一步话题组织工作。(2)文中通过分析各类常用的文本聚类算法,从中选取了增量聚类算法作为话题检测算法的基础。引入了改进话题检测效果的三项优化策略:话题质心更新、文本过滤、话题模型选择。通过对比实验证明了话题检测算法的有效性。(3)设计了话题关键词提取算法,为每一个话题提取标志性词汇集合。此算法主要采用了文本特征选择的互信息原理,并引入了对在博文标题中出现的词进行加权的优化策略。通过实验证明了关键词提取算法的有效性。(4)在话题模型的基础上实现了专题提取算法。该算法以层次聚类思想为基础,主要选用了话题模型特征中的三项特征:关键词集合、话题质心、话题发起时间。对各项特征建立不同的相似度计算公式,以计算话题模型之间的相似度。最后通过实验证明了专题提取算法的有效性。基于以上研究成果,本文设计博客话题检测系统,该系统由五大模块组成:数据库模块,数据预处理模块,话题检测模块,话题模型特征提取模块,专题提取模块。通过编程技术实现了Blog话题检测原型系统,为博客话题检测技术的研究打下了坚实的基础。

【Abstract】 Topic detection technology is an unknown topic identification technology faced to text-oriented information flow, which is an important component of topic detection and tracking technology. This technology seeks a particular time and place events in expanded with more topics related to outreach, which has great practical value in the information extraction and monitoring of public opinion. At present, the most common topic detection algorithms are designed to deal with the news websites corpus. While the algorithm for Blogosphere is not mature. That is because Blogosphere is a personal media. The corpus from Blogosphere is more complex and has a huge number compared with news.This paper analyses deeply the structure of data from Blogosphere. It ascertains the main needs of topic detection on Blog data. This paper designs the topic model based on the character of Blog data. The model contains topic center and keywords set as main feature. The topic detection algorithm, the keywords extract algorithm and the special topic extract algorithm are based on the topic model. The main contributions of this paper are as follow:1. This paper designs the topic model base on the characters of Blog data. The topic model contains five features: topic name, keywords set, topic center, posts of topic, time of topic. The algorithms in this paper are all based on the topic model. The topic detection algorithm and the keywords extract algorithm create each feature of topic model. And the special topic extract algorithm is based on the topic model.2. This paper analyses various types of text clustering algorithms, and chooses the incremental clustering algorithm as the main component of topic detection algorithm. Three optimization strategies are imported: topic center update, text filtering, selection of topic models. By the experiment, it proves the efficiency of topic detection algorithm.3. The topic keywords extract algorithm is designed to extract keywords for each topic. The words contained in each topic are weighted by the mutual information formula. The word appeared in title is more important to describe the topic.4. The special topic extract algorithm is based on the topic model. It chooses three factures of topic model: keywords set, topic center, time of topic. This algorithm designs three different formulas to calculate the similarity of topic models. At last, it proves the efficiency of special topic extract algorithm by the experiment.Based on the above studying, this paper designs the topic detection system base on Blogosphere. The system is composed by five modules: database module, data pretreatment module, topic detection module, topic feature extract module, special topic extract module. This system is the base of topic detection research in Blogosphere.

  • 【分类号】TP393.092
  • 【被引频次】5
  • 【下载频次】335
节点文献中: 

本文链接的文献网络图示:

本文的引文网络