

Exploring Temporal Text Mining for News Content Anatomy and Recommendation

【作者】 陈伟

【导师】 陈纯; 卜佳俊;

【作者基本信息】 浙江大学 , 计算机科学与技术, 2010, 博士

【摘要】 互联网的诞生及发展,大大促进了信息的传播。作为信息传播的重要手段,网络新闻在互联网上扮演着非常重要的角色,已经成为网民最常使用的网络应用之一。网络新闻是网络上发布的“新近发生的事实的报道”,它较传统新闻传播媒介在时效性、容量、丰富性、易交互性、易检索性,以及多媒体化的呈现方式等方面都有巨大的优势,给人们的生活带来了巨大的便利和帮助。当然,海量的网络新闻也给人们带来了信息过载问题。为了更好地满足各类网络用户的需求,提升网络用户的新闻获取体验,研究网络新闻内容的自动理解及推荐技术具有重要的意义。所谓新闻内容理解,是指从大量的新闻数据中抽取出事先未知的、可理解的、最终可用的知识,同时利用这些知识更好地组织新闻以帮助用户更好地获取这些信息。而新闻推荐技术则通过分析网络用户的各类新闻阅读行为,获得用户的喜好信息,结合对新闻内容的理解,向用户推荐其可能感兴趣的新闻。上述问题处理的大多是时序文本,涉及到时序文本挖掘技术的诸多方面。本文基于时序文本挖掘的相关技术,研究新闻内容理解和推荐涉及的多个问题,并提出了解决方案,具体的工作如下:本文首先针对时序新闻数据集的事件检测问题,提出了一种基于突发特征分析的新闻突发事件检测方法。引入特征轨迹将构成时序新闻数据集的特征表示为时间序列;提出了一种特征轨迹小波域表示方法,并引入多尺度突发分析算法检测突发特征及突发跨度;提出了一种基于近邻传播聚类算法的突发事件检测算法,将特征突发模式的相似性、特征所在新闻的重合度、以及特征能量(表示特征的突发强度)作为近邻传播算法的输入,将突发特征聚类以构成事件,并引入事件能量衡量事件的突发水平。针对时序新闻的在线突发事件检测问题,提出了一种在线的新闻突发事件检测及其进化分析方法。引入一种多尺度滑动窗口实时监控特征轨迹,并利用在线多尺度突发特征检测方法检测出当前时间窗口中具有不同突发跨度的突发特征;引入一个指数型的衰减因子衰减特征轨迹,并基于此计算突发特征之间的关联度;同样利用近邻传播聚类算法将突发特征聚类以检测出突发事件,利用能量衡量事件的突发水平;最后,提出了一种基于余弦相似度的信息检索方法发现事件在时间轴上的进化过程。针对时序新闻突发事件检测算法在实时性、准确率等方面存在的问题,进一步提出了一种基于假设检验的在线突发事件检测方法。提出了一种基于随机过程的特征数据流表示方法,并运用分布拟合检验及左边检验检测突发特征;分析突发特征的相关性,引入进化谱聚类算法将相关性较高的突发特征聚类以构成事件。算法具备更高的实时性,并能更准确地检测某些突发特征及事件。为了帮助人们更好地了解时序新闻,提出了一种时序新闻主题分解与摘要方法。在时序新闻的关键词一句子关联矩阵上应用非负矩阵分解(Non-negative Matrix Factorization,即NMF)获得子主题信息;通过分析非负矩阵分解获得的编码向量(encoding vector),发现属于每个子主题的事件,并为这些子主题及其包含的事件产生摘要;基于编码矩阵对句子进行排序,选择属于每个子主题的排名最高的若干句子作为该时序新闻的摘要。针对视障及老年人群的网络新闻获取需求,提出并实现了一个个性化的有声网络新闻推荐及综合挖掘平台。提出了一种个性化的有声网络新闻推荐的体系架构,支持各类终端通过HTTP协议获取个性化的有声新闻。该架构支持两个层面的个性化,在提供新闻频道自适应导航的同时,能够根据用户对于多类主题的兴趣自动推送相关的新闻。最后设计并实现了该系统(简称网络搜音机服务系统)。除实现上述功能外,基于前述新闻内容理解的工作,系统还集成了热点事件检测、用户兴趣发现及热点事件与用户兴趣的可视化展示等功能,为用户提供有效的信息获取服务。

【Abstract】 The rapid growth of the Internet greatly accelerates information propogation. Web news plays a very important role on the Internet, and has already became one of the most widely used Web applications. Web news is the report of the recently happened fact which is publised on the Web. Compared to traditional news media, Web news has many advantages such as freshness, capability, richness, interactivity, searchability etc. It greatly faciliates users to get information from the outside world. However, the massive amount of Web news is also coupled with information overload problems.News content anatomy and recommendation can greatly fulfill users’ requirements of Web news. News content anatomy is the process of extracting previously unknown, understandable and usable patterns from news content. Based on the analysis of users’usage pattern of Web news, recommendation system automatically pushes users’preferred news to them. Both news content anatomy and recommendation deal with temporal text, and the key of them are the temporal text mining techniques. By exploring temporal text mining, we study multiple problems of news content anatomy and recommendation, as follows:We firstly propose a bursty event detction method by analyzing bursty features in temporal news corpus. The features in the copus are represented as feature trail and are then transformed to wavelet domain. We introduce an elastic burst detection algorithm to identify multi-scale bursty features, and model them as a vector. By setting the preference as features’ power (bursty level), affinity propagation clustering algorithm is used to group these bursty features with high document overlap and identically distribution in bursty time windows together. Then, events are returned to users with the order of their power.We then study a particular news stream monitoring task:timely detecting of bursty events which have happened recently and discovering their evolutionary patterns along the timeline. We use a multi-resolution sliding window to monitor the feature trail and apply an online multi-resolution burst detection method to identify bursty features with different bursty durations within recent time window. We cluster bursty features to form bursty events and associate each event with a power value which reflects its bursty level. An information retrieval method based on cosine similarity is used to discover the event’s evolution along the timeline.We further introduce an online event detection algorithm in news stream. Firstly, we represent a feature stream as a random process and apply a goodness-of-fit test to find out these features with obvious changes in distribution of term frequency in a news document. Left side significance test is further used to validate bursty features. Then, an evolutionary spectral clustering algorithm is applied to group highly correlated bursty features to form bursty events.To help users understand various aspects of a tempoarl news stream, we study topic decomposition and summarization for a temporal-sequenced text corpus of a specific topic. We derive sub-topics by applying Non-negative Matrix Factorization (NMF) to terms-by-sentences matrix of the temporal news stream. And then, we detect incidents of each sub-topic and generate summaries for both sub-topic and its incidents by examining the constitution of its encoding vector generated by NMF. Finally, we rank each sentences based on the encoding matrix and select top ranked sentences of each sub-topic as the tempoal news corpus’summary.Finally, we present an architecture for providing personalized phonic Web news in Internet-connected consumer electronics. It provides two types of personalization. An adaptive channel navigation method is introduced to help users reach relevant channels quickly. Besides, a news recommending strategy is proposed to track multiple threads of users’interests and provide users with preferred news. Finally, we implement this system named EagleRadio. EagleRaido can not only provide personalized phonic news, but also integrate some news content anatomy funcitons, such as bursty events dectection, user’s interests modeling and visualizaiton.

  • 【网络出版投稿人】 浙江大学
  • 【网络出版年期】2011年 08期