节点文献

互联网文本信息挖掘与个性化推荐的研究

Research on Internet Text Mining and Personalized Recommendation

【作者】 温源

【导师】 刘云;

【作者基本信息】 北京交通大学 , 通信与信息系统, 2014, 博士

【摘要】 随着互联网技术的发展,网站的普及以及大量文本数据的出现,互联网已经成为了人们获取信息资源的一条重要渠道。但是网络数据成千上万,一个人无论如何用多久的时间也不可能完成对整个互联网的探索。因此简化对网络的探索过程,提高网络信息的检索效率就成为了当前网络时代的研究方向。好的信息挖掘方法可以提高人们的信息检索效率,能够提供准确、及时、可靠的网络信息汇总,提供适合人们阅读的摘要。同时,随着网络技术的发展,越来越多的网站出现了不需要人工搜索,就可获得信息的新途径,这些新途径就是信息推荐。在合适的时机,给合适的对象提供相关信息或相关产品推荐,能够提升用户浏览兴趣,提高网站的服务体验,并且增加用户对网站的粘度。推荐方法是继搜索引擎之后的又一大信息获取方法,该方法在未来有着很大的应用前景,不但对于互联网新闻消息、相关文本推荐有帮助,而且在电子商务、公司产品推广以及新产品扩展和传播等领域均具有重要的应用价值。鉴于此,本论文结合交叉学科的研究方法,针对现有互联网文本信息的特点提出网络热点话题发现算法以及网络自动摘要生成模型,并且通过研究网络用户之间的兴趣联系和用户偏好进而提出个性化推荐算法。本文分别从互联网文本数据采集与处理、文本信息聚类算法、热点信息挖掘、网络新闻摘要提取方法、协同过滤推荐算法、基于社团关系的信息推荐等方向和角度,对互联网的文本数据挖掘及个性化推荐进行了研究。论文的主要研究内容如下:1.研究了互联网文本信息采集与预处理技术,中文分词与聚类方法,并针对互联网文本信息的特点,提出了一种网络热点事件的发现算法。该方法通过引入文本词语的突发度量值,结合词语位置对权重的影响因素,完善了词语权重计算的准确度。此外,本文提出一种基于预设密度的聚类算法,该算法通过以相似的文本为核心的类簇,获得合理划分的文本主题。从而在不需要事先指定事件数的情况下,自动发现该时间段内的热点事件。实验结果表明,该算法在发现互联网热点事件的检测中有较好的效果。2.研究了对网络文本信息自动生成摘要的方法。该方法使得文本信息得以压缩,使用摘要的形式来表示文本,从而可以提供用户快速获取文本的主要内容。通过分析了互联网新闻自动摘要的特殊情况,针对多文本信息的摘要,提出了摘要主题的概念。局部主题就是在把互联网新闻划分成句子后,根据分层聚类形成的结果,产生的信息集合。其次,利用互联网新闻常附有人工评论信息的条件,进一步提高文本摘要的准确度。通过将新闻正文及评论的语句映射为网络节点,再引入网络中分析节点权重的HITS算法,来计算处于不同位置的句子的影响力。根据评论信息对新闻正文语句的影响程度,改进传统算法中计算这些语句的权重大小,进而影响了摘要句的选取。实验表明,使用评论信息的摘要算法比没有使用评论信息的摘要算法的效果更好。该研究为互联网条件下的信息抽取和自动摘要以及未来进一步的文本信息压缩提供了基础。3.研究了基于协同过滤的推荐算法。在传统的协同过滤基础上,改进了协同过滤推荐算法中的用户相似度计算,进而提高了推荐的准确度。通过考虑不同用户的共同喜好,以及他们各自偏好对相似度的影响,进而提出一种基于对数的相似度计算公式。并且在实际应用中,使用微博数据检验了改进后的推荐算法。对微博聚类形成不同的话题类,然后获得用户与这些话题类的关系网络,从而利用改进的协同过滤算法做推荐。实验的结果表明,基于微博数据的推荐能够有效的命中验证集中的数据,具有良好的推荐效果。新的推荐算法与传统的协同过滤算法相比,较大幅度的提高了推荐准确率,具有更好的个性化推荐效果。4.从推荐系统的角度出发,通过提出了两种不同社团形成模型,研究在不同社团形成条件下的适合的推荐方法。对此,提出了两种适合社团内相似度计算的模型,并与传统相似度模型对比,测试了几种相似度计算模型在以社团为推荐条件下的实际应用效果。实测中,以公认的Movielens数据集为验证数据,验证了基于社团形成的模型不但在推荐的准确度,以及推荐的多样性等方面都优于传统的热传导模型及概率传递模型。通过比较两种社团形成的模型,发现非严格划分的社团模型,与严格划分社团模型相比,拥有更高的推荐准确度与推荐多样性值。因此该种模型更适合推荐系统,尤其适合为个性化推荐提供服务。

【Abstract】 With the development of Internet technology, the popularity of the websites and the emergence of large number of texts data, the Internet has become an important channel for people to obtain information resources. But with tens of thousands of data on Internet, it is impossible for a person to complete the exploration of the entire Internet. Thus, simplifying process of exploring the network and improve the efficiency of retrieving information on Internet have become popular research directions of the Internet age. Good information mining method can improve the efficiency of information retrieval. It can provide accurate, timely, and reliable network information collection, to provide for people to read a summary timely. Meanwhile, with the development of network technology, more and more websites appear without manual searching. These new approaches are information recommendations. At the right time to provide right relevant information or related products, it can enhance the user browsing interests and increase the viscosity of the user for the websites. The recommended method is another major information access method in the future. It has a great prospect, and has great value not only for Internet news or related texts recommendation, but also for e-commerce, promotion of the company’s products and new product dissemination. In view of this, the paper combines interdisciplinary research methods, and proposes Internet hot topics detection and network auto summary generation model. The paper makes personalized recommendation algorithm based on research in user preferences and user interest. This paper focus on the fields of the Internet data acquisition, text message clustering algorithm, hot information mining, network news summarization methods, collaborative filtering recommendation algorithm and community-based recommendation.Major works and innovations of the paper include the following aspects:1) This paper has a research on the Internet text information collection and pre-process technology, Chinese word segmentation and clustering methods. And then it proposes a hotspot event discovery algorithm based on the characteristic of text information on Internet. By introducing the text word burst metric and considering influence the position of words, this paper improves the accuracy of calculating the weight values. This paper presents a reasonable division of the text theme by preset-density based maximum link clustering Algorithm and treats similar texts as the core of the clusters. So it can automatically discover the hot events of a period. Experimental results show that this algorithm has a better result in finding the internet hotspot events.2) The paper has a research on automatically generated text summaries of the Internet texts. The algorithm allows text information to be compressed, and uses abstract forms to represent text, which can provide users with quick access to the text of the main content. The algorithm analyzes the Internet news summaries information for multiple texts, and then put forward the concept of summary topics. The summary topics generate information clusters according to the results of hierarchical clustering by dividing Internet news into sentences. Secondly, the use of artificial comment of Internet news further improves the accuracy of text summarization. Text and comments statements are mapped into network nodes, and then introducing into the HITS algorithm for analysis of network node weights to calculate the different influences of location of the sentences. Comment information has an influence of the news body text. It significantly improvements the right selection of the summary by improving the weight of these statements. Experimental results show that the algorithm with use of comments is better than the algorithm without using comments. The study provides a basis for further Internet information extraction and automatically summarization.3) This paper has studied the collaborative filtering recommendation algorithm. This paper has improved the accuracy of recommendation by an improved collaborative filtering algorithm based on the conventional computing method. By considering the preferences of different users and the similarity of their respective preferences, it presents a similarity formula based on logarithm. In practical applications, it uses the real data of micro-blog to test the improved recommendation algorithm. By clustering of micro-blog to form different topic categories, it gets the relationship between users and these topics categories, and then takes advantage of the improved collaborative filtering algorithm to recommend. Experimental results show that the recommendation result can effectively hit the micro-blog data validation data set. Compared to traditional collaborative filtering algorithms, the new recommendation algorithm dramatically increased the recommendation accuracy, with better personalized recommendations effect.4) This paper has a research on the perspective of the recommendation system. It presents two different models of formation communities, and studies which recommended method is suitable under the conditions of different community formation. It proposes two suitable similarity calculation models in the community, and then compares them with the traditional similarity model and tests several similarity calculation models under the conditions of different community formations. Measured in Movielens dataset to verify that the model based on the formation of communities is better than traditional heat conduction model and probabilistic transmission model not only in terms of the accuracy of the recommendation but also in the diversity of recommendation. At last it compares two models of forming communities and finds that for non-strict division of community model has a higher accuracy and diversity of recommendation, compared with the strict division of community model. Thus, the non-strictly divided communities’ model is more suitable for recommendation system, especially for the personalized recommendation.

节点文献中: