节点文献

面向推荐系统的关键问题研究及应用

Research on the Key Issues for the Recommender Systems

【作者】 刘士琛

【导师】 熊焰;

【作者基本信息】 中国科学技术大学 , 计算机应用技术, 2014, 博士

【摘要】 随着互联网近年来在国内外爆炸式的发展,互联网上的数据、信息以前所未有的速度疯狂增长。因此怎样从海量数据中发现自己希望寻找的内容已经成为越来越多的用户面临的一大难题,也成为大量专家学者研究的热门课题。用户从互联网上发现并获取数据信息,一般看来经历了三个阶段:1,最初始是各类门户网站的建立,如sina、sohu、yahoo等,他们帮助用户梳理、组织各类常用的热门的资源、信息,供用户发现、浏览。但一方面梳理整合的信息毕竟是有限的,用户的需求不一定包含其中;另一方面随着数据的爆炸式增长,太多的数据会使得门户网站变得杂乱臃肿,因此这些网站也只能选择相对重要的信息检索。2,然后是搜索引擎的出现,如google,baidu等,用户能通过搜索引擎检索自己希望获取的内容;然而检索结果的准确性极依赖于用户对问题的描述,同时一般用户的描述通常是不够准确的,这会直接导致检索结果出现偏差,用户很难完全准确的找到自己所需的结果。3,最近则是推荐系统的产生,用户不再需要主动搜索,而系统会智能的通过用户的属性信息,用户的历史记录,为用户推荐用户可能会需要的信息,如taobao、netflix等会智能的为用户推荐商品、电影,这在用户需求不够明确时,能为用户精简信息。值得注意的是以上三个阶段不是一个进化的过程,而是一个相互补充,互相协作的关系。由于推荐系统能很好的解决互联网“信息过载”的问题,因此广受用户欢迎,也因此被越来越多的网站、公司使用,而与之相应的推荐算法也越来越受到学术界的重视,成为一个重要的研究领域。然而面对不同种类的数据与越来越复杂的应用场景,推荐系统会面临不同的问题,如冷启动问题和可扩展性等常规问题;又如应用场景的区别、数据分布的不一致会使得同样的算法在不同场景、数据上得到的结果相差很远;同时存在的是某些推荐算法问题的求解困难等新问题。针对以上推荐系统中存在的问题,本文深入研究推荐系统,做了以下几点研究工作:(1)基于非参数统计的相似度模型研究。协同过滤算法是推荐系统最基本也是最主流的算法,被成功的运用于大量商业模型中,取得了很好的效果。该算法主要由两步组成,其中相似度的计算是第一步也是最为关键的一步。然而1,不同应用场景的数据会有各自的特点,具有明显的差异性、分布明显不同,使用同样的相似度度量模型是不够准确的;2,传统的欧氏距离、皮尔逊相关度、余弦相似度等都有各自的局限性,已经不能直接应用于越来越复杂的场景:3,对于稀疏的数据,算出的相似度置信概率极低,直接用于推荐会降低推荐精度。基于以上原因,本文提出了一种基于非参数统计的相似度模型,基于非参数统计的思想,该模型能将不同场景的数据映射到统一的空间,去除不同数据间的差异,将其统一到相同的标准。同时由于投影后的空间具有良好的线性性,相似度度量能很好的使用线性相似度方式计算,解决上述几点问题,提高推荐精度。(2)基于时间回溯的特征预测模型研究。数据量的不足往往是各种机器学习模型面临的最大问题之一,大量的研究表明,数据对于模型结果的重要性远远大于算法对于模型的重要性。在推荐系统中,用户的历史行为是最主要的模型数据来源。传统的推荐系统可以根据用户的历史行为预测他们的属性(如爱好、年龄、性别等),也可以直接通过历史行为找到类似的用户进而进行推荐。然而一直以来的研究中,对用户历史行为的使用都是朴素、简单的,并没有注重历史行为的时间维度。本文提出了一种基于时间回溯的特征预测模型,使历史数据的利用率大大增大,从某种意义上数倍的丰富了数据量,提高预测精度。并且,本文在taobao的真实数据上使用该方法预测用户孩子的年龄,结果表明预测精度大大高于传统方法。(3)基于演化博弈的全局优化算法研究。大量的推荐算法问题,甚至数据挖掘问题,在模型的求解过程中,都会规约到求解全局优化问题。因此求解全局优化问题是推荐系统中的一个重点问题,也是难点问题。目前,常用的算法,如梯度下降法、随机梯度下降法或者牛顿法,只适合求解凸函数最优化(凸优化)问题。而本文提出的基于演化博弈的全局优化算法尝试求解连续域上的全局优化问题,剔除掉凸函数这一强限制条件。同时在求解的过程中,基于演化博弈,本文提出了一种自适应的参数调整方案,能极大的提高算法的准确性,并一定程度减少算法的收敛时间。

【Abstract】 With the rapid development of Internet all around the world, the data and information on Internet has been increasing at a dramatical speed. Therefore, more customers are facing the problem of discovering the demanded contents from overwhelmingly massive data. As the result, this problem becomes a popular research topic and attracts attention from lots of scientists.Generally, there are three stages for users to maintain information from internet. First, various portal sites are established, such as sina, sohu, yahoo and so on. They help users filter and organize a variety of popular resource and information to discover and browse. However, the organized information is not always able to meet users’need, as well as overwhelming data will make the website overstaffed with the explosive growth of data, which results in the incompletion of information retrieval. Second, search engines start to emerge so that users are able to retrieve their desired contents, such as google and baidu. But the accuracy of search results quite depends on the description towards questions, which is usually not quite precise, thus the caused bias will make it difficult for users to identify exactly their required results. Third, recommender systems have been developed in recent years, which will intelligently recommend probably required information to users in conjunction with users’profile description and history record without users’ search operation. For instance, taobao and netflix will intelligently recommend items and movies to users, which can extract information for users when their requirement is not obvious enough. Noteworthily, the above three stages are not an evolution process, but a cooperative network instead.Recommender systems can properly deal with the information overload problem in internet, so they are widely welcome by users and thus adopted by great amount of websites and corporations. Therefore, recommend algorithms attract attention from academia and become a significant research area. However, with various kinds of data and complicated application environment, recommender systems will face different problems, for instance, normal problems like cold start and scalability; the difference in application environment and inconformity in data distribution will make the results from same algorithm differ from each other; new problems emerge as some recommend algorithms have trouble with calculation. In order to solve these problems, this paper intensively studies recommender system, and completes the following research work:(a) Similarity model research based on non-parametrical statisticsThe successfully applied collaborative filtering algorithms are the most fundamental and popular algorithms in recommender system research area. They consist of two steps, between which the calculation of similarity is the first and significant step. However, first, data under different application environment has individual characteristics and obvious difference in distribution, thus it is inaccurate to employ the same similarity measurement models; second, the traditional Euclidean distance, Pearson correlation and cosine similarity measurements are no longer suitable for complicated environment; third, the confidence probability is extremely small calculated from sparse data, the direct utilization of which will reduce the recommend accuracy. Because of the above reasons, this paper proposes a similarity model based on non-parametrical statistics, which is able to map data under different environments into a uniform space and standardize the data. Moreover, with the nice linearity in the projection space, similarity measurement is easy to calculate with aid of linear similarity, which solves the above problems and improves the recommend accuracy.(b) Demographic prediction with time backtrackingLack of data is always one of the biggest problem for various machine learning models, plenty of research work shows that data is far more significant than algorithms for the models. In recommender systems, the historical behaviors of users are the main source of model data. Traditional recommender systems can predict users’ profile like hobbits, ages and genders either by analyzing historical behaviors or by identifying similar users for recommendation. However, the employment of users’ historical behaviors used to be naive and simple, and ignores the time-varying property. Thus this paper proposes a time backtracking model, which promotes the utilization of historical data and increases data volume so as to improve the prediction accuracy. In addition, this paper applies this model into real word data from taobao to predict the age of users’children, and the experimental result shows the prediction accuracy is much higher than the traditional methods.(c) Evolutionary game theory inspired algorithm for global optimizationAmong the calculation process, lots of recommend algorithms and data mining problems will be transformed into solving the global optimization problem. Therefore global optimization problem is an important and challenging task in recommender systems. Currently, the frequently used algorithms, such as gradient descent method, stochastic gradient descent method and Newton method, are merely suitable for solving convex optimization problem. Thus this paper proposes an evolutionary game theory inspired algorithm to solve the global optimization problem in continuous domain without restraint of convex functions. Meanwhile, among the calculation process, a self-adapted parameter method is proposed to significantly improve the accuracy of algorithm and accelerate the converging speed to some extent.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络