节点文献
概率潜在语义分析及其应用
Probabilistic Latent Semantic Analysis and Applications
【作者】 刘森;
【作者基本信息】 浙江大学 , 计算机应用技术, 2011, 硕士
【摘要】 信息检索的很多应用都需要探究隐藏在字、词背后的涵义,简单的字面匹配由于广泛存在的同义词多义词现象,往往得不到能够和查询在含义上精确匹配的检索结果。概率潜在语义分析(即PLSA, Probabilistic Latent Semantic Analysis)通过概率的形式建立了将隐含变量与共现数据对(如词汇与文档)联系起来的模型,使用统计的方法建立了“文档-潜在语义-词语”三者之间概率分布关系,并利用这种概率进行基于统计的语义分析,从中得到同一个主题下不同词的分布参数以及同一篇文档下不同主题的分布参数,从而能够从语义的层面上而不再是以往的单纯的字面意义上去表达和理解文档。在语义空间上,能够对文档做出更精准的匹配,排序,相关性查询等操作。本文主要研究概率潜在语义分析的稀疏表达框架以及并行化扩展,主要贡献有:●提出了一种在PLSA框架下高效地引入稀疏表达的方法,通过添加稀疏度控制在两个模型参数上以解决传统的PLSA存在的过拟合以及无法提取局部特征的问题。本文实验证实本文所述方法在准确度上超越了已有的PLSA算法,并且在性能有杰出表现。●提出了在分布式处理框架下高效地训练PLSA模型的方法,分别设计实现了基于多核处理器的多线程PLSA算法,以及基于Hadoop和基于MP工的的并行化PLSA算法,讨论了在实际应用中的具体细节和问题,最后在集群上进行了实验和性能评估。●探索尝试了将PLSA用于个性化RSS文章排序的方法,通过记录用户阅读文章所消耗的时间评估用户对文章的兴趣。
【Abstract】 Many of the applications related to information retrieval rely on discovering the hidden meanings behind the text itself. However, due to the existence of polysemy and synonym, the match of queries may not be accurate on literal terms. Probabilistic Latent Semantic Analysis is a topic modeling technique to discover the hidden structure by building the relation between observed data and the assumed hidden variables, which is "document-topic-term" for text corpus. It uses a statistical learning technique to estimate the model parameters, including the multinomial distribution of the terms belonging to a topic, and the multinomial distribution of the topics given a document. The documents are represented in a semantic space instead of the term space, so that matching, ranking and relevance can be done more accurately. This paper contributes on the following aspects:We present an efficient approach that provides direct control over sparsity during the expectation maximization process. Which resolved the problem that PLSA can not produce local features and the over fitting problem. Experiments on face databases are reported to show visual representations on obtaining local features, and detailed improvements in clustering tasks compared with the original processWe designed the multithread PLSA training process in distributed systems under the MPI and the MapReduce framework, many details have been discussed for implementations, and evaluations have been analyzed for pros and cons.We proposed a method for RSS document ranking problem, using implicit feedback of reading time for user preference modeling.
【Key words】 Topic Model; Sparse Representation; PLSA; Distributed System; Matrix Factorization;