

Citation Context Based Analysis Technologies on Scientific Literature Retrieval

【作者】 张金松

【导师】 陈燕;

【作者基本信息】 大连海事大学 , 管理科学与工程, 2013, 博士

【摘要】 随着大数据时代的到来,科学文献越来越多的以电子化文档的形式存在于网络中,这不仅能够促进文献的传播与推广,更能促进科学研究水平的发展,使研究者达到“站在巨人的肩膀上”的目的。然而,大量电子化学术文献的产生,不仅存在良莠不齐的问题,同时为文献管理提出了新的挑战,如何对文献进行有效的表示、筛选、应用,已经成为当今知识管理领域研究的热点问题之一。因此,本文将文本挖掘、信息检索等相关方法应用于文献检索技术的研究中,以引文分析方法为基础,利用引文上下文的相关语义信息,融合主题模型、排序算法、语言模型、网络图等理论,实现文献知识域可视化表示、文献排序算法的研究、文献检索模型的构建等,并选取相关学术论文数据对各个知识点进行实验验证。本文的主要研究内容可以包括:1.基于引文分析法提出一种引文概率分布距离的计算方法,并将其应用于文献知识域可视化的研究中。2.抽取引文上下文的文本信息,利用Labeled-LDA主题模型获得有向、加权引文网络中顶点权值与边权重两个先验概率,改进传统PageRank算法,实现基于引文上下文的文献排序方法(Context-Based Ranking Algorithm, CBRA)研究。3.将基于引文上下文的排序方法应用于作者权威度的分析实验中,针对每个主题建立相对应的作者权威度排名,并利用作者权威度信息改进文献排序结果,这样,文献排序不仅基于网络链接,同时考虑了作者权威度的影响因素。4.利用基于引文上下文的排序方法改进传统的基于语言建模的信息检索模型,利用系统开发的思想构建与主题相关的文献检索系统。5.将基于引文上下文的排序方法应用于段落检索研究中,构建基于主题的段落检索模型,从而提高传统文献检索的准确率以及有效性。

【Abstract】 Currently, the time of Big Data is coming, so more and more scientific literature is shown as electronic documents in the Internet, which not only promotes the popularization of literature, but also accelerates the development of scientific research level, as well as achieves the goal of "standing on the shoulder of giants". However, along with these changes, the problem that the good and the bad literature are intermingled in the large amounts of electronic academic documents is becoming more conspicuous. Therefore, we are faced with the new challenges in literature visualization, retrieval, management and application, which have become a hotspot in the research of Bibliometrics and knowledge management.This thesis will put the focus on the related methodologies of scientific literature retrieval based on the theory of citation analysis, text mining and information retrieval. So, some methods will be considered in the following part, i.e., Topic Model, Ranking Algorithm, Language Model, and Graph Theory. First of all, a method of domin knowledge visulation is presented. And then, there is a ranking algorithm of scientific literature by analyzing the semantic knowledge of citation context. Finally, a scientific literature retrieval model was implemented. All of these methods have improved by the experiment. So, the main research content includes:1. Put forward a new computing method for the citation probability distribution distance based on citation analysis, and then apply it into the visualization of literature knowledge domain.2. Extract the text information of citation context, and use the topic model of Labeled-LDA to generate two prior probabilities (vertex weight, edge weight) in the directed and weighted citation network. So a Context-Based Ranking Algorithm (CBRA) was proposed that improving the traditional PageRank algorithm.3. Apply the CBRA into the experiment of author authority ranking analysis. For each topic, we can set up the author authority rankings, which will improve the literature rankings. So that the literature ranking is not only based on the network links, but also take consideration of the authority of author. 4. In accordance with the CRBA, this thesis will improve the traditional information retrieval model which is based on language model. And then, establish a topic based literature retrieval system by system development methods.5. Apply the CBRA into passage retrieval and set up the passage retrieval system based on topic, which can improve the accuracy and relevance of literature retrieval.


