节点文献

基于拉普拉斯谱分析的科学论文甄别方法研究

Scientific Paper Discrimination Method Research Based on Laplacian Spectrum Analysis

【作者】 王谦

【导师】 孙文俊;

【作者基本信息】 哈尔滨工业大学 , 管理科学与工程, 2010, 硕士

【摘要】 由自然语言和形式化语言表达的学术论文是人类保存和传播知识的最重要的工具。然而,现今学术领域有不少劣质甚至伪造的学术论文滥竽充数,占用学术发表资源,污染人类的知识体系。这些人工产生或是用算法自动生成的劣质或者伪论文有一个共同特点——语法与规范上均无问题,然而,语义上却是晦涩难懂乃至根本无意义。这些劣质或者伪学术论文,应该与严肃认真的、有学术价值的高水平学术论文有本质的区别。发现这个本质区别,并利用其对学术论文进行初步甄别,是本文的主要研究内容。通过此研究,可以更深入地了解主要由自然语言表达的人类知识体系的结构特征。另外,从实际的角度来看,如果能对数量巨大的学术论文稿件进行较为可靠的初步筛选,使得评审人的宝贵时间不至在伪学术论文上浪费,将是一项很实际、很有价值的工作。语言网络作为实际的复杂网络,其小世界特性和无标度特性已经被中外学者证明。分析语言网络复杂网络特征,可以推测伪论文的词同现网络与真论文的词同现网络的复杂网络特征存在明显区别。研究复杂网络结构特征时,有些学者应用谱图理论中的拉普拉斯谱分布图从几何角度分析,发现随机网络、小世界网络和无标度网络的拉普拉斯谱分布存在显著差异。本文以科学论文词同现网络为研究对象,运用拉普拉斯谱分析方法研究其网络结构特征,在比较真伪科学论文的拉普拉斯谱特征:拉普拉斯特征值分布、谱密度分布和特征值极值等的基础上,找出两类论文由拉普拉斯谱表征属性的本质区别,从而设计拉普拉斯谱甄别方法实现对真伪科学论文的自动甄别。本文运用设计的拉普拉斯谱甄别方法,分别对收集到的真伪科学论文样本:MIS Quarter论文、管理科学与工程国际会议录取与未录取论文、以及SCI engine随机生成的伪论文四类样本的各100篇论文进行了拉普拉斯谱图绘制和深入分析,发现真伪科学论文的拉普拉斯谱分布存在显著差异,从而证明可以利用科学论文词同现网络的拉普拉斯谱特征来甄别真伪论文。

【Abstract】 Academic papers expressed by natural language and formal language papers are the most important tools that the human preserve and disseminate the knowledge. Today, however, there are many poor academic papers and even inauthentic those take up academic publication resources and pollute of human knowledge. These poor and inauthentic papers artificially produced or automatically generated by algorithms have a common feature which is standard on grammar with no problems, but not obscure and even pointless in semantics. These poor quality or inauthentic papers should have essential differences with serious and high level academic papers. Survey the essential differences, and using them to initially discriminate their papers is the main contents of this article. Through this research, we can more in-depth understand the structural features of human knowledge mainly expressed by the natural language. In addition, from a practical point of view, it will be a very practical and great value work that paper reviewers’ valuable time is not to waste in the inauthentic academic papers if large quantities of papers on the manuscript can be discriminated for a more reliable initial.As a real complex network, the small world and scale-free characteristics of language network have been proved by Chinese and foreign scholars. According to the complex network characteristics of the language network, we can presume that the word co-occurrence networks of dissertation papers are more likely the characteristics of random networks, while the real papers are more inclined to the characteristics of the small world network or the scale-free network. While in the study of characteristics of complex networks, some scholars apply the Laplacian spectrum distribution of graph theory in network topology structure from the geometric view, and find that the Laplacian spectrum distributions with the random network, the small world network and the scale-free network are significantly different.This paper takes the word co-occurrence network of scientific papers as the object of study. We use the Laplacian spectrum analysis method to study the structures of the word co-occurrence networks. Based on the comparative study of the Laplace spectral characteristics of scientific papers: Laplace eigenvalue distribution, Laplace spectral density distribution and the Laplace extreme eigenvalues, we can find the essential different characteristics of the two types of scientific papers identified by the Laplacian spectrum, and that, these differences can be used to design Laplacian spectrum screening method to achieve scientific Automatic paper screening.In this paper, we use the Laplacian spectrum discriminating method to plot and in-depth analysis of the Laplace spectra graphs of the authenticity of the collected scientific papers samples. The papers samples are MIS Quarter papers, accepted and not accepted papers of International Conference on Management Science and Engineering, and the SCI engine pseudo-random generated papers. We select all 100 papers of every type of the four samples and comparative investigate their Laplace spectra. The study of the paper discovers that there are significant differences in spectral distribution which can be prove that Laplacian spectral characteristics of the word co-occurrence network can be used to identify the authenticity of scientific papers.

节点文献中: