节点文献

维数约减算法研究及其在大规模文本数据挖掘中的应用

Research of Dimensionality Reduction and Its Appliacation on Data Mining of Large-Scale Text

【作者】 于瑞国

【导师】 何丕廉;

【作者基本信息】 天津大学 , 计算机应用技术, 2008, 博士

【摘要】 随着网络的快速发展,人们处在这个“信息爆炸”的时代,常常面对海量数据分析和处理的任务,且这样的数据仍在以几何级数增长。同时,在现实中这些海量数据往往又是高维而稀疏的,且存在着大量的冗余。因而能对高维海量数据做压缩处理,且保持其内在属性的有效处理方法成为人工智能、机器学习、数据挖掘等领域的重要研究课题之一。高效的维数约减算法是对高维海量数据处理的一种有效方法,且具有一定的实际应用价值。本文的关注点集中在适用于高维海量数据的快速维数约减算法的研究及其具体应用。本文分别提出了两种新的维数约减算法:(1)基于期望扰动的直接随机映像算法(On the Expected Distortion Bound of Direct Random Projection,简称DRP);(2)基于锚点集的最小平方误差等距嵌入算法(Anchor points based Isometric Embedding under least square error criterion,简称AIE)。基于期望扰动的直接随机映像算法DRP具有O ( dn )的时间复杂性,这样的性能评价是建立在对期望扰动分析的基础上的。并证明了1)DRP算法的期望扰动的界。2)在适当的给定条件下,可在O (1)的随机时间内找到一个将期望扰动限定在一个合适范围之内的DRP映像。进而提出了一种获得中肯DRP的启发式算法。此算法具有稳固的渐进加速比,相对于其他随机映像算法具有更好的稳定性。而且在流数据模式下,可采用增量策略,DRP算法的时间复杂性为O ( d log d )。基于锚点集的最小平方误差等距嵌入算法AIE具有O ( n log( n ))的时间复杂性,而且在获得测地线距离后的计算时间复杂度达到对嵌入点数的线性关系,且可以完全并行实现。与Isomap、LLE等非线性维数约减算法相比较,具有更优化的时间复杂性。当前主流的搜索引擎根据查询词在网页中的出现频率,辅以网页权威性等信息,生成查询结果。但用户提供的查询词往往非常简单,在许多情况下,搜索引擎难以确定用户的查询意图。本文提出了一种利用Web日志中的海量点击数据进行网页内容相关性挖掘的方法,在此基础上给出了一种反馈式搜索引擎(Feedback Search Engine ,简称FSE)框架及相关算法。FSE根据网页相关性动态生成查询结果,以期提供给用户更中肯和个性化的信息。

【Abstract】 With the rapid development of Internet, people often need to face massive data to analysis and process in the age of“information explosion”, and this large amount of data is still increasing in a geometrical rate. In real world, the massive data always is high dimensional and sparse, and redundancy often exists in the massive data. Compressing on massive data and keeping the internal properties becomes one of the important research topics in artificial intelligence、machine learning、data mining and other fields. High Efficient dimension reduction algorithm is a method of processing high dimensional mass data and has certain practical application value. This paper focuses on research and application of the rapid dimension reduction algorithm, which is applicable to the massive data.The paper proposes two new dimension reduction algorithms: The First is On the Expected Distortion Bound of Direct Random Projection (DRP). The second is Anchor points based Isometric Embedding under least square error criterion (AIE). On the Expected Distortion Bound of Direct Random Projection (DRP) has a time complexity of O ( dn ). The performance of DRP is investigated in terms of expected distortion analysis. We prove: 1) an expected distortion bound of DRP; and 2) given moderate conditions, the DRP with appropriate expected distortion can be found in O (1) random time. Furthermore, we propose a simple heuristic to facilitate finding an appropriate DRP. By experiments, DRP might be more stable than the other two random projection algorithms. Using an incremental strategy, the total time cost of DRP is O ( d log d ) in flow data mode.Anchor points based Isometric Embedding under least square error criterion (AIE) has a time complexity of O ( n log( n )), and after obtained geodesic distances it has linear time complexity for embedded points and can be fully realized in parallel. Compared with Isomap, LLE etc. nonlinear dimension reduction algorithms, AIE have better time complexity.Current mainstream search engines generate search results by analyzing statistical information such as the frequency of queries in web pages and the ranking of web pages. In many situations, search engines can not determine what kind of information users want. This paper describes a web content relevance mining method using large amounts of clickthrough data in web log. Furthermore, based on this method, we present a framework of Feedback Search Engine (FSE) and associated algorithms. According to page-to-page relevance, FSE generate search results dynamically and provide its users more accurate and personalized information.

  • 【网络出版投稿人】 天津大学
  • 【网络出版年期】2009年 07期
节点文献中: