节点文献

多源环境中数据预处理与模式挖掘的研究

Data Preprocessing and Pattern Mining in Multiple Data Sources

【作者】 林耀进

【导师】 胡学钢; 吴信东;

【作者基本信息】 合肥工业大学 , 计算机应用技术, 2014, 博士

【摘要】 随着数据库、网络以及各种信息技术的迅猛发展,许多实际应用领域如:传感器网络、商业交易、社会媒体分析等数据的描述信息变得越来越多,产生了种海量、多源和异构表现形式的数据。这些多源异构数据蕴含着丰富的知识和有用的信息。然而,由于多数据源具有异构性、自治性、复杂性、不一致性等特征,使得传统的数据挖掘技术面临着巨大的挑战。因此,开展多数据源环境下标签传播、数据源质量评估、模式挖掘等知识挖掘研究具有重要的研究与应用价值。本文主要研究内容如下:1)由于数据源之间结构的不一致性,很难将多个数据源直接整合成单一数据源进行学习。在充分利用有标签数据源的标签信息与无标签数据源的内部结构信息基础上,分别提出了全局一致化和局部一致化两种标签传播方法,利用此两种方法使无标签数据源的数据样本具有类标签。再次基础上,构建多数据源的集成学习方法,从分类精度、鲁棒性和扩展性等三方面验证了所提算法的有效性。另外,实验结果表明当无标签数据源较多时,局部一致化的标签传播方法效果优于全局一致化的标签传播方法。2)面对多数据源进行学习时,多数据源中可能存在无关的或冗余的数据源。从数据源的重要度和数据源间的冗余度出发,设计了一种基于最大重要度最小冗余度的数据源质量评估与选择算法。其中,重要度表示一个数据源对分类的贡献程度,冗余度表示不同数据源之间蕴含信息的重叠程度。最后,通过选择前p%个数据源进行多数据源的集成学习。实验结果表明该度量方法能有效地选择与任务相关的数据源。3)商场随着销售量的日益增长,存储了大量与时间相关的事务型销售数据。通过将销售数据按时间划分为多个时间戳数据库。针对多个时间戳数据库构成的多相关数据库,提出了一种以挖掘稳定模式为代表的有效算法。该算法首先通过定义两个约束条件:minsupp和varivalue以定义稳定数据项,然后基于灰色关联分析方法度量稳定数据项之间的相似度。在此基础上,提出了一种层次灰色聚类方法挖掘由稳定数据项组成的稳定模式。从模式的有效性、时间效率及拓展性等方面验证了所提算法的有效性。

【Abstract】 With the raid development of database, network and other information technologies, multiple data sources with large volumes and heterogeneity have become ubiquitous in many practical applications, such as sensor networking, supermarket transactions and social media analysis. These databases contain plenty of useful information and valuable knowledge, and bring new characteristics as being heterogeneous, autonomous, complex, and inconsistent, which are challenging for traditional mining algorithms. Thus, knowledge discovery from multiple data sources, such as label propagation, quality of source evaluation, and pattern mining, is a significant problem with application values in real-world applications. The main contributions of this dissertation are as follows.1) It is difficult to merge multiple data sources into a centralized database for learning due to the inconsistency between different data sources. We present two label propagation methods to infer the labels of training objects from unlabeled sources by making a full use of class label information from labeled sources, and internal structure information from unlabeled sources, which are referred to as global consensus and local consensus, respectively. We test the classification accuracy, robustness and scalability of the proposed methods by constructing a multiple-data-source ensemble learning model. Experimental results show that the local consensus outperforms the global consensus when there exist plenty of unlabeled sources.2) It is noticeable that some sources might be irrelevant or redundant when constructing multiple-data-source learning. Thus, it is meaningful to select a set of good information sources that could help improve the learning performance. We present an algorithm of source assessment and selection based on max-significance-min-redundancy, in which significance represents the degree to which an information source contributes to classification, and redundancy implies the information overlap among different information sources. Finally, we select the first p percent sources to construct multiple-data-sources ensemble learning. Experimental results show that the metric can effectively select some sources related to the target mining task.3) Every time when a customer interacts with a business, there is an opportunity to gain strategic knowledge. Transactional data collected over time contain a wealth of information about customers and their purchasing patterns. We divide transactional data into multiple time-stamped databases according to their sale periods. We present an efficient algorithm for mining four patterns represented by stable patterns. First, we define the notion of stable items according to two constraint conditions:minsupp and varivalue. We then measure the similarity between stable items based on gray relational analysis, and propose a hierarchical gray clustering method for mining stable patterns consisting of stable items. Finally, experimental results show that the proposed algorithm is effective, efficient and scalable.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络