节点文献

流数据上的可置换聚类研究

Alternative Clustering on Stream Data

【作者】 张婧媛

【导师】 江贺;

【作者基本信息】 大连理工大学 , 计算机应用技术, 2011, 硕士

【摘要】 随着硬件技术的发展,海量流数据迅速增长,成为近年来的研究热点。流数据聚类作为流挖掘的一项基本任务,已经有了大量的研究和应用。但目前已有的方法都只是在某一特定的时刻反馈给用户单一的聚类结果,然而实际应用中,流数据可以从不同的角度去观察理解,较为合理的作法是反馈给用户多个可置换聚类结果。本文从流数据的角度出发,设计新的适应于流数据的动态可置换聚类方法,让人们可以多方面地探索流数据。该算法命名为AltStream,由在线和离线两部分组成。它的目标是在给定的数据流中找到两个高质量的,同时相似度较低的宏观可置换聚类结果。因此,在算法的在线部分,会同时保持两组可置换的统计信息从而使得流数据不断向相异的方向变化。为保持微簇间的可置换性,我们提出一种新的度量方法SOBD测度,针对两个聚类间包含不同数据点的情况,近似地评估它们的相似性。当用户需要两个可置换宏观聚类结果时,离线部分启动。首先会根据时间区域得到两组相应的微簇集,再根据已知的簇个数在第一组微簇集上用一种无监督的算法dec-kmeans获取两个不相关的宏观簇。其中质量较好的会作为第一个最终结果返回给用户,而另外一个质量稍差的宏观簇簇心则被抽取出来作为一种半监督信息来引导第二组微簇集,运用基于权重的k-means算法从而得到第二个宏观聚类结果。在真实数据集上的大量实验结果表明,我们的新算法无论在质量上,还是在相异度上,都优于其它一些对比算法。由此可见,该方法将在文本数据流,信用卡交易处理流,网络日志和网络页面点击流等诸多实际应用中指导用户更好地分析数据。

【Abstract】 In recent years, data streams have attracted a lot of research interests. As an essential task in mining data streams, stream clustering has become a hot topic in this area. These algorithms usually produce only one single clustering within a certain time period. However, data streams can be usually interpreted in multiple perspectives and alternative clusterings are preferred in many real world applications.In this paper, we issue the new problem of alternative stream clustering, which aims to find two high quality and dissimilar macro-clusterings in a given data stream. We propose a new algorithm named AltStream consisting of two components. The online component of AltStream simultaneously maintains two alternative groups of micro-clusters which are used to record the statistical information about the evolving stream. During the online procedure, we develop a new method, the SOBD measure to approximately evaluate the dissimilarity between two clusterings containing some distinct data points from each other. When the users request to find two alternative macro-clusterings, the offline component is then invoked. After the two sets of micro-clusters are returned with respect to the specified time horizon and the number of clusters, an unsupervised alternative clustering algorithm, namely dec-kmeans, is then employed in the offline component to find two alternative macro-clusterings over one set of micro-clusters. The one with better quality is outputted as the first resulting macro-clustering, whereas the centroids of the other macro-clustering are extracted as the semi-supervised information. Under the guideline of these centroids, the second resulting macro-clustering is created by a weighted k-means algorithm.Experimental results on real world streams illustrate that our new algorithm performs better than some comparative methods, in terms of both quality and dissimilarity. Therefore, AltStream would be widely used in text stream, creadit card transaction flows, web logs and web page click streams, etc. In each real-world application, it would be important for the users to explore data streams in various aspects.

【关键词】 流数据聚类可置换
【Key words】 Data StreamAlternative ClusteringAltStream
节点文献中: 

本文链接的文献网络图示:

本文的引文网络