Clustering Data Stream Has Attracted A Lot Computer Science Essay

Published: November 9, 2015 Words: 4346

In the last few years clustering data stream has attracted a lot of attention by growing the volume of existing data and constrained memory capacity which leads us to process data dynamically [1]. Finding arbitrary shape cluster from data stream in real time is important issue and difficult in the real time applications [2]. In the real world problems, clustering data can perform in different size, shapes degrees of separation and data sparseness. More, noise and outlier in the data can affect the true structure of cluster [3].

Traditional algorithms approach clustering is applied to optimize problem by minimizing certain quality metrics such as squared error. The result of clusters is convex shape in d- dimensional space. Therefore, for clustering with arbitrary shape, this strategy does not work properly. We can see arbitrary shape in area of science such as weather satellite, image segmentation spatial data which is collected from geographic information systems (GIS) and even sensor data that is posed as arbitrary shape. These applications generate enormous volume of data. Those set of algorithm which find irregular shaped cluster are called as "shape-base clustering algorithms" [4].

In this Proposal, our focus is on the arbitrary-shape clustering task. We use the term shape-based clustering for all algorithmic techniques that capture clusters with arbitrary shapes, varying densities and sizes. Various algorithms have been applied for finding Shape-based clustering. These algorithms are categorized into: Hierarchical clustering, prototype-base algorithm, spectral-base clustering, kernel-base method, density-base method and manifold-base clustering.

Among these algorithms, prototype algorithm can only model cluster with specific shape, therefore cannot be applied for arbitrary shape. In the density method like DBScan, finding appropriate parameters is difficult and this difficulty makes use of these algorithm limited. Spectral algorithm for choosing a good similarity graph is not trivial and can be quite unstable under different choices of the parameters for neighbouring graphs [5]. Kernel-base clustering which has been used for to distinguish non-linear clustering boundaries, closely related to spectral algorithm. Kernel clustering has some major problems; its computational is expensive due to difficulty in scaling large-scale dataset because of the number of parameters is quadric respect to the number of examples. Second, it is sensitive to choose of kernel functions. Third, it needs data pre-processing to ensure that any clustering boundary will pass through the origins which makes it unstable for clustering unbalanced dataset. Spectral, density-based (DBSCAN), and nearest-neighbour graph based (Chameleon) approaches are the most successful among the many shape-based clustering methods. But when they turns to the complexity time and space limitation in data stream, aforementioned algorithms have high computational in comparison to linear distance-base clustering [5].

Some works have been done in the area of data stream clustering for arbitrary shape clustering. This continues effort has one common goal to achieve and produce an accurate data stream clustering method. Despite these studies still cannot lead the users to obtain correct result.

Problem statement

Traditional clustering algorithms have focused on the following objectives: 1) Improving the clustering quality, 2) improving efficiency, and 3) designing algorithms that can scale to larger datasets [4]. The last consideration is gaining prominence as the growth of volume of existing data and insufficiency of data storage capacity lead us to process data dynamically in extracting knowledge. In distance-base clustering method data have been considered as a data stream which comes in from one side and exit from another side and the data not available to visit as a second time. Besides, shape-based clustering becomes challenging for large volume of data stream since standard distance measures and parametric models are unable to capture arbitrary shaped clusters. Although density-based algorithms have been proposed to overcome this limitation, they suffer from a high computational complexity and sensitivity to parameters. Similarly, spectral clustering algorithms suffer from scalability issues. The three main problems in this area which are related to this property include:

The Processing time in data stream which is only available once and there is no track back.

Detecting change in evolutionary stream data and detecting concept drift during the unexpected time.

Finding arbitrary shape clustering using distance-based clustering in real time application. There are some considerations need to look into during the process:

Splitting the data stream in sub-cluster accurately.

A subspace cluster can have very sparsely distributed points in dimensions other than the subspace dimensions.

Volume of existence data and complexity in clustering.

Similarity measurement for non-convex shape (arbitrary shape).

Distinguish between novelty micro clusters and uncertain data.

Ensemble the generated sub-clusters to macro-cluster (Final cluster)

Some works have been done in the area of data stream clustering to cluster arbitrary shape. This continues effort has one common goal to achieve and produce a scalable and an accurate stream clustering method in better complexity time and space. However, these studies still cannot lead the users to obtain correct result. For the current methods for data stream clustering, the quality of results still does not satisfy users in some applications such as GIS, spatial data. This kind of application generates data with shape-based clustering in high volume of data in real time. Therefore, due to this fact, clustering of such data stream leads to high computation time for clustering. Furthermore most researches in data stream clustering haven't focused on distance-based clustering algorithm of arbitrary shape data stream.

One category of clustering is distance-base clustering which is usually applied for convex shape cluster. There is no reference on monitoring cluster on data stream with non-convex shape in distance base clustering. Some of the clustering algorithms which are developed for data streams are distance based. Although, they are appropriate to find convex-shaped clusters. Nevertheless, there is weakness in distance-base cluster to find the true clusters for non-convex clusters efficiently.

Another issue is uncertain data streams; they are often possible to be present in many real applications such as GIS and geometric data. Uncertain data occurred due to incorrect recording mechanisms. In the data stream, the uncertainty meaningfully affects the micro- clustering process, because of the changing behaviour of the different attributes and its effect on the distance computations. In addition, the current data stream clustering algorithms cannot mine true clusters efficiently because of their weakness in determining important parameters such as k number of exact cluster in stream of data and considering the novelty and concept drift in the stream of data especially in real time era. This is also related to problem in finding good scalability in data stream clustering algorithm. therefore a novel approach should be designed to overcome the difficulty in data stream clustering with consideration on the above mentioned problems to achieve more accurate result in high speed and scalable.

Research objective

The primary objective to this research is to improve the effectiveness of uncertain data stream clustering in terms of accuracy, speed and scalability. In particular , this research focus on designing methods to overcome main difficulties in data stream clustering including scan data once, concept drift, uncertain and non-convex shape of data stream. In achieving the objective, the following objectives are adopted:

To propose an algorithm to detect concept drift and simplified fading function for removing expired micro clusters and to detect novel micro-clusters in batch processing.

To improve the measurement method for clustering data stream for non-convex shape clustering. A new unified similarity metric need to be defined that takes into account the dimensionality of the seed clusters along with the distance similarity.

In order to fulfil the objective (1) and (2) a new framework is proposed that consist of online and offline component to cluster high scale data sets in real time processing as well as scaling algorithm to higher dimensions using a method such as Locality sensitive hashing ( LSH).

Scope of Study

This proposal focused on the uncertain data stream clustering which creates clusters to extract knowledge from unlabelled stream data. As a result, it should be applied without any prior knowledge about data. Moreover, concept drift, uncertain data and data with arbitrary shape is presumed. Regarding memory constrains, one pass over entire data sets is considered but several pass for one fragment of data in the memory is possible.

There are some critical assumptions which is important to be mentioned in this research proposal. Those assumptions are followed:

The uncertain data should be included in the dataset to show the efficiency and robustness of the method behaviour for handling uncertainty.

The data should be in irrelevant shape (non-convex shape) to show the effectiveness of the method to handle arbitrary shape clustering.

In the dataset with d-dimensional attributes, there is no correlation among their properties.

Data is numeric in the dataset. Sequential and time series are not considered in this scope of study.

Numbers of dimensions for all attributes is equal which means pre-processing step is not included in this research.

The performance of the proposed method is compared with most prominent method such as DCSTREAM and ConStream.

Two important groups of parameters are applied to measure the quality of data stream clustering which are named Cluster's Purity and Silhouette.

In this research proposal, the real-world data (UCI repository) and synthetic datasets are carried out in the batch processing to emphasize method performance (Experiments part).

Related work

As referred to data stream clustering which is discussed in previous section, comes with some difficulties in managing uncertain data stream due to generating ambiguous data from diversity streaming application in high scale of data in real time application and dealing with arbitrary shape clusters that traditional clustering algorithms does not work well on such non-convex shape. The following sections will provide the related works which have been done in uncertain data stream clustering and arbitrary shape clusters.

Uncertain Data Stream

Clustering data stream is one the challenging task in data stream mining which the rate of stream is very fast and the volume of data stream is huge. Hence space efficient once pass algorithm can be devised to cope the online query.

The first work done on clustering data stream was STREAM by [6]. STREAM is the extended of k-mean clustering for data stream computation. Afterwards, a CluStream was proposed by [3] for clustering analysis using micro-cluster methodology. This framework was showed more flexible cluster for data stream [6].

Zhou proposed CluWin method for clustering data stream. It is a solution for slide window model by merging exponential histogram with cluster feature [7].

Lately, a probabilistic data steam clustering became more significant due to uncertainty in real applications. Most recently work focused on static uncertain data which can be visited multiple times. Consequently, static uncertain data clustering were unable to manage and process the uncertain data stream.

To manage the uncertain data, some works were done such as FDBSCAN [8] and FOPTICS [9]. These algorithms are applied to cluster uncertain data using density based cluster. In uncertain data, Uk-Mean clustering also was developed which is extended of k-mean clustering [10].

Michael Chau et al [11] proposed UK-mean clustering to cluster uncertain data by minimizing the expected sum of squared error E(SSE). In UK-Mean clustering data object is determined by uncertainty region with probability density base (pdf). Therefore, the difference between UK-Mean clustering and K-Mean clustering is computation distance and clusters. UK-Mean clustering is incapable of implementing with data stream in real time application [11].

Yu Zhiwen et al [12] proposed fuzzy dimensional induced clustering (FDIC) algorithm which applies fuzzy distance clustering with fractal correlation dimension to find appropriate cluster in low dimensional subspace in the uncertain database. In order to find kth nearest neighbour in clustering, Yu ZHiwen et al in [12] proposed the FKNN algorithm to retrieve cluster efficiently. But the weakness of this work cannot be applicable in data stream clustering [12].

These works which were done it can only be applicable for uncertain data in deterministic environment.

Therefore, there should be developed methods which are capable of managing uncertain data stream. Some works were done in this area. The first work in uncertain data stream which is done by Aggarwal and Yu was named UMicro [13]. This method used a general model of uncertainty that it is presumed only standard error of data points is available. Aggarwal and Yu in [13] did show that by using even such a modest uncertainty in mining process, it is appropriate to improve the quality of cluster efficiently. Nevertheless, this method is not applicable for probabilistic point model.

Chen Zhang and et al proposed PMicro with the use of micro-clustering framework to cluster uncertain data stream. In comparison with UMicro which is used Error-Based Cluster Feature (ECF), PMicro applied probability cluster feature (PCF) because UMicro cannot be applied for point probability model directly. PMicor for similarity in clustering uses both distance metric and probability of data point. In PMicro clustering, weighted decay was proposed in order to reduce the effectiveness of old micro cluster in clustering [1].

PMicro resulted high commutation cost. To tackle with this problem, Chen Zhang and et al [1] in the same article, applied heuristic model which is consist of two phases: First it choose small part of cluster that is close to the point. Second it selects the candidate cluster from small dataset. Obviously, this method cannot get the optimal cluster theoretically [1].

There are some works which were done for different kind of uncertainty of data stream such as HCluStream. In order to tackle the problem of heterogeneous data stream in deterministic data stream environment, Heterogeneous CluStream (HCluStream) was proposed by Yang Chunyu [8]. HCluStream is designed and modified based on combination of Clustream and HPStream version which is used for heterogeneous data. In real application, there is both continuous and categorical attributes. Therefore, data stream with both continues and categorical attribute, is called heterogeneous data stream. HCluStream mainly defines the micro-cluster structure with heterogeneous feature (m-HCluster) and the distance metric between the sample and the sample under the heterogeneous attributes [14].

Zhou Aoying in [15] proposed Low uncertainty Micro clustering (LuMicro) which employed online micro cluster and and offline macro cluster same as UMicro that was done by Aggarwal [13]. UMicro does not consider the influence of uncertainty in data stream clustering. The LuMicro cluster improved the parameters of micro-cluster to get better quality of cluster. LuMicro in comparison with Umicro clustering works more faster and scalable due to setting some parameters and using probability cluster feature metric instead of error base cluster feature metric [15].

GUO-YAN HUANG et al [16] proposed HU-Stream clustering to manage heterogeneous data stream with uncertainty in categorical attributes.HU-Stream uses heterogeneous uncertainty clustering feature (H-UCF) for tracking the statistical categorical attribute.HU-Stream employs two main mechanisms same as LuStream in [15]. In the first stage compute the similarity measurement to get uncertainty information, the second phase is to get information of new arriving tuple by using H-UCF for categorical attributes. The experimental result showed that the quality of cluster in HU-Stream is more in comparison with UMicro [16].

There are some related works which were done in classification and association rule mining in uncertain data stream. Such as FMU, REZA AKBARINIA AND FLORENT MASSEGLIA proposed Fast Mining of Uncertain data streams (FMU) algorithm for probabilistic frequent itemsets in uncertain data stream with sliding window. This method monitors the probabilistic of itemsets in real time. The result showed the efficiency of algorithm [17].

CLARO and PODS models were proposed by Thanh T. L. and et al in [18, 19] for continuous random variables. Due to two reasons the models were proposed: first computationally is difficult to obtain result distributions when input tuples are modelled using continuous random variables. Such computation often involves multivariate integrals or requires new algorithms to be designed. Second, such computation must be performed for high-volume data streams [18, 19].

But this method has some difficulties with query optimization and correlation among across tuple and multiple aggregations. The other main drawback of this approach is that the formula of the result distribution involves an unresolved parameterized integral. To get sufficient knowledge of the result distribution (e.g., calculating its mean and variance), one needs to repeat the inverse transformation for a large number of points. This is expensive when the integration has no closed-form solution and needs resorting to numerical solutions [18, 19].

Chandima HewaNadungodage et al in [20] proposed a method to find frequent pattern mining from uncertain data stream which are named UHS-Stream and TFUHS-Stream. UHS-Stream was designed based on hyper structure false-positive oriented algorithm to find whole frequent itemsets until the current moment. TFUHS-Stream was designed based on a false-positive oriented TF model in time fading manner to find frequent itemsets in an uncertain data stream. the method showed the effectiveness in terms of memory usage, accuracy and runtime [20].

In data stream application with uncertainty to collect label sample is very hard and sometime is impossible. Chunquan Liang and et al [21] proposed concept-adapting very fast decision tree (CVFDT) algorithm for uncertain data stream in unlabelled classification. [21] proposed an algorithm namely puuCVFDT (CVFDT for positive and unlabelled uncertain data) to handle concept drift with uncertainty under unlabelled and positive learning. But in real application, the proposed method is incapable to observe the probability of the target concept or it is very low [21].

Arbitrary Shape Data Stream

There are some related works which have been done to cluster data with arbitrary shape in order to improve better quality of clustering.

Eric W et al. [22] proposed Agglomerative Partitional Clustering (APC) for arbitrary cluster shape. APC attempts to solve the situation that the centroids of clusters are joined with region of high density with line segment. All interconnected centroids preserved as single cluster. APC estimates the density area of centroids clusters, if two centroids of clusters share the area of continuous high density, these two centroids of clusters are replaced with line segment joining of two original centroids. The computation of the method is fast with low memory and robust against noisy data [22].

Hang Wang et al proposed OPCluStream to solve the arbitrary shape in clustering and global parameters setting in density base method to discovering overlap clustering. [2] proposed tree topology for density base structure. In this tree every point direct to its father node, except noise which has no father. When a new point is arrived in on a data stream and is transformed into structure of cluster whenever is required. Clusters partitions could be regained through clustering structure. The method did show high efficiency and linear complexity on large dataset [2].

Amr Magdy et al proposed MDStream to solve the arbitrary shape and densities which other algorithm didn't consider the arbitrary density. The threshold of density in density-base clustering is static. [23] improved the threshold to dynamic. MDStream used grid structure in online component. Grouping the neighbouring dense grid cells into clusters was done in offline component. MDStream is more scalable and efficient in terms of memory usage and running than DStream [23].

Amin Namadchian et al proposed DSCLU clustering to solve arbitrary shape in multi density environment. DSCLU contains two phases online and offline phases. In online phase, a buffer is used that number of micro-cluster shouldn't exceed a threshold which is determined as an input parameter. Offline phase is similar to DBSCAN which is begun with dense micro cluster and combine the dominant neighbours. MintPts and gamma variables are input parameters for this phase. DSCLU can find cluster better than D-Stream and Den-Stream. But the weakness of the DSCLU is more reply on input parameters [24].

Hongbin GAO and et al proposed GRDCluStream to solve the description of current data stream which wasn't considered the dynamic different case of streaming data in CluStream. CluStream doesn't support arbitrary shape clustering. GRDCluStream was designed based on relative density and grid which involves two parts, online and offline process. Online process record the current feature of data point stream and offline process assessing specified time window for users.in online process data stream is accepted rapidly and saved the result as pyramid time frame. Quality of clustering by solving the problem of boundary point, threshold sensitivity of density using absolutely density was improved by GRDCluStream. The weakness of proposed method which is the effect of clustering in high dimension data steam is not noticeable and distinguishing noises are very hard and specially in high dimension [25].

C. Pöelitz et al suggested a method that consider both time and space for an even. The method comprise two phases: first, extracting spatial cluster using means of DBSCAN with distance threshold which is specified by user. Second, Applying DBSCAN again to each spatial cluster with a specified temporal distance threshold that is selected based on features of the spatial cluster [26].

Vineet Chaoji et al proposed a method to solve the arbitrary shape clusters in data using ABACUS method. The proposed method is based on intrinsic structure for each cluster that is referred as a backbone of that cluster. The backbone involves of smaller set of data points that allows method to scale the data in large dataset. ABACUS is contained two phases. First phase is detecting the backbone of each cluster using iterative process which is made up of point merging and point movement process. Second phase is discovering of true clusters. The proposed method cannot be applicable in data stream which is data come in one side and exit from other side very fast and needs one-pass algorithm [27].

Zhijun Yao et al proposed neighbour-expanding approach which is combination of non-greedy growth (NGG) and maximum-likelihood-first (MLF) algorithms. This approach was applied to the murine typhus spatial data in South Texas to solve the arbitrary shape in this case study [28].

Huanliang Sun et al proposed CDS-Tree clustering method to solve the arbitrary shape in clustering based on space partition. But the issue in space partition that cannot be applicable in data stream due to large memory space which data stream clustering is constrained to memory space and low efficiency in high dimensional space. The CDS-Tree is improved method of space partition for data stream. The proposed method is an index structure to cluster arbitrary shape in high dimensional with high accuracy. Compact structure of CDS-Tree costs small memory due to saving only non-empty cells, hence, its efficiency is high. For changing data stream, CDS-Tree used Data Skew factor (DSF) which is a measure for data skew to adjust partition granularity [29].

M. Soleymani Baghshah et al [3] proposed an algorithm that can be applied for arbitrary shape. The proposed algorithm has three parts of clustering: first step is grouping data point in sub-cluster based on Gustafson-Kessel (GK), which considers covariance and centre vector for each cluster. Second step is forming cluster from sub-cluster using Bhattacharya distances between sub-clusters. Final step is finding final cluster based on hierarchical algorithm. But before applying hierarchical algorithm, a minimum spanning forest (MST) of the graph is used the Kruskal's algorithm. The proposed method showed low computational complexity in comparison with other methods which are applied for arbitrary shape [3].

Bidyut Kr. Patra et al [30] proposed leader single-link (l-SL) which is distance-based clustering to cluster arbitrary shape. L-SL is a combination of leader clustering and single-link clustering. For Each new data point is measured based on two parameters h (inter cluster distance, distance between a pair of clusters is more than h) and t (distance criteria) and leader (l) (e.g. || x-l||<= t). l-sl has some difficulty to finalize the clustering result. To tackle this problem, augmented l-SL (al-SL) was proposed to solve the issue. The complexity time and space of al-SL is slightly better than l-sl. But it is much better than single link (SL). But the complexity of the proposed method is still high and it is depend on some parameters setting [30].

Vineet Chaoji et al proposed SPARCL algorithm to solve arbitrary shape. [31] modified k-mean clustering to be applicable in arbitrary shape. The complexity and space of k-mean is linear. This algorithm has two stages. First, runs on whole dataset to find many small seed cluster. Second, SPARCL iteratively combines the generated small cluster (from first stage) to attain the final cluster of shape-based. The SPARCL algorithm is more efficient than other shape-based cluster algorithms. But it has some difficulty in data stream and high dimensional environment [31].

Research Methodology

In this proposal in order to tackle problem statements and cover objectives, a novel framework is proposed for clustering uncertain data stream. The proposed method in real time environment can be divided into two components: Online and online component. In proposed method, primary task is converting a large amount of data to data stream with window of data for online processing as a window size. In the proposed method, data are active during window time stamp T1 and it will be expired during window time stamp T2. Therefore, visiting data is only possible in time window time stamp over whole data. Figure 1 is presents the proposed method which is as follow:

Dataset

Convert to Data Stream with Window size h

Producing Sub-Set of data

Seed-Cluster

Generator

Repository of Seed-Clusters

Detecting Expired Micro-Cluster using Fading Function

Ensemble Seed-Cluster based on user defined

(Macro-Cluster)

Handling Uncertain Data

Online Component

Offline Component

Figure 1: Framework of Proposed Method

Applying distance-based clustering such as k-mean in data stream for arbitrary shape clustering is main challenges of this proposal. As it can be seen in Figure 1, the proposed method is consisted of two phases: online component and offline component.

Pre-processing and Converting data to form of data stream with window size h is performed outside of online component. In order to generate sub-set of data with windows size h, k-mean clustering is applied.as mentioned in related works, uncertain data influence on quality of clusters. Therefore, before generating seed-clutters, uncertain data should be detected based of uncertainty boundary in order to improve the quality of cluster, time and space. These steps are done in online component.

Generating macro-clusters is performed in offline component. By storing seed-clusters, we are able to detect an expired seed cluster using fading function and merging seed clusters to macro-cluster based on user defined (number of cluster which is defined from user).