Data mining is a process of analyzing data from diverse perspective and summarizing it into useful information. In simple words extracting knowledge from large amount of data .The use of data mining information will increase revenue and cut costs. Data mining software's allows you to analyze data in different perspective, classify it and summarize the relationships found. Thus data mining is a method in discovering correlations and patterns among number of fields in huge relational database.
Cluster & clustering: Clustering is the process of partitioning of data into groups of similar objects (clusters). It is a technique of data modeling which gives the exact concise summaries of the data.
There are several clustering techniques like
AN OVERVIEW OF CLUSTER ANALYSIS
Clustering is a process of grouping a set of abstract or physical objects into classes of related objects. A cluster is a group of data objects that are analogous to one another within the same cluster and dissimilar to the data object in the other cluster. The process of clustering involves firstly partitioning the set of data into groups based on data analogy and then assigning label to the groups. Thus Clustering is advantageous as it is adaptable to changes and helps in distinguish between clusters.
Cluster analysis is similar to the human activity of distinguishing cats and dogs, or identifying between different things like bikes and cars. Cluster analysis is widely used in several fields like market research, pattern recognition, data analysis and image processing. For example in business clustering helps marketers to discover distinctive groups in their customer bases and portray customer group based in purchasing pattern. Clustering partitions large data sets into groups according to their similarity thus also called as data segmentation. Cluster analysis can be used as a tool to gain insight into distribution of data, to examine the distinctiveness of each cluster and center on a particular set of clusters for additional analysis. Data clustering is causative in the areas such as data mining, statistics, machine learning, spatial database, marketing etc.
Cluster analysis tools have been built as software packages such as S -plus, SPSS and SAS. Clustering is an example of unsupervised learning in machine. For this reason clustering is a form of learning by observation and not by examples. The efforts in data mining are inattentive on discovering methods for effective cluster analysis in outsized database. The research have resulted in active themes that focus on the scalability of clustering methods and the effectiveness of methods for clustering composite shapes and types of data, high dimensional clustering techniques ,mixed numerical and categorical data in outsized database.
Clustering a challenging field
Clustering is a challenging field of research in which its prospective applications masquerade their own special needs.
These are the typical requirements of clustering in data mining:
Clustering methods:
The major clustering methods available are
There are two types of classification- Supervised and unsupervised
Supervised classification - we might know the available data and the amount
Unsupervised classification - both the availability and amount of data are unknown. This is called CLUSTERING
Intra clusters is one in which the similarity between the data is high compared to inter cluster.
Applications of Clustering
Clustering Techniques
1.Partitioned algorithm
Generally a database containing m set of clusters and n set of data are constructed.The criteria for partitioning are
Heuristic methods: K-Means, k medoids, CLARANS
Global optimal: enumerates all paritions
K Means:
Every cluster is represented by the center of the cluster.
k-means algorithm
There are various dis advantages using K mean method, the number of clusters to be dealt should be predefined. This method is not able to handle Outliers. Only one parameter i.e. mean point is alone is used for classifying the cluster which is not sufficient.
k-medoids or PAM (Partition around medoids):
Each cluster is represented by one of the objects in the cluster
We improve the methodology by using medoids (most centrally located object in a cluster) as a reference point instead of mean value used earlier.
PAM :
Initially a set of medoids is selected and recursively one of the medoids are replaced by non-medoids if it improves the total distance of the resulting clustering.This method works effectively for small data sets, but does not scale well for large data sets.
CLARA (Clustering Large Applications):
Various data sets are choosed and PAM is applied on each set which gives the best clustering as the output.It deals with larger data sets than PAM but the Efficiency of each data sets depends on the sample size
CLARANS ("Randomized sampling" CLARA):
This technique is highly efficient and scalable than both PAM and CLARA .In this method the neighboring sample are drawn dynamically.The clustering process can be presented as searching a graph where every node is a potential solution, that is, a set of k medoids .If the local optimum is found, CLARANS starts with new randomly selected node in search for a new local optimum
2.Hierarchical approach:
Based on some criteria the set of data (or objects) are decomposed hierarchically.
Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON
In this approach distance matrix is chosen as the clustering criteria. The number of clusters k is not required instead a termination condition is needed.
AGNES (Agglomerative Nesting)
Bottom up approach, which merges the dissimilar matrix or the cluster in a non descending manner. Merging is performed in such a manner that one cluster is merged into another large cluster and into another larger cluster and so on until all the objects are merged into a single cluster.This method is not scalable, time complexity is O (n2), where n is the number of total objects
DIANA (Divisive Analysis)
Top down decomposition, this is the Inverse order of Agglomerative Nesting. In this method the object is spitted into smaller objects and so on and each of them form their own cluster. This method is rarely applied.
Integration of hierarchical with distance-based clustering:
BIRCH : uses CF-tree and incrementally adjusts the quality of sub-clusters
Balanced Iterative Reducing and Clustering using Hierarchies. It uses CF(Clustering Feature )-tree and incrementally adjusts to the quality of sub-clusters.Th is approach is basis for multi phase clustering.
Step1: scan DB to build an initial in-memory CF tree (a multi-level compression of the data that tries to preserve the inherent clustering structure of the data)
Step 2: arbitrary clustering algorithm is used to cluster the leaf nodes of the CF-tree
This method is useful to handle only numeric data, and sensitive to the order of the data record.
CF-Tree in BIRCH
Clustering feature: It concisely summarizes the statistics for a given sub cluster: the 0-th, 1st and 2nd moments of the sub cluster from the statistical point of view. Efficient utilization of storage and stores the crucial measurements for cluster computation.
A CF tree is a height-balanced tree that stores the clustering features for a hierarchical clustering .A non leaf node in a tree has "children" and these nodes store sums of the CFs of their children.
A CF tree has two parameters:
ROCK :
Clustering categorical data by neighbor and link analysis.RObust Clustering using links. These links are used to measure similarity/proximity. Its not a distance-based clustering. Computational complexity: is of order
Algorithm: sampling-based clustering
CHAMELEON :
Hierarchical clustering using dynamic modeling. Method of clustering complex objects
The similarity between the clusters are measured based on a dynamic model. Two clusters are merged only if the interconnectivity and closeness (proximity) between two clusters are high relative to the internal interconnectivity of the clusters and closeness of items within the clusters
Cure ignores information about interconnectivity of the objects; Rock ignores information about the closeness of two clusters
A two-phase algorithm
Density-based approach:
In this approach Clustering is done based on density (local cluster criterion), such as density-connected points
Typical methods: DBSACN, OPTICS, DenClue
Properties:
Two parameters:
Density-reachable:
A point p is density-reachable from a point q w.r.t. Eps, MinPts if there is a chain of points p1, …, pn, p1 = q, pn = p such that pi+1 is directly density-reachable from pi
Density-connected:
A point p is density-connected to a point q w.r.t. Eps, MinPts if there is a point o such that both, p and q are density-reachable from o w.r.t. Eps and MinPts
DBSCAN:
Density Based Spatial Clustering of Applications with Noise.Relies on a density-based notion of cluster: A cluster is defined as a maximal set of density-connected points.Discovers clusters of arbitrary shape in spatial databases with noise
Algorithm:
OPTICS:
DENCLUE:
Using Statistical Density Functions
Properties
Grid-based approach:
based on a multiple-level granularity structure(multi-resolution grid data structure)
Typical methods: STING, WaveCluster, CLIQUE
STING: A Statistical Information Grid Approach
The spatial area area is divided into rectangular cells
There are several levels of cells corresponding to different levels of resolution
Each cell at a high level is partitioned into a number of smaller cells in the next lower level
Statistical info of each cell is calculated and stored beforehand and is used to answer queries
Parameters of higher level cells can be easily calculated from parameters of lower level cell
count, mean, s, min, max
type of distribution-normal, uniform, etc.
Use a top-down approach to answer spatial data queries
Start from a pre-selected layer-typically with a small number of cells
For each cell in the current level compute the confidence interval
Remove the irrelevant cells from further consideration
When finish examining the current layer, proceed to the next lower level
Repeat this process until the bottom layer is reached
Advantages:
Disadvantages:
WaveCluster: Clustering by Wavelet Analysis
A multi-resolution clustering approach which applies wavelet transform to the feature space
How to apply wavelet transform to find clusters
Summarizes the data by imposing a multidimensional grid structure onto data space
These multidimensional spatial data objects are represented in a n-dimensional feature space
Apply wavelet transform on feature space to find the dense regions in the feature space
Apply wavelet transform multiple times which result in clusters at different scales from fine to coarse
Wavelet transform: A signal processing technique that decomposes a signal into different frequency sub-band (can be applied to n-dimensional signals)
Data are transformed to preserve relative distance between objects at different levels of resolution
Allows natural clusters to become more distinguishable
Algorithm
Input parameters
# of grid cells for each dimension
the wavelet, and the # of applications of wavelet transform
Why is wavelet transformation useful for clustering?
Use hat-shape filters to emphasize region where points cluster, but simultaneously suppress weaker information in their boundary
Effective removal of outliers, multi-resolution, cost effective
Major features:
Complexity O(N)
Detect arbitrary shaped clusters at different scales
Not sensitive to noise, not sensitive to input order
Only applicable to low dimensional data
Both grid-based and density-based
CLIQUE (Clustering In QUEst)
Automatically identifying subspaces of a high dimensional data space that allow better clustering than original space
CLIQUE can be considered as both density-based and grid-based
It partitions each dimension into the same number of equal length interval
It partitions an m-dimensional data space into non-overlapping rectangular units
A unit is dense if the fraction of total data points contained in the unit exceeds the input model parameter
A cluster is a maximal set of connected dense units within a subspace
Partition the data space and find the number of points that lie inside each cell of the partition.
Identify the subspaces that contain clusters using the Apriori principle
Identify clusters
Determine dense units in all subspaces of interests
Determine connected dense units in all subspaces of interests.
Generate minimal description for the clusters
Determine maximal regions that cover a cluster of connected dense units for each cluster
Determination of minimal cover for each cluster
Model-based:
A model is hypothesized for each of the clusters and tries to find the best fit of that model to each other
Typical methods: EM, SOM, COBWEB
EM - Expectation Maximization
EM - A popular iterative refinement algorithm
An extension to k-means
Assign each object to a cluster according to a weight (prob. distribution)
New means are computed based on weighted measures
General idea
Starts with an initial estimate of the parameter vector
Iteratively rescores the patterns against the mixture density produced by the parameter vector
The rescored patterns are used to update the parameter updates
Patterns belonging to the same cluster, if they are placed by their scores in a particular component
Algorithm converges fast but may not be in global optima
Initially, randomly assign k cluster centers
Iteratively refine the clusters based on two steps
Expectation step: assign each data point Xi to cluster Ci with the following probability
Maximization step:
Estimation of model parameters
Conceptual clustering
A form of clustering in machine learning
Produces a classification scheme for a set of unlabeled objects
Finds characteristic description for each concept (class)
COBWEB (Fisher'87)
A popular a simple method of incremental conceptual learning
Creates a hierarchical clustering in the form of a classification tree
Each node refers to a concept and contains a probabilistic description of that concept
Frequent pattern-based:
Based on the analysis of frequent patterns
Typical methods: pCluster
User-guided or constraint-based:
Clustering by considering user-specified or application-specific constraints
Typical methods: COD (obstacles), constrained
Clustering in applications: desirable to have user-guided (i.e., constrained) cluster analysis
Different constraints in cluster analysis:
Constraints on individual objects (do selection first)
Cluster on houses worth over $300K
Constraints on distance or similarity functions
Weighted functions, obstacles (e.g., rivers, lakes)
Constraints on the selection of clustering parameters
# of clusters, MinPts, etc.
User-specified constraints
Contain at least 500 valued customers and 5000 ordinary ones
Semi-supervised: giving small training sets as "constraints" or hints