A Study On Datamining And Clustering Education Essay

Published: November 21, 2015 Words: 2260

Data mining is a process of analyzing data from diverse perspective and summarizing it into useful information. In simple words extracting knowledge from large amount of data .The use of data mining information will increase revenue and cut costs. Data mining software's allows you to analyze data in different perspective, classify it and summarize the relationships found. Thus data mining is a method in discovering correlations and patterns among number of fields in huge relational database.

Cluster & clustering: Clustering is the process of partitioning of data into groups of similar objects (clusters). It is a technique of data modeling which gives the exact concise summaries of the data.

There are several clustering techniques like

AN OVERVIEW OF CLUSTER ANALYSIS

Clustering is a process of grouping a set of abstract or physical objects into classes of related objects. A cluster is a group of data objects that are analogous to one another within the same cluster and dissimilar to the data object in the other cluster. The process of clustering involves firstly partitioning the set of data into groups based on data analogy and then assigning label to the groups. Thus Clustering is advantageous as it is adaptable to changes and helps in distinguish between clusters.

Cluster analysis is similar to the human activity of distinguishing cats and dogs, or identifying between different things like bikes and cars. Cluster analysis is widely used in several fields like market research, pattern recognition, data analysis and image processing. For example in business clustering helps marketers to discover distinctive groups in their customer bases and portray customer group based in purchasing pattern. Clustering partitions large data sets into groups according to their similarity thus also called as data segmentation. Cluster analysis can be used as a tool to gain insight into distribution of data, to examine the distinctiveness of each cluster and center on a particular set of clusters for additional analysis. Data clustering is causative in the areas such as data mining, statistics, machine learning, spatial database, marketing etc.

Cluster analysis tools have been built as software packages such as S -plus, SPSS and SAS. Clustering is an example of unsupervised learning in machine. For this reason clustering is a form of learning by observation and not by examples. The efforts in data mining are inattentive on discovering methods for effective cluster analysis in outsized database. The research have resulted in active themes that focus on the scalability of clustering methods and the effectiveness of methods for clustering composite shapes and types of data, high dimensional clustering techniques ,mixed numerical and categorical data in outsized database.

Clustering a challenging field

Clustering is a challenging field of research in which its prospective applications masquerade their own special needs.

These are the typical requirements of clustering in data mining:

Clustering methods:

The major clustering methods available are

There are two types of classification- Supervised and unsupervised

Supervised classification - we might know the available data and the amount

Unsupervised classification - both the availability and amount of data are unknown. This is called CLUSTERING

Intra clusters is one in which the similarity between the data is high compared to inter cluster.

Applications of Clustering

Clustering Techniques

1.Partitioned algorithm

Generally a database containing m set of clusters and n set of data are constructed.The criteria for partitioning are

Heuristic methods: K-Means, k medoids, CLARANS

Global optimal: enumerates all paritions

K Means:

Every cluster is represented by the center of the cluster.

k-means algorithm

There are various dis advantages using K mean method, the number of clusters to be dealt should be predefined. This method is not able to handle Outliers. Only one parameter i.e. mean point is alone is used for classifying the cluster which is not sufficient.

k-medoids or PAM (Partition around medoids):

Each cluster is represented by one of the objects in the cluster

We improve the methodology by using medoids (most centrally located object in a cluster) as a reference point instead of mean value used earlier.

PAM :

Initially a set of medoids is selected and recursively one of the medoids are replaced by non-medoids if it improves the total distance of the resulting clustering.This method works effectively for small data sets, but does not scale well for large data sets.

CLARA (Clustering Large Applications):

Various data sets are choosed and PAM is applied on each set which gives the best clustering as the output.It deals with larger data sets than PAM but the Efficiency of each data sets depends on the sample size

CLARANS ("Randomized sampling" CLARA):

This technique is highly efficient and scalable than both PAM and CLARA .In this method the neighboring sample are drawn dynamically.The clustering process can be presented as searching a graph where every node is a potential solution, that is, a set of k medoids .If the local optimum is found, CLARANS starts with new randomly selected node in search for a new local optimum

2.Hierarchical approach:

Based on some criteria the set of data (or objects) are decomposed hierarchically.

Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON

In this approach distance matrix is chosen as the clustering criteria. The number of clusters k is not required instead a termination condition is needed.

AGNES (Agglomerative Nesting)

Bottom up approach, which merges the dissimilar matrix or the cluster in a non descending manner. Merging is performed in such a manner that one cluster is merged into another large cluster and into another larger cluster and so on until all the objects are merged into a single cluster.This method is not scalable, time complexity is O (n2), where n is the number of total objects

DIANA (Divisive Analysis)

Top down decomposition, this is the Inverse order of Agglomerative Nesting. In this method the object is spitted into smaller objects and so on and each of them form their own cluster. This method is rarely applied.

Integration of hierarchical with distance-based clustering:

BIRCH : uses CF-tree and incrementally adjusts the quality of sub-clusters

Balanced Iterative Reducing and Clustering using Hierarchies. It uses CF(Clustering Feature )-tree and incrementally adjusts to the quality of sub-clusters.Th is approach is basis for multi phase clustering.

Step1: scan DB to build an initial in-memory CF tree (a multi-level compression of the data that tries to preserve the inherent clustering structure of the data)

Step 2: arbitrary clustering algorithm is used to cluster the leaf nodes of the CF-tree

This method is useful to handle only numeric data, and sensitive to the order of the data record.

CF-Tree in BIRCH

Clustering feature: It concisely summarizes the statistics for a given sub cluster: the 0-th, 1st and 2nd moments of the sub cluster from the statistical point of view. Efficient utilization of storage and stores the crucial measurements for cluster computation.

A CF tree is a height-balanced tree that stores the clustering features for a hierarchical clustering .A non leaf node in a tree has "children" and these nodes store sums of the CFs of their children.

A CF tree has two parameters:

ROCK :

Clustering categorical data by neighbor and link analysis.RObust Clustering using links. These links are used to measure similarity/proximity. Its not a distance-based clustering. Computational complexity: is of order

Algorithm: sampling-based clustering

CHAMELEON :

Hierarchical clustering using dynamic modeling. Method of clustering complex objects

The similarity between the clusters are measured based on a dynamic model. Two clusters are merged only if the interconnectivity and closeness (proximity) between two clusters are high relative to the internal interconnectivity of the clusters and closeness of items within the clusters

Cure ignores information about interconnectivity of the objects; Rock ignores information about the closeness of two clusters

A two-phase algorithm

Density-based approach:

In this approach Clustering is done based on density (local cluster criterion), such as density-connected points

Typical methods: DBSACN, OPTICS, DenClue

Properties:

Two parameters:

Density-reachable:

A point p is density-reachable from a point q w.r.t. Eps, MinPts if there is a chain of points p1, …, pn, p1 = q, pn = p such that pi+1 is directly density-reachable from pi

Density-connected:

A point p is density-connected to a point q w.r.t. Eps, MinPts if there is a point o such that both, p and q are density-reachable from o w.r.t. Eps and MinPts

DBSCAN:

Density Based Spatial Clustering of Applications with Noise.Relies on a density-based notion of cluster: A cluster is defined as a maximal set of density-connected points.Discovers clusters of arbitrary shape in spatial databases with noise

Algorithm:

OPTICS:

DENCLUE:

Using Statistical Density Functions

Properties

Grid-based approach:

based on a multiple-level granularity structure(multi-resolution grid data structure)

Typical methods: STING, WaveCluster, CLIQUE

STING: A Statistical Information Grid Approach

The spatial area area is divided into rectangular cells

There are several levels of cells corresponding to different levels of resolution

Each cell at a high level is partitioned into a number of smaller cells in the next lower level

Statistical info of each cell is calculated and stored beforehand and is used to answer queries

Parameters of higher level cells can be easily calculated from parameters of lower level cell

count, mean, s, min, max

type of distribution-normal, uniform, etc.

Use a top-down approach to answer spatial data queries

Start from a pre-selected layer-typically with a small number of cells

For each cell in the current level compute the confidence interval

Remove the irrelevant cells from further consideration

When finish examining the current layer, proceed to the next lower level

Repeat this process until the bottom layer is reached

Advantages:

Disadvantages:

WaveCluster: Clustering by Wavelet Analysis

A multi-resolution clustering approach which applies wavelet transform to the feature space

How to apply wavelet transform to find clusters

Summarizes the data by imposing a multidimensional grid structure onto data space

These multidimensional spatial data objects are represented in a n-dimensional feature space

Apply wavelet transform on feature space to find the dense regions in the feature space

Apply wavelet transform multiple times which result in clusters at different scales from fine to coarse

Wavelet transform: A signal processing technique that decomposes a signal into different frequency sub-band (can be applied to n-dimensional signals)

Data are transformed to preserve relative distance between objects at different levels of resolution

Allows natural clusters to become more distinguishable

Algorithm

Input parameters

# of grid cells for each dimension

the wavelet, and the # of applications of wavelet transform

Why is wavelet transformation useful for clustering?

Use hat-shape filters to emphasize region where points cluster, but simultaneously suppress weaker information in their boundary

Effective removal of outliers, multi-resolution, cost effective

Major features:

Complexity O(N)

Detect arbitrary shaped clusters at different scales

Not sensitive to noise, not sensitive to input order

Only applicable to low dimensional data

Both grid-based and density-based

CLIQUE (Clustering In QUEst)

Automatically identifying subspaces of a high dimensional data space that allow better clustering than original space

CLIQUE can be considered as both density-based and grid-based

It partitions each dimension into the same number of equal length interval

It partitions an m-dimensional data space into non-overlapping rectangular units

A unit is dense if the fraction of total data points contained in the unit exceeds the input model parameter

A cluster is a maximal set of connected dense units within a subspace

Partition the data space and find the number of points that lie inside each cell of the partition.

Identify the subspaces that contain clusters using the Apriori principle

Identify clusters

Determine dense units in all subspaces of interests

Determine connected dense units in all subspaces of interests.

Generate minimal description for the clusters

Determine maximal regions that cover a cluster of connected dense units for each cluster

Determination of minimal cover for each cluster

Model-based:

A model is hypothesized for each of the clusters and tries to find the best fit of that model to each other

Typical methods: EM, SOM, COBWEB

EM - Expectation Maximization

EM - A popular iterative refinement algorithm

An extension to k-means

Assign each object to a cluster according to a weight (prob. distribution)

New means are computed based on weighted measures

General idea

Starts with an initial estimate of the parameter vector

Iteratively rescores the patterns against the mixture density produced by the parameter vector

The rescored patterns are used to update the parameter updates

Patterns belonging to the same cluster, if they are placed by their scores in a particular component

Algorithm converges fast but may not be in global optima

Initially, randomly assign k cluster centers

Iteratively refine the clusters based on two steps

Expectation step: assign each data point Xi to cluster Ci with the following probability

Maximization step:

Estimation of model parameters

Conceptual clustering

A form of clustering in machine learning

Produces a classification scheme for a set of unlabeled objects

Finds characteristic description for each concept (class)

COBWEB (Fisher'87)

A popular a simple method of incremental conceptual learning

Creates a hierarchical clustering in the form of a classification tree

Each node refers to a concept and contains a probabilistic description of that concept

Frequent pattern-based:

Based on the analysis of frequent patterns

Typical methods: pCluster

User-guided or constraint-based:

Clustering by considering user-specified or application-specific constraints

Typical methods: COD (obstacles), constrained

Clustering in applications: desirable to have user-guided (i.e., constrained) cluster analysis

Different constraints in cluster analysis:

Constraints on individual objects (do selection first)

Cluster on houses worth over $300K

Constraints on distance or similarity functions

Weighted functions, obstacles (e.g., rivers, lakes)

Constraints on the selection of clustering parameters

# of clusters, MinPts, etc.

User-specified constraints

Contain at least 500 valued customers and 5000 ordinary ones

Semi-supervised: giving small training sets as "constraints" or hints