Data Mining Is Relatively New Innovation Information Technology Essay

Published: November 30, 2015 Words: 2441

Although data mining is a relatively new innovation, it improvements and offers compared to traditional data analysis is seen the field is expanding rapidly. Since a critical requirement efficient and accurate distribution of useful information and in today's information-rich material for research on every topic continues.

Clustering is one of the most important technology was adopted By data-mining tools through a variety of applications. There are several algorithms that can assess large amounts of database on certain parameters and a group of data points .In this study, comparing two commonly used clustering algorithms, k-means and k-medoids clustering with applications in relation to other known techniques. Preliminary testing carried out in each technique utilizes the standard implementation of each algorithm .In addition; experimental tests are proposed and potential improvements in these methods, and presents k-mean and k-medoids algorithms. Various key applications of clustering methods are said in detail.

Weka is the software where the machine can learn the algorithm for the data mining work. Algorithm can be directly applied in the data or through the java source code. Weka is actually made for developing new type of learning scheme. Weka is an open source software where everyone can use it. There are much software like weka such as rapid minor but it is not an open source and user friendly. Weka does not have k-mediods clustering service to check with the algorithm. That's is one of the drawbacks in Weka .

Introduction of Data mining:

Data mining is a training devices that automatically search large stores of data to find patterns and trends that go beyond simple analysis. Data mining uses sophisticated mathematical algorithms that segment the data and assess the Probability of Future Events. Data mining is also known as Knowledge Discovery in Data (KDD).Efficiency and accuracy of results of these data mining activities based on direct contact with the selection of a suitable algorithm.

It allows users to analyse data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases. It allows users to analyse data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases.

Data mining tasks are:

Classification

Clustering

Association rule discovery

Data pre-processing

Visualization

Regression

Clustering:

Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups. It is collection of data objects similar to one another within the same cluster. Clustering of data is a method by which large sets of data are grouped in to clusters of smaller sets of similar data.Clustering is unsupervised classification: no predefined classes. Clustering is used because it has a useful data concept construction.

Types of Clustering:

There are two types of clustering

Hierarchical clustering: A multi-level nested decompositions. Create a hierarchical decomposition of the set of data (or objects) using some criterion. Depends on clustering paradigm. Bottom-up (agglomerative) clustering: From singleton to whole, Top-down (divisive) clustering: From whole to singleton. Typical methods: Diana, Agnes, BIRCH, CURE, CAMELEON

Partitioning clustering:

A single-level clustering results. Density-based, mixture-model, graph-theoretic clustering. Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors.

Typical methods: k-means, k-medoids, CLARANS.

Partitioning Algorithm Basic concept: Construct a partition of a database D of n objects into a set of k clusters, s.t., min sum of squared distance

Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion

Global optimal: exhaustively enumerate all partitions

Heuristic methods: k-means and k-medoids algorithms

k-means (MacQueen'67): Each cluster is represented by the centre of the cluster.

k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw'87): Each cluster is represented by one of the objects in the cluster .

k-Means:

K-Means clustering generates a specific number of disjoint, flat (non-hierarchical) clusters. It is well suited to generating globular clusters. The K-Means method is numerical, unsupervised, non-deterministic and iterative.

A non-hierarchical approach to clustering is good to specify a desired number of clusters, say k, and then each have (object) to one of k clusters so as to minimize a measure of the variability within the clusters. A very common measure is the sum of the distances or sum of squared Euclidean distances from the mean of each cluster. The problem you may as integer programming problem to solve, but because integer programs can be set with a large number of variables, time-consuming, are often charged clusters using a fast, heuristic methods, which generally lead to good (but not necessarily optimal) solutions .The k-means algorithm is one such method.

k-Means Clustering Method:K-means clustering finds the centroids, where the coordinate of each centroid is the means of the coordinates of the objects in the cluster and assigns every object to the nearest centroid. The algorithm can be summarized as follows. k

Step 1: Select objects randomly. These objects represent initial group centroids. k

Step 2: Assign each object to the group that has the closest centroid.

Step 3: When all objects have been assigned, recalculate the positions of the centroids. k

Step 4: Repeat Steps 2 and 3 until the centroids no longer move.

k-Means Clustering:

Problem in k means:

There are some problems in K-means data. We see a mention of it in the current studies and earlier and often recurrent problem has to do with the initialization algorithm. For example, patterns of protein sequences are include improving the initialization process of "increasing rate of the series belonging to clusters with similarity in the structure above." The term "adding" suggests that multiple initialization methods are far from ideal, especially given the number of articles that mention this. Furthermore, "relatively small and can detect subtle series of images" that would otherwise not be detected by the algorithms K-traditional means.

0

1

2

3

4

5

6

7

8

9

10

0

1

2

3

4

5

6

7

Advantages of k-means Technique

With a large number of variables, K-Means may be computationally faster than hierarchical clustering (if K is small).

K-Means may produce tighter clusters than hierarchical clustering, especially if the clusters are globular.

Disadvantages of k-means Technique

Difficulty in comparing quality of the clusters produced (e.g. for different initial partitions or values of K affect outcome).

Fixed number of clusters can make it difficult to predict what K should be.

Does not work well with the non-globular clusters.

Different initial partitions can result in different final clusters. It is helpful to rerun the programusing the same as well as different K values, to compare the results achieved.

Example:

Using the data (learning set), and will classify the objects into a particular class - for example, if our class (decision) attribute is tumor Type and its values are: malignant, benign, etc. - these will be the classes. They will be represented by cluster1, cluster2, etc. However, the class information is never provided to the algorithm. The class information can be used later on, to evaluate how accurately the algorithm classified the objects.

Curvature

Texture

Blood

Consump

Tumor

Type

x1

0.8

1.2

A

Benign

x2

0.75

1.4

B

Benign

x3

0.23

0.4

D

Malignant

x4

.

.

0.23

0.5

D

Malignant

Curvature

Texture

Blood

Consump

Tumor

Type

x1

0.8

1.2

A

Benign

x2

0.75

1.4

B

Benign

x3

0.23

0.4

D

Malignant

x4

.

.

0.23

0.5

D

Malignant

Curvature

Texture

Blood

Consump

0.8

0.23

1.2

0.4

A

B

D

.x1

The way we do that, is by plotting the objects from the database into space. Each attribute is one dimension:

Curvature

Texture

Blood

Consump

0.8

0.23

1.2

0.4

A

B

D

.

.

.

.

.

.

.

Cluster 1

benign

Cluster 2

malignant

k-means Irish Data set:

The Iris data set in UCI repository and the performance of the proposed algorithm mentioned here.

Iris flower data set is a multivariate data set introduced by Sir Ronald Aylmer Fisher. It is an example of discriminate analysis.

There are 150 samples of data's are collected in this data set and it is equally divided in to three data set. (Iris setosa, Iris virginica and Iris versicolor)

Each sample data's are classified into four variables, which are the length and the width of sepal and petal, and it is measured in centimeters.

Setosa

Veriscular

virginica

Setosa

50

0

0

veriscular

0

47

14

virginica

0

3

36

k-Medoids:

Cluster analysis is a descriptive task that seeks to identify homogeneous groups of objects, depending on the value of their properties. This document proposes a new algorithm for K-medoids clustering algorithm works as a K-means and test several methods for choosing the initial medoids. The proposed algorithm calculates the distance matrix of time and uses it to search for new medoids in each iteration step. We evaluated the proposed algorithms on real and synthetic data and compare the results of other algorithms. The proposed algorithm has a computation time decreased with comparable performance compared with the section around Medoids.

k-Medoids Clustering Methods:

PAM (Partitioning Around Medoids, 1987)

Starts from an initial set of medoids and iteratively replaces one of the medoids by one of the non-medoids if it improves the total distance of the resulting clustering

PAM works effectively for small data sets, but does not scale well for large data sets.

CLARA (Kaufmann &Rousseeuw, 1990).

CLARANS (Ng & Han, 1994): Randomized sampling.

Basic k-medoids Algorithm:

Step 1: Randomly select k medoids-representative objects from n objects.

Step 2: After finding medoids set, each object of the data set will be assigned to the nearest medoid based on some distance formula.

Step 3: Now for each clusters previously formed, swap each data point of the cluster with its respective medoid point to compute the total cost of configuration

Step 4: After swapping process is completed, the configuration with the lowest cost should be selected.

Step 5: Repeat steps 2 to 4 till the medoids doesn't change or until the max iterations.

k-Medoids Algorithm (PAM):

PAM (Kaufman and Rousseeuw, 1987), built in Splus.

Use real object to represent the cluster.

1. Select k representative (medoid) objects arbitrarily.

2. For each pair of non-selected random object h and selected (medoid) object i,

Calculate the total swapping costTCih:

If TCih< 0,

-i is replaced by h; and

-Assign each non-selected object to the most similar representative object

3. Repeat steps 2 until there is no change

PAM Clustering: Total swapping cost TCih=∑jCjih

Problem with PAM:

PAM is more robust than k-means in the presence of noise and outliers because a medoid is less influenced by outliers or other extreme values than a mean. PAM works efficiently for small data sets but do not scale well for large data sets.

CLARA (Clustering Large Applications) (1990):

CLARA (Kaufmann and Rousseeuw in 1990)

It draws multiple samples of the data set, applies PAM on each sample, and gives the best clustering as the output. Common sample size=40+2k.

Deals with larger data sets than (PAM).

Weakness:

-Efficiency depends on the sample size

-A good clustering based on samples will not necessarily represent a good clustering of the whole data set if the sample is biased

CLARANS ("Randomized" CLARA) (1994):

CLARANS (A Clustering Algorithm based on Randomized Search) (Ng and Han'94).CLARANS draws sample of neighboursdynamically.The clustering process can be presented as searching a graph where every node is a potential solution, that is, a set of k medoids.It is more efficient and scalable than both PAM and CLARA.

Advantages of K-mediods:

K-Medoids is more advantageous and robust than K-Means in the case where there is the presence of noise and outliers.

The algorithm for computing the best set of medoids are more demanding than k-means algorithm.

k-Mediods is effective and accurate for small dataset.

Disadvantages of K-mediods

K-Medoids is more costlier than that of k-Means

k-Mediods is effective and accurate for small dataset.

The major disadvantage is both methods the user needs to specify the "k".

k-Mediods Data-set

A. Title Description

Given: {2,4,10,12,3,20,30,11,25} k = 2 clustering algorithm using K-Medoids

B. Algorithm Description

K-Medoids algorithms: selection of the most central location of the cluster object, which is the centre. Classification method is based on all the objects to minimize their dissimilarity between the reference point and the principles of implementation.

K-Medoids algorithm basic strategy: First, randomly selected for each cluster a representative objects; the remaining objects according to their distance from the object with the representatives assigned to the nearest cluster. Then repeatedly with representatives of non-representative objects instead of objects, to improve the quality of clustering. The quality of clustering results to estimate with a cost function, the function object and its reference to measure the average dissimilarity between objects. According to certain rules to determine whether the non-representative of a representative of the current object is a good alternative.

When re-allocation occurs, the cost function will change. When the total cost of replacing all non-centre point of the object and the costs incurred. If the total cost is positive, the current centre is considered acceptable, and this iteration did not change occurred.

Method described as follows:

Algorithm: K-Medoids. Object-based centre or centres for the division of the typical K-centre algorithm.

Input: the results of the number of clusters k, a database containing n objects.

Output: k clusters so that all object and its nearest centre of the smallest sum of dissimilarity.

Method: Call K-Medoids (k, SD)

Begin

(1) Randomly select k objects as the initial focal point;

(2) Repeat

(3) Assigning each remaining object to the nearest cluster represented by the centre;

(4) Randomly select a non-centre object O random;

(5) Calculation of the total cost of O j O random instead of S;

(6) IF S <0, then O random alternative O j, to form a new set of k-centre;

(7) Until no change;

End

C. Analysis

The results of the fifth start from the same iteration, the centre of 3 and 12.

K-Medoids algorithm repeatedly tried to find a better centre. Analyses of all possible objects are, each cluster in the centre of an object is considered to estimate the quality of clustering results. When the value of n and k are large, this computational cost is considerable.

The Robustness Comparison between K-Means algorithm K-Medoids.

When there is "noise" and isolated data points, K-Medoids algorithm is better than K-Means algorithm is more robust, because the centre is not so easily as the average impact of extreme data. However, K-Medoids is more costly than that of K-Means.

Output Screen;

Conclusion: