Medical Dataset Using Data Mining Algorithms Computer Science Essay

Published: November 9, 2015 Words: 4649

A lot of information is available in the medical databases about the patients, doctors and diseases. Extraction of useful knowledge and providing scientific decision making for the diagnosis and treatment of the diseases from these databases increasingly become necessary. The medical databases need the use of data mining algorithms to find the hidden facts. Knowledge discovery and data mining methods have been applied to discover hidden patterns and relations in complex datasets using intelligent agents. The use of Artificial Intelligence (AI) algorithms further beefs up more intelligence in software agents. There are a number of algorithms in data mining for prediction, classification, interpretation and visualization of the datasets but 'k-means clustering', 'decision trees', 'neural network' and 'data visualization (2D or 3D scattered graphs) algorithms are commonly accepted as the most powerful data mining tools. This research paper is about a multiagent system for prediction, classification, interpretation and visualization of medical database using data mining algorithms.

The modern medical science generates a great of information stored in the medical databases. The vast quantities of data are generated through health care process. Among them, clinical databases have accumulated large quantities of information about the patients and their medical conditions, the doctors and the diseases. The data mining techniques help to find the relationships between multiple parental variables and the outcomes they influence. The methods and applications of medical data mining are based on computational intelligence such as artificial neural network, k-means clustering, decision trees and data visualization etc. [1][2][3][4][5]

Data mining is to verify the hypothesis prepared by the user and to discover or uncover new patterns from the large databases. The data mining makes use of concepts and techniques from other established disciplines, like, Artificial Intelligence, Database Technology, Machine Learning and Statistics. There are two keys to success in data mining: First, it is coming up with a precise formulation of the problem solving and the second, it is using the right data. Data mining is used in bioinformatics, genetics, and medical sciences. The data is available every where now. The problem is how to produce knowledge from this mountain of data. This is an era of knowledge and the knowledge can be obtained from Data and Information using data mining techniques i.e. Data + Information = Knowledge. [8][13]

There are a number of algorithms in data mining for prediction, classification, interpretation and visualization of the datasets but 'k-means clustering', 'decision trees', neural network and 'data visualization (2D or 3D scattered graphs) algorithms are commonly accepted as the most powerful data mining tools. In medical sciences, the classification of medicines; patient records according to their doses etc. can be performed by applying the clustering algorithms. The issue is how to interpret these clusters. It is not easy for all the users that they can interpret and extract the required results from these clusters, until some visualization tools are not used. Therefore, our proposed MedicalMiner combines the three data mining algorithms in such a way that the user has to give the dataset as input the rest will be done by MedicalMiner. Figure 1 depicts the proposed MedicalMiner [1].

Figure - 1 A MedicalMiner for Prediction, Classification, Interpretation & Visualization

The medical dataset prepared under the direction and problems of the doctors is a basic and required input for the MedicalMiner. The selection of intelligent algorithms from data mining according to the given dataset is another input. A multiagent system is prepared for these inputs. The outputs of MedicalMiner are the prediction, classification, interpretation and visualization of given dataset. The architecture of this MAS is shown in figure 2.

Intelligent Mobile Agents are used in data mining algorithms because of robustness, intelligence and suitability in heterogeneous environment. The architecture of the multiagent system called MedicalMiner proposed in this research paper for the generation of initial centroids using actual sample datapoints is shown in figure 2. This multiagent system has three agents one agent is for k-means clustering algorithm. This agent creates the clusters of the given dataset for prediction and classification. Neural network algorithm is also used for prediction and classification. The second agent is for decision tree algorithm. This agent uses the output of first agent and generates the decision rules of each cluster for interpretation. The third agent is for data visualization algorithm. This agent also uses the output of first agent and generates the 2D graphs of each cluster for visualization. The user can directly access the clusters. This multiagent system is a cascade i.e. the output of first agent is an input for the other two agents.

Figure - 2 The Architecture of MAS for MedicalMiner

In section 2 we present the an overview of data mining algorithms used in MedicalMiner. Section 3 is about the methodology. In section 4 we discuss the results and dicussion, finally section 5 presents the conclusion.

2- Overview of Data Mining Algorithms used in MedicalMiner

There are a lot of algorithms available in data mining for prediction, callsification, interpretation and visualization of the datasets but neural networks, decision tress, k-means clusters and data visualization are very popular.

2.1- Neural Networks

The neural networks are used for discovering complex or unknown relationships in data. It detects patterns from the large datasets for prediction or classification. The neural networks are used in system performing image and signal processing, pattern recognition, robotics, automatic navigation, prediction and forecasting and simulations. The NNs are better suited to learning on small to medium sized datasets. The data must be tained first by NNs and the process it goes through is considered by most to be hidden and therefore left unexplained [13].

The neural network starts with an input layer, where each node corresponds to a predictor variable. These input nodes are connected to a number of nodes in a hidden layer. Each input node is connected to every node in the hidden layer. The nodes in the hidden layer may be connected to nodes in another hidden layer, or to an output layer. The output layer consists of one or more response variables. This is illustrated in figure 3 [13].

Figure - 3 A Neural Network with one hidden layer

2.2- Decision tree algorithm

The decision tree algorithm is used as an efficient method for producing classifiers from data. The goal of supervised learning is to create a classification model, known as a classifier, which will predict, with the values of its available input attributes, the class for some entity. In other words, classification is the process of dividing the samples into pre-defined groups. It is used for decision rules as an output. In order to do mining with the decision trees, the attributes have continuous discrete values, the target attribute values must be provided in advance and the data must be sufficient so that the prediction of the results will be possible. Decision trees are faster to use, easier to generate understanding rules and simpler to explain since any decision that is made can be understood by viewing path of decision. They also help to form an accurate, balanced picture of the risks and rewards that can result from a particular choice. The decision rules are obtained in the form of if-then-else, which can be used for the decision support systems, classification and prediction. Figure 3 illustrates how decision rules are obtained from decision tree algorithm [6].

Figure - 3 Decision Rules from a Decision Tree Algorithm

The different steps of decision tree (ID3) algorithm are given below:

Step 1: Let 'S' is a training set. If all instances in 'S' are positive, then create 'YES' node and halt. If all instances in 'S' are negative, create a 'NO' node and halt. Otherwise select a feature 'F' with values v1,...,vn and create a decision node.

Step 2: Partition the training instances in 'S' into subsets S1, S2, ..., Sn according to the values of V.

Step 3: Apply the algorithm recursively to each of the sets Si.

The decision tree algorithm generates understandable rules, performs classification without requiring much computation, suitable to handle both continuous and categorical variables and provides an indication for prediction or classification [6].

2.3- K-means clustering algorithm

The 'k', in the k-means algorithm stands for number of clusters as an input and the 'means' stands for an average, location of all the members of a particular cluster. The algorithm is used for finding the similar patterns due to its simplicity and fast execution. This algorithm uses a square-error criterion in equation 1 for re-assignment of any sample from one cluster to another, which will cause a decrease in the total squared error [7].

(1)

Where (F - C)2 is the distance between the datapoints. It is easy to implement, and its time and space complexity are relatively small. Figure 4 illustrates the working of clustering algorithms.

Figure - 4 The Function of the Clustering Algorithms

The different steps of k-means clustering algorithm are given below:

Step 1: Select the value of 'k', the number of clusters.

Step 2: Calculate the initial centroids from the actual sample of dataset. Divide datapoints into 'k' clusters.

Step 3: Move datapoints into clusters using Euclidean's distance formula in equation 2. Recalculate new centroids. These centroids are calculated on the basis of average or means.

(2)

Step 4: Repeat step 3 until no datapoint is to be moved.

Where d(xi, xj) is the distance between xi and xj. xi and xj are the attributes of a given object, where i, j and k vary from 1 to N where N is total number of attributes of that given object, indexes i, j, k and N are all integers [7].

The algorithm has linear time complexity in the size of the dataset and the access time to all elements is fast. It is an order-independent algorithm. It generates same partition of data irrespective of order of samples. It can be applied in number of areas like, Marketing, Libraries, Insurance, City-planning, Earthquake studies, WWW and Medical Sciences [11].

2.4- Data visualization

This method provides the better understanding of data to the users. Graphics and visualization tools better illustrate the relationship among data and their importance in data analysis cannot be overemphasized. The distributions of values can be displayed by using histograms or box plots. 2D or 3D scattered graphs can also be used. Visualization works because it provides the broader information as opposed to text or numbers. The missing and exceptional values from data, the relationships and patterns within the data are easier to identify when graphically displayed. It allows the user to easily focus and see the patterns and trends amongst data. The major issues in data visualization are: As the volume of the data increases it becomes difficult to distinguish patterns from data sets. Another problem in using visualization is displaying multi-dimensional or multi-variable models because only two-dimensions can be shown on a computer or paper [13].

3- Methodology

The methodology of this research paper is: for the classification of dataset 'clusters' are created, for the prediction and the interpretation of these clusters the 'decision rules' are created and for the visualization the '2D scattered graphs' are used. We will first apply the K-means clustering algorithm on a medical dataset 'Diabetes'. This is a dataset/testbed of 790 records. Before applying k-means clustering algorithms on this dataset, the data is pre-processed, called data standardization. The interval scaled data is properly cleansed by applying the range method. The attributes of the dataset/testbed 'Diabetes' are:

Number of times pregnant (NTP)(min. age = 21, max. age = 81)

Plasma glucose concentration a 2 hours in an oral glucose tolerance test (PGC)

Diastolic blood pressure (mm Hg) (DBP)

Triceps skin fold thickness (mm) (TSFT)

2-Hour serum insulin (m U/ml) (2HSHI)

Body mass index (weight in kg/(height in m)^2) (BMI)

Diabetes pedigree function (DPF)

Age

Class (whether diabetes is cat 1 or cat 2) [10]

There are two main sources of data distribution, first is the centralized data source and second is the distributed data source. The distributed data source has further two approaches of data partitioning, first, the horizontally partitioned data, where same sets of attributes are on each node, this case is also called the homogeneous case. The second is the vertically partitioned data, which requires that different attributes are observed at different nodes, this case is also called the heterogeneous case. It is required that each node must contain a unique identifier to facilitate matching in vertical partition [1][9].

In this paper we use the vertical partitioning of dataset 'Diabetes'. We create two vertical partitions one is selecting two attributes on the basis of their values and the other is an ordinary selection of attributes. These two different vertical partitions are created to observe the impact of values of the attributes on the clusters. The tables from 1 to 4 show the vertical partitions, an ordinary selection of attributes. The attribute 'class' is a unique identifier in all these partitions.

Table - 1 Vertically distributed Diabetes dataset at node 1

NTP

PGC

Class

4

148

-ive

2

85

+ive

2

185

-ive

Table - 2 Vertically distributed Diabetes dataset at node 2

DBP

TSFT

Class

72

35

-ive

66

29

+ive

64

0

-ive

Table - 3 Vertically distributed Diabetes dataset at node 3

2HSI

BMI

Class

0

33.6

-ive

94

28.1

+ive

168

43.1

-ive

Table - 4 Vertically distributed Diabetes dataset at node 4

DPF

AGE

Class

0.627

50

-ive

0.351

31

+ive

2.288

33

-ive

The second partition is based on the values of the attributes as shown in tables from 5 to 8. The attribute 'class' again is a unique identifier in all these partitions.

Table - 5 Vertically distributed Diabetes dataset at node 1

NTP

DPF

Class

4

0.627

-ive

2

0.351

+ive

2

2.288

-ive

Table - 6 Vertically distributed Diabetes dataset at node 2

DBP

AGE

Class

72

50

-ive

66

31

+ive

64

33

-ive

Table - 7 Vertically distributed Diabetes dataset at node 3

TSFT

BMI

Class

35

33.6

-ive

29

28.1

+ive

0

43.1

-ive

Table - 8 Vertically distributed Diabetes dataset at node 4

PGC

2HIS

Class

148

0

-ive

85

94

+ive

185

168

-ive

Each partitioned table is a dataset of 790 records; only 3 records are exemplary shown in each table.

We will first apply the K-means clustering algorithm on the above created vertical partitions. The value of 'k', number of clusters is set to 4 and the number of iterations 'n' in each case is 50 i.e. value of k =4 and value of n=50. The decision rules for these obtained clusters will be created by using decision tree (ID3) algorithm. For the further interpretation and visualization of the results of these clusters, 2D scattered graphs are drawn using data visualization.

The partitioned datasets are placed on different nodes of the distributed network as shown in figure 5. The traditional centralized data analyzing does not scale very well in distributed applications. In distributed environment analyzing the distributed data is a non-trivial problem because of many constraints such as limited bandwidth, privacy-sensitive data and distributed compute nodes. Due to the adaptive and deliberative reasoning features of intelligent mobile agents, the latter is well suited to cope up with the problems of distributed systems. An intelligent, learning and autonomous agent is capable of capturing and applying domain specific knowledge, learning, information and reasoning, to take actions in pursuit of a goal. The distributed problems solving environment fit well with the multiagent system (MAS) since the solution requires autonomous behavior, collabartion and reasioning. The agents perform the underlying data analysis tasks very efficiently in distributed manner. The MAS offer an architecture for distributed problem solving. The MAS deal with complex applications that require distributed problem solving. The MAS is also a distributed systems, combing data mining algorithms with MAS for data analyzing will further enhance the processing power of the application [12].

Figure - 5 A Distributed Network using MAS

The multiagent system is used for the extraction of results from these nodes. These agents can roam from one node to other node freely and can be stored at any node in the distributed network. The results of these agents can also be stored at any where in the network. The architecture of mobile intelligent agents is shown in figure 2. This is a multiagent system, capable of performing classification, interpretation and visualization of large datasets. This multiagent system comprises three intelligent mobile agents. First agent performs the classification of the given dataset using k-means clustering algorithm and provides clusters as output. Second and third agents perform the interpretation and visualization of these clusters using decision tree algorithm and data visualization of data mining. The user can directly access the clusters as an output from k-means algorithm and can interpret these clusters using 2D graphs using data visualization and decision rules derived from the decision tree algorithm. The study could be extended to large scale distributed databases so as to validate the effectiveness of the proposed methodology. For further investigation in this direction, one will undoubtedly has to take into account the parameters such as data caching and the validity of the agent framework.

4- Results and Discussion

The pattern discovery from large dataset is a three steps process. In first step, one seeks to enumerate all of the associations that occur at least 'a' times in the dataset. In the second step, the clusters of the dataset are created and the third and last step is to construct the 'decision rules' with (if-then statements) the valid pattern pairs. Association Analysis: Association mining is concerned with whether the co-joint event (A,B,C,….) occurs more or less than would be expected on a chance basis. If it occurs as much (within a pre-specified margin), then it is not considered an interesting rule. Predictive Analysis: It is to generate 'decision rules' from the diabetes medical dataset using logical operations. The result of these rules after applying on the 'patient record' will be either 'true' or 'false'.

After applying the K-means clustering algorithms on these four nodes, total sixteen clusters are obtained, four for each node using ordinary selection of attributes in vertical partitions. Among these sixteen clusters, the interesting clusters which show some use patterns are obtained as shown in figures 6, 7and 8.

Figure - 6 A Scattered Graph of cluster 1 of node 2 between TSFT and DBP attributes of Diabetes dataset

The graph shows that there is variable distance between 'TSFT' and 'DBP' for cluster 1 of node 2.

Figure - 7 A Scattered Graph for cluster 2 of node 2 between TSFT and DBP attributes of Diabetes dataset

As compared to the first cluster of node 2, this cluster has a structure more or less close to figure 6. The distance between 'TSFT' and 'DBP' varies.

Figure - 8 A Scattered Graph for cluster 3 of node 3 between TSFT and DBP attributes of Diabetes dataset

As the graph shows, the distance between 'BMI' and '2HSHI' is variable for cluster 3 of node 3. We can see there is a high density of data situated between 0 and 350.

After applying the K-means clustering algorithms on the second partition based on the similar values of the attributes of the given dataset 'Diabetes' on these four nodes, total sixteen clusters are obtained, four for each node. The 2D scattered graphs of the interesting clusters are shown in figures 9, 10, 11 and 12.

Figure - 9 A Scattered Graph for cluster 1 of node 4 between PGC and HIS attributes of Diabetes dataset

The graph shows the distances between the attributes 'PGC' and '2HIS' is variable. It varies from 0 to 36.

Figure - 10 A Scattered Graph for cluster 3 of node 1 between NTP and DPF attributes of Diabetes dataset

The graph shows at the beginning the distance between the attributes 'PGC' and '2HIS' is constant then the distance varies, again the distance becomes constant at the end.

Figure - 11 A Scattered Graph for cluster 4 of node 4 between PGC and HIS attributes of Diabetes dataset

The graph shows that there is variable distance between 'PGC' and 'HIS' for cluster 4 of node 4.

Figure - 12 A Scattered Graph for cluster 4 of node 3 between TSFT and BMI attributes of Diabetes dataset

The graph shows that there is variable distance between 'TSFT' and 'BMI' for cluster 4 of node 3.

There are total 32 decision rules are generated 1 for each cluster. We are taking 2 interesting decision rules for the interpretation of clusters are given below:

The Decision Rules for cluster 1 of node 4 are:

Rule: 1

if PGC = "165" then

Class = "Cat2"

else

Rule: 2

if PGC = "153" then

Class = "Cat2"

else

Rule: 3

if PGC = "157" then

Class = "Cat2"

else

Rule: 4

if PGC = "139" then

Class = "Cat2"

else

Rule: 5

if HIS = "545" then

Class = "Cat2"

else

Rule: 6

if HIS = "744" then

Class = "Cat2"

else

Class = "Cat1"

There are only 6 decision rules for this cluster so the decision and the interpretation are very easy.

The Decision Rules for cluster 3 of node 1are:

Rule: 1

if DPF = "1.32" then

Class = "Cat1"

else

Rule: 2

if DPF = "2.29" then

Class = "Cat1"

else

Rule: 3

if NTP = "2" then

Class = "Cat2"

else

Rule: 4

if DPF = "2.42" then

Class = "Cat1"

else

Rule: 5

if DPF = "2.14" then

Class = "Cat1"

else

Rule: 6

if DPF = "1.39" then

Class = "Cat1"

else

Rule: 7

if DPF = "1.29" then

Class = "Cat1"

else

Rule: 8

if DPF = "1.26" then

Class = "Cat1"

There are only 8 decision rules for this cluster. The interpretation and the prediction are easy through these rules.

The next discussion is about the importance of all the attributes using these three data mining algorithms on the given dataset 'Diabetes'.

Figure - 13 Graph between the Attributes and the percentage Value using K-means clustering Algorithm

The graph shows that the attributes 'PGC' is one of the most important attribute and 'DBP' is less important attribute of this dataset for the prediction by using the K-means clustering algorithm.

Figure - 14 Graph between the Attributes and the percentage Value using Neural Networks Algorithm

The graph shows that almost all the attributes of dataset play important role, due to their high values, in the prediction by using the Neural Networks.

Figure - 15 Graph between the Attributes and the percentage Value using Decision tree Algorithm

The graph shows that the attributes 'PGC' is one of the most important attribute and 'NTP' is less important attribute of this dataset for the prediction by using the Decision Tree algorithm.

Table - 9 The % Importance of Diabetes Dataset Attributes in three Data Mining Algorithms

Sr. #

Attributes

K-Means

Decision Tree

Neural Networks

1

PGC

100.00

100.00

99.13

2

AGE

51.57

36.47

96.59

3

BMI

50.24

52.71

99.53

4

NTP

49.15

4.05

69.90

5

TSFT

33.82

9.92

90.01

6

2HSI

28.45

5.88

74.53

7

DPF

27.86

30.86

100.00

8

DBP

12.34

27.10

95.66

The table 9 summaries the % values of all attributes of dataset 'Diabetes' using the K-means clustering, the Neural Networks and the Decision Tree algorithms.

Figure - 16 Graph between the Variables of Diabetes Dataset and % Importance Values for all three Data Mining Algorithms

The graph shows that the % values of all the attributes of the given dataset 'Diabetes' is high from the Neural Networks as compared to the Decision Tree and the K-means clustering algorithms. The % values of all the attributes of the given dataset 'Diabetes' is low from the Decision Tree algorithm as compared to the other two algorithms. The intermediate % values of all the attributes are shown in the above graph from the K-means clustering algorithm. The Neural Networks shows that all the attributes of this dataset are very important in the prediction but when we draw a comparison between all the three algorithms then the attributes 'PGC', 'BMI', 'AGE' and 'DPF' are very important in the prediction of diabetes of category 1 or 2 in patients.

The table 10 is a performance metrics generated by the Neural Networks. The prediction can be done easily through this table.

Table - 10 Performance Metrics

CLASS

R

Net-R

Avg. Abs.

Max. Abs.

RMS

Accuracy (20%)

Conf. Interval (95%)

All

0.66

0.66

0.26

0.95

0.35

0.52

0.69

Train

0.65

0.65

0.26

0.95

0.36

0.52

0.70

Test

0.68

0.68

0.25

0.89

0.35

0.52

0.68

The most important and useful metrics for a prediction model are usually the R (Pearson R) value, RMS (Root Mean Square) error, and Avg. Abs. (Average Absolute) error, although Max. Abs. (Maximum Absolute) error may sometimes be important. The R value and RMS error indicate how "close" one data series is to another - in our case, the data series are the Target (actual) output values and the corresponding predicted output values generated by the model. R values range from -1.0 to +1.0. A larger (absolute value) R value indicates a higher correlation. The sign of the R value indicates whether the correlation is positive (when a value in one series changes, its corresponding value in the other series changes in the same direction), or negative (when a value in one series changes, its corresponding value in the other series changes in the opposite direction). An R value of 0.0 means there is no correlation between the two series. In general larger positive R values indicate "better" models. RMS error is a measure of the error between corresponding pairs of values in two series of values. Smaller RMS error values are better. Finally, another key to using performance metrics is to compare the same metric computed for different datasets. Note the R values highlighted for the Train and Test sets in the above table. The relatively small difference between values (0.65 and 0.68) suggests that the model generalizes well and that it is likely to make accurate predictions when it processes new data (data not obtained from the Train or Test dataset).

A graph is drawn between the target output and the predicted output as shown in figure 17.

Figure - 17 A Graph between the Target Output and the Predicted Output using Neural Networks

The graph shows that the predicted outputs and the target outputs are close with each other. There are two results are drawn from this graph that the data in the dataset is properly cleansed and the prediction is more accurate.

5- Conclusion

In this research paper we present the prediction, classification, interpretation of a dataset 'Diabetes' using three data mining algorithms; namely, K-means clustering, Decision tree and Neural networks. For the visualization of these results, 2D scattered graphs are drawn. We first create 2 vertical partitions of the given dataset, first was based on the arbitrary selection of attributes and the second was based on the similar values of the attributes. The results obtained from the second partitions are very helpful for the prediction. For the discovery of interesting pattern from the given dataset we combine 2 data mining algorithms namely, K-means clustering and the Decision tree in a cascade way i.e. the output of K-means clustering algorithm is used as an input for the Decision tree algorithm. The decision rules obtained from the Decision tree algorithm can further be used as simple queries for any medical databases. The results are further verified by using the Neural Networks. The interesting pattern discovered from the given dataset is "Diabetes of category 1 or 2 depends upon 'Plasma Glucose Concentration', 'Body Mass Index', 'Diabetes Pedigree Function' and 'Age' attributes". We draw the conclusion that the attributes 'PGC', 'BMI', 'DPF' and 'AGE' of the given dataset 'Diabetes' play important role in the prediction whether a patient is diabetic of category 1 or category 2.