Hypothiroid Detection And Classification Health And Social Care Essay

Published: November 27, 2015 Words: 2845

Thyroid gland produces thyroid hormones to help the regulation of the body's metabolism. The abnormalities of producing thyroid hormones are divided into two categories (hypothyroidism and hyperthyroidism)[]. The paper presents Multi-class Support Vector Machine (MCSVM) techniques for detection and classification of hypothyroid problem in humans [].

Support Vector Machines (SVMs) proposed by Vapnik [1] are a set of related supervised learning methods used for classification and regression. SVMs are classification prediction tool that uses Machine Learning theory as a principled and very robust method to maximize predictive accuracy for detection and classification. SVMs can be classified as systems which use hypothesis space of linear functions in a high dimensional feature space, trained with a learning algorithm from optimization theory that performs a learning bias derived from statistical learning theory [1, 2].

The SVM method was developed to construct separating hyperplanes for classification problems. In the 1990s SVM was generalized for constructing nonlinear separating functions and for real-valued functions approximation. Some applications of SVMs include text categorization, character recognition, bioinformatics, bankruptcy prediction, spam categorization and so forth [5].

The rest of the paper is organized as follows. In section II, SVMs classification is introduced and provides a formulation of One-class. Section III presents how Multi-class SVMs problem is solved through binary SVMs by deploying different methods. The experimental results are given in Section IV to show the effectiveness and Multi-class SVM over other decision tree classifiers which is followed by a conclusion in Section V.

SVMs Overview

SVM classification is based on the concept of decision planes that define decision boundaries in input space or feature space. SVM constructs linear functions (hyperplanes either in input space or in feature space) from a set of labeled training data set. This hyperplane will try to split the positive samples from the negative samples. The linear separator is normally designed to have the largest distance from the hyperplane to the nearest of the positive and negative samples. Intuitively, this causes correct classification for training data which is near, but not equal to the testing data.

Throughout training phase SVM takes a data matrix as input data and labels each one of samples as either belonging to a given class (positive) or not (negative). SVM treats each sample in the matrix as a row in a high dimensional feature space, where the number of attributes identifies the dimensionality of the space. SVM learning algorithm determines the best hyperplane which separates each positive and negative training sample. The trained SVM can be used to make predictions about a test sample's membership in the class.

Briefly, nonlinear SVM maps the N- dimensional input space into a high dimensional feature space. Finally in this high dimensional feature space a linear classifier is constructed which acts as nonlinear classifier in input space. Most mathematical concepts as background materials in this paper are taken from Support Vector Machine for classification [3], Introduction to Data Mining [6] and articles listed in references [1,4,7,8].

Linear Separable SVMs Classifier

Consider a binary classification problem with N training samples (data). Each sample is indicated by a tuple (xi , yi) and ( i = 1, 2, …, n), where xi=(xi1, xi2, …, xin) corresponds to the attribute set for the ith sample. Conventionally let yi Î{-1, 1} and it is considered as its class label. The decision boundary of a linear classifier Separator) can be written as follows:

wTx + b = 0, (1)

where w is weight vector and b is a bias term.

There are many linear separators, but the SVM design goal is to define a decision boundary that is maximally far away from any data point. This distance from the decision boundary to the closest data point determines the margin of the classifier. This method of construction necessarily means that the decision boundary for an SVM is fully specified by a (usually small) subset of the data points which defines the position of the separator. These points are referred to as the support vectors. Fig. 1 shows the margin and support vectors for two class problem.

Decision boundary and margin of SVM

If the training data are linearly separable then there exists a pair (W, b) such that:

wTxi + b ≥ 1 if yi = 1, (2)

wTxi + b ≤ -1 if yi = -1, (3)

The linear classifier is defined as:

f(x) = sign(wTx + b). (4)

For a given data set and decision hyperplane, the functional margin of the ith sample xi with respect to a hyperplane (w, b) is defined as in (5):

gi = yi (wTxi + b). (5)

The functional margin of a data set with respect to a decision boundary is then twice the functional margin of any of the points in the data set with minimal functional margin (the factor of 2 comes from measuring across the whole width of the margin, as in Fig. 1). It is known that the shortest distance between a point and a hyperplane is perpendicular to the plane, and hence, parallel to w. A unit vector in this direction is w/â•‘wâ•‘. The geometric margin of the classifier is the maximum width of the band that can be drawn separating the support vectors of the two classes as shown in Fig. 1 by r. Distance from xi to sample the separator is equal to:

g = yi (wTxi + b)/ â•‘wâ•‘= 1/â•‘wâ•‘. (6)

Designing linear separator is to maximize this geometric margin (6) in order to find best w and b as formulated below :

r = 2 /â•‘wâ•‘ is maximized,

For all (xi, yi), i =1,…,n : s.t. yi (wTxi + b) ≥ 1. (7)

It is convex quadratic optimization problem with quadratic function subject to linear constraints. Quadratic optimization problems are a standard, renowned class of mathematical optimization problems, and many algorithms exist for solving them. SVM classifiers can be built using standard quadratic programming (QP) libraries. The above problem can be reformulated minimization as follows:

Φ(w) = ||w|| = wTw is minimized,

For all (xi, yi), i =1,…,n : s.t. yi (wTxi + b) ≥ 1. (8)

The solution for above problems includes constructing a dual problem where a Lagrange multiplier ai is linked with every inequality constrains (yi (wTxi + b) ≥ 1) in the primal problem:

Find α1, …,αn such that

is maximized with respect to:

(9)

Then solution to the primal is:

In the solution above, most of ai are zero for data sample which are not support vectors. Each non-zero ai indicates that the corresponding xi is a support vectors. So the classification function is then as given below:

We don't need w explicitly since it relies on an inner product between the test point x and the support vectors xi (11). Finding class label for any data points xj involves computing the inner products xiTxj .

Linear None Separable SVMs Classifier

We have presented above, where data sets are linearly separable, and now we present below is the data which is not linearly separable. The standard approach is to allow the fat decision boundary as shown in Fig. 2 to make a few mistakes (some points - outliers or noisy sample - are inside or on the wrong side of the margin). We then pay a cost for each misclassified sample, which depends on how far it is from meeting the margin requirement given in (8). To implement this, slack variables ξi are introduced below.

The formulation of the SVMs optimization problem with slack variables is given below:

Slack variables for linearly nonseparable data

Find w, b and ξi ≥ 0 such that

For all (xi, yi), i =1, ...,n : s.t. yi (wTxi + b) ≥ 1 - ξi, (12)

The parameter C is a regularization term, which can be viewed as a way to control overfitting, it is trade-off between the relative importance of maximizing the margin and fitting the training data.

Dual problem is identical to separable case (would not be identical if the 2-norm penalty for slack variables CΣξi2 was used in primal objective.

Designing SVM in this is to find α1, …, αN such that

is maximized under the following constraints with respect to α :

(13)

Neither the slack variables ξi nor Lagrange multipliers for them emerge in the dual problem. All that are left with is the constant C which bound the possible size of the Lagrange multipliers for the support vector data points. Same as before, the xi with non-zero αi will be the support vectors. The solution of the dual problem is of the form:

Again, we don't need to compute w explicitly for classification function as shown in (15):

Nonlinear SVMs Classifier

In SVMs the optimal hyperplane is defined to maximize the generalization ability. But if the training data are not linearly separable, the achieved classifier may not have high generalization ability though the hyperplanes are determined optimally. So to enhance linear separability, the original input space is mapped into a high dimensional dot product space called the feature Space [3].

SVMs provides an easy and efficient way of doing this mapping to a higher dimensional space, which is referred to as the kernel trick [2]. The SVMs linear classifier relies on a dot product between data point vectors. Suppose we transform the data to some (possibly infinite dimensional) space H via a mapping function Φ such that the data appears of the form Φ(xi).Φ(xj). Linear operation in H is equivalent to non-linear operation in input space. If x = (x1, x2) and φ(x) =(x12, x22, Ã-2x1x2). Fig. 3 illustrates this mapping.

Nonlinear SVMs decision boundary

Let K(xi, xj)=Ф(xi).Ф(xj) as kernel function, so we change all inner products to kernel functions for training data.

Designing SVM is to find α1 ,…, αN such that

is maximized under the following constraints with respect to α

Then the classifier is as below:

The four commonly used families of kernels are:

Linear kernel

k(xi, xj) = xi T xj

Polynomial kernel with degree d

k(xi, xj)=( xi T xj + 1)d

Radial basis function (RBF) kernel (g is a positive parameter for controlling the radius)

k(xi, xj)=exp (-g || xi T xj ||2)

Sigmoid kernel (g is a positive parameter)

k(xi, xj)=tanh (gxi T xj + r)

One-class SVMs (OCSVM)

An OCSVM can be used when only one class of training samples is available. Although SVMs were originally designed for binary classifiers, Scholkopf proposed an approach to estimate, for a given class, the spatial region that includes most of the class patterns [9].

The algorithm of OCSVM is similar to binary SVM algorithm in that it applies kernel functions to perform implicit mappings and dot products. It also applies the same kind of hyperplane for the decision boundary. The solution is only relies on the support vectors. The selected decision boundary is determined by solving an optimization problem that determines the best hyperplane under a set of criteria [11].

One-class classification attempts to classify one class of objects, and recognize it from all other possible objects. The samples can be classified well by the classifier, but the others will be classified as outlier. An OCSVM should then be achieved to reject this object and to label it as an outlier.

The feature space points Ф(xi), …, Ф(xn) are all separable. The distance of the hyperplane is determined by the following:

Where 0 < u < 1 is a parameter that controls the trade-off between maximizing the distance from the origin and containing most of the data in the region constructed by the hyperplane and corresponds to the ratio of expected anomalies in the data set and ξi are slack variables.

By solving the optimization problem, the decision function for each point x is:

f(x) = (w. Ф(xi)) - ρ (19)

Let this K(xi,xj)=Ф(xi).Ф(xj) be as kernel function. The following is formulated as dual problem.

(20)

is minimized. subject to 0 ≤ αi ≤ 1, so as to give the following:

The final classifier is the form of :

Multi-class SVMs (MCSVM)

In case of Multi-class classification problems, the issue becomes more complex because the outputs could be more than one class and must be divided into N mutually exclusive classes. In fact, there are some ways to solve Multi-class classification problems for SVM such as One-Against-One (OAO) classifiers and One-Against-All (OAA) classifiers.

One-Against-All Support vector Machine (OAASVM)

In case of M-class problems (M>2), M binary SVM classifiers are constructed [12]. The ith SVM is trained such that the samples in the ith class are labeled as positive samples and all the rest as negative samples. In the recognition stage, a test sample is obtained to all M SVMs and is labeled according to the maximum output among the M classifiers.

One-Against-One Support vector Machine(OAOSVM)

OAOSVM constructs M Ã- (M-1)/2 binary classifiers, using all the binary pair-wise combinations of the M classes. Each classifier is trained using the samples of the first class as positive samples and the samples of the second class as negative samples. To merge these classifiers, the Max Wins algorithm is adopted. It finds the consequent class by choosing the class voted by the majority of the classifiers. The number of samples used for training of each one of the OAOSVM classifiers is smaller, since only samples from two of all M classes are taken into account. The fewer number of samples causes smaller nonlinearity, resulting in shorter training times [13].

The Multi-class SVM (MCSVM) problem is to construct a decision function given I samples typically with noise: (x1, y1),…, (xn, yn ), where xi : i = 1, …, N is a vector of length n and yiΠ{1, &, k}represents the class of the sample. The classical approach to solving MCSVM classification problems is to consider the problem as a collection of binary classification problems. In the OAA method one constructs k classifiers, one for each class. The nth classifier constructs a hyperplane between class n and the k - 1 remaining classes. A special sample is assigned to the class for which the distance from the margin, in the positive direction, is maximal. By minimizing:

s.t. (wyi .xi) + byi ≥ (wm .xi) + bm + 2 - ξim ,

ξim ≥ 0 , for i = 1, …, l: mΠ{1,&,k} \yi . (23)

The final classifier is the form of :

Experimental Evaluation

In this section, we provide experimental evidence of our results on hypothyroid dataset from UCI repository [15]. To compare the CPU time and accuracies of for comparing effect of MCSVM with other algorithms we run Weka[15] on Windows Vista running on a PC with system configuration Intel P4 processor (2.6 GHz) with 2 GB of RAM.

Hypothyroid dataset had 4 classes with 30 attributes. The number of samples of given data was 3772 which is divided to 3167 samples as training data and 605 as test data. We experimented on this data set in for comparing performance and accuracy of MCSVM classifier with different kind of kernels and other classifier.

Effect Of Kernel Functions

The performance of a SVM depends to a great extent on the choice of the kernel function to transform a data from input space to a higher dimensional feature space. There are no definite rules for this selection except satisfactory performance by simulation study.

Table I presents results of MCSVM with the three kernel functions.The degree of the polynomial kernel in our expriment is 3 and trade-off which has been chosen as C= 60. In spite of better accuracy for polynominal function, the time which has been spent for classifing the samples in RBF kernel is half of polynominal.

Comparision Of Three Different Kernels

Kernel

# Kernel

Accuracy of

Consuming Time

Training

Test

Training

Test

Linear

943

97.2%

96.4%

40.74s

40.21s

Polynomial

1879

99.1%

96.9%

139.54s

134.5s

RBF

1595

95.5%

96.1%

50.48s

79.49

Effect Of Using Method in MCSVM

We have experimented with MCSVMs using the OAO and the OAA with polynomial kernels. The classification results are shown in Table II. It is obvious that the accuracy of OAA is better than OAO, but it takes long time to construct classifier.

Robutness Of Constructing Method In MCSVM

Method

Accuracy of

Consuming Time

Training

Test

Training

Test

OAASVM

95.6%

95.3%

212.4s

227.8s

OAOSVM

96.1%

94.48%

79.3s

81s

Effect Of Classifier Selection

In order to verify the effectiveness and robustness of MCSVM classification approach, we compared the classification results between the MCSVM and two other classifiers, Decision Tree and Adaboost. The results are presented in Table III. It is obvious that the accuracy of MCSVM is better than others classifiers.

Average Cllasification acuracy Result Of Three Algorithm

Classifier

Accuracy of

Consuming Time

Training

Test

Training

Test

MCSVM

99.1%

96.9%

13.82s

14.6s

Decision Tree (J48)

95%

94.1%

0.11s

0.22s

AdaBoost

96.4%

95.5%

0.29s

0.39s

Conclusion

This paper presented a comparative study on hypothy-roid classification by using MCSVM classifier with different kinds of kernels. The experimental result has shown the superior performance and accuracy of MCSVM than AdaBoost and Decision Tree, it has also shown accuracy of OAASVM with polynomial kernel is better than others.