Information Fusion Ensemble System In Data Accounting Essay

Published: October 28, 2015 Words: 5181

The model predicts or forecasts reflect a deterministic approach. Traditionally, deterministic prediction by one model has been the way prediction was done, but that scenario has changed now. It is possible to improve the generalization ability of classifiers using a technique called Information Fusion. Information Fusion is the utility of combining diverse, independent outcomes of many methods (models, hypothesis) and the two-way interactions between them using some model on the test data in decision-making. The uncertainties are often very significant. The uncertainties in the model, initial conditions, or the climate (environment) for a single prediction using single model, in which small differences in the initialization well within observational error can have large impacts on longer predictions. Similarly, uncertainty in model physics can result in large forecast differences and errors [23].

By running a collection (ensemble) of predictions, each starting from a different initial state or with different conditions, observe the variations in the resulting predictions, which can be used to estimate the uncertainty of the prediction. Information Fusion (ensemble of classifiers) is being used to provide a new generation of products that give estimable results.

It is possible to predict forecast skill. It appears that when forecasts are similar, forecast skill is higher. Predictive accuracy is substantially improved when joining together multiple predictors. Combining models may add more complexity and it will become difficult to characterize and to anticipate predictions. Once all the numerical simulations and processing are done, still important role can be played in terms of human interpretations by evaluating the model output, attempting to consider features the model can't handle (use of feature selection to manage it), making adjustments if needed, parameter tuning, communicating the results to the users. Identifying the best model requires identifying the proper "model complexity". Diversity can be achieved from different algorithms, or algorithm parameters [4-Books].

Present state has focused public attention on the topic of Information Fusion with multiple classifier system. This thesis presents a novel idea of a multiple (ensemble) classification (classifier) [22] system with feature selection where Neural Networks (Multilayer Feed-forward Networks with Back Propagation learning) are boosted for scalable (high dimensional) datasets. The system proposed here uses Genetic Algorithms for feature selection with various Evaluation Techniques (Evaluators) like subset evaluation, consistency subset evaluation and wrapper subset evaluation approaches to filter the whole feature set of the data and to enhance the performance of the feature selection and in turn overall system. The two schemes (Search Technique combined with Evaluation Schemes) can be combined to complement each other's limitations. The combined algorithm performs well in several domains (experiments) and is useful method in selecting features for classification problems.

The feature selection is a crucial aspect of supervised learning process. There is a need to determine the relevant attributes (variables) for the prediction of the target attribute (variable). An efficient attribute (variable) selection strategy is required. It aims to detect the most relevant and not redundant input variables for the prediction of the target attribute (classes).

Input feature selection methods are used to identify input columns not useful and those which do not contribute significantly to the performance of a Multiple Classifier System. It is used to remove insignificant inputs and improve the generalization performance of an ensemble, in spite of losing some input information.

2. Problem Formulation

Technology alone does not deliver a solution. Model or hypothesis is required to be built. Models are described by an equation. Models examine only linear relationships and many assumptions about the data are to be made. Then the obvious question is: How is the model working out the predictions? Predictive Model that has been built should be validated or checked on fresh data and it should perform the predictions based on the knowledge acquired. A set of input fields are used to predict the values of the field as an output. Data mining is used to improve the quality of data and to predict outcomes based on historic data for better decision making. Scalability is still an important issue for many real world data mining applications. A systematic method for designing classifier ensembles is still an open topic. Thus, combining multiple classifiers with search and evaluation techniques should be a promising topic [1, 2,3].

Some of the challenges to the use of these methodologies include the following:

Efficient integration of soft computing tools.

Massive data sets and high dimensionality i.e. scalability problem to extremely large datasets.

Feature evaluation and dimensionality reduction to improve prediction accuracy.

To come out of the over fitting and assessing the statistical significance

Choice of evaluation techniques to handle dynamic changes in data.

Incorporation of domain knowledge and user interaction with prior knowledge.

Quantitative evaluation of performance.

To increase the understandability of the patterns.

To achieve the management of changing data and knowledge.

3. Proposed Approach

The current method uses Genetic Algorithm for feature selection with various evaluation Schemes. Here, Genetic Algorithm is being used to select relevant features from large datasets. As Genetic Algorithms deal well with large solution spaces, tuning it to adjust as per the requirements of the ensemble, we can get optimum feature selection. Finally Boosting algorithm that finishes the system is devised so that the whole system works efficiently over high dimensional datasets. The accuracy and execution time requirements can be reduced by Feature Selection by Genetic Algorithms with evaluation schemes and Boosting the Neural Networks respectively. It is found that Genetic Search with Neural networks provides good result, takes acceptable time to build the classifier model with acceptable computational complexity. This can further be enhanced by using multi classifier approach. The performance of neural networks ensemble built using feature sets generated by feature selection improves. This can be optimized using some efficient method for feature selection like Genetic Algorithm with evaluation schemes. The central objective is to develop the system that provides approximately 3-5% performance improvement over similar existing techniques. Ensemble classifiers are very efficient in terms of accuracy over single classifiers. Ensembles of NNs are also very robust but when dealing with High dimensional databases, the execution time increases considerably. To reduce this execution time, many techniques can be employed like feature selection, evaluating model complexity etc. The current research is based on Adaboost.M1 and attributes selected classifier algorithms. This algorithm can be used to boost a series of NNs.

The Proposed System consists of three main steps:

GA-based feature selection with Evaluation Schemes,

Ensemble (of Neural Networks) training, and

Ensemble (combined Classifier) testing.

Use a Poor Learner (slightly better than random);

Build a Model;

Boost those training instances modeled incorrectly;

Build a New Model;

Repeat.

Neural network ensembles, and other classifier ensembles alike, are typically designed heuristically in two steps:

Generating individual classifiers,

Combining the outputs of the individual classifiers, by simply averaging [26].

Figure 3.1: Preprocessing with GA and various Evaluators in Multiple Classifier System

In the feature selection step, GA-based feature selection is being run m times to obtain m sets of features, each of which is used for one individual neural network. Each GA run uses a different data set in addition to using a different random initial population. The data sets are obtained by randomly sampling, with replacement, the original data set n times, where n is the number of examples in the training set. In the training step, each neural network with different number of hidden neurons is trained independently using the entire training data.

The number of inputs for each individual neural network is the number of features that GA selected. The trained networks are then tested using the testing data set. The outputs of the networks for each case are averaged to arrive at the output of the ensemble corresponding to the case. By varying decision threshold and applying against the ensemble outputs, a set of TPR-FPR pairs is obtained.

4. Generalize Algorithm of Adaboost

Input:

Given: Data Set D = { (x1, y1), . . ., {xm, ym)}

where, xi  X, yi  Y = { -1, +1}

Initialize:

for all i  {1, 2, . . ., m}

where, m = total number of examples

T = Number of Learning Rounds

L = Base Learning algorithm

Process:

For t = 1, 2. . . T

1. Train the Base learner using distribution on Dt

2. Calculate the Error of ht:

3. if then, set T = t - 1 and abort

4. Choose the weight updating parameter

5. Calculate distribution:

Output:

The final classifier:

Feature Selection

In applications of the present time, data sets with hundreds or even thousands of features are no exception. This poses a challenge to classical techniques, since the number of samples often is still limited with respect to the feature size. In many cases, these data sets often have many completely redundant noise features or many data features are highly correlated (many features contain the same information). So methods based on the high dimension, and robust techniques are needed. Feature selection and ensemble learning are two effective methods to deal with this problem. Feature selection methods with various evaluation techniques seek to remove the redundant features, reduce the dimensionality of the feature space, and improve the accuracy of learning. In general, feature selection can be seen as an optimization problem in the feature space. Many Search algorithms have been proposed to reduce the computational complexity, such as branch and bound, heuristic, exhaustive, genetic search etc. The idea of ensemble learning is to employ multiple learners and combine their predictions. Previous research has shown that an ensemble is often more accurate than any of the single classifiers in the ensemble. The intrinsic (true) data dimensionality is usually smaller than the dimensionality of the whole feature space where data objects are described, and it is one of the important characteristics of the data set that may influence the performance of classifiers. Consequently, the redundancy in the data feature set (and the intrinsic data dimensionality) may affect the performance of the combining techniques, as their performance depends on the training sample size referred to the data dimension. So it is a meaningful work to carry out feature selection first and combine the classifiers subsequently while processing the high dimension data.

Here we propose an improved GA-based feature selection method (with evaluators) to reduce the dimensionality of the high-dimensional datasets, and then evaluate using Attribute selected classifier and AdaBoost.M1 constructed on the reduced feature set. Research suggests that feature selection and ensemble learning perform quite well in increasing the accuracy of the predictor and decrease the complexity of learning [1,2,3].

Algorithm: A Simple Genetic Algorithm

Step 1: Set t = 1. Randomly generate N solutions to form the first population, P1. Evaluate the fitness of solutions in P1.

Step 2: Crossover: Generate an offspring population Qt as follows:

2.1. Choose two solutions x and y from Pt based on the fitness values.

2.2. Using a crossover operator, generate offspring and add them to Qt.

Step 3: Mutation: Mutate each solution with a predefined mutation rate.

Step 4: Fitness assignment: Evaluate and assign a fitness value to each solution based on its objective function value and infeasibility.

Step 5: Selection: Select N solutions from Qt based on their fitness and copy them to Pt+1.

Step 6: If the stopping criterion is satisfied, terminate the search and return to the current population, else, set t = t + 1 go to Step 2.

Though robust, the computational time of GA is quite high. Thus the combination of GA with classifier error rate as fitness function, especially for neural classifiers with complex learning algorithm, seems time consuming when the sample set is very big or the number of features is large. A classifier independent, filter approach with GA is considered more appropriate.

GA is a powerful feature selection tool, especially when the dimensions of the original feature set are large.

Using GA-based feature selection, the goal is to find a population of best sequence for every feature vector, which minimizes the classification error rate.

Design of GA

Design of chromosome

Design of fitness function

Selection

Crossover and Mutation

5. Description of Current Algorithm

The algorithm maintains a weight distribution Dt(i) on training instances xi, i = 1, . . . , N, from which training data subsets St are chosen for each consecutive classifier (hypothesis) ht. The distribution is initialized to be uniform, so that all instances have equal likelihood to be selected into the first training dataset. The training error εt of classifier ht is also weighted by this distribution, such that εt is the sum of distribution weights of the instances misclassified by ht. As before, we require that this error be less than 1/2. A normalized error is then obtained as βt, such that for 0 < εt < 1/2, we have 0 < βt < 1.

In distribution update rule, the distribution weights of those instances that are correctly classified by the current hypothesis are reduced by a factor of βt; whereas the weights of the misclassified instances are unchanged. When the updated weights are renormalized, so that Dt+1 is a proper distribution, the weights of the misclassified instances are effectively increased. Hence, iteration by iteration, AdaBoost focuses on increasingly difficult instances. Note that AdaBoost raises the weights of instances misclassified by ht so that they add up to 1/2, and lowers the weights of correctly classified instances, so that they too add up to 1/2. Since the base model learning algorithm NN is required to have an error less than 1/2, it is guaranteed to correctly classify at least one previously misclassified training example. Once a preset T number of classifiers are generated, AdaBoost.M1 is ready for classifying unlabeled test instances. Unlike bagging or boosting, AdaBoost.M1 uses a rather undemocratic voting scheme, called the weighted majority voting. The idea is an intuitive one: those classifiers that have shown good performance during training are rewarded with higher voting weights than the others. Recall that a normalized error βt was calculated. The reciprocal of this quantity, 1/βt is therefore a measure of performance, and can be used to weight the classifiers. Furthermore, since βt is training error, it is often close to zero and 1/βt can therefore be a very large number. To avoid potential instability that can be caused by asymptotically large numbers, the logarithm of 1/βt is usually used as the voting weight of ht. At the end, the class that receives the highest total vote from all classifiers is the ensemble decision.

The algorithm is sequential: classifier CK is created before classifier CK+1, which in turn requires that βK and the current distribution DK be available. Freund and Schapire also showed that the training error of AdaBoost.M1 is bounded above:

Where E is the ensemble error. Since εt <1/2, E is guaranteed to decrease with each new classifier. In most practical cases, the error decreases very rapidly in the first little iteration, and approaches zero as new classifiers are added. While this is remarkable on its own account, the surprising resistance of AdaBoost to over-fitting is particularly noteworthy. Over-fitting is a commonly observed phenomenon where the classifier performs poorly on test data, despite achieving a very small training error. Over-fitting is usually attributed to memorizing the data, or learning the noise in the data. As a classifier's capacity increases, so does its tendency to memorize the training data and (or) learn the noise in the data. Since the capacity of an ensemble increases with each added classifier, one would expect AdaBoost to suffer from over-fitting, particularly if its complexity exceeds what is necessary to learn the underlying data distribution. Yet, AdaBoost performance usually levels off with increasing number of classifiers with no indication of over-fitting.

It initializes a random population according to the design described above and follows a simple Genetic search.

6. Experiments and Results

Each classifier in the cascade is trained using the original positive examples and the same number of false positives from the previous stage or negative examples at the first stage. The resulting classifier of previous stage is used as the input of the current stage and builds a new classifier with lower false positive rate. The detection threshold is set using a validation set of image pairs.

For Boosting : variance and bias reduction and no over fitting

Changes in the dataset produces big changes in the predictor

Choose a Weak (unstable) Classifier for Boosting

Too weak classifiers do not provide good results

Results are shown in following tables.

Performance Parameters (Based on results obtained using RWEKA on dataset labor with

Method Name

Time Taken

(Seconds)

Correctly Classified Instances (%)

Incorrectly Classified Instances

(%)

Kappa Statistic

Mean Absolute Error

Root Mean Square Error

ANN

functions.Multilayer Perceptron

0.24

82.20

17.80

0.60

0.18

0.33

kNN

lazy.IBk

0.02

82.45

17.54

0.68

0.17

0.30

SVM

functions.SMO

0.17

87.71

12.28

0.84

0.07

0.16

Bayes

bayes.NaiveBayes

0.02

89.47

10.52

0.86

0.08

0.17

Decision Tree

trees.J48

0.02

78.60

21.40

0.51

0.28

0.40

Table 6.1: (ASC + GA + CFS SUBSET EVAL+10 fold cross validation)

Classifier

Criteria

Multilayer Perceptron

NaiveBayes

Decision Tree

SMO

Time Taken to Build Model (seconds)

Negligible

Negligible

0.12

0.01

Correctly classified instances (%)

96.93

95.53

94.73

96.27

Incorrectly classified instances (%)

3.07

4.47

5.27

3.73

Mean Absolute Error (0-1)

0.03

0.04

0.04

0.23

Kappa Statistics

0.95

0.93

0.92

0.94

TP Rate (0-1)

1.0

1.0

0.98

1.0

FP Rate (0-1)

0

0

0

0

Precision (%)

1

1

1

1

Table 6.2: Comparison of performance of different classifiers applied after feature selection using GA in RGUI with RWEKA Package for Iris.arff: 10 fold cross validation

Figure 6.1: Performance comparison of classifiers using GA on Iris.arff

Classifier

Criteria

AdaBoostM1

Multilayer Perceptron

NaiveBayes

Decision Tree

Iterations

10

10

10

Time Taken to Build Model (seconds)

0.01

0.02

0.02

Correctly classified instances (%)

96.66

95.3333

95.33

Incorrectly classified instances (%)

3.3333

4.66

4.66

Mean Absolute Error (0-1)

0.0083

0.0329

0.0318

TP Rate (0-1)

0.96

0.95

95.33

FP Rate (0-1)

0.04

0.033

0.02

Kappa

0.94

0.93

0.93

Precision (%)

0.96

0.934

0.96

Table 6.3: Comparison of performance of different Meta classifiers applied after feature selection using Genetic Algorithm in RGUI with RWEKA Package for Iris.arff: 10 fold cross validation, data set 100 %

Classifier

Criteria

AttributeSelectedClassifier(CFS + GA)

Multilayer Perceptron

NaiveBayes

Decision Tree

Iterations

10

10

10

Time Taken to Build Model (seconds)

0.19

0.03

0.02

Correctly classified instances (%)

96.00

96.66

96.00

Incorrectly classified instances (%)

4.00

3.33

4.00

Mean Absolute Error (0-1)

0.0492

0.0275

0.0355

KAPPA

0.94

0.95

0.94

TP Rate (0-1)

1

1

0.98

FP Rate (0-1)

0

0

0

Precision (%)

1

1

1

Table 6.4: AttributeSelectedClassifier(CFS + GA)

Classifier

Criteria

MultiBoostAB

Multilayer Perceptron

NaiveBayes

Decision Tree

Iterations

10

10

10

Time Taken to Build Model (seconds)

1.97

0.02

0.01

Correctly classified instances (%)

96.66

96.66

95.3333

Incorrectly classified instances (%)

3.33

3.33

4.666

Mean Absolute Error (0-1)

0.0206

0.0224

0.0326

KAPPA

0.95

0.95

0.93

TP Rate (0-1)

1

1.00

0.98

FP Rate (0-1

0

0.00

0.00

Precision (%)

1

1.00

1.00

Table 6.5: MultiBoostAB

Classifier

Criteria

Stacking

Multilayer Perceptron

NaiveBayes

Decision Trees

Iterations

10

--

--

Time Taken to Build Model (seconds)

1.38

0

0

Correctly classified instances (%)

33.33

33.33

33.33

Incorrectly classified instances (%)

66.66

66.66

66.66

Mean Absolute Error (0-1)

0.4489

0.4598

0.4603

Kappa

0

0

0

TP Rate (0-1)

0.37

0.63

0

FP Rate (0-1)

0.37

0.63

0

Precision (%)

0.12

0.21

0

Table 6.6: Stacking

Classifier

Criteria

Multilayer Perceptron

NaiveBayes

Decision Tree

SMO

Time Taken to Build Model (seconds)

10.89

0

0.06

0.55

Correctly classified instances (%)

74.5863

49.6454

82.7423

59.3381

Incorrectly classified instances (%)

25.4137

49.6454

17.2577

40.6619

Mean Absolute Error (0-1)

0.1678

0.279

0.1054

0.2996

Kappa Statistics

0.6614

0.3337

0.7701

0.4583

TP Rate (0-1)

0.746

0.496

0.827

0.593

FP Rate (0-1)

0.084

0.161

0.058

0.135

Precision (%)

0.733

0.606

0.058

0.577

Table 6.7: Comparison of performance of different classifiers applied after feature selection using Genetic Algorithm in RGUI with RWEKA Package for vehical.arff: 10 fold cross validation with CFS Subset Eval

Figure 6.2: Performance comparison of classifiers using GA+CFS on vehical.arff

Classifier

Criteria

AdaBoostM1 (VECHICAL.ARFF)

Multilayer Perceptron

NaiveBayes

Decision Tree

Iterations

10

--

10

Time Taken to Build Model (seconds)

150.19

0.05

1.03

Correctly classified instances (%)

99.2908

46.3357

100

Incorrectly classified instances (%)

0.7092

46.3357

0

Mean Absolute Error (0-1)

0.0108

0.2793

0

TP Rate (0-1)

0.993

0.463

1

FP Rate (0-1)

0.993

0.172

0

Kappa

0.9905

0.2901

1

Precision (%)

0.993

0.534

1

Table 6.8: Comparison of performance of different Meta classifiers applied after feature selection using Genetic Algorithm in RGUI with RWEKA Package: VEHICAL.ARFF 10 fold cross validation, data set 100 %

Classifier

Criteria

AttributeSelectedClassifier(CFS + GA) (VECHICAL.ARFF)

Multilayer Perceptron

NaiveBayes

Decision Tree

Iterations

--

--

--

Time Taken to Build Model (seconds)

13.7

0.05

0.14

Correctly classified instances (%)

74.5863

49.6454

82.7423

Incorrectly classified instances (%)

74.5863

50.3546

17.2577

Mean Absolute Error (0-1)

0.1678

50.3546

0.1054

KAPPA

0.6614

0.3337

0.1054

TP Rate (0-1)

0.746

0.496

0.827

FP Rate (0-1)

0.084

0.161

0.058

Precision (%)

0.733

0.606

0.853

Table 6.9: AttributeSelectedClassifier (CFS + GA) (VECHICAL.ARFF)

Classifier

Criteria

MultiBoostAB (VECHICAL.ARFF)

Multilayer Perceptron

NaiveBayes

Decision Tree

Iterations

--

--

--

Time Taken to Build Model (seconds)

235.2

0.09

2.33

Correctly classified instances (%)

95.9811

46.3357

100

Incorrectly classified instances (%)

4.0189

53.664

0

Mean Absolute Error (0-1)

0.0213

0.2793

0.0003

KAPPA

0.9464

0.290

1

TP Rate (0-1)

0.96

0.463

1

FP Rate (0-1)

0.013

0.172

0

Precision (%)

0.96

0.534

1

Table 6.10: MultiBoostAB (VECHICAL.ARFF)

Classifier

Criteria

Stacking ( VEHICAL.ARFF)

Multilayer Perceptron

NaiveBayes

Decision Trees

Iterations

Time Taken to Build Model (seconds)

100.39

92.3

90.52

Correctly classified instances (%)

88.0615

89.5981

86.8794

Incorrectly classified instances (%)

11.9385

10.4019

13.1206

Mean Absolute Error (0-1)

0.0842

0.0609

0.0906

Kappa

0.8408

0.8613

0.825

TP Rate (0-1)

0.881

0.896

0.869

FP Rate (0-1)

0.04

0.035

0.045

Precision (%)

0.882

0.895

0.87

Table 6.11: Stacking (VEHICAL.ARFF)

Sr. No.

Name

#Attributes (Features)

FS Algorithm

GA + CFS+ASC

GA + ConsistencySubset Eval + ASC

GA+WRAPPER + ASC

#Features selected

Time Taken

#Features selected

Time Taken

#Features selected

1.

Iris

5

2

4.95

2

2.61

1

2.

Splice

62

24

0.5

4

1

3

3.

Letter Recognition

17

12

2

14

50

1

4.

Arrhythmia

280

99

1

155

2

2

5.

ionosphere

35

15

35

12

60

2

Table 6.12: Performance of Feature Selection Algorithms on AttributeSelectedclassifier (ASC) with MultilayerPerceptron Classifier in RGUI with RWEKA Package

Figure 6.3: Performance of classifiers with ASC(No of features selected) +MLP

7. Discussion on Accepting Results

A particular data set on which the analysis method is used and results change with different data and analysis along with identification of contrary findings with critical assumptions made in the analysis.

RGUI and Weka are java packages providing an environment for implementation and simulation of a large number of machine learning and statistical algorithms. In the present research, comparison of the performance of AdaBoost.M1, Attribute Selected Classifier etc. are done over various base classifiers (NN and others). Different feature selection strategies like Genetic Search along with various evaluation schemes are being used.

The existing algorithm for Feature Selection in RGUI with WEKA performs a search using the simple genetic algorithm. The fitness of the feature subset is evaluated using many variant techniques some of which are as follows:

CfsSubselEval

ConsistencySubsetEval

WrapperSubsetEval

FilteredSubsetEval

The methods are being implemented based (all of the above) on the selected feature subset. The main steps are as follows:

The inconsistency rate specifies to what extent the reduced data still represent the original dataset and can be considered a measure of how much inconsistent the data become when only a subset of attributes is considered.

Inconsistency is introduced in the data when the number of attributes is reduced; the rate measures how much inconsistency is introduced when only a certain feature subset is considered.

The rate is computed as follows:

Two items of the given dataset are considered inconsistent if they match except for the class labels with respect to the subset of features considered.

For all matching instances the inconsistency count is the number n of instances minus the largest number of instances of the most frequent class label; for example if there are two class label c1 and c2 with respectively n1 and n2 instances (n1 + n2 = n) then the inconsistency count is equal to (n - max(n1,ns)).

The inconsistency rate is computed as the quotient of the sum of all the inconsistency counts divided by the total number of instances.

8. Performance Results

The results show the performance comparison of various classifiers on following strategies of Feature Selection:

Without Feature Selection

Feature Selection using GA, subset and consistency based Evaluator

Feature Selection using GA, Correlation based Evaluator

Feature Selection using GA, Both Consistency and Correlation based Evaluators

The following measures are being used to compare the performance of different algorithms:

Iterations: Number of iterations performed (wherever applicable)

Time Taken to build model (Training time + Construction time)

Correctly Classified instances in percentage of total number of instances.

Incorrectly Classified instances in percentage of total number of instances.

Kappa Statistics: It takes the correctly classified instances into account by deducting it from the predictor's successes and expressing the result as a proportion of the total for a perfect predictor. The Kappa statistic is used to measure the agreement between predicted and observed categorizations of a dataset, while correcting for agreement that occurs by chance.

Mean Absolute Error

Root Mean Squared Error

Relative Absolute Error

Root Relative Squared Error

TP Rate or Recall

The true positives (TP) and true negatives (TN) are correct classifications. A false positive (FP) occurs when the outcome is incorrectly predicted as yes (or positive) when it is actually no (negative). A false negative (FN) occurs when the outcome is incorrectly predicted as negative when it is actually positive.

Consider a two class problem, then true positives, true negatives, false positives and false negatives can be represented as:

Predicted Class

Yes

Actual

Class

Yes

True

Positive

No

False

Positive

Table 8.1: Positives and Negatives in prediction [3]

Strengths of AdaBoost

It has no parameters to tune (except for the number of rounds)

It is fast, simple and easy to program.

It comes with a set of theoretical guarantee (e.g., training error, test error)

Instead of trying to design a learning algorithm that is accurate over the entire space, focus on finding base learning algorithms that only need to be better than random.

It can identify outliers, i.e. examples that are either mislabeled or that are inherently ambiguous and hard to categorize.

Weakness of AdaBoost

The actual performance of boosting depends on the data and the base learner.

Boosting seems to be especially susceptible to noise.

When the number of outliers is very large, the emphasis placed on the hard examples can hurt the performance.

In the early iterations, boosting is primary a bias-reducing method

In later iterations, it appears to be primarily a variance-reducing method

Combines multiple learned models to construct better generalizations

Classifiers that always agree won't give new information

Combining predictions of an ensemble will often be more accurate than any of single prediction.

Ideal learning ensemble: individually accurate classifiers with high level of disagree

Results

Ada-boosting strongly correlated even across different algorithms (boosting depends more on data set than type of classifier algorithm)

At most 30 networks in ensemble are sufficient, in general.

Neural Networks better than Classification Trees, in general.

Boosted classifiers can reject many of the negative sub-windows while detecting all positive instances.

Series of such simple classifiers can achieve good detection performance while eliminating the need for further processing of negative sub-windows.

9. Observations

The accuracy of classifiers is better with feature selection i.e. on feature set selected by the current system than on the full feature set (i. e. without feature selection).

Time taken by the system to find the optimal feature set is less than the time taken by correlation based or consistency based evaluators; hence it is a good characteristic that the time taken is not increased to great or infeasible extents.

The number of features selected in the optimal set by the current system is equal to or more than both correlation based GA and consistency based GA, except on Splice dataset. On Splice, number of features selected by the system is less then correlation based GA, however this does not decrease accuracy of classifiers.

Also the accuracy of classifiers is better on feature set selected by the current system than on feature sets selected by either consistency based GA or correlation based GA. The accuracy improvement ranges from 3% to 5%.

The proposed algorithm works efficiently over high dimensional datasets without compromising accuracy or running times over low dimensional datasets (Iris and Splice).

By using feature selection it tends to reduce the classifier error rates.

Test data has been already obtained and proves feasible.

Model Complexity and expression analysis is done and the system is capable for survival.

The system works well with desired accuracy and reduction in FPs and FNs.

Due to the different environmental setups, the system is robust and accurate with least error.

10. Conclusion

Various search and evaluation techniques are used to speed up the feature selection.

Multiple classifier and Genetic Algorithms are used for improving the classification and efficient feature selection.

A cascade of classifiers is used to minimize the computation without sacrificing the classification performance.

Boosting (and all its variants) is a practical tool for classification and other learning problems and provides more accuracy than single base classifier.

AdaBoost can improve classifier accuracy for many problems. Current research work can improve classifier accuracy for many problems.

AdaBoost (and all its variants) is versatile learning algorithm based on synergistic performance of an ensemble of weak classifiers (learners). It is grounded in rich theory. The Complexity of weak learner is important.

AdaBoost performs well theoretically and experimentally and is unwilling to perform over fitting (Not Always).

NNs are powerful tools for modelling in a wide variety of domains and Boosting these NNs to form ensemble that retains the classification strengths of NN while increasing its accuracy. However the execution time may become higher to some extent.

Ensemble learning using feature selection is more accurate than the methods without feature selection in general and more efficient than single classifier when dealing with high-dimensional datasets.

Here the current research (with feature selection and evaluation schemes), that trains each base model using a training example weight vector that is based on the performance of the previous base model is presented.

This system works well even on datasets with noise.

A more detailed empirical analysis has been performed including the performance of the base models, ranges of the parameters of regular and noisy examples.

The standard GA method can be combined with evaluation schemes to obtain feature sets on which higher accuracy can be obtained and GA plays a central role as a selector to select a subset of features.

Using two criteria (Search Method plus Evaluation Schemes) to find out an optimal feature set gives better results than single criterion.

Feature Subset Selection and Correlation Based Evaluation yields better results compared to Consistency based evaluation with GA in general.