Non Redundant Association Rules Using Min Max Approximation Accounting Essay

In the area of association rule mining, the primary focus has been on developing novel algorithms to aid efficient computation of such rules. However, the quality of the extracted rules has not drawn adequate attention. One big problem in association rule mining is the huge number of extracted rules which results in difficulties in end users' comprehension and significantly reducing the effectiveness of association rule mining. If the extracted knowledge can't be effectively used in solving real world problems, the effort used to extract the knowledge becomes unproductive. Moreover, many of the extracted rules can be replaced by other rules which may be considered redundant.

The main technique is to construct concise representations of association rules. A concise representation contains a much smaller number of rules and would be considered lossless if all association rules can be derived from the representation. Association rules are derived from frequent itemsets. For a large dataset, the number of frequent itemsets is often so huge that generating all frequent itemsets requires resources (memory and time) [12], which are difficult to design and implement.

The proposed work is to design concise representations of frequent itemsets. A concise representation of frequent itemsets, is a proper subset of frequent itemsets, from which all frequent itemsets and their supports can be derived without any further access to the dataset with the help of Min-Max Approximation.

RELATED WORK

Various strategies have been evolved to extract Non-Redundant Frequent Itemset Patterns. This has led to work on problems such as mining of Closed Itemsets [44], Maximal Itemsets [16], Non-derivable Itemsets [55], and pattern profiles [70].

Closed Itemsets [44] and Non-derivable itemsets (NDI) [55] are lossless forms of compressing Frequent Itemset patterns. The Frequent Itemsets and associated frequency counts is derived from the compressed representation. Calders and Goethals [55] have found that for some datasets and its support thresholds have number of Non-Derivable Itemsets less than number of closed itemsets, while other datasets and support thresholds have number of closed itemsets less than number of NDI [55]. Maximal Itemsets allow greater compression when compared with Closed Itemsets, but the representation is lossy. In Maximal Itemsets the Frequent Itemsets are computed exactly. However, the exact frequency counts associated with these Frequent Itemsets cannot be determined. There are some other lossy representations besides maximal itemsets. Top-k patterns approach by Han et al. [27] presents the most Frequent k Closed Itemsets to the end-user. Xin et al. [66] extended this work to extract redundancy aware top-k Frequent Itemset patterns. Error-tolerant patterns by Yang et al. [71] and Pei et al. [48] allow certain amount of fluctuations in evaluating supports of itemset patterns. An approach by Afrati et al. [10] uses K-Itemsets to recover a collection of Frequent Itemsets. However, it is difficult to find a method to recover the support information with their approach. Yan et al. [70] demonstrated that the pattern profile approach can effectively summarize itemsets for Non-Redundant. However, from an efficiency perspective it is not clear how well this approach will scale to large datasets, and the pattern profiles are not itemset patterns. A clear way to improve this problem is still to be evolved.

For these reasons, the proposed work is to find Non-Redundant Association Rules for Non-Taxonomy Datasets using Min-Max Approximation.

Proposed work for Concise Representation of Association Rules in Non-Taxonomy Data Sets using MinMax Rules

The development of a concise and lossless representation is a promising way for improving the quality of discovered associations. In this proposed work, two condensed bases algorithms are proposed, which represent Non-Redundant Association Rules. These algorithms are applied on Census-Income dataset taken from UCI KDD Repository.

Association Rules

Let I = {I1, I2, â€¦ , Im} be a set of unique items with m members, t is a transaction that contains a set of items from I so that i ïƒï€ I and T is a database or dataset containing a set of identifiable transactions t. As mentioned, an association rule is implication of the following form A ïƒžB, where A, Bï€ ïƒŒï€ ï‰, and are known as itemsets, containing a set of items where ï€ ïï€ ïƒ‡ï‚ =ï€ ï¦. An item i appear in a transaction t, then both i and t have a binary relation denoted by (t, i) ïƒŽ R. The following mappings define the Galois connection of the binary relation, where X ïƒ Iï€¬ï€ Yïƒï” [73].

(1)

(2)

From this ï€ is known as the transaction mapping of X, while is known as the item mapping of Y

As we know that the Galois connection from Formal concept analysis, between sets T and I is a pair (ï´,ï§), where, ï´: 2Iâ†’2T and ï§: 2Tâ†’2I satisfying for X, X1, X2 ïƒ I, Y, Y1, Y2 ïƒ T, then

Monotonicity: X1 ïƒ X2 ïƒž ï´(X2) ïƒ ï´(X1) (3)

Monotonicity Y1 ïƒ Y2 ïƒž ï§(Y2) ïƒ ï§(Y1) (4)

X ïƒ ï§(ï´(X)) or (ï§oï´) X (5)

Y ïƒ ï´(ï§(Y)) or (ï´oï§) Y (6)

Based on Galois connection, the association rules measures support and confidence can be defined as

Support: The support of an Itemset A, can be denoted as supp(A), is determined as the percentage of the transactions, which contain A. Therefore [73]

ï´(A) - Contains item A sharing by all transactions. |ï´(A)| is the number of transactions that contains item A shared by all transactions. |T| is the total number of transactions. This is similarly to probability of A or P(A)

Confidence: The confidence of an association rule of the form Aï‚®B can be denoted conf(Aï‚®B) where A, B ïƒŒ I. This is determined as the percentage of the transactions containing A that also contain B. Therefore [73]

ï´(AïƒˆB) is the set of all transaction mapping shared by A or B. |ï´(AïƒˆB)| is the number of transaction that shared by all transactions for A or B. This is similar to conditional probability P(B/A).

Closed Itemset: Let A be a subset of I. A is a closed itemset iff (Î³oÏ„)X = X.

Generator: An Itemset gïƒŽ2I is considered to be a generator of a closed Itemset cïƒŽ2I if and only if the following holds, c =ï§oï´(g) and g ïƒŒ ï§oï´(g). If there is no ï€¤g' ïƒŒ g such that ï§oï´(g') = c.

From Definition of Generator, the proposed work can get that g ïƒŒ c is true for any generator g and its closed itemset c. The Galois connection satisfies the following properties [16].

Property 1. Let A, c ïƒŽ 2I. If c is the closed itemset of X, then supp(A) = supp(c).

Property 2: Let X, X1, X2 ïƒŽ 2I, Y, Y1, Y2 ïƒŽ 2T, then

X1 ïƒ X2 ïƒž ï´(X2) ïƒ ï´(X1)

Y1 ïƒ Y2 ïƒž ï§(Y2) ïƒ ï§(Y1)

Property 2 indicates that the transaction mapping of an itemset is larger than the transaction mapping of its super itemset. Similarly, The item mapping of the transaction set is larger than the item mapping of its super transaction set.

Association rule mining is usually decomposed into two sub problems, First, find frequent itemsets whose support is larger than or equal to the predefined minimum support. Second, from those frequent itemsets generate association rules that satisfy the minimum support and minimum confidence. This is applied on the popular Mushroom dataset, which is derived from UCI KDD repository, with minimum support 0.7 and minimum confidence 0.7. The processes of association rule generation specified in Table2 are

The frequent itemsets are generated using Clementine Workbench tool.

From these the generators and closed itemsets are evolved using XLMiner tool

Then, the association rules are formed which satisfies user specified minsupp = 0.7 and minconf = 0.7.

The above process generated 88 association rules. Table 4.1 displays 20 of the 88 association rules.

S.No.

Association rules

Supp

Conf

gill-attachment = f ï‚® veil-type = p

0.975

1.0

veil-color = w ï‚® veil-type = p

0.975

1.0

gill-attachment = f and veil-color = w ï‚® veil-type = p

0.973

1.0

gill-attachment = f and ring-number = o ï‚® veil-type = p

0.898

1.0

gill-spacing = c and veil-color = w ï‚® veil-type = p

0.814

1.0

gill-attachment = f and gill-spacing = c ï‚® veil-type = p and veil-color = w

0.812

1.0

gill-attachment = f and gill-spacing = c ï‚® veil-type = p

0.812

1.0

gill-attachment= f and gill-spacing = c and veil -type = p ï‚® veil-color = w

0.812

1.0

gill-attachment = f ï‚® veil-type = p and veil-color = w

0.973

1.0

gill-attachment = f ï‚® veil-type = p and ring-number = o

0.898

0.92

veil-color = w ï‚® gill-spacing = c and veil-type = p

0.814

0.83

veil-color = w ï‚® gill-attachment = f and gill-spacing = c and veil-type = p

0.812

0.83

gill-attachment = f and veil-color = w ï‚® gill-spacing = c and veil-type = p

0.812

0.83

gill-attachment = f and veil-color = w ï‚® veil-type = p and ring-number = o

0.897

0.92

gill-attachment = f and ring-number = o ï‚® veil-type = p and veil-color = w

0.897

0.99

gill-spacing = c and veil-color = w ï‚® gill-attachment = f and veil-type = p

0.813

0.99

gill-attachment = f ï‚® veil-color = w

0.973

0.99

gill-attachment = f ï‚® ring-number = o

0.898

0.92

gill-attachment = f and veil-color = w ï‚® gill-spacing = c

0.812

0.83

gill-attachment = f and ring-number = o ï‚® veil-color = w

0.897

0.99

Table 4.1 Association rules (Mushroom dataset, minsupp = 0.8, minconf = 0.8).

With the help of proposed concept, the closed itemsets and their minimal generators are given in Table 4.2.

Closed itemsets

Minimal generators

Support

{veil âˆ’type = p}

1.0

{gill-attachment=f and veil-type=p}

{gill-attachment=f}

0.974

{gill-spacing=c and veil-type=p}

{gill-spacing=c}

0.838

{veil-type=p and veil-color=w}

{veil-color=w}

0.975

{veil-type=p and ring-number=o}

{ring-number=o}

0.921

{gill-attachment=f and veil-type=p and veil-color=w}

{gill-attachment-f and veil-color=w}

0.973

{gill-attachment=f and veil-type=p and ring-number=o}

{gill-attachment=f and ring-number=o}

0.898

{gill-spacing=c and veil-type=p and veil-color=w}

{gill-spacing=c and veil-color=w}

0.814

{gill-attachment=f and gill-spacing=c and veil-type=p and veil-color=w}

{gill-attachment=f and gill-spacing=c}

0.812

{gill-attachment=f and veil-type=p and veil-color=w and ring-number=o}

{veil-color=w and ring-number=o}

0.897

Table 4.2 : Closed Itemset and Minimal Generators for Mushroom Dataset

Redundancy in Association Rules

A challenge to association mining is the huge number of extracted rules. But, using closed itemsets and generators to extract association rules can greatly reduce the number of extracted rules [29, 36]. However, a considerable amount of redundancy still exists in the rules extracted based on closed itemsets. Therefore, techniques are needed to remove the redundancy in order to generate high quality association rules. The scope of the redundancy must be carefully and fairly defined so that the reduction won't cause information loss in the resulting rules. Any information loss causes quality deterioration of the extracted rules, which makes the redundancy reduction not worthwhile.

Redundancy definition

The rules in Table 4.1 are considered useful based on the predefined minimum support and confidence. However, some of the rules actually do not contribute new information. The consequent concluded by some rules can be obtained via other rules with higher or the same confidence but without requiring more conditions to be satisfied. For example, consider the rule 9 say R9 in Table 4.2

R9: gill-attachment = f ï‚® veil-type = p and veil-color = w

This rule can be achieved by using rule 1 and rule 17, say R1 and R17 which are

R1: gill-attachment = f ï‚® veil-type = p

R17: gill-attachment = f ï‚® veil-color = w

The R9 can be achieved by applying the distributive law on R1 and R17 thus, the rule R9 can be considered as redundant rule.

Similarly, the rules R5: gill-spacing = c and veil-color = w ï‚® veil-type = p,

R8: gill-attachment= f and gill-spacing = c and veil -type = p ï‚® veil-color = w,

R13: gill-attachment = f and veil-color = w ï‚® gill-spacing = c and veil-type = p are also redundant.

The rules R1, R5, R8, R13, R17, conclude the same or less results which can be produced by other rules. These rules have a longer or the same antecedent and a shorter or the same consequent, and the confidence of these is not larger than the remaining rules. So, these rules will not give any new information and such rules will be removed to get accurate and quality rules.

For these reasons, in the proposed work, the following definition is used to define this type of redundant rules.

Redundant rules: Let X ï‚® Y and Xâ€² ï‚® Yâ€² are two association rules with confidence cf and cfâ€², respectively. X ï‚® Y is said to be a redundant rule to Xâ€² ï‚® Yâ€² if Xâ€² âŠ† X, Y âŠ† Yâ€², and cf â‰¤ cfâ€².

Based on the definition of Redundant rule, for an association rule Xï‚®Y, if there does not exist any other rule Xâ€² ï‚®Yâ€² such that the confidence of Xâ€²ï‚®Yâ€² is the same as or larger than the confidence of Xï‚®Y, Xâ€² âŠ† X or Y âŠ† Yâ€², then X ï‚®Y is non-redundant.

Safely eliminating redundancy without damaging the capacity of the remaining rules is an essential issue and defining a boundary between redundancy and non-redundancy is crucial to ensure safe redundancy elimination. Several approaches have been proposed for redundancy elimination [10, 28, 36]. However, none of them have specifically discussed the boundary. The proposal is to use Certainty Factor (CF) as a criterion to determine the boundary. If the deletion of a rule does not reduce the CF value of the remaining rules, the deletion is considered safe.

The concept of the certainty factor was first proposed in MYCIN [52] in order to express the level of accuracy and truth behind an association rule and also determine how reliable the antecedent of the given rule is. Certainty factor is founded on two functions; the measure of belief ï¤(X,Y) and the measure of disbelief ï§(X,Y) for a rule of the form X ïƒž Y . The functions of ï¤ and ï§ are given as follows:

(10)

(11)

Where, P(Y/X) and P(Y) represent the confidence of the rule and the support of the consequent respectively. For both ï¤ and ï¹ the values range between 0 and 1 and measure the strength of the belief or disbelief in the consequent Y given the antecedent X. Thus, ï¤(X, Y) weighs how much the antecedent X increases the possibility of consequent Y occurring, while ï¹(X, Y) weighs how much the antecedent X decreases the possibility of consequent Y occurring. If P(Y/X) equals 1, then the antecedent completely supports the consequent and thus ï¤(X, Y) will be 1. On the other hand, if P(Y/X) is equal to 0, then this indicates that the antecedent completely denies the consequent and thus ï¹(X, Y) will be 1. The total strength of the belief or disbelief captured by the association rule is measured by the certainty factor, which is defined as follows: [52]

CF(X, Y) = ï¤(X, Y) - ï¹(X, Y) (12)

The value of the certainty factor will be between 1 and -1, where negative values represent the cases where the antecedent is against (denying) the consequent. Positive values indicate that the antecedent supports the consequent. The certainty factor 0 means that the antecedent does not influence (neither supporting nor denying) the consequent. Association rules with a high certainty factor value are the most useful as they represent strong positive associations between the rule's antecedent and consequent. The aim of association rule mining is to discover these rules that have strong positive associations. It is therefore proposed that the certainty factor can be used to measure the strength of discovered association rules.

Proposed Exact and Reliable bases to Represent Non-Redundant Association Rules

Developing concise and lossless representations is a promising way to improve the quality of the discovered associations. For this the proposed work, defined the following Exact rules

Proposed Exact Bases

MinMax Exact Basis: Let C be the set of Frequent Closed Itemsets and for each Frequent Closed Itemset c, let Gc be the set of minimal generators for c, r is considered as a rule. The MinMax exact basis thus is:

MinMaxExact ={r: g â†’ (c or g) such that c ïƒŽ C and g ïƒŽ Gc and g ï‚¹ c }

MinMax Approximate Basis: Let C be the set of Frequent Closed Itemsets and let G be the set of minimal generators of the set of Frequent Closed Itemsets in C. The MinMax approximate basis thus is:

MinMaxApprox = {r: gâ†’(c or g) such that c ïƒŽ C and g ïƒŽGc and ï§oï´(g) ïƒŒ C}

Rules which have a confidence of less than 1 are known as approximate rules, while those with a confidence equal to 1 are known as exact rules.

The above proposed work is applied on Mushroom dataset obtained from UCI KDD Repository. First, the appriori algorithm is applied by using the Clementine workbench tool on the dataset to get the frequent itemsets. From the frequent itemsets the generator and closed itemsets are derived with the proposed work. Later, the MinMax approaches are applied to get the exact and approximate rules using XLMiner tool. With this proposal the 88 rules extracted from the Mushroom dataset, there are 9 exact rules and 71 approximate rules. Based on the Min-Max approximate basis and the Min-Max exact basis, all exact rules and 25 approximate rules, are displayed in Tables 4.3 and Table 4.4.

S.No.

Rules

Supp

Conf

gill-attachment = f ï‚® veil-type = p

0.97

1.0

gill-spacing = c ï‚® veil-type = p

0.83

1.0

veil-color = w ï‚® veil-type = p

0.97

1.0

ring-number = o ï‚® veil-type = p

0.92

1.0

gill-attachment=f and veil-color = w ï‚® veil-type = p

0.97

1.0

gill-attachment = f and ring-number = o ï‚® veil-type = p

0.89

1.0

gill-spacing = c and veil-color = w ï‚® veil-type = p

0.81

1.0

gill-attachment = f and gill-spacing = c ï‚® veil-type = p and veil-color = w

0.81

1.0

veil-color = w and ring-number = o ï‚® gill-attachment = f and veil -type = p

0.89

1.0

Table 4.3 Non-redundant exact rules extracted using Min-Max exact basis for Mushroom Dataset.

S.No.

Association Rules

Supp

Conf

veil-type = p ï‚®gill-attachment = f

0.97

veil-type = p ï‚® gill-spacing = c

0.83

veil-type = p ï‚® veil-color = w

0.97

veil-type = p ï‚® ring-number = o

0.92

veil-type = p ï‚® gill-attachment = f and veil-color = w

0.97

veil-type = p ï‚® gill-attachment = f and ring-number = o

0.89

veil-type = p ï‚® gill-spacing = c and veil-color = w

0.81

veil-type = p ï‚® gill-attachment = f and gill-spacing = c, veil-color = w

0.81

veil-type = p ï‚® gill-attachment = f and veil-color = w, ring-number = o

0.89

gill-attachment = f ï‚® veil-type = p and veil-color = w

0.97

0.99

gill-attachment = f ï‚® veil-type = p and ring-number = o

0.89

0.92

gill-attachment = f ï‚® gill-spacing = c and veil-type = p, veil-color = w

0.81

0.83

gill-attachment = f ï‚® veil-type = p and veil-color = w, ring-number = o

0.89

0.92

gill-spacing = c ï‚® veil-type = p and veil-color = w

0.81

0.97

gill-spacing = c ï‚® gill-attachment = f and veil-type = p, veil-color = w

0.81

0.96

veil-color = w ï‚® gill-attachment = f and veil -type = p

0.97

0.99

veil-color = w ï‚® gill-spacing = c and veil-type = p

0.81

0.83

veil-color = w ï‚® gill-attachment = f and gill-spacing = c and veil-type = p

0.81

0.83

veil-color = w ï‚® gill-attachment = f and veil -type = p and ring-number = o

0.89

0.91

ring-number = o ï‚® gill-attachment = f and veil-type = p

0.89

0.97

ring-number = o ï‚® gill-attachment = f and veil-type = p and veil-color = w

0.89

0.97

gill-attachment = f and veil-color = w ï‚® gill-spacing = c and veil-type = p

0.81

0.83

gill-attachment = f and veil-color = w ï‚® veil-type = p and ring-number = o

0.89

0.92

gill-attachment = f and ring-number = o ï‚® veil-type = p and veil-color = w

0.89

0.99

gill-spacing = c and veil-color = w ï‚® gill-attachment = f and veil-type = p

0.81

0.99

Table 4.4 Non-redundant approximate rules extracted using Min-Max approximate basis for Mushroom Dataset

However, under Redundancy definition, some of the rules extracted from the Min-Max bases are redundant.

For example, consider the Rules 1, 3 and 5 from Table 4, say these are R1, R3 and R5, which are

R1: veil-type = p ï‚®gill-attachment = f

R3: veil-type = p ï‚® veil-color = w

R5: veil-type = p ï‚® gill-attachment = f and veil-color = w.

R5 can be achieved by applying distributive law on R1 and R3. Thus R5 is a derived rule and also these rule support and confidences are same. So, the rule R5 is a redundant rule.

Similarly, from Table 4.4,

The rule 6 is derived from rule 1 and rule 4, and confidence of rule 6 is 0.89, and is rule 1 and rule 4 confidence are 0.97, 0.92 respectively. The rule 6 confidence is less than rule 1 and 4. Thus rule 6 is a redundant rule.

The rule 7 is derived from rule 2 and 3. The confidence of rule 7 is less than the rule 2 and 3. The rule 7 is redundant rule.

Rules 22 to 25 in Table 4.4 are redundant to rules 17, 11, 10, and 16, respectively.

Reliable bases

Corresponding to the two Min-Max bases, another two concise bases called Reliable bases are defined and proposed. Using the Reliable bases, more redundant rules can be eliminated.

Reliable Approximate Basis: Let C be the set of frequent closed itemsets and G be the set of minimal generators of the frequent closed itemsets in C. The Reliable approximate basis is:

Reliable Exact Basis: Let C be the set of frequent closed itemsets. For each frequent closed itemset c, let Gc be the set of minimal generators of c. The Reliable exact basis is:

Non-Redundant Association Rule Mining Algorithms

The generic representation that results from a coupling of the MinMaxExact Basis with the MinMaxApproximate Basis results in a more concise set of Association Rules, which are Non- Redundant, and lossless. Here, the algorithms for Non-Redundant association rule mining in single level datasets are presented.

Algorithm 1 : MinMax Exact Rule

Input : Set of Frequent closed itemsets and generators

Output: Set of non - redundant exact basis rules

exactrules ï‚¬ ïƒ†

for all c ïƒŽ C

for all g ïƒŽ Gc

if ï€¢ cï‚¢ ïƒŽ C such that cï‚¢ ïƒŒ c then

exactRules: = exactRules ïƒˆ { gï‚® c}

end

return exactRules

Algorithm 2 : MinMax Approximate Rule

Input : Set of Frequent Closed Itemsets and generators

Output: Set of non - redundant approximate basis rules

approxRules ï‚¬ ïƒ†

for all c ïƒŽ C

for all g ïƒŽ Gc such that c ïƒ‰ ï§oï´(g)

if ï€¢ cï‚¢ïƒŽC & ï€¢gï‚¢ïƒŽC such that (cï‚¢ïƒ‰ï§oï´(gï‚¢) and gï‚¢ïƒg) or (conf(gï‚®c) > conf(gï‚¢ï‚®cï‚¢))

then approxRules := approxRules ïƒˆ { gï‚® c}

end

return approxRules

The above Min-Max approaches are applied on Mushroom and Census-Income Datasets to illustrate the proposed concept. The derived rules from Census-Income Dataset are given in the experiments section.

Experiments

All Experiments were performed using Clementine work Bench and XLMiner tools. The Experiments are applied on two datasets Census-Income and Mushroom Datasets obtained from UCI Machine Learning Repository. The Dataset characteristics are, the Mushroom dataset contains 8125 rows and 23 attributes and Census-Income contains 32562 rows 15 attributes, the proposed model applied on these datasets to identify the no of frequent itemsets, no of closed frequent itemsets and no. of non redundant frequent itemsets.

S. No

Association Rules

Supp

Conf

Race=White and Capital_Gain=0 and Capital_Loss=0 â†’Native_Community = US

0.74

0.92

0.03

Class=<=50K and Capital_Loss=0 â†’Native_Community = US

0.74

0.89

-0.06

Class=<=50K and Capital_Gain=0â†’Native_Community = US

0.73

0.89

-0.06

Class=<=50K and Capital_Gain =0 and Capital_Loss = 0â†’Native_Community = US

0.70

0.89

-0.07

Native_Community = US and Capital_Gain = 0 and Capital_Loss = 0 â†’Race = White

0.78

0.87

0.02

Class = <=50K and Capital_Loss = 0â†’Race = White

0.74

0.84

-0.12

Class = <=50K and Capital_Gain = 0â†’Race = White

0.73

0.84

-0.13

Class = <=50K and Capital_Gain = 0 and Capital_Loss = 0 â†’Race = White

0.70

0.84

-0.13

Native_Community = US and Capital_Gain = 0 and Capital_Loss = 0 â†’ Class = <=50K

0.78

0.81

0.06

Race = White and Capital_Gain = 0 and Capital_Loss = 0 â†’Class = <=50K

0.74

0.80

0.05

Native_Community = US and Capital_Gain = 0â†’Class = <=50K

0.82

0.79

0.04

Race =White and Native_Community=US and Capital_Gain= 0 â†’ Class = <=50K

0.72

0.77

0.02

Race=White and Native_Community = US and Capital_Loss = 0 â†’Class = <=50K

0.75

-0.03

Race = White and Native_Community = USâ†’Class = <=50K

0.79

0.74

-0.09

Class = <=50K and Capital_Gain = 0 and Capital_Loss = 0 â†’WorkClass = Private

0.70

0.72

0.03

Class = <=50K and Capital_Gain = 0â†’WorkClass = Private

0.73

0.72

0.03

Class = <=50K and Capital_Loss = 0â†’WorkClass = Private

0.74

0.72

0.03

Class = <=50Kâ†’WorkClass = Private

0.76

0.72

0.03

Race = White and Capital_Gain = 0 and Capital_Loss = 0 â†’WorkClass = Private

0.74

0.71

0.01

Capital_Gain = 0 and Capital_Loss = 0â†’WorkClass = Private

0.87

0.71

0.01

Race = White and Capital_Gain = 0â†’WorkClass = Private

0.78

0.70

0.01

Table 4.5. Non-Redundant Rules Extracted Using Min-Max Approximate Rules for Census-Income Dataset

The number of rules generated for this will be 21, where as in the previous approach the rules were 38 and 17 redundant rules are eliminated with the approach of Min-Max and Reliable approaches. The summary of this will be available in the following table.

Census Income Data Set

Mushroom Data Set

Minconf

FCI

Non Redundant

FCI

Non Redundant

0.1

191

1709

367

0.5

331

0.9

Table 4.6. The Non Redundant Rules using Min-Max approach

Figure 4.1. The Non Redundant Rules using Min-Max approach for Mushroom & Census-Income Dataset

Data analysis using Mushroom Dataset

In general, when the confidence values are low then the numbers of generated closed itemsets are more and the numbers are decreased when confidence values are increased. This is proved with the proposed approach, where this work also generated more closed itemsets from 0.1 to 0.5 confidence values.

When there are more numbers of rules, then there will be more number of redundant rules. In the above figure shows that, at 0.1 confidence level there are 1709 closed itemsets and rules. In this many of them are redundant, which are eliminated with the proposed work and this generated 367 non-redundant rules. This proves that when there are more numbers of rules are generated then there is a redundancy.

When confidence values are increasing, then the number of rules will be decreasing and redundancy is also decreased. With the help of above Figure, at confidence = 0.2, the number of closed itemsets and rules are 1312 and number of non redundant rules 321, similarly at confidence = 0.5 the closed itemsets and rules are 331 and non-redundant rules are 71.

From confidence 0.1 to 0.5 the number of closed itemsets and rules are more and there is huge number of redundant rules for these confidence values, which are successfully eliminated with the proposed work.

As the confidence values are increasing then the chance of redundancy will be decreased. This is proved from the confidence values 0.5 to 0.9. From 0.5 to 0.9 confidence values, the number of closed itemsets and rules are still more but redundant rules are very less. Hence, a few redundant rules are generated at confidence=0.9. When the confidence values are decreasing, then the number of redundant rules will increase linearly and number of closed itemsets and rules will increase exponentially.

The confidence=0.5 is acting as a critical value. Below the confidence=0.5 there are more number of redundant rules and closed itemsets are generated. Above the confidence=0.5 the redundant rules are generated but very less compared with below confidence=0.5. The 0.5 confidence can be act as a critical or average point of redundancy.

Data analysis using Census-Income Dataset

Some times at low confidence value will get less number of rules (as compared with Mushroom dataset). This is proved for Census-Income dataset at confidence=0.1, there are less number of rules are generated compared with Mushroom dataset. But still there are redundant rules at this confidence, which are eliminated with the proposed work.

With the help of above figure, when confidence values are increasing then the number of closed itemsets and rules are decreasing.

From confidence values 0.1 to 0.5 the numbers of rules are decreasing in linearly, but there are very less number of redundancies in these rules. Hence, the removal of redundancy is very less from 0.1 to 0.5 confidence values.

The numbers of closed itemsets and rules are decreasing from 0.5 to 0.9, but the redundancies in these rules are almost constant from Census-Income dataset which is not true for Mushroom Dataset. This gives there is a quality and lossless rules are generated at different confidence values.

The confidence=0.5 is acting as a critical value. Below the confidence=0.5 there are more number of redundant rules, closed itemsets and more redundancy rules are generated. Above the confidence=0.5 the number of redundant rules are constant compared with below confidence=0.5. The 0.5 confidence can be act as a critical or average point of redundancy.

The analysis on Census-Income dataset shows that the redundancy is not based on dataset and different confidence values. Some datasets may (like Mushroom) generate large number of redundant rules at different confidence values. Another dataset (like Census-Income dataset) may generates very less or constant number of redundant rules at different confidence values.

Figure 4.2. Supp, Conf, CF values for Non Redundant Rules for Census - Income Dataset

With the help of the above figure, the proposed work conclude that,

When confidence values are decreasing, then with the present method, the numbers of redundant rules are decreasing and numbers of non-redundant rules are gradually increasing, with more quality and lossless rules.

Corresponding to CF values it can be noticed that some rules are independent, and some rules are having +ve and -ve CF values. The +ve CF value rules indicate that the antecedents are supporting the consequent, and -ve CF values indicates that the antecedent is denying the consequent.

The non-redundant rule support values are varying for different confidence values. It is shown that the support is high when more non-redundant rules are generated.

So we can conclude that when confidence values are decreasing at a point say confidence = 0.5, than the numbers of non-redundant rules are increasing. Hence, for non-redundant rules generation the confidence measure is suitable than the support measure. From confidence value 0.7 to 0.5 the presented work generated positive CF rules. i.e the generated rules antecedent is supporting the consequent. Which indicates these discovered rules are quality and loss less non-redundant rules on single level datasets?

Summary

The challenging problem of generating the association rule mining is redundancy exists in the extracted Association Rules, which may affect the quality of the Association Rules and Frequent pattern generation in Non - Taxonomy Data sets. With the above approach the proposed work can remove the redundant Frequent patterns using Concept Lattices. The Certainty Factor is used to identify the generated rules are safely removed or not. Later MinMax rules are applied to efficiently remove the redundant and quality frequent patterns and then association rule mining in Non- Taxonomy datasets.

Note: Related to this chapter, a research papers are published in

R. Vijaya Prakash, Dr. A. Govardhan, Prof. SSVN. Sarma, "Discovering Non-Redundant Association Rules using MinMax Approximate Rules", Indian Journal of Computer Science and Engineering (IJCSE), Vol. 3 No. 6, pp -796, ISSN - 0976 - 5166.

Non Redundant Association Rules Using Min Max Approximation Accounting Essay

You may also find these documents helpful