An Efficient Algorithm For Mining Association Rules Accounting Essay

Category: Accounting

Abstract- Identifying frequent itemsets is one of the most important issues faced by the knowledge discovery and data mining community. There have been a number of excellent algorithms developed for extracting frequent itemsets in very large databases. Frequent itemset mining leads to the discovery of associations and correlations among items in large transactional or relational datasets. A problem with such a process is that the solution of interesting patterns has to be performed only on frequent itemsets. Pushing constraints in frequent itemsets mining can help pruning the search space. In this paper, an efficient algorithm is proposed to integrate confidence measure during the process of mining frequent itemsets, which generates confident frequent itemsets. Consequently, the suggested algorithm generates strong association rules from these confident frequent itemsets. This technique has been implemented and the experimental results show the usefulness and effectiveness of the proposed scheme.

Keywords- KDD; data mining; confident frequent itemsets; Apriori algorithm.

Introduction

The advances in data collection have generated an urgent need for techniques that can intelligently and automatically analyze and mine knowledge from huge amounts of data. The Knowledge Discovery in Databases (KDD) is the process to exploit the possibilities of extracting the knowledge implicit in the collected data. The field of KDD integrates techniques from artificial intelligence, mathematics and statistics for the discovery of interesting, previously unknown and potentially useful information from large datasets [16]. Data mining is a step of KDD in which patterns or models are extracted from data by using some automated techniques [15].

Association rule mining is receiving increasing attention. It has been mainly developed to identify the relationships strongly associated among itemsets that have high-frequency and strong-correlation. Association rule mining is an unsupervised learning because it extracts rules without any prior class information. Association rules enable us to detect the items that frequently occur together in an application [15]. The task of association rules mining usually performed into two steps [12]. The first step aims at finding all frequent itemsets that satisfy the minimum support (minsup) constraint with the frequent itemset property (any subset of a frequent itemset is frequent; if an itemset is not frequent, none of its supersets can be frequent) for efficiency reasons. The second step involves generating association rules that satisfy the minimum confidence (minconf) constraint from the frequent itemsets.

Fundamentally, all association rules meet a minsup threshold that defines a frequent itemset. If these association rules further meet a minconf threshold, then they are called strong association rules. Several algorithms have been proposed to address association rules mining [12], [13], [14]. Association rule mining can produce a huge amount of patterns that are most of the time not useful to the users. It is, hence, impossible for an expert to assess these patterns. This is the case with the well-known Apriori algorithm [12], [13]. One of the methods used to cope with such an amount of output depends on using constrained association rules mining [3], [6], [11], [17] that helps to reduce the number of uninteresting discovered rules.

Constrained itemsets mining is a hot research theme in data mining. It shows that constraint pushing may substantially improve the performance of frequent pattern mining. It has been observed that every often user wants to restrict the set of frequent itemsets to be discovered by adding extra constraints like Lift measure [4], and J measure [5] or based on some type of knowledge either before the mining (pre-processing) or after the mining (post-processing) [6], [7], [8], [9], [10]. The pre-processing approaches limit the potential for discovering 'surprising' information in the data. It is clear that additional constraints for itemsets can be verified in a post-processing step, after all itemsets exceeding a given minsup threshold have been discovered. The post-processing approaches, on the other hand, sacrifice processing speed for many rules are generated which are then pruned [17]. Nevertheless, such a solution cannot be considered satisfactory since users providing advanced selection criteria may expect that the data mining system will exploit them in the mining process to improve performance [17].

As mentioned above, the Apriori algorithm works into two steps. But in the proposed algorithm, the confidence measure is pushed into the mining of frequent itemsets. During the frequent itemsets generation, the proposed algorithm computes the confidence and prunes the items that have confidence less than the minconf threshold. Further iteration of the algorithm is performed only for the itemsets that have support and confidence higher than a user specified thresholds. These itemsets are called confident frequent itemsets. The method described in this paper pushes the confidence measure within the mining algorithm of frequent itemsets in order to eliminate search space that is uninteresting to the user or to emphasize the search space that is interesting to the user.

Roddick and Rice [2] investigate various ways to set thresholds that define interestingness in rules, such as content dependent and independent ones and links the interest in a pattern to its low probability to occur. The work suggested by Wang, He, Cheung and Chin[1] eliminate the problem of setting minsup and find directly association rules with a minconf threshold with the drawback that from some databases too many rules might be generated. Pei, Han and Lakshmanan [3] proposed a convertible constraint, which can be pushed deep into frequent pattern mining. Dense Miner [11] applies all of minconf and minimum improvement to constraint the search space. An incremental association rules mining algorithm that integrates shocking interestingness criterion during the process of building the model is proposed in [18].

The main challenge in this paper is how to integrate the constraint of confidence measure earlier in the mining procedure(push this constraint deeply into the frequent itemsets generation process) rather than using the simple approach of running a traditional algorithm then using a post-processing pass to filter the generated rules. This paper is organized as follows. In Section II we briefly recall the basic concepts of association rules. The proposed algorithm for discovering strong association rules is presented in Section III. The experimental results are shown in Section IV to evaluate the performance of the proposed algorithm. And, in Section V the conclusion and future work are described.

Association rule

An association rule is defined as follows:

Let I = {i1, …, in} be a set of items, and T = {t1, …, tm} a set of transactions, where each transaction ti consists of a subset of items in I. An association rule is then an implication of the form:

A → B, A ∈ I, B ∈ I, A ∩ B = Ø.

The support for an itemset is defined as the ratio of the total number of transactions which support this itemset to the total number of transactions in the database. An itemset A has support s in T if s% of the transactions in T contains A. An itemset A is frequent if its support is higher than the user specified minsup.

The confidence of rule A → B is the probability that when itemset A occurs in a transaction in T, itemset B also occurs in the same transaction. The rule A → B holds in T with confidence c if c% of transactions in T that contain A also contain B. An example of association rules is: 60% of transactions that contain coffee also contain sugar; 5% of all transactions contain both of these items. Here 60% is called the confidence of the rule, and 5% the support of the rule.

The problem of mining association rules is to generate all association rules that consist of frequent itemsets and the confidence greater than the user-specified minconf. The discovery of association rules is thus important in understanding the underlying relationships between a large numbers of possible combinations of items [15].

THE PROPOSED ALGORITHM

The problem of mining frequent itemsets plays an essential role in mining association rules, but it is not sufficient to mine all frequent itemsets. Instead, it is sufficient to mine the set of confident frequent itemsets. Recent studies show that constraint pushing may substantially improve the performance of frequent itemsets mining.

The proposed algorithm uses confidence measure as a constraint during the model building in order to discover association rules. Instead of using the confidence measure as a post-processing step as in the Apriori algorithm, this measure is pushed into the mining of frequent itemsets to form a constraint in order to discover only confident frequent itemsets. For every stage of frequent itemests generation, frequent sub- itemsests are generated from every frequent itemset at that stage. These frequent sub- itemsests are evaluated using confidence (conf) measure (use "(1)") and prune the frequent sub- itemsests that do not satisfy this measure resulting in a set of confident frequent itemsets in this stage.

where sup.count.frequent itemset and sup.count.frequent sub-itemset is the number of transactions containing frequent itemset and frequent sub-itemset respectively.

The proposed algorithm expands the current confident frequent itemsets to the next level frequent itemsets like a normal Apriori algorithm. This approach approves that only confident frequent itemsets are eligible to be candidate during the next iteration of frequent itemsets generation. The confident frequent itemsets in the proposed algorithm can substantially reduce the number of patterns generated in frequent itemset mining while preserving the complete information regarding the set of frequent itemsets. That is, from the set of confident frequent itemsets, we can easily derive the set of frequent itemsets and their support in further iterations.

Based on this, many of frequent itemsets will relate to unconfident frequent itemsets. If only confident frequent itemsets are extracted, the search space can be greatly reduced. As a result, the proposed algorithm uses the confident frequent itemsets to generate strong association rules. This can improve the quality of the extracted association rules and make them more interesting and easier to understand. Figure 1 shows the proposed algorithm.

create L1 = set of frequent(supported) itemsets of cardinality one

set k to 2

while (Lk−1 ≠Ø)

{

create Ck from Lk−1

prune all the itemsets in Ck that are not frequent, to create Lk

for every frequent itemset in Lk

{

generate all frequent sub-itemets

compute confidence for every frequent sub-itemets

if confidence for all frequent sub-itemets < minconf then

The result for the Voting dataset

minconf

minsup

Mind strong association rules

No.

0.81

0.51

duty_free_exports =n → crime =y

1

0.79

0.51

crime =y → duty_free_exports=n

2

0.79

0.50

duty_free_exports =n → regligious _groups_in_schools=y

3

0.77

0.50

regligious_groups_in_schools =y → duty_free_exports =n

4 delete that frequent itemset from Lk

}//for

increase k by 1

}// while

The set of all confident frequent itemsets is L1 ð-´ L2 ð-´ · · · ð-´ Lk. Use these confident frequent itemsets to generate strong association rules.

Figure 1. The pseudo-code of the proposed algorithm.

EXPERIMINTAL RESULTS

The proposed approach is implemented, and to evaluate the performance of the proposed algorithm, the algorithm is applied to some real-world datasets from the UCI datasets repository [19]. We evaluated the performance of the proposed algorithm and compared it with the Apriori algorithm which was implemented in a public domain tool called Weka: http://www.cs.waikato.ac.nz/ml/weka/ index.html. WEKA is a collection of machine learning algorithms for data mining tasks. We used the default parameters of the Apriori algorithm to make the comparison fair. The performance of the proposed algorithm on different datasets is demonstrated below:

Experiment 1

Mushroom dataset was used for this experiment. This dataset has 8124 examples, and 23 nominal attributes. Table I presents the final strong association rules discovered by the proposed algorithm with the following thresholds: minsup=0.82 and minconf=1.00.

The result for the Mushroom dataset

minconf

minsup

Mind strong association rules

No.

1.00

1.00

gill_spacing=f  ring_number=w → veil_color=p

1

1.00

0.82

gill_spacing=f  gill_size=c → veil_color =p

2

1.00

0.87

gill_color=b  gill_spacing=f → veil_color =p

3

1.00

0.88

gill_color=b  veil_color=p → ring_number =w

4

1.00

0.97

ring_type=o  ring_number=w → veil_color=p

5

1.00

0.97

ring_type=o → gill_spacing=f  veil_color=p

6

1.00

0.85

ring_type=o  gill_color=b → veil_color =p

7The Apriori algorithm with minsup=0.82 and minconf=1.00 would generate 11 strong association rules for the Mushroom dataset.

Experiment 2

Voting dataset was used for this experiment. This dataset has 435 examples, and 17 nominal attributes. From this dataset, the proposed scheme discovered 4 strong association rules with the following thresholds: minsup=0.50 and minconf=0.70, which are given in Table II.

The Apriori algorithm with minsup=0.50 and minconf=0.70 would generate 16 strong association rules from the Voting dataset.

Experiment 3

Zoo dataset was used for this experiment. This dataset has 101 examples, and 18 attributes. The attributes were nominal. Table III presents the final strong association rules discovered by the proposed algorithm with the following thresholds: minsup=0.65 and minconf=0.90.

The result for the Zoo dataset

minconf

minsup

Mind strong association rules

No.

0.97

0.66

breathes=1  backbone=1→ venomous =0

1

0.95

0.70

breathes=1  venomous=0 → fins=0

2

0.93

0.70

breathes=1  fins=0 → venomous=0

3

0.92

0.70

venomous=0  fins=0 → breathes=1

4

0.92

0.66

venomous=0  fins=0 → airbone=0

5

0.92

0.60

airbone=0  feathers=0 → venomous=0

6

0.94

0.66

venomous=0  airbone=0 → feathers=0

7

The Apriori algorithm with minsup=0.65 and minconf=0.90 would generate 24 strong association rules for the Zoo dataset.

Experiment 4

In this experiment Weather dataset is used. This dataset has 14 examples, and 5 nominal attributes. The proposed algorithm generates 5 strong association rules with the following thresholds: minsup=0.20 and minconf=0.70 as shown in Table IV.

The result for the weather dataset

minconf

minsup

Mind strong association rules

No.

1.00

0.29

outlook=overcast → play=yes

1

0.80

0.29

play=no → humidity=high

2

0.86

0.43

humidity=normal → play=yes

3

1.00

0.21

outlook=sunny  play=yes → humidity=high

4

1.00

0.21

tempreture=cool  play=yes → humidity=normal

5

In the same constraints the Apriori algorithm would generate 17 strong association rules form the Weather dataset.

The experimental results show that the proposed algorithm is powerful and outperforms rather than, substantially, the Apriori algorithm. The large number of rules generated by the Apriori algorithm makes manual inspection of the rules very difficult. Hence, automated assistance is needed. In the proposed algorithm the confidence measure is pushed in association rule mining to reduce the huge search space. So, the proposed algorithm would decrease the number of extracted association rules. Figure 2 depicts the comparative performance of the two algorithms.

Figure 2. Number of discovered rules by the proposed algorithm and the Apriori algorithm.

CONCLUSTION AND FUTURE WORK

The task of data mining is to produce interesting patterns and extract useful knowledge for human users from a database. Frequent itemsets mining is one of the most important areas of data mining. The proposed algorithm shows that integrating confidence measure during the process of mining frequent itemsets may substantially improve the performance of association rules mining by reducing the search space. The integration of confidence measure and frequent pattern growth mining into one unified framework, leads to further improvement of mining efficiency. The experimental results show the effectiveness of the proposed algorithm in reducing the number of discovered rules comparing with the Apriori algorithm.

One of the most important future research directions would be the automated discovery of interesting association rules from large datasets using multi-objective genetic algorithm.