imsl.data_mining.FrequentItemSets¶
-
class
FrequentItemSets
¶ Aggregate frequent itemsets and compute association rules.
Notes
Function
frequent_itemsets()
and the methods of classFrequentItemSets
perform the Apriori algorithm for association rule discovery. Association rules are statements of the form “if X, then Y”, given with some measure of confidence.1. Application Fields of Apriori
The main application for association rule discovery is market basket analysis, where X and Y are products or groups of products, and the occurrences are individual transactions, or “market baskets”. The results help sellers learn relationships between different products they sell, supporting better marketing decisions. There are other applications for association rule discovery, such as the problem areas of text mining and bioinformatics. The Apriori algorithm ([1]) is one of the most popular algorithms for association rule discovery in transactional data sets.
2. Application to a Single Set of Transactions
The Apriori algorithm consists of two stages when applied to a single transaction set. In the first and most critical stage, which is executed via function
frequent_itemsets()
, the Apriori algorithm mines the transactions for frequent itemsets. An itemset is frequent if it appears in more than a minimum number of transactions. The number of transactions containing an itemset is known as its “support”, and the minimum support (as a percentage of transactions) is a control parameter in the algorithm. The algorithm begins by finding the frequent single itemsets. Then the algorithm generates all two-item itemsets from the frequent single items and determines which among them are frequent. From the collection of frequent pairs, Apriori forms candidate three-item subsets and determines which are frequent, and so on. The algorithm stops, when either a maximum itemset size is reached, or when none of the candidate itemsets are frequent. In this way, the Apriori algorithm exploits the apriori property: for an itemset to be frequent, all of its proper subsets must also be frequent. At each step the problem is reduced to only the frequent subsets.In the second stage, executed via method
association_rules()
, the algorithm generates association rules. These are of the form \(X \Rightarrow Y\) (read, “if X, then Y”), where Y and X are disjoint frequent itemsets. The confidence measure associated with the rule is defined as the proportion of transactions containing X that also contain Y. Denote the support of X (the number of transactions containing X) as \(S_X\), and the support of \(Z=X \cup Y\) as \(S_Z\). The confidence of the rule \(X \Rightarrow Y\) is the ratio \(S_Z/S_X\). Note that the confidence ratio is the conditional probability\[P[X|Y] =\frac{P[XY]}{P[X]}\]where \(P[XY]\) denotes the probability of both X and Y. The probability of an itemset X is estimated by \(S_X/N\), where N is the total number of transactions.
Another measure of the strength of the association is known as the lift, which is the ratio \((S_ZN)/(S_XS_Y)\). Lift values close to 1.0 suggest the sets are independent, and that they occur together by chance. Large lift values indicate a strong association. A minimum confidence threshold and a lift threshold can be specified.
3. Application to Blocks of Transactions (Aggregation)
Since the Apriori algorithm can operate on subsets of data, it can be used with distributed data or data sets larger than physical memory. Additionally, it may be useful in parallel computing environments where nodes can be programmed to calculate intermediate results in parallel.
For each data set or block of transactions, call function
frequent_itemsets()
to obtain the frequent itemsets for each block. The same parameter settings, such as minimum support percentage, must be used in each call.Then, call method
union()
to obtain the union of all frequent itemsets. The resulting sets serve as the “candidate” itemsets for the global set of transactions.An itemset which is frequent in one transaction set may or may not be frequent in the larger collection. To find the itemsets that are frequent over the entire set of transactions, use method
frequency()
to perform another pass through the individual blocks, this time counting the occurrence of each itemset in each transaction set. This step can be done in parallel.The next step is then to sum up the individual counts before filtering for the frequent itemsets. This is achieved by adding the frequency vectors returned by method
frequency()
. After this step, the frequencies of each candidate itemset over all of the transactions are known.In the final step, call method
update()
to determine all itemsets that meet the threshold to be considered “frequent”. This step completes the aggregation of the transactional data sets.Once the frequent itemsets are known, the strong association rules can be found using method
association_rules()
.The aggregation method is due to [3] and is also summarized and compared with other approaches in [2].
References
[1] Agrawal, R., and Srikant, R. (1994), Fast algorithms for mining association rules, Proceedings of the 20th International Conference on Very Large Data Bases, Santiago, Chile, August 29 - September 1, 1994. [2] Rajamaran, A., and Ullman, J. D. (2011), Mining of Massive Datasets, Cambridge University Press, Cambridge, UK. [3] Savasere, A., Omiecinski, E., and Navathe, S. (1995), An Efficient Algorithm for Mining Association Rules in Large Databases, Proceedings of the 21st International Conference on Very Large Data Bases, Zurich, Switzerland, 1995. Examples
Example 1:
This example applies the Apriori algorithm to a data set consisting of 50 transactions involving five different product IDs. The minimum support percentage is set to 0.30, giving a minimum required support of 15 transactions. The frequent itemsets and strong association rules are printed.
>>> import imsl.data_mining.apriori as apriori >>> max_num_products = 5 >>> max_set_size = 10 >>> min_pct_support = 0.30 >>> confidence = 0.8 >>> lift = 2.0 >>> x = [[1, 3], [1, 2], [1, 1], [2, 1], [2, 2], [2, 4], [2, 5], ... [3, 3], [4, 4], [4, 3], [4, 5], [4, 1], [5, 5], [6, 1], ... [6, 2], [6, 3], [7, 5], [7, 3], [7, 2], [8, 3], [8, 4], ... [8, 1], [8, 5], [8, 2], [9, 4], [10, 5], [10, 3], [11, 2], ... [11, 3], [12, 4], [13, 4], [14, 2], [14, 3], [14, 1], [15, 3], ... [15, 5], [15, 1], [16, 2], [17, 3], [17, 5], [17, 1], [18, 5], ... [18, 1], [18, 2], [18, 3], [19, 2], [20, 4], [21, 1], [21, 4], ... [21, 2], [21, 5], [22, 5], [22, 4], [23, 2], [23, 5], [23, 3], ... [23, 1], [23, 4], [24, 3], [24, 1], [24, 5], [25, 3], [25, 5], ... [26, 1], [26, 4], [26, 2], [26, 3], [27, 2], [27, 3], [27, 1], ... [27, 5], [28, 5], [28, 3], [28, 4], [28, 1], [28, 2], [29, 4], ... [29, 5], [29, 2], [30, 2], [30, 4], [30, 3], [31, 2], [32, 5], ... [32, 1], [32, 4], [33, 4], [33, 1], [33, 5], [33, 3], [33, 2], ... [34, 3], [35, 5], [35, 3], [36, 3], [36, 5], [36, 4], [36, 1], ... [36, 2], [37, 1], [37, 3], [37, 2], [38, 4], [38, 2], [38, 3], ... [39, 3], [39, 2], [39, 1], [40, 2], [40, 1], [41, 3], [41, 5], ... [41, 1], [41, 4], [41, 2], [42, 5], [42, 1], [42, 4], [43, 3], ... [43, 2], [43, 4], [44, 4], [44, 5], [44, 2], [44, 3], [44, 1], ... [45, 4], [45, 5], [45, 3], [45, 2], [45, 1], [46, 2], [46, 4], ... [46, 5], [46, 3], [46, 1], [47, 4], [47, 5], [48, 2], [49, 1], ... [49, 4], [49, 3], [50, 3], [50, 4]] >>> itemsets = apriori.frequent_itemsets(max_num_products, x, ... max_set_size=max_set_size, ... min_pct_support=min_pct_support) >>> # Print frequent itemsets >>> print(itemsets) Frequent Itemsets (Out of 50 Transactions): Size Support Itemset 1 27 { 1 } 1 30 { 2 } 1 33 { 3 } 1 27 { 4 } 1 27 { 5 } 2 20 { 1 2 } 2 22 { 1 3 } 2 16 { 1 4 } 2 19 { 1 5 } 2 22 { 2 3 } 2 16 { 2 4 } 2 15 { 2 5 } 2 16 { 3 4 } 2 19 { 3 5 } 2 17 { 4 5 } 3 17 { 1 2 3 } 3 15 { 1 3 5 } >>> # Print association rules >>> assoc_rules = itemsets.association_rules(confidence, lift) >>> for rule in assoc_rules: ... print(rule) Association Rule (itemset X implies itemset Y): X = {1} ==> Y = {3} supp(X)=27, supp(Y)=33, supp(X U Y)=22 conf=0.81, lift=1.23 Association Rule (itemset X implies itemset Y): X = {1 2} ==> Y = {3} supp(X)=20, supp(Y)=33, supp(X U Y)=17 conf=0.85, lift=1.29
Example 2:
This example shows how to apply the Apriori algorithm to separate blocks of data and combine results. The data are two separate blocks of 50 transactions involving five different product IDs. The minimum support percentage is set to 0.30, providing a minimum required support of 30 transactions overall. The frequent itemsets and strong association rules are printed.
>>> import imsl.data_mining.apriori as apriori >>> max_num_products = 5 >>> max_set_size = 4 >>> min_pct_support = 0.30 >>> confidence = 0.8 >>> lift = 2.0 >>> x1 = [[1, 3], [1, 2], [1, 1], [2, 1], [2, 2], [2, 4], [2, 5], ... [3, 3], [4, 4], [4, 3], [4, 5], [4, 1], [5, 5], [6, 1], ... [6, 2], [6, 3], [7, 5], [7, 3], [7, 2], [8, 3], [8, 4], ... [8, 1], [8, 5], [8, 2], [9, 4], [10, 5], [10, 3], [11, 2], ... [11, 3], [12, 4], [13, 4], [14, 2], [14, 3], [14, 1], [15, 3], ... [15, 5], [15, 1], [16, 2], [17, 3], [17, 5], [17, 1], [18, 5], ... [18, 1], [18, 2], [18, 3], [19, 2], [20, 4], [21, 1], [21, 4], ... [21, 2], [21, 5], [22, 5], [22, 4], [23, 2], [23, 5], [23, 3], ... [23, 1], [23, 4], [24, 3], [24, 1], [24, 5], [25, 3], [25, 5], ... [26, 1], [26, 4], [26, 2], [26, 3], [27, 2], [27, 3], [27, 1], ... [27, 5], [28, 5], [28, 3], [28, 4], [28, 1], [28, 2], [29, 4], ... [29, 5], [29, 2], [30, 2], [30, 4], [30, 3], [31, 2], [32, 5], ... [32, 1], [32, 4], [33, 4], [33, 1], [33, 5], [33, 3], [33, 2], ... [34, 3], [35, 5], [35, 3], [36, 3], [36, 5], [36, 4], [36, 1], ... [36, 2], [37, 1], [37, 3], [37, 2], [38, 4], [38, 2], [38, 3], ... [39, 3], [39, 2], [39, 1], [40, 2], [40, 1], [41, 3], [41, 5], ... [41, 1], [41, 4], [41, 2], [42, 5], [42, 1], [42, 4], [43, 3], ... [43, 2], [43, 4], [44, 4], [44, 5], [44, 2], [44, 3], [44, 1], ... [45, 4], [45, 5], [45, 3], [45, 2], [45, 1], [46, 2], [46, 4], ... [46, 5], [46, 3], [46, 1], [47, 4], [47, 5], [48, 2], [49, 1], ... [49, 4], [49, 3], [50, 3], [50, 4]] >>> x2 = [[1, 2], [1, 1], [1, 4], [1, 3], [2, 2], [2, 5], [2, 3], ... [2, 1], [2, 4], [3, 5], [3, 4], [4, 2], [5, 4], [5, 2], ... [5, 3], [5, 5], [6, 3], [6, 5], [7, 2], [7, 5], [7, 4], ... [7, 1], [7, 3], [8, 2], [9, 2], [9, 4], [10, 4], [10, 2], ... [11, 4], [11, 1], [12, 3], [12, 1], [12, 5], [12, 2], [13, 2], ... [14, 3], [14, 4], [14, 2], [15, 2], [16, 5], [16, 2], [16, 4], ... [17, 1], [18, 2], [18, 3], [18, 4], [19, 3], [19, 1], [19, 2], ... [19, 4], [20, 5], [20, 1], [21, 5], [21, 4], [21, 1], [21, 3], ... [22, 4], [22, 1], [22, 5], [23, 1], [23, 2], [24, 4], [25, 4], ... [25, 3], [26, 5], [26, 2], [26, 3], [26, 4], [26, 1], [27, 2], ... [27, 1], [27, 5], [27, 3], [28, 1], [28, 2], [28, 3], [28, 4], ... [29, 5], [29, 2], [29, 1], [30, 5], [30, 3], [30, 2], [30, 4], ... [31, 4], [31, 1], [32, 1], [32, 2], [32, 3], [32, 4], [32, 5], ... [33, 3], [33, 2], [33, 4], [33, 5], [33, 1], [34, 3], [34, 4], ... [34, 5], [34, 2], [35, 2], [35, 3], [36, 3], [36, 5], [36, 4], ... [37, 1], [37, 4], [37, 2], [37, 3], [37, 5], [38, 5], [38, 3], ... [38, 1], [38, 2], [39, 2], [39, 5], [40, 4], [40, 2], [41, 4], ... [42, 4], [43, 5], [43, 4], [44, 5], [44, 4], [44, 3], [44, 2], ... [44, 1], [45, 1], [45, 2], [45, 3], [45, 5], [45, 4], [46, 3], ... [46, 4], [47, 4], [47, 5], [47, 2], [47, 3], [48, 5], [48, 3], ... [48, 2], [48, 1], [48, 4], [49, 4], [49, 5], [50, 4], [50, 1]] >>> # Find frequent itemsets in x1 and x2. >>> itemsets1 = apriori.frequent_itemsets(max_num_products, x1, ... max_set_size=max_set_size, ... min_pct_support=min_pct_support) >>> itemsets2 = apriori.frequent_itemsets(max_num_products, x2, ... max_set_size=max_set_size, ... min_pct_support=min_pct_support) >>> # Take the union of itemsets1 and itemsets2. >>> cand_itemsets = itemsets1.union(itemsets2) >>> # Count the frequencies of each candidate itemset in each data set. >>> freq1 = cand_itemsets.frequency(x1) >>> freq2 = cand_itemsets.frequency(x2) >>> # Sum the frequencies. >>> freq = freq1 + freq2 >>> # Determine which of the candidate itemsets are frequent. >>> itemsets = cand_itemsets.update(freq) >>> # Print the aggregated frequent itemsets. >>> print(itemsets) Frequent Itemsets (Out of 100 Transactions): Size Support Itemset 1 51 { 1 } 1 63 { 2 } 1 60 { 3 } 1 63 { 4 } 1 54 { 5 } 2 37 { 1 2 } 2 38 { 1 3 } 2 33 { 1 4 } 2 35 { 1 5 } 2 44 { 2 3 } 2 38 { 2 4 } 2 34 { 2 5 } 2 38 { 3 4 } 2 38 { 3 5 } 2 37 { 4 5 } 3 32 { 1 2 3 } 3 31 { 2 3 4 } >>> # Generate and print the strong association rules. >>> assoc_rules = itemsets.association_rules(confidence, lift) >>> for rule in assoc_rules: ... print(rule) Association Rule (itemset X implies itemset Y): X = {1 2} ==> Y = {3} supp(X)=37, supp(Y)=60, supp(X U Y)=32 conf=0.86, lift=1.44 Association Rule (itemset X implies itemset Y): X = {1 3} ==> Y = {2} supp(X)=38, supp(Y)=63, supp(X U Y)=32 conf=0.84, lift=1.34 Association Rule (itemset X implies itemset Y): X = {2 4} ==> Y = {3} supp(X)=38, supp(Y)=60, supp(X U Y)=31 conf=0.82, lift=1.36 Association Rule (itemset X implies itemset Y): X = {3 4} ==> Y = {2} supp(X)=38, supp(Y)=63, supp(X U Y)=31 conf=0.82, lift=1.29
Methods
association_rules
(confidence, lift)Compute strong association rules among itemsets. frequency
(x)Count the frequency of each itemset in a transaction data set. union
(*itemsets)Compute the union of itemsets from different transaction sets. update
(frequencies)Update the set of frequent itemsets in the candidate itemsets. Attributes
item_sets
Return the itemsets. max_num_products
Return the maximum number of products. max_set_size
Return the maximum itemset size. min_pct_support
Return the minimum percentage of transaction support. n_trans
Return the number of transactions used to construct the itemsets. support
Return the support for each itemset.