aggrApriori

Computes the frequent itemsets in a transaction set using aggregation.

Synopsis

aggrApriori ( )

Required Arguments

One optional argument must be supplied to indicate which calculation step is to be performed.

Optional Argument Description
frequentItemsets Compute the frequent itemsets.
union Compute the union of two itemsets.
count Count the occurrence of each itemset in the transaction data set.
sum Add the counts of two itemsets.
updateFreqItemsets Update the set of frequent itemsets.
associationRules Compute the strong association rules.

Optional Arguments

frequentItemsets, float x, int maxNumProducts, int maxSetSize, float minPctSupport, structure itemsets (Input/Output)
Computes the frequent itemsets in a transaction set.
float x[[]] (Input)
Array of size n × 2, each row of which represents a transaction id and item id pair. The algorithm assumes that an individual transaction is complete within a single dataset. That is, there is no matching of transaction ids between different data sets.
int maxNumProducts (Input)
Maximum number of items or products that may be present in the aggregation of all transactions.
int maxSetSize (Input)
Maximum size of an itemset. Only frequent itemsets with maxSetSize or fewer items are considered in the analysis.
float minPctSupport (Input)
Minimum percentage of transactions in which an item or itemset must be present to be considered frequent. minPctSupport must be in the interval [0,1].
structure itemsets (Output)
A data structure containing the frequent itemsets in the transaction set x. If no value can be computed, then None is returned. To release this space, use freeAprioriItemsets.

or

union, structure itemsets1, structure itemsets2 (Input/Output)
Computes the union of two itemsets.
structure itemsets1 (Input)
An structure data structure containing the frequent itemsets for the union.
structure itemsets2 (Input)
An structure data structure containing the frequent itemsets for the union.

or

count, structure candItemsets, float x (Input/Output)
Counts the frequency of each itemset in a transaction data set.
structure candItemsets (Input)
Candidate itemsets and the corresponding number of transactions.
float x[] (Input)
Array of size n × 2, each row of which represents a transaction id and item id pair. The algorithm assumes that an individual transaction is complete within a single dataset. That is, there are no matching of transaction ids between different data sets.

or

sum, int prevFreq1,*int* prevFreq2 (Input)
Sum up the itemset frequencies in prevFreq1 and prevFreq2 and return in freq.
int prevFreq1[] (Input)
Array of length nItemsets containing the itemset frequencies counted over one or more blocks of transaction data.
int prevFreq2[] (Input)
Array of length nItemsets containing the itemset frequencies counted over a second set of blocks of transaction data.

or

updateFreqItemsets, structure candItemsets, int freq (Input/Output)
Updates the set of frequent items.
structure candItemsets (Input)
Candidate itemsets and the corresponding number of transactions.
int freq[] (Input)
Array of length nItemsets containing the frequencies for each itemset in candItemsets.

or

associationRules, structure itemsets, float confidence, float lift (Input/Output)
Computes the strong association rules among itemsets.
structure itemsets (Input)
A data structure containing the itemsets.
float confidence (Input)
The minimum confidence used to determine the strong association rules. confidence must be in the interval [0,1]. lift is the other criterion that determines whether an association is “strong.” If either criterion, confidence or lift is exceeded, the association rule is considered “strong.”
float lift (Input)
The minimum lift used to determine the strong association rules. lift must be non-negative. confidence is the other criterion which determines whether an association is “strong.” If either criterion, confidence or lift is exceeded, the association rule will be considered “strong.”

Description

The function aggrApriori performs the Apriori algorithm for association rule discovery. Association rules are statements of the form, “if X, then Y”, given with some measure of confidence. The main application for association rule discovery is market basket analysis, where X and Y are products or groups of products, and the occurrences are individual transactions, or “market baskets.” The results help sellers learn relationships between the different products they sell, supporting better marketing decisions. There are other applications for association rule discovery, such as the problem areas of text mining and bioinformatics. The Apriori algorithm (Agrawal and Srikant, 1994) is one of the most popular algorithms for association rule discovery in transactional datasets. For a full description of the Apriori algorithm, see apriori.

The interface to the function aggrApriori is designed to complete the analysis over a series of steps, with each step requiring a call to the function. With this design, Apriori can be applied to separate blocks of transactions. For each dataset or block, call aggrApriori with frequentItemsets to obtain the frequent itemsets from each block.The same parameter settings, such as minimum support percentage, must be used in each separate call. Then, call aggrApriori with the keyword union sequentially to obtain the union of all the frequent itemsets. The resulting set serves as the “candidate” itemsets for the global set of transactions.

An itemset which is frequent in one transaction set may or may not be frequent in the larger collection. To find the itemsets that are frequent over the entire set of transactions, aggrApriori performs another pass through the individual blocks, this time counting the occurrences of each of the itemsets in each of the transaction sets. This step can be done in parallel, using keyword count. The next step is then to sum up the individual counts before filtering for the frequent itemsets. This is achieved with the keyword sum, applied successively to pairs of previous counts. After this step, the frequencies of each itemset over all of the transactions are known and it remains to be seen if any meet the threshold to be considered “frequent”. The final step in determining the frequent itemsets is updateFreqItemsets. Once the frequent itemsets are known, the strong association rules can be found using the step, associationRules , although this is not a special step in the aggregation. The method is due to Savasere, Omiecinski, and Navathe (1995) and is also summarized and compared with other approaches in Rajaraman and Ullman (2011).

Since aggrApriori can operate on subsets of data, it can be used when physical memory cannot hold the entire data set. Additionally, this design may be useful in parallel computing environments where nodes can be programmed to calculate intermediate results in parallel.

Example

This example shows how to apply Apriori to separate blocks of data and combine results. The data are two separate blocks of 50 transactions involving five different product IDs. The minimum support percentage is set to 0.30, providing a minimum required support of 30 transactions overall.

from numpy import *
from pyimsl.stat.aggrApriori import aggrApriori
from pyimsl.stat.writeAprioriItemsets import writeAprioriItemsets
from pyimsl.stat.writeAssociationRules import writeAssociationRules

x1 = array([[1, 3], [1, 2], [1, 1], [2, 1], [2, 2], [2, 4], [2, 5],
            [3, 3], [4, 4], [4, 3], [4, 5], [4, 1], [5, 5], [6, 1],
            [6, 2], [6, 3], [7, 5], [7, 3], [7, 2], [8, 3], [8, 4],
            [8, 1], [8, 5], [8, 2], [9, 4], [10, 5], [10, 3], [11, 2],
            [11, 3], [12, 4], [13, 4], [14, 2], [14, 3], [14, 1], [15, 3],
            [15, 5], [15, 1], [16, 2], [17, 3], [17, 5], [17, 1], [18, 5],
            [18, 1], [18, 2], [18, 3], [19, 2], [20, 4], [21, 1], [21, 4],
            [21, 2], [21, 5], [22, 5], [22, 4], [23, 2], [23, 5], [23, 3],
            [23, 1], [23, 4], [24, 3], [24, 1], [24, 5], [25, 3], [25, 5],
            [26, 1], [26, 4], [26, 2], [26, 3], [27, 2], [27, 3], [27, 1],
            [27, 5], [28, 5], [28, 3], [28, 4], [28, 1], [28, 2], [29, 4],
            [29, 5], [29, 2], [30, 2], [30, 4], [30, 3], [31, 2], [32, 5],
            [32, 1], [32, 4], [33, 4], [33, 1], [33, 5], [33, 3], [33, 2],
            [34, 3], [35, 5], [35, 3], [36, 3], [36, 5], [36, 4], [36, 1],
            [36, 2], [37, 1], [37, 3], [37, 2], [38, 4], [38, 2], [38, 3],
            [39, 3], [39, 2], [39, 1], [40, 2], [40, 1], [41, 3], [41, 5],
            [41, 1], [41, 4], [41, 2], [42, 5], [42, 1], [42, 4], [43, 3],
            [43, 2], [43, 4], [44, 4], [44, 5], [44, 2], [44, 3], [44, 1],
            [45, 4], [45, 5], [45, 3], [45, 2], [45, 1], [46, 2], [46, 4],
            [46, 5], [46, 3], [46, 1], [47, 4], [47, 5], [48, 2], [49, 1],
            [49, 4], [49, 3], [50, 3], [50, 4]])
x2 = array([[1, 2], [1, 1], [1, 4], [1, 3], [2, 2], [2, 5], [2, 3],
            [2, 1], [2, 4], [3, 5], [3, 4], [4, 2], [5, 4], [5, 2],
            [5, 3], [5, 5], [6, 3], [6, 5], [7, 2], [7, 5], [7, 4],
            [7, 1], [7, 3], [8, 2], [9, 2], [9, 4], [10, 4], [10, 2],
            [11, 4], [11, 1], [12, 3], [12, 1], [12, 5], [12, 2], [13, 2],
            [14, 3], [14, 4], [14, 2], [15, 2], [16, 5], [16, 2], [16, 4],
            [17, 1], [18, 2], [18, 3], [18, 4], [19, 3], [19, 1], [19, 2],
            [19, 4], [20, 5], [20, 1], [21, 5], [21, 4], [21, 1], [21, 3],
            [22, 4], [22, 1], [22, 5], [23, 1], [23, 2], [24, 4], [25, 4],
            [25, 3], [26, 5], [26, 2], [26, 3], [26, 4], [26, 1], [27, 2],
            [27, 1], [27, 5], [27, 3], [28, 1], [28, 2], [28, 3], [28, 4],
            [29, 5], [29, 2], [29, 1], [30, 5], [30, 3], [30, 2], [30, 4],
            [31, 4], [31, 1], [32, 1], [32, 2], [32, 3], [32, 4], [32, 5],
            [33, 3], [33, 2], [33, 4], [33, 5], [33, 1], [34, 3], [34, 4],
            [34, 5], [34, 2], [35, 2], [35, 3], [36, 3], [36, 5], [36, 4],
            [37, 1], [37, 4], [37, 2], [37, 3], [37, 5], [38, 5], [38, 3],
            [38, 1], [38, 2], [39, 2], [39, 5], [40, 4], [40, 2], [41, 4],
            [42, 4], [43, 5], [43, 4], [44, 5], [44, 4], [44, 3], [44, 2],
            [44, 1], [45, 1], [45, 2], [45, 3], [45, 5], [45, 4], [46, 3],
            [46, 4], [47, 4], [47, 5], [47, 2], [47, 3], [48, 5], [48, 3],
            [48, 2], [48, 1], [48, 4], [49, 4], [49, 5], [50, 4], [50, 1]])

maxNumProducts = 5
maxSetSize = 4
minPctSupport = 0.30
confidence = 0.8
lift = 2.0

# Find frequent itemsets in x1 and x2.
frequentItemsets1 = {"x": x1, "maxNumProducts": maxNumProducts,
                     "maxSetSize": maxSetSize, "minPctSupport": minPctSupport}
aggrApriori(frequentItemsets=frequentItemsets1)
itemsets1 = frequentItemsets1["itemsets"]

frequentItemsets2 = {"x": x2, "maxNumProducts": maxNumProducts,
                     "maxSetSize": maxSetSize, "minPctSupport": minPctSupport}
aggrApriori(frequentItemsets=frequentItemsets2)
itemsets2 = frequentItemsets2["itemsets"]

# Take the union of itemsets1 and itemsets2.
union = {"itemsets1": itemsets1, "itemsets2": itemsets2}
aggrApriori(union=union)
candItemsets = union["candItemsets"]

# Count the frequencies of each candidate itemset in
# each of the data sets
count1 = {"candItemsets": candItemsets, "x": x1}
aggrApriori(count=count1)
prevFreq1 = count1["freq"]

count2 = {"candItemsets": candItemsets, "x": x2}
aggrApriori(count=count2)
prevFreq2 = count2["freq"]

# Sum the frequencies.
sum = {"prevFreq1": prevFreq1, "prevFreq2": prevFreq2}
aggrApriori(sum=sum)
freq = sum["freq"]

# Determine which of the candidate itemsets are frequent.
updateFreqItemsets = {"candItemsets": candItemsets, "freq": freq}
aggrApriori(updateFreqItemsets=updateFreqItemsets)
itemsets = updateFreqItemsets["itemsets"]

# Generate the strong association rules.
associationRules = {"itemsets": itemsets,
                    "confidence": confidence, "lift": lift}
aggrApriori(associationRules=associationRules)
assocRules = associationRules["assocRules"]

writeAprioriItemsets(itemsets)
writeAssociationRules(assocRules)

Output

Frequent Itemsets (Out of 100  Transactions):
Size   Support  Itemset
  1        51   { 1 }
  1        63   { 2 }
  1        60   { 3 }
  1        63   { 4 }
  1        54   { 5 }
  2        37   { 1  2 }
  2        38   { 1  3 }
  2        33   { 1  4 }
  2        35   { 1  5 }
  2        44   { 2  3 }
  2        38   { 2  4 }
  2        34   { 2  5 }
  2        38   { 3  4 }
  2        38   { 3  5 }
  2        37   { 4  5 }
  3        32   { 1  2  3 }
  3        31   { 2  3  4 }

Association Rules (itemset X implies itemset Y):
X = {1 2} ==> Y = {3}
  supp(X)=37, supp(Y)=60, supp(X U Y)=32
  conf= 0.86, lift=1.44
X = {1 3} ==> Y = {2}
  supp(X)=38, supp(Y)=63, supp(X U Y)=32
  conf= 0.84, lift=1.34
X = {2 4} ==> Y = {3}
  supp(X)=38, supp(Y)=60, supp(X U Y)=31
  conf= 0.82, lift=1.36
X = {3 4} ==> Y = {2}
  supp(X)=38, supp(Y)=63, supp(X U Y)=31
  conf= 0.82, lift=1.29

Warning Errors

IMSLS_MIN_SUPPORT_NOT_MET No items met minimum support of #.

Fatal Errors

IMSLS_NEED_IARG_GE "name" = #. "name" must be greater than or equal to #.
IMSLS_NEED_IARG_GT "name" = #. "name" must be greater than #.
IMSLS_INEQUALITY_VIOLATION_12 "name1" = # must equal "name2" = #.