aggrApriori¶
Computes the frequent itemsets in a transaction set using aggregation.
Synopsis¶
aggrApriori ( )
Required Arguments¶
One optional argument must be supplied to indicate which calculation step is to be performed.
Optional Argument | Description |
---|---|
frequentItemsets |
Compute the frequent itemsets. |
union |
Compute the union of two itemsets. |
count |
Count the occurrence of each itemset in the transaction data set. |
sum |
Add the counts of two itemsets. |
updateFreqItemsets |
Update the set of frequent itemsets. |
associationRules |
Compute the strong association rules. |
Optional Arguments¶
frequentItemsets
, floatx
, intmaxNumProducts
, intmaxSetSize
, floatminPctSupport
, structureitemsets
(Input/Output)- Computes the frequent itemsets in a transaction set.
- float
x[[]]
(Input) - Array of size
n
× 2, each row of which represents a transaction id and item id pair. The algorithm assumes that an individual transaction is complete within a single dataset. That is, there is no matching of transaction ids between different data sets. - int
maxNumProducts
(Input) - Maximum number of items or products that may be present in the aggregation of all transactions.
- int
maxSetSize
(Input) - Maximum size of an itemset. Only frequent itemsets with
maxSetSize
or fewer items are considered in the analysis. - float
minPctSupport
(Input) - Minimum percentage of transactions in which an item or itemset must be
present to be considered frequent.
minPctSupport
must be in the interval [0,1]. - structure
itemsets
(Output) - A data structure containing the frequent itemsets in the transaction set
x
. If no value can be computed, thenNone
is returned. To release this space, use freeAprioriItemsets.
or
union
, structureitemsets1
, structureitemsets2
(Input/Output)- Computes the union of two itemsets.
- structure
itemsets1
(Input) - An
structure
data structure containing the frequent itemsets for the union. - structure
itemsets2
(Input) - An
structure
data structure containing the frequent itemsets for the union.
or
count
, structurecandItemsets
, floatx
(Input/Output)- Counts the frequency of each itemset in a transaction data set.
- structure
candItemsets
(Input) - Candidate itemsets and the corresponding number of transactions.
- float
x[]
(Input) - Array of size
n
× 2, each row of which represents a transaction id and item id pair. The algorithm assumes that an individual transaction is complete within a single dataset. That is, there are no matching of transaction ids between different data sets.
or
sum
, intprevFreq1
,*int*prevFreq2
(Input)- Sum up the itemset frequencies in
prevFreq1
andprevFreq2
and return infreq
. - int
prevFreq1[]
(Input) - Array of length
nItemsets
containing the itemset frequencies counted over one or more blocks of transaction data. - int
prevFreq2[]
(Input) - Array of length
nItemsets
containing the itemset frequencies counted over a second set of blocks of transaction data.
or
updateFreqItemsets
, structurecandItemsets
, intfreq
(Input/Output)- Updates the set of frequent items.
- structure
candItemsets
(Input) - Candidate itemsets and the corresponding number of transactions.
- int
freq[]
(Input) - Array of length
nItemsets
containing the frequencies for each itemset incandItemsets
.
or
associationRules
, structureitemsets
, floatconfidence
, floatlift
(Input/Output)- Computes the strong association rules among itemsets.
- structure
itemsets
(Input) - A data structure containing the itemsets.
- float
confidence
(Input) - The minimum confidence used to determine the strong association rules.
confidence
must be in the interval [0,1].lift
is the other criterion that determines whether an association is “strong.” If either criterion,confidence
orlift
is exceeded, the association rule is considered “strong.” - float
lift
(Input) - The minimum lift used to determine the strong association rules.
lift
must be non-negative.confidence
is the other criterion which determines whether an association is “strong.” If either criterion,confidence
orlift
is exceeded, the association rule will be considered “strong.”
Description¶
The function aggrApriori
performs the Apriori algorithm for association
rule discovery. Association rules are statements of the form, “if X, then
Y”, given with some measure of confidence. The main application for
association rule discovery is market basket analysis, where X and Y are
products or groups of products, and the occurrences are individual
transactions, or “market baskets.” The results help sellers learn
relationships between the different products they sell, supporting better
marketing decisions. There are other applications for association rule
discovery, such as the problem areas of text mining and bioinformatics. The
Apriori algorithm (Agrawal and Srikant, 1994)
is one of the most popular algorithms for association rule discovery in
transactional datasets. For a full description of the Apriori algorithm, see
apriori.
The interface to the function aggrApriori
is designed to complete the
analysis over a series of steps, with each step requiring a call to the
function. With this design, Apriori can be applied to separate blocks of
transactions. For each dataset or block, call aggrApriori
with
frequentItemsets
to obtain the frequent itemsets from each block.The
same parameter settings, such as minimum support percentage, must be used in
each separate call. Then, call aggrApriori
with the keyword union
sequentially to obtain the union of all the frequent itemsets. The resulting
set serves as the “candidate” itemsets for the global set of transactions.
An itemset which is frequent in one transaction set may or may not be
frequent in the larger collection. To find the itemsets that are frequent
over the entire set of transactions, aggrApriori
performs another pass
through the individual blocks, this time counting the occurrences of each of
the itemsets in each of the transaction sets. This step can be done in
parallel, using keyword count
. The next step is then to sum up the
individual counts before filtering for the frequent itemsets. This is
achieved with the keyword sum
, applied successively to pairs of previous
counts. After this step, the frequencies of each itemset over all of the
transactions are known and it remains to be seen if any meet the threshold
to be considered “frequent”. The final step in determining the frequent
itemsets is updateFreqItemsets
. Once the frequent itemsets are known,
the strong association rules can be found using the step,
associationRules
, although this is not a special step in the
aggregation. The method is due to
Savasere, Omiecinski, and Navathe (1995) and
is also summarized and compared with other approaches in
Rajaraman and Ullman (2011).
Since aggrApriori
can operate on subsets of data, it can be used when
physical memory cannot hold the entire data set. Additionally, this design
may be useful in parallel computing environments where nodes can be
programmed to calculate intermediate results in parallel.
Example¶
This example shows how to apply Apriori to separate blocks of data and combine results. The data are two separate blocks of 50 transactions involving five different product IDs. The minimum support percentage is set to 0.30, providing a minimum required support of 30 transactions overall.
from numpy import *
from pyimsl.stat.aggrApriori import aggrApriori
from pyimsl.stat.writeAprioriItemsets import writeAprioriItemsets
from pyimsl.stat.writeAssociationRules import writeAssociationRules
x1 = array([[1, 3], [1, 2], [1, 1], [2, 1], [2, 2], [2, 4], [2, 5],
[3, 3], [4, 4], [4, 3], [4, 5], [4, 1], [5, 5], [6, 1],
[6, 2], [6, 3], [7, 5], [7, 3], [7, 2], [8, 3], [8, 4],
[8, 1], [8, 5], [8, 2], [9, 4], [10, 5], [10, 3], [11, 2],
[11, 3], [12, 4], [13, 4], [14, 2], [14, 3], [14, 1], [15, 3],
[15, 5], [15, 1], [16, 2], [17, 3], [17, 5], [17, 1], [18, 5],
[18, 1], [18, 2], [18, 3], [19, 2], [20, 4], [21, 1], [21, 4],
[21, 2], [21, 5], [22, 5], [22, 4], [23, 2], [23, 5], [23, 3],
[23, 1], [23, 4], [24, 3], [24, 1], [24, 5], [25, 3], [25, 5],
[26, 1], [26, 4], [26, 2], [26, 3], [27, 2], [27, 3], [27, 1],
[27, 5], [28, 5], [28, 3], [28, 4], [28, 1], [28, 2], [29, 4],
[29, 5], [29, 2], [30, 2], [30, 4], [30, 3], [31, 2], [32, 5],
[32, 1], [32, 4], [33, 4], [33, 1], [33, 5], [33, 3], [33, 2],
[34, 3], [35, 5], [35, 3], [36, 3], [36, 5], [36, 4], [36, 1],
[36, 2], [37, 1], [37, 3], [37, 2], [38, 4], [38, 2], [38, 3],
[39, 3], [39, 2], [39, 1], [40, 2], [40, 1], [41, 3], [41, 5],
[41, 1], [41, 4], [41, 2], [42, 5], [42, 1], [42, 4], [43, 3],
[43, 2], [43, 4], [44, 4], [44, 5], [44, 2], [44, 3], [44, 1],
[45, 4], [45, 5], [45, 3], [45, 2], [45, 1], [46, 2], [46, 4],
[46, 5], [46, 3], [46, 1], [47, 4], [47, 5], [48, 2], [49, 1],
[49, 4], [49, 3], [50, 3], [50, 4]])
x2 = array([[1, 2], [1, 1], [1, 4], [1, 3], [2, 2], [2, 5], [2, 3],
[2, 1], [2, 4], [3, 5], [3, 4], [4, 2], [5, 4], [5, 2],
[5, 3], [5, 5], [6, 3], [6, 5], [7, 2], [7, 5], [7, 4],
[7, 1], [7, 3], [8, 2], [9, 2], [9, 4], [10, 4], [10, 2],
[11, 4], [11, 1], [12, 3], [12, 1], [12, 5], [12, 2], [13, 2],
[14, 3], [14, 4], [14, 2], [15, 2], [16, 5], [16, 2], [16, 4],
[17, 1], [18, 2], [18, 3], [18, 4], [19, 3], [19, 1], [19, 2],
[19, 4], [20, 5], [20, 1], [21, 5], [21, 4], [21, 1], [21, 3],
[22, 4], [22, 1], [22, 5], [23, 1], [23, 2], [24, 4], [25, 4],
[25, 3], [26, 5], [26, 2], [26, 3], [26, 4], [26, 1], [27, 2],
[27, 1], [27, 5], [27, 3], [28, 1], [28, 2], [28, 3], [28, 4],
[29, 5], [29, 2], [29, 1], [30, 5], [30, 3], [30, 2], [30, 4],
[31, 4], [31, 1], [32, 1], [32, 2], [32, 3], [32, 4], [32, 5],
[33, 3], [33, 2], [33, 4], [33, 5], [33, 1], [34, 3], [34, 4],
[34, 5], [34, 2], [35, 2], [35, 3], [36, 3], [36, 5], [36, 4],
[37, 1], [37, 4], [37, 2], [37, 3], [37, 5], [38, 5], [38, 3],
[38, 1], [38, 2], [39, 2], [39, 5], [40, 4], [40, 2], [41, 4],
[42, 4], [43, 5], [43, 4], [44, 5], [44, 4], [44, 3], [44, 2],
[44, 1], [45, 1], [45, 2], [45, 3], [45, 5], [45, 4], [46, 3],
[46, 4], [47, 4], [47, 5], [47, 2], [47, 3], [48, 5], [48, 3],
[48, 2], [48, 1], [48, 4], [49, 4], [49, 5], [50, 4], [50, 1]])
maxNumProducts = 5
maxSetSize = 4
minPctSupport = 0.30
confidence = 0.8
lift = 2.0
# Find frequent itemsets in x1 and x2.
frequentItemsets1 = {"x": x1, "maxNumProducts": maxNumProducts,
"maxSetSize": maxSetSize, "minPctSupport": minPctSupport}
aggrApriori(frequentItemsets=frequentItemsets1)
itemsets1 = frequentItemsets1["itemsets"]
frequentItemsets2 = {"x": x2, "maxNumProducts": maxNumProducts,
"maxSetSize": maxSetSize, "minPctSupport": minPctSupport}
aggrApriori(frequentItemsets=frequentItemsets2)
itemsets2 = frequentItemsets2["itemsets"]
# Take the union of itemsets1 and itemsets2.
union = {"itemsets1": itemsets1, "itemsets2": itemsets2}
aggrApriori(union=union)
candItemsets = union["candItemsets"]
# Count the frequencies of each candidate itemset in
# each of the data sets
count1 = {"candItemsets": candItemsets, "x": x1}
aggrApriori(count=count1)
prevFreq1 = count1["freq"]
count2 = {"candItemsets": candItemsets, "x": x2}
aggrApriori(count=count2)
prevFreq2 = count2["freq"]
# Sum the frequencies.
sum = {"prevFreq1": prevFreq1, "prevFreq2": prevFreq2}
aggrApriori(sum=sum)
freq = sum["freq"]
# Determine which of the candidate itemsets are frequent.
updateFreqItemsets = {"candItemsets": candItemsets, "freq": freq}
aggrApriori(updateFreqItemsets=updateFreqItemsets)
itemsets = updateFreqItemsets["itemsets"]
# Generate the strong association rules.
associationRules = {"itemsets": itemsets,
"confidence": confidence, "lift": lift}
aggrApriori(associationRules=associationRules)
assocRules = associationRules["assocRules"]
writeAprioriItemsets(itemsets)
writeAssociationRules(assocRules)
Output¶
Frequent Itemsets (Out of 100 Transactions):
Size Support Itemset
1 51 { 1 }
1 63 { 2 }
1 60 { 3 }
1 63 { 4 }
1 54 { 5 }
2 37 { 1 2 }
2 38 { 1 3 }
2 33 { 1 4 }
2 35 { 1 5 }
2 44 { 2 3 }
2 38 { 2 4 }
2 34 { 2 5 }
2 38 { 3 4 }
2 38 { 3 5 }
2 37 { 4 5 }
3 32 { 1 2 3 }
3 31 { 2 3 4 }
Association Rules (itemset X implies itemset Y):
X = {1 2} ==> Y = {3}
supp(X)=37, supp(Y)=60, supp(X U Y)=32
conf= 0.86, lift=1.44
X = {1 3} ==> Y = {2}
supp(X)=38, supp(Y)=63, supp(X U Y)=32
conf= 0.84, lift=1.34
X = {2 4} ==> Y = {3}
supp(X)=38, supp(Y)=60, supp(X U Y)=31
conf= 0.82, lift=1.36
X = {3 4} ==> Y = {2}
supp(X)=38, supp(Y)=63, supp(X U Y)=31
conf= 0.82, lift=1.29
Warning Errors¶
IMSLS_MIN_SUPPORT_NOT_MET |
No items met minimum support of #. |
Fatal Errors¶
IMSLS_NEED_IARG_GE |
"name" = # . "name" must
be greater than or equal to
# . |
IMSLS_NEED_IARG_GT |
"name" = # . "name" must
be greater than # . |
IMSLS_INEQUALITY_VIOLATION_12 |
"name1" = # must equal
"name2" = # . |