aggr_apriori
Computes the frequent itemsets in a transaction set using aggregation.
Synopsis
#include <imsls.h>
void imsls_f_aggr_apriori (Imsls_keyword step, …, 0)
The type double function is imsls_d_aggr_apriori.
Required Arguments
Imsls_keyword step (Input)
One optional argument must be supplied to indicate which calculation step is to be performed. step is the name of the optional argument, defined as follows:
step |
Description |
IMSLS_FREQUENT_ITEMSETS |
Compute the frequent itemsets. |
IMSLS_UNION |
Compute the union of two itemsets. |
IMSLS_COUNT |
Count the occurrence of each itemset in the transaction data set. |
IMSLS_SUM |
Add the counts of two itemsets. |
IMSLS_UPDATE_FREQ_ITEMSETS |
Update the set of frequent itemsets. |
IMSLS_ASSOCIATION_RULES |
Compute the strong association rules. |
Synopsis with Optional Arguments
#include <imsls.h>
void imsls_f_aggr_apriori (
IMSLS_FREQUENT_ITEMSETS, int n, int x[], int max_num_products, int max_set_size,
double min_pct_support, Imsls_f_apriori_itemsets **itemsets, or
IMSLS_UNION, Imsls_f_apriori_itemsets *itemsets1,
Imsls_f_apriori_itemsets *itemsets2, Imsls_f_apriori_itemsets **cand_itemsets, or
IMSLS_COUNT, Imsls_f_apriori_itemsets *cand_itemsets, int n, int x[], int **freq, or
IMSLS_SUM, int n_itemsets,int prev_freq1[], int prev_freq2[], int freq[],or
IMSLS_UPDATE_FREQ_ITEMSETS, Imsls_f_apriori_itemsets *cand_itemsets, int n_itemsets, int freq[], Imsls_f_apriori_itemsets **itemsets, or
IMSLS_ASSOCIATION_RULES, Imsls_f_apriori_itemsets *itemsets,float confidence, float lift, Imsls_f_association_rules **assoc_rules,
0)
Optional Arguments
IMSLS_FREQUENT_ITEMSETS, int n, int x[], int max_num_products, int max_set_size, double min_pct_support, Imsls_f_apriori_itemsets **itemsets (Input/Output)
Computes the frequent itemsets in a transaction set.
int n (Input)
Number of (transaction, item) pairs in x.
int x[] (Input)
Array of size n x 2, each row of which represents a transaction id and item id pair. The algorithm assumes that an individual transaction is complete within a single dataset. That is, there is no matching of transaction ids between different data sets.
int max_num_products (Input)
Maximum number of items or products that may be present in the aggregation of all transactions.
int max_set_size (Input)
Maximum size of an itemset. Only frequent itemsets with max_set_size or fewer items are considered in the analysis.
double min_pct_support (Input)
Minimum percentage of transactions in which an item or itemset must be present to be considered frequent. min_pct_support must be in the interval [0,1].
Imsls_f_apriori_itemsets **itemsets (Output)
Address of a pointer to an Imsls_f_apriori_itemsets data structure containing the frequent itemsets in the transaction set x. If no value can be computed, then NULL is returned. To release this space, use imsls_free_apriori_itemsets.
or
IMSLS_UNION, Imsls_f_apriori_itemsets *itemsets1, Imsls_f_apriori_itemsets *itemsets2, Imsls_f_apriori_itemsets **cand_itemsets (Input/Output)
Computes the union of two itemsets.
Imsls_f_apriori_itemsets *itemsets1 (Input)
Pointer to an Imsls_f_apriori_itemsets data structure containing the frequent itemsets for the union.
Imsls_f_apriori_itemsets *itemsets2 (Input)
Pointer to an Imsls_f_apriori_itemsets data structure containing the frequent itemsets for the union.
Imsls_f_apriori_itemsets **cand_itemsets (Output)
Address of a pointer to an Imsls_f_apriori_itemsets data structure containing the union of two itemsets. If no value can be computed, then NULL is returned. To release this space, use imsls_free_apriori_itemsets.
or
IMSLS_COUNT, Imsls_f_apriori_itemsets *cand_itemsets, int n, int x[], int **freq (Input/Output)
Counts the frequency of each itemset in a transaction data set.
Imsls_f_apriori_itemsets cand_itemsets (Input)
Candidate itemsets and the corresponding number of transactions.
int n (Input)
Number of transaction/item pairs in x.
int x[] (Input)
Array of size n x 2, each row of which represents a transaction id and item id pair. The algorithm assumes that an individual transaction is complete within a single dataset. That is, there are no matching of transaction ids between different data sets.
int **freq (Output)
Address of an internally allocated array of length cand_itemsets->n_itemsets containing the number of occurrences of each itemset in x. To release this space, use imsls_free.
or
IMSLS_SUM, int n_itemsets, int prev_freq1[],int prev_freq2[], int **freq (Input/Output)
Sum up the itemset frequencies in prev_freq1 and prev_freq2 and return in freq.
int n_itemsets (Input)
Length of prev_freq1 and prev_freq2 which corresponds to the number of itemsets.
int prev_freq1[] (Input)
Array of length n_itemsets containing the itemset frequencies counted over one or more blocks of transaction data.
int prev_freq2[] (Input)
Array of length n_itemsets containing the itemset frequencies counted over a second set of blocks of transaction data.
int **freq (Output)
Array of length n_itemsets containing the sum of the frequencies.
or
IMSLS_UPDATE_FREQ_ITEMSETS, Imsls_f_apriori_itemsets *cand_itemsets, int n_itemsets, int freq[], Imsls_f_apriori_itemsets **itemsets (Input/Output)
Updates the set of frequent items.
Imsls_f_apriori_itemsets *cand_itemsets (Input)
Candidate itemsets and the corresponding number of transactions.
int n_itemsets (Input)
Length of freq.
int freq[] (Input)
Array of length n_itemsets containing the frequencies for each itemset in cand_itemsets.
Imsls_f_apriori_itemsets **itemsets (Output)
Address of a pointer to an Imsls_f_apriori_itemsets data structure containing the frequent itemsets. If no value can be computed, then NULL is returned. To release this space, use imsls_free_apriori_itemsets.
or
IMSLS_ASSOCIATION_RULES, Imsls_f_apriori_itemsets *itemsets, float confidence, float lift, Imsls_f_association_rules **assoc_rules (Input/Output)
Computes the strong association rules among itemsets.
Imsls_f_apriori_itemsets *itemsets (Input)
A pointer to an Imsls_f_itemsets data structure containing the itemsets.
float confidence (Input)
The minimum confidence used to determine the strong association rules. confidence must be in the interval [0,1]. lift is the other criterion that determines whether an association is “strong.” If either criterion, confidence or lift is exceeded, the association rule is considered “strong.”
float lift (Input)
The minimum lift used to determine the strong association rules. lift must be non-negative. confidence is the other criterion which determines whether an association is “strong.” If either criterion, confidence or lift is exceeded, the association rule will be considered “strong.”
Imsls_f_association_rules **assoc_rules (Output)
Address of a pointer to an Imsls_f_association_rules data structure containing the strong association rules among the itemsets. If no value can be computed, then NULL is returned. To release this space, use imsls_free_association_rules.
Description
The function imsls_f_aggr_apriori performs the Apriori algorithm for association rule discovery. Association rules are statements of the form, "if X, then Y", given with some measure of confidence. The main application for association rule discovery is market basket analysis, where X and Y are products or groups of products, and the occurrences are individual transactions, or "market baskets." The results help sellers learn relationships between the different products they sell, supporting better marketing decisions. There are other applications for association rule discovery, such as the problem areas of text mining and bioinformatics. The Apriori algorithm (Agrawal and Srikant, 1994) is one of the most popular algorithms for association rule discovery in transactional datasets. For a full description of the Apriori algorithm, see imsls_f_apriori.
The interface to the function imsls_f_aggr_apriori is designed to complete the analysis over a series of steps, with each step requiring a call to the function. With this design, Apriori can be applied to separate blocks of transactions. For each dataset or block, call imsls_f_aggr_apriori with IMSLS_FREQUENT_ITEMSETS to obtain the frequent itemsets from each block. The same parameter settings, such as minimum support percentage, must be used in each separate call. Then, call imsls_f_aggr_apriori with the keyword IMSLS_UNION sequentially to obtain the union of all the frequent itemsets. The resulting set serves as the "candidate" itemsets for the global set of transactions.
An itemset which is frequent in one transaction set may or may not be frequent in the larger collection. To find the itemsets that are frequent over the entire set of transactions, imsls_f_aggr_apriori performs another pass through the individual blocks, this time counting the occurrences of each of the itemsets in each of the transaction sets. This step can be done in parallel, using keyword IMSLS_COUNT. The next step is then to sum up the individual counts before filtering for the frequent itemsets. This is achieved with the keyword IMSLS_SUM, applied successively to pairs of previous counts. After this step, the frequencies of each itemset over all of the transactions are known and it remains to be seen if any meet the threshold to be considered "frequent". The final step in determining the frequent itemsets is IMSLS_UPDATE_FREQ_ITEMSETS. Once the frequent itemsets are known, the strong association rules can be found using the step, IMSLS_ASSOCIATION_RULES , although this is not a special step in the aggregation. The method is due to Savasere, Omiecinski, and Navathe (1995) and is also summarized and compared with other approaches in Rajaraman and Ullman(2011).
Since imsls_f_aggr_apriori can operate on subsets of data, it can be used when physical memory cannot hold the entire data set. Additionally, this design may be useful in parallel computing environments where nodes can be programmed to calculate intermediate results in parallel.
Data Structures
The data structures used by imsls_f_aggr_apriori are described below. (For imsls_d_aggr_apriori, the structure names are Imsls_d_apriori_itemsets, Imsls_d_association_rules, and Imsls_d_rule_components where type float becomes double).
Structure definitions are provided for informational purposes and may be subject to change.
Field |
Description |
int n_itemsets |
Length of array itemsets containing the Imsls_apriori_items structures. |
Imsls_apriori_items *itemsets |
Array of Imsls_apriori_items structures containing the set of frequent items and the support for that set. |
int n_trans |
Number of transactions. |
int max_num_products |
Maximum number of products. |
int max_set_size |
Maximum itemset size. |
double min_pct_support |
Minimum percentage of transactions. |
Field |
Description |
int n_items |
Length of items. |
int *items |
Array containing the set of frequent items. |
int support |
Number of transactions in which the item appears. |
Field |
Description |
int n_rules |
Length of array rules containing the Imsls_f_rule_components structures. |
Imsls_f_rule_components *rules |
Array containing the association rules. |
Field |
Description |
int n_x |
Length of x. |
int *x |
Array containing the X components of the association rule. |
int n_y |
Length of y. |
int *y |
Array containing the Y components of the association rule. |
int support[3] |
Support for Z, X and Y components of the association rule. |
float confidence |
Confidence of the association rule. |
float lift |
Lift of the association rule. |
Example
This example shows how to apply Apriori to separate blocks of data and combine results. The data are two separate blocks of 50 transactions involving five different product IDs. The minimum support percentage is set to 0.30, providing a minimum required support of 30 transactions overall.
#include <imsls.h>
#define N1 144
#define N2 147
int main() {
int i;
int max_num_products = 5, max_set_size = 4;
float confidence = 0.8, lift = 2.0;
double min_pct_support = 0.30;
int x1[N1][2] = {
{1, 3}, {1, 2}, {1, 1}, {2, 1}, {2, 2}, {2, 4}, {2, 5},
{3, 3}, {4, 4}, {4, 3}, {4, 5}, {4, 1}, {5, 5}, {6, 1},
{6, 2}, {6, 3}, {7, 5}, {7, 3}, {7, 2}, {8, 3}, {8, 4},
{8, 1}, {8, 5}, {8, 2}, {9, 4}, {10, 5}, {10, 3}, {11, 2},
{11, 3}, {12, 4}, {13, 4}, {14, 2}, {14, 3}, {14, 1}, {15, 3},
{15, 5}, {15, 1}, {16, 2}, {17, 3}, {17, 5}, {17, 1}, {18, 5},
{18, 1}, {18, 2}, {18, 3}, {19, 2}, {20, 4}, {21, 1}, {21, 4},
{21, 2}, {21, 5}, {22, 5}, {22, 4}, {23, 2}, {23, 5}, {23, 3},
{23, 1}, {23, 4}, {24, 3}, {24, 1}, {24, 5}, {25, 3}, {25, 5},
{26, 1}, {26, 4}, {26, 2}, {26, 3}, {27, 2}, {27, 3}, {27, 1},
{27, 5}, {28, 5}, {28, 3}, {28, 4}, {28, 1}, {28, 2}, {29, 4},
{29, 5}, {29, 2}, {30, 2}, {30, 4}, {30, 3}, {31, 2}, {32, 5},
{32, 1}, {32, 4}, {33, 4}, {33, 1}, {33, 5}, {33, 3}, {33, 2},
{34, 3}, {35, 5}, {35, 3}, {36, 3}, {36, 5}, {36, 4}, {36, 1},
{36, 2}, {37, 1}, {37, 3}, {37, 2}, {38, 4}, {38, 2}, {38, 3},
{39, 3}, {39, 2}, {39, 1}, {40, 2}, {40, 1}, {41, 3}, {41, 5},
{41, 1}, {41, 4}, {41, 2}, {42, 5}, {42, 1}, {42, 4}, {43, 3},
{43, 2}, {43, 4}, {44, 4}, {44, 5}, {44, 2}, {44, 3}, {44, 1},
{45, 4}, {45, 5}, {45, 3}, {45, 2}, {45, 1}, {46, 2}, {46, 4},
{46, 5}, {46, 3}, {46, 1}, {47, 4}, {47, 5}, {48, 2}, {49, 1},
{49, 4}, {49, 3}, {50, 3}, {50, 4}
};
int x2[N2][2] = {
{1, 2}, {1, 1}, {1, 4}, {1, 3}, {2, 2}, {2, 5}, {2, 3},
{2, 1}, {2, 4}, {3, 5}, {3, 4}, {4, 2}, {5, 4}, {5, 2},
{5, 3}, {5, 5}, {6, 3}, {6, 5}, {7, 2}, {7, 5}, {7, 4},
{7, 1}, {7, 3}, {8, 2}, {9, 2}, {9, 4}, {10, 4}, {10, 2},
{11, 4}, {11, 1}, {12, 3}, {12, 1}, {12, 5}, {12, 2}, {13, 2},
{14, 3}, {14, 4}, {14, 2}, {15, 2}, {16, 5}, {16, 2}, {16, 4},
{17, 1}, {18, 2}, {18, 3}, {18, 4}, {19, 3}, {19, 1}, {19, 2},
{19, 4}, {20, 5}, {20, 1}, {21, 5}, {21, 4}, {21, 1}, {21, 3},
{22, 4}, {22, 1}, {22, 5}, {23, 1}, {23, 2}, {24, 4}, {25, 4},
{25, 3}, {26, 5}, {26, 2}, {26, 3}, {26, 4}, {26, 1}, {27, 2},
{27, 1}, {27, 5}, {27, 3}, {28, 1}, {28, 2}, {28, 3}, {28, 4},
{29, 5}, {29, 2}, {29, 1}, {30, 5}, {30, 3}, {30, 2}, {30, 4},
{31, 4}, {31, 1}, {32, 1}, {32, 2}, {32, 3}, {32, 4}, {32, 5},
{33, 3}, {33, 2}, {33, 4}, {33, 5}, {33, 1}, {34, 3}, {34, 4},
{34, 5}, {34, 2}, {35, 2}, {35, 3}, {36, 3}, {36, 5}, {36, 4},
{37, 1}, {37, 4}, {37, 2}, {37, 3}, {37, 5}, {38, 5}, {38, 3},
{38, 1}, {38, 2}, {39, 2}, {39, 5}, {40, 4}, {40, 2}, {41, 4},
{42, 4}, {43, 5}, {43, 4}, {44, 5}, {44, 4}, {44, 3}, {44, 2},
{44, 1}, {45, 1}, {45, 2}, {45, 3}, {45, 5}, {45, 4}, {46, 3},
{46, 4}, {47, 4}, {47, 5}, {47, 2}, {47, 3}, {48, 5}, {48, 3},
{48, 2}, {48, 1}, {48, 4}, {49, 4}, {49, 5}, {50, 4}, {50, 1}
};
Imsls_f_apriori_itemsets *itemsets1 = NULL, *itemsets2 = NULL,
*cand_itemsets = NULL, *itemsets = NULL;
int *prev_freq1 = NULL, *prev_freq2 = NULL, *freq = NULL;
Imsls_f_association_rules *assoc_rules = NULL;
/* Find frequent itemsets in x1 and x2. */
imsls_f_aggr_apriori(IMSLS_FREQUENT_ITEMSETS,
N1, &x1[0][0], max_num_products,
max_set_size, min_pct_support, &itemsets1,
0);
imsls_f_aggr_apriori(IMSLS_FREQUENT_ITEMSETS,
N2, &x2[0][0], max_num_products,
max_set_size, min_pct_support, &itemsets2,
0);
/* Take the union of itemsets1 and itemsets2. */
imsls_f_aggr_apriori(IMSLS_UNION,
itemsets1, itemsets2, &cand_itemsets,
0);
/* Count the frequencies of each candidate itemset in
each of the data sets */
imsls_f_aggr_apriori(IMSLS_COUNT, cand_itemsets,
N1, &x1[0][0], &prev_freq1,
0);
imsls_f_aggr_apriori(IMSLS_COUNT, cand_itemsets,
N2, &x2[0][0], &prev_freq2,
0);
/* Sum the frequencies. */
imsls_f_aggr_apriori(IMSLS_SUM, cand_itemsets->n_itemsets,
prev_freq1, prev_freq2, &freq,
0);
/* Determine which of the candidate itemsets are frequent. */
imsls_f_aggr_apriori(IMSLS_UPDATE_FREQ_ITEMSETS,
cand_itemsets, cand_itemsets->n_itemsets, freq, &itemsets,
0);
/* Generate the strong association rules. */
imsls_f_aggr_apriori(IMSLS_ASSOCIATION_RULES,
itemsets, confidence, lift, &assoc_rules,
0);
imsls_f_write_apriori_itemsets(itemsets);
imsls_f_write_association_rules(assoc_rules);
imsls_f_free_apriori_itemsets(itemsets1);
imsls_f_free_apriori_itemsets(itemsets2);
imsls_f_free_apriori_itemsets(cand_itemsets);
imsls_f_free_apriori_itemsets(itemsets);
imsls_f_free_association_rules(assoc_rules);
imsls_free(prev_freq1);
imsls_free(prev_freq2);
imsls_free(freq);
}
Output
Frequent Itemsets (Out of 100 Transactions):
Size Support Itemset
1 51 { 1 }
1 63 { 2 }
1 60 { 3 }
1 63 { 4 }
1 54 { 5 }
2 37 { 1 2 }
2 38 { 1 3 }
2 33 { 1 4 }
2 35 { 1 5 }
2 44 { 2 3 }
2 38 { 2 4 }
2 34 { 2 5 }
2 38 { 3 4 }
2 38 { 3 5 }
2 37 { 4 5 }
3 32 { 1 2 3 }
3 31 { 2 3 4 }
Association Rules (itemset X implies itemset Y):
X = {1 2} ==> Y = {3}
supp(X)=37, supp(Y)=60, supp(X U Y)=32
conf= 0.86, lift=1.44
X = {1 3} ==> Y = {2}
supp(X)=38, supp(Y)=63, supp(X U Y)=32
conf= 0.84, lift=1.34
X = {2 4} ==> Y = {3}
supp(X)=38, supp(Y)=60, supp(X U Y)=31
conf= 0.82, lift=1.36
X = {3 4} ==> Y = {2}
supp(X)=38, supp(Y)=63, supp(X U Y)=31
conf= 0.82, lift=1.29
Warning Errors
IMSLS_MIN_SUPPORT_NOT_MET |
No items met minimum support of #. |
Fatal Errors
IMSLS_NEED_IARG_GE |
"name" = #. "name" must be greater than or equal to #. |
IMSLS_NEED_IARG_GT |
"name" = #. "name" must be greater than #. |
IMSLS_INEQUALITY_VIOLATION_12 |
"name1" = # must equal "name2" = #. |