CNL Stat : Data Mining : apriori
apriori
Computes the frequent itemsets in a transaction set.
Synopsis
#include <imsls.h>
Imsls_f_apriori_itemsets *imsls_f_apriori (int n, int x[], int max_num_products, , 0)
The type double function is imsls_d_apriori.
Required Arguments
int n (Input)
Number of (transaction, item) pairs in x.
int x[] (Input)
Array of size n x 2, each row of which represents a transaction id and item id pair.
int max_num_products (Input)
Maximum number of unique items or products that may be present in the transactions. max_num_products must be greater than or equal to the number of items in x.
Return Value
Pointer to an Imsls_f_apriori_itemsets data structure containing the frequent itemsets in the transaction set x. If no value can be computed, then NULL is returned. To release this space, use imsls_free_apriori_itemsets. Please see Data Structures for a description of this data structure.
Synopsis with Optional Arguments
#include <imsls.h>
Imsls_f_apriori_itemsets *imsls_f_apriori (int n, int x[], int max_num_products,
IMSLS_MAX_SET_SIZE, int max_set_size,
IMSLS_MIN_SUPPORT, double min_pct_support,
IMSLS_ASSOCIATION_RULES, float confidence, float lift, Imsls_f_association_rules **assoc_rules,
0)
Optional Arguments
IMSLS_MAX_SET_SIZE, int max_set_size (Input)
Maximum size of an itemset. Only frequent itemsets with max_set_size or fewer items are considered in the analysis.
Default: max_set_size = 5.
IMSLS_MIN_SUPPORT, double min_pct_support (Input)
Minimum percentage of transactions in which an item or itemset must be present to be considered frequent. min_pct_support must be in the interval [0,1].
Default: min_pct_support= 0.1.
IMSLS_ASSOCIATION_RULES, float confidence, float lift, Imsls_f_association_rules **assoc_rules (Input/Output)
Computes the strong association rules among itemsets.
float confidence (Input)
The minimum confidence used to determine the strong association rules. confidence must be in the interval [0,1]. lift is the other criterion that determines whether an association is "strong." If either criterion, confidence or lift, is exceeded, the association rule is considered "strong."
float lift (Input)
The minimum lift used to determine the strong association rules. lift must be non-negative. confidence is the other criterion that determines whether an association is "strong." If either criterion, confidence or lift, is exceeded, the association rule is considered "strong."
Imsls_f_association_rules **assoc_rules (Output)
Address of a pointer to an Imsls_f_association_rules data structure containing the strong association rules among the itemsets. If no value can be computed, then NULL is returned. To release this space, use imsls_f_free_association_rules.
Description
The function imsls_f_apriori performs the Apriori algorithm for association rule discovery. Association rules are statements of the form, "if X, then Y", given with some measure of confidence. The main application for association rule discovery is market basket analysis, where X and Y are products or groups of products, and the occurrences are individual transactions, or “market baskets." The results help sellers learn relationships between the different products they sell, supporting better marketing decisions. There are other applications for association rule discovery, such as the problem areas of text mining and bioinformatics. The Apriori algorithm (Agrawal and Srikant, 1994) is one of the most popular algorithms for association rule discovery in transactional datasets.
For distributed data or data larger than physical memory, see imsls_f_aggr_apriori.
In the first and most critical stage, the Apriori algorithm mines the transactions for frequent itemsets. An itemset is frequent if it appears in more than a minimum number of transactions. The number of transactions containing an itemset is known as its “support”, and the minimum support (as a percentage of transactions) is a control parameter in the algorithm. The algorithm begins by finding the frequent single items. Then the algorithm generates all two-item sets from the frequent single items and determines which among them are frequent. From the collection of frequent pairs, Apriori forms candidate three-item subsets and determines which are frequent, and so on. The algorithm stops when either a maximum itemset size is reached, or when none of the candidate itemsets are frequent. In this way, the Apriori algorithm exploits the apriori-property: for an itemset to be frequent, all of its proper subsets must also be frequent. At each step the problem is reduced to only the frequent subsets.
In the second stage, the algorithm generates association rules. These are of the form, X Y (read, "if X, then Y"), where Y and X are disjoint frequent itemsets. The confidence measure associated with the rule is defined as the proportion of transactions containing X that also contain Y. Denote the support of X (the number of transactions containing X) as SX, and SZ is the support of Z = X Y. The confidence of the rule X Y is the ratio, SZ/SX. Note that the confidence ratio is the conditional probability
where P[XY] denotes the probability of both X and Y. The probability of an itemset X is estimated by SX/N, where N is the total number of transactions.
Another measure of the strength of the association is known as the lift, which is the ratio (SZN) / (SXSY). Lift values close to 1.0 suggest the sets are independent, and that they occur together by chance. Large lift values indicate a strong association. A minimum confidence threshold and a lift threshold can be specified.
Data Structures
The data structures output by imsls_f_apriori are described below. (For imsls_d_apriori, the structure names are Imsls_d_apriori_itemsets, Imsls_d_association_rules, and Imsls_d_rule_components where type float becomes double).
Structure definitions are provided for informational purposes and may be subject to change.
Table 35 – Imsls_f_apriori_itemsets
Field
Description
int n_itemsets
Length of array itemsets containing the Imsls_apriori_items structures.
Imsls_apriori_items *itemsets
Array of Imsls_apriori_items structures containing the set of frequent items and the support for that set.
int n_trans
Number of transactions.
int max_num_products
Maximum number of products.
int max_set_size
Maximum itemset size.
double min_pct_support
Minimum percentage of transactions.
Table 36 – Imsls_apriori_items
Field
Description
int n_items
Length of items.
int *items
Array containing the set of frequent items.
int support
The number of transactions in which the item appears.
Table 37 – Imsls_f_association_rules
Field
Description
int n_rules
Length of array rules containing the Imsls_f_rule_components structures.
Imsls_f_rule_components *rules
Array containing the association rules.
Table 38 – Imsls_f_rule_components
Field
Description
int n_x
Length of x.
int *x
Array containing the X components of the association rule.
int n_y
Length of y.
int *y
Array containing the Y components of the association rule.
int support[3]
Support for Z, X and Y components of the association rule.
float confidence
Confidence of the association rule.
float lift
Lift of the association rule.
Example
This example applies Apriori to find the frequent itemsets and strong association rules. The data are 50 transactions involving five different product IDs. The minimum support percentage is set to 0.30, giving a minimum required support of 15 transactions.
 
#include <imsls.h>
 
#define N 144
 
int main() {
int max_num_products = 5, max_set_size = 10;
float confidence = 0.8, lift = 2.0;
double min_pct_support = 0.3;
int x[N][2] = {
{1, 3}, {1, 2}, {1, 1}, {2, 1}, {2, 2}, {2, 4}, {2, 5},
{3, 3}, {4, 4}, {4, 3}, {4, 5}, {4, 1}, {5, 5}, {6, 1},
{6, 2}, {6, 3}, {7, 5}, {7, 3}, {7, 2}, {8, 3}, {8, 4},
{8, 1}, {8, 5}, {8, 2}, {9, 4}, {10, 5}, {10, 3}, {11, 2},
{11, 3}, {12, 4}, {13, 4}, {14, 2}, {14, 3}, {14, 1}, {15, 3},
{15, 5}, {15, 1}, {16, 2}, {17, 3}, {17, 5}, {17, 1}, {18, 5},
{18, 1}, {18, 2}, {18, 3}, {19, 2}, {20, 4}, {21, 1}, {21, 4},
{21, 2}, {21, 5}, {22, 5}, {22, 4}, {23, 2}, {23, 5}, {23, 3},
{23, 1}, {23, 4}, {24, 3}, {24, 1}, {24, 5}, {25, 3}, {25, 5},
{26, 1}, {26, 4}, {26, 2}, {26, 3}, {27, 2}, {27, 3}, {27, 1},
{27, 5}, {28, 5}, {28, 3}, {28, 4}, {28, 1}, {28, 2}, {29, 4},
{29, 5}, {29, 2}, {30, 2}, {30, 4}, {30, 3}, {31, 2}, {32, 5},
{32, 1}, {32, 4}, {33, 4}, {33, 1}, {33, 5}, {33, 3}, {33, 2},
{34, 3}, {35, 5}, {35, 3}, {36, 3}, {36, 5}, {36, 4}, {36, 1},
{36, 2}, {37, 1}, {37, 3}, {37, 2}, {38, 4}, {38, 2}, {38, 3},
{39, 3}, {39, 2}, {39, 1}, {40, 2}, {40, 1}, {41, 3}, {41, 5},
{41, 1}, {41, 4}, {41, 2}, {42, 5}, {42, 1}, {42, 4}, {43, 3},
{43, 2}, {43, 4}, {44, 4}, {44, 5}, {44, 2}, {44, 3}, {44, 1},
{45, 4}, {45, 5}, {45, 3}, {45, 2}, {45, 1}, {46, 2}, {46, 4},
{46, 5}, {46, 3}, {46, 1}, {47, 4}, {47, 5}, {48, 2}, {49, 1},
{49, 4}, {49, 3}, {50, 3}, {50, 4}
};
Imsls_f_apriori_itemsets *itemsets = NULL;
Imsls_f_association_rules *assoc_rules = NULL;
 
/* Compute and print the strong association rules. */
itemsets = imsls_f_apriori(N, &x[0][0], max_num_products,
IMSLS_MAX_SET_SIZE, max_set_size,
IMSLS_MIN_SUPPORT, min_pct_support,
IMSLS_ASSOCIATION_RULES, confidence, lift, &assoc_rules,
0);
imsls_f_write_apriori_itemsets(itemsets);
imsls_f_write_association_rules(assoc_rules);
 
imsls_f_free_apriori_itemsets(itemsets);
imsls_f_free_association_rules(assoc_rules);
}
Output
 
Frequent Itemsets (Out of 50 Transactions):
Size Support Itemset
1 27 { 1 }
1 30 { 2 }
1 33 { 3 }
1 27 { 4 }
1 27 { 5 }
2 20 { 1 2 }
2 22 { 1 3 }
2 16 { 1 4 }
2 19 { 1 5 }
2 22 { 2 3 }
2 16 { 2 4 }
2 15 { 2 5 }
2 16 { 3 4 }
2 19 { 3 5 }
2 17 { 4 5 }
3 17 { 1 2 3 }
3 15 { 1 3 5 }
 
Association Rules (itemset X implies itemset Y):
X = {1} ==> Y = {3}
supp(X)=27, supp(Y)=33, supp(X U Y)=22
conf= 0.81, lift=1.23
X = {1 2} ==> Y = {3}
supp(X)=20, supp(Y)=33, supp(X U Y)=17
conf= 0.85, lift=1.29
Warning Errors
IMSLS_MIN_SUPPORT_NOT_MET
No items met minimum support of #.
Fatal Errors
IMSLS_NEED_IARG_GE
"name" = #. “name“ must be greater than or equal to #.