com.imsl.datamining.PrefixSpan

All Implemented Interfaces:: Serializable

public class PrefixSpan extends Object implements Serializable

Performs the PrefixSpan algorithm for sequential pattern mining.

Sequential pattern mining (SPM) involves finding significant subsequences within a set of sequences. A subsequence is significant if it occurs frequently enough, according to a minimum threshold.

Let $I = \{i_1, i_2, ..., i_n\} $ be a set of $ n $ items of interest. Such items could be products, genes, behaviors, symptoms of a disease, or virtually any observable, discrete event. A subset of items is called an itemset. A sequence s is an ordered list of itemsets, denoted by $\{s_1,s_2,...,s_m\}$ where $s_j$ is an itemset. $s_j$ is also referred to as an element of the sequence s.

A sequence $ \alpha $ = <$\alpha_1, \alpha_2, ..., \alpha_k$> is a subsequence of another sequence $ \beta $ = <$\beta_1, \beta_2, ..., \beta_l$> denoted as $ \alpha \subseteq \beta $, if there exist indexes $ \{ j_1 \lt j_2 \lt⋯\lt j_k \}$ such that $ \alpha_1 \subseteq \beta_{j_1}, \alpha_2 \subseteq \beta_{j_2}, …, \alpha_k \subseteq \beta_{j_k} $. For example, if the set of items $I = \{a, b, c, d\} $ and a sequence $s$ = <$(abd)(bd)$>, (items occurring in the same event are enclosed in parentheses) then the length-1 subsequences of $s$ are <$a$>, <$b$>, <$d$>, and the length-2 subsequences are <$ab$>, <$ad$>, <$bb$>, <$bd$>, <$db$>, <$dd$>,<$(ab)$>, <$(ad)$>, <$(bd)$>. Similarly, there are many length-3 subsequences, e.g., <$(ab)b$>.

Whereas a transaction database includes transaction id’s and the itemsets that occur in that transaction, a sequence database is organized by sequence id, together with the corresponding sequence of transactions. Sequential pattern mining involves searching a sequence database for frequently occurring subsequences, where a subsequence is frequent if it occurs a specified minimum number of times, referred to as the support threshold. Frequent subsequences are then called sequential patterns. For example, a sequence of a customer’s purchases over time involves combinations of products purchased either together at the same time or over time in separate transactions, or repeatedly together over time. The separate transactions for a customer might look like: $$ \begin{array}{|c|c|c|} \hline \textbf{Customer Id} & \textbf{Transaction Id} & \textbf{Itemset} \\ \hline 1 & 10 & \lt a,b,d \gt \\ \hline 1 & 20 & \lt b,d \gt \\ \hline 1 & 30 & \lt a,c \gt \\ \hline 1 & 40 & \lt a,b,c \gt \\ \hline \end{array} $$ The corresponding sequence is $$ \begin{array}{|c|c|} \hline \textbf{Sequence Id} & \textbf{Sequence} \\ \hline 1 & \lt (abd)(bd)(ac)(abc) \gt \\ \hline \end{array} $$ Within a transaction, denoted by parenthesis, it is assumed that an item occurs at most once, and since <$(bc)$> is equivalent to <$(cb)$>, <$(ab)$> is equivalent to <$(ba)$>, etc., items should be listed in alphabetical or numerical order within a transaction. The sequence with id 1 has 4 elements, <$(abd)$>,<$ (bd)$>, <$(ac)$>, <$(abc)$>, and a length of 10 since there are 10 instances of a product appearing in it. The item <$a$> appears 3 times but contributes only 1 to the support of a.

One approach to mining for frequent subsequences is a "generate and test" approach. Suppose we have a sequence database with 4 sequences. $$ \begin{array}{|c|c|} \hline \textbf{Sequence Id} & \textbf{Sequence} \\ \hline 1 & \lt (abd)(bd)(ac)(abc) \gt \\ \hline 2 & \lt (a)(bc)bcd \gt\\ \hline 3 & \lt (ab)cab \gt \\ \hline 4 & \lt a(b)bac \gt \\ \hline \end{array} $$ The single item subsequences and their support are <$a$>: 4, <$b$>: 4, <$c$>: 4, and <$d$>: 2. If minimum support is set to 3, <$a$>, <$b$>, <$c$> are the frequent length-1 subsequences. The a priori principle (Agrawal, R. and Srikant, R. (1994)), allows us to ignore <$d$> and only consider longer subsequences beginning with, and appended by, the frequent length-1 items. From 3 frequent length-1 subsequences there are 12 length-2 subsequences, shown below. $$ \begin{array}{|c|c|} \hline \textbf{Subsequence} & \textbf{Support} \\ \hline \lt aa \gt& 3 \\ \hline \lt ab \gt & 4 \\ \hline \lt ac \gt & 4 \\ \hline \lt bb \gt & 4 \\ \hline \lt ba \gt & 3 \\ \hline \lt bc \gt & 4 \\ \hline \lt cc \gt & 2 \\ \hline \lt ca \gt & 2 \\ \hline \lt cb \gt & 3 \\ \hline \lt (ab) \gt & 2 \\ \hline \lt (ac) \gt & 1 \\ \hline \lt (bc) \gt & 2 \\ \hline \end{array} $$ With a minimum threshold of 3, there are 7 frequent length-2 subsequences. To find candidate length-3 sequential patterns, we append frequent length-2 subsequences with the frequent single items, <$a$>, <$b$>, <$c$>, and then scan the database to obtain their support. At each step, infrequent subsequences are discarded and not considered further. This "generate and test" approach would stop when there are no more frequent subsequences. Earlier algorithms (e.g., GSP ) use the generate and test approach. A different more efficient strategy, employed by the PrefixSpan algorithm, is a depth-first search that “discovers” or grows sequential patterns by way of projection and recursion.

PrefixSpan Algorithm

Input: A sequence database, S and a minimum support threshold, $m$.
Output: All sequential patterns in $S$ determined by $m$.
Step 1: Scan $S$ for frequent length-1 subsequences.
Step 2: For each length-1 sequential pattern $\alpha$ (those that meet minimum support), find the $\alpha$-projected database, $S|\alpha$. Repeat step 1 with $S$ replaced with the $\alpha$-projected database, $S|\alpha$. Continue until no more sequential patterns are found for $\alpha$.

The $\alpha$-projected database $S|\alpha$ is found by scanning each sequence in $S$ for the first occurrence of $\alpha$, and removing all items before and including $\alpha$, leaving the remainder of the sequence. These projected sequences are then scanned for frequent single item subsequences, β. Projecting S|α on each β is equivalent to projecting S on $\alpha$ appended by each β. For each $\alpha$ the repeated projections continue until no more sequential patterns are found.

References

Agrawal, R. and Srikant, R. (1994) Fast Algorithms for Mining Association Rules in Large Databases. Proceedings of the 20th International Conference on Very Large Data Bases, Santiago de Chile, 12-15 September 1994, 487-499.

J. Pei, J. Han, B. Mortazavi-Asl, J. Wang, H. Pinto, Q. Chen, U. Dayal, M. Hsu (2004) Mining Sequential Patterns by Pattern-Growth: The PrefixSpan Approach. IEEE Trans. Knowl. Data Eng. 16(11): 1424-1440.

See Also:

Constructor Summary

Constructors

Constructor

Description

PrefixSpan(double[][] transactionDB, int[] columnIndex)

Constructs a PrefixSpan object from a transaction database.

PrefixSpan(SequenceDatabase sequenceDB)

Constructs a PrefixSpan object from a SequenceDatabase.
Method Summary

Modifier and Type

Method

Description

SequenceDatabase

getFrequentSequences(double minSupportPct)

Returns the frequent subsequences (sequential patterns) determined by the given minimum support percentage.

void

setMaxNumberOfSequences(int maxNumberOfSequences)

Sets the maximum number of sequences.

void

setMaxSequenceLength(int maxSequenceLength)

Sets the maximum sequence length.

void

setPrintLevel(int printLevel)

Sets the print level.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details
- PrefixSpan
  
  public PrefixSpan(double[][] transactionDB, int[] columnIndex)
  
  Constructs a PrefixSpan object from a transaction database.
  
  Parameters:
  
  transactionDB - a double array containing transactions.
  transactionDB must contain unique customer ids in column columnIndex[0], unique transaction ids in column columnIndex[1], and unique product ids in column columnIndex[2]. Customer, transaction, and product id's must be integers, $\ge 0$.
  
  columnIndex - an int array of length 3 containing the column indices
- PrefixSpan
  
  public PrefixSpan(SequenceDatabase sequenceDB)
  
  Constructs a PrefixSpan object from a SequenceDatabase.
  
  Parameters:
  
  sequenceDB - a SequenceDatabase
Method Details
- getFrequentSequences
  
  public SequenceDatabase getFrequentSequences(double minSupportPct)
  
  Returns the frequent subsequences (sequential patterns) determined by the given minimum support percentage.
  This method runs the PrefixSpan algorithm.
  
  Parameters:
  
  minSupportPct - a double, the minimum support percentage
  
  Returns:
  
  a SequenceDatabase containing the sequential patterns
- setMaxSequenceLength
  
  public void setMaxSequenceLength(int maxSequenceLength)
  
  Sets the maximum sequence length. The algorithm will search for sequential patterns of length only up to and including maxSequenceLength.
  
  Parameters:
  
  maxSequenceLength - an int, the maximum sequence length
  Default: maxSequenceLength = 20.
- setMaxNumberOfSequences
  
  public void setMaxNumberOfSequences(int maxNumberOfSequences)
  
  Sets the maximum number of sequences. The algorithm will search for up to and including maxNumberOfSequences sequential patterns.
  
  Parameters:
  
  maxNumberOfSequences - an int, the maximum number of sequences
  Default: maxNumberOfSequences = 1000
- setPrintLevel
  
  public void setPrintLevel(int printLevel)
  
  Sets the print level.
  When printLevel > 0, intermediate steps of the algorithm are printed to the standard output.
  
  Parameters:
  
  printLevel - an int, the desired print level
  Default: printLevel = 0. No printing is performed.

Class PrefixSpan

PrefixSpan Algorithm

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Details

PrefixSpan

PrefixSpan

Method Details

getFrequentSequences

setMaxSequenceLength

setMaxNumberOfSequences

setPrintLevel