The sequential pattern mining problem

Let's move on to formalizing, the third and last pattern matching question we tackle in this chapter. Let's look at sequences in more detail. A sequence is different from the transactions we looked at before in that the order now matters. For a given item set I, a sequence S in I of length l is defined as follows:

s = <s_1, s₂,..., s_l>

Here, each individual s_i is a concatenation of items, that is, s_i = (a_i1 ... a_im), where a_ij is an item in I. Note that we do care about the order of sequence items s_i but not about the internal ordering of the individual a_ij in s_i. A sequence database S consists of pairs of sequence IDs and sequences, analogous to what we had before. An example of such a database can be found in the following table, in which the letters represent the same items as in our previous shopping cart example:

Sequence ID	Sequence
1	<a(abc)(ac)d(cf)>
2	<(ad)c(bc)(ae)>
3	<(ef)(ab)(df)cb>
4	<eg(af)cbc>

Table 2: A small sequence database with four short sequences.

In the example sequences, note the round brackets to group individual items into a sequence item. Also note that we drop these redundant braces if the sequence item consists of a single item. Importantly, the notion of a subsequence requires a little more carefulness than for unordered structures. We call u = (u₁, ..., u_n) a subsequence of s = (s₁,..., s_l) and write u < $s$ if there are indices 1 ≤ i1 < i2 < ... < in ≤ m so that we have the following:

u₁ < $s i1, ..., u n < s in$

Here, the < signs in the last line mean that u_j is a subpattern of $s ij . Roughly speaking, u is a subsequence of s if all the elements of u are subpatterns of s in their given order.$ Equivalently, we call s a supersequence of u. In the preceding example, we see that <a(ab)ac> and a(cb)(ac)dc> are examples of subsequences of <a(abc)(ac)d(cf)> and that <(fa)c> is an example of a subsequence of <eg(af)cbc>.

With the help of the notion of supersequences, we can now define the support of a sequence s in a given sequence database S as follows:

supp_S(s) = supp(s) = |{ s' ∈ S | s < $s'$ }| / |S|

Note that, structurally, this is the same definition as for plain unordered patterns, but the < symbol means something else, that is, a subsequence. As before, we drop the database subscript in the notation of support if the information is clear from the context. Equipped with a notion of support, the definition of sequential patterns follows the previous definition completely analogously. Given a minimum support threshold t, a sequence s in S is said to be a sequential pattern if supp(s) is greater than or equal to t. The formalization of the third question is called the sequential pattern mining problem, that is, find the full set of sequences that are sequential patterns in S for a given threshold t.

Even in our little example with just four sequences, it can already be challenging to manually inspect all the sequential patterns. To give just one example of a sequential pattern of support 1.0, a subsequence of length 2 of all the four sequences is <ac>. Finding all the sequential patterns is an interesting problem, and we will learn about the so-called prefix span algorithm that Spark employs to address the problem in the following section.

Table of Contents for The sequential pattern mining problem

Create new playlist

Sign In

Sign Up

Table of Contents for
The sequential pattern mining problem