Analyzing supermarket shopping baskets

As an example, we will look at dataset consisting of anonymous transactions at a supermarket in Belgium. This dataset was made available by Tom Brijs at Hasselt University. Because of privacy concerns, the data has been anonymized, so we only have a number for each product, and each basket therefore consists of a set of numbers. The data file is available from several online sources (including this book's companion website).

We begin by loading the dataset and looking at some statistics (this is always a good idea):

from collections import defaultdict 
from itertools import chain 
 
# File is downloaded as a compressed file 
import gzip 
# file format is a line per transaction 
# of the form '12 34 342 5...' 
dataset = [[int(tok) for tok in line.strip().split()] 
        for line in gzip.open('retail.dat.gz')] 
# It is more convenient to work with sets 
dataset = [set(d) for d in dataset] 
# count how often each product was purchased: 
counts = defaultdict(int) 
for elem in chain(*dataset): 
    counts[elem] += 1

We can see the resulting counts summarized in the following table:

Number of times bought	Number of products
Just once	2,224
2 or 3	2,438
4 to 7	2,508
8 to 15	2,251
16 to 31	2,182
32 to 63	1,940
64 to 127	1,523
128 to 511	1,225
512 or more	179

There are many products that have only been bought a few times. For example, 33 percent of products were bought four or fewer times. However, this represents only 1 percent of purchases. This phenomenon, where many products are only purchased a small number of times, is sometimes labeled the long tail, and has only become more prominent as the internet makes it cheaper to stock and sell niche items. In order to be able to provide recommendations for these products, we would need a lot more data.

There are a few open source implementations of basket analysis algorithms out there, but none that are well integrated with scikit-learn or any of the other packages we have been using. Therefore, we are going to implement one classic algorithm ourselves. This algorithm is called the Apriori algorithm, and it is a bit old (it was published in 1994 by Rakesh Agrawal and Ramakrishnan Srikant), but it still works (algorithms, of course, never stop working; they just get superseded by better ideas).

The Apriori algorithm takes a collection of sets (that is, your shopping baskets) and returns sets that are very frequent as subsets (that is, items that together are part of many shopping baskets).

The algorithm works using a bottom-up approach: starting with the smallest candidates (those composed of a single element), it builds up, adding one element at a time. The algorithm takes a set of baskets and the minimum input that should be considered (a parameter we will call minsupport). The first step is to consider all baskets with just one element with minimal support. Then, these are combined in all possible ways to build up two-element baskets. These are filtered in order to keep only those that have minimal support. Then, all possible three-element baskets are considered, those with minimal support are kept, and so on. The trick of Apriori is that when building a larger basket, it only needs to consider those that are built up of smaller sets.

The following diagram presents a schematic view of the algorithm:

We shall now implement this algorithm in code. We need to define the minimum support we are looking for:

minsupport = 100

Support is the number of times a set of products was purchased together.

The goal of Apriori is to find item sets with high support. Logically, any item set with more than minimal support can only be composed of items that themselves have at least minimal support:

valid = set(k for k,v in counts.items() 
          if (v >= minsupport))

Our initial itemsets are singletons (sets with a single element). In particular, all singletons that have at least minimal support are frequent itemsets:

 itemsets = [frozenset([v]) for v in valid]

We need to set up an index to speed up computation using the following code:

baskets = defaultdict(set) 
 
for i, ds in enumerate(dataset):                   
    for ell in ds: 
        baskets[ell].add(i)

That is, baskets[i] contains the indices of all the elements in the dataset where i occurs. Now, our loop is given as follows:

itemsets = [frozenset([v]) for v in valid] 
freqsets = [] 
for i in range(16): 
    print(i) 
    nextsets = [] 
    tested = set() 
    for it in itemsets: 
        for v in valid: 
            if v not in it: 
                # Create a new candidate set by adding v to it 
                c = (it | frozenset([v])) 
                # check if we have tested it already 
                if c in tested: 
                    continue 
                tested.add(c) 
 
                candidates = set() 
                for elem in c: 
                    candidates.update(baskets[elem]) 
                support_c = sum(1 for d in candidates 
                                if dataset[d].issuperset(c)) 
                if support_c > minsupport: 
                    nextsets.append(c) 
    freqsets.extend(nextsets) 
    itemsets = nextsets 
    if not len(itemsets): 
        break 
print("Finished!") 
Finished!

The Apriori algorithm returns frequent itemsets—that is, baskets that are present above a certain threshold (given by the minsupport variable in the code).

Table of Contents for Analyzing supermarket shopping baskets

Create new playlist

Sign In

Sign Up

Table of Contents for
Analyzing supermarket shopping baskets