Analyzing supermarket shopping baskets

As an example, we will look at dataset consisting of anonymous transactions at a supermarket in Belgium. This dataset was made available by Tom Brijs at Hasselt University. Because of privacy concerns, the data has been anonymized, so we only have a number for each product, and each basket therefore consists of a set of numbers. The data file is available from several online sources (including this book's companion website).

We begin by loading the dataset and looking at some statistics (this is always a good idea):

from collections import defaultdict 
from itertools import chain 
 
# File is downloaded as a compressed file 
import gzip 
# file format is a line per transaction 
# of the form '12 34 342 5...' 
dataset = [[int(tok) for tok in line.strip().split()] 
        for line in gzip.open('retail.dat.gz')] 
# It is more convenient to work with sets 
dataset = [set(d) for d in dataset] 
# count how often each product was purchased: 
counts = defaultdict(int) 
for elem in chain(*dataset): 
    counts[elem] += 1 

We can see the resulting counts summarized in the following table:

Number of times bought

Number of products

Just once

2,224

2 or 3

2,438

4 to 7

2,508

8 to 15

2,251

16 to 31

2,182

32 to 63

1,940

64 to 127

1,523

128 to 511

1,225

512 or more

179

 

There are many products that have only been bought a few times. For example, 33 percent of products were bought four or fewer times. However, this represents only 1 percent of purchases. This phenomenon, where many products are only purchased a small number of times, is sometimes labeled the long tail, and has only become more prominent as the internet makes it cheaper to stock and sell niche items. In order to be able to provide recommendations for these products, we would need a lot more data.

There are a few open source implementations of basket analysis algorithms out there, but none that are well integrated with scikit-learn or any of the other packages we have been using. Therefore, we are going to implement one classic algorithm ourselves. This algorithm is called the Apriori algorithm, and it is a bit old (it was published in 1994 by Rakesh Agrawal and Ramakrishnan Srikant), but it still works (algorithms, of course, never stop working; they just get superseded by better ideas).

The Apriori algorithm takes a collection of sets (that is, your shopping baskets) and returns sets that are very frequent as subsets (that is, items that together are part of many shopping baskets).

The algorithm works using a bottom-up approach: starting with the smallest candidates (those composed of a single element), it builds up, adding one element at a time. The algorithm takes a set of baskets and the minimum input that should be considered (a parameter we will call minsupport). The first step is to consider all baskets with just one element with minimal support. Then, these are combined in all possible ways to build up two-element baskets. These are filtered in order to keep only those that have minimal support. Then, all possible three-element baskets are considered, those with minimal support are kept, and so on. The trick of Apriori is that when building a larger basket, it only needs to consider those that are built up of smaller sets.

The following diagram presents a schematic view of the algorithm:

We shall now implement this algorithm in code. We need to define the minimum support we are looking for:

minsupport = 100 

Support is the number of times a set of products was purchased together.

The goal of Apriori is to find item sets with high support. Logically, any item set with more than minimal support can only be composed of items that themselves have at least minimal support:

valid = set(k for k,v in counts.items() 
          if (v >= minsupport)) 

Our initial itemsets are singletons (sets with a single element). In particular, all singletons that have at least minimal support are frequent itemsets:

 itemsets = [frozenset([v]) for v in valid] 

We need to set up an index to speed up computation using the following code:

baskets = defaultdict(set) 
 
for i, ds in enumerate(dataset):                   
    for ell in ds: 
        baskets[ell].add(i) 

That is, baskets[i] contains the indices of all the elements in the dataset where i occurs. Now, our loop is given as follows:

itemsets = [frozenset([v]) for v in valid] 
freqsets = [] 
for i in range(16): 
    print(i) 
    nextsets = [] 
    tested = set() 
    for it in itemsets: 
        for v in valid: 
            if v not in it: 
                # Create a new candidate set by adding v to it 
                c = (it | frozenset([v])) 
                # check if we have tested it already 
                if c in tested: 
                    continue 
                tested.add(c) 
 
                candidates = set() 
                for elem in c: 
                    candidates.update(baskets[elem]) 
                support_c = sum(1 for d in candidates 
if dataset[d].issuperset(c)) if support_c > minsupport: nextsets.append(c) freqsets.extend(nextsets) itemsets = nextsets if not len(itemsets): break print("Finished!") Finished!

The Apriori algorithm returns frequent itemsets—that is, baskets that are present above a certain threshold (given by the minsupport variable in the code).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset