Modeling and evaluation

We will start by mining the data for the overall association rules before moving on to our rules for beer, specifically. Throughout the modeling process, we will use the apriori algorithm, which is the appropriately named apriori() function in the arules package. The two main things that we will need to specify in the function is the dataset and parameters. As for the parameters, you will need to apply judgment at specifying the minimum support and confidence and the minimum and/or maximum length of basket items in an itemset. Using the item frequency plots along with trial and error, let's set the minimum support at 1 in 1,000 transactions and minimum confidence at 90 percent. Additionally, let's establish the maximum number of items to be associated as four. The following is the code to create the object that we will call rules:

> rules = apriori(Groceries, parameter = list(supp = 0.001, conf = 0.9, maxlen=4))

Calling the object shows up how many rules the algorithm produced:

> rules
set of 67 rules

There are a number of ways to examine the rules. The first thing that I recommend is to set the number of displayed digits to only two with the options() function in base R. Then, sort and inspect the top five rules based on the lift that they provide, as follows:

> options(digits=2)

> rules = sort(rules, by="lift", decreasing=TRUE)

> inspect(rules[1:5])
  lhs                     rhs                support confidence lift
1 {liquor,                                                          
   red/blush wine}     => {bottled beer}      0.0019       0.90 11.2
2 {root vegetables,                                                 
   butter,                                                          
   cream cheese }      => {yogurt}            0.0010       0.91  6.5
3 {citrus fruit,                                                    
   root vegetables,                                                 
   soft cheese}        => {other vegetables}  0.0010       1.00  5.2
4 {pip fruit,                                                       
   whipped/sour cream,                                              
   brown bread}        => {other vegetables}  0.0011       1.00  5.2
5 {butter,                                                          
   whipped/sour cream,                                              
   soda}               => {other vegetables}  0.0013       0.93  4.8

Lo and behold, the rule that provides the best overall lift is the purchase of liquor and red wine on the probability of purchasing bottled beer. I have to admit that this is pure chance and not intended on my part. As I always say, it is better to be lucky than good. Not a very common transaction with a support of only 1.9 per 1,000.

You can also sort by the support and confidence, so let's have a look at the first 5 rules by="confidence" in descending order, as follows:

> rules = sort(rules, by="confidence", decreasing=TRUE)
> inspect(rules[1:5])
  lhs                     rhs                support confidence lift
1 {citrus fruit,                                                    
   root vegetables,                                                 
   soft cheese}        => {other vegetables}  0.0010          1  5.2
2 {pip fruit,                                                       
   whipped/sour cream,                                              
   brown bread}        => {other vegetables}  0.0011          1  5.2
3 {rice,                                                            
   sugar}              => {whole milk}        0.0012          1  3.9
4 {canned fish,                                                     
   hygiene articles}   => {whole milk}        0.0011          1  3.9
5 {root vegetables,                                                 
   butter,                                                          
   rice}               => {whole milk}        0.0010          1  3.9

You can see in the table that confidence for these transactions is 100 percent. Moving on to our specific study of beer, we can utilize a function in arules to develop cross tabulations—the crossTable() function—and then examine whatever suits our needs. The first step is to create table with our dataset:

> table = crossTable(Groceries)

With table created, we can now examine the joint occurrences between the items. Here, we will look at just the first three rows and columns:

> table[1:3, 1:3]
            frankfurter sausage liver loaf
frankfurter         580      99          7
sausage              99     924         10
liver loaf            7      10         50

As you might imagine, shoppers only selected liver loaf 50 times out of the 9,835 transactions. Additionally, of the 924 times, people gravitated toward sausage and 10 times, they felt compelled to grab liver loaf. (Desperate times call for desperate measures!) If we want to look at a specific example, you can either specify the row and column number or just spell that item out:

> table["bottled beer","bottled beer"]
[1] 792

This tells us that there were 792 transactions of bottled beer. Let's see what the joint occurrence is between bottled beer and canned beer:

> table["bottled beer","canned beer"]
[1] 26

I would expect this to be low as it supports my idea that people lean toward drinking beer from either a bottle or a can; I know I do.

We can now move on and derive specific rules for bottled beer. We will again use the apriori() function, but this time, we will add a syntax around appearance. This means that we will specify in the syntax that we want the left-hand side to be items that increase the probability of a purchase of bottled beer, which will be on the right-hand side. In the following code, notice that I've adjusted the support and confidence numbers. Feel free to experiment with your own settings.

> beer.rules = apriori(data=Groceries,parameter=list(support=0.0015,confidence=0.3), appearance =list(default="lhs",rhs="bottled beer"))

> beer.rules
set of 4 rules

So, we find ourselves with only 4 association rules. We saw one of them already; now let's have a look at the other three rules in descending order by lift:

> beer.rules = sort(beer.rules, decreasing=TRUE,by="lift")

> inspect(beer.rules)
  lhs                   rhs            support confidence lift
1 {liquor,                                                    
   red/blush wine}   => {bottled beer}  0.0019       0.90 11.2
2 {liquor}           => {bottled beer}  0.0047       0.42  5.2
3 {soda,                                                      
   red/blush wine}   => {bottled beer}  0.0016       0.36  4.4
4 {other vegetables,                                          
   red/blush wine}   => {bottled beer}  0.0015       0.31  3.8

In all of the instances, the purchase of bottled beer is associated with booze, either liquor and/or red wine probably, which is no surprise to anyone. What is interesting is that white wine is not in the mix here. Let's take a closer look at this and compare the joint occurrences of bottled beer and types of wine:

> table["bottled beer", "red/blush wine"]
[1] 48

> table["red/blush wine", "red/blush wine"]
[1] 189

> 48/189
[1] 0.25

> table["white wine", "white wine"]
[1] 187

> table["bottled beer", "white wine"]
[1] 22

> 22/187
[1] 0.12

It's interesting that 25 percent of the time when someone purchased red wine, they also purchased bottled beer, but with white wine, a joint purchase only happened in 12 percent of the instances. We certainly don't know the why in this analysis, but this could help us to potentially determine how we should position our product in this grocery store. One other thing before we move on is to look at a plot of the rules. This is done with the plot() function in the arulesViz package. There are many graphic options available. For this example, let's specify that we want graph, showing lift the rules provided and shaded by confidence. The following syntax will provide this accordingly:

> plot(beer.rules, method="graph", measure="lift",shading="confidence")

The following is the output of the preceding command:

Modeling and evaluation

This graph shows that liquor/red wine provides the best lift and the highest level of confidence with both the size of the circle and its shading.

What we've just done in this simple analysis is shown how easy it is with R to conduct a market basket analysis. It doesn't take much imagination to figure out the analytical possibilities that one can include with this technique, for example, incorporate customer segmentation, longitudinal purchase history, and so on as well as how to use it in ad displays, copromotions, and on and on. Now, let's move on to a situation where the customers are rating the items.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset