We will start by mining the data for the overall association rules before moving on to our rules for beer, specifically. Throughout the modeling process, we will use the apriori algorithm, which is the appropriately named apriori()
function in the arules
package. The two main things that we will need to specify in the function is the dataset and parameters. As for the parameters, you will need to apply judgment at specifying the minimum support and confidence and the minimum and/or maximum length of basket items in an itemset. Using the item frequency plots along with trial and error, let's set the minimum support at 1 in 1,000 transactions and minimum confidence at 90 percent. Additionally, let's establish the maximum number of items to be associated as four. The following is the code to create the object that we will call rules
:
> rules = apriori(Groceries, parameter = list(supp = 0.001, conf = 0.9, maxlen=4))
Calling the object shows up how many rules the algorithm produced:
> rules set of 67 rules
There are a number of ways to examine the rules. The first thing that I recommend is to set the number of displayed digits to only two with the options()
function in base R. Then, sort and inspect the top five rules based on the lift that they provide, as follows:
> options(digits=2) > rules = sort(rules, by="lift", decreasing=TRUE) > inspect(rules[1:5]) lhs rhs support confidence lift 1 {liquor, red/blush wine} => {bottled beer} 0.0019 0.90 11.2 2 {root vegetables, butter, cream cheese } => {yogurt} 0.0010 0.91 6.5 3 {citrus fruit, root vegetables, soft cheese} => {other vegetables} 0.0010 1.00 5.2 4 {pip fruit, whipped/sour cream, brown bread} => {other vegetables} 0.0011 1.00 5.2 5 {butter, whipped/sour cream, soda} => {other vegetables} 0.0013 0.93 4.8
Lo and behold, the rule that provides the best overall lift is the purchase of liquor
and red wine
on the probability of purchasing bottled beer
. I have to admit that this is pure chance and not intended on my part. As I always say, it is better to be lucky than good. Not a very common transaction with a support of only 1.9 per 1,000.
You can also sort by the support and confidence, so let's have a look at the first 5 rules by="confidence"
in descending order, as follows:
> rules = sort(rules, by="confidence", decreasing=TRUE) > inspect(rules[1:5]) lhs rhs support confidence lift 1 {citrus fruit, root vegetables, soft cheese} => {other vegetables} 0.0010 1 5.2 2 {pip fruit, whipped/sour cream, brown bread} => {other vegetables} 0.0011 1 5.2 3 {rice, sugar} => {whole milk} 0.0012 1 3.9 4 {canned fish, hygiene articles} => {whole milk} 0.0011 1 3.9 5 {root vegetables, butter, rice} => {whole milk} 0.0010 1 3.9
You can see in the table that confidence
for these transactions is 100 percent. Moving on to our specific study of beer, we can utilize a function in arules
to develop cross tabulations—the crossTable()
function—and then examine whatever suits our needs. The first step is to create table
with our dataset:
> table = crossTable(Groceries)
With table
created, we can now examine the joint occurrences between the items. Here, we will look at just the first three rows and columns:
> table[1:3, 1:3] frankfurter sausage liver loaf frankfurter 580 99 7 sausage 99 924 10 liver loaf 7 10 50
As you might imagine, shoppers only selected liver loaf 50 times out of the 9,835 transactions. Additionally, of the 924
times, people gravitated toward sausage
and 10
times, they felt compelled to grab liver loaf
. (Desperate times call for desperate measures!) If we want to look at a specific example, you can either specify the row and column number or just spell that item out:
> table["bottled beer","bottled beer"] [1] 792
This tells us that there were 792
transactions of bottled beer
. Let's see what the joint occurrence is between bottled beer
and canned beer
:
> table["bottled beer","canned beer"] [1] 26
I would expect this to be low as it supports my idea that people lean toward drinking beer from either a bottle or a can; I know I do.
We can now move on and derive specific rules for bottled beer
. We will again use the apriori()
function, but this time, we will add a syntax around appearance
. This means that we will specify in the syntax that we want the left-hand side to be items that increase the probability of a purchase of bottled beer
, which will be on the right-hand side. In the following code, notice that I've adjusted the support
and confidence
numbers. Feel free to experiment with your own settings.
> beer.rules = apriori(data=Groceries,parameter=list(support=0.0015,confidence=0.3), appearance =list(default="lhs",rhs="bottled beer")) > beer.rules set of 4 rules
So, we find ourselves with only 4
association rules. We saw one of them already; now let's have a look at the other three rules in descending order by lift:
> beer.rules = sort(beer.rules, decreasing=TRUE,by="lift") > inspect(beer.rules) lhs rhs support confidence lift 1 {liquor, red/blush wine} => {bottled beer} 0.0019 0.90 11.2 2 {liquor} => {bottled beer} 0.0047 0.42 5.2 3 {soda, red/blush wine} => {bottled beer} 0.0016 0.36 4.4 4 {other vegetables, red/blush wine} => {bottled beer} 0.0015 0.31 3.8
In all of the instances, the purchase of bottled beer
is associated with booze, either liquor
and/or red wine
probably, which is no surprise to anyone. What is interesting is that white wine
is not in the mix here. Let's take a closer look at this and compare the joint occurrences of bottled beer
and types of wine:
> table["bottled beer", "red/blush wine"] [1] 48 > table["red/blush wine", "red/blush wine"] [1] 189 > 48/189 [1] 0.25 > table["white wine", "white wine"] [1] 187 > table["bottled beer", "white wine"] [1] 22 > 22/187 [1] 0.12
It's interesting that 25 percent of the time when someone purchased red wine
, they also purchased bottled beer
, but with white wine
, a joint purchase only happened in 12 percent of the instances. We certainly don't know the why in this analysis, but this could help us to potentially determine how we should position our product in this grocery store. One other thing before we move on is to look at a plot of the rules. This is done with the plot()
function in the arulesViz
package. There are many graphic options available. For this example, let's specify that we want graph
, showing lift
the rules provided and shaded by confidence
. The following syntax will provide this accordingly:
> plot(beer.rules, method="graph", measure="lift",shading="confidence")
The following is the output of the preceding command:
This graph shows that liquor/red wine provides the best lift and the highest level of confidence with both the size of the circle and its shading.
What we've just done in this simple analysis is shown how easy it is with R to conduct a market basket analysis. It doesn't take much imagination to figure out the analytical possibilities that one can include with this technique, for example, incorporate customer segmentation, longitudinal purchase history, and so on as well as how to use it in ad displays, copromotions, and on and on. Now, let's move on to a situation where the customers are rating the items.