How it works...

The purpose of association mining is to discover associations among items from the transactional database. Typically, the process of association mining proceeds by finding itemsets that have support greater than the minimum support. Next, the process uses the frequent itemsets to generate strong rules (for example, milk => bread a customer who buys milk is likely to buy bread) that have confidence greater than minimum confidence. By definition, an association rule can be expressed in the form of X=>Y, where X and Y are disjointed itemsets. We can measure the strength of associations between two terms: support and confidence. Support shows how much of the percentage of a rule is applicable within a dataset, while confidence indicates the probability of both X and Y appearing in the same transaction:

  • Support =
  • Confidence =

Here, X and Y refers to the frequency of a particular itemset; N denotes the populations.

As support and confidence are metrics for the strength rule only, you might still obtain many redundant rules with high support and confidence. Therefore, we can use the third measure, lift, to evaluate the quality (ranking) of the rule. By definition, lift indicates the strength of a rule over the random co-occurrence of X and Y, so we can formulate lift in the following form:

Lift =

Apriori is the best known algorithm for mining associations which performs a level-wise, breadth-first algorithm to count the candidate itemsets. The process of Apriori starts by finding frequent itemsets (a set of items that have minimum support) level-wisely. For example, the process starts with finding frequent 1-itemsets. Then, the process continues by using frequent 1-itemsets to find frequent 2-itemsets. The process iteratively discovers new frequent k+1-itemsets from frequent k-itemsets until no frequent itemsets are found.

Finally, the process utilizes frequent itemsets to generate association rules:

An illustration of the Apriori algorithm (where support = 2)

In this recipe, we use the Apriori algorithm to find association rules within transactions. We use the built-in Groceries dataset, which contains one month of real-world point-of-sale transaction data from a typical grocery outlet. We then use the summary function to obtain the summary statistics of the Groceries dataset. The summary statistics shows that the dataset contains 9,835 transactions which are categorized into 169 categories. In addition to this, the summary shows information, such as most frequent items, itemset distribution, and example extended item information within the dataset. We can then use itemFrequencyPlot to visualize the five most frequent items with support over 0.1.

Next, we apply the Apriori algorithm to search for rules with support over 0.001 and confidence over 0.5. We then use the summary function to inspect detailed information on the generated rules. From the output summary, we find that the Apriori algorithm generates 5,668 rules with support over 0.001 and confidence over 0.5. Furthermore, we can find the rule length distribution, summary of quality measures, and mining information. In the summary of the quality measurement we find descriptive statistics of three measurements, which are support, confidence, and lift. Support is the proportion of transactions containing a certain itemset. Confidence is the correct percentage of the rule. Lift is the response target association rule divided by the average response.

To explore some generated rules, we can use the inspect function to view the first six rules of the 5,668 generated rules. Lastly, we can sort rules by confidence and list rules with the most confidence. Therefore, we find that rich sugar associated to whole milk is the most confident rule with support equal to 0.001220132, confidence equal to 1, and lift equal to 3.913649.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset