One of the interesting tasks in unsupervised learning is the profiling of customers or clustering of customers. Given one dataset of customer information, one wants to find groups of customers that either share similar characteristics or buy the same products. This task results in a number of benefits for business owners because they are provided the information regarding the groups of customers that they have, whereby therefore enabling a more strategic customer relationship.
Customer information can contain both numerical and categorical data. Whenever we face a categorical unscaled variable, we need to split it into the number of values that the variable may take. For example, let's suppose that we have the following transaction list of customer purchases:
Transaction ID |
Customer ID |
Products |
Discount |
Total |
---|---|---|---|---|
1399 |
56 |
Milk, Bread, Butter |
0.00 |
4.30 |
1400 |
991 |
Cheese, Milk |
2.30 |
5.60 |
1401 |
406 |
Bread, Sausage |
0.00 |
8.80 |
1402 |
239 |
Chipotle Sauce, Spice |
0.00 |
6.70 |
1403 |
33 |
Turkey |
0.00 |
4.50 |
1404 |
406 |
Turkey, Butter, Spice |
1.00 |
9.00 |
It can be easily seen that the products is unscaled categorical data, and for each transaction, there is an undefined number of products purchased, that is, the customer may purchase only one or several units of these products. In order to transform this dataset into a numerical dataset, one needs to apply preprocessing. For each product, there will be a variable added to the dataset, resulting in the following:
Cust. ID |
Milk |
Bread |
Butter |
Cheese |
Sausage |
Chipotle Sauce |
Spice |
Turkey |
---|---|---|---|---|---|---|---|---|
56 |
1 |
1 |
1 |
0 |
0 |
0 |
0 |
0 |
991 |
1 |
0 |
0 |
1 |
0 |
0 |
0 |
0 |
406 |
0 |
1 |
1 |
0 |
1 |
0 |
1 |
1 |
239 |
0 |
0 |
0 |
0 |
0 |
1 |
1 |
0 |
33 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
In order to save space, we ignored the numerical variables and considered the presence of the product purchased by a client as 1 and the absence as 0. Alternative preprocessing may consider the number of occurrences of a value, therefore no longer remaining binary, but becoming discrete.