Chapter 2: | Segmenting Customer Transactions Using a Pattern-Based Clustering Approach |
This is a limited free preview of this book. Please buy full access.
In the next section we describe GHIC, an algorithm for pattern-based clustering.
2.3 The Clustering Algorithm
The ideal algorithm will be one that maximizes M (defined in previous section). However, for the objective function defined above, if there are n transactions and two clusters that we are interested in learning, the number of possible clustering schemes to examine is 2n. Hence, a heuristic approach is called for.
After we obtain FIS using association rule discovery algorithms, as described in Section 2, we convert the initial transactions into the format shown in Table 3, where rows represent transactions to be clustered and columns represent itemsets. A “1” in a cell indicates that a certain transaction contains a certain itemset. We use T′ = {T′1, T′2, … , T′n} to represent this new transaction set. Instead of clustering the original transactions, we convert the problem into clustering these binary vectors and present a divisive hierarchical algorithm. The entire transaction set is first divided into two clusters. Each cluster is further divided into two clusters if it has more than a predefined number of transactions. This process is repeated until no cluster is big enough to be divided further. In addition to cluster size, we assume that the stopping conditions are user specified and can contain additional criteria such that if a cluster’s size is bigger than the threshold but its quality is very good (contains very similar transactions), we can stop dividing that cluster.
Table 3. Transactions Represented Initemsets
is1 | is2 | … | isp | |
T1 | 1 | 0 | … | 0 |
T2 | 1 | 0 | … | 1 |
… | … | … | … | … |
Tn | 0 | 1 | … | 0 |