Chapter 2: | Segmenting Customer Transactions Using a Pattern-Based Clustering Approach |
In order to generate balanced clusters, we introduce another component to M. Also because each division creates two clusters, the revised M is specified as follows:
M (C1, C2) = K1 · D(C1, C2) + K2 · S(C1) + K3 · S(C2) + ƒBALANCE (N1, N2)
K1, K2, K3are user-specified weights used to bring the difference and similarity to relatively compatible scales and can be decided upon according to simulation. For example, in the simulation, the difference component ranges between 1000 and 2000, and the two similarity components range between 100 and 200. If the difference and similarity need to be considered equally, we can set K1 as 1 and set K2 and K3 as 10.
D(C1, C2) represents the inter-cluster difference, S(C1) and S(C2) are the intra-cluster similarity for cluster-1 and cluster-2 respectively, and N1and N2 are the number of transactions in cluster-1 and cluster-2 respectively.fBALANCE(N1, N2) can take one of the following formats.
ƒBALANCE(N1, N2) serves as a tool to balance the size of the two clusters generated. It can either be a linear function of the absolute difference between the size of the two clusters (A), or a discrete function of the relative difference of the size of the two clusters (B). In (B), if the size of cluster-1 is very different from the size of cluster-2 (case 2 in (B)), this balancing component of M will drive M to – ∞ meaning that this cluster will not be considered. Kl and Kh are the desired lower bound and upper bound of N1/N2. For a given clustering, if N1/N2 is out of these two bounds (clusters generated are unbalanced), this clustering will not be considered. GHIC is presented in Figure 1.