#Initialize test array
= np.array([.05, .05, .05, 0, 0, 0,
test_arr .05, .05, .05, 0, 0, 0,
0, 0, 0, .05, .05, .05,
0, 0, 0, .05, .05, .05,
.04, .04, 0, .04, .04, .04,
.04, .04, .04, 0, .04, .04]).reshape(6,6)
Abstract
Clustering is a useful tool in data mining whereby data points are collected together based on some predefined similarity metric. This results in a more digestible data model that can provide solid inference into how related certain attributes of data are with each other. In this post, we discuss the information theoretic co-clustering algorithm (Dhillon, Mallela, and Modha 2003) and provide a python implementation of said algorithm. We discuss the usefulness of co-clustering and propose potential future projects.
See this link for the paper.
Introduction
Co-clustering can be defined as a family of algorithms that simultaneously clusters rows and columns. Suppose we have a matrix A that contains n rows and m columns, as shown below.
\[A = \begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1m} \\ a_{21} & a_{22} & \cdots & a_{2m} \\ \vdots & \vdots & \ddots & \vdots \\ a_{n1} & a_{n2} & \cdots & a_{nm} \end{bmatrix}\]
Our objective with \(A\) would be to identify distinct groups of data that exhibit great similarities to one another. One way to expose these groupings would be through a clustering algorithm like K-means. However, the down side to K-means is the single modality nature of the algorithm. That is, rows and columns would be clustered independetly of each other. In real-world scenarios, there are generally relationships between the rows and columns that we’d want to capture as part of our analysis. For example, suppose the rows of \(A\) were customers and the columns of \(A\) were products available. \(a_{ij}\) would then represent the quantities purchased by the \(i\)th customer for the \(j\)th product. We’d want to capture in our analysis the relationships between customers and the products. This is where co-clustering comes in.
Co-clustering are families of algorithms that simultaneously cluster rows and columns. Contuining with the example above of rows representing customers and columns representing products, suppose we have the following data.
\[\begin{pmatrix} 3 & 3 & 0 \\ 4 & 4 & 0 \\ 0 & 0 & 1 \end{pmatrix}\]\(a_{ij}\) once again represents the purchases by the \(i\)th customer for the \(j\)th product. In normal clustering approaches (like in K means clustering), we would ignore the column structure of this matrix and instead focus on the row structure only. Thus, we would probably end up with two clusters, with the the first two rows pertaining to one cluster and the last row pertaining to another. This approach, while helpful in identifying data points that closely resemble each other, does not take into account the complete picture offered by the matrix.
In co-clustering, we would analyze both the row structure and the column structure. We would continue to use the row structure, as illustrated in the previous example, but we would also include column clusters, more than likely clustering the first two columns together and the last one by itself. Thinking back to what this data set represents, we can now show at a more granular level what customers have similar purchase histories, as well as what products are purchased together. Thus, we have a better understanding of a customer journey and their interactions with our products.
Information Theoretic Co-Clustering
Now that we have a solid foundation on what co-clustering is and its potential uses, we explore the co-clustering algorithm “Information Theoretic Co-Clustering” (Dhillon, Mallela, and Modha 2003). For a more detailed explanation, please refer to the paper link above or the referenced paper in the works cited.
The ITCC algorithm poses the challenge of clustering as an optimization problem using relative entropy, as shown below in Equation 3.1.
\[\begin{equation} D_{KL}(P||Q) = \sum_{x} \sum_{y} P(x, y) \log \left( \frac{P(x, y)}{Q(x, y)} \right) \label{eq:kl_joint_discrete} \end{equation} \tag{3.1}\]
Essentially, we are trying to find a prototype or approximate joint distribution \(Q(x,y)\) to minimizes the distance from \(P(x,y)\). To do this, the ITCC algorithm attempts to calculate this minimized approximated joint disribution \(Q(x,y)\) by monotonically decreasing the objective function (the objective function being mutual information loss, thus minimizing the information lost between the true joint distribution and the approximate joint distribution).
To calculate this approximate joint distribution, a set number of row clusters and column clusters are assigned. Then, rows and columns are assigned to a specific cluster number, up to n row clusters and m column clusters. A joint cluster distribution is then calculated, which we will denote as \(P(\hat{x},\hat{y})\). The approximate joint distribution \(Q(x,y)\) is the calculated using \(P(\hat{x},\hat{y}\) and other conditional distributions. \(Q(x,y)\) is calculated by finding rows and columns that minimize Equation 3.1, relative to the conditional \(P(X|\hat{Y})\) and \(P(Y|\hat{X})\). A more structured run through of the algorithm is found below.
- Initialize co-cluster for rows (\(C_X\)) and co-cluster for columns (\(C_Y\))
- Calculate \(q^{(0)}\) \((\hat{X}, \hat{Y})\), \(q^{(0)}\) \((X|\hat{X})\), \(q^{(0)}\) \((Y|\hat{Y})\), and \(q^{(0)}\) \((Y|\hat{x})\)
- Compute new column clusters for each row x where \(C^{t+1}_{X}(x)=\) \(argmin_{\hat{x}}\) \(D(p(Y|x) || q^{(t)}(Y|\hat{x}))\)
- Compute distributions \(q^{(t+1)}\) \((\hat{X}, \hat{Y})\), \(q^{(t+1)}\) \((X|\hat{X})\), \(q^{(t+1)}\) \((Y|\hat{Y})\), and \(q^{(t+1)}\) \((X|\hat{y})\)
- Compute new column clusters for each column y where \(C^{t+2}_{Y}(y)=\) \(argmin_{\hat{y}}\) \(D(p(X|y) || q^{(t+1)}(X|\hat{y}))\)
- Compute distributions \(q^{(t+2)}\) \((\hat{X}, \hat{Y})\), \(q^{(t+2)}\) \((X|\hat{X})\), \(q^{(t+2)}\) \((Y|\hat{Y})\), and \(q^{(t+2)}\) \((Y|\hat{x})\)
- Stop and return current row and column clusters (\(C_X,C_Y\)) if the change in the objective function value is small. Else, repeat starting at step 2.
Essentially, the algorithm forms row and column cluster prototypes, calculates the appropriate distributions to get to the approximate joint distribution \(q(X,Y)\), then measures the distance between \(p(X,Y)\) and \(q(X,Y\) using Equation 3.1. The algorithm is able to be monotonically decreasing due to the fact that we always select the cluster that minimizes Equation 3.1 (see steps 3 and 5 in the algorithm steps above).
Code Walk-Through
In this section, we’ll walk through the various code chunks that make up our ITCC algorithm. Full disclosure, this is a very crude approach to implementing this algorithm. There are far better implementations of this algorithm along with other co-clustering algorithms (e.g. scikit learn). This is merely an educational exercise by me to practice implementing clustering algorithms.
To begin the code walk through, we will initialize our joint distribution \(P(X,Y)\), which is found below. This is the same joint distribution used in the ITCC paper.
Our \(P(X,Y)\) is a 6x6 matrix. We will attempt to find the optimal 3 row clusters and 2 columns clusters using our ITCC implementation, just as performed in the ITCC paper. To do this, we will implement various functions that calculate the necessary distributions for eventually comparing \(P(X,Y)\) and \(Q(X,Y)\).
Our first function in the code walk through is calculating the joint distribution \(Q(\hat{X},\hat{Y})\), which is the joint distribution of the row and column clusters. The code to do this is found below.
#Define function for calculating joint distribution
def calc_joint(x: dict, y: dict, a: np.array) -> np.array:
= np.zeros((len(x), len(y)))
joint_arr #joint_arr = {}
for i, j in x.items():
for z, p in y.items():
= np.ix_(j,p)
ind = a[ind].sum()
joint_arr[i,z]
return joint_arr
The function accepts the row clusters, column clusters, and joint distribution \(P(X,Y)\). We chose to use dictionaries to illustrate the structure of co cluster prototypes with the co cluster number as the key and the value being a list of indices where the respective row(s) or column(s) lie. The function returns the mxn array, where m is the number of row clusters and n is the number of columns clusters.
The next function we define is \(Q(X|\hat{X})\). This accepts the joint distribution \(P(X,Y)\) and the dictionary for the row cluster prototypes. The function then returns a mxn matrix where m is the number of rows in the original matrix (or the total number of rows from all the row clusters) and n is the number of row clusters.
#Define function for calculating conditional of x given x_hat
def calc_x_cond(jd: np.array, x_ind: dict) -> np.array:
= jd.sum(axis=1)
x_mar = calc_x_mar(jd, x_ind)
xhat_mar = np.zeros((len(x_mar), len(x_ind)))
cond_dist
for key, val in x_ind.items():
for idx, prb in enumerate(x_mar):
if idx in val:
= prb / xhat_mar[key]
cond_dist[idx,key] else:
= 0
cond_dist[idx,key]
return cond_dist
\(Q(\hat{X}|\hat{Y})\) is the conditional distribution of the \(\hat{X}\) given \(\hat{Y}\). This just becomes the joint distribution \(Q(\hat{X},\hat{Y})\) divided by the marginal of \(Q(\hat{Y})\). The function returns a mxn matrix where m is the number of row clusters and n is the number of column clusters.
#Define function for calculation conditional of x_hat given y_hat
def calc_xhat_cond(jd: np.array) -> np.array:
= np.zeros((jd.shape[0], jd.shape[1]))
cond_dist
for j in range(jd.shape[1]):
= jd[:,j].sum()
y_mar for i in range(jd.shape[0]):
if y_mar == 0:
= 0
cond_dist[i,j] else:
= jd[i,j] / y_mar
cond_dist[i,j]
return cond_dist
\(Q(X|\hat{Y})\) is calculated as the product of \(Q(X|\hat{X})Q(\hat{X}|\hat{Y})\). This function accepts then the previously calculated \(Q(X|\hat{X})\) distribution and the \(Q(\hat{X}|\hat{Y})\) distribution. This function returns a mxn matrix where m is the number of rows from the original joint distribution \(P(X,Y)\) and n is the the number of columns from the same said distribution.
#Define function for calculating x given y_hat
def calc_x_cond_yhat(x_xhat: np.array, xhat_yhat: np.array, x_ind: dict, y_ind: dict) -> np.array:
= []
full_arr
for y_key, y_val in y_ind.items():
for _ in range(len(y_val)):
= []
row for x_key, x_val in x_ind.items():
* x_xhat[x_val, x_key])
row.extend(xhat_yhat[x_key,y_key]
full_arr.append(row)
return np.array(full_arr)
We now arrive at defining the column distributions, first with \(Q(Y|\hat{Y})\). This function accepts the primary joint distribution and the dictionary of \(\hat{Y}\). The function returns a matrix of dimension mxn where m is the number of columns in the original joint distribution and n is the number of column clusters.
#Define function for calculating conditional of y given y_hat
def calc_y_cond(jd: np.array, y_ind: dict) -> np.array:
= jd.sum(axis=0)
y_mar = calc_y_mar(jd, y_ind)
yhat_mar = np.zeros((len(y_mar), len(y_ind)))
cond_dist
for key, val in y_ind.items():
for idx, prb in enumerate(y_mar):
if idx in val:
= prb / yhat_mar[key]
cond_dist[idx,key] else:
= 0
cond_dist[idx,key]
return cond_dist
\(Q(\hat{Y}|\hat{X})\) is calculated very similariy to \(Q(\hat{X}|\hat{Y})\) and returns, like the previous function, the same dimension of the of the joint co cluster distribution.
#Define function for calculating conditional of y_hat given x_hat
def calc_yhat_cond(jd: np.array) -> np.array:
= np.zeros((jd.shape[0], jd.shape[1]))
cond_dist
for i in range(jd.shape[0]):
= jd[i,:].sum()
x_mar for j in range(jd.shape[1]):
if x_mar == 0:
= 0
cond_dist[i,j] else:
= jd[i,j] / x_mar
cond_dist[i,j]
return cond_dist
\(Q(Y|\hat{X})\) is equal to the product of \(Q(Y|\hat{Y}) Q(\hat{Y}|\hat{X})\). This function accepts the \(Q(Y|\hat{Y})\) distribution, \(Q(\hat{Y}|\hat{X})\) distribution, the column cluster dictionary, and the row cluster dictionary. The function returns a mxn matrix of the same dimension of the original joint distribution.
#Define function for calculating y given x_hat
def calc_y_cond_xhat(y_yhat: np.array, yhat_xhat: np.array, y_ind: dict, x_ind: dict) -> np.array:
= []
full_arr
for x_key, x_val in x_ind.items():
for _ in range(len(x_val)):
= []
row for y_key, y_val in y_ind.items():
* y_yhat[y_val, y_key])
row.extend(yhat_xhat[x_key,y_key]
full_arr.append(row)
return np.array(full_arr)
\(C_X\) represents our current row clusters. The function below takes in the distributions \(Q(Y|\hat{X}\) and \(P(Y|X)\), along with the mappings of each row to row to cluster using the x_ind dictionary. This function utilizes Equation 3.1 to find calculate the distance between these distributions. It then selects the cluster row prototype that minimizes the distance between a row of the true distribution and that of the approximate distribution. The function returns a dictionary with the new row cluster prototypes.
#Define function for calculation next c_x
def next_cx(y_xhat: np.array, y_x: np.array, x_ind: dict) -> dict:
= {key: [] for key in x_ind}
new_x_hat
for i in range(y_x.shape[0]):
= []
temp_idx
for key, val in x_ind.items():
= np.mean(y_xhat[val,:], axis=0)
proto = np.nan_to_num(kl_div(y_x[i,:], proto), posinf=10, neginf=-10)
kl_res #kl_res = rel_entr(y_x[i,:], proto)
sum(kl_res))
temp_idx.append(np.
= np.argmin(np.array(temp_idx))
temp_val #temp_val = tie_breaker(temp_idx)
new_x_hat[temp_val].append(i)
return new_x_hat
Similarily to the function above \(C_Y\) represents our current column clusters. The function utilizes \(Q(X|\hat{Y})\) and \(P(X|Y)\) along with the mappings found in y_ind, which contain the mappings of columns to column clusters. It then uses Equation 3.1 to perform the same calculations used in \(C_X\). This function also returns a dictionary with the new mappings for column cluster prototypes.
#Define function for calculation next c_y
def next_cy(x_yhat: np.array, x_y: np.array, y_ind: dict) -> dict:
= {key: [] for key in y_ind}
new_y_hat
for i in range(x_y.shape[1]):
= []
temp_idx
for key, val in y_ind.items():
= np.mean(x_yhat[:,val], axis=1)
proto = np.nan_to_num(kl_div(x_y[:,i], proto), posinf=10, neginf=-10)
kl_res #kl_res = rel_entr(x_y[:,i], proto)
sum(kl_res))
temp_idx.append(np.
= np.argmin(np.array(temp_idx))
temp_val #temp_val = tie_breaker(temp_idx)
new_y_hat[temp_val].append(i)
return new_y_hat
This function is where we formally define the co-clustering algorithm utilizing the functions defined above as well as the steps outline for the ITCC algorithm above. We refer the reader to the above section “Information Theoretic Co-Clustering”. The function accepts the target distribution we wish to perform co-clustering on, as well as the number of row clusters \(k\) and column clusters \(l\). The user can also specify the number of iterations. The function returns a tuple of form (dict, dict), where the first dictionary is the row cluster and the second is the column cluster.
#Define co-clustering algorithm
def co_cluster(joint_dist: np.array, k: int, l: int, num_iter: int) -> (dict, dict):
#Initialize x_hat
= get_x_hat(joint_dist, k, new=True)
x_hat #Initialize y_hat
= get_y_hat(joint_dist, l, new=True)
y_hat #Initialize min kl val
= .0001
max_kl print(f"init_x_hat: {x_hat}, init_y_hat: {y_hat}")
#Enter loop
for _ in range(num_iter):
#Calculate q(x_hat, y_hat)
= calc_joint(x_hat, y_hat, joint_dist)
q_joint_hat
#Calculate q(x|x_hat)
= calc_x_cond(joint_dist, x_hat)
q_x_cond_xhat #Calculate q(y|y_hat)
= calc_y_cond(joint_dist, y_hat)
q_y_cond_yhat #Calculate p(y_hat|x_hat)
= calc_yhat_cond(q_joint_hat)
q_yhat_cond_xhat
#Calculate q(y|x_hat)
= calc_y_cond_xhat(q_y_cond_yhat, q_yhat_cond_xhat, y_hat, x_hat)
q_y_cond_xhat #Calculate p(y|x)
= joint_dist / joint_dist.sum(axis=1).reshape(-1,1)
p_y_x
#Get next cx
= next_cx(q_y_cond_xhat, p_y_x, x_hat)
x_hat_2 #Check if x_hat_2 is valid
= valid_cluster(x_hat_2)
x_hat_2
#Calculate qt+1(x_hat, y_hat)
= calc_joint(x_hat_2, y_hat, joint_dist)
q_joint_hat_2
#Calculate qt+1(x|x_hat)
= calc_x_cond(joint_dist, x_hat_2)
q_x_cond_xhat_2
#Calculate qt+1(x_hat|y_hat)
= calc_xhat_cond(q_joint_hat_2)
q_xhat_cond_yhat_2
#Calculate qt+1(x|y_hat)
= calc_x_cond_yhat(q_x_cond_xhat_2, q_xhat_cond_yhat_2, x_hat_2, y_hat)
q_x_cond_yhat #Calculate p(x|y)
= joint_dist / joint_dist.sum(axis=0).reshape(-1,1)
p_x_y
#Get next cy
= next_cy(q_x_cond_yhat, p_x_y, y_hat)
y_hat_2 #Check if y_hat_2 is valid
= valid_cluster(y_hat_2)
y_hat_2
#Calculate qt+2(x_hat, y_hat)
= calc_joint(x_hat_2, y_hat_2, joint_dist)
q_joint_hat_3
#Calculate qt+2(y|y_hat)
= calc_y_cond(joint_dist, y_hat_2)
q_y_cond_yhat_2
#Calculate qt+2(y_hat|x_hat)
= calc_yhat_cond(q_joint_hat_3)
q_yhat_cond_xhat_2
#Calculate qt+2(y|x_hat)
= calc_y_cond_xhat(q_y_cond_yhat_2, q_yhat_cond_xhat_2, y_hat_2, x_hat_2)
q_y_cond_xhat_2
#Calculate q(x,y)
= joint_dist.sum(axis=1).reshape(-1,1) * q_y_cond_xhat
joint_first #Calculate qt+2(x,y)
= joint_dist.sum(axis=1).reshape(-1,1) * q_y_cond_xhat_2
joint_second
= np.sum(np.nan_to_num(kl_div(joint_dist, joint_first), posinf=10, neginf=-10))
kl_res_1 = np.sum(np.nan_to_num(kl_div(joint_dist, joint_second), posinf=10, neginf=-10))
kl_res_2
= kl_res_1 - kl_res_2
kl
if kl > max_kl:
= x_hat_2
x_hat = y_hat_2
y_hat else:
print(f"kl={kl} on iteration {_}")
return x_hat, y_hat
return x_hat, y_hat
init_x_hat: {0: array([0, 3]), 1: array([1, 5]), 2: array([2, 4])}, init_y_hat: {0: array([2, 5, 1]), 1: array([0, 4, 3])}
kl=0.0 on iteration 2
({0: [0, 1], 1: [2, 3], 2: [4, 5]}, {0: [0, 1, 2], 1: [3, 4, 5]})
Putting all the pieces together from our code walk through, the output above shows that we have successfully implemented a ITCC algorithm (this utilized the test case from the paper).
Conclusion
Though this tutorial was short and consisted of only one test case, we illustrated the usefulness of co-clustering algorithms in a general sense. Additionally, we walked through the steps to implement the information theoretic co-clustering algorithm, along with example code. Further test cases should utilized in production settings. Nonetheless, we hope this post encourages others to discover for themselves the usefulness of co-clustering in their data mining jobs.