NAME

get_independent_GMM_using_CEM - Finds a Gaussian mixture model (GMM) by automatically choosing an optimal

SYNOPSIS

#include "r/r_cluster.h"

Example compile flags (system dependent):
  -DLINUX_X86_64 -DLINUX_X86_64_OPTERON  -DGNU_COMPILER 
   -I/home/kobus/include
   -L/home/kobus/misc/load/linux_x86_64_opteron -L/usr/lib/x86_64-linux-gnu
  -lKJB                               -lfftw3  -lgsl -lgslcblas -ljpeg  -lSVM -lstdc++                    -lpthread -lSLATEC -lg2c    -lacml -lacml_mv -lblas -lg2c      -lncursesw 


int get_independent_GMM_using_CEM
(
	int initial_num_clusters,
	const Matrix *feature_mp,
	const Vector *initial_a_vp,
	const Matrix *initial_u_mp,
	const Matrix *initial_var_mp,
	Vector **a_vpp,
	Matrix **u_mpp,
	Matrix **var_mpp,
	Matrix **P_mpp,
	double beta,
	double gamma
);

DESCRIPTION

number of clusters. This routine finds a Gaussian mixture model (GMM) for the data on the assumption that the features are independent. It starts with an initial number of clusters and possibly proceeds through a series of split-merge operations on the clusters to choose an optimal number of clusters. The optimizing criterion for choosing the final number of clusters is based on a trade-off between the log-likelihood and an information theoretic criterion as described in the following paper: Baibo Zhang, Changshui Zhang, Xing Yi, "Competitive EM Algorithm for Finite Mixture Models", Pattern Recognition, PR(37), No. 1, Jan. 2004, pp. 131-144. Some features of this routine are controlled via the set facility. In particular, it fits:

         p(x) = sum  a-sub-i *  g(u-sub-i, v-sub-i, x)
                 i

where a-sub-i is the prior probability for the mixuture component (cluster), u-sub-i is the mean vector for component i, v-sub-i is the variance for the component, and g(u,v,x) is a Gaussian with diagonal covariance (i.e., the features are assumed to be independent, given the cluster). The argument initial_num_clusters is the initial number of mixture components (clusters). The data matrix feature_mp is an N by M matrix where N is the number of data points, and M is the number of features. The means, variances and priors of the initial clusters can be specified through the arguments initial_u_mp, initial_var_mp, and initial_a_vp respectively. If they are all NULL, then a random initialization method is used for the clusters. The model parameters are put into *a_vpp, *u_mpp, and *v_mpp. Any of a_vpp, u_mpp, or v_mpp is NULL if that value is not needed. Both u-sub-i and v-sub-i are vectors, and they are put into the i'th row of *u_mpp and *v_mpp, respectively. The matrices are thus K by M, where K is the final number of clusters. If P_mpp, is not NULL, then the soft clustering (cluster membership) for each data point is returned. In that case, *P_mpp will be N by K. The parameters gamma and beta control the cluster split-merge operations. These need to be determined experimentally. However, the following observations might help as a rough guide in the selection of values for these: The higher the value of gamma, the greater is the frequency of split-merge operations and hence greater is the ability to jump in the parameter space (see reference). A value that is of the order of 0.1L or L, where L is the adjusted log-likelihood, has been observed to give reasonable results. The order of L can be determined from the logs printed out by the routine. The higher the value of beta, the greater is the chance of merge operations being chosen relative to split operations (see reference) while proceeding through the split-merge sequence. A value in the range of 0.1R - 10R, where, R = min(KL_divergence_merge^2, KL_divergence_split^2), has been observed to give reasonable results. R has to be determined experimentally by observing the KL divergence values printed out by the routine.

RETURNS

If the routine fails, then ERROR is returned with an error message being set. Otherwise NO_ERROR is returned.

WARNING

This routine might lead to memory leaks if fails to execute completely.

DISCLAIMER

This software is not adequatedly tested. It is recomended that results are checked independantly where appropriate.

AUTHOR

Kobus Barnard

DOCUMENTER