get_independent_GMM_using_CEM - Finds a Gaussian mixture model (GMM) by automatically choosing an optimal
Example compile flags (system dependent):
-DLINUX_X86_64 -DLINUX_X86_64_OPTERON -DGNU_COMPILER
-lKJB -lfftw3 -lgsl -lgslcblas -ljpeg -lSVM -lstdc++ -lpthread -lSLATEC -lg2c -lacml -lacml_mv -lblas -lg2c -lncursesw
const Matrix *feature_mp,
const Vector *initial_a_vp,
const Matrix *initial_u_mp,
const Matrix *initial_var_mp,
number of clusters.
This routine finds a Gaussian mixture model (GMM) for the data on the
assumption that the features are independent. It starts with an initial
number of clusters and possibly proceeds through a series of split-merge
operations on the clusters to choose an optimal number of clusters. The
optimizing criterion for choosing the final number of clusters is based on a
trade-off between the log-likelihood and an information theoretic criterion
as described in the following paper:
Baibo Zhang, Changshui Zhang, Xing Yi, "Competitive EM Algorithm for Finite
Mixture Models", Pattern Recognition, PR(37), No. 1, Jan. 2004, pp. 131-144.
Some features of this routine are controlled via the set facility.
In particular, it fits:
p(x) = sum a-sub-i * g(u-sub-i, v-sub-i, x)
where a-sub-i is the prior probability for the mixuture component (cluster),
u-sub-i is the mean vector for component i, v-sub-i is the variance for the
component, and g(u,v,x) is a Gaussian with diagonal covariance (i.e., the
features are assumed to be independent, given the cluster).
The argument initial_num_clusters is the initial number of mixture components
The data matrix feature_mp is an N by M matrix where N is the number of data
points, and M is the number of features.
The means, variances and priors of the initial clusters can be specified
through the arguments initial_u_mp, initial_var_mp, and initial_a_vp
respectively. If they are all NULL, then a random initialization method is used
for the clusters.
The model parameters are put into *a_vpp, *u_mpp, and *v_mpp. Any of
a_vpp, u_mpp, or v_mpp is NULL if that value is not needed.
Both u-sub-i and v-sub-i are vectors, and they are put into the i'th row of
*u_mpp and *v_mpp, respectively. The matrices are thus K by M, where K is the
final number of clusters.
If P_mpp, is not NULL, then the soft clustering (cluster membership) for each
data point is returned. In that case, *P_mpp will be N by K.
The parameters gamma and beta control the cluster split-merge operations.
These need to be determined experimentally. However, the following
observations might help as a rough guide in the selection of values for these:
The higher the value of gamma, the greater is the frequency of split-merge
operations and hence greater is the ability to jump in the parameter space (see
reference). A value that is of the order of 0.1L or L, where L is the
adjusted log-likelihood, has been observed to give reasonable results. The
order of L can be determined from the logs printed out by the routine.
The higher the value of beta, the greater is the chance of merge operations
being chosen relative to split operations (see reference) while proceeding
through the split-merge sequence. A value in the range of 0.1R - 10R, where,
R = min(KL_divergence_merge^2, KL_divergence_split^2), has been observed to give
reasonable results. R has to be determined experimentally by observing the KL
divergence values printed out by the routine.
If the routine fails, then ERROR is returned with an error message being
set. Otherwise NO_ERROR is returned.
This routine might lead to memory leaks if fails to execute completely.
This software is not adequatedly tested. It is recomended that
results are checked independantly where appropriate.