get_independent_GMM_with_shift_2 - Finds a Gaussian mixture model (GMM) for data possibly containing discrete


#include "r/r_cluster.h"

Example compile flags (system dependent):
   -L/home/kobus/misc/load/linux_x86_64_opteron -L/usr/lib/x86_64-linux-gnu
  -lKJB                               -lfftw3  -lgsl -lgslcblas -ljpeg  -lSVM -lstdc++                    -lpthread -lSLATEC -lg2c    -lacml -lacml_mv -lblas -lg2c      -lncursesw 

int get_independent_GMM_with_shift_2
	int max_left_shift,
	int max_right_shift,
	int num_clusters,
	const Matrix *feature_mp,
	const Vector *initial_delta_vp,
	const Vector *initial_a_vp,
	const Matrix *initial_u_mp,
	const Matrix *initial_var_mp,
	Vector **delta_vpp,
	Matrix **P_shift_mpp,
	Vector **a_vpp,
	Matrix **u_mpp,
	Matrix **var_mpp,
	Matrix **P_cluster_mpp


random global shifts in feature dimensions. This routine finds a Gaussian mixture model (GMM) for the data on the assumption that the features are independent. It allows for the possibility of a data point being shifted by a random discrete amount after having been generated from its Gaussian. The shifts are assumed to be independent of the Gaussians from which the data points are generated. Unlike the counterpart routine, the shifts are not necessarily assumed to occur with wrap-arounds. Instead, the shifts could result in any arbitrary values into the feature dimensions that free up due to the shift. The model is fit with EM. Some features are controlled via the set facility. This routine performs subspace clustering in that it considers only a subset of feature dimensions that are guaranteed to be not prone to corruption by noise due to the assumed nature of shift. The subspace of feature dimensions is determined by the max_left_shift and max_right_shift parameters as explained below. In particular, it fits:
         p(x) = sum sum  a-sub-i * delta-sub-j * g(u-sub-i, v-sub-i, x(- s-sub-j))
                 i   j
where a-sub-i is the prior probability for the mixuture component (cluster), u-sub-i is the mean vector for component i, v-sub-i is the variance for the component, and g(u,v,x) is a Gaussian with diagonal covariance (i.e., the features are assumed to be independent, given the cluster). delta-sub-j is the prior probability of shift j and x(- s-sub-j) indicates a global reverse (negative sign) shift of x by the amount corresponding to s-sub-j. max_left_shift and max_right_shift specify the maximum amount of global discrete random left and right shift respectively a data point can experience after being generated from its Gaussian. Unlike the counterpart routine, each of these parameters can have only non-negative values. The total number of possible shifts for any data point is S = (max_left_shift + max_right_shift + 1) including the zero shift. Based on max_left_shift and max_right_shift, a subspace of the entire feature space exists that is guaranteed to be unaffected by the arbitrary noise that a random shift introduces. It is of dimension T = M - (max_left_shift + max_right_shift), where M is the dimensionality of the full feature space. So, the EM procedure determines clusters in this subspace rather than the full space. The argument num_clusters is the number of requested mixture components (clusters), K. The data matrix feature_mp is an N by M matrix where N is the number of data points, and M is the number of features. The model parameters are put into *delta_vpp, *a_vpp, *u_mpp, and *var_mpp. Any of delta_vpp, a_vpp, u_mpp, or var_mpp is NULL if that value is not needed. The vector *delta_vpp contains the inferred probability distribution over shifts computed using all the training data points. It is of size S. The elements of *delta_vpp can be viewed as shift priors. The assumed order of shifts in this vector or any other output pertaining to shifts is: (max_left_shift, max_left_shift-1,...., 0,...., max_right_shift-1, max_right_shift) The vector *a_vpp contains the inferred cluster priors. It is of size K. Both u-sub-i and v-sub-i are vectors, and they are put into the i'th row of *u_mpp and *var_mpp, respectively. The matrices are thus K by T. If P_cluster_mpp, is not NULL, then the soft clustering (cluster membership) for each data point is returned. In that case, *P_cluster_mpp will be N by K. If P_shift_mpp is not NULL, then the posterior probability distribution over the possible discrete shifts for each data point is returned. In that case, *P_shift_mpp will be N by S. Initial values of the parameters to be used as the starting values for the EM iterations can be specified using initial_delta_vp, initial_a_vp, initial_u_mp and initial_var_mp. If they are all NULL, then a random initialization scheme is used. It is assumed that the initial parameters are specified either in the full feature space or the reduced space in which the final clusters are sought. In case of full space, the routine retrieves the parameters corresponding to the target subspace.


If the routine fails (due to storage allocation), then ERROR is returned with an error message being set. Otherwise NO_ERROR is returned.


This software is not adequatedly tested. It is recomended that results are checked independantly where appropriate.


Prasad Gabbur, Kobus Barnard.


Kobus Barnard


set_em_cluster_options , get_independent_GMM , get_independent_GMM_using_CEM , get_independent_GMM_with_shift , get_GMM_blk_compound_sym_cov , get_GMM_blk_compound_sym_cov_1