get_independent_GMM_with_shift_2 - Finds a Gaussian mixture model (GMM) for data possibly containing discrete
Example compile flags (system dependent):
-DLINUX_X86_64 -DLINUX_X86_64_OPTERON -DGNU_COMPILER
-lKJB -lfftw3 -lgsl -lgslcblas -ljpeg -lSVM -lstdc++ -lpthread -lSLATEC -lg2c -lacml -lacml_mv -lblas -lg2c -lncursesw
const Matrix *feature_mp,
const Vector *initial_delta_vp,
const Vector *initial_a_vp,
const Matrix *initial_u_mp,
const Matrix *initial_var_mp,
random global shifts in feature dimensions.
This routine finds a Gaussian mixture model (GMM) for the data on the
assumption that the features are independent. It allows for the possibility
of a data point being shifted by a random discrete amount after having been
generated from its Gaussian. The shifts are assumed to be independent of the
Gaussians from which the data points are generated. Unlike the counterpart
routine, the shifts are not necessarily assumed to occur with wrap-arounds.
Instead, the shifts could result in any arbitrary values into the feature
dimensions that free up due to the shift. The model is fit with EM. Some
features are controlled via the set facility.
This routine performs subspace clustering in that it considers only a subset
of feature dimensions that are guaranteed to be not prone to corruption by
noise due to the assumed nature of shift. The subspace of feature dimensions
is determined by the max_left_shift and max_right_shift parameters as
In particular, it fits:
p(x) = sum sum a-sub-i * delta-sub-j * g(u-sub-i, v-sub-i, x(- s-sub-j))
where a-sub-i is the prior probability for the mixuture component (cluster),
u-sub-i is the mean vector for component i, v-sub-i is the variance for the
component, and g(u,v,x) is a Gaussian with diagonal covariance (i.e., the
features are assumed to be independent, given the cluster). delta-sub-j is
the prior probability of shift j and x(- s-sub-j) indicates a global reverse
(negative sign) shift of x by the amount corresponding to s-sub-j.
max_left_shift and max_right_shift specify the maximum amount of global
discrete random left and right shift respectively a data point can experience
after being generated from its Gaussian. Unlike the counterpart routine, each
of these parameters can have only non-negative values. The total number of
possible shifts for any data point is S = (max_left_shift + max_right_shift + 1)
including the zero shift.
Based on max_left_shift and max_right_shift, a subspace of the entire
feature space exists that is guaranteed to be unaffected by the arbitrary
noise that a random shift introduces. It is of dimension
T = M - (max_left_shift + max_right_shift), where M is the dimensionality of
the full feature space. So, the EM procedure determines clusters in this
subspace rather than the full space.
The argument num_clusters is the number of requested mixture components
The data matrix feature_mp is an N by M matrix where N is the number of data
points, and M is the number of features.
The model parameters are put into *delta_vpp, *a_vpp, *u_mpp, and *var_mpp. Any of
delta_vpp, a_vpp, u_mpp, or var_mpp is NULL if that value is not needed.
The vector *delta_vpp contains the inferred probability distribution over
shifts computed using all the training data points. It is of size S. The
elements of *delta_vpp can be viewed as shift priors. The assumed order of
shifts in this vector or any other output pertaining to shifts is:
(max_left_shift, max_left_shift-1,...., 0,...., max_right_shift-1, max_right_shift)
The vector *a_vpp contains the inferred cluster priors. It is of size K.
Both u-sub-i and v-sub-i are vectors, and they are put into the i'th row of
*u_mpp and *var_mpp, respectively. The matrices are thus K by T.
If P_cluster_mpp, is not NULL, then the soft clustering (cluster membership) for each
data point is returned. In that case, *P_cluster_mpp will be N by K.
If P_shift_mpp is not NULL, then the posterior probability distribution over
the possible discrete shifts for each data point is returned. In that case,
*P_shift_mpp will be N by S.
Initial values of the parameters to be used as the starting values for the EM
iterations can be specified using initial_delta_vp, initial_a_vp,
initial_u_mp and initial_var_mp. If they are all NULL, then a
random initialization scheme is used. It is assumed that the initial
parameters are specified either in the full feature space or the reduced
space in which the final clusters are sought. In case of full space, the
routine retrieves the parameters corresponding to the target subspace.
If the routine fails (due to storage allocation), then ERROR is returned
with an error message being set. Otherwise NO_ERROR is returned.
This software is not adequatedly tested. It is recomended that
results are checked independantly where appropriate.
Prasad Gabbur, Kobus Barnard.