NAME

get_kmeans_clusters - Clusters data using the k-means method.

SYNOPSIS

#include "lsm/lsm_cluster.h"

Example compile flags (system dependent):
  -DLINUX_X86_64 -DLINUX_X86_64_OPTERON  -DGNU_COMPILER 
   -I/home/kobus/include
   -L/home/kobus/misc/load/linux_x86_64_opteron -L/usr/lib/x86_64-linux-gnu
  -lKJB                               -lfftw3  -lgsl -lgslcblas -ljpeg  -lSVM -lstdc++                    -lpthread -lSLATEC -lg2c    -lacml -lacml_mv -lblas -lg2c      -lncursesw 


int get_kmeans_clusters
(
	const Matrix *data_mp,
	int num_clusters,
	int (*distance_fn)(Vector *,Vector *,double *),
	Matrix **output_cluster_mpp,
	Vector **output_weights_vpp,
	Vector **output_classes_vpp
);

DESCRIPTION

Clusters data into a set number of cluster centres using the k-means algorithm. Returns the cluster centres in a matrix, the number of data points per cluster centre (normalized over [0, 1.0] range, and the cluster centre each input data point is assigned to. The input data to be clustered is contained in "data_mp", an N x D matrix, where N is the number of rows and D is the number of columns (dimensions). The argument "num_clusters" indicates how many cluster centres to compute from the input data. This is the "k" in the k-means algorithm. The "distance_fn" argument allows the user to specify their own distance metric that operates on vectors. If the "distance_fn" argument is NULL, then the Euclidean distance will be used. To specify your own distance function, you must use the following prototype: | int my_distance(Vector* v1_vp, Vector* v2_vp, double* distance_ptr) Your distance function should return NO_ERROR on success, and ERROR on failure. The computed cluster centres are returned in "output_cluster_mpp", which which is a double pointer to a Matrix of size "num_clusters" x D. If the "output_cluster_mpp" matrix does not exist (*output_cluster_mpp == NULL) or is the wrong size, the matrix will be created or resized as appropriate. The normalized number of data points assigned to each cluster is returned in "output_weights_vpp", a Vector of length "num_clusters". If this argument is NULL it will be ignored. If the vector does not exist (*output_weights_vpp == NULL), or is the wrong length, the vector will be created/resized. Note that this vector has benn normalized to sum to 1.0. The index of the cluster centre that each input data point has been assigned to is returned in "output_classes_vpp", a Vector of length N. Each entry is the index to "output_cluster_mpp" Matrix to which the data point has been assigned. If this value is NULL, it will be ignored. If the vector does not exist or is the wrong size, it will be created/resized. Finding the clusters is an iterative process. The iterations are controlled by two parameters: the maximum number of iterations allowed, and the difference a computed cluster centre and its value at the previous iteration. See the man pages for "set_kmeans_max_iterations" and "set_kmeans_epsilon" for more info.

RETURNS

Either NO_ERROR, or ERROR, with an appropriate error message being set.

set_kmeans_options, set_kmeans_max_iterations, get_kmeans_max_iterations, set_kmeans_epsilon, get_kmeans_epsilon, free_kmeans_allocated_static_data.

DISCLAIMER

This software is not adequatedly tested. It is recomended that results are checked independantly where appropriate.

AUTHOR

Lindsay Martin

DOCUMENTER