This program creates a synthetic data set by selecting cluster centroids and generating samples by ranomly picking a centroid and sampling from a spherical gaussian with the centroid as its mean and a user specified standard deviation. This data generator has been used to evaluate the VFKM and VFEM systems.
The centroids are randomly placed uniformly in a N dimensional unit hypercube (where N is the number of continuous dimensions), except that if any centroid is placed closer than:
(sqrt(N) / (num centroids + 1)) * std deviation to an already placed one its location is resampled. (If any centroid can not be placed after 1000 resamples and error is reported.) Note that 'unit hypercube' means that each dimension ranges from 0 - 1.0.
Finally, training samples are generated by randomly selecting a centroid and sampling from a Gaussian with it as the mean and the specified standard deviation (specified by a parameter to the program) for each dimension. Note that the value of a sample's dimension may fall outside the 0 - 1.0 range.
clusterdata -continuous 20 -clusters 3 -stdev 0.05 -conceptSeed 21 -seed 1234 -train 1000
Creates 1000 samples in 20 dimensions by sampling from a mixture of 3 Gaussians with a standard deviation of 0.05. This same data set could be recreated by using the same seed and conceptSeed flags.