"Matching Words and Pictures", by Kobus Barnard, Pinar Duygulu, Nando de Freitas, David Forsyth, David Blei, and Michael I. Jordan, Journal of Machine Learning Research, Vol 3, pp 1107-1135.
(The appropriate archival reference for this data).
(gzipped tar ball with README file)
Note that the download is about 160 Megs. If you need in is smaller packages, try clicking here.
(The following description is contained in the README file)
This directory contains the data used for the JMLR paper "Matching Words and Pictures", by Kobus Barnard, Pinar Duygulu, Nando de Freitas, David Forsyth, David Blei, and Michael I. Jordan, Journal of Machine Learning Research, Vol 3, pp 1107-1135. This is the appropriate reference if you use this data.
The data is very much cobbled together and has some anomalies. More carefully prepared data will be made available in the future.
Each image segment is represented by 46 features. Since each image has a different number of segments, we list the number of segments used in separate files, so that the entire set of image segments can be read into a single Matlab file. The segments for a given image are listed in order of descending size and are separated from those for the next image by several spaces. For the JMLR paper we used the 10 largest segments (if there were 10 or more), or all the segments if there were 10 or fewer. The data in a different format which includes the unused segments is available upon request.
To compute the color features the images were linearized on the basis that they were PCD images, and then for convenience they were scaled up by (255/107), a somewhat arbitrary factor which has some justification based on the PCD format. (In hindsight, a factor of 2 would make more sense, but using this, or any other factor, would not change anything). Note that the features are redundant. Note also that the RGB and L*a*b features were duplicated to increase their weight for a specific experiment (long since finished), and we did not subsequently remove the duplicated columns. I do not know if this duplication inadvertently helps, hinders, or has no effect on these experiments. However, if you need a non-singular feature matrix, you will have to remove them. The 46 features are:
area, x, y, boundary/area, convexity, moment-of-inertia (6)
ave RGB (3)
ave RGB (3, yes, duplicated!)
RGB stdev (3)
ave rgS (3)
rgS stdev (3)
ave L*a*b (3)
ave L*a*b (3, yes, duplicated!)
lab stdev (3)
mean oriented energy, 30 degree increments (12)
mean difference of gaussians, 4 sigmas (4)
The data is organized into 10 different samples of roughly 16,000 images. For
each sample there are three disjoint subsets corresponding to training data,
held out data, and a harder held out set referred as "novel" in the paper.
The files for each of these have either no prefix (training) "test_1_" (held
out), or "test_3_" (novel). (There is no test_2). The file "words" applies to
all three groups.
The files are as follows.
words
The vocabulary used. We count the words starting at 1, so "city" is
word 1.
document_words
test_1_document_words
test_3_document_words
The words for the images. Each line has a list of numbers which are
indicies into the vocabulary file "words". Counting starts at 1. If
the image has fewer blobs than the maximum, the row is padded with
-99's so that the file can be read as a Matlab matrix.
word_counts
test_1_word_counts
test_3_word_counts
The number of words for each image. These files contain the same
information as the document word files.
blob_counts
test_1_blob_counts
test_3_blob_counts
One number per line giving the number of blobs used for that image.
blobs
test_1_blobs
test_3_blobs
The features for the blobs for the images, listed in order of images,
then decreasing blob size. In order to tell which blob goes with
which image, you need either the file blob_counts, or the file
document_blobs. Note that there are some spaces between the blobs for
each image.
document_blobs
test_1_document_blobs
test_3_document_blobs
(EDITED april 4, 2004: The original writing suggested that these
files supplied the blob tokens. However, these files simply point
to the actual blobs. To get the tokens that were used for the ECCV
2002 paper, consult the files cluster_membership and
test_1_cluster_membership.)
The blobs for the images. This data is only relevant to the
discrete translation method. Each line has a list of numbers
representing indicies into the file "blobs". If the image has fewer
blobs than the maximum, the row is padded with -99's so that the
file can be read as a Matlab matrix.
(The names of these files are somewhat misleading because they are
not exactly analogous with the files document_words and
test_1_document_words. These files do not give you any more
information than what is available in blob_counts and
test_1_blob_counts.)
cluster_membership
test_1_cluster_membership
test_3_cluster_membership
The blob token associated with each line of the file blobs,
test_1_blobs, and test_3_blobs. This data is only relavent for
discrete translations approaches to the problem.
image_nums
test_1_image_nums
test_3_image_nums
The Corel image numbers. We are unable to distribute the actual
images due to copyright restrictions. The data can be used with some
extent without the images. We provide the image numbers for those who
have access to the Corel images. There are a number of versions of
the Corel data, and so far it seems that the image numbers are
consistent across versions. Thus if you have a different version of
the data it is possible to construct a subset which is the
intersection of our data and your data.
seg_masks
In the directory seg_masks we include the segmentation masks for the
Corel images used for the JMLR paper. Again, we are unable to
distribute the actual images due to copyright restrictions.
(gzipped tar ball with README file)
Note that the download is about 160 Megs. If you need in is smaller packages, try clicking here.