"Matching Words and Pictures", by Kobus Barnard, Pinar Duygulu, Nando de Freitas, David Forsyth, David Blei, and Michael I. Jordan, Journal of Machine Learning Research, Vol 3, pp 1107-1135.
(The appropriate archival reference for this data).
(gzipped tar ball with README file)
Note that the download is about 160 Megs. If you need in is smaller packages, try clicking here.
(The following description is contained in the README file)
This directory contains the data used for the JMLR paper "Matching Words and Pictures", by Kobus Barnard, Pinar Duygulu, Nando de Freitas, David Forsyth, David Blei, and Michael I. Jordan, Journal of Machine Learning Research, Vol 3, pp 1107-1135. This is the appropriate reference if you use this data.
The data is very much cobbled together and has some anomalies. More carefully prepared data will be made available in the future.
Each image segment is represented by 46 features. Since each image has a different number of segments, we list the number of segments used in separate files, so that the entire set of image segments can be read into a single Matlab file. The segments for a given image are listed in order of descending size and are separated from those for the next image by several spaces. For the JMLR paper we used the 10 largest segments (if there were 10 or more), or all the segments if there were 10 or fewer. The data in a different format which includes the unused segments is available upon request.
To compute the color features the images were linearized on the basis that they were PCD images, and then for convenience they were scaled up by (255/107), a somewhat arbitrary factor which has some justification based on the PCD format. (In hindsight, a factor of 2 would make more sense, but using this, or any other factor, would not change anything). Note that the features are redundant. Note also that the RGB and L*a*b features were duplicated to increase their weight for a specific experiment (long since finished), and we did not subsequently remove the duplicated columns. I do not know if this duplication inadvertently helps, hinders, or has no effect on these experiments. However, if you need a non-singular feature matrix, you will have to remove them. The 46 features are:
area, x, y, boundary/area, convexity, moment-of-inertia (6) ave RGB (3) ave RGB (3, yes, duplicated!) RGB stdev (3) ave rgS (3) rgS stdev (3) ave L*a*b (3) ave L*a*b (3, yes, duplicated!) lab stdev (3) mean oriented energy, 30 degree increments (12) mean difference of gaussians, 4 sigmas (4)The data is organized into 10 different samples of roughly 16,000 images. For each sample there are three disjoint subsets corresponding to training data, held out data, and a harder held out set referred as "novel" in the paper. The files for each of these have either no prefix (training) "test_1_" (held out), or "test_3_" (novel). (There is no test_2). The file "words" applies to all three groups.
The files are as follows.
words The vocabulary used. We count the words starting at 1, so "city" is word 1. document_words test_1_document_words test_3_document_words The words for the images. Each line has a list of numbers which are indicies into the vocabulary file "words". Counting starts at 1. If the image has fewer blobs than the maximum, the row is padded with -99's so that the file can be read as a Matlab matrix. word_counts test_1_word_counts test_3_word_counts The number of words for each image. These files contain the same information as the document word files. blob_counts test_1_blob_counts test_3_blob_counts One number per line giving the number of blobs used for that image. blobs test_1_blobs test_3_blobs The features for the blobs for the images, listed in order of images, then decreasing blob size. In order to tell which blob goes with which image, you need either the file blob_counts, or the file document_blobs. Note that there are some spaces between the blobs for each image. document_blobs test_1_document_blobs test_3_document_blobs (EDITED april 4, 2004: The original writing suggested that these files supplied the blob tokens. However, these files simply point to the actual blobs. To get the tokens that were used for the ECCV 2002 paper, consult the files cluster_membership and test_1_cluster_membership.) The blobs for the images. This data is only relevant to the discrete translation method. Each line has a list of numbers representing indicies into the file "blobs". If the image has fewer blobs than the maximum, the row is padded with -99's so that the file can be read as a Matlab matrix. (The names of these files are somewhat misleading because they are not exactly analogous with the files document_words and test_1_document_words. These files do not give you any more information than what is available in blob_counts and test_1_blob_counts.) cluster_membership test_1_cluster_membership test_3_cluster_membership The blob token associated with each line of the file blobs, test_1_blobs, and test_3_blobs. This data is only relavent for discrete translations approaches to the problem. image_nums test_1_image_nums test_3_image_nums The Corel image numbers. We are unable to distribute the actual images due to copyright restrictions. The data can be used with some extent without the images. We provide the image numbers for those who have access to the Corel images. There are a number of versions of the Corel data, and so far it seems that the image numbers are consistent across versions. Thus if you have a different version of the data it is possible to construct a subset which is the intersection of our data and your data. seg_masks In the directory seg_masks we include the segmentation masks for the Corel images used for the JMLR paper. Again, we are unable to distribute the actual images due to copyright restrictions.
(gzipped tar ball with README file)
Note that the download is about 160 Megs. If you need in is smaller packages, try clicking here.