Go to | Data for computer vision | Kobus Barnard 's research page | Kobus Barnard 's Home Page

Data for

"Matching Words and Pictures", by Kobus Barnard, Pinar Duygulu, Nando de Freitas, David Forsyth, David Blei, and Michael I. Jordan, Journal of Machine Learning Research, Vol 3, pp 1107-1135.
(The appropriate archival reference for this data).

(gzipped tar ball with README file)

Note that the download is about 160 Megs. If you need in is smaller packages, try clicking here.

Description

(The following description is contained in the README file)

This directory contains the data used for the JMLR paper "Matching Words and Pictures", by Kobus Barnard, Pinar Duygulu, Nando de Freitas, David Forsyth, David Blei, and Michael I. Jordan, Journal of Machine Learning Research, Vol 3, pp 1107-1135. This is the appropriate reference if you use this data.

The data is very much cobbled together and has some anomalies. More carefully prepared data will be made available in the future.

Each image segment is represented by 46 features. Since each image has a different number of segments, we list the number of segments used in separate files, so that the entire set of image segments can be read into a single Matlab file. The segments for a given image are listed in order of descending size and are separated from those for the next image by several spaces. For the JMLR paper we used the 10 largest segments (if there were 10 or more), or all the segments if there were 10 or fewer. The data in a different format which includes the unused segments is available upon request.

To compute the color features the images were linearized on the basis that they were PCD images, and then for convenience they were scaled up by (255/107), a somewhat arbitrary factor which has some justification based on the PCD format. (In hindsight, a factor of 2 would make more sense, but using this, or any other factor, would not change anything). Note that the features are redundant. Note also that the RGB and L*a*b features were duplicated to increase their weight for a specific experiment (long since finished), and we did not subsequently remove the duplicated columns. I do not know if this duplication inadvertently helps, hinders, or has no effect on these experiments. However, if you need a non-singular feature matrix, you will have to remove them. The 46 features are:

        area, x, y, boundary/area, convexity, moment-of-inertia  (6)
        ave RGB (3)
        ave RGB (3, yes, duplicated!)
        RGB stdev (3)
        ave rgS   (3)
        rgS stdev  (3)
        ave L*a*b (3)
        ave L*a*b (3, yes, duplicated!)
        lab stdev (3)
        mean oriented energy, 30 degree increments   (12)
        mean difference of gaussians, 4 sigmas       (4)

The data is organized into 10 different samples of roughly 16,000 images. For each sample there are three disjoint subsets corresponding to training data, held out data, and a harder held out set referred as "novel" in the paper. The files for each of these have either no prefix (training) "test_1_" (held out), or "test_3_" (novel). (There is no test_2). The file "words" applies to all three groups.

The files are as follows.

        words
           The vocabulary used. We count the words starting at 1, so "city" is
           word 1. 

        document_words
        test_1_document_words
        test_3_document_words
           The words for the images. Each line has a list of numbers which are
           indicies into the vocabulary file "words".  Counting starts at 1. If
           the image has fewer blobs than the maximum, the row is padded with
           -99's so that the file can be read as a Matlab matrix. 

        word_counts
        test_1_word_counts
        test_3_word_counts
           The number of words for each image. These files contain the same
           information as the document word files. 

        blob_counts
        test_1_blob_counts
        test_3_blob_counts
           One number per line giving the number of blobs used for that image. 

        blobs
        test_1_blobs
        test_3_blobs
           The features for the blobs for the images, listed in order of images,
           then decreasing blob size. In order to tell which blob goes with
           which image, you need either the file blob_counts, or the file
           document_blobs. Note that there are some spaces between the blobs for
           each image. 

        document_blobs
        test_1_document_blobs
        test_3_document_blobs
           (EDITED april 4, 2004: The original writing suggested that these
           files supplied the blob tokens. However, these files simply point
           to the actual blobs. To get the tokens that were used for the ECCV
           2002 paper, consult the files cluster_membership and
           test_1_cluster_membership.)

           The blobs for the images. This data is only relevant to the
           discrete translation method.  Each line has a list of numbers
           representing indicies into the file "blobs". If the image has fewer
           blobs than the maximum, the row is padded with -99's so that the
           file can be read as a Matlab matrix. 

           (The names of these files are somewhat misleading because they are
           not exactly analogous with the files document_words and
           test_1_document_words.  These files do not give you any more
           information than what is available in blob_counts and
           test_1_blob_counts.)

        cluster_membership
        test_1_cluster_membership
        test_3_cluster_membership
           The blob token associated with each line of the file blobs,
           test_1_blobs, and test_3_blobs. This data is only relavent for
           discrete translations approaches to the problem. 

        image_nums
        test_1_image_nums
        test_3_image_nums
           The Corel image numbers.  We are unable to distribute the actual
           images due to copyright restrictions.  The data can be used with some
           extent without the images. We provide the image numbers for those who
           have access to the Corel images. There are a number of versions of
           the Corel data, and so far it seems that the image numbers are
           consistent across versions. Thus if you have a different version of
           the data it is possible to construct a subset which is the
           intersection of our data and your data.

        seg_masks
           In the directory seg_masks we include the segmentation masks for the
           Corel images used for the JMLR paper. Again, we are unable to
           distribute the actual images due to copyright restrictions.

(gzipped tar ball with README file)

Note that the download is about 160 Megs. If you need in is smaller packages, try clicking here.