Kobus Barnard, Pinar Duygulu, and David Forsyth, "Clustering Art", Computer Vision and Pattern Recognition, pp. II:434-439, 2001.
We extend a recently developed method  for learning the semantics of image databases using text and pictures. We incorporate statistical natural language processing in order to deal with free text. We demonstrate the current system on a difficult dataset, namely 10,000 images of work from the Fine Arts Museum of San Francisco. The images include line drawings, paintings, and pictures of sculpture and ceramics. Many of the images have associated free text whose varies greatly, from physical description to interpretation and mood.
We use WordNet to provide semantic grouping information and to help disambiguate word senses, as well as emphasize the hierarchical nature of semantic relationships. This allows us to impose a natural structure on the image collection, that reflects semantics to a considerable degree. Our method produces a joint probability distribution for words and picture elements. We demonstrate that this distribution can be used (a) to provide illustrations for given captions and (b) to generate words for images outside the training set. Results from this annotation process yield a quantitative study of our method. Finally, our annotation process can be seen as a form of object recognizer that has been learned through a partially supervised process.
Keywords: image features, text and images, image semantics, learning, statistical models, latent semantic analysis, browsing art with science, Wordnet, language and vision
Full text (gzipped postscript)
Full text (pdf)