Go to |  Kobus Barnard 's talks pageKobus Barnard 's research pageKobus Barnard 's Home Page

Learning the Semantics of Words and Pictures


Talks using roughly this set of slides have been given at INRIA, Rhone-Alps (Grenoble); Xerox Research Center, Europe; IBM Almaden; HP Reasearch Labs, Palo Alto; and most recently, the MIS department of the the University of Arizona.

Abstract (below)

Slides (PDF)

Associated on-line demo


Relevent publications

Kobus Barnard, Pinar Duygulu, Nando de Freitas, David Forsyth, David Blei, and Michael I. Jordan Matching Words and Pictures, Journal of Machine Learning Research, Vol 3, pp 1107-1135.

Pinar Duygulu, Kobus Barnard, Nando de Freitas, and David Forsyth Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary , Seventh European Conference on Computer Vision, pp IV:97-112, 2002 (Awarded best paper in cognitive computer vision).

Kobus Barnard, Pinar Duygulu, Nando de Freitas, and David Forsyth Object Recognition as Machine Translation - Part 2: Exploiting Image Database Clustering Models , unpublished manuscript.

Nando de Freitas, Kobus Barnard, Pinar Duygulu and David Forsyth. Bayesian Models for Massive Multimedia Databases: a New Frontier. 7th Valencia International Meeting on Bayesian Statistics/2002 ISBA International Meeting. June 2nd - June 6th, 2002, Spain.

Kobus Barnard, Pinar Duygulu, and David Forsyth, "Modeling the Statistics of Image Features and Associated Text," Proc. Document Recognition and Retrieval IX, Electronic Imaging 2002.

Kobus Barnard, Pinar Duygulu, and David Forsyth Clustering Art, Computer Vision and Pattern Recognition, 2001, pp. II:434-439.

Nando de Freitas and Kobus Barnard. Bayesian Latent Semantic Analysis of Multimedia Databases, UBC TR 2001-15.
Postscript

Kobus Barnard and David Forsyth Learning the Semantics of Words and Pictures, International Conference on Computer Vision, vol 2, pp. 408-415, 2001.

Kobus Barnard and David Forsyth Exploiting Image Semantics for Picture Libraries, The First ACM/IEEE-CS Joint Conference on Digital Libraries, 2001, page 469.


Abstract:

Large datasets are useful only to the extent that they can be browsed and searched, but doing so effectively requires learning and representing semantics. For example, a user searching for a tiger image will not be satisfied with images which have orange and black pieces, unless those pieces are arranged into a tiger. The human searcher is interested in both the semantics and appearance. Since high level semantics are not normally available, satisfying the user requires an understanding of the relationships between these two modes. Thus improving access to unstructured data is strongly linked to learning relationships among the components. In the case of images and text, this leads to new approach to recognition, where we recognize things by translating visual information to semantic correlates (words). In this approach, learning recognition is a specific case of mining multi-modal data.

Our strategy is to model the joint statistics of the image region features and associated text. Several of our models also cluster the images into groups, which we exploit to support browsing (demonstrated on 10,000 images of art from the Fine Arts Museum of San Francisco). I will discuss a number of other applications which are implemented using probabilistic inference. These include probabilistic (soft) search, and the novel applications of suggesting images for text passages (auto-illustrate), and predicting words for images outside the training set (auto-annotate). As well as being useful for indexing, auto-annotation also provides an important performance measure. Specifically, we compare predicted words with the supplied words for images held out from training. Such a measure can be exploited for model selection, feature selection, and comparing segmentation methods. I will demonstrate this with a principled comparison of two segmentation methods.

Auto-annotation is successful when good words are predicted for an image, but recognition requires those words to be attached to the appropriate image regions. To further tackle the recognition problem we explicitly model the correspondences between image regions and words. This is analogous to the statistical machine translation problem, but instead of translating from French to English, we translate from image regions to semantic correlates (words). Looking to the future I will suggest some ways in which word prediction could be used to further integrate higher and lower levels of analysis, as well as discuss other categories of data.


Keywords: image features, text and images, image semantics, learning, statistical models, latent semantic analysis, multimedia translation, object recognition, segmentation


Slides (PDF)