Presented at Designing Tomorrow's Category-Level 3D Object Recognition Systems: An International Workshop September 8 to September 10, 2003 Taormina, Sicily.
Kobus Barnard, Matthew Johnson, and David Forsyth, "Word sense disambiguation with pictures," Workshop on learning word meaning from non-linguistic data , held in conjunction with The Human Language Technology Conference, Edmonton, Canada, May 27-June 1, 2003.    [ PDF ]
Kobus Barnard, Pinar Duygulu, Raghavendra Guru, Prasad Gabbur, and David Forsyth, "The effects of segmentation and feature choice in a translation model of object recognition" , Computer Vision and Pattern Recognition, pp 675-684, 2003.    [ PDF ]
Kobus Barnard, Pinar Duygulu, Nando de Freitas, David Forsyth, David Blei, and Michael I. Jordan Matching Words and Pictures, Journal of Machine Learning Research, Vol 3, pp 1107-1135.
Pinar Duygulu, Kobus Barnard, Nando de Freitas, and David Forsyth Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary , Seventh European Conference on Computer Vision, pp IV:97-112, 2002 (Awarded best paper in cognitive computer vision).
Kobus Barnard, Pinar Duygulu, Nando de Freitas, and David Forsyth Object Recognition as Machine Translation - Part 2: Exploiting Image Database Clustering Models , unpublished manuscript.
Nando de Freitas, Kobus Barnard, Pinar Duygulu and David Forsyth. Bayesian Models for Massive Multimedia Databases: a New Frontier. 7th Valencia International Meeting on Bayesian Statistics/2002 ISBA International Meeting. June 2nd - June 6th, 2002, Spain.
Kobus Barnard, Pinar Duygulu, and David Forsyth, "Modeling the Statistics of Image Features and Associated Text," Proc. Document Recognition and Retrieval IX, Electronic Imaging 2002.
Kobus Barnard, Pinar Duygulu, and David Forsyth Clustering Art, Computer Vision and Pattern Recognition, 2001, pp. II:434-439.
Nando de Freitas and Kobus Barnard.
Bayesian Latent Semantic Analysis of
Multimedia Databases, UBC TR 2001-15.
Postscript
Kobus Barnard and David Forsyth Learning the Semantics of Words and Pictures, International Conference on Computer Vision, vol 2, pp. 408-415, 2001.
Kobus Barnard and David Forsyth Exploiting Image Semantics for Picture Libraries, The First ACM/IEEE-CS Joint Conference on Digital Libraries, 2001, page 469.
I will present computer vision as a process that translates from visual representations (images) to semantic ones (words). This translation can be automatically learned from large unstructured data sets, suggesting that computer vision is a data mining activity focused on the relationships between data elements of different modes. More specifically, we link image regions to semantically appropriate words. Importantly, we do not require that the training data has the correspondence between the elements identified. For example, we may have keywords "tiger" and "grass" for an image, but we do not know whether the "tiger" goes with a green region of the image, or the part with orange and black stripes. I will use an analogy with work in statistical machine translation for languages to explain how some of this ambiguity can be resolved with sufficient training data.
Replacing recognition with a similar but more easily characterized activity (word prediction), finesses many long standing problems. We do not prescribe in advance what kinds of things that are to be learned. This is an automatic function of the data, features, and segmentation. The system learns the relationships it can, and we do not have to construct by hand a new model in order to recognize a new kind of thing. Since we can measure system performance by looking at how well it predicts words for images held out from training, we use the system to evaluate image segmenters and choices of features in a principled manner. Finally, of great interest in our current work, the approach can be used to integrate high and low level vision processes. For example, we use word prediction to propose region merges. Using only low level features, it is not possible to merge the black and white halves of a penguin. However, if these regions have similar probability distributions over words, we can propose a region merge. If such a grouping leads to better overall word prediction, then it can be proposed as a (better) visual model for the word penguin.
Keywords: image features, text and images, image semantics, learning, statistical models, latent semantic analysis, multimedia translation, object recognition, segmentation