Kobus Barnard, Pinar Duygulu, and David Forsyth, "Recognition as Translating Images into Text" Internet Imaging IX, Electronic Imaging 2003 (Invited paper). [ Full text (pdf) ]
We present an overview of a new paradigm for tackling long standing computer vision problems. Specifically our approach is to build statistical models which translate from a visual representations (images) to semantic ones (associated text). As providing optimal text for training is difficult at best, we propose working with whatever associated text is available in large quantities. Examples include large image collections with keywords, museum image collections with descriptive text, news photos, and images on the web.
In this paper we discuss how the translation approach can give a handle on difficult questions such as: What counts as an object? Which objects are easy to recognize and which are hard? Which objects are indistinguishable using our features? How to integrate low level vision processes such as feature based segmentation, with high level processes such as grouping. We also summarize some of the models proposed for translating from visual information to text, and some of the methods used to evaluate their performance.
Keywords: Object recognition, machine translation, learning image semantics, hierarchical clustering, aspect model