Kobus Barnard, Pinar Duygulu, and David Forsyth, "Exploiting Text and Image Feature Co-occurrence Statistics in Large Datasets," to appear as a chapter in Trends and Advances in Content-Based Image and Video Retrieval (tentative title).
Building tools for accessing image data is hard because users are typically interested in the semantics of the image content. For example, a user searching for a tiger image will not be satisfied with images with plausible histograms; tiger semantics are required. The requirement that image features be linked to semantics means that real progress in image data access is fundamentally bound to traditional problems in computer vision. In this paper we outline recent work in learning such relationships from large datasets of images with associated text (e.g. keywords, captions, meta data, or descriptions). Fundamental to our approach is that images and associated text are both compositionalÑimages are composed of regions and objects, and text is composed of words, or more abstractly, topics or concepts. An important problem we consider is how to learn the correspondence between the components across the modes. Training data with the correspondences identified is rare and expensive to collect. By contrast, there is large amounts of data for training with weak correspondence information (e.g., Corel--40,000 images; captioned news photographs on the webÑ--20,000 images per month; web images embedded in text; video with captioning or speech recognition). The statistical models learned from such data support browsing, searching by text, image features, or both, as well as novel applications such as suggesting images for illustration of text passages (auto-illustrate), attaching words to images (auto-annotate), and attaching words to specific image regions (recognition).
Keywords: Object recognition, multi-media translation, statistical models, segmentation, correspondence, auto-annotate, auto-illustrate, browsing, image retrieval