Go to| Kobus Barnard 's projects page | Kobus Barnard 's research page | Kobus Barnard 's Home Page


Image understanding as multi-media translation

This work has now spanned more than a decade, and has been picked up in one form or another by many others. Research in linking words and pictures has two main motivations:

  • Improving access to image and video data through browsing, searching, and data mining.
  • Exploiting large image collections as data for classical computer vision problems such as scene understanding and object recognition.

    These two problems are closely related because images have both semantic and visual content. Our approach is to build statistical models which "explain" the data in a collection. The data consists of image segment features and words associated with the images. The words may be carefully chosen keywords as in the Corel data set, or free-form text in conjunction with natural language pre-processing.

    Associated words provide semantic content which is difficult to derive using standard computer vision methods. Conversely, the image features provide information which is often omitted when humans provide the words because it is clearly a visual element. For example, an image of a red rose will not normally have the keyword "red". Thus image features and associated words can complement and even disambiguate each other.

    Once a statistical model has been built from the data, it can be queried in a probabilistic sense. Specifically, we can ask which images have high probability given the query items, which can be any combination of words and image features. An extreme use of such queries is what we refer to as "auto-illustrate". Given a text passage, we can query an image data base for appropriate images. Going even further, the model can be queried to attach words to pictures--"auto-annotate", which has clear links to recognition.

    The next step towards recognition is to attach the words to specific image regions. To do this we must solve a correspondence problem while training because which image parts the words refer too is not known, a priori. The situation is analogous to that in learning how to do machine translation translation from aligned bi-texts. Here you have sentences assumed to have the same content, but the word in one does not necessarily translate to the word in the other. Similarly, we approach recognition as translating from visual cues to semantic correlates (words).

    A key observation which is being exploited in this approach is that as the learning system builds support for which parts of the image are, say, "grass", this reduces the correspondence ambiguity between the other blobs and words. The information to do this is present when the data base is analyzed as a whole.

    Spatial reasoning. Additional work has looked at constructing more complex models for both the image and language side, and linking the entities in these models. For example, we assume that effective object models include a notion of parts that need to be grouped or analyzed together for recognition. In preliminary work (CVPR 03) we have shown that our approach is suitable for proposing groupings of dissimilar regions. For example, we cannot merge a black and white penguin region with a low level segmentation algorithm, but if both regions are associated with "penguin", we can posit that a better model for "penguin" should incorporate several regions. In collaboration with Peter Carbonetto and Nando de Freitas (ECCV 04), we developed a spatial context model for image grid regions.

    Language processing. Similarly, more intelligence on the language side can be helpful. The system described above is limited to simple "stuff" nouns. However, if we know that certain words are adjectives and prepositions and further understand their role in sentence structure, then, rather than confuse the learning system, we can use language parsing to help deal with the correspondence ambiguity (TCOR 06).

    Visual adjectives. One problem that arises in exploiting adjectives in free form caption is that most of them do not correspond to particularly salient visual attributes (e.g., "religious"). Hence we have developed a method to determine the visualness of adjectives to mine web data for candidate useful adjectives (ACM MM 05).

    Multi-modal disambiguation. Similarly, images can help ground the meaning of language. We have introduced an approach for using images to disambiguate words (AIJ 05). A word like "bank" has many meanings including a financial institution and a break in the terrain as in "river disambiguating words using textual context. We augment text disambiguation methods to use image information from accompanying illustrations.

    Aligning image elements and caption words is an alternative to simultaneously dealing with two difficult problems: 1) image element and caption word correspondence; and 2) building models semantic units in image. In alignment, we focus on linking image elements with caption words. In other words, we focus on labeling the training data. Having achieved this, we can build models for the semantic units, perhaps with additional iterations of correspondence estimation. While this is implicit in other systems, breaking the problem up in this way has some advantages. For example, we have shown that it can lead to better learning of visual attributes associated with rare words (CVPR 07).

    Exploiting object detectors. The systems described so far do best in learning the appearance of "stuff" (e.g., sky, water, grass), or certain animals and objects that can be recognized largely by color and texture (e.g., tigers and zebras). Handling shape, while feasible using multi-part models as mentioned above, has proven difficult. Another approach is to exploit the significant work on discriminative models for object category recognition that are trained using images of examples where the objects are prominent. We integrated such detectors into an alignment system, and have verified that they can help. To do this we needed to convert binary (yes/no) detector outputs to probabilities. We further found that using WordNet to allow, say, a bird detector, help resolve sub-categories, such as eagle.

  • Browsing: Multi-modal modeling for browsing large image collections.

    Search: Probabilistic query "river and tiger". The two words do not appear together.
    Auto-illustration: Choosing pictures for text. Words from a passage of "Moby Dick":
    large importance attached fact old dutch century more command whale ship ..."
    And images from the Fine Art Museum of San Francisco found to illustrate the passage
    Auto-annotation: Providing keywords to images that don't have them.

    Keywords: HIPPO BULL mouth walk
    Predicted words: water hippos rhino river grass reflection

    Recognition: Attaching words to image regions.
    Alignment: Linking caption words to visual features.

    To the right are some results of region labeling where the caption is provided. We only show labels for the 10 largest regions. The left images shows the result using color and texture features. The right images show the result of combining that with object detection results for caption words that we have detectors for. Note that in the bottom right image, we automatically linked the caption word "man" to "person" using WordNet so that we could use a person detector, as we did not have a specific detector for man.

    Mining for visual adjectives.

    A text based web image query retrieves many images, only some of which have yellow regions, and even if they do have yellow regions, we do not know which ones they are. However, we can cluster the regions with dominant appearance vs others which fit better with a generic model. Having create a cluster of yellow regions, we can compute the entropy of yellow region appearance statistics. Since they have low entropy, we say that yellow is a "visual" adjective relative to our features.

    Exploiting adjectives for reducing correspondence ambiguity.

    In the left hand image we aligned caption nouns with the image based on a strong prior (90%) that the words come from the caption (and 10% that they are omitted from the caption). In the right hand image we further integrate the label "red" which is attached to the noun "car". This improves the alignment with respect to that noun.

    Word sense disambiguation with pictures.

    Here we show the determined senses of words in a paragraph with the word "plant." In this example, a text based word sense disambiguation method provides the wrong sense of "plant" (plant_1, factory). We then provided the image on the right as an illustration for the paragraph, and computed the sense specific word probabilities based on the image, using an annotation system trained on image data with sense specified keywords. Combining the text based probabilities with the image bases probabilities led to the correct sense (plant_2, botanical).


    Kobus Barnard
    David Blei
    Nando de Freitas
    Pinar Duygulu
    David Forsyth
    Michael Jordan
    Robert Wilensky
    Keiji Yanai
    Peter Carbonetto
    Luca del Pero
    Quanfu Fan
    Prasad Gabbur
    Nikhil Shirihatti
    Ranjini Swaminathan
    Matthew Johnson
    Philip Lee
    James Magaherni
    Emily Hartley
    Roderic Collins
    Niels Haering
    Anthony Hoogs
    Atul Kanaujia
    John Kaufhold
    Ping Wang


    Luca del Pero, James Magahern, Philip Lee, Emily Hartley, Ping Wang, Atul Kanaujia, Niels Haering, and Kobus Barnard,
    "Fusing object detection and region appearance for image-text alignment," ACM Multimedia short paper, 2011 (to appear).

    Kobus Barnard, Quanfu Fan, Ranjini Swaminathan, Anthony Hoogs, Roderic Collins, Pascale Rondot, John Kaufhold,
    "Evaluation of localized semantics: data, methodology, and experiments," International Journal of Computer Vision, Vol. 77, pp 199-217, 2008.    [ PDF]

    Kobus Barnard and Quanfu Fan,
    "Reducing correspondence ambiguity in loosely labeled training data," IEEE Computer Society Conference on Computer Vision and Pattern Recognition, June, 2007.       [PDF]

    Keiji Yanai and Kobus Barnard,
    "An Analysis of 'Visualness' of Word Concepts," IPSJ Transactions on Computer Vision and Image Media, Vol.48, No.SIG1(CVIM17) (2007/02) (in Japanese)

    Kobus Barnard, Keiji Yanai, Matthew Johnson, and Prasad Gabbur,
    "Cross modal disambiguation," in Toward Category-Level Object Recognition, Jean Ponce, Martial Hebert, Cordelia Schmidt, eds., Springer-Verlag LNCS Vol. 4170, 2006.

    Kobus Barnard and Keiji Yanai,
    "Mutual information of words and pictures," Information Theory and Applications Inaugural Workshop, February 6-10, 2006.    [PDF]

    Keiji Yanai and Kobus Barnard,
    "Finding Visual Concept by Web Image Mining", Proc. of the Fifteenth International World Wide Web Conference, Edinburgh, Scotland, 2006.

    Kobus Barnard, Quanfu Fan, Ranjini Swaminathan, Anthony Hoogs, Roderic Collins, Pascale Rondot, John Kaufhold,
    "Evaluation of localized semantics: data, methodology, and experiments," University of Arizona, Computing Science, Technical Report, TR-05-08, 2005 (revised October 2006).    [ PDF]

    Keiji Yanai and Kobus Barnard,
    "Image Region Entropy: A Measure of 'Visualness' of Web Images Associated with One Concept", Proc. of ACM Multimedia, Singapore, November, 2005.    [PDF]

    Keiji Yanai and Kobus Barnard,
    "Probabilistic Web Image Gathering", Proc. of ACM Multimedia Workshop on Multimedia Information Retrieval (MIR), Singapore, November, 2005.    [PDF]

    Keiji Yanai, Nikhil V. Shirahatti, Prasad Gabbur and Kobus Barnard,
    "Evaluation Strategies for Image Understanding and Retrieval", Proc. of ACM Multimedia Workshop on Multimedia Information Retrieval (MIR), Singapore, November, 2005 (Invited paper).    [PDF]

    Kobus Barnard and Matthew Johnson,
    "Word Sense Disambiguation with Pictures", Artificial Intelligence, Volume 167, pp. 13-30, 2005.    [PDF]

    Nikhil V Shirahatti and Kobus Barnard,
    "Evaluating Image Retrieval"   Computer Vision and Pattern Recognition, (CVPR), San Diego, CA, pp. I:955-961, June 2005.    [PDF]

    Peter Carbonetto, Nando de Freitas, and Kobus Barnard,
    " A Statistical Model for General Contextual Object Recognition,"   European Conference on Computer Vision, 2004.   (Copyright Springer-Verlag. Published in the Springer-Verlag Lecture Notes in Computer Science )    [ PDF ]

    Kobus Barnard, Pinar Duygulu, and David Forsyth,
    "Exploiting Text and Image Feature Co-occurrence Statistics in Large Datasets,"   to appear as a chapter in Trends and Advances in Content-Based Image and Video Retrieval (tentative title).    [ PDF ]

    Kobus Barnard and Prasad Gabbur,
    "Color and Color Constancy in a Translation Model for Object Recognition,"   Eleventh Color Imaging Conference, pp 364-369.    [ PDF ]

    Kobus Barnard, Matthew Johnson, and David Forsyth,
    "Word sense disambiguation with pictures," Workshop on learning word meaning from non-linguistic data , held in conjunction with The Human Language Technology Conference, Edmonton, Canada, May 27-June 1, 2003.    [ PDF ]

    Kobus Barnard, Pinar Duygulu, Raghavendra Guru, Prasad Gabbur,
    and David Forsyth, "The effects of segmentation and feature choice in a translation model of object recognition" , Computer Vision and Pattern Recognition, pp II: 675-682, 2003.    [ PDF ]

    Kobus Barnard, Pinar Duygulu, Nando de Freitas, David Forsyth, David Blei, and Michael I. Jordan,
    "Matching Words and Pictures," Journal of Machine Learning Research, Vol 3, pp 1107-1135, 2003.    [ PDF ]

    Kobus Barnard and Nikhil V. Shirahatti,
    " A method for comparing content based image retrieval methods", Internet Imaging IX, Electronic Imaging 2003.    [ PDF ]

    Kobus Barnard, Pinar Duygulu, and David Forsyth,
    " Recognition as Translating Images into Text" Internet Imaging IX, Electronic Imaging 2003 (Invited paper).    [ PDF ]

    Pinar Duygulu, Kobus Barnard, Nando de Freitas, and David Forsyth
    " Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary," Seventh European Conference on Computer Vision, pp IV:97-112, 2002 (Awarded best paper in cognitive computer vision).    [ PDF ]

    Kobus Barnard, Pinar Duygulu, Nando de Freitas, and David Forsyth
    " Object Recognition as Machine Translation - Part 2: Exploiting Image Database Clustering Models ," unpublished manuscript, 2002. [ PDF ]

    Nando de Freitas, Kobus Barnard, Pinar Duygulu and David Forsyth,
    Bayesian Models for Massive Multimedia Databases: a New Frontier. 7th Valencia International Meeting on Bayesian Statistics/2002 ISBA International Meeting. June 2nd - June 6th, 2002, Spain.

    Kobus Barnard, Pinar Duygulu, and David Forsyth,
    "Modeling the Statistics of Image Features and Associated Text," Proc. Document Recognition and Retrieval IX, Electronic Imaging 2002.

    Kobus Barnard, Pinar Duygulu, and David Forsyth,
    Clustering Art, Computer Vision and Pattern Recognition, 2001, pp. II:434-439.

    Nando de Freitas and Kobus Barnard,
    Bayesian Latent Semantic Analysis of Multimedia Databases, UBC TR 2001-15.

    Kobus Barnard and David Forsyth,
    Learning the Semantics of Words and Pictures, International Conference on Computer Vision, vol 2, pp. 408-415, 2001.

    Kobus Barnard and David Forsyth,
    Exploiting Image Semantics for Picture Libraries, The First ACM/IEEE-CS Joint Conference on Digital Libraries, 2001, page 469.