Go to| Kobus Barnard 's projects page | Kobus Barnard 's research page | Kobus Barnard 's Home Page


WORDS and PICTURES

Image understanding as multi-media translation

(AKA Computer Vision Meets Digital Libraries)

PICTURES
AND
WORDS

Browsing: A Browser for Large Image Collections (DEMO!)


Search: Probabilistic query "river and tiger". The two words do not appear together.

Auto-illustration: Choosing pictures for text.

Words from a passage of "Moby Dick":

large importance attached fact old dutch century more command whale ship ..."

And images from the Fine Art Museum of San Francisco found to illustrate the passage


Auto-annotation: Attaching words to images.

Keywords: HIPPO BULL mouth walk
Predicted words: water hippos rhino river grass reflection


Recognition:Attaching words to image regions.

Computer vision meets digital libraries when there are lots and lots of images. Our interest is two fold.

  • Provide access to large image datasets through browsing and search.
  • Use large image collections as data for classical computer vision problems such as scene understanding and object recognition.

    These two problems are closely related because images have both semantic and visual content. Our approach is to build statistical models which "explain" the data in a collection. The data consists of image segment features and words associated with the images. The words may be carefully chosen keywords as in the Corel data set, or free-form text in conjunction with natural language pre-processing.

    Associated words provide semantic content which is difficult to derive using standard computer vision methods. Conversely, the image features provide information which is often omitted when humans provide the words because it is clearly a visual element. For example, an image of a red rose will not normally have the keyword "red". Thus image features and associated words can complement and even disambiguate each other.

    Once a statistical model has been built from the data, it can be queried in a probabilistic sense. Specifically, we can ask which images have high probability given the query items, which can be any combination of words and image features. An extreme use of such queries is what we refer to as "auto-illustrate". Given a text passage, we can query an image data base for appropriate images. Going even further, the model can be queried to attach words to pictures--"auto-annotate", which has clear links to recognition.

    The next step towards recognition is to attach the words to specific image regions. To do this we must solve a correspondence problem while training because which image parts the words refer too is not known, a priori. The situation is analogous to that in learning how to do machine translation translation from aligned bi-texts. Here you have sentences assumed to have the same content, but the n'th word in one does not necessarily translate to the n'th word in the other. Similarly, we approach recognition as translating from visual cues to semantic correlates (words).

    A key observation which is being exploited in this approach is that as the learning system builds support for which parts of the image are, say, "grass", this reduces the correspondence ambiguity between the other blobs and words. The information to do this is present when the data base is analyzed as a whole.

    Current work is focused on constructing more complex models for both the image and language side, and linking the entities in these models. For example, we assume that effective object models include a notion of parts that need to be grouped or analyzed together for recognition. In preliminary work we have shown that our approach is suitable for proposing groupings of dissimilar regions. For example, we cannot merge a black and white penguin region with a low level segmentation algorithm, but if both regions are associated with "penguin", we can posit that a better model for "penguin" should incorporate several regions.

    Similarly, more intelligence on the language side can be helpful. The system described above is limited to simple "stuff" nouns. However, if we know that certain words are adjectives and prepositions and further understand their role in sentence structure, then we should be able to simultaneously learn the visual meaning of these words and exploit them to reduce ambiguity in the training data.

    Finally, going the other way, images can help ground the meaning of language. In recent work we have showed that our image analysis can be used to help disambiguate words. A word like "bank" has many meanings including a financial institution and a break in the terrain as in "river disambiguating words using textual context. We augment text disambiguation methods to use image information from accompanying illustrations.


  • People

    Kobus Barnard
    David Blei
    Peter Carbonetto
    Pinar Duygulu
    Jaety Edwards
    Quanfu Fan Nando de Freitas
    David Forsyth
    Prasad Gabbur
    Matthew Johnson
    Michael Jordan
    Nikhil Shirihatti
    Ranjini Swaminathan
    Robert Wilensky
    Keiji Yanai


    Publications

    Kobus Barnard, Keiji Yanai, Matthew Johnson, and Prasad Gabbur,
    "Cross modal disambiguation," in Toward Category-Level Object Recognition, Jean Ponce, Martial Hebert, Cordelia Schmidt, eds., Springer-Verlag LNCS Vol. 4170, 2006.

    Kobus Barnard and Keiji Yanai,
    "Mutual information of words and pictures," Information Theory and Applications Inaugural Workshop, February 6-10, 2006.    [PDF]

    Keiji Yanai and Kobus Barnard,
    "Finding Visual Concept by Web Image Mining", Proc. of the Fifteenth International World Wide Web Conference, Edinburgh, Scotland, 2006.

    Kobus Barnard, Quanfu Fan, Ranjini Swaminathan, Anthony Hoogs, Roderic Collins, Pascale Rondot, John Kaufhold,
    "Evaluation of localized semantics: data, methodology, and experiments," University of Arizona, Computing Science, Technical Report, TR-05-08, 2005 (revised October 2006).    [ PDF]

    Keiji Yanai and Kobus Barnard,
    "Image Region Entropy: A Measure of 'Visualness' of Web Images Associated with One Concept", Proc. of ACM Multimedia, Singapore, November, 2005.    [PDF]

    Keiji Yanai and Kobus Barnard,
    "Probabilistic Web Image Gathering", Proc. of ACM Multimedia Workshop on Multimedia Information Retrieval (MIR), Singapore, November, 2005.    [PDF]

    Keiji Yanai, Nikhil V. Shirahatti, Prasad Gabbur and Kobus Barnard,
    "Evaluation Strategies for Image Understanding and Retrieval", Proc. of ACM Multimedia Workshop on Multimedia Information Retrieval (MIR), Singapore, November, 2005 (Invited paper).    [PDF]

    Kobus Barnard and Matthew Johnson,
    "Word Sense Disambiguation with Pictures", Artificial Intelligence, Volume 167, pp. 13-30, 2005.    [PDF]

    Nikhil V Shirahatti and Kobus Barnard,
    "Evaluating Image Retrieval"   Computer Vision and Pattern Recognition, (CVPR), San Diego, CA, pp. I:955-961, June 2005.    [PDF]

    Peter Carbonetto, Nando de Freitas, and Kobus Barnard,
    " A Statistical Model for General Contextual Object Recognition,"   European Conference on Computer Vision, 2004.   (Copyright Springer-Verlag. Published in the Springer-Verlag Lecture Notes in Computer Science )    [ PDF ]

    Kobus Barnard, Pinar Duygulu, and David Forsyth,
    "Exploiting Text and Image Feature Co-occurrence Statistics in Large Datasets,"   to appear as a chapter in Trends and Advances in Content-Based Image and Video Retrieval (tentative title).    [ PDF ]

    Kobus Barnard and Prasad Gabbur,
    "Color and Color Constancy in a Translation Model for Object Recognition,"   Eleventh Color Imaging Conference, pp 364-369.    [ PDF ]

    Kobus Barnard, Matthew Johnson, and David Forsyth,
    "Word sense disambiguation with pictures," Workshop on learning word meaning from non-linguistic data , held in conjunction with The Human Language Technology Conference, Edmonton, Canada, May 27-June 1, 2003.    [ PDF ]

    Kobus Barnard, Pinar Duygulu, Raghavendra Guru, Prasad Gabbur,
    and David Forsyth, "The effects of segmentation and feature choice in a translation model of object recognition" , Computer Vision and Pattern Recognition, pp II: 675-682, 2003.    [ PDF ]

    Kobus Barnard, Pinar Duygulu, Nando de Freitas, David Forsyth, David Blei, and Michael I. Jordan,
    "Matching Words and Pictures," Journal of Machine Learning Research, Vol 3, pp 1107-1135, 2003.    [ PDF ]

    Kobus Barnard and Nikhil V. Shirahatti,
    " A method for comparing content based image retrieval methods", Internet Imaging IX, Electronic Imaging 2003.    [ PDF ]

    Kobus Barnard, Pinar Duygulu, and David Forsyth,
    " Recognition as Translating Images into Text" Internet Imaging IX, Electronic Imaging 2003 (Invited paper).    [ PDF ]

    Pinar Duygulu, Kobus Barnard, Nando de Freitas, and David Forsyth
    " Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary," Seventh European Conference on Computer Vision, pp IV:97-112, 2002 (Awarded best paper in cognitive computer vision).    [ PDF ]

    Kobus Barnard, Pinar Duygulu, Nando de Freitas, and David Forsyth
    " Object Recognition as Machine Translation - Part 2: Exploiting Image Database Clustering Models ," unpublished manuscript, 2002. [ PDF ]

    Nando de Freitas, Kobus Barnard, Pinar Duygulu and David Forsyth,
    Bayesian Models for Massive Multimedia Databases: a New Frontier. 7th Valencia International Meeting on Bayesian Statistics/2002 ISBA International Meeting. June 2nd - June 6th, 2002, Spain.

    Kobus Barnard, Pinar Duygulu, and David Forsyth,
    "Modeling the Statistics of Image Features and Associated Text," Proc. Document Recognition and Retrieval IX, Electronic Imaging 2002.

    Kobus Barnard, Pinar Duygulu, and David Forsyth,
    Clustering Art, Computer Vision and Pattern Recognition, 2001, pp. II:434-439.

    Nando de Freitas and Kobus Barnard,
    Bayesian Latent Semantic Analysis of Multimedia Databases, UBC TR 2001-15.
    Postscript

    Kobus Barnard and David Forsyth,
    Learning the Semantics of Words and Pictures, International Conference on Computer Vision, vol 2, pp. 408-415, 2001.

    Kobus Barnard and David Forsyth,
    Exploiting Image Semantics for Picture Libraries, The First ACM/IEEE-CS Joint Conference on Digital Libraries, 2001, page 469.


    Some Early Results (needs to be updated!)

    Results on images of art (May 2001)

    Preliminary Bayesian work (December 2000)

    Results with Corel data and Blobworld features (Fall 2000)