|
WORDS and PICTURESImage understanding as multi-media translation
(AKA Computer Vision Meets Digital Libraries) |
|
|
|
|
|
Browsing:
A Browser for Large Image Collections
(DEMO!)
Search: Probabilistic query "river and tiger". The two words do not appear together.
Auto-illustration: Choosing pictures for text. Words from a passage of "Moby Dick": large importance attached fact old dutch century more command whale ship ..." And images from the Fine Art Museum of San Francisco found to illustrate the passage
Auto-annotation: Attaching words to images.
Keywords:
HIPPO BULL mouth walk
Recognition:Attaching words to image regions.
|
Computer vision meets digital libraries when there are lots and lots of
images.
Our interest is two fold.
These two problems are closely related because images have both semantic and visual content. Our approach is to build statistical models which "explain" the data in a collection. The data consists of image segment features and words associated with the images. The words may be carefully chosen keywords as in the Corel data set, or free-form text in conjunction with natural language pre-processing. Associated words provide semantic content which is difficult to derive using standard computer vision methods. Conversely, the image features provide information which is often omitted when humans provide the words because it is clearly a visual element. For example, an image of a red rose will not normally have the keyword "red". Thus image features and associated words can complement and even disambiguate each other. Once a statistical model has been built from the data, it can be queried in a probabilistic sense. Specifically, we can ask which images have high probability given the query items, which can be any combination of words and image features. An extreme use of such queries is what we refer to as "auto-illustrate". Given a text passage, we can query an image data base for appropriate images. Going even further, the model can be queried to attach words to pictures--"auto-annotate", which has clear links to recognition. The next step towards recognition is to attach the words to specific image regions. To do this we must solve a correspondence problem while training because which image parts the words refer too is not known, a priori. The situation is analogous to that in learning how to do machine translation translation from aligned bi-texts. Here you have sentences assumed to have the same content, but the n'th word in one does not necessarily translate to the n'th word in the other. Similarly, we approach recognition as translating from visual cues to semantic correlates (words). A key observation which is being exploited in this approach is that as the learning system builds support for which parts of the image are, say, "grass", this reduces the correspondence ambiguity between the other blobs and words. The information to do this is present when the data base is analyzed as a whole. Current work is focused on constructing more complex models for both the image and language side, and linking the entities in these models. For example, we assume that effective object models include a notion of parts that need to be grouped or analyzed together for recognition. In preliminary work we have shown that our approach is suitable for proposing groupings of dissimilar regions. For example, we cannot merge a black and white penguin region with a low level segmentation algorithm, but if both regions are associated with "penguin", we can posit that a better model for "penguin" should incorporate several regions. Similarly, more intelligence on the language side can be helpful. The system described above is limited to simple "stuff" nouns. However, if we know that certain words are adjectives and prepositions and further understand their role in sentence structure, then we should be able to simultaneously learn the visual meaning of these words and exploit them to reduce ambiguity in the training data.
Finally, going the other way, images can help ground the meaning of language. In recent work we have showed that our image analysis can be used to help disambiguate words. A word like "bank" has many meanings including a financial institution and a break in the terrain as in "river disambiguating words using textual context. We augment text disambiguation methods to use image information from accompanying illustrations. |
Preliminary Bayesian work (December 2000)
Results with Corel data and Blobworld features (Fall 2000)