Key contribution of the thesis is a knowledge-rich, graph-based approach to automatically capture the message (gist) of images using the content and structure of a knowledge base to bridge the gap between text and image understanding. Congratulations, Lydia!
We investigate the problem of understanding the message (gist) conveyed by images and their captions as found, for instance, on websites or news articles. To this end, we propose a methodology to capture the meaning of image-caption pairs on the basis of large amounts of machine-readable knowledge that have previously been shown to be highly effective for text understanding. Our method identifies the connotation of objects beyond their denotation: where most approaches to image or image-text understanding focus on the denotation of objects, i.e., their literal meaning, our work addresses the identification of connotations, i.e., iconic meanings of objects, to understand the message of images. We view image understanding as the task of representing an image-caption pair on the basis of a wide- coverage vocabulary of concepts such as the one provided by Wikipedia, and cast gist detection as a concept-ranking problem with image-caption pairs as queries. Specifically, we approach the problem using a pipeline that: i) links detected object labels in the image and concept mentions in the caption to nodes of the knowledge base; ii) builds a semantic graph out of these ‘seed’ concepts; iii) applies a series of graph expansion and clustering steps on the original semantic graph to include additional concepts and topics within the semantic representation; iv) combines several graph-based and text-based features into a concept ranking model that pinpoints the gist concepts. Understanding the gist can be useful for tasks, such as image search and recommending images for texts.
As gist detection is a novel task, to the best of our knowledge, there is no dataset available. Thus, we create a dataset allowing for simultaneous evaluation of literal and non- literal image-caption pairs. The gold standard gist concepts are from a common knowledge base (Wikipedia) and the provided ranks are detailed with levels 0 to 5, which supports various benchmarking tasks, e.g., ranking according to different levels of granularity and classification. Furthermore, as our proposed gist detection pipeline touches on different research areas, we provide a detailed gold standard for each of our pipeline steps, such as entity linking or object detection in the images. Our gist detection pipeline is evaluated in a detailed ablation study, investigating aspects of twelve different research questions. These are elaborated in the evaluation section via human-assessment or cross-validation and provide detailed insights into the gist of image-caption pairs. Furthermore, we show in an end-to-end setting the feasibility of state-of-the-art methods combined with our gist-detection pipeline and point to future research directions.
Our experiments show that the candidate selection and ranking of gist concepts is a more difficult problem for non-literal image-caption pairs than for literal image-caption pairs. Furthermore, we demonstrate that using features and concepts from both modalities (image and caption) improves the performance for all types of pairs – a finding which is in line with results from research on multimodal approaches for other related tasks. Additionally, a feature ablation study shows the complementary nature and usefulness of different types of features, which are collected from different kinds of semantic graphs of increasing richness. Finally, we experimented with a state-of-the-art image object detector and caption generator to evaluate the performance of an end-to-end solution for our task. The results indicate that using state-of-the-art open-domain image understanding provides us with an input that is good enough to detect gist concepts of image-caption pairs, with nearly half of the predicted gist concepts being relevant. However, it also demonstrates that improved object detectors could avoid a drop of 38% mean-average precision. Additionally, the caption contains useful hints especially for non-literal pairs.
Gist image identification is a small, yet arguably crucial part of the much bigger task of interpreting images beyond their denotation. Within a use case scenario of an established research problem, we show that gist detection in the form of concept ranking is useful for downstream tasks such as multimedia indexing, in that it outperforms shallow and deep approaches. Finally, we conclude that it could be useful also for image search and recommendation.