The key contributions of the theses are novel methods for fusing time-dependent web table data as well as methods for extending knowledge bases with previously unknown long-tail entities. Dr. Oubabi's experiments proofed the importance of considering the time stamp location within web pages for data fusion. His research on entity expansion showed that it is possible to augment knowledge bases with descriptions of thousands of previously unknown entities using web table data together a rather small amount of manual supervision in the form of domain rules.
The examination committee consisted of Prof. Heiner Stuckenschmidt, Prof. Heiko Paulheim, Prof. Simone Ponzetto, and Prof. Christian Bizer.
Cross-domain knowledge bases are increasingly used for a large variety of applications. As the usefulness of a knowledge base for many of these applications increases with its completeness, augmenting knowledge bases with new knowledge is an important task. A source for this new knowledge could be in the form of web tables, which are relational HTML tables extracted from the Web.
This thesis researches data integration methods for cross-domain knowledge base augmentation from web tables. Existing methods have focused on the task of slot filling static data. We research methods that additionally enable augmentation in the form of slot filling time-dependent data and entity expansion.
When augmenting knowledge bases using time-dependent web table data, we require time-aware fusion methods. They identify from a set of conflicting web table values the one that is valid given a certain temporal scope. A primary concern of time-aware fusion is therefore the estimation of temporal scope annotations, which web table data lacks. We introduce two time-aware fusion approaches. In the first, we extract timestamps from the table and its context to exploit as temporal scopes, additionally introducing approaches to reduce the sparsity and noisiness of these timestamps. We introduce a second time-aware fusion method that exploits a temporal knowledge base to propagate temporal scopes to web table data, reducing the dependence on noisy and sparse timestamps.
Entity expansion enriches a knowledge base with previously unknown long-tail entities. It is a task that to our knowledge has not been researched before. We introduce the Long-Tail Entity Extraction Pipeline, the first system that can perform entity expansion from web table data. The pipeline works by employing identity resolution twice, once to disambiguate between entity occurrences within web tables, and once between entities created from web tables and existing entities in the knowledge base. In addition to identifying new long-tail entities, the pipeline also creates their descriptions according to the knowledge base schema.
By running the pipeline on a large-scale web table corpus, we profile the potential of web tables for the task of entity expansion. We find, that given certain classes, we can enrich a knowledge base with tens and even hundreds of thousands new entities and corresponding facts.
Finally, we introduce a weak supervision approach for long-tail entity extraction, where supervision in the form of a large number of manually labeled matching and non-matching pairs is substituted with a small set of bold matching rules build using the knowledge base schema. Using this, we can reduce the supervision effort required to train our pipeline to enable cross-domain entity expansion at web-scale.
In the context of this research, we created and published two datasets. The Time-Dependent Ground Truth contains time-dependent knowledge with more than one million temporal facts and corresponding temporal scope annotations. It could potentially be employed for a large variety of tasks that consider the temporal aspect of data. We also built the Web Tables for Long-Tail Entity Extraction gold standard, the first benchmark for the task of entity expansion from web tables.
Full Text of the PhD Thesis: