Paper accepted at SemEval 2019: Unsupervised Frame Induction using Contextualized Word Embeddings

The paper “HHMM at SemEval-2019 Task 2: Unsupervised Frame Induction using Contextualized Word Embeddings” by Saba Anwar, Dmitry Ustalov, Nikolay Arefyev, Chris Biemann, Simone Paolo Ponzetto, and Alexander Panchenko has been accepted for publication at SemEval 2019.


We present our system for semantic frame induction that showed the best performance in Subtask B.1 and finished as the runner-up in Subtask A of the SemEval 2019 Task 2 on unsupervised semantic frame induction (QasemiZadeh et al., 2019). Our approach separates this task into two independent steps: verb clustering using word and their context embeddings and role labeling by combining these embeddings with syntactical features. A simple combination of these steps shows very competitive results and can be extended to process other datasets and languages.

HHMM is an abbreviation for Hansestadt Hamburg, Mannheim, and Moscow. It is chosen to avoid confusion with hidden Markov models.

AI group celebrates the 100th supervised Master thesis
Article accepted at Computational Linguistics: Watset: Local-Global Graph Clustering with Applications in Sense and Frame Induction

The article “Watset: Local-Global Graph Clustering with Applications in Sense and Frame Induction” by Dmitry Ustalov, Alexander Panchenko, Chris Biemann, and Simone Paolo Ponzetto has been accepted for publication at the Computational Linguistics (CL) journal by MIT Press.


We present a detailed theoretical and computational analysis of the Watset meta-algorithm for fuzzy graph clustering, which has been found to be widely applicable in a variety of domains. This algorithm creates an intermediate representation of the input graph that reflects the “ambiguity” of its nodes. It uses hard clustering to discover clusters in this “disambiguated” intermediate graph. After outlining the approach and analyzing its computational complexity, we demonstrate that Watset shows competitive results in three applications: unsupervised synset induction from a synonymy graph, unsupervised semantic frame induction from dependency triples, and unsupervised semantic class induction from a distributional thesaurus. Our algorithm is generic and can be also applied to other networks of linguistic data.

Article accepted at ACM TODS: A Unified Framework for Frequent Sequence Mining with Subsequence Constraints

The article „A Unified Framework for Frequent Sequence Mining with Subsequence Constraints“ by Kaustubh Beedkar, Rainer Gemulla und Wim Martens has been accepted for publication in ACM Transactions on Database Systems (TODS).


Frequent sequence mining methods often make use of constraints to control which subsequences should be mined. A variety of such subsequence constraints has been studied in the literature, including length, gap, span, regular-expression, and hierarchy constraints. In this article, we show that many subsequence constraints---including and beyond those considered in the literature---can be unified in a single framework. A unified treatment allows researchers to study jointly many types of subsequence constraints (instead of each one individually) and helps to improve usability of pattern mining systems for practitioners. In more detail, we propose a set of simple and intuitive „pattern expressions to describe subsequence constraints and explore algorithms for efficiently mining frequent subsequences under such general constraints. Our algorithms translate pattern expressions to succinct finite state transducers, which we use as computational model, and simulate these transducers in a way suitable for frequent sequence mining. Our experimental study on real-world datasets indicates that our algorithms---although more general---are efficient and, when used for sequence mining with prior constraints studied in literature, competitive to (and in some cases superior to) state-of-the-art specialized methods.

VW Foundation support for project on societal impact of AI

The Volkswagen founation provides seed funding to prepare a proposal on desolidarization effects of Smart City Applications. The corresponding consortium is led by Prof. Kai Eckert from Stuttgart Media University. From the Mannheim side the Chair of Artificial Intelligence (Prof. Stuckenschmidt) will provide expertise in AI Technologies for Smart City Applications and the Chair of Statistics and Methodology (Prof. Kreuter) will design the research methodology. The full proposal has to be submitted in October.

Open Position: Assistent Professor (W1) for „Artificial Intelligence Methods“

The Faculty of Business Informatics and Business Mathematics of the University of Mannheim has an opening for a Junior Professorship (W1) for „Artificial Intelligence Methods“. The position is funded by the KI-BW program of the state of Baden-Württemberg and comes with attractive resources.

Paper accepted at ICDE 2019: Scalable Frequent Sequence Mining With Flexible Subsequence Constraints

The paper „Scalable Frequent Sequence Mining With Flexible Subsequence Constraints“ by Alexander Renz-Wieland, Matthias Bertsch, and Rainer Gemulla has been accepted at the 2019 IEEE International Conference on Data Engineering (ICDE).


We study scalable algorithms for frequent sequence mining under flexible subsequence constraints. Such constraints enable applications to specify concisely which patterns are of interest and which are not. We focus on the bulk synchronous parallel model with one round of communication; this model is suitable for platforms such as MapReduce or Spark. We derive a general framework for frequent sequence mining under this model and propose the D-SEQ and D-CAND algorithms within this framework. The algorithms differ in what data are communicated and how computation is split up among workers. To the best of our knowledge, D-SEQ and D-CAND are the first scalable algorithms for frequent sequence mining with flexible constraints. We conducted an experimental study on multiple real-world datasets that suggests that our algorithms scale nearly linearly, outperform common baselines, and offer acceptable generalization overhead over existing, less general mining algorithms.

31.5 billion quads Microdata, Embedded JSON-LD, RDFa, and Microformat data originating from 9.6 million websites published

The DWS group is happy to announce the new release of the WebDataCommons Microdata, JSON-LD, RDFa and Microformat data corpus. The data has been extracted from the November 2018 version of the Common Crawl covering 2.5 billion HTML pages which originate from 32 million websites (pay-level domains).

Photo credit: Anna Logue
WDC Training Dataset and Gold Standard for Large-Scale Product Matching released

The research focus in the field of entity resolution (aka link discovery or duplicate detection) is moving from traditional symbolic matching methods to embeddings and deep neural network based matching. A problem with evaluating deep learning based matchers is that they are rather training data hungry and that the benchmark datasets that are traditionally used for comparing matching methods are often too small to properly evaluate this new family of methods.