Article accepted at Computational Linguistics: Watset: Local-Global Graph Clustering with Applications in Sense and Frame Induction

The article “Watset: Local-Global Graph Clustering with Applications in Sense and Frame Induction” by Dmitry Ustalov, Alexander Panchenko, Chris Biemann, and Simone Paolo Ponzetto has been accepted for publication at the Computational Linguistics (CL) journal by MIT Press.

Abstract:

We present a detailed theoretical and computational analysis of the Watset meta-algorithm for fuzzy graph clustering, which has been found to be widely applicable in a variety of domains. This algorithm creates an intermediate representation of the input graph that reflects the “ambiguity” of its nodes. It uses hard clustering to discover clusters in this “disambiguated” intermediate graph. After outlining the approach and analyzing its computational complexity, we demonstrate that Watset shows competitive results in three applications: unsupervised synset induction from a synonymy graph, unsupervised semantic frame induction from dependency triples, and unsupervised semantic class induction from a distributional thesaurus. Our algorithm is generic and can be also applied to other networks of linguistic data.

Article accepted at ACM TODS: A Unified Framework for Frequent Sequence Mining with Subsequence Constraints

The article „A Unified Framework for Frequent Sequence Mining with Subsequence Constraints“ by Kaustubh Beedkar, Rainer Gemulla und Wim Martens has been accepted for publication in ACM Transactions on Database Systems (TODS).

Abstract:

Frequent sequence mining methods often make use of constraints to control which subsequences should be mined. A variety of such subsequence constraints has been studied in the literature, including length, gap, span, regular-expression, and hierarchy constraints. In this article, we show that many subsequence constraints---including and beyond those considered in the literature---can be unified in a single framework. A unified treatment allows researchers to study jointly many types of subsequence constraints (instead of each one individually) and helps to improve usability of pattern mining systems for practitioners. In more detail, we propose a set of simple and intuitive „pattern expressions to describe subsequence constraints and explore algorithms for efficiently mining frequent subsequences under such general constraints. Our algorithms translate pattern expressions to succinct finite state transducers, which we use as computational model, and simulate these transducers in a way suitable for frequent sequence mining. Our experimental study on real-world datasets indicates that our algorithms---although more general---are efficient and, when used for sequence mining with prior constraints studied in literature, competitive to (and in some cases superior to) state-of-the-art specialized methods.

Paper accepted at ICDE 2019: Scalable Frequent Sequence Mining With Flexible Subsequence Constraints

The paper „Scalable Frequent Sequence Mining With Flexible Subsequence Constraints“ by Alexander Renz-Wieland, Matthias Bertsch, and Rainer Gemulla has been accepted at the 2019 IEEE International Conference on Data Engineering (ICDE).

Abstract:

We study scalable algorithms for frequent sequence mining under flexible subsequence constraints. Such constraints enable applications to specify concisely which patterns are of interest and which are not. We focus on the bulk synchronous parallel model with one round of communication; this model is suitable for platforms such as MapReduce or Spark. We derive a general framework for frequent sequence mining under this model and propose the D-SEQ and D-CAND algorithms within this framework. The algorithms differ in what data are communicated and how computation is split up among workers. To the best of our knowledge, D-SEQ and D-CAND are the first scalable algorithms for frequent sequence mining with flexible constraints. We conducted an experimental study on multiple real-world datasets that suggests that our algorithms scale nearly linearly, outperform common baselines, and offer acceptable generalization overhead over existing, less general mining algorithms.

31.5 billion quads Microdata, Embedded JSON-LD, RDFa, and Microformat data originating from 9.6 million websites published

The DWS group is happy to announce the new release of the WebDataCommons Microdata, JSON-LD, RDFa and Microformat data corpus. The data has been extracted from the November 2018 version of the Common Crawl covering 2.5 billion HTML pages which originate from 32 million websites (pay-level domains).

Photo credit: Anna Logue
WDC Training Dataset and Gold Standard for Large-Scale Product Matching released

The research focus in the field of entity resolution (aka link discovery or duplicate detection) is moving from traditional symbolic matching methods to embeddings and deep neural network based matching. A problem with evaluating deep learning based matchers is that they are rather training data hungry and that the benchmark datasets that are traditionally used for comparing matching methods are often too small to properly evaluate this new family of methods.

Photo credit: Anna Logue
Paper accepted at EDBT 2019

Our systems and applications paper

Extending Cross-Domain Knowledge Bases with Long Tail Entities using Web Table Data (Yaser Oulabi, Christian Bizer)

got accepted at the 22nd International Conference on Extending Database Technology (EDBT 2019), one of the top-tier conferences in the data management field!

Abstract of the paper:

 Cross-domain knowledge bases such as YAGO, DBpedia, or the Google Knowledge Graph are being used as background knowledge within an increasing range of applications including web search, data integration, natural language understanding, and question answering. The usefulness of a knowledge base for these applications depends on its completeness. Relational HTML tables that are published on the Web cover a wide range of topics and describe very specific long tail entities, such as small villages, less-known football players, or obscure songs. This systems and applications paper explores the potential of web table data for the task of completing cross-domain knowledge bases with descriptions of formerly unknown entities. We present the first system that handles all steps that are necessary for this task: schema matching, row clustering, entity creation, and new detection. The evaluation of the system using a manually labeled gold standard shows that it can construct formerly unknown instances and their descriptions from table data with an average F1 score of 0.80. In a second experiment, we apply the system to a large corpus of web tables extracted from the Common Crawl. This experiment allows us to get an overall impression of the potential of web tables for augmenting knowledge bases with long tail entities. The experiment shows that we can augment the DBpedia knowledge base with descriptions of 14 thousand new football players as well as 187 thousand new songs. The accuracy of the facts describing these instances is 0.90.

PDF of the paper:

Yaser Oulabi, Christian Bizer: Extending Cross-Domain Knowledge Bases with Long Tail Entities using Web Table Data. EDBT 2019.

Photo credit: Anna Logue
WInte.r Web Data Integration Framework Version 1.3 released

We are happy to announce the release of Version 1.3 of the Web Data Integration Framework (WInte.r).

WInte.r is a Java framework for end-to-end data integration. The framework implements a wide variety of different methods for data pre-processing, schema matching, identity resolution, data fusion, and result evaluation. The methods are designed to be easily customizable by exchanging pre-defined building blocks, such as blockers, matching rules, similarity functions, and conflict resolution functions.

The following features have been added to the framework for the new release:

Value Normalization: New ValueNormaliser class for normalizing quantifiers and units of measurement. New DataSetNormaliser class for detecting data types and transform complete datasets into a normalised base format. External Rule Learning: In addition to learning matching rules directly inside of WInte.r, the new release also supports learning matching rules using external tools such as Rapidminer and importing the learned rules back into WInte.r. Debug Reporting: The new release features detailed reports about the application of matching rules, blockers, and data fusion methods which lay the foundation for fine-tuning the methods. Step-by-Step Tutorial: In order to get users started with the framework, we have written a step-by-step tutorial on how to use WInte.r for identity resolution and data fusion and how to debug and fine-tune the different steps of the integration process.

The WInte.r famework forms a foundation for our research on large-scale web data integration. The framework is used by the T2K Match algorithm for matching millions of Web tables against a central knowledge base, as well as within our work on Web table stitching for improving matching quality. The framework is also used in the context of the DS4DM research project for matching tabular data for data search.

Beside of being used for research, we also use the WInte.r famework for teaching. The students of our Web Data Integration course use the framework to solve case studies and implement their term projects.  

Detailed information about the WInte.r framework is found at

github.com/olehmberg/winter

The WInte.r framework can be downloaded from the same web site. The framework can be used under the terms of the Apache 2.0 License.

Lots of thanks to Alexander Brinkmann and Oliver Lehmberg for their work on the new release as well as on the tutorial and extended documentation in the WInte.r wiki.

André Melo has defended his PhD thesis

André Melo has defended his PhD thesis on „Automatic Refinement of Large-Scale Cross-Domain Knowledge Graphs“, supervised by Prof. Heiko Paulheim.

In his thesis, André has developed different methods to improve large-scale, cross-domain knowledge graphs along various dimensions. His contributions include, among others, a benchmarking suite for knowledge graph completion and correction, an effective method for type prediction using hierarchical classification, and a machine-learning based method for detection wrong relation assertions. Moreover, he has proposed methods for error correction in knowledge graph, and for distilling high-level tests from individual errors identified.

As of September, André will start a new job as a knowledge engineer for Babylon Health in London. We wish him all the best!

Photo credit: Data Mining Cup/prudsys AG
Mannheim Students Score Second Place at Data Mining Cup

The Data Mining Cup is an annual data mining competition for students from all over the world. Since 2014, students from Mannheim take part in the competition as an integral part of the Data Mining 2 lecture, held by Prof. Paulheim. In the course of the competition, the students have to solve a data mining task based on real e-commerce data.

This year, the data was provided by an online sports apparel retailer, and the task was to predict the sellout date for individual articles. Students had six weeks time to develop their solution. In the course of the lecture, they worked in different teams and had regular discussions about solution approaches and results.

One of the student teams from Mannheim qualified for the final round of the 10 best teams in May and was invited to present their solution Berlin at the prudsys personalization & pricing summit. In the final ranking, they scored second out of 197 solutions in total. Overall, teams from 148 universities from 47 countries took part in the 2018 data mining cup.

The DWS group wants to congratulate the winning team:

Nele Ecker Thilo Habrich Andreea Iana Adrian Kochsiek Alexander Luetke Laurien Theresa Lummer Nils Richter Fabian Oliver Schmitt

Picture: Members of the winnig team in Berlin. Left to right: Nele Ecker, Laurien Lummer, Adrian Kochsiek, Alexander Lütke

JCDL 2018 - Vannevar Bush Best Paper Award

Our paper „Entity-Aspect Linking: Providing Fine-Grained Semantics of Entities in Context“ has recently won the Vannevar Bush best paper award at the 2018 Joint Conference on Digital Libraries (JCDL), the top conference in the field of digital libraries!

The work, coauthored by Federico Nanni, Simone Paolo Ponzetto and Laura Dietz, is part of a collaboration between the DWS group and the University of New Hampshire in the context of an Elite Post-Doc grant of the Baden-Württemberg Stiftung recently awarded from Laura.

Congratulations also to Myriam Traub, Thaer Samar, Jacco van Ossenbruggen and Lynda Hardman, who, with their work, share with us the 2018 best paper award!