Paper accepted at ICDE 2019: Scalable Frequent Sequence Mining With Flexible Subsequence Constraints

The paper „Scalable Frequent Sequence Mining With Flexible Subsequence Constraints“ by Alexander Renz-Wieland, Matthias Bertsch, and Rainer Gemulla has been accepted at the 2019 IEEE International Conference on Data Engineering (ICDE).

Abstract:

We study scalable algorithms for frequent sequence mining under flexible subsequence constraints. Such constraints enable applications to specify concisely which patterns are of interest and which are not. We focus on the bulk synchronous parallel model with one round of communication; this model is suitable for platforms such as MapReduce or Spark. We derive a general framework for frequent sequence mining under this model and propose the D-SEQ and D-CAND algorithms within this framework. The algorithms differ in what data are communicated and how computation is split up among workers. To the best of our knowledge, D-SEQ and D-CAND are the first scalable algorithms for frequent sequence mining with flexible constraints. We conducted an experimental study on multiple real-world datasets that suggests that our algorithms scale nearly linearly, outperform common baselines, and offer acceptable generalization overhead over existing, less general mining algorithms.

31.5 billion quads Microdata, Embedded JSON-LD, RDFa, and Microformat data originating from 9.6 million websites published

The DWS group is happy to announce the new release of the WebDataCommons Microdata, JSON-LD, RDFa and Microformat data corpus. The data has been extracted from the November 2018 version of the Common Crawl covering 2.5 billion HTML pages which originate from 32 million websites (pay-level domains).

Photo credit: Anna Logue
WDC Training Dataset and Gold Standard for Large-Scale Product Matching released

The research focus in the field of entity resolution (aka link discovery or duplicate detection) is moving from traditional symbolic matching methods to embeddings and deep neural network based matching. A problem with evaluating deep learning based matchers is that they are rather training data hungry and that the benchmark datasets that are traditionally used for comparing matching methods are often too small to properly evaluate this new family of methods.

Photo credit: Anna Logue
Paper accepted at EDBT 2019

Our systems and applications paper

Extending Cross-Domain Knowledge Bases with Long Tail Entities using Web Table Data (Yaser Oulabi, Christian Bizer)

got accepted at the 22nd International Conference on Extending Database Technology (EDBT 2019), one of the top-tier conferences in the data management field!

Abstract of the paper:

 Cross-domain knowledge bases such as YAGO, DBpedia, or the Google Knowledge Graph are being used as background knowledge within an increasing range of applications including web search, data integration, natural language understanding, and question answering. The usefulness of a knowledge base for these applications depends on its completeness. Relational HTML tables that are published on the Web cover a wide range of topics and describe very specific long tail entities, such as small villages, less-known football players, or obscure songs. This systems and applications paper explores the potential of web table data for the task of completing cross-domain knowledge bases with descriptions of formerly unknown entities. We present the first system that handles all steps that are necessary for this task: schema matching, row clustering, entity creation, and new detection. The evaluation of the system using a manually labeled gold standard shows that it can construct formerly unknown instances and their descriptions from table data with an average F1 score of 0.80. In a second experiment, we apply the system to a large corpus of web tables extracted from the Common Crawl. This experiment allows us to get an overall impression of the potential of web tables for augmenting knowledge bases with long tail entities. The experiment shows that we can augment the DBpedia knowledge base with descriptions of 14 thousand new football players as well as 187 thousand new songs. The accuracy of the facts describing these instances is 0.90.

PDF of the paper:

Yaser Oulabi, Christian Bizer: Extending Cross-Domain Knowledge Bases with Long Tail Entities using Web Table Data. EDBT 2019.

New Champion of the „Bohnenspiel WM“

In the 2018 edition of the annual Bohnenspiel WM a new champion emerged! As integral part of the AI bachelor lecture, every year the students have the chance to participate in small groups in a tournament called the Bohnen-WM. For this tournament the students have to design and implement an AI that is capable of playing the game Bohnenspiel on a high level . This year 8 groups participated with AIs implementing different techniques and tactics. Finally, the AI of the group Selia Bati, Yves Mike Ekspenszid, and Sophia Isabel Maguin have won the cup. The winning AI was named „DieWilde9“.

Photo credit: Anna Logue
New DFG Project on joining graph- and vector-based sense representations for semantic end-user information access

We are happy to announce that the Deutsche Forschungs­gemeinschaft accepted our proposal for extending a joint research project on hybrid semantic representations together with our friends and colleagues of the Language Technology Group of the University of Hamburg.

 

The project, titled „Joining graph- and vector-based sense representations for semantic end-user information access“ (JOIN-T 2) builds upon and aims at bringing our JOIN-T project (also funded funded by DFG) one step forward. Our vision for the next three years is to explore ways to produce semantic representations that combine the interpretability of manually crafted resources and sparse representations with the accuracy and high coverage of dense neural embeddings.

Stay tuned for forthcoming research papers and resources!

Photo credit: Anna Logue
WInte.r Web Data Integration Framework Version 1.3 released

We are happy to announce the release of Version 1.3 of the Web Data Integration Framework (WInte.r).

WInte.r is a Java framework for end-to-end data integration. The framework implements a wide variety of different methods for data pre-processing, schema matching, identity resolution, data fusion, and result evaluation. The methods are designed to be easily customizable by exchanging pre-defined building blocks, such as blockers, matching rules, similarity functions, and conflict resolution functions.

The following features have been added to the framework for the new release:

Value Normalization: New ValueNormaliser class for normalizing quantifiers and units of measurement. New DataSetNormaliser class for detecting data types and transform complete datasets into a normalised base format. External Rule Learning: In addition to learning matching rules directly inside of WInte.r, the new release also supports learning matching rules using external tools such as Rapidminer and importing the learned rules back into WInte.r. Debug Reporting: The new release features detailed reports about the application of matching rules, blockers, and data fusion methods which lay the foundation for fine-tuning the methods. Step-by-Step Tutorial: In order to get users started with the framework, we have written a step-by-step tutorial on how to use WInte.r for identity resolution and data fusion and how to debug and fine-tune the different steps of the integration process.

The WInte.r famework forms a foundation for our research on large-scale web data integration. The framework is used by the T2K Match algorithm for matching millions of Web tables against a central knowledge base, as well as within our work on Web table stitching for improving matching quality. The framework is also used in the context of the DS4DM research project for matching tabular data for data search.

Beside of being used for research, we also use the WInte.r famework for teaching. The students of our Web Data Integration course use the framework to solve case studies and implement their term projects.  

Detailed information about the WInte.r framework is found at

github.com/olehmberg/winter

The WInte.r framework can be downloaded from the same web site. The framework can be used under the terms of the Apache 2.0 License.

Lots of thanks to Alexander Brinkmann and Oliver Lehmberg for their work on the new release as well as on the tutorial and extended documentation in the WInte.r wiki.

Photo credit: Anna Logue
Paper accepted at EMNLP 2018

Our long paper submission

„Investigating the Role of Argumentation in the Rhetorical Analysis of Scientific Publications with Neural Multi-Task Learning Models " (Anne Lauscher, Goran Glavaš, Kai Eckert, and Simone Paolo Ponzetto)

got accepted at the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP 2018), one of the top-tier conferences in natural language processing!

André Melo has defended his PhD thesis

André Melo has defended his PhD thesis on „Automatic Refinement of Large-Scale Cross-Domain Knowledge Graphs“, supervised by Prof. Heiko Paulheim.

In his thesis, André has developed different methods to improve large-scale, cross-domain knowledge graphs along various dimensions. His contributions include, among others, a benchmarking suite for knowledge graph completion and correction, an effective method for type prediction using hierarchical classification, and a machine-learning based method for detection wrong relation assertions. Moreover, he has proposed methods for error correction in knowledge graph, and for distilling high-level tests from individual errors identified.

As of September, André will start a new job as a knowledge engineer for Babylon Health in London. We wish him all the best!

Paper accepted at COLING 2018

The position paper „Automatic Assessment of Conceptual Text Complexity Using Knowledge Graphs“ by Sanja Štajner and Ioana Hulpus has been accepted at the 27th International Conference on Computational Linguistics (COLING 2018), the premier international conference on Computational Linguistics.