31.5 billion quads Microdata, Embedded JSON-LD, RDFa, and Microformat data originating from 9.6 million websites published

The DWS group is happy to announce the new release of the WebDataCommons Microdata, JSON-LD, RDFa and Microformat data corpus. The data has been extracted from the November 2018 version of the Common Crawl covering 2.5 billion HTML pages which originate from 32 million websites (pay-level domains).

Photo credit: Anna Logue
WDC Training Dataset and Gold Standard for Large-Scale Product Matching released

The research focus in the field of entity resolution (aka link discovery or duplicate detection) is moving from traditional symbolic matching methods to embeddings and deep neural network based matching. A problem with evaluating deep learning based matchers is that they are rather training data hungry and that the benchmark datasets that are traditionally used for comparing matching methods are often too small to properly evaluate this new family of methods.

Photo credit: Anna Logue
Paper accepted at EDBT 2019

Our systems and applications paper

Extending Cross-Domain Knowledge Bases with Long Tail Entities using Web Table Data (Yaser Oulabi, Christian Bizer)

got accepted at the 22nd International Conference on Extending Database Technology (EDBT 2019), one of the top-tier conferences in the data management field!

Abstract of the paper:

 Cross-domain knowledge bases such as YAGO, DBpedia, or the Google Knowledge Graph are being used as background knowledge within an increasing range of applications including web search, data integration, natural language understanding, and question answering. The usefulness of a knowledge base for these applications depends on its completeness. Relational HTML tables that are published on the Web cover a wide range of topics and describe very specific long tail entities, such as small villages, less-known football players, or obscure songs. This systems and applications paper explores the potential of web table data for the task of completing cross-domain knowledge bases with descriptions of formerly unknown entities. We present the first system that handles all steps that are necessary for this task: schema matching, row clustering, entity creation, and new detection. The evaluation of the system using a manually labeled gold standard shows that it can construct formerly unknown instances and their descriptions from table data with an average F1 score of 0.80. In a second experiment, we apply the system to a large corpus of web tables extracted from the Common Crawl. This experiment allows us to get an overall impression of the potential of web tables for augmenting knowledge bases with long tail entities. The experiment shows that we can augment the DBpedia knowledge base with descriptions of 14 thousand new football players as well as 187 thousand new songs. The accuracy of the facts describing these instances is 0.90.

PDF of the paper:

Yaser Oulabi, Christian Bizer: Extending Cross-Domain Knowledge Bases with Long Tail Entities using Web Table Data. EDBT 2019.

Photo credit: Anna Logue
New DFG Project on joining graph- and vector-based sense representations for semantic end-user information access

We are happy to announce that the Deutsche Forschungs­gemeinschaft accepted our proposal for extending a joint research project on hybrid semantic representations together with our friends and colleagues of the Language Technology Group of the University of Hamburg.

 

The project, titled „Joining graph- and vector-based sense representations for semantic end-user information access“ (JOIN-T 2) builds upon and aims at bringing our JOIN-T project (also funded funded by DFG) one step forward. Our vision for the next three years is to explore ways to produce semantic representations that combine the interpretability of manually crafted resources and sparse representations with the accuracy and high coverage of dense neural embeddings.

Stay tuned for forthcoming research papers and resources!

Photo credit: Anna Logue
WInte.r Web Data Integration Framework Version 1.3 released

We are happy to announce the release of Version 1.3 of the Web Data Integration Framework (WInte.r).

WInte.r is a Java framework for end-to-end data integration. The framework implements a wide variety of different methods for data pre-processing, schema matching, identity resolution, data fusion, and result evaluation. The methods are designed to be easily customizable by exchanging pre-defined building blocks, such as blockers, matching rules, similarity functions, and conflict resolution functions.

The following features have been added to the framework for the new release:

Value Normalization: New ValueNormaliser class for normalizing quantifiers and units of measurement. New DataSetNormaliser class for detecting data types and transform complete datasets into a normalised base format. External Rule Learning: In addition to learning matching rules directly inside of WInte.r, the new release also supports learning matching rules using external tools such as Rapidminer and importing the learned rules back into WInte.r. Debug Reporting: The new release features detailed reports about the application of matching rules, blockers, and data fusion methods which lay the foundation for fine-tuning the methods. Step-by-Step Tutorial: In order to get users started with the framework, we have written a step-by-step tutorial on how to use WInte.r for identity resolution and data fusion and how to debug and fine-tune the different steps of the integration process.

The WInte.r famework forms a foundation for our research on large-scale web data integration. The framework is used by the T2K Match algorithm for matching millions of Web tables against a central knowledge base, as well as within our work on Web table stitching for improving matching quality. The framework is also used in the context of the DS4DM research project for matching tabular data for data search.

Beside of being used for research, we also use the WInte.r famework for teaching. The students of our Web Data Integration course use the framework to solve case studies and implement their term projects.  

Detailed information about the WInte.r framework is found at

github.com/olehmberg/winter

The WInte.r framework can be downloaded from the same web site. The framework can be used under the terms of the Apache 2.0 License.

Lots of thanks to Alexander Brinkmann and Oliver Lehmberg for their work on the new release as well as on the tutorial and extended documentation in the WInte.r wiki.

André Melo has defended his PhD thesis

André Melo has defended his PhD thesis on „Automatic Refinement of Large-Scale Cross-Domain Knowledge Graphs“, supervised by Prof. Heiko Paulheim.

In his thesis, André has developed different methods to improve large-scale, cross-domain knowledge graphs along various dimensions. His contributions include, among others, a benchmarking suite for knowledge graph completion and correction, an effective method for type prediction using hierarchical classification, and a machine-learning based method for detection wrong relation assertions. Moreover, he has proposed methods for error correction in knowledge graph, and for distilling high-level tests from individual errors identified.

As of September, André will start a new job as a knowledge engineer for Babylon Health in London. We wish him all the best!

Data Science Conference LWDA 2018 in Mannheim

The Data and Web Science Group is hosting the Data Science Conference LWDA 2018 in Mannheim on August 22-24, 2018.

LWDA, which expands to „Lernen, Wissen, Daten, Analysen“ („Learning, Knowledge, Data, Analytics“), covers recent research in areas such as knowledge discovery, machine learning & data mining, knowledge management, database management & information systems, information retrieval. 

The LWDA conference is organized by and brings together the various special interest groups of the Gesellschaft für Informatik (German Computer Science Society) in this area. The program comprises of joint research sessions and keynotes as well as of workshops organized by each special interest group.

Further information can be found on the conference website: https://www.uni-mannheim.de/lwda-2018/.

Download the conference poster.

Photo credit: Data Mining Cup/prudsys AG
Mannheim Students Score Second Place at Data Mining Cup

The Data Mining Cup is an annual data mining competition for students from all over the world. Since 2014, students from Mannheim take part in the competition as an integral part of the Data Mining 2 lecture, held by Prof. Paulheim. In the course of the competition, the students have to solve a data mining task based on real e-commerce data.

This year, the data was provided by an online sports apparel retailer, and the task was to predict the sellout date for individual articles. Students had six weeks time to develop their solution. In the course of the lecture, they worked in different teams and had regular discussions about solution approaches and results.

One of the student teams from Mannheim qualified for the final round of the 10 best teams in May and was invited to present their solution Berlin at the prudsys personalization & pricing summit. In the final ranking, they scored second out of 197 solutions in total. Overall, teams from 148 universities from 47 countries took part in the 2018 data mining cup.

The DWS group wants to congratulate the winning team:

Nele Ecker Thilo Habrich Andreea Iana Adrian Kochsiek Alexander Luetke Laurien Theresa Lummer Nils Richter Fabian Oliver Schmitt

Picture: Members of the winnig team in Berlin. Left to right: Nele Ecker, Laurien Lummer, Adrian Kochsiek, Alexander Lütke

JCDL 2018 - Vannevar Bush Best Paper Award

Our paper „Entity-Aspect Linking: Providing Fine-Grained Semantics of Entities in Context“ has recently won the Vannevar Bush best paper award at the 2018 Joint Conference on Digital Libraries (JCDL), the top conference in the field of digital libraries!

The work, coauthored by Federico Nanni, Simone Paolo Ponzetto and Laura Dietz, is part of a collaboration between the DWS group and the University of New Hampshire in the context of an Elite Post-Doc grant of the Baden-Württemberg Stiftung recently awarded from Laura.

Congratulations also to Myriam Traub, Thaer Samar, Jacco van Ossenbruggen and Lynda Hardman, who, with their work, share with us the 2018 best paper award!

Photo credit: Anna Logue
Papers accepted at ACL 2018

We have three papers to be presented at the 56th Annual Meeting of the Association for Computational Linguistics (ACL 2018), the premier international conference on Computational Linguistics and Natural Language Processing.

Two short papers prepared in collaboration with our colleagues from the University of Cambridge, the University of Hamburg and the University of Oslo have been accepted at the main conference track:

Goran Glavaš, Ivan Vulić: Explicit Retrofitting of Distributional Word Vectors. Dmitry Ustalov, Alexander Panchenko, Andrei Kutuzov, Chris Biemann, Simone Paolo Ponzetto: Unsupervised Semantic Frame Induction using Triclustering.

One paper has been accepted at the 3rd Workshop on Representation Learning for NLP (RepL4NLP) hosted by ACL 2018:

Samuel Broscheit: Learning Distributional Token Representations from Visual Features.