31.5 billion quads Microdata, Embedded JSON-LD, RDFa, and Microformat data originating from 9.6 million websites published
The DWS group is happy to announce the new release of the WebDataCommons Microdata, JSON-LD, RDFa and Microformat data corpus. The data has been extracted from the November 2018 version of the Common Crawl covering 2.5 billion HTML pages which originate from 32 million websites (pay-level domains).
WDC Training Dataset and Gold Standard for Large-Scale Product Matching released
The research focus in the field of entity resolution (aka link discovery or duplicate detection) is moving from traditional symbolic matching methods to embeddings and deep neural network based matching. A problem with evaluating deep learning based matchers is that they are rather training data ...
Paper accepted at EDBT 2019
Our systems and applications paper Extending Cross-Domain Knowledge Bases with Long Tail Entities using Web Table Data (Yaser Oulabi, Christian Bizer) got accepted at the 22nd International Conference on Extending Database Technology (EDBT 2019), one of the top-tier conferences in the data ...
New DFG Project on joining graph- and vector-based sense representations for semantic end-user information access
We are happy to announce that the Deutsche Forschungsgemeinschaft accepted our proposal for extending a joint research project on hybrid semantic representations together with our friends and colleagues of the Language Technology Group of the University of Hamburg. The project, titled “Joining ...
WInte.r Web Data Integration Framework Version 1.3 released
We are happy to announce the release of Version 1.3 of the Web Data Integration Framework (WInte.r). WInte.r is a Java framework for end-to-end data integration. The framework implements a wide variety of different methods for data pre-processing, schema matching, identity resolution, data fusion, ...
Paper accepted at EMNLP 2018
Our long paper submission “Investigating the Role of Argumentation in the Rhetorical Analysis of Scientific Publications with Neural Multi-Task Learning Models " (Anne Lauscher, Goran Glavaš, Kai Eckert, and Simone Paolo Ponzetto) got accepted at the 2018 Conference on Empirical Methods in ...
André Melo has defended his PhD thesis
André Melo has defended his PhD thesis on “Automatic Refinement of Large-Scale Cross-Domain Knowledge Graphs”, supervised by Prof. Heiko Paulheim. In his thesis, André has developed different methods to improve large-scale, cross-domain knowledge graphs along various dimensions. His contributions ...
Paper accepted at COLING 2018
The position paper “Automatic Assessment of Conceptual Text Complexity Using Knowledge Graphs” by Sanja Štajner and Ioana Hulpus has been accepted at the 27th International Conference on Computational Linguistics (COLING 2018), the premier international conference on Computational Linguistics. 
Mannheim Students Score Second Place at Data Mining Cup
The Data Mining Cup is an annual data mining competition for students from all over the world. Since 2014, students from Mannheim take part in the competition as an integral part of the Data Mining 2 lecture, held by Prof. Paulheim. In the course of the competition, the students have to solve a data ...
JCDL 2018 – Vannevar Bush Best Paper Award
Our paper “Entity-Aspect Linking: Providing Fine-Grained Semantics of Entities in Context” has recently won the Vannevar Bush best paper award at the 2018 Joint Conference on Digital Libraries (JCDL), the top conference in the field of digital libraries! The work, coauthored by Federico Nanni, ...