OPIEC: An Open Information Extraction Corpus

OPIEC is an Open Information Extraction (OIE) corpus, constructed from the entire English Wikipedia. It containing more than 341M triples. Each triple from the corpus is composed of rich meta-data: each token from the subj / obj / rel along with NLP annotations (POS tag, NER tag, ...), provenance sentence (along with its dependency parse, sentence order relative to the article), original (golden) links contained in the Wikipedia articles, space / time, etc (for more detailed explanation of the meta-data, see  here ).

 

Publication

Kiril Gashteovski, Sebastian Wanner, Sven Hertling, Samuel Broscheit, Rainer Gemulla
OPIEC: An Open Information Extraction Corpus  [ pdf , author's version , OpenReview ]
Conference on Automated Knowledge Base Construction (AKBC), 2019

Data

Links for downloading the OPIEC corpus: 

  • OPIEC: the full corpus
  • OPIEC-Clean: containing arguments which are considered to be „clean“
  • OPIEC-Link: containing fully linked arguments

As a bonus corpus, we offer WikipediaNLP: the entire English Wikipedia with NLP annotations (dependency parse, POS tags, NER tags, ...).

Code

  • Code for reading the data [GitHub]
  • Code for the whole corpus construction  [GitHub

Licenses

All the code is licensed under the GPL-3.0 License. All the data released is licensed under the  Creative Commons Attribution Share-Alike license (CC-BY-SA); and the GNU Free Documentation License.