OPIEC is an Open Information Extraction (OIE) corpus, constructed from the entire English Wikipedia. It containing more than 341M triples. Each triple from the corpus is composed of rich meta-data: each token from the subj / obj / rel along with NLP annotations (POS tag, NER tag, ...), provenance sentence (along with its dependency parse, sentence order relative to the article), original (golden) links contained in the Wikipedia articles, space / time, etc (for more detailed explanation of the meta-data, see here ).
Kiril Gashteovski, Sebastian Wanner, Sven Hertling, Samuel Broscheit, Rainer Gemulla
OPIEC: An Open Information Extraction Corpus [ pdf , author's version , OpenReview ]
In Proc. of Conference on Automated Knowledge Base Construction (AKBC), 2019
Kiril Gashteovski, Rainer Gemulla, Bhushan Kotnis, Sven Hertling, Christian Meilicke
On Aligning OpenIE Extractions with Knowledge Bases: A Case Study [ pdf , slides, resources]
In Proc. of Workshop on Evaluation and Comparison of NLP Systems (Eval4NLP at EMNLP), 2020
Links for downloading the OPIEC corpus:
As a bonus corpus, we offer WikipediaNLP: the entire English Wikipedia with NLP annotations (dependency parse, POS tags, NER tags, ...).
All the code is licensed under the GPL-3.0 License. All the data released is licensed under the Creative Commons Attribution Share-Alike license (CC-BY-SA); and the GNU Free Documentation License.