OPIEC: An Open Information Extraction Corpus
OPIEC is an Open Information Extraction (OIE) corpus, constructed from the entire English Wikipedia. It containing more than 341M triples. Each triple from the corpus is composed of rich meta-data: each token from the subj / obj / rel along with NLP annotations (POS tag, NER tag, ...), provenance sentence (along with its dependency parse, sentence order relative to the article), original (golden) links contained in the Wikipedia articles, space / time, etc (for more detailed explanation of the meta-data, see here ).
Publications
Kiril Gashteovski, Sebastian Wanner, Sven Hertling, Samuel Broscheit, Rainer Gemulla
OPIEC: An Open Information Extraction Corpus [ pdf , author's version , OpenReview ]
In Proc. of Conference on Automated Knowledge Base Construction (AKBC), 2019
Kiril Gashteovski, Rainer Gemulla, Bhushan Kotnis, Sven Hertling, Christian Meilicke
On Aligning OpenIE Extractions with Knowledge Bases: A Case Study [ pdf , slides, resources]
In Proc. of Workshop on Evaluation and Comparison of NLP Systems (Eval4NLP at EMNLP), 2020
Data
Links for downloading the OPIEC corpus:
- OPIEC: the full corpus
- Full data (compressed: ~ 67 GB, uncompressed: ~ 928.7 GB)
- Example file (~129 M)
- OPIEC-Clean: containing arguments which are considered to be “clean”
- Full data (compressed: ~ 35 GB, uncompressed: ~ 292.4 GB)
- Example file (~40 MB)
- OPIEC-Link: containing fully linked arguments
- Full data (compressed: ~ 2.8 GB, uncompressed ~ 19.8 GB)
- Example file (~2.4 MB)
As a bonus corpus, we offer WikipediaNLP: the entire English Wikipedia with NLP annotations (dependency parse, POS tags, NER tags, ...).
- Full data (compressed: ~ 47 GB, uncompressed: 155 GB)
- Example file (~327 MB)
Licenses
All the code is licensed under the GPL-3.0 License. All the data released is licensed under the Creative Commons Attribution Share-Alike license (CC-BY-SA); and the GNU Free Documentation License.