Mannheim Data Integration Benchmark (MaDI-Bench) released

We are happy to announce the release of the Mannheim Data Integration Benchmark (MaDI-Bench). Data integration combines heterogeneous data from multiple sources into a single, coherent dataset. Data integration involves a sequence of interdependent tasks including schema matching, value normalization, entity blocking, entity matching, and data fusion. Existing benchmarks either evaluate these steps in isolation or cover only incomplete versions of the data integration pipeline, omitting specific steps. The lack of public end-to-end data integration benchmarks hinders research on methods that address the integration process as a whole. MaDI-Bench fills this gap by providing the first benchmark for end-to-end integration of relational tables covering all steps of the integration process.

 

MaDI-Bench consists of five end-to-end data integration tasks: Companies, Games, Music, Products, and Scientific Papers. Each task takes several heterogeneous source tables in a domain and asks a system to return one fused target table, by performing schema matching, value normalization, entity matching, and data fusion. MaDI-Bench supports the calulation of step-wise as well as end-to-end evaluation metrics. The benchmark provides ground truth to score every step of the integration process: a gold schema mapping for schema matching, the target schema's constraints and taxonomies for value normalization, record pairs evaluating entity matchers, and hand-verified records for assessing the output of data fusion.   To prevent a quick saturation of the benchmark as data integration systems progress, we introduce a generic variant-generation method for deriving harder as well as easier variants from the base tasks.

We validate the benchmark using human-engineered pipelines, a best-of-breed pipeline, and an LLM-based pipeline. The validation demonstrates the utility of the benchmark for measuring the step-wise as well as the end-to-end performance of data integration pipelines. All benchmark artefacts are available for public download.

More information about MaDI-Bench is found:

We hope that the benchmark will prove useful for the community and will support the development of fully automatic as well as human-in-the-loop end-to-end data integration systems.

Back