Paper Abstract
The goal of entity resolution, also known as duplicate detection and record linkage, is to identify all records in one or more data sets that refer to the same real-world entity. To achieve this goal, matching rules, encoding the matching patterns in the data, can be learned with the help of manually annotated record pairs. Active learning for entity resolution aims to minimize the human labeling effort by including the human in the learning loop and by selecting the most informative pairs for labeling. While active learning methods are quite successful at reducing the human labeling effort, we show that their performance decreases when evaluated against data sets with a large number of sparse properties. We evaluate an existing active learning method against e-commerce data sets with such characteristics and observe that it is prone to suboptimal convergence points, thus producing highly varying results among different runs of the same experiment. In this paper we propose our ongoing work on building a robust active learning method which is able to tackle the observed instability issue. Our method is based on the unsupervised matching of the record pairs in preprocessing. The unsupervised matching results are used afterwards for bootstrapping the active learning process and for preventing it from converging to suboptimal matching rules. The evaluation shows that the proposed method increases the robustness of the active learning process as it minimizes the variation of the results of different runs.