The goal of the ACM SIGMOD Programming Contest was to develop a blocking system for Entity Resolution. The WBSG solution to the task involved embedding entity descriptions using a specifically pre-trained transformer model, indexing the embeddings using FAISS, performing a nearest neighbour search in the embedding space and re-ranking retrieved entity pairs using a symbolic similarity metric. The transformer model was pre-trained using supervised contrastive learning and an additional pre-training set that was extracted from the CommonCrawl relying on schema.org annotations.
The goal of the Programming Contest was to develop a blocking system for Entity Resolution. Entity Resolution (ER) is the problem of identifying and matching different tuples that refer to the same real-world entity in a dataset, such as tuples Georgia Tech and Georgia Institute of Technology. When performing ER on a table A, considering all quadratic number (A x A) of tuple pairs can be computationally expensive. Thus, a filtering step, referred to as blocking step, is used first to quickly filter out obvious non-matches and to obtain a much smaller candidate set of tuple pairs, for which a matching step can then be applied to find matches.
For this task, the teams were asked to perform blocking on the two product datasets D1 and D2, each one containing different types of products with different distributions of data and noise. The challenge was to design a blocking system for each dataset that can quickly filter out non-matches in a limited time to generate a small candidate set with high recall (not losing too many true matches).
The WBSG blocking system applies a mixture of modern neural and traditional symbolic blocking techniques. One of the main challenges for the blocking system was the restriction on computational power imposed by the organizers of the contest, as all submissions were evaluated using a CPU-only machine with 16 cores and 32GB RAM and had to finish within 35 minutes. Implementing a neural approach using Transformers given these restrictions is challenging. We overcame this challenge by using an extremely small distilled Transformer in combination with an approximate nearest neighbour search using Facebook's FAISS library as presented by Johnson et al. in their paper Billion-scale similarity search with GPUs.
Our blocking system performs 7 steps. In the first step, domain-specific preprocessing is applied to normalize the product descriptions. In the second step, tuples with the same description are grouped. The descriptions are embedded using task-specific fine-tuned Transformer models and the embeddings are indexed using FAISS during steps three and four. An approximate nearest neighbour search for all groups of product descriptions over the FAISS indexes is executed in step five. The output of step five is a ranked list of candidate group pairs. This ranked list of group pairs is re-ranked using the average similarity score of the neural similarity and the Jaccard similarity of the candidate group pairs in step six. The reranking smoothes the bias of the embedding model. In the seventh step, the final tuple pairs are generated by combining tuple identifiers from the most similar candidate group pairs until the amount of 1 million tuple pairs for D1 and 2 million tuples pairs for D2 is created. The amount of pairs was restricted by the organizers of the contest.
The key step of the whole blocking system is the third step during which the surface forms of all grouped products are embedded. For the embedding, the xtremedistil-l6-h256-uncased transformer model proposed by Microsoft is used. For performance reasons, the maximum sequence length of the tokenizer, which is used to encode the product surface forms, is limited to 28 and 24 tokens for the datasets D1 and D2. Additionally, the model is forced to focus on the most important features of the model output through a simple feed-forward layer that projects the 256 dimensions of the transformer model’s output to 32 dimensions.
The transformer model is trained using the architecture presented by Peeters and Bizer in the paper Supervised Contrastive Learning for Product Matching, which itself uses a supervised contrastive loss function presented by Khosla et al. in the paper Supervised Contrastive Learning. In addition to the training data provided by the challenge organisers, the model is also pre-trained using a large set of product offers, which were extracted from the Common Crawl by relying on schema.org annotations. The offers in this pre-training set are grouped by product IDs and belong to the product category “Computers and Accessories”. The offers were already used for the intermediate training for the experiments documented in Intermediate Training of BERT for Product Matching.