# Identity Resolution with PyDI

## Learning Objectives

In this exercise, you will learn how to:

1. **Implement and Evaluate Blocking Strategies**: Reduce the search space using blocking techniques
2. **Build Rule-Based Matchers**: Create matching rules using similarity comparators
3. **Debug and Refine**: Analyze errors and iteratively improve matching quality
4. **Apply LLM-Based Matching**: Use large language models for entity resolution

## Entity Matching Problem

You have two movie datasets with the same schema (after schema mapping):
- **Academy Awards Dataset**: Contains movies that won or were nominated for Academy Awards
- **Actors Dataset**: Contains movies with detailed actor information

Your goal is to identify which records in these two datasets refer to the same real-world movie. This is challenging because:
- Movie titles may have spelling variations
- Dates may be slightly different
- Director and actor information may be incomplete
- Some movies have very similar titles but are different films

## 1. Setup

Install the PyDI package. We may regularly push fixes to PyDI, it does not hurt to run this cell regularly!

In addition to this notebook, check out the [PyDI tutorial notebook](docs/tutorial/PyDI_Tutorial.ipynb) and the [PyDI Wiki](https://github.com/wbsg-uni-mannheim/PyDI/blob/main/docs/wiki/Home.md) for additional methods/information!

In [1]:
#!pip install -qU uma-pydi

PyDI supports two levels of logging: INFO and DEBUG

It makes sense to use the more verbose DEBUG logging during experimentation and iteration to get more information about what is going on behind the scenes and what parameters you may want/need to change.

In [2]:
import logging

import os
os.makedirs('logs/', exist_ok=True)

# choose either default logging or debug logging

# # Configure logging for INFO level
# logging.basicConfig(
#     level=logging.INFO,
#     format='[%(levelname)-5s] %(name)s - %(message)s',
#     handlers=[
#           logging.FileHandler('logs/pydi.log'),  # Save to file
#           logging.StreamHandler()                      # Display on console
#       ],
#     force=True
# )

# Configure logging for DEBUG level
logging.basicConfig(
    level=logging.DEBUG,
    format='[%(levelname)-5s] %(name)s - %(message)s',
    handlers=[
          logging.FileHandler('logs/pydi.log'),  # Save to file
          logging.StreamHandler()                      # Display on console
      ],
    force=True
)

In [3]:
from pathlib import Path

# Setup directories
DATA_DIR = Path("data")
INPUT_DIR = DATA_DIR / "input"
OUTPUT_DIR = DATA_DIR / "output"
SPLITS_DIR = DATA_DIR / "splits"

# Create output directory if it doesn't exist
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

# Define file paths
ACADEMY_AWARDS_FILE = INPUT_DIR / "academy_awards.xml"
ACTORS_FILE = INPUT_DIR / "actors.xml"
TRAIN_FILE = SPLITS_DIR / "train.csv"
VALIDATION_FILE = SPLITS_DIR / "validation.csv"
TEST_FILE = SPLITS_DIR / "test.csv"

print(f"Data directory: {DATA_DIR.absolute()}")
print(f"Output directory: {OUTPUT_DIR.absolute()}")

Data directory: c:\Users\Ralph\dev\pydi\IR_ex\data
Output directory: c:\Users\Ralph\dev\pydi\IR_ex\data\output


## 2. Data Loading and Exploration

PyDI supports various data loading methods as part of its IO module. As MapForce outputs XML files, we use load_xml.

You can use the DataFrame.attrs attribute to add meta information to the dataset, e.g. the source of the dataset.

In [4]:
from PyDI.io import load_xml

# Load the datasets
df_academy = load_xml(ACADEMY_AWARDS_FILE, name="academy_awards")

# Add an example source to the dataframe attributes
df_academy.attrs['source'] = 'academyawards'

df_actors = load_xml(ACTORS_FILE, name="actors")
df_actors.attrs['source'] = 'imdb'

print(f"\nAcademy Awards dataset: {len(df_academy)} records")
print(f"Actors dataset: {len(df_actors)} records")
print(f"\nTotal possible pairs (cartesian product): {len(df_academy) * len(df_actors):,}")

[INFO ] numexpr.utils - Note: NumExpr detected 32 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[INFO ] numexpr.utils - NumExpr defaulting to 16 threads.



Academy Awards dataset: 4580 records
Actors dataset: 149 records

Total possible pairs (cartesian product): 682,420


In [5]:
# Display sample records from Academy Awards
print("\n=== Sample records from Academy Awards dataset ===")
display(df_academy.head())

print("\n=== Missing values ===")
print(df_academy.isnull().sum())


=== Sample records from Academy Awards dataset ===


Unnamed: 0,id,title,actors_actor_name,date,director_name,oscar
0,academy_awards_1,Biutiful,Javier Bardem,2010-01-01,,
1,academy_awards_2,True Grit,"[Jeff Bridges, Hailee Steinfeld]",2010-01-01,Joel Coen and Ethan Coen,
2,academy_awards_3,The Social Network,Jesse Eisenberg,2010-01-01,David Fincher,yes
3,academy_awards_4,The King's Speech,"[Colin Firth, Geoffrey Rush, Helena Bonham Car...",2010-01-01,Tom Hooper,yes
4,academy_awards_5,127 Hours,James Franco,2010-01-01,,



=== Missing values ===
id                      0
title                  12
actors_actor_name    3531
date                    0
director_name        4172
oscar                3313
dtype: int64


We can see that title and date are the only attributes that are available across most records in the academy_awards dataset. These two attributes will thus be highly important for most matching decisions.

In [6]:
# Display sample records from Actors
print("\n=== Sample records from Actors dataset ===")
display(df_actors.head())

print("\n=== Missing values ===")
print(df_actors.isnull().sum())


=== Sample records from Actors dataset ===


Unnamed: 0,id,title,actors_actor_name,actors_actor_birthday,actors_actor_birthplace,date
0,actors_1,7th Heaven,Janet Gaynor,1906-01-01,Pennsylvania,1929-01-01
1,actors_2,Coquette,Mary Pickford,1892-01-01,Canada,1930-01-01
2,actors_3,The Divorcee,Norma Shearer,1902-01-01,Canada,1931-01-01
3,actors_4,Min and Bill,Marie Dressler,1868-01-01,Canada,1932-01-01
4,actors_5,The Sin of Madelon Claudet,Helen Hayes,1900-01-01,Washington DC,1933-01-01



=== Missing values ===
id                         0
title                      0
actors_actor_name          0
actors_actor_birthday      0
actors_actor_birthplace    0
date                       0
dtype: int64


### Load Test and Development Set

For evaluating identity resolution, we need labeled matching and not matching record pairs. For some tips regarding the creation of these labeled sets, have a look at the slide sets of this exercise and the lecture.

Each labeled set contains labeled pairs of records:
- **TRUE**: The two records refer to the same movie
- **FALSE**: The two records refer to different movies

We have three files for this exercise:
- **Training set**: Purpose: train supervised matchers or select few-shot examples for LLMs
- **Validation set**: Purpose: tune parameters and evaluate choices
- **Test set**: Purpose: use for final evaluation

In [7]:
import pandas as pd

TRAIN_FILE = SPLITS_DIR / "train.csv"
VALIDATION_FILE = SPLITS_DIR / "validation.csv"
TEST_FILE = SPLITS_DIR / "test.csv"

# Load labeled splits
df_train = pd.read_csv(TRAIN_FILE)
df_validation = pd.read_csv(VALIDATION_FILE)
df_test = pd.read_csv(TEST_FILE)

print(f"Training set: {len(df_train)} pairs")
print(f"Validation set: {len(df_validation)} pairs")
print(f"Test set: {len(df_test)} pairs")

Training set: 150 pairs
Validation set: 147 pairs
Test set: 150 pairs


## 3. Blocking Strategies

Comparing all possible pairs would require comparing every record in dataset A with every record in dataset B. For large datasets, this is computationally infeasible.

**Blocking** reduces the search space by only comparing records (candidate pairs) that share some common property (e.g., same first letter of title, similar year). This dramatically reduces the number of comparisons but brings the additional challenge of maintaining high recall (we don't want to miss [m]any true matches).

Key metrics:
- **Pair Completeness (Recall)**: What % of true matches are included in the candidate pairs?
- **Reduction Ratio**: What % of all possible pairs are eliminated?
- **Pairs Quality**: What % of candidate pairs are true matches?

PyDI supports the following Blockers:

1. **StandardBlocker** - Equality-based blocking on one or more key columns.
2. **SortedNeighbourhoodBlocker** - Sorted neighbourhood blocking using a sliding window over a sort key.
3. **TokenBlocker** - Token-based blocking using token overlap on a string column.
4. **EmbeddingBlocker** - Embedding-based blocking using nearest neighbor search over text embeddings.

### Standard Blocking

Let's try using a Standard Blocker first. This implementation uses the first two characters of each word in the title as an aggregated blocking key:

In [8]:
from PyDI.entitymatching.blocking import StandardBlocker

# First, we define a function to generate blocking keys
def generate_blocking_key(title):
    """
    Generate blocking key from first 3 words of title.
    Takes first 2 chars of each word, concatenates and uppercases.
    """
    if not isinstance(title, str):
        return None
    
    tokens = title.split()
    blocking_key = ""
    
    for i in range(min(3, len(tokens))):
        blocking_key += tokens[i][:2].upper()
    
    return blocking_key if blocking_key else None

# Apply to DataFrames
df_academy['blocking_key'] = df_academy['title'].apply(generate_blocking_key)
df_actors['blocking_key'] = df_actors['title'].apply(generate_blocking_key)

# Apply standard blocking
blocker_standard = StandardBlocker(
    df_academy, df_actors,
    on=['blocking_key'],
    output_dir=OUTPUT_DIR / "blocking-evaluation",
    id_column='id'
)

# Materialize all candidate pairs
candidates_standard = blocker_standard.materialize()

print()
print(f"  Generated: {len(candidates_standard):,} candidates")

# Display sample candidates
display(candidates_standard.head(10))

[DEBUG] PyDI.entitymatching.blocking.standard.StandardBlocker - Creating blocking key values for dataset1: 4580 records
[DEBUG] PyDI.entitymatching.blocking.standard.StandardBlocker - Creating blocking key values for dataset2: 149 records
[INFO ] PyDI.entitymatching.blocking.standard.StandardBlocker - created 3586 blocking keys for first dataset
[INFO ] PyDI.entitymatching.blocking.standard.StandardBlocker - created 145 blocking keys for second dataset
[DEBUG] PyDI.entitymatching.blocking.standard.StandardBlocker - Joining blocking key values: 3586 x 145 blocks
[INFO ] PyDI.entitymatching.blocking.standard.StandardBlocker - created 142 blocks from blocking keys
[DEBUG] PyDI.entitymatching.blocking.standard.StandardBlocker - Block size distribution:
[DEBUG] PyDI.entitymatching.blocking.standard.StandardBlocker - Size Frequency
[DEBUG] PyDI.entitymatching.blocking.standard.StandardBlocker - 92          1
[DEBUG] PyDI.entitymatching.blocking.standard.StandardBlocker - 14          2
[DEBUG


  Generated: 396 candidates


Unnamed: 0,id1,id2
0,academy_awards_2,actors_120
1,academy_awards_2187,actors_120
2,academy_awards_23,actors_17
3,academy_awards_733,actors_17
4,academy_awards_892,actors_17
5,academy_awards_1532,actors_17
6,academy_awards_2349,actors_17
7,academy_awards_3624,actors_17
8,academy_awards_31,actors_79
9,academy_awards_3091,actors_79


### Blocking Evaluation

PyDI provides evaluation methods for blocking with pair completeness, pair quality, and reduction ratio as part of the `EntityMatchingEvaluator` class:
- **`evaluate_blocking()`**: Evaluates blocking given an already materialized set of pairs.
- **`evaluate_blocking_batched()`**: Evaluates blocking by iterating over batches and storing results. Useful for very large datasets 

Let's first evaluate our materialized blocking results against our test set.

In [None]:
from PyDI.entitymatching.evaluation import EntityMatchingEvaluator

# Evaluate blocking quality using validation set
blocking_metrics_standard = EntityMatchingEvaluator.evaluate_blocking(
    candidate_pairs=candidates_standard,
    blocker=blocker_standard,
    test_pairs=df_validation,
    out_dir=OUTPUT_DIR / "blocking-evaluation"
)

[INFO ] root -   Pair Completeness: 0.957
[INFO ] root -   Pair Quality:      0.114
[INFO ] root -   Reduction Ratio:   0.999
[INFO ] root -   True Matches Found: 45/47
[INFO ] root - Blocking evaluation complete!


### **Task 1**

Experiment with different blockers:

- First, use the NoBlocker to see the maximum runtime
- Then, try different blocking keys with the StandardRecordBlocker
- Finally, try the TokenBlocker

You can completely dispense with blocking if your datasets are small enough or you want a baseline to compare your blockers to, by using PyDIs NoBlocker object:

In [None]:
from PyDI.entitymatching.blocking import NoBlocker, TokenBlocker

blocker_none = NoBlocker(
    df_academy, df_actors,
    id_column='id'  # specify the ID column for both datasets
)

# but we can also generate the full set of pairs for smaller datasets
candidates_noblocker = blocker_none.materialize()

# Evaluate blocking quality using test validation set
blocking_metrics_standard = EntityMatchingEvaluator.evaluate_blocking(
    candidate_pairs=candidates_noblocker,
    blocker=blocker_none,
    test_pairs=df_validation,
    out_dir=OUTPUT_DIR / "blocking-evaluation"
)

[INFO ] root -   Pair Completeness: 1.000
[INFO ] root -   Pair Quality:      0.000
[INFO ] root -   Reduction Ratio:   0.000
[INFO ] root -   True Matches Found: 47/47
[INFO ] root - Blocking evaluation complete!


### Token Blocking

Now, let's try token blocking:

Token blocking creates blocks based on individual words or ngrams (tokens) in the title. This is more flexible than standard blocking because:
- Movies with different word order can still be compared
- Multi-word titles are more likely to match

In [None]:
from PyDI.entitymatching.blocking import TokenBlocker

# Apply token blocking
print("Applying token blocking on title...")
blocker_token = TokenBlocker(
    df_academy, df_actors,
    column='title',      # Tokenize titles
    output_dir=OUTPUT_DIR / "blocking-evaluation",
    id_column='id',
    ngram_size=3,
    ngram_type='character'
)

candidates_token = blocker_token.materialize()

print(f"\nGenerated {len(candidates_token):,} candidate pairs")

# Evaluate blocking quality using validation set
blocking_metrics_standard = EntityMatchingEvaluator.evaluate_blocking(
    candidate_pairs=candidates_token,
    blocker=blocker_token,
    test_pairs=df_validation,
    out_dir=OUTPUT_DIR / "blocking-evaluation"
)

[DEBUG] PyDI.entitymatching.blocking.token_blocking.TokenBlocker - Creating token index for dataset1: 4580 records
[DEBUG] PyDI.entitymatching.blocking.token_blocking.TokenBlocker - Creating token index for dataset2: 149 records
[INFO ] PyDI.entitymatching.blocking.token_blocking.TokenBlocker - created 5557 token keys for first dataset
[INFO ] PyDI.entitymatching.blocking.token_blocking.TokenBlocker - created 1036 token keys for second dataset
[DEBUG] PyDI.entitymatching.blocking.token_blocking.TokenBlocker - Joining token keys: 5557 x 1036 tokens
[INFO ] PyDI.entitymatching.blocking.token_blocking.TokenBlocker - created 1034 blocks from token keys
[DEBUG] PyDI.entitymatching.blocking.token_blocking.TokenBlocker - Token frequency distribution:
[DEBUG] PyDI.entitymatching.blocking.token_blocking.TokenBlocker - Size Frequency
[DEBUG] PyDI.entitymatching.blocking.token_blocking.TokenBlocker - 38          3
[DEBUG] PyDI.entitymatching.blocking.token_blocking.TokenBlocker - 37          2
[D

Applying token blocking on title...

Generated 110,905 candidate pairs


[INFO ] root -   Pair Completeness: 1.000
[INFO ] root -   Pair Quality:      0.000
[INFO ] root -   Reduction Ratio:   0.837
[INFO ] root -   True Matches Found: 47/47
[INFO ] root - Blocking evaluation complete!


Depending on your use case and data, different blockers may be more appropriate. Feel free to also try out other blockers like the embedding-based blocker of PyDI!


### Some important things to keep in mind for blocking:

- Strive for **high reduction ratio** but keep **pair completeness over 97%**!

- **Your evaluation is only as good as your evaluation set!** You may see 100% pair completeness and assume you do not lose any matches. But that is only true if your evaluation set is "perfect"! Do not blindly believe the metrics but use common sense and manually verify that your evaluation set is representative of your matching problem!

**Important:** While PyDI outputs the "pair quality" measure, this measure is only valid if your evaluation set contains ALL possible positive matches. As this is highly unlikely due to the required manual labeling effort, do not try to optimize this metric and do not report it!

## 4. Rule-Based Matching with Iterative Refinement

Now we'll build a rule-based matcher that computes a weighted similarity score for each candidate pair.

We'll start with a simple matcher using:
- **Title similarity**: Levensthein similarity on titles
- **Date similarity**: Numeric tolerance (within 2 years)

Additional comparators are defined but commented out. You can uncomment them during refinement to try to increase the matching score.

In [12]:
from PyDI.entitymatching.comparators import StringComparator, DateComparator

# Define comparators
comparators = [
    StringComparator(column="title",
                      similarity_function="levenshtein",
                        preprocess=str.lower
    ),
    DateComparator(
        column="date", 
        max_days_difference=730  # 2 year tolerance (2*365 days)
    ),
]

# define the weights for the comparators
weights = [0.5, 0.5]
# define the similarity threshold of the rule for deciding on a match
threshold = 0.6

# Uncomment these and add them to the list during refinement based on debug analysis
# StringComparator(column="title", similarity_function="jaccard", tokenization="word", preprocess=str.lower), # Token similarity
# StringComparator(column="director", similarity_function="jaccard", tokenization="word", preprocess=str.lower),  # Director name
# StringComparator(column="director", similarity_function="levenshtein", preprocess=str.lower),  # Director with edit distance
# StringComparator(column="actors", similarity_function="jaccard", tokenization="word", list_strategy="best_match", preprocess=str.lower),  # Best actor match
# DateComparator(column="date", max_days_difference=365),  # less lenient date matching (1 year)

Next, we initialize the rule-based matcher and activate the Debug mode of the matcher. When Debug mode is active, PyDI writes detailed debug logs into the output, which we can subsequently use to refine the matching results.

Another important parameter of all matchers is `candidates`. We can either pass a set of already materialized candidates, e.g. `candidates_standard` from before. But as datasets are usually large, it makes more sense to pass the blocker directly - the matcher will then internally run the blocking on batches instead of fully materializing all pairs at once. You can control the size of the batches when initializing the blocker with the `batch_size` parameter.

In [13]:
from PyDI.entitymatching import RuleBasedMatcher

# Initialize matcher
matcher = RuleBasedMatcher()

# Run matching with debug mode enabled
correspondences, debug_info = matcher.match(
    df_left=df_academy,
    df_right=df_actors,
    candidates=blocker_standard,
    id_column='id',
    comparators=comparators,
    weights=weights,
    threshold=threshold,
    debug=True
)

[INFO ] PyDI.entitymatching.rule_based.RuleBasedMatcher - Starting Entity Matching
[INFO ] PyDI.entitymatching.rule_based.RuleBasedMatcher - Blocking 4580 x 149 elements
[DEBUG] PyDI.entitymatching.blocking.standard.StandardBlocker - Creating candidate record pairs from 142 blocks
[INFO ] PyDI.entitymatching.rule_based.RuleBasedMatcher - Matching 4580 x 149 elements after 0:00:0.003; 396 blocked pairs (reduction ratio: 0.9994197122006975)
[DEBUG] PyDI.entitymatching.blocking.standard.StandardBlocker - Creating candidate record pairs from 142 blocks
[INFO ] PyDI.entitymatching.rule_based.RuleBasedMatcher - Entity Matching finished after 0:00:0.189; found 134 correspondences.


We can evaluate the result of our entity matching with this method of the EntityMatchingEvaluator:
- **`evaluate_matching()`**: Evaluates matching given an evaluation set and the predicted correspondences. 

In [14]:
debug_output_dir = OUTPUT_DIR / "debug_results_entity_matching"
debug_output_dir.mkdir(parents=True, exist_ok=True)

eval_results = EntityMatchingEvaluator.evaluate_matching(
    correspondences=correspondences,
    test_pairs=df_validation,
    out_dir=debug_output_dir,
    debug_info=debug_info, # add debug info (optional)
    matcher_instance=matcher # add matcher instance for context for debug files (optional)
)

[DEBUG] root - Individual correspondence evaluations:
[DEBUG] root - [correct] academy_awards_2187,actors_120,TRUE,sim:0.7500
[DEBUG] root - [correct] academy_awards_2732,actors_110,TRUE,sim:0.7500
[DEBUG] root - [correct] academy_awards_2040,actors_46,TRUE,sim:0.7493
[DEBUG] root - [correct] academy_awards_503,actors_148,TRUE,sim:1.0000
[DEBUG] root - [correct] academy_awards_1430,actors_132,TRUE,sim:0.7493
[DEBUG] root - [correct] academy_awards_608,actors_146,TRUE,sim:0.7500
[DEBUG] root - [correct] academy_awards_618,actors_73,TRUE,sim:0.7500
[DEBUG] root - [correct] academy_awards_723,actors_71,TRUE,sim:0.7500
[DEBUG] root - [correct] academy_awards_2892,actors_29,TRUE,sim:0.7493
[DEBUG] root - [correct] academy_awards_902,actors_141,TRUE,sim:0.7083
[DEBUG] root - [correct] academy_awards_910,actors_68,TRUE,sim:0.7500
[DEBUG] root - [correct] academy_awards_3244,actors_101,TRUE,sim:0.7500
[DEBUG] root - [correct] academy_awards_1272,actors_135,TRUE,sim:0.7500
[DEBUG] root - [corre

### Cluster Consistency Analysis

Let's analyze the **cluster structure** to identify any inconsistencies our imperfect evaluation set may miss. The EntityMatchingEvaluator offers the `create_cluster_size_distribution` method for this purpose.

In [15]:
# Create cluster size distribution from our matches
cluster_distribution = EntityMatchingEvaluator.create_cluster_size_distribution(
    correspondences=correspondences,
    out_dir=OUTPUT_DIR / "cluster_analysis"
)

[INFO ] root - Cluster Size Distribution of 132 clusters:
[INFO ] root - 	Cluster Size	| Frequency	| Percentage
[INFO ] root - 	──────────────────────────────────────────────────
[INFO ] root - 		2	|	130	|	98.48%
[INFO ] root - 		3	|	2	|	1.52%
[INFO ] root - Cluster size distribution written to data\output\cluster_analysis\cluster_size_distribution.csv


- Remember: **Your evaluation is only as good as your evaluation set!** This is also true for the matching step! You see an F1 of 95% and you are happy, but if the evaluation set does not accurately represent your data, the cluster size distribution is one way you can spot this. Seeing many clusters with a size larger than 2 when you are sure your source datasets are deduplicated should make you question for evaluation set and manually check what is going on!

- In this stage of the project, you will **iteratively refine your evaluation sets** not only based on the metrics you see but also manual inspection of debug logs and cluster size distribution.

### Post-Processing: Global Matching

#### The One-to-One Constraint

In identity resolution, we often want to enforce a **one-to-one constraint**: each record in dataset A should match at most one record in dataset B, and vice versa. It only makes sense to enforce this constraint if you are sure that your source datasets are themselves duplicate-free!

Our current results may violate this constraint (one movie in Academy Awards might match multiple movies in Actors database), so enforcing it can still improve the results

PyDI offers the following methods for global matching:

- **GreedyOneToOneMatchingAlgorithm**: Ensures one-to-one matching by greedily selecting highest-scoring correspondences while avoiding conflicts.
- **MaximumBipartiteMatching**: Finds optimal one-to-one matching using maximum weight bipartite matching algorithms.
- **StableMatching**: Finds stable matches where records are matched to mutually preferred partners, ensuring no record would prefer to switch.

In [16]:
from PyDI.entitymatching import GreedyOneToOneMatchingAlgorithm, StableMatching, MaximumBipartiteMatching

one_to_one_algorithm = GreedyOneToOneMatchingAlgorithm()
refined_correspondences = one_to_one_algorithm.cluster(correspondences)

# Create cluster size distribution from our matches
cluster_distribution = EntityMatchingEvaluator.create_cluster_size_distribution(
    correspondences=refined_correspondences,
    out_dir=OUTPUT_DIR / "cluster_analysis"
)

[INFO ] root - Filtered correspondences: 134 -> 134 (threshold=0.0)
[INFO ] root - Greedy matching: 134 -> 132 correspondences (264 entities matched)
[INFO ] root - GreedyOneToOneMatchingAlgorithm: 134 -> 132 correspondences
[INFO ] root - GreedyOneToOneMatchingAlgorithm: 266 -> 264 entities
[INFO ] root - Cluster Size Distribution of 132 clusters:
[INFO ] root - 	Cluster Size	| Frequency	| Percentage
[INFO ] root - 	──────────────────────────────────────────────────
[INFO ] root - 		2	|	132	|	100.00%
[INFO ] root - Cluster size distribution written to data\output\cluster_analysis\cluster_size_distribution.csv


### **Task 2**

1. Understand the results
    - Inspect the log files in data\output to see which errors were made
2. Try different combinations of comparators or weights in your matching rule
    - Can you improve the performance?
    - Can you improve the performance using global matching?

When inspecting the debug log, we can see that the year information in both datasets is not reliable and we are missing many actual matches due to a low overall similarity caused by the year comparator. Let's try giving more weight to the title comparator based on our inspection:

In [17]:
# Run matching with adjusted weights
correspondences, debug_info = matcher.match(
    df_left=df_academy,
    df_right=df_actors,
    candidates=blocker_standard,
    id_column='id',
    comparators=comparators,
    weights=[0.7, 0.3], # increase weight on title based on debug analysis
    threshold=threshold,
    debug=True
)

eval_results = EntityMatchingEvaluator.evaluate_matching(
    correspondences=correspondences,
    test_pairs=df_validation,
    out_dir=debug_output_dir,
    debug_info=debug_info, # add debug info (optional)
    matcher_instance=matcher # add matcher instance for context for debug files (optional)
)

[INFO ] PyDI.entitymatching.rule_based.RuleBasedMatcher - Starting Entity Matching
[INFO ] PyDI.entitymatching.rule_based.RuleBasedMatcher - Blocking 4580 x 149 elements
[DEBUG] PyDI.entitymatching.blocking.standard.StandardBlocker - Creating candidate record pairs from 142 blocks
[INFO ] PyDI.entitymatching.rule_based.RuleBasedMatcher - Matching 4580 x 149 elements after 0:00:0.002; 396 blocked pairs (reduction ratio: 0.9994197122006975)
[DEBUG] PyDI.entitymatching.blocking.standard.StandardBlocker - Creating candidate record pairs from 142 blocks
[INFO ] PyDI.entitymatching.rule_based.RuleBasedMatcher - Entity Matching finished after 0:00:0.160; found 154 correspondences.
[DEBUG] root - Individual correspondence evaluations:
[DEBUG] root - [wrong] academy_awards_2,actors_120,FALSE,sim:0.7000
[DEBUG] root - [correct] academy_awards_2187,actors_120,TRUE,sim:0.8500
[DEBUG] root - [correct] academy_awards_4529,actors_2,TRUE,sim:0.7000
[DEBUG] root - [correct] academy_awards_2732,actors_1

In [18]:
# Create cluster size distribution from our matches
cluster_distribution = EntityMatchingEvaluator.create_cluster_size_distribution(
    correspondences=correspondences,
    out_dir=OUTPUT_DIR / "cluster_analysis"
)

[INFO ] root - Cluster Size Distribution of 144 clusters:
[INFO ] root - 	Cluster Size	| Frequency	| Percentage
[INFO ] root - 	──────────────────────────────────────────────────
[INFO ] root - 		2	|	135	|	93.75%
[INFO ] root - 		3	|	8	|	5.56%
[INFO ] root - 		4	|	1	|	0.69%
[INFO ] root - Cluster size distribution written to data\output\cluster_analysis\cluster_size_distribution.csv


In [19]:
refined_correspondences = one_to_one_algorithm.cluster(correspondences)

eval_results = EntityMatchingEvaluator.evaluate_matching(
    correspondences=refined_correspondences,
    test_pairs=df_validation,
    out_dir=debug_output_dir
)

[INFO ] root - Filtered correspondences: 154 -> 154 (threshold=0.0)
[INFO ] root - Greedy matching: 154 -> 144 correspondences (288 entities matched)
[INFO ] root - GreedyOneToOneMatchingAlgorithm: 154 -> 144 correspondences
[INFO ] root - GreedyOneToOneMatchingAlgorithm: 298 -> 288 entities
[DEBUG] root - Individual correspondence evaluations:
[DEBUG] root - [correct] academy_awards_503,actors_148,TRUE,sim:1.0000
[DEBUG] root - [correct] academy_awards_2187,actors_120,TRUE,sim:0.8500
[DEBUG] root - [correct] academy_awards_910,actors_68,TRUE,sim:0.8500
[DEBUG] root - [correct] academy_awards_723,actors_71,TRUE,sim:0.8500
[DEBUG] root - [correct] academy_awards_618,actors_73,TRUE,sim:0.8500
[DEBUG] root - [correct] academy_awards_608,actors_146,TRUE,sim:0.8500
[DEBUG] root - [correct] academy_awards_2732,actors_110,TRUE,sim:0.8500
[DEBUG] root - [correct] academy_awards_4140,actors_89,TRUE,sim:0.8500
[DEBUG] root - [correct] academy_awards_3300,actors_100,TRUE,sim:0.8500
[DEBUG] root -

In [20]:
# Create cluster size distribution from our matches
cluster_distribution = EntityMatchingEvaluator.create_cluster_size_distribution(
    correspondences=refined_correspondences,
    out_dir=OUTPUT_DIR / "cluster_analysis"
)

[INFO ] root - Cluster Size Distribution of 144 clusters:
[INFO ] root - 	Cluster Size	| Frequency	| Percentage
[INFO ] root - 	──────────────────────────────────────────────────
[INFO ] root - 		2	|	144	|	100.00%
[INFO ] root - Cluster size distribution written to data\output\cluster_analysis\cluster_size_distribution.csv


Once we have optimized all parameters on the validation set, we can verify the performance of our rule on out held-out test set

In [31]:
eval_results = EntityMatchingEvaluator.evaluate_matching(
    correspondences=refined_correspondences,
    test_pairs=df_test,
    out_dir=debug_output_dir
)

[INFO ] root - Confusion Matrix:
[INFO ] root -   True Positives:  49
[INFO ] root -   True Negatives:  100
[INFO ] root -   False Positives: 0
[INFO ] root -   False Negatives: 1
[INFO ] root - Performance Metrics:
[INFO ] root -   Accuracy:  0.993
[INFO ] root -   Precision: 1.000
[INFO ] root -   Recall:    0.980
[INFO ] root -   F1-Score:  0.990


### **Task 3 (Optional)**

Check out the MLBasedMatcher in the [PyDI tutorial notebook](docs/tutorial/PyDI_Tutorial.ipynb) and apply scikit learn models to try to achieve a better result by automatically learning the weights of the matching rule.

## 6. LLM-Based Matching

Large Language Models (LLMs) can perform entity resolution by:
- Understanding semantic similarity beyond string matching
- Handling variations in formatting, spelling, and abbreviations
- Reasoning about context (e.g., actor names, directors, dates)

We'll use **GPT-5-nano** via PyDI's `LLMBasedMatcher` to match the same candidate pairs and compare results to our RuleBasedMatcher. The LLM-based matcher internally uses a pre-defined prompt for entity matching. You can provide your own by providing the `system_prompt` parameter.

**Note**: This will make API calls and incur costs.

In [None]:
# Lets setup our chat model
#from langchain_groq import ChatGroq
from langchain_openai import ChatOpenAI

from dotenv import load_dotenv
import os

load_dotenv()

# Check for OpenAI API key
api_key = os.getenv('OPENAI_API_KEY')
if api_key:
    print("OPENAI_API_KEY found in environment")
    print(f"   Key starts with: {api_key[:10]}...")
else:
    print("OPENAI_API_KEY not found in environment")

# set back to info logging to avoid spam from other libraries
logging.getLogger().setLevel(logging.INFO) 


# Initialize OpenAI chat model
chat_model = ChatOpenAI(
    model="gpt-5-nano",  
    max_tokens=500,        # Reasonable limit for structured output
    temperature=0.0,      # Deterministic output
    reasoning_effort="minimal",  
)

chat_model.invoke("How are you doing today?")

OPENAI_API_KEY found in environment
   Key starts with: sk-proj-Q4...


[INFO ] httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


AIMessage(content='I’m doing well, thanks! How about you? What can I help you with today?', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 28, 'prompt_tokens': 12, 'total_tokens': 40, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-5-nano-2025-08-07', 'system_fingerprint': None, 'id': 'chatcmpl-CTPeACIY0yAQGFa18IPwundREjcv6', 'service_tier': 'default', 'finish_reason': 'stop', 'logprobs': None}, id='run--071caffd-fdb5-4e39-8acb-7fd9ea5e9640-0', usage_metadata={'input_tokens': 12, 'output_tokens': 28, 'total_tokens': 40, 'input_token_details': {'audio': 0, 'cache_read': 0}, 'output_token_details': {'audio': 0, 'reasoning': 0}})

In [22]:
from PyDI.entitymatching import LLMBasedMatcher

# Initialize LLM-based matcher
llm_matcher = LLMBasedMatcher()

# Define fields to include in LLM prompts
attributes_for_llm = ['title', 'date']

In [23]:
# Run LLM-based matching
# IMPORTANT: This cell will make API calls and incur costs
# Comment out if you don't want to run LLM matching

print("Running LLM-based matching...\n")
print("(This may take several minutes depending on the number of candidates)\n")

correspondences_llm = llm_matcher.match(
    df_left=df_academy,
    df_right=df_actors,
    candidates=blocker_standard,
    chat_model=chat_model,
    fields=attributes_for_llm,
    id_column='id',
    out_dir= OUTPUT_DIR / "llm",
    debug=True
)

[INFO ] root - Entity matching: academy_awards (4580 records) <-> actors (149 records)
[INFO ] root - Processing 1 candidate batches


Running LLM-based matching...

(This may take several minutes depending on the number of candidates)



[INFO ] httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
[INFO ] PyDI.utils.llm - LLM call: {"timestamp": "2025-10-22T09:42:28.325169Z", "row_index": 0, "attempt": 0, "provider_class": "ChatOpenAI", "model": "gpt-5-nano", "duration_ms": 2519.7436809539795, "temperature": null, "max_tokens": null, "usage": {"input_tokens": 212, "output_tokens": 47, "total_tokens": 259, "input_token_details": {"audio": 0, "cache_read": 0}, "output_token_details": {"audio": 0, "reasoning": 0}}, "request_messages": [{"type": "system", "content": "You are an expert entity resolver. Your task is to decide if two records refer to the same real-world entity.\n\nAnalyze the provided records carefully and return your decision as strict JSON in this format:\n{\"match\": true|false, \"score\": <float between 0.0 and 1.0>, \"explanation\": \"<brief explanation>\"}\n\nGuidelines:\n- score should reflect your confidence (1.0 = definitely same entity, 0.0 = definitely different)\

In [24]:
# Evaluate LLM-based matching results

eval_results = EntityMatchingEvaluator.evaluate_matching(
    correspondences=correspondences_llm,
    test_pairs=df_validation,
    out_dir= OUTPUT_DIR / "llm"
)

[INFO ] root - Confusion Matrix:
[INFO ] root -   True Positives:  26
[INFO ] root -   True Negatives:  100
[INFO ] root -   False Positives: 0
[INFO ] root -   False Negatives: 21
[INFO ] root - Performance Metrics:
[INFO ] root -   Accuracy:  0.857
[INFO ] root -   Precision: 1.000
[INFO ] root -   Recall:    0.553
[INFO ] root -   F1-Score:  0.712


When inspecting the log files in data/output/llm, we can see that the LLM often decides for non-match due to the differences in release year that was already impacting our earlier RuleBasedMatcher.

We could either

- remove the date attribute
- make the LLM aware of problems with the date attribute by passing a custom prompt that overwrites PyDIs standard prompt using the "system_prompt" parameter
- provide few-shot examples from which the LLM can learn this data quality issue

Let's try using few-shot prompting to show this functionality. The following examples are selected from our training set saved in **df_train**. They mimic the exact output format that is expected as defined in PyDI's standard prompt.

In [27]:
few_shot_examples = [
    # POSITIVE 1: Same movie "7th Heaven" but 2 years difference
    (
        {"title": "7th Heaven", "date": "1927-01-01"},
        {"title": "7th Heaven", "date": "1929-01-01"},
        {"match": "true", "score": 0.95, "explanation": "Same movie with identical title despite 2-year date discrepancy"}
    ),
    
    # POSITIVE 2: Same movie "It Happened One Night" but 1 year difference
    (
        {"title": "It Happened One Night", "date": "1934-01-01"},
        {"title": "It Happened One Night", "date": "1935-01-01"},
        {"match": "true", "score": 0.95, "explanation": "Same movie with identical title despite 1-year date difference"}
    ),
    
    # NEGATIVE 1: "All the President's Men" vs "All the King's Men" - very similar titles but different movies
    (
        {"title": "Goodbye, Mr. Chips", "date": "1969-01-01"},
        {"title": "Goodbye, Mr. Chips", "date": "1940-01-01"},
        {"match": "false", "score": 0.1, "explanation": "Different movie versions despite identical title - 29-year gap indicates remake"}
    ),
    
    # NEGATIVE 2: "A Stolen Life" vs "A Double Life" - both have "A ... Life" pattern
    (
        {"title": "Back Street", "date": "1961-01-01"},
        {"title": "Wall Street", "date": "1988-01-01"},
        {"match": "false", "score": 0.2, "explanation": "Different movies despite both having Street - Back Street vs Wall Street"}
    )
]

In [28]:
# Run LLM-based matching with few-shot examples
print("Running LLM-based matching with few-shot examples...\n")
print("(This may take several minutes depending on the number of candidates)\n")

correspondences_llm_fewshot = llm_matcher.match(
    df_left=df_academy,
    df_right=df_actors,
    candidates=blocker_standard,
    chat_model=chat_model,
    fields=attributes_for_llm,
    id_column='id',
    few_shots=few_shot_examples,
    out_dir=OUTPUT_DIR / "llm_fewshot",
    debug=True
)


# Evaluate the few-shot results
eval_results_fewshot = EntityMatchingEvaluator.evaluate_matching(
    correspondences=correspondences_llm_fewshot,
    test_pairs=df_test,
    out_dir=OUTPUT_DIR / "llm_fewshot"
)

[INFO ] root - Entity matching: academy_awards (4580 records) <-> actors (149 records)
[INFO ] root - Processing 1 candidate batches


Running LLM-based matching with few-shot examples...

(This may take several minutes depending on the number of candidates)



[INFO ] httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
[INFO ] PyDI.utils.llm - LLM call: {"timestamp": "2025-10-22T09:57:34.280107Z", "row_index": 0, "attempt": 0, "provider_class": "ChatOpenAI", "model": "gpt-5-nano", "duration_ms": 1649.5249271392822, "temperature": null, "max_tokens": null, "usage": {"input_tokens": 563, "output_tokens": 49, "total_tokens": 612, "input_token_details": {"audio": 0, "cache_read": 0}, "output_token_details": {"audio": 0, "reasoning": 0}}, "request_messages": [{"type": "system", "content": "You are an expert entity resolver. Your task is to decide if two records refer to the same real-world entity.\n\nAnalyze the provided records carefully and return your decision as strict JSON in this format:\n{\"match\": true|false, \"score\": <float between 0.0 and 1.0>, \"explanation\": \"<brief explanation>\"}\n\nGuidelines:\n- score should reflect your confidence (1.0 = definitely same entity, 0.0 = definitely different)\

Perfect! We can see a strong improvement when using few-shot examples, even giving us the best result yet! Let's verify on the test set:

In [32]:
eval_results = EntityMatchingEvaluator.evaluate_matching(
    correspondences=correspondences_llm_fewshot,
    test_pairs=df_test,
    out_dir=debug_output_dir
)

[INFO ] root - Confusion Matrix:
[INFO ] root -   True Positives:  46
[INFO ] root -   True Negatives:  100
[INFO ] root -   False Positives: 0
[INFO ] root -   False Negatives: 4
[INFO ] root - Performance Metrics:
[INFO ] root -   Accuracy:  0.973
[INFO ] root -   Precision: 1.000
[INFO ] root -   Recall:    0.920
[INFO ] root -   F1-Score:  0.958


Now let's save our generated correspondences, so we can later use them in data fusion. You can either save them as pickle files or use the standard pandas `write_csv` function.

In [29]:
correspondences_output_dir = OUTPUT_DIR / "correspondences"
correspondences_output_dir.mkdir(parents=True, exist_ok=True)

correspondences_llm_fewshot.to_pickle(f"{OUTPUT_DIR}/correspondences/correspondences_llm_fewshot.pkl")

correspondences_llm_fewshot.to_csv(f"{OUTPUT_DIR}/correspondences/correspondences_llm_fewshot.csv", index=False)

That's it for identity resolution! Now go ahead, create your evaluation sets and match your datasets!

Keep in mind that you have to perform identity resolution at least twice:

If you have datasets A, B, C, you need to create correspondences between at least two pairs of datasets, e.g. A-B and B-C. Think about which dataset should be the "connector" for your use case. Often it is the best choice to choose the largest dataset.