Best Open-Source Entity Resolution Tools in 2026
Published February 2026 · 12 min read
If you need to match records across datasets — deduplication, record linkage, identity resolution — you have more open-source options in 2026 than ever before. But the tools differ significantly in approach, maturity, and what they actually produce.
We evaluated the five most actively used open-source entity resolution tools: Kanoniv, Splink, Zingg, Dedupe, and Senzing (community edition). This post compares them on the dimensions that matter for production use: matching approach, golden record support, scalability, ease of use, and licensing.
The Quick Comparison
| Tool | Approach | Golden Records | Scale | License | Active? |
|---|---|---|---|---|---|
| Kanoniv | Declarative rules (YAML) | Yes | 100K+ local, unlimited cloud | MIT | Yes |
| Splink | Probabilistic (Fellegi-Sunter) | No | 100M+ (Spark/DuckDB) | MIT | Yes |
| Zingg | ML + active learning | Enterprise only | Large (Spark) | AGPL-3.0 | Yes |
| Dedupe | ML + active learning | No | Small-medium | MIT | Inactive since Aug 2024 |
| Senzing | Principle-based AI | No | Billions (claimed) | Proprietary (free tier) | Yes |
1. Kanoniv
Best for: Teams that want golden records, local development, and declarative configuration.
Kanoniv takes a different approach from the other tools on this list. Instead of training a model or configuring a probabilistic framework, you write a YAML spec that declares your matching rules, survivorship strategy, and decision thresholds.
entity:
name: customer
sources:
- name: crm
adapter: csv
location: contacts.csv
primary_key: id
- name: billing
adapter: csv
location: stripe.csv
primary_key: id
rules:
- name: email_exact
type: exact
field: email
weight: 1.0
- name: name_fuzzy
type: jaro_winkler
field: name
threshold: 0.9
weight: 0.8
survivorship:
strategy: source_priority
priority: [crm, billing]
decision:
thresholds:
match: 0.85from kanoniv import Spec, Source, reconcile, validate
spec = Spec.from_file("customer-spec.yaml")
validate(spec).raise_on_error()
sources = [
Source.from_csv("crm", "contacts.csv"),
Source.from_csv("billing", "stripe.csv"),
]
result = reconcile(sources, spec)
print(f"Golden records: {len(result.golden_records)}")What makes it different:
- Golden records out of the box. Every other tool on this list stops at matching. Kanoniv includes survivorship -- the logic that chooses which field values survive into the canonical record.
- Offline-first. The Python SDK runs entirely on your machine. No API keys, no accounts, no data leaves your environment.
- Spec-as-code. The YAML spec is version-controlled, diff-able, and serves as living documentation of your matching logic.
- Rust engine. The reconciliation engine is written in Rust and compiled into a native Python extension via PyO3. Fast.
Limitations:
- No ML-based matching (by design — Kanoniv favors explicit rules over learned models)
- Newer project, smaller community than Splink or Dedupe
- Local SDK handles 100K+ records well; larger datasets need Kanoniv Cloud
License: Free SDK (local use), paid Cloud platform
2. Splink
Best for: Data scientists who want probabilistic matching at scale with DuckDB or Spark.
Splink is the most technically rigorous open-source matching tool. Built by the UK Ministry of Justice, it implements the Fellegi-Sunter probabilistic model with proper m/u probability estimation via the EM algorithm.
import splink.duckdb.comparison_library as cl
from splink.duckdb.linker import DuckDBLinker
settings = {
"link_type": "dedupe_only",
"comparisons": [
cl.exact_match("email"),
cl.jaro_winkler_at_thresholds("first_name", [0.9, 0.7]),
cl.levenshtein_at_thresholds("surname", [1, 2]),
cl.exact_match("city"),
],
"blocking_rules_to_generate_predictions": [
"l.email = r.email",
"l.surname = r.surname and l.city = r.city",
],
}
linker = DuckDBLinker(df, settings)
linker.estimate_u_using_random_sampling(max_pairs=1e6)
linker.estimate_parameters_using_expectation_maximisation("l.email = r.email")
pairwise = linker.predict(threshold_match_probability=0.9)
clusters = linker.cluster_pairwise_predictions_at_threshold(pairwise, 0.95)Strengths:
- Proper probabilistic model with calibrated match probabilities
- Scales to 100M+ records on Spark or DuckDB
- Excellent visualization tools (waterfall charts, comparison viewers)
- Well-documented with detailed tutorials
- Active development (1,900+ GitHub stars)
Limitations:
- No golden records or survivorship — output is match clusters only
- Steeper learning curve (must understand m/u probabilities, EM, blocking)
- No real-time matching API
- Requires understanding of probabilistic theory to tune effectively
License: MIT
3. Zingg
Best for: Teams with Spark infrastructure who want ML-based matching with minimal labeling.
Zingg uses active learning — it picks the most informative record pairs for you to label, then trains a classifier. You label 30-50 pairs and get a trained model.
# Define the field types in a JSON config
# Then label pairs interactively
zingg --phase label --conf zingg.conf
# After labeling ~40 pairs, train the model
zingg --phase train --conf zingg.conf
# Run matching
zingg --phase match --conf zingg.confStrengths:
- Active learning reduces labeling effort to 30-50 pairs
- Handles messy data where rules are hard to articulate
- Runs on Spark for large-scale processing
- Good for cases where you can't express matching logic as rules
Limitations:
- AGPL-3.0 license — restrictive for commercial SaaS use (must open-source your code or buy a commercial license)
- Requires Spark infrastructure
- No golden records in the open-source version (Enterprise only)
- Less predictable than rule-based approaches — model behavior can surprise you
License: AGPL-3.0 (commercial license available)
4. Dedupe
Best for: Python developers who want a simple, well-documented matching library.
Dedupe was the original Python entity resolution library. It uses active learning with a clean API that feels natural to Python developers.
import dedupe
fields = [
{"field": "name", "type": "String"},
{"field": "address", "type": "String"},
{"field": "city", "type": "ShortString"},
{"field": "phone", "type": "String", "has missing": True},
]
deduper = dedupe.Dedupe(fields)
deduper.prepare_training(data)
dedupe.console_label(deduper) # Interactive labeling
deduper.train()
clusters = deduper.partition(data, threshold=0.5)Strengths:
- Clean, Pythonic API
- Good documentation and tutorials
- Active learning with interactive console labeling
- Well-understood by the data science community (4,400+ GitHub stars)
- MIT licensed
Limitations:
- Effectively unmaintained — last release August 2024, no commits since
- No golden records or survivorship
- Doesn't scale well beyond a few hundred thousand records
- Single-machine only (no distributed computing)
- No real-time matching
License: MIT
5. Senzing (Community Edition)
Best for: Developers who want zero-configuration matching and can accept a proprietary engine.
Senzing takes a unique approach — it uses "principle-based AI" that requires zero configuration. You load data, and it matches. No rules, no training, no thresholds to set.
from senzing import G2Engine
engine = G2Engine()
engine.init("my_app", config)
# Add records — Senzing matches automatically
engine.addRecord("CRM", "1001", '{"NAME": "Robert Smith", "EMAIL": "[email protected]"}')
engine.addRecord("BILLING", "42", '{"NAME": "Bob Smith", "EMAIL": "[email protected]"}')
# Query: are these the same entity?
response = engine.getEntityByRecordID("CRM", "1001")Strengths:
- Zero configuration — genuinely works out of the box for many use cases
- Real-time matching (add a record, get instant resolution)
- Handles diverse data types without field-specific tuning
- Claims billions of records on large clusters
Limitations:
- Not truly open source — the engine is proprietary; the community edition is free but closed-source
- No golden records or survivorship
- Self-hosted only (no managed cloud offering)
- Commercial license starts at ~$37K/year for production
- Black box — you can't inspect or modify the matching logic
License: Proprietary (free community tier, commercial production license)
Feature Matrix
| Feature | Kanoniv | Splink | Zingg | Dedupe | Senzing |
|---|---|---|---|---|---|
| Deterministic rules | Yes | Limited | No | No | Built-in |
| Fuzzy matching | Yes | Yes | Yes (ML) | Yes (ML) | Built-in |
| Probabilistic model | No | Yes (Fellegi-Sunter) | No | No | Proprietary |
| Golden records | Yes | No | Enterprise only | No | No |
| Survivorship | Yes | No | Enterprise only | No | No |
| Real-time API | Yes (Cloud) | No | No | No | Yes |
| Local development | Yes | Yes | Yes (Spark) | Yes | Yes |
| Distributed computing | Cloud | Spark, DuckDB | Spark | No | Multi-node |
| Active learning | No | No | Yes | Yes | No |
| Config format | YAML | Python dict | JSON | Python API | Auto |
| Visualization | No | Yes (excellent) | No | No | No |
Decision Framework
Choose Kanoniv if:
- You need golden records with survivorship out of the box
- You want to develop and test locally before deploying to production
- You prefer declarative YAML configuration over writing code
- You want a free SDK with no vendor lock-in for local matching
- Your team values explainability and auditability over ML-based matching
Choose Splink if:
- You have a data science team comfortable with probabilistic models
- You need to match 10M+ records and have Spark/DuckDB infrastructure
- You want calibrated match probabilities (not just scores)
- You need excellent visualization tools for model debugging
- You don't need golden records (you'll build survivorship yourself)
Choose Zingg if:
- Your data is too messy to express matching rules
- You have Spark infrastructure
- AGPL licensing is acceptable (or you'll buy a commercial license)
- You want ML-based matching with minimal labeling effort
Choose Dedupe if:
- You have a small dataset (< 100K records)
- You want a simple Python library with interactive labeling
- You're comfortable with a library that's no longer actively maintained
- You're prototyping and need something quick
Choose Senzing if:
- You want zero-configuration matching
- You can accept a proprietary engine
- You need real-time record-by-record resolution
- Budget allows for commercial licensing ($37K+/year)
The Trend: Declarative Over Learned
The entity resolution landscape is shifting. Early tools (Dedupe, Zingg) bet on ML — train a model, let it learn matching patterns. Newer tools (Kanoniv, Splink) favor explicit configuration — declare your rules, understand exactly what matches and why.
This mirrors a broader trend in data engineering: teams want deterministic, version-controlled, auditable pipelines. A YAML spec or Python configuration is reviewable in a PR, testable in CI, and explainable to compliance. A trained model is none of those things.
The best approach for most teams in 2026: start with deterministic rules on strong identifiers, add fuzzy matching for names and addresses, and reserve ML for the subset of data where rules genuinely can't capture the pattern.
