Best Open-Source Entity Resolution Tools in 2026

Published February 2026 · 12 min read

If you need to match records across datasets — deduplication, record linkage, identity resolution — you have more open-source options in 2026 than ever before. But the tools differ significantly in approach, maturity, and what they actually produce.

We evaluated the five most actively used open-source entity resolution tools: Kanoniv, Splink, Zingg, Dedupe, and Senzing (community edition). This post compares them on the dimensions that matter for production use: matching approach, golden record support, scalability, ease of use, and licensing.

The Quick Comparison

Tool	Approach	Golden Records	Scale	License	Active?
Kanoniv	Declarative rules (YAML)	Yes	100K+ local, unlimited cloud	MIT	Yes
Splink	Probabilistic (Fellegi-Sunter)	No	100M+ (Spark/DuckDB)	MIT	Yes
Zingg	ML + active learning	Enterprise only	Large (Spark)	AGPL-3.0	Yes
Dedupe	ML + active learning	No	Small-medium	MIT	Inactive since Aug 2024
Senzing	Principle-based AI	No	Billions (claimed)	Proprietary (free tier)	Yes

1. Kanoniv

Best for: Teams that want golden records, local development, and declarative configuration.

Kanoniv takes a different approach from the other tools on this list. Instead of training a model or configuring a probabilistic framework, you write a YAML spec that declares your matching rules, survivorship strategy, and decision thresholds.

yaml

entity:
  name: customer
sources:
  - name: crm
    adapter: csv
    location: contacts.csv
    primary_key: id
  - name: billing
    adapter: csv
    location: stripe.csv
    primary_key: id
rules:
  - name: email_exact
    type: exact
    field: email
    weight: 1.0
  - name: name_fuzzy
    type: jaro_winkler
    field: name
    threshold: 0.9
    weight: 0.8
survivorship:
  strategy: source_priority
  priority: [crm, billing]
decision:
  thresholds:
    match: 0.85

python

from kanoniv import Spec, Source, reconcile, validate

spec = Spec.from_file("customer-spec.yaml")
validate(spec).raise_on_error()

sources = [
    Source.from_csv("crm", "contacts.csv"),
    Source.from_csv("billing", "stripe.csv"),
]
result = reconcile(sources, spec)
print(f"Golden records: {len(result.golden_records)}")

What makes it different:

Golden records out of the box. Every other tool on this list stops at matching. Kanoniv includes survivorship -- the logic that chooses which field values survive into the canonical record.
Offline-first. The Python SDK runs entirely on your machine. No API keys, no accounts, no data leaves your environment.
Spec-as-code. The YAML spec is version-controlled, diff-able, and serves as living documentation of your matching logic.
Rust engine. The reconciliation engine is written in Rust and compiled into a native Python extension via PyO3. Fast.

Limitations:

No ML-based matching (by design — Kanoniv favors explicit rules over learned models)
Newer project, smaller community than Splink or Dedupe
Local SDK handles 100K+ records well; larger datasets need Kanoniv Cloud

License: Free SDK (local use), paid Cloud platform

2. Splink

Best for: Data scientists who want probabilistic matching at scale with DuckDB or Spark.

Splink is the most technically rigorous open-source matching tool. Built by the UK Ministry of Justice, it implements the Fellegi-Sunter probabilistic model with proper m/u probability estimation via the EM algorithm.

python

import splink.duckdb.comparison_library as cl
from splink.duckdb.linker import DuckDBLinker

settings = {
    "link_type": "dedupe_only",
    "comparisons": [
        cl.exact_match("email"),
        cl.jaro_winkler_at_thresholds("first_name", [0.9, 0.7]),
        cl.levenshtein_at_thresholds("surname", [1, 2]),
        cl.exact_match("city"),
    ],
    "blocking_rules_to_generate_predictions": [
        "l.email = r.email",
        "l.surname = r.surname and l.city = r.city",
    ],
}

linker = DuckDBLinker(df, settings)
linker.estimate_u_using_random_sampling(max_pairs=1e6)
linker.estimate_parameters_using_expectation_maximisation("l.email = r.email")
pairwise = linker.predict(threshold_match_probability=0.9)
clusters = linker.cluster_pairwise_predictions_at_threshold(pairwise, 0.95)

Strengths:

Proper probabilistic model with calibrated match probabilities
Scales to 100M+ records on Spark or DuckDB
Excellent visualization tools (waterfall charts, comparison viewers)
Well-documented with detailed tutorials
Active development (1,900+ GitHub stars)

Limitations:

No golden records or survivorship — output is match clusters only
Steeper learning curve (must understand m/u probabilities, EM, blocking)
No real-time matching API
Requires understanding of probabilistic theory to tune effectively

License: MIT

3. Zingg

Best for: Teams with Spark infrastructure who want ML-based matching with minimal labeling.

Zingg uses active learning — it picks the most informative record pairs for you to label, then trains a classifier. You label 30-50 pairs and get a trained model.

bash

# Define the field types in a JSON config
# Then label pairs interactively
zingg --phase label --conf zingg.conf

# After labeling ~40 pairs, train the model
zingg --phase train --conf zingg.conf

# Run matching
zingg --phase match --conf zingg.conf

Strengths:

Active learning reduces labeling effort to 30-50 pairs
Handles messy data where rules are hard to articulate
Runs on Spark for large-scale processing
Good for cases where you can't express matching logic as rules

Limitations:

AGPL-3.0 license — restrictive for commercial SaaS use (must open-source your code or buy a commercial license)
Requires Spark infrastructure
No golden records in the open-source version (Enterprise only)
Less predictable than rule-based approaches — model behavior can surprise you

License: AGPL-3.0 (commercial license available)

4. Dedupe

Best for: Python developers who want a simple, well-documented matching library.

Dedupe was the original Python entity resolution library. It uses active learning with a clean API that feels natural to Python developers.

python

import dedupe

fields = [
    {"field": "name", "type": "String"},
    {"field": "address", "type": "String"},
    {"field": "city", "type": "ShortString"},
    {"field": "phone", "type": "String", "has missing": True},
]

deduper = dedupe.Dedupe(fields)
deduper.prepare_training(data)
dedupe.console_label(deduper)  # Interactive labeling
deduper.train()
clusters = deduper.partition(data, threshold=0.5)

Strengths:

Clean, Pythonic API
Good documentation and tutorials
Active learning with interactive console labeling
Well-understood by the data science community (4,400+ GitHub stars)
MIT licensed

Limitations:

Effectively unmaintained — last release August 2024, no commits since
No golden records or survivorship
Doesn't scale well beyond a few hundred thousand records
Single-machine only (no distributed computing)
No real-time matching

License: MIT

5. Senzing (Community Edition)

Best for: Developers who want zero-configuration matching and can accept a proprietary engine.

Senzing takes a unique approach — it uses "principle-based AI" that requires zero configuration. You load data, and it matches. No rules, no training, no thresholds to set.

python

from senzing import G2Engine

engine = G2Engine()
engine.init("my_app", config)

# Add records — Senzing matches automatically
engine.addRecord("CRM", "1001", '{"NAME": "Robert Smith", "EMAIL": "[email protected]"}')
engine.addRecord("BILLING", "42", '{"NAME": "Bob Smith", "EMAIL": "[email protected]"}')

# Query: are these the same entity?
response = engine.getEntityByRecordID("CRM", "1001")

Strengths:

Zero configuration — genuinely works out of the box for many use cases
Real-time matching (add a record, get instant resolution)
Handles diverse data types without field-specific tuning
Claims billions of records on large clusters

Limitations:

Not truly open source — the engine is proprietary; the community edition is free but closed-source
No golden records or survivorship
Self-hosted only (no managed cloud offering)
Commercial license starts at ~$37K/year for production
Black box — you can't inspect or modify the matching logic

License: Proprietary (free community tier, commercial production license)

Feature Matrix

Feature	Kanoniv	Splink	Zingg	Dedupe	Senzing
Deterministic rules	Yes	Limited	No	No	Built-in
Fuzzy matching	Yes	Yes	Yes (ML)	Yes (ML)	Built-in
Probabilistic model	No	Yes (Fellegi-Sunter)	No	No	Proprietary
Golden records	Yes	No	Enterprise only	No	No
Survivorship	Yes	No	Enterprise only	No	No
Real-time API	Yes (Cloud)	No	No	No	Yes
Local development	Yes	Yes	Yes (Spark)	Yes	Yes
Distributed computing	Cloud	Spark, DuckDB	Spark	No	Multi-node
Active learning	No	No	Yes	Yes	No
Config format	YAML	Python dict	JSON	Python API	Auto
Visualization	No	Yes (excellent)	No	No	No

Decision Framework

Choose Kanoniv if:

You need golden records with survivorship out of the box
You want to develop and test locally before deploying to production
You prefer declarative YAML configuration over writing code
You want a free SDK with no vendor lock-in for local matching
Your team values explainability and auditability over ML-based matching

Choose Splink if:

You have a data science team comfortable with probabilistic models
You need to match 10M+ records and have Spark/DuckDB infrastructure
You want calibrated match probabilities (not just scores)
You need excellent visualization tools for model debugging
You don't need golden records (you'll build survivorship yourself)

Choose Zingg if:

Your data is too messy to express matching rules
You have Spark infrastructure
AGPL licensing is acceptable (or you'll buy a commercial license)
You want ML-based matching with minimal labeling effort

Choose Dedupe if:

You have a small dataset (< 100K records)
You want a simple Python library with interactive labeling
You're comfortable with a library that's no longer actively maintained
You're prototyping and need something quick

Choose Senzing if:

You want zero-configuration matching
You can accept a proprietary engine
You need real-time record-by-record resolution
Budget allows for commercial licensing ($37K+/year)

The Trend: Declarative Over Learned

The entity resolution landscape is shifting. Early tools (Dedupe, Zingg) bet on ML — train a model, let it learn matching patterns. Newer tools (Kanoniv, Splink) favor explicit configuration — declare your rules, understand exactly what matches and why.

This mirrors a broader trend in data engineering: teams want deterministic, version-controlled, auditable pipelines. A YAML spec or Python configuration is reviewable in a PR, testable in CI, and explainable to compliance. A trained model is none of those things.

The best approach for most teams in 2026: start with deterministic rules on strong identifiers, add fuzzy matching for names and addresses, and reserve ML for the subset of data where rules genuinely can't capture the pattern.

Best Open-Source Entity Resolution Tools in 2026 ​

The Quick Comparison ​

1. Kanoniv ​

2. Splink ​

3. Zingg ​

4. Dedupe ​

5. Senzing (Community Edition) ​

Feature Matrix ​

Decision Framework ​

Choose Kanoniv if: ​

Choose Splink if: ​

Choose Zingg if: ​

Choose Dedupe if: ​

Choose Senzing if: ​

The Trend: Declarative Over Learned ​

Best Open-Source Entity Resolution Tools in 2026

The Quick Comparison

1. Kanoniv

2. Splink

3. Zingg

4. Dedupe

5. Senzing (Community Edition)

Feature Matrix

Decision Framework

Choose Kanoniv if:

Choose Splink if:

Choose Zingg if:

Choose Dedupe if:

Choose Senzing if:

The Trend: Declarative Over Learned