Skip to content

Best Open-Source Entity Resolution Tools in 2026

Published February 2026 · 12 min read

If you need to match records across datasets — deduplication, record linkage, identity resolution — you have more open-source options in 2026 than ever before. But the tools differ significantly in approach, maturity, and what they actually produce.

We evaluated the five most actively used open-source entity resolution tools: Kanoniv, Splink, Zingg, Dedupe, and Senzing (community edition). This post compares them on the dimensions that matter for production use: matching approach, golden record support, scalability, ease of use, and licensing.

The Quick Comparison

ToolApproachGolden RecordsScaleLicenseActive?
KanonivDeclarative rules (YAML)Yes100K+ local, unlimited cloudMITYes
SplinkProbabilistic (Fellegi-Sunter)No100M+ (Spark/DuckDB)MITYes
ZinggML + active learningEnterprise onlyLarge (Spark)AGPL-3.0Yes
DedupeML + active learningNoSmall-mediumMITInactive since Aug 2024
SenzingPrinciple-based AINoBillions (claimed)Proprietary (free tier)Yes

1. Kanoniv

Best for: Teams that want golden records, local development, and declarative configuration.

Kanoniv takes a different approach from the other tools on this list. Instead of training a model or configuring a probabilistic framework, you write a YAML spec that declares your matching rules, survivorship strategy, and decision thresholds.

yaml
entity:
  name: customer
sources:
  - name: crm
    adapter: csv
    location: contacts.csv
    primary_key: id
  - name: billing
    adapter: csv
    location: stripe.csv
    primary_key: id
rules:
  - name: email_exact
    type: exact
    field: email
    weight: 1.0
  - name: name_fuzzy
    type: jaro_winkler
    field: name
    threshold: 0.9
    weight: 0.8
survivorship:
  strategy: source_priority
  priority: [crm, billing]
decision:
  thresholds:
    match: 0.85
python
from kanoniv import Spec, Source, reconcile, validate

spec = Spec.from_file("customer-spec.yaml")
validate(spec).raise_on_error()

sources = [
    Source.from_csv("crm", "contacts.csv"),
    Source.from_csv("billing", "stripe.csv"),
]
result = reconcile(sources, spec)
print(f"Golden records: {len(result.golden_records)}")

What makes it different:

  • Golden records out of the box. Every other tool on this list stops at matching. Kanoniv includes survivorship -- the logic that chooses which field values survive into the canonical record.
  • Offline-first. The Python SDK runs entirely on your machine. No API keys, no accounts, no data leaves your environment.
  • Spec-as-code. The YAML spec is version-controlled, diff-able, and serves as living documentation of your matching logic.
  • Rust engine. The reconciliation engine is written in Rust and compiled into a native Python extension via PyO3. Fast.

Limitations:

  • No ML-based matching (by design — Kanoniv favors explicit rules over learned models)
  • Newer project, smaller community than Splink or Dedupe
  • Local SDK handles 100K+ records well; larger datasets need Kanoniv Cloud

License: Free SDK (local use), paid Cloud platform

Best for: Data scientists who want probabilistic matching at scale with DuckDB or Spark.

Splink is the most technically rigorous open-source matching tool. Built by the UK Ministry of Justice, it implements the Fellegi-Sunter probabilistic model with proper m/u probability estimation via the EM algorithm.

python
import splink.duckdb.comparison_library as cl
from splink.duckdb.linker import DuckDBLinker

settings = {
    "link_type": "dedupe_only",
    "comparisons": [
        cl.exact_match("email"),
        cl.jaro_winkler_at_thresholds("first_name", [0.9, 0.7]),
        cl.levenshtein_at_thresholds("surname", [1, 2]),
        cl.exact_match("city"),
    ],
    "blocking_rules_to_generate_predictions": [
        "l.email = r.email",
        "l.surname = r.surname and l.city = r.city",
    ],
}

linker = DuckDBLinker(df, settings)
linker.estimate_u_using_random_sampling(max_pairs=1e6)
linker.estimate_parameters_using_expectation_maximisation("l.email = r.email")
pairwise = linker.predict(threshold_match_probability=0.9)
clusters = linker.cluster_pairwise_predictions_at_threshold(pairwise, 0.95)

Strengths:

  • Proper probabilistic model with calibrated match probabilities
  • Scales to 100M+ records on Spark or DuckDB
  • Excellent visualization tools (waterfall charts, comparison viewers)
  • Well-documented with detailed tutorials
  • Active development (1,900+ GitHub stars)

Limitations:

  • No golden records or survivorship — output is match clusters only
  • Steeper learning curve (must understand m/u probabilities, EM, blocking)
  • No real-time matching API
  • Requires understanding of probabilistic theory to tune effectively

License: MIT

3. Zingg

Best for: Teams with Spark infrastructure who want ML-based matching with minimal labeling.

Zingg uses active learning — it picks the most informative record pairs for you to label, then trains a classifier. You label 30-50 pairs and get a trained model.

bash
# Define the field types in a JSON config
# Then label pairs interactively
zingg --phase label --conf zingg.conf

# After labeling ~40 pairs, train the model
zingg --phase train --conf zingg.conf

# Run matching
zingg --phase match --conf zingg.conf

Strengths:

  • Active learning reduces labeling effort to 30-50 pairs
  • Handles messy data where rules are hard to articulate
  • Runs on Spark for large-scale processing
  • Good for cases where you can't express matching logic as rules

Limitations:

  • AGPL-3.0 license — restrictive for commercial SaaS use (must open-source your code or buy a commercial license)
  • Requires Spark infrastructure
  • No golden records in the open-source version (Enterprise only)
  • Less predictable than rule-based approaches — model behavior can surprise you

License: AGPL-3.0 (commercial license available)

4. Dedupe

Best for: Python developers who want a simple, well-documented matching library.

Dedupe was the original Python entity resolution library. It uses active learning with a clean API that feels natural to Python developers.

python
import dedupe

fields = [
    {"field": "name", "type": "String"},
    {"field": "address", "type": "String"},
    {"field": "city", "type": "ShortString"},
    {"field": "phone", "type": "String", "has missing": True},
]

deduper = dedupe.Dedupe(fields)
deduper.prepare_training(data)
dedupe.console_label(deduper)  # Interactive labeling
deduper.train()
clusters = deduper.partition(data, threshold=0.5)

Strengths:

  • Clean, Pythonic API
  • Good documentation and tutorials
  • Active learning with interactive console labeling
  • Well-understood by the data science community (4,400+ GitHub stars)
  • MIT licensed

Limitations:

  • Effectively unmaintained — last release August 2024, no commits since
  • No golden records or survivorship
  • Doesn't scale well beyond a few hundred thousand records
  • Single-machine only (no distributed computing)
  • No real-time matching

License: MIT

5. Senzing (Community Edition)

Best for: Developers who want zero-configuration matching and can accept a proprietary engine.

Senzing takes a unique approach — it uses "principle-based AI" that requires zero configuration. You load data, and it matches. No rules, no training, no thresholds to set.

python
from senzing import G2Engine

engine = G2Engine()
engine.init("my_app", config)

# Add records — Senzing matches automatically
engine.addRecord("CRM", "1001", '{"NAME": "Robert Smith", "EMAIL": "[email protected]"}')
engine.addRecord("BILLING", "42", '{"NAME": "Bob Smith", "EMAIL": "[email protected]"}')

# Query: are these the same entity?
response = engine.getEntityByRecordID("CRM", "1001")

Strengths:

  • Zero configuration — genuinely works out of the box for many use cases
  • Real-time matching (add a record, get instant resolution)
  • Handles diverse data types without field-specific tuning
  • Claims billions of records on large clusters

Limitations:

  • Not truly open source — the engine is proprietary; the community edition is free but closed-source
  • No golden records or survivorship
  • Self-hosted only (no managed cloud offering)
  • Commercial license starts at ~$37K/year for production
  • Black box — you can't inspect or modify the matching logic

License: Proprietary (free community tier, commercial production license)

Feature Matrix

FeatureKanonivSplinkZinggDedupeSenzing
Deterministic rulesYesLimitedNoNoBuilt-in
Fuzzy matchingYesYesYes (ML)Yes (ML)Built-in
Probabilistic modelNoYes (Fellegi-Sunter)NoNoProprietary
Golden recordsYesNoEnterprise onlyNoNo
SurvivorshipYesNoEnterprise onlyNoNo
Real-time APIYes (Cloud)NoNoNoYes
Local developmentYesYesYes (Spark)YesYes
Distributed computingCloudSpark, DuckDBSparkNoMulti-node
Active learningNoNoYesYesNo
Config formatYAMLPython dictJSONPython APIAuto
VisualizationNoYes (excellent)NoNoNo

Decision Framework

Choose Kanoniv if:

  • You need golden records with survivorship out of the box
  • You want to develop and test locally before deploying to production
  • You prefer declarative YAML configuration over writing code
  • You want a free SDK with no vendor lock-in for local matching
  • Your team values explainability and auditability over ML-based matching
  • You have a data science team comfortable with probabilistic models
  • You need to match 10M+ records and have Spark/DuckDB infrastructure
  • You want calibrated match probabilities (not just scores)
  • You need excellent visualization tools for model debugging
  • You don't need golden records (you'll build survivorship yourself)

Choose Zingg if:

  • Your data is too messy to express matching rules
  • You have Spark infrastructure
  • AGPL licensing is acceptable (or you'll buy a commercial license)
  • You want ML-based matching with minimal labeling effort

Choose Dedupe if:

  • You have a small dataset (< 100K records)
  • You want a simple Python library with interactive labeling
  • You're comfortable with a library that's no longer actively maintained
  • You're prototyping and need something quick

Choose Senzing if:

  • You want zero-configuration matching
  • You can accept a proprietary engine
  • You need real-time record-by-record resolution
  • Budget allows for commercial licensing ($37K+/year)

The Trend: Declarative Over Learned

The entity resolution landscape is shifting. Early tools (Dedupe, Zingg) bet on ML — train a model, let it learn matching patterns. Newer tools (Kanoniv, Splink) favor explicit configuration — declare your rules, understand exactly what matches and why.

This mirrors a broader trend in data engineering: teams want deterministic, version-controlled, auditable pipelines. A YAML spec or Python configuration is reviewable in a PR, testable in CI, and explainable to compliance. A trained model is none of those things.

The best approach for most teams in 2026: start with deterministic rules on strong identifiers, add fuzzy matching for names and addresses, and reserve ML for the subset of data where rules genuinely can't capture the pattern.

The identity and delegation layer for AI agents.