Skip to content

Rules

Rules define how records are compared and scored for matching. Each rule specifies a type, the fields to compare, and a weight that contributes to the overall match score. Rules are the core of your identity resolution logic: they encode your domain knowledge about what makes two records represent the same entity.

Rule Structure

Every rule requires three fields:

yaml
rules:
  - name: email_exact        # Unique identifier for this rule
    type: exact               # How values are compared
    field: email              # Which field to compare
    weight: 1.0               # Score contribution (0.0 - 1.0)
FieldTypeRequiredDescription
namestringYesUnique rule identifier (alphanumeric + underscore)
typestringYesOne of exact, similarity, range, composite, ml
fieldstringYesField this rule compares (must exist in at least one source). Alias: fields
weightfloatYesScore contribution, between 0.0 and 1.0
algorithmstringConditionalRequired for similarity type
thresholdfloatConditionalRequired for similarity, range, and ml types
tolerancefloatConditionalRequired for range type
operatorstringConditionaland or or, required for composite type
childrenarrayConditionalSub-rules, required for composite type
modelstringConditionalModel identifier, required for ml type (Cloud)

Rule Types

Exact Match

The simplest and most performant rule type. Records match if values are identical after normalization.

yaml
- name: email_exact
  type: exact
  field: email
  weight: 1.0

Behavior:

  • String comparison is case-insensitive ([email protected] matches [email protected])
  • Leading and trailing whitespace is trimmed before comparison
  • Null values never match (even against other nulls)
  • Numeric fields are compared by value, not string representation (100 matches 100.0)

Multi-field exact match:

To match on multiple fields, use a composite rule with and:

yaml
- name: phone_and_zip
  type: composite
  operator: and
  children:
    - name: phone_exact
      type: exact
      field: phone
      weight: 0.4
    - name: zip_exact
      type: exact
      field: zip_code
      weight: 0.4

This rule only fires when both phone and zip_code are identical across two records.


Similarity Match

Compares string values using a similarity algorithm and fires when the similarity score meets or exceeds the threshold.

yaml
- name: name_fuzzy
  type: similarity
  field: name
  algorithm: jaro_winkler
  threshold: 0.88
  weight: 0.6

Required fields for similarity rules:

FieldTypeDescription
algorithmstringOne of jaro_winkler, levenshtein, soundex, metaphone, cosine
thresholdfloatMinimum similarity score to consider a match (0.0 - 1.0)

How it works:

  1. Both values are normalized (trimmed, lowercased)
  2. The similarity algorithm produces a score between 0.0 and 1.0
  3. If the score >= threshold, the rule fires and contributes its weight to the overall score
  4. If the score < threshold, the rule contributes 0.0

Example with Levenshtein:

yaml
- name: company_name_fuzzy
  type: similarity
  field: company_name
  algorithm: levenshtein
  threshold: 0.80
  weight: 0.5

"Acme Corporation" vs "Acme Corp" produces a Levenshtein similarity of ~0.82, which exceeds the 0.80 threshold.


Range Match

Compares numeric or date values and fires when they fall within a specified tolerance. Useful for financial reconciliation, date matching, and measurement comparisons.

yaml
- name: amount_close
  type: range
  field: amount
  tolerance: 0.05    # 5% tolerance
  weight: 0.5

Tolerance modes:

tolerance valueModeExample
0.05Percentage (5%)100.00 matches 95.00 - 105.00
5.0Absolute100.00 matches 95.00 - 105.00

How tolerance mode is determined

Values less than or equal to 1.0 are treated as percentages. Values greater than 1.0 are treated as absolute tolerances. To specify an absolute tolerance of 1.0 or less, use a composite rule with custom logic.

Date range example:

yaml
- name: transaction_date_close
  type: range
  field: transaction_date
  tolerance: 3       # Within 3 days
  weight: 0.4

Composite Match

Combine multiple sub-rules using and or or operators. Composite rules let you express complex matching logic that cannot be captured by a single rule.

and operator: all sub-rules must fire:

yaml
- name: address_match
  type: composite
  operator: and
  children:
    - name: street_fuzzy
      type: similarity
      field: street
      algorithm: jaro_winkler
      threshold: 0.85
      weight: 0.4
    - name: zip_exact
      type: exact
      field: zip_code
      weight: 0.3

This rule only contributes to the score when both the street name is similar and the zip code is an exact match. Use and when you need multiple signals to corroborate each other.

or operator: at least one sub-rule must fire:

yaml
- name: contact_match
  type: composite
  operator: or
  children:
    - name: email_exact
      type: exact
      field: email
      weight: 1.0
    - name: phone_exact
      type: exact
      field: phone
      weight: 0.9
    - name: name_and_zip
      type: composite
      operator: and
      children:
        - name: name_fuzzy
          type: similarity
          field: name
          algorithm: jaro_winkler
          threshold: 0.88
          weight: 0.6
        - name: zip_exact
          type: exact
          field: zip_code
          weight: 0.3

This rule fires when any one of the following is true: email matches exactly, phone matches exactly, or both name is similar and zip code matches. Note that composite rules can be nested; the third sub-rule is itself a composite and.

Scoring behavior:

  • and: All children must fire for any of them to contribute their weights. The composite score is min(child_scores).
  • or: At least one child must fire. The composite score is max(child_scores).

Algorithm Comparison

Choose the right algorithm for your data:

AlgorithmBest ForSpeedHandles TyposHandles TranspositionsUnicode Support
jaro_winklerPerson namesFastGoodExcellentYes
levenshteinShort strings (<50 chars)MediumExcellentGoodYes
soundexPhonetic name matchingFastPoorN/AEnglish only
metaphoneEnglish name variantsFastModerateN/AEnglish only
cosineLong strings, addressesMediumGoodGoodYes

When to use each algorithm

jaro_winkler is the default choice for person names. It gives extra weight to matching prefixes, which aligns with how name typos typically occur (errors are more common later in a string). Recommended threshold: 0.85 - 0.92.

yaml
# Good: Matches "Robert" vs "Robret", "Katherine" vs "Catherine"
algorithm: jaro_winkler
threshold: 0.88

levenshtein counts the minimum number of single-character edits (insertions, deletions, substitutions) needed to transform one string into another, then normalizes by the longer string length. Best for short, structured strings. Recommended threshold: 0.75 - 0.90.

yaml
# Good: Matches "123 Main St" vs "123 Main Street"
algorithm: levenshtein
threshold: 0.80

soundex encodes strings by their English pronunciation. Two names that sound alike produce the same code. Useful when data entry is done by ear (phone agents, medical intake). There is no meaningful threshold; it is a binary match.

yaml
# Good: Matches "Smith" vs "Smyth", "Stephen" vs "Steven"
algorithm: soundex
threshold: 1.0    # Soundex is binary: 1.0 = same code, 0.0 = different

metaphone is an improved phonetic algorithm that handles English pronunciation rules more accurately than Soundex. Better with names that have silent letters or complex consonant clusters.

yaml
# Good: Matches "Wright" vs "Right", "Knight" vs "Nite"
algorithm: metaphone
threshold: 1.0    # Also binary like soundex

cosine splits strings into overlapping character bigrams and computes the cosine similarity of the resulting frequency vectors. Effective for longer strings like addresses or company names where word order may vary.

yaml
# Good: Matches "123 Main Street, Apt 4" vs "Apt 4, 123 Main St"
algorithm: cosine
threshold: 0.65

Weight Guidelines

Weights encode your confidence that a field match indicates a true entity match. Higher weights mean stronger signals.

Field TypeRule TypeRecommended WeightRationale
Emailexact0.9 - 1.0Emails are near-unique identifiers
Phoneexact0.8 - 0.9Phones can be shared (households, businesses)
SSN / National IDexact1.0Truly unique, but verify data quality first
Full namesimilarity0.5 - 0.7Common names cause false positives
First name onlysimilarity0.2 - 0.3Low discriminative power alone
Last name onlysimilarity0.3 - 0.4More discriminative than first name
Address (full)similarity0.3 - 0.5Formatting varies widely across sources
Zip / Postal codeexact0.1 - 0.2Many people share the same zip code
Date of birthexact0.3 - 0.5Strong signal when combined with name
Transaction amountrange0.3 - 0.5Useful for financial reconciliation
Account numberexact0.9 - 1.0Near-unique like email

Weight tuning tips

TIP

  • Start with the recommended weights and adjust based on your data
  • Run a reconciliation on a labeled sample and inspect false positives and false negatives
  • If you see too many false positives, reduce weights on low-discriminative fields (name, zip)
  • If you see too many false negatives, lower the similarity threshold or increase weights on fuzzy rules
  • The sum of all weights does not need to equal 1.0. The decision thresholds are relative to the weighted sum

Blocking Strategies

Without blocking, the engine must compare every record against every other record, resulting in O(n^2) comparisons. For 100,000 records, that is 5 billion comparisons. Blocking reduces this by grouping records into buckets and only comparing within each bucket.

yaml
blocking:
  strategy: exact
  keys: [email, zip_code]

Strategies

StrategyHow It GroupsBest ForTrade-off
exactRecords must share at least one identical blocking key valueClean data with reliable key fieldsFast, but misses matches when blocking keys differ
phoneticRecords grouped by phonetic encoding (Soundex) of blocking keysName-based matching with spelling variationsCatches more matches, larger comparison windows
ngramRecords grouped by overlapping character n-gram bucketsDirty data where no single field is reliableBroadest recall, but slowest

Exact blocking

The default strategy. Records are placed into buckets by the exact value of each blocking key. Two records are candidates for comparison if they share the same value for at least one blocking key.

yaml
blocking:
  strategy: exact
  keys: [email, zip_code]

Given records:

Record A: [email protected], zip=10001
Record B: [email protected], zip=90210
Record C: [email protected], zip=10001
  • A and B are candidates (same email)
  • A and C are candidates (same zip_code)
  • B and C are not candidates (no shared blocking key values)

Phonetic blocking

Groups records by the phonetic encoding of each blocking key. Useful when names are spelled differently but sound alike.

yaml
blocking:
  strategy: phonetic
  keys: [last_name]

"Smith", "Smyth", and "Smithe" all produce the same Soundex code and would be grouped together for comparison.

N-gram blocking

Groups records by overlapping character n-grams. Records that share a sufficient number of n-grams are placed in the same bucket. This is the broadest strategy and catches the most potential matches, but creates the largest comparison windows.

yaml
blocking:
  strategy: ngram
  keys: [name, address]

WARNING

N-gram blocking can significantly increase processing time on large datasets. Use it only when data quality is too low for exact or phonetic blocking. Consider limiting blocking keys to 1-2 fields.

Choosing a blocking strategy

ScenarioRecommended StrategyKeys
Clean CRM data with reliable emailexact[email]
Healthcare patient matchingphonetic[last_name, date_of_birth]
Messy address datangram[address]
Financial transactionsexact[account_number, transaction_date]
Lead deduplicationexact[email, phone]

Field References

Rules reference fields by name. These field names must exist in the attributes list of at least one source definition. If a source does not have a field referenced by a rule, records from that source are skipped for that rule (they are not penalized).

Validation

The spec validator catches field reference errors at validation time:

python
from kanoniv import Spec, validate

spec = Spec.from_file("spec.yaml")
result = validate(spec)
# Error: Rule "name_fuzzy" references field "full_name" which does not
#        exist in any source. Did you mean "name"?

The validator provides:

  • Missing field detection: fields referenced in rules but not present in any source
  • Typo suggestions: "Did you mean...?" suggestions based on edit distance from available fields
  • Unused field warnings: fields declared in sources but never referenced by any rule

Evaluation Order

Rules are evaluated in declaration order, but all rules contribute to the final weighted sum score. The evaluation order matters for two reasons:

  1. Short-circuiting: If early rules produce a score above the match threshold, later rules may be skipped for performance (the result would be the same). This optimization is automatic.
  2. Readability: Declaring high-weight, high-confidence rules first makes the spec easier to understand.

The final score for a candidate pair is computed as:

score = sum(rule.weight * rule_match for each rule)

Where rule_match is 1.0 for exact/range rules that fire, the similarity score for similarity rules that exceed their threshold, and the model confidence for ML rules that exceed their threshold.

The decision thresholds then classify the pair:

  • score >= match --> automatic match
  • review <= score < match --> sent to review queue
  • score < review --> rejected (no match)

Complete Example

A production-ready rules section for a Customer 360 use case, matching customer records across CRM, billing, and support systems:

yaml
rules:
  # High-confidence identifier match.
  # Email is near-unique -- if two records share an email,
  # they are almost certainly the same person.
  - name: email_exact
    type: exact
    field: email
    weight: 1.0

  # Phone number as a secondary strong signal.
  # Slightly lower weight than email because phone numbers
  # can be shared within households.
  - name: phone_exact
    type: exact
    field: phone
    weight: 0.85

  # Fuzzy name matching to catch typos and abbreviations.
  # jaro_winkler works well for person names because it
  # rewards matching prefixes (e.g., "Rob" in "Robert"/"Roberto").
  - name: name_fuzzy
    type: similarity
    field: name
    algorithm: jaro_winkler
    threshold: 0.88
    weight: 0.6

  # Address matching requires both street similarity and zip match.
  # Using a composite 'and' because a street name alone is ambiguous
  # (e.g., "Main Street" exists in every city).
  - name: address_composite
    type: composite
    operator: and
    children:
      - name: street_fuzzy
        type: similarity
        field: street_address
        algorithm: levenshtein
        threshold: 0.80
        weight: 0.4
      - name: zip_exact
        type: exact
        field: zip_code
        weight: 0.3

  # Date of birth as a corroborating signal.
  # Not high weight on its own, but valuable when combined
  # with name similarity to break ties.
  - name: dob_exact
    type: exact
    field: date_of_birth
    weight: 0.4

blocking:
  strategy: exact
  keys: [email, phone, zip_code]

decision:
  scoring: weighted_sum
  thresholds:
    match: 0.9
    review: 0.7

This configuration:

  • Automatically matches when email or phone matches (weight >= 0.9 threshold)
  • Sends to review when name is similar + address matches (0.6 + 0.5 = 1.1, but capped contributions mean review-range scores)
  • Rejects when only weak signals align (zip code + date of birth alone = 0.4 + 0.4 = 0.8, which is below review)
  • Blocks on email, phone, and zip to keep comparisons manageable

Validation Limits

ConstraintLimit
Max rules per spec50
Max sub-rules per composite10
Max composite nesting depth3
Weight range0.0 - 1.0
Threshold range0.0 - 1.0
Max fields per rule5
Max blocking keys5

The identity and delegation layer for AI agents.