Rules

Rules define how records are compared and scored for matching. Each rule specifies a type, the fields to compare, and a weight that contributes to the overall match score. Rules are the core of your identity resolution logic: they encode your domain knowledge about what makes two records represent the same entity.

Rule Structure

Every rule requires three fields:

yaml

rules:
  - name: email_exact        # Unique identifier for this rule
    type: exact               # How values are compared
    field: email              # Which field to compare
    weight: 1.0               # Score contribution (0.0 - 1.0)

Field	Type	Required	Description
`name`	`string`	Yes	Unique rule identifier (alphanumeric + underscore)
`type`	`string`	Yes	One of `exact`, `similarity`, `range`, `composite`, `ml`
`field`	`string`	Yes	Field this rule compares (must exist in at least one source). Alias: `fields`
`weight`	`float`	Yes	Score contribution, between 0.0 and 1.0
`algorithm`	`string`	Conditional	Required for `similarity` type
`threshold`	`float`	Conditional	Required for `similarity`, `range`, and `ml` types
`tolerance`	`float`	Conditional	Required for `range` type
`operator`	`string`	Conditional	`and` or `or`, required for `composite` type
`children`	`array`	Conditional	Sub-rules, required for `composite` type
`model`	`string`	Conditional	Model identifier, required for `ml` type (Cloud)

Rule Types

Exact Match

The simplest and most performant rule type. Records match if values are identical after normalization.

yaml

- name: email_exact
  type: exact
  field: email
  weight: 1.0

Behavior:

String comparison is case-insensitive ([email protected] matches [email protected])
Leading and trailing whitespace is trimmed before comparison
Null values never match (even against other nulls)
Numeric fields are compared by value, not string representation (100 matches 100.0)

Multi-field exact match:

To match on multiple fields, use a composite rule with and:

yaml

- name: phone_and_zip
  type: composite
  operator: and
  children:
    - name: phone_exact
      type: exact
      field: phone
      weight: 0.4
    - name: zip_exact
      type: exact
      field: zip_code
      weight: 0.4

This rule only fires when both phone and zip_code are identical across two records.

Similarity Match

Compares string values using a similarity algorithm and fires when the similarity score meets or exceeds the threshold.

yaml

- name: name_fuzzy
  type: similarity
  field: name
  algorithm: jaro_winkler
  threshold: 0.88
  weight: 0.6

Required fields for similarity rules:

Field	Type	Description
`algorithm`	`string`	One of `jaro_winkler`, `levenshtein`, `soundex`, `metaphone`, `cosine`
`threshold`	`float`	Minimum similarity score to consider a match (0.0 - 1.0)

How it works:

Both values are normalized (trimmed, lowercased)
The similarity algorithm produces a score between 0.0 and 1.0
If the score >= threshold, the rule fires and contributes its weight to the overall score
If the score < threshold, the rule contributes 0.0

Example with Levenshtein:

yaml

- name: company_name_fuzzy
  type: similarity
  field: company_name
  algorithm: levenshtein
  threshold: 0.80
  weight: 0.5

"Acme Corporation" vs "Acme Corp" produces a Levenshtein similarity of ~0.82, which exceeds the 0.80 threshold.

Range Match

Compares numeric or date values and fires when they fall within a specified tolerance. Useful for financial reconciliation, date matching, and measurement comparisons.

yaml

- name: amount_close
  type: range
  field: amount
  tolerance: 0.05    # 5% tolerance
  weight: 0.5

Tolerance modes:

`tolerance` value	Mode	Example
`0.05`	Percentage (5%)	`100.00` matches `95.00` - `105.00`
`5.0`	Absolute	`100.00` matches `95.00` - `105.00`

How tolerance mode is determined

Values less than or equal to 1.0 are treated as percentages. Values greater than 1.0 are treated as absolute tolerances. To specify an absolute tolerance of 1.0 or less, use a composite rule with custom logic.

Date range example:

yaml

- name: transaction_date_close
  type: range
  field: transaction_date
  tolerance: 3       # Within 3 days
  weight: 0.4

Composite Match

Combine multiple sub-rules using and or or operators. Composite rules let you express complex matching logic that cannot be captured by a single rule.

and operator: all sub-rules must fire:

yaml

- name: address_match
  type: composite
  operator: and
  children:
    - name: street_fuzzy
      type: similarity
      field: street
      algorithm: jaro_winkler
      threshold: 0.85
      weight: 0.4
    - name: zip_exact
      type: exact
      field: zip_code
      weight: 0.3

This rule only contributes to the score when both the street name is similar and the zip code is an exact match. Use and when you need multiple signals to corroborate each other.

or operator: at least one sub-rule must fire:

yaml

- name: contact_match
  type: composite
  operator: or
  children:
    - name: email_exact
      type: exact
      field: email
      weight: 1.0
    - name: phone_exact
      type: exact
      field: phone
      weight: 0.9
    - name: name_and_zip
      type: composite
      operator: and
      children:
        - name: name_fuzzy
          type: similarity
          field: name
          algorithm: jaro_winkler
          threshold: 0.88
          weight: 0.6
        - name: zip_exact
          type: exact
          field: zip_code
          weight: 0.3

This rule fires when any one of the following is true: email matches exactly, phone matches exactly, or both name is similar and zip code matches. Note that composite rules can be nested; the third sub-rule is itself a composite and.

Scoring behavior:

and: All children must fire for any of them to contribute their weights. The composite score is min(child_scores).
or: At least one child must fire. The composite score is max(child_scores).

Algorithm Comparison

Choose the right algorithm for your data:

Algorithm	Best For	Speed	Handles Typos	Handles Transpositions	Unicode Support
`jaro_winkler`	Person names	Fast	Good	Excellent	Yes
`levenshtein`	Short strings (<50 chars)	Medium	Excellent	Good	Yes
`soundex`	Phonetic name matching	Fast	Poor	N/A	English only
`metaphone`	English name variants	Fast	Moderate	N/A	English only
`cosine`	Long strings, addresses	Medium	Good	Good	Yes

When to use each algorithm

jaro_winkler is the default choice for person names. It gives extra weight to matching prefixes, which aligns with how name typos typically occur (errors are more common later in a string). Recommended threshold: 0.85 - 0.92.

yaml

# Good: Matches "Robert" vs "Robret", "Katherine" vs "Catherine"
algorithm: jaro_winkler
threshold: 0.88

levenshtein counts the minimum number of single-character edits (insertions, deletions, substitutions) needed to transform one string into another, then normalizes by the longer string length. Best for short, structured strings. Recommended threshold: 0.75 - 0.90.

yaml

# Good: Matches "123 Main St" vs "123 Main Street"
algorithm: levenshtein
threshold: 0.80

soundex encodes strings by their English pronunciation. Two names that sound alike produce the same code. Useful when data entry is done by ear (phone agents, medical intake). There is no meaningful threshold; it is a binary match.

yaml

# Good: Matches "Smith" vs "Smyth", "Stephen" vs "Steven"
algorithm: soundex
threshold: 1.0    # Soundex is binary: 1.0 = same code, 0.0 = different

metaphone is an improved phonetic algorithm that handles English pronunciation rules more accurately than Soundex. Better with names that have silent letters or complex consonant clusters.

yaml

# Good: Matches "Wright" vs "Right", "Knight" vs "Nite"
algorithm: metaphone
threshold: 1.0    # Also binary like soundex

cosine splits strings into overlapping character bigrams and computes the cosine similarity of the resulting frequency vectors. Effective for longer strings like addresses or company names where word order may vary.

yaml

# Good: Matches "123 Main Street, Apt 4" vs "Apt 4, 123 Main St"
algorithm: cosine
threshold: 0.65

Weight Guidelines

Weights encode your confidence that a field match indicates a true entity match. Higher weights mean stronger signals.

Recommended weights by field type

Field Type	Rule Type	Recommended Weight	Rationale
Email	`exact`	0.9 - 1.0	Emails are near-unique identifiers
Phone	`exact`	0.8 - 0.9	Phones can be shared (households, businesses)
SSN / National ID	`exact`	1.0	Truly unique, but verify data quality first
Full name	`similarity`	0.5 - 0.7	Common names cause false positives
First name only	`similarity`	0.2 - 0.3	Low discriminative power alone
Last name only	`similarity`	0.3 - 0.4	More discriminative than first name
Address (full)	`similarity`	0.3 - 0.5	Formatting varies widely across sources
Zip / Postal code	`exact`	0.1 - 0.2	Many people share the same zip code
Date of birth	`exact`	0.3 - 0.5	Strong signal when combined with name
Transaction amount	`range`	0.3 - 0.5	Useful for financial reconciliation
Account number	`exact`	0.9 - 1.0	Near-unique like email

Weight tuning tips

TIP

Start with the recommended weights and adjust based on your data
Run a reconciliation on a labeled sample and inspect false positives and false negatives
If you see too many false positives, reduce weights on low-discriminative fields (name, zip)
If you see too many false negatives, lower the similarity threshold or increase weights on fuzzy rules
The sum of all weights does not need to equal 1.0. The decision thresholds are relative to the weighted sum

Blocking Strategies

Without blocking, the engine must compare every record against every other record, resulting in O(n^2) comparisons. For 100,000 records, that is 5 billion comparisons. Blocking reduces this by grouping records into buckets and only comparing within each bucket.

yaml

blocking:
  strategy: exact
  keys: [email, zip_code]

Strategies

Strategy	How It Groups	Best For	Trade-off
`exact`	Records must share at least one identical blocking key value	Clean data with reliable key fields	Fast, but misses matches when blocking keys differ
`phonetic`	Records grouped by phonetic encoding (Soundex) of blocking keys	Name-based matching with spelling variations	Catches more matches, larger comparison windows
`ngram`	Records grouped by overlapping character n-gram buckets	Dirty data where no single field is reliable	Broadest recall, but slowest

Exact blocking

The default strategy. Records are placed into buckets by the exact value of each blocking key. Two records are candidates for comparison if they share the same value for at least one blocking key.

yaml

blocking:
  strategy: exact
  keys: [email, zip_code]

Given records:

Record A: [email protected], zip=10001
Record B: [email protected], zip=90210
Record C: [email protected], zip=10001

A and B are candidates (same email)
A and C are candidates (same zip_code)
B and C are not candidates (no shared blocking key values)

Phonetic blocking

Groups records by the phonetic encoding of each blocking key. Useful when names are spelled differently but sound alike.

yaml

blocking:
  strategy: phonetic
  keys: [last_name]

"Smith", "Smyth", and "Smithe" all produce the same Soundex code and would be grouped together for comparison.

N-gram blocking

Groups records by overlapping character n-grams. Records that share a sufficient number of n-grams are placed in the same bucket. This is the broadest strategy and catches the most potential matches, but creates the largest comparison windows.

yaml

blocking:
  strategy: ngram
  keys: [name, address]

WARNING

N-gram blocking can significantly increase processing time on large datasets. Use it only when data quality is too low for exact or phonetic blocking. Consider limiting blocking keys to 1-2 fields.

Choosing a blocking strategy

Scenario	Recommended Strategy	Keys
Clean CRM data with reliable email	`exact`	`[email]`
Healthcare patient matching	`phonetic`	`[last_name, date_of_birth]`
Messy address data	`ngram`	`[address]`
Financial transactions	`exact`	`[account_number, transaction_date]`
Lead deduplication	`exact`	`[email, phone]`

Field References

Rules reference fields by name. These field names must exist in the attributes list of at least one source definition. If a source does not have a field referenced by a rule, records from that source are skipped for that rule (they are not penalized).

Validation

The spec validator catches field reference errors at validation time:

python

from kanoniv import Spec, validate

spec = Spec.from_file("spec.yaml")
result = validate(spec)
# Error: Rule "name_fuzzy" references field "full_name" which does not
#        exist in any source. Did you mean "name"?

The validator provides:

Missing field detection: fields referenced in rules but not present in any source
Typo suggestions: "Did you mean...?" suggestions based on edit distance from available fields
Unused field warnings: fields declared in sources but never referenced by any rule

Evaluation Order

Rules are evaluated in declaration order, but all rules contribute to the final weighted sum score. The evaluation order matters for two reasons:

Short-circuiting: If early rules produce a score above the match threshold, later rules may be skipped for performance (the result would be the same). This optimization is automatic.
Readability: Declaring high-weight, high-confidence rules first makes the spec easier to understand.

The final score for a candidate pair is computed as:

score = sum(rule.weight * rule_match for each rule)

Where rule_match is 1.0 for exact/range rules that fire, the similarity score for similarity rules that exceed their threshold, and the model confidence for ML rules that exceed their threshold.

The decision thresholds then classify the pair:

score >= match --> automatic match
review <= score < match --> sent to review queue
score < review --> rejected (no match)

Complete Example

A production-ready rules section for a Customer 360 use case, matching customer records across CRM, billing, and support systems:

yaml

rules:
  # High-confidence identifier match.
  # Email is near-unique -- if two records share an email,
  # they are almost certainly the same person.
  - name: email_exact
    type: exact
    field: email
    weight: 1.0

  # Phone number as a secondary strong signal.
  # Slightly lower weight than email because phone numbers
  # can be shared within households.
  - name: phone_exact
    type: exact
    field: phone
    weight: 0.85

  # Fuzzy name matching to catch typos and abbreviations.
  # jaro_winkler works well for person names because it
  # rewards matching prefixes (e.g., "Rob" in "Robert"/"Roberto").
  - name: name_fuzzy
    type: similarity
    field: name
    algorithm: jaro_winkler
    threshold: 0.88
    weight: 0.6

  # Address matching requires both street similarity and zip match.
  # Using a composite 'and' because a street name alone is ambiguous
  # (e.g., "Main Street" exists in every city).
  - name: address_composite
    type: composite
    operator: and
    children:
      - name: street_fuzzy
        type: similarity
        field: street_address
        algorithm: levenshtein
        threshold: 0.80
        weight: 0.4
      - name: zip_exact
        type: exact
        field: zip_code
        weight: 0.3

  # Date of birth as a corroborating signal.
  # Not high weight on its own, but valuable when combined
  # with name similarity to break ties.
  - name: dob_exact
    type: exact
    field: date_of_birth
    weight: 0.4

blocking:
  strategy: exact
  keys: [email, phone, zip_code]

decision:
  scoring: weighted_sum
  thresholds:
    match: 0.9
    review: 0.7

This configuration:

Automatically matches when email or phone matches (weight >= 0.9 threshold)
Sends to review when name is similar + address matches (0.6 + 0.5 = 1.1, but capped contributions mean review-range scores)
Rejects when only weak signals align (zip code + date of birth alone = 0.4 + 0.4 = 0.8, which is below review)
Blocks on email, phone, and zip to keep comparisons manageable

Validation Limits

Constraint	Limit
Max rules per spec	50
Max sub-rules per composite	10
Max composite nesting depth	3
Weight range	0.0 - 1.0
Threshold range	0.0 - 1.0
Max fields per rule	5
Max blocking keys	5

Rules ​

Rule Structure ​

Rule Types ​

Exact Match ​

Similarity Match ​

Range Match ​

Composite Match ​

Algorithm Comparison ​

When to use each algorithm ​

Weight Guidelines ​

Recommended weights by field type ​

Weight tuning tips ​

Blocking Strategies ​

Strategies ​

Exact blocking ​

Phonetic blocking ​

N-gram blocking ​

Choosing a blocking strategy ​

Field References ​

Validation ​

Evaluation Order ​

Complete Example ​

Validation Limits ​

Rules

Rule Structure

Rule Types

Exact Match

Similarity Match

Range Match

Composite Match

Algorithm Comparison

When to use each algorithm

Weight Guidelines

Recommended weights by field type

Weight tuning tips

Blocking Strategies

Strategies

Exact blocking

Phonetic blocking

N-gram blocking

Choosing a blocking strategy

Field References

Validation

Evaluation Order

Complete Example

Validation Limits