Rules
Rules define how records are compared and scored for matching. Each rule specifies a type, the fields to compare, and a weight that contributes to the overall match score. Rules are the core of your identity resolution logic: they encode your domain knowledge about what makes two records represent the same entity.
Rule Structure
Every rule requires three fields:
rules:
- name: email_exact # Unique identifier for this rule
type: exact # How values are compared
field: email # Which field to compare
weight: 1.0 # Score contribution (0.0 - 1.0)| Field | Type | Required | Description |
|---|---|---|---|
name | string | Yes | Unique rule identifier (alphanumeric + underscore) |
type | string | Yes | One of exact, similarity, range, composite, ml |
field | string | Yes | Field this rule compares (must exist in at least one source). Alias: fields |
weight | float | Yes | Score contribution, between 0.0 and 1.0 |
algorithm | string | Conditional | Required for similarity type |
threshold | float | Conditional | Required for similarity, range, and ml types |
tolerance | float | Conditional | Required for range type |
operator | string | Conditional | and or or, required for composite type |
children | array | Conditional | Sub-rules, required for composite type |
model | string | Conditional | Model identifier, required for ml type (Cloud) |
Rule Types
Exact Match
The simplest and most performant rule type. Records match if values are identical after normalization.
- name: email_exact
type: exact
field: email
weight: 1.0Behavior:
- String comparison is case-insensitive (
[email protected]matches[email protected]) - Leading and trailing whitespace is trimmed before comparison
- Null values never match (even against other nulls)
- Numeric fields are compared by value, not string representation (
100matches100.0)
Multi-field exact match:
To match on multiple fields, use a composite rule with and:
- name: phone_and_zip
type: composite
operator: and
children:
- name: phone_exact
type: exact
field: phone
weight: 0.4
- name: zip_exact
type: exact
field: zip_code
weight: 0.4This rule only fires when both phone and zip_code are identical across two records.
Similarity Match
Compares string values using a similarity algorithm and fires when the similarity score meets or exceeds the threshold.
- name: name_fuzzy
type: similarity
field: name
algorithm: jaro_winkler
threshold: 0.88
weight: 0.6Required fields for similarity rules:
| Field | Type | Description |
|---|---|---|
algorithm | string | One of jaro_winkler, levenshtein, soundex, metaphone, cosine |
threshold | float | Minimum similarity score to consider a match (0.0 - 1.0) |
How it works:
- Both values are normalized (trimmed, lowercased)
- The similarity algorithm produces a score between 0.0 and 1.0
- If the score >=
threshold, the rule fires and contributes itsweightto the overall score - If the score <
threshold, the rule contributes 0.0
Example with Levenshtein:
- name: company_name_fuzzy
type: similarity
field: company_name
algorithm: levenshtein
threshold: 0.80
weight: 0.5"Acme Corporation" vs "Acme Corp" produces a Levenshtein similarity of ~0.82, which exceeds the 0.80 threshold.
Range Match
Compares numeric or date values and fires when they fall within a specified tolerance. Useful for financial reconciliation, date matching, and measurement comparisons.
- name: amount_close
type: range
field: amount
tolerance: 0.05 # 5% tolerance
weight: 0.5Tolerance modes:
tolerance value | Mode | Example |
|---|---|---|
0.05 | Percentage (5%) | 100.00 matches 95.00 - 105.00 |
5.0 | Absolute | 100.00 matches 95.00 - 105.00 |
How tolerance mode is determined
Values less than or equal to 1.0 are treated as percentages. Values greater than 1.0 are treated as absolute tolerances. To specify an absolute tolerance of 1.0 or less, use a composite rule with custom logic.
Date range example:
- name: transaction_date_close
type: range
field: transaction_date
tolerance: 3 # Within 3 days
weight: 0.4Composite Match
Combine multiple sub-rules using and or or operators. Composite rules let you express complex matching logic that cannot be captured by a single rule.
and operator: all sub-rules must fire:
- name: address_match
type: composite
operator: and
children:
- name: street_fuzzy
type: similarity
field: street
algorithm: jaro_winkler
threshold: 0.85
weight: 0.4
- name: zip_exact
type: exact
field: zip_code
weight: 0.3This rule only contributes to the score when both the street name is similar and the zip code is an exact match. Use and when you need multiple signals to corroborate each other.
or operator: at least one sub-rule must fire:
- name: contact_match
type: composite
operator: or
children:
- name: email_exact
type: exact
field: email
weight: 1.0
- name: phone_exact
type: exact
field: phone
weight: 0.9
- name: name_and_zip
type: composite
operator: and
children:
- name: name_fuzzy
type: similarity
field: name
algorithm: jaro_winkler
threshold: 0.88
weight: 0.6
- name: zip_exact
type: exact
field: zip_code
weight: 0.3This rule fires when any one of the following is true: email matches exactly, phone matches exactly, or both name is similar and zip code matches. Note that composite rules can be nested; the third sub-rule is itself a composite and.
Scoring behavior:
and: All children must fire for any of them to contribute their weights. The composite score ismin(child_scores).or: At least one child must fire. The composite score ismax(child_scores).
Algorithm Comparison
Choose the right algorithm for your data:
| Algorithm | Best For | Speed | Handles Typos | Handles Transpositions | Unicode Support |
|---|---|---|---|---|---|
jaro_winkler | Person names | Fast | Good | Excellent | Yes |
levenshtein | Short strings (<50 chars) | Medium | Excellent | Good | Yes |
soundex | Phonetic name matching | Fast | Poor | N/A | English only |
metaphone | English name variants | Fast | Moderate | N/A | English only |
cosine | Long strings, addresses | Medium | Good | Good | Yes |
When to use each algorithm
jaro_winkler is the default choice for person names. It gives extra weight to matching prefixes, which aligns with how name typos typically occur (errors are more common later in a string). Recommended threshold: 0.85 - 0.92.
# Good: Matches "Robert" vs "Robret", "Katherine" vs "Catherine"
algorithm: jaro_winkler
threshold: 0.88levenshtein counts the minimum number of single-character edits (insertions, deletions, substitutions) needed to transform one string into another, then normalizes by the longer string length. Best for short, structured strings. Recommended threshold: 0.75 - 0.90.
# Good: Matches "123 Main St" vs "123 Main Street"
algorithm: levenshtein
threshold: 0.80soundex encodes strings by their English pronunciation. Two names that sound alike produce the same code. Useful when data entry is done by ear (phone agents, medical intake). There is no meaningful threshold; it is a binary match.
# Good: Matches "Smith" vs "Smyth", "Stephen" vs "Steven"
algorithm: soundex
threshold: 1.0 # Soundex is binary: 1.0 = same code, 0.0 = differentmetaphone is an improved phonetic algorithm that handles English pronunciation rules more accurately than Soundex. Better with names that have silent letters or complex consonant clusters.
# Good: Matches "Wright" vs "Right", "Knight" vs "Nite"
algorithm: metaphone
threshold: 1.0 # Also binary like soundexcosine splits strings into overlapping character bigrams and computes the cosine similarity of the resulting frequency vectors. Effective for longer strings like addresses or company names where word order may vary.
# Good: Matches "123 Main Street, Apt 4" vs "Apt 4, 123 Main St"
algorithm: cosine
threshold: 0.65Weight Guidelines
Weights encode your confidence that a field match indicates a true entity match. Higher weights mean stronger signals.
Recommended weights by field type
| Field Type | Rule Type | Recommended Weight | Rationale |
|---|---|---|---|
exact | 0.9 - 1.0 | Emails are near-unique identifiers | |
| Phone | exact | 0.8 - 0.9 | Phones can be shared (households, businesses) |
| SSN / National ID | exact | 1.0 | Truly unique, but verify data quality first |
| Full name | similarity | 0.5 - 0.7 | Common names cause false positives |
| First name only | similarity | 0.2 - 0.3 | Low discriminative power alone |
| Last name only | similarity | 0.3 - 0.4 | More discriminative than first name |
| Address (full) | similarity | 0.3 - 0.5 | Formatting varies widely across sources |
| Zip / Postal code | exact | 0.1 - 0.2 | Many people share the same zip code |
| Date of birth | exact | 0.3 - 0.5 | Strong signal when combined with name |
| Transaction amount | range | 0.3 - 0.5 | Useful for financial reconciliation |
| Account number | exact | 0.9 - 1.0 | Near-unique like email |
Weight tuning tips
TIP
- Start with the recommended weights and adjust based on your data
- Run a reconciliation on a labeled sample and inspect false positives and false negatives
- If you see too many false positives, reduce weights on low-discriminative fields (name, zip)
- If you see too many false negatives, lower the similarity threshold or increase weights on fuzzy rules
- The sum of all weights does not need to equal 1.0. The decision thresholds are relative to the weighted sum
Blocking Strategies
Without blocking, the engine must compare every record against every other record, resulting in O(n^2) comparisons. For 100,000 records, that is 5 billion comparisons. Blocking reduces this by grouping records into buckets and only comparing within each bucket.
blocking:
strategy: exact
keys: [email, zip_code]Strategies
| Strategy | How It Groups | Best For | Trade-off |
|---|---|---|---|
exact | Records must share at least one identical blocking key value | Clean data with reliable key fields | Fast, but misses matches when blocking keys differ |
phonetic | Records grouped by phonetic encoding (Soundex) of blocking keys | Name-based matching with spelling variations | Catches more matches, larger comparison windows |
ngram | Records grouped by overlapping character n-gram buckets | Dirty data where no single field is reliable | Broadest recall, but slowest |
Exact blocking
The default strategy. Records are placed into buckets by the exact value of each blocking key. Two records are candidates for comparison if they share the same value for at least one blocking key.
blocking:
strategy: exact
keys: [email, zip_code]Given records:
Record A: [email protected], zip=10001
Record B: [email protected], zip=90210
Record C: [email protected], zip=10001- A and B are candidates (same email)
- A and C are candidates (same zip_code)
- B and C are not candidates (no shared blocking key values)
Phonetic blocking
Groups records by the phonetic encoding of each blocking key. Useful when names are spelled differently but sound alike.
blocking:
strategy: phonetic
keys: [last_name]"Smith", "Smyth", and "Smithe" all produce the same Soundex code and would be grouped together for comparison.
N-gram blocking
Groups records by overlapping character n-grams. Records that share a sufficient number of n-grams are placed in the same bucket. This is the broadest strategy and catches the most potential matches, but creates the largest comparison windows.
blocking:
strategy: ngram
keys: [name, address]WARNING
N-gram blocking can significantly increase processing time on large datasets. Use it only when data quality is too low for exact or phonetic blocking. Consider limiting blocking keys to 1-2 fields.
Choosing a blocking strategy
| Scenario | Recommended Strategy | Keys |
|---|---|---|
| Clean CRM data with reliable email | exact | [email] |
| Healthcare patient matching | phonetic | [last_name, date_of_birth] |
| Messy address data | ngram | [address] |
| Financial transactions | exact | [account_number, transaction_date] |
| Lead deduplication | exact | [email, phone] |
Field References
Rules reference fields by name. These field names must exist in the attributes list of at least one source definition. If a source does not have a field referenced by a rule, records from that source are skipped for that rule (they are not penalized).
Validation
The spec validator catches field reference errors at validation time:
from kanoniv import Spec, validate
spec = Spec.from_file("spec.yaml")
result = validate(spec)
# Error: Rule "name_fuzzy" references field "full_name" which does not
# exist in any source. Did you mean "name"?The validator provides:
- Missing field detection: fields referenced in rules but not present in any source
- Typo suggestions: "Did you mean...?" suggestions based on edit distance from available fields
- Unused field warnings: fields declared in sources but never referenced by any rule
Evaluation Order
Rules are evaluated in declaration order, but all rules contribute to the final weighted sum score. The evaluation order matters for two reasons:
- Short-circuiting: If early rules produce a score above the
matchthreshold, later rules may be skipped for performance (the result would be the same). This optimization is automatic. - Readability: Declaring high-weight, high-confidence rules first makes the spec easier to understand.
The final score for a candidate pair is computed as:
score = sum(rule.weight * rule_match for each rule)Where rule_match is 1.0 for exact/range rules that fire, the similarity score for similarity rules that exceed their threshold, and the model confidence for ML rules that exceed their threshold.
The decision thresholds then classify the pair:
score >= match--> automatic matchreview <= score < match--> sent to review queuescore < review--> rejected (no match)
Complete Example
A production-ready rules section for a Customer 360 use case, matching customer records across CRM, billing, and support systems:
rules:
# High-confidence identifier match.
# Email is near-unique -- if two records share an email,
# they are almost certainly the same person.
- name: email_exact
type: exact
field: email
weight: 1.0
# Phone number as a secondary strong signal.
# Slightly lower weight than email because phone numbers
# can be shared within households.
- name: phone_exact
type: exact
field: phone
weight: 0.85
# Fuzzy name matching to catch typos and abbreviations.
# jaro_winkler works well for person names because it
# rewards matching prefixes (e.g., "Rob" in "Robert"/"Roberto").
- name: name_fuzzy
type: similarity
field: name
algorithm: jaro_winkler
threshold: 0.88
weight: 0.6
# Address matching requires both street similarity and zip match.
# Using a composite 'and' because a street name alone is ambiguous
# (e.g., "Main Street" exists in every city).
- name: address_composite
type: composite
operator: and
children:
- name: street_fuzzy
type: similarity
field: street_address
algorithm: levenshtein
threshold: 0.80
weight: 0.4
- name: zip_exact
type: exact
field: zip_code
weight: 0.3
# Date of birth as a corroborating signal.
# Not high weight on its own, but valuable when combined
# with name similarity to break ties.
- name: dob_exact
type: exact
field: date_of_birth
weight: 0.4
blocking:
strategy: exact
keys: [email, phone, zip_code]
decision:
scoring: weighted_sum
thresholds:
match: 0.9
review: 0.7This configuration:
- Automatically matches when email or phone matches (weight >= 0.9 threshold)
- Sends to review when name is similar + address matches (0.6 + 0.5 = 1.1, but capped contributions mean review-range scores)
- Rejects when only weak signals align (zip code + date of birth alone = 0.4 + 0.4 = 0.8, which is below review)
- Blocks on email, phone, and zip to keep comparisons manageable
Validation Limits
| Constraint | Limit |
|---|---|
| Max rules per spec | 50 |
| Max sub-rules per composite | 10 |
| Max composite nesting depth | 3 |
| Weight range | 0.0 - 1.0 |
| Threshold range | 0.0 - 1.0 |
| Max fields per rule | 5 |
| Max blocking keys | 5 |
