This project maps noisy / misspelled / dialectal romanized input words (e.g., from Hindi, English, South Indian transliterations) to their canonical correct forms from a reference dictionary.
It does not use pretrained models—matching is based on normalization, edit distance (Levenshtein), a BK-tree for fast nearest neighbors, and a k-gram fallback for robustness.
Example:
Raaam, ramm, shakttii → Ram, Shakti etc.
- Language-aware normalization (vowel collapsing, transliteration tweaks like
w→v,ph→f, etc.) - Efficient approximate string matching using a BK-tree (edit distance metric)
- Fallback candidate seeding via k-gram overlap when BK-tree misses
- Deterministic, thresholdable matching (no ML/fine-tuning)
- Easy to evaluate: outputs gold vs predicted for accuracy proof
- Lightweight: per-word lookup is fast (sub-millisecond on typical dictionaries)
-
reference.txt
Canonical vocabulary (one word per line). The source dictionary to match against. -
errors.txt(optional for basic usage)
Noisy inputs / test inputs to correct (can be used in batch scripts). -
matcher.py
Core implementation: builds BK-tree and k-gram index, normalizes inputs, and returns best match. -
evaluation_script.py
(Or the cell in Colab) Generates synthetic test pairs, runs matching, computes:- Exact-match accuracy
- Top-3 recall
- Average normalized edit distance
- Timing (efficiency)
- Saves
gold_*.tsvandpredicted_*.tsvfor proof.
-
corrected_output.tsv
(Generated) Mapping of input → predicted canonical word, with scores/confidence if enabled.
No heavy dependencies required for the core BK-tree version. Just standard Python 3.
# Optional: create virtualenv python -m venv venv source venv/bin/activate # If using fuzzy/phonetic extended version: pip install rapidfuzz jellyfishPopulate reference.txt with your canonical words (verbs, names, tokens, etc.), one per line.
Import and use the matching logic (example from matcher.py):
from matcher import build_structure, match_best reference = open("reference.txt").read().splitlines() structure = build_structure(reference, bk_radius=2, k=3) # tune radius/k as needed input_word = "Raaam" best_match, distance = match_best(input_word, structure) print(f"{input_word} → {best_match} (edit distance: {distance})")Run the evaluation script to:
-
Generate synthetic noisy variants from the reference
-
Match them back
-
Compute metrics and save:
gold_bktest.tsvpredicted_bktest.tsv
Example metrics output:
Exact-match accuracy: 93.5% Top-3 recall: 98.7% Avg normalized edit distance: 0.05 Per-word latency: ~2ms -
Normalize: Input and reference words are cleaned (lowercased, repeated vowels collapsed, dialectal replacements applied, non-alphanumerics stripped).
-
BK-tree: Built over normalized reference forms for fast approximate-match retrieval using Levenshtein distance.
-
Query:
- First attempt: search BK-tree within a small radius.
- Fallback: if no close BK-tree hits, use k-gram overlap to seed candidates, then compute edit distances.
-
Selection: Pick the candidate with the smallest edit distance as the best match.
-
Optional fallback: If both fail, do a brute-force scan (rare).
bk_radius: how tolerant the BK-tree is to edit differences. Larger radius finds more distant matches but may slow queries.k(k-gram size): controls granularity of fallback overlap. Commonly 2 or 3.- Thresholding (if added): you can reject or flag matches with edit distance above a cutoff for manual review.
- Preprocessing (once): normalization + BK-tree / k-gram index build.
- Query time: expected sublinear via BK-tree; common cases resolved in constant/few comparisons.
- Empirical per-word latency is typically 1–5ms on dictionaries of size 10k–100k in Python.
To prove accuracy to stakeholders:
-
Provide
gold_*.tsv(ground truth noisy→correct pairs). -
Provide
predicted_*.tsv(system output). -
Show metrics:
- Exact-match accuracy (how often top prediction equals correct word)
- Top-K recall (correct word appears among top-k candidates)
- Average normalized edit distance (distance error magnitude)
- Sample mismatches with alternatives
-
Show timing to demonstrate efficiency.
Input: shakttii Normalized → shakti Matched: Shakti (edit dist = 1 or 2 depending on corruption)
- Add confidence scoring / labeling (HIGH/MEDIUM/LOW) based on distance thresholds.
- Integrate into a REST API for real-time correction.
- Log low-confidence or unmatched inputs to expand reference dictionary (bootstrapping).
- Combine with lightweight phonetic boosting (optional) for improved recall on sound-alikes.
Specify your license here (e.g., MIT, proprietary, etc.)
Include a short demo script or link if needed. For detailed evaluation report generation, run the provided evaluation script and share the TSVs.