- Notifications
You must be signed in to change notification settings - Fork 427
Description
What happened?
Chinese text is incorrectly deduplicated in openmemory_store due to ASCII-only tokenization
Steps to Reproduce
Bug description
openmemory_store incorrectly deduplicates different Chinese memories into the same record.
For example, these two different inputs may resolve to the same memory id:
我喜欢健身我喜欢普洱茶
This causes unrelated Chinese memories to be treated as duplicates.
Environment
- OpenMemory: latest (observed on 2026-03-03)
- MCP tool:
openmemory_store - Language: Chinese (zh-CN)
Steps to reproduce
- Store memory A:
- content:
我喜欢健身
- content:
- Store memory B:
- content:
我喜欢普洱茶
- content:
- Check returned IDs / list memories.
Actual behavior
Different Chinese texts are often deduplicated to the same memory id (or old memory is reused), so no independent record is created.
Expected behavior
Semantically different Chinese memories should not be collapsed into one by default.
Suspected root cause
The tokenizer used by simhash appears ASCII-only:
- JS:
packages/openmemory-js/src/utils/text.tstok_pat = /[a-z0-9]+/gi
- Similar logic exists in Python implementation as well.
For Chinese text, token set can become empty, producing near-constant simhash and causing false deduplication.
Suggested fix
- Guardrail: if token set is empty, skip simhash dedup for that input.
- Improve tokenizer to support Unicode letters/numbers (
\p{L}\p{N}withuflag). - Add CJK-specific n-gram tokenization (e.g., bi-gram) for Chinese/Japanese/Korean.
- (Optional) add config:
OM_DEDUP_MODE=simhash|noneOM_DEDUP_HAMMING_MAX=<int>
Impact
This affects multilingual users and makes Chinese memory storage unreliable in contextual mode.
Component
Backend (API/Server)
Environment
No response
Relevant log output
Code of Conduct
- I agree to follow this project's Code of Conduct