Skip to content

[BUG] Chinese text is incorrectly deduplicated in openmemory_store due to ASCII-only tokenization #147

@dgod15210-ship-it

Description

@dgod15210-ship-it

What happened?

Chinese text is incorrectly deduplicated in openmemory_store due to ASCII-only tokenization

Steps to Reproduce

Bug description

openmemory_store incorrectly deduplicates different Chinese memories into the same record.

For example, these two different inputs may resolve to the same memory id:

  • 我喜欢健身
  • 我喜欢普洱茶

This causes unrelated Chinese memories to be treated as duplicates.

Environment

  • OpenMemory: latest (observed on 2026-03-03)
  • MCP tool: openmemory_store
  • Language: Chinese (zh-CN)

Steps to reproduce

  1. Store memory A:
    • content: 我喜欢健身
  2. Store memory B:
    • content: 我喜欢普洱茶
  3. Check returned IDs / list memories.

Actual behavior

Different Chinese texts are often deduplicated to the same memory id (or old memory is reused), so no independent record is created.

Expected behavior

Semantically different Chinese memories should not be collapsed into one by default.

Suspected root cause

The tokenizer used by simhash appears ASCII-only:

  • JS: packages/openmemory-js/src/utils/text.ts
    • tok_pat = /[a-z0-9]+/gi
  • Similar logic exists in Python implementation as well.

For Chinese text, token set can become empty, producing near-constant simhash and causing false deduplication.

Suggested fix

  1. Guardrail: if token set is empty, skip simhash dedup for that input.
  2. Improve tokenizer to support Unicode letters/numbers (\p{L}\p{N} with u flag).
  3. Add CJK-specific n-gram tokenization (e.g., bi-gram) for Chinese/Japanese/Korean.
  4. (Optional) add config:
    • OM_DEDUP_MODE=simhash|none
    • OM_DEDUP_HAMMING_MAX=<int>

Impact

This affects multilingual users and makes Chinese memory storage unreliable in contextual mode.

Component

Backend (API/Server)

Environment

No response

Relevant log output

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions