[BUG] Chinese text is incorrectly deduplicated in openmemory_store due to ASCII-only tokenization

What happened?

Chinese text is incorrectly deduplicated in openmemory_store due to ASCII-only tokenization

Steps to Reproduce

Bug description

openmemory_store incorrectly deduplicates different Chinese memories into the same record.

For example, these two different inputs may resolve to the same memory id:

我喜欢健身
我喜欢普洱茶

This causes unrelated Chinese memories to be treated as duplicates.

Environment

OpenMemory: latest (observed on 2026-03-03)
MCP tool: openmemory_store
Language: Chinese (zh-CN)

Steps to reproduce

Store memory A:
- content: 我喜欢健身
Store memory B:
- content: 我喜欢普洱茶
Check returned IDs / list memories.

Actual behavior

Different Chinese texts are often deduplicated to the same memory id (or old memory is reused), so no independent record is created.

Expected behavior

Semantically different Chinese memories should not be collapsed into one by default.

Suspected root cause

The tokenizer used by simhash appears ASCII-only:

JS: packages/openmemory-js/src/utils/text.ts
- tok_pat = /[a-z0-9]+/gi
Similar logic exists in Python implementation as well.

For Chinese text, token set can become empty, producing near-constant simhash and causing false deduplication.

Suggested fix

Guardrail: if token set is empty, skip simhash dedup for that input.
Improve tokenizer to support Unicode letters/numbers (\p{L}\p{N} with u flag).
Add CJK-specific n-gram tokenization (e.g., bi-gram) for Chinese/Japanese/Korean.
(Optional) add config:
- OM_DEDUP_MODE=simhash|none
- OM_DEDUP_HAMMING_MAX=<int>

Impact

This affects multilingual users and makes Chinese memory storage unreliable in contextual mode.

Component

Backend (API/Server)

Environment

No response

Relevant log output

Code of Conduct

I agree to follow this project's Code of Conduct

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Chinese text is incorrectly deduplicated in openmemory_store due to ASCII-only tokenization #147

What happened?

Steps to Reproduce

Bug description

Environment

Steps to reproduce

Actual behavior

Expected behavior

Suspected root cause

Suggested fix

Impact

Component

Environment

Relevant log output

Code of Conduct

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] Chinese text is incorrectly deduplicated in openmemory_store due to ASCII-only tokenization #147

Description

What happened?

Steps to Reproduce

Bug description

Environment

Steps to reproduce

Actual behavior

Expected behavior

Suspected root cause

Suggested fix

Impact

Component

Environment

Relevant log output

Code of Conduct

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions