Skip to content

feat: Add Dimension Mismatch Handling for ChromaDB (#157)#207

Merged
kevin-mindverse merged 3 commits intomindverse:masterfrom
PStarH:master
Apr 22, 2025
Merged

feat: Add Dimension Mismatch Handling for ChromaDB (#157)#207
kevin-mindverse merged 3 commits intomindverse:masterfrom
PStarH:master

Conversation

@PStarH
Copy link
Contributor

@PStarH PStarH commented Apr 11, 2025

Summary

This PR implements the dimension mismatch handling feature for ChromaDB in the Second-Me project, as outlined in Issue #157. This feature enables seamless transitions between embedding models with different output dimensions, enhancing flexibility and reducing manual intervention.


Key Features

1. Automatic Dimension Detection

  • Embedding dimensions are inferred based on the model name using a robust mapping defined in chroma_utils.py.
  • Supported models:
    • OpenAI: text-embedding-ada-002 (1536), text-embedding-3-large (3072), etc.
    • Ollama: e.g., mistral (768), llama2 (1024)
  • Defaults to 1536 dimensions if the model is unrecognized.

2. Dimension Validation

  • Validates consistency between configured model dimensions and existing ChromaDB collection dimensions.
  • Validation logic integrated into embedding_service.py and init_chroma.py.

3. Automatic Reinitialization

  • On detection of a mismatch, collections are safely reinitialized to match the new embedding dimension.
  • Function: reinitialize_chroma_collections
    • Deletes and recreates collections.
    • Verifies new dimensions post-reinit.

4. Robust Error Handling and Logging

  • Detailed logs are generated for:
    • Detected mismatches
    • Successful or failed reinitializations
  • Clear error messages to aid debugging and user understanding.

5. User Documentation

  • Embedding Model Switching.md added:
    • Table of commonly used models and their embedding dimensions.
    • Step-by-step explanation of the automatic handling mechanism.
    • Troubleshooting guide for edge cases.

Impact

This implementation significantly improves user experience by:

  • Removing the need for manual database resets.
  • Ensuring reliability during embedding model transitions.
  • Offering clear diagnostics when issues arise.

Related Issue

#157

PStarH added 2 commits April 11, 2025 12:18
Add chroma_utils.py to manage chromaDB and added docs for explanation
- Enhanced the`reinitialize_chroma_collections` function in`chroma_utils.py` to properly check if collections exist before attempting to delete them, preventing potential errors when collections don't exist. - Improved error handling in the`_handle_dimension_mismatch` method in`embedding_service.py` by adding more robust exception handling and verification steps after reinitialization. - Enhanced the collection initialization process in`embedding_service.py` to provide more detailed error messages and better handle cases where collections still have incorrect dimensions after reinitialization. - Added additional verification steps to ensure that collection dimensions match the expected dimension after creation or retrieval. - Improved logging throughout the code to provide more context in error messages, making debugging easier.
@CLAassistant
Copy link

CLAassistant commented Apr 11, 2025

CLA assistant check
All committers have signed the CLA.

@yingapple
Copy link
Contributor

please resolve the conflict. Thanks!

@kevin-mindverse
Copy link
Contributor

Hey there, thank you for your contribution, but still there's a few conflicts to solve :)

@PStarH
Copy link
Contributor Author

PStarH commented Apr 17, 2025

I resolved the conflict I think

Copy link
Contributor

@kevin-mindverse kevin-mindverse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good Job

@kevin-mindverse kevin-mindverse merged commit 5868f94 into mindverse:master Apr 22, 2025
1 check passed
kevin-mindverse pushed a commit that referenced this pull request Apr 22, 2025
* Fix Issue #157 Add chroma_utils.py to manage chromaDB and added docs for explanation * Add logging and debugging process - Enhanced the`reinitialize_chroma_collections` function in`chroma_utils.py` to properly check if collections exist before attempting to delete them, preventing potential errors when collections don't exist. - Improved error handling in the`_handle_dimension_mismatch` method in`embedding_service.py` by adding more robust exception handling and verification steps after reinitialization. - Enhanced the collection initialization process in`embedding_service.py` to provide more detailed error messages and better handle cases where collections still have incorrect dimensions after reinitialization. - Added additional verification steps to ensure that collection dimensions match the expected dimension after creation or retrieval. - Improved logging throughout the code to provide more context in error messages, making debugging easier.
lazy-ape pushed a commit to lazy-ape/Second-Me that referenced this pull request Apr 22, 2025
…indverse#207) * Fix Issue mindverse#157 Add chroma_utils.py to manage chromaDB and added docs for explanation * Add logging and debugging process - Enhanced the`reinitialize_chroma_collections` function in`chroma_utils.py` to properly check if collections exist before attempting to delete them, preventing potential errors when collections don't exist. - Improved error handling in the`_handle_dimension_mismatch` method in`embedding_service.py` by adding more robust exception handling and verification steps after reinitialization. - Enhanced the collection initialization process in`embedding_service.py` to provide more detailed error messages and better handle cases where collections still have incorrect dimensions after reinitialization. - Added additional verification steps to ensure that collection dimensions match the expected dimension after creation or retrieval. - Improved logging throughout the code to provide more context in error messages, making debugging easier.
kevin-mindverse added a commit that referenced this pull request Apr 24, 2025
* fix: modify thinking_model loading configuration * feat: realize thinkModel ui * feat:store * feat: add combined_llm_config_dto * add thinking_model_config & database migration * directly add thinking model to user_llm_config * delete thinking model repo dto service * delete thinkingmodel table migration * add is_cot config * feat: allow define is_cot * feat: simplify logs info * feat: add training model * feat: fix is_cot problem * fix: fix chat message * fix: fix progress error * fix: disable no settings thinking * feat: add thinking warning * fix: fix start service error * feat:fix init trainparams problem * feat: change playGround prompt * feat: Add Dimension Mismatch Handling for ChromaDB (#157) (#207) * Fix Issue #157 Add chroma_utils.py to manage chromaDB and added docs for explanation * Add logging and debugging process - Enhanced the`reinitialize_chroma_collections` function in`chroma_utils.py` to properly check if collections exist before attempting to delete them, preventing potential errors when collections don't exist. - Improved error handling in the`_handle_dimension_mismatch` method in`embedding_service.py` by adding more robust exception handling and verification steps after reinitialization. - Enhanced the collection initialization process in`embedding_service.py` to provide more detailed error messages and better handle cases where collections still have incorrect dimensions after reinitialization. - Added additional verification steps to ensure that collection dimensions match the expected dimension after creation or retrieval. - Improved logging throughout the code to provide more context in error messages, making debugging easier. * Change topics_generator timeout to 30 (#263) * quick fix * fix: shade -> shade_merge_info (#265) * fix: shade -> shade_merge_info * add convert array * quick fix import error * add log * add heartbeat * new strategy * sse version * add heartbeat * zh to en * optimize code * quick fix convert function * Feat/new branch management (#267) * feat: new branch management * feat: fix multi-upload * optimize contribute management --------- Co-authored-by: Crabboss Mr <1123357821@qq.com> Co-authored-by: Ye Xiangle <yexiangle@mail.mindverse.ai> Co-authored-by: Xinghan Pan <sampan090611@gmail.com> Co-authored-by: doubleBlack2 <108928143+doubleBlack2@users.noreply.github.com> Co-authored-by: kevin-mindverse <kevin@mindverse.ai> Co-authored-by: KKKKKKKevin <115385420+kevin-mindverse@users.noreply.github.com>
Heterohabilis pushed a commit to Heterohabilis/Second-Me that referenced this pull request May 29, 2025
…indverse#207) * Fix Issue mindverse#157 Add chroma_utils.py to manage chromaDB and added docs for explanation * Add logging and debugging process - Enhanced the`reinitialize_chroma_collections` function in`chroma_utils.py` to properly check if collections exist before attempting to delete them, preventing potential errors when collections don't exist. - Improved error handling in the`_handle_dimension_mismatch` method in`embedding_service.py` by adding more robust exception handling and verification steps after reinitialization. - Enhanced the collection initialization process in`embedding_service.py` to provide more detailed error messages and better handle cases where collections still have incorrect dimensions after reinitialization. - Added additional verification steps to ensure that collection dimensions match the expected dimension after creation or retrieval. - Improved logging throughout the code to provide more context in error messages, making debugging easier.
Heterohabilis pushed a commit to Heterohabilis/Second-Me that referenced this pull request May 29, 2025
* fix: modify thinking_model loading configuration * feat: realize thinkModel ui * feat:store * feat: add combined_llm_config_dto * add thinking_model_config & database migration * directly add thinking model to user_llm_config * delete thinking model repo dto service * delete thinkingmodel table migration * add is_cot config * feat: allow define is_cot * feat: simplify logs info * feat: add training model * feat: fix is_cot problem * fix: fix chat message * fix: fix progress error * fix: disable no settings thinking * feat: add thinking warning * fix: fix start service error * feat:fix init trainparams problem * feat: change playGround prompt * feat: Add Dimension Mismatch Handling for ChromaDB (mindverse#157) (mindverse#207) * Fix Issue mindverse#157 Add chroma_utils.py to manage chromaDB and added docs for explanation * Add logging and debugging process - Enhanced the`reinitialize_chroma_collections` function in`chroma_utils.py` to properly check if collections exist before attempting to delete them, preventing potential errors when collections don't exist. - Improved error handling in the`_handle_dimension_mismatch` method in`embedding_service.py` by adding more robust exception handling and verification steps after reinitialization. - Enhanced the collection initialization process in`embedding_service.py` to provide more detailed error messages and better handle cases where collections still have incorrect dimensions after reinitialization. - Added additional verification steps to ensure that collection dimensions match the expected dimension after creation or retrieval. - Improved logging throughout the code to provide more context in error messages, making debugging easier. * Change topics_generator timeout to 30 (mindverse#263) * quick fix * fix: shade -> shade_merge_info (mindverse#265) * fix: shade -> shade_merge_info * add convert array * quick fix import error * add log * add heartbeat * new strategy * sse version * add heartbeat * zh to en * optimize code * quick fix convert function * Feat/new branch management (mindverse#267) * feat: new branch management * feat: fix multi-upload * optimize contribute management --------- Co-authored-by: Crabboss Mr <1123357821@qq.com> Co-authored-by: Ye Xiangle <yexiangle@mail.mindverse.ai> Co-authored-by: Xinghan Pan <sampan090611@gmail.com> Co-authored-by: doubleBlack2 <108928143+doubleBlack2@users.noreply.github.com> Co-authored-by: kevin-mindverse <kevin@mindverse.ai> Co-authored-by: KKKKKKKevin <115385420+kevin-mindverse@users.noreply.github.com>
EOMZON pushed a commit to EOMZON/Second-Me that referenced this pull request Feb 1, 2026
…indverse#207) * Fix Issue mindverse#157 Add chroma_utils.py to manage chromaDB and added docs for explanation * Add logging and debugging process - Enhanced the`reinitialize_chroma_collections` function in`chroma_utils.py` to properly check if collections exist before attempting to delete them, preventing potential errors when collections don't exist. - Improved error handling in the`_handle_dimension_mismatch` method in`embedding_service.py` by adding more robust exception handling and verification steps after reinitialization. - Enhanced the collection initialization process in`embedding_service.py` to provide more detailed error messages and better handle cases where collections still have incorrect dimensions after reinitialization. - Added additional verification steps to ensure that collection dimensions match the expected dimension after creation or retrieval. - Improved logging throughout the code to provide more context in error messages, making debugging easier.
EOMZON pushed a commit to EOMZON/Second-Me that referenced this pull request Feb 1, 2026
* fix: modify thinking_model loading configuration * feat: realize thinkModel ui * feat:store * feat: add combined_llm_config_dto * add thinking_model_config & database migration * directly add thinking model to user_llm_config * delete thinking model repo dto service * delete thinkingmodel table migration * add is_cot config * feat: allow define is_cot * feat: simplify logs info * feat: add training model * feat: fix is_cot problem * fix: fix chat message * fix: fix progress error * fix: disable no settings thinking * feat: add thinking warning * fix: fix start service error * feat:fix init trainparams problem * feat: change playGround prompt * feat: Add Dimension Mismatch Handling for ChromaDB (mindverse#157) (mindverse#207) * Fix Issue mindverse#157 Add chroma_utils.py to manage chromaDB and added docs for explanation * Add logging and debugging process - Enhanced the`reinitialize_chroma_collections` function in`chroma_utils.py` to properly check if collections exist before attempting to delete them, preventing potential errors when collections don't exist. - Improved error handling in the`_handle_dimension_mismatch` method in`embedding_service.py` by adding more robust exception handling and verification steps after reinitialization. - Enhanced the collection initialization process in`embedding_service.py` to provide more detailed error messages and better handle cases where collections still have incorrect dimensions after reinitialization. - Added additional verification steps to ensure that collection dimensions match the expected dimension after creation or retrieval. - Improved logging throughout the code to provide more context in error messages, making debugging easier. * Change topics_generator timeout to 30 (mindverse#263) * quick fix * fix: shade -> shade_merge_info (mindverse#265) * fix: shade -> shade_merge_info * add convert array * quick fix import error * add log * add heartbeat * new strategy * sse version * add heartbeat * zh to en * optimize code * quick fix convert function * Feat/new branch management (mindverse#267) * feat: new branch management * feat: fix multi-upload * optimize contribute management --------- Co-authored-by: Crabboss Mr <1123357821@qq.com> Co-authored-by: Ye Xiangle <yexiangle@mail.mindverse.ai> Co-authored-by: Xinghan Pan <sampan090611@gmail.com> Co-authored-by: doubleBlack2 <108928143+doubleBlack2@users.noreply.github.com> Co-authored-by: kevin-mindverse <kevin@mindverse.ai> Co-authored-by: KKKKKKKevin <115385420+kevin-mindverse@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

4 participants