Name	Name	Last commit message	Last commit date
parent directory ..
.github/workflows	.github/workflows
docs	docs
examples	examples
guides	guides
matrixone	matrixone
scripts	scripts
tests	tests
.editorconfig	.editorconfig
.flake8	.flake8
.gitignore	.gitignore
LICENSE	LICENSE
MANIFEST.in	MANIFEST.in
Makefile	Makefile
README.md	README.md
README_USER.md	README_USER.md
READTHEDOCS.md	READTHEDOCS.md
RELEASE_WORKFLOW.md	RELEASE_WORKFLOW.md
TESTING.md	TESTING.md
mo_diag.py	mo_diag.py
pyproject.toml	pyproject.toml
pytest.ini	pytest.ini
requirements-sqlalchemy14.txt	requirements-sqlalchemy14.txt
requirements-sqlalchemy20.txt	requirements-sqlalchemy20.txt
requirements.txt	requirements.txt
setup.py	setup.py
tox.ini	tox.ini

MatrixOne Python SDK

A comprehensive Python SDK for MatrixOne that provides SQLAlchemy-like interface for database operations, vector search, fulltext search, snapshot management, PITR, restore operations, table cloning, and mo-ctl integration.

📚 Complete Documentation | 🚀 Quick Start Guide | 🧠 Vector Search & IVF Monitoring | 📖 API Reference

Note: This SDK implementation is generated and optimized through AI-assisted development, leveraging advanced code generation techniques to ensure comprehensive API coverage, type safety, and modern Python best practices.

✨ Features

🚀 High Performance: Optimized for MatrixOne database operations with connection pooling
🔄 Async Support: Full async/await support with AsyncClient for non-blocking operations
🧠 Vector Search: Advanced vector similarity search with HNSW and IVF indexing algorithms
- Support for f32 and f64 precision vectors
- Multiple distance metrics (L2, Cosine, Inner Product)
- ⭐ IVF Index Health Monitoring with get_ivf_stats() - Critical for production!
- ⭐ IVF LIMIT BY RANK - Fine-grained ranking control (pre/post/force modes)
- Configurable index parameters for performance tuning
🔍 Fulltext Search: Powerful fulltext indexing and search with BM25 and TF-IDF algorithms
- Natural language and boolean search modes
- Multi-column fulltext indexes
- Relevance scoring and ranking
📊 Metadata Analysis: Comprehensive table and column metadata analysis with statistics
- Table size and row count statistics
- Column data distribution analysis
- Index usage metrics
🔍 Secondary Index Verification: Verify consistency of secondary indexes with main table
- Get all secondary index table names
- Get specific index table by index name
- Verify row counts across main table and all indexes in single query
📂 Stage Management: Comprehensive external stage operations for centralized data storage
- Create and manage stages (file system, S3, cloud storage)
- Load data from stages with automatic path resolution
- Stage-scoped load operations for convenience
- Transaction support for atomic stage operations
📥 Data Loading: Pandas-style bulk data loading with intuitive interface
- read_csv(), read_json(), read_parquet() methods (pandas-compatible API)
- Load from local files or external stages (stage:// protocol)
- Pandas-compatible parameters (sep, quotechar, skiprows, names, encoding)
- Support for inline data loading and compression
- Transaction-aware loading for data consistency
📤 Data Export: Pandas-style data export with intuitive interface
- to_csv() and to_jsonl() methods (pandas-compatible API)
- Export to local files or external stages (stage:// protocol)
- Support for raw SQL, SQLAlchemy select(), and MatrixOne queries
- Customizable CSV options (sep, quotechar, lineterminator)
- Transaction-aware exports for data consistency
📸 Snapshot Management: Create and manage database snapshots at multiple levels
⏰ Point-in-Time Recovery: PITR functionality for precise data recovery
🔄 Table Cloning: Clone databases and tables efficiently with data replication
- ⭐ SQLAlchemy-style statement builders (clone_table(), clone_database())
- Snapshot-based point-in-time cloning
- Cross-tenant cloning with TO ACCOUNT support
- IF NOT EXISTS for idempotent operations
🌿 Branch Management: Git-style version control for databases and tables
- Create isolated branches for development and testing
- Compare differences between branches with count-only optimization
- Merge branches with conflict resolution strategies
- Point-in-time branching from snapshots
- ⭐ SQLAlchemy-style statement builders (create_table_branch(), diff_table_branch(), etc.)
- Advanced DIFF output: COUNT, LIMIT, FILE export, snapshot on both sides
- Cross-tenant branching with TO ACCOUNT support
👥 Account Management: Comprehensive user, role, and permission management
📊 Pub/Sub: Real-time publication and subscription support
🔧 Version Management: Automatic backend version detection and compatibility checking
🛡️ Type Safety: Full type hints support with comprehensive documentation
📚 SQLAlchemy Integration: Seamless SQLAlchemy integration with enhanced ORM features
🔗 Enhanced Query Building: Advanced query building with logical operations (logical_and, logical_or, logical_not)
🪝 Connection Hooks: Pre/post connection hooks for custom initialization logic
🛠️ mo-diag CLI Tool: Interactive diagnostic tool for MatrixOne database maintenance
- Index health monitoring and verification
- IVF/HNSW vector index status inspection
- Table statistics and metadata analysis
- Interactive shell with Tab completion and command history
- Non-interactive mode for scripting and automation
- Batch operations on tables and indexes

⚠️ Important: Column Naming Convention

🚨 CRITICAL: Always use lowercase with underscores (snake_case) for column names!

MatrixOne does not support SQL standard double-quoted identifiers in queries, which causes issues with camelCase column names when using SQLAlchemy ORM.

# ❌ DON'T: CamelCase column names (will fail in SELECT queries) class User(Base): userName = Column(String(50)) # CREATE succeeds, SELECT fails! userId = Column(Integer) # Will cause SQL syntax errors # ✅ DO: Use lowercase with underscores (snake_case) class User(Base): user_name = Column(String(50)) # Works perfectly user_id = Column(Integer) # All operations succeed

Why this matters:

✅ CREATE TABLE works with both styles (uses backticks)
✅ INSERT works with both styles
❌ SELECT fails with camelCase (uses double quotes, not supported by MatrixOne)

Example of the problem:

# CamelCase generates: SELECT "userName" FROM user ❌ Fails! # snake_case generates: SELECT user_name FROM user ✅ Works!

🚀 Installation

From PyPI (Stable Release)

pip install matrixone-python-sdk

From test.pypi (Latest Pre-release)

pip install \ --index-url https://test.pypi.org/simple/ \ --extra-index-url https://pypi.org/simple/ \ matrixone-python-sdk

Note: The --extra-index-url is required to install dependencies (PyMySQL, SQLAlchemy, etc.) from the official PyPI.

Development Installation

Using Virtual Environment (Recommended)

git clone https://github.com/matrixorigin/matrixone.git cd matrixone/clients/python # Create virtual environment python -m venv venv # Activate virtual environment # On macOS/Linux: source venv/bin/activate # On Windows: # venv\Scripts\activate # Quick setup with Makefile (includes all dev dependencies) make dev-setup # Or manual setup pip install -e '.[dev]' # Includes pyarrow, Faker, pytest, etc.

Using Conda Environment

git clone https://github.com/matrixorigin/matrixone.git cd matrixone/clients/python # Create conda environment conda create -n matrixone-dev python=3.10 conda activate matrixone-dev # Quick setup with Makefile make dev-setup # Or manual setup pip install -e '.[dev]'

Direct Installation (Not Recommended)

git clone https://github.com/matrixorigin/matrixone.git cd matrixone/clients/python # Quick setup with Makefile make dev-setup # Or manual setup pip install -e '.[dev]'

Testing Environment Setup

For Testing and Examples

# Install test dependencies (includes pyarrow for Parquet support) pip install -e '.[test]' # Or use requirements files pip install -r requirements-sqlalchemy20.txt # SQLAlchemy 2.x pip install -r requirements-sqlalchemy14.txt # SQLAlchemy 1.4.x

For Load Data Features

The SDK includes comprehensive LOAD DATA functionality with support for:

CSV/TSV files: Basic delimited file loading
JSONLINE files: JSON Lines format with object/array structures
Parquet files: Columnar format loading (requires pyarrow)
Inline data: Direct string data loading
Stage loading: Loading from named external stages

Dependencies for Load Data:

# Required for Parquet file support pip install pyarrow>=10.0.0 # For generating test data pip install Faker>=10.0.0 # Or install everything at once pip install -e '.[dev]' # Includes pyarrow, Faker, pytest, etc.

Quick Start

Basic Usage

from matrixone import Client # Create and connect to MatrixOne client = Client() client.connect( host='localhost', port=6001, user='root', password='111', database='test' ) # Execute queries result = client.execute("SELECT 1 as test") print(result.fetchall()) # Get backend version (auto-detected) version = client.get_backend_version() print(f"MatrixOne version: {version}") client.disconnect()

📝 Connection Parameters

The connect() method requires keyword arguments (not positional):

database - Required, no default value

host - Default: 'localhost'

port - Default: 6001

user - Default: 'root'

password - Default: '111'

Minimal connection (uses all defaults):
client.connect(database='test')
By default, all features (IVF, HNSW, fulltext) are automatically enabled via on_connect=[ConnectionAction.ENABLE_ALL].

Transaction Management (Recommended)

Use client.session() for atomic transactions. All operations within a session succeed or fail together:

from matrixone import Client from sqlalchemy import select, insert, update, delete client = Client() client.connect(database='test') # Basic transaction with automatic commit/rollback with client.session() as session: # All operations are atomic session.execute(insert(User).values(name='Alice', age=30)) session.execute(update(User).where(User.age < 18).values(status='minor')) # Query within transaction stmt = select(User).where(User.age > 25) result = session.execute(stmt) users = result.scalars().all() # Commits automatically on success, rolls back on error client.disconnect()

Key Benefits:

✅ Atomic operations - all succeed or fail together
✅ Automatic rollback on errors
✅ Access to all MatrixOne managers (snapshots, clones, load_data, etc.)
✅ Full SQLAlchemy ORM support

Data Loading (Pandas-Style)

Load bulk data from CSV, JSON, or Parquet files with pandas-compatible API:

from matrixone import Client from matrixone.orm import declarative_base, Column, Integer, String Base = declarative_base() class User(Base): __tablename__ = 'users' id = Column(Integer, primary_key=True) name = Column(String(100)) email = Column(String(255)) client = Client() client.connect(database='test') # Create table client.create_table(User) # Basic CSV load (pandas-style) client.load_data.read_csv('users.csv', table=User) # CSV with header (pandas-style) client.load_data.read_csv('users.csv', table=User, skiprows=1) # Custom separator (pandas-style) client.load_data.read_csv('users.txt', table=User, sep='|') # Tab-separated (pandas-style) client.load_data.read_csv('users.tsv', table=User, sep='\t') # JSON Lines (pandas-style) client.load_data.read_json('events.jsonl', table='events', lines=True, orient='records') # Parquet (pandas-style) client.load_data.read_parquet('data.parquet', table=User) # Load from stage client.load_data.read_csv('stage://s3_stage/users.csv', table=User) client.load_data.read_csv_stage('s3_stage', 'users.csv', table=User) # Inline data csv_data = "1,Alice,alice@example.com\\n2,Bob,bob@example.com\\n" client.load_data.read_csv(csv_data, table=User, inline=True) client.disconnect()

Key Features:

✅ Pandas-compatible API (read_csv(), read_json(), read_parquet())
✅ Pandas-style parameters (sep, quotechar, skiprows, names, encoding)
✅ Load from local files or external stages (stage:// protocol)
✅ Support for compression, parallel loading, and inline data

Data Export (Pandas-Style)

Export query results to CSV or JSONL with pandas-compatible API:

from matrixone import Client from sqlalchemy import select client = Client() client.connect(database='test') # Basic CSV export (pandas-style) client.export.to_csv('/tmp/users.csv', "SELECT * FROM users") # With custom separator client.export.to_csv('/tmp/users.tsv', "SELECT * FROM users", sep='\t') # Export with SQLAlchemy stmt = select(User).where(User.age > 25) client.export.to_csv('/tmp/adults.csv', stmt, sep='|') # Export to JSONL client.export.to_jsonl('/tmp/users.jsonl', "SELECT * FROM users") # Export to external stage (using stage:// protocol) client.export.to_csv('stage://s3_stage/backup.csv', stmt) # Export to external stage (using convenience method) client.export.to_csv_stage('s3_stage', 'backup2.csv', stmt) # Export within transaction with client.session() as session: stmt = select(Product).where(Product.stock > 0) session.export.to_csv('/tmp/in_stock.csv', stmt) client.disconnect()

Key Features:

✅ Pandas-compatible API (to_csv(), to_jsonl())
✅ Convenience methods for stage exports (to_csv_stage(), to_jsonl_stage())
✅ Support for raw SQL, SQLAlchemy, and MatrixOne queries
✅ Export to local files or external stages (stage:// protocol)
✅ Customizable CSV options (sep, quotechar, lineterminator)

Wrapping Existing SQLAlchemy Sessions (For Legacy Projects)

If you have existing SQLAlchemy code, you can wrap your sessions with MatrixOne features without refactoring:

from sqlalchemy import create_engine from sqlalchemy.orm import sessionmaker from matrixone import Client from matrixone.session import Session as MatrixOneSession # Your existing SQLAlchemy setup engine = create_engine('mysql+pymysql://root:111@localhost:6001/test') SessionFactory = sessionmaker(bind=engine) sqlalchemy_session = SessionFactory() # Create MatrixOne client mo_client = Client() mo_client.connect(host='localhost', port=6001, user='root', password='111', database='test') # Wrap your existing session with MatrixOne features mo_session = MatrixOneSession( client=mo_client, wrap_session=sqlalchemy_session ) try: # Your existing SQLAlchemy operations still work result = mo_session.execute("SELECT * FROM users") # Now you can also use MatrixOne-specific features mo_session.stage.create_s3('backup_stage', bucket='my-backups', path='') mo_session.snapshots.create('daily_backup', level='database') mo_session.load_data.read_csv('/data/users.csv', table='users') mo_session.commit() finally: mo_session.close()

Use Cases:

🔄 Gradual migration from pure SQLAlchemy to MatrixOne
🏢 Adding MatrixOne features to existing enterprise applications
📦 Legacy code modernization without complete refactoring
🔧 Testing MatrixOne features alongside existing code

SQLAlchemy ORM Integration

The SDK provides seamless SQLAlchemy integration with ORM-style operations:

from matrixone import Client from matrixone.orm import Base, Column, Integer, String, select, insert, update, delete # Define ORM models class User(Base): __tablename__ = 'users' id = Column(Integer, primary_key=True) name = Column(String(100)) email = Column(String(255)) age = Column(Integer) client = Client() client.connect(database='test') # Create table from model client.create_table(User) # ORM-style INSERT stmt = insert(User).values(name='John', email='john@example.com', age=30) client.execute(stmt) # ORM-style SELECT stmt = select(User).where(User.age > 25) result = client.execute(stmt) for user in result.scalars(): print(f"User: {user.name}, Age: {user.age}") # ORM-style UPDATE stmt = update(User).where(User.id == 1).values(email='newemail@example.com') client.execute(stmt) # ORM-style DELETE stmt = delete(User).where(User.age < 18) client.execute(stmt) client.disconnect()

Recommended Practices:

✅ Use session() for multi-statement transactions
✅ Use ORM-style statements (select, insert, update, delete)
✅ Use client.execute() for single-statement operations
✅ Prefer SQLAlchemy statements over raw SQL strings

Async Usage

Full async/await support with AsyncClient and async sessions:

import asyncio from matrixone import AsyncClient from sqlalchemy import select, insert, update async def main(): client = AsyncClient() await client.connect(database='test') # Basic async query result = await client.execute(select(User).where(User.age > 25)) print(result.fetchall()) # Async transaction async with client.session() as session: # All operations are atomic await session.execute(insert(User).values(name='Alice', age=30)) await session.execute(update(User).where(User.age < 18).values(status='minor')) # Concurrent queries with asyncio.gather user_result, order_result = await asyncio.gather( session.execute(select(User)), session.execute(select(Order)) ) # Commits automatically await client.disconnect() asyncio.run(main())

Async Benefits:

✅ Non-blocking I/O operations
✅ Concurrent execution with asyncio.gather()
✅ Perfect for FastAPI, aiohttp, and async frameworks
✅ Same transaction guarantees as sync version

Snapshot Management

Create and manage database snapshots. Use session() for atomic snapshot + clone operations:

from matrixone import SnapshotLevel # Single snapshot operation snapshot = client.snapshots.create( name='my_snapshot', level=SnapshotLevel.DATABASE, database='production' ) # List snapshots snapshots = client.snapshots.list() for snap in snapshots: print(f"Snapshot: {snap.name}, Created: {snap.created_at}") # Atomic snapshot + clone in transaction with client.session() as session: # Create snapshot snapshot = session.snapshots.create( name='daily_backup', level=SnapshotLevel.DATABASE, database='production' ) # Clone from snapshot atomically session.clone.clone_database( target_db='production_copy', source_db='production', snapshot_name='daily_backup' ) # Both operations commit together

Vector Search

from matrixone.sqlalchemy_ext import create_vector_column from sqlalchemy import Column, BigInteger, String, Text from matrixone.orm import declarative_base import numpy as np # Define vector table model Base = declarative_base() class Document(Base): __tablename__ = 'documents' # IMPORTANT: HNSW index requires BigInteger primary key id = Column(BigInteger, primary_key=True, autoincrement=True) title = Column(String(200)) content = Column(Text) embedding = create_vector_column(384, precision='f32') # 384-dim vector # Create table using client API client.create_table(Document) # Enable and create HNSW index client.vector_ops.enable_hnsw() client.vector_ops.create_hnsw( 'documents', # table name - positional argument name='idx_embedding', column='embedding', m=16, ef_construction=200 ) # Insert document with vector using client API client.insert(Document, { 'title': 'AI Research Paper', 'content': 'Advanced machine learning techniques...', 'embedding': np.random.rand(384).tolist() }) # Vector similarity search query_vector = np.random.rand(384).tolist() results = client.vector_ops.similarity_search( 'documents', # table name - positional argument vector_column='embedding', query_vector=query_vector, limit=5, distance_type='cosine' ) # ⭐ CRITICAL: Monitor IVF index health for production stats = client.vector_ops.get_ivf_stats('documents', 'embedding') counts = stats['distribution']['centroid_count'] balance_ratio = max(counts) / min(counts) if min(counts) > 0 else float('inf') print(f"Index Health: {len(counts)} centroids, balance ratio: {balance_ratio:.2f}") if balance_ratio > 2.5: print("⚠️ Index needs rebuilding for optimal performance!")

Fulltext Search

# Enable and create fulltext index using client API client.fulltext_index.enable_fulltext() client.fulltext_index.create( 'documents', # table name - positional argument name='ftidx_content', columns=['title', 'content'] ) # Boolean search with encourage (like natural language search) from matrixone.sqlalchemy_ext.fulltext_search import boolean_match results = client.query( Document.title, Document.content, boolean_match('title', 'content').encourage('machine learning techniques') ).execute() # Boolean search with must/should/must_not operators results = client.query( Document.title, Document.content, boolean_match('title', 'content') .must('machine') .must('learning') .must_not('basic') ).execute()

Data Loading with Stages

Load data from external sources using stages. Use session() for atomic multi-file loading:

from matrixone.orm import Base, Column, Integer, String # Define ORM model class User(Base): __tablename__ = 'users' id = Column(Integer, primary_key=True) name = Column(String(100)) email = Column(String(255)) # Create S3 stage using simple interface client.stage.create_s3( name='data_stage', bucket='my-bucket', path='data/', aws_key_id='your_key', aws_secret_key='your_secret', region='us-east-1' ) # Load data from stage using ORM model (pandas-style) client.load_data.read_csv_stage('data_stage', 'users.csv', table=User) # Atomic multi-file loading in transaction with client.session() as session: # Create local stage session.stage.create_local('import_stage', '/data/imports/') # Load multiple files atomically (pandas-style) session.load_data.read_csv('/data/users.csv', table=User) session.load_data.read_csv('/data/orders.csv', table=Order) # All loads commit together

Stage Benefits:

✅ Centralized data source configuration
✅ Support for S3, local filesystem, and cloud storage
✅ Automatic path resolution
✅ Transaction support for atomic operations

Metadata Analysis

# Get table metadata with statistics metadata_rows = client.metadata.scan( dbname='test', tablename='documents', columns='*' # Get structured MetadataRow objects ) for row in metadata_rows: print(f"Column: {row.col_name}") print(f" Nulls: {row.null_cnt}") print(f" Rows: {row.rows_cnt}") print(f" Size: {row.origin_size}") # Get table statistics brief_stats = client.metadata.get_table_brief_stats( dbname='test', tablename='documents' ) table_stats = brief_stats['documents'] print(f"Total rows: {table_stats['row_cnt']}") print(f"Original size: {table_stats['original_size']} bytes") print(f"Compressed size: {table_stats['compress_size']} bytes")

Secondary Index Verification

# Get all secondary index tables index_tables = client.get_secondary_index_tables('my_table') print(f"Found {len(index_tables)} secondary indexes") # Get specific index by name physical_table = client.get_secondary_index_table_by_name('my_table', 'idx_name') print(f"Index 'idx_name' -> {physical_table}") # Verify all indexes have same count as main table try: count = client.verify_table_index_counts('my_table') print(f"✓ Verification passed! Row count: {count}") except ValueError as e: print(f"✗ Index mismatch detected:") print(e) # Error includes all count details for debugging

Version Management

# Check if feature is available if client.is_feature_available('snapshot_creation'): snapshot = client.snapshots.create('my_snapshot', 'cluster') else: hint = client.get_version_hint('snapshot_creation') print(f"Feature not available: {hint}") # Check version compatibility if client.check_version_compatibility('3.0.0', '>='): print("Backend supports 3.0.0+ features")

MatrixOne Version Support

The SDK automatically detects MatrixOne backend versions and handles compatibility:

Development Version: 8.0.30-MatrixOne-v → 999.0.0 (highest priority)
Release Version: 8.0.30-MatrixOne-v3.0.0 → 3.0.0
Legacy Format: MatrixOne 3.0.1 → 3.0.1

# Check if running development version if client.is_development_version(): print("Running development version - all features available") else: print(f"Running release version: {client.get_backend_version()}")

🛠️ mo-diag - Interactive Diagnostic Tool

The mo-diag command-line tool provides an interactive shell for diagnosing and maintaining MatrixOne databases, with a special focus on vector indexes, secondary indexes, and table statistics.

Installation

After installing the SDK, mo-diag is automatically available as a command:

pip install matrixone-python-sdk mo-diag --help

Quick Start

Interactive Mode

Launch the interactive shell to execute multiple diagnostic commands:

# Connect to default localhost mo-diag --database test # Connect to remote database mo-diag --host 192.168.1.100 --port 6001 --user admin --password secret --database production

Interactive Features:

🔍 Tab Completion: Press Tab to auto-complete commands, table names, and database names
⬆️⬇️ Command History: Use arrow keys to browse command history
🔎 History Search: Press Ctrl+R to search command history
🎨 Colored Output: Clear visual feedback with syntax highlighting
💾 Persistent History: Command history saved to ~/.mo_diag_history

Non-Interactive Mode

Execute single commands directly for scripting and automation:

# Check IVF index status mo-diag -d test -c "show_ivf_status" # Get detailed table statistics mo-diag -d test -c "show_table_stats my_table -a -d" # Execute SQL query mo-diag -d test -c "sql SELECT COUNT(*) FROM my_table" # Flush table and all indexes mo-diag -d test -c "flush_table my_table"

CDC Task Shortcuts

Manage Change Data Capture directly from the shell or scripts. The interactive commands (cdc_tasks, cdc_task, cdc_create, cdc_drop) now have matching non-interactive helpers via mo-diag cdc.

# Summaries (same as running cdc_tasks inside the shell) mo-diag cdc show # Inspect a task with watermark threshold control mo-diag cdc show nightly_sync --details --threshold=5m # Launch the guided creator with a forced mode mo-diag cdc create --table-level # Drop a task with explicit confirmation override mo-diag cdc drop nightly_sync --force

Available Commands

Index Management

show_indexes <table> [database]

Display all indexes for a table including IVF, HNSW, Fulltext, and regular indexes
Shows physical table names, index types, and statistics
Includes object counts, row counts, and sizes for vector/fulltext indexes

MO-DIAG[test]> show_indexes ivf_health_demo_docs 📊 Secondary Indexes for 'test.ivf_health_demo_docs' *************************** 1. row *************************** Index Name: idx_embedding_ivf_v2 Algorithm: ivfflat Table Type: metadata Physical Table: __mo_index_secondary_0199e725-0a7a-77b8-b689-ccdd0a33f581 Columns: embedding Statistics: - Objects: 1 - Rows: 7 - Compressed Size: 940 B - Original Size: 1.98 KB *************************** 2. row *************************** Index Name: idx_embedding_ivf_v2 Algorithm: ivfflat Table Type: centroids Physical Table: __mo_index_secondary_0199e725-0a7b-706e-8f0a-a50edc3621a1 Columns: embedding Statistics: - Objects: 1 - Rows: 17 - Compressed Size: 3.09 KB - Original Size: 6.83 KB *************************** 3. row *************************** Index Name: idx_embedding_ivf_v2 Algorithm: ivfflat Table Type: entries Physical Table: __mo_index_secondary_0199e725-0a7c-77f4-8d0b-48fd8258098a Columns: embedding Statistics: - Objects: 1 - Rows: 1,000 - Compressed Size: 156.34 KB - Original Size: 176.07 KB Total: 3 index tables (1 ivfflat with 3 physical tables)

Note: IVF indexes have 3 physical tables (metadata, centroids, entries)

show_all_indexes [database]

Health report for all tables with secondary indexes in the database
Row count consistency checks
IVF/HNSW/Fulltext index status

MO-DIAG[test]> show_all_indexes 📊 Index Health Report for Database 'test': ======================================================================================================================== ✓ HEALTHY (3 tables) ------------------------------------------------------------------------------------------------------------------------ Table Name | Indexes | Row Count | Notes ------------------------------------------------------------------------------------------------------------------------ cms_all_content_chunk_info | 3 | ✓ 32,712 rows | - demo_mixed_indexes | 4 | ✓ 20 rows | - ivf_health_demo_docs | 1 | 300 rows | IVF: 17 centroids, 300 vectors ======================================================================================================================== Summary: ✓ 3 healthy tables Total: 3 tables with indexes 💡 Tip: Use 'verify_counts <table>' or 'show_ivf_status' for detailed diagnostics

verify_counts <table> [database]

Verify row count consistency between main table and all its secondary indexes
Highlights any mismatches

MO-DIAG[test]> verify_counts my_table 📊 Row Count Verification for 'test.my_table' ════════════════════════════════════════════════════════════════════════════ Main table: 10,000 rows ──────────────────────────────────────────────────────────────────────────── ✓ __mo_index_secondary_xxx: 10,000 rows ✓ __mo_index_unique_yyy: 10,000 rows ════════════════════════════════════════════════════════════════════════════ ✅ PASSED: All index tables match (10,000 rows)

Vector Index Monitoring

show_ivf_status [database] [-v] [-t table]

Display IVF index building status and centroid distribution
-v: Verbose mode with detailed centroid information
-t <table>: Filter by specific table

MO-DIAG[test]> show_ivf_status 📊 IVF Index Status in 'test': ════════════════════════════════════════════════════════════════════════════ Table | Index | Column | Centroids | Vectors | Balance | Status ──────────────────────────────────────────────────────────────────────────── ivf_health_demo_docs | idx_embedding_ivf_v2 | embedding | 17 | 1,000 | 2.35 | ✓ active

Table Statistics

show_table_stats <table> [database] [-t] [-a] [-d]

Display table metadata and statistics
-t: Include tombstone statistics
-a: Include all indexes (hierarchical view with -d)
-d: Show detailed object lists

MO-DIAG[test]> show_table_stats ivf_health_demo_docs -a -d 📊 Detailed Table Statistics for 'test.ivf_health_demo_docs': ====================================================================================================================================================== Table: ivf_health_demo_docs Objects: 1 | Rows: 1,000 | Null: 0 | Original: 176.03 KB | Compressed: 156.24 KB Objects: Object Name | Rows | Null Cnt | Original Size | Compressed Size ---------------------------------------------------------------------------------------------------------------------------------------------------- 0199e729-642e-71e0-b338-67c4980ee294_00000 | 1000 | 0 | 176.03 KB | 156.24 KB Index: idx_embedding_ivf_v2 └─ (metadata): __mo_index_secondary_0199e725-0a7a-77b8-b689-ccdd0a33f581:272851 Objects: 1 | Rows: 7 | Null: 0 | Original: 1.98 KB | Compressed: 940 B Objects: Object Name | Rows | Null Cnt | Original Size | Compressed Size ---------------------------------------------------------------------------------------------------------------------------------------------------- 0199e729-642a-7d36-ac37-0ae17325f7ec_00000 | 7 | 0 | 1.98 KB | 940 B └─ (centroids): __mo_index_secondary_0199e725-0a7b-706e-8f0a-a50edc3621a1:272852 Objects: 1 | Rows: 17 | Null: 0 | Original: 6.83 KB | Compressed: 3.09 KB Objects: Object Name | Rows | Null Cnt | Original Size | Compressed Size ---------------------------------------------------------------------------------------------------------------------------------------------------- 0199e729-6431-794d-9c13-d10f2e4a59e7_00000 | 17 | 0 | 6.83 KB | 3.09 KB └─ (entries): __mo_index_secondary_0199e725-0a7c-77f4-8d0b-48fd8258098a:272853 Objects: 1 | Rows: 1,000 | Null: 0 | Original: 696.11 KB | Compressed: 626.37 KB Objects: Object Name | Rows | Null Cnt | Original Size | Compressed Size ---------------------------------------------------------------------------------------------------------------------------------------------------- 0199e729-6438-7db1-bcb6-7b74c59b0aee_00000 | 1000 | 0 | 696.11 KB | 626.37 KB

Note:

Hierarchical structure: Table → Index → Physical Tables → Objects
IVF index has 3 physical tables: metadata (7 rows), centroids (17 rows), entries (1,000 rows)
Physical table name format: table_name:table_id
Use -t flag to include tombstone statistics for deleted/updated data

Database Operations

flush_table <table> [database]

Flush main table and all its secondary index physical tables
Includes IVF metadata/centroids/entries, HNSW, Fulltext, and regular indexes
Requires sys user privileges

MO-DIAG[test]> flush_table ivf_health_demo_docs 🔄 Flushing table: test.ivf_health_demo_docs ✓ Main table flushed: ivf_health_demo_docs 📋 Found 3 index physical tables Index: idx_embedding_ivf_v2 ✓ metadata: __mo_index_secondary_0199e725-0a7a-77b8-b689-ccdd0a33f581 ✓ centroids: __mo_index_secondary_0199e725-0a7b-706e-8f0a-a50edc3621a1 ✓ entries: __mo_index_secondary_0199e725-0a7c-77f4-8d0b-48fd8258098a 📊 Summary: Main table: ✓ flushed Index tables: 3/3 flushed successfully

Note:

Automatically discovers and flushes all index physical tables
For IVF indexes: flushes metadata, centroids, and entries tables
Requires sys user privileges
Recommended before backups or after bulk operations

tables [database]

List all tables in current or specified database

databases

List all databases (highlights current database)

use <database>

Switch to a different database

sql <SQL statement>

Execute arbitrary SQL query

MO-DIAG[test]> sql SELECT COUNT(*) FROM my_table col0 ---- 10000 1 row(s) returned

Utility Commands

history [n | -c]

Show last n commands (default: 20)
-c: Clear command history

help [command]

Show help for all commands or specific command

exit / quit

Exit the interactive shell

Command-Line Options

usage: mo-diag [-h] [--host HOST] [--port PORT] [--user USER] [--password PASSWORD] [--database DATABASE] [--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}] [--command COMMAND] options: --host HOST Database host (default: localhost) --port PORT Database port (default: 6001) --user USER Database user (default: root) --password PASSWORD Database password (default: 111) --database DATABASE Database name (optional) -d DATABASE Short form of --database --log-level LEVEL Logging level (default: ERROR) --command COMMAND Execute single command and exit -c COMMAND Short form of --command

Use Cases

1. Monitor IVF Index Health in Production

# Quick check on index status mo-diag -d production -c "show_ivf_status -v" # Detailed index inspection with physical table stats mo-diag -d production -c "show_indexes my_vector_table" # Verify centroid distribution balance mo-diag -d production -c "show_ivf_status -t my_vector_table -v"

2. Debug Index Count Mismatches

# Check row consistency mo-diag -d test -c "verify_counts my_table" # Get health report for all indexes mo-diag -d test -c "show_all_indexes" # Flush to sync if needed mo-diag -d test -c "flush_table my_table"

3. Analyze Table Storage

# Get table statistics with all indexes mo-diag -d test -c "show_table_stats my_table -a" # Detailed object-level analysis mo-diag -d test -c "show_table_stats my_table -a -d" # Include tombstone statistics mo-diag -d test -c "show_table_stats my_table -a -t -d"

4. Automated Maintenance Scripts

#!/bin/bash # daily_index_check.sh DATABASES=("prod_db1" "prod_db2" "prod_db3") for db in "${DATABASES[@]}"; do echo "Checking $db..." mo-diag -d "$db" -c "show_all_indexes" >> /var/log/mo_diag_daily.log mo-diag -d "$db" -c "show_ivf_status" >> /var/log/mo_diag_daily.log done

Tips and Best Practices

Regular Health Checks: Run show_all_indexes daily to catch index issues early
Monitor IVF Balance: Use show_ivf_status -v to ensure even centroid distribution
Before Major Operations: Always verify_counts before bulk updates or migrations
Production Debugging: Use non-interactive mode for logging and monitoring
Tab Completion: Leverage Tab completion to avoid typos in table/database names
Command History: Use Ctrl+R to quickly find and re-execute previous diagnostic commands
Flush Regularly: If you notice index count mismatches, flush_table can help sync

Troubleshooting

Issue: Tab completion not working

Solution: Ensure prompt_toolkit>=3.0.0 is installed: pip install prompt_toolkit

Issue: "Unknown database" error

Solution: Create the database first: mo-diag -d test -c "sql CREATE DATABASE IF NOT EXISTS test"

Issue: Permission denied for flush operations

Solution: Connect as sys user or a user with sufficient privileges

Issue: IVF index shows 0 centroids

Solution: The index might be empty or still building. Check with show_table_stats <table> -a

pitr = client.pitr.create_cluster_pitr( name='cluster_pitr', range_value=7, range_unit='d' )

Restore to specific time

restore_point = datetime(2024, 1, 15, 10, 30, 0) client.pitr.restore_to_time(restore_point)

 ### Account Management ```python # Create user user = client.account.create_user( username='newuser', password='password123', description='New user account' ) # Create role role = client.account.create_role( role_name='analyst', description='Data analyst role' ) # Grant privileges client.account.grant_privilege( user='newuser', role='analyst', privileges=['SELECT', 'INSERT'] )

Pub/Sub Operations

# Create publication publication = client.pubsub.create_publication( name='data_changes', tables=['users', 'orders'], description='User and order changes' ) # Create subscription subscription = client.pubsub.create_subscription( name='data_sync', publication_name='data_changes', target_tables=['users_backup', 'orders_backup'] )

Stage Management

# Create a local file system stage (simplified) stage = client.stage.create_local('my_data_stage', './data/', comment='Main data import stage') # Or use the full method with explicit URL stage = client.stage.create('my_data_stage', 'file:///path/to/data/', comment='Main data import stage') # Create an S3 stage with simplified credentials s3_stage = client.stage.create_s3( 'production_stage', 'my-bucket/data/', aws_key='AKIAIOSFODNN7EXAMPLE', aws_secret='wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY', aws_region='us-east-1', comment='Production data stage' ) # Or use environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY) s3_stage = client.stage.create_s3('production_stage', 'my-bucket/data/') # List all stages stages = client.stage.list() for stage in stages: print(f"{stage.name}: {stage.url}") # Load data from stage (Method 1: via client.load_data, pandas-style) client.load_data.read_csv_stage('my_data_stage', 'users.csv', table=User) # Load data from stage (Method 2: via stage object - simpler!) stage = client.stage.get('my_data_stage') stage.load_csv('users.csv', User) stage.load_json('events.jsonl', Event) stage.load_parquet('orders.parq', Order) # Batch load multiple files results = stage.load_files({ 'users.csv': User, 'events.jsonl': Event, 'orders.parq': Order }) # Modify stage client.stage.alter('my_data_stage', comment='Updated comment') # Drop stage client.stage.drop('my_data_stage')

Advanced Features

PITR (Point-in-Time Recovery)

# Create PITR for cluster ## Configuration ### Connection Parameters ```python client = Client( connection_timeout=30, query_timeout=300, auto_commit=True, charset='utf8mb4', sql_log_mode='auto', # 'off', 'simple', 'auto', 'full' slow_query_threshold=1.0 )

Logging Configuration

from matrixone import MatrixOneLogger logger = MatrixOneLogger( level=logging.INFO, sql_log_mode='auto', # 'off', 'simple', 'auto', 'full' slow_query_threshold=1.0, max_sql_display_length=500 ) client = Client(logger=logger)

Error Handling

The SDK provides comprehensive error handling with helpful messages:

from matrixone.exceptions import ( ConnectionError, QueryError, VersionError, SnapshotError ) try: snapshot = client.snapshots.create('test', 'cluster') except VersionError as e: print(f"Version compatibility error: {e}") except SnapshotError as e: print(f"Snapshot operation failed: {e}")

Examples

Check out the examples/ directory for comprehensive usage examples:

Basic Examples:

example_01_basic_connection.py - Basic database operations and connectivity
example_02_account_management.py - User and role management
example_03_async_operations.py - Async/await operations with AsyncClient
example_04_transaction_management.py - Transaction handling and rollback
example_05_snapshot_restore.py - Snapshot creation and restore operations
example_06_sqlalchemy_integration.py - SQLAlchemy ORM integration
example_07_advanced_features.py - Advanced SDK features
example_08_pubsub_operations.py - Pub/Sub operations
example_09_logger_integration.py - Logging configuration and integration
example_10_version_management.py - Version compatibility management
example_11_matrixone_version_demo.py - MatrixOne version detection

Vector Search Examples:

example_12_vector_basics.py - Vector data types and basic distance functions
example_13_vector_indexes.py - IVF and HNSW index management and performance
example_14_vector_search.py - Vector similarity search operations
example_15_vector_advanced.py - Advanced vector operations and optimization

ORM and Advanced Examples:

example_18_snapshot_orm.py - Snapshot operations with ORM
example_19_sqlalchemy_style_orm.py - SQLAlchemy-style ORM patterns
example_20_sqlalchemy_engine_integration.py - SQLAlchemy engine integration
example_21_advanced_orm_features.py - Advanced ORM features
example_22_unified_sql_builder.py - Unified SQL query builder
example_24_query_update.py - Query and update operations
example_25_metadata_operations.py - Table metadata analysis and statistics

Load Data Examples:

example_23_load_data_operations.py - Comprehensive LOAD DATA operations (CSV, TSV, JSONLINE, Parquet, Inline, Stage)

Stage Management Examples:

example_26_stage_operations.py - External stage management and data loading from stages

Branch Builder Examples:

example_branch_builder.py - SQLAlchemy-style branch statement builders (create, diff, merge)

Clone Builder Examples:

example_clone_builder.py - SQLAlchemy-style clone statement builders (table, database)

Specialized Examples:

example_connection_hooks.py - Connection hooks for custom initialization
example_dynamic_logging.py - Dynamic logging configuration
example_ivf_stats_complete.py - IVF index statistics and analysis

Running Examples

Run all examples:

make examples

Run specific examples:

make example-basic # Basic connection and database operations make example-async # Async/await operations make example-vector # Vector search and similarity

Run individual example files:

# Basic examples python examples/example_01_basic_connection.py python examples/example_03_async_operations.py # Vector examples python examples/example_12_vector_basics.py python examples/example_13_vector_indexes.py python examples/example_14_vector_search.py # Load Data examples python examples/example_23_load_data_operations.py # Stage management examples python examples/example_26_stage_operations.py # Metadata and advanced examples python examples/example_25_metadata_operations.py python examples/example_22_unified_sql_builder.py

Customizing Connection Parameters

You can override the default connection parameters using environment variables:

# Set custom connection parameters export MATRIXONE_HOST=localhost export MATRIXONE_PORT=6001 export MATRIXONE_USER=root export MATRIXONE_PASSWORD=111 export MATRIXONE_DATABASE=test # Run examples with custom parameters make examples

Or set them inline:

MATRIXONE_HOST=localhost MATRIXONE_USER=admin make example-basic

Available environment variables:

MATRIXONE_HOST: Database host (default: 127.0.0.1)
MATRIXONE_PORT: Database port (default: 6001)
MATRIXONE_USER: Database username (default: root)
MATRIXONE_PASSWORD: Database password (default: 111)
MATRIXONE_DATABASE: Database name (default: test)
MATRIXONE_CHARSET: Character set (default: utf8mb4)
MATRIXONE_CONNECT_TIMEOUT: Connection timeout (default: 30)
MATRIXONE_AUTOCOMMIT: Auto-commit mode (default: true)

🧪 Testing & Development

Testing Overview

The MatrixOne Python SDK includes comprehensive testing with support for multiple Python and SQLAlchemy versions.

📖 For detailed testing documentation, see TESTING.md

Quick Test Commands

# Check environment and dependencies make check-env # Install test dependencies make install-sqlalchemy14 # For SQLAlchemy 1.4.x make install-sqlalchemy20 # For SQLAlchemy 2.0.x # Run tests make test-offline # Fast mock-based tests make test-online # Database integration tests (requires MatrixOne) make test-sqlalchemy14 # Full test suite with SQLAlchemy 1.4.x make test-sqlalchemy20 # Full test suite with SQLAlchemy 2.0.x make test-matrix # Test both SQLAlchemy versions

Test Types

Offline Tests: Mock-based unit tests (450+ tests, ~3-5 seconds)
Online Tests: Database integration tests (100+ tests, ~20-30 seconds)
Matrix Tests: Cross-version compatibility testing (Python 3.10/3.11 + SQLAlchemy 1.4/2.0)

Development Setup

# Check environment requirements make check-env # Setup development environment (includes all dev dependencies) make dev-setup # Install specific dependency sets make install-deps-dev # Install development dependencies make install-deps-test # Install test dependencies  make install-sqlalchemy14 # Install SQLAlchemy 1.4.x dependencies make install-sqlalchemy20 # Install SQLAlchemy 2.0.x dependencies # Run all tests make test # Run examples make examples

Dependency Management:

make install-deps-dev: Installs all development dependencies including pyarrow, Faker, pytest, etc.
make install-deps-test: Installs testing dependencies including pyarrow (required for Parquet)
make install-sqlalchemy14/20: Installs version-specific SQLAlchemy dependencies

Important: pyarrow>=10.0.0 is required for LOAD DATA Parquet functionality. Install via make dev-setup, make install-deps-dev, or make install-deps-test. For detailed testing instructions, environment setup, and troubleshooting, see TESTING.md.

📚 Documentation

📖 Online Documentation

Full Documentation on ReadTheDocs ⭐

Complete documentation with examples, API reference, and production best practices.

Local Documentation Guides

Getting Started:

Quick Start: Get started quickly with common use cases
Installation Guide: Installation and setup instructions
Configuration Guide: Connection and client configuration

Core Features:

Vector Guide: Vector search, indexing, and ⭐ IVF health monitoring
Fulltext Guide: Fulltext search and indexing
ORM Guide: SQLAlchemy ORM integration and patterns
Metadata Guide: Table metadata analysis and statistics

Advanced Features:

Snapshot & Restore: Snapshot and PITR operations
Clone Guide: Database and table cloning
Account Guide: User and role management
Pub/Sub Guide: Publication and subscription operations
MoCTL Guide: MoCTL integration

Production Guide:

Best Practices: Production-ready patterns using SDK APIs
Connection Hooks: Pre/post connection hooks

Reference:

API Reference: Complete API reference
Examples: Comprehensive usage examples
Testing Guide: Testing instructions and environment setup

Generate HTML documentation:

make docs # Generate HTML documentation make docs-serv # Start documentation server at http://localhost:8000

🤝 Contributing

We welcome contributions!

Development Workflow

Fork the repository

Set up development environment

git clone <your-fork-url> cd matrixone/clients/python # Create virtual environment python -m venv venv source venv/bin/activate # On macOS/Linux # Install development dependencies make dev-setup

Create a feature branch

git checkout -b feature/your-feature-name

Make your changes
Add tests for new functionality
Run pre-commit checks
```
make pre-commit # format + lint + test
```
Submit a pull request

Virtual Environment for Contributors

# Always work in virtual environment source venv/bin/activate # Before making changes make test # Ensure all tests pass # After making changes make format # Format code make lint # Check code quality make test # Run tests make examples # Test examples # Before submitting PR make release-check # Complete pre-release checks

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Support

📧 Email: contact@matrixorigin.cn
🐛 Issues: GitHub Issues
💬 Discussions: GitHub Discussions
📖 Documentation:
- MatrixOne Docs (English)
- MatrixOne Docs (中文)

Changelog

See CHANGELOG.md for version history and changes.

MatrixOne Python SDK - Making MatrixOne database operations simple and powerful in Python.

FilesExpand file tree

python

Directory actions

More options

Directory actions

More options

Latest commit

History

python

Folders and files

parent directory

README.md

MatrixOne Python SDK

✨ Features

⚠️ Important: Column Naming Convention

🚀 Installation

From PyPI (Stable Release)

From test.pypi (Latest Pre-release)

Development Installation

Using Virtual Environment (Recommended)

Using Conda Environment

Direct Installation (Not Recommended)

Testing Environment Setup

For Testing and Examples

For Load Data Features

Quick Start

Basic Usage

Transaction Management (Recommended)

Data Loading (Pandas-Style)

Data Export (Pandas-Style)

Wrapping Existing SQLAlchemy Sessions (For Legacy Projects)

SQLAlchemy ORM Integration

Async Usage

Snapshot Management

Vector Search

Fulltext Search

Data Loading with Stages

Metadata Analysis

Secondary Index Verification

Version Management

MatrixOne Version Support

🛠️ mo-diag - Interactive Diagnostic Tool

Installation

Quick Start

Interactive Mode

Non-Interactive Mode

CDC Task Shortcuts

Available Commands

Index Management

Vector Index Monitoring

Table Statistics

Database Operations

Utility Commands

Command-Line Options

Use Cases

1. Monitor IVF Index Health in Production

2. Debug Index Count Mismatches

3. Analyze Table Storage

4. Automated Maintenance Scripts

Tips and Best Practices

Troubleshooting

Restore to specific time

Pub/Sub Operations

Stage Management

Advanced Features

PITR (Point-in-Time Recovery)

Logging Configuration

Error Handling

Examples

Running Examples

Customizing Connection Parameters

🧪 Testing & Development

Testing Overview

Quick Test Commands

Test Types

Development Setup

📚 Documentation

📖 Online Documentation

Local Documentation Guides

🤝 Contributing