incdbscan-rs

A high-performance Rust implementation of IncrementalDBSCAN with Python bindings.

IncrementalDBSCAN maintains DBSCAN clustering incrementally as data points are inserted or deleted one at a time, without re-running DBSCAN from scratch. After each update, the result is identical to running DBSCAN on the full updated dataset.

This is a complete rewrite of incdbscan (Python) in Rust, using PyO3 for Python bindings. The algorithm and correctness are preserved; the implementation language and data structures change for dramatically better performance and stability.

Based on: Ester et al. 1998. Incremental Clustering for Mining in a Data Warehousing Environment. VLDB 1998.

Installation

pip install incdbscan-rs

From source (requires Rust toolchain)

# Install Rust if needed curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh # Build and install pip install maturin maturin develop --release

Usage

import numpy as np from incdbscan_rs import IncrementalDBSCAN # Create the model db = IncrementalDBSCAN(eps=1.5, min_pts=5, p=2.0) # Insert data points (numpy 2D array) data = np.array([ [0.0, 0.0], [1.0, 0.0], [0.5, 0.5], [10.0, 10.0], ]) db.insert(data) # Get cluster labels # Returns: cluster IDs (>= 0), -1 for noise, NaN for unknown points labels = db.get_cluster_labels(data) # array([0., 0., 0., -1.]) # Insert more points incrementally new_points = np.array([[10.5, 10.0], [10.0, 10.5], [11.0, 11.0], [10.5, 10.5]]) db.insert(new_points) # Labels update incrementally - no need to recluster labels = db.get_cluster_labels(np.array([[10.0, 10.0]])) # Now part of a cluster instead of noise # Delete points deleted = db.delete(np.array([[0.0, 0.0]])) # Returns [True] - point was found and removed # Returns [False] if the point didn't exist

API

`IncrementalDBSCAN(eps=1.0, min_pts=5, p=2.0)`

Parameter	Type	Default	Description
`eps`	`float`	`1.0`	Radius for neighborhood queries. Two points are neighbors if their distance is <= `eps`.
`min_pts`	`int`	`5`	Minimum number of neighbors required for a point to be a core point.
`p`	`float`	`2.0`	Minkowski distance parameter. `p=2.0` is Euclidean, `p=1.0` is Manhattan, `p=inf` is Chebyshev.

Methods

Method	Input	Output	Description
`insert(X)`	`ndarray (n, d)`	`None`	Insert points and update clustering.
`delete(X)`	`ndarray (n, d)`	`list[bool]`	Delete points. Returns whether each point was found.
`get_cluster_labels(X)`	`ndarray (n, d)`	`ndarray (n,)`	Get labels: `>= 0` = cluster, `-1` = noise, `NaN` = not found.

Important: All input arrays must be float64 (np.float64). If your data comes from pandas or another source as float32 or int, convert it first:

data = data.astype(np.float64)

Performance

Benchmarks vs Python incdbscan

Measured on the same machine with identical data (random 2D points, eps=2.0, min_pts=5). Each benchmark inserts all points, then deletes half.

Insertion speed

Dataset size	Python	Rust	Speedup
200 points	0.296s	0.001s	210x
500 points	0.494s	0.003s	147x
1000 points	1.087s	0.011s	100x
500 pts, 10D	0.484s	0.001s	425x

The Python version rebuilds a KD-tree (sklearn.NearestNeighbors.fit()) on every single insertion -- O(n log n) per insert. The Rust version uses a flat Vec with O(1) append and O(n) brute-force query, which wins massively because the tree rebuild is the bottleneck.

Deletion speed

Dataset size	Python	Rust	Speedup
200 pts, delete 100	0.003s	0.0002s	14x
500 pts, delete 250	0.098s	0.014s	7x
1000 pts, delete 500	1.478s	0.240s	6x

Deletion involves BFS-based split detection, which has similar algorithmic complexity in both versions. Gains come from Rust's tight loops, no Python object overhead, and no FFI callback crossing.

Overall (insert + delete)

Dataset size	Python	Rust	Speedup
200 points	0.299s	0.002s	184x
500 points	0.592s	0.017s	34x
1000 points	2.566s	0.251s	10x

High-dimensional scaling (v0.2.0)

Simulates a real embedding workload: 996-dimensional L2-normalized vectors (resembling OpenAI/Gemini text embeddings), eps=1.2, min_pts=5, inserted in batches of 1,500 points. Measured on a single machine.

Batch	Total Points	v0.1.0 (s)	v0.2.0 (s)	Speedup
1	1,500	1.617	0.547	3.0x
3	4,500	8.053	1.397	5.8x
5	7,500	14.466	2.273	6.4x
7	10,500	20.898	3.039	6.9x
10	15,000	29.606	4.347	6.8x
Total		160.2s	24.9s	6.4x

v0.2.0 introduces two optimizations that dramatically improve performance for high-dimensional data:

Early termination in distance computation. For Euclidean distance (p=2.0), squared differences are accumulated in chunks of 4 dimensions. If the partial sum exceeds eps² at any checkpoint, the remaining dimensions are skipped. This is exact -- no approximation, bit-for-bit identical results to a full computation. For high-dimensional embeddings with a tight eps, most non-neighbor pairs are rejected after computing only 5-15% of dimensions.
Parallel spatial index scan. When the dataset exceeds 1,000 points, the brute-force neighbor scan is parallelized across CPU cores using rayon. Below 1,000 points, sequential scan avoids thread pool overhead.

Stress test: 10 batches of 500 points

Simulates a real workload: insert 500 points per batch, delete 100 per batch, 10 batches total (5000 inserts, 1000 deletes).

Python incdbscan

Batch 1: insert=1.85s delete=0.21s mem=810KB Batch 5: insert=2.19s delete=13.06s mem=4,987KB Batch 10: insert=3.13s delete=58.22s mem=14,009KB

Deletion time grows from 0.2s to 58s. Memory grows linearly at ~1.4 MB per batch. The Python version may crash with RecursionError: maximum recursion depth exceeded at larger scales due to circular object references and callback-based BFS (see Stability).

incdbscan-rs

Batch 1: insert=0.003s delete=0.01s mem=12KB Batch 5: insert=0.027s delete=0.60s mem=12KB Batch 10: insert=0.086s delete=2.65s mem=13KB

Deletion time grows from 0.01s to 2.6s (inherent to the algorithm), but remains 22x faster than Python throughout. Python-side memory stays flat at 12-13 KB because all data lives in Rust's heap.

Correctness verification

The Rust version produces identical results to both the Python incdbscan and sklearn's DBSCAN:

All benchmarks above show matching cluster counts and noise counts between Python and Rust
Cross-validation against sklearn.cluster.DBSCAN confirms label assignments are isomorphic (same clustering, potentially different label numbering)
Tested scenarios: cluster creation, absorption, merge, 2-way split, 3-way split, duplicate handling, noise detection, multi-dimensional data (2D through 100D)

Stability

Why the Python version crashes on long-running tasks

The Python incdbscan can hit RecursionError: maximum recursion depth exceeded after several batches of insertions/deletions. There are two root causes:

Circular object references. Each Object stores self.neighbors = {self}, and neighbors cross-reference each other. After thousands of insertions, Python's cyclic garbage collector must trace these chains, which can exceed the default recursion limit (1000) during GC sweeps.
BFS via Python callbacks. Split detection uses rustworkx.bfs_search() which invokes a Python BFSVisitor callback for every node and edge. In dense graphs, these callbacks accumulate on the call stack.
Quadratic memory operations. numpy.insert() copies the entire coordinate array on every single point insertion, causing O(n^2) total memory operations and heavy GC pressure.

Why incdbscan-rs is immune

Concern	Python	Rust
Recursion	BFS via visitor callbacks, GC cycle tracing	Zero recursion -- all traversals are iterative loops
Memory model	Circular `Object` references, cyclic GC	`u64` IDs in `HashMap` and `petgraph` -- no reference cycles, no GC
Spatial index	KD-tree rebuild + numpy array copy per insert	Flat `Vec` with O(1) append -- no copies
Stack growth	Proportional to graph size via callbacks	Constant -- heap-allocated `VecDeque` for BFS
Python-side memory	14 MB at batch 10, growing ~1.4 MB/batch	13 KB flat -- all data lives in Rust heap

The Rust version has no call stack growth proportional to data size. Every graph traversal uses while let Some(node) = queue.pop_front() { ... } with a heap-allocated queue.

Architecture

src/ ├── lib.rs # PyO3 module + IncrementalDBSCAN pyclass ├── engine.rs # Pure-Rust IncrementalDbscan entry point ├── types.rs # ObjectId (u64), ClusterLabel (i64), constants, hash function ├── distance.rs # Minkowski distance family (p=2 optimized with early termination) ├── object.rs # ObjectData struct (id, count, neighbor_count, core status) ├── spatial_index.rs # Brute-force spatial index (Vec-based, O(1) insert, parallel O(n) query) ├── labels.rs # LabelHandler (bidirectional HashMap mapping) ├── objects.rs # Central manager (petgraph StableGraph + spatial index + labels) ├── inserter.rs # Insertion algorithm (creation / absorption / merge) ├── deleter.rs # Deletion algorithm (split detection, border reassignment) └── bfs_split.rs # Multi-source BFS for cluster split detection

Key design decisions

petgraph::StableGraph instead of a plain graph. Stable node indices survive node removal, which is critical since we store NodeIndex values in hash maps.
No neighbor set on objects. The Python version stores obj.neighbors as a set. Rust queries graph.neighbors(node_idx) directly, avoiding duplicated state and circular references.
DeletedObjectInfo pattern. Python accesses a deleted object's neighbors after deletion (the object persists in memory via GC). Rust snapshots neighbor data into a struct before removal.
Brute-force spatial index with early termination. O(1) insert + O(n) query per insert beats Python's O(n log n) tree rebuild + O(log n + k) query, because the rebuild dominates. For Euclidean distance, the query uses early termination to skip dimensions once the partial squared distance exceeds eps², and parallelizes across CPU cores via rayon for datasets above 1,000 points.
Feature-gated PyO3. PyO3 bindings are behind the extension-module Cargo feature, so cargo test runs pure Rust tests without requiring a Python interpreter.

Running tests

Rust unit tests (25 tests)

cargo test

Tests cover: distance calculations, early termination correctness, hashing, spatial index operations, label management, object data structures.

Python tests (36 tests)

pip install incdbscan-rs[dev] pytest

Tests cover: construction, noise, cluster creation, absorption, merge, duplicates, deletion, 2-way/3-way splits, reinsert, multi-dimensional (1D-50D), distance metrics, sklearn cross-validation, stress testing.

Benchmarks

cargo bench --bench batch_scaling

Runs the high-dimensional batch insertion benchmark (10 batches of 1,500 points, 996 dimensions).

Differences from Python incdbscan

Feature	Python incdbscan	incdbscan-rs
Distance metrics	Any sklearn metric	Minkowski family only (p=1, 2, inf, or any p >= 1)
`delete()` return value	Returns `self`	Returns `list[bool]` (whether each point was found)
Warnings for missing objects	`IncrementalDBSCANWarning`	`NaN` in labels, `False` in delete results
Dependencies	numpy, scikit-learn, rustworkx, sortedcontainers, xxhash	numpy (Python side only)
Minimum Python	3.9	3.9

License

BSD-3-Clause. See LICENSE.

Based on the incdbscan project by Arpad Fulop.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.github/workflows		.github/workflows
benches		benches
python/incdbscan_rs		python/incdbscan_rs
src		src
tests		tests
.gitignore		.gitignore
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

incdbscan-rs

Installation

From source (requires Rust toolchain)

Usage

API

`IncrementalDBSCAN(eps=1.0, min_pts=5, p=2.0)`

Methods

Performance

Benchmarks vs Python incdbscan

Insertion speed

Deletion speed

Overall (insert + delete)

High-dimensional scaling (v0.2.0)

Stress test: 10 batches of 500 points

Python incdbscan

incdbscan-rs

Correctness verification

Stability

Why the Python version crashes on long-running tasks

Why incdbscan-rs is immune

Architecture

Key design decisions

Running tests

Rust unit tests (25 tests)

Python tests (36 tests)

Benchmarks

Differences from Python incdbscan

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

incdbscan-rs

Installation

From source (requires Rust toolchain)

Usage

API

IncrementalDBSCAN(eps=1.0, min_pts=5, p=2.0)

Methods

Performance

Benchmarks vs Python incdbscan

Insertion speed

Deletion speed

Overall (insert + delete)

High-dimensional scaling (v0.2.0)

Stress test: 10 batches of 500 points

Python incdbscan

incdbscan-rs

Correctness verification

Stability

Why the Python version crashes on long-running tasks

Why incdbscan-rs is immune

Architecture

Key design decisions

Running tests

Rust unit tests (25 tests)

Python tests (36 tests)

Benchmarks

Differences from Python incdbscan

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`IncrementalDBSCAN(eps=1.0, min_pts=5, p=2.0)`

Packages