A high-performance Rust implementation of IncrementalDBSCAN with Python bindings.
IncrementalDBSCAN maintains DBSCAN clustering incrementally as data points are inserted or deleted one at a time, without re-running DBSCAN from scratch. After each update, the result is identical to running DBSCAN on the full updated dataset.
This is a complete rewrite of incdbscan (Python) in Rust, using PyO3 for Python bindings. The algorithm and correctness are preserved; the implementation language and data structures change for dramatically better performance and stability.
Based on: Ester et al. 1998. Incremental Clustering for Mining in a Data Warehousing Environment. VLDB 1998.
pip install incdbscan-rs# Install Rust if needed curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh # Build and install pip install maturin maturin develop --releaseimport numpy as np from incdbscan_rs import IncrementalDBSCAN # Create the model db = IncrementalDBSCAN(eps=1.5, min_pts=5, p=2.0) # Insert data points (numpy 2D array) data = np.array([ [0.0, 0.0], [1.0, 0.0], [0.5, 0.5], [10.0, 10.0], ]) db.insert(data) # Get cluster labels # Returns: cluster IDs (>= 0), -1 for noise, NaN for unknown points labels = db.get_cluster_labels(data) # array([0., 0., 0., -1.]) # Insert more points incrementally new_points = np.array([[10.5, 10.0], [10.0, 10.5], [11.0, 11.0], [10.5, 10.5]]) db.insert(new_points) # Labels update incrementally - no need to recluster labels = db.get_cluster_labels(np.array([[10.0, 10.0]])) # Now part of a cluster instead of noise # Delete points deleted = db.delete(np.array([[0.0, 0.0]])) # Returns [True] - point was found and removed # Returns [False] if the point didn't exist| Parameter | Type | Default | Description |
|---|---|---|---|
eps | float | 1.0 | Radius for neighborhood queries. Two points are neighbors if their distance is <= eps. |
min_pts | int | 5 | Minimum number of neighbors required for a point to be a core point. |
p | float | 2.0 | Minkowski distance parameter. p=2.0 is Euclidean, p=1.0 is Manhattan, p=inf is Chebyshev. |
| Method | Input | Output | Description |
|---|---|---|---|
insert(X) | ndarray (n, d) | None | Insert points and update clustering. |
delete(X) | ndarray (n, d) | list[bool] | Delete points. Returns whether each point was found. |
get_cluster_labels(X) | ndarray (n, d) | ndarray (n,) | Get labels: >= 0 = cluster, -1 = noise, NaN = not found. |
Important: All input arrays must be float64 (np.float64). If your data comes from pandas or another source as float32 or int, convert it first:
data = data.astype(np.float64)Measured on the same machine with identical data (random 2D points, eps=2.0, min_pts=5). Each benchmark inserts all points, then deletes half.
| Dataset size | Python | Rust | Speedup |
|---|---|---|---|
| 200 points | 0.296s | 0.001s | 210x |
| 500 points | 0.494s | 0.003s | 147x |
| 1000 points | 1.087s | 0.011s | 100x |
| 500 pts, 10D | 0.484s | 0.001s | 425x |
The Python version rebuilds a KD-tree (sklearn.NearestNeighbors.fit()) on every single insertion -- O(n log n) per insert. The Rust version uses a flat Vec with O(1) append and O(n) brute-force query, which wins massively because the tree rebuild is the bottleneck.
| Dataset size | Python | Rust | Speedup |
|---|---|---|---|
| 200 pts, delete 100 | 0.003s | 0.0002s | 14x |
| 500 pts, delete 250 | 0.098s | 0.014s | 7x |
| 1000 pts, delete 500 | 1.478s | 0.240s | 6x |
Deletion involves BFS-based split detection, which has similar algorithmic complexity in both versions. Gains come from Rust's tight loops, no Python object overhead, and no FFI callback crossing.
| Dataset size | Python | Rust | Speedup |
|---|---|---|---|
| 200 points | 0.299s | 0.002s | 184x |
| 500 points | 0.592s | 0.017s | 34x |
| 1000 points | 2.566s | 0.251s | 10x |
Simulates a real embedding workload: 996-dimensional L2-normalized vectors (resembling OpenAI/Gemini text embeddings), eps=1.2, min_pts=5, inserted in batches of 1,500 points. Measured on a single machine.
| Batch | Total Points | v0.1.0 (s) | v0.2.0 (s) | Speedup |
|---|---|---|---|---|
| 1 | 1,500 | 1.617 | 0.547 | 3.0x |
| 3 | 4,500 | 8.053 | 1.397 | 5.8x |
| 5 | 7,500 | 14.466 | 2.273 | 6.4x |
| 7 | 10,500 | 20.898 | 3.039 | 6.9x |
| 10 | 15,000 | 29.606 | 4.347 | 6.8x |
| Total | 160.2s | 24.9s | 6.4x |
v0.2.0 introduces two optimizations that dramatically improve performance for high-dimensional data:
-
Early termination in distance computation. For Euclidean distance (p=2.0), squared differences are accumulated in chunks of 4 dimensions. If the partial sum exceeds eps² at any checkpoint, the remaining dimensions are skipped. This is exact -- no approximation, bit-for-bit identical results to a full computation. For high-dimensional embeddings with a tight eps, most non-neighbor pairs are rejected after computing only 5-15% of dimensions.
-
Parallel spatial index scan. When the dataset exceeds 1,000 points, the brute-force neighbor scan is parallelized across CPU cores using rayon. Below 1,000 points, sequential scan avoids thread pool overhead.
Simulates a real workload: insert 500 points per batch, delete 100 per batch, 10 batches total (5000 inserts, 1000 deletes).
Batch 1: insert=1.85s delete=0.21s mem=810KB Batch 5: insert=2.19s delete=13.06s mem=4,987KB Batch 10: insert=3.13s delete=58.22s mem=14,009KB Deletion time grows from 0.2s to 58s. Memory grows linearly at ~1.4 MB per batch. The Python version may crash with RecursionError: maximum recursion depth exceeded at larger scales due to circular object references and callback-based BFS (see Stability).
Batch 1: insert=0.003s delete=0.01s mem=12KB Batch 5: insert=0.027s delete=0.60s mem=12KB Batch 10: insert=0.086s delete=2.65s mem=13KB Deletion time grows from 0.01s to 2.6s (inherent to the algorithm), but remains 22x faster than Python throughout. Python-side memory stays flat at 12-13 KB because all data lives in Rust's heap.
The Rust version produces identical results to both the Python incdbscan and sklearn's DBSCAN:
- All benchmarks above show matching cluster counts and noise counts between Python and Rust
- Cross-validation against
sklearn.cluster.DBSCANconfirms label assignments are isomorphic (same clustering, potentially different label numbering) - Tested scenarios: cluster creation, absorption, merge, 2-way split, 3-way split, duplicate handling, noise detection, multi-dimensional data (2D through 100D)
The Python incdbscan can hit RecursionError: maximum recursion depth exceeded after several batches of insertions/deletions. There are two root causes:
-
Circular object references. Each
Objectstoresself.neighbors = {self}, and neighbors cross-reference each other. After thousands of insertions, Python's cyclic garbage collector must trace these chains, which can exceed the default recursion limit (1000) during GC sweeps. -
BFS via Python callbacks. Split detection uses
rustworkx.bfs_search()which invokes a PythonBFSVisitorcallback for every node and edge. In dense graphs, these callbacks accumulate on the call stack. -
Quadratic memory operations.
numpy.insert()copies the entire coordinate array on every single point insertion, causing O(n^2) total memory operations and heavy GC pressure.
| Concern | Python | Rust |
|---|---|---|
| Recursion | BFS via visitor callbacks, GC cycle tracing | Zero recursion -- all traversals are iterative loops |
| Memory model | Circular Object references, cyclic GC | u64 IDs in HashMap and petgraph -- no reference cycles, no GC |
| Spatial index | KD-tree rebuild + numpy array copy per insert | Flat Vec with O(1) append -- no copies |
| Stack growth | Proportional to graph size via callbacks | Constant -- heap-allocated VecDeque for BFS |
| Python-side memory | 14 MB at batch 10, growing ~1.4 MB/batch | 13 KB flat -- all data lives in Rust heap |
The Rust version has no call stack growth proportional to data size. Every graph traversal uses while let Some(node) = queue.pop_front() { ... } with a heap-allocated queue.
src/ ├── lib.rs # PyO3 module + IncrementalDBSCAN pyclass ├── engine.rs # Pure-Rust IncrementalDbscan entry point ├── types.rs # ObjectId (u64), ClusterLabel (i64), constants, hash function ├── distance.rs # Minkowski distance family (p=2 optimized with early termination) ├── object.rs # ObjectData struct (id, count, neighbor_count, core status) ├── spatial_index.rs # Brute-force spatial index (Vec-based, O(1) insert, parallel O(n) query) ├── labels.rs # LabelHandler (bidirectional HashMap mapping) ├── objects.rs # Central manager (petgraph StableGraph + spatial index + labels) ├── inserter.rs # Insertion algorithm (creation / absorption / merge) ├── deleter.rs # Deletion algorithm (split detection, border reassignment) └── bfs_split.rs # Multi-source BFS for cluster split detection petgraph::StableGraphinstead of a plain graph. Stable node indices survive node removal, which is critical since we storeNodeIndexvalues in hash maps.- No neighbor set on objects. The Python version stores
obj.neighborsas a set. Rust queriesgraph.neighbors(node_idx)directly, avoiding duplicated state and circular references. DeletedObjectInfopattern. Python accesses a deleted object's neighbors after deletion (the object persists in memory via GC). Rust snapshots neighbor data into a struct before removal.- Brute-force spatial index with early termination. O(1) insert + O(n) query per insert beats Python's O(n log n) tree rebuild + O(log n + k) query, because the rebuild dominates. For Euclidean distance, the query uses early termination to skip dimensions once the partial squared distance exceeds eps², and parallelizes across CPU cores via rayon for datasets above 1,000 points.
- Feature-gated PyO3. PyO3 bindings are behind the
extension-moduleCargo feature, socargo testruns pure Rust tests without requiring a Python interpreter.
cargo testTests cover: distance calculations, early termination correctness, hashing, spatial index operations, label management, object data structures.
pip install incdbscan-rs[dev] pytestTests cover: construction, noise, cluster creation, absorption, merge, duplicates, deletion, 2-way/3-way splits, reinsert, multi-dimensional (1D-50D), distance metrics, sklearn cross-validation, stress testing.
cargo bench --bench batch_scalingRuns the high-dimensional batch insertion benchmark (10 batches of 1,500 points, 996 dimensions).
| Feature | Python incdbscan | incdbscan-rs |
|---|---|---|
| Distance metrics | Any sklearn metric | Minkowski family only (p=1, 2, inf, or any p >= 1) |
delete() return value | Returns self | Returns list[bool] (whether each point was found) |
| Warnings for missing objects | IncrementalDBSCANWarning | NaN in labels, False in delete results |
| Dependencies | numpy, scikit-learn, rustworkx, sortedcontainers, xxhash | numpy (Python side only) |
| Minimum Python | 3.9 | 3.9 |
BSD-3-Clause. See LICENSE.
Based on the incdbscan project by Arpad Fulop.