Fast Python with Rust: a data-oriented approach

Published 2026-02-28

How fast can Rust make Python? With a bit of data-oriented design, really fast:

Library	Operations	Collection Size	Time (seconds)
Redbear	100000 × 5	10	0.056
Redbear	1000000 × 5	10	0.583
Redbear	10000 × 5	1000	0.015
Redbear	100000 × 5	1000	0.145
NumPy	100000 × 5	10	0.285
NumPy	1000000 × 5	10	2.436
NumPy	10000 × 5	1000	0.039
NumPy	100000 × 5	1000	0.371

You can find all the benchmark script in the Redbear benchmarks folder

Despite the impressive numbers above, this post holds no delusional that a small library will replace the monumental work behind libraries like NumPy and Polars. Redbear only achieves this performance by focusing on a narrow use case and making functionality concessions in the name of speed. If you have a data manipulation problem, you should, without question, use one of the de facto libraries.

This post shows how Redbear achieved this level of performance by focusing on element wise operations on dictionaries, while demonstrating Rust performance optimizations.

Is Python really that slow?

As you can see above, NumPy has fantastic performance across the board, but it doesn't provide all the functionality you might need. Most data processing pipelines rely pandas, or Polars. These libraries have a different performance story. They incur a higher performance penalty for providing a lot more functionality, and specialize in working with large data sets.

Libraries like Blackbear noticed that performance dramatically degrades when running lots of operations on small data sets. Blackbear in particular implemented common operations on dictionaries in plain Python with great success:

Library	Operations	Collection Size	Time (seconds)
Blackbear	100000 × 5	10	0.313
Blackbear	1000000 × 5	10	3.072
Polars	100000 × 5	10	1.072
Polars	1000000 × 5	10	11.147
pandas	100000 × 5	10	13.466
pandas	1000000 × 5	10	130.837

That performance gain starts to fall off as the data sets grew, so pandas and Polars still play an important part:

Library	Operations	Collection Size	Time (seconds)
Blackbear	10000 × 5	1000	3.643
Blackbear	100000 × 5	1000	36.307
Polars	10000 × 5	1000	0.132
Polars	100000 × 5	1000	1.219
pandas	10000 × 5	1000	1.339
pandas	100000 × 5	1000	13.432

The fragmented performance and need to mix libraries for optimal performance made me look for an alternative. And that's when I started looking at Rust's Python integration.

Is Rust enough?

Rust can work with python thanks to the fantastic PyO3 library. And since python dictionaries can perform so well, would we see a performance boost from using Rust hash maps instead?

We will only look at addition operations since we can extrapolate the rest from there. And we will not go through the full project setup. maturin makes the setup trivial, so if you want to follow along, you can follow their instructions.

Let's with scalar addition, the simplest operation we can perform. It takes a dictionary and a value, and adds the value to all the values in the dictionary.

use std::collections::HashMap; use pyo3::prelude::*;  #[pyfunction] fn add_scalar(d: HashMap<String, f64>, value: f64) -> HashMap<String, f64> {  d.into_iter().map(|(k, v)| (k, v + value)).collect() }

To turn a Rust function into a python function we just have to use the pyfunction macro. PyO3 takes care of converting the arguments and outputs for us.

We want to keep this API immutable so we can continue building calculations without modifying the original values, and to match the implementation of Blackbear. That means we can implement this calculation as a quick iterator without even worrying about Python.

The next addition operation we need is addition of two dictionaries. We will only consider the keys of the first dictionary and add the values of the matching keys together. If we can't find a matching key, we will use a fill value with the default of 0.0.

#[pyfunction] #[pyo3(signature = (d1, d2, fill=0.0))] fn add(d1: HashMap<String, f64>, d2: HashMap<String, f64>, fill: f64) -> HashMap<String, f64> {  d1.iter()  .map(|(k, v)| (k.clone(), v + d2.get(k).unwrap_or(&fill)))  .collect() }

We need to use another macro here to provide a default value for the fill argument. Since Rust can't take default values and Python often uses default values in signatures, PyO3 provides the signature macro to simulate this functionality while keeping the function a good citizen in both languages.

We can proceed in almost the same way as before. We only need to switch to using .iter() so we can work with references, and look up the value to add in d2 using .get(k).unwrap_or(&fill).

So... Did we get our performance boost by directly porting the Python functions?

Library	Operations	Collection Size	Time (seconds)
Blackbear	100000 × 5	10	0.313
Blackbear	1000000 × 5	10	3.072
Blackbear	10000 × 5	1000	3.643
Blackbear	100000 × 5	1000	36.307
Redbear	100000 × 5	10	1.066
Redbear	1000000 × 5	10	10.931
Redbear	10000 × 5	1000	13.467
Redbear	100000 × 5	1000	133.243

Those numbers don't look great. We see 3-5x worse performance than the native Python implementation. This really shows the cost converting data. For every function call we have to convert the data structures from Python, to Rust, and back to Python. All that data movement quickly adds up and slows down our implementation beyond the performance improvements Rust can give us.

It's clear that just switching to Rust isn't enough.

Using a dedicated structure

To start the optimization, let's take inspiration from all the other libraries that already handle this problem well.

The other libraries that we compared against, except Blackbear, use a dedicated data structure. Thanks to having this data structure, they can convert the data once, run a bunch of operation, and then convert it back at the end. That's exactly what we should do.

#[pyclass] struct RedDict {  values: HashMap<String, f64>, }  #[pymethods] impl RedDict {  #[new]  fn new(dict: HashMap<String, f64>) -> Self {  Self { values: dict }  } }

We use three new macros here. pyclass allows us to make a class out of a struct, and pymethods converts all the Rust methods inside an imp into Python methods on the class. These two macros work together to seamlessly translate Rust construct to Python, but they don't work for all methods. Some Python magic methods, like __init__, require special handling. Luckily PyO3 covers the most common cases with macros like the new macro we use here.

With the above declaration, we get a Python class that we can initialize naturally with RedDict(py_dict).

Now we have to migrate our add functions to methods that can work on RedDict directly.

#[pymethods] impl RedDict {  #[pyo3(signature = (other, fill=0.0))]  fn add(&self, other: &Self, fill: f64) -> Self {  let values = self  .values  .iter()  .map(|(k, v)| (k.clone(), v + other.values.get(k).unwrap_or(&fill)))  .collect();   Self { values }  } }

In the interest of space, I will only show the add implementation from now on. It's straight forward enough to derive the scalar implementation from there.

We only need to add the .values accessor to our RedDict instances, and construct a new RedDict instance before returning the results. And with that change, we gain a massive performance increase:

Library	Operations	Collection Size	Time (seconds)
Before	100000 × 5	10	1.066
Before	1000000 × 5	10	10.931
Before	10000 × 5	1000	13.467
Before	100000 × 5	1000	133.243
After	100000 × 5	10	0.240
After	1000000 × 5	10	2.532
After	10000 × 5	1000	3.253
After	100000 × 5	1000	32.645

We now caught up to the Python implementation and gained a nice chaining syntax that lets us write expressions like:

RedDict(py_dict).add_scalar(2).multiply_scalar(10)

Optimizing copying with mutation

We now have a short and clean implementation, but it's not optimal. Using an iterator to build up a HashMap has negative performance characteristics because we fill the structure as we iterate over it. We can remedy this by leaning on highly optimized built in types and methods.

First we wrap values in an Arc smart pointer so we only copy around the pointer instead of the whole HashMap:

#[pyclass] struct RedDict {  values: Arc<HashMap<String, f64>>, }

Then, instead of building up the data structure through iteration, we clone the values efficiently by using the built in .clone() method which avoids growing the HashMap as we build it. And once we have the clone, mutate only the values to avoid any structural changes to the HashMap:

#[pymethods] impl RedDict {  #[pyo3(signature = (other, fill=0.0))]  fn add(&self, other: &Self, fill: f64) -> Self {  let mut values = self.values.clone();   Arc::make_mut(&mut values)  .iter_mut()  .for_each(|(key, val)| *val += other.values.get(key).unwrap_or(&fill));   Self { values }  } }

These changes might seem small, but they double our operations throughput:

Library	Operations	Collection Size	Time (seconds)
Before	100000 × 5	10	0.240
Before	1000000 × 5	10	2.532
Before	10000 × 5	1000	3.253
Before	100000 × 5	1000	32.645
After	100000 × 5	10	0.157
After	1000000 × 5	10	1.609
After	10000 × 5	1000	1.730
After	100000 × 5	1000	17.941
NumPy	100000 × 5	10	0.285
NumPy	1000000 × 5	10	2.436
NumPy	10000 × 5	1000	0.039
NumPy	100000 × 5	1000	0.371

These changes are transparent to the user and makes this library surpass NumPy on the small data structures. But we can still do better.

Ditching the hash map

Hash maps excel at random element access, but we only perform element wise operations. When it comes to element wise operations, vectors are the performance king. CPUs have optimizations specifically for working with vectors, so if we can tap into those operations, we will use the most efficient operations possible (outside of numeric operations).

To take advantage of vectors, we need to update RedDict one more time:

#[pyclass(from_py_object)] #[derive(Clone)] struct RedDict {  /// Mapping from key -> index into `values`.  index: Arc<HashMap<String, usize>>,  /// Packed numeric values.  values: Arc<Vec<f64>>, }

We store all the values in a Vec, while still using a hash map for key to index storage. Storing the key to index relationship means that we can perform all the operations on the values vector and only reference the hash map when we need to go back to a Python dictionary.

This layous also brings out the true power of Arc. Since we leave the index unchanged, we only clone the pointer to index. If we ever do change it, Arc will automatically clone it and create a new reference. That's the power of smart pointers.

As often happens: to achieve the best performance we have to deal with addition complexity, starting with the RedDict initialization:

#[pymethods] impl RedDict {  #[new]  fn new(dict: HashMap<String, f64>) -> Self {  let mut values = Vec::with_capacity(dict.len());  let mut index = HashMap::with_capacity(dict.len());   for (pos, (k, v)) in dict.into_iter().enumerate() {  values.push(v);  index.insert(k, pos);  }   Self {  index: Arc::new(index),  values: Arc::new(values),  }  }

Instead of PyO3 handling most of the conversion for us, we have to manually construct both the values vector and the index hash map based on the passed in hash map.

The addition between two dictionaries becomes more complex. We need to:

Iterate over indexes to retrieve the keys
Remember the position that the key maps to in the our left hand values vector
Look up the position of the key in the right hand vector, or use the fill value
Update the value at the correct position in the left handed vector by adding the right handed value

#[pyo3(signature = (other, fill=0.0))] fn add(&self, other: Self, fill: f64) -> Self {  let mut new = self.clone();  let new_vals = Arc::make_mut(&mut new.values);   for (key, &i) in self.index.iter() {  let rhs = other  .index  .get(key)  .map(|&j| other.values[j])  .unwrap_or(fill);  new_vals[i] += &rhs;  }   new }

And we get a good performance increase, but still fall far short of the NumPy performance for large collections:

Library	Operations	Collection Size	Time (seconds)
Redbear	100000 × 5	10	0.111
Redbear	1000000 × 5	10	1.156
Redbear	10000 × 5	1000	0.604
Redbear	100000 × 5	1000	5.877
NumPy	100000 × 5	10	0.285
NumPy	1000000 × 5	10	2.436
NumPy	10000 × 5	1000	0.039
NumPy	100000 × 5	1000	0.371

The issue comes down to us doing a lot more work than we need to. In some cases.

Minimizing unnecessary look-ups

Since we perform a lookup for every value in a collection, we pay a lot of overhead. It would be great if we could get rid of all those operation.

It turns out that a lot of the time we start with a dictionary, transform it in some way, and use then perform more operations using the intermediate derivations:

 rd = RedDict(py_dict)  rd2 = rd.add_scalar(2).add(rd)

In these cases, our self and other dictionaries share an index. And if they share an index, it means that values have the same order, so we don't need to do any look-ups.

#[pyo3(signature = (other, fill=0.0))] fn add(&self, other: Self, fill: f64) -> Self {  let mut new = self.clone();  let new_vals = Arc::make_mut(&mut new.values);   if new.index == other.index {  for (lhs, rhs) in new_vals.iter_mut().zip(other.values.iter()) {  *lhs += rhs  }  } else {  for (key, &i) in self.index.iter() {  let rhs = other  .index  .get(key)  .map(|&j| other.values[j])  .unwrap_or(fill);  new_vals[i] += &rhs;  }  }   new }

Here we check the index for equality. If they equal we use zip to traverse both the value vectors in the same order, and add the elements together.

With these changes, we reach our goal! We surpass NumPy across the board:

Library	Operations	Collection Size	Time (seconds)
Redbear	100000 × 5	10	0.065
Redbear	1000000 × 5	10	0.653
Redbear	10000 × 5	1000	0.016
Redbear	100000 × 5	1000	0.160
NumPy	100000 × 5	10	0.285
NumPy	1000000 × 5	10	2.436
NumPy	10000 × 5	1000	0.039
NumPy	100000 × 5	1000	0.371

Trace-offs, and when to use Redbear

By applying data-oriented design in Rust we achieved our goal of surpassing the general purpose libraries in Python. This came at the caveat of needing to re-use derivations of a base instance and only working with floats, but it's an impressive achievement for less than 50 lines of code.