Hashing a pandas dataframe for calculated column caching

Question

I am using composition method to create a class with a contained pandas dataframe as shown below. I am creating a derived property by doing some operation on the base columns.

import numpy as np import pandas as pd class myclass: def __init__(self, *args, **kwargs): self.df = pd.DataFrame(*args, **kwargs) @property def derived(self): return self.df.sum(axis=1) myobj = myclass(np.random.randint(100, size=(100,6))) d = mc.derived

The calculation of derived is an expensive step and hence I would like to cache this function. I want to use functools.lru_cache for the same. However, it requires that the original object be hashed. I tried creating a __hash__ function for the object as detailed in this answer https://stackoverflow.com/a/47800021/3679377.

Now I run in to a new problem where the hashing function is an expensive step!. Is there any way to get around this problem? Or have I reached a dead end?

Is there any better way to check if a dataframe has been modified and if not, keep returning the same hash?

'I am creating a custom class by extending a pandas dataframe as shown below.' - You are not extending. You have a class that contains a dataframe. see packetflow.co.uk/python-inheritance-vs-composition — balderman
– balderman, Commented Aug 19, 2020 at 9:38
True, I'm using composition. I'll reframe my question like that. It's just that I went by the title of pandas' help page. pandas.pydata.org/pandas-docs/stable/development/extending.html — najeem
– najeem, Commented Aug 19, 2020 at 9:43
Do you want to avoid the calculation of derived in the case where self.df was not changed? — balderman
– balderman, Commented Aug 19, 2020 at 9:53
Do you want to handle only the derived operation or do you wish to have a system that you can extend to some other operations on this dataframe ? — efont
– efont, Commented Aug 19, 2020 at 10:00

efont · Accepted Answer · 2020-08-20 16:48:03Z

If hashing doesn't work for you can try to take advantage of the internal state of your class.

Cache one method

Use a class attribute as a cache: on first call of the method, store the result into this attribute, and retrieve it on subsequent call.

import pandas as pd class MyClass: def __init__(self, *args, **kwargs): self._df = pd.DataFrame(*args, **kwargs) self._cached_value = None @property def df(self): return self._df @df.setter def df(self, value): self._cached_value = None self._df = value @property def derived(self): if self._cached_value is None: self._cached_value = self._df.sum(axis=1) return self._cached_value cl = MyClass() cl.derived # compute cl.derived # return cached value cl.df = my_new_df_value # cache is emptied cl.derived # compute

Cache several methods

You can then extend this principle to several methodes using a dict to store the result of each operation. You can use methods name as the keys to this dict (thanks to module inspect, see this response for an example).

import pandas as pd import inspect class MyClass: def __init__(self, *args, **kwargs): self.df = pd.DataFrame(*args, **kwargs) self._cached_values = {} @property def derived(self): method_name = self._get_method_name() if method_name not in self._cached_values: self._cached_value[method_name] = self.df.sum(axis=1) return self._cached_value[method_name] @property def derived_bis(self): method_name = self._get_method_name() if method_name not in self._cached_values: self._cached_value[method_name] = your_expensive_op return self._cached_value[method_name] def _get_method_name(self): return inspect.stack()[1][3] # returns the name of this method's caller cl = MyClass() cl.derived # compute --> self._cached_value = {'derived': your_result} cl.derived # return cached value cl.derived_bis # compute --> self._cached_value = {'derived': your_result, 'derived_bis': your_other_result} cl.derived_bis # return cached value

You can factorize the bodies of the two properties to respect the DRY principle, but be sure to modify _get_method_name accordingly.

This will not work if the dataframe was changed in between subsequent calls to derived! for eg: c1 = MyClass(); c1.derived; c1.df*=10; c1.derived will give me the already cached data which is wrong. The code should know enough to throw away the cache when I modify the df.
Ah yes, I had not understood this was a requirement, my bad. But it is still possible to make it work if you empty the cache when updating the value of the dataframe. This can be done as a first step of your setter :)
Exactly. So when does the class know 'now my dataframe has changed'? I was trying to achieve all this using lru_cache. however, it requires that i compute a hash value. I can set up a hash value for the dataframe based on easily computable stuffs like for eg: the sum of all values in the df. But it's not fool proof. Any decent hash value takes as much time as the derived property itself.
I have edited the first part of my answer to depict the full mechanism. Does it resemble what you were looking for ? If yes I will edit the rest of the answer accordingly. Also I don't think the derived methods should be properties, so I will remove them to be clearer.
I think the solution above + adding a hash check, ie (pandas.util.hash_pandas_object(df) -- on each call to derived would work. Hashing every single time is modest overhead, but if you need to detect changes I don't see another way. I don't think dataframes have an event model.

Bastien Harkins · Accepted Answer · 2020-08-26 10:12:06Z

If you know which methods are likely to update your df, you could override them in your custom class, and keep a flag. I'm not going into details here, but here is the basic principle:

import numpy as np import pandas as pd class myclass: def __init__(self, *args, **kwargs): self.df = pd.DataFrame(*args, **kwargs) self.derived_is_calculated = False @property def derived(self): if not self.derived_is_calculated: d = self.df.sum(axis=1) self.derived_is_calculated = True return d def update(self, other, **kwargs): """ Implements the normal update method, and sets a flag to track if df has changed """ old_df = self.df.copy() # Make a copy for comparison pd.DataFrame.update(self.df, other, **kwargs) # Call the base'update' method if not self.df.equals(old_df): # Compare before and after update self.derived_is_calculated = False random_array = np.random.randint(100, size=(2,10)) myobj = myclass(random_array) print(myobj.derived) # Prints the summed df print(myobj.derived) # Prints None myobj.update([1,2,3]) print(myobj.derived) # Prints the new summed df

There is probably a deeper method of DataFrame or pandas that is called on every change in the DataFrame content, I'll keep looking.

But you could setup a list of methods that your program will use, and make a decorator to do basically what I did in update and call it on each one of the listed methods...

Thanks, but I dont know how the user will modify the dataframe. It's a regular pandas dataframe and I believe there are quite a lot of ways in which it can be modified.
Actually I believe all updates to a pd.DataFrame go through the __setitem__ method (though I didn't check thoroughly).

ignis · Accepted Answer · 2023-10-31 19:41:22Z

This question is among the Google results for searching how to hash a DataFrame.

For the use case from your example code, caching the result is the best approach, as noted in efont's answer.

To answer the literal question on how to hash a DataFrame and work around the fact that "the hashing function is an expensive step", see this answer by Roko Mijic:

hashlib.sha1(pd.util.hash_pandas_object(df).values).hexdigest()

Here is the reference for pd.util.hash_pandas_object().

Collectives™ on Stack Overflow

Hashing a pandas dataframe for calculated column caching

3 Answers 3

Cache one method

Cache several methods

8 Comments

2 Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Cache one method

Cache several methods

8 Comments

2 Comments

Comments

Linked

Related