Type-checking Pandas DataFrames

Question

I want to type-check Pandas DataFrames i.e. I want to specify which column labels a DataFrame must have and what kind of data type (dtype) is stored in them. A crude implementation (inspired by this question) would work like this:

from collections import namedtuple Col = namedtuple('Col', 'label, type') def dataframe_check(*specification): def check_accepts(f): assert len(specification) <= f.__code__.co_argcount def new_f(*args, **kwds): for (df, specs) in zip(args, specification): spec_columns = [spec.label for spec in specs] assert (df.columns == spec_columns).all(), \ 'Columns dont match specs {}'.format(spec_columns) spec_dtypes = [spec.type for spec in specs] assert (df.dtypes == spec_dtypes).all(), \ 'Dtypes dont match specs {}'.format(spec_dtypes) return f(*args, **kwds) new_f.__name__ = f.__name__ return new_f return check_accepts

I don't mind the complexity of the checking function but it adds a lot of boilerplate code.

@dataframe_check([Col('a', int), Col('b', int)], # df1 [Col('a', int), Col('b', float)],) # df2 def f(df1, df2): return df1 + df2 f(df, df)

Is there a more Pythonic way of type-checking DataFrames? Something that looks more like the new Python 3.6 static type-checking?

Is it possible to implement it in mypy?

Dan · Accepted Answer · 2017-09-25 20:18:32Z

Perhaps not the most pythonic way, but using a dict for your specs might do the trick (with keys as column names and values as data types):

import pandas as pd df = pd.DataFrame(columns=['col1', 'col2']) df['col1'] = df['col1'].astype('int') df['col2'] = df['col2'].astype('str') cols_dtypes_req = {'col1':'int', 'col2':'object'} #'str' dtype is 'object' in pandas def check_df(dataframe, specs): for colname in specs: if colname not in dataframe: return 'Column missing.' elif dataframe[colname].dtype != specs[colname]: return 'Data type incorrect.' for dfcol in dataframe: if dfcol not in specs: return 'Unexpected dataframe column.' return 'Dataframe meets specifications.' print(check_df(df, cols_dtypes_req))

If you implement it with OrderedDict you can also check the order of the columns.

artoby · Accepted Answer · 2024-05-20 09:14:14Z

Try pandera. It's powerful and easy to add.

Example:

from pandera import Field, SchemaModel, check_types from pandera.typing import DataFrame, Index, Series, Float64 class RawPriceSchema(SchemaModel): index: Index[int] = Field(unique=True) symbol: Series[str] price: Series[Float64] = Field(nullable=True) RawPrice = DataFrame[RawPriceSchema] # ... @check_types def foo(price: RawPrice): ...

A more detailed example in this demo repo and in this video

Collectives™ on Stack Overflow

Type-checking Pandas DataFrames

2 Answers 2

1 Comment

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Linked

Related