8

I want to type-check Pandas DataFrames i.e. I want to specify which column labels a DataFrame must have and what kind of data type (dtype) is stored in them. A crude implementation (inspired by this question) would work like this:

from collections import namedtuple Col = namedtuple('Col', 'label, type') def dataframe_check(*specification): def check_accepts(f): assert len(specification) <= f.__code__.co_argcount def new_f(*args, **kwds): for (df, specs) in zip(args, specification): spec_columns = [spec.label for spec in specs] assert (df.columns == spec_columns).all(), \ 'Columns dont match specs {}'.format(spec_columns) spec_dtypes = [spec.type for spec in specs] assert (df.dtypes == spec_dtypes).all(), \ 'Dtypes dont match specs {}'.format(spec_dtypes) return f(*args, **kwds) new_f.__name__ = f.__name__ return new_f return check_accepts 

I don't mind the complexity of the checking function but it adds a lot of boilerplate code.

@dataframe_check([Col('a', int), Col('b', int)], # df1 [Col('a', int), Col('b', float)],) # df2 def f(df1, df2): return df1 + df2 f(df, df) 

Is there a more Pythonic way of type-checking DataFrames? Something that looks more like the new Python 3.6 static type-checking?

Is it possible to implement it in mypy?

2 Answers 2

2

Perhaps not the most pythonic way, but using a dict for your specs might do the trick (with keys as column names and values as data types):

import pandas as pd df = pd.DataFrame(columns=['col1', 'col2']) df['col1'] = df['col1'].astype('int') df['col2'] = df['col2'].astype('str') cols_dtypes_req = {'col1':'int', 'col2':'object'} #'str' dtype is 'object' in pandas def check_df(dataframe, specs): for colname in specs: if colname not in dataframe: return 'Column missing.' elif dataframe[colname].dtype != specs[colname]: return 'Data type incorrect.' for dfcol in dataframe: if dfcol not in specs: return 'Unexpected dataframe column.' return 'Dataframe meets specifications.' print(check_df(df, cols_dtypes_req)) 
Sign up to request clarification or add additional context in comments.

1 Comment

If you implement it with OrderedDict you can also check the order of the columns.
0

Try pandera. It's powerful and easy to add.

Example:

from pandera import Field, SchemaModel, check_types from pandera.typing import DataFrame, Index, Series, Float64 class RawPriceSchema(SchemaModel): index: Index[int] = Field(unique=True) symbol: Series[str] price: Series[Float64] = Field(nullable=True) RawPrice = DataFrame[RawPriceSchema] # ... @check_types def foo(price: RawPrice): ... 

A more detailed example in this demo repo and in this video

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.