Static typing/schema of a pandas dataframe

Question

Is there a way to hint about a pandas DataFrame's schema "statically" so that we can get code completion, static type checking, and just general predictability during coding?

I wouldn't mind duplicating the schema info in code and type annotation for this to work..

So maybe something roughly like mypy comment type annotations:

df = pd.DataFrame({'a': [1.0, 2.4, 4.5], 'B': [1,2,3]}) # pd.schema: ('a': np.dtype(float)), ('B': np.dtype(int))

(or better yet have the schema specified in some external JSON file or such)

Then you can image things like df. auto-completing during coding to df.a or df.B. Or mypy (and any other static code analyzer) being able to infer the type of df.B[0] and such.

Although hopeful, I'm guessing this isn't really possible (or desired...). If so, what would be a good standard for writing good reusable code that returns pd.DataFrame's with specific columns? So imagine there's a function get_data() -> pd.DataFrame that returns data with columns that are known in advance - how would you make this transparent to a user of this function? Anything smarter / more standardized than just spelling it out in the function's docstring?

there seems to be some related wip in mypy: github.com/pandas-dev/pandas/issues/26792 also see some kind of workaround in stackoverflow.com/questions/46412821/… — Rafael
– Rafael, Commented Jan 11, 2020 at 17:36
Related to this, is there a way to type a single Series that mypy will understand? Something like pd.Series[str], pd.Series[int], etc — BallpointBen
– BallpointBen, Commented Feb 17, 2021 at 22:53
dataenforce wraps the DataFrame to allow exactly the kind of type hinting you describe. — above_c_level
– above_c_level, Commented Mar 15, 2021 at 12:31

dasons · Accepted Answer · 2021-07-10 05:53:05Z

7

pandera should be what you need.

A data validation library for scientists, engineers, and analysts seeking correctness.

answered Jul 10, 2021 at 5:53

dasons

5015 silver badges12 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Manny Schneck Over a year ago

This is precisely what I was looking for. Thank you!

DaveB · Accepted Answer · 2021-04-04 10:39:09Z

This may be something you already know, but a reliable way to get the auto-completion you are after is to develop code "live" in Jupyter notebooks. It's very commonly used in data science applications - for your instance it might be appropriate to instantiate a version of the DataFrame with the types that you are looking for at the top of the notebook, then Jupyter will provide the autocomplete for the columns and types as you code. Obviously it has a big advantage over the IDE in terms of knowing what is in scope, because the dataframe is actually loaded into memory as you are developing.

Per above_c_level's comment, dataenforce looks promising for its connection with pytest (ie. testing after code is developed), but unless there are some fancy integrations with your IDE I don't think it will be able to match Jupyter's "live knowledge" of the object.

Not the direct answer to the question, but insightful nonetheless! Thanks @DaveB
That's true for scripting and prototyping but not for developing production level code.

Collectives™ on Stack Overflow

Static typing/schema of a pandas dataframe

2 Answers 2

1 Comment

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

2 Comments

Linked

Related