18

Is there a way to hint about a pandas DataFrame's schema "statically" so that we can get code completion, static type checking, and just general predictability during coding?

I wouldn't mind duplicating the schema info in code and type annotation for this to work..

So maybe something roughly like mypy comment type annotations:

df = pd.DataFrame({'a': [1.0, 2.4, 4.5], 'B': [1,2,3]}) # pd.schema: ('a': np.dtype(float)), ('B': np.dtype(int)) 

(or better yet have the schema specified in some external JSON file or such)

Then you can image things like df. auto-completing during coding to df.a or df.B. Or mypy (and any other static code analyzer) being able to infer the type of df.B[0] and such.

Although hopeful, I'm guessing this isn't really possible (or desired...). If so, what would be a good standard for writing good reusable code that returns pd.DataFrame's with specific columns? So imagine there's a function get_data() -> pd.DataFrame that returns data with columns that are known in advance - how would you make this transparent to a user of this function? Anything smarter / more standardized than just spelling it out in the function's docstring?

6
  • SO you want to select the columns base on the datatype ? Commented Apr 22, 2019 at 0:27
  • 2
    No, I want to statically annotate the types of the columns Commented Apr 22, 2019 at 17:44
  • 2
    there seems to be some related wip in mypy: github.com/pandas-dev/pandas/issues/26792 also see some kind of workaround in stackoverflow.com/questions/46412821/… Commented Jan 11, 2020 at 17:36
  • 2
    Related to this, is there a way to type a single Series that mypy will understand? Something like pd.Series[str], pd.Series[int], etc Commented Feb 17, 2021 at 22:53
  • 1
    dataenforce wraps the DataFrame to allow exactly the kind of type hinting you describe. Commented Mar 15, 2021 at 12:31

2 Answers 2

7

pandera should be what you need.

A data validation library for scientists, engineers, and analysts seeking correctness.

Sign up to request clarification or add additional context in comments.

1 Comment

This is precisely what I was looking for. Thank you!
1

This may be something you already know, but a reliable way to get the auto-completion you are after is to develop code "live" in Jupyter notebooks. It's very commonly used in data science applications - for your instance it might be appropriate to instantiate a version of the DataFrame with the types that you are looking for at the top of the notebook, then Jupyter will provide the autocomplete for the columns and types as you code. Obviously it has a big advantage over the IDE in terms of knowing what is in scope, because the dataframe is actually loaded into memory as you are developing.

Per above_c_level's comment, dataenforce looks promising for its connection with pytest (ie. testing after code is developed), but unless there are some fancy integrations with your IDE I don't think it will be able to match Jupyter's "live knowledge" of the object.

2 Comments

Not the direct answer to the question, but insightful nonetheless! Thanks @DaveB
That's true for scripting and prototyping but not for developing production level code.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.