175

Let's take a simple function that takes a str and returns a dataframe:

import pandas as pd def csv_to_df(path): return pd.read_csv(path, skiprows=1, sep='\t', comment='#') 

What is the recommended pythonic way of adding type hints to this function?

If I ask python for the type of a DataFrame it returns pandas.core.frame.DataFrame. The following won't work though, as it'll tell me that pandas is not defined.

 def csv_to_df(path: str) -> pandas.core.frame.DataFrame: return pd.read_csv(path, skiprows=1, sep='\t', comment='#') 
5
  • 1
    But you're using the pd alias, and you can probably define custom types. Commented May 10, 2017 at 11:15
  • @MosesKoledoye if I try pd.core.frame.DataFrame I'll get an AttributeError instead of a NameError. Commented May 10, 2017 at 11:16
  • I am not an authority on "pythonicity" but I would recommend doc-strings (using ''' this function takes a inputType and returns an outputType ''') this is also what will be shown if someone calls help(yourFunction) function on your function. Commented May 10, 2017 at 11:22
  • 4
    the library dataenforce allows to check for data types inside the data frame github.com/CedricFR/dataenforce Commented Apr 21, 2020 at 13:49
  • Related on r/learnpython: How to specify pandas type-hint with columns Commented Apr 15, 2023 at 2:23

7 Answers 7

256

Why not just use pd.DataFrame?

import pandas as pd def csv_to_df(path: str) -> pd.DataFrame: return pd.read_csv(path, skiprows=1, sep='\t', comment='#') 

Result is the same:

> help(csv_to_df) Help on function csv_to_df in module __main__: csv_to_df(path:str) -> pandas.core.frame.DataFrame 
Sign up to request clarification or add additional context in comments.

8 Comments

it also won't allow specifying dtypes for specific columns, which could be extremely useful
@Philipp_Kats Currently there is no way to specify dtypes for DataFrame columns in type hints, and I haven't seen any work done in this direction (correct me if I'm wrong). Linking a related question on type hints with NumPy and dtypes: Type hint for NumPy ndarray dtype?. You will see that it's also not implemented there yet.
This gives an error in mypy error: No library stub file for module 'pandas'
pd.DataFrame doesn't tell much unfortunately. The underlying df could have literally any shape and you wouldn't know.
@Nesha25 It is also similar to the difference between a list[int] and list[str]. Without the type parameter telling you what's "inside" the list, you don't really know what you can legally do with the contents. The same applies to dataframes. Additionally, your example of an "int greater than 3" is unusual indeed, but such "value constraints" are not so unusual in type systems - consider for example a "a non-null pointer", "non-zero divisor" or "object with validated email address". Such types are used in many places.
|
28

I'm currently doing the following:

from typing import TypeVar PandasDataFrame = TypeVar('pandas.core.frame.DataFrame') def csv_to_df(path: str) -> PandasDataFrame: return pd.read_csv(path, skiprows=1, sep='\t', comment='#') 

Which gives:

> help(csv_to_df) Help on function csv_to_df in module __main__: csv_to_df(path:str) -> ~pandas.core.frame.DataFrame 

Don't know how pythonic that is, but it's understandable enough as a type hint, I find.

6 Comments

@Azat Ibrakov would you mind elaborating on your comment? Sometimes I'm not sure what is and isn't 'pythonic'.
I see people downvoting this answer. For context, this was the solution I found for my own question, and for all intents and purposes it works just fine. The more pythonic solution above, which I accepted as correct answer (but does have its own perks, see comments), was only provided 8 months afterwards.
It's not pythonic since it is less clear and harder to maintain than the accepted answer for this question. Since the type path here is not verified by the compiler it won't raise errors if it's wrong. This could happen from a typo in your TypeVar arg or change to the module itself.
I receive a warning when I use this: The argument to 'TypeVar()' must be a string equal to the variable name to which it is assigned
This is not the correct use of a type variable. A TypeVar exists to link two types together (mypy docs). You probably meant a type alias: PandasDataFrame = pandas.core.frame.DataFrame
|
27

Update Nov 2023

Since the maintainer of dataenforce discontinued its development, I thought I'd update this answer. According to him:

Alternatively, I recommend using pandera which provides a similar functionality (& much more!)


See this answer for an example of how to use it

Original answer

Now there is a pip package that can help with this. https://github.com/CedricFR/dataenforce

You can install it with pip install dataenforce and use very pythonic type hints like:

def preprocess(dataset: Dataset["id", "name", "location"]) -> Dataset["location", "count"]: pass 

Comments

15

Take a look at pandera.

pandera provides a flexible and expressive API for performing data validation on dataframe-like objects to make data processing pipelines more readable and robust. Dataframes contain information that pandera explicitly validates at runtime. This is useful in production-critical or reproducible research settings.


The advantage of pandera is that you can also specify dtypes of individual DataFrame columns. The following example uses pandera to run-time enforce a DataFrame containing a single column of integers:

import pandas as pd import pandera from pandera.typing import DataFrame, Series class Integers(pandera.SchemaModel): number: Series[int] @pandera.check_types def my_fn(a: DataFrame[Integers]) -> None: pass # This works df = pd.DataFrame({"number": [ 2002, 2003]}) my_fn(df) # Raises an exception df = pd.DataFrame({"number": [ 2002.0, 2003]}) my_fn(df) # Raises an exception df = pd.DataFrame({"number": [ '2002', 2003]}) my_fn(df) 

Comments

7

Check out the answer given here which explains the usage of the package data-science-types.

pip install data-science-types 

Demo

# program.py import pandas as pd df: pd.DataFrame = pd.DataFrame({'col1': [1,2,3], 'col2': [4,5,6]}) # OK df1: pd.DataFrame = pd.Series([1,2,3]) # error: Incompatible types in assignment 

Run using mypy the same way:

$ mypy program.py

3 Comments

Unfortunately, this is buried at bottom. In 2021 this is the best answer. Note too the comment by Daniel Malachov following the linked answer (stackoverflow.com/a/63446142/8419574).
@user3897315 - I disagree that this is the best answer in 2021. If you visit data-science-types on GitHub you'll find the repository has been archived, and the README updated (on Feb 16 2021) with the following note: "⚠️ this project has mostly stopped development ⚠️ The pandas team and the numpy team are both in the process of integrating type stubs into their codebases, and we don't see the point of competing with them."
I agree, but following that I don't see a timeline when pandas or numpy will have these pushed or ETA in their roadmap.
1

This is straying from the original question but building off of @dangom's answer using TypeVar and @Georgy's comment that there is no way to specify datatypes for DataFrame columns in type hints, you could use a simple work-around like this to specify datatypes in a DataFrame:

from typing import TypeVar DataFrameStr = TypeVar("pandas.core.frame.DataFrame(str)") def csv_to_df(path: str) -> DataFrameStr: return pd.read_csv(path, skiprows=1, sep='\t', comment='#') 

Comments

1

A more concise way is either

import pandas as pd def csv_to_df(path: str) -> pd.DataFrame: return pd.read_csv(path, skiprows=1, sep='\t', comment='#') # as for param type hint def handle_df(df:pd.DataFrame): pass 

or

import pandas as pd from pandas import DataFrame def csv_to_df(path: str) -> DataFrame: return pd.read_csv(path, skiprows=1, sep='\t', comment='#') # as for param type hint def handle_df(df:DataFrame): pass 

But the draw back of this kind of type hint is that you can not add the constraint of the column names and column types.

I am seeking for the solutions. If anyone knowns please tell me.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.