7

I'm having trouble making sense of why a call to pandas' dataframe.apply method is not returning the expected result. Could someone please shed some light on why the first call to apply shown below doesn't return an expected result, while the second one does?

import pandas as pd import numpy as np df = pd.DataFrame({ "x": [1, 2, np.nan], "y": ["hi", "there", np.nan] }) print(df) #> x y #> 0 1.0 hi #> 1 2.0 there #> 2 NaN NaN print(df.dtypes) #> x float64 #> y object #> dtype: object # why would something like this not return the expected result (which should # be TRUE, FALSE): print(df.apply(lambda x: np.issubdtype(x, np.number))) #> x False #> y False #> dtype: bool # but something like this returns the expected result (i.e., median imputation # is used if the series is a number, otherwise NULLs are replaced with "MISSING"): def replace_nulls(s): is_numeric = np.issubdtype(s, np.number) missing_value = s.median() if is_numeric else "MISSING" return np.where(s.isnull(), missing_value, s) print(df.apply(replace_nulls)) #> x y #> 0 1.0 hi #> 1 2.0 there #> 2 1.5 MISSING 

Created on 2019-10-03 by the reprexpy package

7
  • 1
    That seems broken to me. pd.Series({k: np.issubdtype(v, np.number) for k, v in df.items()}) works but your's doesn't. Commented Oct 3, 2019 at 15:54
  • Hmm, yeah, not sure why a comprehension like that would work where apply doesn't. Commented Oct 3, 2019 at 16:15
  • apply does a lot of checking and things that make it safe. This is a bug and I don't have the patience to unwind it at the moment. The comprehension is exactly what it seems to be and therefore no surprises. Let me know if you want to submit a bug report. Otherwise, I will. Commented Oct 3, 2019 at 16:25
  • 1
    I opened github.com/pandas-dev/pandas/issues/28773 Commented Oct 3, 2019 at 17:32
  • 1
    As of pandas 1.3.4, this works as OP intended. Commented Apr 5, 2022 at 17:29

1 Answer 1

0

The confusion arises from a misunderstanding about what is passed to the function in DataFrame.apply and how np.issubdtype is used.

  • DataFrame.apply passes each column as a pandas Series to the lambda function, not the dtype of the column directly.
  • np.issubdtype expects two arguments: a dtype and a type or tuple of types to check against.

But when you pass a Series (which x is) to np.issubdtype, it is not the correct usage, because np.issubdtype is designed to work with dtypes, not with Series objects directly. That's why it returns False for both columns.

You shoud do this instead

import pandas as pd import numpy as np df = pd.DataFrame({ "x": [1, 2, np.nan], "y": ["hi", "there", np.nan] }) print(df.apply(lambda x: np.issubdtype(x, np.number))) 

which gives

x True y False dtype: bool 

That means that the correct way to use things in your first case is:

print(df.apply(lambda x: np.issubdtype(x.dtype, np.number))) 

which gives what you had above.

This is actually what you did in your second case using it as

def replace_nulls(s): is_numeric = np.issubdtype(s.dtype, np.number) # Correctly checks the dtype of the Series missing_value = s.median() if is_numeric else "MISSING" return np.where(s.isnull(), missing_value, s) 
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.