Skip to content

Conversation

@NumberPiOso
Copy link
Contributor

@NumberPiOso NumberPiOso commented Apr 25, 2022

@NumberPiOso
Copy link
Contributor Author

Special attention to line 4139 in

pandas/pandas/core/frame.py

Lines 4137 to 4141 in b8e1aa4

if not is_bool_dtype(res):
# Special condition to check when dealing with higher dimensions
if not (res.ndim > 1 and (res.dtypes == bool).all()):
msg = f"expr must evaluate to boolean not {res.dtypes}"
raise ValueError(msg)

This condition was added in order to pass some special test that used higher dimensions. These tests were

pandas/tests/frame/test_query_eval.py::TestDataFrameQueryNumExprPandas::test_nested_scope
pandas/tests/frame/test_query_eval.py::TestDataFrameQueryPythonPandas::test_nested_scope

This is because these tests produce a res that evaluates to false in first condition (line 4137)

res = 0 1 2 0 True False False 1 False False False 2 False False True 3 True True False 4 False True True

So in this case we check for column to be bool type

@rhshadrach rhshadrach added Bug expressions pd.eval, query labels Apr 25, 2022
if res.ndim == 1:
is_bool_result = is_bool_dtype(res)
elif res.ndim > 1:
is_bool_result = all(is_bool_dtype(x) for x in res.dtypes)
Copy link
Contributor Author

@NumberPiOso NumberPiOso Apr 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not completely sure if this generalizes to higher dimensions or I can fix the condition to res.ndim == 2 for improved readibility

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only offer up to 2 dimensional objects in pandas so should be fine, though there also might be a more explicit way with typing to branch here. @simonjayhawkins

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need any of list, you can call common.is_bool_indexer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback
It correctly raises when they are not bool. But we go back to problems with tests
pandas/tests/frame/test_query_eval.py::TestDataFrameQueryNumExprPandas::test_nested_scope
pandas/tests/frame/test_query_eval.py::TestDataFrameQueryPythonPandas::test_nested_scope

The first one produces an output like the following res variable.

import pandas as pd from pandas.core.dtypes.common import is_bool_dtype from pandas.core import common as com res = pd.DataFrame([ [False, False, False], [False, True, False], [False, False, False], [False, False, False], [False, False, False], ]) if res.ndim == 1: is_bool_result = is_bool_dtype(res) elif res.ndim > 1: is_bool_result = all(is_bool_dtype(x) for x in res.dtypes) new_is_bool_result = com.is_bool_indexer(res) # is_bool_result: True # new_is_bool: False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok when how about we use it then handle those cases. I really do not like re-inventing the wheel.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only cases is_bool_indexer handles are Series, ndarray, Index, and lists. I don't see anything in dtypes or common that handles DataFrames. I'm not sure if adding DataFrame to is_bool_indexer with its current uses is safe. Perhaps we could add is_bool_frame? The implementation above appears to be correct to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the new implementation that takes into account #46862 (comment) I feel we need this if/else.

In general, it seems logical to me that these cases are separated.

But if you do not agree, I want to understand a little more the proposed modification. Is this the implementation you imagined?:

 if not is_bool_frame(res): msg = f"expr must evaluate to boolean not {res.dtypes}" raise ValueError(msg) try: result = self.loc[res] except ValueError: # when res is multi-dimensional loc raises, but this is sometimes a # valid query result = self[res]
@NumberPiOso NumberPiOso requested a review from rhshadrach April 26, 2022 21:51
if res.ndim == 1:
is_bool_result = is_bool_dtype(res)
elif res.ndim > 1:
is_bool_result = all(is_bool_dtype(x) for x in res.dtypes)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need any of list, you can call common.is_bool_indexer

@NumberPiOso NumberPiOso requested a review from jreback April 28, 2022 15:00
is_bool_result = all(is_bool_dtype(x) for x in res.dtypes)
if not is_bool_result:
msg = f"expr must evaluate to boolean not {res.dtypes}"
raise ValueError(msg)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out of curiosity, as not looked at this in detail, but if we are ensuring a bool indexer, is the following try/except still needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed it is a common problem with the multidimensional res, so we can share the solution.

@NumberPiOso
Copy link
Contributor Author

Sorry for the long delay, I had a couple of hard weeks. But now on I can and will iterate a lot quicker.

@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Jul 11, 2022
@mroeschke
Copy link
Member

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

@mroeschke mroeschke closed this Jul 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

7 participants