-
- Notifications
You must be signed in to change notification settings - Fork 19.4k
BUG: Raise error when expr does not evaluate to bool in df.query #46862
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
daf9245 to b8e1aa4 Compare | Special attention to line 4139 in Lines 4137 to 4141 in b8e1aa4
This condition was added in order to pass some special test that used higher dimensions. These tests were pandas/tests/frame/test_query_eval.py::TestDataFrameQueryNumExprPandas::test_nested_scope This is because these tests produce a res that evaluates to false in first condition (line 4137) res = 0 1 2 0 True False False 1 False False False 2 False False True 3 True True False 4 False True TrueSo in this case we check for column to be bool type |
| if res.ndim == 1: | ||
| is_bool_result = is_bool_dtype(res) | ||
| elif res.ndim > 1: | ||
| is_bool_result = all(is_bool_dtype(x) for x in res.dtypes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not completely sure if this generalizes to higher dimensions or I can fix the condition to res.ndim == 2 for improved readibility
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only offer up to 2 dimensional objects in pandas so should be fine, though there also might be a more explicit way with typing to branch here. @simonjayhawkins
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need any of list, you can call common.is_bool_indexer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback
It correctly raises when they are not bool. But we go back to problems with tests
pandas/tests/frame/test_query_eval.py::TestDataFrameQueryNumExprPandas::test_nested_scope
pandas/tests/frame/test_query_eval.py::TestDataFrameQueryPythonPandas::test_nested_scope
The first one produces an output like the following res variable.
import pandas as pd from pandas.core.dtypes.common import is_bool_dtype from pandas.core import common as com res = pd.DataFrame([ [False, False, False], [False, True, False], [False, False, False], [False, False, False], [False, False, False], ]) if res.ndim == 1: is_bool_result = is_bool_dtype(res) elif res.ndim > 1: is_bool_result = all(is_bool_dtype(x) for x in res.dtypes) new_is_bool_result = com.is_bool_indexer(res) # is_bool_result: True # new_is_bool: FalseThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok when how about we use it then handle those cases. I really do not like re-inventing the wheel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only cases is_bool_indexer handles are Series, ndarray, Index, and lists. I don't see anything in dtypes or common that handles DataFrames. I'm not sure if adding DataFrame to is_bool_indexer with its current uses is safe. Perhaps we could add is_bool_frame? The implementation above appears to be correct to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the new implementation that takes into account #46862 (comment) I feel we need this if/else.
In general, it seems logical to me that these cases are separated.
But if you do not agree, I want to understand a little more the proposed modification. Is this the implementation you imagined?:
if not is_bool_frame(res): msg = f"expr must evaluate to boolean not {res.dtypes}" raise ValueError(msg) try: result = self.loc[res] except ValueError: # when res is multi-dimensional loc raises, but this is sometimes a # valid query result = self[res]| if res.ndim == 1: | ||
| is_bool_result = is_bool_dtype(res) | ||
| elif res.ndim > 1: | ||
| is_bool_result = all(is_bool_dtype(x) for x in res.dtypes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need any of list, you can call common.is_bool_indexer
| is_bool_result = all(is_bool_dtype(x) for x in res.dtypes) | ||
| if not is_bool_result: | ||
| msg = f"expr must evaluate to boolean not {res.dtypes}" | ||
| raise ValueError(msg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
out of curiosity, as not looked at this in detail, but if we are ensuring a bool indexer, is the following try/except still needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed it is a common problem with the multidimensional res, so we can share the solution.
| Sorry for the long delay, I had a couple of hard weeks. But now on I can and will iterate a lot quicker. |
| This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this. |
| Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen. |
doc/source/whatsnew/v1.5.0.rstfile if fixing a bug or adding a new feature.