Skip to content

Conversation

@jmarintur
Copy link
Contributor

Ensure that when constructing a DataFrame with a list of Series with different CategoricalIndexes, the resulting columns are categorical.

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fix is too specific. This should be fixed where the Index[object] is being created

@jmarintur
Copy link
Contributor Author

Hi @mroeschke, Unit Tests failures seem unrelated (when running them locally everything is ok).

if isinstance(self, CategoricalIndex) and isinstance(other, CategoricalIndex):
both_categories = self.categories
# if ordered and unordered, we set categories to be unordered
ordered = False if self.ordered != other.ordered else None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do tests fail if you just do:

both_categories = union_categorical([self.categories, other.categories]) self = self.set_categories(both_categories) other = other.set_categories(both_categories) 
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mroeschke, switching to using that piece of code, you get:
image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mroeschke, any suggestion on how you think we should handle the PR at this point? Thank you!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I guess it should have been

both_categories = union_categorical([self, other]) self = self.set_categories(both_categories) other = other.set_categories(both_categories)

The main point here is that ideally there shouldn't be an re-invention of categorical union logic in union

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mroeschke, thanks for your guidance. Taking into account your suggestion, and my previous comments, we'd need to change union and _union. Does it sound good to you?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mroeschke, I've just tried your suggestion:

if isinstance(self.dtype, CategoricalDtype) and isinstance( other.dtype, CategoricalDtype ): both_categories = union_categoricals([self, other]) self = self.set_categories(both_categories) other = other.set_categories(both_categories) 

If I run it with this change with the following example:

s1 = pd.Series([1, 2], index=pd.CategoricalIndex(["a", "b"], ordered=False)) s2 = pd.Series([3, 4], index=pd.CategoricalIndex(["b", "c"], ordered=False)) pd.DataFrame([s1, s2]).columns 

union_categoricals returns:

['a', 'b', 'b', 'c'] Categories (3, object): ['a', 'b', 'c'] 

which is the expected behaviour of that function. However, it triggers the following error when setting the categories in self.set_categories(both_categories):

ValueError: Categorical categories must be unique

We only need to change the code to:

if isinstance(self.dtype, CategoricalDtype) and isinstance( other.dtype, CategoricalDtype ): both_categories = union_categoricals([self, other]).categories self = self.set_categories(both_categories) other = other.set_categories(both_categories) 

to make it work for the example above, however, some tests won't pass, e.g.:

FAILED pandas/tests/reshape/concat/test_append.py::TestAppend::test_append_same_columns_type[CategoricalIndex1] - TypeError: Cannot use sort_categories=True with ordered Categoricals FAILED pandas/tests/reshape/concat/test_append.py::TestAppend::test_append_different_columns_types[CategoricalIndex-CategoricalIndex] - TypeError: Categorical.ordered must be the same 

leading to the need of handling the order as I did in the last commits.

Any thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I've just pushed another commit with, what I think, it is a better way on handling it. It does follow the original logic when both CategoricalIndex differ. Please let me know what you think.

@jmarintur jmarintur requested a review from mroeschke April 4, 2024 09:30
@mroeschke mroeschke added Categorical Categorical Data Type DataFrame DataFrame data structure Constructors Series/DataFrame/Index/pd.array Constructors labels Apr 23, 2024
@mroeschke
Copy link
Member

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

@mroeschke mroeschke closed this May 31, 2024
@jmarintur
Copy link
Contributor Author

Dear @mroeschke, could you please give me some feedback? I'd like to understand what's missing in my last changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Categorical Categorical Data Type Constructors Series/DataFrame/Index/pd.array Constructors DataFrame DataFrame data structure

2 participants