-
- Notifications
You must be signed in to change notification settings - Fork 19.4k
ENH: support CategoricalIndex (GH7629) #9741
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| cc @TomAugspurger So the main point to note here is that we don't have the concept of This is actually a good thing, it makes the entire discussion we had w.r.t. to the groupby sort issue moot (well if you are grouping by a |
| are there operations that currently return say an |
| @jreback What happens if I try to insert new values into a categorical index that aren't already in the categories? |
| hmm, I think
|
This I could make work, but IIRC we decided to have this merge the categories (though in this case they are the same)....hmmm |
so the appending operation works, but converts you to a |
pandas/core/index.py Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does CategoricalIndex(['a', 'b'], categories=['a', 'b']) == CategoricalIndex(['a', 'b'], categories=['a', 'b', 'c']) return True? i.e. the values are the same but the categories (possible values) differ.
I just checked on Categoricals and we raise a TypeError if the categories aren't identical.
In [1]: c1 = pd.Categorical(['a', 'b'], categories=['a', 'b']) In [2]: c2 = pd.Categorical(['a', 'b'], categories=['a', 'b', 'c']) In [5]: c1 == c2 --------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-5-d8d43a43a02a> in <module>() ----> 1 c1 == c2 /Users/tom.augspurger/Envs/py3/lib/python3.4/site-packages/pandas-0.16.0_19_g8d2818e-py3.4-macosx-10.10-x86_64.egg/pandas/core/categorical.py in f(self, other) 38 if (len(self.categories) != len(other.categories)) or \ 39 not ((self.categories == other.categories).all()): ---> 40 raise TypeError("Categoricals can only be compared if 'categories' are the same") 41 if not (self.ordered == other.ordered): 42 raise TypeError("Categoricals can only be compared if 'ordered' is the same") TypeError: Categoricals can only be compared if 'categories' are the sameWe should probably raise here too. Ohh, and maybe that's handled in self._data == other._data?
9c22b53 to c1730ef Compare | I now dispatch to Categorical for comparisons, providing conversions for Index and Categorical, but they must match in categories/ordered |
| I fixed the concat issues to be the same as concating columns (e.g. non-matching categories make this raise). |
379eda0 to 918c01a Compare pandas/core/categorical.py Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like this, particularly because the dtype of the returned array will usually not be object.
Why don't you simply do return np.array(self, dtype=dtype) and allow any valid numpy dtype?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that will cause numpy to error if dtype=='category';
or are you just talking about where I have is_object_dtype (and just make the return np.array(self, dtype=dtype). That would seem to be ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, just for the cases when we already know it's not dtype='category'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
pandas/core/index.py Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
__array__ actually is supposed to take an optional dtype argument, not result, which should be passed on to the np.array call below and eventually on to Categorical, which should return an array of the appropriate type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, ok, we don't do this for Series, i'll fix for Categorical/CategoricalIndex here, maybe make a separate issue for this
| What about Categorical levels in a MultiIndex? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback you are my hero :)
| did another read through -- looking pretty good to me! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should there also be a test for identical? (or is it already somewhere else?)
162dee9 to d44e812 Compare | @shoyer @jorisvandenbossche @JanSchulz any other comments....going to merge |
| Looks good to me! On Sun, Apr 12, 2015 at 10:06 AM, jreback notifications@github.com
|
doc/source/advanced.rst Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small issue: you removed the label of the "Float64Index" section below this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
| Another thing: you added the
|
| Ah, but I see that you added |
beac7d3 to 2f0953e Compare | ok, marked |
| @jreback I know we already had some similar discussion before, but even if the docstring says that it is an "internal, non-public method", nevertheless, it will appear in tab completion, and there will be an api page for those methods in the documentation (as this is done automatically), making them de facto public. But I know it is a difficult discussion, and the line between a "public for other parts of pandas" and "really an internal helper function" is not always clear and easy to draw. |
| ok I made I left any more comments.......? |
raise KeyError when accessing invalid elements setting elements not in the categories is equiv of .append() (which coerces to an Index)
ENH: support CategoricalIndex (GH7629)
| bombs away! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small leftover from rebasing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
| just a small issue in the whatsnew But thanks a lot! It was an extensive, but a good discussion! |
| Woot! Thanks @jreback |
closes #7629
xref #8613
xref #8074
auto-create aCategoricalIndexwhen grouping by aCategorical(this doesn't ATM)df2.loc['d'] = 5should do what? (currently will coerce to anIndex)pd.concat([df2,df])should STILL have aCategoricalIndex(yep)?min/maxCategoricalwrapper methodsA
CategoricalIndexis essentially a drop-in replacement forIndex, that works nicely for non-unique values. It uses aCategoricalto represent itself. The behavior is very similar to using a duplicated Index (for say indexing).Groupby works naturally (and returns another
CategoricalIndex). The only real departure is that.sort_index()works like you would expected (which is a good thing:). Clearly this will provide idempotency forset/resetindex w.r.t. Categoricals, and thus memory savings by its representation.This doesn't change the API at all. IOW, this is not turned on by default, you have to either use
set/reset, assign an index, or pass aCategoricaltoIndex.