Skip to content

Conversation

@jreback
Copy link
Contributor

@jreback jreback commented Feb 28, 2015

closes #8613

In [1]: df = DataFrame(np.random.randn(5,4), columns=list('ABCD'), index=date_range('20130101',periods=5)) In [2]: df Out[2]: A B C D 2013-01-01 -1.380065 1.221596 0.279703 -0.119258 2013-01-02 1.684202 -0.202251 -0.961145 -1.595015 2013-01-03 -0.623634 -2.377577 -3.024554 -0.298809 2013-01-04 -1.251699 0.456356 1.257002 1.465878 2013-01-05 0.687818 -2.125079 1.454911 1.914207 In [5]: s = Series(range(5),[-2,-1,1,2,3]) In [6]: s Out[6]: -2 0 -1 1 1 2 2 3 3 4 dtype: int64 In [7]: df.loc['2013-01-02':'2013-01-10'] Out[7]: A B C D 2013-01-02 1.684202 -0.202251 -0.961145 -1.595015 2013-01-03 -0.623634 -2.377577 -3.024554 -0.298809 2013-01-04 -1.251699 0.456356 1.257002 1.465878 2013-01-05 0.687818 -2.125079 1.454911 1.914207 In [9]: s.loc[-10:3] Out[9]: -2 0 -1 1 1 2 2 3 3 4 dtype: int64 
# this used to be a KeyError In [15]: df.loc[2:3] TypeError: Cannot index datetime64 with <type 'int'> keys 

Slicing with float indexers; now works for ix (loc worked before)

In [3]: s.loc[-1.0:2] Out[3]: -1 1 1 2 2 3 dtype: int64 In [4]: s.ix[-1.0:2] Out[4]: -1 1 1 2 2 3 dtype: int64 # [] raises In [5]: s[-1.0:2] TypeError: cannot do slice start value indexing on <class 'pandas.core.index.Int64Index'> with these indexers [-1.0] of <type 'float'> 
@jreback jreback added Bug Indexing Related to indexing on series/frames, not to indexes themselves API Design labels Feb 28, 2015
@jreback jreback added this to the 0.16.0 milestone Feb 28, 2015
@jreback jreback force-pushed the loc branch 4 times, most recently from f5b9f7e to d61b9b5 Compare February 28, 2015 22:42
@jreback
Copy link
Contributor Author

jreback commented Feb 28, 2015

cc @immerrr
@jorisvandenbossche
@shoyer

I think this should resolve things and make .loc strict, while matching .ix (notably in that .ix tries for positional access when it can, but .loc will refuse to guess).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would here also explain it in words, something like "slicing where the start and/or stop bound is not found in the index is now allowed (previously gave KeyError), as is the case for ix. The change is only for slicing, not when indecxing with a single label."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@jreback jreback force-pushed the loc branch 2 times, most recently from 1141055 to 2c2cf61 Compare March 1, 2015 16:06
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 to deleting lines from core.indexing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whoo hoo!

@shoyer
Copy link
Member

shoyer commented Mar 1, 2015

First of all, big thanks for working on this, @jreback!

However, I don't think we need to enforce distinctions between float and int for slice bounds, because there is no ambiguity in the ordering between integers and floats (whereas the ordering with respect to strings, for example, is indeed problematic). Users shouldn't have to check the type of their index (if they know it's numeric) before doing indexing.

In particular:

  1. I think it should be OK to index an Int64Index with floats, like .loc[0.5:2.5] or .loc[-1.0:2].
  2. I think it should be OK to index a Float64Index with integers, like .loc[0:10]. (Of course, these are still to be interpreted as labels, not integer positions)

The binary search logic for determining slice bound locations should already work for both of these cases.

@shoyer
Copy link
Member

shoyer commented Mar 1, 2015

To give a more concrete example from my experience: an index corresponding to a variable like longitude might variously consist of integer (e.g., if it's in multiples of whole degrees) or as floating point numbers (e.g., if it's in multiples of half degrees). It's a bad user experience to need to cast their slice bounds to the type of index. Code like s.loc[min_lon:max_lon] should work regardless of dtypes.

@jreback
Copy link
Contributor Author

jreback commented Mar 1, 2015

@shoyer

I added in these checks, mainly because numpy is deprecating float indexers (IIRC it has a warning now), and we have been warning on it.

Its easy to take out, but I think that indexing with a float on an integer index is really not in the .loc domain. This is moving back to the .ix functionaility (which actually doesn't work in 0.15.2 or atm), though .loc did (which is odd).

though it IS convenient.

currently we don't allow this for [],.ix (but allow for .loc in 0.15.2.

simple example from 0.15.2

In [1]: s = Series(range(5)) In [2]: s[1.5:3] TypeError: the slice start value [1.5] is not a proper indexer for this index type (Int64Index) In [3]: s.ix[1.5:3] TypeError: the slice start value [1.5] is not a proper indexer for this index type (Int64Index) In [4]: s.loc[1.5:3] Out[4]: 2 2 3 3 dtype: int64 

These all fail with
TypeError: cannot do slice start value indexing on <class 'pandas.core.index.Int64Index'> with these indexers [1.5] of <type 'float'> in master/0.16.0

@jreback
Copy link
Contributor Author

jreback commented Mar 1, 2015

@jorisvandenbossche what do you think?

@shoyer
Copy link
Member

shoyer commented Mar 1, 2015

I agree that using float indexers is a based idea for position based indexing (.iloc). Position based indexing should always use integers. We should be consistent with numpy for this.

But indexing with .loc uses labels. Here (e.g., with searchsorted), numpy is more flexible -- the important thing is that there is a well defined sort order.

@jreback
Copy link
Contributor Author

jreback commented Mar 1, 2015

this should then also work for .ix/[]?

@shoyer
Copy link
Member

shoyer commented Mar 1, 2015

To elaborate:

There is never a good reason to use floats for position based indexing. Numpy warns and/or raises because using a float is an indication that your code probably has a bug.

In contrast, there certainly are good reasons to use floats for label based indexing, because the labels might include floats themselves (or might not, depending on dtypes).

@shoyer
Copy link
Member

shoyer commented Mar 1, 2015

.ix and [] are an unpredictable mess! We hadn't even written down the rules for exactly how indexing with [] works last time I checked.

If I had my way, we would make some hard choices and deprecate/remove fall-back indexing all together (#9213). In the mean-time, I suspect the best approach is to simply try to break/change as little existing code as possible.

@jreback
Copy link
Contributor Author

jreback commented Mar 1, 2015

ok, here is the change (plus making some testing more robust)

In [3]: index = tm.makeIntIndex() In [4]: s = Series(np.arange(len(index))+10,index+5) In [5]: s Out[5]: 5 10 6 11 7 12 8 13 9 14 10 15 11 16 12 17 13 18 14 19 dtype: int64 In [6]: s.loc[6.0:8.5] Out[6]: 6 11 7 12 8 13 dtype: int64 In [7]: s.loc[5.5:8.5] Out[7]: 6 11 7 12 8 13 dtype: int64 In [8]: s.loc[5.5:8.0] Out[8]: 6 11 7 12 8 13 dtype: int64 

I could enable for []/ix, but then would s[5.0:8.0] become a fallback scenario (yes I think)
so what would s[5.5,8.5] do (non-fallback I think), so they become very sensitive to the actual values, hence bad

@jorisvandenbossche
Copy link
Member

I agree with the logic of allowing float indexers for integer indexes when using loc (so as you now implemented)

Ideally, I think it should also work for ix (to keep ix as much as possible to the logic of try: s.loc[]; except: s.iloc[]. In principle , we say that for an integer index axis, we do never integer location fallback (http://pandas.pydata.org/pandas-docs/stable/indexing.html#different-choices-for-indexing), so following the loc behaviour should not be ambiguous in this case I think.

For [], it is something else. For that, we should first have a better documented overview of it behaviour (I started with that last week, will try to put it up as a PR)

@jreback
Copy link
Contributor Author

jreback commented Mar 2, 2015

ok, I added the .ix support for float indexers on an Int index. It is trivial to make this work for []. I am thinking that this should have the same semantics as .ix (e.g. label based with an int index). It works the same.

@jorisvandenbossche
Copy link
Member

But for [] slicing is 'integer-location' based (an exception on the other types of slicing), so this should follow the semantics of iloc in this case, disallowing floats.

@jorisvandenbossche
Copy link
Member

Example:

In [46]: s = Series(range(4), index=[-1,0,1,2]) In [47]: s Out[47]: -1 0 0 1 1 2 2 3 dtype: int64 In [48]: s[0:2] # <--- slicing = integer location based Out[48]: -1 0 0 1 dtype: int64 In [49]: s[0] # <--- single label = label based Out[49]: 1 
@jreback
Copy link
Contributor Author

jreback commented Mar 2, 2015

ok, is updated, any more comments
@jorisvandenbossche @shoyer

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic here seems OK, but a little confusing in context. To me it would make more sense to put it on the subclass _maybe_cast_slice_bound methods, even if that means a little more duplication. Otherwise, things will break, e.g., as soon as we add IntervalIndex.

@shoyer
Copy link
Member

shoyer commented Mar 3, 2015

Generally seems fine by me.

@jreback
Copy link
Contributor Author

jreback commented Mar 4, 2015

indexing is such a rabbit hole :)

I cleaned up _convert_slice_indexer a bit. I tried to use _maybe_cast_slice_bound directly there (I added the kind argument which is the type of indexing, e.g. ix/loc/iloc/getitem). But so many cases are handled that its really hard to get this right. Because the different kinds are handled in different places

its tricky because you have the kinds of indexers (ix/loc/iloc/getitem) x index_types (Index,Float,Int,Datetime....) x kinds of indexing (slice/scalar)

So its better (meaning slightly cleaner), with the semantics we discussed above.

rename typ -> kind for _convert_*_indexer add kind argument to _maybe_cast_slice_bound cleaned up _convert_slice_indexer a bit
jreback added a commit that referenced this pull request Mar 4, 2015
API: consistency with .ix and .loc for getitem operations (GH8613)
@jreback jreback merged commit 8992129 into pandas-dev:master Mar 4, 2015
@jorisvandenbossche
Copy link
Member

Thanks a lot for this!

@shoyer
Copy link
Member

shoyer commented Mar 4, 2015

Indeed, greatly appreciated!

@immerrr
Copy link
Contributor

immerrr commented Mar 13, 2015

That's great that you guys (and especially @jreback) actually went through this 👍

@jreback
Copy link
Contributor Author

jreback commented Mar 13, 2015

trying to make indexing slightly less of a rabbit hole :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

API Design Bug Indexing Related to indexing on series/frames, not to indexes themselves

4 participants