Skip to content

Conversation

@mahepe
Copy link
Contributor

@mahepe mahepe commented May 22, 2019

With master at 6d2398a, issue #25775 seems resolved. Added validation tests.

@gfyoung gfyoung added Indexing Related to indexing on series/frames, not to indexes themselves Testing pandas testing functions or related to the test suite labels May 22, 2019
Copy link
Member

@jschendel jschendel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! A couple comments on simplifying the implementation.

@pytest.mark.parametrize('seed', list(range(100)))
def test_sort_multi_index(self, seed):
# GH 25775
pd.np.random.seed(seed)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a single deterministic example is probably easier to read and maintain.

I've found that the following example reproduces the issue on 0.23.0:

In [1]: import pandas as pd; pd.__version__ Out[1]: '0.23.0' In [2]: df = pd.DataFrame({'a': [0.4, 0.1, 0.3, 0.2], 'b': [0, 1, 1, 1], 'c': range(4), 'd': list('abcd')}) In [3]: df.set_index(list('abc')).sort_index(level=list('ba')) Out[3]: d a b c 0.4 0 0 a 0.1 1 1 b 0.3 1 2 c 0.2 1 3 d

And works on master:

In [1]: import pandas as pd; pd.__version__ Out[1]: '0.25.0.dev0+596.g20d0ad159a' In [2]: df = pd.DataFrame({'a': [0.4, 0.1, 0.3, 0.2], 'b': [0, 1, 1, 1], 'c': range(4), 'd': list('abcd')}) In [3]: df.set_index(list('abc')).sort_index(level=list('ba')) Out[3]: d a b c 0.4 0 0 a 0.1 1 1 b 0.2 1 3 d 0.3 1 2 c
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok! I updated to a single example. I'm using the one given in #25775.

def test_sort_multi_index(self, seed):
# GH 25775
pd.np.random.seed(seed)
df1 = (pd.DataFrame(pd.np.random.rand(10, 5),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataFrame is already imported, so you don't need the pd. prefix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, updated

@codecov
Copy link

codecov bot commented May 22, 2019

Codecov Report

Merging #26492 into master will decrease coverage by <.01%.
The diff coverage is n/a.

Impacted file tree graph

@@ Coverage Diff @@ ## master #26492 +/- ## ========================================== - Coverage 91.88% 91.87% -0.01%  ========================================== Files 174 174 Lines 50692 50692 ========================================== - Hits 46576 46571 -5  - Misses 4116 4121 +5
Flag Coverage Δ
#multiple 90.41% <ø> (ø) ⬆️
#single 41.78% <ø> (-0.11%) ⬇️
Impacted Files Coverage Δ
pandas/io/gbq.py 78.94% <0%> (-10.53%) ⬇️
pandas/core/frame.py 97% <0%> (-0.12%) ⬇️
pandas/util/testing.py 90.81% <0%> (-0.11%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8d124ea...5eb692b. Read the comment docs.

@mahepe mahepe force-pushed the validation-test-for-sorting branch from 7efc4b8 to 58d8782 Compare May 25, 2019 06:16
@pep8speaks
Copy link

pep8speaks commented May 25, 2019

Hello @mahepe! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-06-03 18:32:17 UTC
@mahepe mahepe force-pushed the validation-test-for-sorting branch 4 times, most recently from a89216a to 2a3bf94 Compare May 26, 2019 12:37
assert_frame_equal(sorted_df, expected)

def test_sort_multi_index(self):
# GH 25775
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am not clear by this example what you are even trying to test here. can you add some comments on the test and make the example a little more fluid (maybe add some comments).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated to the simple example by the other reviewer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment here on what this is testing, example is nice

'5,0.639,1,0.944,0.521,0.414' + \
'6,0.264,7,0.456,0.568,0.018' + \
'7,0.617,6,0.616,0.943,0.681' + \
'8,0.359,4,0.697,0.060,0.666' + \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason we are using a fixed string here rather than a generated frame with a seed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other reviewer requested "a single deterministic example" in contrast to iterating through a range of seeds. I wasn't sure what they exactly meant by "deterministic" i.e. whether fixing a seed is good enough or should the dataframe be explicitly given

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I meant a single explicit DataFrame that minimally reproduces the error. Setting a random seed and the generating random floats had nothing to do with the actual error in question, so IMO it made it less obvious as to what's going on. It also resulted in there be a lot more data than necessary to reproduce the error, which makes it harder to follow what's being tested.

So basically something like:

df = DataFrame({'a': [3, 1, 2], 'b': [0, 0, 0], 'c': [0, 1, 2], 'd': list('abc')}) result = df.set_index(list('abc')).sort_index(level=list('ba')) expected = DataFrame({'a': [1, 2, 3], 'b': [0, 0, 0], 'c': [1, 2, 0], 'd': list('bca')}) expected = expected.set_index(list('abc')) tm.assert_frame_equal(result, expected)

Note that df as defined above does indeed not sort correctly on 0.23.4:

In [1]: import numpy as np; import pandas as pd; pd.__version__ Out[1]: '0.23.4' In [2]: df = pd.DataFrame({'a' : [3, 1, 2], 'b': [0, 0, 0,], 'c': [0, 1, 2], 'd': list('abc')}) In [3]: df.set_index(list('abc')).sort_index(level=list('ba')) Out[3]: d a b c 3 0 0 a 1 0 1 b 2 0 2 c

I suppose you could parametrize over level if you wanted to, as adding 'c' still results in invalid sorting:

In [4]: df.set_index(list('abc')).sort_index(level=list('bac')) Out[4]: d a b c 3 0 0 a 1 0 1 b 2 0 2 c
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jschendel, switched to your simpler example.

'9,0.670,2,0.128,0.315,0.363'
data = StringIO(s)
df1 = (pd.read_csv(data)
.assign(b=lambda df: (df.b * 10).astype(int)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these df1, df2 are not clear at all can you rename to something more descriptive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated to the simple example by the other reviewer

@mahepe mahepe force-pushed the validation-test-for-sorting branch from 2a3bf94 to 1763b74 Compare May 29, 2019 04:47
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls merge master and ping on green.

assert_frame_equal(sorted_df, expected)

def test_sort_multi_index(self):
# GH 25775
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment here on what this is testing, example is nice

@mahepe mahepe force-pushed the validation-test-for-sorting branch from 1763b74 to 5eb692b Compare June 3, 2019 18:32
@mahepe
Copy link
Contributor Author

mahepe commented Jun 4, 2019

all green @jreback

@jreback jreback added this to the 0.25.0 milestone Jun 5, 2019
@jreback jreback merged commit 8ef9a63 into pandas-dev:master Jun 5, 2019
@jreback
Copy link
Contributor

jreback commented Jun 5, 2019

thanks @mahepe

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Indexing Related to indexing on series/frames, not to indexes themselves Testing pandas testing functions or related to the test suite

5 participants