-
- Notifications
You must be signed in to change notification settings - Fork 19.4k
API: DataFrame.to_csv formatting parameters for float indexes #11681
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| And I've corrected the fact that the |
pandas/core/index.py Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is non-performant. What exactly is the issue here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hu, I realize my comment is not only not-clear, but also non-accurate
Let's consider the example in my tests:
df = DataFrame({'a': [0, np.NaN], 'b': [0, 1], 'c': [2, 3]}).set_index(['a', 'b']) The index is a multi-index, with the following levels:
Float64Index([0.0], dtype='float64', name=u'a') Int64Index([0, 1], dtype='int64', name=u'b') After the calls to _format_native_types, the variable levels contains:
['0.0'] ['0', '1'] and afterwards mi.values contain:
[('0.0', '0'), (nan, '1')] Here, nan is not a string "nan", but numpy.NaN (which is printed as the string nan)
Then, for that reason, in the tests I've introduced, the following test fails:
Traceback (most recent call last): File "/home/nicolas/Git/pandas/pandas/tests/test_format.py", line 2965, in test_to_csv_na_rep self.assertEqual(df.set_index(['a', 'b']).to_csv(na_rep='_'), expected) AssertionError: 'a,b,c\n0.0,0,2\nnan,1,3\n' != 'a,b,c\n0.0,0,2\n_,1,3\n' There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the way to fix this is set both the levels & labels; e.g. you append a new value to the level (which will represent the nans), then change the -1 in the labels to that value (the position of that value, e.g. 1 in this case). This will then reformat the MultiIndex to work correctly.
You can do this in the MultiIndex constructor of _format_native_types
In [20]: df.index.set_levels([0.0,'_'],level=0).set_labels([0,1],level=0).values Out[20]: array([(0.0, 0), ('_', 1)], dtype=object) | @jreback all set |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since this is basically identical to core/internals/FloatBlock/to_native_types. let's pull both those out and put it in a function in core/format.py/FloatArrayFormatter and call it .get_formatted_data(). See how that works out. These are the routines for screen printing (which are necessarily different from to_csv / index formatting).
| ok, looks pretty good. I think we can take this opportunity to re-factor a bit as I have noted above. |
| @nbonnotte if you can update / refactor as above would be great |
| I will, just had not the time yet. If you'd like this to be done quickly, because of the schedule for the 0.18.0 release, let me know. |
| no, just pinging :) |
| lmk when you can update |
| @jreback I just pushed the changes. Let me know if other changes are needed. Happy Holidays! |
pandas/tests/test_format.py Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blank line here
| lgtm. some very minor formatting changes. in general like to have blank lines between different sub-tests and to format code nicely. ping when pushed and green. |
pandas/core/format.py Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a doc-string here describing what this is doing. Some comments in the code as well (to describe the purposes of the if clauses)
| If you and Travis agree, that should be it. I have added some comments as you suggested, to make the code clearer. But to be honest I'm not quite sure about how some bits of the code fit with the rest, for instance how and where the Also, I suppose |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, are there tests for quoting? if not can you add a couple. thxs.
| @nbonnotte code looks great. just see if we have some tests for the explanations are fine. Just want to have a note instead of a mass of code to explain a bit what is happening. as far as refactoring, you got what I was looking for (integration between ping when pushed / green (or if tests ok, lmk) |
| Yeah, the |
API: DataFrame.to_csv formatting parameters for float indexes
| thanks! |
Fix issue #11553
Two things:
I've created a
Float64Index._format_native_typesmethod which is a copy-paste ofFloatBlock.to_native_types. I would have preferred to call the latter directly, but I'm not sure what theplacementparameter of theFloatBlockconstructor means. I guess I doesn't really matters, since I could put whatever value and it should work (I think), and my hesitation a bit unfounded, but I don't know if it would be really a clean solution. Maybe someone can think of a more elegant way?Since a
Float64Indexcontaining only NaNs collapses when part of a multi-index, its NaNs values would not be converted usingna_rep, so I had to hack a solution. I put a comment in the relevant part. I'm not quite convinced myself of the elegance of the solution, though.What do you think?