Skip to content

Conversation

@Ircama
Copy link

@Ircama Ircama commented Jan 16, 2020

Fix DataFrame.info() range summary for DatetimeIndex

This PR aims to provide better description of DatetimeIndex within DataFrame.info in relation to the display of the range of unsorted indexes, by replacing the original first to last element values with "min date" to "max date" values.

Fixes #31116

This PR fixes range display errors that might occur when [DataFrame.info](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html) describes a [DatetimeIndex](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DatetimeIndex.html) of big dataframes: depending on the dataframe structure, related *"min date" to "max date"* values can be wrong.

Example of output produced by DataFrame.info():

import pandas as pd idx = pd.date_range('2020-01-15 01:03:05', periods=1000004, freq='ms') df = pd.DataFrame({'count': range(len(idx))}, index=idx) df df.info()

df output:

 count 2020-01-15 01:03:05.000 0 2020-01-15 01:03:05.001 1 2020-01-15 01:03:05.002 2 2020-01-15 01:03:05.003 3 2020-01-15 01:03:05.004 4 ... ... 2020-01-15 01:19:44.999 999999 2020-01-15 01:19:45.000 1000000 2020-01-15 01:19:45.001 1000001 2020-01-15 01:19:45.002 1000002 2020-01-15 01:19:45.003 1000003 

df.info() output:

<class 'pandas.core.frame.DataFrame'> DatetimeIndex: 1000004 entries, 2020-01-15 01:03:05 to 2020-01-15 01:19:45.003000 Freq: L Data columns (total 1 columns): count 1000004 non-null int64 dtypes: int64(1) memory usage: 15.3 MB 

This PR fixes possible errors within the description: "DatetimeIndex: ... entries, date_atodate_b output, where date_a and date_b should respectively show minimum and maximum date range values.

@MarcoGorelli
Copy link
Member

Thanks for your PR, and welcome :)

I'm not sure I see the issue in your example - df.info() shows

2020-01-15 01:03:05 to 2020-01-15 01:19:45.003000 

but isn't that what we were expecting to see?

Could you please raise an issue first, in which you show an example of where the output is wrong, along with your expected output?

@WillAyd
Copy link
Member

WillAyd commented Jan 16, 2020

Please add test(s) for the issue you are trying to fix - we typically ask for these as the first part of any PR

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note above comment on tests

@WillAyd WillAyd added the Index Related to the Index class or subclasses label Jan 16, 2020
@Ircama
Copy link
Author

Ircama commented Jan 18, 2020

Ok, thanks. I'll first create an issue and relate it to this PR.

**Fix `DataFrame.info()` range summary for DatetimeIndex** This PR aims to provide better description of [DatetimeIndex](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DatetimeIndex.html) within [DataFrame.info](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html) in relation to the display of the range of unsorted indexes, by replacing the original first to last element values with "min date" to "max date" values.
@Ircama Ircama force-pushed the ircama-dtl-summary branch from 5c7b0db to 5c109f6 Compare January 18, 2020 09:28
@Ircama Ircama requested a review from WillAyd January 18, 2020 09:37
@MarcoGorelli
Copy link
Member

MarcoGorelli commented Jan 18, 2020

Thanks for opening the issue.

However, before this PR can be accepted, you need to write tests / make sure you don't break existing ones - see contributing to the code base:

pandas is serious about testing and strongly encourages contributors to embrace test-driven development (TDD). This development process “relies on the repetition of a very short development cycle: first the developer writes an (initially failing) automated test case that defines a desired improvement or new function, then produces the minimum amount of code to pass that test.” So, before actually writing any code, you should write your tests. Often the test can be taken from the original GitHub issue. However, it is always worth considering additional use cases and writing corresponding tests.

Adding tests is one of the most common requests after code is pushed to pandas. Therefore, it is worth getting in the habit of writing tests ahead of time so this is never an issue.

@MarcoGorelli
Copy link
Member

@WillAyd just checking - is this something pandas would like to support?

If so, @Ircama , is this something you're still working on?

@MarcoGorelli
Copy link
Member

On a second thought, I think I might prefer the current behaviour. Otherwise, if I did df.info() and saw the min and max values instead of the first and last ones, I might get the (possibly false) impression that my index was already sorted

@Ircama
Copy link
Author

Ircama commented Feb 5, 2020

On a second thought, I think I might prefer the current behaviour.

Yes, I did some further analysis and at the end I agree.

Even if the wording used with df.info() on DatetimeIndex entries might lead to believe that the two shown elements are the index range, actually these are the first and the last element of the DataFrame index.

Sorting the index is optional and the shown values could help the developer check whether sorting has been applied to the index.

I would then consequently close #31069 and #31116.

Thanks for the support so far.

@Ircama Ircama closed this Feb 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Index Related to the Index class or subclasses

3 participants