Locate first and last non NaN values in a Pandas DataFrame

Question

I have a Pandas DataFrame indexed by date. There a number of columns but many columns are only populated for part of the time series. I'd like to find where the first and last values non-NaN values are located so that I can extracts the dates and see how long the time series is for a particular column.

Could somebody point me in the right direction as to how I could go about doing something like this?

Is there a solution for this when the missing values can be "0"? (i.e find the first non zero value, per group/time series)? — GrimSqueaker
– GrimSqueaker, Commented Jul 15, 2017 at 12:57

Gadget · Accepted Answer · 2016-08-10 16:12:06Z

78

@behzad.nouri's solution worked perfectly to return the first and last non-NaN values using Series.first_valid_index and Series.last_valid_index, respectively.

edited Aug 10, 2016 at 16:12

Gadget

4444 silver badges10 bronze badges

answered Mar 17, 2014 at 13:16

Jason

4,58612 gold badges53 silver badges79 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Jason Over a year ago

@KorayTugay A pandas Series is 1D (i.e. single column). If you want to check more columns in a df you can iterate over your df. Eg.

for col_name, data in df.items(): print("First valid index for column {} is at {}".format(col_name, data.first_valid_index()))

Peque Over a year ago

Instead of iterating over the columns of a DataFrame, you can use df.apply(Series.first_valid_index).

cs95 · Accepted Answer · 2019-12-14 18:42:20Z

Here's some helpful examples.

Series

s = pd.Series([np.NaN, 1, np.NaN, 3, np.NaN], index=list('abcde')) s a NaN b 1.0 c NaN d 3.0 e NaN dtype: float64 # first valid index s.first_valid_index() # 'b' # first valid position s.index.get_loc(s.first_valid_index()) # 1 # last valid index s.last_valid_index() # 'd' # last valid position s.index.get_loc(s.last_valid_index()) # 3

Alternative solution using notna and idxmax:

# first valid index s.notna().idxmax() # 'b' # last valid index s.notna()[::-1].idxmax() # 'd'

DataFrame

df = pd.DataFrame({ 'A': [np.NaN, 1, np.NaN, 3, np.NaN], 'B': [1, np.NaN, np.NaN, np.NaN, np.NaN] }) df A B 0 NaN 1.0 1 1.0 NaN 2 NaN NaN 3 3.0 NaN 4 NaN NaN

(first|last)_valid_index isn't defined on DataFrames, but you can apply them on each column using apply.

# first valid index for each column df.apply(pd.Series.first_valid_index) A 1 B 0 dtype: int64 # last valid index for each column df.apply(pd.Series.last_valid_index) A 3 B 0 dtype: int64

As before, you can also use notna and idxmax. This is slightly more natural syntax.

# first valid index df.notna().idxmax() A 1 B 0 dtype: int64 # last valid index df.notna()[::-1].idxmax() A 3 B 0 dtype: int64

The issue with idxmax() is that it will return 0 for a full-NaN column. I would expect a NaN in that case, so I would rather always use .apply(Series.first_valid_index).
For the whole dataframe you can find the first index that has no NaNs with df.apply(pd.Series.first_valid_index).max()
For everyone who came here because of the same question in xarray (like me): The “alternative solution” works, but the functions are called differently in xarray. Use da.notnull().idxmax(dim_name) to get the first valid position, where da is a DataArray and dim_name the name of the dimension along which to search for NaN values. For the last valid position, add .isel(dim=slice(None, None, -1)).

PatrickT · Accepted Answer · 2022-01-07 21:56:03Z

A convenience function based on behzad.nouri's commend and cs95's earlier answer. Any errors or misunderstandings are mine.

import pandas as pd import numpy as np df = pd.DataFrame([["2022-01-01", np.nan, np.nan, 1], ["2022-01-02", 2, np.nan, 2], ["2022-01-03", 3, 3, 3], ["2022-01-04", 4, 4, 4], ["2022-01-05", np.nan, 5, 5]], columns=['date', 'A', 'B', 'C']) df['date'] = pd.to_datetime(df['date']) df # date A B C #0 2022-01-01 NaN NaN 1.0 #1 2022-01-02 2.0 NaN 2.0 #2 2022-01-03 3.0 3.0 3.0 #3 2022-01-04 4.0 4.0 4.0 #4 2022-01-05 NaN 5.0 5.0

We want to start at the earliest date common to A and B and end at the latest date common to A and B (for whatever reason, we do not filter by column C).

# filter data to minimum/maximum common available dates def get_date_range(df, cols): """return a tuple of the earliest and latest valid data for all columns in the list""" a,b = df[cols].apply(pd.Series.first_valid_index).max(), df[cols].apply(pd.Series.last_valid_index).min() return (df.loc[a, 'date'], df.loc[b, 'date']) a,b = get_date_range(df, cols=['A', 'B']) a #Timestamp('2022-01-03 00:00:00') b #Timestamp('2022-01-04 00:00:00')

Now filter the data:

df.loc[(df.date >= a) & (df.date <= b)] # date A B C #2 2022-01-03 3.0 3.0 3 #3 2022-01-04 4.0 4.0 4

Collectives™ on Stack Overflow

Locate first and last non NaN values in a Pandas DataFrame

3 Answers 3

2 Comments

Series

DataFrame

3 Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Series

DataFrame

3 Comments

Comments

Linked

Related