How to drop rows of Pandas DataFrame whose value in a certain column is NaN

Question

I have this DataFrame and want only the records whose EPS column is not NaN:

 STK_ID EPS cash STK_ID RPT_Date 601166 20111231 601166 NaN NaN 600036 20111231 600036 NaN 12 600016 20111231 600016 4.3 NaN 601009 20111231 601009 NaN NaN 601939 20111231 601939 2.5 NaN 000001 20111231 000001 NaN NaN

...i.e. something like df.drop(....) to get this resulting dataframe:

 STK_ID EPS cash STK_ID RPT_Date 600016 20111231 600016 4.3 NaN 601939 20111231 601939 2.5 NaN

How do I do that?

df.dropna(subset = ['column1_name', 'column2_name', 'column3_name']) — Sergey Orshanskiy
– Sergey Orshanskiy, Commented Sep 5, 2014 at 23:53
Another ruthless way if you hate NaN so much df = df.dropna(subset=df.columns.values) and you find there are no NaN anywhere — dejjub-AIS
– dejjub-AIS, Commented Oct 1, 2022 at 18:55

AMC · Accepted Answer · 2020-02-16 07:46:41Z

1649

Don't drop, just take the rows where EPS is not NA:

df = df[df['EPS'].notna()]

edited Feb 16, 2020 at 7:46

AMC

2,6977 gold badges15 silver badges35 bronze badges

answered Nov 16, 2012 at 9:34

eumiro

214k36 gold badges307 silver badges264 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Robert Muil Over a year ago

Is there any advantage to indexing and copying over dropping?

stormfield Over a year ago

@wes-mckinney could please let me know if dropna () is a better choice over pandas.notnull in this case ? If so, then why ?

Cadoiz Over a year ago

This does not catch line 3 where EPS is 4.3 (valid) and cash is NaN. I expect OP to want to drop that one too.

Mohith7548 Over a year ago

we can also use df.dropna(subset=['EPS'])

Ka Wa Yip Over a year ago

dropna is actually faster if there are multiple columns.

Aman · Accepted Answer · 2017-08-14 00:04:56Z

This question is already resolved, but...

...also consider the solution suggested by Wouter in his original comment. The ability to handle missing data, including dropna(), is built into pandas explicitly. Aside from potentially improved performance over doing it manually, these functions also come with a variety of options which may be useful.

In [24]: df = pd.DataFrame(np.random.randn(10,3)) In [25]: df.iloc[::2,0] = np.nan; df.iloc[::4,1] = np.nan; df.iloc[::3,2] = np.nan; In [26]: df Out[26]: 0 1 2 0 NaN NaN NaN 1 2.677677 -1.466923 -0.750366 2 NaN 0.798002 -0.906038 3 0.672201 0.964789 NaN 4 NaN NaN 0.050742 5 -1.250970 0.030561 -2.678622 6 NaN 1.036043 NaN 7 0.049896 -0.308003 0.823295 8 NaN NaN 0.637482 9 -0.310130 0.078891 NaN

In [27]: df.dropna() #drop all rows that have any NaN values Out[27]: 0 1 2 1 2.677677 -1.466923 -0.750366 5 -1.250970 0.030561 -2.678622 7 0.049896 -0.308003 0.823295

In [28]: df.dropna(how='all') #drop only if ALL columns are NaN Out[28]: 0 1 2 1 2.677677 -1.466923 -0.750366 2 NaN 0.798002 -0.906038 3 0.672201 0.964789 NaN 4 NaN NaN 0.050742 5 -1.250970 0.030561 -2.678622 6 NaN 1.036043 NaN 7 0.049896 -0.308003 0.823295 8 NaN NaN 0.637482 9 -0.310130 0.078891 NaN

In [29]: df.dropna(thresh=2) #Drop row if it does not have at least two values that are **not** NaN Out[29]: 0 1 2 1 2.677677 -1.466923 -0.750366 2 NaN 0.798002 -0.906038 3 0.672201 0.964789 NaN 5 -1.250970 0.030561 -2.678622 7 0.049896 -0.308003 0.823295 9 -0.310130 0.078891 NaN

In [30]: df.dropna(subset=[1]) #Drop only if NaN in specific column (as asked in the question) Out[30]: 0 1 2 1 2.677677 -1.466923 -0.750366 2 NaN 0.798002 -0.906038 3 0.672201 0.964789 NaN 5 -1.250970 0.030561 -2.678622 6 NaN 1.036043 NaN 7 0.049896 -0.308003 0.823295 9 -0.310130 0.078891 NaN

There are also other options (See docs at http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html), including dropping columns instead of rows.

Pretty handy!

you can also use df.dropna(subset = ['column_name']). Hope that saves at least one person the extra 5 seconds of 'what am I doing wrong'. Great answer, +1
@JamesTobin, I just spent 20 minutes to write a function for that! The official documentation was very cryptic: "Labels along other axis to consider, e.g. if you are dropping rows these would be a list of columns to include". I was unable to understand, what they meant...
df.dropna(subset = ['column_name']) is exactly what I was looking for! Thanks!
This answer is super helpful but in case it isn't clear to anyone reading what options are useful in which situations, I've put together a dropna FAQ post here. Hope this helps people who are struggling to apply dropna to their specific need.
+1 this answer also seems to help avoid having SettingWithCopyWarning later when you use df.dropna(subset = ['column_name'], inplace=True)

Joe · Accepted Answer · 2021-05-10 17:14:42Z

167

You can use this:

df.dropna(subset=['EPS'], how='all', inplace=True)

edited May 10, 2021 at 17:14

answered Aug 2, 2017 at 16:28

Joe

12.4k7 gold badges44 silver badges58 bronze badges

3 Comments

Anton Protopopov Over a year ago

how='all' is redundant here, because you subsetting dataframe only with one field so both 'all' and 'any' will have the same effect.

Enrique Ortiz Casillas Over a year ago

@AntonProtopopov IMPORTANT: how='all' is NOT redundant. Define a simple dataframe: df = pd.DataFrame({"a": [10, None], "b": [None, 10]}) Doing df.dropna(subset=['a', 'b'], how='all') leaves the dataframe intact (as there aren't rows where both columns are Nan, while dropping that parameter returns an empty dataframe.

Anton Protopopov Over a year ago

@EnriqueOrtizCasillas we were talking about that specific case. In the comment I mentioned that it's only about one field. For that 'all' and 'any' are the same. In general case it depends on what is your ultimate goal. In your example you are selecting by two columns - that's a different case.

Kirk Hadley · Accepted Answer · 2014-04-23 05:37:45Z

150

I know this has already been answered, but just for the sake of a purely pandas solution to this specific question as opposed to the general description from Aman (which was wonderful) and in case anyone else happens upon this:

import pandas as pd df = df[pd.notnull(df['EPS'])]

answered Apr 23, 2014 at 5:37

Kirk Hadley

1,6461 gold badge10 silver badges2 bronze badges

5 Comments

joris Over a year ago

Actually, the specific answer would be: df.dropna(subset=['EPS']) (based on the general description of Aman, of course this does also work)

fantabolous Over a year ago

notnull is also what Wes (author of Pandas) suggested in his comment on another answer.

Aakash Gupta Over a year ago

This maybe a noob question. But when I do a df[pd.notnull(...) or df.dropna the index gets dropped. So if there was a null value in row-index 10 in a df of length 200. The dataframe after running the drop function has index values from 1 to 9 and then 11 to 200. Anyway to "re-index" it

ocean800 Over a year ago

you could also do df[pd.notnull(df[df.columns[INDEX]])] where INDEX would be the numbered column if you don't know name

drmaa Over a year ago

For some reason this answer worked for me and the df.dropna(subset=['column name'] didnt.

cs95 · Accepted Answer · 2020-07-03 22:07:23Z

How to drop rows of Pandas DataFrame whose value in a certain column is NaN

This is an old question which has been beaten to death but I do believe there is some more useful information to be surfaced on this thread. Read on if you're looking for the answer to any of the following questions:

Can I drop rows if any of its values have NaNs? What about if all of them are NaN?
Can I only look at NaNs in specific columns when dropping rows?
Can I drop rows with a specific count of NaN values?
How do I drop columns instead of rows?
I tried all of the options above but my DataFrame just won't update!

`DataFrame.dropna`: Usage, and Examples

It's already been said that df.dropna is the canonical method to drop NaNs from DataFrames, but there's nothing like a few visual cues to help along the way.

# Setup df = pd.DataFrame({ 'A': [np.nan, 2, 3, 4], 'B': [np.nan, np.nan, 2, 3], 'C': [np.nan]*3 + [3]}) df A B C 0 NaN NaN NaN 1 2.0 NaN NaN 2 3.0 2.0 NaN 3 4.0 3.0 3.0

Below is a detail of the most important arguments and how they work, arranged in an FAQ format.

Can I drop rows if any of its values have NaNs? What about if all of them are NaN?

This is where the how=... argument comes in handy. It can be one of

'any' (default) - drops rows if at least one column has NaN
'all' - drops rows only if all of its columns have NaNs

<!_ ->

# Removes all but the last row since there are no NaNs df.dropna() A B C 3 4.0 3.0 3.0 # Removes the first row only df.dropna(how='all') A B C 1 2.0 NaN NaN 2 3.0 2.0 NaN 3 4.0 3.0 3.0

Note
If you just want to see which rows are null (IOW, if you want a boolean mask of rows), use isna:
df.isna() A B C 0 True True True 1 False True True 2 False False True 3 False False False df.isna().any(axis=1) 0 True 1 True 2 True 3 False dtype: bool 
To get the inversion of this result, use notna instead.

Can I only look at NaNs in specific columns when dropping rows?

This is a use case for the subset=[...] argument.

Specify a list of columns (or indexes with axis=1) to tells pandas you only want to look at these columns (or rows with axis=1) when dropping rows (or columns with axis=1.

# Drop all rows with NaNs in A df.dropna(subset=['A']) A B C 1 2.0 NaN NaN 2 3.0 2.0 NaN 3 4.0 3.0 3.0 # Drop all rows with NaNs in A OR B df.dropna(subset=['A', 'B']) A B C 2 3.0 2.0 NaN 3 4.0 3.0 3.0

Can I drop rows with a specific count of NaN values?

This is a use case for the thresh=... argument. Specify the minimum number of NON-NULL values as an integer.

df.dropna(thresh=1) A B C 1 2.0 NaN NaN 2 3.0 2.0 NaN 3 4.0 3.0 3.0 df.dropna(thresh=2) A B C 2 3.0 2.0 NaN 3 4.0 3.0 3.0 df.dropna(thresh=3) A B C 3 4.0 3.0 3.0

The thing to note here is you need to specify how many NON-NULL values you want to keep, rather than how many NULL values you want to drop. This is a pain point for new users.

Luckily the fix is easy: if you have a count of NULL values, simply subtract it from the column size to get the correct thresh argument for the function.

required_min_null_values_to_drop = 2 # drop rows with at least 2 NaN df.dropna(thresh=df.shape[1] - required_min_null_values_to_drop + 1) A B C 2 3.0 2.0 NaN 3 4.0 3.0 3.0

How do I drop columns instead of rows?

Use the axis=... argument, it can be axis=0 or axis=1.

Tells the function whether you want to drop rows (axis=0) or drop columns (axis=1).

df.dropna() A B C 3 4.0 3.0 3.0 # All columns have rows, so the result is empty. df.dropna(axis=1) Empty DataFrame Columns: [] Index: [0, 1, 2, 3] # Here's a different example requiring the column to have all NaN rows # to be dropped. In this case no columns satisfy the condition. df.dropna(axis=1, how='all') A B C 0 NaN NaN NaN 1 2.0 NaN NaN 2 3.0 2.0 NaN 3 4.0 3.0 3.0 # Here's a different example requiring a column to have at least 2 NON-NULL # values. Column C has less than 2 NON-NULL values, so it should be dropped. df.dropna(axis=1, thresh=2) A B 0 NaN NaN 1 2.0 NaN 2 3.0 2.0 3 4.0 3.0

I tried all of the options above but my DataFrame just won't update!

dropna, like most other functions in the pandas API returns a new DataFrame (a copy of the original with changes) as the result, so you should assign it back if you want to see changes.

df.dropna(...) # wrong df.dropna(..., inplace=True) # right, but not recommended df = df.dropna(...) # right

Reference

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html

DataFrame.dropna( self, axis=0, how='any', thresh=None, subset=None, inplace=False)

Gil Baggio · Accepted Answer · 2018-08-08 15:17:20Z

Simplest of all solutions:

filtered_df = df[df['EPS'].notnull()]

The above solution is way better than using np.isfinite()

Noordeen · Accepted Answer · 2019-01-23 10:13:13Z

32

Simple and easy way

df.dropna(subset=['EPS'],inplace=True)

source: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html

edited Jan 23, 2019 at 10:13

user10954831

answered Jan 22, 2019 at 8:26

Noordeen

1,61422 silver badges26 bronze badges

2 Comments

AMC Over a year ago

inplace=True is a bizarre topic, and has no effect on DataFrame.dropna(). See: github.com/pandas-dev/pandas/issues/16529

misantroop Over a year ago

How does this answer differ from @Joe's answer? Also, inplace is will be deprecated eventually, best not to use it at all.

Anton Protopopov · Accepted Answer · 2015-12-04 07:01:56Z

You could use dataframe method notnull or inverse of isnull, or numpy.isnan:

In [332]: df[df.EPS.notnull()] Out[332]: STK_ID RPT_Date STK_ID.1 EPS cash 2 600016 20111231 600016 4.3 NaN 4 601939 20111231 601939 2.5 NaN In [334]: df[~df.EPS.isnull()] Out[334]: STK_ID RPT_Date STK_ID.1 EPS cash 2 600016 20111231 600016 4.3 NaN 4 601939 20111231 601939 2.5 NaN In [347]: df[~np.isnan(df.EPS)] Out[347]: STK_ID RPT_Date STK_ID.1 EPS cash 2 600016 20111231 600016 4.3 NaN 4 601939 20111231 601939 2.5 NaN

MaxU - stand with Ukraine · Accepted Answer · 2017-04-20 21:15:56Z

yet another solution which uses the fact that np.nan != np.nan:

In [149]: df.query("EPS == EPS") Out[149]: STK_ID EPS cash STK_ID RPT_Date 600016 20111231 600016 4.3 NaN 601939 20111231 601939 2.5 NaN

Georgy · Accepted Answer · 2020-02-10 09:19:10Z

4

Another version:

df[~df['EPS'].isna()]

edited Feb 10, 2020 at 9:19

Georgy

14k7 gold badges69 silver badges80 bronze badges

answered Feb 8, 2020 at 7:59

keramat

4,6138 gold badges29 silver badges42 bronze badges

1 Comment

AMC Over a year ago

Why use this over Series.notna() ?

Taie · Accepted Answer · 2021-12-08 06:17:35Z

The following method worked for me. It would help if none of the above methods work:

df[df['colum_name'].str.len() >= 1]

The basic idea is that you pick up the record only if the length strength is greater than 1. This is especially useful if you are dealing with string data

Best!

This only works for objects columns: AttributeError: Can only use .str accessor with string values! if your columns is float or int

aesede · Accepted Answer · 2017-01-26 23:12:08Z

3

It may be added at that '&' can be used to add additional conditions e.g.

df = df[(df.EPS > 2.0) & (df.EPS <4.0)]

Notice that when evaluating the statements, pandas needs parenthesis.

edited Jan 26, 2017 at 23:12

aesede

5,7112 gold badges38 silver badges34 bronze badges

answered Mar 15, 2016 at 15:33

David

391 bronze badge

1 Comment

jezrael Over a year ago

Sorry, but OP want someting else. Btw, your code is wrong, return ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().. You need add parenthesis - df = df[(df.EPS > 2.0) & (df.EPS <4.0)], but also it is not answer for this question.

rachwa · Accepted Answer · 2022-07-02 06:07:23Z

You can also use notna inside query:

In [4]: df.query('EPS.notna().values') Out[4]: STK_ID.1 EPS cash STK_ID RPT_Date 600016 20111231 600016 4.3 NaN 601939 20111231 601939 2.5 NaN

Pradeep Singh · Accepted Answer · 2020-02-17 11:00:16Z

In datasets having large number of columns its even better to see how many columns contain null values and how many don't.

print("No. of columns containing null values") print(len(df.columns[df.isna().any()])) print("No. of columns not containing null values") print(len(df.columns[df.notna().all()])) print("Total no. of columns in the dataframe") print(len(df.columns))

For example in my dataframe it contained 82 columns, of which 19 contained at least one null value.

Further you can also automatically remove cols and rows depending on which has more null values
Here is the code which does this intelligently:

df = df.drop(df.columns[df.isna().sum()>len(df.columns)],axis = 1) df = df.dropna(axis = 0).reset_index(drop=True)

Note: Above code removes all of your null values. If you want null values, process them before.

This question has really been squeezed out of questioning, get it? :)

Sole Galli · Accepted Answer · 2024-03-08 15:47:42Z

Those who want to make dropna part of a feature engineering / scikit-learn pipeline, can use DropMissingData from Feature-engine.

The following will drop all rows with nan in a dataframe:

import pandas as pd import numpy as np from feature_engine.imputation import DropMissingData X = pd.DataFrame(dict( x1 = [np.nan,1,1,0,np.nan], x2 = ["a", np.nan, "b", np.nan, "a"], )) dmd = DropMissingData() dmd.fit(X) dmd.transform(X)

The result of the former block is:

 x1 x2 2 1.0 b

To drop rows with nan only in a specific column, for example x2:

dmd = DropMissingData(variables = "x2") dmd.fit(X) dmd.transform(X)

The former block returns the following:

 x1 x2 0 NaN a 2 1.0 b 4 NaN a

Finally, from within a Pipeline:

from sklearn.linear_model import Lasso from sklearn.preprocessing import OrdinalEncoder from feature_engine.imputation import DropMissingData from feature_engine.pipeline import Pipeline pipe = Pipeline( [ ("drop", DropMissingData()), ("enc", OrdinalEncoder()), ("lasso", Lasso(random_state=10)), ] ).set_output(transform="pandas") pipe.fit(X, y) preds_pipe = pipe.predict(X)

More details in Feature-engine's dropna documentation

cottontail · Accepted Answer · 2023-11-29 00:36:46Z

`dropna` vs boolean indexing

If we look at the source code, under the hood, dropna() is precisely notna() + boolean indexing. Depending on what was passed to how=, all() or any() is called to reduce the notna mask into a Series.

The main difference is that with dropna(), you specify the rows to drop, while with the boolean indexing, you look you specify the rows to keep, which is logically the opposite problem. So depending on the use-case, it might be more intuitive to approach the problem of dropping rows with NaN values from the perspective of keeping non-NaN rows or dropping NaN rows.

To sum up, the following are True for any dataframe df:

df = pd.DataFrame({"A": [1, 2, pd.NA], "B": [pd.NA, 'a', 'b'], "C": [pd.NA, 10, 20]}) cols = ['A', 'B'] x1 = df.dropna(subset=cols, how='any') # specify which rows to drop y1 = df[df[cols].notna().all(axis=1)] # specify which rows to keep assert x1.equals(y1) x2 = df.dropna(subset=cols, how='all') y2 = df[df[cols].notna().any(axis=1)] assert x2.equals(y2)

Also, thresh= argument is equivalent to checking if the number of non-NaN values in each row is not less than thresh value; in other words, the following is True:

thresh = 2 x3 = df[df[cols].count(axis=1) >= thresh] y3 = df.dropna(subset=cols, thresh=thresh) assert x3.equals(y3)

Now, if the task is to simply drop rows with NaN values, then dropna() is most intuitive and should be used. However, since mask + boolean indexing is more general, you can define a more complex mask and filter using it.

For example, say, you want to drop rows where either column A value is NaN or there are more than 1 NaN value. This requires 2 function calls using dropna. However, with boolean indexing, you can filter using a single mask.

msk = (df.isna().sum(axis=1) > 1) | df['A'].isna() df = df[~msk]

On a side note, if you get SettingWithCopyWarning when you modify a dataframe constructed via boolean indexing, consider setting copy-on-write mode to True (read more about it here).

pd.set_option('mode.copy_on_write', True) # turn on copy-on-write msk = (df.isna().sum(axis=1) > 1) | df['A'].isna() df1 = df[~msk] df1['new_col'] = 1 # <--- no SettingWithCopyWarning

Simon · Accepted Answer · 2022-02-21 19:55:01Z

-4

you can try with:

df['EPS'].dropna()

answered Feb 21, 2022 at 19:55

Simon

1252 silver badges10 bronze badges

Collectives™ on Stack Overflow

How to drop rows of Pandas DataFrame whose value in a certain column is NaN

17 Answers 17

5 Comments

6 Comments

3 Comments

5 Comments