48

I have a csv file. I read it:

import pandas as pd data = pd.read_csv('my_data.csv', sep=',') data.head() 

It has output like:

id city department sms category 01 khi revenue NaN 0 02 lhr revenue good 1 03 lhr revenue NaN 0 

I want to remove all the rows where sms column is empty/NaN. What is efficient way to do it?

1
  • I reopen question because OP need the most efficient method. Commented Sep 7, 2017 at 9:17

2 Answers 2

81

Use dropna with parameter subset for specify column for check NaNs:

data = data.dropna(subset=['sms']) print (data) id city department sms category 1 2 lhr revenue good 1 

Another solution with boolean indexing and notnull:

data = data[data['sms'].notnull()] print (data) id city department sms category 1 2 lhr revenue good 1 

Alternative with query:

print (data.query("sms == sms")) id city department sms category 1 2 lhr revenue good 1 

Timings

#[300000 rows x 5 columns] data = pd.concat([data]*100000).reset_index(drop=True) In [123]: %timeit (data.dropna(subset=['sms'])) 100 loops, best of 3: 19.5 ms per loop In [124]: %timeit (data[data['sms'].notnull()]) 100 loops, best of 3: 13.8 ms per loop In [125]: %timeit (data.query("sms == sms")) 10 loops, best of 3: 23.6 ms per loop 
Sign up to request clarification or add additional context in comments.

4 Comments

Pls correct me if I am wrong, so the third one is the most efficient way?
@Danish - hmmm, th best test in your data, but I think dropna. query is slowiest.
What is the logic behind the equal sign in the query one?
@Caterina - if check docs and In [12]: np.nan == np.nan Out[12]: False it means if compare column sms with same column sms it generate False for missing values and True for non missing values.
1

You can use the method dropna for this:

data.dropna(axis=0, subset=('sms', )) 

See the documentation for more details on the parameters.

Of course there are multiple ways to do this, and there are some slight performance differences. Unless performance is critical, I would prefer the use of dropna() as it is the most expressive.

import pandas as pd import numpy as np i = 10000000 # generate dataframe with a few columns df = pd.DataFrame(dict( a_number=np.random.randint(0,1e6,size=i), with_nans=np.random.choice([np.nan, 'good', 'bad', 'ok'], size=i), letter=np.random.choice(list('abcdefghijklmnop'), size=i)) ) # using notebook %%timeit a = df.dropna(subset=['with_nans']) #1.29 s ± 112 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) # using notebook %%timeit b = df[~df.with_nans.isnull()] #890 ms ± 59.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) # using notebook %%timeit c = df.query('with_nans == with_nans') #1.71 s ± 100 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.