replacing NaN values in dataframe with pandas

Question

I want to create a function that takes a dataframe and replaces NaN with the mode in categorical columns, and replaces NaN in numerical columns with the mean of that column. If there are more than one mode in the categorical columns, then it should use the first mode.

I have managed to do it with following code:

def exercise4(df): df1 = df.select_dtypes(np.number) df2 = df.select_dtypes(exclude = 'float') mode = df2.mode() df3 = df1.fillna(df.mean()) df4 = df2.fillna(mode.iloc[0,:]) new_df = [df3,df4] df5 = pd.concat(new_df,axis=1) new_cols = list(df.columns) df6 = df5[new_cols] return df6

But i am sure there is a far easier method to do this?

Have you solved your problem?

Massifox
– Massifox

2019-10-26 11:31:51 +00:00
Commented Oct 26, 2019 at 11:31 — Massifox
– Massifox, Commented Oct 26, 2019 at 11:31

jezrael · Accepted Answer · 2019-10-26 08:27:35Z

You can use:

df = pd.DataFrame({ 'A':list('abcdec'), 'B':[4,5,4,5,5,4], 'C':[7,8,9,4,2,3], 'D':[1,3,5,7,1,0], 'E':list('bbcdeb'), }) df.iloc[[1,3], [1,2,0,4]] = np.nan print (df) A B C D E 0 a 4.0 7.0 1 b 1 NaN NaN NaN 3 NaN 2 c 4.0 9.0 5 c 3 NaN NaN NaN 7 NaN 4 e 5.0 2.0 1 e 5 c 4.0 3.0 0 b

Idea is use DataFrame.select_dtypes for non numeric columns with DataFrame.mode and select first row by DataFrame.iloc for positions, then count means - non numeric are expluded by default, so possible use Series.append for Series with all values for replacement passed to DataFrame.fillna:

modes = df.select_dtypes(exclude=np.number).mode().iloc[0] means = df.mean() both = modes.append(means) print (both) A c E b B 4.25 C 5.25 D 2.83333 dtype: object df.fillna(both, inplace=True) print (df) A B C D E 0 a 4.00 7.00 1 b 1 c 4.25 5.25 3 b 2 c 4.00 9.00 5 c 3 c 4.25 5.25 7 b 4 e 5.00 2.00 1 e 5 c 4.00 3.00 0 b

Passed to function with DataFrame.pipe:

def exercise4(df): modes = df.select_dtypes(exclude=np.number).mode().iloc[0] means = df.mean() both = modes.append(means) df.fillna(both, inplace=True) return df df = df.pipe(exercise4) #alternative #df = exercise4(df) print (df) A B C D E 0 a 4.00 7.00 1 b 1 c 4.25 5.25 3 b 2 c 4.00 9.00 5 c 3 c 4.25 5.25 7 b 4 e 5.00 2.00 1 e 5 c 4.00 3.00 0 b

Another idea is use DataFrame.apply, but is necessary result_type='expand' parameter with test dtypes by types.is_numeric_dtype:

from pandas.api.types import is_numeric_dtype f = lambda x: x.mean() if is_numeric_dtype(x.dtype) else x.mode()[0] df.fillna(df.apply(f, result_type='expand'), inplace=True) print (df) A B C D E 0 a 4.00 7.00 1 b 1 c 4.25 5.25 3 b 2 c 4.00 9.00 5 c 3 c 4.25 5.25 7 b 4 e 5.00 2.00 1 e 5 c 4.00 3.00 0 b

Passed to function:

from pandas.api.types import is_numeric_dtype def exercise4(df): f = lambda x: x.mean() if is_numeric_dtype(x.dtype) else x.mode()[0] df.fillna(df.apply(f, result_type='expand'), inplace=True) return df df = df.pipe(exercise4) #alternative #df = exercise4(df) print (df)

Massifox · Accepted Answer · 2019-10-25 22:24:47Z

You can use the _get_numeric_data() method to get the numeric columns (and consequently the categorical ones):

numerical_col = df._get_numeric_data().columns

At this point you only need one line of code using an apply function that runs through the columns:

fixed_df = df.apply(lambda col: col.fillna(col.mean()) if col.name in numerical_col else col.fillna(col.mode()[0]), axis=0)

gosuto · Accepted Answer · 2019-10-26 07:44:14Z

Actually you have all the ingredients already there! Some of your steps can be chained though making some others obsolete.

Looking at these two lines for example:

mode = df2.mode() df4 = df2.fillna(mode.iloc[0,:])

You could just replace them with df4 = df2.fillna(df2.mode().iloc[0,:]. Then instead of constantly reassigning new (sub)dataframes to variables, altering them and concatenating them you can make these alterations inplace, meaning they are applied directly to the dataframe in question. Lastly exclude='float' might work in your particular (example) case, but what if there are even more datatypes in the dataframe? A string column maybe?

My suggestion:

def mean_mode(df): df.select_dtypes(np.number).fillna(df.mean(), inplace=True) df.select_dtypes('category').fillna(df.mode()[0], inplace=True) return df

Code only answers are generally considered low quality. Could you add an explanation to your answer.

George Pipis · Accepted Answer · 2021-03-12 13:40:14Z

You can work as follows:

df = df.apply(lambda x: x.fillna(x.mode()[0]) if (x.dtypes==category) else x.fillna(x.mean()) )

Collectives™ on Stack Overflow

replacing NaN values in dataframe with pandas

4 Answers 4

Comments

Comments

1 Comment

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

Comments

1 Comment

Comments

Linked

Related