I think it's also worth mention and explain the parameters configuration of fillna() like Method, Axis, Limit, etc.
From the documentation we have:
Series.fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None) Fill NA/NaN values using the specified method.
Parameters
value [scalar, dict, Series, or DataFrame] Value to use to fill holes (e.g. 0), alternately a dict/Series/DataFrame of values specifying which value to use for each index (for a Series) or column (for a DataFrame). Values not in the dict/Series/DataFrame will not be filled. This value cannot be a list. method [{‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None] Method to use for filling holes in reindexed Series pad / ffill: propagate last valid observation forward to next valid backfill / bfill: use next valid observation to fill gap axis [{0 or ‘index’}] Axis along which to fill missing values. inplace [bool, default False] If True, fill in-place. Note: this will modify any other views on this object (e.g., a no-copy slice for a column in a DataFrame). limit [int,defaultNone] If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled. Must be greater than 0 if not None. downcast [dict, default is None] A dict of item->dtype of what to downcast if possible, or the string ‘infer’ which will try to downcast to an appropriate equal type (e.g. float64 to int64 if possible).
Ok. Let's start with the method= Parameter this have forward fill (ffill) and backward fill(bfill) ffill is doing copying forward the previous non missing value.
e.g. :
import pandas as pd import numpy as np inp = [{'c1':10, 'c2':np.nan, 'c3':200}, {'c1':np.nan,'c2':110, 'c3':210}, {'c1':12,'c2':np.nan, 'c3':220},{'c1':12,'c2':130, 'c3':np.nan},{'c1':12,'c2':np.nan, 'c3':240}] df = pd.DataFrame(inp) c1 c2 c3 0 10.0 NaN 200.0 1 NaN 110.0 210.0 2 12.0 NaN 220.0 3 12.0 130.0 NaN 4 12.0 NaN 240.0
Forward fill:
df.fillna(method="ffill") c1 c2 c3 0 10.0 NaN 200.0 1 10.0 110.0 210.0 2 12.0 110.0 220.0 3 12.0 130.0 220.0 4 12.0 130.0 240.0
Backward fill:
df.fillna(method="bfill") c1 c2 c3 0 10.0 110.0 200.0 1 12.0 110.0 210.0 2 12.0 130.0 220.0 3 12.0 130.0 240.0 4 12.0 NaN 240.0
The Axis Parameter help us to choose the direction of the fill:
Fill directions:
ffill:
Axis = 1 Method = 'ffill' -----------> direction df.fillna(method="ffill", axis=1) c1 c2 c3 0 10.0 10.0 200.0 1 NaN 110.0 210.0 2 12.0 12.0 220.0 3 12.0 130.0 130.0 4 12.0 12.0 240.0 Axis = 0 # by default Method = 'ffill' | | # direction | V e.g: # This is the ffill default df.fillna(method="ffill", axis=0) c1 c2 c3 0 10.0 NaN 200.0 1 10.0 110.0 210.0 2 12.0 110.0 220.0 3 12.0 130.0 220.0 4 12.0 130.0 240.0
bfill:
axis= 0 method = 'bfill' ^ | | | df.fillna(method="bfill", axis=0) c1 c2 c3 0 10.0 110.0 200.0 1 12.0 110.0 210.0 2 12.0 130.0 220.0 3 12.0 130.0 240.0 4 12.0 NaN 240.0 axis = 1 method = 'bfill' <----------- df.fillna(method="bfill", axis=1) c1 c2 c3 0 10.0 200.0 200.0 1 110.0 110.0 210.0 2 12.0 220.0 220.0 3 12.0 130.0 NaN 4 12.0 240.0 240.0 # alias: # 'fill' == 'pad' # bfill == backfill
limit parameter:
df c1 c2 c3 0 10.0 NaN 200.0 1 NaN 110.0 210.0 2 12.0 NaN 220.0 3 12.0 130.0 NaN 4 12.0 NaN 240.0
Only replace the first NaN element across columns:
df.fillna(value = 'Unavailable', limit=1) c1 c2 c3 0 10.0 Unavailable 200.0 1 Unavailable 110.0 210.0 2 12.0 NaN 220.0 3 12.0 130.0 Unavailable 4 12.0 NaN 240.0 df.fillna(value = 'Unavailable', limit=2) c1 c2 c3 0 10.0 Unavailable 200.0 1 Unavailable 110.0 210.0 2 12.0 Unavailable 220.0 3 12.0 130.0 Unavailable 4 12.0 NaN 240.0
downcast parameter:
df.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 5 entries, 0 to 4 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 c1 4 non-null float64 1 c2 2 non-null float64 2 c3 4 non-null float64 dtypes: float64(3) memory usage: 248.0 bytes df.fillna(method="ffill",downcast='infer').info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 5 entries, 0 to 4 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 c1 5 non-null int64 1 c2 4 non-null float64 2 c3 5 non-null int64 dtypes: float64(1), int64(2) memory usage: 248.0 bytes