Skip to main content
edited tags
Link
Sycorax
  • 95.8k
  • 23
  • 246
  • 405
Source Link

When there are NaN values for a column of data, why is it okay to fill the values with the median or mean of that column?

Suppose I have a dataset with 100 rows, but for one of my columns, titled 'Age', there are NaN values for 14 of the rows. A common approach to dealing with this is filling up those NaN values with the median or mean of the data, but what is the justification for this? I can agree that the median or mean age in this case is the most 'likely' age for a random datapoint if the age histogram looks vaguely Gaussian, but why shouldn't I populate those NaNs with a random number taken from a normal distribution centered at that 'most likely' age? Wouldn't that be more realistic? It seems unrealistic to me that if my dataset is missing 14 people, that they're all going to be the same age, even if it is the most common age. Seems more likely to me there'd be a variance around that most likely age, just like a normal distribution.