I have a dataset with one of the categorical columns having a considerable number of missing values. The interesting thing about this column is that it has values only for a particular category in "another" column .
For eg :
column 1 column2 ======================================== Google - Google - Google - Google - Facebook Image Facebook Video Facebook Image My column of interest has values only for one category (Facebook) that is present in another column. Therefore, the missing values for google cannot be imputed with average, cannot be predicted and those rows cannot be ignored either.
In such a situation, is it wise to consider the missing values '-' as a separate category in one-hot encoding? Or will this affect my machine learning model badly?
column 1andcolumn 2variables ? (In your example, you could make 3 variablesGoogle,FacebookImageandFacebookVideo). That's another thing you can try to avoid having 2 highly correlated columns. $\endgroup$