How to impute missing text data?

Question

Lets say I have a dataframe consisting of two text columns. By text, I mean the values in those columns are either sentences/paragraphs. In such a case, how do I handle missing 'NaN' values?

If it were a numeric data, I would use frequent/mean/median/knn imputation. But, what to do about text data? Any ideas?

Erwan · Accepted Answer · 2021-07-31 09:43:11Z

First most of the time there's no "missing text", there's an empty string (0 sentences, 0 words) and this is a valid text value. The distinction is important, because the former usually means that the information was not captured whereas the latter means that the information was intentionally left blank. For example a user not entering a review is not missing information: the user chose not to enter any text and it cannot be assumed that this choice is equivalent to whatever text is the most common.

To the best of my knowledge there's no imputing in NLP. Imputing can make sense in some cases with a numerical value (even then it should be used cautiously), but in general text is too diverse (unstructured data) for the concept of "most frequent text" to make any sense. In general substituting real text (or absence of text) with artificially generated data is frowned upon from the point of view of evaluation.

Thus in my opinion the main design options are the following:

Leave the text empty. Most of the time an empty text can be represented like any other text value, e.g. as a TFIDF vector made of zeros.
Discard instances which have no text. For example in text classification no text means no input data at all, so there's no point performing the task for such cases.
Treating instances with no text as special cases based on the specifics of the task. For example such instances could be systematically assigned the majority class, if that makes sense for the task.

But what if instead using bag of words like tdidf we would use word embeddings? My idea is that in that situation you can't just imput it as an array of zeroes — hipoglucido
– hipoglucido, Commented Aug 6, 2021 at 5:55
@hipoglucido you're right, as far as I know the first option is not available for an embedding representation. The standard method in this case is to discard empty instances, but as usual this choice depends on the specifics of the task. — Erwan
– Erwan, Commented Aug 6, 2021 at 10:59

Stack Exchange Network

How to impute missing text data?

1 Answer 1

Hot Network Questions

How to impute missing text data?

1 Answer 1

Related

Hot Network Questions