3
$\begingroup$

I have a dataset consisting of questionnaires from patient survey data. There are around 10 questions which are asked during several stages of treatment like during first day of visit, after a week, after two weeks and so on till after 3 months. Now some patients dropout in between the treatment stage. The dataset I have consists of around 50 columns(10 questions repeated over 5 times during the course of treatment), but there are missing data for some patients as they dropout from the treatment.

My questions are:

how do I handle the missing data as it is not filled by the patient?

Should I impute that with mean values or is there any other way?

P.S.: I am new to survival analysis. So any help will be appreciated. Thanks in advance.

id age sex dropout s1_q1 s1_q2 s1_q3 s1_q4 s1_q5.... s5_q10 217 50 m 0 2 3 3 3 2 3 202 58 f 0 4 9 10 10 10 N/A 222 72 m 1 3 8 9 10 9 N/A 207 50 m 0 2 7 6 7 7 6 277 55 f 0 2 4 5 5 5 6 281 62 m 0 4 10 10 10 10 10 
$\endgroup$

1 Answer 1

3
$\begingroup$

There are several different approaches here.

One (which you've already described) could be to impute missing values with their mean. You could then add an extra column which keeps track of whether that value was originally missing or not. So, in the example you provide, we would end up with

id age sex dropout s1_q1 s1_q2 s1_q3 s1_q4 s1_q5.... s5_q10 s5_q10_missing 217 50 m 0 2 3 3 3 2 3 0 202 58 f 0 4 9 10 10 10 6.25 1 222 72 m 1 3 8 9 10 9 6.25 1 207 50 m 0 2 7 6 7 7 6 0 277 55 f 0 2 4 5 5 5 6 0 281 62 m 0 4 10 10 10 10 10 0 

Another could be to get rid of rows that have missing values, if you have enough data (which in your case you probably don't).

A slightly more involved approach would be to predict the missing values using the non-missing data. So you could build a predictive model to predict s5_q10, train it on the non-missing data, and then use that model to predict s5_q10 for the rows where it's missing.

$\endgroup$
3
  • $\begingroup$ Thanks for the response. For the solution, to add an extra column indicating the missing data, it would increase the feature space(as I already have 50 columns). Do you think its a good idea? $\endgroup$ Commented Sep 26, 2018 at 8:39
  • $\begingroup$ It might be. I would try both approaches, evaluate them via cross-validation, and see which works best. $\endgroup$ Commented Sep 26, 2018 at 8:40
  • $\begingroup$ hello, @marco_gorelli I have question about you answer. You added extra column for tracking whether they are originally missing or not. Well, How do you test this? $\endgroup$ Commented Nov 16, 2018 at 10:07

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.