Let's say I have a dataset of 50,000 where about 2% were already missing from the beginning. From what I have learned, we need to use indicators to compare the imputation model with the ground truth value to check the accuracy of the imputed value. But, since, I already have some missing values in my raw dataset, how can I calculate the accuracy of different models and select the best?
How can I compare the accuracy of imputation models if there is already missing dataset in the file?
1 Answer
$\begingroup$ $\endgroup$
4 2 possible things:
- you doubt the library and want to check the accuracy of that, then create custom fuction and check for some data if that imputation is accurate or not.- in general, its rare that standard libraby implementation will be wrong and will impute what it was not intended to do.
- you want to compare the suitablity of imputed values from different methods- best way is create same Model(with same architecure and configs) on different set of imputed data, whatever gives you better performance is the better one.
- $\begingroup$ And what shall be the indicator to compare those different models? Since there are already some data missing, we can't use mean squared error to compare imputed value to ground value to select the best model, right? $\endgroup$Amisha Dhimal– Amisha Dhimal2023-06-29 13:31:17 +00:00Commented Jun 29, 2023 at 13:31
- $\begingroup$ we can use overall model accuracy, if you want you can use mean squared error for all different models the model have best performance overall in predicting target variable had the best imputation $\endgroup$Ansh– Ansh2023-06-29 16:51:24 +00:00Commented Jun 29, 2023 at 16:51
- $\begingroup$ But, let's say I have the following raw data: | X | Y | |-------|------ | | 2 | 3 | | 3 | 3 | | 4 | 3 | | 5 | 2 | | 6 | ? | | 7 | 8 | | 8 | 5 | | 9 | ? | | 10 | 3 | Now, since I have raw data who already has missing data, how can calculate MSE for model A and B to choose one model as best for imputation process? $\endgroup$Amisha Dhimal– Amisha Dhimal2023-07-01 18:06:04 +00:00Commented Jul 1, 2023 at 18:06
- $\begingroup$ the data what you will impute, you will use that to calculae in respective scenario. $\endgroup$Ansh– Ansh2023-07-03 10:37:08 +00:00Commented Jul 3, 2023 at 10:37