Filling How to fill missing values in a Dataset - suitable methodsdataset where some properties can be inputs and outputs?

I have a dataset with missing values, I would like to use machine learning methods to fill. In more detail, there are $n$ individuals, for which up to 10 properties are provided, all numerical. The fact is, there are no individuals for which all properties are given. The first rows (each row contains data for a given individual) do look as the following

\begin{bmatrix} 1 & NA & 3.6 & 12.1 & NA \\ 1.2 & NA & NA & 4 & NA \\ NA & 4 & 5 & NA & 7 \end{bmatrix}

What methodsmethods could be applicable in general? I

I have some basic experience in classifiers and Random Forests. Modulo the obvious difference that this is not a classifying problem, what I struggle most with is that the same variable (described in the saye.g $n$-th column) is both an input and an output. Say I want to predict the value $A_{2,3}$ in in the dataset above. In this case, all the values in the third column could be used as input, excluded of course $A_{2,3}$ itself, which would be an output.

This seems to mebe different than the more conventional set-up of predicting a property, given a set of other properties (e.g, predict income given education, work sector, seniority, etc.). In this case, sometimes the income is to be predicted, sometimes used for predicting another variable. I am aware of methods which, given a vector $X_i$, could approximate a function $F$ and predict responses $Y_i$ with

$$ Y_i = F(X_i)$$

In the scenario I described though, it looks like some implicit function $\Phi$ is to be found, a function of all the variables $Z_i$ (columns in the dataset above)

$$ \Phi (Z_i) = 0$$

What methods could handle this aspect? I understand the question is probably too general, but I could not find much and could do with a starting point. I would be already content with some hints for my further reading, but anything more would be gratefully welcomed, thanks.

Filling a Dataset - suitable methods

I have a dataset with missing values, I would like to use machine learning methods to fill. In more detail, there are $n$ individuals, for which up to 10 properties are provided, all numerical. The fact is, there are no individuals for which all properties are given. The first rows (each row contains data for a given individual) do look as the following

\begin{bmatrix} 1 & NA & 3.6 & 12.1 & NA \\ 1.2 & NA & NA & 4 & NA \\ NA & 4 & 5 & NA & 7 \end{bmatrix}

What methods could be applicable in general? I have some basic experience in classifiers and Random Forests. Modulo the obvious difference that this is not a classifying problem, what I struggle most with is that the same variable (described in the say $n$-th column) is both an input and an output. Say I want to predict the value $A_{2,3}$ in the dataset above. In this case, all the values in the third column could be used as input, excluded of course $A_{2,3}$ itself, which would be an output.

This seems to me different than the more conventional set-up of predicting a property, given a set of other properties (e.g, predict income given education, work sector, seniority, etc.). In this case sometimes the income is to be predicted, sometimes used for predicting another variable. I am aware of methods which, given a vector $X_i$, could approximate a function $F$ and predict responses $Y_i$ with

$$ Y_i = F(X_i)$$

In the scenario I described though, it looks like some implicit function $\Phi$ is to be found, function of all the variables $Z_i$ (columns in the dataset above)

$$ \Phi (Z_i) = 0$$

What methods could handle this aspect? I understand the question is probably too general, but I could not find much and could do with a starting point. I would be already content with some hints for my further reading, but anything more would be gratefully welcomed, thanks.

How to fill missing values in a dataset where some properties can be inputs and outputs?

I have a dataset with missing values, I would like to use machine learning methods to fill. In more detail, there are $n$ individuals, for which up to 10 properties are provided, all numerical. The fact is, there are no individuals for which all properties are given. The first rows (each row contains data for a given individual) do look as the following

\begin{bmatrix} 1 & NA & 3.6 & 12.1 & NA \\ 1.2 & NA & NA & 4 & NA \\ NA & 4 & 5 & NA & 7 \end{bmatrix}

What methods could be applicable in general?

I have some basic experience in classifiers and Random Forests. Modulo the obvious difference that this is not a classifying problem, what I struggle most with is that the same variable (described in the e.g $n$-th column) is both an input and an output. Say I want to predict the value $A_{2,3}$ in the dataset above. In this case, all the values in the third column could be used as input, excluded of course $A_{2,3}$ itself, which would be an output.

This seems to be different than the more conventional set-up of predicting a property, given a set of other properties (e.g, predict income given education, work sector, seniority, etc.). In this case, sometimes the income is to be predicted, sometimes used for predicting another variable. I am aware of methods which, given a vector $X_i$, could approximate a function $F$ and predict responses $Y_i$ with

$$ Y_i = F(X_i)$$

In the scenario I described though, it looks like some implicit function $\Phi$ is to be found, a function of all the variables $Z_i$ (columns in the dataset above)

$$ \Phi (Z_i) = 0$$

What methods could handle this aspect? I understand the question is probably too general, but I could not find much and could do with a starting point. I would be already content with some hints for my further reading, but anything more would be gratefully welcomed, thanks.

added 328 characters in body

Source Link

edited May 15, 2020 at 6:53

Smerdjakov

45
6

I have a dataset with missing values, I would like to use machine learning methods to fill. In more detail, there are $n$ individuals, for which up to 10 properties are provided, all numerical. The fact is, there are no individuals for which all properties are given. The first rows (each row contains data for a given individual) do look as the following

\begin{bmatrix} 1 & NA & 3.6 & 12.1 & NA \\ 1.2 & NA & NA & 4 & NA \\ NA & 4 & 5 & NA & 7 \end{bmatrix}

What methods could be applicable in general? I have some basic experience in classifiers and Random Forests. Modulo the obvious difference that this is not a classifying problem, what I struggle most with is that the same variable (described in the say $n$-th column) is both an input and an output. Say I want to predict the value $A_{2,3}$ in the dataset above. In this case, all the values in the third column could be used as input, excluded of course $A_{2,3}$ itself, which would be an output.

This seems to me different than the more conventional set-up of predicting a property, given a set of other properties (e.g, predict income given education, work sector, seniority, etc.). In this case sometimes the income is to be predicted, sometimes used for predicting another variable. I am aware of methods which, given a vector $X_i$, could approximate a function $F$ and predict responses $Y_i$ with

$$ Y_i = F(X_i)$$

In the scenario I described though, it looks like some implicit function $\Phi$ is to be found, function of all the variables $Z_i$ (columns in the dataset above)

$$ \Phi (Z_i) = 0$$

What methods could handle this aspect? I understand the question is probably too general, but I could not find much and could do with a starting point. I would be already content with some hints for my further reading, but anything more would be gratefully welcomed, thanks.

I have a dataset with missing values, I would like to use machine learning methods to fill. In more detail, there are $n$ individuals, for which up to 10 properties are provided, all numerical. The fact is, there are no individuals for which all properties are given. The first rows (each row contains data for a given individual) do look as the following

\begin{bmatrix} 1 & NA & 3.6 & 12.1 & NA \\ 1.2 & NA & NA & 4 & NA \\ NA & 4 & 5 & NA & 7 \end{bmatrix}

What methods could be applicable in general? I have some basic experience in classifiers and Random Forests. Modulo the obvious difference that this is not a classifying problem, what I struggle most with is that the same variable (described in the say $n$-th column) is both an input and an output. Say I want to predict the value $A_{2,3}$ in the dataset above. In this case, all the values in the third column could be used as input, excluded of course $A_{2,3}$ itself, which would be an output.

This seems to me different than the more conventional set-up of predicting a property, given a set of other properties (e.g, predict income given education, work sector, seniority, etc.). In this case sometimes the income is to be predicted, sometimes used for predicting another variable.

What methods could handle this aspect? I understand the question is probably too general, but I could not find much and could do with a starting point. I would be already content with some hints for my further reading, but anything more would be gratefully welcomed, thanks.

I have a dataset with missing values, I would like to use machine learning methods to fill. In more detail, there are $n$ individuals, for which up to 10 properties are provided, all numerical. The fact is, there are no individuals for which all properties are given. The first rows (each row contains data for a given individual) do look as the following

\begin{bmatrix} 1 & NA & 3.6 & 12.1 & NA \\ 1.2 & NA & NA & 4 & NA \\ NA & 4 & 5 & NA & 7 \end{bmatrix}

What methods could be applicable in general? I have some basic experience in classifiers and Random Forests. Modulo the obvious difference that this is not a classifying problem, what I struggle most with is that the same variable (described in the say $n$-th column) is both an input and an output. Say I want to predict the value $A_{2,3}$ in the dataset above. In this case, all the values in the third column could be used as input, excluded of course $A_{2,3}$ itself, which would be an output.

This seems to me different than the more conventional set-up of predicting a property, given a set of other properties (e.g, predict income given education, work sector, seniority, etc.). In this case sometimes the income is to be predicted, sometimes used for predicting another variable. I am aware of methods which, given a vector $X_i$, could approximate a function $F$ and predict responses $Y_i$ with

$$ Y_i = F(X_i)$$

In the scenario I described though, it looks like some implicit function $\Phi$ is to be found, function of all the variables $Z_i$ (columns in the dataset above)

$$ \Phi (Z_i) = 0$$

What methods could handle this aspect? I understand the question is probably too general, but I could not find much and could do with a starting point. I would be already content with some hints for my further reading, but anything more would be gratefully welcomed, thanks.