0

I am starting with programming. I would like to treat missing data by replacing it for the mean value of its column instead of its row-mean-value.

I'm trying to pre-process some data, from an Excel datasheet, similar to the one written as follows (the first column, the one with no title, is just the numbering/index; is not a real column):

 A B C D E F 0 100 NaN 5.0 1.0 5 a 1 200 5.0 NaN 3.0 3 a 2 300 4.0 NaN 5.0 6 a 3 400 5.0 4.0 7.0 9 b 4 500 5.0 2.0 NaN 2 b 5 600 4.0 3.0 0.0 4 b 

The point is that I would like to replace NaN for the mean value of its column. I am using pandas to extract the values from the datasheet. Then I try with sklearn.impute.SimpleImputer to treat these missing data but it just gives me the mean value of the rows.

dataset = pd.read_excel(io=file_name, sheet_name=sheet) y = dataset.iloc[:, 0].values X = dataset.iloc[:, 1:-1].values from sklearn.impute import SimpleImputer imputer = SimpleImputer(missing_values = np.nan, strategy = 'mean') imputer = imputer.fit(X[:, :]) X[:, :] = imputer.transform(X[:, :]) 

Does anyone know how could I do that in an efficient way? Is there any method of a class that treats this case? (I do not mind getting into another library).

Thank you very much

1
  • Before converting to y and X it's as simple as dataset.fillna(dataset.mean()). The index in pandas is a very powerful tool. Commented Oct 11, 2019 at 15:49

1 Answer 1

0

If you want to fill NaN values with the mean of the column, you can simply use:

data = {'A':[100,200,300,400,500,600],'B':[np.nan,5,4,5,5,4],'C':[5,np.nan,np.nan,4,2,3],'D':[1,3,5,7,np.nan,0]} df = pd.DataFrame(data) df = df.fillna(df.mean()) print(df) 

Output:

 A B C D 0 100 4.6 5.0 1.0 1 200 5.0 3.5 3.0 2 300 4.0 3.5 5.0 3 400 5.0 4.0 7.0 4 500 5.0 2.0 3.2 5 600 4.0 3.0 0.0 
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.