0
$\begingroup$

I am trying to one hot encode my train and test dataset. For my train dataset, I have 2 dataframes with different number of columns but same number of rows.A (with encoded features) = (34164, 293) and B (only contains numerical features) = (34164, 7). I need a final dataframe whose dimensions are C (dataframe with the encoded features and numerical features both) = (34164, 300).

When I use pd.concat function with axis = 1, I get a dataframe with dimensions (44845, 300) and also includes some nan values. I don't get why would it increase my row count when both the initial dataframes have same number of rows? Also from where did those nan values come from? Below is my code.

ohe = OneHotEncoder(handle_unknown = 'ignore', sparse = False) train_x_encoded = pd.DataFrame(ohe.fit_transform(train_x[['model', 'vehicleType', 'brand']])) train_x_encoded.columns = ohe.get_feature_names(['model', 'vehicleType', 'brand']) train_x.drop(['model', 'vehicleType', 'brand'], axis = 1, inplace = True) train_x_final = pd.concat([train_x_encoded, train_x], axis = 1) 

I tried train_x.join function and it returned df with (34164, 300), but there were nan values in it.

train_x_final1 = train_x.join(train_x_encoded) 
$\endgroup$
4
  • $\begingroup$ Does is it still occur if you reset the Index of both dataframes before concatenation? $\endgroup$ Commented Jul 17, 2021 at 17:28
  • $\begingroup$ @Sammy It still gives me nan values in one of the columns which is weird because I performed imputation before encoding! $\endgroup$ Commented Jul 18, 2021 at 6:41
  • 1
    $\begingroup$ @Sammy It finally worked after 3 hours of debugging!! Just a silly mistake I made during preprocessing! XD. Man I am dumb! You want to post your answer so that I can mark it as best answer? $\endgroup$ Commented Jul 18, 2021 at 12:46
  • $\begingroup$ Glad you managed to solve it! I've posted my comment as an answer. $\endgroup$ Commented Jul 18, 2021 at 13:12

1 Answer 1

0
$\begingroup$

When applying pd.concat with axis=1 to two dataframes results in redundant rows (usually also leading to NaNs in the columns of the first dataframe for previously not existing rows and NaNs in the columns of the second dataframe for previously existing rows), you may need to reset indexes of both dataframes before concatenating:

train_x_final = pd.concat( [train_x_encoded.reset_index(), train_x.reset_index()], axis = 1) 
$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.