Invalid literal for Float error in Python

Question

I am trying to use sklearn and perform linear regression in Python using sklearn library.

This is the code I have used to train and fit the model, I am getting the error when I run the predict function call.

train, test = train_test_split(h1, test_size = 0.5, random_state=0) my_features = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode'] trainInp = train[my_features] target = ['price'] trainOut = train[target] regr = LinearRegression() # Train the model using the training sets regr.fit(trainInp, trainOut) print('Coefficients: \n', regr.coef_) testPred = regr.predict(test)

After fitting the model, when I try to predict using the test data, it throws the following error

Traceback (most recent call last): File "C:/Users/gouta/PycharmProjects/MLCourse1/Python.py", line 52, in <module> testPred = regr.predict(test) File "C:\Users\gouta\Anaconda2\lib\site-packages\sklearn\linear_model\base.py", line 200, in predict return self._decision_function(X) File "C:\Users\gouta\Anaconda2\lib\site-packages\sklearn\linear_model\base.py", line 183, in _decision_function X = check_array(X, accept_sparse=['csr', 'csc', 'coo']) File "C:\Users\gouta\Anaconda2\lib\site-packages\sklearn\utils\validation.py", line 393, in check_array array = array.astype(np.float64) ValueError: invalid literal for float(): 20140604T000000

The coefficients for the Linear Regression Model are

('Coefficients: \n', array([[ -5.04902429e+04, 5.23550164e+04, 2.90631319e+02, -1.19010351e-01, -1.25257545e+04, 6.52414059e+02]]))

The following is the first five lines of the test dataset

Is the error being caused because of the large value of coefficients? How to fix this?

WHy is there the letter T in the value? Also, consider showing some of your code... — David Zemens
– David Zemens, Commented Feb 18, 2016 at 17:50
"Is the error being caused because of the large value of coefficients?" <- No, the error is almost certainly because you've got something that looks like a date/time column in your test data, when the model is expecting just an array of floats. Please show us the first few rows of the test data! — Mark Dickinson
– Mark Dickinson, Commented Feb 18, 2016 at 18:08
I am sorry. The mistake I made was, I had selected a number of columns for the train input and target, but had all the columns in the test dataset, so there were additional variables in the test dataset that caused the problem. — goutam
– goutam, Commented Feb 18, 2016 at 20:09

Mark Dickinson · Accepted Answer · 2016-02-18 20:06:38Z

Your problem is that you're fitting the model on a selected set of features from the whole dataframe (you do trainInp = train[my_features]), but you're trying to predict on the complete set of features (regr.predict(test)), including non-numeric features like date.

So instead of doing regr.predict(test), you should do regr.predict(test[my_features]). More generally, remember that whatever preprocessing you apply to the training set (normalization, feature selection, PCA, ...), you should also apply to the test set.

Alternatively, you could cut down to the set of features of interest before you do the train-test split:

my_features = ['bedrooms', 'bathrooms', ...] train, test = train_test_split(h1[my_features], test_size = 0.5, random_state=0)

Collectives™ on Stack Overflow

Invalid literal for Float error in Python

1 Answer 1

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Related