3

I am trying to use sklearn and perform linear regression in Python using sklearn library.

This is the code I have used to train and fit the model, I am getting the error when I run the predict function call.

train, test = train_test_split(h1, test_size = 0.5, random_state=0) my_features = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode'] trainInp = train[my_features] target = ['price'] trainOut = train[target] regr = LinearRegression() # Train the model using the training sets regr.fit(trainInp, trainOut) print('Coefficients: \n', regr.coef_) testPred = regr.predict(test) 

After fitting the model, when I try to predict using the test data, it throws the following error

Traceback (most recent call last): File "C:/Users/gouta/PycharmProjects/MLCourse1/Python.py", line 52, in <module> testPred = regr.predict(test) File "C:\Users\gouta\Anaconda2\lib\site-packages\sklearn\linear_model\base.py", line 200, in predict return self._decision_function(X) File "C:\Users\gouta\Anaconda2\lib\site-packages\sklearn\linear_model\base.py", line 183, in _decision_function X = check_array(X, accept_sparse=['csr', 'csc', 'coo']) File "C:\Users\gouta\Anaconda2\lib\site-packages\sklearn\utils\validation.py", line 393, in check_array array = array.astype(np.float64) ValueError: invalid literal for float(): 20140604T000000 

The coefficients for the Linear Regression Model are

('Coefficients: \n', array([[ -5.04902429e+04, 5.23550164e+04, 2.90631319e+02, -1.19010351e-01, -1.25257545e+04, 6.52414059e+02]])) 

The following is the first five lines of the test dataset

Test dataset

Is the error being caused because of the large value of coefficients? How to fix this?

7
  • WHy is there the letter T in the value? Also, consider showing some of your code... Commented Feb 18, 2016 at 17:50
  • Please show us the code that actually throws the error. Commented Feb 18, 2016 at 17:58
  • Can you show the first few rows of test? Commented Feb 18, 2016 at 18:04
  • "Is the error being caused because of the large value of coefficients?" <- No, the error is almost certainly because you've got something that looks like a date/time column in your test data, when the model is expecting just an array of floats. Please show us the first few rows of the test data! Commented Feb 18, 2016 at 18:08
  • I am sorry. The mistake I made was, I had selected a number of columns for the train input and target, but had all the columns in the test dataset, so there were additional variables in the test dataset that caused the problem. Commented Feb 18, 2016 at 20:09

1 Answer 1

3

Your problem is that you're fitting the model on a selected set of features from the whole dataframe (you do trainInp = train[my_features]), but you're trying to predict on the complete set of features (regr.predict(test)), including non-numeric features like date.

So instead of doing regr.predict(test), you should do regr.predict(test[my_features]). More generally, remember that whatever preprocessing you apply to the training set (normalization, feature selection, PCA, ...), you should also apply to the test set.

Alternatively, you could cut down to the set of features of interest before you do the train-test split:

my_features = ['bedrooms', 'bathrooms', ...] train, test = train_test_split(h1[my_features], test_size = 0.5, random_state=0) 
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you Mark. Just realised that and commented that.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.