Scikit-learn : Input contains NaN, infinity or a value too large for dtype ('float64')

Question

I'm using Python scikit-learn for simple linear regression on data obtained from csv.

reader = pandas.io.parsers.read_csv("data/all-stocks-cleaned.csv") stock = np.array(reader) openingPrice = stock[:, 1] closingPrice = stock[:, 5] print((np.min(openingPrice))) print((np.min(closingPrice))) print((np.max(openingPrice))) print((np.max(closingPrice))) peningPriceTrain, openingPriceTest, closingPriceTrain, closingPriceTest = \ train_test_split(openingPrice, closingPrice, test_size=0.25, random_state=42) openingPriceTrain = np.reshape(openingPriceTrain,(openingPriceTrain.size,1)) openingPriceTrain = openingPriceTrain.astype(np.float64, copy=False) # openingPriceTrain = np.arange(openingPriceTrain, dtype=np.float64) closingPriceTrain = np.reshape(closingPriceTrain,(closingPriceTrain.size,1)) closingPriceTrain = closingPriceTrain.astype(np.float64, copy=False) openingPriceTest = np.reshape(openingPriceTest,(openingPriceTest.size,1)) closingPriceTest = np.reshape(closingPriceTest,(closingPriceTest.size,1)) regression = linear_model.LinearRegression() regression.fit(openingPriceTrain, closingPriceTrain) predicted = regression.predict(openingPriceTest)

The min and max values are showed as 0.0 0.6 41998.0 2593.9

Yet I'm getting this error ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

How should I remove this error? Because from the above result it is true that it doesn't contain infinites or Nan values.

What's the solution for this?

Edit: all-stocks-cleaned.csv is avaliabale at http://www.sharecsv.com/s/cb31790afc9b9e33c5919cdc562630f3/all-stocks-cleaned.csv

@iled all-stocks-cleaned.csv is available at sharecsv.com/s/cb31790afc9b9e33c5919cdc562630f3/… — Vishwajeet Vatharkar
– Vishwajeet Vatharkar, Commented Jan 14, 2016 at 9:34

Sergey Bushmanov · Accepted Answer · 2016-01-14 13:30:31Z

The problem with your regression is that somehow NaN's have sneaked into your data. This could be easily checked with the following code snippet:

import pandas as pd import numpy as np from sklearn import linear_model from sklearn.cross_validation import train_test_split reader = pd.io.parsers.read_csv("./data/all-stocks-cleaned.csv") stock = np.array(reader) openingPrice = stock[:, 1] closingPrice = stock[:, 5] openingPriceTrain, openingPriceTest, closingPriceTrain, closingPriceTest = \ train_test_split(openingPrice, closingPrice, test_size=0.25, random_state=42) openingPriceTrain = openingPriceTrain.reshape(openingPriceTrain.size,1) openingPriceTrain = openingPriceTrain.astype(np.float64, copy=False) closingPriceTrain = closingPriceTrain.reshape(closingPriceTrain.size,1) closingPriceTrain = closingPriceTrain.astype(np.float64, copy=False) openingPriceTest = openingPriceTest.reshape(openingPriceTest.size,1) openingPriceTest = openingPriceTest.astype(np.float64, copy=False) np.isnan(openingPriceTrain).any(), np.isnan(closingPriceTrain).any(), np.isnan(openingPriceTest).any() (True, True, True)

If you try imputing missing values like below:

openingPriceTrain[np.isnan(openingPriceTrain)] = np.median(openingPriceTrain[~np.isnan(openingPriceTrain)]) closingPriceTrain[np.isnan(closingPriceTrain)] = np.median(closingPriceTrain[~np.isnan(closingPriceTrain)]) openingPriceTest[np.isnan(openingPriceTest)] = np.median(openingPriceTest[~np.isnan(openingPriceTest)])

your regression will run smoothly without a problem:

regression = linear_model.LinearRegression() regression.fit(openingPriceTrain, closingPriceTrain) predicted = regression.predict(openingPriceTest) predicted[:5] array([[ 13598.74748173], [ 53281.04442146], [ 18305.4272186 ], [ 50753.50958453], [ 14937.65782778]])

In short: you have missing values in your data, as the error message said.

EDIT::

perhaps an easier and more straightforward approach would be to check if you have any missing data right after you read the data with pandas:

data = pd.read_csv('./data/all-stocks-cleaned.csv') data.isnull().any() Date False Open True High True Low True Last True Close True Total Trade Quantity True Turnover (Lacs) True

and then impute the data with any of the two lines below:

data = data.fillna(lambda x: x.median())

or

data = data.fillna(method='ffill')

np.isnan(openingPriceTrain).any(), np.isnan(closingPriceTrain).any(), np.isnan(openingPriceTest).any() (True, True, True) this part helped me determine the issue , thanks a ton

Collectives™ on Stack Overflow

Scikit-learn : Input contains NaN, infinity or a value too large for dtype ('float64')

1 Answer 1

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Linked

Related