Pandas + sklearn Linear regression fails

Question

I am trying to implement some linear regression model in Python. See the code below, which I've used to make a linear regression.

import pandas salesPandas = pandas.DataFrame.from_csv('home_data.csv') # check the shape of the DataFrame (rows, columns) salesPandas.shape (21613, 20) from sklearn.cross_validation import train_test_split train_dataPandas, test_dataPandas = train_test_split(salesPandas, train_size=0.8, random_state=1) from sklearn.linear_model import LinearRegression reg_model_Pandas = LinearRegression() print type(train_dataPandas) print train_dataPandas.shape <class 'pandas.core.frame.DataFrame'> (17290, 20) print type(train_dataPandas['price']) print train_dataPandas['price'].shape <class 'pandas.core.series.Series'> (17290L,) X = train_dataPandas y = train_dataPandas['price'] reg_model_Pandas.fit(X, y)

After I've executed the python code above, the following error appears:

--------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-11-dc363e199032> in <module>() 3 X = train_dataPandas 4 y = train_dataPandas['price'] ----> 5 reg_model_Pandas.fit(X, y) C:\Users\...\AppData\Local\Continuum\Anaconda2\lib\site-packages\sklearn\linear_model\base.py in fit(self, X, y, n_jobs) 374 n_jobs_ = self.n_jobs 375 X, y = check_X_y(X, y, accept_sparse=['csr', 'csc', 'coo'], --> 376 y_numeric=True, multi_output=True) 377 378 X, y, X_mean, y_mean, X_std = self._center_data( C:\Users\...\AppData\Local\Continuum\Anaconda2\lib\site-packages\sklearn\utils\validation.py in check_X_y(X, y, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric) 442 X = check_array(X, accept_sparse, dtype, order, copy, force_all_finite, 443 ensure_2d, allow_nd, ensure_min_samples, --> 444 ensure_min_features) 445 if multi_output: 446 y = check_array(y, 'csr', force_all_finite=True, ensure_2d=False, C:\Users\...\AppData\Local\Continuum\Anaconda2\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features) 342 else: 343 dtype = None --> 344 array = np.array(array, dtype=dtype, order=order, copy=copy) 345 # make sure we actually converted to numeric: 346 if dtype_numeric and array.dtype.kind == "O": ValueError: invalid literal for float(): 20140610T000000

Output from train_dataPandas.info()

<class 'pandas.core.frame.DataFrame'> Int64Index: 17290 entries, 4058200630 to 1762600320 Data columns (total 20 columns): date 17290 non-null object price 17290 non-null int64 bedrooms 17290 non-null int64 bathrooms 17290 non-null float64 sqft_living 17290 non-null int64 sqft_lot 17290 non-null int64 floors 17290 non-null float64 waterfront 17290 non-null int64 view 17290 non-null int64 condition 17290 non-null int64 grade 17290 non-null int64 sqft_above 17290 non-null int64 sqft_basement 17290 non-null int64 yr_built 17290 non-null int64 yr_renovated 17290 non-null int64 zipcode 17290 non-null int64 lat 17290 non-null float64 long 17290 non-null float64 sqft_living15 17290 non-null int64 sqft_lot15 17290 non-null int64 dtypes: float64(4), int64(15), object(1) memory usage: 2.8+ MB

The error is clear-ish, you have an invalid dtype for one of your columns, it looks like a string, they need to be numeric in order to be compatible with sklearn, can you post the output from train_dataPandas.info() you may need to convert the dtypes — EdChum
– EdChum, Commented Nov 20, 2015 at 9:21
The 'date' column is likely to be a string dtype, you need to convert it, try df['date'] = pd.to_datetime(df['date'], format='%Y%m%dT%H%M%s') — EdChum
– EdChum, Commented Nov 20, 2015 at 9:34
Typo in format string should be: df['date'] = pd.to_datetime(df['date'], format='%Y%m%dT%H%M%S') — EdChum
– EdChum, Commented Nov 20, 2015 at 9:41
OK, wasn't sure if sklearn supported datetime64 or not, you could pass the total time perhaps so df['date'] = df['date'].dt.nanoseconds — EdChum
– EdChum, Commented Nov 20, 2015 at 10:03

Leb · Accepted Answer · 2015-11-20 14:01:25Z

Another possible solution based on your data could be to specify parse_dates when reading the date from file as such:

import pandas salesPandas = pandas.read_csv('home_data.csv', parse_dates=['date'])

The reason why this would be helpful is when you pass your data to be fitted you can break it up into month, hour, day. This is assuming most of your data is concentrated on those previously mentioned and not on years (i.e. your total unique years is about 3-4)

From here you can use Datetimelike Properties and call the month by doing salesPandas['date'].dt.month, then for day and hour just replace it accordingly.

Raphael · Accepted Answer · 2015-11-20 10:41:34Z

So thanks to EdChum, the solution till now is the following:

First I've uploaded the data
salesPandas.info() is showing me, that

Int64Index: 21613 entries, 7129300520 to 1523300157 Data columns (total 20 columns): date 21613 non-null object

this isnt good because sklearn, cannot use the date as object

If I do salesPandas.head() the date for the first tupel is

20141013T000000

you see the T? ...bad

sklearn.linear_model.LinearRegression().fit() wants to have npy arrays (Pandas is build on numpy so a DataFrame is also a numpy array)
So first convert the object to datetime, and then convert it to numeric

salesPandas['date'] = pandas.to_datetime(salesPandas['date'], format='%Y%m%dT%H%M%S')

salesPandas['date'] = pandas.to_numeric(salesPandas['date'])

If you then

reg_model_Pandas.fit(X, y)

it works

If someone having a easier or better solution, feel free to reply :)

Gael Varoquaux · Accepted Answer · 2025-03-14 20:47:18Z

These day, the recommended solution is to make a piplene with sklearn's ColumnTransformer or (easier), skrub's TableVectorizer:

# Some toy data y = [1, 1, 0, 0, 1, 0] data = {'Country':['Germany', 'Turkey', 'England', 'Turkey', 'Germany', 'Turkey'], 'Age':[44, 32, 27, 29, 31, 25], 'Salary':[5400, 8500, 7200, 4800, 6200, 10850], 'Purchased':['yes', 'yes', 'no', 'yes', 'no', 'yes']} df = pd.DataFrame(data) # The table vectorizer, to transform the input data: from skrub import TableVectorizer tv = TableVectorizer() # A pipeline chaining the tablevectorizer with any scikit-learn learner from sklearn.linear_model import LogisticRegression from sklearn.pipeline import make_pipeline model = make_pipeline(tv, LogisticRegression()) model.fit(df, y).predict(df)

The TableVectorizer will encode the date-time columns

Collectives™ on Stack Overflow

Pandas + sklearn Linear regression fails

3 Answers 3

Comments

1 Comment

Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

1 Comment

Comments

Related