Can't do linear regression in scikit-Learn due to "reshaping" issue

Question

I have a simple CSV with two columns:

ErrorWeek (a number for the week number in the year)
ErrorCount (for the number of errors in a given week)

I read the CSV data into a pandas dataframe, like this:

df = pd.read_csv("Errors.csv", sep=",")

df.head() shows:

 ErrorWeek ErrorCount 0 1 80 1 2 118 2 3 249 3 4 397 4 5 159

So far so good.

Then, I create a test/train split, like this:

X_train, X_test, y_train, y_test = train_test_split( df['ErrorWeek'], df['ErrorCount'], random_state=0)

No errors so far.

But, I then create a linear regression object and try to fit the data.

# Create linear regression object regr = linear_model.LinearRegression() # Train the model using the training sets regr.fit(X_train, y_train)

Here I do get an error: "Reshape your data either using array.reshape(-1, 1)"

--

Looking at the shape of X_Test and y_Test, I get what looks like two one dimensional "arrays":

X_train shape: (36,) y_train shape: (36,)

--

I have spent many hours trying to figure this out, but I'm new to Pandas, Python, and to scikit-learn.

I'm reading in two dimensional data, but Pandas isn't seeing that way.

What do I need to do, specifically?

Thanks,

Vivek Kumar · Accepted Answer · 2017-12-12 05:25:31Z

Doing:

X_train, X_test, y_train, y_test = train_test_split( df['ErrorWeek'], df['ErrorCount'], random_state=0)

will make all output arrays of one dimension because you are choosing a single column value for X and y.

Now, when you pass a one dimensional array of [n,], Scikit-learn is not able to decide that what you have passed is one row of data with multiple columns, or multiple samples of data with single column. i.e. sklearn may not infer whether its n_samples=n and n_features=1 or other way around (n_samples=1 and n_features=n) based on X data alone.

Hence it asks you reshape the 1-D data you provided to a 2-d data of shape [n_samples, n_features]

Now there are multiple ways of doing this.

You can do what the scikit-learn says:

X_train = X_train.reshape(-1,1) X_test = X_test.reshape(-1,1)

The 1 in the second place of reshape tells that there is a single column only and -1 is to detect the number of rows automatically for this single column.

Do as suggested in other answers by @MaxU and @Wen

Thanks. A variation of this DID work! :) I basically did X_train = np.array(X_train).reshape(-1,1) for all the X_train, y_train, y_test. I appreciate your answer.
@Morkus You dont need to reshape the y_train and y_test. They should be 1-d array only.
If I comment out those two lines, I get: "ValueError: Expected 2D array, got 1D array instead:" No idea why but the error happens on the predict method call. Thx.
@Morkus y_train and y_test are not sent to predict() method. Please update the code and error in method. X_train and X_test needs to be reshaped but not y_train and y_test

BENY · Accepted Answer · 2017-12-11 21:35:20Z

3

change your fit part

regr.fit(X_train[:,None], y_train)

answered Dec 11, 2017 at 21:35

BENY

324k22 gold badges176 silver badges250 bronze badges

2 Comments

Morkus Over a year ago

Thanks, but this didn't work. What does the colon mean when you have something like [:, None]?

BENY Over a year ago

@Morkus it just convert array to ndarray

MaxU - stand with Ukraine · Accepted Answer · 2017-12-11 21:39:06Z

1

Try this:

X_train, X_test, y_train, y_test = train_test_split( df[['ErrorWeek']], df['ErrorCount'], random_state=0)

PS pay attention at additional square brackets: df[['ErrorWeek']]

answered Dec 11, 2017 at 21:39

MaxU - stand with Ukraine

212k37 gold badges402 silver badges436 bronze badges

1 Comment

Morkus Over a year ago

Thank you for your reply. I'm a bit confused about the double square brackets and what they mean, but this too didn't work.

Caterina De Franco · Accepted Answer · 2019-10-10 13:45:54Z

Apparently sklearn wants x to be a pandas.core.frame.DataFrame because it cannot distinguish between a single feature with n samples or n features with one sample. Instead y can be one single column, that is a pandas.core.series.Series. Therefore, in your example, you should transform x to a pandas.core.frame.DataFrame.

As already pointed out by @MaxU:

x=df[['ErrorWeek']] # double brakets y=df['ErrorCount'] # single brakets X_train, X_test, y_train, y_test = train_test_split(x, y, random_state=0)

Daniel · Accepted Answer · 2021-05-03 05:38:30Z

0

X_train.reshape(-1,1) won't work as it's a series, you'll need to use X_train = X_train.values.reshape(-1,1)

edited May 3, 2021 at 5:38

Daniel

9,86713 gold badges54 silver badges72 bronze badges

answered May 3, 2021 at 4:13

Anirudh Sharma

11 bronze badge

Collectives™ on Stack Overflow

Can't do linear regression in scikit-Learn due to "reshaping" issue

5 Answers 5

4 Comments

2 Comments

1 Comment

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

4 Comments

2 Comments

1 Comment

Comments

Comments

Linked

Related