"Value Error: x and y must be same size" error. Multiple Linear Regression

Question

import numpy as np import matplotlib.pyplot as plt import pandas as pd dataset=pd.read_csv("Marketing_Data.csv") X = dataset.iloc[:, :-1].values y = dataset.iloc[:, -1].values from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1/3, random_state = 0) from sklearn.linear_model import LinearRegression regressor = LinearRegression() regressor.fit(X_train, y_train) lin_reg = LinearRegression() lin_reg.fit(X,y) y_pred = regressor.predict(X_test) np.set_printoptions(precision = 2) plt.scatter(X, y, color = 'red') plt.plot(X, lin_reg.predict(X), color = 'blue') plt.title("Sales") plt.show()

I am trying to write a multiple linear regression. There are three independent variables and one dependent variable. I get a

Value Error: x and y must be same size

and an empty matplotlib graph.

Tracebak:

 File "||file path comes here||\untitled0.py", line 20, in <module> plt.scatter(X, y, color = 'red') File "C:\Anaconda\lib\site-packages\matplotlib\pyplot.py", line 2890, in scatter __ret = gca().scatter( File "C:\Anaconda\lib\site-packages\matplotlib\__init__.py", line 1438, in inner return func(ax, *map(sanitize_sequence, args), **kwargs) File "C:\Anaconda\lib\site-packages\matplotlib\cbook\deprecation.py", line 411, in wrapper return func(*inner_args, **inner_kwargs) File "C:\Anaconda\lib\site-packages\matplotlib\axes\_axes.py", line 4441, in scatter raise ValueError("x and y must be the same size")

I don't know exactly how to write it down but it starts like this: "youtube,facebook,newspaper,sales 84.72,19.2,48.96,12.6 ... ..." — cagatay.e.sahin
– cagatay.e.sahin, Commented Apr 6, 2021 at 18:55

Viktor Åberg · Accepted Answer · 2021-04-06 18:56:28Z

1

plt.scatter expects both x and y to be of shape (n, ), so if your X is 2-or-higher dimensional it won't work.

Since you are doing multiple linear regression and your X has many dimensions, you'll need something other than a scatterplot. (Or just pick one dimension of X to be the x-axis for the plot.)

edited Apr 6, 2021 at 18:56

answered Apr 6, 2021 at 18:39

Viktor Åberg

112 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

cagatay.e.sahin Over a year ago

Thank you. There are three independent variables, so I assume that there needs to be three arrays (?). What would a proper plotting look like for multiple linear regression?

Viktor Åberg Over a year ago

Plotting with 3 independent variables sounds pretty difficult, since it would require 4 dimensions to be able to show the data. Here is someone else who seems to have the same problem, you could try some of the solutions in the answer or simply plot each variable on it's own so you get 3 plots.

cagatay.e.sahin Over a year ago

Thank you for the answer. If I want to predict a new data, would something like this work?: print(regressor.predict([[20,20,20]]))

Viktor Åberg Over a year ago

Yup! That's what you are doing with this line: y_pred = regressor.predict(X_test)

cagatay.e.sahin Over a year ago

Thanks! I think I'm getting a better grasp of it now. When I get 15 rep, I'll come back and give an upvote.

nahar · Accepted Answer · 2021-04-06 19:02:40Z

I think you are getting the error because of how you are using .iloc for variables X and Y. Do not know what your csv data looks like so apologies if this is not what you are looking for...

Your X .iloc returns an NxN array, kind of like a matrix (pandas/numpy treats it as an array), and returns all rows in your dataset minus the last column (you are telling it to ignore the last column with :-1).

Your y .iloc returns 1xN array and will return the last column of your dataset.

It looks like:

x = dataset.iloc[:, :-1].values >>> [['Col1Row1_val', 'Col2Row1_val'] ['Col1Row2_val', 'Col2Row2_val']] y = dataset.iloc[:, -1].values >>> ['lastColRow1_val', 'lastColRow2_val']

The x and y .iloc should be similar or composed in a way where x and y are both, for example, a 1xN array or NxN array.

Or use a plot other than scatter

Thanks. I took the dataset from the internet for exercising. It looks like this: "youtube,facebook,newspaper,sales 84.72,19.2,48.96,12.6 351.48,33.96,51.84,25.68 135.48,20.88,46.32,14.28 116.64,1.8,36,11.52 318.72,24,0.36,20.88 114.84,1.68,8.88,11.4 348.84,4.92,10.2,15.36 320.28,52.56,6,30.48 89.64,59.28,54.84,17.64 51.72,32.04,42.12,12.12 ... ..." So, the last column is the result, the dependent valueable. My goal is to train the machine so that it can predict the sales from how much advertisement is put on what media.

Collectives™ on Stack Overflow

"Value Error: x and y must be same size" error. Multiple Linear Regression

2 Answers 2

5 Comments

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

1 Comment

Related