2
import numpy as np import matplotlib.pyplot as plt import pandas as pd dataset=pd.read_csv("Marketing_Data.csv") X = dataset.iloc[:, :-1].values y = dataset.iloc[:, -1].values from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1/3, random_state = 0) from sklearn.linear_model import LinearRegression regressor = LinearRegression() regressor.fit(X_train, y_train) lin_reg = LinearRegression() lin_reg.fit(X,y) y_pred = regressor.predict(X_test) np.set_printoptions(precision = 2) plt.scatter(X, y, color = 'red') plt.plot(X, lin_reg.predict(X), color = 'blue') plt.title("Sales") plt.show() 

I am trying to write a multiple linear regression. There are three independent variables and one dependent variable. I get a

Value Error: x and y must be same size 

and an empty matplotlib graph.

Tracebak:

 File "||file path comes here||\untitled0.py", line 20, in <module> plt.scatter(X, y, color = 'red') File "C:\Anaconda\lib\site-packages\matplotlib\pyplot.py", line 2890, in scatter __ret = gca().scatter( File "C:\Anaconda\lib\site-packages\matplotlib\__init__.py", line 1438, in inner return func(ax, *map(sanitize_sequence, args), **kwargs) File "C:\Anaconda\lib\site-packages\matplotlib\cbook\deprecation.py", line 411, in wrapper return func(*inner_args, **inner_kwargs) File "C:\Anaconda\lib\site-packages\matplotlib\axes\_axes.py", line 4441, in scatter raise ValueError("x and y must be the same size") 
3
  • Do you have an example of what your csv looks like? Commented Apr 6, 2021 at 18:42
  • I don't know exactly how to write it down but it starts like this: "youtube,facebook,newspaper,sales 84.72,19.2,48.96,12.6 ... ..." Commented Apr 6, 2021 at 18:55
  • I am using a dataset from the internet for exercising. Commented Apr 6, 2021 at 18:59

2 Answers 2

1

plt.scatter expects both x and y to be of shape (n, ), so if your X is 2-or-higher dimensional it won't work.

Since you are doing multiple linear regression and your X has many dimensions, you'll need something other than a scatterplot. (Or just pick one dimension of X to be the x-axis for the plot.)

Sign up to request clarification or add additional context in comments.

5 Comments

Thank you. There are three independent variables, so I assume that there needs to be three arrays (?). What would a proper plotting look like for multiple linear regression?
Plotting with 3 independent variables sounds pretty difficult, since it would require 4 dimensions to be able to show the data. Here is someone else who seems to have the same problem, you could try some of the solutions in the answer or simply plot each variable on it's own so you get 3 plots.
Thank you for the answer. If I want to predict a new data, would something like this work?: print(regressor.predict([[20,20,20]]))
Yup! That's what you are doing with this line: y_pred = regressor.predict(X_test)
Thanks! I think I'm getting a better grasp of it now. When I get 15 rep, I'll come back and give an upvote.
1

I think you are getting the error because of how you are using .iloc for variables X and Y. Do not know what your csv data looks like so apologies if this is not what you are looking for...

Your X .iloc returns an NxN array, kind of like a matrix (pandas/numpy treats it as an array), and returns all rows in your dataset minus the last column (you are telling it to ignore the last column with :-1).

Your y .iloc returns 1xN array and will return the last column of your dataset.

It looks like:

x = dataset.iloc[:, :-1].values >>> [['Col1Row1_val', 'Col2Row1_val'] ['Col1Row2_val', 'Col2Row2_val']] y = dataset.iloc[:, -1].values >>> ['lastColRow1_val', 'lastColRow2_val'] 

The x and y .iloc should be similar or composed in a way where x and y are both, for example, a 1xN array or NxN array.

Or use a plot other than scatter

1 Comment

Thanks. I took the dataset from the internet for exercising. It looks like this: "youtube,facebook,newspaper,sales 84.72,19.2,48.96,12.6 351.48,33.96,51.84,25.68 135.48,20.88,46.32,14.28 116.64,1.8,36,11.52 318.72,24,0.36,20.88 114.84,1.68,8.88,11.4 348.84,4.92,10.2,15.36 320.28,52.56,6,30.48 89.64,59.28,54.84,17.64 51.72,32.04,42.12,12.12 ... ..." So, the last column is the result, the dependent valueable. My goal is to train the machine so that it can predict the sales from how much advertisement is put on what media.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.