7
$\begingroup$

I'm currently doing the Andrew Ng machine learning course on coursera, and in Week2 he discusses feature scaling.

I have seen the lecture and read many posts; I understand the reasoning behind feature scaling (basically to make gradient descent converge faster by representing all the features on roughly the same scale).

My problem arises when I try to do it. I'm using Octave, and I have the the code for gradient descent with linear regression set up: it computes the 'theta' matrix for the hypothesis just fine for non-scaled values, giving accurate predictions.

When I use scaled values of input matrix X and output vector Y, the values of theta and the cost function J(theta) calculated are different than from the un-scaled values. Is this normal? How do I 'undo' the scaling, so that when I test my hypothesis with real data, I get accurate results?

For reference, here is the scaling function I am using (in Octave):

function [scaledX, avgX, stdX] = feature_scale(X) is_first_column_ones=0; %a flag indicating if the first column is ones sum(X==1) if sum(X==1)(1) == size(X)(1) %if the first column is ones is_first_column_ones=1; X=X(:,2:size(X)(2)); %strip away the first column; end stdX=std(X); avgX=mean(X); scaledX=(X-avgX)./stdX; if is_first_column_ones %add back the first column of ones; they require no scaling. scaledX=[ones(size(X)(1),1),scaledX]; end end 

Do I scale my test input, scale my theta, or both?

I should also note that I'm scaling as such:

scaledX=feature_scale(X); scaledY=feature_scale(Y); 

where X and Y are my input and output respectively. Each column of X represents a different feature (the first column is always 1 for the bias feature theta0) and each row of X represents a different input example. Y is a 1-D column matrix where each row is an output example, corresponding to the input of X.

eg: X = [1, x, x^2]

 1.00000 18.78152 352.74566 1.00000 0.61030 0.37246 1.00000 21.41895 458.77124 1.00000 3.83865 14.73521 

Y =

99.8043 1.8283 168.9060 -29.0058 

^ this is for the function y=x^2 - 14x + 10

$\endgroup$
3
  • 2
    $\begingroup$ Multivariate regression is not a synonym of multiple regression: multivariate flags several responses. $\endgroup$ Commented Nov 30, 2015 at 8:00
  • $\begingroup$ I'm sorry, I'm fairly new to stats (I'm coming from a programming background) and I thought they were synonymous. $\endgroup$ Commented Nov 30, 2015 at 8:30
  • $\begingroup$ Is the theta values obtained from gradient descent equal to as they obtained from normal equation? $\endgroup$ Commented Aug 31, 2017 at 16:14

3 Answers 3

9
$\begingroup$

You should perform feature normalization only on features - so only on your input vector $x$. Not on output $y$ or $\theta$. When you trained a model using feature normalization, then you should apply that normalization every time you make a prediction. Also it is expected that you have different $\theta$ and cost function $J(\theta)$ with and without normalization. There is no need to ever undo feature scaling.

$\endgroup$
0
2
$\begingroup$

EDIT: I've made a few changes to the code based on @Yurii's answer.

Okay it seems that after a little fiddling about and checking out this answer, I got it to work:

As mentioned, I used

[scaledX,avgX,stdX]=feature_scale(X)

to scale the features. I then used gradient descent to get the vector theta, which is used to make predictions.

Concretely, I had:

>> Y % y=(a^2 -14*a + 10) as the equation of my data Y = -38.4218 -31.8576 74.2568 38.2865 -10.7453 36.6208 -35.3849 -9.2554 137.4463 3.0049 >> X %[ 1, a, a^2 ] as my features X = 1.00000 6.23961 38.93277 1.00000 4.32748 18.72707 1.00000 17.64222 311.24782 1.00000 7.84469 61.53913 1.00000 1.68448 2.83749 1.00000 5.45754 29.78479 1.00000 5.09865 25.99620 1.00000 1.54614 2.39056 1.00000 20.28331 411.41264 1.00000 0.51888 0.26924 >> [scaledX, avgX, stdX]=feature_scale(X) scaledX = 1.00000 -0.12307 -0.35196 1.00000 -0.40844 -0.49038 1.00000 1.57863 1.51342 1.00000 0.11646 -0.19711 1.00000 -0.80287 -0.59922 1.00000 -0.23979 -0.41463 1.00000 -0.29335 -0.44058 1.00000 -0.82352 -0.60228 1.00000 1.97278 2.19956 1.00000 -0.97683 -0.61681 avgX = 7.0643 90.3138 stdX = 6.7007 145.9833 >> %No need to scale Y >> [theta,costs_vector]=LinearRegression_GradientDescent(scaledX, Y, alpha=1, number_of_iterations=2000); FINAL THETA: theta = -16.390 -70.020 75.617 FINAL COST: 3.41577e-28 

Now, the model has been trained.

When I did not use feature scaling, the theta matrix would come to approximately: theta=[10, -14, 1], which reflected the function y=x^2 -14x + 10 which we are trying to predict.

With feature scaling, as you can see, the theta matrix is completely different. However, we still use it to make predictions, as follows:

>> test_input = 15; >> testX=[1, test_input, test_input^2] testX = 1 15 225 >> scaledTestX=testX; >> scaledTestX(2)=(scaledTestX(2)-avgX(1))/stdX(1); >> scaledTestX(3)=(scaledTestX(3)-avgX(2))/stdX(2); >> scaledTestX scaledTestX = 1.00000 1.18431 0.92261 >> >> final_predicted=(theta')*(scaledTestX') scaledPredicted = 25.000 >> % 25 is the correct value: >> % f(a)=a^2-14a+10, at a=15 (our input value) is 25 
$\endgroup$
1
$\begingroup$

I understand the concepts explained in the previous 2 answers i.e. after we do feature scaling and calculate the intercept (θ0) and the slope (θ1), we get a hypothesis function (h(x)) which uses the scaled down features (Assuming univariate/single variable linear regression)

h(x) = θ0 + θ1x' -- (1)

where

x' = (x-μ)/σ -- (2)

(μ = mean of the feature set x; σ = standard deviation of feature set x)

As Yurii said above, we don't scale the target i.e. y when doing feature scaling. So to predict y for some xm, we simply scale the new input value and feed it into the new hypothesis function (1) using (2)

x'm = (xm-μ)/σ -- (3)

And use this in (1) to get the estimated y. And I think this should work perfectly fine in practice.

But I wanted to plot the regression line against the original i.e. unscaled features and target values. So I needed a way to scale it back. Coordinate geometry to the rescue! :)

Equation (1) gives us the hypothesis with the scaled feature x'. And we know (2) is the relation between the scaled feature x' and the original feature x. So we substitute (2) in (1) and we get (after simplification):

h(x) = (θ0 - θ1*μ/σ) + (θ1/σ)x -- (4)

So to plot a line with the original i.e. unscaled features, we just use the intercept as (θ0 - θ1*μ/σ) and the slope as (θ1/σ).

Here is my complete R code which does the same and plots the regression line:

if(exists("dev")) {dev.off()} # Close the plot rm(list=c("f", "X", "Y", "m", "alpha", "theta0", "theta1", "i")) # clear variables f<-read.csv("slr12.csv") # read in source data (data from here: http://goo.gl/fuOV8m) mu<-mean(f$X) # mean sig<-sd(f$X) # standard deviation X<-(f$X-mu)/sig # feature scaled Y<-f$Y # No scaling of target m<-length(X) alpha<-0.05 theta0<-0.5 theta1<-0.5 for(i in 1:350) { theta0<-theta0 - alpha/m*sum(theta0+theta1*X-Y) theta1<-theta1 - alpha/m*sum((theta0+theta1*X-Y)*X) print(c(theta0, theta1)) } plot(f$X,f$Y) # Plot original data theta0p<-(theta0-theta1*mu/sig) # "Unscale" the intercept theta1p<-theta1/sig # "Unscale" the slope abline(theta0p, theta1p, col="green") # Plot regression line 

You can test that theta0p and theta1p are correct above by running lm(f$Y~f$X) to use the inbuilt linear regression function. The values are the same!

> print(c(theta0p, theta1p)) [1] 867.6042128 0.3731579 > lm(f$Y~f$X) Call: lm(formula = f$Y ~ f$X) Coefficients: (Intercept) f$X 867.6042 0.3732 

enter image description here

$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.