2
$\begingroup$

A ran a regression analysis predicting Salary from gender. In the data Female was coded as 2 and male was coded as 1. Then I was asked to change females to -1 and male to 5. In the analyis ɑ, b, t,and SEb changed.

Why? What is the reasoning behind the coding system here?

$\endgroup$
4
  • $\begingroup$ We need more information. Why were you asked to change the coding? Is this part of an assignment? It does not make sense to code a dummy variable like this. $\endgroup$ Commented Feb 12, 2017 at 3:12
  • $\begingroup$ Thank you for responding. It is part of an assignment and I think the whole point is explaining why some values changed when the coding changed. I am assuming it has to do with the numbers used for coding. I have read that it is common to use Male= 0 and Female= 1 (where male is the reference group). $\endgroup$ Commented Feb 12, 2017 at 3:25
  • $\begingroup$ Yes, that is the common practice. If the sole aim of the assignment is to show the reason why we use 0 and 1, then I think the question is now redundant. $\endgroup$ Commented Feb 12, 2017 at 3:29
  • $\begingroup$ This question might be of interest: stats.stackexchange.com/questions/16689/… $\endgroup$ Commented Feb 12, 2017 at 3:30

1 Answer 1

1
$\begingroup$

It is hard to see without further information why one would lie to code a binary variable as $(-1,5)$, but it is fairly easy to see how the coefficient changes with a simple experiment:

lets create a random data.frame in R with 100 observations, where salary has a mean of 60K with a standard deviation of 15K:

 set.seed(10) df <- data.frame(salary = rnorm(100, mean = 60000, sd = 15000), gender = rbinom(100, 1, 0.42)) df$gender5 <- ifelse(df$gender == 0, -1, 5) 

Now gender is coded $(0,1)$ and gender5 is coded $(-1,5)$. Lets regress salary with gender with the original encoding and with the new one:

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 61291 1994 30.736 <2e-16 *** gender -6421 2765 -2.322 0.0223 * Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 60220.7 1692.1 35.589 <2e-16 *** gender5 -1070.2 460.9 -2.322 0.0223 * 

So:

coefficient: The coefficient has a simple meaning always - whats the average difference in salary between the two categories. The first coding $(0,1)$ is very intuitive and so is used often, and is easily understood when viewed through the regression equation: $\hat{salary}=61,291-6,421\times gender$. If males are coded $1$ and females $0$, than males predicted average salary is $61,421-6,421\times 1=54,870$ or simply $6,421$$ less than females.

When the coding changes, so does the meaning. Now instead a gap of $1$, we have a gap of $6$. Now if we want to predict men, we will do: $\hat{salary}=60,220.7-1,070.2\times 5 = 54,870$. Exactly the same (with a rounding error). The gap is not $1$ now, but $6$. Multiplying slope coefficient by $6$, e.g., $-1,070.2\times 6=-6,421$ and we arrive back at the slope coefficient using the first coding scheme $(0,1)$. This is just much less intuitive to calculate.

Standard Error: Same shtick. The $s.e.$ is dependent on the distribution. if you change it, you change the deviation. so $2765/6=460.9$

T and significance value: Should not change. If it did, there probably is a problem somewhere. re-coding the variables changes the coefficients, but not the significance values.

$\endgroup$
1
  • $\begingroup$ @SLeca you are very welcome. If you feel this is satisfactory, feel free to accept this answer which will mark the question as resolved :) $\endgroup$ Commented Feb 17, 2017 at 13:38

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.