7
$\begingroup$

I'm having difficulty finding a definitive way to determine whether I can use a percentage measure as an independent variable in multiple linear regression or not.

From my understanding, the percentage can't be considered a true continuous measure for some reason and violates some assumptions of the regression model.

Edit 1: For example, I have seen the argument that percentage data is discrete because the underlying data that the percentages are calculated from is discrete.

Can someone explain why percentages aren't true continuous measures and in what cases I could use a percentage as an independent variable?

Edit 2: For further clarity, I will explain what I'm hoping to accomplish specifically below here. The goal is to use a dependent variable (length of time) and claim it explained by several independent variables (some dummies, one a percentage that isn't restricted to any certain values for any observation). I know the assumption for linear regression is that the independent variables will be continuous measures, which is why I utilize dummy variables for the dichotomous categorical variables. I'm just trying to make sure I don't need to utilize a different analytical technique altogether because of the percentages being technically discrete (is this even necessarily true?).

Edit 3: In the interest of complete specificity,

DV - Length of maternity leave taken. IV's - percentage of normal salary paid by employer during leave, and other dummies not relevant to the question.

$\endgroup$
6
  • 2
    $\begingroup$ Can you point us to, and ideally edit into your question, the reasons why people have told you not to do this so someone here can argue against them? $\endgroup$ Commented Nov 26, 2016 at 16:19
  • $\begingroup$ If you have a particular research problem or data set in mind, please provide some details. You are more likely to get a useful answer if your question is more specific. Also, please follow the advice from @mdewey about editing your question to include the arguments you've heard against percentages as independent variables. $\endgroup$ Commented Nov 26, 2016 at 16:25
  • 1
    $\begingroup$ Why percentages can't be continuous? Also, why the restriction on continuous IVs? $\endgroup$ Commented Nov 26, 2016 at 16:25
  • $\begingroup$ I've edited the question to be more specific. The general rule I'm attempting not to violate is this one from my textbook: Linear regression is based off of three assumptions. 1. Linearity 2. Normality (This one being the one I'm most worried I'm violating in some way.) 3. Homoscedasticity Under normality it states that both the dependent variable and independent variables should be continuous and normally distributed. Categorical independent variables, however, may be incorporated as dummy variables. I get the feeling that percentages aren't capable of being dummies, though. $\endgroup$ Commented Nov 26, 2016 at 16:36
  • 1
    $\begingroup$ Oops, actually that's if it is your outcome variable. Still length of time cannot be <0 so still a problem. Time to event is asked for. $\endgroup$ Commented Nov 27, 2016 at 7:28

3 Answers 3

8
$\begingroup$

Percentages can be considered continuous on the interval [0,1]. There is no reason why percentages can't be independent variables in linear regression. In fact, there is no requirement that independent variables need to be continuous. Indicator variables are often used as independent variables in regressions.

$\endgroup$
4
  • $\begingroup$ Thank you so much for your help! This certainly makes sense. $\endgroup$ Commented Nov 26, 2016 at 17:27
  • $\begingroup$ Do you happen to have any sources that corroborate this? It does seem to contend with my text book, and I'm sure my professor will want confirmation from some citation. $\endgroup$ Commented Nov 26, 2016 at 17:34
  • $\begingroup$ Why would anyone give me a negative vote for this reply. $\endgroup$ Commented Nov 27, 2016 at 7:59
  • $\begingroup$ I know a lot of books on regression. I do not think they specifically address the issue of using proportions as predictor variables. $\endgroup$ Commented Nov 29, 2016 at 3:26
13
$\begingroup$

The assumption of normality to which you refer does not apply to any of the predictors (after all how could a binary predictor be normal?) nor does it apply to the outcome. What it applies to is the residuals from your model. So at this stage before you have fitted the model you do not know whether it holds or not. Similarly the usual check for homoscedasticity is based on looking at the residuals in a plot against the fitted values. The question of continuity is more subtle but no measured variable even if theoretically continuous is going to be so when actually measured to finite precision.

If I was modelling length of stay I would be more worried about the skew and also the issue of whether some have been censored because they have not returned to work yet. Have you considered using a time to event model (also known as the Cox model or proportional hazards)?

Another concern, depending on the rules in your jurisdiction, is that if maternity pay is at a certain level for $j$ moths, a lower level for $k$ months, and the stops, you will get bunching of values at $j$ and $k$ (I would have thought).

$\endgroup$
3
  • $\begingroup$ Thank you so much for your help! All of these points are quite valid. Unfortunately, I'm unfamiliar with the Cox model, and at a glance, it seemed above my current knowledge of statistics. Fortunately, I'm not gunning for publication just yet. This is merely a practice in developing basic research designs. $\endgroup$ Commented Nov 26, 2016 at 17:30
  • 3
    $\begingroup$ @SeanConner : analyzing time-to-event is an important basic research design. It allows you to handle cases where maternity leave (in your example) is still in progress and hasn't ended yet. How could you handle such cases with linear regression? It's important to match the correct analysis design to the particular problem at hand. $\endgroup$ Commented Nov 26, 2016 at 17:52
  • $\begingroup$ @EdM, definitely agree with you there. The particular data set I'd be using would strictly sample from women where leave has concluded. Furthermore, I'd be strictly sampling women who have already returned to work or resolved to not return for at least one year. I don't think I've violated the temporal precedence requisite of causal inference. Correct me if I'm wrong, please. $\endgroup$ Commented Nov 26, 2016 at 17:57
0
$\begingroup$

Suppose you have a model

$$Y = B_1 X_1 + B_2 X_2 + E,$$

where $E \sim Nor(0,1)$

Let $X_3$, $X_4$ be the percentages and $S_1$, $S_2$ be the total of $X_1$, $X_2$ respectively then, $X_3 = X_1/S_1*100$ and $X_4 = X_2/S_2*100$.

Then the model with percentage will be, $Y = B_3 X_3 + B_4 X_4 + E'$

The estimates will be $B = ((X'X)^{-1})X'Y$ and the relation between the estimates are,

$B_3 = B_1/S_1*100$ and $B_4 = B_2/S_2*100$.

$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.