0
$\begingroup$

I am working with panel data to estimate spatial spillover effects. 3 of my variables, which are in percentage form, show skewness more than 1. I am not sure whether I should transform them using log or other transformation or it is better to keep them as they are? They are already in percentage form and I am not sure what is the best practice in this case.

Also, one variable, cement production, has a lot of zeros. But this is meaningful because I look at how cement production in one region impacts pollution in another region and it is clear that some region does not produce cement at all. Should I transform it as it also shows high skewness?

$\endgroup$
1
  • $\begingroup$ Have you seen our more popular answers to this question? For more ideas, also see our popular answers that reference regression diagnostics. $\endgroup$ Commented Jul 30 at 13:47

2 Answers 2

2
$\begingroup$

Whether or not to apply a log transformation depends on both the characteristics of your variables and the assumptions of your model. If your model assumes normally distributed errors and log-transforming your variables brings them closer to normality, that can be a strong justification for using logs. A more detailed discussion of when and why to log-transform variables can be found here.

In your case, however, you are working with proportion variables bounded between 0 and 1. For such data, the logit transformation is often more appropriate than a standard log. The logit function is defined as:

$$ \text{logit}(x) = \log\left( \frac{x}{1 - x} \right) $$

This transformation maps the interval $(0, 1)$ onto the entire real line, which can help stabilize variance and better satisfy the assumptions of linear models. Just be sure to handle values very close to 0 or 1 carefully. These can be adjusted slightly, for example by replacing 0 with a small number like 0.001, to avoid undefined values.

However, you should be careful with replacing zeros. If you have many of them, I don't think this is a good idea. A recent and very useful paper on this topic is the following.

$\endgroup$
1
  • $\begingroup$ thank you so much for your answer! I am working with spatial durbin model to understand spillover effects. $\endgroup$ Commented Jul 30 at 11:38
0
$\begingroup$

One step further than the suggestion made by Stan would be to use logistic or poisson regression. These models are feasible if you know the counts and totals underlying the percentages. You would not have to transform your data for these models.

EDIT

Because the question was meant for independent variables (predictors) the above answer I formulated is not very helpful. Instead, I will now give an answer which might be helpful for cement production with about half of the countries having zero production. The following trick could be handy.

Create a dummy variable $D$, being 0 for "not producing cement" and 1 for "producing cement". I understood that your $Cement$ production variable is zero for the not producing countries. Now, calculate the $Cement$ mean for the producing countries only, and subtract this mean from their actual $Cement$ value; for the not producing countries, the $Cement$ variable remains 0.

Now, say the following model holds (I omit the error term):

$Y = b_0+b_1D+b_2Cement$

For non producing countries, this model leads to:

$Y = b_0$

meaning that $b_0$ is mean $Y$ of non producing countries.

For producing countries, we get:

$Y = b_0 + b_1 + b_2Cement = b^* + b_2Cement$

Because $Cement$ is centered around the mean for these countries, the value of $b^*$ is the $Y$ mean of the producing countries. Or: $b_1$ is the difference of the $Y$ means of the producing and non producing countries. $b_2$ has the usual interpretation of a regression coefficient: if $Cement$ increases by 1 (%, that is), the $Y$ variable increases by $b_2$ for the cement producing countries.

The model would be meaningful if the non producing countries have a mean $Y$ that differs substantially from the prediction one would obtain by extrapolating the regression line (of $Y$ on $Cement$) of the producing countries to value 0 for $Cement$.

In case there no other predictor variables, one could consider to use two separate analyses, instead of the above model: (1) a two independent groups t test for comparing the two $Y$ means of producing and non producing countries and (2) a linear regression of $Y$ on $Cement$ for the data of the producing countries only. The results would be similar to the ones obtained by the above model. However, if there are more predictors involved, having the same effect on $Y$ for both types of countries, the above model can be extended with these additional predictors.

Note that one could still transform the $Cement$ variable for the producing countries, before centering it, e.g. by taking the logarithm and then center the logarithmic transform. This would not change the value and interpretation of $b_1$. For $b_2$ you would have a multiplicative interpretation, because adding 1 unit to logcement means multiplying $Cement$ by 2 (doubling cement production) if you would use 2 as the base of the logarithmic transformation.

$\endgroup$
19
  • $\begingroup$ It seems like the transformations under consideration would apply to the features, not to the outcome. $\endgroup$ Commented Jul 30 at 11:48
  • $\begingroup$ @Dave, you may be right! Turkana, could you please clarify this in your question, that would be helpful. Thanks in advance. $\endgroup$ Commented Jul 30 at 12:36
  • $\begingroup$ When the OP says, there is a lot of zeros in the data, this could also be an indicator that a hurdle model—that combines a binary model for the zero/non-zero decision and a truncated count model for the positive counts—might be more appropriate. $\endgroup$ Commented Jul 30 at 17:51
  • $\begingroup$ @BenP Thank you for your answers! I am estimating spatial spillover effects of cement production and I have spatial panel data. I plan to use SLM and SDM. And these percentages, like share of forest area, are the control variables. $\endgroup$ Commented Jul 30 at 19:38
  • $\begingroup$ Also regarding zeros, this is about my variable, cement production. Around 50% percent of regions produce cement and the rest does not produce cement, that is why I have many zeros. But I think these zeros are meaningful because I look at how regions who produce cement cause pollution in neigboring regions. $\endgroup$ Commented Jul 30 at 20:42

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.