What analysis to use for one quantitative independent variable and multiple quantitative dependent variables?

Question

My data is based on survey responses. The independent variable is aggregated based on three Likert questions between rated $1$ to $5$ (IV sums these three and divides by $3$). The four dependent variables are Likert questions between $1$ to $5$.

The obvious (but perhaps controversial) solution is to run four simple linear regression models, one per dependent variable. However, my concern is finding significant results from chance alone.

I looked into multivariate multiple linear regression, but this analysis requires more than one IV, which does not help my case.

Multivariate regression with 1 IV is possible; see analysis of multiple flower characteristics as a function of species in an Appendix to the Fox and Weisberg text. A bigger issue is using ordinary linear regression on Likert-item outcomes. See this page and its links. Ordinal regression is probably better; the R mvord package handles multiple outcomes. — EdM
– EdM, Commented Dec 26, 2023 at 21:13

Shawn Hemelstrand · Accepted Answer · 2023-12-27 01:23:00Z

Your independent variable appears to be a composite of three items which is supposed to represent something intangible (e.g. anxiety, socioeconomic status) and four DVs which may also represent some intangible. Specifically, the items are manifest variables because they are directly observable and the intangibles or the constructs you are trying to capture are latent variables because they are indirectly observed variables via the manifest variables.

I feel the most obvious solution to your problem is some kind of structural equation model (SEM), where the IV latent variable is regressed on the DV latent variable. Here is an overly simplified simulation of your data (some of what I write here is a bit lazy but still demonstrates what I am trying to convey). I simulate the data in R and fit the model with the lavaan package, which is a common package for SEM.

#### Simulated Latents #### set.seed(123) n <- 1000 # number of observations IV <- rnorm(n) # latent variable IV DV <- 0.5*IV + rnorm(n) # latent variable DV #### Simulated Manifests #### IV1 <- IV + rnorm(n) IV2 <- IV + rnorm(n) IV3 <- IV + rnorm(n) DV1 <- DV + rnorm(n) DV2 <- DV + rnorm(n) DV3 <- DV + rnorm(n) DV4 <- DV + rnorm(n) #### Combine Data #### dat <- data.frame(IV1, IV2, IV3, DV1, DV2, DV3, DV4) #### Construct SEM Model #### model <- ' # latent variable definitions IV =~ IV1 + IV2 + IV3 DV =~ DV1 + DV2 + DV3 + DV4 # regression DV ~ IV ' #### Fit Model #### library(lavaan) fit <- sem(model, data = dat) summary(fit, fit.measures = TRUE) #### Plot #### semPlot::semPaths(fit)

The plot below shows the constructed model (unlabeled):

The circles represent the latent variables (IV and DV) and the squares are the items that represent them, or the manifest variables. The lines drawn between the circles and squares are regression paths which estimate how much each item "loads" onto the latent variable, or essentially how close it relates to the latent variable. The semi-circle arrow paths are the variances for each.

That information isn't easy to see, so I change the code here. The standardized loadings are shown below, where each number on the arrows represents how much each item "loads" onto the latent variable. You can see that the arrow between the IV and DV represents the regression path, which shows that the relationship between the two is $\beta = .51$, which is very close to what we specified in our simulated data:

#### Plot #### semPlot::semPaths( fit, "std", layout = "spring", label.cex=1, edge.label.cex = 1.5 )

The full model summary can be run with summary(fit, fit.measures = T), which I do not go into detail here, but examining them is a necessary part of fitting these models:

lavaan 0.6.16 ended normally after 29 iterations Estimator ML Optimization method NLMINB Number of model parameters 15 Number of observations 1000 Model Test User Model: Test statistic 23.797 Degrees of freedom 13 P-value (Chi-square) 0.033 Model Test Baseline Model: Test statistic 2476.670 Degrees of freedom 21 P-value 0.000 User Model versus Baseline Model: Comparative Fit Index (CFI) 0.996 Tucker-Lewis Index (TLI) 0.993 Loglikelihood and Information Criteria: Loglikelihood user model (H0) -11432.560 Loglikelihood unrestricted model (H1) -11420.661 Akaike (AIC) 22895.119 Bayesian (BIC) 22968.736 Sample-size adjusted Bayesian (SABIC) 22921.095 Root Mean Square Error of Approximation: RMSEA 0.029 90 Percent confidence interval - lower 0.008 90 Percent confidence interval - upper 0.047 P-value H_0: RMSEA <= 0.050 0.975 P-value H_0: RMSEA >= 0.080 0.000 Standardized Root Mean Square Residual: SRMR 0.020 Parameter Estimates: Standard errors Standard Information Expected Information saturated (h1) model Structured Latent Variables: Estimate Std.Err z-value P(>|z|) IV =~ IV1 1.000 IV2 1.014 0.059 17.161 0.000 IV3 1.031 0.060 17.130 0.000 DV =~ DV1 1.000 DV2 1.097 0.049 22.448 0.000 DV3 1.055 0.049 21.604 0.000 DV4 1.093 0.050 21.909 0.000 Regressions: Estimate Std.Err z-value P(>|z|) DV ~ IV 0.565 0.048 11.664 0.000 Variances: Estimate Std.Err z-value P(>|z|) .IV1 0.929 0.062 14.940 0.000 .IV2 0.961 0.064 14.978 0.000 .IV3 1.014 0.067 15.143 0.000 .DV1 1.007 0.058 17.476 0.000 .DV2 0.913 0.058 15.833 0.000 .DV3 1.051 0.061 17.143 0.000 .DV4 1.048 0.063 16.729 0.000 IV 0.972 0.087 11.115 0.000 .DV 0.886 0.075 11.787 0.000

This is just scratching the surface but gives you at least a conceptual introduction to what you can do for your case. To learn more, a good starting place is either Kline's book for conceptual knowledge or Beaujean's book for programming it in R.

BenP · Accepted Answer · 2023-12-27 17:05:02Z

Useful comments/answer were already given to make you think if you really should do what you planned to do. Still, if you would nevertheless like to continue, I'll briefly mention below how that can be done...

You could arrange your data in a long format, 4 records for each person, one for each dependent variable, containing the value of the given dependent and the value of the independent, which is the same for all four records. Also, add a number 1-4, specifying which dependent variable the record refers to!

With these data you could run a "multivariate response model" like e.g. described here in chapter 14. In spss you can run such a model using procedure "mixed" in combination with the "repeated" option and an unstructured covariance matrix. In R you can do this using e.g. package glmmTMB in combination with option "dispformula=~0" and also the unstructured covariance matrix; also glm from package nlme in R can be used, again with unstructured covariance matrix.

The unstructured covariance matrix is important here because you would probably want a different variance for each of your dependents, and different covariances or correlations as well. More simple structures can be used too...

Stack Exchange Network

What analysis to use for one quantitative independent variable and multiple quantitative dependent variables?

2 Answers 2

Linked

Hot Network Questions

What analysis to use for one quantitative independent variable and multiple quantitative dependent variables?

2 Answers 2

Linked

Related

Hot Network Questions