3
$\begingroup$

I have collected hundreds of electrophysiological indexes along with several demographic variables on about 400 healthy subjects on which the indices are collected through repeated measures at subsequent time points. I would like to see what are the variation and trends from Time0 to Time5. Basically, I'm trying to look at the mean response of each physiological index at different time points, accounting for the variability between subjects. So I am using LMMs to model this relation. The problem is that this solution is time-consuming for hundreds of variables (one dependent variable each time), and also difficult to look at in a reporting document. One way could be doing dimensionality reduction (Hundreds of multiple linear regression models), but that would misrepresent the original variables.

My question is the following. Is that even acceptable from a scientific perspective to build hundreds of models? I highly doubt that a data analyst would do that, so I appreciate any comments or thoughts on this.

$\endgroup$
5
  • $\begingroup$ It's hard to answer this question without knowing more about the fundamental scientific hypothesis or hypotheses you wish to test. Please edit the question to provide more background about what the " hundreds of physiological indexes" represent and what you hope to learn from/about them. It would also help to say more about the scope of your data: how many individuals, typical number of measurements/time points per index per individuals, whether all individuals are assessed on all the indices, and so forth. $\endgroup$ Commented Oct 18, 2023 at 16:34
  • $\begingroup$ @EdM the variables are electrophysiological indices. The sample size is about 400, and I have 6 times, so I would like to see what is the variation and trends from Time0 to Time5. In total I have more than 300 variables. They are all measured 6 times on each individual, and. $\endgroup$ Commented Oct 18, 2023 at 18:51
  • $\begingroup$ Is there some treatment or intervention distinguishing individuals or occurring at some point during the 6 sets of observations, or are you just trying to describe "variation and trends" over time? $\endgroup$ Commented Oct 18, 2023 at 18:54
  • $\begingroup$ No, they fall all in the control group, so just describing the variations and trends over time $\endgroup$ Commented Oct 19, 2023 at 13:43
  • 2
    $\begingroup$ Please edit the question to include the information that you provided in comments. Comments are easily overlooked and can be deleted; editing the question will put it back on top of the list of "active" questions and might elicit an answer before I can get back to this next week. If this is purely descriptive then you don't have to worry so much about "statistical significance" and the like. $\endgroup$ Commented Oct 19, 2023 at 14:08

2 Answers 2

6
$\begingroup$

I agree with @rep_ho's "do whatever is commonly done with the data modality you are working with".

However, if you are going to quote p-values (for example) you almost certainly need to do something about multiple comparisons correction; for example, in the "'omics" world (microbiome data, RNAseq, etc.), people almost always use false discovery rate corrections (which are popular both because they actually do what people want — control the fraction of significant results that are false positives, rather than the overall rate of false positives — and because they are much less sever than methods (Holm, Bonferroni, etc.) than methods that control the overall error rate.

It would be most elegant to fit a single model to all of your indices at once, with both index and subject as random effects; the reasons not to are

  • it would make the full model very large (and harder to parallelize than fitting every index separately)
  • it would be cool to model the correlation among indices, but that would almost certainly require advanced techniques (reduced-rank/factor-analytic models for the covariance)
  • it might make significance testing of individual indices more difficult
$\endgroup$
5
$\begingroup$

TO answer your question: yes, from a scientific perspective it is ok to build hundreds of models. For example in fMRI data analysis we fit one model per voxel (3d pixel) per subject in the brain image, which can be around 500k voxels, but it can be much more based on the resolution of the machine. The thing that makes it acceptable/unacceptable is how and if you correct for multiple comparisons.

The other things you've mentioned are also true. Yes, if you want to fit many models, it will be time consuming, but you can easily parallelize it, since each model can be run separately. Running dimensionality reduction like PCA/ICA/NNMF etc. is also ok, and commonly done in these situations.

What I suggest is to do whatever is commonly done with the data modality you are working with and go from there.

$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.