40 percent of fMRI signals do not correspond to actual brain activity
tum.de503 points by geox 4 months ago
503 points by geox 4 months ago
My previous job was at a startup doing BMI, for research. For the first time I had the chance to work with expensive neural signal measurement tools (mainly EEG for us, but some teams used fMRI). and quickly did I learn how absolute horrible the signal to noise ratio (SNR) was in this field.
And how it was almost impossible to reproduce many published and well cited result. It was both exciting and jarring to talk with the neuroscientist, because they ofc knew about this and knew how to read the papers but the one doing more funding/business side ofc didn't really spend much time putting emphasis on that.
One of the team presented a accepted paper that basically used Deep Learning (Attention) to predict images that a person was thinking of, from the fMRI signals. When I asked "but DL is proven to be able to find pattern even in random noise, so how can you be sure this is not just overfitting to artefact?" and there wasn't really any answer to that (or rather the publication didn't take that in to account, although that can be experimentally determined). Still, a month later I saw tech explore or some tech news writing an article about it, something like "AI can now read your brain" and the 1984 implications yada yada.
So this is indeed something probably most practitioners, masters and PhD, realize relatively early.
So now that someone says "you know mindfulness is proven to change your brainwaves?" I always add my story "yes, but the study was done with EEG, so I don't trust the scientific backing of it" (but anecdotally, it helps me)
There are lots of reliable science done using EEG and fMRI; I believe you learned the wrong lesson here. The important thing is to treat motion and physiological sources of noise as a first-order problem that must be taken very seriously and requires strict data quality inclusion criterion. As far as deep learning in fMRI/EEG, your response about overfitting is too sweepingly broad to apply to the entire field.
To put it succinctly, I think you have overfit your conclusions on the amount of data you have seen
I would argue in fact almost all fMRI research is unreliable, and formally so (test-retest reliabilities are in fact quite miserable: see my post below).
https://news.ycombinator.com/item?id=46289133
EDIT: The reason being, with reliabilities as bad as these, it is obvious almost all fMRI studies are massively underpowered, and you really need to have hundreds or even up to a thousand participants to detect effects with any statistical reliability. Very few fMRI studies ever have even close to these numbers (https://www.nature.com/articles/s42003-018-0073-z).
That depends immensely on the type of effect you're looking for.
Within-subject effects (this happens when one does A, but not when doing B) can be fine with small sample sizes, especially if you can repeat variations on A and B many times. This is pretty common in task-based fMRI. Indeed, I'm not sure why you need >2 participants expect to show that the principle is relatively generalizable.
Between-subject comparisons (type A people have this feature, type B people don't) are the problem because people differ in lots of ways and each contributes one measurement, so you have no real way to control for all that extra variation.
Precisely, and agreed 100%. We need far more within-subject designs.
You would still in general need many subjects to show the same basic within-subject patterns if you want to claim the pattern is "generalizable", in the sense of "may generalize to most people", but, precisely depending on what you are looking at here, and the strength of the effect, of course you may not need nearly as much participants as in strictly between-subject designs.
With the low test-retest reliability of task fMRI, in general, even in adults, this also means that strictly one-off within-subject designs are also not enough, for certain claims. One sort of has to demonstrate that even the within-subject effect is stable too. This may or may not be plausible for certain things, but it really needs to be considered more regularly and explicitly.
Between-subject heterogeneity is a major challenge in neuroimaging. As a developmental researcher, I've found that in structural volumetrics, even after controlling for total brain size, individual variance remains so large that age-brain associations are often difficult to detect and frequently differ between moderately sized cohorts (n=150-300). However, with longitudinal data where each subject serves as their own control, the power to detect change increases substantially—all that between-subject variance disappears with random intercept/slope mixed models. It's striking.
Task-based fMRI has similar individual variability, but with an added complication: adaptive cognition. Once you've performed a task, your brain responds differently the second time. This happens when studies reuse test questions—which is why psychological research develops parallel forms. But adaptation occurs even with parallel forms (commonly used in fMRI for counterbalancing and repeated assessment) because people learn the task type itself. Adaptation even happens within a single scanning session, where BOLD signal amplitude for the same condition typically decreases over time.
These adaptation effects contaminate ICC test-retest reliability estimates when applied naively, as if the brain weren't an organ designed to dynamically respond to its environment. Therefore, some apparent "unreliability" may not reflect the measurement instrument (fMRI) at all, but rather highlights the failures in how we analyze and conceptualize task responses over time.
Yeah, when you start getting into this stuff and see your first dataset with over a hundred MRIs, and actually start manually inspecting things like skull-stripping and stuff, it is shocking how dramatically and obviously different people's brains are from each other. The nice clean little textbook drawings and other things you see in a lot of education materials really hide just how crazy the variation is.
And yeah, part of why we need more within-subject and longitudinal designs is to get at precisely the things you mention. There is no way to know if the low ICCs we see now are in fact adaptation to the task or task generalities, if they reflect learning that isn't necessarily task-relevant adaptation (e.g. the subject is in a different mood on a later test, and this just leads to a different strategy), if the brain just changes far more than we might expect, or all sorts of other possibilities. I suspect if we ever want fMRI to yield practical or even just really useful theoretical insights, we definitely need to suss out within-subject effects that have high test-retest reliability, regardless of all these possible confounds. Likely finding such effects will involve more than just changes to analysis, but also far more rigorous experimental designs (both in terms of multi-modal data and tighter protocols, etc).
FWIW, we've also noticed a lot of magic can happen too when you suddenly have proper longitudinal data that lets you control things at the individual level.
Yes on many of those fronts, although not all those papers support your conclusion. The field did/does too often use tasks with to few trials, with to few participants. That always frustrated me as my advisor rightly insisted we collect hundreds of participants for each study, while others would collect 20 and publish 10x faster than us.
Yes, well "almost all" is vague and needs to be qualified. Sample sizes have improved over the past decade for sure. I'm not sure if they have grown on median meaningfully, because there are still way too many low-N studies, but you do see studies now that are at least plausibly "large enough" more frequently. More open data has also helped here.
EDIT: And kudos to you and your advisor here.
EDIT2: I will also say that a lot of the research on fMRI methods is very solid and often quite reproducible. I.e. papers that pioneer new analytic methods and/or investigate pipelines and such. There is definitely a lot of fMRI research telling us a lot of interesting and likely reliable things about fMRI, but there is very little fMRI research that is telling us anything reliably generalizable about people or cognition.
I remember when resting-state had its oh shit moment when Power et al (e.g. https://pubmed.ncbi.nlm.nih.gov/22019881/) showed that major findings in the literature, many of which JD Power himself helped build, was based off residual motion artifacts. Kudos to JD Power and others like him.
Yes, and a great example of how so much research in fMRI methodology is just really good science working as it should.
The small sample sizes is rational response from scientists in the face of a) funding levels and b) unreasonable expectations from hiring/promotion committees.
cog neuro labs need to start organizing their research programs more like giant physics projects. Lots of PIs pooling funding and resources together into one big experiment rather than lots of little underpowered independent labs. But it’s difficult to set up a more institutional structure like this unless there’s a big shift in how we measure career advancement/success.
+1 to pooling funding and resources. This is desperately needed in fMRI (although site and other demographic / cultural effects make this much harder than in physics, I suspect).
I'm not an expert, but my hunch would be that a similar Big(ger) Science approach is also needed in areas like nutrition and (non-neurological) experimental psychology where (apparently) often group sizes are just too small. There are obvious drawbacks to having the choice of experiments controlled by consensus and bureaucracy, but if the experiments are otherwise not worthwhile what else is there to do?
I think the problems in nutrition are far, far deeper (we cannot properly control diet in most cases, and certainly not over long timeframes; we cannot track enough people long enough to measure most effects; we cannot trust the measurement i.e. self-report of what is consumed; industry biases are extremely strong; most nutrition effects are likely small and weak and/or interact strongly with genetics, making the sample size requirements larger still).
I'm not sure what you mean by "experimental psychology" though. There are areas like psychophysics that are arguably experimental and have robust findings, and there are some decent-ish studies in clinical psychology too. Here the group sizes are probably actually mostly not too bad.
Areas like social psychology have serious sample size problems, so might benefit, but this field also has serious measurement and reproducibility problems, weak experimental designs, and particularly strong ideological bias among the researchers. I'm not sure larger sample sizes would fix much of the research here.
> Areas like social psychology have serious sample size problems, so might benefit, but this field also has serious measurement and reproducibility problems, weak experimental designs, and particularly strong ideological bias among the researchers. I'm not sure larger sample sizes would fix much of the research here.
I can believe it; but a change doesn't have to be sufficient to be ncessary.