Effect size is the quantitative magnitude of a difference between groups — how much a supplement moved an outcome, not merely whether the movement cleared a p-value. The same trial can be statistically significant and practically meaningless if the size of the difference is too small to matter. Common metrics include Cohen's d (standardized mean difference), relative risk, odds ratio, and number needed to treat. Unfair reports effect sizes directly on ingredient cards so users see the magnitude alongside the evidence tier rather than collapsing both into a single star.
Cohen's d benchmarks
Cohen's d expresses the mean difference between two groups in units of pooled standard deviation. It is the most useful single number for comparing the size of continuous supplement outcomes across trials.
| d value | Label | What a user would likely notice | Typical supplement example |
|---|---|---|---|
| 0.1 | Trivial | Invisible inside daily noise | Most single-trial mood effects |
| 0.2 | Small | Hard to feel individually; visible in trend | Creatine on strength output |
| 0.5 | Medium | A real shift a consistent logger can see over weeks | Ashwagandha on perceived stress |
| 0.8 | Large | Clear subjective and objective change | Caffeine on reaction time |
| ≥ 1.0 | Very large | Uncommon outside pharmaceuticals or acute interventions | Rare for supplement endpoints |
Most supplementation effect sizes for chronic outcomes — mood, cognition, inflammation — land between d = 0.2 and d = 0.5. Setting that expectation before reading any trial abstract prevents a 0.3 effect from feeling like a disappointment and a single 1.1 effect from looking like a breakthrough.
A toy calculation
A 60-person 8-week L-theanine trial reports a morning alertness score improvement. The treatment arm mean changes from 6.0 to 7.1 (SD 1.4). The placebo arm mean changes from 6.0 to 6.3 (SD 1.4). The between-arm difference is 0.8 points on a 10-point scale.
Cohen's d here is approximately 0.8 / 1.4 = 0.57 — a medium effect. The same trial would read very differently three ways:
- As "alertness improved 18%" — technically true, emotionally large, and the wrong unit.
- As "p < 0.01" — correct, tells you nothing about the size.
- As "d ≈ 0.57 with a 95% CI from 0.15 to 0.98" — the useful framing, because the confidence interval includes both trivial and large values, which is typical for a 60-person trial.
Until a second trial shrinks that confidence interval, the right mental model is "probably a small-to-medium effect, with non-trivial chance it will look smaller on replication."
Relative vs. absolute risk
Headlines routinely report relative risk reductions because they look larger. If a supplement reduces cardiovascular event risk from 2% to 1.5% over five years:
- Relative risk reduction — 25% ("supplement slashes risk by a quarter")
- Absolute risk reduction — 0.5 percentage points
- Number needed to treat — 200 (treat 200 people for five years to prevent one event)
All three are arithmetically correct. For a personal decision, the absolute reduction and NNT are the honest framing — they make it obvious that an individual user is probably in the 199 who see no event either way.
Why effect size and tier are separate axes
A compound can hold the top evidence tier with only a small effect size. Creatine's strength effect is d ≈ 0.3 averaged across dozens of studies — modest per person, very reliable across populations. Another compound can post d = 1.2 in a single small trial and drop toward 0.4 when a proper replication arrives. Confidence in the claim and magnitude of the claim move independently, and both belong on the ingredient card. Collapsing them into a single score is how users end up with "highly recommended" stacks that quietly deliver near-zero.
How effect size feeds the ranking
Unfair's recommendation ranking uses effect size as a direct contributor to score, gated by tier. A robust-tier d = 0.3 outranks a preliminary-tier d = 0.9 once replication penalties are applied, because the preliminary number is expected to shrink. When a trial reports only a relative metric, Unfair converts to absolute terms before it enters the ranking so NNT and d stay comparable across ingredients.
How this appears in Unfair
Each ingredient card shows a point estimate for effect size per claim, a confidence interval when the source trials support one, and a plain-language label ("plausible benefit at the margin," "meaningful shift for most users," "large effect in the cited population"). Review cycles compare the user's own observed effect to the trial-level range, and the app flags personal effects that look much larger than the literature as likely expectancy until a second cycle confirms.
Clinical safety note
Small mean effects at the trial level can hide meaningful heterogeneity — some users respond strongly, most not at all. An n-of-1 experiment is the correct move when a trial-level effect is modest but a personal response might still be worth chasing. Effect size on efficacy does not speak to safety; a d = 0.1 benefit and a serious adverse event can live in the same compound, and only the safety signal belongs in a stop-the-stack decision.