Effect Size

Effect size is the quantitative magnitude of a difference between groups — how much a supplement moved an outcome, not merely whether the movement cleared a p-value. The same trial can be statistically significant and practically meaningless if the size of the difference is too small to matter. Common metrics include Cohen's d (standardized mean difference), relative risk, odds ratio, and number needed to treat. Unfair reports effect sizes directly on ingredient cards so users see the magnitude alongside the evidence tier rather than collapsing both into a single star.

Cohen's d benchmarks

Cohen's d expresses the mean difference between two groups in units of pooled standard deviation. It is the most useful single number for comparing the size of continuous supplement outcomes across trials.

d value	Label	What a user would likely notice	Typical supplement example
0.1	Trivial	Invisible inside daily noise	Most single-trial mood effects
0.2	Small	Hard to feel individually; visible in trend	Creatine on strength output
0.5	Medium	A real shift a consistent logger can see over weeks	Ashwagandha on perceived stress
0.8	Large	Clear subjective and objective change	Caffeine on reaction time
≥ 1.0	Very large	Uncommon outside pharmaceuticals or acute interventions	Rare for supplement endpoints

Most supplementation effect sizes for chronic outcomes — mood, cognition, inflammation — land between d = 0.2 and d = 0.5. Setting that expectation before reading any trial abstract prevents a 0.3 effect from feeling like a disappointment and a single 1.1 effect from looking like a breakthrough.

A toy calculation

A 60-person 8-week L-theanine trial reports a morning alertness score improvement. The treatment arm mean changes from 6.0 to 7.1 (SD 1.4). The placebo arm mean changes from 6.0 to 6.3 (SD 1.4). The between-arm difference is 0.8 points on a 10-point scale.

Cohen's d here is approximately 0.8 / 1.4 = 0.57 — a medium effect. The same trial would read very differently three ways:

As "alertness improved 18%" — technically true, emotionally large, and the wrong unit.
As "p < 0.01" — correct, tells you nothing about the size.
As "d ≈ 0.57 with a 95% CI from 0.15 to 0.98" — the useful framing, because the confidence interval includes both trivial and large values, which is typical for a 60-person trial.

Until a second trial shrinks that confidence interval, the right mental model is "probably a small-to-medium effect, with non-trivial chance it will look smaller on replication."

Relative vs. absolute risk

Headlines routinely report relative risk reductions because they look larger. If a supplement reduces cardiovascular event risk from 2% to 1.5% over five years:

Relative risk reduction — 25% ("supplement slashes risk by a quarter")
Absolute risk reduction — 0.5 percentage points
Number needed to treat — 200 (treat 200 people for five years to prevent one event)

All three are arithmetically correct. For a personal decision, the absolute reduction and NNT are the honest framing — they make it obvious that an individual user is probably in the 199 who see no event either way.

Why effect size and tier are separate axes

A compound can hold the top evidence tier with only a small effect size. Creatine's strength effect is d ≈ 0.3 averaged across dozens of studies — modest per person, very reliable across populations. Another compound can post d = 1.2 in a single small trial and drop toward 0.4 when a proper replication arrives. Confidence in the claim and magnitude of the claim move independently, and both belong on the ingredient card. Collapsing them into a single score is how users end up with "highly recommended" stacks that quietly deliver near-zero.

How effect size feeds the ranking

Unfair's recommendation ranking uses effect size as a direct contributor to score, gated by tier. A robust-tier d = 0.3 outranks a preliminary-tier d = 0.9 once replication penalties are applied, because the preliminary number is expected to shrink. When a trial reports only a relative metric, Unfair converts to absolute terms before it enters the ranking so NNT and d stay comparable across ingredients.

How this appears in Unfair

Each ingredient card shows a point estimate for effect size per claim, a confidence interval when the source trials support one, and a plain-language label ("plausible benefit at the margin," "meaningful shift for most users," "large effect in the cited population"). Review cycles compare the user's own observed effect to the trial-level range, and the app flags personal effects that look much larger than the literature as likely expectancy until a second cycle confirms.

Clinical safety note

Small mean effects at the trial level can hide meaningful heterogeneity — some users respond strongly, most not at all. An n-of-1 experiment is the correct move when a trial-level effect is modest but a personal response might still be worth chasing. Effect size on efficacy does not speak to safety; a d = 0.1 benefit and a serious adverse event can live in the same compound, and only the safety signal belongs in a stop-the-stack decision.

Cohen's d benchmarks

A toy calculation

Relative vs. absolute risk

Why effect size and tier are separate axes

How effect size feeds the ranking

How this appears in Unfair

Clinical safety note

Keep reading.

Randomized Controlled Trial

Meta-Analysis

Evidence Tier