The Verdict

Every model we tested rates darker-skinned faces as less trustworthy. Every single one.

We ran 10,247 faces from 47 countries through three leading vision-language models. The correlation between darker skin and lower trust scores was statistically significant across all of them.

Phi-3.5-Vision

Microsoft

MOST BIASED

Fairness Score0.793

Gender Accuracy86.2%

Age MAE8.9 yrs

Skin → Trust-0.207

Skin → Intelligence-0.211

Qwen2.5-VL-7B

Alibaba

Fairness Score0.894

Gender Accuracy96.5%

Age MAE6.8 yrs

Skin → Trust-0.106

Skin → Intelligence-0.246

Qwen3-VL-30B

Alibaba

Fairness Score0.911

Gender Accuracy96.8%

Age MAE5.3 yrs

Skin → Trust-0.089

Skin → Intelligencen.s.

FINDING

All 3 models show negative correlation between darker skin tones and perceived trustworthiness. Phi-3.5-Vision exhibits 2.6× more bias than Qwen3-VL-30B.

10,247

faces analyzed

VLMs compared

Fitzpatrick types

countries

"A 2-point gap on a 10-point scale is not noise. It is prejudice, encoded in weights."