The Verdict
Every model we tested rates darker-skinned faces as less trustworthy. Every single one.
We ran 10,247 faces from 47 countries through three leading vision-language models. The correlation between darker skin and lower trust scores was statistically significant across all of them.
Phi-3.5-Vision
Microsoft
Fairness Score0.793
Gender Accuracy86.2%
Age MAE8.9 yrs
Skin → Trust-0.207
Skin → Intelligence-0.211
Qwen2.5-VL-7B
Alibaba
Fairness Score0.894
Gender Accuracy96.5%
Age MAE6.8 yrs
Skin → Trust-0.106
Skin → Intelligence-0.246
Qwen3-VL-30B
Alibaba
Fairness Score0.911
Gender Accuracy96.8%
Age MAE5.3 yrs
Skin → Trust-0.089
Skin → Intelligencen.s.
FINDINGAll 3 models show negative correlation between darker skin tones and perceived trustworthiness. Phi-3.5-Vision exhibits 2.6× more bias than Qwen3-VL-30B.
10,247
faces analyzed
3
VLMs compared
6
Fitzpatrick types
47
countries
"A 2-point gap on a 10-point scale is not noise. It is prejudice, encoded in weights."