Philosophical Studies ( IF 1.1 ) Pub Date : 2025-05-05 , DOI: 10.1007/s11098-025-02322-y
Lily Hu
Discussions of statistical criteria for fairness commonly convey the normative significance of calibration within groups by invoking what risk scores “mean.” On the Same Meaning picture, group-calibrated scores “mean the same thing” (on average) across individuals from different groups and accordingly, guard against disparate treatment of individuals based on group membership. My contention is that calibration guarantees no such thing. Since concrete actual people belong to many groups, calibration cannot ensure the kind of consistent score interpretation that the Same Meaning picture implies matters for fairness, unless calibration is met within every group to which an individual belongs. Alas only perfect predictors may meet this bar. The Same Meaning picture thus commits a reference class fallacy by inferring from calibration within some group to the “meaning” or evidential value of an individual’s score, because they are a member of that group. The reference class answer it presumes does not only lack justification; it is very likely wrong. I then show that the reference class problem besets not just calibration but other group statistical criteria that claim a close connection to fairness. Reflecting on the origins of this oversight opens a wider lens onto the predominant methodology in algorithmic fairness based on stylized cases.
中文翻译:

校准是否如他们所说的那样;或者,引用类问题再次出现
对公平性统计标准的讨论通常通过调用风险评分的“含义”来传达组内校准的规范意义。在相同含义的图片上,来自不同群体的个体的群体校准分数“平均意义相同”(平均而言),因此,可以防止基于群体成员身份对个体的不同对待。我的观点是,校准不能保证没有这样的事情。由于具体的实际人属于许多群体,因此校准无法确保 Same Meaning 图片所暗示的那种一致的分数解释,这对于公平性很重要,除非在个人所属的每个群体中都满足校准。唉,只有完美的预测因子才能达到这个标准。因此,“相同含义”图片通过从某个群体内的校准推断出个人分数的“意义”或证据价值,从而犯下了参考类谬误,因为他们是该群体的一员。它假定的参考类答案不仅缺乏正当理由;这很可能是错误的。然后,我表明参考类问题不仅困扰着校准,还困扰着其他声称与公平性密切相关的群体统计标准。反思这种疏忽的根源,为基于程式化案例的算法公平性的主要方法打开了更广阔的视角。