Algorithmic Fairness and Its Discontents with Sharad Goel

Amber Cazzell
Feb 25, 2020
37 min read

Updated: Mar 2, 2020

Dr. Sharad Goel is a professor of Management Science and Engineering, as well as a professor of Computer Science and Law at Stanford University. He is the founder and executive director of the Stanford Computational Policy Lab, where he uses advanced data science techniques to examine the effects of social and political policies, and how those policies might be improved upon. In this episode, we discuss the intractability of algorithmic fairness. We explore how decision systems are being used and implemented in unsettling ways, and the mathematical reasons that three common goals for achieving algorithmic fairness are mutually-exclusive.

APA citation: Cazzell, A. R. (Host). (2020, February 15). Algorithmic Fairness and Its Discontents with Sharad Goel [Audio Podcast]. Retrieved from https://www.ambercazzell.com/post/msp-ep28-sharadgoel

NOTE: This transcript was generated automatically. Please excuse typos and errors.

Sharad Goel (01:13):

So I had a pretty standard path I would say to my current academic position. I, I started out in, in math, I did my PhD, my undergraduate degree and my PhD in math I came to Stanford, I did a postdoc. I started getting interested in applications. I mean, the type of work that I was doing before was pretty theoretical, theoretical probability. And then I came to Stanford and I realized as a postdoc, and I realized there are a lot of applications out there in the world specifically political applications that could use that could draw on those types of skills. And so I got pretty interested in that. I did that for a couple of years and then in a way, I guess I kind of burned out of academia and I took what I call it, seven year postdoc after that, and went to industry.

Sharad Goel (02:01):

I went first to Yahoo research and then Microsoft research for a couple of years. And that was pretty exciting for me because I was, this was in a time when social science was starting to become of interest to technology companies. I mean, at this point, you know, in retrospect, it seems like that's, that's always been the case. I mean, you think of companies like Facebook and you're like, you know, these are social networks. It's like need social scientists to be involved. But at that point there wasn't that much. It was just an inkling that these ideas matter, that it wasn't just you're running big machine learning algorithms and that's all there is to it. And so this was exciting for me and I got to interact a lot with, with people with this bridge of computer science and social science and in starting up this new field or contributing to this new field of computational social science.

Sharad Goel (02:52):

So I did that for awhile and then I became increasingly interested in policy, a specifically criminal justice related topics. And and yeah, at that point I was also sort of looking for a change and came back to academia. And I think there's, there, there are lots of lots of advantages to doing that type of policy work in academia. I mean, for one it's like an, you know, people will usually return my phone calls now that I'm here. So that's, you know, that's one big, one big bonus.

Amber Cazzell (03:27):

They didn't when you were in industry?

Sharad Goel (03:28):

Um they always thought I was tried to make money. So that's, you know, I could, I mean, if you like call someone else in your it at Microsoft, it's like, Hmm, you know, why are you interested in this? Like, what is the angle? And you know, probably should be asking me the same question when, when I call them up now, but for some reason they, they don't.

Sharad Goel (03:48):

Okay. So I think there's just a, a different perception of what the incentives are when you're out of company or at, or in any university. And then there's just a lot more her breadth of expertise and of interests in, you know, specifically at a place like Stanford. But I think in universities more generally, I can walk, you know, walk a quarter mile and be at the law school or, or be at it's a sociology department or, you know, political science department. And and I wish, I think is great, especially for the work that I'm doing now. And before I was mostly surrounded by computer scientists, which makes sense when a lot of what the work that we were doing was related to their science, but it also changed the way that we can think about problems.

Amber Cazzell (04:34):

Yeah. Interesting. So our, I don't know if you, I'm sure you had to sign NDAs and everything with your work in industry before, but can you say generally like what sorts of issues we were working with back then?

Sharad Goel (04:46):

Yeah. I mean, so I didn't I don't, I think everything I did was basically public. But that probably means that people didn't actually care that much about the stuff that I was doing. The I think, so let me see what's the one, a lot of the work that I was doing is related to information diffusion. And so this is the idea is that how do you know what types of information are going to go viral? You know, what does that word even mean? It's a word that people throw out a lot, but you know, how do we rigorously make sense of that? And those are the types of questions that, that I was looking at. And also, you know, so, so the substantive question was about information and adoption and diffusion and the technical questions were about distributed computing and how do you answer social science questions at scale. And so it's that particular intersection that I was quite interested in and if like, how could you push social science into this new generation of, of techniques that could help us answer questions that we could barely even conceive of 10, 15 years ago.

Amber Cazzell (05:58):

Yeah. So you're also, you have a joint appointment with sociology and law here, right? So how, when did that happen? Did that happen after your interest in criminal justice in particular came up?

Sharad Goel (06:13):

Sociology I've always been interested in at least as a spectator. So I think that was, yeah, that's always been there. And this is like, you know, if I were to identify with some social science discipline, sociology is probably the discipline that I would identify with most strongly. The, the law connection came much more recently and that was really connected to some work that we started doing in criminal justice maybe, I don't know, six, seven years ago. And in then just realizing that, you know, this was still my interest was really an inequality and that's where the sociology connection comes in. But then realizing that a lot of these questions, they aren't, aren't just broad policy questions, but specifically they have legal implications. And so understanding how these broader questions of inequality intersect with potentially potential legal remedies. That's one of my areas of interest now.

Amber Cazzell (07:09):

Yeah, that's really interesting. So how did you move from information diffusion into your specific interest in criminal justice? I mean, it was mostly,

Sharad Goel (07:19):

Yeah, you know, it's some way to what's happened since again, this was a bit of a long running personal interest of mine, but not something that I had looked at academically or in a, in a kind of a research oriented way. But I was living in New York, I don't know, I guess maybe seven, eight years ago. And stop and frisk was in the news quite a bit. And so I was reading about the case and I'm one of these comments that that the judge in the case and made kind of stuck with me. It was saying, so the case for those who don't know about it, it's, you know, it's a fairly, there's a lot going on in the case. But you know, in a nutshell the issue was that people were being stopped by the police and there were two potential violations there.

Sharad Goel (08:02):

One is that people were being targeted unconstitutionally because of their race. The second is that people, even regardless of their race, were being stopped without what's called reasonable suspicion of criminal activity. This was meaning that, you know, you can't just stop somebody on the street if you don't believe that they're engaged on the criminal activity. And so that's what's considered a fourth amendment constitutional violation. And so the I was reading about this case and the judge made this comment saying that, you know, we know that a lot of people are, we have reason to believe that a lot of people weren't stopped unconstitutionally but we'll never know how many people were and you know, just kind of follow, why is it that we're never gonna know how many people were stopped or even like a ballpark of how many people were stopped.

Sharad Goel (08:47):

Unconstitutionally. And the reason that this comment was made is it was because the, the standard way of, of understanding these types of violations is to interview individuals involved in every specific instance, individual the individuals themselves who were stopped, the police officers who were involved, you know, potential witnesses to the incident. And in New York, at the height of this practice, there were something like 500, 600,000 stops a year. So this is the huge number of stops a year in the practice itself was you know, was going on for decades. And so, yeah, I mean from that perspective you can see you can't possibly carry out an in depth interview of all these people to really understand what's going on. And so the, I mean the approach that we ended up taking was, was one that was very much driven by a statistical analysis and the idea was very simple.

Sharad Goel (09:43):

We were just trying to estimate at the moment that somebody is stopped based on all the available information available to an officer at the moment, right before they were stopped. What is the likelihood that they have a weapon on them? And this was one of the main reasons for stop and frisk in the first place of, of trying to get weapons off the street. And what we estimated is something like 40% of people who are being stopped on on suspicion of criminal possession of weapon in fact had less than a 1% chance of having a weapon. Now it's, yeah, I mean it'd be to me like it's, it strikes me as low now, you know, I think everything is, everything is nuanced, because you know, is 1% to below the level. That's the constitutional level for reasonable suspicion. I don't know. To me it feels like it's bad public policy to stop people with that kind of rate of potentially having a weapon. But you know, he, this is a difficult question. I mean when we, we, we search people in airports when they, you know, presumably don't have, we are very, very unlikely to have a weapon. So there are carve-outs in certain areas where we believe that it's justifiable to search people on, on the base of very, very little evidence. But you know, on the street it struck me personally as, as not good policy. And we made the argument that this in fact was unconstitutional.

Amber Cazzell (11:07):

Yeah. So, and those rates were, when you're saying they had a less than 1% chance, you mean by actual reports of having found weapons and then using that to backtrack?

Sharad Goel (11:16):

Yeah, so we could, so we, so we knew. So New York you know, it's kind of interesting that they in some ways are the poster child for having problematic stop and frisk practices. But in part because of this, they also have very good data collection. Over the years there they've been the subject of various court actions and now they have quite good data collection. So in theory, every time someone is stopped a police report is, is filed and that information becomes public. And so we know many of the factors that the stop was predicated on and also the output of the outcome of those stops, whether or not a weapon or drugs or something else was, was found on that individual.

Amber Cazzell (11:56):

Yeah. Are you, I had read somewhere, I, I think this might've been your work of like archiving video footage from stop and frisk or, or police practices now or something like that. Is that you or is this

Sharad Goel (12:10):

No, that's, that's not that's not us. We've archived a lot of structure data, but no, no,

Amber Cazzell (12:20):

Nope. The footage. Okay. Yeah. Okay. Well, I want to go ahead and shift gears a bit to get into the algorithmic fairness stuff because I think it could take some time. Really interesting. So for listeners who aren't familiar with algorithmic fairness in general, could you give me an overview of what that even means?

Sharad Goel (12:37):

Yeah, I mean I don't think there is a simple answer to this. It's a word that's thrown out a fair bit in machine learning. I algorithmic make fairness. These are phrases that are, that are thrown out I think very broadly to capture the idea that decisions are often guided by algorithms. These are high stakes decisions in medicine and criminal justice and employment. And there's a sense that these are not fair or not equitable. You know, what does that mean exactly? I think that's a complicated question. But if the understanding that just because something is automated and perhaps and because it is automated, it could create inequity. So I think, I think that's the, at least the spirit of what this new subfield is, is trying to address.

Amber Cazzell (13:20):

Okay. And so what I want to talk about sort of the three definitions of fairness that was in that paper that I had mentioned I had been reading. So could you tell me about, I know what the debates are in the field as far as what definitions of fairness should be reigning in this area?

Sharad Goel (13:40):

Yeah, I mean, so first thing you just did to make the field a little bit narrower, at least what computer scientists have, have have exerted some influence on this field recently and the dominant view in computer science is that that fairness, even if all the aspects can't be captured formally, that there is value in creating mathematical definitions of fairness. And that, you know, the reason I am kind of pointing that out as in the way that I did is I actually think that's problematic, but maybe I won't get into that right now, but at least kind of playing by the rules of that particular game. There's this interest in creating these mathematical definitions of fairness. Now, three classes of mathematical definitions of fairness have become particularly prominent these days. And I can go through each of these one by one, if that's useful. And you know, again, I want to like give the caveat that this gets pretty hairy. But we can, we can, you know, I can at least try to talk a little bit about it.

Amber Cazzell (14:45):

Yeah. Let's talk a little bit about it and then it, I do think it gets complicated. I loved some of the examples cause it brought it home for me. So if you could also talk about the examples of where each of these definitions do really well and where they have shortcomings, that would, be wonderful.

Sharad Goel (15:02):

Yeah. Yeah, so first, just to ground ourselves a bit one specific context in which algorithms are used is in the criminal justice system in the so called pretrial decisions. So in the U S after someone's arrested usually within 24, 48 hours after someone is arrested, a judge has to make a high stakes decision about whether or not to release that individual on their own recognizance or to require them to post bail or otherwise to detain them. AI. And traditionally this is made based on intuition and that has all sorts of issues. I mean, it's not obviously problematic, but there are all sorts of potential issues with that procedure. You know, possibly for implicit bias to creep in, just kind of inconsistencies across individuals. So there's been a movement and you know, especially for the last 10 years or so, there's been a movement to aid these judicial decisions with so-called algorithmic risk assessments.

Sharad Goel (16:01):

So algorithm, this really like sounds fancy. It's just a way to checklist in practice. And so this might mean there's something like five factors. You might look at something like how many times an individual has, has failed to appear in the past, what the nature is of their current offense, whether or not they have any other current pending charges. Each of these factors comes with points. All these points are added up and then that gets transformed into a risk score. And then judges see that risk score and then make some determination. Okay. Okay. So that's the the kind of simple setup and versions of that show up in all sorts of disciplines in medicine, in in employment. But these are some very basic so called risk assessments. So,

Amber Cazzell (16:44):

So for, for criminal, for, for making these decisions, like what percent of the time is an algorithm being or a weighted checklist being used for that, would you say?

Sharad Goel (16:53):

Yeah, I mean it's a good question. I don't know the answer to this. It's becoming increasingly popular. It's likely that California will require that the entire state uses risk assessment tools within the next couple of years. And so it's it's already used in the federal government. Okay. Yes,

Amber Cazzell (17:13):

What is the sort of the thinking behind that is the government is assuming that these way to checklists are going to be better than then humans?

Sharad Goel (17:21):

Yeah, exactly. I mean, I think the idea is precisely that these types of risk assessment tools are better, more consistent, more accurate than human, than unaided human judges. And again, I want to highlight here that, that it's not that these algorithms are replacing human judgment, they're guiding human judgment, you know, and so it's supposed to be a tool, not a replacement for a human discretion.

Amber Cazzell (17:46):

Okay. And just, it seems like, is the government wanting this to be the case, would you say for like economic reasons or just for like PR reasons?

Sharad Goel (17:58):

Yeah, I mean, I guess again, it's, you know, probably depends on who you ask. I would say that at least a publicly stated reason for this and you know, one that I generally believe is it's both, it's primarily for equity and consistency. So there, there is a real worry that, that there is a lot of capriciousness in the criminal justice system, and I don't think any of us want that. But you know, consistency though of self is not, is not the end goal. I mean I could just lock up everybody that's perfectly consistent. That's not a, I think an end goal here. And so there, but there is this idea of consistency and the inequity of saying that if we are going to detain individuals sensibly, we should only detain the people who are at highest risk. And again, I'm phrasing this as the, if we should detain anyone because again, I think that's, that's a big open question here that we're not addressing right now. I'm just using this example as as one in which an algorithm is used I think, but I think there is this more existential question of should we be detaining anybody pretrial and personally, I, I think that in the vast majority of cases we should not be but conditional on detaining people. Then the question is how should we do that? Should we use an unaided human decision maker or should we use a risk assessment tool to guide those decisions?

Amber Cazzell (19:17):

Okay, great. Okay. So with that as the background. So, so then what, what are some of the arguments for how to implement these weighted checklist? Yes.

Sharad Goel (19:27):

So one way to evaluate the fairness of these is by examining error rates across groups. And so this is a particularly popular definition. It might be the most popular definition right now in computer science circles of saying that you know, let me give you some actual numbers here. So in Broward County, Florida a particular algorithm was being used there. This is called the compass risk assessment algorithm. And, and what a previous investigation found that the error rate for black defendants was about twice as high as the error rate for white defendants in this algorithm. So let me say this one more time. So among black defendants who ultimately did not go on to re-offend, about 30% were deemed high risk by the algorithm and among white defendants who did not go on to re-offend, about 15% were deemed high risk by the algorithm. And so this is what's called a false positive rate. So people didn't, the ultimately were not they did not do this behavior that they were predicted to do. The algorithm flagged them as being high risk of doing that. And this false positive rate was about twice as high for black defendants as white defendants.

Amber Cazzell (20:41):

Yeah. Which seems very uncomfortable to just, yeah,

Sharad Goel (20:45):

It seems uncomfortable and you know, it, it feels like something has gone terribly wrong when, when you see those types of statistics.

Amber Cazzell (20:52):

Yeah. So a lot of people, I mean, my understanding is that a lot of people think, okay, this is something that needs to be rectified. We need to make sure the error rates are equal across certain groups.

Sharad Goel (21:03):

Yeah. So this is, this is the feeling and they're kind of there. One way to think about this is, is that well I guess maybe there are a couple different ways to think about it. We were like in the F the first is like, Oh, there must be something wrong with the algorithm that the people who created this, they didn't exactly know what they're doing. And if you apply proper statistics, you get a good data set. Of, of, you know, representative data across different race groups, then these error rates are going to be similar. It turns out that that's probably not going to be the case that, that, that, that even we tried to replicate this analysis. We did replicate the analysis and we tried to you know, use clean data and in, in, and try to use the best available statistical techniques and we still found that error rates were about twice as high for black defendants as white defendants. So then the second way that you can, you can look at this issue as is that in fact, it's not a problem where there the, the issue isn't that people aren't applying kind of standard. The best standard statistical techniques is that you need to do something new to force your error rates to be the same across race groups. And here you can just constrain a standard machine learning algorithm to say, make sure my error rates are comparable across white and black defendants.

Amber Cazzell (22:25):

Yeah. So what, what is the problem with taking that approach? This

Sharad Goel (22:30):

Is where the debate starts. So many people do not think that that is a problematic approach to take. And again, I, I think, I think it's probably safe to say that that's a dominant view in computer science right now that, that this is a, a fine design principle to take that there's social value, normative value in equal error rates in and on themselves in. So a policymaker, a social planner should just build that into any algorithm that they that they create. No, but my, my view on this is that it actually is quite problematic to constrain an algorithm to have that. And it's for a subtle statistical reasons. So this is where it gets a little bit airy that in, in many kind of popular understandings of fairness, including legal and you know, and already philosophical understandings of fairness. A common theme that emerges is that we apply the same standard to everybody.

Sharad Goel (23:31):

And so meaning that if I did that, my rule might be something like, I'm going to detain everybody who has a least a 60% chance of committing an offense if released. Now if take that rule at face value, the statistical problem becomes let's me try to estimate everybody's chance to recidivate their risk. And now the rule just tells me once I have that estimate, the best estimate that I have available to me, I already know what I'm going to do. I mean I detain the people who are deemed high risk release the people otherwise. And it turns out that if you apply that rule on real data, you're going to end up with higher false positive rates for individuals with a higher base rate of re-offending. So could you put that into example for them for me? Yeah. So this, so, so if I were to say, and again this is pretty close to what's happening in, in, in the Bower County example that I, that I explained before, is that if we assume that risk estimates are perfect and you know, this is an assumption, but let's just, you know, let's pretend for a moment that risk assessments are perfect.

Sharad Goel (24:39):

So I'm omniscient. I know everybody's likelihood to re-offending. I don't know for sure if they're going to re-offend or not, but I know in probability the likelihood that any one person is going to re-offend regardless of their race. Yeah, now if I apply this rule of detaining everybody above a certain threshold, for example, people who are, who are above a 60% threshold failing to appear if I take the group. So one, so in this, in the empirical example, what we see is that black defendants have an overall higher base rate of recidivating. Again, not because they are because of their race, but because all of these complicated associations with race and socioeconomics that are likely associated with recidivating. But you know, you know, regardless of the exact mechanism, just empirically what we see is that black defendants are more likely to re-offend than white defendants.

Sharad Goel (25:34):

And so given that fact, if you were to apply a uniform threshold to the entire population, you're going to end up just mechanically seeing higher false positive rates for black defendants in that population.

Amber Cazzell (25:48):

Yeah. Just because the distribution of risk is different.

Sharad Goel (25:53):

And this is where it's very hard. It's hard to think about risk. It's even harder to think about distribution of risk. Yeah. So, so economists have a word for this. It's called the problem of infer. Marginality. It's just a fancy word, but you know, you can Google it and maybe I'll give some more information. And this is like one of these ideas that, that I, I think a picture is worth a thousand words here and in podcasts probably does it do it exact justice. And I also want to flag that this is, so I'm, I am pretty convinced of this argument. Not very many people are, so I don't want people walking away thinking that this is either simple or uncontroversial. So I think it's both complex and controversial. Yeah. I just want to kind of flag that.

Amber Cazzell (26:47):

Yeah. You'd also mentioned in that paper that another potential shortcoming of using of using that sort of method of achieving fairness was that that stakeholders are not uniformly affected. So if somebody, if somebody is released, well I should let you explain cause you're much more eloquent than I am on these issues.

Sharad Goel (27:11):

Well, I don't know. I don't know about that, but I can try. Hey. So one way to achieve equal false positive rate is by setting different thresholds for different individuals. So for example, I might release white defendants who have a a 40% chance, or I might detain white defendants as long as they have at least a 40% chance of receiving, but I might detain black defendants only if they have a 60% chance. A recidivating. Now, when we actually show in our paper is, you know, under some reasonably general mathematical conditions, this turns out to be an optimal way to balance false positive rates. Now if you say that directly, that I'm gonna set two different thresholds for two different groups almost certainly that would be deemed unconstitutional. You know, again, for better or for worse, a, but just as a, as a matter of as all of law, that would almost certainly be deemed unconstitutional.

Sharad Goel (28:08):

And so one of the things that we point out in our, in our work is that these black box algorithms that are constrained to equalize false positive rates, they're, they're not telling you how they're doing this. And really what's happening under the hood, if you really peak, is that they're setting different thresholds for different groups. Now, once you realize that, I think it changes one's perspective about the equity of that approach. Now, you know, in theory I think it's, it's fine. I mean, I mean, well in, in, I mean, in practice it'll never be implemented. But in theory, I think one can debate the merits of setting different thresholds for different groups. And in some context that might be totally fine. Like affirmative action is a context where we're effectively setting two different thresholds for, for different groups. But in criminal justice, it's a hard argument to make.

Sharad Goel (28:59):

And one of the reasons that I think it's a hard argument to make is that if you really believe that, that people above a certain risk threshold should be detained. And again, I'm not saying that one should believe it, but once we kind of, once we kind of say that this is what we believe, is it fair? Is it equitable to say that certain communities we're going to release people who we otherwise think are too dangerous to be on the street? Right. And that has a funny, you know, it's a funny thing to think about that for, you know, in the example that I was giving you, like, you know, somehow we believe that it's, that it's not okay for to relatively high risk white individuals for fear that they're going to commit crimes, but it is okay to release again who people who we think are relatively high risk black individuals and they're, the kind of irony is that a lot of crime in the United States is involves individuals with the same risks.

Sharad Goel (30:02):

And so by releasing individuals who might be, you know, relatively high risk of one race group, it's, you might actually create problems in that same community. And so it's tricky. And again, I think this all comes with a big caveat that maybe the best public policy is not to detain anybody. And I think that's where a lot of people's intuition is coming from. Where it's like, well, we shouldn't have any pretrial detention. Oh, we should have very, very limited pretrial detention. And so it's better for the world if we're going to detain some people. Maybe it's better for the world to at least release has set a higher threshold for black defendants because they shouldn't have been, they shouldn't have been detained in the first place. And it's too bad we still have to detain the white defendants. But if we can at least move the needle a little bit in the direction that that is better for society, maybe that's okay. To me, that feels too narrow that that at least if that, if that's the argument, if you want to make, I would like it to be made very explicitly. Which I, I have not heard made explicitly, but even better to me is just saying that this is not a, this is not an equitable practice and it's not that we should have different thresholds for different groups is that we just shouldn't be detaining individuals.

Amber Cazzell (31:24):

Yeah. Yeah. So what about it sounds like anti classification or this idea of just ignoring protected group categories is a, is a technique for trying to achieve fairness that people have largely moved on from,

Sharad Goel (31:39):

Yeah. So this is, so this, are you going just to throw out some jargon here? This, this idea of fairness that we were just talking about equalizing? Error rates is sometimes called classification parity. Okay. And then this other kind of old notion of fairness is, is what we call anti classification. And this ties into this legal notion of what's also called anti classification that we don't base decisions or we, we scrutinize heavily scrutinize decisions that explicitly take into account protected curve, stricter characteristics like race in gender. And so the idea with anti classification, the second idea of fairness is that we shouldn't base decisions directly on these protected characteristics like, like race and gender. So here, I mean, it's an interesting idea in, in the abstract it feels really good. It's like, you know, if you were to ask somebody, is my algorithm fair?

Sharad Goel (32:38):

It's, you know, one common response is, well, yeah, because it doesn't use race, it doesn't use gender. And in fact, in these pretrial settings, in many settings, race in particular is not used for, for, for almost all the algorithms that I know of in part because there's this big stigma to basing decisions even in part on race. Although I want to again, highlight the fact that informative action, we do do this. And so it's just in some circumstances we have this stigma, but in, in not in all circumstances. Okay, so is this, is this good, is this bad? I and I, you know, like everything in this field is complicated. So let me give you another example. So here, so back to Broward County it turns out, so this is, you know, this is real real stats that I'm telling you about.

Sharad Goel (33:29):

So in Broward County, after adjusting for the usual traits that one might want to adjust for like age and someone's past criminal history, it turns out that women are less likely to recidivate than men would that same profile. Okay. Why is that? I don't know. I mean, I can think of all sorts of reasons, but it's just a fact that is out there in Broward County. It's a fact that is pretty that that repeats itself in many jurisdictions across the country. So it's just a common pattern that we see. Now. If one were to have a gender blind risk tool, what is going to happen? What's going to happen is that we're going to overestimate the risk of women and we're going to underestimate the risk of mat. That's BS. We're lumping everybody together. We're not accounting for the fact that in fact risk differs across the gender groups.

Sharad Goel (34:28):

We're just lumping everybody together and it's not only that we misestimate risk. If we take these risk scores seriously, that means we're going to end up detaining people detaining women in particular who we know statistically are in fact relatively low risk. And so not including gender can have this kind of profound inequity in detaining people detaining women who are known to be low risk.

Amber Cazzell (34:57):

Yeah. So what is the legality of including like gender and these sorts of things?

Sharad Goel (35:02):

Yeah, so this is, so it's a good question. I, so I think it's, it hasn't been fully resolved yet. There are a couple of cases, maybe the, the best on the case here is what's called a state of Wisconsin versus Loomis. It went all the way up to the state Supreme court in Wisconsin and it was slightly different context. This was in sentencing, not in pretrial, but the idea is quite similar.

Sharad Goel (35:27):

And in there the Loomis, a man was charging that, that he should not have received this high risk score because of his gender in part because of his gender. When Wisconsin for sentencing, gender is one of the factors that's, that's used at assessing risk. You're saying that's not fair. It's like, why should I get an extra two points just because I'm a man? So he claimed a what's called a due process violation in the court. You know, while sympathetic I would say ultimately concluded that this was a fair thing to do, to include gender because not doing so would subject a women to this harsher standard. And that is not I think a position that many jurisdictions have taken and there's relatively little court action on this yet. And my guess is this is going to go all the way up to the U S Supreme court in the next few years. And so we'll, we'll see what's going on. But it's, you know, it's complicated again, is it's like purely is a statistical you knew it was a public policy matter. I think there's a strong argument to include something like gender in these risk scores, but at the same time, a perception of legitimacy is important. And this is how a lot of the criminal justice system runs. And if, if we dramatically reduce that perception of legitimacy by including gender in a risk score, I think that's a consideration that that one has to make.

Amber Cazzell (36:59):

Yeah. Okay. And so what, what other techniques are being used or argued for as a better technique for achieving fairness?

Sharad Goel (37:10):

Yeah, so I mean, so the, the really the two big techniques for achieving fairness are these, the two that we talked about, one is excluding in some form or another, these protected characteristics like race and gender or potentially proxies of those of those characteristics. This is complicated because in some cases you might want to include them particularly, I mean, if you, excluding them could create these, these unforeseen consequences. Then the second one is equalizing error rates as a design principle of forcing the algorithm to have equal performance across groups. Okay. Now, a third way of thinking about fairness is not so much a design principle, but one of evaluating algorithms. And so here the idea is that that individuals with similar risk scores should re-offend at similar rates. And so risk or should mean the same thing across all individuals. So if I say that somebody has a 60% chance of recidivating, I shouldn't have to ask the followup question of, is this 60% for a white defendant or is a 60% for a black defendant? I should, when you tell me it's 60%, I should be able to interpret that as 60%.

Amber Cazzell (38:17):

Yeah, that seems like what we're trying to get at. Anyway. So

Sharad Goel (38:20):

Yeah, so that is, that's, that's the idea. Now it turns out that that idea is incompatible with equalizing error rates and also with the idea in many cases of equalizing or of, of excluding characteristics from risk scores themselves.

Amber Cazzell (38:39):

Okay. Can you tell me more about that? Like why, why is it incompatible or does that get too hairy?

Sharad Goel (38:44):

Yeah, so I mean, it's, so, it's exactly like what we were talking about with, with the Broward County gender example. So if I tell you, if I, if I look at a gender blind tool, so I've excluded a gender from my risk assessment score and I say, okay, here are two individuals that have a 60% chance of re-offend. A while. If you also tell me their gender, I know, well if it's a man, it's more like 65% chance and if it's a woman it's more like 50% chance of re-offending. And so there is this extra information that I'm getting by by using gender and that's a direct result of, of of, of excluding that protected characteristic when I created a risk score. Okay,

Amber Cazzell (39:24):

Cool. So where is the field headed now, would you say?

Sharad Goel (39:28):

Yeah, so I mean again, this, I think this is, it's, it's a good question and kind of two answers here. I think there is, where is the technical field of computer science heading and then where are policy policymakers and other individuals who are, who are directly trying to you know, improve algorithms heading? Now on the, on the computer science side, I think there still is a lot of interest in formalizing mathematical definitions of fairness and applying these things as design principles to create new algorithms. I think that's, you know, I think there's a value to thinking hard and in using math as a language to formalize some of these ideas. But I also worry that this, this misses the forest for the trees, that it's very narrowly focused on the things that, that as a community we hear scientists are pretty good at.

Sharad Goel (40:29):

You know, writing down an objective function and then optimizing that objective function. I think that's, you know, that's what a lot of machine learning is about. That's what a lot of statistics is about. But I worry that that's not the right tack to take in these types of policy domains. So the second way in which I think this field is moving is to directly engage with policymakers, directly engage with practitioners and try to understand in the real world what are some of the issues that we're finding. So this might be, for example, that the data that were collected to train algorithms were just inappropriate for the task in question. So one good example of this is face recognition where there's, there's evidence that, you know, pretty strong evidence that the previous generation of face recognition tools and in part, even the current generation of, of face recognition tools we're trained on, on a dataset that did not include enough people with dark skin.

Sharad Goel (41:35):

Yeah. And because of that, you are seeing all sorts of avoidable errors that, that were being made. And so once that realization was made, and in retrospect, it seems, you know, pretty obvious, I think if the time it was not obvious, people would just weren't, they didn't have that on their radar is something to worry about. And so at the time and, even to me this was, you know, a big insight to be like, Oh yeah, these, you know, just because you train on on millions, tens of millions of images, you still might not have very good representation. You know, and this could have all sorts of bad downstream consequences. And so just like this very, very simple realization, I think raised awareness of collecting the right data for the task. You know, that's, that's one way in which I, I see a positive, positive movement in this field.

Amber Cazzell (42:23):

Okay. And so is that, what are you personally working on for your future directions?

Sharad Goel (42:29):

Yeah, so I think that is you know, that's a lot of what we're doing now is that we're trying to understand in practice when people use these types of algorithms, what are the issues that, that, that come up? How are these algorithms trained? How are they actually applied? Are people using the recommendations from the algorithms? Are they, are, are they overriding the recommendations in ways that themselves could be prejudicial? So it very close to the practice and sort of less close to the formalism of these types of algorithmic systems.

Amber Cazzell (43:04):

Okay. So are you, I mean, moving in more of applied direction, do you work with organizations on this or how, how, how do, what does that look like?

Sharad Goel (43:14):

Yeah, so we worked with a lot of organizations. I mean, so one thing that we did recently is is work with the San Francisco district attorney's office in helping them, helping create an algorithm to guide charging decisions. So charging is a slightly different, the charging decision is slightly different decision point in the criminal justice system than what we're talking about before. And these pretrial decisions, this comes very quickly after arrest before you even know whether or not the DA is going to carry this forward. They have to make a decision about whether or not they should charge in practice. They really only charge like half of the cases that, that come in front of them. And so the way that this was happening before is that intake attorneys would read the case files a police report and then make an up or down decision.

Sharad Goel (44:05):

These case files contained all sorts of details that arguably did not matter so much for making a decision. And so this included, for example, the race of the individual involved maybe the exact location their hairstyle, all sorts of other things that were either directly related to race were proxies for race. And I didn't really have that much to do with the kind of criminal nature of the case in front of them. It's all we did is we built a tool that stripped out these types of proxies that we didn't think were important for making an informed decision in that in that case. So here we're using a natural language processing and this is now running in San Francisco. So all of the felony cases that are being reviewed, in San Francisco, they first get sanitized through the system and it's kind of free text.

Sharad Goel (45:03):

And so the police narrative gets rewritten to exclude all of these, these types of this type of information that we don't think is super important and could lead to implicit bias intakes attorneys make an initial decision, then they get to look at the original unredacted version of these narratives. They make a second decision and they can override their first decision. But if they do, so they have to give some reason for why they're doing that. So again, this is just the idea is to guide decisions and to structure decisions in a way that we think would reduce the type of implicit biases that might creep it.

Amber Cazzell (45:39):

Yeah. So how do, how do you, how did you decide what, what things should be stripped out? What didn't really matter?

Sharad Goel (45:46):

Yeah, so this is very hard question. And again, it's a policy question and so this is how I think about all these is that there's a technical question, but really the hard thing is the policy question. Once we know what we want to do, we can often do it, but we often don't know what we want to do. And so here we made a, we, we just kind of went through the list of things that, that we thought were either directly related to race like skin tone or explicitly mentions a race and then things that were associated with race, like someone's hairstyle that again, it's hard to imagine circumstances where that would dramatically like play some sort of substantive role. In the case itself. Now there are corner cases and this is why at the end of the day, even though we're not saying you have to make your final decision based on the redacted version, you know, in some cases maybe race was really an important factor here.

Sharad Goel (46:43):

I mean maybe that was like this was a racially charged assault and, and that really matters. And so once you redact that information, you lose the substance. So it's not that this is kind of universe uniformly or universally unimportant. But in many, many cases, I think this information doesn't add a lot. And so this is why we have this phased approach of looking at the redacted information in making an initial decision, which we think works much of the time. And then looking at the unredacted version to, to figure out what is,

Amber Cazzell (47:18):

So the difference between, cause when we were talking about how like there are certain contexts in which these protected categories statistically do seem to matter and the implementation of what you're doing, what is the different such that you think that specific types of protected categories that are getting removed doesn't matter. Is it because it's already at the phase where like something bad has happened and now charging needs to happen or,

Sharad Goel (47:43):

Yeah, so this is so, so it's a good question. So the difference here are really fundamentally is, is that race we don't think matters much. So even in our, our risk assessment tools and in the, we were talking about in Broward County, if we were to include race, you don't really change the predictions if you, so gender really statistically matters a lot. And so then you have to make a harder decision about whether or not to include it or exclude it. With race, it doesn't really change your statistical prediction very much. And so including, it could only, you know, you know, in theory it might only include, you might, you might only increase implicit biases.

Amber Cazzell (48:21):

Okay. And for the project that you were doing, it was focused specifically on stripping like racial categories or proxies out?

Sharad Goel (48:26):

Yeah, exactly. Yeah. Yeah. Gender is a more nuanced trait to strip out. Yeah. And so this is, this is tricky because again, it doesn't matter if you're, if you're trying to determine whether or not to charge an assault, does gender of, of the alleged perpetrator matter? You know, I think it can. So for example, if the, you know, if there, if, if there is a situation so often, you know, one mitigating circumstance for violence is a kind of fear. And so if somebody was allegedly committed an assault, but also we know something about their gender, we know that this person was a woman, we know that, that she may have been protecting herself or acting in a way that that could be interpreted as protecting herself against a male attacker. You know, maybe we take that into consideration when we decide whether or not to charge the case. And it's hard to know. I mean I think this is like a tricky policy question of what of what we should do. It's harder in my mind to come up with an analogous example for race. Right. And so we started with kind of the lowest common denominator here and say, well we all agree that race, you know, 95% of the time it doesn't really play a role in, in these situations. And so we're gonna we're gonna take that out.

Amber Cazzell (50:06):

Yeah. It seems like with race it's maybe a bit easier to have other things that could account for differences but are but happened to be correlated with race such that you can get closer to like the meat of what actually distinguishes people above or below thresholds with gender. Yeah. That seems difficult cause it does seem that there, there are certain differences between genders that is going to be hard to just strip out. Is this like an area that you think social sciences could help with in any way? I mean, part of the problem is just what data is easily available. Right. And it's not like internal.

Sharad Goel (50:51):

Yeah, I mean I think, I think so. I think in some ways all these questions are social science question. And so this is sort of my big you know, I don't know if I'd call it a complaint. My concern with the field is that it's dominated by technical data scientists, computer scientists, machine learning, folks in the idea, algorithmic fairness and st the word algorithm. It feels like this is a technical question. I view almost all the hard questions. This field as not technical, they are policy. And you know, this isn't saying there aren't technical aspects. I think there is, but I think that the hard normative questions are really, you know, they're, they're areas that would be incredibly valuable to have more people trained in the social sciences become involved in.

Amber Cazzell (51:36):

Okay. And so like what are some example areas in which you think that the social sciences have really not jumped in but could use improvement?

Sharad Goel (51:48):

So almost in all of the, the examples that I talk about or that we work on, we work with social scientists. You know, and I think it's so it's, so for example, in this, in this question of what to strip out of, of, of, of these police reports, like that was a difficult conversation that we had with you know, legal experts with, with people who are familiar with criminal justice system, with practitioners. And at the end of the day, it's not a technical question, it's a normative question of how should you design the system. It's the same thing as when we look at risk assessment tools and narrow technical question is how do you create the risk assessment tool? In my mind, the much harder question is what you do with that risk whisk estimate. And they're against the social science question of what types of services should you provide to people based on their risk.

Sharad Goel (52:41):

And so that's something that we're looking into now. And, you know, one of the social science questions is if I see somebody who's at high risk of failing to appear, why is it that they're high risk of failing to appear? So it's not simply they're high risk, you can't do anything about it. So let's lock them up. It's their high risk. Let's understand why that is and now what can we do to lower their risk and help support them? And so one thing that we're doing now is trying to, you know, when people are deemed high risk of a failing to appear at court in many cases because they can't find childcare or they can't find they can't get time off of work, they can't find a ride to work and they can't find a ride to court. And so there, once we understand that, we can intervene and we can say, okay, we're going to get a car, they'll take you to court, we're going to help you find the types of services that you need to make sure that you can fulfill these legal obligations.

Amber Cazzell (53:35):

Yeah. Really interesting. This is fascinating. So in general, do you think that starting to use algorithms in these weighted scales, has that improved upon human decision making? I mean, may, this is a complicated question, but your interdisciplinary, so

Sharad Goel (53:51):

Yeah, I mean it is, I mean this is, yeah, sorry, go on.

Amber Cazzell (53:53):

Well, or I was just going to say, or is it, or do do algorithms ever make what biases humans have a worse or is it just that it's more uncomfortable because it was automated?

Sharad Goel (54:03):

Yeah, I mean I think certainly it's possible. Yes, it's certainly possible to make decisions worse. And so algorithms, you know, I think this should be clear, but I'll get it or not a panacea and they certainly might not even make things better. Okay. The hope though is that a well-designed algorithm will improve on unaided human decision making. I think there is a lot of evidence for that, but it is complicated, you know, in part because there's this tension between human discretion in then prejudicially overriding human or algorithmic recommendations. So there is a feeling that we, that we, that we want these algorithms be guides, not the final word. It's the same time whenever we say that algorithms are guides implicitly, that means that humans can do whatever they want at the end of the day. And, and humans might, it might directly be overriding the situations where algorithms were improving the decision and might be reintroducing those types of biases that there were designed to mitigate.

Sharad Goel (55:07):

So it's, you know, that's like one example. Another example is that algorithms could create feedback loops. They could mask the type of, they could provide some veneer of objectivity and mask more serious insight into what investigation into what might be happening. Yeah. So I think there are all sorts of problems with blindly applying algorithms and I think that's where a lot of the pushback is coming from. That there's a fear that and this also my fear that, that people who are, who are pretty technically sophisticated but might not have the type of social science background, might not have the type of policy background or domain knowledge to thoughtfully interact with an application that that is a, it is a potentially quite problematic way to engage.

Amber Cazzell (55:56):

Yeah. It seems like with it is interesting what you're saying about the human dynamic of being able to override is both like a good and potentially dangerous thing as well. I'm wondering what some of your thoughts are with these, like big tech companies like Google and Facebook that have algorithms running all the time that usually there's not like a person who's sitting around saying, Oh yes, I'm going to give this search result right now and things. So when it's completely automated, like how is being done to try to understand the scope of the problem inside these big corporations?

Sharad Goel (56:31):

Yeah. So I mean I think, I mean, I, I think this is definitely of interest, you know, in the last few years probably should have been of interest in the last 20 years. But you know, we are where we are. So I think people know that, that this is something to worry about. But I think it's not so much in those cases like, like deciding how to place ads. I dare, I don't think the fundamental question is algorithmic. I think it's policy. So how are you going to design a system? Like what types of properties do you want to have of your ad serving platform? When you're rolling this out to, you know, a couple billion people that's, you know, there, there are like serious technical questions underneath that. But really the first order of question is what do you want?

Amber Cazzell (57:19):

Yeah. So do you ever consult with policymakers on some of these issues?

Sharad Goel (57:24):

Yeah, so we you know, we work pretty regularly with, with policymakers on understanding these things. Yeah.

Amber Cazzell (57:31):

Is that all with recidivism or other things as well or

Sharad Goel (57:34):

In other things as well we were looking at some applications in healthcare and education insurance, I'll all over the place. So these algorithms are everywhere now and in it, there is a realization that they can have all sorts of inequitable impacts. Yeah.

Amber Cazzell (57:56):

That is so cool. You do such cool work. It must feel good to like have an applied to be making an impact outside of the ivory tower. That's really cool.

Sharad Goel (58:05):

Yeah. Hey, I, you know, it definitely, this is you know, the, the big reason that I came back to university is to have this type of impact and to raise awareness about these types of issues. So I'm happy to be able to do it.

Amber Cazzell (58:19):

That's awesome. Well, thank you. I think we're out of time here. Thanks so much, Sharad. I really appreciate your time and this is super interesting.

Sharad Goel (58:28):

Thanks for having me on.

Algorithmic Fairness and Its Discontents with Sharad Goel

Recent Posts

Commentaires