1 00:00:10,080 --> 00:00:17,985 *applause* 2 00:00:17,985 --> 00:00:22,900 Thank you very much, can you… You can hear me? Yes! 3 00:00:22,900 --> 00:00:27,620 I’ve been at this now 23 years. We worked, with… My colleagues and I, 4 00:00:27,620 --> 00:00:31,390 we worked in about 30 countries, we’ve advised 9 Truth Commissions, 5 00:00:31,390 --> 00:00:36,410 official Truth Commissions, 4 UN missions, 6 00:00:36,410 --> 00:00:40,150 4 international criminal tribunals. We have testified in 4 different cases 7 00:00:40,150 --> 00:00:44,240 – 2 internationally, 2 domestically – and we’ve advised dozens and dozens 8 00:00:44,240 --> 00:00:49,120 of non-governmental Human Rights groups around the world. The point of this stuff 9 00:00:49,120 --> 00:00:54,180 is to figure out how to bring the knowledge of the people who’ve suffered 10 00:00:54,180 --> 00:00:58,770 human rights violations to bear, on demanding accountability 11 00:00:58,770 --> 00:01:04,960 from the perpetrators. Our job is to figure out how we can tell the truth. 12 00:01:04,960 --> 00:01:09,240 It is one of the moral foundations of the international Human Rights movement 13 00:01:09,240 --> 00:01:14,220 that we speak Truth to Power. We look in the face of the powerful 14 00:01:14,220 --> 00:01:19,299 and we tell them what we believe they have done that is wrong. 15 00:01:19,299 --> 00:01:23,639 If that’s gonna work, we have to speak the truth. 16 00:01:23,639 --> 00:01:29,470 We have to be right, we have to get the analysis on. 17 00:01:29,470 --> 00:01:33,979 That’s not always easy and to get there, 18 00:01:33,979 --> 00:01:37,209 there are sort of 3 themes that I wanna try to touch in this talk. 19 00:01:37,209 --> 00:01:40,379 Since the talk is pretty short I’m really gonna touch on 2 of them, so 20 00:01:40,379 --> 00:01:43,619 at the very end of the talk I’ll invite people who’d like to talk more about 21 00:01:43,619 --> 00:01:49,270 the specifically technical aspects of this work, about classifiers, about clustering, 22 00:01:49,270 --> 00:01:53,620 about statistical estimation, about database techniques. People who wanna talk 23 00:01:53,620 --> 00:01:56,990 about that I’d love to gather and we’ll try to find a space. I’ve been fighting 24 00:01:56,990 --> 00:02:00,460 with the Wiki for 2 days; I think I’m probably not the only one. 25 00:02:00,460 --> 00:02:04,959 We can gather, we can talk about that stuff more in detail. So today, 26 00:02:04,959 --> 00:02:09,990 in the next 25 minutes I’m going to focus specifically on 27 00:02:09,990 --> 00:02:14,520 the trial of General José Efraín Ríos Montt 28 00:02:14,520 --> 00:02:20,200 who ruled Guatemala from March 1982 until August 1983. 29 00:02:20,200 --> 00:02:25,180 That’s General Ríos, there in the upper corner in the red tie. 30 00:02:25,180 --> 00:02:30,600 During the government of General Ríos Montt 31 00:02:30,600 --> 00:02:35,610 tens of thousands of people were killed by the army of Guatemala. And the question 32 00:02:35,610 --> 00:02:39,610 that has been facing Guatemalans since that time is: 33 00:02:39,610 --> 00:02:44,080 “Did the pattern of killing that the army committed 34 00:02:44,080 --> 00:02:49,690 constitute acts of genocide?”. Now genocide is a very specific crime 35 00:02:49,690 --> 00:02:54,420 in International Law. It does not mean you killed a lot of people. 36 00:02:54,420 --> 00:02:58,910 There are other war crimes for mass killing. Genocide specifically means 37 00:02:58,910 --> 00:03:03,930 that you picked out a particular group; and to the exclusion of other groups 38 00:03:03,930 --> 00:03:08,460 nearby them you focused on eliminating that group. 39 00:03:08,460 --> 00:03:14,240 That’s key because for a statistician that gives us a hypothesis we can test 40 00:03:14,240 --> 00:03:18,860 which is: “What is the relative risk, what is the differential probability 41 00:03:18,860 --> 00:03:22,820 of people in the target group being killed relative to their neighbours 42 00:03:22,820 --> 00:03:28,150 who are not in the target group?” So without further ado, 43 00:03:28,150 --> 00:03:31,970 let’s look at the relative risk of being killed for indigenous people 44 00:03:31,970 --> 00:03:36,880 in the 3 rural counties of Chajul, Cotzal and Nebaj 45 00:03:36,880 --> 00:03:41,400 relative to their non-indigenous neighbours. 46 00:03:41,400 --> 00:03:45,960 We have – and I’ll talk in a moment about how we have this – we have information, 47 00:03:45,960 --> 00:03:51,490 and evidence, and estimations of the deaths of about 2150 indigenous people. 48 00:03:51,490 --> 00:03:58,550 People killed by the army in the period of the government of General Ríos. 49 00:03:58,550 --> 00:04:02,550 The population, the total number of people alive who were indigenous 50 00:04:02,550 --> 00:04:07,370 in those counties in the census of 1981 is about 39,000. 51 00:04:07,370 --> 00:04:14,500 So the approximate crude mortality rate due to homicide by the army 52 00:04:14,500 --> 00:04:18,710 is 5.5% for indigenous people in that period. Now that’s relative 53 00:04:18,710 --> 00:04:22,890 to the homicide rate for non-indigenous people in the same place 54 00:04:22,890 --> 00:04:27,200 of approximately 0.7%. So what we ask is: “What is the ratio 55 00:04:27,200 --> 00:04:30,530 between those 2 numbers?” And the ratio between those 2 numbers 56 00:04:30,530 --> 00:04:35,600 is the relative risk. It’s approximately 8. We interpret that as: if you were 57 00:04:35,600 --> 00:04:41,339 an indigenous person alive in one of those 3 counties in 1982, 58 00:04:41,339 --> 00:04:46,939 your probability of being killed by the army was 8 times greater 59 00:04:46,939 --> 00:04:51,069 than a person also living in those 3 counties 60 00:04:51,069 --> 00:04:56,179 who was not indigenous. Eight times, 8 times! 61 00:04:56,179 --> 00:05:00,250 To put that in relative terms: the probability… the relative risk of being 62 00:05:00,250 --> 00:05:04,720 a Bosniac relative to being Serb in Bosnia during the war in Bosnia 63 00:05:04,720 --> 00:05:09,800 was a little less than 3. So your relative risk of being indigenous 64 00:05:09,800 --> 00:05:13,310 was more than twice nearly 3 times as much as your relative risk 65 00:05:13,310 --> 00:05:19,200 of being Bosniac in the Bosnian War. It’s an astonishing level of focus. 66 00:05:19,200 --> 00:05:23,809 It shows a tremendous planning and coherence, I believe. 67 00:05:23,809 --> 00:05:29,469 So, again coming back to the statistical conclusion, how do we come to that? 68 00:05:29,469 --> 00:05:32,849 How do we find that information? How do we make that conclusion? First, we’re only 69 00:05:32,849 --> 00:05:35,470 looking at homicides committed by the army. We’re not looking at homicides 70 00:05:35,470 --> 00:05:39,409 committed by other parties, by the guerrillas, by private actors. 71 00:05:39,409 --> 00:05:44,499 We’re not looking at excess mortality, the mortality that we might find 72 00:05:44,499 --> 00:05:47,709 in conflict that is in excess of normal peacetime mortality. 73 00:05:47,709 --> 00:05:51,470 We’re not looking at any of that, only homicide. And the percentage 74 00:05:51,470 --> 00:05:55,330 relates the number of people killed by the army with the population that was alive. 75 00:05:55,330 --> 00:05:58,650 That’s crucial here. We’re looking at rates and we’re comparing the rate 76 00:05:58,650 --> 00:06:02,430 of the indigenous people shown in the blue bar to non-indigenous people 77 00:06:02,430 --> 00:06:06,869 shown in the green bar. The width of the bars show the relative populations 78 00:06:06,869 --> 00:06:11,829 in each of those 2 communities. So clearly there are many more indigenous people, 79 00:06:11,829 --> 00:06:14,980 but a higher fraction of them are also killed. The bars also show something else. 80 00:06:14,980 --> 00:06:18,049 And that’s what I’ll focus on for the rest of the talk. There are 2 sections 81 00:06:18,049 --> 00:06:22,159 to each of the 2 bars, a dark section on the bottom, a lighter section on top. 82 00:06:22,159 --> 00:06:27,779 And what that indicates is what we know in terms of being able to name people 83 00:06:27,779 --> 00:06:31,249 with their first and last name, their location and dates of death, and 84 00:06:31,249 --> 00:06:35,560 what we must infer statistically. Now I’m beginning to touch on the second theme 85 00:06:35,560 --> 00:06:40,949 of my talk: Which is that when we are studying mass violence and war crimes, 86 00:06:40,949 --> 00:06:48,749 we cannot do statistical or pattern analysis with raw information. 87 00:06:48,749 --> 00:06:51,950 We must use the tools of mathematical statistics to understand 88 00:06:51,950 --> 00:06:56,080 what we don’t know! The information which cannot be observed directly. 89 00:06:56,080 --> 00:07:00,649 We have to estimate that in order to control for the process of the production 90 00:07:00,649 --> 00:07:04,989 of information. Information doesn’t just fall out of the sky, the way it does 91 00:07:04,989 --> 00:07:10,359 for industry. If I’m running an ISP I know every packet that runs through my routers. 92 00:07:10,359 --> 00:07:14,959 That’s not how the social world works. In order to find information about killings 93 00:07:14,959 --> 00:07:17,889 we have to hear about that killing from someone, we have to investigate, 94 00:07:17,889 --> 00:07:22,119 we have to find the human remains. And if we can’t observe the killing 95 00:07:22,119 --> 00:07:28,130 we won’t hear about it and many killings are hidden. In my team we have a kind of 96 00:07:28,130 --> 00:07:33,760 catch phrase: that the world… if a lawyer is killed in a big city at high noon 97 00:07:33,760 --> 00:07:38,259 the world knows about it before dinner time. Every single time. 98 00:07:38,259 --> 00:07:41,850 But when a rural peasant is killed 3-days walk from a road in the dead of night, 99 00:07:41,850 --> 00:07:45,489 we’re unlikely to ever hear. And technology is not changing this. 100 00:07:45,489 --> 00:07:48,899 I’ll talk later about that technology is actually making the problem worse. 101 00:07:48,899 --> 00:07:53,470 So, let’s get back to Guatemala and just conclude 102 00:07:53,470 --> 00:07:57,950 that the little vertical bars, little vertical lines at the top of each bar 103 00:07:57,950 --> 00:08:03,079 indicate the confidence interval. Which is similar to what lay people sometimes call 104 00:08:03,079 --> 00:08:07,199 a margin of error. It is our level of uncertainty about each of those estimates 105 00:08:07,199 --> 00:08:10,960 and you’ll notice that the uncertainty is much, much smaller than 106 00:08:10,960 --> 00:08:14,509 the difference between the 2 bars. The uncertainty does not affect our ability 107 00:08:14,509 --> 00:08:17,970 to draw the conclusion that there was a spectacular difference 108 00:08:17,970 --> 00:08:21,900 in the mortality rates between the people who were the hypothesized 109 00:08:21,900 --> 00:08:26,630 target of genocide and those who were not. 110 00:08:26,630 --> 00:08:30,520 Now the data: first we had the census of 1981, 111 00:08:30,520 --> 00:08:35,339 this was a crucial piece. I think there’s very interesting questions to ask 112 00:08:35,339 --> 00:08:39,609 about why the Government of Guatemala conducted a census on the eve of 113 00:08:39,609 --> 00:08:44,540 committing a genocide. There is excellent work done by historical demographers 114 00:08:44,540 --> 00:08:47,950 about the use of censuses in mass violence. It has been common 115 00:08:47,950 --> 00:08:52,880 throughout history. Similarly, or excuse me, in parallel 116 00:08:52,880 --> 00:08:57,420 there were 4 very large projects. First, the CIIDH 117 00:08:57,420 --> 00:09:01,600 – a group of non-Governmental Human Rights groups – 118 00:09:01,600 --> 00:09:06,610 collected 1240 records of deaths in this three-county region. 119 00:09:06,610 --> 00:09:11,750 Next, the Catholic Church collected a bit fewer than 800 deaths. 120 00:09:11,750 --> 00:09:16,539 The truth commission – the Comisión para el Esclarecimiento Histórico (CEH) – 121 00:09:16,539 --> 00:09:22,000 conducted a really big research project in the late 1990s and 122 00:09:22,000 --> 00:09:25,810 of that we got information about a little bit more than a thousand deaths. 123 00:09:25,810 --> 00:09:30,450 And then the National Program for Compensation is very, very large 124 00:09:30,450 --> 00:09:35,370 and gave us about 4700 records of deaths. 125 00:09:35,370 --> 00:09:40,659 Now, this is interesting but this is not unique. 126 00:09:40,659 --> 00:09:45,769 Many of the deaths are reported in common across those data sources and so… 127 00:09:45,769 --> 00:09:49,490 we think about this in terms of a Venn diagram. We think of: how did these 128 00:09:49,490 --> 00:09:54,329 different data sets intersect with each other or collide with each other. And 129 00:09:54,329 --> 00:09:59,130 we can diagram that as in the sense of these 3 white circles intersecting. 130 00:09:59,130 --> 00:10:05,610 But as I mentioned earlier we’re also interested in what we have not observed. 131 00:10:05,610 --> 00:10:09,490 And this is crucial for us because when we’re thinking about 132 00:10:09,490 --> 00:10:13,420 how much information we have, we have to distinguish between the world on the left, 133 00:10:13,420 --> 00:10:17,200 in which our intersecting circles cover about a third of the reality, 134 00:10:17,200 --> 00:10:21,829 versus the world on the right where our intersecting circles cover all of reality. 135 00:10:21,829 --> 00:10:26,390 These are very different worlds; and the reason they’re so different is not simply 136 00:10:26,390 --> 00:10:29,710 because we want to know the magnitude, not simply because we want to know 137 00:10:29,710 --> 00:10:34,490 the total number of killings. That’s important – but even more important: 138 00:10:34,490 --> 00:10:40,160 we have to know that we’ve covered, we’ve estimated in equal proportions 139 00:10:40,160 --> 00:10:44,430 the two parties. We have to estimate in equal proportions the number of deaths 140 00:10:44,430 --> 00:10:48,340 of non-indigenous people and the number of deaths of indigenous people. 141 00:10:48,340 --> 00:10:51,510 Because if we don’t get those estimates correct our comparison 142 00:10:51,510 --> 00:10:56,080 of their mortality rates will be biased. Our story will be wrong. We will fail 143 00:10:56,080 --> 00:11:01,840 to speak Truth to Power. We can’t have that. So what do we do? Algebra! 144 00:11:01,840 --> 00:11:06,390 Algebra is our friend. So I’m gonna give you just a tiny taste of how we 145 00:11:06,390 --> 00:11:09,650 solve this problem and I’m going to introduce a series of assumptions. 146 00:11:09,650 --> 00:11:13,279 Those of you who would like to debate those assumptions: I invite you to join me 147 00:11:13,279 --> 00:11:18,359 after the talk and we will talk endlessly and tediously about capture heterogeneity. 148 00:11:18,359 --> 00:11:22,240 But in the short term, 149 00:11:22,240 --> 00:11:27,940 we have a universe N of total killings in a specific time/space/ethnicity/location. 150 00:11:27,940 --> 00:11:30,690 And of that we have 2 projects A and B. 151 00:11:30,690 --> 00:11:34,619 A captures some number of deaths from the universe N, 152 00:11:34,619 --> 00:11:40,169 and the probability with which a death is captured by project A from the universe N 153 00:11:40,169 --> 00:11:44,600 is by elementary probability theory the number of deaths documented by A 154 00:11:44,600 --> 00:11:48,740 divided by the unknown number of deaths in the population N. 155 00:11:48,740 --> 00:11:52,969 Similarly, the probability with which a death from N is documented by project B 156 00:11:52,969 --> 00:11:58,149 is B over N, and this is the cool part: the probability with which a death 157 00:11:58,149 --> 00:12:01,949 is documented by both A and B is M. 158 00:12:01,949 --> 00:12:05,579 Now we can put the 2 databases together, we can compare them. Let’s talk about 159 00:12:05,579 --> 00:12:09,370 the use of random force classifiers and clustering to do that later. 160 00:12:09,370 --> 00:12:12,489 But we can put the 2 databases together, compare them, determine the deaths 161 00:12:12,489 --> 00:12:17,429 that are in M – that is in N both A and B – and divide M by N. 162 00:12:17,429 --> 00:12:23,060 But, also by probability theory, the probability that a death occurs in M 163 00:12:23,060 --> 00:12:27,740 is equal to the product of the individual probabilities. 164 00:12:27,740 --> 00:12:31,619 The probability of any compound event, an event made up of two independent events is 165 00:12:31,619 --> 00:12:36,410 equal to the product of those two events, so M over N is equal to 166 00:12:36,410 --> 00:12:41,420 A over N times B over N. Solve for N. 167 00:12:41,420 --> 00:12:45,140 Multiply it through by N squared, divide by M, and we have an estimate of N 168 00:12:45,140 --> 00:12:49,360 which is equal to AB over M. Now, the lights in my eyes, I can’t see, but I saw 169 00:12:49,360 --> 00:12:52,740 a few light bulbs go off over people’s heads. And when I showed this proof 170 00:12:52,740 --> 00:12:57,180 to the judge in the trial of General Ríos 171 00:12:57,180 --> 00:13:01,529 I saw a light bulb go on over her head. 172 00:13:01,529 --> 00:13:04,379 It’s a beautiful thing, it’s a beautiful thing. 173 00:13:04,379 --> 00:13:09,509 *applause* 174 00:13:09,509 --> 00:13:12,660 So we don’t do it in 2 systems because that takes a lot of assumptions. 175 00:13:12,660 --> 00:13:16,069 We do it in 4. You will recall that we have 4 data sources. We organize 176 00:13:16,069 --> 00:13:21,530 the data sources in this format such that we have an inclusion 177 00:13:21,530 --> 00:13:26,249 and an exclusion pattern in the table on the left, which… for which we can define 178 00:13:26,249 --> 00:13:29,810 the number of deaths which fall into each of these intersecting patterns. 179 00:13:29,810 --> 00:13:33,729 And I’ll give you a very quick metaphor here. The metaphor is: 180 00:13:33,729 --> 00:13:38,239 imagine that you have 2 dark rooms and you want to assess the size of those 2 rooms 181 00:13:38,239 --> 00:13:42,049 – which room is larger? And the only tool that you have to assess the size 182 00:13:42,049 --> 00:13:46,359 of those rooms is a handful of little rubber balls. The little rubber balls 183 00:13:46,359 --> 00:13:50,400 have a property that when they hit each other they make a sound. *makes CLICK sound* 184 00:13:50,400 --> 00:13:53,390 So we throw the balls into the first room and we listen, and we hear 185 00:13:53,390 --> 00:13:57,190 *makes several CLICK sounds*. We collect the balls, go to the second room, 186 00:13:57,190 --> 00:14:00,490 throw them with equal force – imagining a spherical cow of uniform density! 187 00:14:00,490 --> 00:14:03,950 We throw the balls into the second room with equal force and we hear 188 00:14:03,950 --> 00:14:07,799 *makes one CLICK sound* So which room is larger? 189 00:14:07,799 --> 00:14:12,070 The second room, because we hear fewer collisions, right? Well, the estimation, 190 00:14:12,070 --> 00:14:15,620 the toy example I gave in the previous slide is the mathematical formalization 191 00:14:15,620 --> 00:14:20,070 of the intuition that fewer collisions mean a larger space. 192 00:14:20,070 --> 00:14:23,329 And so what we’re doing here is laying out the pattern of collisions. 193 00:14:23,329 --> 00:14:26,679 Not just the collisions, the pairwise collisions, but the three-way and 194 00:14:26,679 --> 00:14:31,409 four-way collisions. And that allows us to make the estimate 195 00:14:31,409 --> 00:14:37,439 that was shown in the bar graph of the light part of each of the bars. So 196 00:14:37,439 --> 00:14:41,460 we can come back to our conclusion and put a confidence interval on the estimates. 197 00:14:41,460 --> 00:14:45,910 And the confidence intervals are shown there. Now I’m gonna move through this 198 00:14:45,910 --> 00:14:50,850 somewhat more quickly to get to the end of the talk but I wanna put up one more slide 199 00:14:50,850 --> 00:14:56,240 that was used in the testimony and that is that we divided time 200 00:14:56,240 --> 00:15:01,220 into 16-month periods and compared the 16-month period of 201 00:15:01,220 --> 00:15:04,580 General Ríos’s governance – now it’s only 16 months ’cause we went April to July, 202 00:15:04,580 --> 00:15:07,679 because it’s only a few days in August, a few days in March, so we shaved those off, 203 00:15:07,679 --> 00:15:12,310 okay… – 16-month period of General Ríos’s Government and compared it 204 00:15:12,310 --> 00:15:17,110 to several periods before and after. And I think that the key observation here 205 00:15:17,110 --> 00:15:21,809 is that the rate of killing against indigenous people 206 00:15:21,809 --> 00:15:26,729 is substantially higher done under General Ríos’s Government than under previous 207 00:15:26,729 --> 00:15:33,280 or succeeding governments. But more importantly the ratio between the two, 208 00:15:33,280 --> 00:15:37,950 the relative risk of being killed as an indigenous person, was at its peak 209 00:15:37,950 --> 00:15:42,639 during the government of General Ríos. 210 00:15:42,639 --> 00:15:46,709 Have we proven genocide? No. 211 00:15:46,709 --> 00:15:49,870 This is evidence consistent with the hypothesis that acts of genocide 212 00:15:49,870 --> 00:15:53,539 were committed. The finding of genocide is a legal finding, not so much 213 00:15:53,539 --> 00:15:58,580 a scientific one. So as scientists, our job is to provide evidence that 214 00:15:58,580 --> 00:16:02,870 the finders of fact – the judges in this case – can use in their determination. 215 00:16:02,870 --> 00:16:05,219 This is evidence consistent with that hypothesis. 216 00:16:05,219 --> 00:16:08,189 Were this evidence otherwise, as scientists we would say we would 217 00:16:08,189 --> 00:16:11,480 reject the hypothesis that genocide was committed. However, with this evidence 218 00:16:11,480 --> 00:16:15,370 we find that the evidence, the data is consistent with 219 00:16:15,370 --> 00:16:18,080 the prosecution’s hypothesis. 220 00:16:18,080 --> 00:16:25,320 So, it worked! 221 00:16:25,320 --> 00:16:29,049 Ríos Montt was convicted on genocide charges. *applause* 222 00:16:29,049 --> 00:16:31,359 You can clap! *applause* 223 00:16:31,359 --> 00:16:36,359 *applause* 224 00:16:36,359 --> 00:16:39,499 For a week! *mumbled, surprised laughter* 225 00:16:39,499 --> 00:16:42,279 Then the Constitutional Court intervened, 226 00:16:42,279 --> 00:16:44,959 there I know a couple of experts on Guatemala here in the audience 227 00:16:44,959 --> 00:16:47,839 who can tell you more about why that happened and exactly what happened. 228 00:16:47,839 --> 00:16:52,669 However, the Constitutional Court ordered a new trial, 229 00:16:52,669 --> 00:16:59,160 which is at this time scheduled for the very beginning of 2015. 230 00:16:59,160 --> 00:17:02,970 And I look forward to testifying again, 231 00:17:02,970 --> 00:17:06,820 and again, and again, and again! 232 00:17:06,820 --> 00:17:12,680 *applause* 233 00:17:12,680 --> 00:17:16,989 Look, but I wanna come back to this point. Because as a bunch of technologists… 234 00:17:16,989 --> 00:17:21,589 – there is a lot of folks who really like technology here, I really like it too! 235 00:17:21,589 --> 00:17:25,559 Technology doesn’t get us to science – you have to have science 236 00:17:25,559 --> 00:17:28,770 to get you to science. Technology helps you organize the data. It helps you do 237 00:17:28,770 --> 00:17:32,050 all kinds of extremely great and cool things without which we wouldn’t be able 238 00:17:32,050 --> 00:17:36,480 to even do the science. But you can’t have just technology! 239 00:17:36,480 --> 00:17:40,970 You can’t just have a bunch of data and make conclusions. That’s naive, 240 00:17:40,970 --> 00:17:44,529 and you will get the wrong conclusions. ‘The point of rigorous statistics is 241 00:17:44,529 --> 00:17:48,100 to be right’, and there is a little bit of a caveat there – or to at least know 242 00:17:48,100 --> 00:17:51,620 how uncertain you are. Statistics is often called the ‘Science of Uncertainty’. 243 00:17:51,620 --> 00:17:55,960 That is actually my favorite definition of it. So, 244 00:17:55,960 --> 00:18:01,509 I’m going to assume that we care about getting it right. 245 00:18:01,509 --> 00:18:05,489 No one laughed, that’s good. 246 00:18:05,489 --> 00:18:08,890 Not everyone does, to my distress. 247 00:18:08,890 --> 00:18:11,320 So if you only have some of the data 248 00:18:11,320 --> 00:18:15,490 – and I will argue that we always only have some of the data – 249 00:18:15,490 --> 00:18:20,449 you need some kind of model that will tell you the relationship between your data 250 00:18:20,449 --> 00:18:23,989 and the real world. Statisticians call that an inference. 251 00:18:23,989 --> 00:18:26,200 In order to get from here to there you’re gonna need some kind of 252 00:18:26,200 --> 00:18:30,469 probability model that tells you why your data is like the world, 253 00:18:30,469 --> 00:18:33,960 or in what sense you have to tweet, twiddle and do algebra with your data 254 00:18:33,960 --> 00:18:39,309 to get from what you can observe to what is actually true. 255 00:18:39,309 --> 00:18:42,690 And statistics is about comparisons. Yeah, we get a big number and 256 00:18:42,690 --> 00:18:46,169 journalists love the big number; but it’s really about these relationships 257 00:18:46,169 --> 00:18:50,609 and patterns! So to get those relationships and patterns, 258 00:18:50,609 --> 00:18:53,560 in order for them to be right, in order for our answer to be correct, 259 00:18:53,560 --> 00:18:57,439 every one of the estimates we make for every point in the pattern 260 00:18:57,439 --> 00:19:01,700 has to be right. It’s a hard problem. It’s a hard problem. 261 00:19:01,700 --> 00:19:05,070 And what I worry about is that we have come into this world 262 00:19:05,070 --> 00:19:09,400 in which people throw the notion of Big Data around as though the data allows us 263 00:19:09,400 --> 00:19:14,230 to make an end-run around problems of sampling and modeling. It doesn’t. 264 00:19:14,230 --> 00:19:19,120 So as technologist, the reason I’m, you know, ranting at you guys about it 265 00:19:19,120 --> 00:19:24,540 is that it’s very tempting to have a lot of data and think you have an answer! 266 00:19:24,540 --> 00:19:30,580 And it’s even more tempting because in industry context you might be right. 267 00:19:30,580 --> 00:19:36,739 Not so much in Human Rights, not so much. Violence is a hidden process. 268 00:19:36,739 --> 00:19:39,960 The people who commit violence have an enormous commitment to hiding it, 269 00:19:39,960 --> 00:19:44,420 distorting it, explaining it in different ways. All of those things dramatically 270 00:19:44,420 --> 00:19:48,350 affect the information that is produced from the violence that we’re going to use 271 00:19:48,350 --> 00:19:53,730 to do our analysis. So we usually don’t know what we don’t know 272 00:19:53,730 --> 00:19:58,220 in Human Rights data collection. And that means that we don’t know 273 00:19:58,220 --> 00:20:03,829 if what we don’t know is systematically different from what we do know. 274 00:20:03,829 --> 00:20:06,270 Maybe we know about all the lawyers and we don’t know about the people 275 00:20:06,270 --> 00:20:10,070 in the countryside. Maybe we know about all the indigenous people 276 00:20:10,070 --> 00:20:14,130 and not the non-indigenous people. If that were true, the argument 277 00:20:14,130 --> 00:20:17,980 that I just made would be merely an artifact of the reporting process 278 00:20:17,980 --> 00:20:21,740 rather than some true analysis. Now we did the estimations why I believe 279 00:20:21,740 --> 00:20:25,009 we can reject that critique, but that’s what we have to worry about. 280 00:20:25,009 --> 00:20:28,860 And let’s go back to the Venn diagram and say: which of these is accurate? 281 00:20:28,860 --> 00:20:32,840 It’s not just for one of the points in our pattern analysis. 282 00:20:32,840 --> 00:20:36,500 The problem is that we’re going to compare things. 283 00:20:36,500 --> 00:20:40,890 As in Peru where we compared killings committed by the Peruvian army against 284 00:20:40,890 --> 00:20:44,860 killings committed by the Maoist Guerillas with the Sendero Luminoso. And we found 285 00:20:44,860 --> 00:20:51,460 there that in fact we knew very little about what the Sendero Luminoso had done. 286 00:20:51,460 --> 00:20:55,779 Whereas we knew almost everything what the Peruvian army had done. 287 00:20:55,779 --> 00:20:57,970 This is called the coverage rate. The rate between what we know and 288 00:20:57,970 --> 00:21:02,750 what we don’t know. And raw data, however big, 289 00:21:02,750 --> 00:21:07,510 does not get us to patterns. And here is a bunch of… 290 00:21:07,510 --> 00:21:11,569 kinds of raw data that I’ve used and that I really enjoy using. 291 00:21:11,569 --> 00:21:14,270 You know – truth commission testimonies, UN investigations, press articles, 292 00:21:14,270 --> 00:21:18,309 SMS messages, crowdsourcing, NGO documentation, social media feeds, 293 00:21:18,309 --> 00:21:21,180 perpetrator records, government archives, state agency registries – I know those 294 00:21:21,180 --> 00:21:23,570 sound all the same but they actually turn out to be slightly different. 295 00:21:23,570 --> 00:21:28,340 Happy to talk in tedious detail! Refugee Camp records, any non-random sample. 296 00:21:28,340 --> 00:21:31,990 All of those are gonna take some kind of probability model 297 00:21:31,990 --> 00:21:36,070 and we don’t have that many probability models to use. So 298 00:21:36,070 --> 00:21:40,330 raw data is great for cases – but it doesn’t get you to patterns. 299 00:21:40,330 --> 00:21:45,120 And patterns – again – patterns are the thing that allow us to do analysis. 300 00:21:45,120 --> 00:21:49,289 They are the thing… the patterns are what get us to something that we can use 301 00:21:49,289 --> 00:21:53,629 to help prosecutors, advocates and the… 302 00:21:53,629 --> 00:21:56,409 and the victims themselves. 303 00:21:56,409 --> 00:22:00,589 I gave a version of this talk, a much earlier version of this talk 304 00:22:00,589 --> 00:22:04,630 several years ago in Medellín, Columbia. I’ve worked a lot in Columbia, 305 00:22:04,630 --> 00:22:07,670 it’s really… it’s a great place to work. There’s really terrific 306 00:22:07,670 --> 00:22:13,569 Victims Rights groups there. And a woman from a township, 307 00:22:13,569 --> 00:22:17,310 smaller than a county, near to Medellín came up to me after the talk and she said: 308 00:22:17,310 --> 00:22:21,150 “You know, a lot of people… you know I’m a Human Rights activist, 309 00:22:21,150 --> 00:22:25,309 my job is to collect data, I tell stories about people who have suffered. 310 00:22:25,309 --> 00:22:28,210 But there are people in my village I know who have had 311 00:22:28,210 --> 00:22:32,910 people in their families disappeared and they’re never gonna talk about, ever. 312 00:22:32,910 --> 00:22:38,090 We’re never going to be able to use their names, because they are afraid.” 313 00:22:38,090 --> 00:22:45,349 We can’t name the victims. At least we’d better count them. 314 00:22:45,349 --> 00:22:49,520 So about that counting: there’s 3 ways to do it right. You can have 315 00:22:49,520 --> 00:22:54,430 a perfect census – you can have all the data. Yeah it’s nice, good luck with that. 316 00:22:54,430 --> 00:22:58,910 You can have a random sample of the population - that’s hard! 317 00:22:58,910 --> 00:23:03,029 Sometimes doable but very hard. In my experience we rarely interview 318 00:23:03,029 --> 00:23:07,140 victims of homicide, very rarely. *Laughing* 319 00:23:07,140 --> 00:23:09,640 And that means there’s a complicated probability relationship between 320 00:23:09,640 --> 00:23:13,670 the person you sampled, the interview and the death that they talk to you about. 321 00:23:13,670 --> 00:23:17,300 Or you can do some kind of posterior modeling of the sampling process which is… 322 00:23:17,300 --> 00:23:21,260 which is in essence what I proposed in the earlier slide. 323 00:23:21,260 --> 00:23:25,020 So what can we do with raw data, guys? We can collect a bunch of… 324 00:23:25,020 --> 00:23:28,930 We can say that a case exists. Ok – that’s actually important! We can say: 325 00:23:28,930 --> 00:23:34,409 “Something happened” with raw data. We can say: “We know something about that case". 326 00:23:34,409 --> 00:23:38,250 We can say: “There were 100 victims in that case or at least 100 victims 327 00:23:38,250 --> 00:23:41,570 in that case”, if we can name 100 people. 328 00:23:41,570 --> 00:23:46,390 But we can’t do comparisons: “This is the biggest massacre this year”. 329 00:23:46,390 --> 00:23:48,350 We don’t really know. Because we don’t know about that massacres 330 00:23:48,350 --> 00:23:53,910 we don’t know about. No patterns. Don’t talk about the hot spot of violence. 331 00:23:53,910 --> 00:23:59,420 No, we don’t know that. Happy to talk more about that if we gather after, 332 00:23:59,420 --> 00:24:06,439 but I wanna come to a close here with the importance of getting it right. 333 00:24:06,439 --> 00:24:11,380 I’ve talked about one case today. This is another case, the case of this man: 334 00:24:11,380 --> 00:24:16,320 Edgar Fernando García. Mr. García was a student Labor leader in Guatemala 335 00:24:16,320 --> 00:24:19,800 early in the 1980s. He left his office in February 1984 336 00:24:19,800 --> 00:24:24,470 – did not come home. People reported later that they saw someone 337 00:24:24,470 --> 00:24:28,810 shoving Mr. García into a vehicle and driving away. 338 00:24:28,810 --> 00:24:33,900 His widow became a very important Human Rights activist in Guatemala 339 00:24:33,900 --> 00:24:38,570 and now she’s a very important, and in my opinion impressive politician. 340 00:24:38,570 --> 00:24:42,240 And there’s her infant daughter. She continued to struggle to find out 341 00:24:42,240 --> 00:24:46,130 what had happened to Mr. García for decades. 342 00:24:46,130 --> 00:24:50,400 And in 2006 documents came to light in the National Archives of the… 343 00:24:50,400 --> 00:24:54,429 excuse me, the Historical Archives of the national Police, showing that 344 00:24:54,429 --> 00:24:59,320 the Police had realized an operation in the area of Mr. García’s office 345 00:24:59,320 --> 00:25:01,930 and it was very likely that they had disappeared him. 346 00:25:01,930 --> 00:25:07,400 These 2 guys up here in the upper right were Police officers in that area; 347 00:25:07,400 --> 00:25:11,359 they were arrested, charged with the disappearance of Mister García and 348 00:25:11,359 --> 00:25:15,620 convicted. Part of the evidence used to convict them was communications meta data 349 00:25:15,620 --> 00:25:19,510 showing that documents flowed through the archive. 350 00:25:19,510 --> 00:25:23,699 I mean paper communications! We coded it by hand. We went through and read 351 00:25:23,699 --> 00:25:28,459 the ‘From’ and ‘To’ lines from every Memo. And 352 00:25:28,459 --> 00:25:34,229 they were convicted in 2010 and after that conviction 353 00:25:34,229 --> 00:25:38,699 Mr. García’s infant daughter – now a grown woman – was clearly joyful. 354 00:25:38,699 --> 00:25:42,730 Justice brings closure to a family that never knows when to start talking 355 00:25:42,730 --> 00:25:48,059 about someone in the past tense. Perhaps even more powerfully: 356 00:25:48,059 --> 00:25:52,319 those guys’ grand boss, their boss's boss, Colonel Héctor Bol de la Cruz, 357 00:25:52,319 --> 00:25:58,439 this man here, was convicted of Mr. García’s disappearance 358 00:25:58,439 --> 00:26:02,069 in September this year [2013]. *applause* 359 00:26:02,069 --> 00:26:07,610 *applause* 360 00:26:07,610 --> 00:26:10,789 I don’t know if any of you have ever been dissident students, 361 00:26:10,789 --> 00:26:15,330 but if you’ve been dissident students demonstrating in the street think about 362 00:26:15,330 --> 00:26:19,300 how you would feel if your friends and comrades were disappeared, 363 00:26:19,300 --> 00:26:23,419 and take a long look at Colonel Bol de la Cruz. Here is the rest of the stuff 364 00:26:23,419 --> 00:26:25,626 that we will talk about if we gather afterwards. Thank you very much 365 00:26:25,626 --> 00:26:29,086 for your attention. I really have enjoyed CCC. 366 00:26:29,086 --> 00:26:36,086 *applause* 367 00:26:36,086 --> 00:26:47,203 *Subtitles created by c3subtitles.de in the year 2016. Join and help us!*