1
00:00:10,080 --> 00:00:17,985
*applause*

2
00:00:17,985 --> 00:00:22,900
Thank you very much, can you…
You can hear me? Yes!

3
00:00:22,900 --> 00:00:27,620
I’ve been at this now 23 years. We
worked, with… My colleagues and I,

4
00:00:27,620 --> 00:00:31,390
we worked in about 30 countries,
we’ve advised 9 Truth Commissions,

5
00:00:31,390 --> 00:00:36,410
official Truth Commissions, 4 UN missions,

6
00:00:36,410 --> 00:00:40,150
4 international criminal tribunals.
We have testified in 4 different cases

7
00:00:40,150 --> 00:00:44,240
– 2 internationally, 2 domestically – and
we’ve advised dozens and dozens

8
00:00:44,240 --> 00:00:49,120
of non-governmental Human Rights groups
around the world. The point of this stuff

9
00:00:49,120 --> 00:00:54,180
is to figure out how to bring the
knowledge of the people who’ve suffered

10
00:00:54,180 --> 00:00:58,770
human rights violations to bear,
on demanding accountability

11
00:00:58,770 --> 00:01:04,960
from the perpetrators. Our job is to
figure out how we can tell the truth.

12
00:01:04,960 --> 00:01:09,240
It is one of the moral foundations of the
international Human Rights movement

13
00:01:09,240 --> 00:01:14,220
that we speak Truth to Power. We
look in the face of the powerful

14
00:01:14,220 --> 00:01:19,299
and we tell them what we believe
they have done that is wrong.

15
00:01:19,299 --> 00:01:23,639
If that’s gonna work, we
have to speak the truth.

16
00:01:23,639 --> 00:01:29,470
We have to be right, we
have to get the analysis on.

17
00:01:29,470 --> 00:01:33,979
That’s not always easy and to get there,

18
00:01:33,979 --> 00:01:37,209
there are sort of 3 themes that
I wanna try to touch in this talk.

19
00:01:37,209 --> 00:01:40,379
Since the talk is pretty short I’m
really gonna touch on 2 of them, so

20
00:01:40,379 --> 00:01:43,619
at the very end of the talk I’ll invite
people who’d like to talk more about

21
00:01:43,619 --> 00:01:49,270
the specifically technical aspects of this
work, about classifiers, about clustering,

22
00:01:49,270 --> 00:01:53,620
about statistical estimation, about
database techniques. People who wanna talk

23
00:01:53,620 --> 00:01:56,990
about that I’d love to gather and we’ll
try to find a space. I’ve been fighting

24
00:01:56,990 --> 00:02:00,460
with the Wiki for 2 days; I think
I’m probably not the only one.

25
00:02:00,460 --> 00:02:04,959
We can gather, we can talk about
that stuff more in detail. So today,

26
00:02:04,959 --> 00:02:09,990
in the next 25 minutes I’m
going to focus specifically on

27
00:02:09,990 --> 00:02:14,520
the trial of General
José Efraín Ríos Montt

28
00:02:14,520 --> 00:02:20,200
who ruled Guatemala from
March 1982 until August 1983.

29
00:02:20,200 --> 00:02:25,180
That’s General Ríos, there in
the upper corner in the red tie.

30
00:02:25,180 --> 00:02:30,600
During the government
of General Ríos Montt

31
00:02:30,600 --> 00:02:35,610
tens of thousands of people were killed by
the army of Guatemala. And the question

32
00:02:35,610 --> 00:02:39,610
that has been facing Guatemalans
since that time is:

33
00:02:39,610 --> 00:02:44,080
“Did the pattern of killing
that the army committed

34
00:02:44,080 --> 00:02:49,690
constitute acts of genocide?”. Now
genocide is a very specific crime

35
00:02:49,690 --> 00:02:54,420
in International Law. It does not
mean you killed a lot of people.

36
00:02:54,420 --> 00:02:58,910
There are other war crimes for mass
killing. Genocide specifically means

37
00:02:58,910 --> 00:03:03,930
that you picked out a particular group;
and to the exclusion of other groups

38
00:03:03,930 --> 00:03:08,460
nearby them you focused
on eliminating that group.

39
00:03:08,460 --> 00:03:14,240
That’s key because for a statistician
that gives us a hypothesis we can test

40
00:03:14,240 --> 00:03:18,860
which is: “What is the relative risk,
what is the differential probability

41
00:03:18,860 --> 00:03:22,820
of people in the target group being
killed relative to their neighbours

42
00:03:22,820 --> 00:03:28,150
who are not in the target group?”
So without further ado,

43
00:03:28,150 --> 00:03:31,970
let’s look at the relative risk of
being killed for indigenous people

44
00:03:31,970 --> 00:03:36,880
in the 3 rural counties of
Chajul, Cotzal and Nebaj

45
00:03:36,880 --> 00:03:41,400
relative to their
non-indigenous neighbours.

46
00:03:41,400 --> 00:03:45,960
We have – and I’ll talk in a moment about
how we have this – we have information,

47
00:03:45,960 --> 00:03:51,490
and evidence, and estimations of the
deaths of about 2150 indigenous people.

48
00:03:51,490 --> 00:03:58,550
People killed by the army in the period
of the government of General Ríos.

49
00:03:58,550 --> 00:04:02,550
The population, the total number of
people alive who were indigenous

50
00:04:02,550 --> 00:04:07,370
in those counties in the census
of 1981 is about 39,000.

51
00:04:07,370 --> 00:04:14,500
So the approximate crude mortality
rate due to homicide by the army

52
00:04:14,500 --> 00:04:18,710
is 5.5% for indigenous people in
that period. Now that’s relative

53
00:04:18,710 --> 00:04:22,890
to the homicide rate for non-indigenous
people in the same place

54
00:04:22,890 --> 00:04:27,200
of approximately 0.7%. So what
we ask is: “What is the ratio

55
00:04:27,200 --> 00:04:30,530
between those 2 numbers?” And
the ratio between those 2 numbers

56
00:04:30,530 --> 00:04:35,600
is the relative risk. It’s approximately
8. We interpret that as: if you were

57
00:04:35,600 --> 00:04:41,339
an indigenous person alive in
one of those 3 counties in 1982,

58
00:04:41,339 --> 00:04:46,939
your probability of being killed
by the army was 8 times greater

59
00:04:46,939 --> 00:04:51,069
than a person also living
in those 3 counties

60
00:04:51,069 --> 00:04:56,179
who was not indigenous.
Eight times, 8 times!

61
00:04:56,179 --> 00:05:00,250
To put that in relative terms: the
probability… the relative risk of being

62
00:05:00,250 --> 00:05:04,720
a Bosniac relative to being Serb
in Bosnia during the war in Bosnia

63
00:05:04,720 --> 00:05:09,800
was a little less than 3. So your
relative risk of being indigenous

64
00:05:09,800 --> 00:05:13,310
was more than twice nearly 3 times
as much as your relative risk

65
00:05:13,310 --> 00:05:19,200
of being Bosniac in the Bosnian War.
It’s an astonishing level of focus.

66
00:05:19,200 --> 00:05:23,809
It shows a tremendous planning
and coherence, I believe.

67
00:05:23,809 --> 00:05:29,469
So, again coming back to the statistical
conclusion, how do we come to that?

68
00:05:29,469 --> 00:05:32,849
How do we find that information? How do we
make that conclusion? First, we’re only

69
00:05:32,849 --> 00:05:35,470
looking at homicides committed by the
army. We’re not looking at homicides

70
00:05:35,470 --> 00:05:39,409
committed by other parties, by
the guerrillas, by private actors.

71
00:05:39,409 --> 00:05:44,499
We’re not looking at excess mortality,
the mortality that we might find

72
00:05:44,499 --> 00:05:47,709
in conflict that is in excess of
normal peacetime mortality.

73
00:05:47,709 --> 00:05:51,470
We’re not looking at any of that,
only homicide. And the percentage

74
00:05:51,470 --> 00:05:55,330
relates the number of people killed by the
army with the population that was alive.

75
00:05:55,330 --> 00:05:58,650
That’s crucial here. We’re looking at
rates and we’re comparing the rate

76
00:05:58,650 --> 00:06:02,430
of the indigenous people shown in the
blue bar to non-indigenous people

77
00:06:02,430 --> 00:06:06,869
shown in the green bar. The width of
the bars show the relative populations

78
00:06:06,869 --> 00:06:11,829
in each of those 2 communities. So clearly
there are many more indigenous people,

79
00:06:11,829 --> 00:06:14,980
but a higher fraction of them are also
killed. The bars also show something else.

80
00:06:14,980 --> 00:06:18,049
And that’s what I’ll focus on for the
rest of the talk. There are 2 sections

81
00:06:18,049 --> 00:06:22,159
to each of the 2 bars, a dark section
on the bottom, a lighter section on top.

82
00:06:22,159 --> 00:06:27,779
And what that indicates is what we know
in terms of being able to name people

83
00:06:27,779 --> 00:06:31,249
with their first and last name, their
location and dates of death, and

84
00:06:31,249 --> 00:06:35,560
what we must infer statistically. Now I’m
beginning to touch on the second theme

85
00:06:35,560 --> 00:06:40,949
of my talk: Which is that when we are
studying mass violence and war crimes,

86
00:06:40,949 --> 00:06:48,749
we cannot do statistical or pattern
analysis with raw information.

87
00:06:48,749 --> 00:06:51,950
We must use the tools of mathematical
statistics to understand

88
00:06:51,950 --> 00:06:56,080
what we don’t know! The information
which cannot be observed directly.

89
00:06:56,080 --> 00:07:00,649
We have to estimate that in order to
control for the process of the production

90
00:07:00,649 --> 00:07:04,989
of information. Information doesn’t just
fall out of the sky, the way it does

91
00:07:04,989 --> 00:07:10,359
for industry. If I’m running an ISP I know
every packet that runs through my routers.

92
00:07:10,359 --> 00:07:14,959
That’s not how the social world works. In
order to find information about killings

93
00:07:14,959 --> 00:07:17,889
we have to hear about that killing from
someone, we have to investigate,

94
00:07:17,889 --> 00:07:22,119
we have to find the human remains.
And if we can’t observe the killing

95
00:07:22,119 --> 00:07:28,130
we won’t hear about it and many killings
are hidden. In my team we have a kind of

96
00:07:28,130 --> 00:07:33,760
catch phrase: that the world… if a lawyer
is killed in a big city at high noon

97
00:07:33,760 --> 00:07:38,259
the world knows about it before
dinner time. Every single time.

98
00:07:38,259 --> 00:07:41,850
But when a rural peasant is killed 3-days
walk from a road in the dead of night,

99
00:07:41,850 --> 00:07:45,489
we’re unlikely to ever hear. And
technology is not changing this.

100
00:07:45,489 --> 00:07:48,899
I’ll talk later about that technology is
actually making the problem worse.

101
00:07:48,899 --> 00:07:53,470
So, let’s get back to Guatemala
and just conclude

102
00:07:53,470 --> 00:07:57,950
that the little vertical bars, little
vertical lines at the top of each bar

103
00:07:57,950 --> 00:08:03,079
indicate the confidence interval. Which is
similar to what lay people sometimes call

104
00:08:03,079 --> 00:08:07,199
a margin of error. It is our level of
uncertainty about each of those estimates

105
00:08:07,199 --> 00:08:10,960
and you’ll notice that the uncertainty
is much, much smaller than

106
00:08:10,960 --> 00:08:14,509
the difference between the 2 bars. The
uncertainty does not affect our ability

107
00:08:14,509 --> 00:08:17,970
to draw the conclusion that there
was a spectacular difference

108
00:08:17,970 --> 00:08:21,900
in the mortality rates between the
people who were the hypothesized

109
00:08:21,900 --> 00:08:26,630
target of genocide and those who were not.

110
00:08:26,630 --> 00:08:30,520
Now the data: first we
had the census of 1981,

111
00:08:30,520 --> 00:08:35,339
this was a crucial piece. I think there’s
very interesting questions to ask

112
00:08:35,339 --> 00:08:39,609
about why the Government of Guatemala
conducted a census on the eve of

113
00:08:39,609 --> 00:08:44,540
committing a genocide. There is excellent
work done by historical demographers

114
00:08:44,540 --> 00:08:47,950
about the use of censuses in mass
violence. It has been common

115
00:08:47,950 --> 00:08:52,880
throughout history. Similarly,
or excuse me, in parallel

116
00:08:52,880 --> 00:08:57,420
there were 4 very large
projects. First, the CIIDH

117
00:08:57,420 --> 00:09:01,600
– a group of non-Governmental
Human Rights groups –

118
00:09:01,600 --> 00:09:06,610
collected 1240 records of deaths
in this three-county region.

119
00:09:06,610 --> 00:09:11,750
Next, the Catholic Church collected
a bit fewer than 800 deaths.

120
00:09:11,750 --> 00:09:16,539
The truth commission – the Comisión
para el Esclarecimiento Histórico (CEH) –

121
00:09:16,539 --> 00:09:22,000
conducted a really big research
project in the late 1990s and

122
00:09:22,000 --> 00:09:25,810
of that we got information about a little
bit more than a thousand deaths.

123
00:09:25,810 --> 00:09:30,450
And then the National Program for
Compensation is very, very large

124
00:09:30,450 --> 00:09:35,370
and gave us about 4700
records of deaths.

125
00:09:35,370 --> 00:09:40,659
Now, this is interesting
but this is not unique.

126
00:09:40,659 --> 00:09:45,769
Many of the deaths are reported in common
across those data sources and so…

127
00:09:45,769 --> 00:09:49,490
we think about this in terms of a Venn
diagram. We think of: how did these

128
00:09:49,490 --> 00:09:54,329
different data sets intersect with each
other or collide with each other. And

129
00:09:54,329 --> 00:09:59,130
we can diagram that as in the sense
of these 3 white circles intersecting.

130
00:09:59,130 --> 00:10:05,610
But as I mentioned earlier we’re also
interested in what we have not observed.

131
00:10:05,610 --> 00:10:09,490
And this is crucial for us because
when we’re thinking about

132
00:10:09,490 --> 00:10:13,420
how much information we have, we have to
distinguish between the world on the left,

133
00:10:13,420 --> 00:10:17,200
in which our intersecting circles
cover about a third of the reality,

134
00:10:17,200 --> 00:10:21,829
versus the world on the right where our
intersecting circles cover all of reality.

135
00:10:21,829 --> 00:10:26,390
These are very different worlds; and the
reason they’re so different is not simply

136
00:10:26,390 --> 00:10:29,710
because we want to know the magnitude,
not simply because we want to know

137
00:10:29,710 --> 00:10:34,490
the total number of killings. That’s
important – but even more important:

138
00:10:34,490 --> 00:10:40,160
we have to know that we’ve covered,
we’ve estimated in equal proportions

139
00:10:40,160 --> 00:10:44,430
the two parties. We have to estimate in
equal proportions the number of deaths

140
00:10:44,430 --> 00:10:48,340
of non-indigenous people and the
number of deaths of indigenous people.

141
00:10:48,340 --> 00:10:51,510
Because if we don’t get those
estimates correct our comparison

142
00:10:51,510 --> 00:10:56,080
of their mortality rates will be biased.
Our story will be wrong. We will fail

143
00:10:56,080 --> 00:11:01,840
to speak Truth to Power. We can’t have
that. So what do we do? Algebra!

144
00:11:01,840 --> 00:11:06,390
Algebra is our friend. So I’m gonna
give you just a tiny taste of how we

145
00:11:06,390 --> 00:11:09,650
solve this problem and I’m going to
introduce a series of assumptions.

146
00:11:09,650 --> 00:11:13,279
Those of you who would like to debate
those assumptions: I invite you to join me

147
00:11:13,279 --> 00:11:18,359
after the talk and we will talk endlessly
and tediously about capture heterogeneity.

148
00:11:18,359 --> 00:11:22,240
But in the short term,

149
00:11:22,240 --> 00:11:27,940
we have a universe N of total killings in
a specific time/space/ethnicity/location.

150
00:11:27,940 --> 00:11:30,690
And of that we have 2 projects A and B.

151
00:11:30,690 --> 00:11:34,619
A captures some number of
deaths from the universe N,

152
00:11:34,619 --> 00:11:40,169
and the probability with which a death is
captured by project A from the universe N

153
00:11:40,169 --> 00:11:44,600
is by elementary probability theory the
number of deaths documented by A

154
00:11:44,600 --> 00:11:48,740
divided by the unknown number
of deaths in the population N.

155
00:11:48,740 --> 00:11:52,969
Similarly, the probability with which a
death from N is documented by project B

156
00:11:52,969 --> 00:11:58,149
is B over N, and this is the cool part:
the probability with which a death

157
00:11:58,149 --> 00:12:01,949
is documented by both A and B is M.

158
00:12:01,949 --> 00:12:05,579
Now we can put the 2 databases together,
we can compare them. Let’s talk about

159
00:12:05,579 --> 00:12:09,370
the use of random force classifiers
and clustering to do that later.

160
00:12:09,370 --> 00:12:12,489
But we can put the 2 databases together,
compare them, determine the deaths

161
00:12:12,489 --> 00:12:17,429
that are in M – that is in N both
A and B – and divide M by N.

162
00:12:17,429 --> 00:12:23,060
But, also by probability theory, the
probability that a death occurs in M

163
00:12:23,060 --> 00:12:27,740
is equal to the product of
the individual probabilities.

164
00:12:27,740 --> 00:12:31,619
The probability of any compound event, an
event made up of two independent events is

165
00:12:31,619 --> 00:12:36,410
equal to the product of those two
events, so M over N is equal to

166
00:12:36,410 --> 00:12:41,420
A over N times B over N. Solve for N.

167
00:12:41,420 --> 00:12:45,140
Multiply it through by N squared, divide
by M, and we have an estimate of N

168
00:12:45,140 --> 00:12:49,360
which is equal to AB over M. Now, the
lights in my eyes, I can’t see, but I saw

169
00:12:49,360 --> 00:12:52,740
a few light bulbs go off over people’s
heads. And when I showed this proof

170
00:12:52,740 --> 00:12:57,180
to the judge in the trial of General Ríos

171
00:12:57,180 --> 00:13:01,529
I saw a light bulb go on over her head.

172
00:13:01,529 --> 00:13:04,379
It’s a beautiful thing,
it’s a beautiful thing.

173
00:13:04,379 --> 00:13:09,509
*applause*

174
00:13:09,509 --> 00:13:12,660
So we don’t do it in 2 systems because
that takes a lot of assumptions.

175
00:13:12,660 --> 00:13:16,069
We do it in 4. You will recall that we
have 4 data sources. We organize

176
00:13:16,069 --> 00:13:21,530
the data sources in this format
such that we have an inclusion

177
00:13:21,530 --> 00:13:26,249
and an exclusion pattern in the table on 
the left, which… for which we can define

178
00:13:26,249 --> 00:13:29,810
the number of deaths which fall into
each of these intersecting patterns.

179
00:13:29,810 --> 00:13:33,729
And I’ll give you a very quick
metaphor here. The metaphor is:

180
00:13:33,729 --> 00:13:38,239
imagine that you have 2 dark rooms and you
want to assess the size of those 2 rooms

181
00:13:38,239 --> 00:13:42,049
– which room is larger? And the only
tool that you have to assess the size

182
00:13:42,049 --> 00:13:46,359
of those rooms is a handful of little
rubber balls. The little rubber balls

183
00:13:46,359 --> 00:13:50,400
have a property that when they hit each
other they make a sound. *makes CLICK sound*

184
00:13:50,400 --> 00:13:53,390
So we throw the balls into the first
room and we listen, and we hear

185
00:13:53,390 --> 00:13:57,190
*makes several CLICK sounds*. We
collect the balls, go to the second room,

186
00:13:57,190 --> 00:14:00,490
throw them with equal force – imagining
a spherical cow of uniform density!

187
00:14:00,490 --> 00:14:03,950
We throw the balls into the second
room with equal force and we hear

188
00:14:03,950 --> 00:14:07,799
*makes one CLICK sound*
So which room is larger?

189
00:14:07,799 --> 00:14:12,070
The second room, because we hear fewer
collisions, right? Well, the estimation,

190
00:14:12,070 --> 00:14:15,620
the toy example I gave in the previous
slide is the mathematical formalization

191
00:14:15,620 --> 00:14:20,070
of the intuition that fewer
collisions mean a larger space.

192
00:14:20,070 --> 00:14:23,329
And so what we’re doing here is
laying out the pattern of collisions.

193
00:14:23,329 --> 00:14:26,679
Not just the collisions, the pairwise
collisions, but the three-way and

194
00:14:26,679 --> 00:14:31,409
four-way collisions. And that
allows us to make the estimate

195
00:14:31,409 --> 00:14:37,439
that was shown in the bar graph of
the light part of each of the bars. So

196
00:14:37,439 --> 00:14:41,460
we can come back to our conclusion and put
a confidence interval on the estimates.

197
00:14:41,460 --> 00:14:45,910
And the confidence intervals are shown
there. Now I’m gonna move through this

198
00:14:45,910 --> 00:14:50,850
somewhat more quickly to get to the end of
the talk but I wanna put up one more slide

199
00:14:50,850 --> 00:14:56,240
that was used in the testimony
and that is that we divided time

200
00:14:56,240 --> 00:15:01,220
into 16-month periods and
compared the 16-month period of

201
00:15:01,220 --> 00:15:04,580
General Ríos’s governance – now it’s only
16 months ’cause we went April to July,

202
00:15:04,580 --> 00:15:07,679
because it’s only a few days in August, a
few days in March, so we shaved those off,

203
00:15:07,679 --> 00:15:12,310
okay… – 16-month period of General
Ríos’s Government and compared it

204
00:15:12,310 --> 00:15:17,110
to several periods before and after. And
I think that the key observation here

205
00:15:17,110 --> 00:15:21,809
is that the rate of killing
against indigenous people

206
00:15:21,809 --> 00:15:26,729
is substantially higher done under General
Ríos’s Government than under previous

207
00:15:26,729 --> 00:15:33,280
or succeeding governments. But more
importantly the ratio between the two,

208
00:15:33,280 --> 00:15:37,950
the relative risk of being killed as an
indigenous person, was at its peak

209
00:15:37,950 --> 00:15:42,639
during the government of General Ríos.

210
00:15:42,639 --> 00:15:46,709
Have we proven genocide? No.

211
00:15:46,709 --> 00:15:49,870
This is evidence consistent with the
hypothesis that acts of genocide

212
00:15:49,870 --> 00:15:53,539
were committed. The finding of genocide
is a legal finding, not so much

213
00:15:53,539 --> 00:15:58,580
a scientific one. So as scientists,
our job is to provide evidence that

214
00:15:58,580 --> 00:16:02,870
the finders of fact – the judges in this
case – can use in their determination.

215
00:16:02,870 --> 00:16:05,219
This is evidence consistent
with that hypothesis.

216
00:16:05,219 --> 00:16:08,189
Were this evidence otherwise, as
scientists we would say we would

217
00:16:08,189 --> 00:16:11,480
reject the hypothesis that genocide was
committed. However, with this evidence

218
00:16:11,480 --> 00:16:15,370
we find that the evidence,
the data is consistent with

219
00:16:15,370 --> 00:16:18,080
the prosecution’s hypothesis.

220
00:16:18,080 --> 00:16:25,320
So, it worked!

221
00:16:25,320 --> 00:16:29,049
Ríos Montt was convicted on
genocide charges. *applause*

222
00:16:29,049 --> 00:16:31,359
You can clap!
*applause*

223
00:16:31,359 --> 00:16:36,359
*applause*

224
00:16:36,359 --> 00:16:39,499
For a week!
*mumbled, surprised laughter*

225
00:16:39,499 --> 00:16:42,279
Then the Constitutional Court intervened,

226
00:16:42,279 --> 00:16:44,959
there I know a couple of experts on
Guatemala here in the audience

227
00:16:44,959 --> 00:16:47,839
who can tell you more about why that
happened and exactly what happened.

228
00:16:47,839 --> 00:16:52,669
However, the Constitutional
Court ordered a new trial,

229
00:16:52,669 --> 00:16:59,160
which is at this time scheduled
for the very beginning of 2015.

230
00:16:59,160 --> 00:17:02,970
And I look forward to testifying again,

231
00:17:02,970 --> 00:17:06,820
and again, and again, and again!

232
00:17:06,820 --> 00:17:12,680
*applause*

233
00:17:12,680 --> 00:17:16,989
Look, but I wanna come back to this point.
Because as a bunch of technologists…

234
00:17:16,989 --> 00:17:21,589
– there is a lot of folks who really like
technology here, I really like it too!

235
00:17:21,589 --> 00:17:25,559
Technology doesn’t get us to science
– you have to have science

236
00:17:25,559 --> 00:17:28,770
to get you to science. Technology helps
you organize the data. It helps you do

237
00:17:28,770 --> 00:17:32,050
all kinds of extremely great and cool
things without which we wouldn’t be able

238
00:17:32,050 --> 00:17:36,480
to even do the science. But you
can’t have just technology!

239
00:17:36,480 --> 00:17:40,970
You can’t just have a bunch of data
and make conclusions. That’s naive,

240
00:17:40,970 --> 00:17:44,529
and you will get the wrong conclusions.
‘The point of rigorous statistics is

241
00:17:44,529 --> 00:17:48,100
to be right’, and there is a little bit of
a caveat there – or to at least know

242
00:17:48,100 --> 00:17:51,620
how uncertain you are. Statistics is often
called the ‘Science of Uncertainty’.

243
00:17:51,620 --> 00:17:55,960
That is actually my favorite
definition of it. So,

244
00:17:55,960 --> 00:18:01,509
I’m going to assume that we
care about getting it right.

245
00:18:01,509 --> 00:18:05,489
No one laughed, that’s good.

246
00:18:05,489 --> 00:18:08,890
Not everyone does, to my distress.

247
00:18:08,890 --> 00:18:11,320
So if you only have some of the data

248
00:18:11,320 --> 00:18:15,490
– and I will argue that we always
only have some of the data –

249
00:18:15,490 --> 00:18:20,449
you need some kind of model that will tell
you the relationship between your data

250
00:18:20,449 --> 00:18:23,989
and the real world.
Statisticians call that an inference.

251
00:18:23,989 --> 00:18:26,200
In order to get from here to there
you’re gonna need some kind of

252
00:18:26,200 --> 00:18:30,469
probability model that tells you
why your data is like the world,

253
00:18:30,469 --> 00:18:33,960
or in what sense you have to tweet,
twiddle and do algebra with your data

254
00:18:33,960 --> 00:18:39,309
to get from what you can
observe to what is actually true.

255
00:18:39,309 --> 00:18:42,690
And statistics is about comparisons.
Yeah, we get a big number and

256
00:18:42,690 --> 00:18:46,169
journalists love the big number; but
it’s really about these relationships

257
00:18:46,169 --> 00:18:50,609
and patterns! So to get those
relationships and patterns,

258
00:18:50,609 --> 00:18:53,560
in order for them to be right, in order
for our answer to be correct,

259
00:18:53,560 --> 00:18:57,439
every one of the estimates we make
for every point in the pattern

260
00:18:57,439 --> 00:19:01,700
has to be right. It’s a hard
problem. It’s a hard problem.

261
00:19:01,700 --> 00:19:05,070
And what I worry about is that
we have come into this world

262
00:19:05,070 --> 00:19:09,400
in which people throw the notion of Big
Data around as though the data allows us

263
00:19:09,400 --> 00:19:14,230
to make an end-run around problems
of sampling and modeling. It doesn’t.

264
00:19:14,230 --> 00:19:19,120
So as technologist, the reason I’m,
you know, ranting at you guys about it

265
00:19:19,120 --> 00:19:24,540
is that it’s very tempting to have a lot
of data and think you have an answer!

266
00:19:24,540 --> 00:19:30,580
And it’s even more tempting because
in industry context you might be right.

267
00:19:30,580 --> 00:19:36,739
Not so much in Human Rights, not so
much. Violence is a hidden process.

268
00:19:36,739 --> 00:19:39,960
The people who commit violence have
an enormous commitment to hiding it,

269
00:19:39,960 --> 00:19:44,420
distorting it, explaining it in different
ways. All of those things dramatically

270
00:19:44,420 --> 00:19:48,350
affect the information that is produced
from the violence that we’re going to use

271
00:19:48,350 --> 00:19:53,730
to do our analysis. So we usually
don’t know what we don’t know

272
00:19:53,730 --> 00:19:58,220
in Human Rights data collection.
And that means that we don’t know

273
00:19:58,220 --> 00:20:03,829
if what we don’t know is systematically
different from what we do know.

274
00:20:03,829 --> 00:20:06,270
Maybe we know about all the lawyers
and we don’t know about the people

275
00:20:06,270 --> 00:20:10,070
in the countryside. Maybe we know
about all the indigenous people

276
00:20:10,070 --> 00:20:14,130
and not the non-indigenous people.
If that were true, the argument

277
00:20:14,130 --> 00:20:17,980
that I just made would be merely
an artifact of the reporting process

278
00:20:17,980 --> 00:20:21,740
rather than some true analysis. Now
we did the estimations why I believe

279
00:20:21,740 --> 00:20:25,009
we can reject that critique, but that’s
what we have to worry about.

280
00:20:25,009 --> 00:20:28,860
And let’s go back to the Venn diagram
and say: which of these is accurate?

281
00:20:28,860 --> 00:20:32,840
It’s not just for one of the
points in our pattern analysis.

282
00:20:32,840 --> 00:20:36,500
The problem is that we’re
going to compare things.

283
00:20:36,500 --> 00:20:40,890
As in Peru where we compared killings
committed by the Peruvian army against

284
00:20:40,890 --> 00:20:44,860
killings committed by the Maoist Guerillas
with the Sendero Luminoso. And we found

285
00:20:44,860 --> 00:20:51,460
there that in fact we knew very little
about what the Sendero Luminoso had done.

286
00:20:51,460 --> 00:20:55,779
Whereas we knew almost everything
what the Peruvian army had done.

287
00:20:55,779 --> 00:20:57,970
This is called the coverage rate.
The rate between what we know and

288
00:20:57,970 --> 00:21:02,750
what we don’t know. And
raw data, however big,

289
00:21:02,750 --> 00:21:07,510
does not get us to patterns.
And here is a bunch of…

290
00:21:07,510 --> 00:21:11,569
kinds of raw data that I’ve used
and that I really enjoy using.

291
00:21:11,569 --> 00:21:14,270
You know – truth commission testimonies,
UN investigations, press articles,

292
00:21:14,270 --> 00:21:18,309
SMS messages, crowdsourcing, NGO
documentation, social media feeds,

293
00:21:18,309 --> 00:21:21,180
perpetrator records, government archives,
state agency registries – I know those

294
00:21:21,180 --> 00:21:23,570
sound all the same but they actually
turn out to be slightly different.

295
00:21:23,570 --> 00:21:28,340
Happy to talk in tedious detail! Refugee
Camp records, any non-random sample.

296
00:21:28,340 --> 00:21:31,990
All of those are gonna take
some kind of probability model

297
00:21:31,990 --> 00:21:36,070
and we don’t have that many
probability models to use. So

298
00:21:36,070 --> 00:21:40,330
raw data is great for cases – but
it doesn’t get you to patterns.

299
00:21:40,330 --> 00:21:45,120
And patterns – again – patterns are
the thing that allow us to do analysis.

300
00:21:45,120 --> 00:21:49,289
They are the thing… the patterns are what
get us to something that we can use

301
00:21:49,289 --> 00:21:53,629
to help prosecutors, advocates and the…

302
00:21:53,629 --> 00:21:56,409
and the victims themselves.

303
00:21:56,409 --> 00:22:00,589
I gave a version of this talk, a
much earlier version of this talk

304
00:22:00,589 --> 00:22:04,630
several years ago in Medellín, Columbia.
I’ve worked a lot in Columbia,

305
00:22:04,630 --> 00:22:07,670
it’s really… it’s a great place to
work. There’s really terrific

306
00:22:07,670 --> 00:22:13,569
Victims Rights groups there.
And a woman from a township,

307
00:22:13,569 --> 00:22:17,310
smaller than a county, near to Medellín
came up to me after the talk and she said:

308
00:22:17,310 --> 00:22:21,150
“You know, a lot of people… you
know I’m a Human Rights activist,

309
00:22:21,150 --> 00:22:25,309
my job is to collect data, I tell stories
about people who have suffered.

310
00:22:25,309 --> 00:22:28,210
But there are people in my
village I know who have had

311
00:22:28,210 --> 00:22:32,910
people in their families disappeared and
they’re never gonna talk about, ever.

312
00:22:32,910 --> 00:22:38,090
We’re never going to be able to use
their names, because they are afraid.”

313
00:22:38,090 --> 00:22:45,349
We can’t name the victims. At
least we’d better count them.

314
00:22:45,349 --> 00:22:49,520
So about that counting: there’s
3 ways to do it right. You can have

315
00:22:49,520 --> 00:22:54,430
a perfect census – you can have all the
data. Yeah it’s nice, good luck with that.

316
00:22:54,430 --> 00:22:58,910
You can have a random sample
of the population - that’s hard!

317
00:22:58,910 --> 00:23:03,029
Sometimes doable but very hard.
In my experience we rarely interview

318
00:23:03,029 --> 00:23:07,140
victims of homicide, very rarely.
*Laughing*

319
00:23:07,140 --> 00:23:09,640
And that means there’s a complicated
probability relationship between

320
00:23:09,640 --> 00:23:13,670
the person you sampled, the interview
and the death that they talk to you about.

321
00:23:13,670 --> 00:23:17,300
Or you can do some kind of posterior
modeling of the sampling process which is…

322
00:23:17,300 --> 00:23:21,260
which is in essence what
I proposed in the earlier slide.

323
00:23:21,260 --> 00:23:25,020
So what can we do with raw data,
guys? We can collect a bunch of…

324
00:23:25,020 --> 00:23:28,930
We can say that a case exists. Ok
– that’s actually important! We can say:

325
00:23:28,930 --> 00:23:34,409
“Something happened” with raw data. We can
say: “We know something about that case".

326
00:23:34,409 --> 00:23:38,250
We can say: “There were 100 victims
in that case or at least 100 victims

327
00:23:38,250 --> 00:23:41,570
in that case”, if we can name 100 people.

328
00:23:41,570 --> 00:23:46,390
But we can’t do comparisons: “This
is the biggest massacre this year”.

329
00:23:46,390 --> 00:23:48,350
We don’t really know. Because we
don’t know about that massacres

330
00:23:48,350 --> 00:23:53,910
we don’t know about. No patterns. Don’t
talk about the hot spot of violence.

331
00:23:53,910 --> 00:23:59,420
No, we don’t know that. Happy to talk
more about that if we gather after,

332
00:23:59,420 --> 00:24:06,439
but I wanna come to a close here with
the importance of getting it right.

333
00:24:06,439 --> 00:24:11,380
I’ve talked about one case today. This
is another case, the case of this man:

334
00:24:11,380 --> 00:24:16,320
Edgar Fernando García. Mr. García was
a student Labor leader in Guatemala

335
00:24:16,320 --> 00:24:19,800
early in the 1980s. He left
his office in February 1984

336
00:24:19,800 --> 00:24:24,470
– did not come home. People reported
later that they saw someone

337
00:24:24,470 --> 00:24:28,810
shoving Mr. García into a
vehicle and driving away.

338
00:24:28,810 --> 00:24:33,900
His widow became a very important
Human Rights activist in Guatemala

339
00:24:33,900 --> 00:24:38,570
and now she’s a very important, and
in my opinion impressive politician.

340
00:24:38,570 --> 00:24:42,240
And there’s her infant daughter. She
continued to struggle to find out

341
00:24:42,240 --> 00:24:46,130
what had happened to
Mr. García for decades.

342
00:24:46,130 --> 00:24:50,400
And in 2006 documents came to light
in the National Archives of the…

343
00:24:50,400 --> 00:24:54,429
excuse me, the Historical Archives
of the national Police, showing that

344
00:24:54,429 --> 00:24:59,320
the Police had realized an operation
in the area of Mr. García’s office

345
00:24:59,320 --> 00:25:01,930
and it was very likely that
they had disappeared him.

346
00:25:01,930 --> 00:25:07,400
These 2 guys up here in the upper
right were Police officers in that area;

347
00:25:07,400 --> 00:25:11,359
they were arrested, charged with the
disappearance of Mister García and

348
00:25:11,359 --> 00:25:15,620
convicted. Part of the evidence used to
convict them was communications meta data

349
00:25:15,620 --> 00:25:19,510
showing that documents
flowed through the archive.

350
00:25:19,510 --> 00:25:23,699
I mean paper communications! We coded
it by hand. We went through and read

351
00:25:23,699 --> 00:25:28,459
the ‘From’ and ‘To’ lines
from every Memo. And

352
00:25:28,459 --> 00:25:34,229
they were convicted in 2010
and after that conviction

353
00:25:34,229 --> 00:25:38,699
Mr. García’s infant daughter – now
a grown woman – was clearly joyful.

354
00:25:38,699 --> 00:25:42,730
Justice brings closure to a family
that never knows when to start talking

355
00:25:42,730 --> 00:25:48,059
about someone in the past tense.
Perhaps even more powerfully:

356
00:25:48,059 --> 00:25:52,319
those guys’ grand boss, their boss's
boss, Colonel Héctor Bol de la Cruz,

357
00:25:52,319 --> 00:25:58,439
this man here, was convicted
of Mr. García’s disappearance

358
00:25:58,439 --> 00:26:02,069
in September this year [2013].
*applause*

359
00:26:02,069 --> 00:26:07,610
*applause*

360
00:26:07,610 --> 00:26:10,789
I don’t know if any of you have
ever been dissident students,

361
00:26:10,789 --> 00:26:15,330
but if you’ve been dissident students
demonstrating in the street think about

362
00:26:15,330 --> 00:26:19,300
how you would feel if your friends
and comrades were disappeared,

363
00:26:19,300 --> 00:26:23,419
and take a long look at Colonel Bol
de la Cruz. Here is the rest of the stuff

364
00:26:23,419 --> 00:26:25,626
that we will talk about if we gather
afterwards. Thank you very much

365
00:26:25,626 --> 00:26:29,086
for your attention. I really
have enjoyed CCC.

366
00:26:29,086 --> 00:26:36,086
*applause*

367
00:26:36,086 --> 00:26:47,203
*Subtitles created by c3subtitles.de
in the year 2016. Join and help us!*
