0
00:00:00,000 --> 00:00:30,000
Dear viewer, these subtitles were generated
by a machine via the service Trint
and therefore are (very) buggy.
If you are capable, please help us to
create good quality subtitles:
https://c3subtitles.de/talk/555 Thanks!

1
00:00:09,360 --> 00:00:11,399
The title of our next talk is quite

2
00:00:11,400 --> 00:00:13,739
descriptive, without me giving too much

3
00:00:13,740 --> 00:00:16,019
of an introduction, I think

4
00:00:16,020 --> 00:00:18,269
basically we live in a world in which

5
00:00:18,270 --> 00:00:20,609
algorithms make more and more decisions

6
00:00:20,610 --> 00:00:22,039
about our daily lives.

7
00:00:23,100 --> 00:00:25,589
Many people are working on improving

8
00:00:25,590 --> 00:00:26,879
these algorithms.

9
00:00:26,880 --> 00:00:28,979
Not as many are actually

10
00:00:28,980 --> 00:00:31,139
thinking about the implications.

11
00:00:31,140 --> 00:00:32,728
Say hi to your new boss.

12
00:00:32,729 --> 00:00:35,069
How Algorithms Might Soon Control

13
00:00:35,070 --> 00:00:37,349
Our Lives is the title of our next

14
00:00:37,350 --> 00:00:39,419
talk. Please give a warm welcome

15
00:00:39,420 --> 00:00:40,590
to Andrew Stevens.

16
00:00:48,290 --> 00:00:50,539
So hello, everyone,

17
00:00:50,540 --> 00:00:52,279
I have to say, I'm really quite excited

18
00:00:52,280 --> 00:00:54,559
being here again and terrified

19
00:00:54,560 --> 00:00:56,509
as well, but mostly excited.

20
00:00:56,510 --> 00:00:58,579
And first and foremost, I want to

21
00:00:58,580 --> 00:01:00,709
thank the organizers for inviting

22
00:01:00,710 --> 00:01:02,629
me again and for letting me speak at is

23
00:01:02,630 --> 00:01:04,369
really, really great conference.

24
00:01:04,370 --> 00:01:05,370
So

25
00:01:06,530 --> 00:01:08,539
as we said before, the title of my talk

26
00:01:08,540 --> 00:01:10,669
is Say Hi to Your Boss and I'm going

27
00:01:10,670 --> 00:01:12,709
to talk about algorithm's and about like

28
00:01:12,710 --> 00:01:14,899
shifting decision powers from

29
00:01:14,900 --> 00:01:16,669
humans to machines.

30
00:01:16,670 --> 00:01:19,009
And in case you were asking yourself,

31
00:01:19,010 --> 00:01:20,209
why is this important?

32
00:01:20,210 --> 00:01:23,149
Well, let's just ask a friend

33
00:01:23,150 --> 00:01:24,819
here. I usually like to do that with

34
00:01:24,820 --> 00:01:25,789
Google autocomplete.

35
00:01:25,790 --> 00:01:28,099
And normally this gives me kind of like

36
00:01:28,100 --> 00:01:29,609
some controversial statements or

37
00:01:29,610 --> 00:01:31,499
algorithms are stupid or algorithms will

38
00:01:31,500 --> 00:01:32,419
never work.

39
00:01:32,420 --> 00:01:34,849
But in this case, it seems like and

40
00:01:34,850 --> 00:01:36,529
it's pretty unambiguous that algorithms

41
00:01:36,530 --> 00:01:38,209
will play a very large role in this

42
00:01:38,210 --> 00:01:41,149
world. And as I said, this is like

43
00:01:41,150 --> 00:01:43,159
a big chance because algorithms can

44
00:01:43,160 --> 00:01:44,239
improve our lives a lot.

45
00:01:44,240 --> 00:01:46,459
But it's also a problem

46
00:01:46,460 --> 00:01:48,289
because we're shifting a lot of the

47
00:01:48,290 --> 00:01:50,879
decisions that are now made by people

48
00:01:50,880 --> 00:01:52,339
into the hands of machines.

49
00:01:52,340 --> 00:01:54,859
And in many cases, we don't understand

50
00:01:54,860 --> 00:01:56,929
much how these machines work and

51
00:01:56,930 --> 00:01:59,099
how exactly that make the decisions.

52
00:01:59,100 --> 00:02:01,159
And so I would say my main qualification

53
00:02:01,160 --> 00:02:02,689
for making this talk is that I shot

54
00:02:02,690 --> 00:02:03,889
myself in the foot with data.

55
00:02:03,890 --> 00:02:06,169
And this is a lot of times and I became

56
00:02:06,170 --> 00:02:08,359
interested in how why algorithms

57
00:02:08,360 --> 00:02:10,459
are doing like so many things

58
00:02:10,460 --> 00:02:11,869
that we don't anticipate and why they

59
00:02:11,870 --> 00:02:13,489
sometimes behave in ways that seem

60
00:02:13,490 --> 00:02:16,249
strange and like contradictory

61
00:02:16,250 --> 00:02:17,749
to what we actually want to achieve with

62
00:02:17,750 --> 00:02:18,739
them.

63
00:02:18,740 --> 00:02:20,389
And so that's what I want to talk about.

64
00:02:20,390 --> 00:02:21,739
And we're going to do it like this.

65
00:02:21,740 --> 00:02:23,959
So first, I will give

66
00:02:23,960 --> 00:02:26,149
you some theory about what

67
00:02:26,150 --> 00:02:28,429
an algorithm actually is and how

68
00:02:28,430 --> 00:02:30,409
machine learning algorithms to make

69
00:02:30,410 --> 00:02:32,539
decisions and how this whole

70
00:02:32,540 --> 00:02:34,769
big data thing and

71
00:02:34,770 --> 00:02:36,949
like the new data driven society age

72
00:02:36,950 --> 00:02:39,139
plays into this whole affair.

73
00:02:39,140 --> 00:02:40,789
And finally, we'll show you some of the

74
00:02:40,790 --> 00:02:42,889
use cases for algorithms in our daily

75
00:02:42,890 --> 00:02:44,329
lives today.

76
00:02:44,330 --> 00:02:46,249
And after that, we will be equipped with

77
00:02:46,250 --> 00:02:47,809
everything we need to know in order to

78
00:02:47,810 --> 00:02:49,669
start with some experiments.

79
00:02:49,670 --> 00:02:51,589
So I'm coming from physics.

80
00:02:51,590 --> 00:02:53,239
And when I would try to understand

81
00:02:53,240 --> 00:02:54,919
something, I usually do an experiment and

82
00:02:54,920 --> 00:02:56,929
try to to break the thing I like, make it

83
00:02:56,930 --> 00:02:57,859
explode or whatever.

84
00:02:57,860 --> 00:03:00,259
And so we're trying to do the same

85
00:03:00,260 --> 00:03:01,699
thing here with our algorithms.

86
00:03:01,700 --> 00:03:04,329
And I have picked out two and

87
00:03:04,330 --> 00:03:05,509
two case studies that are going to

88
00:03:05,510 --> 00:03:08,569
present one about discrimination

89
00:03:08,570 --> 00:03:10,729
through algorithms and another one

90
00:03:10,730 --> 00:03:13,459
about the anonymization.

91
00:03:13,460 --> 00:03:15,889
So and finally, I want to end with

92
00:03:15,890 --> 00:03:17,989
some proposals and some ideas on how we

93
00:03:17,990 --> 00:03:20,929
can actually make the most of algorithms

94
00:03:20,930 --> 00:03:23,149
in this kind of setting and also

95
00:03:23,150 --> 00:03:25,489
like to control

96
00:03:25,490 --> 00:03:27,469
and better understand what algorithms are

97
00:03:27,470 --> 00:03:28,619
doing.

98
00:03:28,620 --> 00:03:29,869
OK.

99
00:03:29,870 --> 00:03:32,029
So first and

100
00:03:32,030 --> 00:03:33,229
as I said, I want to talk a bit about

101
00:03:33,230 --> 00:03:34,669
algorithms, and I just want to give you a

102
00:03:34,670 --> 00:03:37,489
very, very basic overview of

103
00:03:37,490 --> 00:03:39,349
machine learning and decision making, the

104
00:03:39,350 --> 00:03:41,419
algorithms. So please excuse me if there

105
00:03:41,420 --> 00:03:42,589
are any experts in the audience and

106
00:03:42,590 --> 00:03:43,729
probably making like a lot of

107
00:03:43,730 --> 00:03:45,739
simplifications here.

108
00:03:45,740 --> 00:03:49,129
OK, um, what is an algorithm

109
00:03:49,130 --> 00:03:50,749
here? I give you an example.

110
00:03:50,750 --> 00:03:53,329
Um, basically an algorithm is just a

111
00:03:53,330 --> 00:03:56,329
recipe that can be followed by a computer

112
00:03:56,330 --> 00:03:58,519
or a human being and

113
00:03:58,520 --> 00:04:00,679
gives a human being on the computer

114
00:04:00,680 --> 00:04:02,899
step by step instructions to achieve

115
00:04:02,900 --> 00:04:04,699
a certain goal. So in this case, we want

116
00:04:04,700 --> 00:04:06,829
to activate a trapdoor and we

117
00:04:06,830 --> 00:04:08,569
want to do that only if I'm on the trap

118
00:04:08,570 --> 00:04:10,579
door. So the algorithm has to decide if

119
00:04:10,580 --> 00:04:12,109
it's me and then if it's me, it can open

120
00:04:12,110 --> 00:04:14,029
the trap door. Otherwise it has to wait.

121
00:04:14,030 --> 00:04:16,159
And now this is a pretty fancy algorithm

122
00:04:16,160 --> 00:04:18,319
because it needs some information about

123
00:04:18,320 --> 00:04:20,599
me and it needs

124
00:04:20,600 --> 00:04:22,759
kind of an intelligent way to

125
00:04:22,760 --> 00:04:25,039
decide if it's the right person that

126
00:04:25,040 --> 00:04:26,540
is standing on the Treptow or not.

127
00:04:29,080 --> 00:04:30,789
So how does the algorithm get that

128
00:04:30,790 --> 00:04:31,719
information?

129
00:04:31,720 --> 00:04:34,479
Well, it uses machine learning and

130
00:04:34,480 --> 00:04:36,639
machine learning is a

131
00:04:36,640 --> 00:04:38,799
way to automatically generate a

132
00:04:38,800 --> 00:04:40,869
model that we can check

133
00:04:40,870 --> 00:04:43,239
against some training data and

134
00:04:43,240 --> 00:04:45,369
which we can then use to explain

135
00:04:45,370 --> 00:04:47,499
the data and in addition to also

136
00:04:47,500 --> 00:04:48,969
predict some unknown data.

137
00:04:48,970 --> 00:04:51,039
So, as you might know from school and

138
00:04:51,040 --> 00:04:53,439
just memorizing data and like

139
00:04:53,440 --> 00:04:55,299
reproducing what you already know can get

140
00:04:55,300 --> 00:04:58,149
you through tests, but normally it won't

141
00:04:58,150 --> 00:05:00,009
make you passed with flying colors.

142
00:05:00,010 --> 00:05:01,959
So ideally, we want to have something

143
00:05:01,960 --> 00:05:03,999
that can, in addition of memorizing data,

144
00:05:04,000 --> 00:05:06,009
also make predictions about data that we

145
00:05:06,010 --> 00:05:07,149
have never seen before.

146
00:05:07,150 --> 00:05:08,529
And this is what machine learning helps

147
00:05:08,530 --> 00:05:10,629
us to do and a bit more

148
00:05:10,630 --> 00:05:11,769
formalized the way we could,

149
00:05:12,790 --> 00:05:13,810
like, look at it as

150
00:05:14,890 --> 00:05:16,179
a model and some data.

151
00:05:16,180 --> 00:05:18,489
So here on the right, I just show you

152
00:05:18,490 --> 00:05:20,859
several possible models

153
00:05:20,860 --> 00:05:22,099
that we can choose from.

154
00:05:22,100 --> 00:05:24,579
Normally, we can write them as explaining

155
00:05:24,580 --> 00:05:26,879
some variable Y as a function

156
00:05:26,880 --> 00:05:29,049
M for model, which

157
00:05:29,050 --> 00:05:31,659
takes some attributes or variables X

158
00:05:31,660 --> 00:05:34,779
and some parameters P and

159
00:05:34,780 --> 00:05:36,939
returns some value for the quantity

160
00:05:36,940 --> 00:05:38,809
that we want to predict.

161
00:05:38,810 --> 00:05:41,139
And now we can use data

162
00:05:41,140 --> 00:05:43,569
to train our models and to like select

163
00:05:43,570 --> 00:05:45,699
the models that are compatible

164
00:05:45,700 --> 00:05:47,499
to our training and eliminate the ones

165
00:05:47,500 --> 00:05:49,209
that are not compatible to it.

166
00:05:49,210 --> 00:05:50,979
And you can see here on the right, we

167
00:05:50,980 --> 00:05:52,599
have eliminated all these models that are

168
00:05:52,600 --> 00:05:54,519
shown in red, whereas the models that are

169
00:05:54,520 --> 00:05:56,739
green can actually be compatible with our

170
00:05:56,740 --> 00:05:57,849
data.

171
00:05:57,850 --> 00:05:59,769
And now we can use those models to make a

172
00:05:59,770 --> 00:06:01,569
prediction about unknown data points as

173
00:06:01,570 --> 00:06:03,579
well and which is shown here.

174
00:06:03,580 --> 00:06:06,009
And usually you see there is some

175
00:06:06,010 --> 00:06:08,079
error, also some discrepancy between

176
00:06:08,080 --> 00:06:09,939
the model and the data that we try to

177
00:06:09,940 --> 00:06:12,129
explain. And this error is usually called

178
00:06:12,130 --> 00:06:13,130
Epsilon.

179
00:06:13,890 --> 00:06:16,679
Now, Epsilon

180
00:06:16,680 --> 00:06:18,569
can be decomposed further into like

181
00:06:18,570 --> 00:06:20,849
several parts, so there's a systematic

182
00:06:20,850 --> 00:06:22,320
error, which is mostly due to

183
00:06:23,820 --> 00:06:25,799
like miss calibrations or like

184
00:06:25,800 --> 00:06:27,449
measurements that we make each time when

185
00:06:27,450 --> 00:06:29,249
we try to like like measure a given

186
00:06:29,250 --> 00:06:31,379
variable. So we can think about this as,

187
00:06:31,380 --> 00:06:33,329
for example, the speedometer on your car,

188
00:06:33,330 --> 00:06:35,429
which gives you intentionally a reading

189
00:06:35,430 --> 00:06:37,259
which is too low in order to make sure

190
00:06:37,260 --> 00:06:39,419
that you not overstep the speed limit.

191
00:06:39,420 --> 00:06:41,489
And in addition to the systematic errors,

192
00:06:41,490 --> 00:06:43,799
we have also some noise in our data,

193
00:06:43,800 --> 00:06:45,419
which is due to like the error, the

194
00:06:45,420 --> 00:06:47,279
internal process that has generated the

195
00:06:47,280 --> 00:06:49,409
data or the model that we

196
00:06:49,410 --> 00:06:51,599
use and the measurement apparatus

197
00:06:51,600 --> 00:06:54,509
that we use to capture the data.

198
00:06:54,510 --> 00:06:56,819
And finally, we have some hidden variable

199
00:06:56,820 --> 00:06:58,949
errors, which is not random

200
00:06:58,950 --> 00:07:01,019
noise, but which are errors that are due

201
00:07:01,020 --> 00:07:03,059
to variables that have an impact on the

202
00:07:03,060 --> 00:07:04,979
outcome of the model, but which we do not

203
00:07:04,980 --> 00:07:07,079
know and which we therefore cannot use

204
00:07:07,080 --> 00:07:08,229
to model the data.

205
00:07:08,230 --> 00:07:09,230
So

206
00:07:10,650 --> 00:07:12,689
that's the basics of like model

207
00:07:12,690 --> 00:07:14,819
generation. And now you probably all have

208
00:07:14,820 --> 00:07:16,919
heard about big data and data driven

209
00:07:16,920 --> 00:07:19,559
society. And the effect

210
00:07:19,560 --> 00:07:22,199
that this has on model generation is

211
00:07:22,200 --> 00:07:23,259
threefold.

212
00:07:23,260 --> 00:07:24,839
For once you can see here, for example,

213
00:07:24,840 --> 00:07:26,999
we have more or less the data volume

214
00:07:27,000 --> 00:07:28,739
in 2000 compared to the data volume in

215
00:07:28,740 --> 00:07:29,969
2015.

216
00:07:29,970 --> 00:07:32,069
You can see that today we have a lot more

217
00:07:32,070 --> 00:07:33,989
data on our hands to make predictions and

218
00:07:33,990 --> 00:07:35,009
train models.

219
00:07:35,010 --> 00:07:37,229
And we also have data of

220
00:07:37,230 --> 00:07:38,870
a much greater variety than before.

221
00:07:40,230 --> 00:07:42,600
So to understand this effect,

222
00:07:44,010 --> 00:07:45,719
we can have a look at this graph here,

223
00:07:45,720 --> 00:07:48,119
which shows some random data

224
00:07:48,120 --> 00:07:49,889
that we measure with a pretty large

225
00:07:49,890 --> 00:07:51,129
noise, as you can see.

226
00:07:51,130 --> 00:07:52,439
And this data also contains some

227
00:07:52,440 --> 00:07:54,669
information. And I don't know

228
00:07:54,670 --> 00:07:56,909
who you can tell me if

229
00:07:56,910 --> 00:07:58,679
either the green points or the red points

230
00:07:58,680 --> 00:08:01,139
are showing have a higher value.

231
00:08:01,140 --> 00:08:02,249
So I guess not.

232
00:08:03,360 --> 00:08:05,729
But now what we can do is we can just

233
00:08:05,730 --> 00:08:08,279
take the data and leverage it, and

234
00:08:08,280 --> 00:08:10,379
by doing that, we can reduce

235
00:08:10,380 --> 00:08:12,269
the amount of noise in our data.

236
00:08:12,270 --> 00:08:14,399
And when we have enough samples that we

237
00:08:14,400 --> 00:08:16,319
can look at, we can make the noise so

238
00:08:16,320 --> 00:08:18,269
small that we can really detect some

239
00:08:18,270 --> 00:08:19,169
signal in the data.

240
00:08:19,170 --> 00:08:22,289
In this case, the signal is just 0.01

241
00:08:22,290 --> 00:08:23,579
high.

242
00:08:23,580 --> 00:08:25,649
And so having more data in our hands

243
00:08:25,650 --> 00:08:28,199
allows us to train models

244
00:08:28,200 --> 00:08:30,299
which can take into account

245
00:08:30,300 --> 00:08:31,300
smaller effect.

246
00:08:32,570 --> 00:08:34,699
Also, as I said, big data

247
00:08:34,700 --> 00:08:36,889
does not only give us more of the same

248
00:08:36,890 --> 00:08:38,418
data, but it gives us different kinds of

249
00:08:38,419 --> 00:08:40,369
data so we can think, for example, about

250
00:08:40,370 --> 00:08:42,499
all the smart devices that you have in

251
00:08:42,500 --> 00:08:44,689
your home, like your smart fridge, your

252
00:08:44,690 --> 00:08:47,149
door, the maybe automated

253
00:08:47,150 --> 00:08:49,259
smoke detector that all collect

254
00:08:49,260 --> 00:08:51,649
data about you and your interactions.

255
00:08:51,650 --> 00:08:53,509
And like, we can use the data to

256
00:08:53,510 --> 00:08:55,129
incorporate it into our models and to

257
00:08:55,130 --> 00:08:56,479
make better predictions.

258
00:08:56,480 --> 00:08:58,699
And so this moves

259
00:08:58,700 --> 00:09:00,769
some of the noise that where

260
00:09:00,770 --> 00:09:02,929
that was into hidden variables, into

261
00:09:02,930 --> 00:09:04,879
the model where we can use it to make

262
00:09:04,880 --> 00:09:05,880
predictions.

263
00:09:08,850 --> 00:09:10,919
Now, interpreting

264
00:09:10,920 --> 00:09:13,149
models can be hard

265
00:09:13,150 --> 00:09:15,119
or it can be very simple, depending on

266
00:09:15,120 --> 00:09:16,439
the model. So there's some machine

267
00:09:16,440 --> 00:09:18,689
learning algorithms like like

268
00:09:18,690 --> 00:09:20,399
decision tree classifiers, which are

269
00:09:20,400 --> 00:09:22,679
pretty easy to interpret

270
00:09:22,680 --> 00:09:25,259
because we can just follow

271
00:09:25,260 --> 00:09:27,389
this graph here and see exactly

272
00:09:27,390 --> 00:09:29,699
how the algorithms makes his decision

273
00:09:29,700 --> 00:09:31,949
or its decision about a given

274
00:09:31,950 --> 00:09:33,149
data point.

275
00:09:33,150 --> 00:09:35,819
Other models, like, for example,

276
00:09:35,820 --> 00:09:37,649
this neural network on the right side are

277
00:09:37,650 --> 00:09:38,669
really hard to interpret.

278
00:09:38,670 --> 00:09:41,339
So we can't get an intuitive

279
00:09:41,340 --> 00:09:43,889
feeling for how this model actually makes

280
00:09:43,890 --> 00:09:45,599
its decisions.

281
00:09:45,600 --> 00:09:47,819
And in fact, you maybe have

282
00:09:47,820 --> 00:09:49,829
seen those pictures here.

283
00:09:49,830 --> 00:09:52,409
They show basically a neural network

284
00:09:52,410 --> 00:09:54,179
working in reverse.

285
00:09:54,180 --> 00:09:55,180
So

286
00:09:56,730 --> 00:09:58,889
they give us an idea of how,

287
00:09:58,890 --> 00:10:01,679
like, this neural network understands

288
00:10:01,680 --> 00:10:02,909
a picture in this case.

289
00:10:02,910 --> 00:10:04,559
And you can see that, for example, we

290
00:10:04,560 --> 00:10:06,389
have several structures here that emerge

291
00:10:06,390 --> 00:10:07,859
at different places in the image and that

292
00:10:07,860 --> 00:10:09,659
are generated by the neural network while

293
00:10:09,660 --> 00:10:11,789
it's recognizing the features

294
00:10:11,790 --> 00:10:12,839
of the image.

295
00:10:12,840 --> 00:10:14,159
And this method has been developed

296
00:10:14,160 --> 00:10:15,989
actually, because it's really, really

297
00:10:15,990 --> 00:10:17,879
difficult to understand what a neural

298
00:10:17,880 --> 00:10:18,979
network is doing otherwise.

299
00:10:18,980 --> 00:10:21,359
So the only way we have to do that

300
00:10:21,360 --> 00:10:24,029
is to like try to see

301
00:10:24,030 --> 00:10:26,099
what kind of input data the

302
00:10:26,100 --> 00:10:27,449
network produces when we give it a

303
00:10:27,450 --> 00:10:28,450
certain output data.

304
00:10:31,060 --> 00:10:32,060
So.

305
00:10:32,890 --> 00:10:35,859
Now, what can you do with algorithms

306
00:10:35,860 --> 00:10:38,079
here to try to classify

307
00:10:38,080 --> 00:10:39,849
the uses of algorithms in our daily life

308
00:10:39,850 --> 00:10:42,429
until like three different risk groups,

309
00:10:42,430 --> 00:10:44,829
so you could say that you have a low

310
00:10:44,830 --> 00:10:46,869
risk group which basically just affects

311
00:10:46,870 --> 00:10:48,939
our lives on the super

312
00:10:48,940 --> 00:10:51,429
superficially and on the algorithms

313
00:10:51,430 --> 00:10:52,809
that make the decisions there.

314
00:10:52,810 --> 00:10:55,179
If they go wrong or if they misbehave,

315
00:10:55,180 --> 00:10:56,919
it would be only mildly annoying.

316
00:10:56,920 --> 00:10:59,019
Then we have the medium risk area where

317
00:10:59,020 --> 00:11:00,939
failure or misbehavior of an algorithm

318
00:11:00,940 --> 00:11:03,039
would be a bit more severe to our life,

319
00:11:03,040 --> 00:11:04,659
but wouldn't be fatal.

320
00:11:04,660 --> 00:11:06,759
Which is only in the high risk area here,

321
00:11:06,760 --> 00:11:08,559
where algorithms really can take

322
00:11:08,560 --> 00:11:10,629
decisions that can affect human

323
00:11:10,630 --> 00:11:12,849
lives and that can really be

324
00:11:12,850 --> 00:11:14,079
life changing in that sense.

325
00:11:15,740 --> 00:11:17,959
Now, a few examples for the first group.

326
00:11:20,340 --> 00:11:21,869
Would be, for example, personalization

327
00:11:21,870 --> 00:11:24,209
services, so whenever you go to a website

328
00:11:24,210 --> 00:11:26,789
like Facebook or Amazon or Netflix

329
00:11:26,790 --> 00:11:28,649
and the website shows you some content

330
00:11:28,650 --> 00:11:30,119
and it tries to show you content, which

331
00:11:30,120 --> 00:11:32,459
is very interesting to you, and it uses

332
00:11:32,460 --> 00:11:34,529
an algorithm to do that and tries

333
00:11:34,530 --> 00:11:36,119
to predict from the articles that you

334
00:11:36,120 --> 00:11:38,249
have viewed before which articles,

335
00:11:38,250 --> 00:11:39,779
for example, you will find interesting.

336
00:11:39,780 --> 00:11:42,689
So this so-called recommendation engine,

337
00:11:42,690 --> 00:11:44,729
and it's in wide use in all kinds of

338
00:11:44,730 --> 00:11:46,409
services today.

339
00:11:46,410 --> 00:11:47,849
Also, we have, of course, individualized

340
00:11:47,850 --> 00:11:49,019
ad targeting.

341
00:11:49,020 --> 00:11:50,369
You might notice if you go to some

342
00:11:50,370 --> 00:11:52,469
website and then you like your

343
00:11:52,470 --> 00:11:54,449
product and afterwards you surf around on

344
00:11:54,450 --> 00:11:56,519
the website, on the Web, and

345
00:11:56,520 --> 00:11:58,289
then ads for this kind of product like

346
00:11:58,290 --> 00:12:00,359
seem to like haunt you everywhere you go.

347
00:12:00,360 --> 00:12:01,859
And this is also due to like machine

348
00:12:01,860 --> 00:12:03,809
learning algorithms that like try to

349
00:12:03,810 --> 00:12:05,399
predict which kind of ads you will find

350
00:12:05,400 --> 00:12:07,679
interesting and that like show these ads

351
00:12:07,680 --> 00:12:09,059
to you on all kinds of different

352
00:12:09,060 --> 00:12:10,229
websites.

353
00:12:10,230 --> 00:12:12,029
And of course, there are algorithms that

354
00:12:12,030 --> 00:12:13,209
can do customer ratings.

355
00:12:13,210 --> 00:12:15,299
So, for example, if you want to like

356
00:12:15,300 --> 00:12:17,639
order product online, they could

357
00:12:17,640 --> 00:12:19,589
estimate how likely it is that you would

358
00:12:19,590 --> 00:12:21,419
pay the invoice for that article.

359
00:12:21,420 --> 00:12:23,519
And if it's not very high, then

360
00:12:23,520 --> 00:12:25,439
it would the system would only send you

361
00:12:25,440 --> 00:12:26,759
the article if you paid the money in

362
00:12:26,760 --> 00:12:27,819
advance.

363
00:12:27,820 --> 00:12:29,159
And of course, there are things like

364
00:12:29,160 --> 00:12:30,439
customer demand prediction.

365
00:12:30,440 --> 00:12:32,549
So the holy grail of this would

366
00:12:32,550 --> 00:12:34,859
be to actually know what do you want

367
00:12:34,860 --> 00:12:36,899
to buy before you know it and then send

368
00:12:36,900 --> 00:12:38,039
it to you, to your door.

369
00:12:38,040 --> 00:12:40,139
And I think after reading

370
00:12:40,140 --> 00:12:41,369
a pattern, I think this is also what

371
00:12:41,370 --> 00:12:43,649
Amazon is trying to do in some cases.

372
00:12:43,650 --> 00:12:45,749
So these things just like affect

373
00:12:45,750 --> 00:12:46,859
our lives superficially.

374
00:12:46,860 --> 00:12:48,959
And if something happens, it

375
00:12:48,960 --> 00:12:51,059
affects us not very in a very deep

376
00:12:51,060 --> 00:12:52,859
way. And there are other uses of

377
00:12:52,860 --> 00:12:55,199
algorithms in our life, for example, and

378
00:12:55,200 --> 00:12:58,019
a big topic that is coming up now with

379
00:12:58,020 --> 00:12:59,669
big data and more data that we can

380
00:12:59,670 --> 00:13:02,279
collect about individuals is personalized

381
00:13:02,280 --> 00:13:04,769
health. So making decisions about

382
00:13:04,770 --> 00:13:06,959
possible treatments and lifestyle based

383
00:13:06,960 --> 00:13:08,999
on data that we collect about you, for

384
00:13:09,000 --> 00:13:11,609
example, your heart rate, your

385
00:13:11,610 --> 00:13:14,039
pulse, your how much you move around,

386
00:13:14,040 --> 00:13:15,959
how many stairs you climb each day.

387
00:13:15,960 --> 00:13:18,479
So this is a large potential for

388
00:13:18,480 --> 00:13:20,849
improving, for example,

389
00:13:20,850 --> 00:13:22,529
areas such as medicine, but also other

390
00:13:22,530 --> 00:13:24,749
ones. And we use

391
00:13:24,750 --> 00:13:26,399
the same or similar classification

392
00:13:26,400 --> 00:13:28,469
algorithms as and the

393
00:13:28,470 --> 00:13:30,699
applications that I showed you before.

394
00:13:30,700 --> 00:13:32,309
So another thing is person

395
00:13:32,310 --> 00:13:33,359
classification.

396
00:13:33,360 --> 00:13:35,519
So here we want to predict,

397
00:13:35,520 --> 00:13:36,869
for example, how likely it is that a

398
00:13:36,870 --> 00:13:39,179
person will commit a crime or

399
00:13:39,180 --> 00:13:40,139
will be a terrorist.

400
00:13:40,140 --> 00:13:41,989
And these are kind of algorithms that are

401
00:13:41,990 --> 00:13:44,759
already in use today by, for example,

402
00:13:44,760 --> 00:13:46,919
governments to like

403
00:13:46,920 --> 00:13:48,989
issue restricted travel

404
00:13:48,990 --> 00:13:51,779
permits and to like

405
00:13:51,780 --> 00:13:54,269
mark some people that have a high

406
00:13:54,270 --> 00:13:56,099
risk profile due to the algorithm for

407
00:13:56,100 --> 00:13:57,779
screening. I think there are many topics

408
00:13:57,780 --> 00:13:59,579
here also that deal with this problem,

409
00:13:59,580 --> 00:14:00,899
that problem especially.

410
00:14:02,050 --> 00:14:04,179
And of course, they are autonomous cars,

411
00:14:04,180 --> 00:14:07,089
planes and machines, which

412
00:14:07,090 --> 00:14:08,949
are currently being developed already in

413
00:14:08,950 --> 00:14:11,049
service and which will take over

414
00:14:11,050 --> 00:14:13,329
like driving from

415
00:14:13,330 --> 00:14:15,159
people in a few years or a few decades

416
00:14:15,160 --> 00:14:17,289
maybe. And finally, this

417
00:14:17,290 --> 00:14:18,759
automated trading, which is mostly

418
00:14:18,760 --> 00:14:20,859
invisible to us, but which has also a

419
00:14:20,860 --> 00:14:23,139
huge impact because 95 or

420
00:14:23,140 --> 00:14:25,449
even 99 percent of all trades today

421
00:14:25,450 --> 00:14:27,189
are actually performed by algorithms in

422
00:14:27,190 --> 00:14:28,449
the machines anymore.

423
00:14:28,450 --> 00:14:29,450
So.

424
00:14:30,280 --> 00:14:32,439
Finally, there's the high risk area

425
00:14:32,440 --> 00:14:33,939
where we have such things like military

426
00:14:33,940 --> 00:14:36,129
intelligence and or intervention.

427
00:14:36,130 --> 00:14:37,719
We also have already some governments

428
00:14:37,720 --> 00:14:40,539
that already use algorithms to

429
00:14:40,540 --> 00:14:42,729
like predict targets

430
00:14:42,730 --> 00:14:44,379
for, for example, drone strikes.

431
00:14:44,380 --> 00:14:46,269
And we also can have, of course,

432
00:14:46,270 --> 00:14:47,889
governments that use machine learning and

433
00:14:47,890 --> 00:14:49,599
algorithms for political oppression.

434
00:14:49,600 --> 00:14:51,729
So, for example, to train

435
00:14:51,730 --> 00:14:53,799
firewalled systems using

436
00:14:53,800 --> 00:14:56,079
juristic algorithms to detect

437
00:14:56,080 --> 00:14:58,839
traffic that should be filtered out.

438
00:14:58,840 --> 00:15:01,119
And there's also critical infrastructure

439
00:15:01,120 --> 00:15:03,579
services like the electricity grid

440
00:15:03,580 --> 00:15:05,770
or other things that are

441
00:15:06,820 --> 00:15:08,649
critical to us and which are also

442
00:15:08,650 --> 00:15:10,989
sometimes ungoverned or like controlled

443
00:15:10,990 --> 00:15:13,449
by algorithms already.

444
00:15:13,450 --> 00:15:15,579
So as you can see already today,

445
00:15:15,580 --> 00:15:17,559
we have many areas in our life with

446
00:15:17,560 --> 00:15:19,659
algorithms and not humans make

447
00:15:19,660 --> 00:15:21,519
the decisions. And if we would like

448
00:15:21,520 --> 00:15:22,899
plotless again on this graph, you would

449
00:15:22,900 --> 00:15:24,969
see that that most things with

450
00:15:24,970 --> 00:15:26,619
algorithms decide today actually in the

451
00:15:26,620 --> 00:15:27,999
green on the yellow area.

452
00:15:28,000 --> 00:15:30,159
And we have some things that are be

453
00:15:30,160 --> 00:15:32,229
touching the critical part of our lives.

454
00:15:32,230 --> 00:15:34,329
And now what big data

455
00:15:34,330 --> 00:15:36,219
and advanced machine learning would do in

456
00:15:36,220 --> 00:15:38,799
the coming years is probably to make

457
00:15:38,800 --> 00:15:40,929
to both widen the applicability

458
00:15:40,930 --> 00:15:43,209
of algorithms so we can use them for

459
00:15:43,210 --> 00:15:45,039
domains where we couldn't use them

460
00:15:45,040 --> 00:15:47,859
before, like speech recognition,

461
00:15:47,860 --> 00:15:50,319
customer service and many other things.

462
00:15:50,320 --> 00:15:52,719
And we will also, like penetrate deeper

463
00:15:52,720 --> 00:15:54,699
into our lives and making decisions which

464
00:15:54,700 --> 00:15:56,749
really can affect us on a more personal,

465
00:15:56,750 --> 00:15:59,109
more intimate and more critical

466
00:15:59,110 --> 00:16:00,110
level.

467
00:16:03,680 --> 00:16:05,749
Good. So this is all I

468
00:16:05,750 --> 00:16:07,969
wanted to show you in theory, and now

469
00:16:07,970 --> 00:16:10,939
I want to use the remaining time to

470
00:16:10,940 --> 00:16:13,429
show you two experiments, which I did.

471
00:16:13,430 --> 00:16:15,049
So there are lots of things that can go

472
00:16:15,050 --> 00:16:16,729
wrong when you use algorithms.

473
00:16:16,730 --> 00:16:19,009
But I picked two topics

474
00:16:19,010 --> 00:16:20,839
here that I find especially important.

475
00:16:20,840 --> 00:16:22,219
And the first topic that we are looking

476
00:16:22,220 --> 00:16:25,159
at is discrimination through algorithms.

477
00:16:25,160 --> 00:16:27,319
So here the question is,

478
00:16:27,320 --> 00:16:29,719
can an algorithm that is trained

479
00:16:29,720 --> 00:16:31,939
by a human or by an

480
00:16:31,940 --> 00:16:34,069
earlier manual decision process

481
00:16:34,070 --> 00:16:35,809
actually also discriminate against

482
00:16:35,810 --> 00:16:37,189
certain groups of people?

483
00:16:37,190 --> 00:16:38,689
You know, like discrimination still is a

484
00:16:38,690 --> 00:16:40,339
very big problem in our society.

485
00:16:40,340 --> 00:16:42,529
And we have like fought for many,

486
00:16:42,530 --> 00:16:44,239
many years to, like, push it back.

487
00:16:44,240 --> 00:16:46,429
And the question is, of course,

488
00:16:46,430 --> 00:16:48,529
now, as we shift so

489
00:16:48,530 --> 00:16:50,209
much of the decision, power from humans

490
00:16:50,210 --> 00:16:52,999
to machines, can we actually

491
00:16:53,000 --> 00:16:54,619
eliminate the discrimination that we

492
00:16:54,620 --> 00:16:56,209
still have in the system or are we going

493
00:16:56,210 --> 00:16:58,489
to carry it over to like this automated

494
00:16:58,490 --> 00:16:59,490
decision making?

495
00:17:02,120 --> 00:17:04,739
The definition of discrimination,

496
00:17:04,740 --> 00:17:06,809
again here is

497
00:17:06,810 --> 00:17:09,239
a treatment or consideration

498
00:17:09,240 --> 00:17:10,949
of a certain person

499
00:17:12,450 --> 00:17:14,608
that is made based on his or her

500
00:17:14,609 --> 00:17:16,679
group, class or category category and

501
00:17:16,680 --> 00:17:18,598
is not based on an on his or her

502
00:17:18,599 --> 00:17:19,499
individual merit.

503
00:17:19,500 --> 00:17:21,719
So that means

504
00:17:21,720 --> 00:17:24,059
that we like professor or

505
00:17:24,060 --> 00:17:26,159
we put at a disadvantage certain

506
00:17:26,160 --> 00:17:28,229
kinds of people, according to their their

507
00:17:28,230 --> 00:17:30,119
group or some protected attribute, which

508
00:17:30,120 --> 00:17:32,399
can be, for example, the ethnicity,

509
00:17:32,400 --> 00:17:34,529
the gender or the sexual orientation

510
00:17:34,530 --> 00:17:36,039
of the person.

511
00:17:36,040 --> 00:17:38,169
And now we

512
00:17:38,170 --> 00:17:40,449
need a way, of course, to measure this

513
00:17:40,450 --> 00:17:41,799
discrimination and

514
00:17:42,970 --> 00:17:44,379
the measurement of the truth here, I

515
00:17:44,380 --> 00:17:46,629
mean, steroids, of course,

516
00:17:46,630 --> 00:17:48,729
is has been developed in the US

517
00:17:48,730 --> 00:17:50,469
and it's called disparate impact.

518
00:17:50,470 --> 00:17:52,959
And it's quite nice because it uses

519
00:17:52,960 --> 00:17:54,849
a very clear and simple mathematical

520
00:17:54,850 --> 00:17:57,189
model to to explain

521
00:17:57,190 --> 00:17:58,269
discrimination.

522
00:17:58,270 --> 00:18:00,459
So basically, this model says

523
00:18:00,460 --> 00:18:02,049
that we have a process, see,

524
00:18:03,130 --> 00:18:05,389
which acts on people that have

525
00:18:05,390 --> 00:18:07,209
a given aptitude X or don't have it.

526
00:18:07,210 --> 00:18:09,609
So, for example, men

527
00:18:09,610 --> 00:18:11,709
and women and we measure the

528
00:18:11,710 --> 00:18:13,089
outcome of this process and we are

529
00:18:13,090 --> 00:18:15,549
interested in the probability of

530
00:18:15,550 --> 00:18:17,649
the decision being, yes, for

531
00:18:17,650 --> 00:18:19,719
a member of the Group X versus the

532
00:18:19,720 --> 00:18:21,819
probability of business for the

533
00:18:21,820 --> 00:18:23,469
member of the other group.

534
00:18:23,470 --> 00:18:25,629
And so we can just have

535
00:18:25,630 --> 00:18:27,669
a look at the conditional probability P

536
00:18:27,670 --> 00:18:29,409
of making it through this process.

537
00:18:29,410 --> 00:18:32,019
Being a member of Group X

538
00:18:32,020 --> 00:18:34,239
equals zero, divided by the probability

539
00:18:34,240 --> 00:18:36,399
of making it through the process of

540
00:18:36,400 --> 00:18:37,899
being and being a member of the other

541
00:18:37,900 --> 00:18:40,119
group. And when we can when we divide

542
00:18:40,120 --> 00:18:41,949
these two quantities, we get the

543
00:18:41,950 --> 00:18:44,019
parameter tower, which describes

544
00:18:44,020 --> 00:18:46,089
the amount of discrimination that we have

545
00:18:46,090 --> 00:18:48,289
in the system. And for

546
00:18:48,290 --> 00:18:50,439
normal purposes, we can choose a given

547
00:18:50,440 --> 00:18:51,969
title, for example, 80 percent.

548
00:18:51,970 --> 00:18:53,529
And if we see that the smaller we can

549
00:18:53,530 --> 00:18:55,209
say, oh, this process contains

550
00:18:55,210 --> 00:18:58,119
discrimination and this is nice because

551
00:18:58,120 --> 00:19:00,339
it measures discrimination, not only if

552
00:19:00,340 --> 00:19:02,559
it's done intentionally, but

553
00:19:02,560 --> 00:19:04,399
also when it's happening inadvertently.

554
00:19:04,400 --> 00:19:05,919
So without wanting it.

555
00:19:05,920 --> 00:19:08,139
So it doesn't really matter after the

556
00:19:08,140 --> 00:19:09,639
process or the people in this process

557
00:19:09,640 --> 00:19:11,679
want to discriminate if they do it.

558
00:19:11,680 --> 00:19:13,989
Nevertheless, maybe unconsciously, this

559
00:19:13,990 --> 00:19:15,940
measure can give us an idea about it.

560
00:19:17,380 --> 00:19:19,569
And of course, in practice we can

561
00:19:19,570 --> 00:19:21,519
deal with probabilities that we have to

562
00:19:21,520 --> 00:19:22,520
to measure the.

563
00:19:25,440 --> 00:19:27,539
So the number of people

564
00:19:27,540 --> 00:19:29,699
in each category and then we can,

565
00:19:29,700 --> 00:19:31,859
like, make an estimate of this parameter,

566
00:19:31,860 --> 00:19:34,319
Tolba, just dividing this number

567
00:19:34,320 --> 00:19:36,179
here by these two numbers and dividing

568
00:19:36,180 --> 00:19:38,099
this, again, by this number divided by

569
00:19:38,100 --> 00:19:39,659
the other two numbers. So it's pretty

570
00:19:39,660 --> 00:19:41,849
easy, very straightforward.

571
00:19:41,850 --> 00:19:44,129
And now

572
00:19:44,130 --> 00:19:46,979
I want to show how we can

573
00:19:46,980 --> 00:19:49,499
use this to test a given process

574
00:19:49,500 --> 00:19:51,629
that we take from we

575
00:19:51,630 --> 00:19:54,599
take decision power from people

576
00:19:54,600 --> 00:19:55,799
and give it to an algorithm.

577
00:19:55,800 --> 00:19:58,169
And the example that I choose here is

578
00:19:58,170 --> 00:19:59,099
a H.R.

579
00:19:59,100 --> 00:20:00,599
process or a hiring process.

580
00:20:00,600 --> 00:20:02,969
And we want to

581
00:20:02,970 --> 00:20:05,669
hear select candidates

582
00:20:05,670 --> 00:20:07,859
based on the data, for example,

583
00:20:07,860 --> 00:20:09,809
the CV and other data about themselves

584
00:20:09,810 --> 00:20:12,119
that they submit to a potential

585
00:20:12,120 --> 00:20:13,140
employer employee.

586
00:20:14,330 --> 00:20:16,429
And the benefits of this are,

587
00:20:16,430 --> 00:20:18,739
of course, saving time in the screening

588
00:20:18,740 --> 00:20:21,049
process and also improve the choice

589
00:20:21,050 --> 00:20:23,239
of candidates and this is a I choose

590
00:20:23,240 --> 00:20:25,009
this example because it's actually

591
00:20:25,010 --> 00:20:26,629
something that is widely done already.

592
00:20:26,630 --> 00:20:28,879
So chances are that if you have

593
00:20:28,880 --> 00:20:30,379
applied to a job recently, you have

594
00:20:30,380 --> 00:20:32,539
probably been subject subjected to

595
00:20:32,540 --> 00:20:34,639
this kind of process and also several

596
00:20:34,640 --> 00:20:36,629
startups in the US, but also in Europe to

597
00:20:36,630 --> 00:20:38,779
try to implement this kind of data driven

598
00:20:38,780 --> 00:20:40,819
hiring processes.

599
00:20:40,820 --> 00:20:42,439
So it's something that's really already

600
00:20:42,440 --> 00:20:43,440
happening.

601
00:20:44,030 --> 00:20:46,129
OK, again, the setup is pretty simple.

602
00:20:46,130 --> 00:20:48,139
We have some information about the

603
00:20:48,140 --> 00:20:49,909
candidate that we submitted to a human

604
00:20:49,910 --> 00:20:51,829
reviewer and that human Rabiya makes a

605
00:20:51,830 --> 00:20:53,959
decision to invite a candidate or not.

606
00:20:53,960 --> 00:20:56,239
And it also gives that information to

607
00:20:56,240 --> 00:20:58,069
an algorithm as a training data.

608
00:20:58,070 --> 00:21:00,409
And the algorithm tries then to replicate

609
00:21:00,410 --> 00:21:02,059
the decision of the human whether to hire

610
00:21:02,060 --> 00:21:03,060
the candidate or not.

611
00:21:04,210 --> 00:21:05,210
So

612
00:21:06,460 --> 00:21:08,919
the set up, as I say, we used to live

613
00:21:08,920 --> 00:21:10,929
any work samples and other publicly

614
00:21:10,930 --> 00:21:12,729
available information about the candidate

615
00:21:12,730 --> 00:21:14,890
that we can get as an input, we then

616
00:21:16,360 --> 00:21:18,549
use a human to like make the decision

617
00:21:18,550 --> 00:21:20,349
about a given candidate, either yes or

618
00:21:20,350 --> 00:21:22,449
no. And we train the algorithm

619
00:21:22,450 --> 00:21:23,439
on this data.

620
00:21:23,440 --> 00:21:25,509
So and the approach that we have here is

621
00:21:25,510 --> 00:21:26,779
a so-called big data approach.

622
00:21:26,780 --> 00:21:28,929
So we basically try to get as much

623
00:21:28,930 --> 00:21:31,179
data about every other candidate

624
00:21:31,180 --> 00:21:32,679
as we can and put it all into the

625
00:21:32,680 --> 00:21:34,719
algorithm and let the algorithm figure

626
00:21:34,720 --> 00:21:35,740
out what it does with it.

627
00:21:37,560 --> 00:21:39,029
So the decision model for this

628
00:21:40,110 --> 00:21:41,369
rather simple and show it here

629
00:21:42,480 --> 00:21:44,489
in order to decide if we hire a given

630
00:21:44,490 --> 00:21:46,679
candidate, we can define a function

631
00:21:46,680 --> 00:21:48,869
as which is the score

632
00:21:48,870 --> 00:21:50,909
that has several parts.

633
00:21:50,910 --> 00:21:52,799
One part is to measure the score of the

634
00:21:52,800 --> 00:21:54,899
merit of the candidate, which is based on

635
00:21:54,900 --> 00:21:57,029
on his or her abilities.

636
00:21:57,030 --> 00:21:58,759
And another part is a discrimination

637
00:21:58,760 --> 00:21:59,669
Malazan bonus.

638
00:21:59,670 --> 00:22:01,889
So we can see that as

639
00:22:01,890 --> 00:22:03,809
increasing or decreasing the total score

640
00:22:03,810 --> 00:22:05,879
of the candidate based on his or her

641
00:22:05,880 --> 00:22:07,619
membership in a given group.

642
00:22:07,620 --> 00:22:09,239
And then, of course, we also have some

643
00:22:09,240 --> 00:22:11,309
element of luck, which we have said to 20

644
00:22:11,310 --> 00:22:12,719
percent, for example, here.

645
00:22:12,720 --> 00:22:15,089
And then we just add

646
00:22:15,090 --> 00:22:17,069
all these three components together.

647
00:22:17,070 --> 00:22:19,199
And if the 50 are larger

648
00:22:19,200 --> 00:22:20,549
than so, we add these two things

649
00:22:20,550 --> 00:22:22,139
together. And if they are larger than the

650
00:22:22,140 --> 00:22:24,239
given bar, we invite the candidate.

651
00:22:24,240 --> 00:22:26,159
If they are not, we don't invite him or

652
00:22:26,160 --> 00:22:28,200
her. And you can see here

653
00:22:29,220 --> 00:22:31,319
the bar has a different

654
00:22:31,320 --> 00:22:32,909
height, depending on the group of the

655
00:22:32,910 --> 00:22:34,529
candidate, if there is discrimination in

656
00:22:34,530 --> 00:22:35,530
the system.

657
00:22:36,820 --> 00:22:39,399
OK, now

658
00:22:39,400 --> 00:22:42,099
we can train a predictor for that

659
00:22:42,100 --> 00:22:44,889
model to which we give the information

660
00:22:44,890 --> 00:22:46,749
about the candidate and also a lot of

661
00:22:46,750 --> 00:22:48,459
other information which it causes here.

662
00:22:48,460 --> 00:22:50,229
So everything else that we can find, for

663
00:22:50,230 --> 00:22:52,329
example, in public records or in other

664
00:22:52,330 --> 00:22:55,119
information that we can get our hands on.

665
00:22:55,120 --> 00:22:57,249
And then we trained

666
00:22:57,250 --> 00:22:59,409
as a predictor for the to

667
00:22:59,410 --> 00:23:00,669
predict the outcome of the hiring

668
00:23:00,670 --> 00:23:03,609
process. And we can see

669
00:23:03,610 --> 00:23:05,679
what the results are. So and since

670
00:23:05,680 --> 00:23:08,089
it's pretty hard to get our data on,

671
00:23:08,090 --> 00:23:10,209
to get our hands on real world data,

672
00:23:10,210 --> 00:23:11,889
and what I did instead here is to

673
00:23:11,890 --> 00:23:14,289
simulate 10000

674
00:23:14,290 --> 00:23:16,629
samples of an agent based model

675
00:23:16,630 --> 00:23:19,179
where we just choose a function,

676
00:23:19,180 --> 00:23:21,339
see you and some disparate impact

677
00:23:21,340 --> 00:23:23,379
and generate training data with that.

678
00:23:23,380 --> 00:23:25,659
And then we can use a standard machine

679
00:23:25,660 --> 00:23:27,009
learning algorithm, in this case, a

680
00:23:27,010 --> 00:23:29,199
support vector machine to test that

681
00:23:29,200 --> 00:23:31,359
data and measure

682
00:23:31,360 --> 00:23:33,639
the discrimination that the algorithm

683
00:23:33,640 --> 00:23:35,229
produces.

684
00:23:35,230 --> 00:23:37,359
OK, this is shown in this graph

685
00:23:37,360 --> 00:23:38,360
here.

686
00:23:41,220 --> 00:23:43,289
It's a bit complicated, so let's go

687
00:23:43,290 --> 00:23:44,579
through it one by one.

688
00:23:44,580 --> 00:23:47,189
So what we see on the X axis is the

689
00:23:47,190 --> 00:23:49,409
amount of information that our

690
00:23:49,410 --> 00:23:51,839
algorithm has about the attributes X

691
00:23:51,840 --> 00:23:52,859
of the candidate.

692
00:23:52,860 --> 00:23:54,959
So this is the protected attribute

693
00:23:54,960 --> 00:23:56,129
to which we don't want to give

694
00:23:56,130 --> 00:23:57,359
information away.

695
00:23:57,360 --> 00:23:59,489
And so if we are at

696
00:23:59,490 --> 00:24:01,079
zero and means that the algorithm doesn't

697
00:24:01,080 --> 00:24:02,519
have any information at all about the

698
00:24:02,520 --> 00:24:04,919
candidate, if we are at one and

699
00:24:04,920 --> 00:24:06,209
it means the algorithm has all the

700
00:24:06,210 --> 00:24:07,829
information about the protected attribute

701
00:24:07,830 --> 00:24:09,629
of the candidate, and if we are at zero

702
00:24:09,630 --> 00:24:10,919
point five, it means that we have the

703
00:24:10,920 --> 00:24:12,749
correct information and about 50 percent

704
00:24:12,750 --> 00:24:13,750
of the cases.

705
00:24:14,960 --> 00:24:17,089
OK, then we have our

706
00:24:17,090 --> 00:24:18,709
perimeter tower, which is a disparate

707
00:24:18,710 --> 00:24:21,109
impact, and here we have set this value

708
00:24:21,110 --> 00:24:23,389
to zero point five, and this means that

709
00:24:23,390 --> 00:24:24,709
the chance of making it through the

710
00:24:24,710 --> 00:24:27,109
process, being a member of Group X

711
00:24:27,110 --> 00:24:29,299
is twice as high as for members, for

712
00:24:29,300 --> 00:24:30,789
people that are not member of this group.

713
00:24:32,610 --> 00:24:35,159
Now, here, above, we see the

714
00:24:35,160 --> 00:24:37,469
prediction, fidelity of our algorithm,

715
00:24:37,470 --> 00:24:39,569
which is between 86

716
00:24:39,570 --> 00:24:42,509
and about 90 percent, and

717
00:24:42,510 --> 00:24:44,879
which also increases as we increase the

718
00:24:44,880 --> 00:24:46,619
number at the rate gamma to which the

719
00:24:46,620 --> 00:24:48,569
information leaks into the system.

720
00:24:48,570 --> 00:24:51,299
And finally, we have here

721
00:24:51,300 --> 00:24:52,649
the towel.

722
00:24:52,650 --> 00:24:54,269
So the discrimination, the amount of

723
00:24:54,270 --> 00:24:56,369
discrimination done by the

724
00:24:56,370 --> 00:24:58,799
algorithm measured, again, as a function

725
00:24:58,800 --> 00:25:00,119
of the information leakage.

726
00:25:01,770 --> 00:25:03,749
So what that means is that the more

727
00:25:03,750 --> 00:25:05,459
information we provide about this

728
00:25:05,460 --> 00:25:07,079
projected attribute we provide to the

729
00:25:07,080 --> 00:25:09,239
algorithm, the better it is able to

730
00:25:09,240 --> 00:25:11,009
discriminate against people in that

731
00:25:11,010 --> 00:25:13,199
group. So if we are

732
00:25:13,200 --> 00:25:14,789
here and the algorithm doesn't have any

733
00:25:14,790 --> 00:25:16,529
information at all about the protected

734
00:25:16,530 --> 00:25:18,179
group, it can't discriminate against

735
00:25:18,180 --> 00:25:20,309
those people. So the ratio

736
00:25:20,310 --> 00:25:22,019
of success between the individual groups

737
00:25:22,020 --> 00:25:24,089
is one. And this is actually

738
00:25:24,090 --> 00:25:26,219
great because it means if we can build

739
00:25:26,220 --> 00:25:28,229
an algorithm that doesn't have any idea

740
00:25:28,230 --> 00:25:30,239
about these protected attributes, we can

741
00:25:30,240 --> 00:25:31,739
eliminate all the discrimination that's

742
00:25:31,740 --> 00:25:32,740
in the system.

743
00:25:33,630 --> 00:25:36,269
On the other hand, if by some

744
00:25:36,270 --> 00:25:38,339
accident the algorithm gets

745
00:25:38,340 --> 00:25:40,739
full information and these attributes,

746
00:25:40,740 --> 00:25:42,959
it can discriminate just as

747
00:25:42,960 --> 00:25:44,909
as well as a human against people in

748
00:25:44,910 --> 00:25:46,799
either group. So that means if we give

749
00:25:46,800 --> 00:25:48,389
too much information to our algorithm, we

750
00:25:48,390 --> 00:25:49,859
will have the same problem in the hiring

751
00:25:49,860 --> 00:25:50,939
process as before.

752
00:25:50,940 --> 00:25:52,349
So we will also have to have

753
00:25:52,350 --> 00:25:54,449
discrimination against people, not by

754
00:25:54,450 --> 00:25:56,609
human systems, but by machine

755
00:25:56,610 --> 00:25:57,610
and say.

756
00:25:58,350 --> 00:26:00,659
And now you say probably,

757
00:26:00,660 --> 00:26:02,249
OK, this is stupid.

758
00:26:02,250 --> 00:26:04,349
Why would we give information

759
00:26:04,350 --> 00:26:05,549
about this protected group to the

760
00:26:05,550 --> 00:26:07,679
algorithm? And of course,

761
00:26:07,680 --> 00:26:09,509
the answer is we don't normally.

762
00:26:09,510 --> 00:26:11,579
But the problem with big data and

763
00:26:11,580 --> 00:26:13,619
with like having a lot of different data

764
00:26:13,620 --> 00:26:15,209
types and different data sources on our

765
00:26:15,210 --> 00:26:17,309
hand is that even if we don't give that

766
00:26:17,310 --> 00:26:19,679
information to the algorithm explicitly

767
00:26:19,680 --> 00:26:21,509
and there is some amount of information

768
00:26:21,510 --> 00:26:23,669
about the attribute that leaks through

769
00:26:23,670 --> 00:26:25,019
with all the other information that we

770
00:26:25,020 --> 00:26:26,369
provide.

771
00:26:26,370 --> 00:26:28,609
And this is basically the essence of

772
00:26:28,610 --> 00:26:30,749
the dilemma of having too much data on

773
00:26:30,750 --> 00:26:32,879
our hands, because it's always very

774
00:26:32,880 --> 00:26:35,339
hard to to keep

775
00:26:35,340 --> 00:26:37,589
information about sensitive things

776
00:26:37,590 --> 00:26:40,079
leaking into our data set.

777
00:26:40,080 --> 00:26:42,119
And, of course, is like a purely

778
00:26:42,120 --> 00:26:43,120
theoretical,

779
00:26:44,580 --> 00:26:46,019
pure theoretical formulation now.

780
00:26:46,020 --> 00:26:48,149
But I actually

781
00:26:48,150 --> 00:26:50,489
try to validate this by using publicly

782
00:26:50,490 --> 00:26:53,459
available data. So what I did was

783
00:26:53,460 --> 00:26:55,709
to get GitHub user data,

784
00:26:55,710 --> 00:26:57,809
which is on which we

785
00:26:57,810 --> 00:26:59,579
can obtain through an API and which gives

786
00:26:59,580 --> 00:27:01,619
us information about all the people on

787
00:27:01,620 --> 00:27:02,189
GitHub.

788
00:27:02,190 --> 00:27:03,670
So, um.

789
00:27:05,480 --> 00:27:06,589
First, we need, of course,

790
00:27:07,790 --> 00:27:09,469
information about the protected attribute

791
00:27:09,470 --> 00:27:11,539
of the group and this case, we choose the

792
00:27:11,540 --> 00:27:12,529
gender.

793
00:27:12,530 --> 00:27:14,629
So either man or woman, and

794
00:27:14,630 --> 00:27:16,849
to do that, we have to manually

795
00:27:16,850 --> 00:27:19,219
classify and

796
00:27:19,220 --> 00:27:21,049
the people that we put into this study.

797
00:27:21,050 --> 00:27:23,389
So what I did was just to

798
00:27:23,390 --> 00:27:25,609
look at the profile pictures and get up

799
00:27:25,610 --> 00:27:27,859
about 5000 and like classify the people

800
00:27:27,860 --> 00:27:29,329
into like like man and woman.

801
00:27:29,330 --> 00:27:31,069
So this gave us the training data for

802
00:27:31,070 --> 00:27:32,389
this kind of simulation.

803
00:27:32,390 --> 00:27:34,969
And what I did then was to retrieve

804
00:27:34,970 --> 00:27:37,309
additional information about each user.

805
00:27:37,310 --> 00:27:39,859
So, for example, the number of projects

806
00:27:39,860 --> 00:27:42,319
on the website, the number of stargazers,

807
00:27:42,320 --> 00:27:44,059
the number of followers, etc., etc.

808
00:27:44,060 --> 00:27:46,729
So everything I could get my hands on and

809
00:27:46,730 --> 00:27:48,979
then I would use that data to make

810
00:27:48,980 --> 00:27:50,749
a prediction about the gender of the user

811
00:27:50,750 --> 00:27:52,819
just based on the information that I put

812
00:27:52,820 --> 00:27:53,820
into the system.

813
00:27:54,630 --> 00:27:56,519
So and I want to say again, this is only

814
00:27:56,520 --> 00:27:58,259
a proof of concept, so I used a very

815
00:27:58,260 --> 00:27:59,219
small data set.

816
00:27:59,220 --> 00:28:00,539
I didn't do any optimization.

817
00:28:00,540 --> 00:28:02,369
I just wanted to see how easy it is

818
00:28:02,370 --> 00:28:03,779
actually to get this kind of

819
00:28:03,780 --> 00:28:05,849
contamination into our

820
00:28:05,850 --> 00:28:07,229
algorithm.

821
00:28:07,230 --> 00:28:09,509
So first,

822
00:28:09,510 --> 00:28:11,849
when I used only very basic things

823
00:28:11,850 --> 00:28:13,499
like, for example, the number of

824
00:28:13,500 --> 00:28:15,779
stargazers or followers of

825
00:28:15,780 --> 00:28:18,479
each user, I couldn't get any

826
00:28:18,480 --> 00:28:20,519
prediction about the gender of the

827
00:28:20,520 --> 00:28:22,409
person. And I mean, this is already great

828
00:28:22,410 --> 00:28:23,940
because if

829
00:28:25,200 --> 00:28:27,239
your colleague says, oh, you know, woman,

830
00:28:27,240 --> 00:28:28,649
they are not good programmers, you can

831
00:28:28,650 --> 00:28:30,809
just now show in this data and then you

832
00:28:30,810 --> 00:28:33,029
basically can disproven.

833
00:28:33,030 --> 00:28:35,189
And that's because it's not possible to

834
00:28:35,190 --> 00:28:37,289
predict gender from the publicly

835
00:28:37,290 --> 00:28:39,119
available data.

836
00:28:39,120 --> 00:28:40,709
And but for me, it was, of course, a bit

837
00:28:40,710 --> 00:28:42,149
disappointing because I wanted to prove

838
00:28:42,150 --> 00:28:43,889
that we can discriminate against these

839
00:28:43,890 --> 00:28:45,480
people. So I need to get more data.

840
00:28:48,940 --> 00:28:51,069
And luckily, GitHub helps us out of

841
00:28:51,070 --> 00:28:53,259
that by providing a

842
00:28:53,260 --> 00:28:55,549
events API, which contains

843
00:28:55,550 --> 00:28:57,609
the full event stream of any action

844
00:28:57,610 --> 00:28:59,769
or almost any action that a given user

845
00:28:59,770 --> 00:29:01,929
has done on the side. So every time that

846
00:29:01,930 --> 00:29:03,579
you open a pull request or you make a

847
00:29:03,580 --> 00:29:04,929
comment or you do something else on

848
00:29:04,930 --> 00:29:06,819
GitHub and there's an event created for

849
00:29:06,820 --> 00:29:09,039
that, and you can download all the public

850
00:29:09,040 --> 00:29:10,659
events on the site through this website

851
00:29:10,660 --> 00:29:12,849
here and like process them and use them

852
00:29:12,850 --> 00:29:15,009
for, for example, data analysis.

853
00:29:15,010 --> 00:29:16,419
And this is what I did. So for all the

854
00:29:16,420 --> 00:29:18,489
users that I had my sample, I downloaded

855
00:29:18,490 --> 00:29:20,619
this event data and like tried to

856
00:29:20,620 --> 00:29:22,179
get some more information that I could

857
00:29:22,180 --> 00:29:23,649
use to discriminate the gender of the

858
00:29:23,650 --> 00:29:24,909
people.

859
00:29:24,910 --> 00:29:26,979
And, for example, the data that that

860
00:29:26,980 --> 00:29:27,980
I looked at.

861
00:29:28,840 --> 00:29:31,119
Here we see the event

862
00:29:31,120 --> 00:29:33,279
frequency so averaged over all the

863
00:29:33,280 --> 00:29:35,169
events as a function of the hour.

864
00:29:35,170 --> 00:29:37,359
And you can see that now there seem

865
00:29:37,360 --> 00:29:39,519
to be some significant differences

866
00:29:39,520 --> 00:29:41,279
between men and women in our data set.

867
00:29:41,280 --> 00:29:43,119
So that's something that the algorithm

868
00:29:43,120 --> 00:29:44,649
could use to make a prediction about the

869
00:29:44,650 --> 00:29:46,839
gender and likewise, in

870
00:29:46,840 --> 00:29:48,969
the type of events that

871
00:29:48,970 --> 00:29:49,929
we have in our data set.

872
00:29:49,930 --> 00:29:52,159
There are also differences in

873
00:29:52,160 --> 00:29:54,399
the frequency of individual event

874
00:29:54,400 --> 00:29:56,019
types. So that's also something that the

875
00:29:56,020 --> 00:29:57,639
algorithm can use to make a decision

876
00:29:57,640 --> 00:29:58,640
about gender.

877
00:29:59,560 --> 00:30:01,659
Now, for the last thing, I went a bit

878
00:30:01,660 --> 00:30:03,849
crazy and I did something that

879
00:30:03,850 --> 00:30:05,929
you normally do in spam detection

880
00:30:05,930 --> 00:30:08,319
that is taking like to commit messages of

881
00:30:08,320 --> 00:30:10,569
individual contributors

882
00:30:10,570 --> 00:30:12,639
and just like inputting them into like a

883
00:30:12,640 --> 00:30:14,439
support vector classifier.

884
00:30:14,440 --> 00:30:16,539
And that basically looks

885
00:30:16,540 --> 00:30:18,399
at the frequencies of individual words

886
00:30:18,400 --> 00:30:19,719
and each commit and tries to find a

887
00:30:19,720 --> 00:30:22,149
difference in the text

888
00:30:22,150 --> 00:30:23,679
between man and woman.

889
00:30:23,680 --> 00:30:25,690
And this already gave me some

890
00:30:26,950 --> 00:30:29,169
good, like good fidelity of

891
00:30:29,170 --> 00:30:30,879
predicting the gender and combining it

892
00:30:30,880 --> 00:30:32,919
with the other information that I had.

893
00:30:32,920 --> 00:30:35,329
I could in fact, achieve 50

894
00:30:35,330 --> 00:30:37,449
percent better chance

895
00:30:37,450 --> 00:30:39,609
of like predicting the gender than by

896
00:30:39,610 --> 00:30:41,739
just guessing. So this is

897
00:30:41,740 --> 00:30:44,529
not very impressive and

898
00:30:44,530 --> 00:30:45,939
we can probably do much better.

899
00:30:45,940 --> 00:30:47,529
But again, this was only like a proof of

900
00:30:47,530 --> 00:30:49,659
concept to see how easy it actually is

901
00:30:49,660 --> 00:30:51,489
to get get this kind of information

902
00:30:51,490 --> 00:30:52,869
leaking into the system.

903
00:30:52,870 --> 00:30:56,019
And so this basically means that if we

904
00:30:56,020 --> 00:30:58,059
can make a predictor for the gender of

905
00:30:58,060 --> 00:31:00,159
the person, GitHub and the algorithm

906
00:31:00,160 --> 00:31:01,929
that we used to like make the decision

907
00:31:01,930 --> 00:31:03,849
about the hiring process can also

908
00:31:03,850 --> 00:31:05,349
generate this kind of information.

909
00:31:05,350 --> 00:31:07,689
If we give him if you give it this data

910
00:31:07,690 --> 00:31:09,789
and use it against the

911
00:31:09,790 --> 00:31:10,790
people.

912
00:31:11,820 --> 00:31:13,409
So the take away from this is that the

913
00:31:13,410 --> 00:31:15,479
algorithm will readily

914
00:31:15,480 --> 00:31:17,159
learn discrimination from us if we

915
00:31:17,160 --> 00:31:19,289
provide them with the right training

916
00:31:19,290 --> 00:31:21,869
data and also information

917
00:31:21,870 --> 00:31:24,239
leakage so that getting information

918
00:31:24,240 --> 00:31:25,889
about protected attributes in our data

919
00:31:25,890 --> 00:31:27,839
sets that we don't want to have, there is

920
00:31:27,840 --> 00:31:30,389
actually a pretty easy and can happen

921
00:31:30,390 --> 00:31:31,390
if we are not careful.

922
00:31:33,480 --> 00:31:34,769
How can we fix this?

923
00:31:34,770 --> 00:31:36,719
Well, it's actually harder than you might

924
00:31:36,720 --> 00:31:38,939
think because often we don't

925
00:31:38,940 --> 00:31:40,649
even have the information about the

926
00:31:40,650 --> 00:31:42,239
protected attributes in our data sets

927
00:31:42,240 --> 00:31:44,549
because we don't want to to take the data

928
00:31:44,550 --> 00:31:47,009
from the user. I mean, imagine if

929
00:31:47,010 --> 00:31:48,599
you would apply for a job and your

930
00:31:48,600 --> 00:31:50,639
employer or potential employer would ask

931
00:31:50,640 --> 00:31:52,739
you for information about

932
00:31:52,740 --> 00:31:55,829
your sexual preferences, your gender,

933
00:31:55,830 --> 00:31:57,059
your ethnicity and everything.

934
00:31:57,060 --> 00:31:59,159
And plenty of other things probably

935
00:31:59,160 --> 00:32:00,839
wouldn't go down so well.

936
00:32:00,840 --> 00:32:02,639
But this is actually the kind of

937
00:32:02,640 --> 00:32:04,529
information that we would need in order

938
00:32:04,530 --> 00:32:06,359
to see if there is some disparate impact

939
00:32:06,360 --> 00:32:07,799
in our data. Because if you don't have

940
00:32:07,800 --> 00:32:09,929
that attribute information, we cannot

941
00:32:09,930 --> 00:32:12,059
like, um, calculate

942
00:32:12,060 --> 00:32:14,339
any fidelity or any like like measure

943
00:32:14,340 --> 00:32:15,569
of the discrimination that is in the

944
00:32:15,570 --> 00:32:17,639
process. And this is what is

945
00:32:17,640 --> 00:32:19,889
so dangerous about this, because our

946
00:32:19,890 --> 00:32:21,779
algorithm can discriminate against people

947
00:32:21,780 --> 00:32:23,309
without us even noticing.

948
00:32:25,370 --> 00:32:27,469
OK, this

949
00:32:27,470 --> 00:32:29,569
is already the first case

950
00:32:29,570 --> 00:32:31,759
study that I wanted to show, and we

951
00:32:31,760 --> 00:32:32,760
have seen that,

952
00:32:34,940 --> 00:32:37,039
that getting information into our

953
00:32:37,040 --> 00:32:39,259
dataset that we shouldn't have is

954
00:32:39,260 --> 00:32:40,009
pretty bad.

955
00:32:40,010 --> 00:32:42,199
And like the

956
00:32:42,200 --> 00:32:44,059
worst kind of information leakage that

957
00:32:44,060 --> 00:32:46,699
you can imagine is if you can identify

958
00:32:46,700 --> 00:32:48,169
someone from the data that you have

959
00:32:48,170 --> 00:32:49,969
obtained for them earlier.

960
00:32:49,970 --> 00:32:52,039
And I mean, again, if

961
00:32:52,040 --> 00:32:54,139
we ask Google about its opinion

962
00:32:54,140 --> 00:32:56,299
on privacy, it's the

963
00:32:56,300 --> 00:32:57,509
picture is rather bleak.

964
00:32:57,510 --> 00:32:59,779
And it seems

965
00:32:59,780 --> 00:33:01,399
that many people have already getting

966
00:33:01,400 --> 00:33:03,709
used to the idea that we are in the

967
00:33:03,710 --> 00:33:05,449
privacy area right now.

968
00:33:05,450 --> 00:33:07,129
And so with the second experiment here, I

969
00:33:07,130 --> 00:33:09,229
want to show how easy it is actually

970
00:33:09,230 --> 00:33:11,929
to anonymize

971
00:33:11,930 --> 00:33:14,059
and given user data even without wanting

972
00:33:14,060 --> 00:33:14,989
it.

973
00:33:14,990 --> 00:33:17,149
And what is actually the

974
00:33:17,150 --> 00:33:19,729
anonymization where the anonymization

975
00:33:19,730 --> 00:33:22,249
means that we have some

976
00:33:22,250 --> 00:33:24,199
information recorded about an individual

977
00:33:24,200 --> 00:33:27,319
or person and we use that information

978
00:33:27,320 --> 00:33:29,029
to predict the identity of that

979
00:33:29,030 --> 00:33:30,859
individual in another data set.

980
00:33:30,860 --> 00:33:33,319
So is kind of like your data is following

981
00:33:33,320 --> 00:33:35,059
you around, even if you like, for

982
00:33:35,060 --> 00:33:37,219
example, change the devices which you're

983
00:33:37,220 --> 00:33:39,529
working on, you change your

984
00:33:39,530 --> 00:33:40,579
your user accounts.

985
00:33:40,580 --> 00:33:43,309
The system is still able to identify

986
00:33:43,310 --> 00:33:45,499
you just by using the data that you have,

987
00:33:45,500 --> 00:33:47,659
like put into the system earlier or that

988
00:33:47,660 --> 00:33:49,939
it was like measured of you earlier.

989
00:33:49,940 --> 00:33:52,039
And the anonymization becomes an

990
00:33:52,040 --> 00:33:54,169
increasing risk as the data sets

991
00:33:54,170 --> 00:33:55,849
that we can use about individual users

992
00:33:55,850 --> 00:33:57,230
get bigger and bigger, actually.

993
00:34:01,640 --> 00:34:03,609
So I hope it's working.

994
00:34:03,610 --> 00:34:05,779
OK, now let's

995
00:34:05,780 --> 00:34:08,019
have a look at the math here, though,

996
00:34:08,020 --> 00:34:10,069
is the anonymization is a pretty big

997
00:34:10,070 --> 00:34:12,109
subject. The math is rather fun, I assure

998
00:34:12,110 --> 00:34:13,579
you so.

999
00:34:13,580 --> 00:34:16,759
And you maybe have played

1000
00:34:16,760 --> 00:34:18,948
this game with some of your friends where

1001
00:34:18,949 --> 00:34:20,779
you just think about some famous person

1002
00:34:20,780 --> 00:34:22,879
and the other and your friend

1003
00:34:22,880 --> 00:34:24,649
has to guess who that is by just asking

1004
00:34:24,650 --> 00:34:26,749
you a series of yes or no question.

1005
00:34:26,750 --> 00:34:28,669
And this actually works pretty

1006
00:34:28,670 --> 00:34:30,738
efficiently so that after like maybe 10

1007
00:34:30,739 --> 00:34:32,869
or 20 question questions, you can

1008
00:34:32,870 --> 00:34:34,999
know exactly which person your

1009
00:34:35,000 --> 00:34:36,259
friend was thinking of.

1010
00:34:36,260 --> 00:34:38,448
And this works so well,

1011
00:34:38,449 --> 00:34:40,849
because if we have like

1012
00:34:40,850 --> 00:34:42,948
several several

1013
00:34:42,949 --> 00:34:45,408
buckets that we can, um,

1014
00:34:45,409 --> 00:34:47,479
that are either true or false for a given

1015
00:34:47,480 --> 00:34:49,190
user, we can

1016
00:34:50,570 --> 00:34:52,819
create a unique fingerprint for the

1017
00:34:52,820 --> 00:34:54,319
user in our system.

1018
00:34:54,320 --> 00:34:56,629
And if you look at the probability

1019
00:34:56,630 --> 00:34:58,099
of like having a collision, so like

1020
00:34:58,100 --> 00:35:00,409
having two users that have exactly

1021
00:35:00,410 --> 00:35:03,409
the same true false values,

1022
00:35:03,410 --> 00:35:05,749
this is getting increasingly unlikely

1023
00:35:05,750 --> 00:35:07,969
the more buckets or the more different

1024
00:35:07,970 --> 00:35:10,189
types of information we can put into

1025
00:35:10,190 --> 00:35:11,599
our system.

1026
00:35:11,600 --> 00:35:13,279
And so, like the exact number or the

1027
00:35:13,280 --> 00:35:14,989
exact probability for finding a

1028
00:35:14,990 --> 00:35:17,239
correlation between users is depending

1029
00:35:17,240 --> 00:35:19,429
on the actual distribution of

1030
00:35:19,430 --> 00:35:21,249
the information in the buckets.

1031
00:35:21,250 --> 00:35:22,939
So if you have a uniform distribution, we

1032
00:35:22,940 --> 00:35:24,289
can calculate that number.

1033
00:35:24,290 --> 00:35:25,909
And as you can see, it decreases

1034
00:35:25,910 --> 00:35:28,069
exponentially, which is why this game

1035
00:35:28,070 --> 00:35:30,009
that I talked about earlier is working so

1036
00:35:30,010 --> 00:35:32,149
well since, for example, if you

1037
00:35:32,150 --> 00:35:33,619
assume that you have like one million

1038
00:35:35,360 --> 00:35:36,889
famous people that you can think of, then

1039
00:35:36,890 --> 00:35:38,779
it would probably be sufficient to have

1040
00:35:38,780 --> 00:35:40,819
like thirty two bits of information to

1041
00:35:40,820 --> 00:35:42,439
uniquely identify them all.

1042
00:35:42,440 --> 00:35:44,539
And we can imagine that with

1043
00:35:44,540 --> 00:35:47,569
big data, we have much we have many more

1044
00:35:47,570 --> 00:35:49,219
buckets that we can actually use so we

1045
00:35:49,220 --> 00:35:51,229
can identify not only a few million

1046
00:35:51,230 --> 00:35:52,879
people, but easily a few billion

1047
00:35:52,880 --> 00:35:55,549
different people using that technique.

1048
00:35:55,550 --> 00:35:57,619
And most real world data sets

1049
00:35:57,620 --> 00:36:00,169
are, of course, not uniformly distributed

1050
00:36:00,170 --> 00:36:02,269
so that we have more the case that

1051
00:36:02,270 --> 00:36:04,519
that many users are in the same bucket.

1052
00:36:04,520 --> 00:36:06,439
So, for example, many there are many

1053
00:36:06,440 --> 00:36:08,419
people that like the same kind of music.

1054
00:36:08,420 --> 00:36:10,489
And so they are all like have the same

1055
00:36:10,490 --> 00:36:12,559
information or the same attribute.

1056
00:36:12,560 --> 00:36:15,109
And using that attribute

1057
00:36:15,110 --> 00:36:16,969
to the anonymized to the users wouldn't

1058
00:36:16,970 --> 00:36:18,829
give us much to wouldn't do as much good

1059
00:36:18,830 --> 00:36:20,449
because it wouldn't help us to like

1060
00:36:20,450 --> 00:36:22,429
narrow down the number of users in our

1061
00:36:22,430 --> 00:36:24,469
system. But there are also many

1062
00:36:24,470 --> 00:36:26,689
attributes that are pretty unique to each

1063
00:36:26,690 --> 00:36:28,759
one of us. For example, the place that we

1064
00:36:28,760 --> 00:36:30,769
are living or the combination of that

1065
00:36:30,770 --> 00:36:32,239
place with the work, where we going.

1066
00:36:32,240 --> 00:36:34,459
So having a few of those quite

1067
00:36:34,460 --> 00:36:35,959
unique data points for each user is

1068
00:36:35,960 --> 00:36:38,119
usually already enough to the anonymize

1069
00:36:38,120 --> 00:36:39,499
us with a very high fidelity.

1070
00:36:42,290 --> 00:36:44,089
And again, I wanted to see if this is

1071
00:36:44,090 --> 00:36:46,279
actually working in practice, so

1072
00:36:46,280 --> 00:36:48,409
what I did was to get the data set

1073
00:36:48,410 --> 00:36:50,119
in this case from Microsoft Research

1074
00:36:50,120 --> 00:36:52,399
Asia, which contains GPS

1075
00:36:52,400 --> 00:36:54,589
data about of about

1076
00:36:54,590 --> 00:36:56,809
200 people that track their whole

1077
00:36:56,810 --> 00:36:59,089
activity for sometimes

1078
00:36:59,090 --> 00:37:00,889
several years, sometimes several months,

1079
00:37:00,890 --> 00:37:03,079
and like use

1080
00:37:03,080 --> 00:37:04,819
the data to create a movement profile, so

1081
00:37:04,820 --> 00:37:06,079
to say.

1082
00:37:06,080 --> 00:37:07,789
So I also have an animated version of

1083
00:37:07,790 --> 00:37:10,639
that that you can see here,

1084
00:37:10,640 --> 00:37:13,369
the different trajectories

1085
00:37:13,370 --> 00:37:15,469
of individual users.

1086
00:37:15,470 --> 00:37:17,329
I don't know if anyone recognizes the

1087
00:37:17,330 --> 00:37:18,330
city.

1088
00:37:20,020 --> 00:37:22,119
Now it's Beijing, actually,

1089
00:37:22,120 --> 00:37:24,279
and if you're wondering what

1090
00:37:24,280 --> 00:37:26,589
this square is, so

1091
00:37:26,590 --> 00:37:28,269
I looked at Google Maps and it seems to

1092
00:37:28,270 --> 00:37:30,669
be the university, so I guess

1093
00:37:30,670 --> 00:37:32,169
it's like another field of study.

1094
00:37:32,170 --> 00:37:34,119
Whenever you need, like some guinea pigs

1095
00:37:34,120 --> 00:37:36,459
to take data for you, you go and ask

1096
00:37:36,460 --> 00:37:37,630
students. So.

1097
00:37:39,600 --> 00:37:42,419
OK, so this is a pretty rich data set

1098
00:37:42,420 --> 00:37:44,399
we have like for in some cases, hundreds

1099
00:37:44,400 --> 00:37:45,659
of thousands of data points per

1100
00:37:45,660 --> 00:37:47,759
individual and I wanted to see how

1101
00:37:47,760 --> 00:37:49,439
easy it would be with this data to

1102
00:37:49,440 --> 00:37:51,149
actually anonymize users.

1103
00:37:51,150 --> 00:37:53,459
So what I did was to first look

1104
00:37:53,460 --> 00:37:54,759
at individual trajectories.

1105
00:37:54,760 --> 00:37:57,119
So here we have like the GPS traces

1106
00:37:57,120 --> 00:37:59,819
of the individual's color coded and

1107
00:37:59,820 --> 00:38:02,039
then to apply a very simple

1108
00:38:02,040 --> 00:38:04,139
grid. So like in this

1109
00:38:04,140 --> 00:38:06,299
case, a four by four grid and

1110
00:38:06,300 --> 00:38:09,239
just measure, um, the frequency

1111
00:38:09,240 --> 00:38:11,399
with which a given individual

1112
00:38:11,400 --> 00:38:13,529
has some data and this on

1113
00:38:13,530 --> 00:38:15,239
this on the square.

1114
00:38:15,240 --> 00:38:17,609
So doing this for

1115
00:38:17,610 --> 00:38:19,289
the two hundred people gives me something

1116
00:38:19,290 --> 00:38:21,479
like this. So this is four to four by

1117
00:38:21,480 --> 00:38:23,729
four grid. And you can see the colors

1118
00:38:23,730 --> 00:38:26,279
represent the number of times a given

1119
00:38:26,280 --> 00:38:27,989
person has been in a given square.

1120
00:38:27,990 --> 00:38:29,939
So why it would mean that the person has

1121
00:38:29,940 --> 00:38:32,009
been here very often black would mean the

1122
00:38:32,010 --> 00:38:33,179
person has never been in this given

1123
00:38:33,180 --> 00:38:34,079
square.

1124
00:38:34,080 --> 00:38:36,239
And you can already see that

1125
00:38:36,240 --> 00:38:38,249
reflect the 60 examples that I show you

1126
00:38:38,250 --> 00:38:40,379
here, the many of them, they

1127
00:38:40,380 --> 00:38:42,119
seem to be quite unique, for example,

1128
00:38:42,120 --> 00:38:43,749
this one and this one.

1129
00:38:43,750 --> 00:38:46,110
So it should be possible to like

1130
00:38:47,310 --> 00:38:49,049
kind of make a fingerprint for a given

1131
00:38:49,050 --> 00:38:50,669
user using that data.

1132
00:38:50,670 --> 00:38:53,249
And of course, if we need more resolution

1133
00:38:53,250 --> 00:38:55,559
to like, for example, disambiguate uses

1134
00:38:55,560 --> 00:38:57,269
like this here where we have like the

1135
00:38:57,270 --> 00:38:59,369
same data, more or less, and we can't,

1136
00:38:59,370 --> 00:39:01,529
like, decide which user we have,

1137
00:39:01,530 --> 00:39:03,119
we can just increase the resolution, for

1138
00:39:03,120 --> 00:39:05,369
example, to eight by eight or to 16

1139
00:39:05,370 --> 00:39:07,499
by 16. So here.

1140
00:39:07,500 --> 00:39:09,629
And now

1141
00:39:09,630 --> 00:39:11,519
coming back to our buckets, if you would,

1142
00:39:11,520 --> 00:39:13,289
measure the distribution of the

1143
00:39:13,290 --> 00:39:15,569
attributes that we have here, we

1144
00:39:15,570 --> 00:39:17,969
can get an idea how good our choice

1145
00:39:17,970 --> 00:39:20,339
is, actually. And you can see

1146
00:39:20,340 --> 00:39:22,049
the choice that we have make made is

1147
00:39:22,050 --> 00:39:24,419
actually pretty bad, because in the first

1148
00:39:24,420 --> 00:39:25,979
bucket of the bucket, with the most

1149
00:39:25,980 --> 00:39:28,049
datapoints, we have about like 10 to

1150
00:39:28,050 --> 00:39:30,299
the six or 1000000 points.

1151
00:39:30,300 --> 00:39:32,489
But the interesting part of this curve,

1152
00:39:32,490 --> 00:39:34,679
which is, by the way, logarithmic

1153
00:39:34,680 --> 00:39:36,809
is here. So in this, like,

1154
00:39:36,810 --> 00:39:39,089
long tail of the distribution where

1155
00:39:39,090 --> 00:39:41,249
we have only sometimes

1156
00:39:41,250 --> 00:39:43,679
one or sometimes a couple of individual

1157
00:39:43,680 --> 00:39:45,179
persons in that given bucket.

1158
00:39:45,180 --> 00:39:46,829
So if we can get some information in

1159
00:39:46,830 --> 00:39:49,529
these buckets, it's easy to use that

1160
00:39:49,530 --> 00:39:51,989
to the anonymise our users.

1161
00:39:51,990 --> 00:39:54,419
OK, and how do we do that?

1162
00:39:54,420 --> 00:39:56,819
Again, we use a very simple measure and

1163
00:39:56,820 --> 00:39:59,399
we just take the fingerprint

1164
00:39:59,400 --> 00:40:01,589
of one user or one trace

1165
00:40:01,590 --> 00:40:03,569
and multiply it with the fingerprint of

1166
00:40:03,570 --> 00:40:05,549
another trace, but picks up the pixel,

1167
00:40:05,550 --> 00:40:07,349
which gives us then the value on the

1168
00:40:07,350 --> 00:40:09,539
right. And then we take these individual

1169
00:40:09,540 --> 00:40:11,009
values here and send them up.

1170
00:40:11,010 --> 00:40:13,169
And this gives us kind of a score of how

1171
00:40:13,170 --> 00:40:15,119
similar to use as our two trajectories

1172
00:40:15,120 --> 00:40:16,379
are.

1173
00:40:16,380 --> 00:40:18,449
So doing this,

1174
00:40:18,450 --> 00:40:20,639
we can take 75 percent

1175
00:40:20,640 --> 00:40:22,709
of our data as a training set.

1176
00:40:22,710 --> 00:40:24,689
So we just like teach our algorithm to

1177
00:40:24,690 --> 00:40:26,789
like, recognize individual users and

1178
00:40:26,790 --> 00:40:28,709
then we can use the remaining 25 percent

1179
00:40:28,710 --> 00:40:30,959
to test how good our algorithm is at

1180
00:40:30,960 --> 00:40:33,269
recognizing the users now

1181
00:40:33,270 --> 00:40:35,669
and then we look at the average

1182
00:40:35,670 --> 00:40:38,249
probability of identification and

1183
00:40:38,250 --> 00:40:40,439
also of the rank that the user has

1184
00:40:40,440 --> 00:40:42,569
in this and this prediction.

1185
00:40:42,570 --> 00:40:43,679
And this is shown here.

1186
00:40:48,110 --> 00:40:50,039
So what a show is the, um,

1187
00:40:51,110 --> 00:40:53,239
the probability of,

1188
00:40:53,240 --> 00:40:55,429
like, finding the right user

1189
00:40:55,430 --> 00:40:57,779
within, for example, the first two,

1190
00:40:57,780 --> 00:41:00,379
the first for the first six users

1191
00:41:00,380 --> 00:41:02,089
that have the highest scores for a given

1192
00:41:02,090 --> 00:41:04,729
chase. And you can see that

1193
00:41:04,730 --> 00:41:07,819
for even 16,

1194
00:41:07,820 --> 00:41:09,859
16 squares. So the four by four grid that

1195
00:41:09,860 --> 00:41:12,109
I showed you in the beginning, the ID

1196
00:41:12,110 --> 00:41:13,909
rate is already 20 percent here.

1197
00:41:13,910 --> 00:41:16,039
So we can identify uniquely

1198
00:41:16,040 --> 00:41:18,499
one fifth of our user by just using

1199
00:41:18,500 --> 00:41:20,179
16 data points.

1200
00:41:20,180 --> 00:41:21,739
And the more data points we use,

1201
00:41:21,740 --> 00:41:23,869
actually, the more the better we can

1202
00:41:23,870 --> 00:41:25,939
identify the users in our data set.

1203
00:41:25,940 --> 00:41:28,579
And with 1024

1204
00:41:28,580 --> 00:41:30,319
individual data points, which would be

1205
00:41:30,320 --> 00:41:32,449
quite easy to get in a real

1206
00:41:32,450 --> 00:41:34,639
world setting, we can uniquely

1207
00:41:34,640 --> 00:41:37,339
identify almost 30 percent of the users.

1208
00:41:37,340 --> 00:41:38,839
And again, I want to state that this is

1209
00:41:38,840 --> 00:41:40,069
just a proof of concept.

1210
00:41:40,070 --> 00:41:42,109
And so there has been no optimization

1211
00:41:42,110 --> 00:41:43,849
done and no like fine tuning of

1212
00:41:43,850 --> 00:41:45,230
parameters or anything.

1213
00:41:46,710 --> 00:41:48,809
And we can also use that technique to not

1214
00:41:48,810 --> 00:41:50,549
only identify single users, but also to

1215
00:41:50,550 --> 00:41:52,629
find similarities between users.

1216
00:41:52,630 --> 00:41:54,089
So this could be interesting, for

1217
00:41:54,090 --> 00:41:56,519
example, to see who is

1218
00:41:56,520 --> 00:41:58,649
related to whom and who are you

1219
00:41:58,650 --> 00:42:00,599
visiting, who are your friends may be.

1220
00:42:00,600 --> 00:42:02,789
And this is what I did here.

1221
00:42:02,790 --> 00:42:04,259
So I used the same metric as before.

1222
00:42:04,260 --> 00:42:06,209
And I just told the system to give me the

1223
00:42:06,210 --> 00:42:07,559
users that are most similar to each

1224
00:42:07,560 --> 00:42:09,659
other. And you can see here in

1225
00:42:09,660 --> 00:42:12,149
green the trajectories of one user

1226
00:42:12,150 --> 00:42:14,309
and then read to the trajectories

1227
00:42:14,310 --> 00:42:15,239
of the other users.

1228
00:42:15,240 --> 00:42:16,919
And the areas that are yellow are

1229
00:42:16,920 --> 00:42:19,829
actually where the two of them coincide.

1230
00:42:19,830 --> 00:42:21,419
You can see there are some hits of the

1231
00:42:21,420 --> 00:42:23,429
system which don't seem too good, but

1232
00:42:23,430 --> 00:42:24,579
there are lots there.

1233
00:42:24,580 --> 00:42:26,609
There's some hits also where you can see

1234
00:42:26,610 --> 00:42:28,709
like a really big agreement

1235
00:42:28,710 --> 00:42:30,209
between the two datasets.

1236
00:42:30,210 --> 00:42:32,759
And I mean, I don't know

1237
00:42:32,760 --> 00:42:34,739
who was taking this data because it's

1238
00:42:34,740 --> 00:42:36,779
anonymized, but I would guess in this

1239
00:42:36,780 --> 00:42:38,579
case that it's either like a taxi driver

1240
00:42:38,580 --> 00:42:40,739
or maybe a bus driver, because

1241
00:42:40,740 --> 00:42:42,719
you can see that we cover almost the

1242
00:42:42,720 --> 00:42:45,419
whole Beijing area with these two traces.

1243
00:42:45,420 --> 00:42:46,979
And so this technique makes it really,

1244
00:42:46,980 --> 00:42:49,049
really easy to like, identify

1245
00:42:49,050 --> 00:42:51,449
users and also like find out

1246
00:42:51,450 --> 00:42:53,459
who they are related and who are which

1247
00:42:53,460 --> 00:42:55,739
other users are similar to them.

1248
00:42:55,740 --> 00:42:59,139
And we can, of course, like improve

1249
00:42:59,140 --> 00:43:01,229
the identification rate of the

1250
00:43:01,230 --> 00:43:03,449
system by, for example, taking

1251
00:43:03,450 --> 00:43:04,859
into account not only the spatial

1252
00:43:04,860 --> 00:43:07,469
information, but also like the temporal

1253
00:43:07,470 --> 00:43:09,119
information, for example, the day night

1254
00:43:09,120 --> 00:43:10,679
cycle, which you see here in the

1255
00:43:10,680 --> 00:43:12,749
background. So here the green groups

1256
00:43:12,750 --> 00:43:14,279
have been taking at night and the Red

1257
00:43:14,280 --> 00:43:16,439
Cross has been taking at the during day.

1258
00:43:16,440 --> 00:43:18,599
So like you have the data, for example,

1259
00:43:18,600 --> 00:43:20,399
going to work in the morning and coming

1260
00:43:20,400 --> 00:43:23,069
back in the evening can

1261
00:43:23,070 --> 00:43:25,259
then be used to increase the prediction,

1262
00:43:25,260 --> 00:43:27,359
fidelity for identifying a given

1263
00:43:27,360 --> 00:43:28,360
user.

1264
00:43:28,840 --> 00:43:30,969
And of course, we could also, like change

1265
00:43:30,970 --> 00:43:33,069
the choice of our buckets and

1266
00:43:33,070 --> 00:43:34,899
like change the way we do the

1267
00:43:34,900 --> 00:43:36,459
fingerprinting in order to increase the

1268
00:43:36,460 --> 00:43:37,579
fidelity of the algorithm.

1269
00:43:37,580 --> 00:43:38,979
So there's plenty of room for

1270
00:43:38,980 --> 00:43:40,749
optimization. And as I said, this is only

1271
00:43:40,750 --> 00:43:42,039
like a proof of principle.

1272
00:43:42,040 --> 00:43:44,409
But there are other like similar

1273
00:43:44,410 --> 00:43:46,599
works in the literature which show that

1274
00:43:46,600 --> 00:43:48,249
with even a very simple method, you can

1275
00:43:48,250 --> 00:43:50,439
achieve quite good edification rates

1276
00:43:50,440 --> 00:43:51,440
in such a data set.

1277
00:43:53,510 --> 00:43:55,699
Now, to summarize this, this

1278
00:43:55,700 --> 00:43:58,129
means that the more data we have about

1279
00:43:58,130 --> 00:44:00,469
a given entity, a person, the more

1280
00:44:00,470 --> 00:44:02,299
difficult it is actually to keep

1281
00:44:02,300 --> 00:44:04,429
algorithms from directly learning and

1282
00:44:04,430 --> 00:44:06,769
using the identity of that object

1283
00:44:06,770 --> 00:44:08,779
for a prediction instead of an attribute.

1284
00:44:08,780 --> 00:44:10,699
That means, as I said before, that the

1285
00:44:10,700 --> 00:44:12,949
data which a given user are given

1286
00:44:12,950 --> 00:44:15,199
person generates follows

1287
00:44:15,200 --> 00:44:16,889
him or her around the whole life.

1288
00:44:16,890 --> 00:44:19,069
So even if you like, would change all

1289
00:44:19,070 --> 00:44:21,319
of your smartphones, all of your devices,

1290
00:44:21,320 --> 00:44:22,519
some parts of your behavior would

1291
00:44:22,520 --> 00:44:23,539
probably stay the same.

1292
00:44:23,540 --> 00:44:25,669
And this could be used to

1293
00:44:25,670 --> 00:44:27,469
identify you later in the process, again,

1294
00:44:27,470 --> 00:44:28,999
with a pretty high fidelity.

1295
00:44:29,000 --> 00:44:31,039
So that's one of the biggest risk of big

1296
00:44:31,040 --> 00:44:33,439
data for me, because it's like

1297
00:44:33,440 --> 00:44:35,659
very easy if we not avoid

1298
00:44:35,660 --> 00:44:37,729
it to, like, destroy

1299
00:44:37,730 --> 00:44:38,989
the privacy of our users.

1300
00:44:40,100 --> 00:44:41,780
OK, yeah, thanks.

1301
00:44:49,220 --> 00:44:50,459
So what can we do about this?

1302
00:44:51,620 --> 00:44:52,999
I don't have all the answers, of course,

1303
00:44:53,000 --> 00:44:55,489
but I have a few ideas and I mean,

1304
00:44:55,490 --> 00:44:56,490
there are lots of people

1305
00:44:57,830 --> 00:44:58,999
working on like

1306
00:45:00,210 --> 00:45:01,849
a political and societal and

1307
00:45:01,850 --> 00:45:03,199
technological solutions for this.

1308
00:45:03,200 --> 00:45:04,459
So here I just want to give a brief

1309
00:45:04,460 --> 00:45:07,009
overview of things that can

1310
00:45:07,010 --> 00:45:09,169
be important in order to avoid these

1311
00:45:09,170 --> 00:45:11,629
two scenarios that I have shown here.

1312
00:45:11,630 --> 00:45:13,999
So to

1313
00:45:14,000 --> 00:45:15,529
the group of people that we probably have

1314
00:45:15,530 --> 00:45:17,899
to to educate the most urgently

1315
00:45:17,900 --> 00:45:20,179
about this is, of course, data

1316
00:45:20,180 --> 00:45:21,529
scientists. So the people that actually

1317
00:45:21,530 --> 00:45:23,179
work with the data and create these

1318
00:45:23,180 --> 00:45:25,759
algorithms, because today

1319
00:45:25,760 --> 00:45:28,099
and there is in Germany, for example,

1320
00:45:28,100 --> 00:45:30,739
you need like a three year apprenticeship

1321
00:45:30,740 --> 00:45:32,869
in order to cheesecake, but there's

1322
00:45:32,870 --> 00:45:34,819
nothing comparable in order to like an

1323
00:45:34,820 --> 00:45:36,409
algorithm Maciste and like to develop

1324
00:45:36,410 --> 00:45:37,699
these kind of algorithms that have a

1325
00:45:37,700 --> 00:45:40,069
large influence on our daily lives.

1326
00:45:40,070 --> 00:45:42,559
So they probably should be

1327
00:45:42,560 --> 00:45:44,299
a better curriculum in universities and

1328
00:45:44,300 --> 00:45:46,759
even in schools, maybe to educate

1329
00:45:46,760 --> 00:45:48,529
people not only about the possibilities

1330
00:45:48,530 --> 00:45:50,689
of data analysis and like about

1331
00:45:50,690 --> 00:45:53,119
scraping even like the last few percent

1332
00:45:53,120 --> 00:45:55,189
of fidelity from a given algorithm,

1333
00:45:55,190 --> 00:45:56,869
but also about like the risks and the

1334
00:45:56,870 --> 00:45:58,069
danger of using these kinds of

1335
00:45:58,070 --> 00:46:00,199
technologies, especially when other

1336
00:46:00,200 --> 00:46:01,909
people are involved.

1337
00:46:01,910 --> 00:46:03,739
And another thing that we should be

1338
00:46:03,740 --> 00:46:06,049
careful with is indulging

1339
00:46:06,050 --> 00:46:07,639
data without actually needing it.

1340
00:46:07,640 --> 00:46:09,739
So to data

1341
00:46:09,740 --> 00:46:12,019
like one of the most popular approaches

1342
00:46:12,020 --> 00:46:14,719
and big data is just to take everything

1343
00:46:14,720 --> 00:46:16,069
that you can get.

1344
00:46:16,070 --> 00:46:17,749
So all the data that we can get our hands

1345
00:46:17,750 --> 00:46:19,789
on to give it to the algorithm and to let

1346
00:46:19,790 --> 00:46:22,129
it decide how it uses

1347
00:46:22,130 --> 00:46:24,199
it. And this is good not only

1348
00:46:24,200 --> 00:46:26,329
because it increases the fidelity of

1349
00:46:26,330 --> 00:46:28,099
our predictions, but as I explained

1350
00:46:28,100 --> 00:46:29,659
earlier, it can be also very dangerous

1351
00:46:29,660 --> 00:46:31,759
because maybe the algorithm can learn

1352
00:46:31,760 --> 00:46:33,919
things which it isn't supposed to learn.

1353
00:46:33,920 --> 00:46:36,499
And also we should really be

1354
00:46:36,500 --> 00:46:37,999
more careful with the data that we give

1355
00:46:38,000 --> 00:46:39,000
in to these systems.

1356
00:46:40,260 --> 00:46:42,029
And of course, the other things that we

1357
00:46:42,030 --> 00:46:43,979
can do, we can try to do to remove

1358
00:46:43,980 --> 00:46:46,209
discrimination and disparate impact, and

1359
00:46:46,210 --> 00:46:48,299
there's also like a lot of academic

1360
00:46:48,300 --> 00:46:50,909
work giving techniques

1361
00:46:50,910 --> 00:46:52,919
and methods that we can use for or for

1362
00:46:52,920 --> 00:46:54,629
doing this. But here, the problem again,

1363
00:46:54,630 --> 00:46:56,669
is that most people that actually work in

1364
00:46:56,670 --> 00:46:58,589
the fields where these algorithms are put

1365
00:46:58,590 --> 00:47:00,809
into practice either don't know

1366
00:47:00,810 --> 00:47:02,339
about these things, are not interested in

1367
00:47:02,340 --> 00:47:04,499
those. So I think here we have a big

1368
00:47:04,500 --> 00:47:06,629
potential for like improving

1369
00:47:06,630 --> 00:47:08,729
the education of data scientists and data

1370
00:47:08,730 --> 00:47:11,699
analysts as citizens.

1371
00:47:11,700 --> 00:47:13,859
We can also do something, of course.

1372
00:47:13,860 --> 00:47:16,169
So the first thing is to not blindly

1373
00:47:16,170 --> 00:47:17,879
trust the decisions made by algorithms.

1374
00:47:17,880 --> 00:47:19,949
So if most people have kind

1375
00:47:19,950 --> 00:47:22,139
of a bias to think that a decision made

1376
00:47:22,140 --> 00:47:24,209
by a computer, by algorithm is maybe more

1377
00:47:24,210 --> 00:47:26,609
fair than a decision made by a human.

1378
00:47:26,610 --> 00:47:27,899
And I think this is something we have to

1379
00:47:27,900 --> 00:47:30,209
get rid of because algorithms as a show

1380
00:47:30,210 --> 00:47:32,219
can be just as discriminating against

1381
00:47:32,220 --> 00:47:33,360
people as humans can.

1382
00:47:34,810 --> 00:47:37,569
So, um, and if we can't,

1383
00:47:37,570 --> 00:47:40,089
like, question their

1384
00:47:40,090 --> 00:47:42,879
decisions, we can at least test them

1385
00:47:42,880 --> 00:47:43,989
and see if there's actually

1386
00:47:43,990 --> 00:47:45,169
discrimination in the system.

1387
00:47:45,170 --> 00:47:47,019
And now this sounds pretty easy, but it's

1388
00:47:47,020 --> 00:47:49,569
actually very hard because the algorithms

1389
00:47:49,570 --> 00:47:51,789
are mostly like in the hands of big

1390
00:47:51,790 --> 00:47:54,069
organizations or corporations and are,

1391
00:47:54,070 --> 00:47:55,689
of course, like a closely guarded trade

1392
00:47:55,690 --> 00:47:57,159
secrets in most times.

1393
00:47:57,160 --> 00:47:59,229
And this means that we have to

1394
00:47:59,230 --> 00:48:00,519
use techniques such as reverse

1395
00:48:00,520 --> 00:48:02,589
engineering in order to to like find

1396
00:48:02,590 --> 00:48:04,659
out how the internals of

1397
00:48:04,660 --> 00:48:05,889
the algorithm might work.

1398
00:48:05,890 --> 00:48:07,629
And I have to say, I'm a bit pessimistic

1399
00:48:07,630 --> 00:48:10,059
about this because, um, whereas

1400
00:48:10,060 --> 00:48:12,039
where the companies or the organizations

1401
00:48:12,040 --> 00:48:14,379
could use like like huge buckets and huge

1402
00:48:14,380 --> 00:48:15,519
amounts of data to train these

1403
00:48:15,520 --> 00:48:17,559
algorithms, the amount of data that we

1404
00:48:17,560 --> 00:48:20,199
can use for reverse engineering then

1405
00:48:20,200 --> 00:48:21,849
is minuscule, is very small in

1406
00:48:21,850 --> 00:48:22,449
comparison.

1407
00:48:22,450 --> 00:48:24,519
So it's really not very likely that we

1408
00:48:24,520 --> 00:48:26,049
would be able to make a good decision

1409
00:48:26,050 --> 00:48:28,479
based on these kinds of techniques.

1410
00:48:28,480 --> 00:48:31,119
And of course, we can also one thing

1411
00:48:31,120 --> 00:48:32,709
that we can do is to fight back with

1412
00:48:32,710 --> 00:48:34,899
data. So by collecting

1413
00:48:34,900 --> 00:48:37,029
data about decisions that are made

1414
00:48:37,030 --> 00:48:39,159
of about us from the algorithms and

1415
00:48:39,160 --> 00:48:41,889
by centralizing that we can like

1416
00:48:41,890 --> 00:48:43,989
create a lot of opportunities for other

1417
00:48:43,990 --> 00:48:46,329
researchers and other people to analyze

1418
00:48:46,330 --> 00:48:48,189
these data sets and to like find

1419
00:48:48,190 --> 00:48:50,259
discrimination and other things in them.

1420
00:48:50,260 --> 00:48:53,409
And so I would encourage you to, um,

1421
00:48:53,410 --> 00:48:55,779
if you like, are reluctant to like

1422
00:48:55,780 --> 00:48:56,899
like give away your data.

1423
00:48:56,900 --> 00:48:59,439
I can, of course, understand it, but, um,

1424
00:48:59,440 --> 00:49:01,539
in some cases, it's really the only way

1425
00:49:01,540 --> 00:49:03,939
to make sure that someone can actually

1426
00:49:03,940 --> 00:49:06,129
work with the data and detect

1427
00:49:06,130 --> 00:49:08,739
also like like find injustices

1428
00:49:08,740 --> 00:49:09,639
that are caused by it.

1429
00:49:09,640 --> 00:49:12,669
So we have to really think about

1430
00:49:12,670 --> 00:49:14,669
differently of giving away our data and

1431
00:49:14,670 --> 00:49:16,869
like like also creating data and machine

1432
00:49:16,870 --> 00:49:18,369
learning against machine learning.

1433
00:49:21,410 --> 00:49:22,410
So.

1434
00:49:24,380 --> 00:49:26,749
As a society, we can, of course,

1435
00:49:26,750 --> 00:49:28,309
create better regulations for algorithm,

1436
00:49:28,310 --> 00:49:29,479
and this is actually something that has

1437
00:49:29,480 --> 00:49:30,679
been done.

1438
00:49:30,680 --> 00:49:33,049
I mean, the beginning of the year,

1439
00:49:33,050 --> 00:49:35,209
our minister of justice was demanding of

1440
00:49:35,210 --> 00:49:37,519
Facebook to to open

1441
00:49:37,520 --> 00:49:39,589
up their algorithm. And this

1442
00:49:39,590 --> 00:49:41,029
was much ridiculed at the time.

1443
00:49:41,030 --> 00:49:42,769
But I think it actually has some merit,

1444
00:49:42,770 --> 00:49:44,899
because if we can't

1445
00:49:44,900 --> 00:49:47,389
understand how corporations

1446
00:49:47,390 --> 00:49:48,919
or companies are using algorithms, we

1447
00:49:48,920 --> 00:49:51,139
can't know if they're discriminating

1448
00:49:51,140 --> 00:49:52,399
against certain people or if they're

1449
00:49:52,400 --> 00:49:53,629
treating us fairly.

1450
00:49:53,630 --> 00:49:55,849
So having as an auditing

1451
00:49:55,850 --> 00:49:58,009
system in place, that allows at least

1452
00:49:58,010 --> 00:49:59,479
a group of people to have a look at this

1453
00:49:59,480 --> 00:50:01,729
algorithm and to see how they're working

1454
00:50:01,730 --> 00:50:03,529
would be a first step in the direction of

1455
00:50:03,530 --> 00:50:06,619
making these things more transparent.

1456
00:50:06,620 --> 00:50:08,959
And of course, making access to the data

1457
00:50:08,960 --> 00:50:10,969
more easy and a safe way is also

1458
00:50:10,970 --> 00:50:13,429
important to be able to to detect

1459
00:50:13,430 --> 00:50:15,259
any problems that we have with it.

1460
00:50:15,260 --> 00:50:17,329
And finally, of course, I

1461
00:50:17,330 --> 00:50:19,129
mean, this is maybe already too late, but

1462
00:50:19,130 --> 00:50:20,809
we should do our best to impede, like,

1463
00:50:20,810 --> 00:50:22,429
the creation of so-called data

1464
00:50:22,430 --> 00:50:24,979
monopolies, because if one organization

1465
00:50:24,980 --> 00:50:26,689
or one sector has all the data in its

1466
00:50:26,690 --> 00:50:28,879
hands, we have already lost.

1467
00:50:28,880 --> 00:50:30,559
Because even if you have the same

1468
00:50:30,560 --> 00:50:32,629
algorithms, the same technologies

1469
00:50:32,630 --> 00:50:33,859
are at our hands.

1470
00:50:33,860 --> 00:50:35,839
Most of the value and data analysis isn't

1471
00:50:35,840 --> 00:50:37,669
the amount of the data that we can can

1472
00:50:37,670 --> 00:50:39,739
have. So if there's an adversary or

1473
00:50:39,740 --> 00:50:41,899
like an organization that has like orders

1474
00:50:41,900 --> 00:50:43,609
of magnitude more data to work with than

1475
00:50:43,610 --> 00:50:45,709
we, it's really unlikely that we will

1476
00:50:45,710 --> 00:50:48,019
be able to, like, compete

1477
00:50:48,020 --> 00:50:50,479
with that adversary on the same scale.

1478
00:50:50,480 --> 00:50:51,480
So.

1479
00:50:52,660 --> 00:50:53,660
As a final word,

1480
00:50:54,820 --> 00:50:57,009
I would say that algorithms

1481
00:50:57,010 --> 00:50:59,919
are probably a lot like children,

1482
00:50:59,920 --> 00:51:01,359
so they're very smart and they're really

1483
00:51:01,360 --> 00:51:02,769
eager to learn things.

1484
00:51:02,770 --> 00:51:04,869
And we, as the

1485
00:51:04,870 --> 00:51:06,759
data analyst, as the programmers, we have

1486
00:51:06,760 --> 00:51:08,709
to teach them to behave in the right way

1487
00:51:08,710 --> 00:51:10,779
and we should try to raise them to be

1488
00:51:10,780 --> 00:51:12,669
responsible adults.

1489
00:51:12,670 --> 00:51:14,170
OK, so thanks.

1490
00:51:20,910 --> 00:51:22,469
His father didn't like being poor.

1491
00:51:24,420 --> 00:51:26,519
We do have a few minutes left

1492
00:51:26,520 --> 00:51:28,649
for Q&A, I would like to ask you

1493
00:51:28,650 --> 00:51:31,319
to queue up at the microphones at the CIA

1494
00:51:31,320 --> 00:51:33,389
if you're watching at home.

1495
00:51:33,390 --> 00:51:34,390
We also

1496
00:51:35,520 --> 00:51:37,589
have a human computer interface

1497
00:51:37,590 --> 00:51:39,269
to relay questions to us.

1498
00:51:39,270 --> 00:51:41,309
I'd say we begin with that.

1499
00:51:41,310 --> 00:51:42,569
Do you have a question for us?

1500
00:51:42,570 --> 00:51:43,619
Yes.

1501
00:51:43,620 --> 00:51:45,749
Rootie is asking, what discrimination

1502
00:51:45,750 --> 00:51:47,969
number would you guess for discrimination

1503
00:51:47,970 --> 00:51:50,759
from politicians over people's choice

1504
00:51:50,760 --> 00:51:53,429
in one or several countries?

1505
00:51:53,430 --> 00:51:55,859
Um, politicians

1506
00:51:55,860 --> 00:51:57,929
about people's choice.

1507
00:51:57,930 --> 00:51:59,099
You mean can you

1508
00:52:00,480 --> 00:52:01,859
be a bit more precise on that?

1509
00:52:01,860 --> 00:52:03,329
I think it's difficult to.

1510
00:52:04,510 --> 00:52:06,429
We'll get back to that question.

1511
00:52:06,430 --> 00:52:08,789
We have one question in song

1512
00:52:08,790 --> 00:52:09,790
number two, please.

1513
00:52:10,920 --> 00:52:12,659
Thank you for your talk.

1514
00:52:12,660 --> 00:52:14,249
Does it make any sense or is there any

1515
00:52:14,250 --> 00:52:16,619
hope that I am as an individual

1516
00:52:16,620 --> 00:52:18,989
can fake my my

1517
00:52:18,990 --> 00:52:21,599
data patterns or

1518
00:52:21,600 --> 00:52:22,600
can I disturb?

1519
00:52:23,590 --> 00:52:25,749
The pattern recognition

1520
00:52:25,750 --> 00:52:27,879
in a sensible way, in a sensitive.

1521
00:52:27,880 --> 00:52:30,219
Yeah, um, yes, I think you surely

1522
00:52:30,220 --> 00:52:32,259
can. That's the question is only if this

1523
00:52:32,260 --> 00:52:33,879
would be effective to, for example,

1524
00:52:33,880 --> 00:52:36,279
protect you against the anonymization,

1525
00:52:36,280 --> 00:52:38,619
because as I said, like, taking

1526
00:52:38,620 --> 00:52:40,839
90 percent of your data can be useless

1527
00:52:40,840 --> 00:52:43,149
if 10 percent of your data points

1528
00:52:43,150 --> 00:52:45,489
are in packets or like in

1529
00:52:45,490 --> 00:52:47,259
attributes that are unique or almost

1530
00:52:47,260 --> 00:52:48,159
unique to your person.

1531
00:52:48,160 --> 00:52:50,289
So if you want this matter to

1532
00:52:50,290 --> 00:52:51,609
be effective, I think you would have to

1533
00:52:51,610 --> 00:52:52,659
be really convincing.

1534
00:52:52,660 --> 00:52:54,729
And, um, I mean, I haven't

1535
00:52:54,730 --> 00:52:56,799
had a look at the very big data set, so

1536
00:52:56,800 --> 00:52:58,179
I really can't give a quantitative

1537
00:52:58,180 --> 00:53:00,279
answer. But I'm rather pessimistic about

1538
00:53:00,280 --> 00:53:00,909
this approach.

1539
00:53:00,910 --> 00:53:02,979
I have to say, OK, we

1540
00:53:02,980 --> 00:53:04,269
do have a few more questions.

1541
00:53:04,270 --> 00:53:06,639
I would ask the people in the room, if

1542
00:53:06,640 --> 00:53:08,559
you have to change rooms right now,

1543
00:53:08,560 --> 00:53:10,629
please do so in a quiet manner

1544
00:53:10,630 --> 00:53:13,089
so we can do the Q&A

1545
00:53:13,090 --> 00:53:14,559
without yelling.

1546
00:53:14,560 --> 00:53:16,329
We do have another question in the IOC

1547
00:53:16,330 --> 00:53:18,609
and after that, it's number four IOC,

1548
00:53:18,610 --> 00:53:19,419
please.

1549
00:53:19,420 --> 00:53:21,939
Atomic engineer is asking

1550
00:53:21,940 --> 00:53:24,249
if a human is generally able

1551
00:53:24,250 --> 00:53:26,049
to create an algorithm which is not

1552
00:53:26,050 --> 00:53:27,429
discriminating.

1553
00:53:27,430 --> 00:53:29,529
And he's doing an analogy to

1554
00:53:29,530 --> 00:53:31,689
random numbers where a human cannot

1555
00:53:31,690 --> 00:53:33,819
really create truly random

1556
00:53:33,820 --> 00:53:35,559
numbers because she or he or she would

1557
00:53:35,560 --> 00:53:37,099
always have a preference.

1558
00:53:37,100 --> 00:53:37,929
Hmm.

1559
00:53:37,930 --> 00:53:39,789
Yeah, that's a very interesting question.

1560
00:53:39,790 --> 00:53:41,919
Um, I mean, it really

1561
00:53:41,920 --> 00:53:43,989
comes down to the algorithm

1562
00:53:43,990 --> 00:53:46,239
having the information about a protected

1563
00:53:46,240 --> 00:53:48,369
class or not having it so it doesn't

1564
00:53:48,370 --> 00:53:49,659
have the information.

1565
00:53:49,660 --> 00:53:51,999
It can't be discriminating

1566
00:53:52,000 --> 00:53:54,069
by definition, because it can

1567
00:53:54,070 --> 00:53:56,169
only randomly guess if a person belongs

1568
00:53:56,170 --> 00:53:57,459
to a given group or not.

1569
00:53:57,460 --> 00:53:59,649
So in that sense, algorithm can be

1570
00:53:59,650 --> 00:54:01,449
perfectly unbiased, but only if they

1571
00:54:01,450 --> 00:54:03,399
don't have any information that that

1572
00:54:03,400 --> 00:54:05,829
gives away the protected status

1573
00:54:05,830 --> 00:54:07,689
of an object or person that they're

1574
00:54:07,690 --> 00:54:09,069
making a decision about.

1575
00:54:09,070 --> 00:54:10,440
So it's definitely possible that.

1576
00:54:12,900 --> 00:54:15,089
OK, the next question by number four,

1577
00:54:15,090 --> 00:54:16,499
please.

1578
00:54:16,500 --> 00:54:17,879
Thank you for your talk.

1579
00:54:17,880 --> 00:54:20,069
You say that algorithms discriminate in

1580
00:54:20,070 --> 00:54:22,379
the same way that humans can,

1581
00:54:22,380 --> 00:54:24,209
but I wonder if the real challenges that

1582
00:54:24,210 --> 00:54:25,979
algorithms discriminate in a slightly

1583
00:54:25,980 --> 00:54:27,989
different way than humans do.

1584
00:54:27,990 --> 00:54:29,729
And for example, you gave the example

1585
00:54:29,730 --> 00:54:32,219
that we can person we can

1586
00:54:32,220 --> 00:54:34,859
identify gender or other markers

1587
00:54:34,860 --> 00:54:36,059
from the data set.

1588
00:54:36,060 --> 00:54:38,189
Yeah, but what if these attributes

1589
00:54:38,190 --> 00:54:40,469
that identify that correlate with gender,

1590
00:54:40,470 --> 00:54:42,779
class, race, etc.

1591
00:54:42,780 --> 00:54:44,519
also correlate with other positive

1592
00:54:44,520 --> 00:54:46,020
attributes, such as

1593
00:54:47,220 --> 00:54:49,349
the study that you're more efficient

1594
00:54:49,350 --> 00:54:51,599
work when you live closer to your

1595
00:54:51,600 --> 00:54:53,279
the side of your employer.

1596
00:54:53,280 --> 00:54:54,959
But if you have a very segregated

1597
00:54:54,960 --> 00:54:57,029
society, that means that those who are

1598
00:54:57,030 --> 00:54:59,429
richer are also then classified

1599
00:54:59,430 --> 00:55:01,709
as more efficient workers and

1600
00:55:01,710 --> 00:55:03,749
when in the scoring of potential

1601
00:55:03,750 --> 00:55:04,959
employees.

1602
00:55:04,960 --> 00:55:07,349
And so the question is,

1603
00:55:07,350 --> 00:55:09,569
if such a thing occurs,

1604
00:55:09,570 --> 00:55:12,059
it's not just that discrimination can

1605
00:55:12,060 --> 00:55:14,459
can be an unintended outcome,

1606
00:55:14,460 --> 00:55:16,079
but also if the company wants to

1607
00:55:16,080 --> 00:55:18,359
discriminate, you cannot prove it because

1608
00:55:18,360 --> 00:55:20,759
you say we just hired the most qualified

1609
00:55:20,760 --> 00:55:23,069
candidate, but in fact, you just hired

1610
00:55:23,070 --> 00:55:24,629
certain kinds of people.

1611
00:55:24,630 --> 00:55:26,459
Yes. Yes. I mean, that's exactly the the

1612
00:55:26,460 --> 00:55:29,219
argument about discrimination,

1613
00:55:29,220 --> 00:55:31,439
because if

1614
00:55:31,440 --> 00:55:33,989
you don't have the information about

1615
00:55:33,990 --> 00:55:36,209
how many people of a given

1616
00:55:36,210 --> 00:55:38,399
class of a given protected status

1617
00:55:38,400 --> 00:55:40,259
applied, for example, for a given job,

1618
00:55:40,260 --> 00:55:42,179
you can't figure out if there is any

1619
00:55:42,180 --> 00:55:43,349
discrimination in the process.

1620
00:55:43,350 --> 00:55:45,419
And so that means that you

1621
00:55:45,420 --> 00:55:47,009
have somehow to get that information into

1622
00:55:47,010 --> 00:55:48,929
the system in order to make an audit and

1623
00:55:48,930 --> 00:55:51,359
actually see if there's some unfair

1624
00:55:51,360 --> 00:55:52,469
bias in there.

1625
00:55:52,470 --> 00:55:53,879
And I mean, the other question I don't

1626
00:55:53,880 --> 00:55:55,769
know if I understood it correctly is if

1627
00:55:55,770 --> 00:55:56,770
you

1628
00:55:58,470 --> 00:56:00,539
if you can in fear information

1629
00:56:00,540 --> 00:56:03,149
about the gender from other things or

1630
00:56:03,150 --> 00:56:04,469
and it's I mean, this is certainly the

1631
00:56:04,470 --> 00:56:07,109
case because as I said in the talk,

1632
00:56:07,110 --> 00:56:09,089
like many things, like, for example, the

1633
00:56:09,090 --> 00:56:10,289
neighborhood that you live in, as you

1634
00:56:10,290 --> 00:56:12,509
said, we give away information about the

1635
00:56:12,510 --> 00:56:13,799
protected attributes as well.

1636
00:56:15,000 --> 00:56:17,039
All right. We have a few more questions I

1637
00:56:17,040 --> 00:56:19,199
would I would ask you to keep in short,

1638
00:56:19,200 --> 00:56:21,299
please, the microphone number five

1639
00:56:21,300 --> 00:56:22,300
in the back

1640
00:56:23,430 --> 00:56:23,789
hurts.

1641
00:56:23,790 --> 00:56:26,099
And often her statement is that

1642
00:56:26,100 --> 00:56:27,599
the more data you actually collect, the

1643
00:56:27,600 --> 00:56:29,549
less you can actually do with it, because

1644
00:56:29,550 --> 00:56:30,659
it's just too much.

1645
00:56:30,660 --> 00:56:32,669
Is there any scenario where this

1646
00:56:32,670 --> 00:56:34,019
statement makes any sense?

1647
00:56:35,100 --> 00:56:36,269
Yeah, definitely is.

1648
00:56:36,270 --> 00:56:38,549
I mean, given an algorithm,

1649
00:56:38,550 --> 00:56:40,349
more data to train with is not always a

1650
00:56:40,350 --> 00:56:41,339
good thing.

1651
00:56:41,340 --> 00:56:43,589
It's pretty easy to do overtrain

1652
00:56:43,590 --> 00:56:45,899
algorithms, not so

1653
00:56:45,900 --> 00:56:48,089
to give it to make a model that is like

1654
00:56:48,090 --> 00:56:50,099
perfectly fitting the data that you give

1655
00:56:50,100 --> 00:56:52,049
it, but that has very little predictive

1656
00:56:52,050 --> 00:56:53,739
power for new data that you see.

1657
00:56:53,740 --> 00:56:55,829
But in general, increasing

1658
00:56:55,830 --> 00:56:57,659
the number of days of data points

1659
00:56:58,680 --> 00:57:00,959
is always like improving the quality

1660
00:57:00,960 --> 00:57:03,029
of the model if the data that you have

1661
00:57:03,030 --> 00:57:05,129
is from the same model as

1662
00:57:05,130 --> 00:57:06,869
well. So it could also happen that the

1663
00:57:06,870 --> 00:57:08,639
data that you have is not homogeneous, so

1664
00:57:08,640 --> 00:57:10,829
that one part of the data

1665
00:57:10,830 --> 00:57:12,389
is fitting well with one model, but the

1666
00:57:12,390 --> 00:57:13,679
other part of the data is fitting well

1667
00:57:13,680 --> 00:57:15,419
with another one. So in that case, it

1668
00:57:15,420 --> 00:57:17,249
might be difficult training a large

1669
00:57:17,250 --> 00:57:19,709
amount of data on a single model, but

1670
00:57:19,710 --> 00:57:21,359
it depends on the individual case, I

1671
00:57:21,360 --> 00:57:23,279
would say. So it's really not easy to

1672
00:57:23,280 --> 00:57:24,280
answer in that sense.

1673
00:57:25,140 --> 00:57:27,209
Thank you. We have time for two more

1674
00:57:27,210 --> 00:57:28,169
short questions.

1675
00:57:28,170 --> 00:57:30,359
I would ask one question from

1676
00:57:30,360 --> 00:57:31,409
the again.

1677
00:57:31,410 --> 00:57:33,869
Yes, Anayansi Lucas' asking,

1678
00:57:33,870 --> 00:57:36,149
isn't the black box nature of the machine

1679
00:57:36,150 --> 00:57:38,099
learning algorithms one of the biggest

1680
00:57:38,100 --> 00:57:40,199
problems can be

1681
00:57:40,200 --> 00:57:42,119
solved by better visualization or

1682
00:57:42,120 --> 00:57:43,829
understanding and what it really is

1683
00:57:43,830 --> 00:57:44,830
doing?

1684
00:57:45,530 --> 00:57:47,959
Yeah, for me, having algorithms that

1685
00:57:47,960 --> 00:57:49,969
are not open to scrutiny and that we

1686
00:57:49,970 --> 00:57:51,739
can't understand is one of the biggest

1687
00:57:51,740 --> 00:57:54,649
problems, of course. And, uh, um,

1688
00:57:54,650 --> 00:57:56,809
visualizing data can help, of course.

1689
00:57:56,810 --> 00:57:59,209
But as I said briefly in the talk,

1690
00:57:59,210 --> 00:58:01,459
since the space of possible parameters

1691
00:58:01,460 --> 00:58:02,929
in the space space of possible data

1692
00:58:02,930 --> 00:58:05,539
points is so enormous,

1693
00:58:05,540 --> 00:58:07,939
even for very small and

1694
00:58:07,940 --> 00:58:09,589
learning problems, that it's really

1695
00:58:09,590 --> 00:58:11,329
difficult to produce a given

1696
00:58:11,330 --> 00:58:13,099
visualization that would you that would

1697
00:58:13,100 --> 00:58:14,900
give you a high confidence and

1698
00:58:15,950 --> 00:58:17,419
a good information about, for example,

1699
00:58:17,420 --> 00:58:18,979
discrimination in the data set.

1700
00:58:18,980 --> 00:58:20,119
So it can certainly help.

1701
00:58:20,120 --> 00:58:22,189
But, uh, I think it's not a

1702
00:58:22,190 --> 00:58:23,190
perfect answer either.

1703
00:58:24,440 --> 00:58:26,719
OK, we have time for one more question.

1704
00:58:26,720 --> 00:58:29,269
Microphone number one, please.

1705
00:58:29,270 --> 00:58:30,349
Thank you.

1706
00:58:30,350 --> 00:58:32,449
In the beginning, you displayed the

1707
00:58:32,450 --> 00:58:35,029
green, yellow and red,

1708
00:58:35,030 --> 00:58:37,039
somehow agreed to give me more damaging.

1709
00:58:37,040 --> 00:58:39,169
The example you made about the green was

1710
00:58:39,170 --> 00:58:41,419
about some kind of algorithm that gave

1711
00:58:41,420 --> 00:58:42,329
to you information.

1712
00:58:42,330 --> 00:58:44,119
Don't you think that at the time of

1713
00:58:44,120 --> 00:58:46,189
exposure influence,

1714
00:58:46,190 --> 00:58:47,539
how much is damaging?

1715
00:58:47,540 --> 00:58:49,999
Because if I get to influence the to here

1716
00:58:50,000 --> 00:58:51,760
is worse than just two days.

1717
00:58:52,900 --> 00:58:54,979
Mm hmm. Can you say that again for the

1718
00:58:54,980 --> 00:58:57,109
time of exposure, with time of

1719
00:58:57,110 --> 00:58:59,509
exposure to an algorithm to influence

1720
00:58:59,510 --> 00:59:01,789
your behavior has to be considered as

1721
00:59:01,790 --> 00:59:03,949
a factor to understand

1722
00:59:03,950 --> 00:59:04,369
if it's true.

1723
00:59:04,370 --> 00:59:05,869
Yeah, that's a very important point.

1724
00:59:05,870 --> 00:59:08,059
I mean, I also had like an experiment and

1725
00:59:08,060 --> 00:59:09,829
where I look at the interaction of the

1726
00:59:09,830 --> 00:59:12,799
algorithms with a person that he is like,

1727
00:59:12,800 --> 00:59:14,509
for example, showing articles, too.

1728
00:59:14,510 --> 00:59:16,579
And, um, this is like a topic of itself,

1729
00:59:16,580 --> 00:59:18,469
I would say. So there's definitely very

1730
00:59:18,470 --> 00:59:20,299
rich interaction. The results are not

1731
00:59:20,300 --> 00:59:21,539
captured by most models.

1732
00:59:21,540 --> 00:59:23,059
So like the algorithm influencing the

1733
00:59:23,060 --> 00:59:24,889
behavior of the person and then that

1734
00:59:24,890 --> 00:59:26,449
again influencing the actions of the

1735
00:59:26,450 --> 00:59:28,789
person and influencing the machine

1736
00:59:28,790 --> 00:59:30,619
learning of the further data.

1737
00:59:30,620 --> 00:59:32,059
So there's definitely some feedback in

1738
00:59:32,060 --> 00:59:33,419
the system. Absolutely.

1739
00:59:33,420 --> 00:59:34,420
Yeah.

1740
00:59:34,770 --> 00:59:37,069
OK, that's all the time we have.

1741
00:59:37,070 --> 00:59:39,050
Thanks again for the great talk.