0 00:00:00,000 --> 00:00:30,000 Dear viewer, these subtitles were generated by a machine via the service Trint and therefore are (very) buggy. If you are capable, please help us to create good quality subtitles: https://c3subtitles.de/talk/555 Thanks! 1 00:00:09,360 --> 00:00:11,399 The title of our next talk is quite 2 00:00:11,400 --> 00:00:13,739 descriptive, without me giving too much 3 00:00:13,740 --> 00:00:16,019 of an introduction, I think 4 00:00:16,020 --> 00:00:18,269 basically we live in a world in which 5 00:00:18,270 --> 00:00:20,609 algorithms make more and more decisions 6 00:00:20,610 --> 00:00:22,039 about our daily lives. 7 00:00:23,100 --> 00:00:25,589 Many people are working on improving 8 00:00:25,590 --> 00:00:26,879 these algorithms. 9 00:00:26,880 --> 00:00:28,979 Not as many are actually 10 00:00:28,980 --> 00:00:31,139 thinking about the implications. 11 00:00:31,140 --> 00:00:32,728 Say hi to your new boss. 12 00:00:32,729 --> 00:00:35,069 How Algorithms Might Soon Control 13 00:00:35,070 --> 00:00:37,349 Our Lives is the title of our next 14 00:00:37,350 --> 00:00:39,419 talk. Please give a warm welcome 15 00:00:39,420 --> 00:00:40,590 to Andrew Stevens. 16 00:00:48,290 --> 00:00:50,539 So hello, everyone, 17 00:00:50,540 --> 00:00:52,279 I have to say, I'm really quite excited 18 00:00:52,280 --> 00:00:54,559 being here again and terrified 19 00:00:54,560 --> 00:00:56,509 as well, but mostly excited. 20 00:00:56,510 --> 00:00:58,579 And first and foremost, I want to 21 00:00:58,580 --> 00:01:00,709 thank the organizers for inviting 22 00:01:00,710 --> 00:01:02,629 me again and for letting me speak at is 23 00:01:02,630 --> 00:01:04,369 really, really great conference. 24 00:01:04,370 --> 00:01:05,370 So 25 00:01:06,530 --> 00:01:08,539 as we said before, the title of my talk 26 00:01:08,540 --> 00:01:10,669 is Say Hi to Your Boss and I'm going 27 00:01:10,670 --> 00:01:12,709 to talk about algorithm's and about like 28 00:01:12,710 --> 00:01:14,899 shifting decision powers from 29 00:01:14,900 --> 00:01:16,669 humans to machines. 30 00:01:16,670 --> 00:01:19,009 And in case you were asking yourself, 31 00:01:19,010 --> 00:01:20,209 why is this important? 32 00:01:20,210 --> 00:01:23,149 Well, let's just ask a friend 33 00:01:23,150 --> 00:01:24,819 here. I usually like to do that with 34 00:01:24,820 --> 00:01:25,789 Google autocomplete. 35 00:01:25,790 --> 00:01:28,099 And normally this gives me kind of like 36 00:01:28,100 --> 00:01:29,609 some controversial statements or 37 00:01:29,610 --> 00:01:31,499 algorithms are stupid or algorithms will 38 00:01:31,500 --> 00:01:32,419 never work. 39 00:01:32,420 --> 00:01:34,849 But in this case, it seems like and 40 00:01:34,850 --> 00:01:36,529 it's pretty unambiguous that algorithms 41 00:01:36,530 --> 00:01:38,209 will play a very large role in this 42 00:01:38,210 --> 00:01:41,149 world. And as I said, this is like 43 00:01:41,150 --> 00:01:43,159 a big chance because algorithms can 44 00:01:43,160 --> 00:01:44,239 improve our lives a lot. 45 00:01:44,240 --> 00:01:46,459 But it's also a problem 46 00:01:46,460 --> 00:01:48,289 because we're shifting a lot of the 47 00:01:48,290 --> 00:01:50,879 decisions that are now made by people 48 00:01:50,880 --> 00:01:52,339 into the hands of machines. 49 00:01:52,340 --> 00:01:54,859 And in many cases, we don't understand 50 00:01:54,860 --> 00:01:56,929 much how these machines work and 51 00:01:56,930 --> 00:01:59,099 how exactly that make the decisions. 52 00:01:59,100 --> 00:02:01,159 And so I would say my main qualification 53 00:02:01,160 --> 00:02:02,689 for making this talk is that I shot 54 00:02:02,690 --> 00:02:03,889 myself in the foot with data. 55 00:02:03,890 --> 00:02:06,169 And this is a lot of times and I became 56 00:02:06,170 --> 00:02:08,359 interested in how why algorithms 57 00:02:08,360 --> 00:02:10,459 are doing like so many things 58 00:02:10,460 --> 00:02:11,869 that we don't anticipate and why they 59 00:02:11,870 --> 00:02:13,489 sometimes behave in ways that seem 60 00:02:13,490 --> 00:02:16,249 strange and like contradictory 61 00:02:16,250 --> 00:02:17,749 to what we actually want to achieve with 62 00:02:17,750 --> 00:02:18,739 them. 63 00:02:18,740 --> 00:02:20,389 And so that's what I want to talk about. 64 00:02:20,390 --> 00:02:21,739 And we're going to do it like this. 65 00:02:21,740 --> 00:02:23,959 So first, I will give 66 00:02:23,960 --> 00:02:26,149 you some theory about what 67 00:02:26,150 --> 00:02:28,429 an algorithm actually is and how 68 00:02:28,430 --> 00:02:30,409 machine learning algorithms to make 69 00:02:30,410 --> 00:02:32,539 decisions and how this whole 70 00:02:32,540 --> 00:02:34,769 big data thing and 71 00:02:34,770 --> 00:02:36,949 like the new data driven society age 72 00:02:36,950 --> 00:02:39,139 plays into this whole affair. 73 00:02:39,140 --> 00:02:40,789 And finally, we'll show you some of the 74 00:02:40,790 --> 00:02:42,889 use cases for algorithms in our daily 75 00:02:42,890 --> 00:02:44,329 lives today. 76 00:02:44,330 --> 00:02:46,249 And after that, we will be equipped with 77 00:02:46,250 --> 00:02:47,809 everything we need to know in order to 78 00:02:47,810 --> 00:02:49,669 start with some experiments. 79 00:02:49,670 --> 00:02:51,589 So I'm coming from physics. 80 00:02:51,590 --> 00:02:53,239 And when I would try to understand 81 00:02:53,240 --> 00:02:54,919 something, I usually do an experiment and 82 00:02:54,920 --> 00:02:56,929 try to to break the thing I like, make it 83 00:02:56,930 --> 00:02:57,859 explode or whatever. 84 00:02:57,860 --> 00:03:00,259 And so we're trying to do the same 85 00:03:00,260 --> 00:03:01,699 thing here with our algorithms. 86 00:03:01,700 --> 00:03:04,329 And I have picked out two and 87 00:03:04,330 --> 00:03:05,509 two case studies that are going to 88 00:03:05,510 --> 00:03:08,569 present one about discrimination 89 00:03:08,570 --> 00:03:10,729 through algorithms and another one 90 00:03:10,730 --> 00:03:13,459 about the anonymization. 91 00:03:13,460 --> 00:03:15,889 So and finally, I want to end with 92 00:03:15,890 --> 00:03:17,989 some proposals and some ideas on how we 93 00:03:17,990 --> 00:03:20,929 can actually make the most of algorithms 94 00:03:20,930 --> 00:03:23,149 in this kind of setting and also 95 00:03:23,150 --> 00:03:25,489 like to control 96 00:03:25,490 --> 00:03:27,469 and better understand what algorithms are 97 00:03:27,470 --> 00:03:28,619 doing. 98 00:03:28,620 --> 00:03:29,869 OK. 99 00:03:29,870 --> 00:03:32,029 So first and 100 00:03:32,030 --> 00:03:33,229 as I said, I want to talk a bit about 101 00:03:33,230 --> 00:03:34,669 algorithms, and I just want to give you a 102 00:03:34,670 --> 00:03:37,489 very, very basic overview of 103 00:03:37,490 --> 00:03:39,349 machine learning and decision making, the 104 00:03:39,350 --> 00:03:41,419 algorithms. So please excuse me if there 105 00:03:41,420 --> 00:03:42,589 are any experts in the audience and 106 00:03:42,590 --> 00:03:43,729 probably making like a lot of 107 00:03:43,730 --> 00:03:45,739 simplifications here. 108 00:03:45,740 --> 00:03:49,129 OK, um, what is an algorithm 109 00:03:49,130 --> 00:03:50,749 here? I give you an example. 110 00:03:50,750 --> 00:03:53,329 Um, basically an algorithm is just a 111 00:03:53,330 --> 00:03:56,329 recipe that can be followed by a computer 112 00:03:56,330 --> 00:03:58,519 or a human being and 113 00:03:58,520 --> 00:04:00,679 gives a human being on the computer 114 00:04:00,680 --> 00:04:02,899 step by step instructions to achieve 115 00:04:02,900 --> 00:04:04,699 a certain goal. So in this case, we want 116 00:04:04,700 --> 00:04:06,829 to activate a trapdoor and we 117 00:04:06,830 --> 00:04:08,569 want to do that only if I'm on the trap 118 00:04:08,570 --> 00:04:10,579 door. So the algorithm has to decide if 119 00:04:10,580 --> 00:04:12,109 it's me and then if it's me, it can open 120 00:04:12,110 --> 00:04:14,029 the trap door. Otherwise it has to wait. 121 00:04:14,030 --> 00:04:16,159 And now this is a pretty fancy algorithm 122 00:04:16,160 --> 00:04:18,319 because it needs some information about 123 00:04:18,320 --> 00:04:20,599 me and it needs 124 00:04:20,600 --> 00:04:22,759 kind of an intelligent way to 125 00:04:22,760 --> 00:04:25,039 decide if it's the right person that 126 00:04:25,040 --> 00:04:26,540 is standing on the Treptow or not. 127 00:04:29,080 --> 00:04:30,789 So how does the algorithm get that 128 00:04:30,790 --> 00:04:31,719 information? 129 00:04:31,720 --> 00:04:34,479 Well, it uses machine learning and 130 00:04:34,480 --> 00:04:36,639 machine learning is a 131 00:04:36,640 --> 00:04:38,799 way to automatically generate a 132 00:04:38,800 --> 00:04:40,869 model that we can check 133 00:04:40,870 --> 00:04:43,239 against some training data and 134 00:04:43,240 --> 00:04:45,369 which we can then use to explain 135 00:04:45,370 --> 00:04:47,499 the data and in addition to also 136 00:04:47,500 --> 00:04:48,969 predict some unknown data. 137 00:04:48,970 --> 00:04:51,039 So, as you might know from school and 138 00:04:51,040 --> 00:04:53,439 just memorizing data and like 139 00:04:53,440 --> 00:04:55,299 reproducing what you already know can get 140 00:04:55,300 --> 00:04:58,149 you through tests, but normally it won't 141 00:04:58,150 --> 00:05:00,009 make you passed with flying colors. 142 00:05:00,010 --> 00:05:01,959 So ideally, we want to have something 143 00:05:01,960 --> 00:05:03,999 that can, in addition of memorizing data, 144 00:05:04,000 --> 00:05:06,009 also make predictions about data that we 145 00:05:06,010 --> 00:05:07,149 have never seen before. 146 00:05:07,150 --> 00:05:08,529 And this is what machine learning helps 147 00:05:08,530 --> 00:05:10,629 us to do and a bit more 148 00:05:10,630 --> 00:05:11,769 formalized the way we could, 149 00:05:12,790 --> 00:05:13,810 like, look at it as 150 00:05:14,890 --> 00:05:16,179 a model and some data. 151 00:05:16,180 --> 00:05:18,489 So here on the right, I just show you 152 00:05:18,490 --> 00:05:20,859 several possible models 153 00:05:20,860 --> 00:05:22,099 that we can choose from. 154 00:05:22,100 --> 00:05:24,579 Normally, we can write them as explaining 155 00:05:24,580 --> 00:05:26,879 some variable Y as a function 156 00:05:26,880 --> 00:05:29,049 M for model, which 157 00:05:29,050 --> 00:05:31,659 takes some attributes or variables X 158 00:05:31,660 --> 00:05:34,779 and some parameters P and 159 00:05:34,780 --> 00:05:36,939 returns some value for the quantity 160 00:05:36,940 --> 00:05:38,809 that we want to predict. 161 00:05:38,810 --> 00:05:41,139 And now we can use data 162 00:05:41,140 --> 00:05:43,569 to train our models and to like select 163 00:05:43,570 --> 00:05:45,699 the models that are compatible 164 00:05:45,700 --> 00:05:47,499 to our training and eliminate the ones 165 00:05:47,500 --> 00:05:49,209 that are not compatible to it. 166 00:05:49,210 --> 00:05:50,979 And you can see here on the right, we 167 00:05:50,980 --> 00:05:52,599 have eliminated all these models that are 168 00:05:52,600 --> 00:05:54,519 shown in red, whereas the models that are 169 00:05:54,520 --> 00:05:56,739 green can actually be compatible with our 170 00:05:56,740 --> 00:05:57,849 data. 171 00:05:57,850 --> 00:05:59,769 And now we can use those models to make a 172 00:05:59,770 --> 00:06:01,569 prediction about unknown data points as 173 00:06:01,570 --> 00:06:03,579 well and which is shown here. 174 00:06:03,580 --> 00:06:06,009 And usually you see there is some 175 00:06:06,010 --> 00:06:08,079 error, also some discrepancy between 176 00:06:08,080 --> 00:06:09,939 the model and the data that we try to 177 00:06:09,940 --> 00:06:12,129 explain. And this error is usually called 178 00:06:12,130 --> 00:06:13,130 Epsilon. 179 00:06:13,890 --> 00:06:16,679 Now, Epsilon 180 00:06:16,680 --> 00:06:18,569 can be decomposed further into like 181 00:06:18,570 --> 00:06:20,849 several parts, so there's a systematic 182 00:06:20,850 --> 00:06:22,320 error, which is mostly due to 183 00:06:23,820 --> 00:06:25,799 like miss calibrations or like 184 00:06:25,800 --> 00:06:27,449 measurements that we make each time when 185 00:06:27,450 --> 00:06:29,249 we try to like like measure a given 186 00:06:29,250 --> 00:06:31,379 variable. So we can think about this as, 187 00:06:31,380 --> 00:06:33,329 for example, the speedometer on your car, 188 00:06:33,330 --> 00:06:35,429 which gives you intentionally a reading 189 00:06:35,430 --> 00:06:37,259 which is too low in order to make sure 190 00:06:37,260 --> 00:06:39,419 that you not overstep the speed limit. 191 00:06:39,420 --> 00:06:41,489 And in addition to the systematic errors, 192 00:06:41,490 --> 00:06:43,799 we have also some noise in our data, 193 00:06:43,800 --> 00:06:45,419 which is due to like the error, the 194 00:06:45,420 --> 00:06:47,279 internal process that has generated the 195 00:06:47,280 --> 00:06:49,409 data or the model that we 196 00:06:49,410 --> 00:06:51,599 use and the measurement apparatus 197 00:06:51,600 --> 00:06:54,509 that we use to capture the data. 198 00:06:54,510 --> 00:06:56,819 And finally, we have some hidden variable 199 00:06:56,820 --> 00:06:58,949 errors, which is not random 200 00:06:58,950 --> 00:07:01,019 noise, but which are errors that are due 201 00:07:01,020 --> 00:07:03,059 to variables that have an impact on the 202 00:07:03,060 --> 00:07:04,979 outcome of the model, but which we do not 203 00:07:04,980 --> 00:07:07,079 know and which we therefore cannot use 204 00:07:07,080 --> 00:07:08,229 to model the data. 205 00:07:08,230 --> 00:07:09,230 So 206 00:07:10,650 --> 00:07:12,689 that's the basics of like model 207 00:07:12,690 --> 00:07:14,819 generation. And now you probably all have 208 00:07:14,820 --> 00:07:16,919 heard about big data and data driven 209 00:07:16,920 --> 00:07:19,559 society. And the effect 210 00:07:19,560 --> 00:07:22,199 that this has on model generation is 211 00:07:22,200 --> 00:07:23,259 threefold. 212 00:07:23,260 --> 00:07:24,839 For once you can see here, for example, 213 00:07:24,840 --> 00:07:26,999 we have more or less the data volume 214 00:07:27,000 --> 00:07:28,739 in 2000 compared to the data volume in 215 00:07:28,740 --> 00:07:29,969 2015. 216 00:07:29,970 --> 00:07:32,069 You can see that today we have a lot more 217 00:07:32,070 --> 00:07:33,989 data on our hands to make predictions and 218 00:07:33,990 --> 00:07:35,009 train models. 219 00:07:35,010 --> 00:07:37,229 And we also have data of 220 00:07:37,230 --> 00:07:38,870 a much greater variety than before. 221 00:07:40,230 --> 00:07:42,600 So to understand this effect, 222 00:07:44,010 --> 00:07:45,719 we can have a look at this graph here, 223 00:07:45,720 --> 00:07:48,119 which shows some random data 224 00:07:48,120 --> 00:07:49,889 that we measure with a pretty large 225 00:07:49,890 --> 00:07:51,129 noise, as you can see. 226 00:07:51,130 --> 00:07:52,439 And this data also contains some 227 00:07:52,440 --> 00:07:54,669 information. And I don't know 228 00:07:54,670 --> 00:07:56,909 who you can tell me if 229 00:07:56,910 --> 00:07:58,679 either the green points or the red points 230 00:07:58,680 --> 00:08:01,139 are showing have a higher value. 231 00:08:01,140 --> 00:08:02,249 So I guess not. 232 00:08:03,360 --> 00:08:05,729 But now what we can do is we can just 233 00:08:05,730 --> 00:08:08,279 take the data and leverage it, and 234 00:08:08,280 --> 00:08:10,379 by doing that, we can reduce 235 00:08:10,380 --> 00:08:12,269 the amount of noise in our data. 236 00:08:12,270 --> 00:08:14,399 And when we have enough samples that we 237 00:08:14,400 --> 00:08:16,319 can look at, we can make the noise so 238 00:08:16,320 --> 00:08:18,269 small that we can really detect some 239 00:08:18,270 --> 00:08:19,169 signal in the data. 240 00:08:19,170 --> 00:08:22,289 In this case, the signal is just 0.01 241 00:08:22,290 --> 00:08:23,579 high. 242 00:08:23,580 --> 00:08:25,649 And so having more data in our hands 243 00:08:25,650 --> 00:08:28,199 allows us to train models 244 00:08:28,200 --> 00:08:30,299 which can take into account 245 00:08:30,300 --> 00:08:31,300 smaller effect. 246 00:08:32,570 --> 00:08:34,699 Also, as I said, big data 247 00:08:34,700 --> 00:08:36,889 does not only give us more of the same 248 00:08:36,890 --> 00:08:38,418 data, but it gives us different kinds of 249 00:08:38,419 --> 00:08:40,369 data so we can think, for example, about 250 00:08:40,370 --> 00:08:42,499 all the smart devices that you have in 251 00:08:42,500 --> 00:08:44,689 your home, like your smart fridge, your 252 00:08:44,690 --> 00:08:47,149 door, the maybe automated 253 00:08:47,150 --> 00:08:49,259 smoke detector that all collect 254 00:08:49,260 --> 00:08:51,649 data about you and your interactions. 255 00:08:51,650 --> 00:08:53,509 And like, we can use the data to 256 00:08:53,510 --> 00:08:55,129 incorporate it into our models and to 257 00:08:55,130 --> 00:08:56,479 make better predictions. 258 00:08:56,480 --> 00:08:58,699 And so this moves 259 00:08:58,700 --> 00:09:00,769 some of the noise that where 260 00:09:00,770 --> 00:09:02,929 that was into hidden variables, into 261 00:09:02,930 --> 00:09:04,879 the model where we can use it to make 262 00:09:04,880 --> 00:09:05,880 predictions. 263 00:09:08,850 --> 00:09:10,919 Now, interpreting 264 00:09:10,920 --> 00:09:13,149 models can be hard 265 00:09:13,150 --> 00:09:15,119 or it can be very simple, depending on 266 00:09:15,120 --> 00:09:16,439 the model. So there's some machine 267 00:09:16,440 --> 00:09:18,689 learning algorithms like like 268 00:09:18,690 --> 00:09:20,399 decision tree classifiers, which are 269 00:09:20,400 --> 00:09:22,679 pretty easy to interpret 270 00:09:22,680 --> 00:09:25,259 because we can just follow 271 00:09:25,260 --> 00:09:27,389 this graph here and see exactly 272 00:09:27,390 --> 00:09:29,699 how the algorithms makes his decision 273 00:09:29,700 --> 00:09:31,949 or its decision about a given 274 00:09:31,950 --> 00:09:33,149 data point. 275 00:09:33,150 --> 00:09:35,819 Other models, like, for example, 276 00:09:35,820 --> 00:09:37,649 this neural network on the right side are 277 00:09:37,650 --> 00:09:38,669 really hard to interpret. 278 00:09:38,670 --> 00:09:41,339 So we can't get an intuitive 279 00:09:41,340 --> 00:09:43,889 feeling for how this model actually makes 280 00:09:43,890 --> 00:09:45,599 its decisions. 281 00:09:45,600 --> 00:09:47,819 And in fact, you maybe have 282 00:09:47,820 --> 00:09:49,829 seen those pictures here. 283 00:09:49,830 --> 00:09:52,409 They show basically a neural network 284 00:09:52,410 --> 00:09:54,179 working in reverse. 285 00:09:54,180 --> 00:09:55,180 So 286 00:09:56,730 --> 00:09:58,889 they give us an idea of how, 287 00:09:58,890 --> 00:10:01,679 like, this neural network understands 288 00:10:01,680 --> 00:10:02,909 a picture in this case. 289 00:10:02,910 --> 00:10:04,559 And you can see that, for example, we 290 00:10:04,560 --> 00:10:06,389 have several structures here that emerge 291 00:10:06,390 --> 00:10:07,859 at different places in the image and that 292 00:10:07,860 --> 00:10:09,659 are generated by the neural network while 293 00:10:09,660 --> 00:10:11,789 it's recognizing the features 294 00:10:11,790 --> 00:10:12,839 of the image. 295 00:10:12,840 --> 00:10:14,159 And this method has been developed 296 00:10:14,160 --> 00:10:15,989 actually, because it's really, really 297 00:10:15,990 --> 00:10:17,879 difficult to understand what a neural 298 00:10:17,880 --> 00:10:18,979 network is doing otherwise. 299 00:10:18,980 --> 00:10:21,359 So the only way we have to do that 300 00:10:21,360 --> 00:10:24,029 is to like try to see 301 00:10:24,030 --> 00:10:26,099 what kind of input data the 302 00:10:26,100 --> 00:10:27,449 network produces when we give it a 303 00:10:27,450 --> 00:10:28,450 certain output data. 304 00:10:31,060 --> 00:10:32,060 So. 305 00:10:32,890 --> 00:10:35,859 Now, what can you do with algorithms 306 00:10:35,860 --> 00:10:38,079 here to try to classify 307 00:10:38,080 --> 00:10:39,849 the uses of algorithms in our daily life 308 00:10:39,850 --> 00:10:42,429 until like three different risk groups, 309 00:10:42,430 --> 00:10:44,829 so you could say that you have a low 310 00:10:44,830 --> 00:10:46,869 risk group which basically just affects 311 00:10:46,870 --> 00:10:48,939 our lives on the super 312 00:10:48,940 --> 00:10:51,429 superficially and on the algorithms 313 00:10:51,430 --> 00:10:52,809 that make the decisions there. 314 00:10:52,810 --> 00:10:55,179 If they go wrong or if they misbehave, 315 00:10:55,180 --> 00:10:56,919 it would be only mildly annoying. 316 00:10:56,920 --> 00:10:59,019 Then we have the medium risk area where 317 00:10:59,020 --> 00:11:00,939 failure or misbehavior of an algorithm 318 00:11:00,940 --> 00:11:03,039 would be a bit more severe to our life, 319 00:11:03,040 --> 00:11:04,659 but wouldn't be fatal. 320 00:11:04,660 --> 00:11:06,759 Which is only in the high risk area here, 321 00:11:06,760 --> 00:11:08,559 where algorithms really can take 322 00:11:08,560 --> 00:11:10,629 decisions that can affect human 323 00:11:10,630 --> 00:11:12,849 lives and that can really be 324 00:11:12,850 --> 00:11:14,079 life changing in that sense. 325 00:11:15,740 --> 00:11:17,959 Now, a few examples for the first group. 326 00:11:20,340 --> 00:11:21,869 Would be, for example, personalization 327 00:11:21,870 --> 00:11:24,209 services, so whenever you go to a website 328 00:11:24,210 --> 00:11:26,789 like Facebook or Amazon or Netflix 329 00:11:26,790 --> 00:11:28,649 and the website shows you some content 330 00:11:28,650 --> 00:11:30,119 and it tries to show you content, which 331 00:11:30,120 --> 00:11:32,459 is very interesting to you, and it uses 332 00:11:32,460 --> 00:11:34,529 an algorithm to do that and tries 333 00:11:34,530 --> 00:11:36,119 to predict from the articles that you 334 00:11:36,120 --> 00:11:38,249 have viewed before which articles, 335 00:11:38,250 --> 00:11:39,779 for example, you will find interesting. 336 00:11:39,780 --> 00:11:42,689 So this so-called recommendation engine, 337 00:11:42,690 --> 00:11:44,729 and it's in wide use in all kinds of 338 00:11:44,730 --> 00:11:46,409 services today. 339 00:11:46,410 --> 00:11:47,849 Also, we have, of course, individualized 340 00:11:47,850 --> 00:11:49,019 ad targeting. 341 00:11:49,020 --> 00:11:50,369 You might notice if you go to some 342 00:11:50,370 --> 00:11:52,469 website and then you like your 343 00:11:52,470 --> 00:11:54,449 product and afterwards you surf around on 344 00:11:54,450 --> 00:11:56,519 the website, on the Web, and 345 00:11:56,520 --> 00:11:58,289 then ads for this kind of product like 346 00:11:58,290 --> 00:12:00,359 seem to like haunt you everywhere you go. 347 00:12:00,360 --> 00:12:01,859 And this is also due to like machine 348 00:12:01,860 --> 00:12:03,809 learning algorithms that like try to 349 00:12:03,810 --> 00:12:05,399 predict which kind of ads you will find 350 00:12:05,400 --> 00:12:07,679 interesting and that like show these ads 351 00:12:07,680 --> 00:12:09,059 to you on all kinds of different 352 00:12:09,060 --> 00:12:10,229 websites. 353 00:12:10,230 --> 00:12:12,029 And of course, there are algorithms that 354 00:12:12,030 --> 00:12:13,209 can do customer ratings. 355 00:12:13,210 --> 00:12:15,299 So, for example, if you want to like 356 00:12:15,300 --> 00:12:17,639 order product online, they could 357 00:12:17,640 --> 00:12:19,589 estimate how likely it is that you would 358 00:12:19,590 --> 00:12:21,419 pay the invoice for that article. 359 00:12:21,420 --> 00:12:23,519 And if it's not very high, then 360 00:12:23,520 --> 00:12:25,439 it would the system would only send you 361 00:12:25,440 --> 00:12:26,759 the article if you paid the money in 362 00:12:26,760 --> 00:12:27,819 advance. 363 00:12:27,820 --> 00:12:29,159 And of course, there are things like 364 00:12:29,160 --> 00:12:30,439 customer demand prediction. 365 00:12:30,440 --> 00:12:32,549 So the holy grail of this would 366 00:12:32,550 --> 00:12:34,859 be to actually know what do you want 367 00:12:34,860 --> 00:12:36,899 to buy before you know it and then send 368 00:12:36,900 --> 00:12:38,039 it to you, to your door. 369 00:12:38,040 --> 00:12:40,139 And I think after reading 370 00:12:40,140 --> 00:12:41,369 a pattern, I think this is also what 371 00:12:41,370 --> 00:12:43,649 Amazon is trying to do in some cases. 372 00:12:43,650 --> 00:12:45,749 So these things just like affect 373 00:12:45,750 --> 00:12:46,859 our lives superficially. 374 00:12:46,860 --> 00:12:48,959 And if something happens, it 375 00:12:48,960 --> 00:12:51,059 affects us not very in a very deep 376 00:12:51,060 --> 00:12:52,859 way. And there are other uses of 377 00:12:52,860 --> 00:12:55,199 algorithms in our life, for example, and 378 00:12:55,200 --> 00:12:58,019 a big topic that is coming up now with 379 00:12:58,020 --> 00:12:59,669 big data and more data that we can 380 00:12:59,670 --> 00:13:02,279 collect about individuals is personalized 381 00:13:02,280 --> 00:13:04,769 health. So making decisions about 382 00:13:04,770 --> 00:13:06,959 possible treatments and lifestyle based 383 00:13:06,960 --> 00:13:08,999 on data that we collect about you, for 384 00:13:09,000 --> 00:13:11,609 example, your heart rate, your 385 00:13:11,610 --> 00:13:14,039 pulse, your how much you move around, 386 00:13:14,040 --> 00:13:15,959 how many stairs you climb each day. 387 00:13:15,960 --> 00:13:18,479 So this is a large potential for 388 00:13:18,480 --> 00:13:20,849 improving, for example, 389 00:13:20,850 --> 00:13:22,529 areas such as medicine, but also other 390 00:13:22,530 --> 00:13:24,749 ones. And we use 391 00:13:24,750 --> 00:13:26,399 the same or similar classification 392 00:13:26,400 --> 00:13:28,469 algorithms as and the 393 00:13:28,470 --> 00:13:30,699 applications that I showed you before. 394 00:13:30,700 --> 00:13:32,309 So another thing is person 395 00:13:32,310 --> 00:13:33,359 classification. 396 00:13:33,360 --> 00:13:35,519 So here we want to predict, 397 00:13:35,520 --> 00:13:36,869 for example, how likely it is that a 398 00:13:36,870 --> 00:13:39,179 person will commit a crime or 399 00:13:39,180 --> 00:13:40,139 will be a terrorist. 400 00:13:40,140 --> 00:13:41,989 And these are kind of algorithms that are 401 00:13:41,990 --> 00:13:44,759 already in use today by, for example, 402 00:13:44,760 --> 00:13:46,919 governments to like 403 00:13:46,920 --> 00:13:48,989 issue restricted travel 404 00:13:48,990 --> 00:13:51,779 permits and to like 405 00:13:51,780 --> 00:13:54,269 mark some people that have a high 406 00:13:54,270 --> 00:13:56,099 risk profile due to the algorithm for 407 00:13:56,100 --> 00:13:57,779 screening. I think there are many topics 408 00:13:57,780 --> 00:13:59,579 here also that deal with this problem, 409 00:13:59,580 --> 00:14:00,899 that problem especially. 410 00:14:02,050 --> 00:14:04,179 And of course, they are autonomous cars, 411 00:14:04,180 --> 00:14:07,089 planes and machines, which 412 00:14:07,090 --> 00:14:08,949 are currently being developed already in 413 00:14:08,950 --> 00:14:11,049 service and which will take over 414 00:14:11,050 --> 00:14:13,329 like driving from 415 00:14:13,330 --> 00:14:15,159 people in a few years or a few decades 416 00:14:15,160 --> 00:14:17,289 maybe. And finally, this 417 00:14:17,290 --> 00:14:18,759 automated trading, which is mostly 418 00:14:18,760 --> 00:14:20,859 invisible to us, but which has also a 419 00:14:20,860 --> 00:14:23,139 huge impact because 95 or 420 00:14:23,140 --> 00:14:25,449 even 99 percent of all trades today 421 00:14:25,450 --> 00:14:27,189 are actually performed by algorithms in 422 00:14:27,190 --> 00:14:28,449 the machines anymore. 423 00:14:28,450 --> 00:14:29,450 So. 424 00:14:30,280 --> 00:14:32,439 Finally, there's the high risk area 425 00:14:32,440 --> 00:14:33,939 where we have such things like military 426 00:14:33,940 --> 00:14:36,129 intelligence and or intervention. 427 00:14:36,130 --> 00:14:37,719 We also have already some governments 428 00:14:37,720 --> 00:14:40,539 that already use algorithms to 429 00:14:40,540 --> 00:14:42,729 like predict targets 430 00:14:42,730 --> 00:14:44,379 for, for example, drone strikes. 431 00:14:44,380 --> 00:14:46,269 And we also can have, of course, 432 00:14:46,270 --> 00:14:47,889 governments that use machine learning and 433 00:14:47,890 --> 00:14:49,599 algorithms for political oppression. 434 00:14:49,600 --> 00:14:51,729 So, for example, to train 435 00:14:51,730 --> 00:14:53,799 firewalled systems using 436 00:14:53,800 --> 00:14:56,079 juristic algorithms to detect 437 00:14:56,080 --> 00:14:58,839 traffic that should be filtered out. 438 00:14:58,840 --> 00:15:01,119 And there's also critical infrastructure 439 00:15:01,120 --> 00:15:03,579 services like the electricity grid 440 00:15:03,580 --> 00:15:05,770 or other things that are 441 00:15:06,820 --> 00:15:08,649 critical to us and which are also 442 00:15:08,650 --> 00:15:10,989 sometimes ungoverned or like controlled 443 00:15:10,990 --> 00:15:13,449 by algorithms already. 444 00:15:13,450 --> 00:15:15,579 So as you can see already today, 445 00:15:15,580 --> 00:15:17,559 we have many areas in our life with 446 00:15:17,560 --> 00:15:19,659 algorithms and not humans make 447 00:15:19,660 --> 00:15:21,519 the decisions. And if we would like 448 00:15:21,520 --> 00:15:22,899 plotless again on this graph, you would 449 00:15:22,900 --> 00:15:24,969 see that that most things with 450 00:15:24,970 --> 00:15:26,619 algorithms decide today actually in the 451 00:15:26,620 --> 00:15:27,999 green on the yellow area. 452 00:15:28,000 --> 00:15:30,159 And we have some things that are be 453 00:15:30,160 --> 00:15:32,229 touching the critical part of our lives. 454 00:15:32,230 --> 00:15:34,329 And now what big data 455 00:15:34,330 --> 00:15:36,219 and advanced machine learning would do in 456 00:15:36,220 --> 00:15:38,799 the coming years is probably to make 457 00:15:38,800 --> 00:15:40,929 to both widen the applicability 458 00:15:40,930 --> 00:15:43,209 of algorithms so we can use them for 459 00:15:43,210 --> 00:15:45,039 domains where we couldn't use them 460 00:15:45,040 --> 00:15:47,859 before, like speech recognition, 461 00:15:47,860 --> 00:15:50,319 customer service and many other things. 462 00:15:50,320 --> 00:15:52,719 And we will also, like penetrate deeper 463 00:15:52,720 --> 00:15:54,699 into our lives and making decisions which 464 00:15:54,700 --> 00:15:56,749 really can affect us on a more personal, 465 00:15:56,750 --> 00:15:59,109 more intimate and more critical 466 00:15:59,110 --> 00:16:00,110 level. 467 00:16:03,680 --> 00:16:05,749 Good. So this is all I 468 00:16:05,750 --> 00:16:07,969 wanted to show you in theory, and now 469 00:16:07,970 --> 00:16:10,939 I want to use the remaining time to 470 00:16:10,940 --> 00:16:13,429 show you two experiments, which I did. 471 00:16:13,430 --> 00:16:15,049 So there are lots of things that can go 472 00:16:15,050 --> 00:16:16,729 wrong when you use algorithms. 473 00:16:16,730 --> 00:16:19,009 But I picked two topics 474 00:16:19,010 --> 00:16:20,839 here that I find especially important. 475 00:16:20,840 --> 00:16:22,219 And the first topic that we are looking 476 00:16:22,220 --> 00:16:25,159 at is discrimination through algorithms. 477 00:16:25,160 --> 00:16:27,319 So here the question is, 478 00:16:27,320 --> 00:16:29,719 can an algorithm that is trained 479 00:16:29,720 --> 00:16:31,939 by a human or by an 480 00:16:31,940 --> 00:16:34,069 earlier manual decision process 481 00:16:34,070 --> 00:16:35,809 actually also discriminate against 482 00:16:35,810 --> 00:16:37,189 certain groups of people? 483 00:16:37,190 --> 00:16:38,689 You know, like discrimination still is a 484 00:16:38,690 --> 00:16:40,339 very big problem in our society. 485 00:16:40,340 --> 00:16:42,529 And we have like fought for many, 486 00:16:42,530 --> 00:16:44,239 many years to, like, push it back. 487 00:16:44,240 --> 00:16:46,429 And the question is, of course, 488 00:16:46,430 --> 00:16:48,529 now, as we shift so 489 00:16:48,530 --> 00:16:50,209 much of the decision, power from humans 490 00:16:50,210 --> 00:16:52,999 to machines, can we actually 491 00:16:53,000 --> 00:16:54,619 eliminate the discrimination that we 492 00:16:54,620 --> 00:16:56,209 still have in the system or are we going 493 00:16:56,210 --> 00:16:58,489 to carry it over to like this automated 494 00:16:58,490 --> 00:16:59,490 decision making? 495 00:17:02,120 --> 00:17:04,739 The definition of discrimination, 496 00:17:04,740 --> 00:17:06,809 again here is 497 00:17:06,810 --> 00:17:09,239 a treatment or consideration 498 00:17:09,240 --> 00:17:10,949 of a certain person 499 00:17:12,450 --> 00:17:14,608 that is made based on his or her 500 00:17:14,609 --> 00:17:16,679 group, class or category category and 501 00:17:16,680 --> 00:17:18,598 is not based on an on his or her 502 00:17:18,599 --> 00:17:19,499 individual merit. 503 00:17:19,500 --> 00:17:21,719 So that means 504 00:17:21,720 --> 00:17:24,059 that we like professor or 505 00:17:24,060 --> 00:17:26,159 we put at a disadvantage certain 506 00:17:26,160 --> 00:17:28,229 kinds of people, according to their their 507 00:17:28,230 --> 00:17:30,119 group or some protected attribute, which 508 00:17:30,120 --> 00:17:32,399 can be, for example, the ethnicity, 509 00:17:32,400 --> 00:17:34,529 the gender or the sexual orientation 510 00:17:34,530 --> 00:17:36,039 of the person. 511 00:17:36,040 --> 00:17:38,169 And now we 512 00:17:38,170 --> 00:17:40,449 need a way, of course, to measure this 513 00:17:40,450 --> 00:17:41,799 discrimination and 514 00:17:42,970 --> 00:17:44,379 the measurement of the truth here, I 515 00:17:44,380 --> 00:17:46,629 mean, steroids, of course, 516 00:17:46,630 --> 00:17:48,729 is has been developed in the US 517 00:17:48,730 --> 00:17:50,469 and it's called disparate impact. 518 00:17:50,470 --> 00:17:52,959 And it's quite nice because it uses 519 00:17:52,960 --> 00:17:54,849 a very clear and simple mathematical 520 00:17:54,850 --> 00:17:57,189 model to to explain 521 00:17:57,190 --> 00:17:58,269 discrimination. 522 00:17:58,270 --> 00:18:00,459 So basically, this model says 523 00:18:00,460 --> 00:18:02,049 that we have a process, see, 524 00:18:03,130 --> 00:18:05,389 which acts on people that have 525 00:18:05,390 --> 00:18:07,209 a given aptitude X or don't have it. 526 00:18:07,210 --> 00:18:09,609 So, for example, men 527 00:18:09,610 --> 00:18:11,709 and women and we measure the 528 00:18:11,710 --> 00:18:13,089 outcome of this process and we are 529 00:18:13,090 --> 00:18:15,549 interested in the probability of 530 00:18:15,550 --> 00:18:17,649 the decision being, yes, for 531 00:18:17,650 --> 00:18:19,719 a member of the Group X versus the 532 00:18:19,720 --> 00:18:21,819 probability of business for the 533 00:18:21,820 --> 00:18:23,469 member of the other group. 534 00:18:23,470 --> 00:18:25,629 And so we can just have 535 00:18:25,630 --> 00:18:27,669 a look at the conditional probability P 536 00:18:27,670 --> 00:18:29,409 of making it through this process. 537 00:18:29,410 --> 00:18:32,019 Being a member of Group X 538 00:18:32,020 --> 00:18:34,239 equals zero, divided by the probability 539 00:18:34,240 --> 00:18:36,399 of making it through the process of 540 00:18:36,400 --> 00:18:37,899 being and being a member of the other 541 00:18:37,900 --> 00:18:40,119 group. And when we can when we divide 542 00:18:40,120 --> 00:18:41,949 these two quantities, we get the 543 00:18:41,950 --> 00:18:44,019 parameter tower, which describes 544 00:18:44,020 --> 00:18:46,089 the amount of discrimination that we have 545 00:18:46,090 --> 00:18:48,289 in the system. And for 546 00:18:48,290 --> 00:18:50,439 normal purposes, we can choose a given 547 00:18:50,440 --> 00:18:51,969 title, for example, 80 percent. 548 00:18:51,970 --> 00:18:53,529 And if we see that the smaller we can 549 00:18:53,530 --> 00:18:55,209 say, oh, this process contains 550 00:18:55,210 --> 00:18:58,119 discrimination and this is nice because 551 00:18:58,120 --> 00:19:00,339 it measures discrimination, not only if 552 00:19:00,340 --> 00:19:02,559 it's done intentionally, but 553 00:19:02,560 --> 00:19:04,399 also when it's happening inadvertently. 554 00:19:04,400 --> 00:19:05,919 So without wanting it. 555 00:19:05,920 --> 00:19:08,139 So it doesn't really matter after the 556 00:19:08,140 --> 00:19:09,639 process or the people in this process 557 00:19:09,640 --> 00:19:11,679 want to discriminate if they do it. 558 00:19:11,680 --> 00:19:13,989 Nevertheless, maybe unconsciously, this 559 00:19:13,990 --> 00:19:15,940 measure can give us an idea about it. 560 00:19:17,380 --> 00:19:19,569 And of course, in practice we can 561 00:19:19,570 --> 00:19:21,519 deal with probabilities that we have to 562 00:19:21,520 --> 00:19:22,520 to measure the. 563 00:19:25,440 --> 00:19:27,539 So the number of people 564 00:19:27,540 --> 00:19:29,699 in each category and then we can, 565 00:19:29,700 --> 00:19:31,859 like, make an estimate of this parameter, 566 00:19:31,860 --> 00:19:34,319 Tolba, just dividing this number 567 00:19:34,320 --> 00:19:36,179 here by these two numbers and dividing 568 00:19:36,180 --> 00:19:38,099 this, again, by this number divided by 569 00:19:38,100 --> 00:19:39,659 the other two numbers. So it's pretty 570 00:19:39,660 --> 00:19:41,849 easy, very straightforward. 571 00:19:41,850 --> 00:19:44,129 And now 572 00:19:44,130 --> 00:19:46,979 I want to show how we can 573 00:19:46,980 --> 00:19:49,499 use this to test a given process 574 00:19:49,500 --> 00:19:51,629 that we take from we 575 00:19:51,630 --> 00:19:54,599 take decision power from people 576 00:19:54,600 --> 00:19:55,799 and give it to an algorithm. 577 00:19:55,800 --> 00:19:58,169 And the example that I choose here is 578 00:19:58,170 --> 00:19:59,099 a H.R. 579 00:19:59,100 --> 00:20:00,599 process or a hiring process. 580 00:20:00,600 --> 00:20:02,969 And we want to 581 00:20:02,970 --> 00:20:05,669 hear select candidates 582 00:20:05,670 --> 00:20:07,859 based on the data, for example, 583 00:20:07,860 --> 00:20:09,809 the CV and other data about themselves 584 00:20:09,810 --> 00:20:12,119 that they submit to a potential 585 00:20:12,120 --> 00:20:13,140 employer employee. 586 00:20:14,330 --> 00:20:16,429 And the benefits of this are, 587 00:20:16,430 --> 00:20:18,739 of course, saving time in the screening 588 00:20:18,740 --> 00:20:21,049 process and also improve the choice 589 00:20:21,050 --> 00:20:23,239 of candidates and this is a I choose 590 00:20:23,240 --> 00:20:25,009 this example because it's actually 591 00:20:25,010 --> 00:20:26,629 something that is widely done already. 592 00:20:26,630 --> 00:20:28,879 So chances are that if you have 593 00:20:28,880 --> 00:20:30,379 applied to a job recently, you have 594 00:20:30,380 --> 00:20:32,539 probably been subject subjected to 595 00:20:32,540 --> 00:20:34,639 this kind of process and also several 596 00:20:34,640 --> 00:20:36,629 startups in the US, but also in Europe to 597 00:20:36,630 --> 00:20:38,779 try to implement this kind of data driven 598 00:20:38,780 --> 00:20:40,819 hiring processes. 599 00:20:40,820 --> 00:20:42,439 So it's something that's really already 600 00:20:42,440 --> 00:20:43,440 happening. 601 00:20:44,030 --> 00:20:46,129 OK, again, the setup is pretty simple. 602 00:20:46,130 --> 00:20:48,139 We have some information about the 603 00:20:48,140 --> 00:20:49,909 candidate that we submitted to a human 604 00:20:49,910 --> 00:20:51,829 reviewer and that human Rabiya makes a 605 00:20:51,830 --> 00:20:53,959 decision to invite a candidate or not. 606 00:20:53,960 --> 00:20:56,239 And it also gives that information to 607 00:20:56,240 --> 00:20:58,069 an algorithm as a training data. 608 00:20:58,070 --> 00:21:00,409 And the algorithm tries then to replicate 609 00:21:00,410 --> 00:21:02,059 the decision of the human whether to hire 610 00:21:02,060 --> 00:21:03,060 the candidate or not. 611 00:21:04,210 --> 00:21:05,210 So 612 00:21:06,460 --> 00:21:08,919 the set up, as I say, we used to live 613 00:21:08,920 --> 00:21:10,929 any work samples and other publicly 614 00:21:10,930 --> 00:21:12,729 available information about the candidate 615 00:21:12,730 --> 00:21:14,890 that we can get as an input, we then 616 00:21:16,360 --> 00:21:18,549 use a human to like make the decision 617 00:21:18,550 --> 00:21:20,349 about a given candidate, either yes or 618 00:21:20,350 --> 00:21:22,449 no. And we train the algorithm 619 00:21:22,450 --> 00:21:23,439 on this data. 620 00:21:23,440 --> 00:21:25,509 So and the approach that we have here is 621 00:21:25,510 --> 00:21:26,779 a so-called big data approach. 622 00:21:26,780 --> 00:21:28,929 So we basically try to get as much 623 00:21:28,930 --> 00:21:31,179 data about every other candidate 624 00:21:31,180 --> 00:21:32,679 as we can and put it all into the 625 00:21:32,680 --> 00:21:34,719 algorithm and let the algorithm figure 626 00:21:34,720 --> 00:21:35,740 out what it does with it. 627 00:21:37,560 --> 00:21:39,029 So the decision model for this 628 00:21:40,110 --> 00:21:41,369 rather simple and show it here 629 00:21:42,480 --> 00:21:44,489 in order to decide if we hire a given 630 00:21:44,490 --> 00:21:46,679 candidate, we can define a function 631 00:21:46,680 --> 00:21:48,869 as which is the score 632 00:21:48,870 --> 00:21:50,909 that has several parts. 633 00:21:50,910 --> 00:21:52,799 One part is to measure the score of the 634 00:21:52,800 --> 00:21:54,899 merit of the candidate, which is based on 635 00:21:54,900 --> 00:21:57,029 on his or her abilities. 636 00:21:57,030 --> 00:21:58,759 And another part is a discrimination 637 00:21:58,760 --> 00:21:59,669 Malazan bonus. 638 00:21:59,670 --> 00:22:01,889 So we can see that as 639 00:22:01,890 --> 00:22:03,809 increasing or decreasing the total score 640 00:22:03,810 --> 00:22:05,879 of the candidate based on his or her 641 00:22:05,880 --> 00:22:07,619 membership in a given group. 642 00:22:07,620 --> 00:22:09,239 And then, of course, we also have some 643 00:22:09,240 --> 00:22:11,309 element of luck, which we have said to 20 644 00:22:11,310 --> 00:22:12,719 percent, for example, here. 645 00:22:12,720 --> 00:22:15,089 And then we just add 646 00:22:15,090 --> 00:22:17,069 all these three components together. 647 00:22:17,070 --> 00:22:19,199 And if the 50 are larger 648 00:22:19,200 --> 00:22:20,549 than so, we add these two things 649 00:22:20,550 --> 00:22:22,139 together. And if they are larger than the 650 00:22:22,140 --> 00:22:24,239 given bar, we invite the candidate. 651 00:22:24,240 --> 00:22:26,159 If they are not, we don't invite him or 652 00:22:26,160 --> 00:22:28,200 her. And you can see here 653 00:22:29,220 --> 00:22:31,319 the bar has a different 654 00:22:31,320 --> 00:22:32,909 height, depending on the group of the 655 00:22:32,910 --> 00:22:34,529 candidate, if there is discrimination in 656 00:22:34,530 --> 00:22:35,530 the system. 657 00:22:36,820 --> 00:22:39,399 OK, now 658 00:22:39,400 --> 00:22:42,099 we can train a predictor for that 659 00:22:42,100 --> 00:22:44,889 model to which we give the information 660 00:22:44,890 --> 00:22:46,749 about the candidate and also a lot of 661 00:22:46,750 --> 00:22:48,459 other information which it causes here. 662 00:22:48,460 --> 00:22:50,229 So everything else that we can find, for 663 00:22:50,230 --> 00:22:52,329 example, in public records or in other 664 00:22:52,330 --> 00:22:55,119 information that we can get our hands on. 665 00:22:55,120 --> 00:22:57,249 And then we trained 666 00:22:57,250 --> 00:22:59,409 as a predictor for the to 667 00:22:59,410 --> 00:23:00,669 predict the outcome of the hiring 668 00:23:00,670 --> 00:23:03,609 process. And we can see 669 00:23:03,610 --> 00:23:05,679 what the results are. So and since 670 00:23:05,680 --> 00:23:08,089 it's pretty hard to get our data on, 671 00:23:08,090 --> 00:23:10,209 to get our hands on real world data, 672 00:23:10,210 --> 00:23:11,889 and what I did instead here is to 673 00:23:11,890 --> 00:23:14,289 simulate 10000 674 00:23:14,290 --> 00:23:16,629 samples of an agent based model 675 00:23:16,630 --> 00:23:19,179 where we just choose a function, 676 00:23:19,180 --> 00:23:21,339 see you and some disparate impact 677 00:23:21,340 --> 00:23:23,379 and generate training data with that. 678 00:23:23,380 --> 00:23:25,659 And then we can use a standard machine 679 00:23:25,660 --> 00:23:27,009 learning algorithm, in this case, a 680 00:23:27,010 --> 00:23:29,199 support vector machine to test that 681 00:23:29,200 --> 00:23:31,359 data and measure 682 00:23:31,360 --> 00:23:33,639 the discrimination that the algorithm 683 00:23:33,640 --> 00:23:35,229 produces. 684 00:23:35,230 --> 00:23:37,359 OK, this is shown in this graph 685 00:23:37,360 --> 00:23:38,360 here. 686 00:23:41,220 --> 00:23:43,289 It's a bit complicated, so let's go 687 00:23:43,290 --> 00:23:44,579 through it one by one. 688 00:23:44,580 --> 00:23:47,189 So what we see on the X axis is the 689 00:23:47,190 --> 00:23:49,409 amount of information that our 690 00:23:49,410 --> 00:23:51,839 algorithm has about the attributes X 691 00:23:51,840 --> 00:23:52,859 of the candidate. 692 00:23:52,860 --> 00:23:54,959 So this is the protected attribute 693 00:23:54,960 --> 00:23:56,129 to which we don't want to give 694 00:23:56,130 --> 00:23:57,359 information away. 695 00:23:57,360 --> 00:23:59,489 And so if we are at 696 00:23:59,490 --> 00:24:01,079 zero and means that the algorithm doesn't 697 00:24:01,080 --> 00:24:02,519 have any information at all about the 698 00:24:02,520 --> 00:24:04,919 candidate, if we are at one and 699 00:24:04,920 --> 00:24:06,209 it means the algorithm has all the 700 00:24:06,210 --> 00:24:07,829 information about the protected attribute 701 00:24:07,830 --> 00:24:09,629 of the candidate, and if we are at zero 702 00:24:09,630 --> 00:24:10,919 point five, it means that we have the 703 00:24:10,920 --> 00:24:12,749 correct information and about 50 percent 704 00:24:12,750 --> 00:24:13,750 of the cases. 705 00:24:14,960 --> 00:24:17,089 OK, then we have our 706 00:24:17,090 --> 00:24:18,709 perimeter tower, which is a disparate 707 00:24:18,710 --> 00:24:21,109 impact, and here we have set this value 708 00:24:21,110 --> 00:24:23,389 to zero point five, and this means that 709 00:24:23,390 --> 00:24:24,709 the chance of making it through the 710 00:24:24,710 --> 00:24:27,109 process, being a member of Group X 711 00:24:27,110 --> 00:24:29,299 is twice as high as for members, for 712 00:24:29,300 --> 00:24:30,789 people that are not member of this group. 713 00:24:32,610 --> 00:24:35,159 Now, here, above, we see the 714 00:24:35,160 --> 00:24:37,469 prediction, fidelity of our algorithm, 715 00:24:37,470 --> 00:24:39,569 which is between 86 716 00:24:39,570 --> 00:24:42,509 and about 90 percent, and 717 00:24:42,510 --> 00:24:44,879 which also increases as we increase the 718 00:24:44,880 --> 00:24:46,619 number at the rate gamma to which the 719 00:24:46,620 --> 00:24:48,569 information leaks into the system. 720 00:24:48,570 --> 00:24:51,299 And finally, we have here 721 00:24:51,300 --> 00:24:52,649 the towel. 722 00:24:52,650 --> 00:24:54,269 So the discrimination, the amount of 723 00:24:54,270 --> 00:24:56,369 discrimination done by the 724 00:24:56,370 --> 00:24:58,799 algorithm measured, again, as a function 725 00:24:58,800 --> 00:25:00,119 of the information leakage. 726 00:25:01,770 --> 00:25:03,749 So what that means is that the more 727 00:25:03,750 --> 00:25:05,459 information we provide about this 728 00:25:05,460 --> 00:25:07,079 projected attribute we provide to the 729 00:25:07,080 --> 00:25:09,239 algorithm, the better it is able to 730 00:25:09,240 --> 00:25:11,009 discriminate against people in that 731 00:25:11,010 --> 00:25:13,199 group. So if we are 732 00:25:13,200 --> 00:25:14,789 here and the algorithm doesn't have any 733 00:25:14,790 --> 00:25:16,529 information at all about the protected 734 00:25:16,530 --> 00:25:18,179 group, it can't discriminate against 735 00:25:18,180 --> 00:25:20,309 those people. So the ratio 736 00:25:20,310 --> 00:25:22,019 of success between the individual groups 737 00:25:22,020 --> 00:25:24,089 is one. And this is actually 738 00:25:24,090 --> 00:25:26,219 great because it means if we can build 739 00:25:26,220 --> 00:25:28,229 an algorithm that doesn't have any idea 740 00:25:28,230 --> 00:25:30,239 about these protected attributes, we can 741 00:25:30,240 --> 00:25:31,739 eliminate all the discrimination that's 742 00:25:31,740 --> 00:25:32,740 in the system. 743 00:25:33,630 --> 00:25:36,269 On the other hand, if by some 744 00:25:36,270 --> 00:25:38,339 accident the algorithm gets 745 00:25:38,340 --> 00:25:40,739 full information and these attributes, 746 00:25:40,740 --> 00:25:42,959 it can discriminate just as 747 00:25:42,960 --> 00:25:44,909 as well as a human against people in 748 00:25:44,910 --> 00:25:46,799 either group. So that means if we give 749 00:25:46,800 --> 00:25:48,389 too much information to our algorithm, we 750 00:25:48,390 --> 00:25:49,859 will have the same problem in the hiring 751 00:25:49,860 --> 00:25:50,939 process as before. 752 00:25:50,940 --> 00:25:52,349 So we will also have to have 753 00:25:52,350 --> 00:25:54,449 discrimination against people, not by 754 00:25:54,450 --> 00:25:56,609 human systems, but by machine 755 00:25:56,610 --> 00:25:57,610 and say. 756 00:25:58,350 --> 00:26:00,659 And now you say probably, 757 00:26:00,660 --> 00:26:02,249 OK, this is stupid. 758 00:26:02,250 --> 00:26:04,349 Why would we give information 759 00:26:04,350 --> 00:26:05,549 about this protected group to the 760 00:26:05,550 --> 00:26:07,679 algorithm? And of course, 761 00:26:07,680 --> 00:26:09,509 the answer is we don't normally. 762 00:26:09,510 --> 00:26:11,579 But the problem with big data and 763 00:26:11,580 --> 00:26:13,619 with like having a lot of different data 764 00:26:13,620 --> 00:26:15,209 types and different data sources on our 765 00:26:15,210 --> 00:26:17,309 hand is that even if we don't give that 766 00:26:17,310 --> 00:26:19,679 information to the algorithm explicitly 767 00:26:19,680 --> 00:26:21,509 and there is some amount of information 768 00:26:21,510 --> 00:26:23,669 about the attribute that leaks through 769 00:26:23,670 --> 00:26:25,019 with all the other information that we 770 00:26:25,020 --> 00:26:26,369 provide. 771 00:26:26,370 --> 00:26:28,609 And this is basically the essence of 772 00:26:28,610 --> 00:26:30,749 the dilemma of having too much data on 773 00:26:30,750 --> 00:26:32,879 our hands, because it's always very 774 00:26:32,880 --> 00:26:35,339 hard to to keep 775 00:26:35,340 --> 00:26:37,589 information about sensitive things 776 00:26:37,590 --> 00:26:40,079 leaking into our data set. 777 00:26:40,080 --> 00:26:42,119 And, of course, is like a purely 778 00:26:42,120 --> 00:26:43,120 theoretical, 779 00:26:44,580 --> 00:26:46,019 pure theoretical formulation now. 780 00:26:46,020 --> 00:26:48,149 But I actually 781 00:26:48,150 --> 00:26:50,489 try to validate this by using publicly 782 00:26:50,490 --> 00:26:53,459 available data. So what I did was 783 00:26:53,460 --> 00:26:55,709 to get GitHub user data, 784 00:26:55,710 --> 00:26:57,809 which is on which we 785 00:26:57,810 --> 00:26:59,579 can obtain through an API and which gives 786 00:26:59,580 --> 00:27:01,619 us information about all the people on 787 00:27:01,620 --> 00:27:02,189 GitHub. 788 00:27:02,190 --> 00:27:03,670 So, um. 789 00:27:05,480 --> 00:27:06,589 First, we need, of course, 790 00:27:07,790 --> 00:27:09,469 information about the protected attribute 791 00:27:09,470 --> 00:27:11,539 of the group and this case, we choose the 792 00:27:11,540 --> 00:27:12,529 gender. 793 00:27:12,530 --> 00:27:14,629 So either man or woman, and 794 00:27:14,630 --> 00:27:16,849 to do that, we have to manually 795 00:27:16,850 --> 00:27:19,219 classify and 796 00:27:19,220 --> 00:27:21,049 the people that we put into this study. 797 00:27:21,050 --> 00:27:23,389 So what I did was just to 798 00:27:23,390 --> 00:27:25,609 look at the profile pictures and get up 799 00:27:25,610 --> 00:27:27,859 about 5000 and like classify the people 800 00:27:27,860 --> 00:27:29,329 into like like man and woman. 801 00:27:29,330 --> 00:27:31,069 So this gave us the training data for 802 00:27:31,070 --> 00:27:32,389 this kind of simulation. 803 00:27:32,390 --> 00:27:34,969 And what I did then was to retrieve 804 00:27:34,970 --> 00:27:37,309 additional information about each user. 805 00:27:37,310 --> 00:27:39,859 So, for example, the number of projects 806 00:27:39,860 --> 00:27:42,319 on the website, the number of stargazers, 807 00:27:42,320 --> 00:27:44,059 the number of followers, etc., etc. 808 00:27:44,060 --> 00:27:46,729 So everything I could get my hands on and 809 00:27:46,730 --> 00:27:48,979 then I would use that data to make 810 00:27:48,980 --> 00:27:50,749 a prediction about the gender of the user 811 00:27:50,750 --> 00:27:52,819 just based on the information that I put 812 00:27:52,820 --> 00:27:53,820 into the system. 813 00:27:54,630 --> 00:27:56,519 So and I want to say again, this is only 814 00:27:56,520 --> 00:27:58,259 a proof of concept, so I used a very 815 00:27:58,260 --> 00:27:59,219 small data set. 816 00:27:59,220 --> 00:28:00,539 I didn't do any optimization. 817 00:28:00,540 --> 00:28:02,369 I just wanted to see how easy it is 818 00:28:02,370 --> 00:28:03,779 actually to get this kind of 819 00:28:03,780 --> 00:28:05,849 contamination into our 820 00:28:05,850 --> 00:28:07,229 algorithm. 821 00:28:07,230 --> 00:28:09,509 So first, 822 00:28:09,510 --> 00:28:11,849 when I used only very basic things 823 00:28:11,850 --> 00:28:13,499 like, for example, the number of 824 00:28:13,500 --> 00:28:15,779 stargazers or followers of 825 00:28:15,780 --> 00:28:18,479 each user, I couldn't get any 826 00:28:18,480 --> 00:28:20,519 prediction about the gender of the 827 00:28:20,520 --> 00:28:22,409 person. And I mean, this is already great 828 00:28:22,410 --> 00:28:23,940 because if 829 00:28:25,200 --> 00:28:27,239 your colleague says, oh, you know, woman, 830 00:28:27,240 --> 00:28:28,649 they are not good programmers, you can 831 00:28:28,650 --> 00:28:30,809 just now show in this data and then you 832 00:28:30,810 --> 00:28:33,029 basically can disproven. 833 00:28:33,030 --> 00:28:35,189 And that's because it's not possible to 834 00:28:35,190 --> 00:28:37,289 predict gender from the publicly 835 00:28:37,290 --> 00:28:39,119 available data. 836 00:28:39,120 --> 00:28:40,709 And but for me, it was, of course, a bit 837 00:28:40,710 --> 00:28:42,149 disappointing because I wanted to prove 838 00:28:42,150 --> 00:28:43,889 that we can discriminate against these 839 00:28:43,890 --> 00:28:45,480 people. So I need to get more data. 840 00:28:48,940 --> 00:28:51,069 And luckily, GitHub helps us out of 841 00:28:51,070 --> 00:28:53,259 that by providing a 842 00:28:53,260 --> 00:28:55,549 events API, which contains 843 00:28:55,550 --> 00:28:57,609 the full event stream of any action 844 00:28:57,610 --> 00:28:59,769 or almost any action that a given user 845 00:28:59,770 --> 00:29:01,929 has done on the side. So every time that 846 00:29:01,930 --> 00:29:03,579 you open a pull request or you make a 847 00:29:03,580 --> 00:29:04,929 comment or you do something else on 848 00:29:04,930 --> 00:29:06,819 GitHub and there's an event created for 849 00:29:06,820 --> 00:29:09,039 that, and you can download all the public 850 00:29:09,040 --> 00:29:10,659 events on the site through this website 851 00:29:10,660 --> 00:29:12,849 here and like process them and use them 852 00:29:12,850 --> 00:29:15,009 for, for example, data analysis. 853 00:29:15,010 --> 00:29:16,419 And this is what I did. So for all the 854 00:29:16,420 --> 00:29:18,489 users that I had my sample, I downloaded 855 00:29:18,490 --> 00:29:20,619 this event data and like tried to 856 00:29:20,620 --> 00:29:22,179 get some more information that I could 857 00:29:22,180 --> 00:29:23,649 use to discriminate the gender of the 858 00:29:23,650 --> 00:29:24,909 people. 859 00:29:24,910 --> 00:29:26,979 And, for example, the data that that 860 00:29:26,980 --> 00:29:27,980 I looked at. 861 00:29:28,840 --> 00:29:31,119 Here we see the event 862 00:29:31,120 --> 00:29:33,279 frequency so averaged over all the 863 00:29:33,280 --> 00:29:35,169 events as a function of the hour. 864 00:29:35,170 --> 00:29:37,359 And you can see that now there seem 865 00:29:37,360 --> 00:29:39,519 to be some significant differences 866 00:29:39,520 --> 00:29:41,279 between men and women in our data set. 867 00:29:41,280 --> 00:29:43,119 So that's something that the algorithm 868 00:29:43,120 --> 00:29:44,649 could use to make a prediction about the 869 00:29:44,650 --> 00:29:46,839 gender and likewise, in 870 00:29:46,840 --> 00:29:48,969 the type of events that 871 00:29:48,970 --> 00:29:49,929 we have in our data set. 872 00:29:49,930 --> 00:29:52,159 There are also differences in 873 00:29:52,160 --> 00:29:54,399 the frequency of individual event 874 00:29:54,400 --> 00:29:56,019 types. So that's also something that the 875 00:29:56,020 --> 00:29:57,639 algorithm can use to make a decision 876 00:29:57,640 --> 00:29:58,640 about gender. 877 00:29:59,560 --> 00:30:01,659 Now, for the last thing, I went a bit 878 00:30:01,660 --> 00:30:03,849 crazy and I did something that 879 00:30:03,850 --> 00:30:05,929 you normally do in spam detection 880 00:30:05,930 --> 00:30:08,319 that is taking like to commit messages of 881 00:30:08,320 --> 00:30:10,569 individual contributors 882 00:30:10,570 --> 00:30:12,639 and just like inputting them into like a 883 00:30:12,640 --> 00:30:14,439 support vector classifier. 884 00:30:14,440 --> 00:30:16,539 And that basically looks 885 00:30:16,540 --> 00:30:18,399 at the frequencies of individual words 886 00:30:18,400 --> 00:30:19,719 and each commit and tries to find a 887 00:30:19,720 --> 00:30:22,149 difference in the text 888 00:30:22,150 --> 00:30:23,679 between man and woman. 889 00:30:23,680 --> 00:30:25,690 And this already gave me some 890 00:30:26,950 --> 00:30:29,169 good, like good fidelity of 891 00:30:29,170 --> 00:30:30,879 predicting the gender and combining it 892 00:30:30,880 --> 00:30:32,919 with the other information that I had. 893 00:30:32,920 --> 00:30:35,329 I could in fact, achieve 50 894 00:30:35,330 --> 00:30:37,449 percent better chance 895 00:30:37,450 --> 00:30:39,609 of like predicting the gender than by 896 00:30:39,610 --> 00:30:41,739 just guessing. So this is 897 00:30:41,740 --> 00:30:44,529 not very impressive and 898 00:30:44,530 --> 00:30:45,939 we can probably do much better. 899 00:30:45,940 --> 00:30:47,529 But again, this was only like a proof of 900 00:30:47,530 --> 00:30:49,659 concept to see how easy it actually is 901 00:30:49,660 --> 00:30:51,489 to get get this kind of information 902 00:30:51,490 --> 00:30:52,869 leaking into the system. 903 00:30:52,870 --> 00:30:56,019 And so this basically means that if we 904 00:30:56,020 --> 00:30:58,059 can make a predictor for the gender of 905 00:30:58,060 --> 00:31:00,159 the person, GitHub and the algorithm 906 00:31:00,160 --> 00:31:01,929 that we used to like make the decision 907 00:31:01,930 --> 00:31:03,849 about the hiring process can also 908 00:31:03,850 --> 00:31:05,349 generate this kind of information. 909 00:31:05,350 --> 00:31:07,689 If we give him if you give it this data 910 00:31:07,690 --> 00:31:09,789 and use it against the 911 00:31:09,790 --> 00:31:10,790 people. 912 00:31:11,820 --> 00:31:13,409 So the take away from this is that the 913 00:31:13,410 --> 00:31:15,479 algorithm will readily 914 00:31:15,480 --> 00:31:17,159 learn discrimination from us if we 915 00:31:17,160 --> 00:31:19,289 provide them with the right training 916 00:31:19,290 --> 00:31:21,869 data and also information 917 00:31:21,870 --> 00:31:24,239 leakage so that getting information 918 00:31:24,240 --> 00:31:25,889 about protected attributes in our data 919 00:31:25,890 --> 00:31:27,839 sets that we don't want to have, there is 920 00:31:27,840 --> 00:31:30,389 actually a pretty easy and can happen 921 00:31:30,390 --> 00:31:31,390 if we are not careful. 922 00:31:33,480 --> 00:31:34,769 How can we fix this? 923 00:31:34,770 --> 00:31:36,719 Well, it's actually harder than you might 924 00:31:36,720 --> 00:31:38,939 think because often we don't 925 00:31:38,940 --> 00:31:40,649 even have the information about the 926 00:31:40,650 --> 00:31:42,239 protected attributes in our data sets 927 00:31:42,240 --> 00:31:44,549 because we don't want to to take the data 928 00:31:44,550 --> 00:31:47,009 from the user. I mean, imagine if 929 00:31:47,010 --> 00:31:48,599 you would apply for a job and your 930 00:31:48,600 --> 00:31:50,639 employer or potential employer would ask 931 00:31:50,640 --> 00:31:52,739 you for information about 932 00:31:52,740 --> 00:31:55,829 your sexual preferences, your gender, 933 00:31:55,830 --> 00:31:57,059 your ethnicity and everything. 934 00:31:57,060 --> 00:31:59,159 And plenty of other things probably 935 00:31:59,160 --> 00:32:00,839 wouldn't go down so well. 936 00:32:00,840 --> 00:32:02,639 But this is actually the kind of 937 00:32:02,640 --> 00:32:04,529 information that we would need in order 938 00:32:04,530 --> 00:32:06,359 to see if there is some disparate impact 939 00:32:06,360 --> 00:32:07,799 in our data. Because if you don't have 940 00:32:07,800 --> 00:32:09,929 that attribute information, we cannot 941 00:32:09,930 --> 00:32:12,059 like, um, calculate 942 00:32:12,060 --> 00:32:14,339 any fidelity or any like like measure 943 00:32:14,340 --> 00:32:15,569 of the discrimination that is in the 944 00:32:15,570 --> 00:32:17,639 process. And this is what is 945 00:32:17,640 --> 00:32:19,889 so dangerous about this, because our 946 00:32:19,890 --> 00:32:21,779 algorithm can discriminate against people 947 00:32:21,780 --> 00:32:23,309 without us even noticing. 948 00:32:25,370 --> 00:32:27,469 OK, this 949 00:32:27,470 --> 00:32:29,569 is already the first case 950 00:32:29,570 --> 00:32:31,759 study that I wanted to show, and we 951 00:32:31,760 --> 00:32:32,760 have seen that, 952 00:32:34,940 --> 00:32:37,039 that getting information into our 953 00:32:37,040 --> 00:32:39,259 dataset that we shouldn't have is 954 00:32:39,260 --> 00:32:40,009 pretty bad. 955 00:32:40,010 --> 00:32:42,199 And like the 956 00:32:42,200 --> 00:32:44,059 worst kind of information leakage that 957 00:32:44,060 --> 00:32:46,699 you can imagine is if you can identify 958 00:32:46,700 --> 00:32:48,169 someone from the data that you have 959 00:32:48,170 --> 00:32:49,969 obtained for them earlier. 960 00:32:49,970 --> 00:32:52,039 And I mean, again, if 961 00:32:52,040 --> 00:32:54,139 we ask Google about its opinion 962 00:32:54,140 --> 00:32:56,299 on privacy, it's the 963 00:32:56,300 --> 00:32:57,509 picture is rather bleak. 964 00:32:57,510 --> 00:32:59,779 And it seems 965 00:32:59,780 --> 00:33:01,399 that many people have already getting 966 00:33:01,400 --> 00:33:03,709 used to the idea that we are in the 967 00:33:03,710 --> 00:33:05,449 privacy area right now. 968 00:33:05,450 --> 00:33:07,129 And so with the second experiment here, I 969 00:33:07,130 --> 00:33:09,229 want to show how easy it is actually 970 00:33:09,230 --> 00:33:11,929 to anonymize 971 00:33:11,930 --> 00:33:14,059 and given user data even without wanting 972 00:33:14,060 --> 00:33:14,989 it. 973 00:33:14,990 --> 00:33:17,149 And what is actually the 974 00:33:17,150 --> 00:33:19,729 anonymization where the anonymization 975 00:33:19,730 --> 00:33:22,249 means that we have some 976 00:33:22,250 --> 00:33:24,199 information recorded about an individual 977 00:33:24,200 --> 00:33:27,319 or person and we use that information 978 00:33:27,320 --> 00:33:29,029 to predict the identity of that 979 00:33:29,030 --> 00:33:30,859 individual in another data set. 980 00:33:30,860 --> 00:33:33,319 So is kind of like your data is following 981 00:33:33,320 --> 00:33:35,059 you around, even if you like, for 982 00:33:35,060 --> 00:33:37,219 example, change the devices which you're 983 00:33:37,220 --> 00:33:39,529 working on, you change your 984 00:33:39,530 --> 00:33:40,579 your user accounts. 985 00:33:40,580 --> 00:33:43,309 The system is still able to identify 986 00:33:43,310 --> 00:33:45,499 you just by using the data that you have, 987 00:33:45,500 --> 00:33:47,659 like put into the system earlier or that 988 00:33:47,660 --> 00:33:49,939 it was like measured of you earlier. 989 00:33:49,940 --> 00:33:52,039 And the anonymization becomes an 990 00:33:52,040 --> 00:33:54,169 increasing risk as the data sets 991 00:33:54,170 --> 00:33:55,849 that we can use about individual users 992 00:33:55,850 --> 00:33:57,230 get bigger and bigger, actually. 993 00:34:01,640 --> 00:34:03,609 So I hope it's working. 994 00:34:03,610 --> 00:34:05,779 OK, now let's 995 00:34:05,780 --> 00:34:08,019 have a look at the math here, though, 996 00:34:08,020 --> 00:34:10,069 is the anonymization is a pretty big 997 00:34:10,070 --> 00:34:12,109 subject. The math is rather fun, I assure 998 00:34:12,110 --> 00:34:13,579 you so. 999 00:34:13,580 --> 00:34:16,759 And you maybe have played 1000 00:34:16,760 --> 00:34:18,948 this game with some of your friends where 1001 00:34:18,949 --> 00:34:20,779 you just think about some famous person 1002 00:34:20,780 --> 00:34:22,879 and the other and your friend 1003 00:34:22,880 --> 00:34:24,649 has to guess who that is by just asking 1004 00:34:24,650 --> 00:34:26,749 you a series of yes or no question. 1005 00:34:26,750 --> 00:34:28,669 And this actually works pretty 1006 00:34:28,670 --> 00:34:30,738 efficiently so that after like maybe 10 1007 00:34:30,739 --> 00:34:32,869 or 20 question questions, you can 1008 00:34:32,870 --> 00:34:34,999 know exactly which person your 1009 00:34:35,000 --> 00:34:36,259 friend was thinking of. 1010 00:34:36,260 --> 00:34:38,448 And this works so well, 1011 00:34:38,449 --> 00:34:40,849 because if we have like 1012 00:34:40,850 --> 00:34:42,948 several several 1013 00:34:42,949 --> 00:34:45,408 buckets that we can, um, 1014 00:34:45,409 --> 00:34:47,479 that are either true or false for a given 1015 00:34:47,480 --> 00:34:49,190 user, we can 1016 00:34:50,570 --> 00:34:52,819 create a unique fingerprint for the 1017 00:34:52,820 --> 00:34:54,319 user in our system. 1018 00:34:54,320 --> 00:34:56,629 And if you look at the probability 1019 00:34:56,630 --> 00:34:58,099 of like having a collision, so like 1020 00:34:58,100 --> 00:35:00,409 having two users that have exactly 1021 00:35:00,410 --> 00:35:03,409 the same true false values, 1022 00:35:03,410 --> 00:35:05,749 this is getting increasingly unlikely 1023 00:35:05,750 --> 00:35:07,969 the more buckets or the more different 1024 00:35:07,970 --> 00:35:10,189 types of information we can put into 1025 00:35:10,190 --> 00:35:11,599 our system. 1026 00:35:11,600 --> 00:35:13,279 And so, like the exact number or the 1027 00:35:13,280 --> 00:35:14,989 exact probability for finding a 1028 00:35:14,990 --> 00:35:17,239 correlation between users is depending 1029 00:35:17,240 --> 00:35:19,429 on the actual distribution of 1030 00:35:19,430 --> 00:35:21,249 the information in the buckets. 1031 00:35:21,250 --> 00:35:22,939 So if you have a uniform distribution, we 1032 00:35:22,940 --> 00:35:24,289 can calculate that number. 1033 00:35:24,290 --> 00:35:25,909 And as you can see, it decreases 1034 00:35:25,910 --> 00:35:28,069 exponentially, which is why this game 1035 00:35:28,070 --> 00:35:30,009 that I talked about earlier is working so 1036 00:35:30,010 --> 00:35:32,149 well since, for example, if you 1037 00:35:32,150 --> 00:35:33,619 assume that you have like one million 1038 00:35:35,360 --> 00:35:36,889 famous people that you can think of, then 1039 00:35:36,890 --> 00:35:38,779 it would probably be sufficient to have 1040 00:35:38,780 --> 00:35:40,819 like thirty two bits of information to 1041 00:35:40,820 --> 00:35:42,439 uniquely identify them all. 1042 00:35:42,440 --> 00:35:44,539 And we can imagine that with 1043 00:35:44,540 --> 00:35:47,569 big data, we have much we have many more 1044 00:35:47,570 --> 00:35:49,219 buckets that we can actually use so we 1045 00:35:49,220 --> 00:35:51,229 can identify not only a few million 1046 00:35:51,230 --> 00:35:52,879 people, but easily a few billion 1047 00:35:52,880 --> 00:35:55,549 different people using that technique. 1048 00:35:55,550 --> 00:35:57,619 And most real world data sets 1049 00:35:57,620 --> 00:36:00,169 are, of course, not uniformly distributed 1050 00:36:00,170 --> 00:36:02,269 so that we have more the case that 1051 00:36:02,270 --> 00:36:04,519 that many users are in the same bucket. 1052 00:36:04,520 --> 00:36:06,439 So, for example, many there are many 1053 00:36:06,440 --> 00:36:08,419 people that like the same kind of music. 1054 00:36:08,420 --> 00:36:10,489 And so they are all like have the same 1055 00:36:10,490 --> 00:36:12,559 information or the same attribute. 1056 00:36:12,560 --> 00:36:15,109 And using that attribute 1057 00:36:15,110 --> 00:36:16,969 to the anonymized to the users wouldn't 1058 00:36:16,970 --> 00:36:18,829 give us much to wouldn't do as much good 1059 00:36:18,830 --> 00:36:20,449 because it wouldn't help us to like 1060 00:36:20,450 --> 00:36:22,429 narrow down the number of users in our 1061 00:36:22,430 --> 00:36:24,469 system. But there are also many 1062 00:36:24,470 --> 00:36:26,689 attributes that are pretty unique to each 1063 00:36:26,690 --> 00:36:28,759 one of us. For example, the place that we 1064 00:36:28,760 --> 00:36:30,769 are living or the combination of that 1065 00:36:30,770 --> 00:36:32,239 place with the work, where we going. 1066 00:36:32,240 --> 00:36:34,459 So having a few of those quite 1067 00:36:34,460 --> 00:36:35,959 unique data points for each user is 1068 00:36:35,960 --> 00:36:38,119 usually already enough to the anonymize 1069 00:36:38,120 --> 00:36:39,499 us with a very high fidelity. 1070 00:36:42,290 --> 00:36:44,089 And again, I wanted to see if this is 1071 00:36:44,090 --> 00:36:46,279 actually working in practice, so 1072 00:36:46,280 --> 00:36:48,409 what I did was to get the data set 1073 00:36:48,410 --> 00:36:50,119 in this case from Microsoft Research 1074 00:36:50,120 --> 00:36:52,399 Asia, which contains GPS 1075 00:36:52,400 --> 00:36:54,589 data about of about 1076 00:36:54,590 --> 00:36:56,809 200 people that track their whole 1077 00:36:56,810 --> 00:36:59,089 activity for sometimes 1078 00:36:59,090 --> 00:37:00,889 several years, sometimes several months, 1079 00:37:00,890 --> 00:37:03,079 and like use 1080 00:37:03,080 --> 00:37:04,819 the data to create a movement profile, so 1081 00:37:04,820 --> 00:37:06,079 to say. 1082 00:37:06,080 --> 00:37:07,789 So I also have an animated version of 1083 00:37:07,790 --> 00:37:10,639 that that you can see here, 1084 00:37:10,640 --> 00:37:13,369 the different trajectories 1085 00:37:13,370 --> 00:37:15,469 of individual users. 1086 00:37:15,470 --> 00:37:17,329 I don't know if anyone recognizes the 1087 00:37:17,330 --> 00:37:18,330 city. 1088 00:37:20,020 --> 00:37:22,119 Now it's Beijing, actually, 1089 00:37:22,120 --> 00:37:24,279 and if you're wondering what 1090 00:37:24,280 --> 00:37:26,589 this square is, so 1091 00:37:26,590 --> 00:37:28,269 I looked at Google Maps and it seems to 1092 00:37:28,270 --> 00:37:30,669 be the university, so I guess 1093 00:37:30,670 --> 00:37:32,169 it's like another field of study. 1094 00:37:32,170 --> 00:37:34,119 Whenever you need, like some guinea pigs 1095 00:37:34,120 --> 00:37:36,459 to take data for you, you go and ask 1096 00:37:36,460 --> 00:37:37,630 students. So. 1097 00:37:39,600 --> 00:37:42,419 OK, so this is a pretty rich data set 1098 00:37:42,420 --> 00:37:44,399 we have like for in some cases, hundreds 1099 00:37:44,400 --> 00:37:45,659 of thousands of data points per 1100 00:37:45,660 --> 00:37:47,759 individual and I wanted to see how 1101 00:37:47,760 --> 00:37:49,439 easy it would be with this data to 1102 00:37:49,440 --> 00:37:51,149 actually anonymize users. 1103 00:37:51,150 --> 00:37:53,459 So what I did was to first look 1104 00:37:53,460 --> 00:37:54,759 at individual trajectories. 1105 00:37:54,760 --> 00:37:57,119 So here we have like the GPS traces 1106 00:37:57,120 --> 00:37:59,819 of the individual's color coded and 1107 00:37:59,820 --> 00:38:02,039 then to apply a very simple 1108 00:38:02,040 --> 00:38:04,139 grid. So like in this 1109 00:38:04,140 --> 00:38:06,299 case, a four by four grid and 1110 00:38:06,300 --> 00:38:09,239 just measure, um, the frequency 1111 00:38:09,240 --> 00:38:11,399 with which a given individual 1112 00:38:11,400 --> 00:38:13,529 has some data and this on 1113 00:38:13,530 --> 00:38:15,239 this on the square. 1114 00:38:15,240 --> 00:38:17,609 So doing this for 1115 00:38:17,610 --> 00:38:19,289 the two hundred people gives me something 1116 00:38:19,290 --> 00:38:21,479 like this. So this is four to four by 1117 00:38:21,480 --> 00:38:23,729 four grid. And you can see the colors 1118 00:38:23,730 --> 00:38:26,279 represent the number of times a given 1119 00:38:26,280 --> 00:38:27,989 person has been in a given square. 1120 00:38:27,990 --> 00:38:29,939 So why it would mean that the person has 1121 00:38:29,940 --> 00:38:32,009 been here very often black would mean the 1122 00:38:32,010 --> 00:38:33,179 person has never been in this given 1123 00:38:33,180 --> 00:38:34,079 square. 1124 00:38:34,080 --> 00:38:36,239 And you can already see that 1125 00:38:36,240 --> 00:38:38,249 reflect the 60 examples that I show you 1126 00:38:38,250 --> 00:38:40,379 here, the many of them, they 1127 00:38:40,380 --> 00:38:42,119 seem to be quite unique, for example, 1128 00:38:42,120 --> 00:38:43,749 this one and this one. 1129 00:38:43,750 --> 00:38:46,110 So it should be possible to like 1130 00:38:47,310 --> 00:38:49,049 kind of make a fingerprint for a given 1131 00:38:49,050 --> 00:38:50,669 user using that data. 1132 00:38:50,670 --> 00:38:53,249 And of course, if we need more resolution 1133 00:38:53,250 --> 00:38:55,559 to like, for example, disambiguate uses 1134 00:38:55,560 --> 00:38:57,269 like this here where we have like the 1135 00:38:57,270 --> 00:38:59,369 same data, more or less, and we can't, 1136 00:38:59,370 --> 00:39:01,529 like, decide which user we have, 1137 00:39:01,530 --> 00:39:03,119 we can just increase the resolution, for 1138 00:39:03,120 --> 00:39:05,369 example, to eight by eight or to 16 1139 00:39:05,370 --> 00:39:07,499 by 16. So here. 1140 00:39:07,500 --> 00:39:09,629 And now 1141 00:39:09,630 --> 00:39:11,519 coming back to our buckets, if you would, 1142 00:39:11,520 --> 00:39:13,289 measure the distribution of the 1143 00:39:13,290 --> 00:39:15,569 attributes that we have here, we 1144 00:39:15,570 --> 00:39:17,969 can get an idea how good our choice 1145 00:39:17,970 --> 00:39:20,339 is, actually. And you can see 1146 00:39:20,340 --> 00:39:22,049 the choice that we have make made is 1147 00:39:22,050 --> 00:39:24,419 actually pretty bad, because in the first 1148 00:39:24,420 --> 00:39:25,979 bucket of the bucket, with the most 1149 00:39:25,980 --> 00:39:28,049 datapoints, we have about like 10 to 1150 00:39:28,050 --> 00:39:30,299 the six or 1000000 points. 1151 00:39:30,300 --> 00:39:32,489 But the interesting part of this curve, 1152 00:39:32,490 --> 00:39:34,679 which is, by the way, logarithmic 1153 00:39:34,680 --> 00:39:36,809 is here. So in this, like, 1154 00:39:36,810 --> 00:39:39,089 long tail of the distribution where 1155 00:39:39,090 --> 00:39:41,249 we have only sometimes 1156 00:39:41,250 --> 00:39:43,679 one or sometimes a couple of individual 1157 00:39:43,680 --> 00:39:45,179 persons in that given bucket. 1158 00:39:45,180 --> 00:39:46,829 So if we can get some information in 1159 00:39:46,830 --> 00:39:49,529 these buckets, it's easy to use that 1160 00:39:49,530 --> 00:39:51,989 to the anonymise our users. 1161 00:39:51,990 --> 00:39:54,419 OK, and how do we do that? 1162 00:39:54,420 --> 00:39:56,819 Again, we use a very simple measure and 1163 00:39:56,820 --> 00:39:59,399 we just take the fingerprint 1164 00:39:59,400 --> 00:40:01,589 of one user or one trace 1165 00:40:01,590 --> 00:40:03,569 and multiply it with the fingerprint of 1166 00:40:03,570 --> 00:40:05,549 another trace, but picks up the pixel, 1167 00:40:05,550 --> 00:40:07,349 which gives us then the value on the 1168 00:40:07,350 --> 00:40:09,539 right. And then we take these individual 1169 00:40:09,540 --> 00:40:11,009 values here and send them up. 1170 00:40:11,010 --> 00:40:13,169 And this gives us kind of a score of how 1171 00:40:13,170 --> 00:40:15,119 similar to use as our two trajectories 1172 00:40:15,120 --> 00:40:16,379 are. 1173 00:40:16,380 --> 00:40:18,449 So doing this, 1174 00:40:18,450 --> 00:40:20,639 we can take 75 percent 1175 00:40:20,640 --> 00:40:22,709 of our data as a training set. 1176 00:40:22,710 --> 00:40:24,689 So we just like teach our algorithm to 1177 00:40:24,690 --> 00:40:26,789 like, recognize individual users and 1178 00:40:26,790 --> 00:40:28,709 then we can use the remaining 25 percent 1179 00:40:28,710 --> 00:40:30,959 to test how good our algorithm is at 1180 00:40:30,960 --> 00:40:33,269 recognizing the users now 1181 00:40:33,270 --> 00:40:35,669 and then we look at the average 1182 00:40:35,670 --> 00:40:38,249 probability of identification and 1183 00:40:38,250 --> 00:40:40,439 also of the rank that the user has 1184 00:40:40,440 --> 00:40:42,569 in this and this prediction. 1185 00:40:42,570 --> 00:40:43,679 And this is shown here. 1186 00:40:48,110 --> 00:40:50,039 So what a show is the, um, 1187 00:40:51,110 --> 00:40:53,239 the probability of, 1188 00:40:53,240 --> 00:40:55,429 like, finding the right user 1189 00:40:55,430 --> 00:40:57,779 within, for example, the first two, 1190 00:40:57,780 --> 00:41:00,379 the first for the first six users 1191 00:41:00,380 --> 00:41:02,089 that have the highest scores for a given 1192 00:41:02,090 --> 00:41:04,729 chase. And you can see that 1193 00:41:04,730 --> 00:41:07,819 for even 16, 1194 00:41:07,820 --> 00:41:09,859 16 squares. So the four by four grid that 1195 00:41:09,860 --> 00:41:12,109 I showed you in the beginning, the ID 1196 00:41:12,110 --> 00:41:13,909 rate is already 20 percent here. 1197 00:41:13,910 --> 00:41:16,039 So we can identify uniquely 1198 00:41:16,040 --> 00:41:18,499 one fifth of our user by just using 1199 00:41:18,500 --> 00:41:20,179 16 data points. 1200 00:41:20,180 --> 00:41:21,739 And the more data points we use, 1201 00:41:21,740 --> 00:41:23,869 actually, the more the better we can 1202 00:41:23,870 --> 00:41:25,939 identify the users in our data set. 1203 00:41:25,940 --> 00:41:28,579 And with 1024 1204 00:41:28,580 --> 00:41:30,319 individual data points, which would be 1205 00:41:30,320 --> 00:41:32,449 quite easy to get in a real 1206 00:41:32,450 --> 00:41:34,639 world setting, we can uniquely 1207 00:41:34,640 --> 00:41:37,339 identify almost 30 percent of the users. 1208 00:41:37,340 --> 00:41:38,839 And again, I want to state that this is 1209 00:41:38,840 --> 00:41:40,069 just a proof of concept. 1210 00:41:40,070 --> 00:41:42,109 And so there has been no optimization 1211 00:41:42,110 --> 00:41:43,849 done and no like fine tuning of 1212 00:41:43,850 --> 00:41:45,230 parameters or anything. 1213 00:41:46,710 --> 00:41:48,809 And we can also use that technique to not 1214 00:41:48,810 --> 00:41:50,549 only identify single users, but also to 1215 00:41:50,550 --> 00:41:52,629 find similarities between users. 1216 00:41:52,630 --> 00:41:54,089 So this could be interesting, for 1217 00:41:54,090 --> 00:41:56,519 example, to see who is 1218 00:41:56,520 --> 00:41:58,649 related to whom and who are you 1219 00:41:58,650 --> 00:42:00,599 visiting, who are your friends may be. 1220 00:42:00,600 --> 00:42:02,789 And this is what I did here. 1221 00:42:02,790 --> 00:42:04,259 So I used the same metric as before. 1222 00:42:04,260 --> 00:42:06,209 And I just told the system to give me the 1223 00:42:06,210 --> 00:42:07,559 users that are most similar to each 1224 00:42:07,560 --> 00:42:09,659 other. And you can see here in 1225 00:42:09,660 --> 00:42:12,149 green the trajectories of one user 1226 00:42:12,150 --> 00:42:14,309 and then read to the trajectories 1227 00:42:14,310 --> 00:42:15,239 of the other users. 1228 00:42:15,240 --> 00:42:16,919 And the areas that are yellow are 1229 00:42:16,920 --> 00:42:19,829 actually where the two of them coincide. 1230 00:42:19,830 --> 00:42:21,419 You can see there are some hits of the 1231 00:42:21,420 --> 00:42:23,429 system which don't seem too good, but 1232 00:42:23,430 --> 00:42:24,579 there are lots there. 1233 00:42:24,580 --> 00:42:26,609 There's some hits also where you can see 1234 00:42:26,610 --> 00:42:28,709 like a really big agreement 1235 00:42:28,710 --> 00:42:30,209 between the two datasets. 1236 00:42:30,210 --> 00:42:32,759 And I mean, I don't know 1237 00:42:32,760 --> 00:42:34,739 who was taking this data because it's 1238 00:42:34,740 --> 00:42:36,779 anonymized, but I would guess in this 1239 00:42:36,780 --> 00:42:38,579 case that it's either like a taxi driver 1240 00:42:38,580 --> 00:42:40,739 or maybe a bus driver, because 1241 00:42:40,740 --> 00:42:42,719 you can see that we cover almost the 1242 00:42:42,720 --> 00:42:45,419 whole Beijing area with these two traces. 1243 00:42:45,420 --> 00:42:46,979 And so this technique makes it really, 1244 00:42:46,980 --> 00:42:49,049 really easy to like, identify 1245 00:42:49,050 --> 00:42:51,449 users and also like find out 1246 00:42:51,450 --> 00:42:53,459 who they are related and who are which 1247 00:42:53,460 --> 00:42:55,739 other users are similar to them. 1248 00:42:55,740 --> 00:42:59,139 And we can, of course, like improve 1249 00:42:59,140 --> 00:43:01,229 the identification rate of the 1250 00:43:01,230 --> 00:43:03,449 system by, for example, taking 1251 00:43:03,450 --> 00:43:04,859 into account not only the spatial 1252 00:43:04,860 --> 00:43:07,469 information, but also like the temporal 1253 00:43:07,470 --> 00:43:09,119 information, for example, the day night 1254 00:43:09,120 --> 00:43:10,679 cycle, which you see here in the 1255 00:43:10,680 --> 00:43:12,749 background. So here the green groups 1256 00:43:12,750 --> 00:43:14,279 have been taking at night and the Red 1257 00:43:14,280 --> 00:43:16,439 Cross has been taking at the during day. 1258 00:43:16,440 --> 00:43:18,599 So like you have the data, for example, 1259 00:43:18,600 --> 00:43:20,399 going to work in the morning and coming 1260 00:43:20,400 --> 00:43:23,069 back in the evening can 1261 00:43:23,070 --> 00:43:25,259 then be used to increase the prediction, 1262 00:43:25,260 --> 00:43:27,359 fidelity for identifying a given 1263 00:43:27,360 --> 00:43:28,360 user. 1264 00:43:28,840 --> 00:43:30,969 And of course, we could also, like change 1265 00:43:30,970 --> 00:43:33,069 the choice of our buckets and 1266 00:43:33,070 --> 00:43:34,899 like change the way we do the 1267 00:43:34,900 --> 00:43:36,459 fingerprinting in order to increase the 1268 00:43:36,460 --> 00:43:37,579 fidelity of the algorithm. 1269 00:43:37,580 --> 00:43:38,979 So there's plenty of room for 1270 00:43:38,980 --> 00:43:40,749 optimization. And as I said, this is only 1271 00:43:40,750 --> 00:43:42,039 like a proof of principle. 1272 00:43:42,040 --> 00:43:44,409 But there are other like similar 1273 00:43:44,410 --> 00:43:46,599 works in the literature which show that 1274 00:43:46,600 --> 00:43:48,249 with even a very simple method, you can 1275 00:43:48,250 --> 00:43:50,439 achieve quite good edification rates 1276 00:43:50,440 --> 00:43:51,440 in such a data set. 1277 00:43:53,510 --> 00:43:55,699 Now, to summarize this, this 1278 00:43:55,700 --> 00:43:58,129 means that the more data we have about 1279 00:43:58,130 --> 00:44:00,469 a given entity, a person, the more 1280 00:44:00,470 --> 00:44:02,299 difficult it is actually to keep 1281 00:44:02,300 --> 00:44:04,429 algorithms from directly learning and 1282 00:44:04,430 --> 00:44:06,769 using the identity of that object 1283 00:44:06,770 --> 00:44:08,779 for a prediction instead of an attribute. 1284 00:44:08,780 --> 00:44:10,699 That means, as I said before, that the 1285 00:44:10,700 --> 00:44:12,949 data which a given user are given 1286 00:44:12,950 --> 00:44:15,199 person generates follows 1287 00:44:15,200 --> 00:44:16,889 him or her around the whole life. 1288 00:44:16,890 --> 00:44:19,069 So even if you like, would change all 1289 00:44:19,070 --> 00:44:21,319 of your smartphones, all of your devices, 1290 00:44:21,320 --> 00:44:22,519 some parts of your behavior would 1291 00:44:22,520 --> 00:44:23,539 probably stay the same. 1292 00:44:23,540 --> 00:44:25,669 And this could be used to 1293 00:44:25,670 --> 00:44:27,469 identify you later in the process, again, 1294 00:44:27,470 --> 00:44:28,999 with a pretty high fidelity. 1295 00:44:29,000 --> 00:44:31,039 So that's one of the biggest risk of big 1296 00:44:31,040 --> 00:44:33,439 data for me, because it's like 1297 00:44:33,440 --> 00:44:35,659 very easy if we not avoid 1298 00:44:35,660 --> 00:44:37,729 it to, like, destroy 1299 00:44:37,730 --> 00:44:38,989 the privacy of our users. 1300 00:44:40,100 --> 00:44:41,780 OK, yeah, thanks. 1301 00:44:49,220 --> 00:44:50,459 So what can we do about this? 1302 00:44:51,620 --> 00:44:52,999 I don't have all the answers, of course, 1303 00:44:53,000 --> 00:44:55,489 but I have a few ideas and I mean, 1304 00:44:55,490 --> 00:44:56,490 there are lots of people 1305 00:44:57,830 --> 00:44:58,999 working on like 1306 00:45:00,210 --> 00:45:01,849 a political and societal and 1307 00:45:01,850 --> 00:45:03,199 technological solutions for this. 1308 00:45:03,200 --> 00:45:04,459 So here I just want to give a brief 1309 00:45:04,460 --> 00:45:07,009 overview of things that can 1310 00:45:07,010 --> 00:45:09,169 be important in order to avoid these 1311 00:45:09,170 --> 00:45:11,629 two scenarios that I have shown here. 1312 00:45:11,630 --> 00:45:13,999 So to 1313 00:45:14,000 --> 00:45:15,529 the group of people that we probably have 1314 00:45:15,530 --> 00:45:17,899 to to educate the most urgently 1315 00:45:17,900 --> 00:45:20,179 about this is, of course, data 1316 00:45:20,180 --> 00:45:21,529 scientists. So the people that actually 1317 00:45:21,530 --> 00:45:23,179 work with the data and create these 1318 00:45:23,180 --> 00:45:25,759 algorithms, because today 1319 00:45:25,760 --> 00:45:28,099 and there is in Germany, for example, 1320 00:45:28,100 --> 00:45:30,739 you need like a three year apprenticeship 1321 00:45:30,740 --> 00:45:32,869 in order to cheesecake, but there's 1322 00:45:32,870 --> 00:45:34,819 nothing comparable in order to like an 1323 00:45:34,820 --> 00:45:36,409 algorithm Maciste and like to develop 1324 00:45:36,410 --> 00:45:37,699 these kind of algorithms that have a 1325 00:45:37,700 --> 00:45:40,069 large influence on our daily lives. 1326 00:45:40,070 --> 00:45:42,559 So they probably should be 1327 00:45:42,560 --> 00:45:44,299 a better curriculum in universities and 1328 00:45:44,300 --> 00:45:46,759 even in schools, maybe to educate 1329 00:45:46,760 --> 00:45:48,529 people not only about the possibilities 1330 00:45:48,530 --> 00:45:50,689 of data analysis and like about 1331 00:45:50,690 --> 00:45:53,119 scraping even like the last few percent 1332 00:45:53,120 --> 00:45:55,189 of fidelity from a given algorithm, 1333 00:45:55,190 --> 00:45:56,869 but also about like the risks and the 1334 00:45:56,870 --> 00:45:58,069 danger of using these kinds of 1335 00:45:58,070 --> 00:46:00,199 technologies, especially when other 1336 00:46:00,200 --> 00:46:01,909 people are involved. 1337 00:46:01,910 --> 00:46:03,739 And another thing that we should be 1338 00:46:03,740 --> 00:46:06,049 careful with is indulging 1339 00:46:06,050 --> 00:46:07,639 data without actually needing it. 1340 00:46:07,640 --> 00:46:09,739 So to data 1341 00:46:09,740 --> 00:46:12,019 like one of the most popular approaches 1342 00:46:12,020 --> 00:46:14,719 and big data is just to take everything 1343 00:46:14,720 --> 00:46:16,069 that you can get. 1344 00:46:16,070 --> 00:46:17,749 So all the data that we can get our hands 1345 00:46:17,750 --> 00:46:19,789 on to give it to the algorithm and to let 1346 00:46:19,790 --> 00:46:22,129 it decide how it uses 1347 00:46:22,130 --> 00:46:24,199 it. And this is good not only 1348 00:46:24,200 --> 00:46:26,329 because it increases the fidelity of 1349 00:46:26,330 --> 00:46:28,099 our predictions, but as I explained 1350 00:46:28,100 --> 00:46:29,659 earlier, it can be also very dangerous 1351 00:46:29,660 --> 00:46:31,759 because maybe the algorithm can learn 1352 00:46:31,760 --> 00:46:33,919 things which it isn't supposed to learn. 1353 00:46:33,920 --> 00:46:36,499 And also we should really be 1354 00:46:36,500 --> 00:46:37,999 more careful with the data that we give 1355 00:46:38,000 --> 00:46:39,000 in to these systems. 1356 00:46:40,260 --> 00:46:42,029 And of course, the other things that we 1357 00:46:42,030 --> 00:46:43,979 can do, we can try to do to remove 1358 00:46:43,980 --> 00:46:46,209 discrimination and disparate impact, and 1359 00:46:46,210 --> 00:46:48,299 there's also like a lot of academic 1360 00:46:48,300 --> 00:46:50,909 work giving techniques 1361 00:46:50,910 --> 00:46:52,919 and methods that we can use for or for 1362 00:46:52,920 --> 00:46:54,629 doing this. But here, the problem again, 1363 00:46:54,630 --> 00:46:56,669 is that most people that actually work in 1364 00:46:56,670 --> 00:46:58,589 the fields where these algorithms are put 1365 00:46:58,590 --> 00:47:00,809 into practice either don't know 1366 00:47:00,810 --> 00:47:02,339 about these things, are not interested in 1367 00:47:02,340 --> 00:47:04,499 those. So I think here we have a big 1368 00:47:04,500 --> 00:47:06,629 potential for like improving 1369 00:47:06,630 --> 00:47:08,729 the education of data scientists and data 1370 00:47:08,730 --> 00:47:11,699 analysts as citizens. 1371 00:47:11,700 --> 00:47:13,859 We can also do something, of course. 1372 00:47:13,860 --> 00:47:16,169 So the first thing is to not blindly 1373 00:47:16,170 --> 00:47:17,879 trust the decisions made by algorithms. 1374 00:47:17,880 --> 00:47:19,949 So if most people have kind 1375 00:47:19,950 --> 00:47:22,139 of a bias to think that a decision made 1376 00:47:22,140 --> 00:47:24,209 by a computer, by algorithm is maybe more 1377 00:47:24,210 --> 00:47:26,609 fair than a decision made by a human. 1378 00:47:26,610 --> 00:47:27,899 And I think this is something we have to 1379 00:47:27,900 --> 00:47:30,209 get rid of because algorithms as a show 1380 00:47:30,210 --> 00:47:32,219 can be just as discriminating against 1381 00:47:32,220 --> 00:47:33,360 people as humans can. 1382 00:47:34,810 --> 00:47:37,569 So, um, and if we can't, 1383 00:47:37,570 --> 00:47:40,089 like, question their 1384 00:47:40,090 --> 00:47:42,879 decisions, we can at least test them 1385 00:47:42,880 --> 00:47:43,989 and see if there's actually 1386 00:47:43,990 --> 00:47:45,169 discrimination in the system. 1387 00:47:45,170 --> 00:47:47,019 And now this sounds pretty easy, but it's 1388 00:47:47,020 --> 00:47:49,569 actually very hard because the algorithms 1389 00:47:49,570 --> 00:47:51,789 are mostly like in the hands of big 1390 00:47:51,790 --> 00:47:54,069 organizations or corporations and are, 1391 00:47:54,070 --> 00:47:55,689 of course, like a closely guarded trade 1392 00:47:55,690 --> 00:47:57,159 secrets in most times. 1393 00:47:57,160 --> 00:47:59,229 And this means that we have to 1394 00:47:59,230 --> 00:48:00,519 use techniques such as reverse 1395 00:48:00,520 --> 00:48:02,589 engineering in order to to like find 1396 00:48:02,590 --> 00:48:04,659 out how the internals of 1397 00:48:04,660 --> 00:48:05,889 the algorithm might work. 1398 00:48:05,890 --> 00:48:07,629 And I have to say, I'm a bit pessimistic 1399 00:48:07,630 --> 00:48:10,059 about this because, um, whereas 1400 00:48:10,060 --> 00:48:12,039 where the companies or the organizations 1401 00:48:12,040 --> 00:48:14,379 could use like like huge buckets and huge 1402 00:48:14,380 --> 00:48:15,519 amounts of data to train these 1403 00:48:15,520 --> 00:48:17,559 algorithms, the amount of data that we 1404 00:48:17,560 --> 00:48:20,199 can use for reverse engineering then 1405 00:48:20,200 --> 00:48:21,849 is minuscule, is very small in 1406 00:48:21,850 --> 00:48:22,449 comparison. 1407 00:48:22,450 --> 00:48:24,519 So it's really not very likely that we 1408 00:48:24,520 --> 00:48:26,049 would be able to make a good decision 1409 00:48:26,050 --> 00:48:28,479 based on these kinds of techniques. 1410 00:48:28,480 --> 00:48:31,119 And of course, we can also one thing 1411 00:48:31,120 --> 00:48:32,709 that we can do is to fight back with 1412 00:48:32,710 --> 00:48:34,899 data. So by collecting 1413 00:48:34,900 --> 00:48:37,029 data about decisions that are made 1414 00:48:37,030 --> 00:48:39,159 of about us from the algorithms and 1415 00:48:39,160 --> 00:48:41,889 by centralizing that we can like 1416 00:48:41,890 --> 00:48:43,989 create a lot of opportunities for other 1417 00:48:43,990 --> 00:48:46,329 researchers and other people to analyze 1418 00:48:46,330 --> 00:48:48,189 these data sets and to like find 1419 00:48:48,190 --> 00:48:50,259 discrimination and other things in them. 1420 00:48:50,260 --> 00:48:53,409 And so I would encourage you to, um, 1421 00:48:53,410 --> 00:48:55,779 if you like, are reluctant to like 1422 00:48:55,780 --> 00:48:56,899 like give away your data. 1423 00:48:56,900 --> 00:48:59,439 I can, of course, understand it, but, um, 1424 00:48:59,440 --> 00:49:01,539 in some cases, it's really the only way 1425 00:49:01,540 --> 00:49:03,939 to make sure that someone can actually 1426 00:49:03,940 --> 00:49:06,129 work with the data and detect 1427 00:49:06,130 --> 00:49:08,739 also like like find injustices 1428 00:49:08,740 --> 00:49:09,639 that are caused by it. 1429 00:49:09,640 --> 00:49:12,669 So we have to really think about 1430 00:49:12,670 --> 00:49:14,669 differently of giving away our data and 1431 00:49:14,670 --> 00:49:16,869 like like also creating data and machine 1432 00:49:16,870 --> 00:49:18,369 learning against machine learning. 1433 00:49:21,410 --> 00:49:22,410 So. 1434 00:49:24,380 --> 00:49:26,749 As a society, we can, of course, 1435 00:49:26,750 --> 00:49:28,309 create better regulations for algorithm, 1436 00:49:28,310 --> 00:49:29,479 and this is actually something that has 1437 00:49:29,480 --> 00:49:30,679 been done. 1438 00:49:30,680 --> 00:49:33,049 I mean, the beginning of the year, 1439 00:49:33,050 --> 00:49:35,209 our minister of justice was demanding of 1440 00:49:35,210 --> 00:49:37,519 Facebook to to open 1441 00:49:37,520 --> 00:49:39,589 up their algorithm. And this 1442 00:49:39,590 --> 00:49:41,029 was much ridiculed at the time. 1443 00:49:41,030 --> 00:49:42,769 But I think it actually has some merit, 1444 00:49:42,770 --> 00:49:44,899 because if we can't 1445 00:49:44,900 --> 00:49:47,389 understand how corporations 1446 00:49:47,390 --> 00:49:48,919 or companies are using algorithms, we 1447 00:49:48,920 --> 00:49:51,139 can't know if they're discriminating 1448 00:49:51,140 --> 00:49:52,399 against certain people or if they're 1449 00:49:52,400 --> 00:49:53,629 treating us fairly. 1450 00:49:53,630 --> 00:49:55,849 So having as an auditing 1451 00:49:55,850 --> 00:49:58,009 system in place, that allows at least 1452 00:49:58,010 --> 00:49:59,479 a group of people to have a look at this 1453 00:49:59,480 --> 00:50:01,729 algorithm and to see how they're working 1454 00:50:01,730 --> 00:50:03,529 would be a first step in the direction of 1455 00:50:03,530 --> 00:50:06,619 making these things more transparent. 1456 00:50:06,620 --> 00:50:08,959 And of course, making access to the data 1457 00:50:08,960 --> 00:50:10,969 more easy and a safe way is also 1458 00:50:10,970 --> 00:50:13,429 important to be able to to detect 1459 00:50:13,430 --> 00:50:15,259 any problems that we have with it. 1460 00:50:15,260 --> 00:50:17,329 And finally, of course, I 1461 00:50:17,330 --> 00:50:19,129 mean, this is maybe already too late, but 1462 00:50:19,130 --> 00:50:20,809 we should do our best to impede, like, 1463 00:50:20,810 --> 00:50:22,429 the creation of so-called data 1464 00:50:22,430 --> 00:50:24,979 monopolies, because if one organization 1465 00:50:24,980 --> 00:50:26,689 or one sector has all the data in its 1466 00:50:26,690 --> 00:50:28,879 hands, we have already lost. 1467 00:50:28,880 --> 00:50:30,559 Because even if you have the same 1468 00:50:30,560 --> 00:50:32,629 algorithms, the same technologies 1469 00:50:32,630 --> 00:50:33,859 are at our hands. 1470 00:50:33,860 --> 00:50:35,839 Most of the value and data analysis isn't 1471 00:50:35,840 --> 00:50:37,669 the amount of the data that we can can 1472 00:50:37,670 --> 00:50:39,739 have. So if there's an adversary or 1473 00:50:39,740 --> 00:50:41,899 like an organization that has like orders 1474 00:50:41,900 --> 00:50:43,609 of magnitude more data to work with than 1475 00:50:43,610 --> 00:50:45,709 we, it's really unlikely that we will 1476 00:50:45,710 --> 00:50:48,019 be able to, like, compete 1477 00:50:48,020 --> 00:50:50,479 with that adversary on the same scale. 1478 00:50:50,480 --> 00:50:51,480 So. 1479 00:50:52,660 --> 00:50:53,660 As a final word, 1480 00:50:54,820 --> 00:50:57,009 I would say that algorithms 1481 00:50:57,010 --> 00:50:59,919 are probably a lot like children, 1482 00:50:59,920 --> 00:51:01,359 so they're very smart and they're really 1483 00:51:01,360 --> 00:51:02,769 eager to learn things. 1484 00:51:02,770 --> 00:51:04,869 And we, as the 1485 00:51:04,870 --> 00:51:06,759 data analyst, as the programmers, we have 1486 00:51:06,760 --> 00:51:08,709 to teach them to behave in the right way 1487 00:51:08,710 --> 00:51:10,779 and we should try to raise them to be 1488 00:51:10,780 --> 00:51:12,669 responsible adults. 1489 00:51:12,670 --> 00:51:14,170 OK, so thanks. 1490 00:51:20,910 --> 00:51:22,469 His father didn't like being poor. 1491 00:51:24,420 --> 00:51:26,519 We do have a few minutes left 1492 00:51:26,520 --> 00:51:28,649 for Q&A, I would like to ask you 1493 00:51:28,650 --> 00:51:31,319 to queue up at the microphones at the CIA 1494 00:51:31,320 --> 00:51:33,389 if you're watching at home. 1495 00:51:33,390 --> 00:51:34,390 We also 1496 00:51:35,520 --> 00:51:37,589 have a human computer interface 1497 00:51:37,590 --> 00:51:39,269 to relay questions to us. 1498 00:51:39,270 --> 00:51:41,309 I'd say we begin with that. 1499 00:51:41,310 --> 00:51:42,569 Do you have a question for us? 1500 00:51:42,570 --> 00:51:43,619 Yes. 1501 00:51:43,620 --> 00:51:45,749 Rootie is asking, what discrimination 1502 00:51:45,750 --> 00:51:47,969 number would you guess for discrimination 1503 00:51:47,970 --> 00:51:50,759 from politicians over people's choice 1504 00:51:50,760 --> 00:51:53,429 in one or several countries? 1505 00:51:53,430 --> 00:51:55,859 Um, politicians 1506 00:51:55,860 --> 00:51:57,929 about people's choice. 1507 00:51:57,930 --> 00:51:59,099 You mean can you 1508 00:52:00,480 --> 00:52:01,859 be a bit more precise on that? 1509 00:52:01,860 --> 00:52:03,329 I think it's difficult to. 1510 00:52:04,510 --> 00:52:06,429 We'll get back to that question. 1511 00:52:06,430 --> 00:52:08,789 We have one question in song 1512 00:52:08,790 --> 00:52:09,790 number two, please. 1513 00:52:10,920 --> 00:52:12,659 Thank you for your talk. 1514 00:52:12,660 --> 00:52:14,249 Does it make any sense or is there any 1515 00:52:14,250 --> 00:52:16,619 hope that I am as an individual 1516 00:52:16,620 --> 00:52:18,989 can fake my my 1517 00:52:18,990 --> 00:52:21,599 data patterns or 1518 00:52:21,600 --> 00:52:22,600 can I disturb? 1519 00:52:23,590 --> 00:52:25,749 The pattern recognition 1520 00:52:25,750 --> 00:52:27,879 in a sensible way, in a sensitive. 1521 00:52:27,880 --> 00:52:30,219 Yeah, um, yes, I think you surely 1522 00:52:30,220 --> 00:52:32,259 can. That's the question is only if this 1523 00:52:32,260 --> 00:52:33,879 would be effective to, for example, 1524 00:52:33,880 --> 00:52:36,279 protect you against the anonymization, 1525 00:52:36,280 --> 00:52:38,619 because as I said, like, taking 1526 00:52:38,620 --> 00:52:40,839 90 percent of your data can be useless 1527 00:52:40,840 --> 00:52:43,149 if 10 percent of your data points 1528 00:52:43,150 --> 00:52:45,489 are in packets or like in 1529 00:52:45,490 --> 00:52:47,259 attributes that are unique or almost 1530 00:52:47,260 --> 00:52:48,159 unique to your person. 1531 00:52:48,160 --> 00:52:50,289 So if you want this matter to 1532 00:52:50,290 --> 00:52:51,609 be effective, I think you would have to 1533 00:52:51,610 --> 00:52:52,659 be really convincing. 1534 00:52:52,660 --> 00:52:54,729 And, um, I mean, I haven't 1535 00:52:54,730 --> 00:52:56,799 had a look at the very big data set, so 1536 00:52:56,800 --> 00:52:58,179 I really can't give a quantitative 1537 00:52:58,180 --> 00:53:00,279 answer. But I'm rather pessimistic about 1538 00:53:00,280 --> 00:53:00,909 this approach. 1539 00:53:00,910 --> 00:53:02,979 I have to say, OK, we 1540 00:53:02,980 --> 00:53:04,269 do have a few more questions. 1541 00:53:04,270 --> 00:53:06,639 I would ask the people in the room, if 1542 00:53:06,640 --> 00:53:08,559 you have to change rooms right now, 1543 00:53:08,560 --> 00:53:10,629 please do so in a quiet manner 1544 00:53:10,630 --> 00:53:13,089 so we can do the Q&A 1545 00:53:13,090 --> 00:53:14,559 without yelling. 1546 00:53:14,560 --> 00:53:16,329 We do have another question in the IOC 1547 00:53:16,330 --> 00:53:18,609 and after that, it's number four IOC, 1548 00:53:18,610 --> 00:53:19,419 please. 1549 00:53:19,420 --> 00:53:21,939 Atomic engineer is asking 1550 00:53:21,940 --> 00:53:24,249 if a human is generally able 1551 00:53:24,250 --> 00:53:26,049 to create an algorithm which is not 1552 00:53:26,050 --> 00:53:27,429 discriminating. 1553 00:53:27,430 --> 00:53:29,529 And he's doing an analogy to 1554 00:53:29,530 --> 00:53:31,689 random numbers where a human cannot 1555 00:53:31,690 --> 00:53:33,819 really create truly random 1556 00:53:33,820 --> 00:53:35,559 numbers because she or he or she would 1557 00:53:35,560 --> 00:53:37,099 always have a preference. 1558 00:53:37,100 --> 00:53:37,929 Hmm. 1559 00:53:37,930 --> 00:53:39,789 Yeah, that's a very interesting question. 1560 00:53:39,790 --> 00:53:41,919 Um, I mean, it really 1561 00:53:41,920 --> 00:53:43,989 comes down to the algorithm 1562 00:53:43,990 --> 00:53:46,239 having the information about a protected 1563 00:53:46,240 --> 00:53:48,369 class or not having it so it doesn't 1564 00:53:48,370 --> 00:53:49,659 have the information. 1565 00:53:49,660 --> 00:53:51,999 It can't be discriminating 1566 00:53:52,000 --> 00:53:54,069 by definition, because it can 1567 00:53:54,070 --> 00:53:56,169 only randomly guess if a person belongs 1568 00:53:56,170 --> 00:53:57,459 to a given group or not. 1569 00:53:57,460 --> 00:53:59,649 So in that sense, algorithm can be 1570 00:53:59,650 --> 00:54:01,449 perfectly unbiased, but only if they 1571 00:54:01,450 --> 00:54:03,399 don't have any information that that 1572 00:54:03,400 --> 00:54:05,829 gives away the protected status 1573 00:54:05,830 --> 00:54:07,689 of an object or person that they're 1574 00:54:07,690 --> 00:54:09,069 making a decision about. 1575 00:54:09,070 --> 00:54:10,440 So it's definitely possible that. 1576 00:54:12,900 --> 00:54:15,089 OK, the next question by number four, 1577 00:54:15,090 --> 00:54:16,499 please. 1578 00:54:16,500 --> 00:54:17,879 Thank you for your talk. 1579 00:54:17,880 --> 00:54:20,069 You say that algorithms discriminate in 1580 00:54:20,070 --> 00:54:22,379 the same way that humans can, 1581 00:54:22,380 --> 00:54:24,209 but I wonder if the real challenges that 1582 00:54:24,210 --> 00:54:25,979 algorithms discriminate in a slightly 1583 00:54:25,980 --> 00:54:27,989 different way than humans do. 1584 00:54:27,990 --> 00:54:29,729 And for example, you gave the example 1585 00:54:29,730 --> 00:54:32,219 that we can person we can 1586 00:54:32,220 --> 00:54:34,859 identify gender or other markers 1587 00:54:34,860 --> 00:54:36,059 from the data set. 1588 00:54:36,060 --> 00:54:38,189 Yeah, but what if these attributes 1589 00:54:38,190 --> 00:54:40,469 that identify that correlate with gender, 1590 00:54:40,470 --> 00:54:42,779 class, race, etc. 1591 00:54:42,780 --> 00:54:44,519 also correlate with other positive 1592 00:54:44,520 --> 00:54:46,020 attributes, such as 1593 00:54:47,220 --> 00:54:49,349 the study that you're more efficient 1594 00:54:49,350 --> 00:54:51,599 work when you live closer to your 1595 00:54:51,600 --> 00:54:53,279 the side of your employer. 1596 00:54:53,280 --> 00:54:54,959 But if you have a very segregated 1597 00:54:54,960 --> 00:54:57,029 society, that means that those who are 1598 00:54:57,030 --> 00:54:59,429 richer are also then classified 1599 00:54:59,430 --> 00:55:01,709 as more efficient workers and 1600 00:55:01,710 --> 00:55:03,749 when in the scoring of potential 1601 00:55:03,750 --> 00:55:04,959 employees. 1602 00:55:04,960 --> 00:55:07,349 And so the question is, 1603 00:55:07,350 --> 00:55:09,569 if such a thing occurs, 1604 00:55:09,570 --> 00:55:12,059 it's not just that discrimination can 1605 00:55:12,060 --> 00:55:14,459 can be an unintended outcome, 1606 00:55:14,460 --> 00:55:16,079 but also if the company wants to 1607 00:55:16,080 --> 00:55:18,359 discriminate, you cannot prove it because 1608 00:55:18,360 --> 00:55:20,759 you say we just hired the most qualified 1609 00:55:20,760 --> 00:55:23,069 candidate, but in fact, you just hired 1610 00:55:23,070 --> 00:55:24,629 certain kinds of people. 1611 00:55:24,630 --> 00:55:26,459 Yes. Yes. I mean, that's exactly the the 1612 00:55:26,460 --> 00:55:29,219 argument about discrimination, 1613 00:55:29,220 --> 00:55:31,439 because if 1614 00:55:31,440 --> 00:55:33,989 you don't have the information about 1615 00:55:33,990 --> 00:55:36,209 how many people of a given 1616 00:55:36,210 --> 00:55:38,399 class of a given protected status 1617 00:55:38,400 --> 00:55:40,259 applied, for example, for a given job, 1618 00:55:40,260 --> 00:55:42,179 you can't figure out if there is any 1619 00:55:42,180 --> 00:55:43,349 discrimination in the process. 1620 00:55:43,350 --> 00:55:45,419 And so that means that you 1621 00:55:45,420 --> 00:55:47,009 have somehow to get that information into 1622 00:55:47,010 --> 00:55:48,929 the system in order to make an audit and 1623 00:55:48,930 --> 00:55:51,359 actually see if there's some unfair 1624 00:55:51,360 --> 00:55:52,469 bias in there. 1625 00:55:52,470 --> 00:55:53,879 And I mean, the other question I don't 1626 00:55:53,880 --> 00:55:55,769 know if I understood it correctly is if 1627 00:55:55,770 --> 00:55:56,770 you 1628 00:55:58,470 --> 00:56:00,539 if you can in fear information 1629 00:56:00,540 --> 00:56:03,149 about the gender from other things or 1630 00:56:03,150 --> 00:56:04,469 and it's I mean, this is certainly the 1631 00:56:04,470 --> 00:56:07,109 case because as I said in the talk, 1632 00:56:07,110 --> 00:56:09,089 like many things, like, for example, the 1633 00:56:09,090 --> 00:56:10,289 neighborhood that you live in, as you 1634 00:56:10,290 --> 00:56:12,509 said, we give away information about the 1635 00:56:12,510 --> 00:56:13,799 protected attributes as well. 1636 00:56:15,000 --> 00:56:17,039 All right. We have a few more questions I 1637 00:56:17,040 --> 00:56:19,199 would I would ask you to keep in short, 1638 00:56:19,200 --> 00:56:21,299 please, the microphone number five 1639 00:56:21,300 --> 00:56:22,300 in the back 1640 00:56:23,430 --> 00:56:23,789 hurts. 1641 00:56:23,790 --> 00:56:26,099 And often her statement is that 1642 00:56:26,100 --> 00:56:27,599 the more data you actually collect, the 1643 00:56:27,600 --> 00:56:29,549 less you can actually do with it, because 1644 00:56:29,550 --> 00:56:30,659 it's just too much. 1645 00:56:30,660 --> 00:56:32,669 Is there any scenario where this 1646 00:56:32,670 --> 00:56:34,019 statement makes any sense? 1647 00:56:35,100 --> 00:56:36,269 Yeah, definitely is. 1648 00:56:36,270 --> 00:56:38,549 I mean, given an algorithm, 1649 00:56:38,550 --> 00:56:40,349 more data to train with is not always a 1650 00:56:40,350 --> 00:56:41,339 good thing. 1651 00:56:41,340 --> 00:56:43,589 It's pretty easy to do overtrain 1652 00:56:43,590 --> 00:56:45,899 algorithms, not so 1653 00:56:45,900 --> 00:56:48,089 to give it to make a model that is like 1654 00:56:48,090 --> 00:56:50,099 perfectly fitting the data that you give 1655 00:56:50,100 --> 00:56:52,049 it, but that has very little predictive 1656 00:56:52,050 --> 00:56:53,739 power for new data that you see. 1657 00:56:53,740 --> 00:56:55,829 But in general, increasing 1658 00:56:55,830 --> 00:56:57,659 the number of days of data points 1659 00:56:58,680 --> 00:57:00,959 is always like improving the quality 1660 00:57:00,960 --> 00:57:03,029 of the model if the data that you have 1661 00:57:03,030 --> 00:57:05,129 is from the same model as 1662 00:57:05,130 --> 00:57:06,869 well. So it could also happen that the 1663 00:57:06,870 --> 00:57:08,639 data that you have is not homogeneous, so 1664 00:57:08,640 --> 00:57:10,829 that one part of the data 1665 00:57:10,830 --> 00:57:12,389 is fitting well with one model, but the 1666 00:57:12,390 --> 00:57:13,679 other part of the data is fitting well 1667 00:57:13,680 --> 00:57:15,419 with another one. So in that case, it 1668 00:57:15,420 --> 00:57:17,249 might be difficult training a large 1669 00:57:17,250 --> 00:57:19,709 amount of data on a single model, but 1670 00:57:19,710 --> 00:57:21,359 it depends on the individual case, I 1671 00:57:21,360 --> 00:57:23,279 would say. So it's really not easy to 1672 00:57:23,280 --> 00:57:24,280 answer in that sense. 1673 00:57:25,140 --> 00:57:27,209 Thank you. We have time for two more 1674 00:57:27,210 --> 00:57:28,169 short questions. 1675 00:57:28,170 --> 00:57:30,359 I would ask one question from 1676 00:57:30,360 --> 00:57:31,409 the again. 1677 00:57:31,410 --> 00:57:33,869 Yes, Anayansi Lucas' asking, 1678 00:57:33,870 --> 00:57:36,149 isn't the black box nature of the machine 1679 00:57:36,150 --> 00:57:38,099 learning algorithms one of the biggest 1680 00:57:38,100 --> 00:57:40,199 problems can be 1681 00:57:40,200 --> 00:57:42,119 solved by better visualization or 1682 00:57:42,120 --> 00:57:43,829 understanding and what it really is 1683 00:57:43,830 --> 00:57:44,830 doing? 1684 00:57:45,530 --> 00:57:47,959 Yeah, for me, having algorithms that 1685 00:57:47,960 --> 00:57:49,969 are not open to scrutiny and that we 1686 00:57:49,970 --> 00:57:51,739 can't understand is one of the biggest 1687 00:57:51,740 --> 00:57:54,649 problems, of course. And, uh, um, 1688 00:57:54,650 --> 00:57:56,809 visualizing data can help, of course. 1689 00:57:56,810 --> 00:57:59,209 But as I said briefly in the talk, 1690 00:57:59,210 --> 00:58:01,459 since the space of possible parameters 1691 00:58:01,460 --> 00:58:02,929 in the space space of possible data 1692 00:58:02,930 --> 00:58:05,539 points is so enormous, 1693 00:58:05,540 --> 00:58:07,939 even for very small and 1694 00:58:07,940 --> 00:58:09,589 learning problems, that it's really 1695 00:58:09,590 --> 00:58:11,329 difficult to produce a given 1696 00:58:11,330 --> 00:58:13,099 visualization that would you that would 1697 00:58:13,100 --> 00:58:14,900 give you a high confidence and 1698 00:58:15,950 --> 00:58:17,419 a good information about, for example, 1699 00:58:17,420 --> 00:58:18,979 discrimination in the data set. 1700 00:58:18,980 --> 00:58:20,119 So it can certainly help. 1701 00:58:20,120 --> 00:58:22,189 But, uh, I think it's not a 1702 00:58:22,190 --> 00:58:23,190 perfect answer either. 1703 00:58:24,440 --> 00:58:26,719 OK, we have time for one more question. 1704 00:58:26,720 --> 00:58:29,269 Microphone number one, please. 1705 00:58:29,270 --> 00:58:30,349 Thank you. 1706 00:58:30,350 --> 00:58:32,449 In the beginning, you displayed the 1707 00:58:32,450 --> 00:58:35,029 green, yellow and red, 1708 00:58:35,030 --> 00:58:37,039 somehow agreed to give me more damaging. 1709 00:58:37,040 --> 00:58:39,169 The example you made about the green was 1710 00:58:39,170 --> 00:58:41,419 about some kind of algorithm that gave 1711 00:58:41,420 --> 00:58:42,329 to you information. 1712 00:58:42,330 --> 00:58:44,119 Don't you think that at the time of 1713 00:58:44,120 --> 00:58:46,189 exposure influence, 1714 00:58:46,190 --> 00:58:47,539 how much is damaging? 1715 00:58:47,540 --> 00:58:49,999 Because if I get to influence the to here 1716 00:58:50,000 --> 00:58:51,760 is worse than just two days. 1717 00:58:52,900 --> 00:58:54,979 Mm hmm. Can you say that again for the 1718 00:58:54,980 --> 00:58:57,109 time of exposure, with time of 1719 00:58:57,110 --> 00:58:59,509 exposure to an algorithm to influence 1720 00:58:59,510 --> 00:59:01,789 your behavior has to be considered as 1721 00:59:01,790 --> 00:59:03,949 a factor to understand 1722 00:59:03,950 --> 00:59:04,369 if it's true. 1723 00:59:04,370 --> 00:59:05,869 Yeah, that's a very important point. 1724 00:59:05,870 --> 00:59:08,059 I mean, I also had like an experiment and 1725 00:59:08,060 --> 00:59:09,829 where I look at the interaction of the 1726 00:59:09,830 --> 00:59:12,799 algorithms with a person that he is like, 1727 00:59:12,800 --> 00:59:14,509 for example, showing articles, too. 1728 00:59:14,510 --> 00:59:16,579 And, um, this is like a topic of itself, 1729 00:59:16,580 --> 00:59:18,469 I would say. So there's definitely very 1730 00:59:18,470 --> 00:59:20,299 rich interaction. The results are not 1731 00:59:20,300 --> 00:59:21,539 captured by most models. 1732 00:59:21,540 --> 00:59:23,059 So like the algorithm influencing the 1733 00:59:23,060 --> 00:59:24,889 behavior of the person and then that 1734 00:59:24,890 --> 00:59:26,449 again influencing the actions of the 1735 00:59:26,450 --> 00:59:28,789 person and influencing the machine 1736 00:59:28,790 --> 00:59:30,619 learning of the further data. 1737 00:59:30,620 --> 00:59:32,059 So there's definitely some feedback in 1738 00:59:32,060 --> 00:59:33,419 the system. Absolutely. 1739 00:59:33,420 --> 00:59:34,420 Yeah. 1740 00:59:34,770 --> 00:59:37,069 OK, that's all the time we have. 1741 00:59:37,070 --> 00:59:39,050 Thanks again for the great talk.