0 00:00:00,000 --> 00:00:30,000 Dear viewer, these subtitles were generated by a machine via the service Trint and therefore are (very) buggy. If you are capable, please help us to create good quality subtitles: https://c3subtitles.de/talk/594 Thanks! 1 00:00:09,870 --> 00:00:12,989 Introducing the next talk, just 2 00:00:12,990 --> 00:00:14,999 getting starting the looking stories 3 00:00:15,000 --> 00:00:17,309 behind the numbers, Mr. Stephan 4 00:00:17,310 --> 00:00:19,949 Wehmeier will introduce talk to us about 5 00:00:19,950 --> 00:00:23,039 computing numbers with an application 6 00:00:23,040 --> 00:00:24,839 to the problems of our society. 7 00:00:24,840 --> 00:00:26,549 Please give him a warm applause. 8 00:00:38,140 --> 00:00:39,140 Yeah, thank you. 9 00:00:40,510 --> 00:00:41,619 Obviously, this is a reference to a 10 00:00:41,620 --> 00:00:43,539 touring paper. I've got many questions, 11 00:00:43,540 --> 00:00:45,639 but those of you got it thumbs 12 00:00:45,640 --> 00:00:46,640 up. 13 00:00:47,140 --> 00:00:49,389 And this talk will basically combine 14 00:00:49,390 --> 00:00:52,749 computer science and and journalism. 15 00:00:52,750 --> 00:00:54,729 I'm currently a data journalist. 16 00:00:54,730 --> 00:00:56,709 I joined a newsroom about one and a half 17 00:00:56,710 --> 00:00:58,389 years ago. It's called Collective. 18 00:00:58,390 --> 00:00:59,799 It's a nonprofit newsroom based in 19 00:00:59,800 --> 00:01:00,999 Berlin. 20 00:01:01,000 --> 00:01:03,519 We do long term investigations and 21 00:01:03,520 --> 00:01:05,559 we are a member based and currently a 22 00:01:05,560 --> 00:01:07,029 foundation funded. 23 00:01:07,030 --> 00:01:09,009 And we are doing investigative 24 00:01:09,010 --> 00:01:11,799 journalism, investigative journalism. 25 00:01:11,800 --> 00:01:13,989 What is that? 26 00:01:13,990 --> 00:01:15,999 One good example now from popular culture 27 00:01:16,000 --> 00:01:17,649 will be this one. 28 00:01:17,650 --> 00:01:19,779 This is the movie spotlight that 29 00:01:19,780 --> 00:01:20,799 just came out in the U.S. 30 00:01:20,800 --> 00:01:22,719 will come out soon in Germany. 31 00:01:22,720 --> 00:01:24,909 And it's the story about child 32 00:01:24,910 --> 00:01:27,159 abuse by the Catholic priests in 33 00:01:27,160 --> 00:01:29,019 the Boston metropolitan area. 34 00:01:29,020 --> 00:01:31,719 And this is the team that 35 00:01:31,720 --> 00:01:33,939 basically uncovered it, or rather, 36 00:01:33,940 --> 00:01:36,229 it's not. These are the actors 37 00:01:36,230 --> 00:01:38,709 that play the investigative journalists. 38 00:01:38,710 --> 00:01:40,779 And the whole film 39 00:01:40,780 --> 00:01:42,579 is actually quite a good representation 40 00:01:42,580 --> 00:01:45,609 of how investigative journalists 41 00:01:45,610 --> 00:01:47,469 work, especially. 42 00:01:47,470 --> 00:01:49,779 So this film depicts a story from 43 00:01:49,780 --> 00:01:52,359 2003, a 2001, 2000, 44 00:01:52,360 --> 00:01:53,559 I think 2001. 45 00:01:53,560 --> 00:01:55,689 And yeah, 46 00:01:55,690 --> 00:01:57,339 it's a slightly overdramatized, of 47 00:01:57,340 --> 00:01:58,539 course, because it's like a Hollywood 48 00:01:58,540 --> 00:02:01,479 film, but it actually depicts a 49 00:02:01,480 --> 00:02:03,819 investigative journalists quite 50 00:02:03,820 --> 00:02:04,820 accurately. 51 00:02:05,980 --> 00:02:07,869 Of course, it also depicts the gender 52 00:02:07,870 --> 00:02:10,059 balance quite accurately, as 53 00:02:10,060 --> 00:02:12,369 you can see here, but it is getting 54 00:02:12,370 --> 00:02:15,189 better. And so in Germany, for example, 55 00:02:15,190 --> 00:02:17,799 the leading data teams 56 00:02:17,800 --> 00:02:18,819 from Spiegel 57 00:02:20,410 --> 00:02:22,719 and SRF in Switzerland 58 00:02:22,720 --> 00:02:23,739 are led by women. 59 00:02:23,740 --> 00:02:26,199 And also the organization that 60 00:02:26,200 --> 00:02:27,879 represents investigative journalists in 61 00:02:27,880 --> 00:02:30,039 Germany has mostly women 62 00:02:30,040 --> 00:02:30,939 on its board. 63 00:02:30,940 --> 00:02:33,369 So still, 64 00:02:33,370 --> 00:02:35,589 many investigative journalists, 65 00:02:35,590 --> 00:02:37,089 like too many, are still men. 66 00:02:37,090 --> 00:02:38,710 So women get into the field 67 00:02:39,820 --> 00:02:40,749 as well. 68 00:02:40,750 --> 00:02:42,809 And the spotlight team. 69 00:02:42,810 --> 00:02:44,499 So what they did is they 70 00:02:45,520 --> 00:02:47,799 got a tip and then they looked at data 71 00:02:47,800 --> 00:02:50,439 and they collected data about priests 72 00:02:50,440 --> 00:02:52,059 who were moving between the different 73 00:02:52,060 --> 00:02:54,219 parishes. So the different districts in 74 00:02:54,220 --> 00:02:56,499 the metropolitan area of Boston, 75 00:02:56,500 --> 00:02:58,239 every time there was an abuse scandal, 76 00:02:58,240 --> 00:03:00,159 they got a sick leave or something 77 00:03:00,160 --> 00:03:01,869 similar and got moved to a different 78 00:03:01,870 --> 00:03:04,119 district to cover up 79 00:03:04,120 --> 00:03:06,219 the scandal and to make it appear that 80 00:03:06,220 --> 00:03:08,679 it's just like a single case. 81 00:03:08,680 --> 00:03:10,989 The truth would be that they discovered 82 00:03:10,990 --> 00:03:13,929 many more cases were present, 83 00:03:13,930 --> 00:03:16,030 were present, and they 84 00:03:17,250 --> 00:03:18,879 they basically uncovered that it was like 85 00:03:18,880 --> 00:03:19,929 a systemic problem. 86 00:03:19,930 --> 00:03:22,059 And this is actually one of the core 87 00:03:22,060 --> 00:03:23,489 pieces of investigative journalism that 88 00:03:23,490 --> 00:03:25,659 you don't show and that a single thing 89 00:03:25,660 --> 00:03:28,119 is wrong, like this man did something 90 00:03:28,120 --> 00:03:30,339 bad, but that the whole system 91 00:03:30,340 --> 00:03:32,529 is set up in a way that many 92 00:03:32,530 --> 00:03:34,270 people do many bad things. 93 00:03:35,320 --> 00:03:37,539 And so 94 00:03:37,540 --> 00:03:39,849 they actually they used 95 00:03:39,850 --> 00:03:42,309 books that displayed 96 00:03:42,310 --> 00:03:44,649 where priests were moving like where's 97 00:03:44,650 --> 00:03:46,719 the priest in which year that like a 98 00:03:46,720 --> 00:03:47,649 book for every year. 99 00:03:47,650 --> 00:03:49,359 And then they went through it and typed 100 00:03:49,360 --> 00:03:51,429 it in a computer and they actually nice 101 00:03:51,430 --> 00:03:53,889 spreadsheet in the end, which displayed 102 00:03:53,890 --> 00:03:55,179 where priests were moving. 103 00:03:55,180 --> 00:03:56,589 So investigative journalists and 104 00:03:56,590 --> 00:03:58,930 computers are like a perfect match. 105 00:03:59,950 --> 00:04:01,419 Of course, computers are used in many 106 00:04:01,420 --> 00:04:03,279 other areas of journalism. 107 00:04:03,280 --> 00:04:05,829 Every every publishing is now 108 00:04:05,830 --> 00:04:07,809 every major newspaper also has a website, 109 00:04:07,810 --> 00:04:09,789 of course, and there's no robot 110 00:04:09,790 --> 00:04:11,859 journalism coming up where sports 111 00:04:11,860 --> 00:04:14,139 teams so sports 112 00:04:14,140 --> 00:04:16,359 events are covered by by computer 113 00:04:16,360 --> 00:04:18,669 programs, not by humans anymore. 114 00:04:18,670 --> 00:04:20,559 But what I'm specifically talking about 115 00:04:20,560 --> 00:04:22,779 is investigative journalism, a term 116 00:04:22,780 --> 00:04:25,359 that started as precision journalism 117 00:04:25,360 --> 00:04:27,639 or database journalism, computer assisted 118 00:04:27,640 --> 00:04:29,619 reporting, data driven journalism is the 119 00:04:29,620 --> 00:04:31,119 current term and there's also the 120 00:04:31,120 --> 00:04:32,379 computational journalism. 121 00:04:32,380 --> 00:04:34,509 All these terms basically mean that 122 00:04:34,510 --> 00:04:36,669 you use a computer to do an 123 00:04:36,670 --> 00:04:38,499 investigative story. 124 00:04:38,500 --> 00:04:41,499 And Philip Meyer, one of the first 125 00:04:41,500 --> 00:04:42,939 investigative journalists who used the 126 00:04:42,940 --> 00:04:44,919 computer, said a journalist has to be a 127 00:04:44,920 --> 00:04:46,029 database manager. 128 00:04:46,030 --> 00:04:48,009 And so we can't quite compare our 129 00:04:48,010 --> 00:04:50,709 database with journalist, 130 00:04:50,710 --> 00:04:53,019 but it's getting closer. 131 00:04:53,020 --> 00:04:55,119 So a journalist has to 132 00:04:55,120 --> 00:04:57,219 have its facts, they have the facts. 133 00:04:57,220 --> 00:04:59,349 And, of course, there are too many facts 134 00:04:59,350 --> 00:05:01,089 to keep just in your mind. 135 00:05:01,090 --> 00:05:03,339 So you have to put it in the computer. 136 00:05:03,340 --> 00:05:05,349 And now I will basically present a couple 137 00:05:05,350 --> 00:05:07,479 of fields in computer 138 00:05:07,480 --> 00:05:09,789 science that investigative journalists 139 00:05:09,790 --> 00:05:11,859 use to make their stories 140 00:05:11,860 --> 00:05:12,860 happen. 141 00:05:13,930 --> 00:05:15,399 One of the big ones is, of course, a 142 00:05:15,400 --> 00:05:17,409 natural language processing. 143 00:05:17,410 --> 00:05:19,719 You know, the Snowden leaks or 144 00:05:19,720 --> 00:05:21,759 if you might remember, the offshore leaks 145 00:05:21,760 --> 00:05:24,579 and a couple of leaks that follows 146 00:05:24,580 --> 00:05:26,649 that followed. When you have a big leak 147 00:05:26,650 --> 00:05:28,809 of data or you got a big 148 00:05:28,810 --> 00:05:31,239 leak of documents or you've got documents 149 00:05:31,240 --> 00:05:33,129 via a Freedom of Information request. 150 00:05:33,130 --> 00:05:35,559 And these are like thousands, 151 00:05:35,560 --> 00:05:37,789 maybe hundreds. Maybe even more documents 152 00:05:37,790 --> 00:05:40,309 that you get either in paper form 153 00:05:40,310 --> 00:05:42,379 or as PDF and what you do 154 00:05:42,380 --> 00:05:43,909 with them, you can't possibly read them 155 00:05:43,910 --> 00:05:44,910 on 156 00:05:46,820 --> 00:05:47,659 the current newsroom. 157 00:05:47,660 --> 00:05:49,309 They don't have enough staff. 158 00:05:49,310 --> 00:05:51,469 They don't have enough 159 00:05:51,470 --> 00:05:52,470 time to 160 00:05:54,080 --> 00:05:56,389 spend on this on these investigations. 161 00:05:56,390 --> 00:05:58,879 And so they have to use 162 00:05:58,880 --> 00:06:01,009 computers to make this job 163 00:06:01,010 --> 00:06:02,249 a bit easier. 164 00:06:02,250 --> 00:06:04,339 Natural language processing is perfect 165 00:06:04,340 --> 00:06:06,319 for that. You put all the documents you 166 00:06:06,320 --> 00:06:08,449 have inside the computer, 167 00:06:08,450 --> 00:06:10,549 you have to possibly OCR 168 00:06:10,550 --> 00:06:13,009 them. And then a couple of things 169 00:06:13,010 --> 00:06:14,449 might work in your favor. 170 00:06:14,450 --> 00:06:16,909 So that's energy extraction, of course, 171 00:06:16,910 --> 00:06:19,609 that finds out, OK, 172 00:06:19,610 --> 00:06:22,219 these documents contain these entities. 173 00:06:22,220 --> 00:06:24,559 And so it's not only Mr. Obama, 174 00:06:24,560 --> 00:06:26,569 but also President Obama, Barack Obama. 175 00:06:26,570 --> 00:06:28,639 And then you can basically extract these 176 00:06:28,640 --> 00:06:30,259 entities and know which documents talk 177 00:06:30,260 --> 00:06:32,749 about which entities you can extract, 178 00:06:34,430 --> 00:06:36,649 who, for example, in an email dump, 179 00:06:36,650 --> 00:06:39,319 who's talking with who or 180 00:06:39,320 --> 00:06:42,109 also company names. 181 00:06:42,110 --> 00:06:44,559 This is like easily extracted 182 00:06:44,560 --> 00:06:46,339 with energy extraction techniques as 183 00:06:46,340 --> 00:06:48,589 deduplication as topic 184 00:06:48,590 --> 00:06:49,939 MODELING'S. So, you know, OK, these 185 00:06:49,940 --> 00:06:52,399 documents talk about these topics and 186 00:06:52,400 --> 00:06:53,629 so I don't have to read them. 187 00:06:53,630 --> 00:06:55,519 If I want to focus my story on this 188 00:06:55,520 --> 00:06:57,859 specific topic, I go down 189 00:06:57,860 --> 00:07:00,139 this path and only look at the 190 00:07:00,140 --> 00:07:01,939 documents that are basically 191 00:07:01,940 --> 00:07:04,069 automatically categorized in a certain 192 00:07:04,070 --> 00:07:04,969 category. 193 00:07:04,970 --> 00:07:07,279 And as part of speech, tagging is often 194 00:07:07,280 --> 00:07:09,439 also quite useful. When you look 195 00:07:09,440 --> 00:07:10,879 at the document, for example, the 196 00:07:10,880 --> 00:07:13,459 debates, you want to find out 197 00:07:13,460 --> 00:07:15,589 when who's talking about what, in 198 00:07:15,590 --> 00:07:17,809 what kind of way you can find that 199 00:07:17,810 --> 00:07:19,439 out with part of speech, tagging. 200 00:07:19,440 --> 00:07:21,169 And of course, basic search search is 201 00:07:21,170 --> 00:07:23,689 always quite useful, not 202 00:07:23,690 --> 00:07:25,759 that many advanced ways 203 00:07:25,760 --> 00:07:27,829 to search. And journalists have to use 204 00:07:27,830 --> 00:07:29,629 them to make sense of these big 205 00:07:29,630 --> 00:07:31,629 documents. Stacks, of course. 206 00:07:31,630 --> 00:07:33,919 And we have 207 00:07:33,920 --> 00:07:35,360 some documents, 208 00:07:36,420 --> 00:07:38,869 such as being a part of computer science 209 00:07:38,870 --> 00:07:41,159 since the 70s, 80s, 210 00:07:41,160 --> 00:07:43,309 and now nowadays with like 211 00:07:43,310 --> 00:07:45,049 solar or elasticsearch or other search 212 00:07:45,050 --> 00:07:46,969 engines that do that quite easily. 213 00:07:46,970 --> 00:07:48,889 But these are made for computer 214 00:07:48,890 --> 00:07:51,229 programmers, right? We use 215 00:07:51,230 --> 00:07:53,419 them as developers and set them up and 216 00:07:53,420 --> 00:07:56,269 configure them and build our own back and 217 00:07:56,270 --> 00:07:58,519 front end on top so that other 218 00:07:58,520 --> 00:08:00,169 people can actually use the search behind 219 00:08:00,170 --> 00:08:02,809 it. And journalists have 220 00:08:02,810 --> 00:08:04,909 a couple of more and they want to use 221 00:08:04,910 --> 00:08:06,319 a couple more features in there. 222 00:08:06,320 --> 00:08:08,839 And we have a 223 00:08:08,840 --> 00:08:10,939 couple of applications that help us 224 00:08:10,940 --> 00:08:13,309 there, namely document cloud, 225 00:08:13,310 --> 00:08:14,689 which is a service where you can upload 226 00:08:14,690 --> 00:08:17,119 lots of documents and they 227 00:08:17,120 --> 00:08:18,120 automatically 228 00:08:19,670 --> 00:08:21,949 made searchable entities extract it 229 00:08:21,950 --> 00:08:24,139 and then you can also publish 230 00:08:24,140 --> 00:08:26,269 them for your readers to also look 231 00:08:26,270 --> 00:08:29,299 at them. Overview docs, which does 232 00:08:29,300 --> 00:08:31,429 topic modeling so you can 233 00:08:31,430 --> 00:08:33,709 dove into your documents more easily. 234 00:08:33,710 --> 00:08:35,899 That's the Project Blacklight, 235 00:08:35,900 --> 00:08:37,908 which is basically a solar front and that 236 00:08:37,909 --> 00:08:40,069 you can so they can 237 00:08:40,070 --> 00:08:41,989 use to give to my journalist colleagues 238 00:08:41,990 --> 00:08:44,048 and they can use a solar search 239 00:08:44,049 --> 00:08:45,589 and an easy way. 240 00:08:45,590 --> 00:08:46,819 And that's also, of course, Google 241 00:08:46,820 --> 00:08:49,009 refind, which is usually used 242 00:08:49,010 --> 00:08:50,809 for tabular data, but it also has a very 243 00:08:50,810 --> 00:08:52,519 good reconciliation back end and also 244 00:08:52,520 --> 00:08:54,619 clustering where you can do duplication. 245 00:08:54,620 --> 00:08:56,719 So if you have like a list 246 00:08:56,720 --> 00:08:58,939 of company names and they are like 247 00:08:58,940 --> 00:09:01,789 very dirty, you can reconciliate 248 00:09:01,790 --> 00:09:03,859 them or basically duplicate them 249 00:09:03,860 --> 00:09:06,079 and make all the company 250 00:09:06,080 --> 00:09:07,459 names match again. 251 00:09:07,460 --> 00:09:09,559 And that's also pro software and namely 252 00:09:09,560 --> 00:09:11,809 new IBM Watson Analytics. 253 00:09:11,810 --> 00:09:13,969 And they are very expensive 254 00:09:13,970 --> 00:09:16,369 and most journalists 255 00:09:16,370 --> 00:09:18,409 have never seen them and probably can't 256 00:09:18,410 --> 00:09:19,339 use them. 257 00:09:19,340 --> 00:09:21,409 It's very difficult to 258 00:09:21,410 --> 00:09:22,429 get your hands on them. 259 00:09:23,450 --> 00:09:25,669 And that is quite sad because we 260 00:09:25,670 --> 00:09:27,469 have to so journalists have to rely on 261 00:09:27,470 --> 00:09:29,569 these open source tools, which and they 262 00:09:29,570 --> 00:09:31,639 are only a few that are actually 263 00:09:31,640 --> 00:09:33,020 made for investigations. 264 00:09:34,280 --> 00:09:36,979 And to the computer science part, 265 00:09:36,980 --> 00:09:39,539 I mostly talk now about, um, 266 00:09:39,540 --> 00:09:41,939 uh, basically English language models. 267 00:09:41,940 --> 00:09:43,939 It's very difficult to find good German 268 00:09:43,940 --> 00:09:45,439 models that are already integrated in 269 00:09:45,440 --> 00:09:48,019 some kind of software so you can use them 270 00:09:48,020 --> 00:09:50,719 in the German speaking world. 271 00:09:50,720 --> 00:09:52,939 And that's kind of 272 00:09:52,940 --> 00:09:55,609 I hope that changes soon. 273 00:09:55,610 --> 00:09:57,769 Then there's machine learning, which 274 00:09:57,770 --> 00:09:59,839 is mostly mostly used for classification 275 00:09:59,840 --> 00:10:02,059 tasks as another big 276 00:10:02,060 --> 00:10:03,529 field, of course, in the computer science 277 00:10:03,530 --> 00:10:05,599 area in statistical 278 00:10:05,600 --> 00:10:07,789 analysis to find out what 279 00:10:07,790 --> 00:10:09,649 belongs to which category. 280 00:10:10,700 --> 00:10:12,589 And of course, there's a bit of neural 281 00:10:12,590 --> 00:10:14,629 net deep learning coming now away. 282 00:10:14,630 --> 00:10:16,189 You can see there's been some deep 283 00:10:16,190 --> 00:10:19,219 dreaming in this, uh, in this picture. 284 00:10:19,220 --> 00:10:21,319 But I haven't seen, like any journalistic 285 00:10:21,320 --> 00:10:22,789 piece that has used that 286 00:10:24,020 --> 00:10:26,389 yet. Um, I 287 00:10:26,390 --> 00:10:28,489 worked a bit on something where you 288 00:10:28,490 --> 00:10:30,949 like neural nets to create some captions 289 00:10:30,950 --> 00:10:32,750 for some databases to scrape them better. 290 00:10:33,860 --> 00:10:36,319 But this is still in the making. 291 00:10:36,320 --> 00:10:38,659 One story that actually used natural 292 00:10:38,660 --> 00:10:41,209 language processing and machine learning 293 00:10:41,210 --> 00:10:43,969 in order to identify 294 00:10:43,970 --> 00:10:46,249 police reports of the Los Angeles Police 295 00:10:46,250 --> 00:10:49,249 Department, which misclassified 296 00:10:49,250 --> 00:10:51,769 over 25000 crimes. 297 00:10:51,770 --> 00:10:54,289 So when a policeman comes to a 298 00:10:54,290 --> 00:10:55,939 police officer, comes to a crime scene 299 00:10:55,940 --> 00:10:58,009 and he writes down a 300 00:10:58,010 --> 00:11:00,559 report, it gets later 301 00:11:00,560 --> 00:11:02,779 put into a database and then classified 302 00:11:02,780 --> 00:11:05,329 by another clerk and they classify 303 00:11:05,330 --> 00:11:07,519 as classified to the crime 304 00:11:07,520 --> 00:11:09,649 that happened as a minor or serious or 305 00:11:09,650 --> 00:11:12,019 another category and 306 00:11:12,020 --> 00:11:13,250 based on the description. 307 00:11:14,480 --> 00:11:16,549 The Los Angeles Times 308 00:11:16,550 --> 00:11:19,399 wrote a machine learning classifier 309 00:11:19,400 --> 00:11:21,859 that looked at basically 310 00:11:21,860 --> 00:11:24,139 the description of the crime and a 311 00:11:24,140 --> 00:11:26,209 a proper classification 312 00:11:26,210 --> 00:11:28,549 and trained it on a on a 313 00:11:28,550 --> 00:11:30,889 training data set and then used 314 00:11:30,890 --> 00:11:33,349 the whole data set to look if 315 00:11:33,350 --> 00:11:35,209 all the other crimes were properly 316 00:11:35,210 --> 00:11:37,399 classified and they were 317 00:11:37,400 --> 00:11:39,349 apparently misclassified over twenty five 318 00:11:39,350 --> 00:11:40,879 thousand of them. And of course, you 319 00:11:40,880 --> 00:11:42,499 can't go through all these records and, 320 00:11:42,500 --> 00:11:44,689 you know, classify them by hand and 321 00:11:44,690 --> 00:11:46,549 machine can do that much more easily. 322 00:11:46,550 --> 00:11:48,079 And it's also been confirmed by the 323 00:11:48,080 --> 00:11:50,059 police department that there has been an 324 00:11:50,060 --> 00:11:51,409 misclassification going on. 325 00:11:51,410 --> 00:11:53,759 And the result is that many 326 00:11:53,760 --> 00:11:56,089 of the crime statistic is much lower 327 00:11:56,090 --> 00:11:58,039 and less serious and more minor crimes 328 00:11:58,040 --> 00:11:59,059 and serious crimes. 329 00:11:59,060 --> 00:12:01,339 And you can basically cover that 330 00:12:01,340 --> 00:12:03,199 up through misclassification. 331 00:12:03,200 --> 00:12:05,689 And the Los Angeles Time Times could 332 00:12:05,690 --> 00:12:07,729 uncover that through machine learning. 333 00:12:09,710 --> 00:12:11,569 Then there's this big field of social 334 00:12:11,570 --> 00:12:13,699 network analysis, a favorite topic 335 00:12:13,700 --> 00:12:16,009 of Mr. Lindback. 336 00:12:16,010 --> 00:12:17,840 Second row and 337 00:12:19,340 --> 00:12:22,279 social network analysis is basically 338 00:12:22,280 --> 00:12:24,469 the bread and butter of every 339 00:12:24,470 --> 00:12:25,399 journalist's work. 340 00:12:25,400 --> 00:12:27,259 We are collecting information about 341 00:12:27,260 --> 00:12:28,759 certain entities and we are trying to 342 00:12:28,760 --> 00:12:29,959 find the connections. 343 00:12:29,960 --> 00:12:32,089 And this is then 344 00:12:32,090 --> 00:12:34,459 basically put into 345 00:12:34,460 --> 00:12:36,899 you can put that into a a 346 00:12:36,900 --> 00:12:38,599 job like that. So this is like a network 347 00:12:38,600 --> 00:12:40,819 graph. And the problem with that 348 00:12:40,820 --> 00:12:44,269 is the result is mostly not journalism, 349 00:12:44,270 --> 00:12:46,039 it's just a research database. 350 00:12:46,040 --> 00:12:48,229 You collected some facts and then they 351 00:12:48,230 --> 00:12:49,909 are and then you can display them like 352 00:12:49,910 --> 00:12:51,979 that. But it's also very subjective 353 00:12:51,980 --> 00:12:54,919 data collection because you only 354 00:12:54,920 --> 00:12:57,169 cover the connections you 355 00:12:57,170 --> 00:12:59,449 think are important and you possibly 356 00:12:59,450 --> 00:13:00,639 don't see any others. 357 00:13:00,640 --> 00:13:03,289 And it's 358 00:13:03,290 --> 00:13:05,029 more like a knowledge management tool 359 00:13:05,030 --> 00:13:07,069 where you can collect everything, you 360 00:13:07,070 --> 00:13:09,259 know to better collaborate with 361 00:13:09,260 --> 00:13:11,209 your fellow journalists. 362 00:13:11,210 --> 00:13:13,429 But as a result, 363 00:13:13,430 --> 00:13:15,619 it's it might not be journalism. 364 00:13:15,620 --> 00:13:17,779 So we can't say, OK, I got this 365 00:13:17,780 --> 00:13:20,359 big graph and now I do and 366 00:13:20,360 --> 00:13:22,439 I can eigenvalues try to 367 00:13:22,440 --> 00:13:25,209 measure. And then I found the bad guy 368 00:13:25,210 --> 00:13:26,239 that doesn't work like that. So you 369 00:13:26,240 --> 00:13:27,240 can't, like, compute 370 00:13:28,310 --> 00:13:30,469 the bad guy out of such a such 371 00:13:30,470 --> 00:13:32,599 a graph. What you need to do is to like 372 00:13:32,600 --> 00:13:34,489 proper journalism on top. 373 00:13:34,490 --> 00:13:36,709 So you have to have like a knowledge 374 00:13:36,710 --> 00:13:38,689 graph and then you can look at it and 375 00:13:38,690 --> 00:13:39,769 then you can interview people. 376 00:13:39,770 --> 00:13:42,599 You can find out more through 377 00:13:42,600 --> 00:13:44,359 like proper like like 378 00:13:45,410 --> 00:13:47,989 old school, let's say investigative 379 00:13:47,990 --> 00:13:49,129 work. 380 00:13:49,130 --> 00:13:51,139 And so what you hear and see in the 381 00:13:51,140 --> 00:13:53,239 background is the lobby, 382 00:13:53,240 --> 00:13:54,859 which has been now shut down. 383 00:13:54,860 --> 00:13:57,049 It used to be run by 384 00:13:57,050 --> 00:13:58,069 the ATF. 385 00:13:58,070 --> 00:14:00,229 And but this is more 386 00:14:00,230 --> 00:14:02,329 like a like like a piece of art 387 00:14:02,330 --> 00:14:03,919 then, you know, gives you an actual 388 00:14:03,920 --> 00:14:06,349 inside here. It's, um, it's 389 00:14:06,350 --> 00:14:08,479 difficult to make social network 390 00:14:08,480 --> 00:14:10,039 networks appear and 391 00:14:11,420 --> 00:14:14,299 make them understandable, let's say. 392 00:14:14,300 --> 00:14:16,069 And then there's a brand new field of 393 00:14:16,070 --> 00:14:17,619 algorithmic accountability. 394 00:14:17,620 --> 00:14:19,429 We also heard, like, for example, to talk 395 00:14:19,430 --> 00:14:21,679 about the VW Diesel Gate 396 00:14:21,680 --> 00:14:23,869 scandal is a 397 00:14:23,870 --> 00:14:26,899 topic of algorithmic accountability. 398 00:14:26,900 --> 00:14:29,629 More and more algorithms are put into 399 00:14:29,630 --> 00:14:31,879 every device we know and 400 00:14:31,880 --> 00:14:34,039 making decisions that affect all 401 00:14:34,040 --> 00:14:35,149 of our lives. 402 00:14:35,150 --> 00:14:37,369 And now we have some 403 00:14:37,370 --> 00:14:39,349 hackers that do some reverse engineering 404 00:14:39,350 --> 00:14:40,579 and that is great. And then they present 405 00:14:40,580 --> 00:14:42,439 a Congress. But of course, this is 406 00:14:42,440 --> 00:14:44,359 basically journalistic work and we need 407 00:14:44,360 --> 00:14:45,889 to bring these techniques into the 408 00:14:45,890 --> 00:14:47,119 newsrooms. 409 00:14:47,120 --> 00:14:49,309 The newsrooms need to understand, the 410 00:14:49,310 --> 00:14:50,359 investigative journalists need to 411 00:14:50,360 --> 00:14:52,789 understand how this stuff works and 412 00:14:52,790 --> 00:14:54,649 how to reverse engineered. 413 00:14:54,650 --> 00:14:57,199 McTear Capoulas, a researcher 414 00:14:57,200 --> 00:14:58,849 in the in Washington, I think 415 00:15:00,260 --> 00:15:01,819 did a lot of work on that. 416 00:15:01,820 --> 00:15:03,979 And for example, 417 00:15:03,980 --> 00:15:06,709 one thing was a 418 00:15:06,710 --> 00:15:08,839 stock trading plans of executives that 419 00:15:08,840 --> 00:15:10,019 are preplanned. 420 00:15:10,020 --> 00:15:12,199 You can analyze how the plan 421 00:15:12,200 --> 00:15:14,539 works and if insider trading is behind 422 00:15:14,540 --> 00:15:17,029 it. Or for example, how does the 423 00:15:17,030 --> 00:15:19,039 iPhone out of work? 424 00:15:19,040 --> 00:15:21,199 And you can observe the 425 00:15:21,200 --> 00:15:23,299 output, you can observe the input, what 426 00:15:23,300 --> 00:15:24,499 what's happening, what is happening 427 00:15:24,500 --> 00:15:27,289 inside. And another 428 00:15:27,290 --> 00:15:29,359 example would be how are 429 00:15:29,360 --> 00:15:31,819 prices displayed in 430 00:15:31,820 --> 00:15:33,889 retail sites for different geographical 431 00:15:33,890 --> 00:15:36,319 areas, for example, and. 432 00:15:36,320 --> 00:15:38,899 That is not an easy task, 433 00:15:38,900 --> 00:15:40,039 but it's becoming more and more 434 00:15:40,040 --> 00:15:42,229 important, especially if there's not 435 00:15:42,230 --> 00:15:45,019 much transparency around 436 00:15:45,020 --> 00:15:46,230 how these things work. 437 00:15:48,200 --> 00:15:49,789 So journalism becomes 438 00:15:51,260 --> 00:15:52,489 gets closer to science. 439 00:15:52,490 --> 00:15:55,519 Let's say the investor investigations 440 00:15:55,520 --> 00:15:58,369 in journalism, they use the investigative 441 00:15:58,370 --> 00:15:59,359 method. 442 00:15:59,360 --> 00:16:01,189 You also have an hypothesis like in 443 00:16:01,190 --> 00:16:03,079 science, you 444 00:16:04,160 --> 00:16:06,320 make up something like 445 00:16:07,490 --> 00:16:08,869 these. 446 00:16:08,870 --> 00:16:11,509 These kids are underprivileged 447 00:16:11,510 --> 00:16:13,759 because of corruption going 448 00:16:13,760 --> 00:16:15,019 on in the school system. 449 00:16:15,020 --> 00:16:16,279 And then you have to prove that 450 00:16:16,280 --> 00:16:18,349 hypothesis. And so 451 00:16:18,350 --> 00:16:20,059 it's very similar to science. 452 00:16:20,060 --> 00:16:22,309 And science 453 00:16:22,310 --> 00:16:24,589 also moves now into 454 00:16:24,590 --> 00:16:26,809 a more reproducible and transparent 455 00:16:26,810 --> 00:16:28,939 manner. The story I told you earlier 456 00:16:28,940 --> 00:16:31,009 about the LAPD, this is 457 00:16:31,010 --> 00:16:33,439 the code that was used 458 00:16:33,440 --> 00:16:35,599 for this story. 459 00:16:35,600 --> 00:16:38,189 You have a like a classifier, 460 00:16:38,190 --> 00:16:40,219 a machine learning classifier of a 461 00:16:40,220 --> 00:16:42,049 support vector machine, and 462 00:16:43,190 --> 00:16:45,439 you can basically run the code yourself 463 00:16:45,440 --> 00:16:47,989 to train the classifier and then 464 00:16:47,990 --> 00:16:49,849 classify some of these reports. 465 00:16:49,850 --> 00:16:52,009 And they only publish like a 466 00:16:52,010 --> 00:16:53,640 tiny training data set and 467 00:16:54,710 --> 00:16:55,849 only parts of the data. 468 00:16:55,850 --> 00:16:57,949 But they basically make their methodology 469 00:16:57,950 --> 00:16:58,879 transparent. 470 00:16:58,880 --> 00:17:01,489 And this is basically 471 00:17:01,490 --> 00:17:02,929 also where science is going. 472 00:17:02,930 --> 00:17:05,299 Many, many research papers 473 00:17:05,300 --> 00:17:07,939 nowadays are not reproducible, 474 00:17:07,940 --> 00:17:10,098 but there should be and this is 475 00:17:10,099 --> 00:17:12,318 a Jupiter notebook and you can basically 476 00:17:12,319 --> 00:17:14,749 create a process 477 00:17:14,750 --> 00:17:16,848 and code mix and then 478 00:17:16,849 --> 00:17:19,309 executed and look at the result. 479 00:17:19,310 --> 00:17:21,799 And anyone else can also reproduce 480 00:17:21,800 --> 00:17:23,088 your work. 481 00:17:23,089 --> 00:17:25,368 So this is Python, but also our 482 00:17:25,369 --> 00:17:27,348 is are like favorite languages of 483 00:17:27,349 --> 00:17:29,269 investigative journalists in the data 484 00:17:29,270 --> 00:17:30,270 area. 485 00:17:30,710 --> 00:17:33,019 Then one big thing I discovered 486 00:17:33,020 --> 00:17:35,119 was that social engineering in 487 00:17:35,120 --> 00:17:37,279 the newsroom is not that 488 00:17:37,280 --> 00:17:38,479 easy. 489 00:17:38,480 --> 00:17:40,369 First of all, of course, there's I.T. 490 00:17:40,370 --> 00:17:42,589 support and that's the problem as 491 00:17:42,590 --> 00:17:43,789 a CMS. 492 00:17:43,790 --> 00:17:45,439 And so the content management system is 493 00:17:45,440 --> 00:17:46,440 always a problem. 494 00:17:47,480 --> 00:17:49,369 As a software engineer, you basically 495 00:17:49,370 --> 00:17:52,039 always fight the CMS 496 00:17:52,040 --> 00:17:54,439 and in in big organizations 497 00:17:54,440 --> 00:17:56,089 like The New York Times, they basically 498 00:17:56,090 --> 00:17:58,009 create their own hacks just so that they 499 00:17:58,010 --> 00:18:00,289 can put their beautiful graphics. 500 00:18:00,290 --> 00:18:03,649 And the rest of the CMS, they 501 00:18:03,650 --> 00:18:05,569 like big hacks going on there. 502 00:18:05,570 --> 00:18:07,789 But this is 503 00:18:07,790 --> 00:18:09,679 this is not what I want to talk about. 504 00:18:09,680 --> 00:18:11,629 Suffering in the NEWSROOM is basically 505 00:18:11,630 --> 00:18:13,069 also building tools for your fellow 506 00:18:13,070 --> 00:18:14,070 journalists. 507 00:18:14,720 --> 00:18:17,269 And that hasn't fully 508 00:18:18,470 --> 00:18:20,669 it doesn't have its roots, 509 00:18:20,670 --> 00:18:22,369 so it doesn't have its roots in the 510 00:18:22,370 --> 00:18:23,959 newsroom. And that's why it's a bit 511 00:18:23,960 --> 00:18:26,449 difficult at the moment. 512 00:18:26,450 --> 00:18:28,909 Right now, journalist writes an article 513 00:18:28,910 --> 00:18:30,499 and then it's published and then you can 514 00:18:30,500 --> 00:18:31,449 forget about it. 515 00:18:31,450 --> 00:18:32,749 You never touch it again. 516 00:18:32,750 --> 00:18:35,059 So there's no technical debt in articles 517 00:18:35,060 --> 00:18:36,639 and codes. 518 00:18:36,640 --> 00:18:38,599 And so sometimes code in newsrooms is 519 00:18:38,600 --> 00:18:39,979 also written for a single story. 520 00:18:39,980 --> 00:18:41,689 So you write code for that story, you 521 00:18:41,690 --> 00:18:43,669 publish it and then you forget about it. 522 00:18:43,670 --> 00:18:45,739 But of course, as engineers, we learn 523 00:18:45,740 --> 00:18:48,139 that is not how to do things we want to. 524 00:18:48,140 --> 00:18:49,399 We don't want to write the same code 525 00:18:49,400 --> 00:18:50,359 again for the next story. 526 00:18:50,360 --> 00:18:52,019 We want to have something reusability. 527 00:18:52,020 --> 00:18:54,109 We want to fix the only ones here. 528 00:18:54,110 --> 00:18:55,999 We don't want to fix it a million times 529 00:18:56,000 --> 00:18:57,799 of all of our articles. 530 00:18:57,800 --> 00:19:00,049 And that means we 531 00:19:00,050 --> 00:19:02,169 need to clean up a bit 532 00:19:02,170 --> 00:19:04,879 and develop 533 00:19:04,880 --> 00:19:07,009 some kind of some kind of 534 00:19:07,010 --> 00:19:09,439 method to write software 535 00:19:09,440 --> 00:19:12,379 in the NEWSROOM. It's currently 536 00:19:12,380 --> 00:19:14,809 quite a quite a hack, as I perceive 537 00:19:14,810 --> 00:19:15,810 it. 538 00:19:16,160 --> 00:19:19,159 And then computer science papers. 539 00:19:19,160 --> 00:19:20,539 I love to read them. 540 00:19:20,540 --> 00:19:22,339 They have very interesting ideas, but 541 00:19:22,340 --> 00:19:24,559 mostly they don't come with code 542 00:19:24,560 --> 00:19:26,779 and when they come with code, it's not 543 00:19:26,780 --> 00:19:28,369 running code. It's difficult to actually 544 00:19:28,370 --> 00:19:30,769 make that run. And when I 545 00:19:30,770 --> 00:19:32,869 actually compiled that C library 546 00:19:32,870 --> 00:19:35,779 to make a machine learning a bit faster, 547 00:19:35,780 --> 00:19:37,009 it's still not usable software. 548 00:19:37,010 --> 00:19:38,869 So I can't give it to my colleague to 549 00:19:38,870 --> 00:19:41,119 actually use what 550 00:19:41,120 --> 00:19:42,409 I compile them. 551 00:19:42,410 --> 00:19:44,599 And this is 552 00:19:44,600 --> 00:19:46,729 definitely something. And also so 553 00:19:46,730 --> 00:19:48,829 I hope that when you publish 554 00:19:48,830 --> 00:19:51,139 something in computer science that you 555 00:19:51,140 --> 00:19:53,179 give me something that I can use to 556 00:19:53,180 --> 00:19:55,099 actually bring it to my newsroom to make 557 00:19:55,100 --> 00:19:56,100 their lives easier. 558 00:19:56,960 --> 00:20:00,049 And also collaboration, 559 00:20:00,050 --> 00:20:01,609 which is something that is basically 560 00:20:01,610 --> 00:20:04,819 innate to the open source software thing, 561 00:20:04,820 --> 00:20:06,889 is a bit more difficult in 562 00:20:06,890 --> 00:20:08,449 newsrooms. And there's always the 563 00:20:08,450 --> 00:20:10,279 competition going on. 564 00:20:10,280 --> 00:20:11,569 The investigator, especially 565 00:20:11,570 --> 00:20:14,059 investigative journalists, are used 566 00:20:14,060 --> 00:20:16,279 to be perceived as lone wolves. 567 00:20:16,280 --> 00:20:18,889 If you are onto a story and someone else 568 00:20:18,890 --> 00:20:20,599 has heard of it, you better publish soon 569 00:20:20,600 --> 00:20:22,849 because the other guy 570 00:20:22,850 --> 00:20:24,919 might, you know, scoop you on it and 571 00:20:24,920 --> 00:20:26,989 then your story is burned and you 572 00:20:26,990 --> 00:20:28,129 can't publish it anymore. 573 00:20:28,130 --> 00:20:30,229 And all the work you did for that was in 574 00:20:30,230 --> 00:20:31,249 vain. 575 00:20:31,250 --> 00:20:33,319 Instead, on the other hand, in 576 00:20:33,320 --> 00:20:35,599 open source software, it's great if many 577 00:20:35,600 --> 00:20:36,779 people are. Collaborate on a piece of 578 00:20:36,780 --> 00:20:39,389 software, the higher the background 579 00:20:39,390 --> 00:20:40,559 is, the better. 580 00:20:40,560 --> 00:20:42,839 And so we need to basically 581 00:20:42,840 --> 00:20:45,149 bring this idea of collaboration 582 00:20:45,150 --> 00:20:47,309 into the newsroom and 583 00:20:47,310 --> 00:20:48,660 this is also still 584 00:20:50,400 --> 00:20:51,389 a problem. 585 00:20:51,390 --> 00:20:54,509 That is it is not quite 586 00:20:54,510 --> 00:20:56,219 there yet. 587 00:20:56,220 --> 00:20:58,019 There are some collaborations now between 588 00:20:58,020 --> 00:20:59,159 The New York Times and The Washington 589 00:20:59,160 --> 00:21:01,529 Post, for example, or 590 00:21:01,530 --> 00:21:03,869 the ProPublica news organization, 591 00:21:03,870 --> 00:21:06,389 and another 592 00:21:06,390 --> 00:21:08,219 bigger publication in the U.S., I think. 593 00:21:08,220 --> 00:21:10,679 And as a collective, we also collaborate 594 00:21:10,680 --> 00:21:12,939 with many other news organizations 595 00:21:12,940 --> 00:21:15,689 to they publish our stories 596 00:21:15,690 --> 00:21:16,829 with us together. 597 00:21:16,830 --> 00:21:18,809 And we hope that this idea of 598 00:21:18,810 --> 00:21:21,539 collaboration that is basically a 599 00:21:21,540 --> 00:21:23,849 software idea, as I perceive it, 600 00:21:23,850 --> 00:21:25,919 is also coming into 601 00:21:25,920 --> 00:21:27,930 the publishing of news stories. 602 00:21:30,450 --> 00:21:32,939 And another big problem is 603 00:21:32,940 --> 00:21:35,309 that we have some software and we 604 00:21:35,310 --> 00:21:37,799 might as well going to use it, and 605 00:21:37,800 --> 00:21:39,359 if there's some other software, I can 606 00:21:39,360 --> 00:21:41,159 only use what I have. 607 00:21:41,160 --> 00:21:43,019 So the hammer nail problem is definitely 608 00:21:43,020 --> 00:21:46,169 something that is in the newsroom. 609 00:21:46,170 --> 00:21:48,239 Have you ever seen, like, a map in 610 00:21:48,240 --> 00:21:50,339 in a in some kind of news article 611 00:21:50,340 --> 00:21:51,689 with lots of points on them? 612 00:21:51,690 --> 00:21:54,149 And that's because the journalist 613 00:21:54,150 --> 00:21:56,129 that did the story had this basically 614 00:21:56,130 --> 00:21:57,869 this map mapping tool that they could put 615 00:21:57,870 --> 00:21:59,339 in like a bunch of data and then it put 616 00:21:59,340 --> 00:22:01,559 it on a map. And even though it might 617 00:22:01,560 --> 00:22:03,419 not make any sense, we got into the 618 00:22:03,420 --> 00:22:04,619 story. 619 00:22:04,620 --> 00:22:06,569 They just used the tool that they have, 620 00:22:06,570 --> 00:22:08,569 or, for example, a timeline, a timeline. 621 00:22:08,570 --> 00:22:09,959 Also, like, that's an easy tool to make a 622 00:22:09,960 --> 00:22:12,149 timeline. And then you have a timeline, 623 00:22:12,150 --> 00:22:14,039 even though it might not be the best way 624 00:22:14,040 --> 00:22:15,040 to present your story, 625 00:22:16,230 --> 00:22:18,269 it's just the tool that is there and 626 00:22:18,270 --> 00:22:18,719 developed. 627 00:22:18,720 --> 00:22:21,119 Developing another tool might not 628 00:22:21,120 --> 00:22:23,249 fit the deadline or like 629 00:22:23,250 --> 00:22:24,269 your resources. 630 00:22:25,550 --> 00:22:27,380 So I basically say 631 00:22:28,640 --> 00:22:30,919 we need more applications for our society 632 00:22:30,920 --> 00:22:33,109 and many advances 633 00:22:33,110 --> 00:22:35,719 in computer science are quite slow 634 00:22:35,720 --> 00:22:38,479 to benefit the public at large. 635 00:22:38,480 --> 00:22:40,819 If there is like a big jump 636 00:22:40,820 --> 00:22:43,069 in, let's say machine learning at Google 637 00:22:43,070 --> 00:22:45,199 knows that first, because they 638 00:22:45,200 --> 00:22:46,669 do the research, they develop the 639 00:22:46,670 --> 00:22:48,799 applications and 640 00:22:48,800 --> 00:22:51,349 other other big companies like 641 00:22:51,350 --> 00:22:53,479 Palantir, the NSA 642 00:22:53,480 --> 00:22:55,729 or Internet companies, they 643 00:22:55,730 --> 00:22:58,249 basically use the latest research 644 00:22:58,250 --> 00:23:00,709 to do better use of tracking or 645 00:23:00,710 --> 00:23:01,819 better targeting. 646 00:23:01,820 --> 00:23:03,949 And so they benefit 647 00:23:03,950 --> 00:23:05,569 quicker from these developments. 648 00:23:05,570 --> 00:23:07,639 And so because they 649 00:23:07,640 --> 00:23:09,469 do their own research, because they have 650 00:23:09,470 --> 00:23:11,599 more resources and many 651 00:23:11,600 --> 00:23:13,799 cutting edge research 652 00:23:13,800 --> 00:23:15,319 space, it comes out of these corporations 653 00:23:15,320 --> 00:23:17,659 like Google, for example, recently 654 00:23:17,660 --> 00:23:20,179 tends to flow from from Google brain 655 00:23:20,180 --> 00:23:21,889 to release. It's a like a machine 656 00:23:21,890 --> 00:23:23,179 learning library, the other machine 657 00:23:23,180 --> 00:23:24,319 learning, learning libraries. 658 00:23:24,320 --> 00:23:26,389 But this is like one that is 659 00:23:26,390 --> 00:23:28,969 very usable and it is an advantage 660 00:23:28,970 --> 00:23:30,799 that it's better supported, better 661 00:23:30,800 --> 00:23:32,989 documented, easier to use, 662 00:23:32,990 --> 00:23:35,209 but it might not exactly fit the 663 00:23:35,210 --> 00:23:36,410 journalistic use case. 664 00:23:37,550 --> 00:23:40,399 And so journalism needs more resources 665 00:23:40,400 --> 00:23:42,759 to develop their own tools. 666 00:23:42,760 --> 00:23:44,899 The tools I mentioned, like 667 00:23:44,900 --> 00:23:47,779 document cloud overview docs 668 00:23:47,780 --> 00:23:49,549 are quite good that basically targeted at 669 00:23:49,550 --> 00:23:51,709 journalists that develop and develop by 670 00:23:51,710 --> 00:23:53,959 journalists. And they they fit 671 00:23:53,960 --> 00:23:55,489 the use case quite well. 672 00:23:55,490 --> 00:23:57,679 But it took like six figure amounts, as 673 00:23:57,680 --> 00:23:59,959 I recall, to develop them 674 00:23:59,960 --> 00:24:01,399 over the years. 675 00:24:01,400 --> 00:24:03,379 It was very difficult to basically get 676 00:24:03,380 --> 00:24:04,819 the use case right. 677 00:24:04,820 --> 00:24:07,309 And for example, Google refine 678 00:24:07,310 --> 00:24:08,899 like an invaluable tool to many 679 00:24:08,900 --> 00:24:11,029 journalists to work with tabular 680 00:24:11,030 --> 00:24:12,259 data and clean it. 681 00:24:12,260 --> 00:24:14,749 And it's really used a lot. 682 00:24:14,750 --> 00:24:16,789 But it was developed by Google and then 683 00:24:16,790 --> 00:24:17,749 open sourced. 684 00:24:17,750 --> 00:24:19,639 And that basically means it hasn't seen a 685 00:24:19,640 --> 00:24:21,619 release in two years. 686 00:24:21,620 --> 00:24:23,869 And that is kind of a kind of bad 687 00:24:23,870 --> 00:24:25,759 that we don't have the resources to 688 00:24:25,760 --> 00:24:27,649 basically work on the tools that we use 689 00:24:27,650 --> 00:24:28,939 in the journal trade every day. 690 00:24:30,080 --> 00:24:32,539 So my call to you as support 691 00:24:32,540 --> 00:24:35,179 journalism, as a service to the public 692 00:24:35,180 --> 00:24:37,489 and help journalists develop, develop 693 00:24:37,490 --> 00:24:38,629 the tools. 694 00:24:38,630 --> 00:24:40,759 What we have here is basically a 695 00:24:40,760 --> 00:24:42,949 public good journalism. 696 00:24:42,950 --> 00:24:45,229 We try to make 697 00:24:45,230 --> 00:24:47,299 we try to be in the service of the 698 00:24:47,300 --> 00:24:48,589 public. 699 00:24:48,590 --> 00:24:50,659 So, for example, join a newsroom if 700 00:24:50,660 --> 00:24:51,679 you can. 701 00:24:51,680 --> 00:24:53,869 It's really a fun work. 702 00:24:53,870 --> 00:24:56,119 So when I joined the 800 NEWSROOM, 703 00:24:56,120 --> 00:24:58,699 simply because I think it's basically 704 00:24:58,700 --> 00:25:00,859 the the best political 705 00:25:00,860 --> 00:25:03,499 activism I can do with the most impact 706 00:25:03,500 --> 00:25:05,599 and not only focused on technical topics, 707 00:25:05,600 --> 00:25:08,269 we hear a lot about data retention 708 00:25:08,270 --> 00:25:10,429 and I don't know other 709 00:25:10,430 --> 00:25:11,389 data topics. 710 00:25:11,390 --> 00:25:12,869 But when you work in a newsroom, you get 711 00:25:12,870 --> 00:25:15,169 to like a very broad range of topics 712 00:25:15,170 --> 00:25:17,239 from all over society and you can still 713 00:25:17,240 --> 00:25:19,579 help with data literacy 714 00:25:21,110 --> 00:25:23,479 and another and another hint. 715 00:25:23,480 --> 00:25:24,769 So if you want to get in touch with 716 00:25:24,770 --> 00:25:26,989 journalists, there's a thing called 717 00:25:26,990 --> 00:25:29,509 Hacks Hecker's, which is a meetup, 718 00:25:29,510 --> 00:25:31,669 and that is in every big 719 00:25:31,670 --> 00:25:33,739 city in Germany, I think 720 00:25:33,740 --> 00:25:35,839 it's only in Berlin and Hamburg. 721 00:25:37,160 --> 00:25:39,529 But but 722 00:25:39,530 --> 00:25:41,689 I think that that's like data journalism 723 00:25:41,690 --> 00:25:43,999 also in Australia. 724 00:25:44,000 --> 00:25:46,069 And but if you're from any 725 00:25:46,070 --> 00:25:47,869 place else, like in New York, London, 726 00:25:47,870 --> 00:25:48,999 they all have access. 727 00:25:49,000 --> 00:25:51,649 Meet the journalists and hackers. 728 00:25:51,650 --> 00:25:53,719 Yeah, we have the hackers and they 729 00:25:53,720 --> 00:25:55,669 come together. They are to meet and talk 730 00:25:55,670 --> 00:25:57,169 about technology and journalism. 731 00:25:57,170 --> 00:25:59,389 And so if you want to have, like, 732 00:25:59,390 --> 00:26:01,369 an idea of what's going on in that world 733 00:26:01,370 --> 00:26:04,309 and join a circus meetup and 734 00:26:04,310 --> 00:26:06,409 I don't know, improve journalism by 735 00:26:06,410 --> 00:26:07,819 contributing to your ideas. 736 00:26:07,820 --> 00:26:08,820 Thank you. 737 00:26:20,730 --> 00:26:22,889 So I think we have time for 738 00:26:22,890 --> 00:26:23,890 questions. 739 00:26:25,140 --> 00:26:27,149 We have a question from the Internet, 740 00:26:27,150 --> 00:26:28,150 please. 741 00:26:29,470 --> 00:26:30,470 Wethington Internet. 742 00:26:34,970 --> 00:26:37,189 Yeah, I, um, the Internet is 743 00:26:37,190 --> 00:26:39,499 asking many things, 744 00:26:39,500 --> 00:26:41,929 and from actually 745 00:26:41,930 --> 00:26:43,759 the most important question is, is your 746 00:26:43,760 --> 00:26:45,499 data mining software available as free 747 00:26:45,500 --> 00:26:47,629 software? And please mention some 748 00:26:47,630 --> 00:26:49,400 of the names of the tools you have used. 749 00:26:50,630 --> 00:26:53,389 My data mining software and 750 00:26:53,390 --> 00:26:55,699 I didn't write the was I always 751 00:26:55,700 --> 00:26:57,139 write data mining software, basically, 752 00:26:57,140 --> 00:26:59,119 and that is a problem for every story you 753 00:26:59,120 --> 00:27:01,999 write like a script that does it and 754 00:27:02,000 --> 00:27:03,439 that has advantages because you can 755 00:27:03,440 --> 00:27:05,059 customize it. It has disadvantages 756 00:27:05,060 --> 00:27:06,229 because you have to write the software 757 00:27:06,230 --> 00:27:07,819 and it's not quick and easy. 758 00:27:09,410 --> 00:27:10,909 For example, we as an organism, we 759 00:27:10,910 --> 00:27:13,309 publish all our work on Get Up, 760 00:27:13,310 --> 00:27:15,019 Get Up, dot com slash collective. 761 00:27:15,020 --> 00:27:17,479 And you can have a look basically at 762 00:27:17,480 --> 00:27:18,949 software that is there. 763 00:27:18,950 --> 00:27:20,959 Mostly it's just front end stuff. 764 00:27:20,960 --> 00:27:23,149 But we also publish more basically 765 00:27:23,150 --> 00:27:25,219 backend data. So data analysis 766 00:27:25,220 --> 00:27:27,019 pieces in the future. 767 00:27:27,020 --> 00:27:29,299 And many news organizations 768 00:27:29,300 --> 00:27:31,549 have repositories on GitHub that explain 769 00:27:31,550 --> 00:27:33,469 how they do their stuff and 770 00:27:34,640 --> 00:27:36,829 you can find that software there. 771 00:27:36,830 --> 00:27:39,169 And the other question was tools 772 00:27:39,170 --> 00:27:40,349 or something. 773 00:27:40,350 --> 00:27:41,749 So I mentioned Tens Flow is a machine 774 00:27:41,750 --> 00:27:43,249 learning library, but there are like 775 00:27:43,250 --> 00:27:44,989 many, many tools for journalists that are 776 00:27:44,990 --> 00:27:47,519 mostly tools and a lot more library. 777 00:27:47,520 --> 00:27:49,909 So I'm using pandas for Python, 778 00:27:49,910 --> 00:27:52,009 but they are also very, very a lot 779 00:27:52,010 --> 00:27:54,229 of our packages that you can use for data 780 00:27:54,230 --> 00:27:55,069 analysis. 781 00:27:55,070 --> 00:27:58,099 But the problem is that Nertz 782 00:27:58,100 --> 00:28:00,799 a minority in the newsroom, and 783 00:28:00,800 --> 00:28:03,649 that means that and 784 00:28:03,650 --> 00:28:06,289 if you want your journalists to use these 785 00:28:06,290 --> 00:28:08,119 tools, use these techniques, you have the 786 00:28:08,120 --> 00:28:10,429 right tools to make them usable for 787 00:28:10,430 --> 00:28:11,869 like the normal people 788 00:28:13,250 --> 00:28:15,949 think, normal people. 789 00:28:15,950 --> 00:28:17,390 OK, thank you. 790 00:28:19,490 --> 00:28:21,799 Not that nerds are not normal people, but 791 00:28:21,800 --> 00:28:24,349 yeah. We have another question, please. 792 00:28:24,350 --> 00:28:25,919 Thank you very much. You've been talking 793 00:28:25,920 --> 00:28:28,069 a lot about language processing tools, 794 00:28:28,070 --> 00:28:30,199 machine learning tools and all of 795 00:28:30,200 --> 00:28:32,419 those, of course, no known to fail 796 00:28:32,420 --> 00:28:35,269 to produce errors to McAuliffe 797 00:28:35,270 --> 00:28:37,789 mis classifying classify. 798 00:28:37,790 --> 00:28:38,809 That's right. Yeah. 799 00:28:38,810 --> 00:28:40,999 And even if they classify correctly, it's 800 00:28:41,000 --> 00:28:42,829 not always easy to see what the 801 00:28:42,830 --> 00:28:44,149 classification actually means. 802 00:28:44,150 --> 00:28:45,979 You alluded to that shortly when talking 803 00:28:45,980 --> 00:28:47,689 about graphs and saying you don't only 804 00:28:47,690 --> 00:28:49,399 look for the central person in the graph 805 00:28:49,400 --> 00:28:50,869 and that's the bad guy. 806 00:28:50,870 --> 00:28:52,879 So how how do you deal with that? 807 00:28:52,880 --> 00:28:55,469 Risks with misclassification, with 808 00:28:55,470 --> 00:28:57,619 with like and also the illusion that 809 00:28:57,620 --> 00:28:59,779 the data could provide you some knowledge 810 00:28:59,780 --> 00:29:02,119 inside that actually is not in the data. 811 00:29:02,120 --> 00:29:04,729 Only apparently is there that 812 00:29:04,730 --> 00:29:05,870 cross crosschecking 813 00:29:07,610 --> 00:29:09,169 like normal crosschecking you do with 814 00:29:09,170 --> 00:29:11,419 data like check your data before 815 00:29:11,420 --> 00:29:13,469 you put it in there for certain things. 816 00:29:13,470 --> 00:29:15,709 Aquanauts recently published a long list 817 00:29:15,710 --> 00:29:17,149 of how you interview your data to make 818 00:29:17,150 --> 00:29:18,150 sure it's 819 00:29:19,280 --> 00:29:20,899 up to a certain standard or that you at 820 00:29:20,900 --> 00:29:23,239 least aware of its failures many times 821 00:29:23,240 --> 00:29:25,459 already. The input data is flawed in many 822 00:29:25,460 --> 00:29:27,529 ways. And then 823 00:29:27,530 --> 00:29:29,149 your methodologies, of course, double 824 00:29:29,150 --> 00:29:31,519 check it, talk to experts 825 00:29:31,520 --> 00:29:33,619 that know more about this field than you 826 00:29:33,620 --> 00:29:35,899 do. And by publishing 827 00:29:35,900 --> 00:29:37,429 the methodologies, you basically make 828 00:29:37,430 --> 00:29:38,959 yourself vulnerable, but also 829 00:29:38,960 --> 00:29:41,479 transparent. So if there's something 830 00:29:41,480 --> 00:29:43,549 bad going on, your readers or 831 00:29:43,550 --> 00:29:46,039 any other interested party can 832 00:29:46,040 --> 00:29:48,079 basically run what you did and then tell 833 00:29:48,080 --> 00:29:49,699 you about what you did wrong. 834 00:29:49,700 --> 00:29:51,769 And so any of these machine learning 835 00:29:51,770 --> 00:29:53,839 things does not lift 836 00:29:53,840 --> 00:29:55,160 the journalistic. 837 00:29:56,990 --> 00:29:58,729 As a journalist, you still have to 838 00:29:58,730 --> 00:30:01,189 validate your your findings 839 00:30:01,190 --> 00:30:03,859 through the second means or 840 00:30:03,860 --> 00:30:06,139 at least do a check 841 00:30:06,140 --> 00:30:07,140 on a bigger sample. 842 00:30:08,210 --> 00:30:10,549 So it's definitely and 843 00:30:10,550 --> 00:30:12,409 the result is not coming out of the 844 00:30:12,410 --> 00:30:14,309 computer. The results is coming out of 845 00:30:14,310 --> 00:30:16,459 the human mind. 846 00:30:16,460 --> 00:30:18,679 So they're the result 847 00:30:18,680 --> 00:30:20,049 of her research. 848 00:30:20,050 --> 00:30:22,219 It's not basically the output 849 00:30:22,220 --> 00:30:23,220 to stand it out. 850 00:30:24,610 --> 00:30:25,689 Thank you so much. 851 00:30:25,690 --> 00:30:27,939 I think we are all done now 852 00:30:27,940 --> 00:30:30,579 for the minutes we have for questions. 853 00:30:30,580 --> 00:30:32,109 Thank you so much. 854 00:30:32,110 --> 00:30:33,110 Thank you, everyone.