0 00:00:00,000 --> 00:00:30,000 Dear viewer, these subtitles were generated by a machine via the service Trint and therefore are (very) buggy. If you are capable, please help us to create good quality subtitles: https://c3subtitles.de/talk/418 Thanks! 1 00:00:09,110 --> 00:00:11,329 And now it is my particular pleasure to 2 00:00:11,330 --> 00:00:13,219 announce Professor Professor Rachel 3 00:00:13,220 --> 00:00:15,319 Grinstead and Eileen 4 00:00:15,320 --> 00:00:17,239 Collins can Islam and Rebekah OverDog 5 00:00:17,240 --> 00:00:19,429 from the Privacy, Security and 6 00:00:19,430 --> 00:00:22,579 Automation Lab from Drexel University. 7 00:00:22,580 --> 00:00:24,679 They are quite old hands at speaking at 8 00:00:24,680 --> 00:00:26,089 the Congress already. 9 00:00:26,090 --> 00:00:27,239 And so. 10 00:00:27,240 --> 00:00:29,449 Well, please 11 00:00:29,450 --> 00:00:30,919 give them a warm round of applause. 12 00:00:30,920 --> 00:00:31,920 And there we go. 13 00:00:46,780 --> 00:00:48,369 It would be absolutely wonderful if we 14 00:00:48,370 --> 00:00:49,689 could get the present there. 15 00:00:49,690 --> 00:00:50,589 Thank you. 16 00:00:50,590 --> 00:00:51,599 Yes, that's much better. 17 00:00:53,360 --> 00:00:56,119 Hello, I'm Rachel Green, stat 18 00:00:56,120 --> 00:00:58,249 professor at Drexel University and 19 00:00:58,250 --> 00:00:59,779 where I lead the Privacy, Security and 20 00:00:59,780 --> 00:01:01,819 Automation Lab, this is joint work with 21 00:01:01,820 --> 00:01:04,249 my students, Eileen Coskun Islam 22 00:01:04,250 --> 00:01:05,509 and Becca Oberdorfer, who will be 23 00:01:05,510 --> 00:01:06,510 speaking later. 24 00:01:07,400 --> 00:01:10,489 So we're going to talk today about 25 00:01:10,490 --> 00:01:12,709 authorship, attribution in source code 26 00:01:12,710 --> 00:01:13,710 and in social media. 27 00:01:16,130 --> 00:01:17,809 So first, I'm going to talk about 28 00:01:17,810 --> 00:01:19,999 stylometry, which is how we do 29 00:01:20,000 --> 00:01:22,639 authorship attribution in my lab usually. 30 00:01:22,640 --> 00:01:24,529 So the idea behind it, the theory behind 31 00:01:24,530 --> 00:01:26,359 it is that everybody's writing style and 32 00:01:26,360 --> 00:01:28,699 speaking style indeed is unique because 33 00:01:28,700 --> 00:01:30,919 we all learn language individually 34 00:01:30,920 --> 00:01:32,029 on an individual basis. 35 00:01:32,030 --> 00:01:33,199 And each of us, even though we might 36 00:01:33,200 --> 00:01:34,969 speak the same language, speak sort of 37 00:01:34,970 --> 00:01:36,889 our own individual dialect of it. 38 00:01:36,890 --> 00:01:38,719 For example, in English there are 39 00:01:38,720 --> 00:01:40,789 regional differences, whereas some people 40 00:01:40,790 --> 00:01:42,709 may say that one piece of furniture is a 41 00:01:42,710 --> 00:01:44,119 couch, whereas other people might say 42 00:01:44,120 --> 00:01:45,120 it's a sofa. 43 00:01:46,250 --> 00:01:48,199 Furthermore, there are there are words 44 00:01:48,200 --> 00:01:50,209 that have similar meanings, but they're 45 00:01:50,210 --> 00:01:51,859 actually different words like although in 46 00:01:51,860 --> 00:01:53,839 though and you know, which ones people 47 00:01:53,840 --> 00:01:55,549 particularly prefer is sort of a 48 00:01:55,550 --> 00:01:57,739 stylistic idiosyncrasy in 49 00:01:57,740 --> 00:01:59,509 writing. People may use the same word 50 00:01:59,510 --> 00:02:00,559 with different spellings. 51 00:02:01,820 --> 00:02:03,889 And there are also just many ways to 52 00:02:03,890 --> 00:02:05,389 express very similar ideas. 53 00:02:05,390 --> 00:02:06,799 Like someone might say, the fork is to 54 00:02:06,800 --> 00:02:09,049 the left of the plate versus the fork 55 00:02:09,050 --> 00:02:10,609 is at the plates left. 56 00:02:10,610 --> 00:02:12,829 So these differences are how we, 57 00:02:12,830 --> 00:02:15,409 in writing and documents can distinguish 58 00:02:15,410 --> 00:02:17,749 authors often in many times. 59 00:02:17,750 --> 00:02:19,489 And that's a lot of the work that we do 60 00:02:19,490 --> 00:02:21,139 in my lab, which is the Privacy, Security 61 00:02:21,140 --> 00:02:22,429 and automation lab. 62 00:02:22,430 --> 00:02:24,409 So this is a research lab at Drexel 63 00:02:24,410 --> 00:02:26,749 University. It has about 10 students 64 00:02:26,750 --> 00:02:28,729 in them in a mixture of graduate and 65 00:02:28,730 --> 00:02:30,319 undergraduate students. 66 00:02:30,320 --> 00:02:32,779 And in general, we study sort of how to 67 00:02:32,780 --> 00:02:34,909 have machines, help humans make decisions 68 00:02:34,910 --> 00:02:37,009 about security, privacy and trust, 69 00:02:37,010 --> 00:02:38,959 often using machine learning and natural 70 00:02:38,960 --> 00:02:40,189 language processing techniques. 71 00:02:41,720 --> 00:02:44,149 In particular, we're very interested in 72 00:02:44,150 --> 00:02:45,529 what we can learn when we analyze 73 00:02:45,530 --> 00:02:47,059 unstructured in some way structured sort 74 00:02:47,060 --> 00:02:49,189 of human textual communication. 75 00:02:49,190 --> 00:02:51,469 And this is what we've spoken at 76 00:02:51,470 --> 00:02:53,599 ETECSA about in the past. 77 00:02:55,070 --> 00:02:57,259 So Mike Brennan spoke at 78 00:02:57,260 --> 00:02:59,689 2063 on sort of privacy 79 00:02:59,690 --> 00:03:02,479 and stylometry and how authorship 80 00:03:02,480 --> 00:03:04,609 recognition techniques can 81 00:03:04,610 --> 00:03:06,679 be attacked and and how they could 82 00:03:06,680 --> 00:03:07,339 be deceived. 83 00:03:07,340 --> 00:03:09,109 Again, in twenty eight, C3 with Saudi 84 00:03:09,110 --> 00:03:11,299 Afroz and Eileen and Saadia 85 00:03:11,300 --> 00:03:13,369 spoke two years ago on 86 00:03:13,370 --> 00:03:15,199 applying stylometry to sort of online 87 00:03:15,200 --> 00:03:16,519 underground markets. 88 00:03:16,520 --> 00:03:18,139 And this year we're going to talk about 89 00:03:18,140 --> 00:03:19,429 source code and costumey. 90 00:03:19,430 --> 00:03:21,589 And Salvatore, people always ask us 91 00:03:21,590 --> 00:03:23,629 sort of what about source code, what 92 00:03:23,630 --> 00:03:24,829 about tweets, stuff like that. 93 00:03:24,830 --> 00:03:26,179 So we're going to answer some of those 94 00:03:26,180 --> 00:03:28,849 questions in this talk. 95 00:03:28,850 --> 00:03:30,619 In general in the lab, we also do a 96 00:03:30,620 --> 00:03:32,509 number of a lot of work, like doing a 97 00:03:32,510 --> 00:03:34,659 social network analysis of online 98 00:03:34,660 --> 00:03:36,889 communities and also textual analysis and 99 00:03:36,890 --> 00:03:38,059 also studying the secure machine 100 00:03:38,060 --> 00:03:39,060 learning. 101 00:03:39,890 --> 00:03:42,199 So instead, we were a privacy 102 00:03:42,200 --> 00:03:44,119 lab, sort of what is the connection 103 00:03:44,120 --> 00:03:46,819 between privacy and stylometry? 104 00:03:46,820 --> 00:03:49,279 Well, so there are very good techniques 105 00:03:49,280 --> 00:03:51,169 for location privacy that the privacy 106 00:03:51,170 --> 00:03:52,669 enhancing technologies community has 107 00:03:52,670 --> 00:03:53,659 worked on. 108 00:03:53,660 --> 00:03:55,769 You're probably pretty familiar her 109 00:03:55,770 --> 00:03:58,249 my T-shirt and mixes 110 00:03:58,250 --> 00:03:59,419 in other types of techniques. 111 00:03:59,420 --> 00:04:00,649 They could hide your IP address from 112 00:04:00,650 --> 00:04:01,939 people on the Internet. 113 00:04:01,940 --> 00:04:03,979 But in some cases, when you're expressing 114 00:04:03,980 --> 00:04:05,629 yourself in text online, that might be 115 00:04:05,630 --> 00:04:07,099 insufficient. And that's where my 116 00:04:07,100 --> 00:04:08,100 research comes in. 117 00:04:09,350 --> 00:04:11,389 Stylometry can be used to identify 118 00:04:11,390 --> 00:04:13,189 authors based on their writing. 119 00:04:13,190 --> 00:04:15,259 And this is important because this 120 00:04:15,260 --> 00:04:16,849 is a potential threat to people that are 121 00:04:16,850 --> 00:04:18,799 exposing crime and corruption, political 122 00:04:18,800 --> 00:04:20,119 organization, especially if they're 123 00:04:20,120 --> 00:04:21,799 speaking in first hand testimonial 124 00:04:21,800 --> 00:04:23,149 accounts. 125 00:04:23,150 --> 00:04:24,739 And it's also just important for normal 126 00:04:24,740 --> 00:04:26,779 people who want to express their opinions 127 00:04:26,780 --> 00:04:28,969 or write code and share it online without 128 00:04:28,970 --> 00:04:30,829 necessarily having the thing that they 129 00:04:30,830 --> 00:04:32,749 wrote online, follow them forever through 130 00:04:32,750 --> 00:04:33,769 their life like a dossier. 131 00:04:35,580 --> 00:04:37,649 So let's go back to stylometry 132 00:04:37,650 --> 00:04:39,219 and let me talk a little bit, just give a 133 00:04:39,220 --> 00:04:41,879 sort of short tutorial on how it works. 134 00:04:41,880 --> 00:04:43,769 So basically. 135 00:04:43,770 --> 00:04:46,049 So true methods 136 00:04:46,050 --> 00:04:48,569 that are used today use machine learning 137 00:04:48,570 --> 00:04:50,309 and so so you have two authors, Cormac 138 00:04:50,310 --> 00:04:52,589 McCarthy and Ernest Hemingway. 139 00:04:54,670 --> 00:04:56,829 So they're both authors that have 140 00:04:56,830 --> 00:04:59,199 somewhat distinct styles, 141 00:04:59,200 --> 00:05:01,089 so coming here, they might say, you know, 142 00:05:01,090 --> 00:05:02,589 what's the bravest thing you ever did? 143 00:05:02,590 --> 00:05:04,269 He spent in the road in bloody phlegm 144 00:05:04,270 --> 00:05:05,649 getting up in the morning, he said. 145 00:05:07,040 --> 00:05:08,899 And then there's Ernie Ernest Hemingway. 146 00:05:08,900 --> 00:05:11,239 He no longer dreamed of storms, 147 00:05:11,240 --> 00:05:13,159 nor women, nor of great occurrences, nor 148 00:05:13,160 --> 00:05:15,049 of great fish, nor fights, nor contests 149 00:05:15,050 --> 00:05:16,999 of strength, nor of his wife. 150 00:05:17,000 --> 00:05:19,069 So the question is, how can you tell 151 00:05:19,070 --> 00:05:20,239 the difference between these people 152 00:05:21,290 --> 00:05:23,399 so we can just feed the text straight 153 00:05:23,400 --> 00:05:25,569 and we have to extract features from it? 154 00:05:25,570 --> 00:05:27,679 Other types of features we use an 155 00:05:27,680 --> 00:05:29,209 example of this might be the frequency of 156 00:05:29,210 --> 00:05:30,139 function, words, function. 157 00:05:30,140 --> 00:05:31,429 Words are sort of the small little words 158 00:05:31,430 --> 00:05:32,779 that don't necessarily mean anything. 159 00:05:34,040 --> 00:05:35,839 And also look at, say, the frequency of 160 00:05:35,840 --> 00:05:36,829 punctuation. 161 00:05:36,830 --> 00:05:37,879 It tells us something about the 162 00:05:37,880 --> 00:05:39,379 structure. 163 00:05:39,380 --> 00:05:41,299 We also use a lot more features in our 164 00:05:41,300 --> 00:05:43,249 work, which we'll talk about later, but 165 00:05:43,250 --> 00:05:44,569 we feed these into a machine learning 166 00:05:44,570 --> 00:05:46,669 model. In many cases, we'll use a support 167 00:05:46,670 --> 00:05:47,359 vector machine. 168 00:05:47,360 --> 00:05:49,309 Sometimes we'll use random forest. 169 00:05:51,770 --> 00:05:53,719 And a good model generally needs a sort 170 00:05:53,720 --> 00:05:55,729 of forty five hundred to seventy five 171 00:05:55,730 --> 00:05:58,099 hundred words of training data and 172 00:05:58,100 --> 00:05:59,929 greater than or equal to a thousand 173 00:05:59,930 --> 00:06:02,119 features. These these are maybe many 174 00:06:02,120 --> 00:06:04,339 features of the same type like 175 00:06:04,340 --> 00:06:06,230 word Ingrams, for example. 176 00:06:11,010 --> 00:06:13,229 Oh, I'm sorry, am I not being, 177 00:06:13,230 --> 00:06:15,329 am I? Is this better? 178 00:06:15,330 --> 00:06:17,639 I hope that I hope the 179 00:06:17,640 --> 00:06:19,019 system is working. 180 00:06:19,020 --> 00:06:20,020 Sorry about that, 181 00:06:21,400 --> 00:06:22,400 OK. 182 00:06:22,800 --> 00:06:24,899 So to actually to actually 183 00:06:24,900 --> 00:06:26,699 use this, say we have an unknown 184 00:06:26,700 --> 00:06:28,259 document, which is our text, our test 185 00:06:28,260 --> 00:06:30,389 document. Just remember the things that 186 00:06:30,390 --> 00:06:31,919 you put into your head or there forever. 187 00:06:31,920 --> 00:06:33,449 He said you might want to think about 188 00:06:33,450 --> 00:06:34,919 that. And we don't know whether this is 189 00:06:34,920 --> 00:06:36,449 written by Ernest Hemingway or Cormac 190 00:06:36,450 --> 00:06:37,769 McCarthy. 191 00:06:37,770 --> 00:06:40,109 So we extract features 192 00:06:40,110 --> 00:06:41,819 for from this document. 193 00:06:41,820 --> 00:06:43,769 Now, this is a very short text snippet 194 00:06:43,770 --> 00:06:44,489 for best results. 195 00:06:44,490 --> 00:06:46,170 We need about 500 words 196 00:06:47,490 --> 00:06:48,959 and we'd ask the model who wrote it, 197 00:06:50,160 --> 00:06:51,329 and it would tell us that it's Cormac 198 00:06:51,330 --> 00:06:52,860 McCarthy, which indeed it is. 199 00:06:55,040 --> 00:06:57,139 So in general, stylometry 200 00:06:57,140 --> 00:06:59,479 methods are pretty good 201 00:06:59,480 --> 00:07:00,619 when you're dealing, especially when 202 00:07:00,620 --> 00:07:01,969 you're dealing with sets of authors in 203 00:07:01,970 --> 00:07:04,579 the sort of one hundred authors range 204 00:07:04,580 --> 00:07:05,749 where that's sort of the world of 205 00:07:05,750 --> 00:07:08,059 suspects that you have, then Bazian 206 00:07:08,060 --> 00:07:10,189 tend to have a method that works at sort 207 00:07:10,190 --> 00:07:13,339 of above 90 percent accuracy. 208 00:07:13,340 --> 00:07:15,799 Now these methods can be scaled. 209 00:07:15,800 --> 00:07:17,569 People have done experiments, a couple at 210 00:07:17,570 --> 00:07:19,759 all, 10000 authors 211 00:07:19,760 --> 00:07:21,949 and younan it all in 212 00:07:21,950 --> 00:07:23,149 with over 100000 authors. 213 00:07:23,150 --> 00:07:24,919 And you can see that even in these cases, 214 00:07:24,920 --> 00:07:26,509 the results are much, much better than 215 00:07:26,510 --> 00:07:27,619 sort of random chance. 216 00:07:28,920 --> 00:07:31,139 Which do allow you to sort of narrow, 217 00:07:31,140 --> 00:07:32,930 narrow the world of suspects quite a bit. 218 00:07:35,010 --> 00:07:37,339 So previously, ask 219 00:07:37,340 --> 00:07:39,359 the other question that we asked in my 220 00:07:39,360 --> 00:07:41,459 lab was sort of how how 221 00:07:41,460 --> 00:07:43,619 strong are these techniques 222 00:07:43,620 --> 00:07:44,999 when people are actually trying to fool 223 00:07:45,000 --> 00:07:46,109 them? 224 00:07:46,110 --> 00:07:48,389 And we found that people in general 225 00:07:48,390 --> 00:07:50,429 were able to reduce the accuracy of these 226 00:07:50,430 --> 00:07:52,559 techniques by writing in a sort of 227 00:07:52,560 --> 00:07:54,569 a specific way to try and hide the 228 00:07:54,570 --> 00:07:56,339 writing style or to imitate another 229 00:07:56,340 --> 00:07:57,359 author. 230 00:07:57,360 --> 00:07:59,099 We had to actually ask people to imitate 231 00:07:59,100 --> 00:08:00,329 Cormac McCarthy in this case. 232 00:08:02,190 --> 00:08:04,019 Now, I wouldn't necessarily recommend 233 00:08:04,020 --> 00:08:05,309 just doing that. If you wanted to hide 234 00:08:05,310 --> 00:08:06,689 your writing, you'd probably want to 235 00:08:06,690 --> 00:08:08,519 verify in some way that you'd actually 236 00:08:08,520 --> 00:08:09,719 done it correctly. 237 00:08:09,720 --> 00:08:11,309 So we actually do have some tools in our 238 00:08:11,310 --> 00:08:12,310 lab. 239 00:08:13,050 --> 00:08:15,179 Jay Stilo is an authorship analysis 240 00:08:15,180 --> 00:08:17,399 tool and not a mouth 241 00:08:17,400 --> 00:08:19,739 is a sort of authorship anonymization 242 00:08:19,740 --> 00:08:21,089 tool, which is a very much a work in 243 00:08:21,090 --> 00:08:22,090 progress. 244 00:08:22,710 --> 00:08:24,809 We have these are available on 245 00:08:24,810 --> 00:08:25,799 our GitHub page. 246 00:08:25,800 --> 00:08:27,779 We'd love to have your comments, help, 247 00:08:27,780 --> 00:08:29,580 thoughts, et cetera, on them. 248 00:08:31,960 --> 00:08:34,199 And we looked at underground forums, 249 00:08:34,200 --> 00:08:35,200 so. 250 00:08:37,470 --> 00:08:39,538 This is an excerpt from the Carters 251 00:08:39,539 --> 00:08:41,399 forum where people trade sort of credit 252 00:08:41,400 --> 00:08:43,829 card information, and 253 00:08:45,060 --> 00:08:46,619 to do this work, we had to actually 254 00:08:46,620 --> 00:08:48,359 extend our analysis told to German. 255 00:08:48,360 --> 00:08:49,979 So just let us talk in German. 256 00:08:52,290 --> 00:08:54,179 And these are the types of features we 257 00:08:54,180 --> 00:08:55,979 use, the frequency of Ingrams, the 258 00:08:55,980 --> 00:08:58,109 punctuation, the special characters, 259 00:08:58,110 --> 00:09:00,299 the function words in this case, German 260 00:09:00,300 --> 00:09:01,709 specific function, words and parts of 261 00:09:01,710 --> 00:09:03,389 speech. And that's another case where you 262 00:09:03,390 --> 00:09:06,059 need specific 263 00:09:06,060 --> 00:09:07,889 parts of specific language, specific 264 00:09:07,890 --> 00:09:08,890 features. 265 00:09:09,770 --> 00:09:12,049 So the question that 266 00:09:12,050 --> 00:09:13,879 you might wonder is like sort of is this 267 00:09:13,880 --> 00:09:15,889 purely an academic concern? 268 00:09:15,890 --> 00:09:18,049 Do people actually use stylometry 269 00:09:18,050 --> 00:09:19,789 in the real world to actually identify 270 00:09:19,790 --> 00:09:21,349 people who might not want to be 271 00:09:21,350 --> 00:09:22,509 identified? 272 00:09:22,510 --> 00:09:23,730 And the answer to this is yes. 273 00:09:24,770 --> 00:09:26,869 So in a rather 274 00:09:26,870 --> 00:09:28,399 sensational case, J.K. 275 00:09:28,400 --> 00:09:30,319 Rowling, who as we know is the author of 276 00:09:30,320 --> 00:09:32,719 the Harry Potter novels, wrote 277 00:09:32,720 --> 00:09:34,189 another book under a pseudonym, Robert 278 00:09:34,190 --> 00:09:36,409 Galbraith and Uihlein 279 00:09:36,410 --> 00:09:38,509 Associates is a stylometry firm. 280 00:09:38,510 --> 00:09:40,400 And they actually did some work 281 00:09:41,720 --> 00:09:43,879 using tools that are 282 00:09:43,880 --> 00:09:45,679 part of our analysis engine, actually, as 283 00:09:45,680 --> 00:09:47,869 well on the request of a reporter 284 00:09:47,870 --> 00:09:49,579 who'd receive an anonymous tip over 285 00:09:49,580 --> 00:09:51,679 Twitter after their linguistic 286 00:09:51,680 --> 00:09:53,719 analysis, he felt confident enough to run 287 00:09:53,720 --> 00:09:56,719 with this story and indeed did expose. 288 00:09:56,720 --> 00:09:58,340 J.K. Rowling is the author of this book. 289 00:10:00,240 --> 00:10:02,279 And our doppelganger finder code, which 290 00:10:02,280 --> 00:10:04,019 we designed to give the sort of the 291 00:10:04,020 --> 00:10:06,179 probability of two accounts are 292 00:10:06,180 --> 00:10:08,369 this are the same person 293 00:10:08,370 --> 00:10:10,319 is actually used by the FBI. 294 00:10:10,320 --> 00:10:12,239 We pointed them in at our GitHub and we 295 00:10:12,240 --> 00:10:13,679 don't know exactly what they use it for 296 00:10:13,680 --> 00:10:15,449 exactly. But they they did tell us that 297 00:10:15,450 --> 00:10:16,450 they found it useful. 298 00:10:18,960 --> 00:10:21,239 So and there are many expert 299 00:10:21,240 --> 00:10:23,549 witnesses that use this in forensic 300 00:10:23,550 --> 00:10:25,499 proceedings, legal proceedings throughout 301 00:10:25,500 --> 00:10:26,579 the world. 302 00:10:26,580 --> 00:10:28,829 I know the most about US law where 303 00:10:28,830 --> 00:10:30,929 forensic linguistic evidence is covered 304 00:10:30,930 --> 00:10:33,059 under the Fenwick opinion, 305 00:10:33,060 --> 00:10:34,679 which speaks to sort of how it can be 306 00:10:34,680 --> 00:10:36,419 used and how it can be considered. 307 00:10:38,520 --> 00:10:40,639 So, OK, so this is all the 308 00:10:40,640 --> 00:10:42,449 stuff that we've done and gives you some 309 00:10:42,450 --> 00:10:44,219 context, so hopefully you have an idea a 310 00:10:44,220 --> 00:10:45,779 little bit about stylometry is and what 311 00:10:45,780 --> 00:10:47,909 it's what it how 312 00:10:47,910 --> 00:10:48,299 it works. 313 00:10:48,300 --> 00:10:49,739 But we're going to talk today about sort 314 00:10:49,740 --> 00:10:51,359 of to particularly kind of interesting 315 00:10:51,360 --> 00:10:52,470 and difficult cases. 316 00:10:53,790 --> 00:10:55,589 The first one is, what if you have an 317 00:10:55,590 --> 00:10:56,939 unknown Twitter feed? 318 00:10:56,940 --> 00:10:59,279 Can you learn its author from blogs 319 00:10:59,280 --> 00:11:00,899 or from comments on a news site like 320 00:11:00,900 --> 00:11:01,859 Reddit? 321 00:11:01,860 --> 00:11:03,479 Like because you might not have a Twitter 322 00:11:03,480 --> 00:11:05,189 feed for that person. 323 00:11:05,190 --> 00:11:06,629 And the answer to this is yes. 324 00:11:06,630 --> 00:11:08,609 However, if you do have a Twitter feed 325 00:11:08,610 --> 00:11:10,079 for the suspect, then you should probably 326 00:11:10,080 --> 00:11:11,080 use that instead. 327 00:11:12,960 --> 00:11:14,579 And then we always get this question 328 00:11:14,580 --> 00:11:15,659 about what? About source code. 329 00:11:15,660 --> 00:11:17,939 Can you detect some of these source code 330 00:11:17,940 --> 00:11:19,019 authorship from their style? 331 00:11:19,020 --> 00:11:20,219 And the answer is that, yes, we can do 332 00:11:20,220 --> 00:11:21,220 that, too. 333 00:11:21,780 --> 00:11:23,879 And particularly neat about this 334 00:11:23,880 --> 00:11:25,049 is even if you run it through an off the 335 00:11:25,050 --> 00:11:26,709 scanner, it still works. 336 00:11:26,710 --> 00:11:28,379 So I'm going to now turn the talk over to 337 00:11:28,380 --> 00:11:29,579 Eilene, who's going to talk about that 338 00:11:29,580 --> 00:11:30,580 work. 339 00:11:38,340 --> 00:11:39,959 Hi, everyone. 340 00:11:39,960 --> 00:11:42,809 So now you'll be looking at stylometry 341 00:11:42,810 --> 00:11:44,999 and here we are trying to find out 342 00:11:45,000 --> 00:11:47,099 who wrote this piece of anonymous 343 00:11:47,100 --> 00:11:49,259 caught by looking at their coding 344 00:11:49,260 --> 00:11:51,629 style. And there are two common scenarios 345 00:11:51,630 --> 00:11:52,889 we can think of. 346 00:11:52,890 --> 00:11:55,349 One source code at authorship attribution 347 00:11:55,350 --> 00:11:56,279 comes to mind. 348 00:11:56,280 --> 00:11:58,199 The first one is, let's say that Elissa's 349 00:11:58,200 --> 00:12:00,449 computer got infected and 350 00:12:00,450 --> 00:12:02,519 she has a piece of source code left 351 00:12:02,520 --> 00:12:03,689 from the malware. 352 00:12:03,690 --> 00:12:06,389 And Bob has a collection of malware 353 00:12:06,390 --> 00:12:07,859 with known. 354 00:12:07,860 --> 00:12:10,019 So Bob can look at his collection of 355 00:12:10,020 --> 00:12:12,899 malware to identify who's Alice's 356 00:12:12,900 --> 00:12:14,459 adversary was. 357 00:12:14,460 --> 00:12:16,529 And in the second scenario, this applies 358 00:12:16,530 --> 00:12:17,729 to plagiarism. 359 00:12:17,730 --> 00:12:19,799 Let's say that Alice got an extension 360 00:12:19,800 --> 00:12:21,899 to her program assignment and heard 361 00:12:21,900 --> 00:12:24,479 Professor Bob has everyone 362 00:12:24,480 --> 00:12:25,599 else's submissions. 363 00:12:25,600 --> 00:12:27,899 So Bob can look at everyone else's 364 00:12:27,900 --> 00:12:30,329 submission, compare it with Alice's 365 00:12:30,330 --> 00:12:32,519 new submission to see if Alice 366 00:12:32,520 --> 00:12:33,869 plagiarized. 367 00:12:33,870 --> 00:12:36,299 And in this case, we're talking 368 00:12:36,300 --> 00:12:38,699 about some serious security enhancing 369 00:12:38,700 --> 00:12:40,949 waste of sorting out authorship, 370 00:12:40,950 --> 00:12:42,329 attribution. 371 00:12:42,330 --> 00:12:44,849 But unfortunately, sometimes security 372 00:12:44,850 --> 00:12:47,039 enhancing technologies are actually 373 00:12:47,040 --> 00:12:49,979 privacy in infringing cases. 374 00:12:49,980 --> 00:12:52,349 For example, said 375 00:12:52,350 --> 00:12:54,569 Molik, poor he's their Web 376 00:12:54,570 --> 00:12:56,669 programmer. He was sentenced to 377 00:12:56,670 --> 00:12:58,949 death because he was identified 378 00:12:58,950 --> 00:13:01,049 the programmer of a porn site 379 00:13:01,050 --> 00:13:02,909 by the Iranian government. 380 00:13:04,080 --> 00:13:06,329 And so he was held 381 00:13:06,330 --> 00:13:08,489 under solitary confinement for 382 00:13:08,490 --> 00:13:11,159 one year without legal representation. 383 00:13:11,160 --> 00:13:13,079 And his family says that he's also a 384 00:13:13,080 --> 00:13:15,209 permanent resident of Canada 385 00:13:15,210 --> 00:13:17,399 and he didn't know that the porn 386 00:13:17,400 --> 00:13:19,769 site developers were using his 387 00:13:19,770 --> 00:13:21,629 photo uploading software. 388 00:13:21,630 --> 00:13:23,789 And so it also said that 389 00:13:23,790 --> 00:13:25,799 if you knew that this was going to be 390 00:13:25,800 --> 00:13:27,899 used by a porn site, he would never put 391 00:13:27,900 --> 00:13:29,609 his name there because it's illegal in 392 00:13:29,610 --> 00:13:30,509 Iran. 393 00:13:30,510 --> 00:13:32,729 And under after that, he 394 00:13:32,730 --> 00:13:34,979 says that under pressure, he said that 395 00:13:34,980 --> 00:13:37,109 he regrets his actions and now his 396 00:13:37,110 --> 00:13:38,940 death sentence is canceled. 397 00:13:40,760 --> 00:13:42,499 When you look at source code, authorship, 398 00:13:42,500 --> 00:13:44,659 attribution, we can define this as 399 00:13:44,660 --> 00:13:46,939 a machine learning problem with four 400 00:13:46,940 --> 00:13:49,339 main experimental settings. 401 00:13:49,340 --> 00:13:51,679 In the first one we can think of software 402 00:13:51,680 --> 00:13:52,459 forensics. 403 00:13:52,460 --> 00:13:54,589 And here we have multiple ORTA, 404 00:13:54,590 --> 00:13:57,349 which corresponds to Multiclass Lerner. 405 00:13:57,350 --> 00:13:59,749 And this is in an open world setting, 406 00:13:59,750 --> 00:14:01,429 which means that we don't know the 407 00:14:01,430 --> 00:14:03,679 suspect set in the regular 408 00:14:03,680 --> 00:14:05,719 case of authorship attribution, which we 409 00:14:05,720 --> 00:14:08,239 can also call psychometric plagiarism 410 00:14:08,240 --> 00:14:09,439 detection. 411 00:14:09,440 --> 00:14:11,929 We have the multiclass case with multiple 412 00:14:11,930 --> 00:14:14,149 authors and we know the suspect said 413 00:14:14,150 --> 00:14:15,799 here. So it's a close work machine 414 00:14:15,800 --> 00:14:17,539 learning problem. 415 00:14:17,540 --> 00:14:19,639 And we can also apply source code 416 00:14:19,640 --> 00:14:22,129 stylometry to a copyright investigation 417 00:14:22,130 --> 00:14:24,289 where we have two parties in the dispute. 418 00:14:24,290 --> 00:14:26,569 So it's a two class problem and 419 00:14:26,570 --> 00:14:28,159 clothes were world problem because we 420 00:14:28,160 --> 00:14:30,679 know both of the sides in the dispute 421 00:14:30,680 --> 00:14:33,049 and in authorship verification. 422 00:14:33,050 --> 00:14:35,299 We would like to answer is 423 00:14:35,300 --> 00:14:37,459 this person who claims to have written 424 00:14:37,460 --> 00:14:39,259 this piece of source code, did they 425 00:14:39,260 --> 00:14:41,689 really write it or did someone else 426 00:14:41,690 --> 00:14:42,769 write it? 427 00:14:42,770 --> 00:14:44,599 And this is kind of a two class, one 428 00:14:44,600 --> 00:14:46,669 class formulation, which we will look 429 00:14:46,670 --> 00:14:48,019 into detail later. 430 00:14:48,020 --> 00:14:50,119 And this is an open class problem 431 00:14:50,120 --> 00:14:52,219 because this was either written by the 432 00:14:52,220 --> 00:14:54,259 claimed person or it was written by 433 00:14:54,260 --> 00:14:55,999 someone that we have no idea about. 434 00:14:57,700 --> 00:15:00,129 And here is a table of summary 435 00:15:00,130 --> 00:15:02,049 of our main results. 436 00:15:02,050 --> 00:15:04,119 You can see that with the two hundred 437 00:15:04,120 --> 00:15:06,609 and fifty class altar's test, 438 00:15:06,610 --> 00:15:08,739 we get nine to five percent accuracy in 439 00:15:08,740 --> 00:15:10,479 identifying them. 440 00:15:10,480 --> 00:15:13,239 And this is a very high accuracy 441 00:15:13,240 --> 00:15:15,189 compared to previous work. 442 00:15:15,190 --> 00:15:17,529 And this indicates that we introduced 443 00:15:17,530 --> 00:15:20,169 a new principal method with a robust 444 00:15:20,170 --> 00:15:22,239 and syntactic feature set for 445 00:15:22,240 --> 00:15:24,759 performing source code stylometry, 446 00:15:24,760 --> 00:15:26,889 which has not been done before in this 447 00:15:26,890 --> 00:15:28,779 scale and in this way. 448 00:15:31,510 --> 00:15:33,609 In order to do it, in order 449 00:15:33,610 --> 00:15:35,709 to understand coding style, we have 450 00:15:35,710 --> 00:15:38,019 to look at programing features, 451 00:15:38,020 --> 00:15:39,579 our programing style features. 452 00:15:39,580 --> 00:15:41,709 And for that, first of all, 453 00:15:41,710 --> 00:15:43,809 we have a piece of source code 454 00:15:43,810 --> 00:15:46,209 and we look at some lexical features 455 00:15:46,210 --> 00:15:48,549 like variable names 456 00:15:48,550 --> 00:15:51,009 and the use of C++ keywords. 457 00:15:51,010 --> 00:15:53,079 Then we look at layout features like the 458 00:15:53,080 --> 00:15:55,149 spaces, the tabs, 459 00:15:55,150 --> 00:15:57,429 and we extract those from source 460 00:15:57,430 --> 00:15:58,119 code. 461 00:15:58,120 --> 00:16:00,519 After that, we preprocessed the source 462 00:16:00,520 --> 00:16:02,739 code to obtain its abstract syntax tree, 463 00:16:02,740 --> 00:16:04,659 which reveals structural features. 464 00:16:04,660 --> 00:16:06,759 So it's the grammar of the code. 465 00:16:06,760 --> 00:16:09,729 And for that we use the fuzzy 466 00:16:09,730 --> 00:16:11,919 abstract syntax tree parser that 467 00:16:11,920 --> 00:16:14,319 was provided by our collaborator Fabian 468 00:16:14,320 --> 00:16:16,989 Yamaguchi, who presented yesterday. 469 00:16:16,990 --> 00:16:19,029 And since it's a fuzzy parser, it can 470 00:16:19,030 --> 00:16:21,189 even handle incomplete pieces 471 00:16:21,190 --> 00:16:22,119 of code. 472 00:16:22,120 --> 00:16:24,429 And once we have the abstract syntax, 473 00:16:24,430 --> 00:16:26,619 we extract syntactic features 474 00:16:26,620 --> 00:16:28,719 such as the not deps or the 475 00:16:28,720 --> 00:16:30,849 abstract syntax tree, not types 476 00:16:30,850 --> 00:16:33,009 or not type frequency, 477 00:16:33,010 --> 00:16:34,559 inverse document frequency. 478 00:16:35,740 --> 00:16:37,839 And we solve a recurring subset 479 00:16:37,840 --> 00:16:39,999 of features coming up in many of our 480 00:16:40,000 --> 00:16:42,129 datasets with hundreds of 481 00:16:42,130 --> 00:16:44,049 authors and thousands of programing 482 00:16:44,050 --> 00:16:44,979 files. 483 00:16:44,980 --> 00:16:47,109 And for example, we see here that most of 484 00:16:47,110 --> 00:16:49,479 these in this list are syntactic 485 00:16:49,480 --> 00:16:51,579 and these features are the most important 486 00:16:51,580 --> 00:16:53,139 features because they have the highest 487 00:16:53,140 --> 00:16:54,729 information gain. 488 00:16:54,730 --> 00:16:57,129 And the syntactic features are mostly 489 00:16:57,130 --> 00:16:59,439 the note that 490 00:16:59,440 --> 00:17:01,779 in the abstract syntax tree 491 00:17:01,780 --> 00:17:03,879 abstract syntax to not term frequency 492 00:17:03,880 --> 00:17:05,139 or TFI depth. 493 00:17:05,140 --> 00:17:07,088 And also we see some lexical features 494 00:17:07,089 --> 00:17:09,429 like C++ keyword 495 00:17:09,430 --> 00:17:11,559 typedef and some layered features 496 00:17:11,560 --> 00:17:13,629 such as the number of types 497 00:17:13,630 --> 00:17:14,630 that were used. 498 00:17:16,359 --> 00:17:18,429 And this slide 499 00:17:18,430 --> 00:17:20,679 illustrates our general method in 500 00:17:20,680 --> 00:17:23,078 many different experimental settings 501 00:17:23,079 --> 00:17:24,939 in order to do experiments, first of all, 502 00:17:24,940 --> 00:17:27,459 we need the datasets of you went ahead 503 00:17:27,460 --> 00:17:29,709 and scraped the submissions 504 00:17:29,710 --> 00:17:31,989 of contestants from 505 00:17:31,990 --> 00:17:34,059 Google called Google Culture is 506 00:17:34,060 --> 00:17:35,889 an international annual programing 507 00:17:35,890 --> 00:17:36,939 competition. 508 00:17:36,940 --> 00:17:39,309 And since 2008, Google has been 509 00:17:39,310 --> 00:17:41,529 publishing their correct submissions 510 00:17:41,530 --> 00:17:44,109 online. So we you went ahead and 511 00:17:44,110 --> 00:17:46,239 scraped all the correct C++ 512 00:17:46,240 --> 00:17:48,309 submissions from 2008 until 513 00:17:48,310 --> 00:17:49,509 2014. 514 00:17:49,510 --> 00:17:51,759 And we ended up with a data set with 515 00:17:51,760 --> 00:17:54,489 more than 100000 users. 516 00:17:54,490 --> 00:17:56,289 And once we have the source code, we 517 00:17:56,290 --> 00:17:58,629 preprocess it with the fuzziest 518 00:17:58,630 --> 00:18:00,939 Pozzo urine and then we extract 519 00:18:00,940 --> 00:18:03,759 lexical, syntactic and layered features. 520 00:18:03,760 --> 00:18:06,009 And as a classifier, we use a random 521 00:18:06,010 --> 00:18:08,109 forest to avoid overfitting 522 00:18:08,110 --> 00:18:10,449 with three hundred trees and these 523 00:18:10,450 --> 00:18:12,579 trees by majority voting to the final 524 00:18:12,580 --> 00:18:14,859 classification, depending on our task. 525 00:18:16,270 --> 00:18:18,129 And I would like to give some statistics 526 00:18:18,130 --> 00:18:20,469 about our Google Kojm dataset. 527 00:18:20,470 --> 00:18:23,559 We saw that in the 2014 528 00:18:23,560 --> 00:18:25,989 data set, which we used as our main one 529 00:18:25,990 --> 00:18:28,719 because it was the largest one in C++. 530 00:18:28,720 --> 00:18:31,419 The average lines of code was 70 531 00:18:31,420 --> 00:18:33,279 per solution. 532 00:18:33,280 --> 00:18:35,829 And in this programing contest, 533 00:18:35,830 --> 00:18:38,079 everyone is implementing the 534 00:18:38,080 --> 00:18:40,929 same problem or the same functionality 535 00:18:40,930 --> 00:18:43,029 at the same time and 536 00:18:43,030 --> 00:18:44,799 in a limited time. 537 00:18:44,800 --> 00:18:46,809 And whenever we are performing a machine 538 00:18:46,810 --> 00:18:49,029 learning task, we always train on 539 00:18:49,030 --> 00:18:51,519 the same problems that people answer to. 540 00:18:51,520 --> 00:18:53,589 And then when we are testing, we choose 541 00:18:53,590 --> 00:18:55,809 a problem that was not in the training 542 00:18:55,810 --> 00:18:56,319 set. 543 00:18:56,320 --> 00:18:58,899 So it makes it a further, more difficult 544 00:18:58,900 --> 00:19:01,059 machine learning problem because the 545 00:19:01,060 --> 00:19:03,369 question was not seen in the training set 546 00:19:03,370 --> 00:19:04,370 before. 547 00:19:05,110 --> 00:19:07,389 And here on the right pie chart, we see 548 00:19:07,390 --> 00:19:09,669 that C++ was the most common language 549 00:19:09,670 --> 00:19:11,980 and that was also true for other years. 550 00:19:14,760 --> 00:19:17,429 Now, I will go about some scenarios 551 00:19:17,430 --> 00:19:18,989 where we can apply source code, 552 00:19:18,990 --> 00:19:21,059 authorship, attribution, 553 00:19:21,060 --> 00:19:23,129 and in the first one, like I'll give 554 00:19:23,130 --> 00:19:25,199 examples as I'm talking about the 555 00:19:25,200 --> 00:19:27,269 scenarios, I would like 556 00:19:27,270 --> 00:19:28,709 to explain the first one, which is 557 00:19:28,710 --> 00:19:30,509 regular authorship attribution. 558 00:19:31,950 --> 00:19:34,109 By giving the Satoshi example, everyone 559 00:19:34,110 --> 00:19:36,569 is trying to find out who Satoshi is and 560 00:19:36,570 --> 00:19:38,609 we have Satoshi source code as well. 561 00:19:38,610 --> 00:19:41,669 So like from the initial contributions 562 00:19:41,670 --> 00:19:44,189 or comments on get from the Bitcoin 563 00:19:44,190 --> 00:19:46,289 repository, we 564 00:19:46,290 --> 00:19:48,269 have his card, but we don't know who this 565 00:19:48,270 --> 00:19:50,549 anonymous programmer actually is. 566 00:19:50,550 --> 00:19:53,159 So we can train 567 00:19:53,160 --> 00:19:55,439 our data with a suspect 568 00:19:55,440 --> 00:19:56,369 set. 569 00:19:56,370 --> 00:19:58,499 And after that we can test on 570 00:19:58,500 --> 00:20:00,569 this initial Bitcoin call to see 571 00:20:00,570 --> 00:20:02,219 who Satoshi is. 572 00:20:02,220 --> 00:20:04,499 And for this experimental setup, we took 573 00:20:04,500 --> 00:20:06,599 two hundred and fifty Auteur's trained 574 00:20:06,600 --> 00:20:10,289 on their files and we had 2250 575 00:20:10,290 --> 00:20:12,179 anonymous program files. 576 00:20:12,180 --> 00:20:14,279 And when we trained and tested, we 577 00:20:14,280 --> 00:20:16,469 got 95 percent accuracy incorrectly 578 00:20:16,470 --> 00:20:18,599 identifying these more than 2000 579 00:20:18,600 --> 00:20:19,689 files. 580 00:20:19,690 --> 00:20:21,839 And if we only had a suspect set 581 00:20:21,840 --> 00:20:24,449 for Satoshi that we could train 582 00:20:24,450 --> 00:20:26,429 and we would have the like. 583 00:20:26,430 --> 00:20:28,749 If you had a suspect set for Satoshi, 584 00:20:28,750 --> 00:20:30,869 this would be the training 585 00:20:30,870 --> 00:20:33,299 part. And then we will use 586 00:20:33,300 --> 00:20:35,609 the Bitcoin code like the initial 587 00:20:35,610 --> 00:20:37,979 Bitcoin code for testing and 588 00:20:37,980 --> 00:20:40,109 we might be able to predict who 589 00:20:40,110 --> 00:20:41,730 the good contributor 590 00:20:42,810 --> 00:20:44,050 Satoshi might be. 591 00:20:45,540 --> 00:20:47,609 Not that we are trying to do this, but 592 00:20:47,610 --> 00:20:48,990 this is just an example. 593 00:20:50,830 --> 00:20:52,479 In the second case, we will talk about 594 00:20:52,480 --> 00:20:53,979 obfuscation. 595 00:20:53,980 --> 00:20:56,049 There are several reasons people try to 596 00:20:56,050 --> 00:20:57,879 obfuscate their court to make it 597 00:20:57,880 --> 00:20:59,589 unrecognizable. 598 00:20:59,590 --> 00:21:02,079 First of all, you might have plagiarized 599 00:21:02,080 --> 00:21:03,909 then you might be trying to hide that you 600 00:21:03,910 --> 00:21:05,979 copied someone else's work or 601 00:21:05,980 --> 00:21:08,499 you might have a malware and you might be 602 00:21:08,500 --> 00:21:10,659 trying to make it unrecognizable so that 603 00:21:10,660 --> 00:21:12,189 it won't be detected. 604 00:21:12,190 --> 00:21:14,439 Or in other cases, you might 605 00:21:14,440 --> 00:21:16,509 just be trying to stay anonymous 606 00:21:16,510 --> 00:21:18,819 and hide your coding style. 607 00:21:18,820 --> 00:21:20,469 But we saw that our authorship 608 00:21:20,470 --> 00:21:22,749 attribution technique is not affected 609 00:21:22,750 --> 00:21:25,359 by common off the shelf commercial 610 00:21:25,360 --> 00:21:26,829 obfuscatory. 611 00:21:26,830 --> 00:21:29,229 I'll give an example with the obfuscatory 612 00:21:29,230 --> 00:21:31,389 that we use, which is like you can buy 613 00:21:31,390 --> 00:21:34,059 it like I think for four hundred dollars. 614 00:21:34,060 --> 00:21:35,559 It's called Dynex. 615 00:21:35,560 --> 00:21:36,879 We are not related to it. 616 00:21:36,880 --> 00:21:38,499 We just use it because it was the 617 00:21:38,500 --> 00:21:40,689 cheapest commercial one 618 00:21:40,690 --> 00:21:42,939 and a widely used one 619 00:21:42,940 --> 00:21:43,269 here. 620 00:21:43,270 --> 00:21:45,879 In this example, we will see how C++ 621 00:21:45,880 --> 00:21:47,469 code is obfuscated. 622 00:21:48,680 --> 00:21:50,359 We see some variable names, they are 623 00:21:50,360 --> 00:21:52,519 being hashed and all the 624 00:21:52,520 --> 00:21:54,739 spaces and comments are being 625 00:21:54,740 --> 00:21:57,109 stripped, if there are any numbers, 626 00:21:57,110 --> 00:21:59,239 they are going to be replaced 627 00:21:59,240 --> 00:22:01,339 with a combination of hexadecimal, 628 00:22:01,340 --> 00:22:03,679 binary and decimal numbers. 629 00:22:03,680 --> 00:22:05,629 And also, if there are any characters, 630 00:22:05,630 --> 00:22:06,919 they are going to be replaced with 631 00:22:06,920 --> 00:22:08,329 hexadecimal escape's. 632 00:22:08,330 --> 00:22:10,039 And you can choose different settings for 633 00:22:10,040 --> 00:22:12,799 your harshing or your combinations. 634 00:22:12,800 --> 00:22:14,869 And we see that like everything is 635 00:22:14,870 --> 00:22:17,059 refactored, but the functionality 636 00:22:17,060 --> 00:22:18,889 or the structure of the program remains 637 00:22:18,890 --> 00:22:20,989 the same. And as long as the structure 638 00:22:20,990 --> 00:22:22,789 is the same, our features are not 639 00:22:22,790 --> 00:22:24,579 affected by this obfuscation. 640 00:22:26,120 --> 00:22:28,369 As a result, we saw that when 641 00:22:28,370 --> 00:22:30,469 we tried to the authorship attribution on 642 00:22:30,470 --> 00:22:33,619 obfuscated code versus original code 643 00:22:33,620 --> 00:22:35,689 with twenty five alters, we 644 00:22:35,690 --> 00:22:37,939 got twenty nine to seven percent 645 00:22:37,940 --> 00:22:39,439 accuracy in both of them. 646 00:22:39,440 --> 00:22:41,959 So our code is, 647 00:22:41,960 --> 00:22:44,089 our method is impervious to 648 00:22:44,090 --> 00:22:46,399 such common 649 00:22:46,400 --> 00:22:48,259 off the shelf obfuscatory. 650 00:22:48,260 --> 00:22:50,329 But this is only for this 651 00:22:50,330 --> 00:22:52,309 obfuscatory, which is not changing 652 00:22:52,310 --> 00:22:53,900 structure or functionality. 653 00:22:55,640 --> 00:22:57,529 Oh another case is copywrite 654 00:22:57,530 --> 00:22:58,579 investigation. 655 00:22:58,580 --> 00:23:00,679 I would like to give a copyleft example 656 00:23:00,680 --> 00:23:02,059 here. 657 00:23:02,060 --> 00:23:04,159 Copyleft software is free 658 00:23:04,160 --> 00:23:05,719 but it still has a license. 659 00:23:05,720 --> 00:23:08,179 So you have to you can modify 660 00:23:08,180 --> 00:23:09,949 it, you can use it, but you have to make 661 00:23:09,950 --> 00:23:11,989 sure that you still include the copyleft 662 00:23:11,990 --> 00:23:14,389 license that it came with. 663 00:23:14,390 --> 00:23:16,489 And in this example, we 664 00:23:16,490 --> 00:23:18,379 would like to see that this program would 665 00:23:18,380 --> 00:23:20,539 take a copyleft code and then 666 00:23:20,540 --> 00:23:21,829 make it copywrite. 667 00:23:22,850 --> 00:23:24,769 There was a very famous case in north 668 00:23:24,770 --> 00:23:26,419 California a few years ago. 669 00:23:26,420 --> 00:23:28,549 It was with Jacobsson versus 670 00:23:28,550 --> 00:23:31,219 Katzir and Jacobsson 671 00:23:31,220 --> 00:23:33,289 had JOA model railroad 672 00:23:33,290 --> 00:23:35,539 interface called the T put 673 00:23:35,540 --> 00:23:38,029 an artistic license on and the artistic 674 00:23:38,030 --> 00:23:40,579 license is less restrictive 675 00:23:40,580 --> 00:23:42,769 than the copyleft license. 676 00:23:42,770 --> 00:23:44,899 And after that, Katzir, who is also 677 00:23:44,900 --> 00:23:47,899 interested in railroad models and 678 00:23:47,900 --> 00:23:50,639 he is working for railroad 679 00:23:50,640 --> 00:23:53,479 harvest as a software developer. 680 00:23:53,480 --> 00:23:55,609 He took this call and then he 681 00:23:55,610 --> 00:23:57,439 put a copyright on it and he started 682 00:23:57,440 --> 00:23:59,299 distributing this commercially. 683 00:23:59,300 --> 00:24:02,149 And also he filed a patent using 684 00:24:02,150 --> 00:24:03,799 Jacobsen's court. 685 00:24:03,800 --> 00:24:05,959 And after that, this was on 686 00:24:05,960 --> 00:24:08,239 court. And some people claimed that 687 00:24:08,240 --> 00:24:10,579 like since this is just artistic license, 688 00:24:10,580 --> 00:24:12,259 he can do whatever he wants with it 689 00:24:12,260 --> 00:24:13,519 because it's free could. 690 00:24:13,520 --> 00:24:14,809 But that was not the case. 691 00:24:14,810 --> 00:24:16,669 Even if it's an artistic license, you 692 00:24:16,670 --> 00:24:18,769 still have to make sure that when 693 00:24:18,770 --> 00:24:20,899 you modify it, it still has an artistic 694 00:24:20,900 --> 00:24:23,209 license and everyone else can use 695 00:24:23,210 --> 00:24:26,059 it the same way the first person intended 696 00:24:26,060 --> 00:24:27,259 it to be used. 697 00:24:27,260 --> 00:24:29,419 And this can be 698 00:24:29,420 --> 00:24:32,419 used. This can be experimented 699 00:24:32,420 --> 00:24:35,179 in a two class machine learning problem. 700 00:24:35,180 --> 00:24:36,859 In the first class, we will have the 701 00:24:36,860 --> 00:24:39,259 copyleft quote from Jacobsson and 702 00:24:39,260 --> 00:24:40,789 in the second class we will have the 703 00:24:40,790 --> 00:24:42,799 copyright code and we will compare them 704 00:24:42,800 --> 00:24:44,989 to each other to see if any code 705 00:24:44,990 --> 00:24:46,539 was taken from the other one. 706 00:24:48,290 --> 00:24:50,389 And in this case, we had 707 00:24:50,390 --> 00:24:53,359 20 pairs of altars, 708 00:24:53,360 --> 00:24:55,669 which means that we had 40 orders each 709 00:24:55,670 --> 00:24:57,739 with nine files, and we tried 710 00:24:57,740 --> 00:24:59,869 to identify their files correctly 711 00:24:59,870 --> 00:25:02,569 and we had 99 percent accuracy 712 00:25:02,570 --> 00:25:04,069 in identifying these. 713 00:25:05,820 --> 00:25:08,219 In the Ford case, we will look at voter 714 00:25:08,220 --> 00:25:09,500 verification here. 715 00:25:10,840 --> 00:25:12,809 We are trying to find out if this person 716 00:25:12,810 --> 00:25:14,909 who claims to have written to Scott, is 717 00:25:14,910 --> 00:25:17,159 he the real programmer or was it written 718 00:25:17,160 --> 00:25:18,419 by someone else? 719 00:25:18,420 --> 00:25:20,519 And this is a two class problem, 720 00:25:20,520 --> 00:25:22,559 but it's not exactly to class because the 721 00:25:22,560 --> 00:25:24,389 first class is only Mallory. 722 00:25:24,390 --> 00:25:26,489 Mallory claims to have written the test 723 00:25:26,490 --> 00:25:28,919 code and and we train 724 00:25:28,920 --> 00:25:30,599 on Mallory as the first class. 725 00:25:30,600 --> 00:25:32,429 We also train on a second class. 726 00:25:32,430 --> 00:25:34,619 That's a combination of several other 727 00:25:34,620 --> 00:25:36,779 authors. And all these are 728 00:25:36,780 --> 00:25:39,059 the same problem solutions 729 00:25:39,060 --> 00:25:41,639 or like each one corresponds to 730 00:25:41,640 --> 00:25:44,399 the same problem from different alters. 731 00:25:44,400 --> 00:25:47,159 And must we train on these two classes? 732 00:25:47,160 --> 00:25:49,499 Here we have the code that Mallory claims 733 00:25:49,500 --> 00:25:51,509 to have written and we have calls from a 734 00:25:51,510 --> 00:25:54,089 bunch of other random authors. 735 00:25:54,090 --> 00:25:56,189 And in this task, we 736 00:25:56,190 --> 00:25:58,379 reach 93 percent accuracy 737 00:25:58,380 --> 00:26:00,809 in 80 different experimental 738 00:26:00,810 --> 00:26:02,909 setup. So that means 739 00:26:02,910 --> 00:26:05,069 that hundreds of different users with 740 00:26:05,070 --> 00:26:06,599 thousands of different files. 741 00:26:10,250 --> 00:26:12,499 We also wanted to see if programing 742 00:26:12,500 --> 00:26:14,509 style is consistent throughout years, 743 00:26:14,510 --> 00:26:16,789 because if, yes, when we're constructing 744 00:26:16,790 --> 00:26:19,069 our data sets, we can mix and match from 745 00:26:19,070 --> 00:26:20,689 different years. 746 00:26:20,690 --> 00:26:23,209 And we took 747 00:26:23,210 --> 00:26:25,279 we found the contestants that were bought 748 00:26:25,280 --> 00:26:27,649 in 2012 and 2014. 749 00:26:27,650 --> 00:26:29,749 And here is an example, and 750 00:26:29,750 --> 00:26:31,760 this is a random example of their, quote, 751 00:26:33,770 --> 00:26:36,949 the same person in 2012 and 2014. 752 00:26:36,950 --> 00:26:39,139 The layered features look extremely 753 00:26:39,140 --> 00:26:40,519 similar. 754 00:26:40,520 --> 00:26:42,099 The structure is very similar. 755 00:26:42,100 --> 00:26:44,209 The four comes at the same depth. 756 00:26:44,210 --> 00:26:46,399 And we see the lexical features such 757 00:26:46,400 --> 00:26:48,919 as the variable name is very similar, 758 00:26:48,920 --> 00:26:50,989 except that in 2014 they 759 00:26:50,990 --> 00:26:52,940 decided to capitalize the T. 760 00:26:55,580 --> 00:26:57,709 And as a result, we were 761 00:26:57,710 --> 00:26:59,899 able to identify 25 762 00:26:59,900 --> 00:27:02,929 authors that were bought in 2012 763 00:27:02,930 --> 00:27:05,089 and 2014, but 88 764 00:27:05,090 --> 00:27:07,489 percent accuracy, the 88 765 00:27:07,490 --> 00:27:09,919 percent might seem low to you after 766 00:27:09,920 --> 00:27:12,199 hearing the previous results with 99 767 00:27:12,200 --> 00:27:14,359 or 93. 768 00:27:14,360 --> 00:27:16,549 But in this case, when we took these 769 00:27:16,550 --> 00:27:19,399 25 out authors just within 770 00:27:19,400 --> 00:27:22,129 2012, we were able to identify 771 00:27:22,130 --> 00:27:24,289 them with 92 percent accuracy. 772 00:27:24,290 --> 00:27:26,359 So it's just a four percent drop in 773 00:27:26,360 --> 00:27:28,609 accuracy, which shows that coding 774 00:27:28,610 --> 00:27:31,159 style is up to some degree 775 00:27:31,160 --> 00:27:32,750 persistent throughout years. 776 00:27:34,550 --> 00:27:36,259 We also wanted to gain some insights 777 00:27:36,260 --> 00:27:38,479 about coding style, so we wanted to 778 00:27:38,480 --> 00:27:40,939 see how people implement 779 00:27:40,940 --> 00:27:43,519 difficult versus easier functionality. 780 00:27:43,520 --> 00:27:46,129 And we took a set of six to do authors 781 00:27:46,130 --> 00:27:48,409 that were able to answer 14 questions. 782 00:27:48,410 --> 00:27:50,719 We took the seven easy problems and said 783 00:27:50,720 --> 00:27:52,819 seven more difficult problems. 784 00:27:52,820 --> 00:27:54,949 And we saw that these authors 785 00:27:54,950 --> 00:27:56,869 program in style was more unique when 786 00:27:56,870 --> 00:27:59,029 they were implementing 787 00:27:59,030 --> 00:28:00,289 harder functionality. 788 00:28:00,290 --> 00:28:02,239 As we can see, with the five percent 789 00:28:02,240 --> 00:28:04,579 increase in accuracy, we were able 790 00:28:04,580 --> 00:28:06,529 to identify them with 95 percent 791 00:28:06,530 --> 00:28:07,530 accuracy. 792 00:28:08,750 --> 00:28:10,609 We also wanted to see the differences 793 00:28:10,610 --> 00:28:12,709 between advanced programing versus a 794 00:28:12,710 --> 00:28:14,839 programmer that has a smaller skill set 795 00:28:14,840 --> 00:28:16,999 and how this is reflected to their 796 00:28:17,000 --> 00:28:18,109 coding style. 797 00:28:18,110 --> 00:28:20,239 And we saw the. 798 00:28:20,240 --> 00:28:22,609 Advanced programmers had a lot 799 00:28:22,610 --> 00:28:25,069 more unique coding style 800 00:28:25,070 --> 00:28:27,229 compared to coders that had 801 00:28:27,230 --> 00:28:29,149 a smaller skill set, and the difference 802 00:28:29,150 --> 00:28:31,279 here is 15 percent. 803 00:28:31,280 --> 00:28:33,409 And this shows a large 804 00:28:33,410 --> 00:28:35,989 and very significant difference in 805 00:28:35,990 --> 00:28:36,990 coding style. 806 00:28:38,800 --> 00:28:40,969 Oh, in the future, source 807 00:28:40,970 --> 00:28:43,399 code, authorship applications, 808 00:28:43,400 --> 00:28:45,229 source code, authorship, attribution can 809 00:28:45,230 --> 00:28:47,839 be applied to many different 810 00:28:47,840 --> 00:28:48,439 areas. 811 00:28:48,440 --> 00:28:50,629 For example, we can 812 00:28:50,630 --> 00:28:53,659 use this to find programmers 813 00:28:53,660 --> 00:28:55,909 or the coders of malicious 814 00:28:55,910 --> 00:28:58,400 code. We can look at open source 815 00:28:59,660 --> 00:29:01,769 repositories and find 816 00:29:01,770 --> 00:29:03,949 anonymous people who are contributing 817 00:29:03,950 --> 00:29:06,319 malicious code and try to identify 818 00:29:06,320 --> 00:29:08,689 them by comparing them to other 819 00:29:08,690 --> 00:29:10,219 good contributors. 820 00:29:10,220 --> 00:29:12,379 Or we can identify the styles 821 00:29:12,380 --> 00:29:14,539 of coders who have a 822 00:29:14,540 --> 00:29:16,699 vulnerable style by looking at the bug 823 00:29:16,700 --> 00:29:18,979 numbers they have on Gett or 824 00:29:18,980 --> 00:29:21,109 another Tengiz companies might use this. 825 00:29:21,110 --> 00:29:22,819 For example, let's say they're interested 826 00:29:22,820 --> 00:29:24,349 in a particular coding style. 827 00:29:24,350 --> 00:29:26,539 They can train on it and after that 828 00:29:26,540 --> 00:29:28,489 they can search for that on get to 829 00:29:28,490 --> 00:29:31,129 recruit employees 830 00:29:31,130 --> 00:29:32,269 directly from get. 831 00:29:33,930 --> 00:29:36,059 And when we compare our work to 832 00:29:36,060 --> 00:29:38,609 previous work, we see a huge increase 833 00:29:38,610 --> 00:29:41,009 in accuracy, even though our data set 834 00:29:41,010 --> 00:29:43,349 is larger in magnitude compared 835 00:29:43,350 --> 00:29:46,079 to theirs, the last two lines 836 00:29:46,080 --> 00:29:48,359 are results with the nine 837 00:29:48,360 --> 00:29:50,880 to five percent accuracy and 250 838 00:29:52,540 --> 00:29:53,819 authors. 839 00:29:53,820 --> 00:29:56,009 So this shows that our method 840 00:29:56,010 --> 00:29:58,289 is with the syntactic 841 00:29:58,290 --> 00:29:58,829 features. 842 00:29:58,830 --> 00:30:00,899 It's doing a lot better 843 00:30:00,900 --> 00:30:03,119 and the previous methods 844 00:30:03,120 --> 00:30:05,520 did not use any syntactic feature said. 845 00:30:07,870 --> 00:30:09,399 I would also like to thank our 846 00:30:09,400 --> 00:30:11,889 collaborators, Dr. 847 00:30:11,890 --> 00:30:14,109 Frank from 848 00:30:14,110 --> 00:30:15,609 the United States Army Research 849 00:30:15,610 --> 00:30:17,769 Laboratory, Dr. Clairvaux from United 850 00:30:17,770 --> 00:30:19,889 States Army Research Laboratory and 851 00:30:19,890 --> 00:30:22,059 really from University of Maryland, 852 00:30:22,060 --> 00:30:24,159 Dr. Ivan Narayanan from Princeton 853 00:30:24,160 --> 00:30:26,559 University and also Fabia Yamaguchi from 854 00:30:26,560 --> 00:30:27,910 University of Getting It. 855 00:30:29,440 --> 00:30:32,139 And I talked about a particular 856 00:30:32,140 --> 00:30:33,669 domain, which was source code. 857 00:30:33,670 --> 00:30:35,739 Now, BECC is going to talk about other 858 00:30:35,740 --> 00:30:38,020 domains and cross domains, stylometry. 859 00:30:46,540 --> 00:30:47,649 Thanks. 860 00:30:47,650 --> 00:30:49,959 All right, so as you just saw from 861 00:30:49,960 --> 00:30:51,729 Eileen's presentation, we're really good 862 00:30:51,730 --> 00:30:52,730 at this. 863 00:30:54,430 --> 00:30:56,019 We are very good at this in a lot of 864 00:30:56,020 --> 00:30:57,009 domains as well. 865 00:30:57,010 --> 00:30:58,599 So the ones I have up here, for example, 866 00:30:58,600 --> 00:31:00,609 source code, of course, but also anything 867 00:31:00,610 --> 00:31:02,079 you really put on the Internet we've 868 00:31:02,080 --> 00:31:03,849 looked at as a community. 869 00:31:03,850 --> 00:31:06,279 So we have e-mails, chat, messages, 870 00:31:06,280 --> 00:31:07,839 even things you don't put on the Internet 871 00:31:07,840 --> 00:31:09,999 like books or historical documents have 872 00:31:10,000 --> 00:31:11,439 been studied. And in a few slides, you'll 873 00:31:11,440 --> 00:31:12,969 see just how good we are at these types 874 00:31:12,970 --> 00:31:13,970 of things. 875 00:31:15,190 --> 00:31:17,319 This is Rahm Emanuel and this 876 00:31:17,320 --> 00:31:18,939 is his Twitter feed. 877 00:31:18,940 --> 00:31:21,069 Rahm Emanuel is an American politician. 878 00:31:21,070 --> 00:31:23,889 He's currently the mayor of Chicago. 879 00:31:23,890 --> 00:31:26,709 And while he was running for his office, 880 00:31:26,710 --> 00:31:28,989 a rogue Twitter feed was developed 881 00:31:28,990 --> 00:31:31,919 to imitate his Twitter feed. 882 00:31:31,920 --> 00:31:34,049 This is not Rahm Emanuel's Twitter feed. 883 00:31:38,330 --> 00:31:41,109 This Twitter feed was written instead by 884 00:31:41,110 --> 00:31:43,189 a a man named 885 00:31:43,190 --> 00:31:44,190 Dan cincher, 886 00:31:46,430 --> 00:31:48,169 and this is a really good example of why 887 00:31:48,170 --> 00:31:50,269 we would need to use stylometry kind of 888 00:31:50,270 --> 00:31:51,199 in the real world. 889 00:31:51,200 --> 00:31:53,149 And if we have Twitter feeds, we can test 890 00:31:53,150 --> 00:31:55,129 on Twitter feeds and we do really well. 891 00:31:55,130 --> 00:31:56,479 The problem that I'm going to discuss 892 00:31:56,480 --> 00:31:58,759 today arises if Dan Cincher 893 00:31:58,760 --> 00:32:00,139 here didn't have a Twitter feed to 894 00:32:00,140 --> 00:32:01,539 compare it to. 895 00:32:01,540 --> 00:32:03,649 So he is a writer, so he has a lot 896 00:32:03,650 --> 00:32:04,729 of writing. So if you didn't have a 897 00:32:04,730 --> 00:32:06,199 Twitter feed, what we could hopefully do 898 00:32:06,200 --> 00:32:08,689 instead was take a number of suspect 899 00:32:08,690 --> 00:32:10,489 authors. And during the campaign, he was 900 00:32:10,490 --> 00:32:12,649 actually named possibly 901 00:32:12,650 --> 00:32:14,479 as one of the suspects. 902 00:32:14,480 --> 00:32:16,309 So we would have some data on a list of 903 00:32:16,310 --> 00:32:17,479 suspects. And if they weren't, all 904 00:32:17,480 --> 00:32:19,039 Twitter feeds of some of them were blog 905 00:32:19,040 --> 00:32:20,869 posts or articles that they'd written. 906 00:32:20,870 --> 00:32:22,879 Hopefully we'd still be able to identify 907 00:32:22,880 --> 00:32:25,219 the author of The Rogue Twitter feed. 908 00:32:26,240 --> 00:32:28,459 So my main problem here is domain 909 00:32:28,460 --> 00:32:30,049 adaptation and stylometry. 910 00:32:30,050 --> 00:32:32,629 We're given a sample text in some domain 911 00:32:32,630 --> 00:32:34,699 and we're trying to identify the author 912 00:32:34,700 --> 00:32:36,199 of some of their document, which is in a 913 00:32:36,200 --> 00:32:37,200 distinct domain. 914 00:32:38,570 --> 00:32:39,979 The features that we use for this 915 00:32:39,980 --> 00:32:41,239 analysis. 916 00:32:41,240 --> 00:32:42,859 Some of them are up here. 917 00:32:42,860 --> 00:32:44,929 First does bag of words, bag of words is 918 00:32:44,930 --> 00:32:46,879 really popular not just in stylometry, 919 00:32:46,880 --> 00:32:48,499 but in natural language processing in 920 00:32:48,500 --> 00:32:49,609 general. 921 00:32:49,610 --> 00:32:51,379 These are, for example, how many times 922 00:32:51,380 --> 00:32:53,089 you use the word the how many times you 923 00:32:53,090 --> 00:32:55,339 use the word computer, etc.. 924 00:32:55,340 --> 00:32:57,289 Another popular one. 925 00:32:57,290 --> 00:32:58,639 And so I'm tree and natural language 926 00:32:58,640 --> 00:33:00,379 processing our character or word and 927 00:33:00,380 --> 00:33:02,659 grims. And there's an example of 928 00:33:02,660 --> 00:33:04,430 character diagrams underneath 929 00:33:05,540 --> 00:33:07,429 and other specific features, function 930 00:33:07,430 --> 00:33:08,719 words or stop words. 931 00:33:08,720 --> 00:33:10,519 So these are non content words, basically 932 00:33:10,520 --> 00:33:12,829 words that don't mean anything for to 933 00:33:12,830 --> 00:33:14,899 the and part of 934 00:33:14,900 --> 00:33:16,159 speech tags and part of speech. 935 00:33:16,160 --> 00:33:18,079 Anagrams are also important and not 936 00:33:18,080 --> 00:33:19,280 context specific. 937 00:33:21,440 --> 00:33:23,479 It's also popular to combine a bunch of 938 00:33:23,480 --> 00:33:25,399 features into one what we call a feature 939 00:33:25,400 --> 00:33:27,649 set, here's a popular one that works well 940 00:33:27,650 --> 00:33:29,749 within a bunch of domains known as write 941 00:33:29,750 --> 00:33:30,949 prints. 942 00:33:30,950 --> 00:33:32,839 You can see it's broken up into lexical 943 00:33:32,840 --> 00:33:34,279 syntactic content. 944 00:33:34,280 --> 00:33:36,739 And on the bottom of the screen should be 945 00:33:37,940 --> 00:33:39,019 misspellings as well. 946 00:33:39,020 --> 00:33:40,249 But you can add other features 947 00:33:41,540 --> 00:33:42,540 like a cut off. 948 00:33:45,250 --> 00:33:47,059 When we're looking at Dumain adaptation 949 00:33:47,060 --> 00:33:48,549 specifically, we're talking about 950 00:33:48,550 --> 00:33:50,379 different domains and types of places 951 00:33:50,380 --> 00:33:51,939 where people are writing things, it's 952 00:33:51,940 --> 00:33:53,949 important to look at non content features 953 00:33:53,950 --> 00:33:55,659 because if you're writing in different 954 00:33:55,660 --> 00:33:57,099 places, you're probably writing about 955 00:33:57,100 --> 00:33:58,089 different things. 956 00:33:58,090 --> 00:34:00,189 So the ones on the screen here are 957 00:34:00,190 --> 00:34:01,660 some examples of those. 958 00:34:03,100 --> 00:34:04,149 The one that's been studied most 959 00:34:04,150 --> 00:34:06,459 extensively in this context are function 960 00:34:06,460 --> 00:34:08,489 words. So these are stock words. 961 00:34:10,120 --> 00:34:11,678 You can see the accuracy's are pretty 962 00:34:11,679 --> 00:34:13,239 good with these words. 963 00:34:14,409 --> 00:34:16,479 The first example up 964 00:34:16,480 --> 00:34:19,029 there with eighty one percent accuracy 965 00:34:19,030 --> 00:34:21,428 had eight people write different 966 00:34:21,429 --> 00:34:22,869 texts in different genres. 967 00:34:22,870 --> 00:34:24,399 So they were asked, for example, to 968 00:34:24,400 --> 00:34:26,079 recreate the story of Little Red Riding 969 00:34:26,080 --> 00:34:28,089 Hood and then ask to write an essay on 970 00:34:28,090 --> 00:34:30,339 something else. And compared this isn't 971 00:34:30,340 --> 00:34:31,658 exactly domain in the way that we're 972 00:34:31,659 --> 00:34:32,559 discussing it today. 973 00:34:32,560 --> 00:34:34,839 That's more genre 974 00:34:34,840 --> 00:34:36,488 or topic. 975 00:34:36,489 --> 00:34:38,619 Similarly, books were 976 00:34:38,620 --> 00:34:41,349 analyzed in the second grouping up here 977 00:34:41,350 --> 00:34:43,509 and were divided by genre and 978 00:34:43,510 --> 00:34:45,309 topic as well. 979 00:34:45,310 --> 00:34:46,839 And all function words were used. 980 00:34:48,300 --> 00:34:49,678 So I said, we're really good at this, we 981 00:34:49,679 --> 00:34:51,899 are you 982 00:34:51,900 --> 00:34:53,789 can see across it within a bunch of 983 00:34:53,790 --> 00:34:55,349 domains of emails, we get eighty six 984 00:34:55,350 --> 00:34:56,999 percent accuracy. The bottom two lines 985 00:34:57,000 --> 00:34:59,249 are my own work getting and 986 00:34:59,250 --> 00:35:01,379 ninety eight percent accuracy, almost 99 987 00:35:01,380 --> 00:35:03,659 with Twitter feeds and 988 00:35:03,660 --> 00:35:05,249 using blog entries, we get about a ninety 989 00:35:05,250 --> 00:35:06,329 three percent accuracy. 990 00:35:07,800 --> 00:35:09,359 So we do pretty well. 991 00:35:09,360 --> 00:35:11,459 The lower accuracies for chat messages in 992 00:35:11,460 --> 00:35:13,139 Java form comments are because they're 993 00:35:13,140 --> 00:35:14,759 using a smaller amount of text for the 994 00:35:14,760 --> 00:35:15,779 testing document. 995 00:35:15,780 --> 00:35:17,069 And as Rachel mentioned the beginning, 996 00:35:17,070 --> 00:35:18,539 you want something closer to five hundred 997 00:35:18,540 --> 00:35:19,949 words for your testing documents. 998 00:35:23,210 --> 00:35:25,339 This is a tweet on your 999 00:35:25,340 --> 00:35:27,469 left and a blog 1000 00:35:27,470 --> 00:35:28,819 on your right. 1001 00:35:28,820 --> 00:35:30,799 These are from our data set and were 1002 00:35:30,800 --> 00:35:32,059 written by the same person. 1003 00:35:33,580 --> 00:35:35,919 The tweet has about, I don't know, three 1004 00:35:35,920 --> 00:35:38,109 real words in it that aren't 1005 00:35:38,110 --> 00:35:39,819 misspelled or replaced with something 1006 00:35:39,820 --> 00:35:42,219 else, but you can see the blog 1007 00:35:42,220 --> 00:35:43,599 on the other side of the screen is very 1008 00:35:43,600 --> 00:35:44,709 well constructed. This correct 1009 00:35:44,710 --> 00:35:46,779 punctuation that there aren't 1010 00:35:46,780 --> 00:35:48,219 replacements for short words. 1011 00:35:48,220 --> 00:35:49,629 We don't see any of that. 1012 00:35:49,630 --> 00:35:51,099 So you can really see the challenge here 1013 00:35:51,100 --> 00:35:53,199 in trying to identify the author of this 1014 00:35:53,200 --> 00:35:55,269 tweet or a group of tweets that look like 1015 00:35:55,270 --> 00:35:57,339 this from a blog that looks like 1016 00:35:57,340 --> 00:35:58,329 that. 1017 00:35:58,330 --> 00:35:59,559 And that's really our challenge. 1018 00:36:00,730 --> 00:36:03,309 The data we collected for this project, 1019 00:36:03,310 --> 00:36:04,630 we collected five hundred 1020 00:36:05,650 --> 00:36:07,689 tweets and blog users and then thirty 1021 00:36:07,690 --> 00:36:09,729 eight Reddit comment, Reddit users who 1022 00:36:09,730 --> 00:36:11,169 also had Twitter feeds. 1023 00:36:11,170 --> 00:36:13,329 We collected the Twitter and blog users 1024 00:36:13,330 --> 00:36:15,819 by simply querying Twitter for 1025 00:36:15,820 --> 00:36:18,249 the phrase dot word, press, dot com, 1026 00:36:18,250 --> 00:36:20,169 and we're able to collect tons and tons 1027 00:36:20,170 --> 00:36:22,299 of data linking those 1028 00:36:22,300 --> 00:36:23,300 two accounts. 1029 00:36:24,250 --> 00:36:26,319 And then for Reddit comments, 1030 00:36:26,320 --> 00:36:28,809 there's a sub Reddit called our Twitter 1031 00:36:28,810 --> 00:36:30,489 where people post their Twitter handles 1032 00:36:30,490 --> 00:36:32,049 in order to gain more followers. 1033 00:36:32,050 --> 00:36:34,119 And so that was a very easy way 1034 00:36:34,120 --> 00:36:35,589 to link them across accounts. 1035 00:36:35,590 --> 00:36:37,479 However, they didn't have as much data in 1036 00:36:37,480 --> 00:36:39,399 there. So were we were only about we were 1037 00:36:39,400 --> 00:36:41,499 only able to get about thirty eight users 1038 00:36:41,500 --> 00:36:43,389 for that data set. But it works well to 1039 00:36:43,390 --> 00:36:45,639 confirm that our methods are working 1040 00:36:45,640 --> 00:36:47,409 across different domains and not simply 1041 00:36:47,410 --> 00:36:48,410 for blogs. 1042 00:36:49,610 --> 00:36:51,739 Possible solutions for this work, 1043 00:36:51,740 --> 00:36:53,929 the first is looking at 1044 00:36:53,930 --> 00:36:55,069 bright prints, which I showed in the 1045 00:36:55,070 --> 00:36:56,689 beginning, kind of throw as many features 1046 00:36:56,690 --> 00:36:58,939 at it as you can and hope it works. 1047 00:36:58,940 --> 00:37:01,099 The second is, 1048 00:37:01,100 --> 00:37:03,139 what if we were very careful about what 1049 00:37:03,140 --> 00:37:04,729 features we selected? Instead, we only 1050 00:37:04,730 --> 00:37:06,979 fixed features, for example, that aren't 1051 00:37:06,980 --> 00:37:08,299 context specific. 1052 00:37:08,300 --> 00:37:10,009 And so we look at function words as 1053 00:37:10,010 --> 00:37:11,209 others have in the past. 1054 00:37:11,210 --> 00:37:13,339 And the final is that we look at our own 1055 00:37:13,340 --> 00:37:14,809 method called Doppelganger Finder, which 1056 00:37:14,810 --> 00:37:15,810 I'll get it later. 1057 00:37:17,030 --> 00:37:19,189 These are the domain results for 1058 00:37:19,190 --> 00:37:21,319 blogs and then tweets and then Reddit 1059 00:37:21,320 --> 00:37:22,489 comments and then tweets. 1060 00:37:22,490 --> 00:37:24,049 We have two different Twitter data sets 1061 00:37:24,050 --> 00:37:25,759 because one was collected with the blogs 1062 00:37:25,760 --> 00:37:26,929 and the other with tweets. And you can 1063 00:37:26,930 --> 00:37:28,849 see that we do really, really well. 1064 00:37:28,850 --> 00:37:31,729 The purple lines are function words 1065 00:37:31,730 --> 00:37:33,379 and they don't do quite as well as the 1066 00:37:33,380 --> 00:37:35,779 right prints, which are the bluish lines 1067 00:37:35,780 --> 00:37:37,289 on the screen. 1068 00:37:37,290 --> 00:37:38,569 But in general, we're doing pretty well 1069 00:37:38,570 --> 00:37:39,570 with this. 1070 00:37:41,180 --> 00:37:43,339 The green lines are the cross domain 1071 00:37:43,340 --> 00:37:45,409 results, so 1072 00:37:45,410 --> 00:37:46,969 you can see that there's a huge gap in 1073 00:37:46,970 --> 00:37:48,169 accuracy. 1074 00:37:48,170 --> 00:37:50,359 So it's if we're testing on blogs 1075 00:37:50,360 --> 00:37:52,279 and then her training on blogs and then 1076 00:37:52,280 --> 00:37:54,229 testing on Twitter feeds or training on 1077 00:37:54,230 --> 00:37:55,849 Reddit comments and testing on Twitter 1078 00:37:55,850 --> 00:37:58,039 feeds. And so you can see that we 1079 00:37:58,040 --> 00:37:59,689 do very poorly and that these results are 1080 00:37:59,690 --> 00:38:01,969 unacceptable using the first two methods 1081 00:38:01,970 --> 00:38:03,589 which are right and then the careful 1082 00:38:03,590 --> 00:38:05,690 feature selection of function words. 1083 00:38:06,890 --> 00:38:07,890 So what do we do about it? 1084 00:38:08,850 --> 00:38:11,219 Doppelganger Finder is a 1085 00:38:11,220 --> 00:38:14,279 new algorithm that was presented, 1086 00:38:14,280 --> 00:38:16,499 it was created to link user 1087 00:38:16,500 --> 00:38:18,419 accounts across cyber criminal forums and 1088 00:38:18,420 --> 00:38:19,889 this kind of naturally seems like it 1089 00:38:19,890 --> 00:38:21,449 would work for a problem, because really 1090 00:38:21,450 --> 00:38:23,069 what we're trying to do is link accounts 1091 00:38:23,070 --> 00:38:24,070 across the Web. 1092 00:38:25,580 --> 00:38:28,129 This method works by calculating 1093 00:38:28,130 --> 00:38:30,079 the probability that each author wrote 1094 00:38:30,080 --> 00:38:32,269 another author's documents and then 1095 00:38:32,270 --> 00:38:34,539 for each pair of authors combined 1096 00:38:34,540 --> 00:38:36,649 these probabilities and every probability 1097 00:38:36,650 --> 00:38:38,689 above a certain inputted threshold is 1098 00:38:38,690 --> 00:38:40,339 considered to be the same person and 1099 00:38:40,340 --> 00:38:41,779 below it is considered to be different 1100 00:38:41,780 --> 00:38:43,219 people. For example, 1101 00:38:44,330 --> 00:38:46,549 we have some author, author, 1102 00:38:46,550 --> 00:38:48,679 AI, and we find 1103 00:38:48,680 --> 00:38:50,629 the probability that author, a real 1104 00:38:50,630 --> 00:38:52,399 author is documents and that author E 1105 00:38:52,400 --> 00:38:53,959 wrote authorized documents. 1106 00:38:53,960 --> 00:38:55,549 And we do this for all of them. 1107 00:38:55,550 --> 00:38:57,019 And then whichever probabilities are 1108 00:38:57,020 --> 00:38:59,569 above a certain threshold we 1109 00:38:59,570 --> 00:39:01,549 use, we say that they're the same author 1110 00:39:01,550 --> 00:39:02,689 and if they're below, we say they're 1111 00:39:02,690 --> 00:39:04,879 distinct. This code can be found on 1112 00:39:04,880 --> 00:39:06,499 GitHub at the bottom of the screen. 1113 00:39:06,500 --> 00:39:08,299 It also appears at the end of the 1114 00:39:08,300 --> 00:39:10,489 presentation, if you miss it, 1115 00:39:10,490 --> 00:39:12,619 we are actually able to augment this 1116 00:39:12,620 --> 00:39:14,569 doppelganger finder algorithm to work 1117 00:39:14,570 --> 00:39:17,629 better in the domain adaptation case 1118 00:39:17,630 --> 00:39:19,969 as well. Here we had to we had to compare 1119 00:39:19,970 --> 00:39:22,399 it to EFG, all of them 1120 00:39:22,400 --> 00:39:24,739 over here. We don't have to compare 1121 00:39:24,740 --> 00:39:26,689 A to B, C and D because they're all in 1122 00:39:26,690 --> 00:39:27,949 the same, let's say, Twitter. 1123 00:39:27,950 --> 00:39:29,089 If they're all on Twitter, they're all 1124 00:39:29,090 --> 00:39:30,859 tweets. So we know they're not written by 1125 00:39:30,860 --> 00:39:32,899 the same people. They're distinct. 1126 00:39:32,900 --> 00:39:34,519 And we get a little bit of an advantage 1127 00:39:34,520 --> 00:39:36,349 here. And the algorithm and also we don't 1128 00:39:36,350 --> 00:39:38,359 have to use a threshold, which is 1129 00:39:38,360 --> 00:39:40,009 definitely a huge advantage. 1130 00:39:40,010 --> 00:39:41,599 We just take the highest of all the 1131 00:39:41,600 --> 00:39:43,099 probabilities because we know that 1132 00:39:43,100 --> 00:39:45,319 they're somehow linked in even 1133 00:39:45,320 --> 00:39:46,339 the open world case. 1134 00:39:46,340 --> 00:39:47,749 And the open road case is one where you 1135 00:39:47,750 --> 00:39:48,769 don't know, the suspect said. 1136 00:39:48,770 --> 00:39:50,929 So you say that I'm not 1137 00:39:50,930 --> 00:39:53,269 sure if it's one of these people or 1138 00:39:53,270 --> 00:39:55,039 you're not sure that there's a perfect 1139 00:39:55,040 --> 00:39:56,509 one to one pairing between the two, then 1140 00:39:56,510 --> 00:39:58,069 you'd have to threshold it and you have 1141 00:39:58,070 --> 00:39:59,420 that same issue again. 1142 00:40:00,980 --> 00:40:03,409 Here are the cross domain results 1143 00:40:03,410 --> 00:40:05,509 for the blog and Twitter 1144 00:40:05,510 --> 00:40:07,729 data. Set the green 1145 00:40:07,730 --> 00:40:09,589 lines at the bottom where the green bars 1146 00:40:09,590 --> 00:40:11,869 on the on the slide 1147 00:40:11,870 --> 00:40:13,399 on the notation slide. 1148 00:40:13,400 --> 00:40:15,559 So we do have to vary terribly 1149 00:40:15,560 --> 00:40:18,019 in cross domain using those methods. 1150 00:40:18,020 --> 00:40:19,939 And then the blue lines are the domain 1151 00:40:19,940 --> 00:40:21,979 results and the bold red line are the 1152 00:40:21,980 --> 00:40:24,349 doppelganger finder results using 1153 00:40:24,350 --> 00:40:25,639 our augmented doppelganger finder. 1154 00:40:25,640 --> 00:40:27,199 So you can see we were able to recover 1155 00:40:27,200 --> 00:40:29,329 the accuracy to 1156 00:40:29,330 --> 00:40:31,789 almost as high as some of the 1157 00:40:31,790 --> 00:40:32,929 domain accuracy's. 1158 00:40:34,820 --> 00:40:35,820 And then. 1159 00:40:39,080 --> 00:40:41,539 The limitations of Doppelganger Finder, 1160 00:40:41,540 --> 00:40:43,459 first of all, you need a lot of text even 1161 00:40:43,460 --> 00:40:45,409 in the training documents or testing 1162 00:40:45,410 --> 00:40:47,899 documents, and so maybe more than 500 1163 00:40:47,900 --> 00:40:49,639 words even you would need of testing 1164 00:40:49,640 --> 00:40:51,020 documents to make this really work. 1165 00:40:52,070 --> 00:40:54,019 Additionally, it's made for a specific 1166 00:40:54,020 --> 00:40:55,519 case, which is account linking 1167 00:40:56,600 --> 00:40:58,699 work for more 1168 00:40:58,700 --> 00:41:01,339 specific cases than this. 1169 00:41:01,340 --> 00:41:03,349 The next question that naturally arises 1170 00:41:03,350 --> 00:41:05,479 is what if I'm trying to 1171 00:41:05,480 --> 00:41:07,279 identify the author of a Twitter feed and 1172 00:41:07,280 --> 00:41:09,139 I have a bunch of blog data, but I have 1173 00:41:09,140 --> 00:41:10,579 some Twitter data. 1174 00:41:10,580 --> 00:41:12,739 Do I use the Twitter data or one 1175 00:41:12,740 --> 00:41:14,749 of them in one of these limited cases? 1176 00:41:14,750 --> 00:41:16,219 Is it should I use the Twitter data or 1177 00:41:16,220 --> 00:41:17,269 should I try to use them? Ended up 1178 00:41:17,270 --> 00:41:19,879 invitation in the 1179 00:41:19,880 --> 00:41:20,839 with the blog data. 1180 00:41:20,840 --> 00:41:22,429 And the answer really is if you have 1181 00:41:22,430 --> 00:41:23,899 Twitter data, you should use Twitter 1182 00:41:23,900 --> 00:41:26,209 data. You can see it the first point 1183 00:41:26,210 --> 00:41:28,219 there on the screen, that 10 percent, 1184 00:41:28,220 --> 00:41:30,259 that is that 10 percent of the data is 1185 00:41:30,260 --> 00:41:32,659 Twitter data and 1186 00:41:32,660 --> 00:41:33,889 the rest are blogs. 1187 00:41:33,890 --> 00:41:36,049 And this is just using bright prints, 1188 00:41:36,050 --> 00:41:38,029 support vector machines for machine 1189 00:41:38,030 --> 00:41:39,139 learning. And you can see that we get a 1190 00:41:39,140 --> 00:41:40,939 huge jump in accuracy from having no 1191 00:41:40,940 --> 00:41:43,279 tweets to having some tweets. 1192 00:41:43,280 --> 00:41:44,839 And so if you have any Twitter data, you 1193 00:41:44,840 --> 00:41:47,239 can use it. This is mirrored as well 1194 00:41:47,240 --> 00:41:49,609 in other domain application methods 1195 00:41:49,610 --> 00:41:50,840 and natural language processing. 1196 00:41:53,770 --> 00:41:55,939 Open problems left in domain adaptation, 1197 00:41:55,940 --> 00:41:58,119 the first is looking at other 1198 00:41:58,120 --> 00:42:00,399 domain adaptation solutions, probably 1199 00:42:00,400 --> 00:42:02,499 from other natural language processing 1200 00:42:02,500 --> 00:42:04,959 problems like sentiment classification, 1201 00:42:04,960 --> 00:42:06,849 also looking at how topic effects style. 1202 00:42:06,850 --> 00:42:08,949 So if you are a blogger and you have a 1203 00:42:08,950 --> 00:42:10,179 Twitter feed that probably written on the 1204 00:42:10,180 --> 00:42:12,309 same thing, but if you are a reader and 1205 00:42:12,310 --> 00:42:14,079 you have a Twitter feed, they're probably 1206 00:42:14,080 --> 00:42:15,279 written on different things. 1207 00:42:15,280 --> 00:42:16,839 And so how does that affect it? 1208 00:42:16,840 --> 00:42:18,249 Or even if you have a Reddit account and 1209 00:42:18,250 --> 00:42:19,479 you write in different subjects on 1210 00:42:19,480 --> 00:42:21,609 different topics, can we still identify 1211 00:42:21,610 --> 00:42:22,989 you as well if you're not writing about 1212 00:42:22,990 --> 00:42:24,069 the same thing? 1213 00:42:24,070 --> 00:42:25,479 Another thing to look at would be other 1214 00:42:25,480 --> 00:42:26,859 domains. 1215 00:42:26,860 --> 00:42:28,899 And finally, is it possible for us to 1216 00:42:28,900 --> 00:42:31,509 change how a document feels 1217 00:42:31,510 --> 00:42:33,729 or how its actual content is to make 1218 00:42:33,730 --> 00:42:35,419 it feel more like the other domain? 1219 00:42:35,420 --> 00:42:36,909 So, for example, we had that tweet up 1220 00:42:36,910 --> 00:42:38,619 there that had barely any words in it. 1221 00:42:38,620 --> 00:42:40,269 What if we were able to make it look a 1222 00:42:40,270 --> 00:42:41,859 little more like plain text and make it 1223 00:42:41,860 --> 00:42:42,999 look like that blog? 1224 00:42:43,000 --> 00:42:45,099 Is that changing it too much or 1225 00:42:45,100 --> 00:42:47,139 would that work? And so that's definitely 1226 00:42:47,140 --> 00:42:48,459 a huge open question. That's not very 1227 00:42:48,460 --> 00:42:49,460 easy to answer right now. 1228 00:42:51,370 --> 00:42:53,289 So anonymity is really hard. 1229 00:42:53,290 --> 00:42:54,969 Trying to make yourself anonymous, even 1230 00:42:54,970 --> 00:42:56,589 through a lot of these methods is 1231 00:42:56,590 --> 00:42:58,779 difficult. And it's really not only about 1232 00:42:58,780 --> 00:43:00,579 what you're writing, but it's also about 1233 00:43:00,580 --> 00:43:01,509 how you write it. 1234 00:43:01,510 --> 00:43:03,729 And so even if you're doing things like 1235 00:43:03,730 --> 00:43:05,199 monitoring the content of what you write 1236 00:43:05,200 --> 00:43:06,549 to make sure it can't be traced back to 1237 00:43:06,550 --> 00:43:08,079 you or hiding your location through 1238 00:43:08,080 --> 00:43:10,179 things like Tor, we can probably 1239 00:43:10,180 --> 00:43:11,679 still identify you through only your 1240 00:43:11,680 --> 00:43:12,680 writing style. 1241 00:43:13,720 --> 00:43:15,579 So while stylometry can combat online 1242 00:43:15,580 --> 00:43:17,469 abuses, it's also a huge anonymity 1243 00:43:17,470 --> 00:43:18,470 threat. 1244 00:43:19,120 --> 00:43:21,399 Finally, we're very surprisingly 1245 00:43:21,400 --> 00:43:23,439 good at anonymizing text across many 1246 00:43:23,440 --> 00:43:25,839 domains and not just within them. 1247 00:43:25,840 --> 00:43:28,329 So not all is lost. 1248 00:43:28,330 --> 00:43:29,330 What can we do about it? 1249 00:43:30,550 --> 00:43:32,139 Our library now is developing a tool 1250 00:43:32,140 --> 00:43:33,140 called Anonymous. 1251 00:43:34,780 --> 00:43:37,089 This piece of software 1252 00:43:37,090 --> 00:43:39,689 helps you anonymize yourself, 1253 00:43:39,690 --> 00:43:41,619 anonymize your text as you write it, and 1254 00:43:41,620 --> 00:43:43,959 uses Stilo in the background 1255 00:43:43,960 --> 00:43:46,629 to verify that you're to 1256 00:43:46,630 --> 00:43:49,089 monitor that you're not the same author. 1257 00:43:49,090 --> 00:43:51,389 This is definitely a work in progress. 1258 00:43:51,390 --> 00:43:54,069 It could use a lot of work and analysis 1259 00:43:54,070 --> 00:43:54,999 and feedback. 1260 00:43:55,000 --> 00:43:57,069 So if anyone's interested in playing 1261 00:43:57,070 --> 00:43:58,599 with it or contributing to it or helping 1262 00:43:58,600 --> 00:44:00,729 with it, the get is at the bottom 1263 00:44:00,730 --> 00:44:02,139 and you can contact us with anything 1264 00:44:02,140 --> 00:44:03,099 else. 1265 00:44:03,100 --> 00:44:04,779 So thank you all for listening to all 1266 00:44:04,780 --> 00:44:05,780 three of us. 1267 00:44:14,700 --> 00:44:16,559 Special thanks to my contributors Travis 1268 00:44:16,560 --> 00:44:18,929 Dutko and Sadie Afroz, and 1269 00:44:18,930 --> 00:44:19,930 we'll take any questions. 1270 00:44:21,390 --> 00:44:22,729 Well, thank you very much. 1271 00:44:22,730 --> 00:44:24,779 And now we have about 20 minutes for 1272 00:44:24,780 --> 00:44:25,780 questions. 1273 00:44:27,060 --> 00:44:29,789 Feel free to occupy the 1274 00:44:29,790 --> 00:44:30,279 phones. 1275 00:44:30,280 --> 00:44:32,010 We have phones, microphones, 1276 00:44:34,380 --> 00:44:36,300 and we'll start with number three. 1277 00:44:37,440 --> 00:44:39,599 Thanks for the talk. I got a question 1278 00:44:39,600 --> 00:44:41,369 about across the main research. 1279 00:44:41,370 --> 00:44:43,349 I was wondering if you ever try to enrich 1280 00:44:43,350 --> 00:44:45,569 your your your future sets by by 1281 00:44:45,570 --> 00:44:48,089 metadata like activity patterns 1282 00:44:48,090 --> 00:44:50,459 or links used or something 1283 00:44:50,460 --> 00:44:51,749 like that. 1284 00:44:51,750 --> 00:44:53,699 So am I on? 1285 00:44:53,700 --> 00:44:55,259 Can anyone hear me? 1286 00:44:55,260 --> 00:44:56,699 We've done a little bit. 1287 00:44:56,700 --> 00:44:58,739 We looked at Twitter specifically because 1288 00:44:58,740 --> 00:45:00,479 there's just so much metadata associated 1289 00:45:00,480 --> 00:45:01,949 with Twitter and we found that we could 1290 00:45:01,950 --> 00:45:03,569 improve our Twitter results a little bit. 1291 00:45:03,570 --> 00:45:05,339 But in the cross domain case, it doesn't 1292 00:45:05,340 --> 00:45:06,809 particularly help. 1293 00:45:06,810 --> 00:45:08,489 But our Twitter results are already 1294 00:45:08,490 --> 00:45:09,959 ninety eight point nine, so any 1295 00:45:09,960 --> 00:45:12,149 improvement isn't really an improvement. 1296 00:45:13,620 --> 00:45:15,239 Do you have any idea why that is so? 1297 00:45:15,240 --> 00:45:17,159 Because my expectation would be that it's 1298 00:45:17,160 --> 00:45:19,529 like it's a very good fingerprint 1299 00:45:19,530 --> 00:45:20,530 of someone, 1300 00:45:22,080 --> 00:45:24,299 like at what time that person 1301 00:45:24,300 --> 00:45:25,800 is writing something or 1302 00:45:27,330 --> 00:45:28,889 how many links are in the text or 1303 00:45:28,890 --> 00:45:29,879 something that could write. 1304 00:45:29,880 --> 00:45:31,919 So we don't have data for I didn't we 1305 00:45:31,920 --> 00:45:33,539 didn't collect any data for when things 1306 00:45:33,540 --> 00:45:35,669 were posted for the blogs. 1307 00:45:35,670 --> 00:45:37,769 So we haven't done analysis with that. 1308 00:45:37,770 --> 00:45:40,619 When we looked at the 1309 00:45:40,620 --> 00:45:42,779 metadata, we're talking about hash tags, 1310 00:45:42,780 --> 00:45:44,019 tags and links. 1311 00:45:44,020 --> 00:45:46,169 So the hash tags in the tags don't really 1312 00:45:46,170 --> 00:45:48,059 translate over to blogs. 1313 00:45:48,060 --> 00:45:49,619 And as far as links go, I just don't 1314 00:45:49,620 --> 00:45:51,119 think that there's enough similarity 1315 00:45:51,120 --> 00:45:53,369 between them to get any real 1316 00:45:53,370 --> 00:45:54,370 improvement out of it. 1317 00:45:55,380 --> 00:45:56,579 Thank you. 1318 00:45:56,580 --> 00:45:57,580 Number four, please. 1319 00:45:58,690 --> 00:46:00,779 It is anonymous, 1320 00:46:00,780 --> 00:46:03,269 limited to English 1321 00:46:03,270 --> 00:46:04,270 or is it 1322 00:46:05,850 --> 00:46:08,069 independent of the natural language 1323 00:46:08,070 --> 00:46:09,569 you choose? 1324 00:46:09,570 --> 00:46:11,849 So I think that the current 1325 00:46:11,850 --> 00:46:14,369 implementation is 1326 00:46:14,370 --> 00:46:16,799 is limited to English, but it wouldn't 1327 00:46:16,800 --> 00:46:18,959 take a lot of work to extend it to, 1328 00:46:18,960 --> 00:46:21,269 say, German in particular, 1329 00:46:21,270 --> 00:46:23,789 because we have the analysis background 1330 00:46:23,790 --> 00:46:25,929 and we have the analysis back end 1331 00:46:25,930 --> 00:46:27,359 in German. 1332 00:46:27,360 --> 00:46:30,329 So it would be just a question of 1333 00:46:30,330 --> 00:46:32,129 of adding a little to a couple tweaks to 1334 00:46:32,130 --> 00:46:33,779 the interface to do that, to get 1335 00:46:33,780 --> 00:46:35,939 extensions to further languages where 1336 00:46:35,940 --> 00:46:37,500 you basically need to do is 1337 00:46:38,970 --> 00:46:39,970 augment the. 1338 00:46:41,630 --> 00:46:44,119 The analysis 1339 00:46:44,120 --> 00:46:46,339 engine to have function words in part 1340 00:46:46,340 --> 00:46:48,079 of speech shagger for that language. 1341 00:46:48,080 --> 00:46:50,059 Now, it may be a little more difficult to 1342 00:46:50,060 --> 00:46:52,249 you say, Asian languages that have 1343 00:46:52,250 --> 00:46:53,849 that require segmentation. 1344 00:46:53,850 --> 00:46:55,369 So you need a segmentation engine for 1345 00:46:55,370 --> 00:46:57,589 that. But other than that, 1346 00:46:57,590 --> 00:46:59,429 it's it shouldn't be that hard. 1347 00:46:59,430 --> 00:47:01,489 Yeah, the like I said, you could 1348 00:47:01,490 --> 00:47:04,429 already use the analysis 1349 00:47:04,430 --> 00:47:06,869 for for some languages, but 1350 00:47:06,870 --> 00:47:08,239 the front end of Anonymous doesn't 1351 00:47:08,240 --> 00:47:10,429 doesn't do that currently is an 1352 00:47:10,430 --> 00:47:12,679 abstraction layer in the 1353 00:47:12,680 --> 00:47:13,059 code. 1354 00:47:13,060 --> 00:47:16,189 Yet there is an API. 1355 00:47:16,190 --> 00:47:17,190 Yeah. 1356 00:47:18,890 --> 00:47:19,890 Thank you. 1357 00:47:20,210 --> 00:47:22,489 I, uh, well, 1358 00:47:22,490 --> 00:47:24,229 uh, thanks for the talk. 1359 00:47:24,230 --> 00:47:26,059 I think it's a fascinating subject. 1360 00:47:26,060 --> 00:47:27,979 In the first half of the lecture you were 1361 00:47:27,980 --> 00:47:30,259 talking about, you know, source 1362 00:47:30,260 --> 00:47:32,029 code analysis, etc.. 1363 00:47:32,030 --> 00:47:33,979 I'm trying to understand, since you were 1364 00:47:33,980 --> 00:47:35,929 one of your results, is that the best 1365 00:47:35,930 --> 00:47:37,999 feature is to look, you have nothing 1366 00:47:38,000 --> 00:47:40,019 to do with the actual code. 1367 00:47:40,020 --> 00:47:42,589 It was no, you know, um, indentation 1368 00:47:42,590 --> 00:47:44,029 and stuff like that. 1369 00:47:44,030 --> 00:47:45,289 But with the structure of the program 1370 00:47:45,290 --> 00:47:47,029 itself, why are you limited to source 1371 00:47:47,030 --> 00:47:47,959 code analysis? 1372 00:47:47,960 --> 00:47:50,119 I mean, you dump it into the pro 1373 00:47:50,120 --> 00:47:52,999 and you get a photograph of the 1374 00:47:53,000 --> 00:47:55,279 program you can analyze. 1375 00:47:55,280 --> 00:47:56,899 So, yeah, you're right. 1376 00:47:56,900 --> 00:47:58,879 This was the first time you were trying 1377 00:47:58,880 --> 00:48:01,069 to syntactic feature set because it 1378 00:48:01,070 --> 00:48:02,509 wasn't tried before. 1379 00:48:02,510 --> 00:48:04,369 And we first wanted to see that our 1380 00:48:04,370 --> 00:48:06,559 intuition is really correct and 1381 00:48:06,560 --> 00:48:07,969 this will get us somewhere. 1382 00:48:07,970 --> 00:48:10,069 So right now with C++, we 1383 00:48:10,070 --> 00:48:12,499 saw that this works very well. 1384 00:48:12,500 --> 00:48:14,599 And as long as we have a parser to 1385 00:48:14,600 --> 00:48:16,939 get the structure of a program 1386 00:48:16,940 --> 00:48:19,399 or anything, this would be very helpful. 1387 00:48:19,400 --> 00:48:21,469 So we're willing to extend our 1388 00:48:21,470 --> 00:48:23,419 work with like different languages. 1389 00:48:23,420 --> 00:48:25,489 Then we have a lot of other things left 1390 00:48:25,490 --> 00:48:27,079 to do in future. 1391 00:48:27,080 --> 00:48:28,399 Yeah, and we would like to get to the 1392 00:48:28,400 --> 00:48:30,469 binary case, and that is next 1393 00:48:30,470 --> 00:48:32,089 sort of on the agenda. But we wanted to 1394 00:48:32,090 --> 00:48:33,109 sort of confirm this in. 1395 00:48:33,110 --> 00:48:34,639 The nice thing now that we can do is we 1396 00:48:34,640 --> 00:48:36,799 can compile these programs and 1397 00:48:36,800 --> 00:48:38,659 then we can directly compare the accuracy 1398 00:48:38,660 --> 00:48:40,279 that we get from the source code to the 1399 00:48:40,280 --> 00:48:42,469 to the binary so we can see what the the 1400 00:48:42,470 --> 00:48:44,179 differences, I guess you know that. 1401 00:48:44,180 --> 00:48:46,309 But it's a really it's a very 1402 00:48:46,310 --> 00:48:48,469 realistic problem, like insert, 1403 00:48:48,470 --> 00:48:50,599 uh, malware research in general. 1404 00:48:52,010 --> 00:48:53,060 Thanks. Thanks. 1405 00:48:55,500 --> 00:48:57,849 So you've said that you used 1406 00:48:57,850 --> 00:49:00,629 to quote from Kojm to analyze 1407 00:49:00,630 --> 00:49:03,089 to check how your program 1408 00:49:03,090 --> 00:49:05,279 works. And did 1409 00:49:05,280 --> 00:49:06,809 you strip out Macarius? 1410 00:49:06,810 --> 00:49:08,699 Because I know that people in such 1411 00:49:08,700 --> 00:49:10,799 programing contests use quite a 1412 00:49:10,800 --> 00:49:12,959 lot markers that they are to 1413 00:49:12,960 --> 00:49:15,539 every day profile because it makes 1414 00:49:16,620 --> 00:49:18,689 it easier to program 1415 00:49:18,690 --> 00:49:21,299 later. And it's about 20 1416 00:49:21,300 --> 00:49:23,519 years of such macarius 1417 00:49:23,520 --> 00:49:25,739 and they trip to doubt because 1418 00:49:25,740 --> 00:49:27,959 if you didn't, you might 1419 00:49:27,960 --> 00:49:29,999 even not compare. 1420 00:49:30,000 --> 00:49:32,069 Ah, if it is the code 1421 00:49:32,070 --> 00:49:34,769 of the same outr, but 1422 00:49:34,770 --> 00:49:37,350 if it is the same code, actually 1423 00:49:39,540 --> 00:49:41,609 we looked at the macros, we 1424 00:49:41,610 --> 00:49:44,699 had a layered feature 1425 00:49:44,700 --> 00:49:47,309 just for macros and also 1426 00:49:47,310 --> 00:49:49,689 our ESTIE parser is an 1427 00:49:49,690 --> 00:49:52,319 honor function by function basis. 1428 00:49:52,320 --> 00:49:54,869 So most of the time macros 1429 00:49:54,870 --> 00:49:56,909 were excluded from the 1430 00:49:58,080 --> 00:50:00,089 structural information. 1431 00:50:00,090 --> 00:50:02,249 So we kept that separately, 1432 00:50:02,250 --> 00:50:04,529 but we tried to find out if there 1433 00:50:04,530 --> 00:50:06,719 are any like similarities like 1434 00:50:06,720 --> 00:50:08,999 that in court and we didn't see too 1435 00:50:09,000 --> 00:50:11,099 many. But if we investigate 1436 00:50:11,100 --> 00:50:13,199 further just for that specific thing, 1437 00:50:13,200 --> 00:50:15,909 we might find more similarities. 1438 00:50:15,910 --> 00:50:17,399 So that's a very good point. 1439 00:50:17,400 --> 00:50:18,839 I'll check that. 1440 00:50:18,840 --> 00:50:20,909 OK, and the second question is, 1441 00:50:20,910 --> 00:50:22,979 are you found that for more 1442 00:50:22,980 --> 00:50:24,630 advanced problems, the 1443 00:50:25,650 --> 00:50:28,209 accuracy of checking 1444 00:50:28,210 --> 00:50:30,509 code is much higher? 1445 00:50:30,510 --> 00:50:32,669 Could be it artifact 1446 00:50:32,670 --> 00:50:34,529 because of that? 1447 00:50:34,530 --> 00:50:36,849 Ah, there were 1448 00:50:36,850 --> 00:50:40,019 less solutions for more advanced problems 1449 00:50:40,020 --> 00:50:42,570 and because of that list autres. 1450 00:50:44,690 --> 00:50:47,659 I mean, there were more outdoors 1451 00:50:47,660 --> 00:50:50,479 writing solutions for easier problems 1452 00:50:50,480 --> 00:50:52,579 and less outdoors writing 1453 00:50:52,580 --> 00:50:54,799 solutions for more for 1454 00:50:54,800 --> 00:50:57,109 harder programs and dirt 1455 00:50:57,110 --> 00:50:58,879 because there are less outdoors. 1456 00:50:58,880 --> 00:51:01,609 The accuracy should be right. 1457 00:51:01,610 --> 00:51:03,679 The dataset sizes are 1458 00:51:03,680 --> 00:51:05,869 always kept the same so that we 1459 00:51:05,870 --> 00:51:08,119 can compare the results. 1460 00:51:08,120 --> 00:51:10,669 So we grouped into our 1461 00:51:10,670 --> 00:51:12,949 two groups into heart 1462 00:51:12,950 --> 00:51:15,379 problems and easy problems and to 1463 00:51:15,380 --> 00:51:17,839 maintain the same size of yes. 1464 00:51:17,840 --> 00:51:20,149 And it was a complete random 1465 00:51:20,150 --> 00:51:22,519 selection from like hundreds or thousands 1466 00:51:22,520 --> 00:51:24,709 of users to make sure that 1467 00:51:24,710 --> 00:51:27,019 it represents a real world 1468 00:51:27,020 --> 00:51:28,020 scenario. 1469 00:51:29,000 --> 00:51:31,199 And thank you for your targets. 1470 00:51:31,200 --> 00:51:32,390 Interesting. Thanks. 1471 00:51:35,320 --> 00:51:36,320 Number two, please. 1472 00:51:37,790 --> 00:51:40,119 Um, there was a assaying 1473 00:51:40,120 --> 00:51:41,859 lost in translation. 1474 00:51:41,860 --> 00:51:44,379 Uh, have you tried taking a passage 1475 00:51:44,380 --> 00:51:46,929 and passing it through Google Translate 1476 00:51:46,930 --> 00:51:48,639 or some other translating program a few 1477 00:51:48,640 --> 00:51:50,859 times and seeing if 1478 00:51:50,860 --> 00:51:53,079 it's recognizable when it comes back 1479 00:51:53,080 --> 00:51:54,729 as being from the original author, 1480 00:51:54,730 --> 00:51:57,189 perhaps with some, uh, corrections 1481 00:51:57,190 --> 00:51:58,149 to spelling and grammar? 1482 00:51:58,150 --> 00:51:59,150 Is it still. 1483 00:52:00,290 --> 00:52:02,449 Yeah, a few years ago, I had a project 1484 00:52:02,450 --> 00:52:04,669 on that where we would like take 1485 00:52:04,670 --> 00:52:07,189 the writing translated to German, 1486 00:52:07,190 --> 00:52:09,409 translated to Japanese, translated 1487 00:52:09,410 --> 00:52:11,179 back, and we would do that with several 1488 00:52:11,180 --> 00:52:13,489 different translators, such 1489 00:52:13,490 --> 00:52:16,069 as Google Bing and 1490 00:52:16,070 --> 00:52:18,949 Language Rewa and a few others. 1491 00:52:18,950 --> 00:52:21,109 And we saw that 1492 00:52:21,110 --> 00:52:23,269 in most of the cases, depending 1493 00:52:23,270 --> 00:52:25,519 on the quality of the translator 1494 00:52:25,520 --> 00:52:27,709 and on that particular language, 1495 00:52:27,710 --> 00:52:29,989 we were able to identify those people 1496 00:52:29,990 --> 00:52:31,699 with very good accuracy. 1497 00:52:31,700 --> 00:52:33,199 But again, the quality 1498 00:52:34,250 --> 00:52:36,259 of the translator on a particular 1499 00:52:36,260 --> 00:52:38,959 language has a very big effect here. 1500 00:52:38,960 --> 00:52:40,999 We were able to observe that. 1501 00:52:41,000 --> 00:52:43,279 I mean, the the longer the path 1502 00:52:43,280 --> 00:52:44,749 that you translate through, like if you 1503 00:52:44,750 --> 00:52:46,279 go through 12 different intermediate 1504 00:52:46,280 --> 00:52:49,249 languages, you're going to 1505 00:52:49,250 --> 00:52:50,959 get almost unrecognizable at the end. 1506 00:52:50,960 --> 00:52:52,459 Now, if someone was trying to subvert a 1507 00:52:52,460 --> 00:52:54,529 system like this, could they 1508 00:52:54,530 --> 00:52:56,539 just do that, end up with the final 1509 00:52:56,540 --> 00:52:58,969 product after 10, 20 translations 1510 00:52:58,970 --> 00:53:01,279 and then just make simple, 1511 00:53:01,280 --> 00:53:02,449 uh, spelling? 1512 00:53:02,450 --> 00:53:03,439 Correct. 1513 00:53:03,440 --> 00:53:05,509 Spoke about simple grammatical, 1514 00:53:05,510 --> 00:53:07,379 uh, phrasing corrections. 1515 00:53:07,380 --> 00:53:09,499 Uh, what sort of length, 1516 00:53:09,500 --> 00:53:11,479 uh, did you did you test this on how many 1517 00:53:11,480 --> 00:53:13,849 translations did you run the passage 1518 00:53:13,850 --> 00:53:15,709 through before, uh, before bringing it 1519 00:53:15,710 --> 00:53:17,449 back to the original language? 1520 00:53:17,450 --> 00:53:18,859 So Arsène Total was three. 1521 00:53:18,860 --> 00:53:20,539 It was German and Japanese, some back to 1522 00:53:20,540 --> 00:53:23,659 English or maybe two in the middle. 1523 00:53:23,660 --> 00:53:25,879 But a recent paper show that they did 1524 00:53:25,880 --> 00:53:28,189 many translations, I think up to 20. 1525 00:53:28,190 --> 00:53:29,989 And the more you did translations, the 1526 00:53:29,990 --> 00:53:32,149 more unidentifiable the author 1527 00:53:32,150 --> 00:53:34,519 became. But at the same time, the text 1528 00:53:34,520 --> 00:53:36,709 lost its semantics, like 1529 00:53:36,710 --> 00:53:39,079 there was not much context or meaning 1530 00:53:39,080 --> 00:53:40,189 left in the text. 1531 00:53:40,190 --> 00:53:42,229 One thing that we have experimented with 1532 00:53:42,230 --> 00:53:44,119 in the anonymous program is actually what 1533 00:53:44,120 --> 00:53:45,919 we would use on a sentence by sentence 1534 00:53:45,920 --> 00:53:48,079 basis, translate it to a whole 1535 00:53:48,080 --> 00:53:49,489 bunch of different languages and back 1536 00:53:49,490 --> 00:53:51,619 just one way, but then rank 1537 00:53:51,620 --> 00:53:53,239 the sentences that the translations that 1538 00:53:53,240 --> 00:53:54,559 were produced by the ones that had the 1539 00:53:54,560 --> 00:53:56,659 most anonymity to the least anonymity and 1540 00:53:56,660 --> 00:53:57,619 put the ones at the top. 1541 00:53:57,620 --> 00:53:59,419 And then the person can look at them and 1542 00:53:59,420 --> 00:54:01,039 find ones that have more anonymity that 1543 00:54:01,040 --> 00:54:02,539 still are close to the meaning and bring 1544 00:54:02,540 --> 00:54:03,739 those back in. 1545 00:54:03,740 --> 00:54:05,389 So that was one thing we experimented 1546 00:54:05,390 --> 00:54:06,390 with. 1547 00:54:06,920 --> 00:54:07,920 OK, thank you. 1548 00:54:10,190 --> 00:54:12,289 I was wondering if your work is one 1549 00:54:12,290 --> 00:54:15,199 way and by that I mean is 1550 00:54:15,200 --> 00:54:17,299 how far away are you from producing 1551 00:54:17,300 --> 00:54:19,429 a quote unquote genuine letter 1552 00:54:19,430 --> 00:54:21,389 from Angela Merkel or a long lost play 1553 00:54:21,390 --> 00:54:23,809 from Shakespeare with all the 1554 00:54:23,810 --> 00:54:25,669 information you have? 1555 00:54:25,670 --> 00:54:27,889 So text generation 1556 00:54:27,890 --> 00:54:30,349 is much, much harder than text analysis. 1557 00:54:30,350 --> 00:54:32,569 It's sort of, I would argue, like 1558 00:54:32,570 --> 00:54:34,879 an MLP complete problem. 1559 00:54:34,880 --> 00:54:37,099 So I don't think 1560 00:54:37,100 --> 00:54:39,169 we're very what we what 1561 00:54:39,170 --> 00:54:41,479 we would be able to do is probably 1562 00:54:41,480 --> 00:54:43,549 help somebody create 1563 00:54:43,550 --> 00:54:45,469 a letter that was that was imitating that 1564 00:54:45,470 --> 00:54:47,269 style. Like you could you could have it 1565 00:54:47,270 --> 00:54:49,309 be sort of a collaboration between the 1566 00:54:49,310 --> 00:54:50,659 analysis engine and a person. 1567 00:54:50,660 --> 00:54:52,579 And that would probably work quite well. 1568 00:54:52,580 --> 00:54:54,799 Um, but to do it automatically 1569 00:54:54,800 --> 00:54:55,729 would be much harder. 1570 00:54:55,730 --> 00:54:57,979 So it could be used to hate 1571 00:54:57,980 --> 00:54:59,059 impersonation as well. 1572 00:55:00,250 --> 00:55:01,250 Yes. 1573 00:55:03,700 --> 00:55:06,939 Um, I've two questions, actually, 1574 00:55:06,940 --> 00:55:09,099 my first one is, will there be 1575 00:55:09,100 --> 00:55:11,259 something like an 1576 00:55:11,260 --> 00:55:13,539 animal for forest 1577 00:55:13,540 --> 00:55:14,540 code, actually? 1578 00:55:15,890 --> 00:55:18,109 Yeah, it's actually available on it, 1579 00:55:18,110 --> 00:55:20,299 but I didn't do the licensing 1580 00:55:20,300 --> 00:55:22,309 yet, but if you want to play with it, you 1581 00:55:22,310 --> 00:55:24,019 can play with it and I'll fix the 1582 00:55:24,020 --> 00:55:26,269 documentation and the licensing 1583 00:55:26,270 --> 00:55:28,579 information as soon as possible. 1584 00:55:28,580 --> 00:55:30,709 What she means is her analysis code. 1585 00:55:30,710 --> 00:55:32,839 We have not written an anonymizer 1586 00:55:32,840 --> 00:55:34,459 for source code yet. 1587 00:55:34,460 --> 00:55:36,349 We don't have an anonymizer for source 1588 00:55:36,350 --> 00:55:38,239 code, which can be called an obfuscatory. 1589 00:55:38,240 --> 00:55:41,209 Maybe you could edit 1590 00:55:41,210 --> 00:55:43,279 Anonymous to probably 1591 00:55:43,280 --> 00:55:45,469 work OK if you're trying to anonymize 1592 00:55:45,470 --> 00:55:47,059 source code to some extent. 1593 00:55:47,060 --> 00:55:48,689 Yeah, but the suggestions would be bad. 1594 00:55:48,690 --> 00:55:49,690 Yes. 1595 00:55:50,460 --> 00:55:52,579 OK, um, and the 1596 00:55:52,580 --> 00:55:54,409 second one is 1597 00:55:56,180 --> 00:55:59,269 when you try to compare 1598 00:55:59,270 --> 00:56:01,849 different codings, 1599 00:56:01,850 --> 00:56:04,309 um, did you also try 1600 00:56:04,310 --> 00:56:06,529 to compare between 1601 00:56:06,530 --> 00:56:08,929 different languages or 1602 00:56:08,930 --> 00:56:11,479 did you just always compare the 1603 00:56:11,480 --> 00:56:14,119 source codes of the same language? 1604 00:56:14,120 --> 00:56:16,339 We always looked at C++ in 1605 00:56:16,340 --> 00:56:18,679 this case because our asked parser 1606 00:56:18,680 --> 00:56:20,219 was for C++. 1607 00:56:20,220 --> 00:56:20,959 Yeah, sure. 1608 00:56:20,960 --> 00:56:23,359 But you think it's possible 1609 00:56:23,360 --> 00:56:25,699 to find 1610 00:56:25,700 --> 00:56:27,889 out, um, 1611 00:56:27,890 --> 00:56:30,559 also the similarities 1612 00:56:30,560 --> 00:56:33,199 between different source codes? 1613 00:56:33,200 --> 00:56:35,619 Well, um, from different languages. 1614 00:56:35,620 --> 00:56:37,459 Since each programing language has a 1615 00:56:37,460 --> 00:56:39,259 structure and its own grammar, this 1616 00:56:39,260 --> 00:56:41,299 should be possible as long as you have 1617 00:56:41,300 --> 00:56:43,369 the parser so it can be extended 1618 00:56:43,370 --> 00:56:46,619 to other languages in the same manner. 1619 00:56:46,620 --> 00:56:48,199 OK, yeah. 1620 00:56:48,200 --> 00:56:49,879 And Patros domain case might be tricky. 1621 00:56:49,880 --> 00:56:51,139 We'd have to do some experiments. 1622 00:56:53,940 --> 00:56:56,339 I actually wanted to ask the question 1623 00:56:56,340 --> 00:56:58,989 of my predecessor, 1624 00:56:58,990 --> 00:57:01,049 just I just want to say thank you 1625 00:57:01,050 --> 00:57:02,909 for a great presentation. 1626 00:57:02,910 --> 00:57:03,910 Thanks. 1627 00:57:05,130 --> 00:57:08,369 OK, I have so many questions from ARACY. 1628 00:57:08,370 --> 00:57:09,370 OK, if I might ask. 1629 00:57:10,380 --> 00:57:12,599 OK, the first one was about the 1630 00:57:18,550 --> 00:57:21,899 case of Kaisa versus Jacobsson, 1631 00:57:21,900 --> 00:57:23,949 where you had this comparison between 1632 00:57:23,950 --> 00:57:26,909 this free code and the, um, 1633 00:57:26,910 --> 00:57:29,309 uh, the, uh, copyrighted 1634 00:57:29,310 --> 00:57:30,389 code. 1635 00:57:30,390 --> 00:57:32,789 And they wanted to know, uh, 1636 00:57:32,790 --> 00:57:34,499 how you got to the source code of the 1637 00:57:34,500 --> 00:57:36,599 copyrighted code or if it was open source 1638 00:57:36,600 --> 00:57:39,149 code or which license it was under. 1639 00:57:39,150 --> 00:57:41,219 No, we didn't compare it because 1640 00:57:41,220 --> 00:57:43,379 it's copyright code, so we didn't try 1641 00:57:43,380 --> 00:57:45,449 to access it because 1642 00:57:45,450 --> 00:57:47,309 it's not publicly available. 1643 00:57:47,310 --> 00:57:49,050 Thank you. I was just an example. 1644 00:57:51,440 --> 00:57:52,459 Number four, please. 1645 00:57:52,460 --> 00:57:54,529 Okay, first of all, sorry, first of 1646 00:57:54,530 --> 00:57:55,489 all, thanks for a very interesting 1647 00:57:55,490 --> 00:57:57,709 thought and also thanks for doing work 1648 00:57:57,710 --> 00:58:00,319 on this anonymous solution, 1649 00:58:00,320 --> 00:58:02,359 because it would be more concerning if it 1650 00:58:02,360 --> 00:58:04,669 was only being applied to 1651 00:58:04,670 --> 00:58:06,169 reduce people's privacy, which in some 1652 00:58:06,170 --> 00:58:07,999 countries can end quite badly. 1653 00:58:08,000 --> 00:58:10,549 And my question is, if people use 1654 00:58:10,550 --> 00:58:12,859 this tool to 1655 00:58:12,860 --> 00:58:14,929 make their language less identifiable, 1656 00:58:14,930 --> 00:58:16,459 can they then be identified as having 1657 00:58:16,460 --> 00:58:17,929 used that tool with high confidence? 1658 00:58:17,930 --> 00:58:19,489 Does it leave a signature if you use 1659 00:58:19,490 --> 00:58:21,749 Anonymous and 1660 00:58:21,750 --> 00:58:23,929 and what's the size of the set 1661 00:58:23,930 --> 00:58:25,009 of people that use it? Because you're 1662 00:58:25,010 --> 00:58:26,839 only as anonymous as the number of people 1663 00:58:26,840 --> 00:58:29,399 using that tool if it's identifiable. 1664 00:58:29,400 --> 00:58:31,489 So I don't know how many 1665 00:58:31,490 --> 00:58:33,169 people use Anonymous. 1666 00:58:33,170 --> 00:58:35,059 Probably not a huge amount, because if 1667 00:58:35,060 --> 00:58:36,409 you actually try using it, it's kind of 1668 00:58:36,410 --> 00:58:37,410 difficult. 1669 00:58:38,390 --> 00:58:40,699 But the and I don't 1670 00:58:40,700 --> 00:58:43,189 know if using anonymous itself 1671 00:58:43,190 --> 00:58:45,530 would would really would 1672 00:58:46,880 --> 00:58:48,439 create a signature. And my guess is it 1673 00:58:48,440 --> 00:58:50,509 probably would, given the way that 1674 00:58:50,510 --> 00:58:52,759 people tend to know what the experiment 1675 00:58:52,760 --> 00:58:54,679 that we did do was, looking at the people 1676 00:58:54,680 --> 00:58:56,179 that we just told to imitate someone 1677 00:58:56,180 --> 00:58:57,589 else's style or just to try and hide 1678 00:58:57,590 --> 00:58:59,309 their style without anonymous. 1679 00:58:59,310 --> 00:59:01,249 And we were able to create a classifier 1680 00:59:01,250 --> 00:59:03,379 that was able to distinguish people 1681 00:59:03,380 --> 00:59:05,659 that had had tried to do that from people 1682 00:59:05,660 --> 00:59:07,729 who hadn't without necessarily being 1683 00:59:07,730 --> 00:59:09,409 able to identify the original author. 1684 00:59:10,550 --> 00:59:12,349 It just seems to me that if the stakes if 1685 00:59:12,350 --> 00:59:14,539 the stakes were high and 1686 00:59:14,540 --> 00:59:16,339 the amount of safety that you'd get from 1687 00:59:16,340 --> 00:59:18,469 using this, it would be difficult to 1688 00:59:18,470 --> 00:59:20,659 kind of calculate the function of when 1689 00:59:20,660 --> 00:59:23,109 it's safer to start using this and 1690 00:59:23,110 --> 00:59:25,549 obfuscated this tool 1691 00:59:25,550 --> 00:59:27,769 versus just saying less. 1692 00:59:27,770 --> 00:59:29,959 And it'd be nice to be able to have more 1693 00:59:29,960 --> 00:59:31,489 analysis so that people can make that 1694 00:59:31,490 --> 00:59:33,259 decision on an informed basis. 1695 00:59:33,260 --> 00:59:34,669 I agree that it would be nice. 1696 00:59:34,670 --> 00:59:36,500 OK, number one, please. 1697 00:59:37,580 --> 00:59:37,909 Hi. 1698 00:59:37,910 --> 00:59:40,159 Yes. Many, many development 1699 00:59:40,160 --> 00:59:42,379 houses and coding houses use style guides 1700 00:59:42,380 --> 00:59:43,789 and they're pretty strict about it. 1701 00:59:43,790 --> 00:59:45,919 And like and you'll run things 1702 00:59:45,920 --> 00:59:47,599 like Rubick and things like that. 1703 00:59:47,600 --> 00:59:49,849 They'll say remove spaces and, 1704 00:59:49,850 --> 00:59:51,709 you know, use single quotes and double 1705 00:59:51,710 --> 00:59:54,949 quotes. Have you taken that into account? 1706 00:59:54,950 --> 00:59:57,049 Um, first of all, we thought that 1707 00:59:57,050 --> 00:59:59,119 people have to implement the 1708 00:59:59,120 --> 01:00:00,919 functionality in a limited time. 1709 01:00:00,920 --> 01:00:03,229 So they use, like, the 1710 01:00:03,230 --> 01:00:05,599 things that they would like, naturally 1711 01:00:05,600 --> 01:00:07,669 use or they would 1712 01:00:07,670 --> 01:00:09,829 express their style because they 1713 01:00:09,830 --> 01:00:11,299 are limited in time. 1714 01:00:11,300 --> 01:00:13,369 On the other hand, if you think 1715 01:00:13,370 --> 01:00:15,529 that their style has to follow 1716 01:00:15,530 --> 01:00:18,019 a certain format, that would make 1717 01:00:18,020 --> 01:00:19,729 everyone more similar. 1718 01:00:19,730 --> 01:00:21,949 And in that scenario, the machine 1719 01:00:21,950 --> 01:00:23,809 learning problem will become even more 1720 01:00:23,810 --> 01:00:25,669 difficult if they're following a certain 1721 01:00:25,670 --> 01:00:27,049 style guide. 1722 01:00:27,050 --> 01:00:29,149 But there is no way for us to tell that 1723 01:00:29,150 --> 01:00:30,589 because we don't have ground truth 1724 01:00:30,590 --> 01:00:32,899 information from these contestants about 1725 01:00:32,900 --> 01:00:35,119 how they were implementing 1726 01:00:35,120 --> 01:00:36,859 the functionality at the time of the 1727 01:00:36,860 --> 01:00:38,109 competition. 1728 01:00:38,110 --> 01:00:40,249 What we can say is it really depends 1729 01:00:40,250 --> 01:00:41,959 on the style guide because we know the 1730 01:00:41,960 --> 01:00:42,859 features that we use. 1731 01:00:42,860 --> 01:00:45,079 So like in the obvious case, 1732 01:00:45,080 --> 01:00:46,789 if the if the style guide really only 1733 01:00:46,790 --> 01:00:48,499 talks about spacing and layout and 1734 01:00:48,500 --> 01:00:49,699 variable names and stuff like that, it 1735 01:00:49,700 --> 01:00:51,079 doesn't affect the deeper structure of 1736 01:00:51,080 --> 01:00:53,449 the code, like the variable depths 1737 01:00:53,450 --> 01:00:54,979 and things like that, then it wouldn't 1738 01:00:54,980 --> 01:00:56,719 then it wouldn't really be relevant. 1739 01:00:56,720 --> 01:00:58,189 But if it does affect that, then it would 1740 01:00:58,190 --> 01:01:00,319 be so probably depends on the specific 1741 01:01:00,320 --> 01:01:01,320 style guide itself. 1742 01:01:03,230 --> 01:01:04,819 But we don't have any data to suggest 1743 01:01:04,820 --> 01:01:07,009 that it's just in in 1744 01:01:07,010 --> 01:01:09,139 development. How do you do a pull request 1745 01:01:09,140 --> 01:01:10,649 and someone criticizes your code? 1746 01:01:10,650 --> 01:01:11,869 You have to change to make it look like 1747 01:01:11,870 --> 01:01:13,099 everyone else's. And so I was wondering 1748 01:01:13,100 --> 01:01:14,809 if you could pick out three thousand 1749 01:01:14,810 --> 01:01:16,849 developers here, which weren't actually 1750 01:01:16,850 --> 01:01:18,529 wrote that code and that sort of thing. 1751 01:01:20,060 --> 01:01:21,060 OK. 1752 01:01:24,920 --> 01:01:26,929 Hello. Yeah, I think we are going to take 1753 01:01:26,930 --> 01:01:28,699 the last three questions and then wrap it 1754 01:01:28,700 --> 01:01:30,529 up. OK. So number two, please. 1755 01:01:30,530 --> 01:01:32,149 And when we're when we're done, we'll go 1756 01:01:32,150 --> 01:01:34,879 to the cafeteria area and 1757 01:01:34,880 --> 01:01:36,499 sit down in a chair and a table there. 1758 01:01:36,500 --> 01:01:37,699 And if people want to ask more questions, 1759 01:01:37,700 --> 01:01:38,809 they can. 1760 01:01:38,810 --> 01:01:39,769 OK. 1761 01:01:39,770 --> 01:01:42,049 OK, so the next question from ask 1762 01:01:42,050 --> 01:01:44,659 is, what about multiple authors 1763 01:01:44,660 --> 01:01:46,429 like an open source projects? 1764 01:01:47,850 --> 01:01:49,139 What happens to the 1765 01:01:50,520 --> 01:01:53,109 protection of the author in such a case? 1766 01:01:53,110 --> 01:01:55,289 OK, so we haven't done anything with 1767 01:01:55,290 --> 01:01:57,539 source code on this yet because 1768 01:01:57,540 --> 01:01:58,979 that's, I think, a difficult problem that 1769 01:01:58,980 --> 01:01:59,999 we just haven't looked at. 1770 01:02:00,000 --> 01:02:02,009 We're currently, though, looking at 1771 01:02:02,010 --> 01:02:04,079 different Macías that are written by 1772 01:02:04,080 --> 01:02:05,639 multiple authors and these are similar 1773 01:02:05,640 --> 01:02:08,129 problems and getting preliminary 1774 01:02:08,130 --> 01:02:10,289 results. So keep looking forward to 1775 01:02:10,290 --> 01:02:11,699 it and we'll have something there, I 1776 01:02:11,700 --> 01:02:12,539 guess. 1777 01:02:12,540 --> 01:02:14,069 Do anything. 1778 01:02:14,070 --> 01:02:15,629 Well, I lead to the preliminary results 1779 01:02:15,630 --> 01:02:17,759 get but those were very good because I 1780 01:02:17,760 --> 01:02:19,159 wasn't using the abstracts initially. 1781 01:02:19,160 --> 01:02:20,160 So. 1782 01:02:21,000 --> 01:02:22,829 OK, number two. 1783 01:02:24,260 --> 01:02:26,419 My question is quite similar. 1784 01:02:26,420 --> 01:02:28,579 Is it possible to detect to 1785 01:02:28,580 --> 01:02:30,709 detect a text as written by 1786 01:02:30,710 --> 01:02:34,029 one person or by more persons? 1787 01:02:34,030 --> 01:02:36,109 I think that's definitely part of 1788 01:02:36,110 --> 01:02:37,429 that problem. 1789 01:02:37,430 --> 01:02:39,139 It may be a first step to completing that 1790 01:02:39,140 --> 01:02:40,140 problem. 1791 01:02:40,970 --> 01:02:42,529 Yeah, it's something we're actively 1792 01:02:42,530 --> 01:02:44,269 working on, but we don't have any results 1793 01:02:44,270 --> 01:02:45,079 yet. 1794 01:02:45,080 --> 01:02:47,299 It does cross 1795 01:02:47,300 --> 01:02:49,369 the main actually also work 1796 01:02:49,370 --> 01:02:50,359 across languages. 1797 01:02:50,360 --> 01:02:52,429 For example, if I'm on 1798 01:02:52,430 --> 01:02:54,469 one mailing list in German and on one 1799 01:02:54,470 --> 01:02:55,579 forum in English. 1800 01:02:55,580 --> 01:02:57,769 Would you be able to match these 1801 01:02:57,770 --> 01:02:58,770 accounts by 1802 01:03:00,050 --> 01:03:02,419 styles that that are independent of of 1803 01:03:02,420 --> 01:03:04,070 the language I'm using for posting? 1804 01:03:05,120 --> 01:03:06,799 You can't use a language independent 1805 01:03:06,800 --> 01:03:09,199 features that or you can try translating 1806 01:03:09,200 --> 01:03:11,419 the code and then I'm sorry, not 1807 01:03:11,420 --> 01:03:13,909 the code, the writing, and then 1808 01:03:13,910 --> 01:03:16,069 do an autocratic retribution 1809 01:03:16,070 --> 01:03:18,199 with an English feature set and 1810 01:03:18,200 --> 01:03:20,569 look at whichever one works better. 1811 01:03:20,570 --> 01:03:23,549 Yeah, I think I think translating 1812 01:03:23,550 --> 01:03:25,639 actually probably the best way to do that 1813 01:03:25,640 --> 01:03:27,019 would be to translate both of them and 1814 01:03:27,020 --> 01:03:29,119 then do both analysis and in both of 1815 01:03:29,120 --> 01:03:30,469 the individual languages and see what the 1816 01:03:30,470 --> 01:03:31,489 results are. 1817 01:03:31,490 --> 01:03:33,529 That's how I would go about it, because 1818 01:03:33,530 --> 01:03:35,629 because the translation, because 1819 01:03:35,630 --> 01:03:37,189 it's hard to do like the Ingrams and 1820 01:03:37,190 --> 01:03:38,299 stuff will be different for the different 1821 01:03:38,300 --> 01:03:38,959 languages. 1822 01:03:38,960 --> 01:03:41,439 So you probably want to translate them 1823 01:03:41,440 --> 01:03:42,440 so. 1824 01:03:43,580 --> 01:03:45,679 OK, I think that's that. 1825 01:03:45,680 --> 01:03:47,419 Thank you so much for coming, and we hope 1826 01:03:47,420 --> 01:03:48,829 that you'll be back next year, 1827 01:03:50,420 --> 01:03:51,420 Ceressus.