0 00:00:00,000 --> 00:00:30,000 Dear viewer, these subtitles were generated by a machine via the service Trint and therefore are (very) buggy. If you are capable, please help us to create good quality subtitles: https://c3subtitles.de/talk/589 Thanks! 1 00:00:09,540 --> 00:00:11,699 So who if you saw the 2 00:00:11,700 --> 00:00:13,949 talk about politicians speak this 3 00:00:13,950 --> 00:00:14,950 morning? 4 00:00:15,510 --> 00:00:16,589 Nobody. 5 00:00:16,590 --> 00:00:17,519 OK. 6 00:00:17,520 --> 00:00:20,049 Yeah, it wasn't German, so maybe 7 00:00:20,050 --> 00:00:22,169 I wanted to respond something to 8 00:00:22,170 --> 00:00:24,269 the people who did, but yeah, 9 00:00:24,270 --> 00:00:26,519 apparently now, uh, 10 00:00:26,520 --> 00:00:28,799 talking gibberish in 11 00:00:28,800 --> 00:00:30,539 human understandable language. 12 00:00:30,540 --> 00:00:32,699 Uh, you didn't hear about that 13 00:00:32,700 --> 00:00:35,669 today, but, um, talking gibberish in 14 00:00:35,670 --> 00:00:38,069 electronic languages, 15 00:00:38,070 --> 00:00:40,139 you are probably familiar 16 00:00:40,140 --> 00:00:41,279 with that. So 17 00:00:42,330 --> 00:00:45,389 then here is 18 00:00:45,390 --> 00:00:47,459 a security researcher with 19 00:00:47,460 --> 00:00:49,619 checkpoints and he will talk 20 00:00:49,620 --> 00:00:52,469 to you today about deejay's. 21 00:00:52,470 --> 00:00:55,229 So, um, algorithms 22 00:00:55,230 --> 00:00:57,059 that produce gibberish. 23 00:00:57,060 --> 00:00:59,069 But, uh, they got a bit smarter in the 24 00:00:59,070 --> 00:01:01,169 past. And 25 00:01:01,170 --> 00:01:03,239 he will tell you something about how 26 00:01:03,240 --> 00:01:05,729 to detect gibberish, 27 00:01:05,730 --> 00:01:07,799 which somebody some people 28 00:01:07,800 --> 00:01:09,389 might want to have for politicians, too. 29 00:01:09,390 --> 00:01:11,609 But you have to use reason for 30 00:01:11,610 --> 00:01:13,799 that. And he will give you an idea about 31 00:01:13,800 --> 00:01:15,719 how you can do that, for example. 32 00:01:15,720 --> 00:01:18,389 OK, give a warm round of applause for 33 00:01:18,390 --> 00:01:19,390 Ben-Meir. 34 00:01:24,620 --> 00:01:26,790 Um, is this on it is. 35 00:01:27,800 --> 00:01:29,719 OK, first things first. 36 00:01:29,720 --> 00:01:31,999 If this slide makes any amount 37 00:01:32,000 --> 00:01:33,919 of sense to you, then I'm sorry to have 38 00:01:33,920 --> 00:01:35,269 to tell you this, but you're probably a 39 00:01:35,270 --> 00:01:36,560 robot. Uh, 40 00:01:37,580 --> 00:01:38,929 so, uh, well, the good news. 41 00:01:38,930 --> 00:01:40,219 That's the bad news. The good news is 42 00:01:40,220 --> 00:01:41,449 that you have come to the right lecture 43 00:01:41,450 --> 00:01:42,889 because once this is done, you'll be able 44 00:01:42,890 --> 00:01:44,599 to detect gibberish just like the rest of 45 00:01:44,600 --> 00:01:46,069 the humans. You be able to blend in and 46 00:01:46,070 --> 00:01:47,149 no one will know a thing. 47 00:01:47,150 --> 00:01:49,549 Uh, so first, 48 00:01:49,550 --> 00:01:51,169 I'm going to refresh your memory a bit 49 00:01:51,170 --> 00:01:52,489 about what D.J. 50 00:01:52,490 --> 00:01:54,649 is and what the problem is that it 51 00:01:54,650 --> 00:01:56,689 was trying to solve. 52 00:01:56,690 --> 00:01:59,089 Let's look at a regular scenario 53 00:01:59,090 --> 00:02:01,099 and basic scenario where an infected 54 00:02:01,100 --> 00:02:03,169 system has been infected with malware 55 00:02:03,170 --> 00:02:05,299 and it wants to converse with its command 56 00:02:05,300 --> 00:02:06,859 and control server. That's what malware 57 00:02:06,860 --> 00:02:08,508 does nowadays. In the past, it may have 58 00:02:08,509 --> 00:02:10,698 just done its own thing without receiving 59 00:02:10,699 --> 00:02:13,129 any commands. But today, 60 00:02:13,130 --> 00:02:15,289 uh, malware usually 61 00:02:15,290 --> 00:02:17,569 waits for commands and operates 62 00:02:17,570 --> 00:02:19,399 based on commands that it, uh, receives. 63 00:02:19,400 --> 00:02:21,589 So, uh, in this 64 00:02:21,590 --> 00:02:23,869 basic, uh, usual scenario, the 65 00:02:23,870 --> 00:02:26,059 malware came with a built in 66 00:02:26,060 --> 00:02:27,739 DNS address. It's hard coded. 67 00:02:27,740 --> 00:02:30,049 And malware queries that in a server 68 00:02:30,050 --> 00:02:32,329 with this hard coded address and receives 69 00:02:32,330 --> 00:02:34,519 a response. This is the IP address of the 70 00:02:34,520 --> 00:02:36,799 server, not the infected system contacts 71 00:02:36,800 --> 00:02:39,559 the use of the Internet and the server, 72 00:02:39,560 --> 00:02:41,569 the census server very excitedly 73 00:02:41,570 --> 00:02:43,699 responds, yes, I have another machine 74 00:02:43,700 --> 00:02:45,949 under my sway and the connection 75 00:02:45,950 --> 00:02:47,719 is complete, not infect the system and 76 00:02:47,720 --> 00:02:50,329 the server can converse. 77 00:02:50,330 --> 00:02:52,549 So all of this is fine and good until one 78 00:02:52,550 --> 00:02:55,099 day the powers that be the 79 00:02:55,100 --> 00:02:56,119 maybe do forty's. 80 00:02:56,120 --> 00:02:58,279 I know they 81 00:02:58,280 --> 00:03:00,349 find out about all of this and 82 00:03:00,350 --> 00:03:01,759 they talk to the people in charge of the 83 00:03:01,760 --> 00:03:03,949 DNS ever. That's probably right, 84 00:03:03,950 --> 00:03:05,389 but not necessarily. 85 00:03:05,390 --> 00:03:07,669 And they tell them, well, 86 00:03:07,670 --> 00:03:09,469 there is this there's been this shady 87 00:03:09,470 --> 00:03:11,419 activity going on and it's making use of 88 00:03:11,420 --> 00:03:12,829 your DNS servers. 89 00:03:12,830 --> 00:03:14,989 Would you kindly make sure that it stops 90 00:03:14,990 --> 00:03:17,059 and the people in charge of the DNS ever 91 00:03:17,060 --> 00:03:18,649 do not want any trouble? 92 00:03:18,650 --> 00:03:20,719 So they removed the record, pointing to 93 00:03:20,720 --> 00:03:22,789 the IP address of the server. 94 00:03:22,790 --> 00:03:24,559 And now the infected system, just as 95 00:03:24,560 --> 00:03:27,139 before, makes the DNS query to the server 96 00:03:27,140 --> 00:03:29,329 and asks, OK, where's the IP server 97 00:03:29,330 --> 00:03:31,429 of? Where's the IP address of 98 00:03:31,430 --> 00:03:33,079 my senses server? 99 00:03:33,080 --> 00:03:35,869 And the server basically responds, 100 00:03:35,870 --> 00:03:36,870 go. 101 00:03:37,430 --> 00:03:39,859 Now the since the server just stender 102 00:03:39,860 --> 00:03:42,139 fully functional, waiting to 103 00:03:42,140 --> 00:03:44,299 send commands and it's Tenzer and 104 00:03:44,300 --> 00:03:46,429 it waits and it waits and it waits. 105 00:03:46,430 --> 00:03:47,839 And that's not very good for the 106 00:03:47,840 --> 00:03:48,840 campaign. 107 00:03:49,520 --> 00:03:52,309 Now DGA 108 00:03:52,310 --> 00:03:54,619 is basically a mechanism that campaign 109 00:03:54,620 --> 00:03:55,879 managers came up with. 110 00:03:55,880 --> 00:03:57,979 They looked at 111 00:03:57,980 --> 00:04:00,079 this problem, the ease with which in 112 00:04:00,080 --> 00:04:02,119 a state known can happen, and they said, 113 00:04:02,120 --> 00:04:03,679 we want something better that won't be 114 00:04:03,680 --> 00:04:05,299 taken down as easily. 115 00:04:05,300 --> 00:04:07,459 So I could stand here for a lot 116 00:04:07,460 --> 00:04:09,559 of time and talk theoretically about and 117 00:04:09,560 --> 00:04:11,449 how it works. But I think a practical 118 00:04:11,450 --> 00:04:13,099 walkthrough of just how it works in 119 00:04:13,100 --> 00:04:15,619 practice is going to be more productive. 120 00:04:15,620 --> 00:04:17,778 So let's 121 00:04:17,779 --> 00:04:19,469 see how it actually works. 122 00:04:19,470 --> 00:04:22,069 It begins our story begins with 123 00:04:22,070 --> 00:04:24,349 this. Since the server and it has access 124 00:04:24,350 --> 00:04:26,929 to a pseudo random number generator, 125 00:04:26,930 --> 00:04:29,209 which is basically a creature that takes 126 00:04:29,210 --> 00:04:31,369 in a small amount of entropy, randomness 127 00:04:31,370 --> 00:04:34,169 and outputs a large amount of entropy 128 00:04:34,170 --> 00:04:35,170 randomness. 129 00:04:36,170 --> 00:04:38,539 Now, this is a random generator 130 00:04:38,540 --> 00:04:40,759 specifically taxin that publicly 131 00:04:40,760 --> 00:04:43,279 available, such as the date of today 132 00:04:43,280 --> 00:04:45,559 or maybe the headlines of today's 133 00:04:45,560 --> 00:04:46,789 newspaper? I don't know. 134 00:04:46,790 --> 00:04:48,169 The important thing is that it should be 135 00:04:48,170 --> 00:04:50,029 publicly available to everybody. 136 00:04:50,030 --> 00:04:52,009 Now it is a customized algorithm that 137 00:04:52,010 --> 00:04:54,199 takes in this particular variable 138 00:04:54,200 --> 00:04:56,399 a small amount of information and output 139 00:04:56,400 --> 00:04:58,609 outputs a large number of 140 00:04:58,610 --> 00:05:00,059 domains. 141 00:05:00,060 --> 00:05:02,269 Uh, these domains are not 142 00:05:02,270 --> 00:05:03,349 very understandable. 143 00:05:03,350 --> 00:05:04,339 And this is typical. 144 00:05:04,340 --> 00:05:06,079 And this is basically what this lecture 145 00:05:06,080 --> 00:05:06,889 is about. 146 00:05:06,890 --> 00:05:09,799 But now what the synthesis over does 147 00:05:09,800 --> 00:05:11,899 is take one 148 00:05:11,900 --> 00:05:13,129 of those domains at random. 149 00:05:13,130 --> 00:05:14,749 Typically that's what it does and 150 00:05:14,750 --> 00:05:16,879 register it with the DNS ever to 151 00:05:16,880 --> 00:05:19,069 point at the IP address that's 152 00:05:19,070 --> 00:05:21,619 relevant to the IP address that 153 00:05:21,620 --> 00:05:23,779 in fact the machine can contact. 154 00:05:23,780 --> 00:05:25,579 Now, what happens to the infected client 155 00:05:25,580 --> 00:05:27,629 site? Something similar, the fact the 156 00:05:27,630 --> 00:05:29,449 system also has access to the same store, 157 00:05:29,450 --> 00:05:31,909 the random generator that came bundled 158 00:05:31,910 --> 00:05:34,399 with the malware, and it has access 159 00:05:34,400 --> 00:05:36,829 to the publicly available seed because 160 00:05:36,830 --> 00:05:38,299 it's publicly available. 161 00:05:38,300 --> 00:05:39,829 So it consults the pseudo random 162 00:05:39,830 --> 00:05:42,169 generator and asks, what are the domain 163 00:05:42,170 --> 00:05:43,969 of today? What are the domains available 164 00:05:43,970 --> 00:05:45,289 to me today? 165 00:05:45,290 --> 00:05:47,539 And it gets a list of how many domains 166 00:05:47,540 --> 00:05:49,939 are there. It varies sometimes 15, 167 00:05:49,940 --> 00:05:51,589 sometimes it's 200. 168 00:05:51,590 --> 00:05:52,969 There's a lot. That's what I'm trying to 169 00:05:52,970 --> 00:05:55,189 say. And now the infected 170 00:05:55,190 --> 00:05:57,319 system knows that the since the server 171 00:05:57,320 --> 00:05:59,539 had registered one of 172 00:05:59,540 --> 00:06:02,119 those domains to point at the IP address, 173 00:06:02,120 --> 00:06:04,309 but it doesn't know which one. 174 00:06:05,490 --> 00:06:07,319 So what is he going to do? 175 00:06:07,320 --> 00:06:09,359 There's really only one solution contact 176 00:06:09,360 --> 00:06:11,609 all the addresses, it's going to iterate 177 00:06:11,610 --> 00:06:13,889 over all the addresses one by one 178 00:06:13,890 --> 00:06:16,469 and make DNS queries 179 00:06:16,470 --> 00:06:18,459 asking for the relevant IP address. 180 00:06:18,460 --> 00:06:20,699 Now, most of those domains 181 00:06:20,700 --> 00:06:22,949 had not been actually registered. 182 00:06:22,950 --> 00:06:25,049 So what results is 183 00:06:25,050 --> 00:06:27,119 a very peculiar sort 184 00:06:27,120 --> 00:06:29,909 of conversation across the DNS protocols, 185 00:06:29,910 --> 00:06:32,069 the protocol that kind of 186 00:06:32,070 --> 00:06:34,289 resembles the cheese shop sketch 187 00:06:34,290 --> 00:06:35,819 by Monty Python. 188 00:06:35,820 --> 00:06:37,969 If you're not familiar with it, it's a 189 00:06:37,970 --> 00:06:40,319 sketch that involves a guy walking into 190 00:06:40,320 --> 00:06:42,569 a cheese shop and he tries to purchase 191 00:06:42,570 --> 00:06:44,249 various kinds of cheese. 192 00:06:44,250 --> 00:06:46,499 And as the sketch progresses, it becomes 193 00:06:46,500 --> 00:06:48,779 increasingly clear that the shop 194 00:06:48,780 --> 00:06:50,639 does not actually hold any kind of cheese 195 00:06:50,640 --> 00:06:51,809 at all. 196 00:06:51,810 --> 00:06:53,909 The guys, do you have any Parmesan cheese 197 00:06:53,910 --> 00:06:55,049 shop on and says no? 198 00:06:55,050 --> 00:06:57,149 Well, how about Retno and so 199 00:06:57,150 --> 00:06:58,529 on and so forth. 200 00:06:58,530 --> 00:07:00,719 And the DNS 201 00:07:00,720 --> 00:07:03,269 conversation going on resembles this 202 00:07:03,270 --> 00:07:05,609 exchange greatly, because what happens 203 00:07:05,610 --> 00:07:07,499 is, is that the infected machine 204 00:07:07,500 --> 00:07:08,459 unsteadiness over. 205 00:07:08,460 --> 00:07:10,709 Well, do you have the IP address for this 206 00:07:10,710 --> 00:07:12,029 jibberish address? 207 00:07:12,030 --> 00:07:14,249 The NSA responds no. 208 00:07:14,250 --> 00:07:16,139 Well, how about this jibberish address? 209 00:07:16,140 --> 00:07:18,389 Knew about this gibberish address? 210 00:07:18,390 --> 00:07:20,729 No, sir. Sorry about this gibberish 211 00:07:20,730 --> 00:07:23,069 address. No, not today, sir. 212 00:07:23,070 --> 00:07:24,329 And this goes on and on. 213 00:07:24,330 --> 00:07:26,159 Here you can see a traffic upcher 214 00:07:26,160 --> 00:07:27,839 depicting this process. 215 00:07:27,840 --> 00:07:30,029 You can see the repeated no such name, no 216 00:07:30,030 --> 00:07:31,379 such thing, no such name. 217 00:07:31,380 --> 00:07:32,909 The dinner service says, What do you want 218 00:07:32,910 --> 00:07:34,889 from me? I have never heard of any one of 219 00:07:34,890 --> 00:07:37,409 those domains. Please stop bothering me. 220 00:07:37,410 --> 00:07:39,869 And but recall 221 00:07:39,870 --> 00:07:41,789 that the since the server had actually 222 00:07:41,790 --> 00:07:43,889 registered one of those domains, 223 00:07:43,890 --> 00:07:46,019 one of those domains is a valid domain 224 00:07:46,020 --> 00:07:48,179 that points to the IP address of 225 00:07:48,180 --> 00:07:50,789 the since the server. So eventually 226 00:07:50,790 --> 00:07:52,529 the infected system is going to make the 227 00:07:52,530 --> 00:07:54,539 golden query and the DNA server is going 228 00:07:54,540 --> 00:07:56,129 to excitedly jump up and down. 229 00:07:56,130 --> 00:07:57,749 Oh my God, I know this one. 230 00:07:57,750 --> 00:08:00,119 It reaches down to the drawer and pulls 231 00:08:00,120 --> 00:08:02,249 up the IP response and the infected 232 00:08:02,250 --> 00:08:03,269 system is related. 233 00:08:03,270 --> 00:08:04,919 Now, it finally has the IP address of 234 00:08:04,920 --> 00:08:07,979 this, the server as before and contact 235 00:08:07,980 --> 00:08:09,689 as before. And all is well with the 236 00:08:09,690 --> 00:08:10,799 world. 237 00:08:10,800 --> 00:08:12,899 Well, you notice that I kept 238 00:08:12,900 --> 00:08:13,799 saying as before. 239 00:08:13,800 --> 00:08:15,719 As before, as before what all this work 240 00:08:15,720 --> 00:08:18,539 just before all this bloated mechanism 241 00:08:18,540 --> 00:08:20,549 of the. Just to get the same result as 242 00:08:20,550 --> 00:08:21,449 before? 243 00:08:21,450 --> 00:08:23,669 Well, not exactly, because think what 244 00:08:23,670 --> 00:08:25,559 you now have to do if you want to try to 245 00:08:25,560 --> 00:08:27,329 take down this infrastructure 246 00:08:27,330 --> 00:08:28,679 infrastructure. 247 00:08:28,680 --> 00:08:31,529 Now, let's look at the domains generated 248 00:08:31,530 --> 00:08:34,109 in one day by this algorithm. 249 00:08:35,230 --> 00:08:37,418 That's a lot, and if 250 00:08:37,419 --> 00:08:38,739 you're trying to take down this 251 00:08:38,740 --> 00:08:41,079 infrastructure and you do not have access 252 00:08:41,080 --> 00:08:43,029 to the pursuit of random number 253 00:08:43,030 --> 00:08:45,069 generator, the algorithm, basically all 254 00:08:45,070 --> 00:08:45,979 of this is random, too. 255 00:08:45,980 --> 00:08:47,739 So every day you're going to have to 256 00:08:47,740 --> 00:08:49,449 chase down and hunt all of those 257 00:08:49,450 --> 00:08:51,549 addresses being queried 258 00:08:51,550 --> 00:08:52,659 all over the world. 259 00:08:52,660 --> 00:08:54,830 And that's not going to be easy. 260 00:08:56,540 --> 00:08:58,789 Now, if a 261 00:08:58,790 --> 00:09:01,069 takedown happens at all, here's the more 262 00:09:01,070 --> 00:09:03,169 likely scenario of how it's going to play 263 00:09:03,170 --> 00:09:05,209 out. First, you have your victim on your 264 00:09:05,210 --> 00:09:07,519 victim gets infected in some 265 00:09:07,520 --> 00:09:09,769 internal enterprise or something, and 266 00:09:09,770 --> 00:09:11,479 it contacts the server. 267 00:09:11,480 --> 00:09:13,549 And there's that exfiltration sound 268 00:09:13,550 --> 00:09:15,949 going on. And eventually someone 269 00:09:15,950 --> 00:09:18,019 wises up to this and this hot 270 00:09:18,020 --> 00:09:20,059 potato gets thrown over to incident 271 00:09:20,060 --> 00:09:21,229 response. 272 00:09:21,230 --> 00:09:24,019 Now, incident response, 273 00:09:24,020 --> 00:09:26,329 and they do 274 00:09:26,330 --> 00:09:28,279 whatever they can with this to try to put 275 00:09:28,280 --> 00:09:29,299 out this fire. 276 00:09:29,300 --> 00:09:31,429 And maybe if we 277 00:09:31,430 --> 00:09:34,309 are lucky, they want to 278 00:09:34,310 --> 00:09:36,559 draw conclusions from this 279 00:09:36,560 --> 00:09:38,539 and make sure that the information that 280 00:09:38,540 --> 00:09:40,669 they've obtained is relevant and can be 281 00:09:40,670 --> 00:09:42,739 used later to prevent further 282 00:09:42,740 --> 00:09:44,629 attacks of this nature by the same family 283 00:09:44,630 --> 00:09:45,559 of malware. 284 00:09:45,560 --> 00:09:47,570 So if you're lucky, this 285 00:09:49,070 --> 00:09:51,379 incident response is buddy buddy with 286 00:09:51,380 --> 00:09:52,999 middle management and some security 287 00:09:53,000 --> 00:09:55,129 vendor. So let's get bus over 288 00:09:55,130 --> 00:09:56,959 to middle management at some security 289 00:09:56,960 --> 00:09:59,749 vendor which ever burns down 290 00:09:59,750 --> 00:10:02,239 somewhere where the sun don't shine or 291 00:10:02,240 --> 00:10:03,919 if you're lucky, it passes 292 00:10:05,270 --> 00:10:06,859 the middle. Management passes this over 293 00:10:06,860 --> 00:10:08,959 to some reverse engineer who is 294 00:10:08,960 --> 00:10:11,209 going to spend like a few 295 00:10:11,210 --> 00:10:13,459 weeks or maybe a few months poring 296 00:10:13,460 --> 00:10:16,099 over this file in either pro until 297 00:10:16,100 --> 00:10:18,619 if you're lucky, this thing results 298 00:10:18,620 --> 00:10:20,929 in a report the report 299 00:10:20,930 --> 00:10:22,489 ever lays down at the bottom of the 300 00:10:22,490 --> 00:10:23,959 Internet, and no one pays attention to 301 00:10:23,960 --> 00:10:25,579 it. But if you're lucky, over a 302 00:10:25,580 --> 00:10:28,159 streamlined process or some kind soul 303 00:10:28,160 --> 00:10:30,379 is going to take this thing and make sure 304 00:10:30,380 --> 00:10:32,209 that the data about how this pseudo 305 00:10:32,210 --> 00:10:34,369 random number generator works 306 00:10:34,370 --> 00:10:36,289 is incorporated into a firewall 307 00:10:36,290 --> 00:10:37,939 somewhere, which will actually, once this 308 00:10:37,940 --> 00:10:40,189 process is down, block any 309 00:10:40,190 --> 00:10:42,379 future traffic based on the same domain 310 00:10:42,380 --> 00:10:43,549 generation algorithm. 311 00:10:44,850 --> 00:10:46,529 And is a streamlined process, what could 312 00:10:46,530 --> 00:10:47,530 possibly go wrong, 313 00:10:48,690 --> 00:10:50,969 but suppose that 314 00:10:50,970 --> 00:10:53,039 you had the ability to automatically 315 00:10:53,040 --> 00:10:54,729 detect the DJ. 316 00:10:54,730 --> 00:10:57,029 Now you could aggressively cut out 317 00:10:57,030 --> 00:10:58,149 a lot of those middlemen. 318 00:10:58,150 --> 00:11:00,599 Now, a lot of the links here have 319 00:11:00,600 --> 00:11:01,949 better things to do with their time 320 00:11:01,950 --> 00:11:03,509 middle management place there so they 321 00:11:03,510 --> 00:11:04,979 have better things to do with their time. 322 00:11:04,980 --> 00:11:06,989 Invest engineers, too. 323 00:11:06,990 --> 00:11:09,629 If you had the ability 324 00:11:09,630 --> 00:11:11,699 to automatically detect deejay 325 00:11:11,700 --> 00:11:13,859 traffic, you could theoretically put 326 00:11:13,860 --> 00:11:15,359 it straight in the firewall. 327 00:11:15,360 --> 00:11:17,009 The firewall is going to see the outgoing 328 00:11:17,010 --> 00:11:18,869 digit traffic and aggressively it's going 329 00:11:18,870 --> 00:11:20,519 to shut it down after like four or five 330 00:11:20,520 --> 00:11:22,919 years and say, hey, I'm sorry, 331 00:11:22,920 --> 00:11:24,389 this traffic looks shady. 332 00:11:24,390 --> 00:11:26,969 It looks like you're not getting through. 333 00:11:26,970 --> 00:11:29,069 So automatically detecting digits 334 00:11:29,070 --> 00:11:30,449 is useful and it's cool. 335 00:11:30,450 --> 00:11:32,549 And as a consequence, there 336 00:11:32,550 --> 00:11:35,309 have been past attempts 337 00:11:35,310 --> 00:11:37,559 to solve this problem. 338 00:11:37,560 --> 00:11:38,999 And we're going to look at some of the 339 00:11:39,000 --> 00:11:41,099 past features that have been suggested 340 00:11:41,100 --> 00:11:42,419 to identify G.J. 341 00:11:42,420 --> 00:11:45,479 And we're going to talk a bit about 342 00:11:45,480 --> 00:11:47,939 how those features do not necessarily 343 00:11:47,940 --> 00:11:50,129 work 100 percent well 344 00:11:50,130 --> 00:11:52,829 all the time and how they could 345 00:11:52,830 --> 00:11:54,240 be improved, which is what we did. 346 00:11:55,830 --> 00:11:57,959 OK, let's talk about 347 00:11:57,960 --> 00:11:59,849 some ways to detect. 348 00:11:59,850 --> 00:12:01,709 One way involves looking at character 349 00:12:01,710 --> 00:12:03,479 frequency. It's a well-known fact that 350 00:12:03,480 --> 00:12:05,399 some letters are more common in the 351 00:12:05,400 --> 00:12:07,059 English language than others. 352 00:12:07,060 --> 00:12:09,119 So if, let's say, I take the 353 00:12:09,120 --> 00:12:11,729 common letters in English and 354 00:12:11,730 --> 00:12:14,039 code them as green and the borderline 355 00:12:14,040 --> 00:12:15,779 letters I call the code is yellow and the 356 00:12:15,780 --> 00:12:17,969 rare letters I know code is 357 00:12:17,970 --> 00:12:20,309 red, and they take five 358 00:12:20,310 --> 00:12:21,869 words that I randomly picked out from the 359 00:12:21,870 --> 00:12:24,239 dictionary and five 360 00:12:24,240 --> 00:12:26,729 jibberish segments of comparable 361 00:12:26,730 --> 00:12:29,039 length. You can at a glance 362 00:12:29,040 --> 00:12:31,109 tell which five words came 363 00:12:31,110 --> 00:12:32,909 from the dictionary, in which five words 364 00:12:32,910 --> 00:12:35,789 didn't. So that's a useful feature. 365 00:12:35,790 --> 00:12:37,979 Another useful feature is 366 00:12:37,980 --> 00:12:38,879 along the same lines. 367 00:12:38,880 --> 00:12:40,949 It it's based on frequencies 368 00:12:40,950 --> 00:12:43,269 of pairs of letters instead of singular 369 00:12:43,270 --> 00:12:45,599 or some pairs of consequent letters 370 00:12:45,600 --> 00:12:47,309 in English are more common than others. 371 00:12:47,310 --> 00:12:48,419 T is common. 372 00:12:48,420 --> 00:12:50,939 It is not so common. 373 00:12:50,940 --> 00:12:53,009 And the 374 00:12:53,010 --> 00:12:55,199 another feature is 375 00:12:55,200 --> 00:12:57,809 called The Longest Meaningful Substring. 376 00:12:57,810 --> 00:12:59,909 It's been suggested in some researches 377 00:12:59,910 --> 00:13:01,169 into this problem. 378 00:13:01,170 --> 00:13:02,939 It involves looking at your input, your 379 00:13:02,940 --> 00:13:05,459 domain name, and seeing what's the 380 00:13:05,460 --> 00:13:07,049 longest substring in there that you can 381 00:13:07,050 --> 00:13:08,999 actually find in the dictionary. 382 00:13:09,000 --> 00:13:11,219 So bug me not contains the words bug and 383 00:13:11,220 --> 00:13:13,679 not so the longest 384 00:13:13,680 --> 00:13:15,959 meaningful suffering is of long free. 385 00:13:15,960 --> 00:13:18,029 Amazon is actually a word in the 386 00:13:18,030 --> 00:13:19,859 dictionary. So this is definitely going 387 00:13:19,860 --> 00:13:21,839 to register is not gibberish by this 388 00:13:21,840 --> 00:13:23,989 metric. EBay contains the word BAI, 389 00:13:23,990 --> 00:13:26,759 so it's like three quarters of gibberish 390 00:13:26,760 --> 00:13:28,529 and this actual gibberish is really 391 00:13:28,530 --> 00:13:30,209 gibberish. I can't find any word in the 392 00:13:30,210 --> 00:13:32,339 dictionary in this thing. 393 00:13:32,340 --> 00:13:34,439 So another useful feature 394 00:13:34,440 --> 00:13:36,569 and the last feature that has 395 00:13:36,570 --> 00:13:38,819 been suggested in the past that 396 00:13:38,820 --> 00:13:40,379 I want to talk about is the next domain. 397 00:13:40,380 --> 00:13:42,059 Remember, just like the cheese shop, 398 00:13:42,060 --> 00:13:44,069 there are the repeated no, sir, no, sir, 399 00:13:44,070 --> 00:13:45,929 not today. So no such thing, no such 400 00:13:45,930 --> 00:13:47,249 name, just counting. 401 00:13:47,250 --> 00:13:49,709 These is a useful feature. 402 00:13:49,710 --> 00:13:51,779 So we have all of those that 403 00:13:51,780 --> 00:13:53,789 suggested useful features for detecting 404 00:13:53,790 --> 00:13:55,769 T.J. And we are done right. 405 00:13:55,770 --> 00:13:57,059 The problem is solved and we can go 406 00:13:57,060 --> 00:13:58,060 shopping. 407 00:13:58,950 --> 00:14:01,619 Well, not exactly 408 00:14:01,620 --> 00:14:03,629 as you might have imagined. 409 00:14:03,630 --> 00:14:05,909 And the first issue 410 00:14:05,910 --> 00:14:07,979 with what I just said is what 411 00:14:07,980 --> 00:14:10,349 I like to call the Tumbler conundrum. 412 00:14:10,350 --> 00:14:12,449 And I mean, let's look at Reddit. 413 00:14:12,450 --> 00:14:14,399 Reddit is like there's the word red and 414 00:14:14,400 --> 00:14:16,139 there's a word that's for most of, you 415 00:14:16,140 --> 00:14:18,749 know, the little dot from our school. 416 00:14:18,750 --> 00:14:20,309 And there's that. 417 00:14:20,310 --> 00:14:21,989 If you look at Google, it contains the 418 00:14:21,990 --> 00:14:24,269 words go and my God, we lucked 419 00:14:24,270 --> 00:14:26,219 out. OGU is the word. 420 00:14:26,220 --> 00:14:28,619 So the longest meaningful substring 421 00:14:28,620 --> 00:14:30,779 criterion is going to look at Google 422 00:14:30,780 --> 00:14:33,089 and say, OK, go ogo 423 00:14:33,090 --> 00:14:35,159 seems legit, but 424 00:14:35,160 --> 00:14:36,809 let's look at Tumblr. 425 00:14:36,810 --> 00:14:38,399 What is Tumblr? 426 00:14:38,400 --> 00:14:39,749 You're not going to find this in the 427 00:14:39,750 --> 00:14:40,829 dictionary. 428 00:14:40,830 --> 00:14:42,409 I mean, Tumblr is not the word. 429 00:14:42,410 --> 00:14:44,319 No, substring of it is whatever. 430 00:14:44,320 --> 00:14:46,649 Tumblr is not word unblurred, not a word 431 00:14:46,650 --> 00:14:49,019 blurr still not or whatever. 432 00:14:49,020 --> 00:14:51,479 And so 433 00:14:51,480 --> 00:14:53,609 there you have one issue, because 434 00:14:53,610 --> 00:14:55,889 the longest meaningful subject 435 00:14:55,890 --> 00:14:57,359 is going to look at this and say this is 436 00:14:57,360 --> 00:14:59,579 gibberish. And if you got a human 437 00:14:59,580 --> 00:15:01,109 to take a look at this, the army is not 438 00:15:01,110 --> 00:15:03,599 going to be so hasty to say 439 00:15:03,600 --> 00:15:05,249 that this is gibberish and we're going to 440 00:15:05,250 --> 00:15:06,329 just touch later on. 441 00:15:06,330 --> 00:15:08,519 The reason why that's 442 00:15:08,520 --> 00:15:09,539 one issue. 443 00:15:09,540 --> 00:15:12,059 The second issue is quite Djibo 444 00:15:12,060 --> 00:15:14,389 now. Quite Yarbo is 445 00:15:14,390 --> 00:15:16,529 digga engine that has surfaced 446 00:15:16,530 --> 00:15:17,549 a few years ago. 447 00:15:18,840 --> 00:15:21,419 And it draws its name from an incident 448 00:15:21,420 --> 00:15:23,549 in an episode of The Simpsons, where Bart 449 00:15:23,550 --> 00:15:25,859 is playing Scrabble against Homer 450 00:15:25,860 --> 00:15:28,019 and Bart is stuck with the list 451 00:15:28,020 --> 00:15:29,819 of letters that you see up there at the 452 00:15:29,820 --> 00:15:30,749 top of the slide. 453 00:15:30,750 --> 00:15:32,549 And he doesn't know what to do until he 454 00:15:32,550 --> 00:15:34,079 says, well, you know what? 455 00:15:34,080 --> 00:15:35,899 I'm going to put on the board the word 456 00:15:35,900 --> 00:15:36,989 quite EUBAM. 457 00:15:36,990 --> 00:15:38,189 And he plays that word. 458 00:15:38,190 --> 00:15:39,659 And of course, it's for four billion 459 00:15:39,660 --> 00:15:41,309 points because he used all these letters 460 00:15:41,310 --> 00:15:42,719 and on the triple word square and so 461 00:15:42,720 --> 00:15:44,169 forth and so on. 462 00:15:44,170 --> 00:15:46,279 And Homer is not happy, and 463 00:15:46,280 --> 00:15:47,919 they asked about what is this word and 464 00:15:47,920 --> 00:15:50,049 Bart, without blinking an eye, he says, 465 00:15:50,050 --> 00:15:52,239 well, quite sure it means a stupid North 466 00:15:52,240 --> 00:15:53,320 American yellow ape. 467 00:15:54,520 --> 00:15:55,520 So 468 00:15:56,590 --> 00:15:58,899 much like Bart was 469 00:15:58,900 --> 00:16:01,389 able to pass this under the radar because 470 00:16:01,390 --> 00:16:03,369 you both sounds like a word, even though 471 00:16:03,370 --> 00:16:04,869 it's not a word. 472 00:16:04,870 --> 00:16:07,179 Quite Yarbo. The generator passes 473 00:16:07,180 --> 00:16:09,249 domain names under the radar because they 474 00:16:09,250 --> 00:16:12,069 sound like words, but they're not words. 475 00:16:12,070 --> 00:16:13,449 What quite Yarbo does and this is 476 00:16:13,450 --> 00:16:15,699 stupidly simple, it makes sure that 477 00:16:15,700 --> 00:16:18,009 every other letter in its 478 00:16:18,010 --> 00:16:20,869 output domains is a vowel. 479 00:16:20,870 --> 00:16:23,169 Now you're sitting there and thinking, 480 00:16:23,170 --> 00:16:25,689 Ben, look, just this 481 00:16:25,690 --> 00:16:27,879 every other letters of all and all of 482 00:16:27,880 --> 00:16:29,319 the features that you talked about 483 00:16:29,320 --> 00:16:31,029 earlier are now suddenly useless. 484 00:16:31,030 --> 00:16:33,309 Well, let's 485 00:16:33,310 --> 00:16:35,749 look at the letter frequencies. 486 00:16:35,750 --> 00:16:37,809 Earlier, the gibberish generated 487 00:16:37,810 --> 00:16:39,699 by the main generation algorithms 488 00:16:39,700 --> 00:16:41,799 contained lots of letters, 489 00:16:41,800 --> 00:16:43,779 X's and these injuries. 490 00:16:43,780 --> 00:16:46,539 Now, if you average out the frequencies 491 00:16:46,540 --> 00:16:47,769 of letters you're going to encounter, 492 00:16:47,770 --> 00:16:49,689 suddenly it looks much more peachy 493 00:16:49,690 --> 00:16:51,939 because you have vowels everywhere 494 00:16:51,940 --> 00:16:53,619 and vowels are common letters. 495 00:16:53,620 --> 00:16:55,059 You're telling me, OK, let's look at the 496 00:16:55,060 --> 00:16:55,959 pair of letters. 497 00:16:55,960 --> 00:16:58,119 The Bagram's is the same thing. 498 00:16:58,120 --> 00:17:00,279 Pairs of letters with vowels in them 499 00:17:00,280 --> 00:17:02,349 are very, very common. 500 00:17:02,350 --> 00:17:04,568 Not all of them, but a lot of them. 501 00:17:04,569 --> 00:17:06,818 And three letters are going to run 502 00:17:06,819 --> 00:17:09,039 into more or less the exact same issue. 503 00:17:09,040 --> 00:17:11,348 So the letter frequencies 504 00:17:11,349 --> 00:17:13,689 approach is now going to be significantly 505 00:17:13,690 --> 00:17:15,009 weaker than it was before. 506 00:17:16,470 --> 00:17:17,939 And how about the longest meaningful 507 00:17:17,940 --> 00:17:18,868 substring? 508 00:17:18,869 --> 00:17:20,969 Well, you can play a game and 509 00:17:20,970 --> 00:17:22,679 you can look at the domains listed here, 510 00:17:22,680 --> 00:17:24,659 which I swear I pulled randomly from 511 00:17:24,660 --> 00:17:26,848 Equatable Page and start 512 00:17:26,849 --> 00:17:29,039 looking forwards by skimming 513 00:17:29,040 --> 00:17:31,649 this. I found give and nope. 514 00:17:31,650 --> 00:17:33,479 And I swear I did not plan this in 515 00:17:33,480 --> 00:17:35,069 advance. Gaited, which is very 516 00:17:35,070 --> 00:17:36,390 appropriate for this conference. 517 00:17:38,220 --> 00:17:40,379 So the point is that now that 518 00:17:40,380 --> 00:17:43,049 your features that seem so strong before 519 00:17:43,050 --> 00:17:45,149 they're no, they're 520 00:17:45,150 --> 00:17:46,439 half useful. 521 00:17:46,440 --> 00:17:48,989 And if you take half useful features 522 00:17:48,990 --> 00:17:50,699 and fit them into a machine learning 523 00:17:50,700 --> 00:17:52,199 algorithm, you're going to get the 524 00:17:52,200 --> 00:17:54,329 result. That's a total loss. 525 00:17:54,330 --> 00:17:56,849 So because of 526 00:17:56,850 --> 00:17:57,960 all of those problems. 527 00:18:01,990 --> 00:18:03,130 OK, let's improvise. 528 00:18:04,550 --> 00:18:06,649 Because of all of those problems, we came 529 00:18:06,650 --> 00:18:08,869 up with our pretty idea for 530 00:18:08,870 --> 00:18:10,939 a solution, a pretty nice theoretical 531 00:18:10,940 --> 00:18:13,159 idea that involves looking 532 00:18:13,160 --> 00:18:15,379 at the input and deciding how close 533 00:18:15,380 --> 00:18:17,689 it is to a concatenation 534 00:18:17,690 --> 00:18:19,219 of words from the dictionary. 535 00:18:19,220 --> 00:18:21,349 So now, Tumbler, that was a 536 00:18:21,350 --> 00:18:23,659 complete nonsense before Ken, but 537 00:18:23,660 --> 00:18:26,299 with just one edit turned into Tumblr, 538 00:18:26,300 --> 00:18:27,679 which is actually a word from the 539 00:18:27,680 --> 00:18:28,680 dictionary. 540 00:18:29,180 --> 00:18:30,980 We have several here at. 541 00:18:35,150 --> 00:18:36,150 That was Youthwork. 542 00:18:38,860 --> 00:18:40,569 Oh, excellent. 543 00:18:40,570 --> 00:18:42,609 So we inserted just one letter in 544 00:18:42,610 --> 00:18:44,949 Guatemala, which is a word Google to 545 00:18:44,950 --> 00:18:46,689 edit and becomes Google the word from the 546 00:18:46,690 --> 00:18:48,579 dictionary. That's not a coincidence. 547 00:18:48,580 --> 00:18:50,049 This is the word that inspired a company 548 00:18:50,050 --> 00:18:52,239 named Reddit with two edits becomes 549 00:18:52,240 --> 00:18:54,639 read it and now we can finally make sense 550 00:18:54,640 --> 00:18:56,169 of those domain names. 551 00:18:56,170 --> 00:18:58,059 And as for quite uibo, quite uibo 552 00:18:58,060 --> 00:18:59,829 generates strings of gibberish. 553 00:18:59,830 --> 00:19:01,959 And sometimes it's going to lock out 554 00:19:01,960 --> 00:19:04,869 and create a world that 555 00:19:04,870 --> 00:19:06,489 that's out of the dictionary, but it's 556 00:19:06,490 --> 00:19:07,899 not going to successfully create 557 00:19:07,900 --> 00:19:09,729 concatenation of actual words. 558 00:19:09,730 --> 00:19:11,949 One word might be in there, but not 559 00:19:11,950 --> 00:19:13,209 you're not going to be able to look at 560 00:19:13,210 --> 00:19:15,249 the whole thing and make sense of it 561 00:19:15,250 --> 00:19:18,069 through the lens of this material. 562 00:19:18,070 --> 00:19:20,439 So the way forward 563 00:19:20,440 --> 00:19:21,189 seems clear. 564 00:19:21,190 --> 00:19:23,359 Step one, we measure the distance of 565 00:19:23,360 --> 00:19:25,539 the input from a 566 00:19:25,540 --> 00:19:27,429 concatenation of dictionary words and 567 00:19:27,430 --> 00:19:30,039 concatenation as long as we can reach it 568 00:19:30,040 --> 00:19:31,959 using the criterion at a distance. 569 00:19:31,960 --> 00:19:33,849 I skipped over this, but the added 570 00:19:33,850 --> 00:19:36,009 distance between a word and another 571 00:19:36,010 --> 00:19:38,109 word is the minimum amount of insertions 572 00:19:38,110 --> 00:19:39,819 of letters, deletions of letters and 573 00:19:39,820 --> 00:19:41,709 edits of letters that you need to get 574 00:19:41,710 --> 00:19:43,749 from that word to the other one. 575 00:19:43,750 --> 00:19:46,149 So we look at the minimum edit distance 576 00:19:46,150 --> 00:19:48,039 between our input and a concatenation of 577 00:19:48,040 --> 00:19:50,049 words from the dictionary, and the 578 00:19:50,050 --> 00:19:50,949 miracle occurs here. 579 00:19:50,950 --> 00:19:52,959 And then we profit because this new 580 00:19:52,960 --> 00:19:54,099 criterion is going through the fifth 581 00:19:54,100 --> 00:19:55,689 grade Yarbo and mitigate the tumbler 582 00:19:55,690 --> 00:19:57,189 problem because the word tumbler is now 583 00:19:57,190 --> 00:19:58,989 going to suddenly make sense to us and 584 00:19:58,990 --> 00:20:00,309 we're going to win the day. 585 00:20:00,310 --> 00:20:02,079 So we were super happy and then we 586 00:20:02,080 --> 00:20:03,909 actually tried to implement this thing in 587 00:20:03,910 --> 00:20:04,910 practice, 588 00:20:07,600 --> 00:20:10,059 why we ran into trouble. 589 00:20:10,060 --> 00:20:11,289 Why did we run into trouble? 590 00:20:11,290 --> 00:20:12,669 Well, let's look at how you actually 591 00:20:12,670 --> 00:20:15,279 performed this this 592 00:20:15,280 --> 00:20:17,379 computation of at the distance. 593 00:20:17,380 --> 00:20:19,149 The canonical algorithm to do this is 594 00:20:19,150 --> 00:20:20,169 called a flood search. 595 00:20:20,170 --> 00:20:21,699 What you basically do is that you take 596 00:20:21,700 --> 00:20:23,799 your input and you perform a breadth 597 00:20:23,800 --> 00:20:25,899 first search in the space of 598 00:20:25,900 --> 00:20:27,969 possible strings by performing every 599 00:20:27,970 --> 00:20:29,199 edit that you can think of, which is 600 00:20:29,200 --> 00:20:31,419 basically a stupid brute force. 601 00:20:31,420 --> 00:20:33,579 So as the amount 602 00:20:33,580 --> 00:20:35,139 of edits that you're willing to search 603 00:20:35,140 --> 00:20:37,449 for grows and grows, the size 604 00:20:37,450 --> 00:20:39,189 of the space that you're going to search 605 00:20:39,190 --> 00:20:40,869 for grows exponentially. 606 00:20:40,870 --> 00:20:43,569 If you have input of size eight 607 00:20:43,570 --> 00:20:45,609 and you are actually willing to search it 608 00:20:45,610 --> 00:20:47,379 out exhaustively and see how close to the 609 00:20:47,380 --> 00:20:49,719 dictionary it is, then 610 00:20:49,720 --> 00:20:51,069 you're going to need to do. 611 00:20:51,070 --> 00:20:53,319 I don't remember the number by heart, but 612 00:20:53,320 --> 00:20:55,839 it's a large number of lookups. 613 00:20:55,840 --> 00:20:57,969 So let's say that one look up 614 00:20:57,970 --> 00:20:59,919 into the dictionary is going to take one 615 00:20:59,920 --> 00:21:01,479 microsecond. I imagine that's in the 616 00:21:01,480 --> 00:21:02,859 right ballpark. 617 00:21:02,860 --> 00:21:04,989 So we take the number of lookups and we 618 00:21:04,990 --> 00:21:07,419 multiply it by the number of seconds 619 00:21:07,420 --> 00:21:08,409 that we can expect. 620 00:21:08,410 --> 00:21:09,669 One lookup to take. 621 00:21:09,670 --> 00:21:11,799 And now we reach the conclusion that 622 00:21:11,800 --> 00:21:13,869 in order to get our answer 623 00:21:13,870 --> 00:21:16,419 for how close to the dictionary 624 00:21:16,420 --> 00:21:18,519 is the input that we have 625 00:21:18,520 --> 00:21:20,709 on our hands, one input, one domain 626 00:21:20,710 --> 00:21:22,779 name, we're going to have to simply plug 627 00:21:22,780 --> 00:21:24,939 it into the algorithm and sit and 628 00:21:24,940 --> 00:21:26,859 wait for two and a half days. 629 00:21:29,850 --> 00:21:30,850 Two and a half days. 630 00:21:32,610 --> 00:21:34,349 Oh, by the way, that's a lower bound 631 00:21:34,350 --> 00:21:35,639 number of lookups that I mentioned 632 00:21:35,640 --> 00:21:37,109 earlier. It's not the actual number of 633 00:21:37,110 --> 00:21:38,969 lookups. The actual number of lookups is 634 00:21:38,970 --> 00:21:41,129 greater. Every calculation you see 635 00:21:41,130 --> 00:21:42,569 in this presentation, it's the back of 636 00:21:42,570 --> 00:21:44,519 the envelope combinatorics calculation. 637 00:21:44,520 --> 00:21:46,589 Yes. It's not exactly precise, just to 638 00:21:46,590 --> 00:21:48,029 give you an idea. 639 00:21:48,030 --> 00:21:49,409 So that was a lower bound about. 640 00:21:49,410 --> 00:21:51,749 Oh, and by the way, we were implicitly 641 00:21:51,750 --> 00:21:53,909 comparing words against at 642 00:21:53,910 --> 00:21:55,979 least a set of all 643 00:21:55,980 --> 00:21:57,959 possible concatenation of dictionary 644 00:21:57,960 --> 00:22:00,299 words. And there's an infinite number 645 00:22:00,300 --> 00:22:02,609 of them laptop, paper, 646 00:22:02,610 --> 00:22:04,469 concatenation of words, bottle 647 00:22:04,470 --> 00:22:06,599 chairperson, podium 648 00:22:06,600 --> 00:22:08,669 style concatenation affords there's an 649 00:22:08,670 --> 00:22:09,719 infinite number of them. 650 00:22:09,720 --> 00:22:11,999 So the problem is that 651 00:22:12,000 --> 00:22:14,039 Google's infinite service project isn't 652 00:22:14,040 --> 00:22:15,899 going public in another half a year. 653 00:22:15,900 --> 00:22:17,789 So we're not going to be able to take 654 00:22:17,790 --> 00:22:19,319 this infinite database and fit it 655 00:22:19,320 --> 00:22:20,459 anywhere. 656 00:22:20,460 --> 00:22:22,829 So at this point, we realized that 657 00:22:22,830 --> 00:22:25,829 we're going to have to improvise. 658 00:22:25,830 --> 00:22:27,840 So we needed to 659 00:22:29,580 --> 00:22:31,799 come up with some ugly hacks and cut some 660 00:22:31,800 --> 00:22:34,139 corners on this insane computation to 661 00:22:34,140 --> 00:22:36,299 make it actually feasible to compute 662 00:22:36,300 --> 00:22:37,619 this thing. 663 00:22:37,620 --> 00:22:39,839 So the first idea that we came 664 00:22:39,840 --> 00:22:42,059 up with, it's it was like a first aid 665 00:22:42,060 --> 00:22:44,429 solution is to use a greedy 666 00:22:44,430 --> 00:22:46,439 algorithm. And instead of trying to model 667 00:22:46,440 --> 00:22:48,809 the whole input against the 668 00:22:48,810 --> 00:22:50,879 set of all possible concatenation 669 00:22:50,880 --> 00:22:53,129 of dictionary words, try to look for 670 00:22:53,130 --> 00:22:55,259 prefixes of the input and match them 671 00:22:55,260 --> 00:22:57,719 against the plain old dictionary. 672 00:22:57,720 --> 00:22:59,909 Now, first of all, by using 673 00:22:59,910 --> 00:23:02,009 this imprecise approximation, you 674 00:23:02,010 --> 00:23:04,109 can expect your running time to drop by 675 00:23:04,110 --> 00:23:06,269 a lot because the length 676 00:23:06,270 --> 00:23:08,459 of the input that you feed into the 677 00:23:08,460 --> 00:23:11,219 distance, the distance algorithm 678 00:23:11,220 --> 00:23:13,319 has an exponential relation, as I said 679 00:23:13,320 --> 00:23:15,749 earlier to the time you can expect 680 00:23:15,750 --> 00:23:17,249 the search to take. 681 00:23:17,250 --> 00:23:18,809 So that's first of all, you can see the 682 00:23:18,810 --> 00:23:20,279 numbers here. The difference is drastic. 683 00:23:20,280 --> 00:23:21,689 But the more important thing is that now 684 00:23:21,690 --> 00:23:23,769 we're actually comparing against 685 00:23:23,770 --> 00:23:26,069 the infinite dictionary and that's 686 00:23:26,070 --> 00:23:28,499 progress. Now, it's not all sunshine 687 00:23:28,500 --> 00:23:30,869 and rainbows because the expected 688 00:23:30,870 --> 00:23:32,339 time using the back of the envelope 689 00:23:32,340 --> 00:23:34,649 calculation here is still, what, 690 00:23:34,650 --> 00:23:36,749 like an hour and something 691 00:23:36,750 --> 00:23:38,489 for one traffic capture that's 692 00:23:38,490 --> 00:23:40,019 unacceptable is unacceptable. 693 00:23:40,020 --> 00:23:42,389 We're not going to sit there for an hour 694 00:23:42,390 --> 00:23:44,339 and wait for the computation on a single 695 00:23:44,340 --> 00:23:45,630 traffic capture to go through. 696 00:23:46,770 --> 00:23:49,109 So we need the more ugly hucks. 697 00:23:49,110 --> 00:23:51,179 The second ugly Hucks hack 698 00:23:51,180 --> 00:23:53,639 involves and 699 00:23:53,640 --> 00:23:55,409 looking at the classical approach of 700 00:23:55,410 --> 00:23:57,209 asymmetric search that I talked about 701 00:23:57,210 --> 00:23:58,619 earlier, that's just stupid. 702 00:23:58,620 --> 00:24:01,099 The breadth first search. 703 00:24:01,100 --> 00:24:03,629 Now, what would happen if I took 704 00:24:03,630 --> 00:24:05,759 the dictionary that we have and I 705 00:24:05,760 --> 00:24:08,009 bloated it to contain 706 00:24:08,010 --> 00:24:10,109 all strings of letters that are 707 00:24:10,110 --> 00:24:12,569 within an edit distance of two 708 00:24:12,570 --> 00:24:14,219 of the original dictionary? 709 00:24:14,220 --> 00:24:15,749 What's going to happen? 710 00:24:15,750 --> 00:24:17,039 Well, you're going to have a larger 711 00:24:17,040 --> 00:24:18,869 dictionary. But the interesting thing 712 00:24:18,870 --> 00:24:20,939 that's going to happen is that you will 713 00:24:20,940 --> 00:24:23,249 now be able to do a much 714 00:24:23,250 --> 00:24:25,679 smaller flood search for the same result, 715 00:24:25,680 --> 00:24:27,779 because think about it, if your input is 716 00:24:27,780 --> 00:24:29,879 at a distance of four of the 717 00:24:29,880 --> 00:24:31,979 original dictionary, then it's within 718 00:24:31,980 --> 00:24:34,349 at a distance of two and 719 00:24:34,350 --> 00:24:36,489 within something that's within 720 00:24:36,490 --> 00:24:38,669 a distance of two of the original 721 00:24:38,670 --> 00:24:39,599 dictionary. 722 00:24:39,600 --> 00:24:41,699 So with just two edits, you're 723 00:24:41,700 --> 00:24:43,439 going to be able to match up against that 724 00:24:43,440 --> 00:24:45,809 something. Now, your flood search, the 725 00:24:45,810 --> 00:24:47,939 input to flood search is the 726 00:24:47,940 --> 00:24:50,039 same length, but the number of 727 00:24:50,040 --> 00:24:52,469 edits that you're expected to make is 728 00:24:52,470 --> 00:24:53,849 much smaller. 729 00:24:53,850 --> 00:24:56,009 So you cut the exponent and 730 00:24:56,010 --> 00:24:58,319 now suddenly the running time is 731 00:24:58,320 --> 00:25:00,449 drastically reduced. But the 732 00:25:00,450 --> 00:25:02,189 downside of this is that now your 733 00:25:02,190 --> 00:25:04,469 dictionary is larger, much 734 00:25:04,470 --> 00:25:06,149 larger by the back of the envelope 735 00:25:06,150 --> 00:25:07,859 calculation shows like by a factor of 736 00:25:07,860 --> 00:25:10,439 10000. So it's a it's unwieldy, 737 00:25:10,440 --> 00:25:12,539 but we can still live with it. 738 00:25:12,540 --> 00:25:13,859 So that's still not enough. 739 00:25:13,860 --> 00:25:16,049 Let's go on to the ugly huc. 740 00:25:16,050 --> 00:25:17,909 The first agritech is not so ugly, 741 00:25:17,910 --> 00:25:20,159 actually. It involves looking 742 00:25:20,160 --> 00:25:22,109 at the distance measure that we are 743 00:25:22,110 --> 00:25:24,629 using. It's the added distance. 744 00:25:24,630 --> 00:25:27,089 Now, what would happen if we disallowed 745 00:25:27,090 --> 00:25:29,939 in place editing of letters? 746 00:25:29,940 --> 00:25:32,129 We only allowed insertions and 747 00:25:32,130 --> 00:25:33,130 deletions. 748 00:25:34,560 --> 00:25:36,749 Now, on the face of it, the metric that 749 00:25:36,750 --> 00:25:38,909 you're going to get is not much 750 00:25:38,910 --> 00:25:40,709 less legitimate than the distance, 751 00:25:40,710 --> 00:25:42,179 actually. There's no reason to think that 752 00:25:42,180 --> 00:25:43,439 it's less legitimate. 753 00:25:43,440 --> 00:25:45,719 It's actually bound within the 754 00:25:45,720 --> 00:25:46,619 insertion deletion. 755 00:25:46,620 --> 00:25:48,689 Distance between two strings of letters 756 00:25:48,690 --> 00:25:51,029 is bound between the distance and twice 757 00:25:51,030 --> 00:25:52,829 the distance. The proof of this is left 758 00:25:52,830 --> 00:25:54,569 as an exercise to the reader. 759 00:25:54,570 --> 00:25:57,089 And now 760 00:25:57,090 --> 00:25:59,279 we switched one criterion 761 00:25:59,280 --> 00:26:00,959 for another criterion. 762 00:26:00,960 --> 00:26:03,179 And the two criterions 763 00:26:03,180 --> 00:26:05,399 are more or less, we think that they're 764 00:26:05,400 --> 00:26:07,289 legitimate to the same degree. 765 00:26:07,290 --> 00:26:09,029 But now a flood search is going to 766 00:26:09,030 --> 00:26:10,859 contain much less options to iterate 767 00:26:10,860 --> 00:26:12,479 through because all the in-place edits 768 00:26:12,480 --> 00:26:13,979 are gone. You don't have to worry about 769 00:26:13,980 --> 00:26:14,939 them anymore. 770 00:26:14,940 --> 00:26:17,909 So all of those aggregates are nice. 771 00:26:17,910 --> 00:26:20,189 But the nice thing really 772 00:26:20,190 --> 00:26:22,619 is not the individual like Linux, 773 00:26:22,620 --> 00:26:24,539 but the way that they come together to 774 00:26:24,540 --> 00:26:26,339 form something greater than the sum of 775 00:26:26,340 --> 00:26:27,899 its parts. 776 00:26:27,900 --> 00:26:29,609 Now, let's look at. 777 00:26:29,610 --> 00:26:31,499 The symmetric search that I talked about 778 00:26:31,500 --> 00:26:33,449 earlier and how it can combine with 779 00:26:33,450 --> 00:26:35,609 insertion deletion distance, let's 780 00:26:35,610 --> 00:26:38,369 look at the not a word spox. 781 00:26:38,370 --> 00:26:40,169 We can remove the pin, get Shook's, and 782 00:26:40,170 --> 00:26:42,269 then we can remove the X and get. 783 00:26:42,270 --> 00:26:44,399 And if you look at the word shout, we can 784 00:26:44,400 --> 00:26:46,919 remove the T and get chow 785 00:26:46,920 --> 00:26:48,929 and we can remove the O and get Shiu. 786 00:26:48,930 --> 00:26:51,359 And those two words 787 00:26:51,360 --> 00:26:53,279 after a fashion have now met in the 788 00:26:53,280 --> 00:26:55,349 middle. Uh, so what we 789 00:26:55,350 --> 00:26:58,049 have now proved is that the distance, 790 00:26:58,050 --> 00:26:59,609 the insertion position distance between 791 00:26:59,610 --> 00:27:01,739 Spook's and Shout is at most 792 00:27:01,740 --> 00:27:03,389 four because you can text books and 793 00:27:03,390 --> 00:27:05,459 remove the P and remove the 794 00:27:05,460 --> 00:27:07,979 X. And now you can go backwards 795 00:27:07,980 --> 00:27:10,079 and add the O and add the 796 00:27:10,080 --> 00:27:12,509 T and sporks and Sharlto now connected 797 00:27:12,510 --> 00:27:14,669 for this. Now there's a peculiar 798 00:27:14,670 --> 00:27:16,829 something about this computation. 799 00:27:16,830 --> 00:27:18,869 You have noticed that we didn't have to 800 00:27:18,870 --> 00:27:20,819 actually insert any letters. 801 00:27:20,820 --> 00:27:22,709 We only deleted letters. 802 00:27:22,710 --> 00:27:24,929 So we deleted letters from spox and we 803 00:27:24,930 --> 00:27:26,579 deleted letters from Shout. 804 00:27:26,580 --> 00:27:28,829 And eventually we reached 805 00:27:28,830 --> 00:27:30,659 this kind of lowest common denominator. 806 00:27:30,660 --> 00:27:32,519 And the truth is that it's possible to do 807 00:27:32,520 --> 00:27:34,679 this for any two inputs. 808 00:27:34,680 --> 00:27:37,079 So really, we can forget about symmetric 809 00:27:37,080 --> 00:27:38,969 insertion deletion and just talk about 810 00:27:38,970 --> 00:27:40,140 symmetric deletion. 811 00:27:41,270 --> 00:27:42,270 Now, 812 00:27:43,670 --> 00:27:45,859 this is a nice thing, because now 813 00:27:45,860 --> 00:27:47,749 we're only left with deletion out of in 814 00:27:47,750 --> 00:27:49,429 session and deletion and editing that we 815 00:27:49,430 --> 00:27:50,659 started with before. 816 00:27:50,660 --> 00:27:52,909 Now, let's combine this 817 00:27:52,910 --> 00:27:55,189 with the blow to the dictionary idea 818 00:27:55,190 --> 00:27:56,239 that we saw earlier. 819 00:27:56,240 --> 00:27:58,759 We can now keep a dictionary containing 820 00:27:58,760 --> 00:28:00,979 all the reduced forms that 821 00:28:00,980 --> 00:28:03,259 you can get from a word in the original 822 00:28:03,260 --> 00:28:05,539 dictionary by removing letters. 823 00:28:05,540 --> 00:28:07,789 So now you can take your input and 824 00:28:07,790 --> 00:28:10,219 start deleting letters and comparing 825 00:28:10,220 --> 00:28:11,899 against reduced forms. 826 00:28:11,900 --> 00:28:13,849 And eventually you're going to be able to 827 00:28:13,850 --> 00:28:16,039 meet in the middle just like books 828 00:28:16,040 --> 00:28:17,299 and shouted. 829 00:28:17,300 --> 00:28:19,369 And the expected time 830 00:28:19,370 --> 00:28:21,469 to carry out this computation is much, 831 00:28:21,470 --> 00:28:23,389 much lower because now you don't have the 832 00:28:23,390 --> 00:28:26,179 crazy flood search anymore. 833 00:28:26,180 --> 00:28:28,039 You only have to delete some letters. 834 00:28:28,040 --> 00:28:30,559 Now, the to really get 835 00:28:30,560 --> 00:28:32,569 this algorithm all the way through, you 836 00:28:32,570 --> 00:28:34,759 would have to look at all 837 00:28:34,760 --> 00:28:36,769 the deletions that you can perform when 838 00:28:36,770 --> 00:28:37,459 your input. 839 00:28:37,460 --> 00:28:39,559 But we decided to limit it to taking 840 00:28:39,560 --> 00:28:41,959 away free letters because 841 00:28:41,960 --> 00:28:44,029 really the performance gain was very 842 00:28:44,030 --> 00:28:46,519 nice from this. And otherwise, what 843 00:28:46,520 --> 00:28:48,169 word in English, you know, that you can 844 00:28:48,170 --> 00:28:50,329 take away free letters and there's still 845 00:28:50,330 --> 00:28:52,669 something sensible left of the original? 846 00:28:52,670 --> 00:28:54,679 There are some, but we thought that it 847 00:28:54,680 --> 00:28:56,569 was a good tradeoff. 848 00:28:56,570 --> 00:28:58,969 So the limitation 849 00:28:58,970 --> 00:29:00,739 that we got out of this was that if 850 00:29:00,740 --> 00:29:02,389 actually meeting in the middle requires 851 00:29:02,390 --> 00:29:04,429 more deletions than the number of 852 00:29:04,430 --> 00:29:06,169 deletions that we were willing to make, 853 00:29:06,170 --> 00:29:08,239 that we said that to be free, 854 00:29:08,240 --> 00:29:10,399 free and we won't be 855 00:29:10,400 --> 00:29:12,109 able to do the meeting, the middle thing, 856 00:29:12,110 --> 00:29:13,489 but otherwise this thing is going to 857 00:29:13,490 --> 00:29:15,289 work. The important thing is that 858 00:29:15,290 --> 00:29:17,599 insertions we get for free all 859 00:29:17,600 --> 00:29:19,699 insertions that need to be made in 860 00:29:19,700 --> 00:29:21,469 order to get to the word in the 861 00:29:21,470 --> 00:29:24,529 dictionary are going to be detected. 862 00:29:24,530 --> 00:29:26,479 Even if you need like eight insertions or 863 00:29:26,480 --> 00:29:28,819 something, it's still going to work 864 00:29:28,820 --> 00:29:31,279 because you're going to find the reduced 865 00:29:31,280 --> 00:29:33,379 form that you got by deleting those eight 866 00:29:33,380 --> 00:29:36,079 letters from the word in the dictionary. 867 00:29:36,080 --> 00:29:38,369 So how good is this thing? 868 00:29:38,370 --> 00:29:40,219 How quick is this thing now after we have 869 00:29:40,220 --> 00:29:42,379 applied all of those ugly? 870 00:29:42,380 --> 00:29:44,479 Well, obviously, we 871 00:29:44,480 --> 00:29:45,859 need to implement it to find out. 872 00:29:45,860 --> 00:29:47,659 I was going to implement it in Perl, but 873 00:29:47,660 --> 00:29:49,159 just the other day I heard the Spanish is 874 00:29:49,160 --> 00:29:50,719 horrible and you shouldn't write anything 875 00:29:50,720 --> 00:29:51,739 in Perl. 876 00:29:51,740 --> 00:29:54,199 So I 877 00:29:54,200 --> 00:29:55,549 went back to the drawing board to write 878 00:29:55,550 --> 00:29:56,839 everything in Python. 879 00:29:56,840 --> 00:29:59,119 And there's all sorts of 880 00:29:59,120 --> 00:30:01,179 calculations here of how many atomic 881 00:30:01,180 --> 00:30:03,259 clock ups in the 882 00:30:03,260 --> 00:30:04,459 Deletions Dictionary. 883 00:30:04,460 --> 00:30:05,899 The bloated dictionary that we created 884 00:30:05,900 --> 00:30:07,279 are going to be required for this test 885 00:30:07,280 --> 00:30:09,349 run and how much time the tests 886 00:30:09,350 --> 00:30:11,989 took, including, I'm going to say, 887 00:30:11,990 --> 00:30:14,779 the including 888 00:30:14,780 --> 00:30:16,939 degeneration of the actual gibberish 889 00:30:16,940 --> 00:30:18,199 to be tested. 890 00:30:18,200 --> 00:30:20,119 And you divide this by that and you 891 00:30:20,120 --> 00:30:22,639 multiply this by that to see 892 00:30:22,640 --> 00:30:24,289 how long it's going to take to do the 893 00:30:24,290 --> 00:30:26,629 same feat that we tried to do before, 894 00:30:26,630 --> 00:30:28,699 which is to see how close 895 00:30:28,700 --> 00:30:30,679 to the dictionary a string of eight 896 00:30:30,680 --> 00:30:32,479 characters is. 897 00:30:32,480 --> 00:30:34,519 And earlier it took us two and a half 898 00:30:34,520 --> 00:30:36,829 days. And now by this calculation, 899 00:30:36,830 --> 00:30:39,109 it's going to take us a quarter 900 00:30:39,110 --> 00:30:41,339 of a second, a quarter of a second. 901 00:30:41,340 --> 00:30:43,219 That's an improvement by a factor of like 902 00:30:43,220 --> 00:30:44,629 nearly a million. 903 00:30:44,630 --> 00:30:46,879 So we're very happy now. 904 00:30:46,880 --> 00:30:49,039 And in the wise words of Hannah 905 00:30:49,040 --> 00:30:50,959 Montana, we can now enjoy the best of 906 00:30:50,960 --> 00:30:52,039 both worlds. 907 00:30:52,040 --> 00:30:54,409 The dictionary size 908 00:30:54,410 --> 00:30:57,079 is hefty, but it's not prohibitive. 909 00:30:57,080 --> 00:30:59,929 And our query time is improved 910 00:30:59,930 --> 00:31:01,369 by a really drastic amount. 911 00:31:01,370 --> 00:31:03,979 OK, yes, we have this deletion, this 912 00:31:03,980 --> 00:31:06,169 limitation on the number of deletions 913 00:31:06,170 --> 00:31:07,129 that we can deal with. 914 00:31:07,130 --> 00:31:08,359 But I explained why. 915 00:31:08,360 --> 00:31:10,069 I think it's worth the tradeoff. 916 00:31:10,070 --> 00:31:12,139 And this feature is now something 917 00:31:12,140 --> 00:31:13,489 that can actually happen in the real 918 00:31:13,490 --> 00:31:14,659 world. 919 00:31:14,660 --> 00:31:16,789 So we decided that 920 00:31:16,790 --> 00:31:18,559 now what we need to do is an experiment. 921 00:31:20,210 --> 00:31:22,549 So let's recap the 922 00:31:22,550 --> 00:31:24,889 three features that we decided to 923 00:31:24,890 --> 00:31:27,409 extract from our pickups 924 00:31:27,410 --> 00:31:29,629 to see how close this 925 00:31:29,630 --> 00:31:31,789 pickup is to being 926 00:31:31,790 --> 00:31:34,549 DGCA generated traffic. 927 00:31:34,550 --> 00:31:36,829 So there's a maximum of the requests 928 00:31:36,830 --> 00:31:38,449 that got the same response, whether 929 00:31:38,450 --> 00:31:41,399 that's an IP address or an accident 930 00:31:41,400 --> 00:31:43,429 for 410 of the involved request. 931 00:31:43,430 --> 00:31:45,559 We calculated this feature that I 932 00:31:45,560 --> 00:31:47,809 just spent like 20 minutes explaining 933 00:31:47,810 --> 00:31:50,179 how it works, of how close this domain 934 00:31:50,180 --> 00:31:52,039 is to a concatenation of words from the 935 00:31:52,040 --> 00:31:54,139 dictionary and for 936 00:31:54,140 --> 00:31:55,429 all of the involved request. 937 00:31:55,430 --> 00:31:57,589 Again, this was computer 938 00:31:57,590 --> 00:31:59,299 for ten of the involved requests just to 939 00:31:59,300 --> 00:32:00,379 improve performance. I think that's 940 00:32:00,380 --> 00:32:02,719 enough. And finally, we computed 941 00:32:02,720 --> 00:32:04,909 the frequency based on Begum's. 942 00:32:04,910 --> 00:32:06,799 I explain this feature earlier, mainly 943 00:32:06,800 --> 00:32:08,989 for comparison's sake, to see 944 00:32:08,990 --> 00:32:11,329 how our metric first against 945 00:32:11,330 --> 00:32:13,459 this and also they can work in unison 946 00:32:13,460 --> 00:32:15,799 to maybe detect 947 00:32:15,800 --> 00:32:17,659 jointly things that each one of them 948 00:32:17,660 --> 00:32:19,009 alone could not. 949 00:32:19,010 --> 00:32:21,199 So now we're going to 950 00:32:21,200 --> 00:32:24,589 see the resulting classifier 951 00:32:24,590 --> 00:32:27,050 classify a traffic capture. 952 00:32:28,680 --> 00:32:29,780 Let's hope this works. 953 00:32:33,100 --> 00:32:35,320 OK, now. 954 00:32:36,790 --> 00:32:39,069 We're going to run this on air traffic 955 00:32:39,070 --> 00:32:41,949 capture. Now, I set this 956 00:32:41,950 --> 00:32:44,019 log to basically pass 957 00:32:44,020 --> 00:32:45,049 the first free time. 958 00:32:45,050 --> 00:32:46,509 Something interesting happens and then 959 00:32:46,510 --> 00:32:48,669 it's going to just blow past this. 960 00:32:48,670 --> 00:32:49,869 So we're not going to be standing here 961 00:32:49,870 --> 00:32:50,870 forever. 962 00:32:51,370 --> 00:32:53,529 And OK, we 963 00:32:53,530 --> 00:32:55,839 have the traffic after and we started 964 00:32:55,840 --> 00:32:57,909 to analyze it and we're 965 00:32:57,910 --> 00:32:59,559 going to extract the pictures, the 966 00:32:59,560 --> 00:33:01,149 features from it. 967 00:33:01,150 --> 00:33:03,279 Obviously, now we find 968 00:33:03,280 --> 00:33:05,589 that the most common, Dena's response 969 00:33:05,590 --> 00:33:08,049 that was received during this traffic 970 00:33:08,050 --> 00:33:10,609 capture is an exclamation. 971 00:33:10,610 --> 00:33:13,849 So we're going to look at the request 972 00:33:13,850 --> 00:33:15,889 requested domain's that got this response 973 00:33:15,890 --> 00:33:18,019 and now I bet you were getting 974 00:33:18,020 --> 00:33:19,070 a little suspicious 975 00:33:20,150 --> 00:33:21,769 looking at those domains. 976 00:33:21,770 --> 00:33:24,679 It's a list of 28 domains, 977 00:33:24,680 --> 00:33:26,929 and now we have our maximum 978 00:33:26,930 --> 00:33:28,039 domain collusion features. 979 00:33:28,040 --> 00:33:29,899 This is the maximum number of domains 980 00:33:29,900 --> 00:33:31,460 that mapped to the same response. 981 00:33:32,580 --> 00:33:35,039 Now we're looking at the longest, 982 00:33:35,040 --> 00:33:36,479 longest, because they're actually all the 983 00:33:36,480 --> 00:33:37,769 same length, so they just sorted 984 00:33:37,770 --> 00:33:40,109 alphabetically, actually relevant 985 00:33:40,110 --> 00:33:41,429 request and we're going to start 986 00:33:41,430 --> 00:33:44,069 analyzing them, using the features that 987 00:33:44,070 --> 00:33:45,839 we talked about before. 988 00:33:45,840 --> 00:33:47,699 Now, we're going to start with what's 989 00:33:47,700 --> 00:33:49,169 called the pronunciation device. 990 00:33:49,170 --> 00:33:51,359 And see, that's just a fancy way 991 00:33:51,360 --> 00:33:52,919 of saying that we are going to look at 992 00:33:52,920 --> 00:33:54,659 the pairs of letters, the big ones, and 993 00:33:54,660 --> 00:33:56,819 see how frequent they are 994 00:33:56,820 --> 00:33:58,739 and how this compares to what we would 995 00:33:58,740 --> 00:33:59,970 expect from English. 996 00:34:01,080 --> 00:34:03,329 And so we're going to start looking 997 00:34:03,330 --> 00:34:05,189 at the first domain name and now the 998 00:34:05,190 --> 00:34:07,469 algorithm looks at this input, 999 00:34:07,470 --> 00:34:08,999 the domain name, and says, OK, it starts 1000 00:34:09,000 --> 00:34:10,079 with C. 1001 00:34:10,080 --> 00:34:12,599 How much am I surprised 1002 00:34:12,600 --> 00:34:14,819 by the fact that it starts with C.? 1003 00:34:14,820 --> 00:34:16,799 Well, as you can see, the answer is five 1004 00:34:16,800 --> 00:34:19,769 point six surprized, more or less. 1005 00:34:19,770 --> 00:34:22,049 Don't ask me about the measurement 1006 00:34:22,050 --> 00:34:23,908 units on this, because this is a we'll 1007 00:34:23,909 --> 00:34:25,769 talk about probability theory and we 1008 00:34:25,770 --> 00:34:27,329 don't have the time for that. 1009 00:34:27,330 --> 00:34:29,399 And the same 1010 00:34:29,400 --> 00:34:31,468 happens now because after this, the 1011 00:34:31,469 --> 00:34:33,419 C, there's an AI and the algorithm ask 1012 00:34:33,420 --> 00:34:35,609 itself, OK. So I see how surprised 1013 00:34:35,610 --> 00:34:38,069 am I seeing an AI after the see. 1014 00:34:38,070 --> 00:34:40,229 Well it's four point three surprise 1015 00:34:40,230 --> 00:34:42,309 less than before and 1016 00:34:43,409 --> 00:34:45,329 then this goes on and on. 1017 00:34:45,330 --> 00:34:47,218 Now we are looking at the AI until 1018 00:34:47,219 --> 00:34:49,289 finally we have a bag full of 1019 00:34:49,290 --> 00:34:51,509 surprises and we are very surprised. 1020 00:34:51,510 --> 00:34:53,759 We have a number and we normalized it by 1021 00:34:53,760 --> 00:34:55,769 dividing by a factor of the length of the 1022 00:34:55,770 --> 00:34:57,959 input. And we get a score of how 1023 00:34:57,960 --> 00:35:00,689 surprised we are generally by 1024 00:35:00,690 --> 00:35:03,899 Peagram and wise pairs of letters wise 1025 00:35:03,900 --> 00:35:06,389 by this input domain name. 1026 00:35:06,390 --> 00:35:08,819 So we move on to the next domain name. 1027 00:35:08,820 --> 00:35:09,909 The same thing happens. 1028 00:35:09,910 --> 00:35:12,209 We get a general idea of how surprised 1029 00:35:12,210 --> 00:35:13,439 you are by it. 1030 00:35:13,440 --> 00:35:15,719 And then same for the next domain name. 1031 00:35:15,720 --> 00:35:17,639 We went for for domain names and I was 1032 00:35:17,640 --> 00:35:19,709 going to go woosh. We went over all 1033 00:35:19,710 --> 00:35:21,589 the domain names and we average them out. 1034 00:35:21,590 --> 00:35:23,669 Now we've got a measure of how surprised 1035 00:35:23,670 --> 00:35:25,859 we are by the big Rum's, the pairs 1036 00:35:25,860 --> 00:35:27,989 of letters in all of the 1037 00:35:27,990 --> 00:35:30,209 relevant domains here generally on 1038 00:35:30,210 --> 00:35:32,609 average, and the answer is zero point 1039 00:35:32,610 --> 00:35:34,289 nine surprised. 1040 00:35:34,290 --> 00:35:36,689 And we 1041 00:35:36,690 --> 00:35:38,549 keep on going and now we're going to 1042 00:35:38,550 --> 00:35:41,009 calculate the lexical 1043 00:35:41,010 --> 00:35:42,299 divide and see, which is really just a 1044 00:35:42,300 --> 00:35:44,429 fancy word for the feature 1045 00:35:44,430 --> 00:35:46,919 that I spent 20 minutes explaining 1046 00:35:46,920 --> 00:35:48,869 how we're going to put together the 1047 00:35:48,870 --> 00:35:51,239 closeness to a 1048 00:35:51,240 --> 00:35:52,859 concatenation of words in the dictionary 1049 00:35:52,860 --> 00:35:54,999 as approximated by a host of Ugly 1050 00:35:55,000 --> 00:35:56,000 Hex. 1051 00:35:56,550 --> 00:35:58,619 So we're going 1052 00:35:58,620 --> 00:35:59,699 to start calculating it. 1053 00:35:59,700 --> 00:36:01,349 And in order to calculate it, the 1054 00:36:01,350 --> 00:36:03,149 algorithm is going to iterate over every 1055 00:36:03,150 --> 00:36:05,129 possible prefix and see which prefix 1056 00:36:05,130 --> 00:36:06,989 looks the most promising to take away 1057 00:36:06,990 --> 00:36:08,879 from the input and say, OK, I imagine 1058 00:36:08,880 --> 00:36:10,829 this more or less is my most promising 1059 00:36:10,830 --> 00:36:13,049 candidate as something that used 1060 00:36:13,050 --> 00:36:14,699 to be a word in the dictionary but got 1061 00:36:14,700 --> 00:36:15,899 mutilated somehow. 1062 00:36:17,630 --> 00:36:20,119 So we performed the look up 1063 00:36:20,120 --> 00:36:22,639 on the sea 1064 00:36:22,640 --> 00:36:24,799 and we can delete nothing and 1065 00:36:24,800 --> 00:36:26,869 stay with sea, or we can delete 1066 00:36:26,870 --> 00:36:29,149 the sea and stay with nothing. 1067 00:36:29,150 --> 00:36:31,309 And it turns out that 1068 00:36:31,310 --> 00:36:33,439 both of those 1069 00:36:33,440 --> 00:36:35,509 options are in the 1070 00:36:35,510 --> 00:36:36,829 dictionary. 1071 00:36:36,830 --> 00:36:39,109 So obviously it's better to 1072 00:36:39,110 --> 00:36:40,819 have to delete nothing and just say the 1073 00:36:40,820 --> 00:36:42,469 sea is in the dictionary, it's a 1074 00:36:42,470 --> 00:36:44,569 candidate, and we can 1075 00:36:44,570 --> 00:36:46,909 just take it away and 1076 00:36:46,910 --> 00:36:48,679 say, OK, that's a word from the 1077 00:36:48,680 --> 00:36:50,419 dictionary. The downside, of course, is 1078 00:36:50,420 --> 00:36:51,739 that sees just one letter. 1079 00:36:51,740 --> 00:36:54,019 So it may be in the dictionary, 1080 00:36:54,020 --> 00:36:56,059 but it may not be the best candidate to 1081 00:36:56,060 --> 00:36:58,039 take away from the input because we want 1082 00:36:58,040 --> 00:37:00,139 a longer candidate to cut the input 1083 00:37:00,140 --> 00:37:02,089 length. This is how the greedy algorithm 1084 00:37:02,090 --> 00:37:04,159 operates. It wants to take 1085 00:37:04,160 --> 00:37:06,109 away the longest the best candidate. 1086 00:37:06,110 --> 00:37:08,359 And that also depends on the length. 1087 00:37:08,360 --> 00:37:10,939 We want to make the input smaller 1088 00:37:10,940 --> 00:37:12,260 as we work on it. 1089 00:37:14,150 --> 00:37:16,219 Now we look at the perfect sky 1090 00:37:16,220 --> 00:37:18,739 and the same process happens basically 1091 00:37:18,740 --> 00:37:21,159 with every prefix 1092 00:37:21,160 --> 00:37:23,479 that that is 1093 00:37:23,480 --> 00:37:25,699 in this input until 1094 00:37:25,700 --> 00:37:27,289 finally the algorithm makes the choice. 1095 00:37:27,290 --> 00:37:29,419 OK, I think taking the first letters 1096 00:37:29,420 --> 00:37:31,669 S.I is the best choice here. 1097 00:37:31,670 --> 00:37:34,189 That's the closest thing I have here 1098 00:37:34,190 --> 00:37:35,869 to a word in the dictionary that I can 1099 00:37:35,870 --> 00:37:36,870 take away. 1100 00:37:37,610 --> 00:37:39,709 Now the same thing happens again 1101 00:37:39,710 --> 00:37:41,809 and it takes away the prefix ally 1102 00:37:41,810 --> 00:37:44,539 and it takes away the prefix Kidwai 1103 00:37:44,540 --> 00:37:46,639 and so forth and so on 1104 00:37:46,640 --> 00:37:48,949 until finally it 1105 00:37:48,950 --> 00:37:51,499 reaches a score 1106 00:37:51,500 --> 00:37:53,659 for similarity to the 1107 00:37:53,660 --> 00:37:55,069 two, a concatenation of words in the 1108 00:37:55,070 --> 00:37:57,199 dictionary for this domain 1109 00:37:57,200 --> 00:37:58,129 name. 1110 00:37:58,130 --> 00:38:00,199 Now we get the 1111 00:38:00,200 --> 00:38:03,289 same calculation for the next 1112 00:38:03,290 --> 00:38:05,809 domain name and the next domain name 1113 00:38:05,810 --> 00:38:07,549 and the rest of the domain names. 1114 00:38:07,550 --> 00:38:09,629 And are all averages out to find 1115 00:38:09,630 --> 00:38:12,199 the final measure of the closeness 1116 00:38:12,200 --> 00:38:15,139 to concatenation of dictionary words, 1117 00:38:15,140 --> 00:38:17,339 of all the domains that were relevant 1118 00:38:17,340 --> 00:38:18,799 to ten domains that we expected and 1119 00:38:18,800 --> 00:38:20,750 decided to take a look at. 1120 00:38:21,800 --> 00:38:23,869 So we now have our 1121 00:38:23,870 --> 00:38:26,029 final list of features, we 1122 00:38:26,030 --> 00:38:28,309 have 28 domains pointing the same way. 1123 00:38:28,310 --> 00:38:30,319 We have electrical devices of zero point 1124 00:38:30,320 --> 00:38:32,719 six and ability for opponent's 1125 00:38:32,720 --> 00:38:35,089 ability. The fancy of nine word lexical 1126 00:38:35,090 --> 00:38:37,519 devices, I remind you, is the feature 1127 00:38:37,520 --> 00:38:38,929 that we built here and the president's 1128 00:38:38,930 --> 00:38:40,879 ability devices based on their pairs of 1129 00:38:40,880 --> 00:38:43,209 letters and how frequent they are. 1130 00:38:43,210 --> 00:38:45,619 Now, the algorithm 1131 00:38:45,620 --> 00:38:47,359 is going to be looking closely at those 1132 00:38:47,360 --> 00:38:49,189 features and it's going to look at the 28 1133 00:38:49,190 --> 00:38:51,529 domains pointing the same way and 1134 00:38:51,530 --> 00:38:54,019 say, well, 1135 00:38:54,020 --> 00:38:55,309 that's too much. 1136 00:38:55,310 --> 00:38:57,019 Twenty eight domains pointing the same 1137 00:38:57,020 --> 00:38:58,639 way is really too much. 1138 00:38:58,640 --> 00:39:01,009 Everything more than five raises alarms 1139 00:39:01,010 --> 00:39:01,939 already. 1140 00:39:01,940 --> 00:39:04,609 So it says that's excessive. 1141 00:39:04,610 --> 00:39:06,679 Now, as for the 1142 00:39:06,680 --> 00:39:08,719 closeness to a concatenation of words in 1143 00:39:08,720 --> 00:39:10,460 the dictionary, it looks at the value. 1144 00:39:11,810 --> 00:39:13,249 Later, I'm going to tell you where the 1145 00:39:13,250 --> 00:39:15,469 parameters for the classify came 1146 00:39:15,470 --> 00:39:17,599 from. It looks at it and says that's 1147 00:39:17,600 --> 00:39:19,459 also excessive. 1148 00:39:19,460 --> 00:39:21,669 It's not close to 1149 00:39:21,670 --> 00:39:23,479 concatenation of words in the dictionary 1150 00:39:23,480 --> 00:39:24,919 at all. 1151 00:39:24,920 --> 00:39:26,779 And finally, it's going to look at the 1152 00:39:26,780 --> 00:39:29,629 pronunciation of the how 1153 00:39:29,630 --> 00:39:32,029 likely the pairs of letters that appeared 1154 00:39:32,030 --> 00:39:33,679 seem to be. 1155 00:39:33,680 --> 00:39:35,929 And it's going to say that it's 1156 00:39:35,930 --> 00:39:37,999 actually reasonable 1157 00:39:38,000 --> 00:39:39,709 because we got the value of zero point 1158 00:39:39,710 --> 00:39:42,049 nine and anything 1159 00:39:42,050 --> 00:39:43,339 up to one point five is actually 1160 00:39:43,340 --> 00:39:45,919 reasonable. So lots of the mines, 1161 00:39:45,920 --> 00:39:48,829 they look like gibberish to our future. 1162 00:39:48,830 --> 00:39:51,379 But the big Romare future 1163 00:39:51,380 --> 00:39:53,929 looked at this thing and said, OK, 1164 00:39:53,930 --> 00:39:55,549 seems fine. 1165 00:39:55,550 --> 00:39:56,599 Why is this? 1166 00:39:56,600 --> 00:39:58,249 I imagine some of you have guessed it's 1167 00:39:58,250 --> 00:40:00,019 quite possible this picture was generated 1168 00:40:00,020 --> 00:40:00,889 by catchable. 1169 00:40:00,890 --> 00:40:02,959 So now this classifier is going to look 1170 00:40:02,960 --> 00:40:04,349 at the domain collusion and lexical 1171 00:40:04,350 --> 00:40:06,409 defensive, the fancy and say, 1172 00:40:06,410 --> 00:40:08,719 OK, Mr. Pronounceable, you 1173 00:40:08,720 --> 00:40:10,699 said that this was reasonable, but I have 1174 00:40:10,700 --> 00:40:12,409 another feature now that I can rely on, 1175 00:40:12,410 --> 00:40:15,019 and this is definitely Dejiang. 1176 00:40:15,020 --> 00:40:16,729 So this 1177 00:40:18,140 --> 00:40:20,269 was the short demo of how 1178 00:40:20,270 --> 00:40:21,270 this thing works. 1179 00:40:22,310 --> 00:40:24,469 And now we're going to look at 1180 00:40:24,470 --> 00:40:26,239 some pretty graphs. 1181 00:40:26,240 --> 00:40:28,309 And now we 1182 00:40:29,450 --> 00:40:31,579 we took 10000 1183 00:40:31,580 --> 00:40:33,829 pickup's out of 1184 00:40:33,830 --> 00:40:36,019 checkpoints just to see how 1185 00:40:36,020 --> 00:40:38,029 the data looks if we map it across the 1186 00:40:38,030 --> 00:40:40,789 features that we have created. 1187 00:40:40,790 --> 00:40:43,249 And these are your 10000 pickups 1188 00:40:43,250 --> 00:40:45,349 mapped across the closeness 1189 00:40:45,350 --> 00:40:47,599 to concatenation of dictionary words 1190 00:40:47,600 --> 00:40:49,549 and the maximum domain name collusions. 1191 00:40:49,550 --> 00:40:51,229 The number of DNS requested got the same 1192 00:40:51,230 --> 00:40:53,599 response. Now, if the lump there 1193 00:40:53,600 --> 00:40:55,579 to the upper right seems suspicious to 1194 00:40:55,580 --> 00:40:58,069 you, then you're probably right, 1195 00:40:58,070 --> 00:41:00,379 because when we took some test samples, 1196 00:41:00,380 --> 00:41:02,479 like a hundred samples just to test the 1197 00:41:02,480 --> 00:41:04,819 waters and we 1198 00:41:04,820 --> 00:41:06,979 labeled them by hand, the 1199 00:41:06,980 --> 00:41:09,049 samples were in the lump on all the clean 1200 00:41:09,050 --> 00:41:11,299 samples aligned neatly 1201 00:41:11,300 --> 00:41:13,429 across the vertical 1202 00:41:13,430 --> 00:41:15,049 over there to the left. 1203 00:41:15,050 --> 00:41:17,689 And if you want a visualization 1204 00:41:17,690 --> 00:41:20,029 of the classifier itself 1205 00:41:20,030 --> 00:41:22,129 that I promised you, then I'm 1206 00:41:22,130 --> 00:41:24,769 going to tell you how it was generated, 1207 00:41:24,770 --> 00:41:26,959 actually. I tried all sorts of machine 1208 00:41:26,960 --> 00:41:28,759 learning algorithm and tried Gaussian 1209 00:41:28,760 --> 00:41:30,949 mixture models and also that. 1210 00:41:30,950 --> 00:41:33,109 But so far I got the best result 1211 00:41:33,110 --> 00:41:35,179 by just looking at the testing data 1212 00:41:35,180 --> 00:41:37,339 myself. So really, this is a case 1213 00:41:37,340 --> 00:41:39,949 of machine learning, a subclass 1214 00:41:39,950 --> 00:41:41,629 called Been Learning, and it kind of 1215 00:41:41,630 --> 00:41:43,849 bummed because I really wanted the 1216 00:41:43,850 --> 00:41:45,529 proper machine learning algorithm to make 1217 00:41:45,530 --> 00:41:47,599 sense of this data and I'm still looking 1218 00:41:47,600 --> 00:41:49,939 for it. This is a classifier 1219 00:41:49,940 --> 00:41:52,159 that it was not generated based 1220 00:41:52,160 --> 00:41:54,289 on any of the data that you see most 1221 00:41:54,290 --> 00:41:55,249 against the classifier. 1222 00:41:55,250 --> 00:41:56,389 That's the test data. 1223 00:41:56,390 --> 00:41:59,029 I mean, it's the 10000 1224 00:41:59,030 --> 00:42:00,109 pickups that we took. 1225 00:42:00,110 --> 00:42:02,329 It generated the parameters 1226 00:42:02,330 --> 00:42:04,999 to generate misclassify based on our 1227 00:42:05,000 --> 00:42:05,809 traffic captures. 1228 00:42:05,810 --> 00:42:08,359 Of course, you don't test your classifier 1229 00:42:08,360 --> 00:42:10,459 based on the same samples 1230 00:42:10,460 --> 00:42:12,619 that used to generate the classifier. 1231 00:42:12,620 --> 00:42:14,119 Then you're going to have overfitting and 1232 00:42:14,120 --> 00:42:15,299 that's not cool. 1233 00:42:15,300 --> 00:42:17,419 And this is how the 1234 00:42:17,420 --> 00:42:18,739 classifier looks. 1235 00:42:18,740 --> 00:42:21,139 And now I'm going to talk a bit about 1236 00:42:21,140 --> 00:42:23,539 the future of this 1237 00:42:23,540 --> 00:42:25,939 project and what needs 1238 00:42:25,940 --> 00:42:28,339 to be done further on it. 1239 00:42:28,340 --> 00:42:30,139 Now, first of all, more testing because 1240 00:42:30,140 --> 00:42:32,629 testing on 100 samples is nice, but 1241 00:42:32,630 --> 00:42:35,299 I bet that there's a lot of 1242 00:42:35,300 --> 00:42:37,429 surprises that the JSF have 1243 00:42:37,430 --> 00:42:39,889 up their sleeves that will require 1244 00:42:39,890 --> 00:42:41,959 this project to evolve and 1245 00:42:41,960 --> 00:42:43,969 get better features and fine tuned 1246 00:42:43,970 --> 00:42:45,739 features to be better. 1247 00:42:45,740 --> 00:42:47,779 Actually, we're going to touch on one of 1248 00:42:47,780 --> 00:42:50,299 those in like two slides next. 1249 00:42:50,300 --> 00:42:52,369 As I said, more machine learning I 1250 00:42:52,370 --> 00:42:53,370 want. 1251 00:42:55,210 --> 00:42:57,309 I want to be using an actual 1252 00:42:57,310 --> 00:42:58,809 machine learning algorithm, even though 1253 00:42:58,810 --> 00:43:01,329 this finely tuned tune by hand classifier 1254 00:43:01,330 --> 00:43:03,309 has worked, when I say I think you saw 1255 00:43:03,310 --> 00:43:04,929 the graph, you saw that it was a very 1256 00:43:04,930 --> 00:43:06,579 reasonable conclusion to draw from the 1257 00:43:06,580 --> 00:43:07,599 data. 1258 00:43:07,600 --> 00:43:09,459 And finally, there is gibberish 1259 00:43:09,460 --> 00:43:12,249 detection, one of three, because 1260 00:43:12,250 --> 00:43:14,229 as it turns out, the main generation 1261 00:43:14,230 --> 00:43:16,689 algorithm, some of those at least, 1262 00:43:16,690 --> 00:43:18,759 have already anticipated this kind of 1263 00:43:18,760 --> 00:43:20,569 analysis that we have performed here. 1264 00:43:20,570 --> 00:43:22,629 And you have deejay's, such as the one 1265 00:43:22,630 --> 00:43:24,999 used by the Matsuno and malware 1266 00:43:25,000 --> 00:43:27,219 that generates random 1267 00:43:27,220 --> 00:43:29,499 domains by concatenating concatenating 1268 00:43:29,500 --> 00:43:30,999 words from the dictionary. 1269 00:43:31,000 --> 00:43:33,579 So the PIGRAM feature in our features 1270 00:43:33,580 --> 00:43:35,709 is going to do nothing to detect 1271 00:43:35,710 --> 00:43:38,049 this kind of, uh, how 1272 00:43:38,050 --> 00:43:40,749 could we evolve to defeat 1273 00:43:40,750 --> 00:43:42,849 this? Well, I'm just trying an 1274 00:43:42,850 --> 00:43:44,739 idea out there. Maybe we could take a 1275 00:43:44,740 --> 00:43:46,359 look at the words from the dictionary 1276 00:43:46,360 --> 00:43:48,129 that you actually found and make some 1277 00:43:48,130 --> 00:43:50,199 kind of semantic comparison to see how 1278 00:43:50,200 --> 00:43:51,789 likely they were to actually appear 1279 00:43:51,790 --> 00:43:53,619 together in the same domain name. 1280 00:43:53,620 --> 00:43:56,319 I mean, Panter and asphyxiation. 1281 00:43:56,320 --> 00:43:57,909 I guess I could see it a good name for a 1282 00:43:57,910 --> 00:43:58,849 rock band. 1283 00:43:58,850 --> 00:44:02,109 And so 1284 00:44:02,110 --> 00:44:03,909 how about undetectable gibberish? 1285 00:44:03,910 --> 00:44:06,339 Can this arms race eventually end 1286 00:44:06,340 --> 00:44:08,469 in undetectable gibberish? 1287 00:44:08,470 --> 00:44:10,839 Well, my personal opinion 1288 00:44:10,840 --> 00:44:13,059 is that an undetectable lump 1289 00:44:13,060 --> 00:44:15,579 of gibberish is a contradiction in terms, 1290 00:44:15,580 --> 00:44:17,949 because the idea with undetectable 1291 00:44:17,950 --> 00:44:20,139 is that it resembles the legitimate 1292 00:44:20,140 --> 00:44:21,849 distribution of what you expect to see in 1293 00:44:21,850 --> 00:44:23,769 the real world of every conceivable 1294 00:44:23,770 --> 00:44:26,139 feature. Now, if something resembles 1295 00:44:26,140 --> 00:44:28,119 the real world across every conceivable 1296 00:44:28,120 --> 00:44:30,009 feature, it's not going to look like 1297 00:44:30,010 --> 00:44:31,119 gibberish to you. 1298 00:44:31,120 --> 00:44:33,129 But that doesn't mean that we're off the 1299 00:44:33,130 --> 00:44:34,959 hook, because while I think that 1300 00:44:34,960 --> 00:44:37,089 undetectable gibberish is an 1301 00:44:37,090 --> 00:44:39,289 impossibility, undetectable 1302 00:44:39,290 --> 00:44:41,559 auto generated domain names are very much 1303 00:44:41,560 --> 00:44:42,609 a possibility. 1304 00:44:42,610 --> 00:44:44,949 I think that in theory, 1305 00:44:44,950 --> 00:44:46,989 Grothus could eventually create an 1306 00:44:46,990 --> 00:44:48,849 algorithm that generates random domain 1307 00:44:48,850 --> 00:44:51,069 names that are very 1308 00:44:51,070 --> 00:44:53,169 much like domains that you see out there 1309 00:44:53,170 --> 00:44:55,239 in the real world across every feature 1310 00:44:55,240 --> 00:44:56,949 that you could conceive of. 1311 00:44:56,950 --> 00:44:59,169 And it's not going to run 1312 00:44:59,170 --> 00:45:01,449 into any issues, because even if 1313 00:45:01,450 --> 00:45:04,119 you put that set of constraints, 1314 00:45:04,120 --> 00:45:06,309 eventually it's going 1315 00:45:06,310 --> 00:45:08,649 to be the space of possible domains 1316 00:45:08,650 --> 00:45:11,019 is so large, even 1317 00:45:11,020 --> 00:45:12,789 with all those constraints, that there's 1318 00:45:12,790 --> 00:45:14,949 plenty of space, therefore, to generate 1319 00:45:14,950 --> 00:45:17,319 the domains to prosper and flourish. 1320 00:45:17,320 --> 00:45:19,449 But, you know, that's all theoretical. 1321 00:45:19,450 --> 00:45:21,699 And in the future, 1322 00:45:21,700 --> 00:45:24,159 in practice, you have the total 1323 00:45:24,160 --> 00:45:26,229 gibberish just 1324 00:45:26,230 --> 00:45:28,119 running amok. Today, you have quite you 1325 00:45:28,120 --> 00:45:29,289 were running amok today. 1326 00:45:29,290 --> 00:45:31,359 And at the top end of the year, you have 1327 00:45:31,360 --> 00:45:33,219 the dictionary concatenation basically 1328 00:45:33,220 --> 00:45:34,929 just running amok today. 1329 00:45:34,930 --> 00:45:36,999 You know what? Let's first force all 1330 00:45:37,000 --> 00:45:39,099 the judges to actually use undetectable 1331 00:45:39,100 --> 00:45:41,199 gibberish. And if we do that, I believe 1332 00:45:41,200 --> 00:45:43,039 that we will have done enough for that 1333 00:45:43,040 --> 00:45:43,989 thing. Yes. 1334 00:45:43,990 --> 00:45:46,179 Then we can think and what we can do 1335 00:45:46,180 --> 00:45:48,549 from then from that point on. 1336 00:45:48,550 --> 00:45:50,619 Now, your next question regarding 1337 00:45:50,620 --> 00:45:52,239 this nice classifier that tells you 1338 00:45:52,240 --> 00:45:54,429 whether traffic cultures are containing 1339 00:45:54,430 --> 00:45:57,069 T.J. or not is can I have it? 1340 00:45:57,070 --> 00:45:59,439 The answer is yes. 1341 00:45:59,440 --> 00:46:00,969 There is the address of the GitHub 1342 00:46:00,970 --> 00:46:03,039 report. And I really hope that the 1343 00:46:03,040 --> 00:46:05,259 people try to use it and say, 1344 00:46:05,260 --> 00:46:06,759 oh, my God, Ben, it doesn't work for me 1345 00:46:06,760 --> 00:46:08,619 at all. And it's useless to prove it 1346 00:46:08,620 --> 00:46:09,579 because I want to prove it. 1347 00:46:09,580 --> 00:46:11,199 I want this thing to work and be 1348 00:46:11,200 --> 00:46:13,269 available for anyone who wants to 1349 00:46:13,270 --> 00:46:15,879 detect deejay's in their pick-ups. 1350 00:46:15,880 --> 00:46:17,949 So we are we can 1351 00:46:17,950 --> 00:46:20,339 now summarize this whole journey and 1352 00:46:20,340 --> 00:46:22,599 we can say that the deejay's, our pain 1353 00:46:22,600 --> 00:46:24,549 and the automatic detection of the J 1354 00:46:24,550 --> 00:46:26,949 helps. But if it is done safely 1355 00:46:26,950 --> 00:46:28,659 and it gets confused by strategic 1356 00:46:28,660 --> 00:46:30,009 placement of vowels. 1357 00:46:30,010 --> 00:46:32,199 But if you do it Snavely, it gets less 1358 00:46:32,200 --> 00:46:34,509 confused by strategics basement AVOs 1359 00:46:34,510 --> 00:46:36,759 and becomes equipped to 201 domain names 1360 00:46:36,760 --> 00:46:39,609 like Tumblr and 1361 00:46:39,610 --> 00:46:41,679 undetectable auto generated domain names 1362 00:46:41,680 --> 00:46:43,509 may be a possibility in the future. 1363 00:46:43,510 --> 00:46:45,129 But you know what? First, let's focus on 1364 00:46:45,130 --> 00:46:46,899 the juice then, and then we can see what 1365 00:46:46,900 --> 00:46:48,779 we can do next. 1366 00:46:48,780 --> 00:46:50,439 So thank you. And are there any 1367 00:46:50,440 --> 00:46:51,440 questions? 1368 00:47:00,250 --> 00:47:02,499 OK, thank you to Ben 1369 00:47:02,500 --> 00:47:04,659 for your very nice talk, 1370 00:47:04,660 --> 00:47:06,789 very interesting talk, if you have 1371 00:47:06,790 --> 00:47:09,549 to leave now, do so quietly, 1372 00:47:09,550 --> 00:47:12,009 please, so we can have a nice 1373 00:47:12,010 --> 00:47:14,319 and informative Q&A session. 1374 00:47:14,320 --> 00:47:15,669 Thank you very much. 1375 00:47:15,670 --> 00:47:18,759 So we have a few minutes of time, 1376 00:47:18,760 --> 00:47:20,979 a few not not so few minutes. 1377 00:47:20,980 --> 00:47:23,169 We have a lot of time for Q&A. 1378 00:47:23,170 --> 00:47:25,689 So if you have any questions lined up 1379 00:47:25,690 --> 00:47:28,089 at the microphones 1380 00:47:28,090 --> 00:47:30,579 and we will also take questions from 1381 00:47:30,580 --> 00:47:32,019 the Internet. 1382 00:47:32,020 --> 00:47:34,539 So leaving 1383 00:47:34,540 --> 00:47:37,029 is OK. Talking is not unless 1384 00:47:37,030 --> 00:47:38,619 you step up to a microphone. 1385 00:47:38,620 --> 00:47:39,669 Thank you. 1386 00:47:39,670 --> 00:47:42,129 So we will start 1387 00:47:42,130 --> 00:47:44,269 at the front from 1388 00:47:44,270 --> 00:47:45,789 my side. Left from your side. 1389 00:47:45,790 --> 00:47:47,889 Right microphone. 1390 00:47:47,890 --> 00:47:50,649 Hi. So thanks for your talk. 1391 00:47:50,650 --> 00:47:52,839 I was wondering, you had lots of issues 1392 00:47:52,840 --> 00:47:55,059 with the dictionary because of its size 1393 00:47:55,060 --> 00:47:57,219 and so on, and then 1394 00:47:57,220 --> 00:47:58,389 you built all these hacks. 1395 00:47:58,390 --> 00:48:00,699 But I wondered whether you thought 1396 00:48:00,700 --> 00:48:02,439 about getting rid of the dictionary 1397 00:48:02,440 --> 00:48:04,659 altogether by including sellable 1398 00:48:04,660 --> 00:48:06,999 information, for example, because 1399 00:48:07,000 --> 00:48:09,069 I also think you're rather interested 1400 00:48:09,070 --> 00:48:11,559 in possible words of the English language 1401 00:48:11,560 --> 00:48:12,969 and the words that are listed in the 1402 00:48:12,970 --> 00:48:14,860 dictionary. So, for example, 1403 00:48:16,000 --> 00:48:18,339 Mandalorian or something like that 1404 00:48:18,340 --> 00:48:20,589 is probably in 1405 00:48:20,590 --> 00:48:22,089 a dictionary about Star Wars, but 1406 00:48:22,090 --> 00:48:24,429 probably not in the dictionary used. 1407 00:48:24,430 --> 00:48:26,529 But it is a possible word of the English 1408 00:48:26,530 --> 00:48:28,659 language by the rules 1409 00:48:28,660 --> 00:48:29,979 for the syllables. 1410 00:48:29,980 --> 00:48:32,139 So, yes, I just thought maybe 1411 00:48:32,140 --> 00:48:34,539 you could get rid of the dictionary by 1412 00:48:34,540 --> 00:48:36,759 using those syllable information, 1413 00:48:36,760 --> 00:48:37,209 so. 1414 00:48:37,210 --> 00:48:39,429 Well, actually, there have there 1415 00:48:39,430 --> 00:48:41,889 has been in one paper 1416 00:48:41,890 --> 00:48:43,899 on the subject that used exactly the 1417 00:48:43,900 --> 00:48:45,479 approach that you described. 1418 00:48:45,480 --> 00:48:47,769 Well, more or less by stemming words 1419 00:48:47,770 --> 00:48:50,439 and trying to look at morphemes 1420 00:48:50,440 --> 00:48:51,879 and so forth and so on. 1421 00:48:51,880 --> 00:48:53,229 It was very interesting. 1422 00:48:54,310 --> 00:48:57,189 Their particular attempt didn't 1423 00:48:57,190 --> 00:48:59,439 go so well, but I think it's 1424 00:48:59,440 --> 00:49:01,359 a good approach. I don't know about 1425 00:49:01,360 --> 00:49:03,550 getting rid of the dictionary completely 1426 00:49:04,600 --> 00:49:06,669 because I'm not very convinced 1427 00:49:06,670 --> 00:49:09,069 about how the more the 1428 00:49:09,070 --> 00:49:11,279 word stems or the syllables or the 1429 00:49:11,280 --> 00:49:13,449 items that you are proposing are 1430 00:49:13,450 --> 00:49:15,639 going to be at reconstructing words from 1431 00:49:15,640 --> 00:49:16,629 the dictionary. 1432 00:49:16,630 --> 00:49:17,589 But you know what? 1433 00:49:17,590 --> 00:49:19,539 I think it's an avenue worth pursuing. 1434 00:49:19,540 --> 00:49:21,789 If actually this could help 1435 00:49:21,790 --> 00:49:23,649 downsize the dictionary, then it's 1436 00:49:23,650 --> 00:49:25,209 something that I'm interested in looking 1437 00:49:25,210 --> 00:49:26,949 into and doing, because right now the 1438 00:49:26,950 --> 00:49:29,019 dictionary where is like ten 1439 00:49:29,020 --> 00:49:30,969 point seven gigabytes and anything to 1440 00:49:30,970 --> 00:49:33,519 make it smaller is very welcome. 1441 00:49:33,520 --> 00:49:34,520 So thank you. 1442 00:49:35,530 --> 00:49:37,929 OK, to the left front 1443 00:49:37,930 --> 00:49:39,189 microphone, please. 1444 00:49:39,190 --> 00:49:40,929 I was wondering how you managed domains 1445 00:49:40,930 --> 00:49:42,909 like electricity or things which are 1446 00:49:42,910 --> 00:49:44,139 valid, but which aren't in the 1447 00:49:44,140 --> 00:49:45,140 dictionary. 1448 00:49:46,540 --> 00:49:49,539 Things that you mean like 1449 00:49:49,540 --> 00:49:51,929 an SkyCity, because it's not 1450 00:49:51,930 --> 00:49:54,059 even like a word in the 1451 00:49:54,060 --> 00:49:55,619 dictionary at all? 1452 00:49:55,620 --> 00:49:57,719 Yeah, that's that's a 1453 00:49:57,720 --> 00:49:58,929 really good question. 1454 00:49:58,930 --> 00:50:01,319 You're not not every domain 1455 00:50:01,320 --> 00:50:03,360 name is going to fit into this part 1456 00:50:04,530 --> 00:50:06,899 of domain names that are 1457 00:50:06,900 --> 00:50:09,059 like a word in the dictionary, 1458 00:50:09,060 --> 00:50:11,279 which is why this thing still needs 1459 00:50:11,280 --> 00:50:12,449 to be tuned. 1460 00:50:12,450 --> 00:50:14,789 And more approaches 1461 00:50:14,790 --> 00:50:17,249 for recognizing 1462 00:50:17,250 --> 00:50:19,499 valid domain names need to be added 1463 00:50:19,500 --> 00:50:22,289 to it. Looking at the top domains. 1464 00:50:22,290 --> 00:50:24,539 I mean, that's a given. 1465 00:50:24,540 --> 00:50:27,089 And you see what 1466 00:50:27,090 --> 00:50:28,589 let me ask you a question. 1467 00:50:28,590 --> 00:50:30,899 What sort of approach do you see 1468 00:50:30,900 --> 00:50:33,239 that would have foreseen from first 1469 00:50:33,240 --> 00:50:35,429 principles that exclusivity 1470 00:50:35,430 --> 00:50:37,769 is a legitimate domain name? 1471 00:50:37,770 --> 00:50:39,359 That's kind of an issue. 1472 00:50:39,360 --> 00:50:41,669 So I think if things like 1473 00:50:41,670 --> 00:50:43,829 exclusivity there will ever have 1474 00:50:43,830 --> 00:50:46,199 to be manually 1475 00:50:46,200 --> 00:50:49,169 waitlisted or you could have a 1476 00:50:49,170 --> 00:50:50,219 specific I don't know. 1477 00:50:50,220 --> 00:50:52,499 I really don't know how you could see 1478 00:50:52,500 --> 00:50:55,439 in advance that it is legitimate 1479 00:50:55,440 --> 00:50:58,229 except, you know, taking 1480 00:50:58,230 --> 00:51:00,360 this apriori knowledge and applying it. 1481 00:51:02,540 --> 00:51:04,909 OK, the real rights microphone, 1482 00:51:04,910 --> 00:51:05,910 please. 1483 00:51:08,840 --> 00:51:10,639 How are you dealing with Kuniko demands 1484 00:51:10,640 --> 00:51:11,869 as well as 1485 00:51:13,220 --> 00:51:15,169 some demands in the U.S. 1486 00:51:15,170 --> 00:51:17,359 and other countries that are look like 1487 00:51:17,360 --> 00:51:18,709 English or gibberish? 1488 00:51:18,710 --> 00:51:20,899 China uses a lot of strings 1489 00:51:20,900 --> 00:51:23,689 that we wouldn't necessarily recognize. 1490 00:51:23,690 --> 00:51:26,059 And have you looked at other properties 1491 00:51:26,060 --> 00:51:28,399 of these two domains, such as 1492 00:51:28,400 --> 00:51:29,719 time of first use, you know? 1493 00:51:29,720 --> 00:51:32,149 Has anyone gone to these domains before 1494 00:51:32,150 --> 00:51:34,399 as well as how close, 1495 00:51:34,400 --> 00:51:35,289 how more recent? 1496 00:51:35,290 --> 00:51:36,439 Everything registered? 1497 00:51:36,440 --> 00:51:38,689 OK, so first of all, regarding the 1498 00:51:38,690 --> 00:51:41,299 languages issue, this project 1499 00:51:41,300 --> 00:51:43,189 was specifically about English. 1500 00:51:43,190 --> 00:51:45,229 It is a work in progress and the proof of 1501 00:51:45,230 --> 00:51:47,179 concept. I specifically physically fought 1502 00:51:47,180 --> 00:51:48,079 about that. 1503 00:51:48,080 --> 00:51:50,329 And if this thing is actually going to 1504 00:51:50,330 --> 00:51:51,619 be used and is going to function 1505 00:51:51,620 --> 00:51:53,779 properly, I am interested in expanding 1506 00:51:53,780 --> 00:51:55,790 it to be able to handle other languages. 1507 00:51:58,550 --> 00:52:00,049 Now, can you remind me if you have a 1508 00:52:00,050 --> 00:52:00,829 question? 1509 00:52:00,830 --> 00:52:03,139 So yeah, 1510 00:52:03,140 --> 00:52:05,479 pwning code or 1511 00:52:05,480 --> 00:52:06,979 domain strings that are English 1512 00:52:06,980 --> 00:52:09,409 characters but are used in countries 1513 00:52:09,410 --> 00:52:11,629 or popular in countries that it's 1514 00:52:11,630 --> 00:52:13,039 not English words. 1515 00:52:13,040 --> 00:52:14,149 That's what I said. 1516 00:52:14,150 --> 00:52:16,609 Then there are other characteristics 1517 00:52:16,610 --> 00:52:17,610 of these domains. 1518 00:52:18,920 --> 00:52:20,329 For instance, had they, had they ever 1519 00:52:20,330 --> 00:52:21,330 been 1520 00:52:22,940 --> 00:52:23,599 right, you know. 1521 00:52:23,600 --> 00:52:24,739 Right. And only registered. 1522 00:52:24,740 --> 00:52:27,089 Yeah. Yeah I remember now. 1523 00:52:27,090 --> 00:52:29,179 Well plenty of projects trying 1524 00:52:29,180 --> 00:52:30,379 to solve this. 1525 00:52:30,380 --> 00:52:32,779 This problem in the past have relied 1526 00:52:32,780 --> 00:52:35,009 on this sort of thing, keeping 1527 00:52:35,010 --> 00:52:37,099 sort of ongoing intelligence 1528 00:52:37,100 --> 00:52:39,229 operation that will 1529 00:52:39,230 --> 00:52:41,299 be able to tell you whatever the 1530 00:52:41,300 --> 00:52:44,239 domain is, grieving and very suspicious 1531 00:52:44,240 --> 00:52:46,009 based on this sort of intelligence. 1532 00:52:46,010 --> 00:52:47,239 When was it registered? 1533 00:52:47,240 --> 00:52:49,939 When was it taxes and so forth and so on? 1534 00:52:49,940 --> 00:52:52,429 These are all very useful features. 1535 00:52:52,430 --> 00:52:54,589 I specifically decided that 1536 00:52:54,590 --> 00:52:56,719 there were a lot of scope for this 1537 00:52:56,720 --> 00:52:58,879 project and this project should join 1538 00:52:58,880 --> 00:53:01,189 hands with any 1539 00:53:01,190 --> 00:53:03,379 sort of engine making use of 1540 00:53:03,380 --> 00:53:05,539 an ongoing intelligence operation instead 1541 00:53:05,540 --> 00:53:07,999 of reinventing the wheel and implementing 1542 00:53:08,000 --> 00:53:09,080 that sort of approach. 1543 00:53:10,260 --> 00:53:11,260 Thank you. 1544 00:53:11,630 --> 00:53:14,159 OK, Angel, other questions 1545 00:53:14,160 --> 00:53:15,579 from the Internet. 1546 00:53:15,580 --> 00:53:17,699 I'm just looking at them right now, 1547 00:53:17,700 --> 00:53:18,700 but it seems 1548 00:53:19,770 --> 00:53:21,420 the Internet the same question as 1549 00:53:23,370 --> 00:53:24,689 that's convenient. 1550 00:53:24,690 --> 00:53:27,179 So I guess the next question 1551 00:53:27,180 --> 00:53:29,070 from microphone right up front, 1552 00:53:31,830 --> 00:53:33,959 it sounds like the 1553 00:53:33,960 --> 00:53:36,329 DJ always with some output, 1554 00:53:36,330 --> 00:53:38,639 some sort of stick poetry 1555 00:53:38,640 --> 00:53:40,679 that will make your system have to 1556 00:53:40,680 --> 00:53:42,839 identify today's poetry, but also 1557 00:53:42,840 --> 00:53:45,239 in like 400 different languages, 1558 00:53:45,240 --> 00:53:47,279 such as the dot com. 1559 00:53:47,280 --> 00:53:49,919 So will allow you to register it. 1560 00:53:49,920 --> 00:53:51,059 That would be a problem. 1561 00:53:51,060 --> 00:53:53,279 So my question is, in 1562 00:53:53,280 --> 00:53:55,559 what place in the DNS ecosystem 1563 00:53:55,560 --> 00:53:58,109 do you believe that your system will 1564 00:53:58,110 --> 00:54:00,179 have? I mean, where exactly 1565 00:54:00,180 --> 00:54:01,689 do we have a use for this? 1566 00:54:01,690 --> 00:54:04,829 OK, so actually 1567 00:54:04,830 --> 00:54:07,079 this was conceived in an entirely 1568 00:54:07,080 --> 00:54:08,759 different context than what I described 1569 00:54:08,760 --> 00:54:10,739 in the first few slides. 1570 00:54:10,740 --> 00:54:12,989 This was conceived as a feature that 1571 00:54:12,990 --> 00:54:15,239 you can compute on a sample in 1572 00:54:15,240 --> 00:54:17,489 a sandboxing context so you can 1573 00:54:17,490 --> 00:54:19,559 do machine learning operations and 1574 00:54:19,560 --> 00:54:21,869 so forth and extract a feature 1575 00:54:21,870 --> 00:54:24,029 that you can look at the sample and 1576 00:54:24,030 --> 00:54:25,799 say, OK, it's deejayed. 1577 00:54:25,800 --> 00:54:27,479 That's one of the features that I can 1578 00:54:27,480 --> 00:54:28,709 look at and use. 1579 00:54:28,710 --> 00:54:30,989 But as I mentioned 1580 00:54:30,990 --> 00:54:33,119 in one of the first slides, I think 1581 00:54:33,120 --> 00:54:35,069 that the pinnacle of what this film could 1582 00:54:35,070 --> 00:54:37,349 do is if the code were optimized for 1583 00:54:37,350 --> 00:54:38,969 performance in that kind of context, 1584 00:54:38,970 --> 00:54:41,159 which it currently is not, it could 1585 00:54:41,160 --> 00:54:43,679 sit on the firewall and throttle 1586 00:54:43,680 --> 00:54:45,659 traffic before it manages to get out of 1587 00:54:45,660 --> 00:54:46,660 the network. 1588 00:54:48,990 --> 00:54:51,329 OK, and that's another question on the 1589 00:54:51,330 --> 00:54:54,479 rear right Smitt microphone. 1590 00:54:54,480 --> 00:54:57,299 Yes, you mentioned that you tried other 1591 00:54:57,300 --> 00:54:59,099 machine learning models for this. 1592 00:54:59,100 --> 00:55:02,009 Did you try a Markov chain classifier? 1593 00:55:02,010 --> 00:55:03,659 Actually, that one I haven't tried. 1594 00:55:03,660 --> 00:55:05,309 Markov Chains are saying, OK, I'm going 1595 00:55:05,310 --> 00:55:06,059 to try that tomorrow. 1596 00:55:06,060 --> 00:55:07,859 I'm aware of some other research that 1597 00:55:07,860 --> 00:55:10,229 does Dedé classification using Markov 1598 00:55:10,230 --> 00:55:12,389 Chains. And if you're there's a couple 1599 00:55:12,390 --> 00:55:13,359 of advantages to this. 1600 00:55:13,360 --> 00:55:16,259 So if you're concerned about Bagram's 1601 00:55:16,260 --> 00:55:18,449 not being enough, you can get extended to 1602 00:55:18,450 --> 00:55:19,679 a third order or fourth order. 1603 00:55:19,680 --> 00:55:21,719 Markov chain assumption with smoothing. 1604 00:55:21,720 --> 00:55:23,819 And you can also, you know, 1605 00:55:23,820 --> 00:55:25,889 look up time is is linear, 1606 00:55:25,890 --> 00:55:27,779 so it's pretty fast. 1607 00:55:27,780 --> 00:55:29,849 OK, I'm going to have to take a look at 1608 00:55:29,850 --> 00:55:30,389 that. 1609 00:55:30,390 --> 00:55:32,219 Yeah. Thank you. 1610 00:55:32,220 --> 00:55:34,289 OK, on the left rear 1611 00:55:34,290 --> 00:55:36,509 microphone. That's another question. 1612 00:55:36,510 --> 00:55:38,669 Did you try using using filters 1613 00:55:38,670 --> 00:55:40,529 for nougats because they might you'd use 1614 00:55:40,530 --> 00:55:42,659 to look up time significantly to 1615 00:55:42,660 --> 00:55:44,759 use what Blum Footies 1616 00:55:44,760 --> 00:55:45,449 Bloomfield's. 1617 00:55:45,450 --> 00:55:47,819 No, I have not looked into that to 1618 00:55:47,820 --> 00:55:49,979 lots of useful suggestions to this. 1619 00:55:49,980 --> 00:55:51,179 I am hopeful that. 1620 00:55:52,660 --> 00:55:53,879 OK. 1621 00:55:53,880 --> 00:55:55,989 Oh, we have time for another 1622 00:55:55,990 --> 00:55:58,779 couple of questions, so I'm looking at. 1623 00:55:58,780 --> 00:55:59,830 Nothing from the Internet. 1624 00:56:01,030 --> 00:56:03,279 OK, last question, maybe to wrap 1625 00:56:03,280 --> 00:56:05,679 it up from the right's 1626 00:56:05,680 --> 00:56:06,969 front microphone. 1627 00:56:06,970 --> 00:56:08,319 Thank you. 1628 00:56:08,320 --> 00:56:10,629 I liked your presentation very much. 1629 00:56:10,630 --> 00:56:12,999 I'm all for data science and 1630 00:56:13,000 --> 00:56:14,619 not experience with the network 1631 00:56:14,620 --> 00:56:15,769 perspective, OK? 1632 00:56:15,770 --> 00:56:17,499 I think it looks like a cat and mouse 1633 00:56:17,500 --> 00:56:19,569 game. And I think the from the 1634 00:56:19,570 --> 00:56:21,799 methods perspective, I think it's 1635 00:56:21,800 --> 00:56:24,069 a losing game. So I would like to hear 1636 00:56:24,070 --> 00:56:25,449 more about the motivation at the 1637 00:56:25,450 --> 00:56:26,450 beginning. 1638 00:56:27,370 --> 00:56:29,559 The the the standard process that 1639 00:56:29,560 --> 00:56:31,629 is described, how it is handled with 1640 00:56:31,630 --> 00:56:33,279 the middle management and the reports 1641 00:56:33,280 --> 00:56:35,529 that it was done in reverse engineering 1642 00:56:35,530 --> 00:56:38,049 of their statistical methods 1643 00:56:38,050 --> 00:56:40,149 involved to today that are 1644 00:56:40,150 --> 00:56:42,189 the same cat and mouse game that you are 1645 00:56:42,190 --> 00:56:43,899 only playing it better. 1646 00:56:43,900 --> 00:56:45,969 Or is it just 1647 00:56:45,970 --> 00:56:48,189 the scheme, which I believe personally is 1648 00:56:48,190 --> 00:56:49,419 a solution? 1649 00:56:49,420 --> 00:56:50,799 I agree with you completely. 1650 00:56:50,800 --> 00:56:52,539 That's what I said. In theory, it really 1651 00:56:52,540 --> 00:56:54,849 is a losing game because all forms 1652 00:56:54,850 --> 00:56:56,739 of deejay's are going to evolve 1653 00:56:56,740 --> 00:56:57,909 eventually. 1654 00:56:57,910 --> 00:56:59,319 They haven't the will. 1655 00:56:59,320 --> 00:57:01,839 And as I said, my personal 1656 00:57:01,840 --> 00:57:03,219 analysis, and it seems that you agree 1657 00:57:03,220 --> 00:57:04,449 with me, is that eventually, 1658 00:57:04,450 --> 00:57:06,159 theoretically, they win this game. 1659 00:57:06,160 --> 00:57:08,229 Now, security is full of 1660 00:57:08,230 --> 00:57:10,239 such games where the attackers 1661 00:57:10,240 --> 00:57:12,789 theoretically eventually win because 1662 00:57:12,790 --> 00:57:15,069 the mission, the burden on the shoulders 1663 00:57:15,070 --> 00:57:17,499 of the defenders is like some 1664 00:57:17,500 --> 00:57:19,719 eventually in the worst case, 1665 00:57:19,720 --> 00:57:22,119 worst case, it's equivalent 1666 00:57:22,120 --> 00:57:24,579 to some ENPI complete problem 1667 00:57:24,580 --> 00:57:26,559 or is generally in theory, in the worst 1668 00:57:26,560 --> 00:57:27,729 case, impossible. 1669 00:57:27,730 --> 00:57:30,459 But while you are correct, 1670 00:57:30,460 --> 00:57:32,739 I believe that we should be focusing 1671 00:57:32,740 --> 00:57:34,179 on what's happening right now. 1672 00:57:34,180 --> 00:57:36,459 As I said, OK, it's a losing game right 1673 00:57:36,460 --> 00:57:39,189 now. But besides 1674 00:57:39,190 --> 00:57:41,649 trying to look aside and think, OK, 1675 00:57:41,650 --> 00:57:43,899 what game changer can we bring in 1676 00:57:43,900 --> 00:57:45,849 here to make this ultimately not a losing 1677 00:57:45,850 --> 00:57:47,829 game? We need to do our best to push 1678 00:57:47,830 --> 00:57:50,109 forward and make it less of a losing game 1679 00:57:50,110 --> 00:57:52,299 temporarily, even if eventually 1680 00:57:52,300 --> 00:57:54,369 we know that in theory 1681 00:57:54,370 --> 00:57:55,509 we're going to lose. 1682 00:57:55,510 --> 00:57:56,510 That's what I believe. 1683 00:57:58,660 --> 00:57:59,830 OK, Ben. 1684 00:58:00,850 --> 00:58:02,379 The rest of us give a warm applause to 1685 00:58:02,380 --> 00:58:03,699 Ben, and thank you very much for your 1686 00:58:03,700 --> 00:58:04,700 talk.