1 00:00:00,000 --> 00:00:09,970 *silent 31C3 preroll* 2 00:00:09,970 --> 00:00:13,220 Dr. Gareth Owen: Hello. Can you hear me? Yes. Okay. So my name is Gareth Owen. 3 00:00:13,220 --> 00:00:16,150 I’m from the University of Portsmouth. I’m an academic 4 00:00:16,150 --> 00:00:19,320 and I’m going to talk to you about an experiment that we did 5 00:00:19,320 --> 00:00:22,610 on the Tor hidden services, trying to categorize them, 6 00:00:22,610 --> 00:00:25,230 estimate how many they were etc. etc. 7 00:00:25,230 --> 00:00:27,380 Well, as we go through the talk I’m going to explain 8 00:00:27,380 --> 00:00:31,120 how Tor hidden services work internally, and how the data was collected. 9 00:00:31,120 --> 00:00:35,320 So what sort of conclusions you can draw from the data based on the way that we’ve 10 00:00:35,320 --> 00:00:39,950 collected it. Just so [that] I get an idea: how many of you use Tor 11 00:00:39,950 --> 00:00:42,430 on a regular basis, could you put your hand up for me? 12 00:00:42,430 --> 00:00:46,120 So quite a big number. Keep your hand up if… or put your hand up if you’re 13 00:00:46,120 --> 00:00:48,320 a relay operator. 14 00:00:48,320 --> 00:00:51,470 Wow, that’s quite a significant number, isn’t it? And then, put your hand up 15 00:00:51,470 --> 00:00:55,250 and/or keep it up if you run a hidden service. 16 00:00:55,250 --> 00:00:59,530 Okay, so, a fewer number, but still some people run hidden services. 17 00:00:59,530 --> 00:01:02,720 Okay, so, some of you may be very familiar with the way Tor works, sort of, 18 00:01:02,720 --> 00:01:06,700 in a low level. But I am gonna go through it for those which aren’t, so they understand 19 00:01:06,700 --> 00:01:10,380 just how they work. And as we go along, because I’m explaining how 20 00:01:10,380 --> 00:01:14,030 the hidden services work, I’m going to tag on information on how 21 00:01:14,030 --> 00:01:19,030 the Tor hidden services themselves can be deanonymised and also how the users 22 00:01:19,030 --> 00:01:23,090 of those hidden services can be deanonymised, if you put 23 00:01:23,090 --> 00:01:27,040 some strict criteria on what it is you want to do with respect to them. 24 00:01:27,040 --> 00:01:30,920 So the things that I’m going to go over: I wanna go over how Tor works, 25 00:01:30,920 --> 00:01:34,190 and then specifically how hidden services work. I’m gonna talk about something 26 00:01:34,190 --> 00:01:37,889 called the “Tor Distributed Hash Table” for hidden services. If you’ve heard 27 00:01:37,889 --> 00:01:40,560 that term and don’t know what it means, don’t worry, I’ll explain 28 00:01:40,560 --> 00:01:44,010 what a distributed hash table is and how it works. It’s not as complicated 29 00:01:44,010 --> 00:01:47,690 as it sounds. And then I wanna go over Darknet data, so, data that we collected 30 00:01:47,690 --> 00:01:53,030 from Tor hidden services. And as I say, as we go along I will sort of explain 31 00:01:53,030 --> 00:01:56,650 how you do deanonymisation of both the services themselves and of the visitors 32 00:01:56,650 --> 00:02:02,400 to the service. And just how complicated it is. 33 00:02:02,400 --> 00:02:07,370 So you may have seen this slide which I think was from GCHQ, released last year 34 00:02:07,370 --> 00:02:12,099 as part of the Snowden leaks where they said: “You can deanonymise some users 35 00:02:12,099 --> 00:02:15,560 some of the time but they’ve had no success in deanonymising someone 36 00:02:15,560 --> 00:02:20,109 in response to a specific request.” So, given all of you e.g., I may be able 37 00:02:20,109 --> 00:02:25,090 to deanonymise a small fraction of you but I can’t choose precisely one person 38 00:02:25,090 --> 00:02:27,499 I want to deanonymise. That’s what I’m gonna be explaining in relation 39 00:02:27,499 --> 00:02:30,940 to the deanonymisation attacks, how you can deanonymise a section but 40 00:02:30,940 --> 00:02:38,629 you can’t necessarily choose which section of the users that you will be deanonymising. 41 00:02:38,629 --> 00:02:42,740 Tor drives with just a couple of different problems. On one part 42 00:02:42,740 --> 00:02:46,239 it allows you to bypass censorship. So if you’re in a country like China, which 43 00:02:46,239 --> 00:02:51,010 blocks some types of traffic you can use Tor to bypass their censorship blocks. 44 00:02:51,010 --> 00:02:55,541 It tries to give you privacy, so, at some level in the network someone can’t see 45 00:02:55,541 --> 00:02:59,200 what you’re doing. And at another point in the network people who don’t know 46 00:02:59,200 --> 00:03:02,540 who you are but may necessarily be able to see what you’re doing. 47 00:03:02,540 --> 00:03:07,099 Now the traditional case for this is to look at VPNs. 48 00:03:07,099 --> 00:03:10,669 With a VPN you have sort of a single provider. 49 00:03:10,669 --> 00:03:14,689 You have lots of users connecting to the VPN. The VPN has sort of 50 00:03:14,689 --> 00:03:18,240 a mixing effect from an outside or a server’s point of view. And then 51 00:03:18,240 --> 00:03:22,499 out of the VPN you see requests to Twitter, Wikipedia etc. etc. 52 00:03:22,499 --> 00:03:26,830 And if that traffic doesn’t encrypt it then the VPN can also read the contents 53 00:03:26,830 --> 00:03:30,980 of the traffic. Now of course there is a fundamental weakness with this. 54 00:03:30,980 --> 00:03:35,730 If you trust the VPN provider the VPN provider knows both who you are 55 00:03:35,730 --> 00:03:39,629 and what you’re doing and can link those two together with absolute 56 00:03:39,629 --> 00:03:43,580 certainty. So you don’t… whilst you do get some of these properties, assuming 57 00:03:43,580 --> 00:03:48,069 you’ve got a trustworthy VPN provider you don’t get them in the face of 58 00:03:48,069 --> 00:03:51,609 an untrustworthy VPN provider. And of course: how do you trust the VPN 59 00:03:51,609 --> 00:03:59,319 provider? What sort of measure do you use? That’s sort of an open question. 60 00:03:59,319 --> 00:04:03,729 So Tor tries to solve this problem by distributing the trust. Tor is 61 00:04:03,729 --> 00:04:07,500 an open source project, so you can go on to their Git repository, you can 62 00:04:07,500 --> 00:04:12,620 download the source code, and change it, improve it, submit patches etc. 63 00:04:12,620 --> 00:04:17,108 As you heard earlier, during Jacob and Roger’s talk they’re currently partly 64 00:04:17,108 --> 00:04:20,949 sponsored by the US Government which seems a bit paradoxical, but they explained 65 00:04:20,949 --> 00:04:24,770 in that talk many of the… that doesn’t affect like judgment. 66 00:04:24,770 --> 00:04:28,540 And indeed, they do have some funding from other sources, and they design that system 67 00:04:28,540 --> 00:04:30,841 – which I’ll talk about a little bit later – in a way where they don’t have 68 00:04:30,841 --> 00:04:34,230 to trust each other. So there’s sort of some redundancy, and they’re trying 69 00:04:34,230 --> 00:04:39,650 to minimize these sort of trust issues related to this. Now, Tor is 70 00:04:39,650 --> 00:04:43,310 a partially de-centralized network, which means that it has some centralized 71 00:04:43,310 --> 00:04:47,870 components which are under the control of the Tor Project and some de-centralized 72 00:04:47,870 --> 00:04:51,190 components which are normally the Tor relays. If you run a relay you’re 73 00:04:51,190 --> 00:04:56,290 one of those de-centralized components. There is, however, no single authority 74 00:04:56,290 --> 00:05:01,110 on the Tor network. So no single server which is responsible, 75 00:05:01,110 --> 00:05:04,290 which you’re required to trust. So the trust is somewhat distributed, 76 00:05:04,290 --> 00:05:12,000 but not entirely. When you establish a circuit through Tor you, the user, 77 00:05:12,000 --> 00:05:15,500 download a list of all of the relays inside the Tor network. 78 00:05:15,500 --> 00:05:19,070 And you get to pick – and I’ll tell you how you do that – which relays 79 00:05:19,070 --> 00:05:22,750 you’re going to use to route your traffic through. So here is a typical example: 80 00:05:22,750 --> 00:05:27,090 You’re here on the left hand side as the user. You download a list of the relays 81 00:05:27,090 --> 00:05:32,010 inside the Tor network and you select from that list three nodes, a guard node 82 00:05:32,010 --> 00:05:36,580 which is your entry into the Tor network, a relay node which is a middle node. 83 00:05:36,580 --> 00:05:39,010 Essentially, it’s going to route your traffic to a third hop. And then 84 00:05:39,010 --> 00:05:42,650 the third hop is the exit node where your traffic essentially exits out 85 00:05:42,650 --> 00:05:46,840 on the internet. Now, looking at the circuit. So this is a circuit through 86 00:05:46,840 --> 00:05:50,170 the Tor network through which you’re going to route your traffic. There are 87 00:05:50,170 --> 00:05:52,540 three layers of encryption at the beginning, so between you 88 00:05:52,540 --> 00:05:56,150 and the guard node. Your traffic is encrypted three times. 89 00:05:56,150 --> 00:05:59,330 In the first instance encrypted to the guard, and the it’s encrypted again, 90 00:05:59,330 --> 00:06:03,180 through the relay, and then encrypted again to the exit, and as the traffic moves 91 00:06:03,180 --> 00:06:08,710 through the Tor network each of those layers of encryption are unpeeled 92 00:06:08,710 --> 00:06:17,300 from the data. The Guard here in this case knows who you are, and the exit relay 93 00:06:17,300 --> 00:06:21,590 knows what you’re doing but neither know both. And the middle relay doesn’t really 94 00:06:21,590 --> 00:06:26,710 know a lot, except for which relay is her guard and which relay is her exit. 95 00:06:26,710 --> 00:06:31,870 Who runs an exit relay? So if you run an exit relay all of the traffic which 96 00:06:31,870 --> 00:06:36,210 users are sending out on the internet they appear to come from your IP address. 97 00:06:36,210 --> 00:06:41,360 So running an exit relay is potentially risky because someone may do something 98 00:06:41,360 --> 00:06:45,590 through your relay which attracts attention. And then, when law enforcement 99 00:06:45,590 --> 00:06:48,940 traced that back to an IP address it’s going to come back to your address. 100 00:06:48,940 --> 00:06:51,790 So some relay operators have had trouble with this, with law enforcement coming 101 00:06:51,790 --> 00:06:55,360 to them, and saying: “Hey we got this traffic coming through your IP address 102 00:06:55,360 --> 00:06:57,950 and you have to go and explain it.” So if you want to run an exit relay 103 00:06:57,950 --> 00:07:01,400 it’s a little bit risky, but we’re thankful for those people that do run exit relays 104 00:07:01,400 --> 00:07:04,870 because ultimately if people didn’t run an exit relay you wouldn’t be able 105 00:07:04,870 --> 00:07:08,000 to get out of the Tor network, and it wouldn’t be terribly useful from this 106 00:07:08,000 --> 00:07:20,560 point of view. So, yes. *applause* 107 00:07:20,560 --> 00:07:24,610 So every Tor relay, when you set up a Tor relay you publish something called 108 00:07:24,610 --> 00:07:28,780 a descriptor which describes your Tor relay and how to use it to a set 109 00:07:28,780 --> 00:07:33,430 of servers called the authorities. And the trust in the Tor network is essentially 110 00:07:33,430 --> 00:07:38,610 split across these authorities. They’re run by the core Tor Project members. 111 00:07:38,610 --> 00:07:42,639 And they maintain a list of all of the relays in the network. And they observe 112 00:07:42,639 --> 00:07:46,010 them over a period of time. If the relays exhibit certain properties they give 113 00:07:46,010 --> 00:07:50,480 the relays flags. If e.g. a relay allows traffic to exit from the Tor network 114 00:07:50,480 --> 00:07:54,450 it will get the ‘Exit’ flag. If they’d been switched on for a certain period of time, 115 00:07:54,450 --> 00:07:58,400 or for a certain amount of traffic they’ll be allowed to become the guard relay 116 00:07:58,400 --> 00:08:02,180 which is the first node in your circuit. So when you build your circuit you 117 00:08:02,180 --> 00:08:07,230 download a list of these descriptors from one of the Directory Authorities. You look 118 00:08:07,230 --> 00:08:10,120 at the flags which have been assigned to each of the relays, and then you pick 119 00:08:10,120 --> 00:08:14,150 your route based on that. So you’ll pick the guard node from a set of relays 120 00:08:14,150 --> 00:08:16,400 which have the ‘Guard’ flag, your exits from the set of relays which have 121 00:08:16,400 --> 00:08:20,860 the ‘Exit’ flag etc. etc. Now, as of a quick count this morning there are 122 00:08:20,860 --> 00:08:29,229 about 1500 guard relays, around 1000 exit relays, and six relays flagged as ‘bad’ exits. 123 00:08:29,229 --> 00:08:34,360 What does a ‘bad exit’ mean? *waits for audience to respond* 124 00:08:34,360 --> 00:08:37,759 That’s not good! That’s exactly what it means! Yes! *laughs* 125 00:08:37,759 --> 00:08:40,450 *applause* 126 00:08:40,450 --> 00:08:45,569 So relays which have been flagged as ‘bad exits’ your client will never chose to exit 127 00:08:45,569 --> 00:08:50,660 traffic through. And examples of things which may get a relay flagged as an 128 00:08:50,660 --> 00:08:53,829 [bad] exit relay – if they’re fiddling with the traffic which is coming out of 129 00:08:53,829 --> 00:08:57,019 the Tor relay. Or doing things like man-in-the-middle attacks against 130 00:08:57,019 --> 00:09:01,629 SSL traffic. We’ve seen various things, there have been relays man-in-the-middling 131 00:09:01,629 --> 00:09:07,050 SSL traffic, there have very, very recently been an exit relay which was patching 132 00:09:07,050 --> 00:09:10,800 binaries that you downloaded from the internet, inserting malware into the binaries. 133 00:09:10,800 --> 00:09:14,630 So you can do these things but the Tor Project tries to scan for them. And if 134 00:09:14,630 --> 00:09:19,829 these things are detected then they’ll be flagged as ‘Bad Exits’. It’s true to say 135 00:09:19,829 --> 00:09:24,610 that the scanning mechanism is not 100% fool-proof by any stretch of the imagination. 136 00:09:24,610 --> 00:09:28,559 It tries to pick up common types of attacks, so as a result 137 00:09:28,559 --> 00:09:32,480 it won’t pick up unknown attacks or attacks which haven’t been seen or 138 00:09:32,480 --> 00:09:36,680 have not been known about beforehand. 139 00:09:36,680 --> 00:09:45,370 So looking at this, how do you deanonymise the traffic travelling through the Tor 140 00:09:45,370 --> 00:09:49,449 networks? Given some traffic coming out of the exit relay, how do you know 141 00:09:49,449 --> 00:09:54,269 which user that corresponds to? What is their IP address? You can’t actually 142 00:09:54,269 --> 00:09:58,279 modify the traffic because if any of the relays tried to modify the traffic 143 00:09:58,279 --> 00:10:02,249 which they’re sending through the network Tor will tear down the circuit through the relay. 144 00:10:02,249 --> 00:10:06,290 So there’s these integrity checks, each of the hops. And if you try to sort of 145 00:10:06,290 --> 00:10:09,870 – because you can’t decrypt the packet you can’t modify it in any meaningful way, 146 00:10:09,870 --> 00:10:13,749 and because there’s an integrity check at the next hop that means that you can’t 147 00:10:13,749 --> 00:10:17,019 modify the packet because otherwise it’s detected. So you can’t do this sort of 148 00:10:17,019 --> 00:10:20,900 marker, and try and follow the marker through the network. So instead 149 00:10:20,900 --> 00:10:26,699 what you can do if you control… so let me give you two cases. In the worst case 150 00:10:26,699 --> 00:10:31,330 if the attacker controls all three of your relays that you pick, which is an unlikely 151 00:10:31,330 --> 00:10:34,739 scenario that needs to control quite a big proportion of the network. Then 152 00:10:34,739 --> 00:10:39,550 it should be quite obvious that they can work out who you are and also 153 00:10:39,550 --> 00:10:42,369 see what you’re doing because in that case they can tag the traffic, and 154 00:10:42,369 --> 00:10:45,709 they can just discard these integrity checks at each of the following hops. 155 00:10:45,709 --> 00:10:50,709 Now in a different case, if you control the Guard relay and the exit relay 156 00:10:50,709 --> 00:10:54,160 but not the middle relay the Guard relay can’t tamper with the traffic because 157 00:10:54,160 --> 00:10:57,660 this middle relay will close down the circuit as soon as it happens. 158 00:10:57,660 --> 00:11:01,130 The exit relay can’t send stuff back down the circuit to try and identify the user, 159 00:11:01,130 --> 00:11:05,030 either. Because again, the circuit will be closed down. So what can you do? 160 00:11:05,030 --> 00:11:09,869 Well, you can count the number of packets going through the Guard node. And you can 161 00:11:09,869 --> 00:11:14,690 measure the timing differences between packets, and try and spot that pattern 162 00:11:14,690 --> 00:11:18,750 at the Exit relays. You’re looking at counts of packets and the timing between those 163 00:11:18,750 --> 00:11:22,360 packets which are being sent, and essentially trying to correlate them all. 164 00:11:22,360 --> 00:11:26,869 So if your user happens to pick you as your Guard node, and then happens to pick 165 00:11:26,869 --> 00:11:31,850 your exit relay, then you can deanonymise them with very high probability using 166 00:11:31,850 --> 00:11:35,649 this technique. You’re just correlating the timings of packets and counting 167 00:11:35,649 --> 00:11:38,889 the number of packets going through. And the attacks demonstrated in literature 168 00:11:38,889 --> 00:11:44,509 are very reliable for this. We heard earlier from the Tor talk about the “relay 169 00:11:44,509 --> 00:11:50,739 early” tag which was the attack discovered by the cert researches in the US. 170 00:11:50,739 --> 00:11:55,050 That attack didn’t rely on timing attacks. Instead, what they were able to do was 171 00:11:55,050 --> 00:11:58,720 send a special type of cell containing the data back down the circuit, 172 00:11:58,720 --> 00:12:01,889 essentially marking this data, and saying: “This is the data we’re seeing 173 00:12:01,889 --> 00:12:06,149 at the Exit relay, or at the hidden service", and encode into the messages 174 00:12:06,149 --> 00:12:10,049 travelling back down the circuit, what the data was. And then you could pick 175 00:12:10,049 --> 00:12:14,269 those up at the Guard relay and say, okay, whether it’s this person that’s doing that. 176 00:12:14,269 --> 00:12:18,370 In fact, although this technique works, and yeah it was a very nice attack, 177 00:12:18,370 --> 00:12:21,269 the traffic correlation attacks are actually just as powerful. 178 00:12:21,269 --> 00:12:25,259 So although this bug has been fixed traffic correlation attacks still work and are 179 00:12:25,259 --> 00:12:29,739 still fairly, fairly reliable. So the problem still does exist. This is very much 180 00:12:29,739 --> 00:12:33,399 an open question. How do we solve this problem? We don’t know, currently, 181 00:12:33,399 --> 00:12:40,040 how to solve this problem of trying to tackle the traffic correlation. 182 00:12:40,040 --> 00:12:45,369 There are a couple of solutions. But they’re not particularly… 183 00:12:45,369 --> 00:12:48,569 they’re not particularly reliable. Let me just go through these, and I’ll skip back 184 00:12:48,569 --> 00:12:53,061 on the few things I’ve missed. The first thing is, high-latency networks, so 185 00:12:53,061 --> 00:12:56,999 networks where packets are delayed in their transit through the network. 186 00:12:56,999 --> 00:13:00,740 That throws away a lot of the timing information. So they promise 187 00:13:00,740 --> 00:13:03,800 to potentially solve this problem. But of course, if you want to visit 188 00:13:03,800 --> 00:13:06,779 Google’s home page, and you have to wait five minutes for it, you’re simply 189 00:13:06,779 --> 00:13:11,910 just not going to use Tor. The whole point is trying to make this technology usable. 190 00:13:11,910 --> 00:13:14,759 And if you got something which is very, very slow then it doesn’t make it 191 00:13:14,759 --> 00:13:18,269 attractive to use. But of course, this case does work slightly better 192 00:13:18,269 --> 00:13:22,059 for e-mail. If you think about it with e-mail, you don’t mind if you’re e-mail 193 00:13:22,059 --> 00:13:25,399 – well, you may not mind, you may mind – you don’t mind if your e-mail is delayed 194 00:13:25,399 --> 00:13:29,120 by some period of time. Which makes this somewhat difficult. And as Roger said 195 00:13:29,120 --> 00:13:35,130 earlier, you can also introduce padding into the circuit, so these are dummy cells. 196 00:13:35,130 --> 00:13:39,839 But, but… with a big caveat: some of the research suggests that actually you’d 197 00:13:39,839 --> 00:13:43,439 need to introduce quite a lot of padding to defeat these attacks, and that would 198 00:13:43,439 --> 00:13:47,179 overload the Tor network in its current state. So, again, not a particular 199 00:13:47,179 --> 00:13:53,860 practical solution. 200 00:13:53,860 --> 00:13:58,279 How does Tor try to solve this problem? Well, Tor makes it very difficult 201 00:13:58,279 --> 00:14:03,171 to become a users Guard relay. If you can’t become a users Guard relay 202 00:14:03,171 --> 00:14:07,839 then you don’t know who the user is, quite simply. And so by making it very hard 203 00:14:07,839 --> 00:14:13,249 to become the Guard relay therefore you can’t do this traffic correlation attack. 204 00:14:13,249 --> 00:14:17,579 So at the moment the Tor client chooses one Guard relay and keeps it for a period 205 00:14:17,579 --> 00:14:22,259 of time. So if I want to sort of target just one of you I would need to control 206 00:14:22,259 --> 00:14:26,259 the Guard relay that you were using at that particular point in time. And in fact 207 00:14:26,259 --> 00:14:30,679 I’d also need to know what that Guard relay is. So by making it very unlikely 208 00:14:30,679 --> 00:14:34,129 that you would select a particular malicious Guard relay, where the number of malicious 209 00:14:34,129 --> 00:14:39,179 Guard relays is very small, that’s how Tor tries to solve this problem. And 210 00:14:39,179 --> 00:14:43,280 at the moment your Guard relay is your barrier of security. If the attacker can’t 211 00:14:43,280 --> 00:14:46,460 control the Guard relay then they won’t know who you are. That doesn’t mean 212 00:14:46,460 --> 00:14:50,639 they can’t try other sort of side channel attacks by messing with the traffic 213 00:14:50,639 --> 00:14:55,129 at the Exit relay etc. You know that you may sort of e.g. download dodgy documents 214 00:14:55,129 --> 00:14:59,499 and open one on your computer, and those sort of things. Now the alternative 215 00:14:59,499 --> 00:15:02,769 of course to having a Guard relay and keeping it for a very long time 216 00:15:02,769 --> 00:15:06,029 will be to have a Guard relay and to change it on a regular basis. 217 00:15:06,029 --> 00:15:09,929 Because you might think, well, just choosing one Guard relay and sticking with it 218 00:15:09,929 --> 00:15:13,399 is probably a bad idea. But actually, that’s not the case. If you pick 219 00:15:13,399 --> 00:15:18,370 the Guard relay, and assuming that the chance of picking a Guard relay that is 220 00:15:18,370 --> 00:15:22,800 malicious is very low, then, when you first use your Guard relay, if you got 221 00:15:22,800 --> 00:15:27,420 a good choice, then your traffic is safe. If you haven’t got a good choice then 222 00:15:27,420 --> 00:15:31,759 your traffic isn’t safe. Whereas if your Tor client chooses a Guard relay 223 00:15:31,759 --> 00:15:35,610 every few minutes, or every hour, or something on those lines at some point 224 00:15:35,610 --> 00:15:39,179 you’re gonna pick a malicious Guard relay. So they’re gonna have some of your traffic 225 00:15:39,179 --> 00:15:43,399 but not all of it. And so currently the trade-off is that we make it very difficult 226 00:15:43,399 --> 00:15:48,490 for an attacker to control a Guard relay and the user picks a Guard relay and 227 00:15:48,490 --> 00:15:52,449 keeps it for a long period of time. And so it’s very difficult for the attackers 228 00:15:52,449 --> 00:15:58,939 to pick that Guard relay when they control a very small proportion of the network. 229 00:15:58,939 --> 00:16:06,420 So this, currently, provides those properties I described earlier, the privacy 230 00:16:06,420 --> 00:16:11,410 and the anonymity when you’re browsing the web, when you’re accessing websites etc. 231 00:16:11,410 --> 00:16:16,519 But still you know who the website is. So although you’re anonymous and the website 232 00:16:16,519 --> 00:16:20,730 doesn’t know who you are you know who the website is. And there may be some cases 233 00:16:20,730 --> 00:16:25,499 where e.g. the website would also wish to remain anonymous. You want the person 234 00:16:25,499 --> 00:16:29,970 accessing the website and the website itself to be anonymous to each other. 235 00:16:29,970 --> 00:16:34,230 And you could think about people e.g. being in countries where running 236 00:16:34,230 --> 00:16:39,730 a political blog e.g. might be a dangerous activity. If you run that on a regular 237 00:16:39,730 --> 00:16:45,660 webserver you’re easily identified whereas, if you got some way where you as 238 00:16:45,660 --> 00:16:49,490 the webserver can be anonymous then that allows you to do that activity without 239 00:16:49,490 --> 00:16:57,480 being targeted by your government. So this is what hidden services try to solve. 240 00:16:57,480 --> 00:17:03,080 Now when you first think about a problem you kind of think: “Hang on a second, 241 00:17:03,080 --> 00:17:06,429 the user doesn’t know who the website is and the website doesn’t know 242 00:17:06,429 --> 00:17:09,890 who the user is. So how on earth do they talk to each other?” Well, that’s essentially 243 00:17:09,890 --> 00:17:14,220 what the Tor hidden service protocol tries to sort of set up. How do you identify and 244 00:17:14,220 --> 00:17:19,579 connect to each other. So at the moment this is what happens: We’ve got Bob 245 00:17:19,579 --> 00:17:23,780 on the [right] hand side who is the hidden service. And we got Alice on the left hand 246 00:17:23,780 --> 00:17:28,620 side here who is the user who wishes to visit the hidden service. Now when Bob 247 00:17:28,620 --> 00:17:34,190 sets up his hidden service he picks three nodes in the Tor network as introduction 248 00:17:34,190 --> 00:17:38,831 points and builds several hop circuits to them. So the introduction points don’t know 249 00:17:38,831 --> 00:17:44,680 who Bob is. Bob has circuits to them. And Bob says to each of these introduction points 250 00:17:44,680 --> 00:17:48,240 “Will you relay traffic to me if someone connects to you asking for me?” 251 00:17:48,240 --> 00:17:53,030 And then those introduction points do that. So then, once Bob has picked 252 00:17:53,030 --> 00:17:56,840 his introduction points he publishes a descriptor describing the list of his 253 00:17:56,840 --> 00:18:01,310 introduction points for someone who wishes to come onto his websites. And then Alice 254 00:18:01,310 --> 00:18:06,700 on the left hand side wishing to visit Bob will pick a rendezvous point in the network 255 00:18:06,700 --> 00:18:10,030 and build a circuit to it. So this “RP” here is the rendezvous point. 256 00:18:10,030 --> 00:18:14,530 And she will relay a message via one of the introduction points saying to Bob: 257 00:18:14,530 --> 00:18:18,290 “Meet me at the rendezvous point”. And then Bob will build a 3-hop-circuit 258 00:18:18,290 --> 00:18:22,870 to the rendezvous point. So now at this stage we got Alice with a multi-hop circuit 259 00:18:22,870 --> 00:18:26,890 to the rendezvous point, and Bob with a multi-hop circuit to the rendezvous point. 260 00:18:26,890 --> 00:18:32,550 Alice and Bob haven’t connected to one another directly. The rendezvous point 261 00:18:32,550 --> 00:18:36,530 doesn’t know who Bob is, the rendezvous point doesn’t know who Alice is. 262 00:18:36,530 --> 00:18:40,261 All they’re doing is forwarding the traffic. And they can’t inspect the traffic, 263 00:18:40,261 --> 00:18:43,740 either, because the traffic itself is encrypted. 264 00:18:43,740 --> 00:18:47,530 So that’s currently how you solve this problem with trying to communicate 265 00:18:47,530 --> 00:18:50,820 with someone who you don’t know who they are and vice versa. 266 00:18:50,820 --> 00:18:55,740 *drinks from the bottle* 267 00:18:55,740 --> 00:18:58,870 The principle thing I’m going to talk about today is this database. 268 00:18:58,870 --> 00:19:01,990 So I said, Bob, when he picks his introduction points he builds this thing 269 00:19:01,990 --> 00:19:06,080 called a descriptor, describing who his introduction points are, and he publishes 270 00:19:06,080 --> 00:19:10,390 them to a database. This database itself is distributed throughout the Tor network. 271 00:19:10,390 --> 00:19:17,860 It’s not a single server. So both, Bob and Alice need to be able to publish information 272 00:19:17,860 --> 00:19:22,040 to this database, and also retrieve information from this database. And Tor 273 00:19:22,040 --> 00:19:24,820 currently uses something called a distributed hash table, which I’m gonna 274 00:19:24,820 --> 00:19:27,930 give an example of what this means and how it works. And then I’ll talk to you 275 00:19:27,930 --> 00:19:34,380 specifically how the Tor Distributed Hash Table works itself. So let’s say e.g. 276 00:19:34,380 --> 00:19:39,830 you've got a set of servers. So here we've got 26 servers and you’d like to store 277 00:19:39,830 --> 00:19:44,240 your files across these different servers without having a single server responsible 278 00:19:44,240 --> 00:19:48,050 for deciding, “okay, that file is stored on that server, and this file is stored 279 00:19:48,050 --> 00:19:53,050 on that server” etc. etc. Now here is my list of files. You could take a very naive 280 00:19:53,050 --> 00:19:57,740 approach. And you could say: “Okay, I’ve got 26 servers, I got all of these file names 281 00:19:57,740 --> 00:20:01,250 and start with the letter of the alphabet.” And I could say: “All of the files that begin 282 00:20:01,250 --> 00:20:05,450 with A are gonna go under server A; or the files that begin with B are gonna go 283 00:20:05,450 --> 00:20:09,900 on server B etc.” And then when you want to retrieve a file you say: “Okay, what 284 00:20:09,900 --> 00:20:13,950 does my file name begin with?” And then you know which server it’s stored on. 285 00:20:13,950 --> 00:20:17,750 Now of course you could have a lot of servers – sorry – a lot of files 286 00:20:17,750 --> 00:20:22,780 which begin with a Z, an X or a Y etc. in which case you’re gonna overload 287 00:20:22,780 --> 00:20:27,310 that server. You’re gonna have more files stored on one server than on another server 288 00:20:27,310 --> 00:20:32,150 in your set. And if you have a lot of big files, say e.g. beginning with B then 289 00:20:32,150 --> 00:20:35,520 rather than distributing your files across all the servers you’re gonna just be 290 00:20:35,520 --> 00:20:39,060 overloading one or two of them. So to solve this problem what we tend to do is: 291 00:20:39,060 --> 00:20:42,410 we take the file name, and we run it through a cryptographic hash function. 292 00:20:42,410 --> 00:20:46,930 A hash function produces output which looks like random, very small changes 293 00:20:46,930 --> 00:20:50,740 in the input so a cryptographic hash function produces a very large change 294 00:20:50,740 --> 00:20:55,240 in the output. And this change looks random. So if I take all of my file names 295 00:20:55,240 --> 00:20:59,820 here, and assuming I have a lot more, I take a hash of them, and then I use 296 00:20:59,820 --> 00:21:05,470 that hash to determine which server to store the file on. Then, with high probability 297 00:21:05,470 --> 00:21:09,670 my files will be distributed evenly across all of the servers. And then when I want 298 00:21:09,670 --> 00:21:12,990 to go and retrieve one of the files I take my file name, I run it through the 299 00:21:12,990 --> 00:21:15,980 cryptographic hash function, that gives me the hash, and then I use that hash 300 00:21:15,980 --> 00:21:19,740 to identify which server that particular file is stored on. And then I go and 301 00:21:19,740 --> 00:21:25,990 retrieve it. So that’s the sort of a loose idea of how a distributed hash table works. 302 00:21:25,990 --> 00:21:29,340 There are a couple of problems with this. What if you got a changing size, what 303 00:21:29,340 --> 00:21:34,700 if the number of servers you got changes in size as it does in the Tor network. 304 00:21:34,700 --> 00:21:42,290 It’s a very brief overview of the theory. So how does it apply for the Tor network? 305 00:21:42,290 --> 00:21:47,640 Well, the Tor network has a set of relays and it has a set of hidden services. 306 00:21:47,640 --> 00:21:52,710 Now we take all of the relays, and they have a hash identity which identifies them. 307 00:21:52,710 --> 00:21:57,460 And we map them onto a circle using that hash value as an identifier. So you can 308 00:21:57,460 --> 00:22:03,230 imagine the hash value ranging from Zero to a very large number. We got a Zero point 309 00:22:03,230 --> 00:22:07,280 at the very top there. And that runs all the way round to the very large number. 310 00:22:07,280 --> 00:22:12,130 So given the identity hash for a relay we can map that to a particular point on 311 00:22:12,130 --> 00:22:19,070 the server. And then all we have to do is also do this for hidden services. 312 00:22:19,070 --> 00:22:22,320 So there’s a hidden service address, something.onion, so this is 313 00:22:22,320 --> 00:22:27,750 one of the hidden websites that you might visit. You take the – I’m not gonna describe 314 00:22:27,750 --> 00:22:33,980 in too much detail how this is done but – the value is done in such a way such that 315 00:22:33,980 --> 00:22:38,020 it’s evenly distributed about the circle. So your hidden service will have 316 00:22:38,020 --> 00:22:44,240 a particular point on the circle. And the relays will also be mapped onto this circle. 317 00:22:44,240 --> 00:22:49,640 So there’s the relays. And the hidden service. And in the case of Tor 318 00:22:49,640 --> 00:22:53,460 the hidden service actually maps to two positions on the circle, and it publishes 319 00:22:53,460 --> 00:22:57,850 its descriptor to the three relays to the right at one position, and the three relays 320 00:22:57,850 --> 00:23:01,600 to the right at another position. So there are actually in total six places where 321 00:23:01,600 --> 00:23:05,060 this descriptor is published on the circle. And then if I want to go and 322 00:23:05,060 --> 00:23:09,450 fetch and connect to a hidden service I go on to go and pull this hidden descriptor 323 00:23:09,450 --> 00:23:13,780 down to identify what its introduction points are. I take the hidden service 324 00:23:13,780 --> 00:23:17,200 address, I find out where it is on the circle, I map all of the relays onto 325 00:23:17,200 --> 00:23:21,110 the circle, and then I identify which relays on the circle are responsible 326 00:23:21,110 --> 00:23:24,031 for that particular hidden service. And I just connect, then I say: “Do you have 327 00:23:24,031 --> 00:23:26,630 a copy of the descriptor for that particular hidden service?” 328 00:23:26,630 --> 00:23:29,620 And if so then we’ve got our list of introduction points. And we can go 329 00:23:29,620 --> 00:23:38,020 to the next steps to connect to our hidden service. So I’m gonna explain how we 330 00:23:38,020 --> 00:23:41,320 sort of set up our experiments. What we thought, or what we were interested to do, 331 00:23:41,320 --> 00:23:48,181 was collect publications of hidden services. So for everytime a hidden service 332 00:23:48,181 --> 00:23:51,520 gets set up it publishes to this distributed hash table. What we wanted to do was 333 00:23:51,520 --> 00:23:55,750 collect those publications so that we get a complete list of all of the hidden 334 00:23:55,750 --> 00:23:59,280 services. And what we also wanted to do is to find out how many times a particular 335 00:23:59,280 --> 00:24:06,300 hidden service is requested. 336 00:24:06,300 --> 00:24:10,540 Just one more point that will become important later. 337 00:24:10,540 --> 00:24:14,230 The position which the hidden service appears on the circle changes 338 00:24:14,230 --> 00:24:18,950 every 24 hours. So there’s not a fixed position every single day. 339 00:24:18,950 --> 00:24:24,370 If we run 40 nodes over a long period of time we will occupy positions within 340 00:24:24,370 --> 00:24:29,570 that distributed hash table. And we will be able to collect publications and requests 341 00:24:29,570 --> 00:24:34,300 for hidden services that are located at that position inside the distributed 342 00:24:34,300 --> 00:24:39,251 hash table. So in that case we ran 40 Tor nodes, we had a student at university 343 00:24:39,251 --> 00:24:43,950 who said: “Hey, I run a hosting company, I got loads of server capacity”, and 344 00:24:43,950 --> 00:24:46,580 we told him what we were doing, and he said: “Well, you really helped us out, 345 00:24:46,580 --> 00:24:49,820 these last couple of years…” and just gave us loads of server capacity 346 00:24:49,820 --> 00:24:55,500 to allow us to do this. So we spun up 40 Tor nodes. Each Tor node was required 347 00:24:55,500 --> 00:24:59,560 to advertise a certain amount of bandwidth to become a part of that distributed 348 00:24:59,560 --> 00:25:02,200 hash table. It’s actually a very small amount, so this didn’t matter too much. 349 00:25:02,200 --> 00:25:06,050 And then, after – this has changed recently in the last few days, 350 00:25:06,050 --> 00:25:10,070 it used to be 25 hours, it’s just been increased as a result of one of the 351 00:25:10,070 --> 00:25:14,570 attacks last week. But here… certainly during our study it was 25 hours. You then 352 00:25:14,570 --> 00:25:18,300 appear at a particular point inside that distributed hash table. And you’re then 353 00:25:18,300 --> 00:25:22,750 in a position to record publications of hidden services and requests for hidden 354 00:25:22,750 --> 00:25:27,810 services. So not only can you get a full list of the onion addresses you can also 355 00:25:27,810 --> 00:25:32,250 find out how many times each of the onion addresses are requested. 356 00:25:32,250 --> 00:25:38,270 And so this is what we recorded. And then, once we had a full list of… or once 357 00:25:38,270 --> 00:25:41,830 we had run for a long period of time to collect a long list of .onion addresses 358 00:25:41,830 --> 00:25:46,850 we then built a custom crawler that would visit each of the Tor hidden services 359 00:25:46,850 --> 00:25:51,450 in turn, and pull down the HTML contents, the text content from the web page, 360 00:25:51,450 --> 00:25:54,760 so that we could go ahead and classify the content. Now it’s really important 361 00:25:54,760 --> 00:25:59,250 to know here, and it will become obvious why a little bit later, we only pulled down 362 00:25:59,250 --> 00:26:03,030 HTML content. We didn’t pull out images. And there’s a very, very important reason 363 00:26:03,030 --> 00:26:09,980 for that which will become clear shortly. 364 00:26:09,980 --> 00:26:13,520 We had a lot of questions when we first started this. Noone really knew 365 00:26:13,520 --> 00:26:18,000 how many hidden services there were. It had been suggested to us there was a very high 366 00:26:18,000 --> 00:26:21,250 turn-over of hidden services. We wanted to confirm that whether that was true or not. 367 00:26:21,250 --> 00:26:24,530 And we also wanted to do this so, what are the hidden services, 368 00:26:24,530 --> 00:26:30,140 how popular are they, etc. etc. etc. So our estimate for how many hidden services 369 00:26:30,140 --> 00:26:34,770 there are, over the period which we ran our study, this is a graph plotting 370 00:26:34,770 --> 00:26:38,560 our estimate for each of the individual days as to how many hidden services 371 00:26:38,560 --> 00:26:44,850 there were on that particular day. Now the data is naturally noisy because we’re only 372 00:26:44,850 --> 00:26:48,590 a very small proportion of that circle. So we’re only observing a very small 373 00:26:48,590 --> 00:26:53,250 proportion of the total publications and requests every single day, for each of 374 00:26:53,250 --> 00:26:57,260 those hidden services. And if you take a long term average for this 375 00:26:57,260 --> 00:27:02,720 there’s about 45.000 hidden services that we think were present, on average, 376 00:27:02,720 --> 00:27:07,880 each day, during our entire study. Which is a large number of hidden services. 377 00:27:07,880 --> 00:27:11,070 But over the entire length we collected about 80.000, in total. 378 00:27:11,070 --> 00:27:14,270 Some came and went etc. So the next question after how many 379 00:27:14,270 --> 00:27:17,750 hidden services there are is how long the hidden service exists for. 380 00:27:17,750 --> 00:27:20,620 Does it exist for a very long period of time, does it exist for a very short 381 00:27:20,620 --> 00:27:24,220 period of time etc. etc. So what we did was, for every single 382 00:27:24,220 --> 00:27:30,260 .onion address we plotted how many times we saw a publication for that particular 383 00:27:30,260 --> 00:27:34,160 hidden service during the six months. How many times did we see it. 384 00:27:34,160 --> 00:27:38,100 If we saw it a lot of times that suggested in general the hidden service existed 385 00:27:38,100 --> 00:27:42,180 for a very long period of time. If we saw a very short number of publications 386 00:27:42,180 --> 00:27:45,760 for each hidden service then that suggests that they were only present 387 00:27:45,760 --> 00:27:51,690 for a very short period of time. This is our graph. By far the most number 388 00:27:51,690 --> 00:27:55,890 of hidden services we only saw once during the entire study. And we never saw them 389 00:27:55,890 --> 00:28:00,390 again. We suggest that there’s a very high turnover of the hidden services, they 390 00:28:00,390 --> 00:28:04,520 don’t tend to exist on average i.e. for a very long period of time. 391 00:28:04,520 --> 00:28:10,730 And then you can see the sort of a tail here. If we plot just those 392 00:28:10,730 --> 00:28:16,390 hidden services which existed for a long time, so e.g. we could take hidden services 393 00:28:16,390 --> 00:28:20,280 which have a high number of hit requests and say: “Okay, those that have a high number 394 00:28:20,280 --> 00:28:24,800 of hits probably existed for a long time.” That’s not absolutely certain, but probably. 395 00:28:24,800 --> 00:28:29,190 Then you see this sort of -normal- plot about 4..5, so we saw on average 396 00:28:29,190 --> 00:28:34,870 most hidden services four or five times during the entire six months if they were 397 00:28:34,870 --> 00:28:40,530 popular and we’re using that as a proxy measure for whether they existed 398 00:28:40,530 --> 00:28:48,160 for the entire time. Now, this stage was over 160 days, so almost six months. 399 00:28:48,160 --> 00:28:51,490 What we also wanted to do was trying to confirm this over a longer period. 400 00:28:51,490 --> 00:28:56,310 So last year, in 2013, about February time some researchers of the University 401 00:28:56,310 --> 00:29:00,350 of Luxemburg also ran a similar study but it ran over a very short period of time 402 00:29:00,350 --> 00:29:05,060 over the day. But they did it in such a way it could collect descriptors 403 00:29:05,060 --> 00:29:08,590 across much of the circle during a single day. That was because of a bug in the way 404 00:29:08,590 --> 00:29:12,020 Tor did some of the things which has now been fixed so we can’t repeat that 405 00:29:12,020 --> 00:29:16,520 as a particular way. So we got a list of .onion addresses from February 2013 406 00:29:16,520 --> 00:29:18,960 from these researchers at the University of Luxemburg. And then we got our list 407 00:29:18,960 --> 00:29:23,670 of .onion addresses from this six months which was March to September of this year. 408 00:29:23,670 --> 00:29:26,700 And we wanted to say, okay, we’re given these two sets of .onion addresses. 409 00:29:26,700 --> 00:29:30,740 Which .onion addresses existed in his set but not ours and vice versa, and which 410 00:29:30,740 --> 00:29:39,740 .onion addresses existed in both sets? 411 00:29:39,740 --> 00:29:45,520 So as you can see a very small minority of hidden service addresses existed 412 00:29:45,520 --> 00:29:50,000 in both sets. This is over an 18 month period between these two collection points. 413 00:29:50,000 --> 00:29:54,430 A very small number of services existed in both his data set and in 414 00:29:54,430 --> 00:29:58,390 our data set. Which again suggested there’s a very high turnover of hidden 415 00:29:58,390 --> 00:30:02,920 services that don’t tend to exist for a very long period of time. 416 00:30:02,920 --> 00:30:06,530 So the question is why is that? Which we’ll come on to a little bit later. 417 00:30:06,530 --> 00:30:11,120 It’s a very valid question, can’t answer it 100%, we have some inclines as to 418 00:30:11,120 --> 00:30:15,560 why that may be the case. So in terms of popularity which hidden services 419 00:30:15,560 --> 00:30:19,700 did we see, or which .onion addresses did we see requested the most? 420 00:30:19,700 --> 00:30:26,980 Which got the most number of hits? Or the most number of directory requests. 421 00:30:26,980 --> 00:30:30,120 So botnet Command & Control servers – if you’re not familiar with what 422 00:30:30,120 --> 00:30:34,340 a botnet is, the idea is to infect lots of people with a piece of malware. 423 00:30:34,340 --> 00:30:37,630 And this malware phones home to a Command & Control server where 424 00:30:37,630 --> 00:30:41,500 the botnet master can give instructions to each of the bots on to do things. 425 00:30:41,500 --> 00:30:46,780 So it might be e.g. to collect passwords, key strokes, banking details. 426 00:30:46,780 --> 00:30:51,010 Or it might be to do things like Distributed Denial of Service attacks, 427 00:30:51,010 --> 00:30:55,220 or to send spam, those sorts of things. And a couple of years ago someone gave 428 00:30:55,220 --> 00:31:00,720 a talk and said: “Well, the problem with running a botnet is your C&C servers 429 00:31:00,720 --> 00:31:05,750 are vulnerable.” Once a C&C server is taken down you no longer have control over 430 00:31:05,750 --> 00:31:10,030 your botnet. So it’s been a sort of arms race against anti-virus companies and 431 00:31:10,030 --> 00:31:15,130 against malware authors to try and come up with techniques to run C&C servers in a way 432 00:31:15,130 --> 00:31:18,490 which they can’t be taken down. And a couple of years ago someone gave a talk 433 00:31:18,490 --> 00:31:22,450 at a conference that said: “You know what? It would be a really good idea if botnet 434 00:31:22,450 --> 00:31:25,809 C&C servers were run as Tor hidden services because then no one knows 435 00:31:25,809 --> 00:31:29,370 where they are, and in theory they can’t be taken down.” So in the fact we have this 436 00:31:29,370 --> 00:31:33,000 there are loads and loads and loads of these addresses associated with several 437 00:31:33,000 --> 00:31:38,122 different botnets, ‘Sefnit’ and ‘Skynet’. Now Skynet is the one I wanted to talk 438 00:31:38,122 --> 00:31:42,840 to you about because the guy that runs Skynet had a twitter account, and he also 439 00:31:42,840 --> 00:31:47,210 did a Reddit AMA. If you not heard of a Reddit AMA before, that’s a Reddit 440 00:31:47,210 --> 00:31:51,500 ask-me-anything. You can go on the website and ask the guy anything. So this guy 441 00:31:51,500 --> 00:31:54,790 wasn’t hiding in the shadows. He’d say: “Hey, I’m running this massive botnet, 442 00:31:54,790 --> 00:31:58,180 here’s my Twitter account which I update regularly, here is my Reddit AMA where 443 00:31:58,180 --> 00:32:01,620 you can ask me questions!” etc. 444 00:32:01,620 --> 00:32:04,590 He was arrested last year, which is not, perhaps, a huge surprise. 445 00:32:04,590 --> 00:32:11,750 *laughter and applause* 446 00:32:11,750 --> 00:32:15,970 But… so he was arrested, his C&C servers disappeared 447 00:32:15,970 --> 00:32:21,600 but there were still infected hosts trying to connect with the C&C servers and 448 00:32:21,600 --> 00:32:24,490 request access to the C&C server. 449 00:32:24,490 --> 00:32:27,570 This is why we’re saying: “A large number of hits.” So all of these requests are 450 00:32:27,570 --> 00:32:31,520 failed requests, i.e. we didn’t have a descriptor for them because 451 00:32:31,520 --> 00:32:34,910 the hidden service had gone away but there were still clients requesting each 452 00:32:34,910 --> 00:32:38,040 of the hidden services. 453 00:32:38,040 --> 00:32:41,980 And the next thing we wanted to do was to try and categorize sites. So, as I said 454 00:32:41,980 --> 00:32:45,960 earlier, we crawled all of the hidden services that we could, and we classified 455 00:32:45,960 --> 00:32:50,230 them into different categories based on what the type of content was 456 00:32:50,230 --> 00:32:53,650 on the hidden service side. The first graph I have is the number of sites 457 00:32:53,650 --> 00:32:58,040 in each of the categories. So you can see down the bottom here we got lots of 458 00:32:58,040 --> 00:33:04,280 different categories. We got drugs, market places, etc. on the bottom. And the graph 459 00:33:04,280 --> 00:33:07,360 shows the percentage of the hidden services that we crawled that fit in 460 00:33:07,360 --> 00:33:12,680 to each of these categories. So e.g. looking at this, drugs, the most number of sites 461 00:33:12,680 --> 00:33:16,250 that we crawled were made up of drugs-focused websites, followed by 462 00:33:16,250 --> 00:33:20,970 market places etc. There’s a couple of questions you might have here, 463 00:33:20,970 --> 00:33:25,640 so which ones are gonna stick out, what does ‘porn’ mean, well, you know 464 00:33:25,640 --> 00:33:31,060 what ‘porn’ means. There are some very notorious porn sites on the Tor Darknet. 465 00:33:31,060 --> 00:33:34,470 There was one in particular which was focused on revenge porn. It turns out 466 00:33:34,470 --> 00:33:37,520 that youngsters wish to take pictures of themselves, and send it to their 467 00:33:37,520 --> 00:33:45,040 boyfriends or their girlfriends. And when they get dumped they publish them 468 00:33:45,040 --> 00:33:49,750 on these websites. So there were several of these sites on the main internet 469 00:33:49,750 --> 00:33:53,070 which have mostly been shut down. And some of these sites were archived 470 00:33:53,070 --> 00:33:58,220 on the Darknet. The second one is that we should probably wonder what is, 471 00:33:58,220 --> 00:34:03,430 is ‘abuse’. Abuse was… every single site we classified in this category 472 00:34:03,430 --> 00:34:07,750 were child abuse sites. So they were in some way facilitating child abuse. 473 00:34:07,750 --> 00:34:10,980 And how do we know that? Well, the data that came back from the crawler 474 00:34:10,980 --> 00:34:14,789 made it completely unambiguous as to what the content was in these sites. That was 475 00:34:14,789 --> 00:34:18,918 completely obvious, from then content, from the crawler as to what was on these sites. 476 00:34:18,918 --> 00:34:23,449 And this is the principal reason why we didn’t pull down images from sites. 477 00:34:23,449 --> 00:34:26,099 There are many countries that would be a criminal offense to do so. 478 00:34:26,099 --> 00:34:29,530 So our crawler only pulled down text content from all of these sites, and that 479 00:34:29,530 --> 00:34:34,470 enabled us to classify them, based on that. We didn’t pull down any images. 480 00:34:34,470 --> 00:34:37,880 So of course the next thing we liked to do is to say: “Okay, well, given each of these 481 00:34:37,880 --> 00:34:42,759 categories, what proportion of directory requests went to each of the categories?” 482 00:34:42,759 --> 00:34:45,489 Now the next graph is going to need some explaining as to precisely what it 483 00:34:45,489 --> 00:34:52,090 means, and I’m gonna give that. This is the proportion of directory requests 484 00:34:52,090 --> 00:34:55,830 which we saw that went to each of the categories of hidden service that we 485 00:34:55,830 --> 00:34:59,740 classified. As you can see, in fact, we saw a very large number going to these 486 00:34:59,740 --> 00:35:05,010 abuse sites. And the rest sort of distributed right there, at the bottom. 487 00:35:05,010 --> 00:35:07,230 And the question is: “What is it we’re collecting here?” 488 00:35:07,230 --> 00:35:12,070 We’re collecting successful hidden service directory requests. What does a hidden 489 00:35:12,070 --> 00:35:16,790 service directory request mean? It probably loosely correlates with 490 00:35:16,790 --> 00:35:22,230 either a visit or a visitor. So somewhere in between those two. Because when you 491 00:35:22,230 --> 00:35:26,790 want to visit a hidden service you make a request for the hidden service descriptor 492 00:35:26,790 --> 00:35:31,080 and that allows you to connect to it and browse through the web site. 493 00:35:31,080 --> 00:35:34,770 But there are cases where, e.g. if you restart Tor, you’ll go back and you 494 00:35:34,770 --> 00:35:40,100 re-fetch the descriptor. So in that case we’ll count twice, for example. 495 00:35:40,100 --> 00:35:43,050 What proportion of these are people, and which proportion of them are 496 00:35:43,050 --> 00:35:46,619 something else? The answer to that is we just simply don’t know. 497 00:35:46,619 --> 00:35:50,250 We've got directory requests but that doesn’t tell us about what they’re doing on these 498 00:35:50,250 --> 00:35:55,130 sites, what they’re fetching, or who indeed they are, or what it is they are. 499 00:35:55,130 --> 00:35:58,690 So these could be automated requests, they could be human beings. We can’t 500 00:35:58,690 --> 00:36:03,750 distinguish between those two things. 501 00:36:03,750 --> 00:36:06,420 What are the limitations? 502 00:36:06,420 --> 00:36:12,170 A hidden service directory request neither exactly correlates to a visit -or- a visitor. 503 00:36:12,170 --> 00:36:16,380 It’s probably somewhere in between. So you can’t say whether it’s exactly one 504 00:36:16,380 --> 00:36:19,810 or the other. We cannot say whether a hidden service directory request 505 00:36:19,810 --> 00:36:26,230 is a person or something automated. We can’t distinguish between those two. 506 00:36:26,230 --> 00:36:31,890 Any type of site could be targeted by e.g. DoS attacks, by web crawlers which would 507 00:36:31,890 --> 00:36:40,040 greatly inflate the figures. If you were to do a DoS attack it’s likely you’d only 508 00:36:40,040 --> 00:36:44,700 request a small number of descriptors. You’d actually be flooding the site itself 509 00:36:44,700 --> 00:36:47,740 rather than the directories. But, in theory, you could flood the directories. 510 00:36:47,740 --> 00:36:52,840 But we didn’t see any sort of shutdown of our directories based on flooding, e.g. 511 00:36:52,840 --> 00:36:58,720 Whilst we can’t rule that out, it doesn’t seem to fit too well with what we’ve got. 512 00:36:58,720 --> 00:37:02,971 The other question is ‘crawlers’. I obviously talked with the Tor Project 513 00:37:02,971 --> 00:37:08,570 about these results and they’ve suggested that there are groups, so the child 514 00:37:08,570 --> 00:37:12,740 protection agencies e.g. that will crawl these sites on a regular basis. And, 515 00:37:12,740 --> 00:37:15,879 again, that doesn’t necessarily correlate with a human being. And that could 516 00:37:15,879 --> 00:37:19,830 inflate the figures. How many hidden directory requests would there be 517 00:37:19,830 --> 00:37:24,610 if a crawler was pointed at it. Typically, if I crawl them on a single day, one request. 518 00:37:24,610 --> 00:37:27,850 But if they got a large number of servers doing the crawling then it could be 519 00:37:27,850 --> 00:37:32,840 a request per day for every single server. So, again, I can’t give you, definitive, 520 00:37:32,840 --> 00:37:37,930 “yes, this is human beings” or “yes, this is automated requests”. 521 00:37:37,930 --> 00:37:43,300 The other important point is, these two content graphs are only hidden services 522 00:37:43,300 --> 00:37:48,550 offering web content. There are hidden services that do things, e.g. IRC, 523 00:37:48,550 --> 00:37:52,490 the instant messaging etc. Those aren’t included in these figures. We’re only 524 00:37:52,490 --> 00:37:57,990 concentrating on hidden services offering web sites. They’re HTTP services, or HTTPS 525 00:37:57,990 --> 00:38:01,640 services. Because that allows to easily classify them. And, in fact, some of 526 00:38:01,640 --> 00:38:06,080 the other types are IRC and Jabber the result was probably not directly comparable 527 00:38:06,080 --> 00:38:08,920 with web sites. That’s sort of the use case for using them, it’s probably 528 00:38:08,920 --> 00:38:16,490 slightly different. So I appreciate the last graph is somewhat alarming. 529 00:38:16,490 --> 00:38:20,640 If you have any questions please ask either me or the Tor developers 530 00:38:20,640 --> 00:38:24,810 as to how to interpret these results. It’s not quite as straight-forward as it may 531 00:38:24,810 --> 00:38:27,500 look when you look at the graph. You might look at the graph and say: “Hey, 532 00:38:27,500 --> 00:38:30,980 that looks like there’s lots of people visiting these sites”. It’s difficult 533 00:38:30,980 --> 00:38:40,240 to conclude that from the results. 534 00:38:40,240 --> 00:38:45,990 The next slide is gonna be very contentious. I will prefix it with: 535 00:38:45,990 --> 00:38:50,970 “I’m not advocating -any- kind of action whatsoever. I’m just trying 536 00:38:50,970 --> 00:38:56,130 to describe technically as to what could be done. It’s not up to me to make decisions 537 00:38:56,130 --> 00:39:02,869 on these types of things.” So, of course, when we found this out, frankly, I think 538 00:39:02,869 --> 00:39:06,190 we were stunned. I mean, it took us several days, frankly, it just stunned us, 539 00:39:06,190 --> 00:39:09,610 “what the hell, this is not what we expected at all.” 540 00:39:09,610 --> 00:39:13,210 So a natural step is, well, we think, most of us think that Tor is a great thing, 541 00:39:13,210 --> 00:39:18,510 it seems. Could this problem be sorted out while still keeping Tor as it is? 542 00:39:18,510 --> 00:39:21,510 And probably the next step to say: “Well, okay, could we just block this class 543 00:39:21,510 --> 00:39:26,060 of content and not other types of content?” So could we block just hidden services 544 00:39:26,060 --> 00:39:29,630 that are associated with these sites and not other types of hidden services? 545 00:39:29,630 --> 00:39:33,370 We thought there’s three ways in which we could block hidden services. 546 00:39:33,370 --> 00:39:36,960 And I’ll talk about whether these were impossible in the coming months, 547 00:39:36,960 --> 00:39:39,430 after explaining them. But during our study these would have been impossible 548 00:39:39,430 --> 00:39:43,590 and presently they are possible. 549 00:39:43,590 --> 00:39:48,630 A single individual could shut down a single hidden service by controlling 550 00:39:48,630 --> 00:39:53,640 all of the relays which are responsible for receiving a publication request 551 00:39:53,640 --> 00:39:57,280 on that distributed hash table. It’s possible to place one of your relays 552 00:39:57,280 --> 00:40:01,460 at a particular position on that circle and so therefore make yourself be 553 00:40:01,460 --> 00:40:04,290 the responsible relay for a particular hidden service. 554 00:40:04,290 --> 00:40:08,500 And if you control all of the six relays which are responsible for a hidden service, 555 00:40:08,500 --> 00:40:11,390 when someone comes to you and says: “Can I have a descriptor for that site” 556 00:40:11,390 --> 00:40:15,910 you can just say: “No, I haven’t got it”. And provided you control those relays 557 00:40:15,910 --> 00:40:20,580 users won’t be able to fetch those sites. 558 00:40:20,580 --> 00:40:25,010 The second option is you could say: “Okay, the Tor Project are blocking these” 559 00:40:25,010 --> 00:40:28,941 – which I’ll talk about in a second – “as a relay operator”. Could I 560 00:40:28,941 --> 00:40:32,500 as a relay operator say: “Okay, as a relay operator I don’t want to carry 561 00:40:32,500 --> 00:40:35,930 this type of content, and I don’t want to be responsible for serving up this type 562 00:40:35,930 --> 00:40:39,930 of content.” A relay operator could patch his relay and say: “You know what, 563 00:40:39,930 --> 00:40:44,020 if anyone comes to this relay requesting anyone of these sites then, again, just 564 00:40:44,020 --> 00:40:48,740 refuse to do it”. The problem is a lot of relay operators need to do it. So a very, 565 00:40:48,740 --> 00:40:51,990 very large number of the potential relay operators would need to do that 566 00:40:51,990 --> 00:40:56,170 to effectively block these sites. The final option is the Tor Project could 567 00:40:56,170 --> 00:41:00,740 modify the Tor program and actually embed these ingresses in the Tor program itself 568 00:41:00,740 --> 00:41:05,030 so as that all relays by default both block hidden service directory requests 569 00:41:05,030 --> 00:41:10,560 to these sites, and also clients themselves would say: “Okay, if anyone’s requesting 570 00:41:10,560 --> 00:41:15,000 these block them at the client level.” Now I hasten to add: I’m not advocating 571 00:41:15,000 --> 00:41:18,230 any kind of action that is entirely up to other people because, frankly, I think 572 00:41:18,230 --> 00:41:22,530 if I advocated blocking hidden services I probably wouldn’t make it out alive, 573 00:41:22,530 --> 00:41:27,050 so I’m just saying: this is a description of what technical measures could be used 574 00:41:27,050 --> 00:41:30,730 to block some classes of sites. And of course there’s lots of questions here. 575 00:41:30,730 --> 00:41:35,150 If e.g. the Tor Project themselves decided: “Okay, we’re gonna block these sites” 576 00:41:35,150 --> 00:41:38,490 that means they are essentially in control of the block list. 577 00:41:38,490 --> 00:41:41,360 The block list would be somewhat public so everyone would be up to inspect 578 00:41:41,360 --> 00:41:44,930 what the sites are that are being blocked and they would be in control of some kind 579 00:41:44,930 --> 00:41:54,360 of block list. Which, you know, arguably is against what the Tor Projects are after. 580 00:41:54,360 --> 00:41:59,560 *takes a sip, coughs* 581 00:41:59,560 --> 00:42:05,480 So how about deanonymising visitors to hidden service web sites? 582 00:42:05,480 --> 00:42:08,940 So in this case we got a user on the left-hand side who is connected to 583 00:42:08,940 --> 00:42:12,630 a Guard node. We’ve got a hidden service on the right-hand side who is connected 584 00:42:12,630 --> 00:42:17,530 to a Guard node and on the top we got one of those directory servers which is 585 00:42:17,530 --> 00:42:21,850 responsible for serving up those hidden service directory requests. 586 00:42:21,850 --> 00:42:28,660 Now, when you first want to connect to a hidden service you connect through 587 00:42:28,660 --> 00:42:31,619 your Guard node and through a couple of hops up to the hidden service directory and 588 00:42:31,619 --> 00:42:35,840 you request the descriptor off of them. So at this point if you are the attacker 589 00:42:35,840 --> 00:42:39,440 and you control one of the hidden service directory nodes for a particular site 590 00:42:39,440 --> 00:42:43,100 you can send back down the circuit a particular pattern of traffic. 591 00:42:43,100 --> 00:42:47,740 And if you control that user’s Guard node – which is a big if – 592 00:42:47,740 --> 00:42:52,110 then you can spot that pattern of traffic at the Guard node. The question is: 593 00:42:52,110 --> 00:42:56,940 “How do you control a particular user’s Guard node?” That’s very, very hard. 594 00:42:56,940 --> 00:43:01,480 But if e.g. I run a hidden service and all of you visit my hidden service, and 595 00:43:01,480 --> 00:43:05,670 I’m running a couple of dodgy Guard relays then the probability is that some of you, 596 00:43:05,670 --> 00:43:09,760 certainly not all of you by any stretch will select my dodgy Guard relay, and 597 00:43:09,760 --> 00:43:13,220 I could deanonymise you, but I couldn’t deanonymise the rest of them. 598 00:43:13,220 --> 00:43:18,260 So what we’re saying here is that you can deanonymise some of the users 599 00:43:18,260 --> 00:43:22,130 some of the time but you can’t pick which users those are which you’re going to 600 00:43:22,130 --> 00:43:26,609 deanonymise. You can’t deanonymise someone specific but you can deanonymise a fraction 601 00:43:26,609 --> 00:43:32,170 based on what fraction of the network you control in terms of Guard capacity. 602 00:43:32,170 --> 00:43:36,340 How about… so the attacker controls those two – here’s a picture from a research of 603 00:43:36,340 --> 00:43:40,200 the University of Luxemburg which did this. And these are plots of 604 00:43:40,200 --> 00:43:45,270 taking the user’s IP address visiting a C&C server, and then geolocating it 605 00:43:45,270 --> 00:43:48,480 and putting it on a map. So “where was the user located when they called one of 606 00:43:48,480 --> 00:43:51,620 the Tor hidden services?” So, again, this is a selection, a percentage 607 00:43:51,620 --> 00:43:58,060 of the users visiting C&C servers using this technique. 608 00:43:58,060 --> 00:44:03,770 How about deanonymising hidden services themselves? Well, again, you got a problem. 609 00:44:03,770 --> 00:44:08,340 You’re the user. You’re gonna connect through your Guard into the Tor network. 610 00:44:08,340 --> 00:44:12,160 And then, eventually, through the hidden service’s Guard node, and talk to 611 00:44:12,160 --> 00:44:16,740 the hidden service. As the attacker you need to control the hidden service’s 612 00:44:16,740 --> 00:44:20,859 Guard node to do these traffic correlation attacks. So again, it’s very difficult 613 00:44:20,859 --> 00:44:24,390 to deanonymise a specific Tor hidden service. But if you think about, okay, 614 00:44:24,390 --> 00:44:30,200 there is 1.000 Tor hidden services, if you can control a percentage of the Guard nodes 615 00:44:30,200 --> 00:44:34,230 then some hidden services will pick you and then you’ll be able to deanonymise those. 616 00:44:34,230 --> 00:44:37,330 So provided you don’t care which hidden services you gonna deanonymise 617 00:44:37,330 --> 00:44:41,400 then it becomes much more straight-forward to control the Guard nodes of some hidden 618 00:44:41,400 --> 00:44:44,910 services but you can’t pick exactly what those are. 619 00:44:44,910 --> 00:44:51,040 So what sort of data can you see traversing a relay? 620 00:44:51,040 --> 00:44:55,880 This is a modified Tor client which just dumps cells which are coming… 621 00:44:55,880 --> 00:44:58,750 essentially packets travelling down a circuit, and the information you can 622 00:44:58,750 --> 00:45:04,020 extract from them at a Guard node. And this is done off the main Tor network. 623 00:45:04,020 --> 00:45:08,590 So I’ve got a client connected to a “malicious” Guard relay 624 00:45:08,590 --> 00:45:14,040 and it logs every single packet – they’re called ‘cells’ in the Tor protocol – 625 00:45:14,040 --> 00:45:17,619 coming through the Guard relay. We can’t decrypt the packet because it’s encrypted 626 00:45:17,619 --> 00:45:21,780 three times. What we can record, though, is the IP address of the user, 627 00:45:21,780 --> 00:45:25,070 the IP address of the next hop, and we can count packets travelling 628 00:45:25,070 --> 00:45:29,240 in each direction down the circuit. And we can also record the time at which those 629 00:45:29,240 --> 00:45:32,210 packets were sent. So of course, if you’re doing the traffic correlation attacks 630 00:45:32,210 --> 00:45:37,970 you’re using that time in the information to try and work out whether you’re seeing 631 00:45:37,970 --> 00:45:42,370 traffic which you’ve sent and which identifies a particular user or not. 632 00:45:42,370 --> 00:45:44,810 Or indeed traffic which they’ve sent which you’ve seen at a different point 633 00:45:44,810 --> 00:45:49,100 in the network. 634 00:45:49,100 --> 00:45:51,980 Moving on to my… 635 00:45:51,980 --> 00:45:55,760 …interesting problems, research questions etc. 636 00:45:55,760 --> 00:45:59,250 Based on what I’ve said, I’ve said there’s these directory authorities which are 637 00:45:59,250 --> 00:46:05,070 controlled by the core Tor members. If e.g. they were malicious then they could 638 00:46:05,070 --> 00:46:08,990 manipulate the Tor… – if a big enough chunk of them are malicious then 639 00:46:08,990 --> 00:46:12,700 they can manipulate the consensus to direct you to particular nodes. 640 00:46:12,700 --> 00:46:15,920 I don’t think that’s the case, and that anyone thinks that’s the case. 641 00:46:15,920 --> 00:46:19,180 And Tor is designed in a way to tr… I mean that you’d have to control 642 00:46:19,180 --> 00:46:22,480 a certain number of the authorities to be able to do anything important. 643 00:46:22,480 --> 00:46:25,270 So the Tor people… I said this to them a couple of days ago. 644 00:46:25,270 --> 00:46:28,780 I find it quite funny that you’d design your system as if you don’t trust 645 00:46:28,780 --> 00:46:31,880 each other. To which their response was: “No, we design our system so that 646 00:46:31,880 --> 00:46:35,620 we don’t have to trust each other.” Which I think is a very good model to have, 647 00:46:35,620 --> 00:46:39,430 when you have this type of system. So could we eliminate these sort of 648 00:46:39,430 --> 00:46:43,240 centralized servers? I think that’s actually a very hard problem to do. 649 00:46:43,240 --> 00:46:46,340 There are lots of attacks which could potentially be deployed against 650 00:46:46,340 --> 00:46:51,250 a decentralized network. At the moment the Tor network is relatively well understood 651 00:46:51,250 --> 00:46:54,490 both in terms of what types of attack it is vulnerable to. So if we were to move 652 00:46:54,490 --> 00:46:58,880 to a new architecture then we may open it to a whole new class of attacks. 653 00:46:58,880 --> 00:47:02,000 The Tor network has been existing for quite some time and it’s been 654 00:47:02,000 --> 00:47:06,820 very well studied. What about global adversaries like the NSA, where you could 655 00:47:06,820 --> 00:47:10,980 monitor network links all across the world? It’s very difficult to defend 656 00:47:10,980 --> 00:47:15,530 against that. Where they can monitor… if they can identify which Guard relay 657 00:47:15,530 --> 00:47:18,760 you’re using, they can monitor traffic going into and out of the Guard relay, 658 00:47:18,760 --> 00:47:23,259 and they log each of the subsequent hops along. It’s very, very difficult to defend against 659 00:47:23,259 --> 00:47:26,470 these types of things. Do we know if they’re doing it? The documents that were 660 00:47:26,470 --> 00:47:29,850 released yesterday – I’ve only had a very brief look through them, but they suggest 661 00:47:29,850 --> 00:47:32,480 that they’re not presently doing it and they haven’t had much success. 662 00:47:32,480 --> 00:47:36,450 I don’t know why, there are very powerful attacks described in the academic literature 663 00:47:36,450 --> 00:47:40,830 which are very, very reliable and most academic literature you can access for free 664 00:47:40,830 --> 00:47:43,960 so it’s not even as if they have to figure out how to do it. They just have to read 665 00:47:43,960 --> 00:47:47,010 the academic literature and try and implement some of these attacks. 666 00:47:47,010 --> 00:47:52,000 I don’t know what – why they’re not. The next question is how to detect malicious 667 00:47:52,000 --> 00:47:57,760 relays. So in my case we’re running 40 relays. Our relays were on consecutive 668 00:47:57,760 --> 00:48:01,570 IP addresses, so we’re running 40 – well, most of them are on consecutive 669 00:48:01,570 --> 00:48:04,820 IP addresses in two blocks. So they’re running on IP addresses numbered 670 00:48:04,820 --> 00:48:09,280 e.g. 1,2,3,4,… We were running two relays per IP address, 671 00:48:09,280 --> 00:48:12,210 and every single relay had my name plastered across it. 672 00:48:12,210 --> 00:48:14,740 So after I set up these 40 relays in 673 00:48:14,740 --> 00:48:17,420 a relatively short period of time I expected someone from the Tor Project 674 00:48:17,420 --> 00:48:22,260 to come to me and say: “Hey Gareth, what are you doing?” – no one noticed, 675 00:48:22,260 --> 00:48:26,090 no one noticed. So this is presently an open question. On the Tor Project 676 00:48:26,090 --> 00:48:28,790 they’re quite open about this. They acknowledged that, in fact, last year 677 00:48:28,790 --> 00:48:33,210 we had the CERT researchers launch much more relays than that. The Tor Project 678 00:48:33,210 --> 00:48:36,510 spotted those large number of relays but chose not to do anything about it 679 00:48:36,510 --> 00:48:40,119 and, in fact, they were deploying an attack. But, as you know, it’s often very 680 00:48:40,119 --> 00:48:43,700 difficult to defend against unknown attacks. So at the moment how to detect 681 00:48:43,700 --> 00:48:47,780 malicious relays is a bit of an open question. Which as I think is being 682 00:48:47,780 --> 00:48:50,720 discussed on the mailing list. 683 00:48:50,720 --> 00:48:54,230 The other one is defending against unknown tampering at exits. If you took or take 684 00:48:54,230 --> 00:48:57,220 the exit relays – the exit relay can tamper with the traffic. 685 00:48:57,220 --> 00:49:01,040 So we know particular types of attacks doing SSL man-in-the-middles etc. 686 00:49:01,040 --> 00:49:05,350 We’ve seen recently binary patching. How do we detect unknown tampering 687 00:49:05,350 --> 00:49:08,970 with traffic, other types of traffic? So the binary tampering wasn’t spotted 688 00:49:08,970 --> 00:49:12,060 until it was spotted by someone who told the Tor Project. So it wasn’t 689 00:49:12,060 --> 00:49:15,609 detected e.g. by the Tor Project themselves, it was spotted by someone else 690 00:49:15,609 --> 00:49:20,500 and notified to them. And then the final one open on here is the Tor code review. 691 00:49:20,500 --> 00:49:25,400 So the Tor code is open source. We know from OpenSSL that, although everyone 692 00:49:25,400 --> 00:49:29,260 can read source code, people don’t always look at it. And OpenSSL has been 693 00:49:29,260 --> 00:49:32,230 a huge mess, and there’s been lots of stuff disclosed over that 694 00:49:32,230 --> 00:49:35,880 over the last coming days. There are lots of eyes on the Tor code but I think 695 00:49:35,880 --> 00:49:41,519 always, more eyes are better. I’d say, ideally if we can get people to look 696 00:49:41,519 --> 00:49:45,140 at the Tor code and look for vulnerabilities then… I encourage people 697 00:49:45,140 --> 00:49:49,860 to do that. It’s a very useful thing to do. There could be unknown vulnerabilities 698 00:49:49,860 --> 00:49:53,119 as we’ve seen with the “relay early” type quite recently in the Tor code which 699 00:49:53,119 --> 00:49:56,990 could be quite serious. The truth is we just don’t know until people do thorough 700 00:49:56,990 --> 00:50:02,500 code audits, and even then it’s very difficult to know for certain. 701 00:50:02,500 --> 00:50:08,170 So my last point, I think, yes, 702 00:50:08,170 --> 00:50:11,130 is advice to future researchers. So if you ever wanted, or are planning 703 00:50:11,130 --> 00:50:16,349 on doing a study in the future, e.g. on Tor, do not do what the CERT researchers 704 00:50:16,349 --> 00:50:20,550 do and start deanonymising people on the live Tor network and doing it in a way 705 00:50:20,550 --> 00:50:25,060 which is incredibly irresponsible. I don’t think…I mean, I tend, myself, to give you with 706 00:50:25,060 --> 00:50:28,510 the benefit of a doubt, I don’t think the CERT researchers set out to be malicious. 707 00:50:28,510 --> 00:50:33,320 I think they’re just very naive. That’s what it was they were doing. 708 00:50:33,320 --> 00:50:36,780 That was rapidly pointed out to them. In my case we are running 709 00:50:36,780 --> 00:50:43,090 40 relays. Our Tor relays they were forwarding traffic, they were acting as good relays. 710 00:50:43,090 --> 00:50:45,970 The only thing that we were doing was logging publication requests 711 00:50:45,970 --> 00:50:50,050 to the directories. Big question whether that’s malicious or not – I don’t know. 712 00:50:50,050 --> 00:50:53,330 One thing that has been pointed out to me is that the .onion addresses themselves 713 00:50:53,330 --> 00:50:58,270 could be considered sensitive information, so any data we will be retaining 714 00:50:58,270 --> 00:51:01,840 from the study is the aggregated data. So we won't be retaining information 715 00:51:01,840 --> 00:51:05,400 on individual .onion addresses because that could potentially be considered 716 00:51:05,400 --> 00:51:08,900 sensitive information. If you think about someone running an .onion address which 717 00:51:08,900 --> 00:51:11,240 contains something which they don’t want other people knowing about. So we won’t 718 00:51:11,240 --> 00:51:15,060 be retaining that data, and we’ll be destroying them. 719 00:51:15,060 --> 00:51:19,920 So I think that brings me now to starting the questions. 720 00:51:19,920 --> 00:51:22,770 I want to say “Thanks” to a couple of people. The student who donated 721 00:51:22,770 --> 00:51:26,820 the server to us. Nick Savage who is one of my colleagues who was a sounding board 722 00:51:26,820 --> 00:51:30,510 during the entire study. Ivan Pustogarov who is the researcher at the University 723 00:51:30,510 --> 00:51:34,700 of Luxembourg who sent us the large data set of .onion addresses from last year. 724 00:51:34,700 --> 00:51:37,670 He’s also the chap who has demonstrated those deanonymisation attacks 725 00:51:37,670 --> 00:51:41,500 that I talked about. A big "Thank you" to Roger Dingledine who has frankly been… 726 00:51:41,500 --> 00:51:45,230 presented loads of questions to me over the last couple of days and allowed me 727 00:51:45,230 --> 00:51:49,410 to bounce ideas back and forth. That has been a very useful process. 728 00:51:49,410 --> 00:51:53,640 If you are doing future research I strongly encourage you to contact the Tor Project 729 00:51:53,640 --> 00:51:57,040 at the earliest opportunity. You’ll find them… certainly I found them to be 730 00:51:57,040 --> 00:51:59,460 extremely helpful. 731 00:51:59,460 --> 00:52:04,640 Donncha also did something similar, so both Ivan and Donncha have done 732 00:52:04,640 --> 00:52:09,520 a similar study in trying to classify the types of hidden services or work out 733 00:52:09,520 --> 00:52:13,520 how many hits there are to particular types of hidden service. Ivan Pustogarov 734 00:52:13,520 --> 00:52:17,430 did it on a bigger scale and found similar results to us. 735 00:52:17,430 --> 00:52:21,910 That is that these abuse sites featured frequently 736 00:52:21,910 --> 00:52:26,740 in the top requested sites. That was done over a year ago, and again, he was seeing 737 00:52:26,740 --> 00:52:31,109 similar sorts of pattern. There were these abuse sites being requested frequently. 738 00:52:31,109 --> 00:52:35,450 So that also sort of probates what we’re saying. 739 00:52:35,450 --> 00:52:38,540 The data I put online is at this address, there will probably be the slides, 740 00:52:38,540 --> 00:52:41,609 something called ‘The Tor Research Framework’ which is an implementation 741 00:52:41,609 --> 00:52:47,510 of a Java client, so an implementation of a Tor client in Java specifically aimed 742 00:52:47,510 --> 00:52:52,080 at researchers. So if e.g. you wanna pull out data from a consensus you can do. 743 00:52:52,080 --> 00:52:55,290 If you want to build custom routes through the network you can do. 744 00:52:55,290 --> 00:52:58,230 If you want to build routes through the network and start sending padding traffic 745 00:52:58,230 --> 00:53:01,720 down them you can do etc. The code is designed in a way which is 746 00:53:01,720 --> 00:53:06,000 designed to be easily modifiable for testing lots of these things. 747 00:53:06,000 --> 00:53:10,580 There is also a link to the Tor FBI exploit which they deployed against 748 00:53:10,580 --> 00:53:16,230 visitors to some Tor hidden services last year. They exploited a Mozilla Firefox bug 749 00:53:16,230 --> 00:53:20,540 and then ran code on users who were visiting these hidden service, and ran 750 00:53:20,540 --> 00:53:24,619 code on their computer to identify them. At this address there is a link to that 751 00:53:24,619 --> 00:53:29,250 including a copy of the shell code and an analysis of exactly what it was doing. 752 00:53:29,250 --> 00:53:31,670 And then of course a list of references, with papers and things. 753 00:53:31,670 --> 00:53:34,260 So I’m quite happy to take questions now. 754 00:53:34,260 --> 00:53:46,960 *applause* 755 00:53:46,960 --> 00:53:50,880 Herald: Thanks for the nice talk! Do we have any questions 756 00:53:50,880 --> 00:53:57,000 from the internet? 757 00:53:57,000 --> 00:53:59,740 Signal Angel: One question. It’s very hard to block addresses since creating them 758 00:53:59,740 --> 00:54:03,620 is cheap, and they can be generated for each user, and rotated often. So 759 00:54:03,620 --> 00:54:07,510 can you think of any other way for doing the blocking? 760 00:54:07,510 --> 00:54:09,799 Gareth: That is absolutely true, so, yes. If you were to block a particular .onion 761 00:54:09,799 --> 00:54:13,060 address they can wail: “I want another .onion address.” So I don’t know of 762 00:54:13,060 --> 00:54:16,760 any way to counter that now. 763 00:54:16,760 --> 00:54:18,510 Herald: Another one from the internet? *inaudible answer from Signal Angel* 764 00:54:18,510 --> 00:54:22,030 Okay, then, Microphone 1, please! 765 00:54:22,030 --> 00:54:26,359 Question: Thank you, that’s fascinating research. You mentioned that it is 766 00:54:26,359 --> 00:54:32,200 possible to influence the hash of your relay node in a sense that you could 767 00:54:32,200 --> 00:54:35,970 to be choosing which service you are advertising, or which hidden service 768 00:54:35,970 --> 00:54:38,050 you are responsible for. Is that right? Gareth: Yeah, correct! 769 00:54:38,050 --> 00:54:40,390 Question: So could you elaborate on how this is possible? 770 00:54:40,390 --> 00:54:44,740 Gareth: So e.g. you just keep regenerating a public key for your relay, 771 00:54:44,740 --> 00:54:48,140 you’ll get closer and closer to the point where you’ll be the responsible relay 772 00:54:48,140 --> 00:54:51,160 for that particular hidden service. That’s just – you keep regenerating your identity 773 00:54:51,160 --> 00:54:54,720 hash until you’re at that particular point in the relay. That’s not particularly 774 00:54:54,720 --> 00:55:00,490 computationally intensive to do. That was it? 775 00:55:00,490 --> 00:55:04,740 Herald: Okay, next question from Microphone 5, please. 776 00:55:04,740 --> 00:55:09,490 Question: Hi, I was wondering for the attacks where you identify a certain number 777 00:55:09,490 --> 00:55:15,170 of users using a hidden service. Have those attacks been used, or is there 778 00:55:15,170 --> 00:55:18,880 any evidence there, and is there any way of protecting against that? 779 00:55:18,880 --> 00:55:22,260 Gareth: That’s a very interesting question, is there any way to detect these types 780 00:55:22,260 --> 00:55:24,970 of attacks? So some of the attacks, if you’re going to generate particular 781 00:55:24,970 --> 00:55:29,030 traffic patterns, one way to do that is to use the padding cells. The padding cells 782 00:55:29,030 --> 00:55:32,070 aren’t used at the moment by the official Tor client. So the detection of those 783 00:55:32,070 --> 00:55:36,510 could be indicative but it doesn't... it`s not conclusive evidence in our tool. 784 00:55:36,510 --> 00:55:40,050 Question: And is there any way of protecting against a government 785 00:55:40,050 --> 00:55:46,510 or something trying to denial-of-service hidden services? 786 00:55:46,510 --> 00:55:48,180 Gareth: So I… trying to… did not… 787 00:55:48,180 --> 00:55:52,500 Question: Is it possible to protect against this kind of attack? 788 00:55:52,500 --> 00:55:56,180 Gareth: Not that I’m aware of. The Tor Project are currently revising how they 789 00:55:56,180 --> 00:55:59,500 do the hidden service protocol which will make e.g. what I did, enumerating 790 00:55:59,500 --> 00:56:03,230 the hidden services, much more difficult. And to also be in a position on the 791 00:56:03,230 --> 00:56:07,470 distributed hash table in advance for a particular hidden service. 792 00:56:07,470 --> 00:56:10,510 So they are at the moment trying to change the way it’s done, and make some of 793 00:56:10,510 --> 00:56:15,270 these things more difficult. 794 00:56:15,270 --> 00:56:20,290 Herald: Good. Next question from Microphone 2, please. 795 00:56:20,290 --> 00:56:27,220 Mic2: Hi. I’m running the Tor2Web abuse, and so I used to see a lot of abuse of requests 796 00:56:27,220 --> 00:56:31,130 concerning the Tor hidden service being exposed on the internet through 797 00:56:31,130 --> 00:56:37,270 the Tor2Web.org domain name. And I just wanted to comment on, like you said, 798 00:56:37,270 --> 00:56:45,410 the abuse number of the requests. I used to spoke with some of the child protection 799 00:56:45,410 --> 00:56:50,070 agencies that reported abuse at Tor2Web.org, and they are effectively 800 00:56:50,070 --> 00:56:55,570 using crawlers that periodically look for changes in order to get new images to be 801 00:56:55,570 --> 00:57:00,190 put in the database. And what I was able to understand is that the German agency 802 00:57:00,190 --> 00:57:07,440 doing that is crawling the same sites that the Italian agencies are crawling, too. 803 00:57:07,440 --> 00:57:11,890 So it’s likely that in most of the countries there are the child protection 804 00:57:11,890 --> 00:57:16,790 agencies that are crawling those few numbers of Tor hidden services that 805 00:57:16,790 --> 00:57:22,760 contain child porn. And I saw it also a bit from the statistics of Tor2Web 806 00:57:22,760 --> 00:57:28,500 where the amount of abuse relating to that kind of content, it’s relatively low. 807 00:57:28,500 --> 00:57:30,000 Just as contribution! 808 00:57:30,000 --> 00:57:33,500 Gareth: Yes, that’s very interesting, thank you for that! 809 00:57:33,500 --> 00:57:37,260 *applause* 810 00:57:37,260 --> 00:57:39,560 Herald: Next, Microphone 4, please. 811 00:57:39,560 --> 00:57:45,260 Mic4: You then attacked or deanonymised users with an infected or a modified Guard 812 00:57:45,260 --> 00:57:51,810 relay? Is it required to modify the Guard relay if I control the entry point 813 00:57:51,810 --> 00:57:57,360 of the user to the internet? If I’m his ISP? 814 00:57:57,360 --> 00:58:01,900 Gareth: Yes, if you observe traffic travelling into a Guard relay without 815 00:58:01,900 --> 00:58:04,570 controlling the Guard relay itself. Mic4: Yeah. 816 00:58:04,570 --> 00:58:07,500 Gareth: In theory, yes. I wouldn’t be able to tell you how reliable that is 817 00:58:07,500 --> 00:58:10,500 off the top of my head. Mic4: Thanks! 818 00:58:10,500 --> 00:58:13,630 Herald: So another question from the internet! 819 00:58:13,630 --> 00:58:16,339 Signal Angel: Wouldn’t the ability to choose the key hash prefix give 820 00:58:16,339 --> 00:58:19,980 the ability to target specific .onions? 821 00:58:19,980 --> 00:58:23,680 Gareth: So you can only target one .onion address at a time. Because of the way 822 00:58:23,680 --> 00:58:28,080 they are generated. So you wouldn’t be able to say e.g. “Pick a key which targeted 823 00:58:28,080 --> 00:58:32,339 two or more .onion addresses.” You can only target one .onion address at a time 824 00:58:32,339 --> 00:58:37,720 by positioning yourself at a particular point on the distributed hash table. 825 00:58:37,720 --> 00:58:40,260 Herald: Another one from the internet? … Okay. 826 00:58:40,260 --> 00:58:43,369 Then Microphone 3, please. 827 00:58:43,369 --> 00:58:47,780 Mic3: Hey. Thanks for this research. I think it strengthens the network. 828 00:58:47,780 --> 00:58:54,300 So in the deem (?) I was wondering whether you can donate this relays to be a part of 829 00:58:54,300 --> 00:58:59,500 non-malicious relays pool, basically use them as regular relays afterwards? 830 00:58:59,500 --> 00:59:02,750 Gareth: Okay, so can I donate the relays a rerun and at the Tor capacity (?) ? 831 00:59:02,750 --> 00:59:05,490 Unfortunately, I said they were run by a student and they were donated for 832 00:59:05,490 --> 00:59:09,510 a fixed period of time. So we’ve given those back to him. We are very grateful 833 00:59:09,510 --> 00:59:14,790 to him, he was very generous. In fact, without his contribution donating these 834 00:59:14,790 --> 00:59:18,700 it would have been much more difficult to collect as much data as we did. 835 00:59:18,700 --> 00:59:21,490 Herald: Good, next, Microphone 5, please! 836 00:59:21,490 --> 00:59:25,839 Mic5: Yeah hi, first of all thanks for your talk. I think you’ve raised 837 00:59:25,839 --> 00:59:29,310 some real issues that need to be considered very carerfully by everyone 838 00:59:29,310 --> 00:59:33,950 on the Tor Project. My question: I’d like to go back to the issue with so many 839 00:59:33,950 --> 00:59:38,470 abuse related web sites running over the Tor Project. I think it’s an important 840 00:59:38,470 --> 00:59:41,900 issue that really needs to be considered because we don’t wanna be associated 841 00:59:41,900 --> 00:59:44,840 with that at the end of the day. Anyone who uses Tor, who runs a relay 842 00:59:44,840 --> 00:59:51,250 or an exit node. And I understand it’s a bit of a censored issue, and you don’t 843 00:59:51,250 --> 00:59:55,300 really have any say over whether it’s implemented or not. But I’d like to get 844 00:59:55,300 --> 01:00:02,410 your opinion on the implementation of a distributed block-deny system 845 01:00:02,410 --> 01:00:06,980 that would run in very much a similar way to those of the directory authorities. 846 01:00:06,980 --> 01:00:08,950 I’d just like to see what you think of that. 847 01:00:08,950 --> 01:00:13,200 Gareth: So you’re asking me whether I want to support a particular blocking mechanism 848 01:00:13,200 --> 01:00:14,200 then? 849 01:00:14,200 --> 01:00:16,470 Mic5: I’d like to get your opinion on it. *Gareth laughs* 850 01:00:16,470 --> 01:00:20,540 I know it’s a sensitive issue but I think, like I said, I think something… 851 01:00:20,540 --> 01:00:25,700 I think it needs to be considered because everyone running exit nodes and relays 852 01:00:25,700 --> 01:00:30,270 and people of the Tor Project don’t want to be known or associated with 853 01:00:30,270 --> 01:00:34,790 these massive amount of abuse web sites that currently exist within the Tor network. 854 01:00:34,790 --> 01:00:40,210 Gareth: I absolutely agree, and I think the Tor Project are horrified as well that 855 01:00:40,210 --> 01:00:43,960 this problem exists, and they, in fact, talked on it in previous years that 856 01:00:43,960 --> 01:00:48,690 they have a problem with this type of content. I asked to what if anything is 857 01:00:48,690 --> 01:00:52,340 done about it, it’s very much up to them. Could it be done in a distributed fashion? 858 01:00:52,340 --> 01:00:56,240 So the example I gave was a way which it could be done by relay operators. 859 01:00:56,240 --> 01:00:59,770 So e.g. that would need the consensus of a large number of relay operators to be 860 01:00:59,770 --> 01:01:02,890 effective. So that is done in a distributed fashion. The question is: 861 01:01:02,890 --> 01:01:06,810 who gives the list of .onion addresses to block to each of the relay operators? 862 01:01:06,810 --> 01:01:09,640 Clearly, the relay operators aren’t going to collect themselves. It needs to be 863 01:01:09,640 --> 01:01:15,780 supplied by someone like the Tor Project, e.g., or someone trustworthy. Yes, it can 864 01:01:15,780 --> 01:01:20,480 be done in a distributed fashion. It can be done in an open fashion. 865 01:01:20,480 --> 01:01:21,710 Mic5: Who knows? Gareth: Okay. 866 01:01:21,710 --> 01:01:23,750 Mic5: Thank you. 867 01:01:23,750 --> 01:01:27,260 Herald: Good. And another question from the internet. 868 01:01:27,260 --> 01:01:31,210 Signal Angel: Apparently there’s an option in the Tor client to collect statistics 869 01:01:31,210 --> 01:01:35,169 on hidden services. Do you know about this, and how it relates to your research? 870 01:01:35,169 --> 01:01:38,551 Gareth: Yes, I believe they’re going to be… the extent to which I know about it 871 01:01:38,551 --> 01:01:41,930 is they’re gonna be trying this next month, to try and estimate how many 872 01:01:41,930 --> 01:01:46,490 hidden services there are. So keep your eye on the Tor Project web site, 873 01:01:46,490 --> 01:01:50,340 I’m sure they’ll be publishing their data in the coming months. 874 01:01:50,340 --> 01:01:55,090 Herald: And, sadly, we are running out of time, so this will be the last question, 875 01:01:55,090 --> 01:01:56,980 so Microphone 4, please! 876 01:01:56,980 --> 01:02:01,250 Mic4: Hi, I’m just wondering if you could sort of outline what ethical clearances 877 01:02:01,250 --> 01:02:04,510 you had to get from your university to conduct this kind of research. 878 01:02:04,510 --> 01:02:07,260 Gareth: So we have to discuss these types of things before undertaking 879 01:02:07,260 --> 01:02:11,970 any research. And we go through the steps to make sure that we’re not e.g. storing 880 01:02:11,970 --> 01:02:16,370 sensitive information about particular people. So yes, we are very mindful 881 01:02:16,370 --> 01:02:19,240 of that. And that’s why I made a particular point of putting on the slides 882 01:02:19,240 --> 01:02:21,510 as to some of the things to consider. 883 01:02:21,510 --> 01:02:26,180 Mic4: So like… you outlined a potential implementation of the traffic correlation 884 01:02:26,180 --> 01:02:29,500 attack. Are you saying that you performed the attack? Or… 885 01:02:29,500 --> 01:02:33,180 Gareth: No, no no, absolutely not. So the link I’m giving… absolutely not. 886 01:02:33,180 --> 01:02:34,849 We have not engaged in any… 887 01:02:34,849 --> 01:02:36,350 Mic4: It just wasn’t clear from the slides. 888 01:02:36,350 --> 01:02:39,380 Gareth: I apologize. So it’s absolutely clear on that. No, we’re not engaging 889 01:02:39,380 --> 01:02:42,860 in any deanonymisation research on the Tor network. The research I showed 890 01:02:42,860 --> 01:02:46,079 is linked on the references, I think, which I put at the end of the slides. 891 01:02:46,079 --> 01:02:52,000 You can read about it. But it’s done in simulation. So e.g. there’s a way 892 01:02:52,000 --> 01:02:54,730 to do simulation of the Tor network on a single computer. I can’t remember 893 01:02:54,730 --> 01:02:58,880 the name of the project, though. Shadow! Yes, it’s a system 894 01:02:58,880 --> 01:03:02,170 called Shadow, we can run a large number of Tor relays on a single computer 895 01:03:02,170 --> 01:03:04,579 and simulate the traffic between them. If you’re going to do that type of research 896 01:03:04,579 --> 01:03:09,380 then you should use that. Okay, thank you very much, everyone. 897 01:03:09,380 --> 01:03:15,280 *applause* 898 01:03:15,280 --> 01:03:25,750 *silent postroll titles*