0
00:00:00,000 --> 00:00:30,000
Dear viewer, these subtitles were generated
by a machine via the service Trint
and therefore are (very) buggy.
If you are capable, please help us to
create good quality subtitles:
https://c3subtitles.de/talk/418 Thanks!

1
00:00:09,110 --> 00:00:11,329
And now it is my particular pleasure to

2
00:00:11,330 --> 00:00:13,219
announce Professor Professor Rachel

3
00:00:13,220 --> 00:00:15,319
Grinstead and Eileen

4
00:00:15,320 --> 00:00:17,239
Collins can Islam and Rebekah OverDog

5
00:00:17,240 --> 00:00:19,429
from the Privacy, Security and

6
00:00:19,430 --> 00:00:22,579
Automation Lab from Drexel University.

7
00:00:22,580 --> 00:00:24,679
They are quite old hands at speaking at

8
00:00:24,680 --> 00:00:26,089
the Congress already.

9
00:00:26,090 --> 00:00:27,239
And so.

10
00:00:27,240 --> 00:00:29,449
Well, please

11
00:00:29,450 --> 00:00:30,919
give them a warm round of applause.

12
00:00:30,920 --> 00:00:31,920
And there we go.

13
00:00:46,780 --> 00:00:48,369
It would be absolutely wonderful if we

14
00:00:48,370 --> 00:00:49,689
could get the present there.

15
00:00:49,690 --> 00:00:50,589
Thank you.

16
00:00:50,590 --> 00:00:51,599
Yes, that's much better.

17
00:00:53,360 --> 00:00:56,119
Hello, I'm Rachel Green, stat

18
00:00:56,120 --> 00:00:58,249
professor at Drexel University and

19
00:00:58,250 --> 00:00:59,779
where I lead the Privacy, Security and

20
00:00:59,780 --> 00:01:01,819
Automation Lab, this is joint work with

21
00:01:01,820 --> 00:01:04,249
my students, Eileen Coskun Islam

22
00:01:04,250 --> 00:01:05,509
and Becca Oberdorfer, who will be

23
00:01:05,510 --> 00:01:06,510
speaking later.

24
00:01:07,400 --> 00:01:10,489
So we're going to talk today about

25
00:01:10,490 --> 00:01:12,709
authorship, attribution in source code

26
00:01:12,710 --> 00:01:13,710
and in social media.

27
00:01:16,130 --> 00:01:17,809
So first, I'm going to talk about

28
00:01:17,810 --> 00:01:19,999
stylometry, which is how we do

29
00:01:20,000 --> 00:01:22,639
authorship attribution in my lab usually.

30
00:01:22,640 --> 00:01:24,529
So the idea behind it, the theory behind

31
00:01:24,530 --> 00:01:26,359
it is that everybody's writing style and

32
00:01:26,360 --> 00:01:28,699
speaking style indeed is unique because

33
00:01:28,700 --> 00:01:30,919
we all learn language individually

34
00:01:30,920 --> 00:01:32,029
on an individual basis.

35
00:01:32,030 --> 00:01:33,199
And each of us, even though we might

36
00:01:33,200 --> 00:01:34,969
speak the same language, speak sort of

37
00:01:34,970 --> 00:01:36,889
our own individual dialect of it.

38
00:01:36,890 --> 00:01:38,719
For example, in English there are

39
00:01:38,720 --> 00:01:40,789
regional differences, whereas some people

40
00:01:40,790 --> 00:01:42,709
may say that one piece of furniture is a

41
00:01:42,710 --> 00:01:44,119
couch, whereas other people might say

42
00:01:44,120 --> 00:01:45,120
it's a sofa.

43
00:01:46,250 --> 00:01:48,199
Furthermore, there are there are words

44
00:01:48,200 --> 00:01:50,209
that have similar meanings, but they're

45
00:01:50,210 --> 00:01:51,859
actually different words like although in

46
00:01:51,860 --> 00:01:53,839
though and you know, which ones people

47
00:01:53,840 --> 00:01:55,549
particularly prefer is sort of a

48
00:01:55,550 --> 00:01:57,739
stylistic idiosyncrasy in

49
00:01:57,740 --> 00:01:59,509
writing. People may use the same word

50
00:01:59,510 --> 00:02:00,559
with different spellings.

51
00:02:01,820 --> 00:02:03,889
And there are also just many ways to

52
00:02:03,890 --> 00:02:05,389
express very similar ideas.

53
00:02:05,390 --> 00:02:06,799
Like someone might say, the fork is to

54
00:02:06,800 --> 00:02:09,049
the left of the plate versus the fork

55
00:02:09,050 --> 00:02:10,609
is at the plates left.

56
00:02:10,610 --> 00:02:12,829
So these differences are how we,

57
00:02:12,830 --> 00:02:15,409
in writing and documents can distinguish

58
00:02:15,410 --> 00:02:17,749
authors often in many times.

59
00:02:17,750 --> 00:02:19,489
And that's a lot of the work that we do

60
00:02:19,490 --> 00:02:21,139
in my lab, which is the Privacy, Security

61
00:02:21,140 --> 00:02:22,429
and automation lab.

62
00:02:22,430 --> 00:02:24,409
So this is a research lab at Drexel

63
00:02:24,410 --> 00:02:26,749
University. It has about 10 students

64
00:02:26,750 --> 00:02:28,729
in them in a mixture of graduate and

65
00:02:28,730 --> 00:02:30,319
undergraduate students.

66
00:02:30,320 --> 00:02:32,779
And in general, we study sort of how to

67
00:02:32,780 --> 00:02:34,909
have machines, help humans make decisions

68
00:02:34,910 --> 00:02:37,009
about security, privacy and trust,

69
00:02:37,010 --> 00:02:38,959
often using machine learning and natural

70
00:02:38,960 --> 00:02:40,189
language processing techniques.

71
00:02:41,720 --> 00:02:44,149
In particular, we're very interested in

72
00:02:44,150 --> 00:02:45,529
what we can learn when we analyze

73
00:02:45,530 --> 00:02:47,059
unstructured in some way structured sort

74
00:02:47,060 --> 00:02:49,189
of human textual communication.

75
00:02:49,190 --> 00:02:51,469
And this is what we've spoken at

76
00:02:51,470 --> 00:02:53,599
ETECSA about in the past.

77
00:02:55,070 --> 00:02:57,259
So Mike Brennan spoke at

78
00:02:57,260 --> 00:02:59,689
2063 on sort of privacy

79
00:02:59,690 --> 00:03:02,479
and stylometry and how authorship

80
00:03:02,480 --> 00:03:04,609
recognition techniques can

81
00:03:04,610 --> 00:03:06,679
be attacked and and how they could

82
00:03:06,680 --> 00:03:07,339
be deceived.

83
00:03:07,340 --> 00:03:09,109
Again, in twenty eight, C3 with Saudi

84
00:03:09,110 --> 00:03:11,299
Afroz and Eileen and Saadia

85
00:03:11,300 --> 00:03:13,369
spoke two years ago on

86
00:03:13,370 --> 00:03:15,199
applying stylometry to sort of online

87
00:03:15,200 --> 00:03:16,519
underground markets.

88
00:03:16,520 --> 00:03:18,139
And this year we're going to talk about

89
00:03:18,140 --> 00:03:19,429
source code and costumey.

90
00:03:19,430 --> 00:03:21,589
And Salvatore, people always ask us

91
00:03:21,590 --> 00:03:23,629
sort of what about source code, what

92
00:03:23,630 --> 00:03:24,829
about tweets, stuff like that.

93
00:03:24,830 --> 00:03:26,179
So we're going to answer some of those

94
00:03:26,180 --> 00:03:28,849
questions in this talk.

95
00:03:28,850 --> 00:03:30,619
In general in the lab, we also do a

96
00:03:30,620 --> 00:03:32,509
number of a lot of work, like doing a

97
00:03:32,510 --> 00:03:34,659
social network analysis of online

98
00:03:34,660 --> 00:03:36,889
communities and also textual analysis and

99
00:03:36,890 --> 00:03:38,059
also studying the secure machine

100
00:03:38,060 --> 00:03:39,060
learning.

101
00:03:39,890 --> 00:03:42,199
So instead, we were a privacy

102
00:03:42,200 --> 00:03:44,119
lab, sort of what is the connection

103
00:03:44,120 --> 00:03:46,819
between privacy and stylometry?

104
00:03:46,820 --> 00:03:49,279
Well, so there are very good techniques

105
00:03:49,280 --> 00:03:51,169
for location privacy that the privacy

106
00:03:51,170 --> 00:03:52,669
enhancing technologies community has

107
00:03:52,670 --> 00:03:53,659
worked on.

108
00:03:53,660 --> 00:03:55,769
You're probably pretty familiar her

109
00:03:55,770 --> 00:03:58,249
my T-shirt and mixes

110
00:03:58,250 --> 00:03:59,419
in other types of techniques.

111
00:03:59,420 --> 00:04:00,649
They could hide your IP address from

112
00:04:00,650 --> 00:04:01,939
people on the Internet.

113
00:04:01,940 --> 00:04:03,979
But in some cases, when you're expressing

114
00:04:03,980 --> 00:04:05,629
yourself in text online, that might be

115
00:04:05,630 --> 00:04:07,099
insufficient. And that's where my

116
00:04:07,100 --> 00:04:08,100
research comes in.

117
00:04:09,350 --> 00:04:11,389
Stylometry can be used to identify

118
00:04:11,390 --> 00:04:13,189
authors based on their writing.

119
00:04:13,190 --> 00:04:15,259
And this is important because this

120
00:04:15,260 --> 00:04:16,849
is a potential threat to people that are

121
00:04:16,850 --> 00:04:18,799
exposing crime and corruption, political

122
00:04:18,800 --> 00:04:20,119
organization, especially if they're

123
00:04:20,120 --> 00:04:21,799
speaking in first hand testimonial

124
00:04:21,800 --> 00:04:23,149
accounts.

125
00:04:23,150 --> 00:04:24,739
And it's also just important for normal

126
00:04:24,740 --> 00:04:26,779
people who want to express their opinions

127
00:04:26,780 --> 00:04:28,969
or write code and share it online without

128
00:04:28,970 --> 00:04:30,829
necessarily having the thing that they

129
00:04:30,830 --> 00:04:32,749
wrote online, follow them forever through

130
00:04:32,750 --> 00:04:33,769
their life like a dossier.

131
00:04:35,580 --> 00:04:37,649
So let's go back to stylometry

132
00:04:37,650 --> 00:04:39,219
and let me talk a little bit, just give a

133
00:04:39,220 --> 00:04:41,879
sort of short tutorial on how it works.

134
00:04:41,880 --> 00:04:43,769
So basically.

135
00:04:43,770 --> 00:04:46,049
So true methods

136
00:04:46,050 --> 00:04:48,569
that are used today use machine learning

137
00:04:48,570 --> 00:04:50,309
and so so you have two authors, Cormac

138
00:04:50,310 --> 00:04:52,589
McCarthy and Ernest Hemingway.

139
00:04:54,670 --> 00:04:56,829
So they're both authors that have

140
00:04:56,830 --> 00:04:59,199
somewhat distinct styles,

141
00:04:59,200 --> 00:05:01,089
so coming here, they might say, you know,

142
00:05:01,090 --> 00:05:02,589
what's the bravest thing you ever did?

143
00:05:02,590 --> 00:05:04,269
He spent in the road in bloody phlegm

144
00:05:04,270 --> 00:05:05,649
getting up in the morning, he said.

145
00:05:07,040 --> 00:05:08,899
And then there's Ernie Ernest Hemingway.

146
00:05:08,900 --> 00:05:11,239
He no longer dreamed of storms,

147
00:05:11,240 --> 00:05:13,159
nor women, nor of great occurrences, nor

148
00:05:13,160 --> 00:05:15,049
of great fish, nor fights, nor contests

149
00:05:15,050 --> 00:05:16,999
of strength, nor of his wife.

150
00:05:17,000 --> 00:05:19,069
So the question is, how can you tell

151
00:05:19,070 --> 00:05:20,239
the difference between these people

152
00:05:21,290 --> 00:05:23,399
so we can just feed the text straight

153
00:05:23,400 --> 00:05:25,569
and we have to extract features from it?

154
00:05:25,570 --> 00:05:27,679
Other types of features we use an

155
00:05:27,680 --> 00:05:29,209
example of this might be the frequency of

156
00:05:29,210 --> 00:05:30,139
function, words, function.

157
00:05:30,140 --> 00:05:31,429
Words are sort of the small little words

158
00:05:31,430 --> 00:05:32,779
that don't necessarily mean anything.

159
00:05:34,040 --> 00:05:35,839
And also look at, say, the frequency of

160
00:05:35,840 --> 00:05:36,829
punctuation.

161
00:05:36,830 --> 00:05:37,879
It tells us something about the

162
00:05:37,880 --> 00:05:39,379
structure.

163
00:05:39,380 --> 00:05:41,299
We also use a lot more features in our

164
00:05:41,300 --> 00:05:43,249
work, which we'll talk about later, but

165
00:05:43,250 --> 00:05:44,569
we feed these into a machine learning

166
00:05:44,570 --> 00:05:46,669
model. In many cases, we'll use a support

167
00:05:46,670 --> 00:05:47,359
vector machine.

168
00:05:47,360 --> 00:05:49,309
Sometimes we'll use random forest.

169
00:05:51,770 --> 00:05:53,719
And a good model generally needs a sort

170
00:05:53,720 --> 00:05:55,729
of forty five hundred to seventy five

171
00:05:55,730 --> 00:05:58,099
hundred words of training data and

172
00:05:58,100 --> 00:05:59,929
greater than or equal to a thousand

173
00:05:59,930 --> 00:06:02,119
features. These these are maybe many

174
00:06:02,120 --> 00:06:04,339
features of the same type like

175
00:06:04,340 --> 00:06:06,230
word Ingrams, for example.

176
00:06:11,010 --> 00:06:13,229
Oh, I'm sorry, am I not being,

177
00:06:13,230 --> 00:06:15,329
am I? Is this better?

178
00:06:15,330 --> 00:06:17,639
I hope that I hope the

179
00:06:17,640 --> 00:06:19,019
system is working.

180
00:06:19,020 --> 00:06:20,020
Sorry about that,

181
00:06:21,400 --> 00:06:22,400
OK.

182
00:06:22,800 --> 00:06:24,899
So to actually to actually

183
00:06:24,900 --> 00:06:26,699
use this, say we have an unknown

184
00:06:26,700 --> 00:06:28,259
document, which is our text, our test

185
00:06:28,260 --> 00:06:30,389
document. Just remember the things that

186
00:06:30,390 --> 00:06:31,919
you put into your head or there forever.

187
00:06:31,920 --> 00:06:33,449
He said you might want to think about

188
00:06:33,450 --> 00:06:34,919
that. And we don't know whether this is

189
00:06:34,920 --> 00:06:36,449
written by Ernest Hemingway or Cormac

190
00:06:36,450 --> 00:06:37,769
McCarthy.

191
00:06:37,770 --> 00:06:40,109
So we extract features

192
00:06:40,110 --> 00:06:41,819
for from this document.

193
00:06:41,820 --> 00:06:43,769
Now, this is a very short text snippet

194
00:06:43,770 --> 00:06:44,489
for best results.

195
00:06:44,490 --> 00:06:46,170
We need about 500 words

196
00:06:47,490 --> 00:06:48,959
and we'd ask the model who wrote it,

197
00:06:50,160 --> 00:06:51,329
and it would tell us that it's Cormac

198
00:06:51,330 --> 00:06:52,860
McCarthy, which indeed it is.

199
00:06:55,040 --> 00:06:57,139
So in general, stylometry

200
00:06:57,140 --> 00:06:59,479
methods are pretty good

201
00:06:59,480 --> 00:07:00,619
when you're dealing, especially when

202
00:07:00,620 --> 00:07:01,969
you're dealing with sets of authors in

203
00:07:01,970 --> 00:07:04,579
the sort of one hundred authors range

204
00:07:04,580 --> 00:07:05,749
where that's sort of the world of

205
00:07:05,750 --> 00:07:08,059
suspects that you have, then Bazian

206
00:07:08,060 --> 00:07:10,189
tend to have a method that works at sort

207
00:07:10,190 --> 00:07:13,339
of above 90 percent accuracy.

208
00:07:13,340 --> 00:07:15,799
Now these methods can be scaled.

209
00:07:15,800 --> 00:07:17,569
People have done experiments, a couple at

210
00:07:17,570 --> 00:07:19,759
all, 10000 authors

211
00:07:19,760 --> 00:07:21,949
and younan it all in

212
00:07:21,950 --> 00:07:23,149
with over 100000 authors.

213
00:07:23,150 --> 00:07:24,919
And you can see that even in these cases,

214
00:07:24,920 --> 00:07:26,509
the results are much, much better than

215
00:07:26,510 --> 00:07:27,619
sort of random chance.

216
00:07:28,920 --> 00:07:31,139
Which do allow you to sort of narrow,

217
00:07:31,140 --> 00:07:32,930
narrow the world of suspects quite a bit.

218
00:07:35,010 --> 00:07:37,339
So previously, ask

219
00:07:37,340 --> 00:07:39,359
the other question that we asked in my

220
00:07:39,360 --> 00:07:41,459
lab was sort of how how

221
00:07:41,460 --> 00:07:43,619
strong are these techniques

222
00:07:43,620 --> 00:07:44,999
when people are actually trying to fool

223
00:07:45,000 --> 00:07:46,109
them?

224
00:07:46,110 --> 00:07:48,389
And we found that people in general

225
00:07:48,390 --> 00:07:50,429
were able to reduce the accuracy of these

226
00:07:50,430 --> 00:07:52,559
techniques by writing in a sort of

227
00:07:52,560 --> 00:07:54,569
a specific way to try and hide the

228
00:07:54,570 --> 00:07:56,339
writing style or to imitate another

229
00:07:56,340 --> 00:07:57,359
author.

230
00:07:57,360 --> 00:07:59,099
We had to actually ask people to imitate

231
00:07:59,100 --> 00:08:00,329
Cormac McCarthy in this case.

232
00:08:02,190 --> 00:08:04,019
Now, I wouldn't necessarily recommend

233
00:08:04,020 --> 00:08:05,309
just doing that. If you wanted to hide

234
00:08:05,310 --> 00:08:06,689
your writing, you'd probably want to

235
00:08:06,690 --> 00:08:08,519
verify in some way that you'd actually

236
00:08:08,520 --> 00:08:09,719
done it correctly.

237
00:08:09,720 --> 00:08:11,309
So we actually do have some tools in our

238
00:08:11,310 --> 00:08:12,310
lab.

239
00:08:13,050 --> 00:08:15,179
Jay Stilo is an authorship analysis

240
00:08:15,180 --> 00:08:17,399
tool and not a mouth

241
00:08:17,400 --> 00:08:19,739
is a sort of authorship anonymization

242
00:08:19,740 --> 00:08:21,089
tool, which is a very much a work in

243
00:08:21,090 --> 00:08:22,090
progress.

244
00:08:22,710 --> 00:08:24,809
We have these are available on

245
00:08:24,810 --> 00:08:25,799
our GitHub page.

246
00:08:25,800 --> 00:08:27,779
We'd love to have your comments, help,

247
00:08:27,780 --> 00:08:29,580
thoughts, et cetera, on them.

248
00:08:31,960 --> 00:08:34,199
And we looked at underground forums,

249
00:08:34,200 --> 00:08:35,200
so.

250
00:08:37,470 --> 00:08:39,538
This is an excerpt from the Carters

251
00:08:39,539 --> 00:08:41,399
forum where people trade sort of credit

252
00:08:41,400 --> 00:08:43,829
card information, and

253
00:08:45,060 --> 00:08:46,619
to do this work, we had to actually

254
00:08:46,620 --> 00:08:48,359
extend our analysis told to German.

255
00:08:48,360 --> 00:08:49,979
So just let us talk in German.

256
00:08:52,290 --> 00:08:54,179
And these are the types of features we

257
00:08:54,180 --> 00:08:55,979
use, the frequency of Ingrams, the

258
00:08:55,980 --> 00:08:58,109
punctuation, the special characters,

259
00:08:58,110 --> 00:09:00,299
the function words in this case, German

260
00:09:00,300 --> 00:09:01,709
specific function, words and parts of

261
00:09:01,710 --> 00:09:03,389
speech. And that's another case where you

262
00:09:03,390 --> 00:09:06,059
need specific

263
00:09:06,060 --> 00:09:07,889
parts of specific language, specific

264
00:09:07,890 --> 00:09:08,890
features.

265
00:09:09,770 --> 00:09:12,049
So the question that

266
00:09:12,050 --> 00:09:13,879
you might wonder is like sort of is this

267
00:09:13,880 --> 00:09:15,889
purely an academic concern?

268
00:09:15,890 --> 00:09:18,049
Do people actually use stylometry

269
00:09:18,050 --> 00:09:19,789
in the real world to actually identify

270
00:09:19,790 --> 00:09:21,349
people who might not want to be

271
00:09:21,350 --> 00:09:22,509
identified?

272
00:09:22,510 --> 00:09:23,730
And the answer to this is yes.

273
00:09:24,770 --> 00:09:26,869
So in a rather

274
00:09:26,870 --> 00:09:28,399
sensational case, J.K.

275
00:09:28,400 --> 00:09:30,319
Rowling, who as we know is the author of

276
00:09:30,320 --> 00:09:32,719
the Harry Potter novels, wrote

277
00:09:32,720 --> 00:09:34,189
another book under a pseudonym, Robert

278
00:09:34,190 --> 00:09:36,409
Galbraith and Uihlein

279
00:09:36,410 --> 00:09:38,509
Associates is a stylometry firm.

280
00:09:38,510 --> 00:09:40,400
And they actually did some work

281
00:09:41,720 --> 00:09:43,879
using tools that are

282
00:09:43,880 --> 00:09:45,679
part of our analysis engine, actually, as

283
00:09:45,680 --> 00:09:47,869
well on the request of a reporter

284
00:09:47,870 --> 00:09:49,579
who'd receive an anonymous tip over

285
00:09:49,580 --> 00:09:51,679
Twitter after their linguistic

286
00:09:51,680 --> 00:09:53,719
analysis, he felt confident enough to run

287
00:09:53,720 --> 00:09:56,719
with this story and indeed did expose.

288
00:09:56,720 --> 00:09:58,340
J.K. Rowling is the author of this book.

289
00:10:00,240 --> 00:10:02,279
And our doppelganger finder code, which

290
00:10:02,280 --> 00:10:04,019
we designed to give the sort of the

291
00:10:04,020 --> 00:10:06,179
probability of two accounts are

292
00:10:06,180 --> 00:10:08,369
this are the same person

293
00:10:08,370 --> 00:10:10,319
is actually used by the FBI.

294
00:10:10,320 --> 00:10:12,239
We pointed them in at our GitHub and we

295
00:10:12,240 --> 00:10:13,679
don't know exactly what they use it for

296
00:10:13,680 --> 00:10:15,449
exactly. But they they did tell us that

297
00:10:15,450 --> 00:10:16,450
they found it useful.

298
00:10:18,960 --> 00:10:21,239
So and there are many expert

299
00:10:21,240 --> 00:10:23,549
witnesses that use this in forensic

300
00:10:23,550 --> 00:10:25,499
proceedings, legal proceedings throughout

301
00:10:25,500 --> 00:10:26,579
the world.

302
00:10:26,580 --> 00:10:28,829
I know the most about US law where

303
00:10:28,830 --> 00:10:30,929
forensic linguistic evidence is covered

304
00:10:30,930 --> 00:10:33,059
under the Fenwick opinion,

305
00:10:33,060 --> 00:10:34,679
which speaks to sort of how it can be

306
00:10:34,680 --> 00:10:36,419
used and how it can be considered.

307
00:10:38,520 --> 00:10:40,639
So, OK, so this is all the

308
00:10:40,640 --> 00:10:42,449
stuff that we've done and gives you some

309
00:10:42,450 --> 00:10:44,219
context, so hopefully you have an idea a

310
00:10:44,220 --> 00:10:45,779
little bit about stylometry is and what

311
00:10:45,780 --> 00:10:47,909
it's what it how

312
00:10:47,910 --> 00:10:48,299
it works.

313
00:10:48,300 --> 00:10:49,739
But we're going to talk today about sort

314
00:10:49,740 --> 00:10:51,359
of to particularly kind of interesting

315
00:10:51,360 --> 00:10:52,470
and difficult cases.

316
00:10:53,790 --> 00:10:55,589
The first one is, what if you have an

317
00:10:55,590 --> 00:10:56,939
unknown Twitter feed?

318
00:10:56,940 --> 00:10:59,279
Can you learn its author from blogs

319
00:10:59,280 --> 00:11:00,899
or from comments on a news site like

320
00:11:00,900 --> 00:11:01,859
Reddit?

321
00:11:01,860 --> 00:11:03,479
Like because you might not have a Twitter

322
00:11:03,480 --> 00:11:05,189
feed for that person.

323
00:11:05,190 --> 00:11:06,629
And the answer to this is yes.

324
00:11:06,630 --> 00:11:08,609
However, if you do have a Twitter feed

325
00:11:08,610 --> 00:11:10,079
for the suspect, then you should probably

326
00:11:10,080 --> 00:11:11,080
use that instead.

327
00:11:12,960 --> 00:11:14,579
And then we always get this question

328
00:11:14,580 --> 00:11:15,659
about what? About source code.

329
00:11:15,660 --> 00:11:17,939
Can you detect some of these source code

330
00:11:17,940 --> 00:11:19,019
authorship from their style?

331
00:11:19,020 --> 00:11:20,219
And the answer is that, yes, we can do

332
00:11:20,220 --> 00:11:21,220
that, too.

333
00:11:21,780 --> 00:11:23,879
And particularly neat about this

334
00:11:23,880 --> 00:11:25,049
is even if you run it through an off the

335
00:11:25,050 --> 00:11:26,709
scanner, it still works.

336
00:11:26,710 --> 00:11:28,379
So I'm going to now turn the talk over to

337
00:11:28,380 --> 00:11:29,579
Eilene, who's going to talk about that

338
00:11:29,580 --> 00:11:30,580
work.

339
00:11:38,340 --> 00:11:39,959
Hi, everyone.

340
00:11:39,960 --> 00:11:42,809
So now you'll be looking at stylometry

341
00:11:42,810 --> 00:11:44,999
and here we are trying to find out

342
00:11:45,000 --> 00:11:47,099
who wrote this piece of anonymous

343
00:11:47,100 --> 00:11:49,259
caught by looking at their coding

344
00:11:49,260 --> 00:11:51,629
style. And there are two common scenarios

345
00:11:51,630 --> 00:11:52,889
we can think of.

346
00:11:52,890 --> 00:11:55,349
One source code at authorship attribution

347
00:11:55,350 --> 00:11:56,279
comes to mind.

348
00:11:56,280 --> 00:11:58,199
The first one is, let's say that Elissa's

349
00:11:58,200 --> 00:12:00,449
computer got infected and

350
00:12:00,450 --> 00:12:02,519
she has a piece of source code left

351
00:12:02,520 --> 00:12:03,689
from the malware.

352
00:12:03,690 --> 00:12:06,389
And Bob has a collection of malware

353
00:12:06,390 --> 00:12:07,859
with known.

354
00:12:07,860 --> 00:12:10,019
So Bob can look at his collection of

355
00:12:10,020 --> 00:12:12,899
malware to identify who's Alice's

356
00:12:12,900 --> 00:12:14,459
adversary was.

357
00:12:14,460 --> 00:12:16,529
And in the second scenario, this applies

358
00:12:16,530 --> 00:12:17,729
to plagiarism.

359
00:12:17,730 --> 00:12:19,799
Let's say that Alice got an extension

360
00:12:19,800 --> 00:12:21,899
to her program assignment and heard

361
00:12:21,900 --> 00:12:24,479
Professor Bob has everyone

362
00:12:24,480 --> 00:12:25,599
else's submissions.

363
00:12:25,600 --> 00:12:27,899
So Bob can look at everyone else's

364
00:12:27,900 --> 00:12:30,329
submission, compare it with Alice's

365
00:12:30,330 --> 00:12:32,519
new submission to see if Alice

366
00:12:32,520 --> 00:12:33,869
plagiarized.

367
00:12:33,870 --> 00:12:36,299
And in this case, we're talking

368
00:12:36,300 --> 00:12:38,699
about some serious security enhancing

369
00:12:38,700 --> 00:12:40,949
waste of sorting out authorship,

370
00:12:40,950 --> 00:12:42,329
attribution.

371
00:12:42,330 --> 00:12:44,849
But unfortunately, sometimes security

372
00:12:44,850 --> 00:12:47,039
enhancing technologies are actually

373
00:12:47,040 --> 00:12:49,979
privacy in infringing cases.

374
00:12:49,980 --> 00:12:52,349
For example, said

375
00:12:52,350 --> 00:12:54,569
Molik, poor he's their Web

376
00:12:54,570 --> 00:12:56,669
programmer. He was sentenced to

377
00:12:56,670 --> 00:12:58,949
death because he was identified

378
00:12:58,950 --> 00:13:01,049
the programmer of a porn site

379
00:13:01,050 --> 00:13:02,909
by the Iranian government.

380
00:13:04,080 --> 00:13:06,329
And so he was held

381
00:13:06,330 --> 00:13:08,489
under solitary confinement for

382
00:13:08,490 --> 00:13:11,159
one year without legal representation.

383
00:13:11,160 --> 00:13:13,079
And his family says that he's also a

384
00:13:13,080 --> 00:13:15,209
permanent resident of Canada

385
00:13:15,210 --> 00:13:17,399
and he didn't know that the porn

386
00:13:17,400 --> 00:13:19,769
site developers were using his

387
00:13:19,770 --> 00:13:21,629
photo uploading software.

388
00:13:21,630 --> 00:13:23,789
And so it also said that

389
00:13:23,790 --> 00:13:25,799
if you knew that this was going to be

390
00:13:25,800 --> 00:13:27,899
used by a porn site, he would never put

391
00:13:27,900 --> 00:13:29,609
his name there because it's illegal in

392
00:13:29,610 --> 00:13:30,509
Iran.

393
00:13:30,510 --> 00:13:32,729
And under after that, he

394
00:13:32,730 --> 00:13:34,979
says that under pressure, he said that

395
00:13:34,980 --> 00:13:37,109
he regrets his actions and now his

396
00:13:37,110 --> 00:13:38,940
death sentence is canceled.

397
00:13:40,760 --> 00:13:42,499
When you look at source code, authorship,

398
00:13:42,500 --> 00:13:44,659
attribution, we can define this as

399
00:13:44,660 --> 00:13:46,939
a machine learning problem with four

400
00:13:46,940 --> 00:13:49,339
main experimental settings.

401
00:13:49,340 --> 00:13:51,679
In the first one we can think of software

402
00:13:51,680 --> 00:13:52,459
forensics.

403
00:13:52,460 --> 00:13:54,589
And here we have multiple ORTA,

404
00:13:54,590 --> 00:13:57,349
which corresponds to Multiclass Lerner.

405
00:13:57,350 --> 00:13:59,749
And this is in an open world setting,

406
00:13:59,750 --> 00:14:01,429
which means that we don't know the

407
00:14:01,430 --> 00:14:03,679
suspect set in the regular

408
00:14:03,680 --> 00:14:05,719
case of authorship attribution, which we

409
00:14:05,720 --> 00:14:08,239
can also call psychometric plagiarism

410
00:14:08,240 --> 00:14:09,439
detection.

411
00:14:09,440 --> 00:14:11,929
We have the multiclass case with multiple

412
00:14:11,930 --> 00:14:14,149
authors and we know the suspect said

413
00:14:14,150 --> 00:14:15,799
here. So it's a close work machine

414
00:14:15,800 --> 00:14:17,539
learning problem.

415
00:14:17,540 --> 00:14:19,639
And we can also apply source code

416
00:14:19,640 --> 00:14:22,129
stylometry to a copyright investigation

417
00:14:22,130 --> 00:14:24,289
where we have two parties in the dispute.

418
00:14:24,290 --> 00:14:26,569
So it's a two class problem and

419
00:14:26,570 --> 00:14:28,159
clothes were world problem because we

420
00:14:28,160 --> 00:14:30,679
know both of the sides in the dispute

421
00:14:30,680 --> 00:14:33,049
and in authorship verification.

422
00:14:33,050 --> 00:14:35,299
We would like to answer is

423
00:14:35,300 --> 00:14:37,459
this person who claims to have written

424
00:14:37,460 --> 00:14:39,259
this piece of source code, did they

425
00:14:39,260 --> 00:14:41,689
really write it or did someone else

426
00:14:41,690 --> 00:14:42,769
write it?

427
00:14:42,770 --> 00:14:44,599
And this is kind of a two class, one

428
00:14:44,600 --> 00:14:46,669
class formulation, which we will look

429
00:14:46,670 --> 00:14:48,019
into detail later.

430
00:14:48,020 --> 00:14:50,119
And this is an open class problem

431
00:14:50,120 --> 00:14:52,219
because this was either written by the

432
00:14:52,220 --> 00:14:54,259
claimed person or it was written by

433
00:14:54,260 --> 00:14:55,999
someone that we have no idea about.

434
00:14:57,700 --> 00:15:00,129
And here is a table of summary

435
00:15:00,130 --> 00:15:02,049
of our main results.

436
00:15:02,050 --> 00:15:04,119
You can see that with the two hundred

437
00:15:04,120 --> 00:15:06,609
and fifty class altar's test,

438
00:15:06,610 --> 00:15:08,739
we get nine to five percent accuracy in

439
00:15:08,740 --> 00:15:10,479
identifying them.

440
00:15:10,480 --> 00:15:13,239
And this is a very high accuracy

441
00:15:13,240 --> 00:15:15,189
compared to previous work.

442
00:15:15,190 --> 00:15:17,529
And this indicates that we introduced

443
00:15:17,530 --> 00:15:20,169
a new principal method with a robust

444
00:15:20,170 --> 00:15:22,239
and syntactic feature set for

445
00:15:22,240 --> 00:15:24,759
performing source code stylometry,

446
00:15:24,760 --> 00:15:26,889
which has not been done before in this

447
00:15:26,890 --> 00:15:28,779
scale and in this way.

448
00:15:31,510 --> 00:15:33,609
In order to do it, in order

449
00:15:33,610 --> 00:15:35,709
to understand coding style, we have

450
00:15:35,710 --> 00:15:38,019
to look at programing features,

451
00:15:38,020 --> 00:15:39,579
our programing style features.

452
00:15:39,580 --> 00:15:41,709
And for that, first of all,

453
00:15:41,710 --> 00:15:43,809
we have a piece of source code

454
00:15:43,810 --> 00:15:46,209
and we look at some lexical features

455
00:15:46,210 --> 00:15:48,549
like variable names

456
00:15:48,550 --> 00:15:51,009
and the use of C++ keywords.

457
00:15:51,010 --> 00:15:53,079
Then we look at layout features like the

458
00:15:53,080 --> 00:15:55,149
spaces, the tabs,

459
00:15:55,150 --> 00:15:57,429
and we extract those from source

460
00:15:57,430 --> 00:15:58,119
code.

461
00:15:58,120 --> 00:16:00,519
After that, we preprocessed the source

462
00:16:00,520 --> 00:16:02,739
code to obtain its abstract syntax tree,

463
00:16:02,740 --> 00:16:04,659
which reveals structural features.

464
00:16:04,660 --> 00:16:06,759
So it's the grammar of the code.

465
00:16:06,760 --> 00:16:09,729
And for that we use the fuzzy

466
00:16:09,730 --> 00:16:11,919
abstract syntax tree parser that

467
00:16:11,920 --> 00:16:14,319
was provided by our collaborator Fabian

468
00:16:14,320 --> 00:16:16,989
Yamaguchi, who presented yesterday.

469
00:16:16,990 --> 00:16:19,029
And since it's a fuzzy parser, it can

470
00:16:19,030 --> 00:16:21,189
even handle incomplete pieces

471
00:16:21,190 --> 00:16:22,119
of code.

472
00:16:22,120 --> 00:16:24,429
And once we have the abstract syntax,

473
00:16:24,430 --> 00:16:26,619
we extract syntactic features

474
00:16:26,620 --> 00:16:28,719
such as the not deps or the

475
00:16:28,720 --> 00:16:30,849
abstract syntax tree, not types

476
00:16:30,850 --> 00:16:33,009
or not type frequency,

477
00:16:33,010 --> 00:16:34,559
inverse document frequency.

478
00:16:35,740 --> 00:16:37,839
And we solve a recurring subset

479
00:16:37,840 --> 00:16:39,999
of features coming up in many of our

480
00:16:40,000 --> 00:16:42,129
datasets with hundreds of

481
00:16:42,130 --> 00:16:44,049
authors and thousands of programing

482
00:16:44,050 --> 00:16:44,979
files.

483
00:16:44,980 --> 00:16:47,109
And for example, we see here that most of

484
00:16:47,110 --> 00:16:49,479
these in this list are syntactic

485
00:16:49,480 --> 00:16:51,579
and these features are the most important

486
00:16:51,580 --> 00:16:53,139
features because they have the highest

487
00:16:53,140 --> 00:16:54,729
information gain.

488
00:16:54,730 --> 00:16:57,129
And the syntactic features are mostly

489
00:16:57,130 --> 00:16:59,439
the note that

490
00:16:59,440 --> 00:17:01,779
in the abstract syntax tree

491
00:17:01,780 --> 00:17:03,879
abstract syntax to not term frequency

492
00:17:03,880 --> 00:17:05,139
or TFI depth.

493
00:17:05,140 --> 00:17:07,088
And also we see some lexical features

494
00:17:07,089 --> 00:17:09,429
like C++ keyword

495
00:17:09,430 --> 00:17:11,559
typedef and some layered features

496
00:17:11,560 --> 00:17:13,629
such as the number of types

497
00:17:13,630 --> 00:17:14,630
that were used.

498
00:17:16,359 --> 00:17:18,429
And this slide

499
00:17:18,430 --> 00:17:20,679
illustrates our general method in

500
00:17:20,680 --> 00:17:23,078
many different experimental settings

501
00:17:23,079 --> 00:17:24,939
in order to do experiments, first of all,

502
00:17:24,940 --> 00:17:27,459
we need the datasets of you went ahead

503
00:17:27,460 --> 00:17:29,709
and scraped the submissions

504
00:17:29,710 --> 00:17:31,989
of contestants from

505
00:17:31,990 --> 00:17:34,059
Google called Google Culture is

506
00:17:34,060 --> 00:17:35,889
an international annual programing

507
00:17:35,890 --> 00:17:36,939
competition.

508
00:17:36,940 --> 00:17:39,309
And since 2008, Google has been

509
00:17:39,310 --> 00:17:41,529
publishing their correct submissions

510
00:17:41,530 --> 00:17:44,109
online. So we you went ahead and

511
00:17:44,110 --> 00:17:46,239
scraped all the correct C++

512
00:17:46,240 --> 00:17:48,309
submissions from 2008 until

513
00:17:48,310 --> 00:17:49,509
2014.

514
00:17:49,510 --> 00:17:51,759
And we ended up with a data set with

515
00:17:51,760 --> 00:17:54,489
more than 100000 users.

516
00:17:54,490 --> 00:17:56,289
And once we have the source code, we

517
00:17:56,290 --> 00:17:58,629
preprocess it with the fuzziest

518
00:17:58,630 --> 00:18:00,939
Pozzo urine and then we extract

519
00:18:00,940 --> 00:18:03,759
lexical, syntactic and layered features.

520
00:18:03,760 --> 00:18:06,009
And as a classifier, we use a random

521
00:18:06,010 --> 00:18:08,109
forest to avoid overfitting

522
00:18:08,110 --> 00:18:10,449
with three hundred trees and these

523
00:18:10,450 --> 00:18:12,579
trees by majority voting to the final

524
00:18:12,580 --> 00:18:14,859
classification, depending on our task.

525
00:18:16,270 --> 00:18:18,129
And I would like to give some statistics

526
00:18:18,130 --> 00:18:20,469
about our Google Kojm dataset.

527
00:18:20,470 --> 00:18:23,559
We saw that in the 2014

528
00:18:23,560 --> 00:18:25,989
data set, which we used as our main one

529
00:18:25,990 --> 00:18:28,719
because it was the largest one in C++.

530
00:18:28,720 --> 00:18:31,419
The average lines of code was 70

531
00:18:31,420 --> 00:18:33,279
per solution.

532
00:18:33,280 --> 00:18:35,829
And in this programing contest,

533
00:18:35,830 --> 00:18:38,079
everyone is implementing the

534
00:18:38,080 --> 00:18:40,929
same problem or the same functionality

535
00:18:40,930 --> 00:18:43,029
at the same time and

536
00:18:43,030 --> 00:18:44,799
in a limited time.

537
00:18:44,800 --> 00:18:46,809
And whenever we are performing a machine

538
00:18:46,810 --> 00:18:49,029
learning task, we always train on

539
00:18:49,030 --> 00:18:51,519
the same problems that people answer to.

540
00:18:51,520 --> 00:18:53,589
And then when we are testing, we choose

541
00:18:53,590 --> 00:18:55,809
a problem that was not in the training

542
00:18:55,810 --> 00:18:56,319
set.

543
00:18:56,320 --> 00:18:58,899
So it makes it a further, more difficult

544
00:18:58,900 --> 00:19:01,059
machine learning problem because the

545
00:19:01,060 --> 00:19:03,369
question was not seen in the training set

546
00:19:03,370 --> 00:19:04,370
before.

547
00:19:05,110 --> 00:19:07,389
And here on the right pie chart, we see

548
00:19:07,390 --> 00:19:09,669
that C++ was the most common language

549
00:19:09,670 --> 00:19:11,980
and that was also true for other years.

550
00:19:14,760 --> 00:19:17,429
Now, I will go about some scenarios

551
00:19:17,430 --> 00:19:18,989
where we can apply source code,

552
00:19:18,990 --> 00:19:21,059
authorship, attribution,

553
00:19:21,060 --> 00:19:23,129
and in the first one, like I'll give

554
00:19:23,130 --> 00:19:25,199
examples as I'm talking about the

555
00:19:25,200 --> 00:19:27,269
scenarios, I would like

556
00:19:27,270 --> 00:19:28,709
to explain the first one, which is

557
00:19:28,710 --> 00:19:30,509
regular authorship attribution.

558
00:19:31,950 --> 00:19:34,109
By giving the Satoshi example, everyone

559
00:19:34,110 --> 00:19:36,569
is trying to find out who Satoshi is and

560
00:19:36,570 --> 00:19:38,609
we have Satoshi source code as well.

561
00:19:38,610 --> 00:19:41,669
So like from the initial contributions

562
00:19:41,670 --> 00:19:44,189
or comments on get from the Bitcoin

563
00:19:44,190 --> 00:19:46,289
repository, we

564
00:19:46,290 --> 00:19:48,269
have his card, but we don't know who this

565
00:19:48,270 --> 00:19:50,549
anonymous programmer actually is.

566
00:19:50,550 --> 00:19:53,159
So we can train

567
00:19:53,160 --> 00:19:55,439
our data with a suspect

568
00:19:55,440 --> 00:19:56,369
set.

569
00:19:56,370 --> 00:19:58,499
And after that we can test on

570
00:19:58,500 --> 00:20:00,569
this initial Bitcoin call to see

571
00:20:00,570 --> 00:20:02,219
who Satoshi is.

572
00:20:02,220 --> 00:20:04,499
And for this experimental setup, we took

573
00:20:04,500 --> 00:20:06,599
two hundred and fifty Auteur's trained

574
00:20:06,600 --> 00:20:10,289
on their files and we had 2250

575
00:20:10,290 --> 00:20:12,179
anonymous program files.

576
00:20:12,180 --> 00:20:14,279
And when we trained and tested, we

577
00:20:14,280 --> 00:20:16,469
got 95 percent accuracy incorrectly

578
00:20:16,470 --> 00:20:18,599
identifying these more than 2000

579
00:20:18,600 --> 00:20:19,689
files.

580
00:20:19,690 --> 00:20:21,839
And if we only had a suspect set

581
00:20:21,840 --> 00:20:24,449
for Satoshi that we could train

582
00:20:24,450 --> 00:20:26,429
and we would have the like.

583
00:20:26,430 --> 00:20:28,749
If you had a suspect set for Satoshi,

584
00:20:28,750 --> 00:20:30,869
this would be the training

585
00:20:30,870 --> 00:20:33,299
part. And then we will use

586
00:20:33,300 --> 00:20:35,609
the Bitcoin code like the initial

587
00:20:35,610 --> 00:20:37,979
Bitcoin code for testing and

588
00:20:37,980 --> 00:20:40,109
we might be able to predict who

589
00:20:40,110 --> 00:20:41,730
the good contributor

590
00:20:42,810 --> 00:20:44,050
Satoshi might be.

591
00:20:45,540 --> 00:20:47,609
Not that we are trying to do this, but

592
00:20:47,610 --> 00:20:48,990
this is just an example.

593
00:20:50,830 --> 00:20:52,479
In the second case, we will talk about

594
00:20:52,480 --> 00:20:53,979
obfuscation.

595
00:20:53,980 --> 00:20:56,049
There are several reasons people try to

596
00:20:56,050 --> 00:20:57,879
obfuscate their court to make it

597
00:20:57,880 --> 00:20:59,589
unrecognizable.

598
00:20:59,590 --> 00:21:02,079
First of all, you might have plagiarized

599
00:21:02,080 --> 00:21:03,909
then you might be trying to hide that you

600
00:21:03,910 --> 00:21:05,979
copied someone else's work or

601
00:21:05,980 --> 00:21:08,499
you might have a malware and you might be

602
00:21:08,500 --> 00:21:10,659
trying to make it unrecognizable so that

603
00:21:10,660 --> 00:21:12,189
it won't be detected.

604
00:21:12,190 --> 00:21:14,439
Or in other cases, you might

605
00:21:14,440 --> 00:21:16,509
just be trying to stay anonymous

606
00:21:16,510 --> 00:21:18,819
and hide your coding style.

607
00:21:18,820 --> 00:21:20,469
But we saw that our authorship

608
00:21:20,470 --> 00:21:22,749
attribution technique is not affected

609
00:21:22,750 --> 00:21:25,359
by common off the shelf commercial

610
00:21:25,360 --> 00:21:26,829
obfuscatory.

611
00:21:26,830 --> 00:21:29,229
I'll give an example with the obfuscatory

612
00:21:29,230 --> 00:21:31,389
that we use, which is like you can buy

613
00:21:31,390 --> 00:21:34,059
it like I think for four hundred dollars.

614
00:21:34,060 --> 00:21:35,559
It's called Dynex.

615
00:21:35,560 --> 00:21:36,879
We are not related to it.

616
00:21:36,880 --> 00:21:38,499
We just use it because it was the

617
00:21:38,500 --> 00:21:40,689
cheapest commercial one

618
00:21:40,690 --> 00:21:42,939
and a widely used one

619
00:21:42,940 --> 00:21:43,269
here.

620
00:21:43,270 --> 00:21:45,879
In this example, we will see how C++

621
00:21:45,880 --> 00:21:47,469
code is obfuscated.

622
00:21:48,680 --> 00:21:50,359
We see some variable names, they are

623
00:21:50,360 --> 00:21:52,519
being hashed and all the

624
00:21:52,520 --> 00:21:54,739
spaces and comments are being

625
00:21:54,740 --> 00:21:57,109
stripped, if there are any numbers,

626
00:21:57,110 --> 00:21:59,239
they are going to be replaced

627
00:21:59,240 --> 00:22:01,339
with a combination of hexadecimal,

628
00:22:01,340 --> 00:22:03,679
binary and decimal numbers.

629
00:22:03,680 --> 00:22:05,629
And also, if there are any characters,

630
00:22:05,630 --> 00:22:06,919
they are going to be replaced with

631
00:22:06,920 --> 00:22:08,329
hexadecimal escape's.

632
00:22:08,330 --> 00:22:10,039
And you can choose different settings for

633
00:22:10,040 --> 00:22:12,799
your harshing or your combinations.

634
00:22:12,800 --> 00:22:14,869
And we see that like everything is

635
00:22:14,870 --> 00:22:17,059
refactored, but the functionality

636
00:22:17,060 --> 00:22:18,889
or the structure of the program remains

637
00:22:18,890 --> 00:22:20,989
the same. And as long as the structure

638
00:22:20,990 --> 00:22:22,789
is the same, our features are not

639
00:22:22,790 --> 00:22:24,579
affected by this obfuscation.

640
00:22:26,120 --> 00:22:28,369
As a result, we saw that when

641
00:22:28,370 --> 00:22:30,469
we tried to the authorship attribution on

642
00:22:30,470 --> 00:22:33,619
obfuscated code versus original code

643
00:22:33,620 --> 00:22:35,689
with twenty five alters, we

644
00:22:35,690 --> 00:22:37,939
got twenty nine to seven percent

645
00:22:37,940 --> 00:22:39,439
accuracy in both of them.

646
00:22:39,440 --> 00:22:41,959
So our code is,

647
00:22:41,960 --> 00:22:44,089
our method is impervious to

648
00:22:44,090 --> 00:22:46,399
such common

649
00:22:46,400 --> 00:22:48,259
off the shelf obfuscatory.

650
00:22:48,260 --> 00:22:50,329
But this is only for this

651
00:22:50,330 --> 00:22:52,309
obfuscatory, which is not changing

652
00:22:52,310 --> 00:22:53,900
structure or functionality.

653
00:22:55,640 --> 00:22:57,529
Oh another case is copywrite

654
00:22:57,530 --> 00:22:58,579
investigation.

655
00:22:58,580 --> 00:23:00,679
I would like to give a copyleft example

656
00:23:00,680 --> 00:23:02,059
here.

657
00:23:02,060 --> 00:23:04,159
Copyleft software is free

658
00:23:04,160 --> 00:23:05,719
but it still has a license.

659
00:23:05,720 --> 00:23:08,179
So you have to you can modify

660
00:23:08,180 --> 00:23:09,949
it, you can use it, but you have to make

661
00:23:09,950 --> 00:23:11,989
sure that you still include the copyleft

662
00:23:11,990 --> 00:23:14,389
license that it came with.

663
00:23:14,390 --> 00:23:16,489
And in this example, we

664
00:23:16,490 --> 00:23:18,379
would like to see that this program would

665
00:23:18,380 --> 00:23:20,539
take a copyleft code and then

666
00:23:20,540 --> 00:23:21,829
make it copywrite.

667
00:23:22,850 --> 00:23:24,769
There was a very famous case in north

668
00:23:24,770 --> 00:23:26,419
California a few years ago.

669
00:23:26,420 --> 00:23:28,549
It was with Jacobsson versus

670
00:23:28,550 --> 00:23:31,219
Katzir and Jacobsson

671
00:23:31,220 --> 00:23:33,289
had JOA model railroad

672
00:23:33,290 --> 00:23:35,539
interface called the T put

673
00:23:35,540 --> 00:23:38,029
an artistic license on and the artistic

674
00:23:38,030 --> 00:23:40,579
license is less restrictive

675
00:23:40,580 --> 00:23:42,769
than the copyleft license.

676
00:23:42,770 --> 00:23:44,899
And after that, Katzir, who is also

677
00:23:44,900 --> 00:23:47,899
interested in railroad models and

678
00:23:47,900 --> 00:23:50,639
he is working for railroad

679
00:23:50,640 --> 00:23:53,479
harvest as a software developer.

680
00:23:53,480 --> 00:23:55,609
He took this call and then he

681
00:23:55,610 --> 00:23:57,439
put a copyright on it and he started

682
00:23:57,440 --> 00:23:59,299
distributing this commercially.

683
00:23:59,300 --> 00:24:02,149
And also he filed a patent using

684
00:24:02,150 --> 00:24:03,799
Jacobsen's court.

685
00:24:03,800 --> 00:24:05,959
And after that, this was on

686
00:24:05,960 --> 00:24:08,239
court. And some people claimed that

687
00:24:08,240 --> 00:24:10,579
like since this is just artistic license,

688
00:24:10,580 --> 00:24:12,259
he can do whatever he wants with it

689
00:24:12,260 --> 00:24:13,519
because it's free could.

690
00:24:13,520 --> 00:24:14,809
But that was not the case.

691
00:24:14,810 --> 00:24:16,669
Even if it's an artistic license, you

692
00:24:16,670 --> 00:24:18,769
still have to make sure that when

693
00:24:18,770 --> 00:24:20,899
you modify it, it still has an artistic

694
00:24:20,900 --> 00:24:23,209
license and everyone else can use

695
00:24:23,210 --> 00:24:26,059
it the same way the first person intended

696
00:24:26,060 --> 00:24:27,259
it to be used.

697
00:24:27,260 --> 00:24:29,419
And this can be

698
00:24:29,420 --> 00:24:32,419
used. This can be experimented

699
00:24:32,420 --> 00:24:35,179
in a two class machine learning problem.

700
00:24:35,180 --> 00:24:36,859
In the first class, we will have the

701
00:24:36,860 --> 00:24:39,259
copyleft quote from Jacobsson and

702
00:24:39,260 --> 00:24:40,789
in the second class we will have the

703
00:24:40,790 --> 00:24:42,799
copyright code and we will compare them

704
00:24:42,800 --> 00:24:44,989
to each other to see if any code

705
00:24:44,990 --> 00:24:46,539
was taken from the other one.

706
00:24:48,290 --> 00:24:50,389
And in this case, we had

707
00:24:50,390 --> 00:24:53,359
20 pairs of altars,

708
00:24:53,360 --> 00:24:55,669
which means that we had 40 orders each

709
00:24:55,670 --> 00:24:57,739
with nine files, and we tried

710
00:24:57,740 --> 00:24:59,869
to identify their files correctly

711
00:24:59,870 --> 00:25:02,569
and we had 99 percent accuracy

712
00:25:02,570 --> 00:25:04,069
in identifying these.

713
00:25:05,820 --> 00:25:08,219
In the Ford case, we will look at voter

714
00:25:08,220 --> 00:25:09,500
verification here.

715
00:25:10,840 --> 00:25:12,809
We are trying to find out if this person

716
00:25:12,810 --> 00:25:14,909
who claims to have written to Scott, is

717
00:25:14,910 --> 00:25:17,159
he the real programmer or was it written

718
00:25:17,160 --> 00:25:18,419
by someone else?

719
00:25:18,420 --> 00:25:20,519
And this is a two class problem,

720
00:25:20,520 --> 00:25:22,559
but it's not exactly to class because the

721
00:25:22,560 --> 00:25:24,389
first class is only Mallory.

722
00:25:24,390 --> 00:25:26,489
Mallory claims to have written the test

723
00:25:26,490 --> 00:25:28,919
code and and we train

724
00:25:28,920 --> 00:25:30,599
on Mallory as the first class.

725
00:25:30,600 --> 00:25:32,429
We also train on a second class.

726
00:25:32,430 --> 00:25:34,619
That's a combination of several other

727
00:25:34,620 --> 00:25:36,779
authors. And all these are

728
00:25:36,780 --> 00:25:39,059
the same problem solutions

729
00:25:39,060 --> 00:25:41,639
or like each one corresponds to

730
00:25:41,640 --> 00:25:44,399
the same problem from different alters.

731
00:25:44,400 --> 00:25:47,159
And must we train on these two classes?

732
00:25:47,160 --> 00:25:49,499
Here we have the code that Mallory claims

733
00:25:49,500 --> 00:25:51,509
to have written and we have calls from a

734
00:25:51,510 --> 00:25:54,089
bunch of other random authors.

735
00:25:54,090 --> 00:25:56,189
And in this task, we

736
00:25:56,190 --> 00:25:58,379
reach 93 percent accuracy

737
00:25:58,380 --> 00:26:00,809
in 80 different experimental

738
00:26:00,810 --> 00:26:02,909
setup. So that means

739
00:26:02,910 --> 00:26:05,069
that hundreds of different users with

740
00:26:05,070 --> 00:26:06,599
thousands of different files.

741
00:26:10,250 --> 00:26:12,499
We also wanted to see if programing

742
00:26:12,500 --> 00:26:14,509
style is consistent throughout years,

743
00:26:14,510 --> 00:26:16,789
because if, yes, when we're constructing

744
00:26:16,790 --> 00:26:19,069
our data sets, we can mix and match from

745
00:26:19,070 --> 00:26:20,689
different years.

746
00:26:20,690 --> 00:26:23,209
And we took

747
00:26:23,210 --> 00:26:25,279
we found the contestants that were bought

748
00:26:25,280 --> 00:26:27,649
in 2012 and 2014.

749
00:26:27,650 --> 00:26:29,749
And here is an example, and

750
00:26:29,750 --> 00:26:31,760
this is a random example of their, quote,

751
00:26:33,770 --> 00:26:36,949
the same person in 2012 and 2014.

752
00:26:36,950 --> 00:26:39,139
The layered features look extremely

753
00:26:39,140 --> 00:26:40,519
similar.

754
00:26:40,520 --> 00:26:42,099
The structure is very similar.

755
00:26:42,100 --> 00:26:44,209
The four comes at the same depth.

756
00:26:44,210 --> 00:26:46,399
And we see the lexical features such

757
00:26:46,400 --> 00:26:48,919
as the variable name is very similar,

758
00:26:48,920 --> 00:26:50,989
except that in 2014 they

759
00:26:50,990 --> 00:26:52,940
decided to capitalize the T.

760
00:26:55,580 --> 00:26:57,709
And as a result, we were

761
00:26:57,710 --> 00:26:59,899
able to identify 25

762
00:26:59,900 --> 00:27:02,929
authors that were bought in 2012

763
00:27:02,930 --> 00:27:05,089
and 2014, but 88

764
00:27:05,090 --> 00:27:07,489
percent accuracy, the 88

765
00:27:07,490 --> 00:27:09,919
percent might seem low to you after

766
00:27:09,920 --> 00:27:12,199
hearing the previous results with 99

767
00:27:12,200 --> 00:27:14,359
or 93.

768
00:27:14,360 --> 00:27:16,549
But in this case, when we took these

769
00:27:16,550 --> 00:27:19,399
25 out authors just within

770
00:27:19,400 --> 00:27:22,129
2012, we were able to identify

771
00:27:22,130 --> 00:27:24,289
them with 92 percent accuracy.

772
00:27:24,290 --> 00:27:26,359
So it's just a four percent drop in

773
00:27:26,360 --> 00:27:28,609
accuracy, which shows that coding

774
00:27:28,610 --> 00:27:31,159
style is up to some degree

775
00:27:31,160 --> 00:27:32,750
persistent throughout years.

776
00:27:34,550 --> 00:27:36,259
We also wanted to gain some insights

777
00:27:36,260 --> 00:27:38,479
about coding style, so we wanted to

778
00:27:38,480 --> 00:27:40,939
see how people implement

779
00:27:40,940 --> 00:27:43,519
difficult versus easier functionality.

780
00:27:43,520 --> 00:27:46,129
And we took a set of six to do authors

781
00:27:46,130 --> 00:27:48,409
that were able to answer 14 questions.

782
00:27:48,410 --> 00:27:50,719
We took the seven easy problems and said

783
00:27:50,720 --> 00:27:52,819
seven more difficult problems.

784
00:27:52,820 --> 00:27:54,949
And we saw that these authors

785
00:27:54,950 --> 00:27:56,869
program in style was more unique when

786
00:27:56,870 --> 00:27:59,029
they were implementing

787
00:27:59,030 --> 00:28:00,289
harder functionality.

788
00:28:00,290 --> 00:28:02,239
As we can see, with the five percent

789
00:28:02,240 --> 00:28:04,579
increase in accuracy, we were able

790
00:28:04,580 --> 00:28:06,529
to identify them with 95 percent

791
00:28:06,530 --> 00:28:07,530
accuracy.

792
00:28:08,750 --> 00:28:10,609
We also wanted to see the differences

793
00:28:10,610 --> 00:28:12,709
between advanced programing versus a

794
00:28:12,710 --> 00:28:14,839
programmer that has a smaller skill set

795
00:28:14,840 --> 00:28:16,999
and how this is reflected to their

796
00:28:17,000 --> 00:28:18,109
coding style.

797
00:28:18,110 --> 00:28:20,239
And we saw the.

798
00:28:20,240 --> 00:28:22,609
Advanced programmers had a lot

799
00:28:22,610 --> 00:28:25,069
more unique coding style

800
00:28:25,070 --> 00:28:27,229
compared to coders that had

801
00:28:27,230 --> 00:28:29,149
a smaller skill set, and the difference

802
00:28:29,150 --> 00:28:31,279
here is 15 percent.

803
00:28:31,280 --> 00:28:33,409
And this shows a large

804
00:28:33,410 --> 00:28:35,989
and very significant difference in

805
00:28:35,990 --> 00:28:36,990
coding style.

806
00:28:38,800 --> 00:28:40,969
Oh, in the future, source

807
00:28:40,970 --> 00:28:43,399
code, authorship applications,

808
00:28:43,400 --> 00:28:45,229
source code, authorship, attribution can

809
00:28:45,230 --> 00:28:47,839
be applied to many different

810
00:28:47,840 --> 00:28:48,439
areas.

811
00:28:48,440 --> 00:28:50,629
For example, we can

812
00:28:50,630 --> 00:28:53,659
use this to find programmers

813
00:28:53,660 --> 00:28:55,909
or the coders of malicious

814
00:28:55,910 --> 00:28:58,400
code. We can look at open source

815
00:28:59,660 --> 00:29:01,769
repositories and find

816
00:29:01,770 --> 00:29:03,949
anonymous people who are contributing

817
00:29:03,950 --> 00:29:06,319
malicious code and try to identify

818
00:29:06,320 --> 00:29:08,689
them by comparing them to other

819
00:29:08,690 --> 00:29:10,219
good contributors.

820
00:29:10,220 --> 00:29:12,379
Or we can identify the styles

821
00:29:12,380 --> 00:29:14,539
of coders who have a

822
00:29:14,540 --> 00:29:16,699
vulnerable style by looking at the bug

823
00:29:16,700 --> 00:29:18,979
numbers they have on Gett or

824
00:29:18,980 --> 00:29:21,109
another Tengiz companies might use this.

825
00:29:21,110 --> 00:29:22,819
For example, let's say they're interested

826
00:29:22,820 --> 00:29:24,349
in a particular coding style.

827
00:29:24,350 --> 00:29:26,539
They can train on it and after that

828
00:29:26,540 --> 00:29:28,489
they can search for that on get to

829
00:29:28,490 --> 00:29:31,129
recruit employees

830
00:29:31,130 --> 00:29:32,269
directly from get.

831
00:29:33,930 --> 00:29:36,059
And when we compare our work to

832
00:29:36,060 --> 00:29:38,609
previous work, we see a huge increase

833
00:29:38,610 --> 00:29:41,009
in accuracy, even though our data set

834
00:29:41,010 --> 00:29:43,349
is larger in magnitude compared

835
00:29:43,350 --> 00:29:46,079
to theirs, the last two lines

836
00:29:46,080 --> 00:29:48,359
are results with the nine

837
00:29:48,360 --> 00:29:50,880
to five percent accuracy and 250

838
00:29:52,540 --> 00:29:53,819
authors.

839
00:29:53,820 --> 00:29:56,009
So this shows that our method

840
00:29:56,010 --> 00:29:58,289
is with the syntactic

841
00:29:58,290 --> 00:29:58,829
features.

842
00:29:58,830 --> 00:30:00,899
It's doing a lot better

843
00:30:00,900 --> 00:30:03,119
and the previous methods

844
00:30:03,120 --> 00:30:05,520
did not use any syntactic feature said.

845
00:30:07,870 --> 00:30:09,399
I would also like to thank our

846
00:30:09,400 --> 00:30:11,889
collaborators, Dr.

847
00:30:11,890 --> 00:30:14,109
Frank from

848
00:30:14,110 --> 00:30:15,609
the United States Army Research

849
00:30:15,610 --> 00:30:17,769
Laboratory, Dr. Clairvaux from United

850
00:30:17,770 --> 00:30:19,889
States Army Research Laboratory and

851
00:30:19,890 --> 00:30:22,059
really from University of Maryland,

852
00:30:22,060 --> 00:30:24,159
Dr. Ivan Narayanan from Princeton

853
00:30:24,160 --> 00:30:26,559
University and also Fabia Yamaguchi from

854
00:30:26,560 --> 00:30:27,910
University of Getting It.

855
00:30:29,440 --> 00:30:32,139
And I talked about a particular

856
00:30:32,140 --> 00:30:33,669
domain, which was source code.

857
00:30:33,670 --> 00:30:35,739
Now, BECC is going to talk about other

858
00:30:35,740 --> 00:30:38,020
domains and cross domains, stylometry.

859
00:30:46,540 --> 00:30:47,649
Thanks.

860
00:30:47,650 --> 00:30:49,959
All right, so as you just saw from

861
00:30:49,960 --> 00:30:51,729
Eileen's presentation, we're really good

862
00:30:51,730 --> 00:30:52,730
at this.

863
00:30:54,430 --> 00:30:56,019
We are very good at this in a lot of

864
00:30:56,020 --> 00:30:57,009
domains as well.

865
00:30:57,010 --> 00:30:58,599
So the ones I have up here, for example,

866
00:30:58,600 --> 00:31:00,609
source code, of course, but also anything

867
00:31:00,610 --> 00:31:02,079
you really put on the Internet we've

868
00:31:02,080 --> 00:31:03,849
looked at as a community.

869
00:31:03,850 --> 00:31:06,279
So we have e-mails, chat, messages,

870
00:31:06,280 --> 00:31:07,839
even things you don't put on the Internet

871
00:31:07,840 --> 00:31:09,999
like books or historical documents have

872
00:31:10,000 --> 00:31:11,439
been studied. And in a few slides, you'll

873
00:31:11,440 --> 00:31:12,969
see just how good we are at these types

874
00:31:12,970 --> 00:31:13,970
of things.

875
00:31:15,190 --> 00:31:17,319
This is Rahm Emanuel and this

876
00:31:17,320 --> 00:31:18,939
is his Twitter feed.

877
00:31:18,940 --> 00:31:21,069
Rahm Emanuel is an American politician.

878
00:31:21,070 --> 00:31:23,889
He's currently the mayor of Chicago.

879
00:31:23,890 --> 00:31:26,709
And while he was running for his office,

880
00:31:26,710 --> 00:31:28,989
a rogue Twitter feed was developed

881
00:31:28,990 --> 00:31:31,919
to imitate his Twitter feed.

882
00:31:31,920 --> 00:31:34,049
This is not Rahm Emanuel's Twitter feed.

883
00:31:38,330 --> 00:31:41,109
This Twitter feed was written instead by

884
00:31:41,110 --> 00:31:43,189
a a man named

885
00:31:43,190 --> 00:31:44,190
Dan cincher,

886
00:31:46,430 --> 00:31:48,169
and this is a really good example of why

887
00:31:48,170 --> 00:31:50,269
we would need to use stylometry kind of

888
00:31:50,270 --> 00:31:51,199
in the real world.

889
00:31:51,200 --> 00:31:53,149
And if we have Twitter feeds, we can test

890
00:31:53,150 --> 00:31:55,129
on Twitter feeds and we do really well.

891
00:31:55,130 --> 00:31:56,479
The problem that I'm going to discuss

892
00:31:56,480 --> 00:31:58,759
today arises if Dan Cincher

893
00:31:58,760 --> 00:32:00,139
here didn't have a Twitter feed to

894
00:32:00,140 --> 00:32:01,539
compare it to.

895
00:32:01,540 --> 00:32:03,649
So he is a writer, so he has a lot

896
00:32:03,650 --> 00:32:04,729
of writing. So if you didn't have a

897
00:32:04,730 --> 00:32:06,199
Twitter feed, what we could hopefully do

898
00:32:06,200 --> 00:32:08,689
instead was take a number of suspect

899
00:32:08,690 --> 00:32:10,489
authors. And during the campaign, he was

900
00:32:10,490 --> 00:32:12,649
actually named possibly

901
00:32:12,650 --> 00:32:14,479
as one of the suspects.

902
00:32:14,480 --> 00:32:16,309
So we would have some data on a list of

903
00:32:16,310 --> 00:32:17,479
suspects. And if they weren't, all

904
00:32:17,480 --> 00:32:19,039
Twitter feeds of some of them were blog

905
00:32:19,040 --> 00:32:20,869
posts or articles that they'd written.

906
00:32:20,870 --> 00:32:22,879
Hopefully we'd still be able to identify

907
00:32:22,880 --> 00:32:25,219
the author of The Rogue Twitter feed.

908
00:32:26,240 --> 00:32:28,459
So my main problem here is domain

909
00:32:28,460 --> 00:32:30,049
adaptation and stylometry.

910
00:32:30,050 --> 00:32:32,629
We're given a sample text in some domain

911
00:32:32,630 --> 00:32:34,699
and we're trying to identify the author

912
00:32:34,700 --> 00:32:36,199
of some of their document, which is in a

913
00:32:36,200 --> 00:32:37,200
distinct domain.

914
00:32:38,570 --> 00:32:39,979
The features that we use for this

915
00:32:39,980 --> 00:32:41,239
analysis.

916
00:32:41,240 --> 00:32:42,859
Some of them are up here.

917
00:32:42,860 --> 00:32:44,929
First does bag of words, bag of words is

918
00:32:44,930 --> 00:32:46,879
really popular not just in stylometry,

919
00:32:46,880 --> 00:32:48,499
but in natural language processing in

920
00:32:48,500 --> 00:32:49,609
general.

921
00:32:49,610 --> 00:32:51,379
These are, for example, how many times

922
00:32:51,380 --> 00:32:53,089
you use the word the how many times you

923
00:32:53,090 --> 00:32:55,339
use the word computer, etc..

924
00:32:55,340 --> 00:32:57,289
Another popular one.

925
00:32:57,290 --> 00:32:58,639
And so I'm tree and natural language

926
00:32:58,640 --> 00:33:00,379
processing our character or word and

927
00:33:00,380 --> 00:33:02,659
grims. And there's an example of

928
00:33:02,660 --> 00:33:04,430
character diagrams underneath

929
00:33:05,540 --> 00:33:07,429
and other specific features, function

930
00:33:07,430 --> 00:33:08,719
words or stop words.

931
00:33:08,720 --> 00:33:10,519
So these are non content words, basically

932
00:33:10,520 --> 00:33:12,829
words that don't mean anything for to

933
00:33:12,830 --> 00:33:14,899
the and part of

934
00:33:14,900 --> 00:33:16,159
speech tags and part of speech.

935
00:33:16,160 --> 00:33:18,079
Anagrams are also important and not

936
00:33:18,080 --> 00:33:19,280
context specific.

937
00:33:21,440 --> 00:33:23,479
It's also popular to combine a bunch of

938
00:33:23,480 --> 00:33:25,399
features into one what we call a feature

939
00:33:25,400 --> 00:33:27,649
set, here's a popular one that works well

940
00:33:27,650 --> 00:33:29,749
within a bunch of domains known as write

941
00:33:29,750 --> 00:33:30,949
prints.

942
00:33:30,950 --> 00:33:32,839
You can see it's broken up into lexical

943
00:33:32,840 --> 00:33:34,279
syntactic content.

944
00:33:34,280 --> 00:33:36,739
And on the bottom of the screen should be

945
00:33:37,940 --> 00:33:39,019
misspellings as well.

946
00:33:39,020 --> 00:33:40,249
But you can add other features

947
00:33:41,540 --> 00:33:42,540
like a cut off.

948
00:33:45,250 --> 00:33:47,059
When we're looking at Dumain adaptation

949
00:33:47,060 --> 00:33:48,549
specifically, we're talking about

950
00:33:48,550 --> 00:33:50,379
different domains and types of places

951
00:33:50,380 --> 00:33:51,939
where people are writing things, it's

952
00:33:51,940 --> 00:33:53,949
important to look at non content features

953
00:33:53,950 --> 00:33:55,659
because if you're writing in different

954
00:33:55,660 --> 00:33:57,099
places, you're probably writing about

955
00:33:57,100 --> 00:33:58,089
different things.

956
00:33:58,090 --> 00:34:00,189
So the ones on the screen here are

957
00:34:00,190 --> 00:34:01,660
some examples of those.

958
00:34:03,100 --> 00:34:04,149
The one that's been studied most

959
00:34:04,150 --> 00:34:06,459
extensively in this context are function

960
00:34:06,460 --> 00:34:08,489
words. So these are stock words.

961
00:34:10,120 --> 00:34:11,678
You can see the accuracy's are pretty

962
00:34:11,679 --> 00:34:13,239
good with these words.

963
00:34:14,409 --> 00:34:16,479
The first example up

964
00:34:16,480 --> 00:34:19,029
there with eighty one percent accuracy

965
00:34:19,030 --> 00:34:21,428
had eight people write different

966
00:34:21,429 --> 00:34:22,869
texts in different genres.

967
00:34:22,870 --> 00:34:24,399
So they were asked, for example, to

968
00:34:24,400 --> 00:34:26,079
recreate the story of Little Red Riding

969
00:34:26,080 --> 00:34:28,089
Hood and then ask to write an essay on

970
00:34:28,090 --> 00:34:30,339
something else. And compared this isn't

971
00:34:30,340 --> 00:34:31,658
exactly domain in the way that we're

972
00:34:31,659 --> 00:34:32,559
discussing it today.

973
00:34:32,560 --> 00:34:34,839
That's more genre

974
00:34:34,840 --> 00:34:36,488
or topic.

975
00:34:36,489 --> 00:34:38,619
Similarly, books were

976
00:34:38,620 --> 00:34:41,349
analyzed in the second grouping up here

977
00:34:41,350 --> 00:34:43,509
and were divided by genre and

978
00:34:43,510 --> 00:34:45,309
topic as well.

979
00:34:45,310 --> 00:34:46,839
And all function words were used.

980
00:34:48,300 --> 00:34:49,678
So I said, we're really good at this, we

981
00:34:49,679 --> 00:34:51,899
are you

982
00:34:51,900 --> 00:34:53,789
can see across it within a bunch of

983
00:34:53,790 --> 00:34:55,349
domains of emails, we get eighty six

984
00:34:55,350 --> 00:34:56,999
percent accuracy. The bottom two lines

985
00:34:57,000 --> 00:34:59,249
are my own work getting and

986
00:34:59,250 --> 00:35:01,379
ninety eight percent accuracy, almost 99

987
00:35:01,380 --> 00:35:03,659
with Twitter feeds and

988
00:35:03,660 --> 00:35:05,249
using blog entries, we get about a ninety

989
00:35:05,250 --> 00:35:06,329
three percent accuracy.

990
00:35:07,800 --> 00:35:09,359
So we do pretty well.

991
00:35:09,360 --> 00:35:11,459
The lower accuracies for chat messages in

992
00:35:11,460 --> 00:35:13,139
Java form comments are because they're

993
00:35:13,140 --> 00:35:14,759
using a smaller amount of text for the

994
00:35:14,760 --> 00:35:15,779
testing document.

995
00:35:15,780 --> 00:35:17,069
And as Rachel mentioned the beginning,

996
00:35:17,070 --> 00:35:18,539
you want something closer to five hundred

997
00:35:18,540 --> 00:35:19,949
words for your testing documents.

998
00:35:23,210 --> 00:35:25,339
This is a tweet on your

999
00:35:25,340 --> 00:35:27,469
left and a blog

1000
00:35:27,470 --> 00:35:28,819
on your right.

1001
00:35:28,820 --> 00:35:30,799
These are from our data set and were

1002
00:35:30,800 --> 00:35:32,059
written by the same person.

1003
00:35:33,580 --> 00:35:35,919
The tweet has about, I don't know, three

1004
00:35:35,920 --> 00:35:38,109
real words in it that aren't

1005
00:35:38,110 --> 00:35:39,819
misspelled or replaced with something

1006
00:35:39,820 --> 00:35:42,219
else, but you can see the blog

1007
00:35:42,220 --> 00:35:43,599
on the other side of the screen is very

1008
00:35:43,600 --> 00:35:44,709
well constructed. This correct

1009
00:35:44,710 --> 00:35:46,779
punctuation that there aren't

1010
00:35:46,780 --> 00:35:48,219
replacements for short words.

1011
00:35:48,220 --> 00:35:49,629
We don't see any of that.

1012
00:35:49,630 --> 00:35:51,099
So you can really see the challenge here

1013
00:35:51,100 --> 00:35:53,199
in trying to identify the author of this

1014
00:35:53,200 --> 00:35:55,269
tweet or a group of tweets that look like

1015
00:35:55,270 --> 00:35:57,339
this from a blog that looks like

1016
00:35:57,340 --> 00:35:58,329
that.

1017
00:35:58,330 --> 00:35:59,559
And that's really our challenge.

1018
00:36:00,730 --> 00:36:03,309
The data we collected for this project,

1019
00:36:03,310 --> 00:36:04,630
we collected five hundred

1020
00:36:05,650 --> 00:36:07,689
tweets and blog users and then thirty

1021
00:36:07,690 --> 00:36:09,729
eight Reddit comment, Reddit users who

1022
00:36:09,730 --> 00:36:11,169
also had Twitter feeds.

1023
00:36:11,170 --> 00:36:13,329
We collected the Twitter and blog users

1024
00:36:13,330 --> 00:36:15,819
by simply querying Twitter for

1025
00:36:15,820 --> 00:36:18,249
the phrase dot word, press, dot com,

1026
00:36:18,250 --> 00:36:20,169
and we're able to collect tons and tons

1027
00:36:20,170 --> 00:36:22,299
of data linking those

1028
00:36:22,300 --> 00:36:23,300
two accounts.

1029
00:36:24,250 --> 00:36:26,319
And then for Reddit comments,

1030
00:36:26,320 --> 00:36:28,809
there's a sub Reddit called our Twitter

1031
00:36:28,810 --> 00:36:30,489
where people post their Twitter handles

1032
00:36:30,490 --> 00:36:32,049
in order to gain more followers.

1033
00:36:32,050 --> 00:36:34,119
And so that was a very easy way

1034
00:36:34,120 --> 00:36:35,589
to link them across accounts.

1035
00:36:35,590 --> 00:36:37,479
However, they didn't have as much data in

1036
00:36:37,480 --> 00:36:39,399
there. So were we were only about we were

1037
00:36:39,400 --> 00:36:41,499
only able to get about thirty eight users

1038
00:36:41,500 --> 00:36:43,389
for that data set. But it works well to

1039
00:36:43,390 --> 00:36:45,639
confirm that our methods are working

1040
00:36:45,640 --> 00:36:47,409
across different domains and not simply

1041
00:36:47,410 --> 00:36:48,410
for blogs.

1042
00:36:49,610 --> 00:36:51,739
Possible solutions for this work,

1043
00:36:51,740 --> 00:36:53,929
the first is looking at

1044
00:36:53,930 --> 00:36:55,069
bright prints, which I showed in the

1045
00:36:55,070 --> 00:36:56,689
beginning, kind of throw as many features

1046
00:36:56,690 --> 00:36:58,939
at it as you can and hope it works.

1047
00:36:58,940 --> 00:37:01,099
The second is,

1048
00:37:01,100 --> 00:37:03,139
what if we were very careful about what

1049
00:37:03,140 --> 00:37:04,729
features we selected? Instead, we only

1050
00:37:04,730 --> 00:37:06,979
fixed features, for example, that aren't

1051
00:37:06,980 --> 00:37:08,299
context specific.

1052
00:37:08,300 --> 00:37:10,009
And so we look at function words as

1053
00:37:10,010 --> 00:37:11,209
others have in the past.

1054
00:37:11,210 --> 00:37:13,339
And the final is that we look at our own

1055
00:37:13,340 --> 00:37:14,809
method called Doppelganger Finder, which

1056
00:37:14,810 --> 00:37:15,810
I'll get it later.

1057
00:37:17,030 --> 00:37:19,189
These are the domain results for

1058
00:37:19,190 --> 00:37:21,319
blogs and then tweets and then Reddit

1059
00:37:21,320 --> 00:37:22,489
comments and then tweets.

1060
00:37:22,490 --> 00:37:24,049
We have two different Twitter data sets

1061
00:37:24,050 --> 00:37:25,759
because one was collected with the blogs

1062
00:37:25,760 --> 00:37:26,929
and the other with tweets. And you can

1063
00:37:26,930 --> 00:37:28,849
see that we do really, really well.

1064
00:37:28,850 --> 00:37:31,729
The purple lines are function words

1065
00:37:31,730 --> 00:37:33,379
and they don't do quite as well as the

1066
00:37:33,380 --> 00:37:35,779
right prints, which are the bluish lines

1067
00:37:35,780 --> 00:37:37,289
on the screen.

1068
00:37:37,290 --> 00:37:38,569
But in general, we're doing pretty well

1069
00:37:38,570 --> 00:37:39,570
with this.

1070
00:37:41,180 --> 00:37:43,339
The green lines are the cross domain

1071
00:37:43,340 --> 00:37:45,409
results, so

1072
00:37:45,410 --> 00:37:46,969
you can see that there's a huge gap in

1073
00:37:46,970 --> 00:37:48,169
accuracy.

1074
00:37:48,170 --> 00:37:50,359
So it's if we're testing on blogs

1075
00:37:50,360 --> 00:37:52,279
and then her training on blogs and then

1076
00:37:52,280 --> 00:37:54,229
testing on Twitter feeds or training on

1077
00:37:54,230 --> 00:37:55,849
Reddit comments and testing on Twitter

1078
00:37:55,850 --> 00:37:58,039
feeds. And so you can see that we

1079
00:37:58,040 --> 00:37:59,689
do very poorly and that these results are

1080
00:37:59,690 --> 00:38:01,969
unacceptable using the first two methods

1081
00:38:01,970 --> 00:38:03,589
which are right and then the careful

1082
00:38:03,590 --> 00:38:05,690
feature selection of function words.

1083
00:38:06,890 --> 00:38:07,890
So what do we do about it?

1084
00:38:08,850 --> 00:38:11,219
Doppelganger Finder is a

1085
00:38:11,220 --> 00:38:14,279
new algorithm that was presented,

1086
00:38:14,280 --> 00:38:16,499
it was created to link user

1087
00:38:16,500 --> 00:38:18,419
accounts across cyber criminal forums and

1088
00:38:18,420 --> 00:38:19,889
this kind of naturally seems like it

1089
00:38:19,890 --> 00:38:21,449
would work for a problem, because really

1090
00:38:21,450 --> 00:38:23,069
what we're trying to do is link accounts

1091
00:38:23,070 --> 00:38:24,070
across the Web.

1092
00:38:25,580 --> 00:38:28,129
This method works by calculating

1093
00:38:28,130 --> 00:38:30,079
the probability that each author wrote

1094
00:38:30,080 --> 00:38:32,269
another author's documents and then

1095
00:38:32,270 --> 00:38:34,539
for each pair of authors combined

1096
00:38:34,540 --> 00:38:36,649
these probabilities and every probability

1097
00:38:36,650 --> 00:38:38,689
above a certain inputted threshold is

1098
00:38:38,690 --> 00:38:40,339
considered to be the same person and

1099
00:38:40,340 --> 00:38:41,779
below it is considered to be different

1100
00:38:41,780 --> 00:38:43,219
people. For example,

1101
00:38:44,330 --> 00:38:46,549
we have some author, author,

1102
00:38:46,550 --> 00:38:48,679
AI, and we find

1103
00:38:48,680 --> 00:38:50,629
the probability that author, a real

1104
00:38:50,630 --> 00:38:52,399
author is documents and that author E

1105
00:38:52,400 --> 00:38:53,959
wrote authorized documents.

1106
00:38:53,960 --> 00:38:55,549
And we do this for all of them.

1107
00:38:55,550 --> 00:38:57,019
And then whichever probabilities are

1108
00:38:57,020 --> 00:38:59,569
above a certain threshold we

1109
00:38:59,570 --> 00:39:01,549
use, we say that they're the same author

1110
00:39:01,550 --> 00:39:02,689
and if they're below, we say they're

1111
00:39:02,690 --> 00:39:04,879
distinct. This code can be found on

1112
00:39:04,880 --> 00:39:06,499
GitHub at the bottom of the screen.

1113
00:39:06,500 --> 00:39:08,299
It also appears at the end of the

1114
00:39:08,300 --> 00:39:10,489
presentation, if you miss it,

1115
00:39:10,490 --> 00:39:12,619
we are actually able to augment this

1116
00:39:12,620 --> 00:39:14,569
doppelganger finder algorithm to work

1117
00:39:14,570 --> 00:39:17,629
better in the domain adaptation case

1118
00:39:17,630 --> 00:39:19,969
as well. Here we had to we had to compare

1119
00:39:19,970 --> 00:39:22,399
it to EFG, all of them

1120
00:39:22,400 --> 00:39:24,739
over here. We don't have to compare

1121
00:39:24,740 --> 00:39:26,689
A to B, C and D because they're all in

1122
00:39:26,690 --> 00:39:27,949
the same, let's say, Twitter.

1123
00:39:27,950 --> 00:39:29,089
If they're all on Twitter, they're all

1124
00:39:29,090 --> 00:39:30,859
tweets. So we know they're not written by

1125
00:39:30,860 --> 00:39:32,899
the same people. They're distinct.

1126
00:39:32,900 --> 00:39:34,519
And we get a little bit of an advantage

1127
00:39:34,520 --> 00:39:36,349
here. And the algorithm and also we don't

1128
00:39:36,350 --> 00:39:38,359
have to use a threshold, which is

1129
00:39:38,360 --> 00:39:40,009
definitely a huge advantage.

1130
00:39:40,010 --> 00:39:41,599
We just take the highest of all the

1131
00:39:41,600 --> 00:39:43,099
probabilities because we know that

1132
00:39:43,100 --> 00:39:45,319
they're somehow linked in even

1133
00:39:45,320 --> 00:39:46,339
the open world case.

1134
00:39:46,340 --> 00:39:47,749
And the open road case is one where you

1135
00:39:47,750 --> 00:39:48,769
don't know, the suspect said.

1136
00:39:48,770 --> 00:39:50,929
So you say that I'm not

1137
00:39:50,930 --> 00:39:53,269
sure if it's one of these people or

1138
00:39:53,270 --> 00:39:55,039
you're not sure that there's a perfect

1139
00:39:55,040 --> 00:39:56,509
one to one pairing between the two, then

1140
00:39:56,510 --> 00:39:58,069
you'd have to threshold it and you have

1141
00:39:58,070 --> 00:39:59,420
that same issue again.

1142
00:40:00,980 --> 00:40:03,409
Here are the cross domain results

1143
00:40:03,410 --> 00:40:05,509
for the blog and Twitter

1144
00:40:05,510 --> 00:40:07,729
data. Set the green

1145
00:40:07,730 --> 00:40:09,589
lines at the bottom where the green bars

1146
00:40:09,590 --> 00:40:11,869
on the on the slide

1147
00:40:11,870 --> 00:40:13,399
on the notation slide.

1148
00:40:13,400 --> 00:40:15,559
So we do have to vary terribly

1149
00:40:15,560 --> 00:40:18,019
in cross domain using those methods.

1150
00:40:18,020 --> 00:40:19,939
And then the blue lines are the domain

1151
00:40:19,940 --> 00:40:21,979
results and the bold red line are the

1152
00:40:21,980 --> 00:40:24,349
doppelganger finder results using

1153
00:40:24,350 --> 00:40:25,639
our augmented doppelganger finder.

1154
00:40:25,640 --> 00:40:27,199
So you can see we were able to recover

1155
00:40:27,200 --> 00:40:29,329
the accuracy to

1156
00:40:29,330 --> 00:40:31,789
almost as high as some of the

1157
00:40:31,790 --> 00:40:32,929
domain accuracy's.

1158
00:40:34,820 --> 00:40:35,820
And then.

1159
00:40:39,080 --> 00:40:41,539
The limitations of Doppelganger Finder,

1160
00:40:41,540 --> 00:40:43,459
first of all, you need a lot of text even

1161
00:40:43,460 --> 00:40:45,409
in the training documents or testing

1162
00:40:45,410 --> 00:40:47,899
documents, and so maybe more than 500

1163
00:40:47,900 --> 00:40:49,639
words even you would need of testing

1164
00:40:49,640 --> 00:40:51,020
documents to make this really work.

1165
00:40:52,070 --> 00:40:54,019
Additionally, it's made for a specific

1166
00:40:54,020 --> 00:40:55,519
case, which is account linking

1167
00:40:56,600 --> 00:40:58,699
work for more

1168
00:40:58,700 --> 00:41:01,339
specific cases than this.

1169
00:41:01,340 --> 00:41:03,349
The next question that naturally arises

1170
00:41:03,350 --> 00:41:05,479
is what if I'm trying to

1171
00:41:05,480 --> 00:41:07,279
identify the author of a Twitter feed and

1172
00:41:07,280 --> 00:41:09,139
I have a bunch of blog data, but I have

1173
00:41:09,140 --> 00:41:10,579
some Twitter data.

1174
00:41:10,580 --> 00:41:12,739
Do I use the Twitter data or one

1175
00:41:12,740 --> 00:41:14,749
of them in one of these limited cases?

1176
00:41:14,750 --> 00:41:16,219
Is it should I use the Twitter data or

1177
00:41:16,220 --> 00:41:17,269
should I try to use them? Ended up

1178
00:41:17,270 --> 00:41:19,879
invitation in the

1179
00:41:19,880 --> 00:41:20,839
with the blog data.

1180
00:41:20,840 --> 00:41:22,429
And the answer really is if you have

1181
00:41:22,430 --> 00:41:23,899
Twitter data, you should use Twitter

1182
00:41:23,900 --> 00:41:26,209
data. You can see it the first point

1183
00:41:26,210 --> 00:41:28,219
there on the screen, that 10 percent,

1184
00:41:28,220 --> 00:41:30,259
that is that 10 percent of the data is

1185
00:41:30,260 --> 00:41:32,659
Twitter data and

1186
00:41:32,660 --> 00:41:33,889
the rest are blogs.

1187
00:41:33,890 --> 00:41:36,049
And this is just using bright prints,

1188
00:41:36,050 --> 00:41:38,029
support vector machines for machine

1189
00:41:38,030 --> 00:41:39,139
learning. And you can see that we get a

1190
00:41:39,140 --> 00:41:40,939
huge jump in accuracy from having no

1191
00:41:40,940 --> 00:41:43,279
tweets to having some tweets.

1192
00:41:43,280 --> 00:41:44,839
And so if you have any Twitter data, you

1193
00:41:44,840 --> 00:41:47,239
can use it. This is mirrored as well

1194
00:41:47,240 --> 00:41:49,609
in other domain application methods

1195
00:41:49,610 --> 00:41:50,840
and natural language processing.

1196
00:41:53,770 --> 00:41:55,939
Open problems left in domain adaptation,

1197
00:41:55,940 --> 00:41:58,119
the first is looking at other

1198
00:41:58,120 --> 00:42:00,399
domain adaptation solutions, probably

1199
00:42:00,400 --> 00:42:02,499
from other natural language processing

1200
00:42:02,500 --> 00:42:04,959
problems like sentiment classification,

1201
00:42:04,960 --> 00:42:06,849
also looking at how topic effects style.

1202
00:42:06,850 --> 00:42:08,949
So if you are a blogger and you have a

1203
00:42:08,950 --> 00:42:10,179
Twitter feed that probably written on the

1204
00:42:10,180 --> 00:42:12,309
same thing, but if you are a reader and

1205
00:42:12,310 --> 00:42:14,079
you have a Twitter feed, they're probably

1206
00:42:14,080 --> 00:42:15,279
written on different things.

1207
00:42:15,280 --> 00:42:16,839
And so how does that affect it?

1208
00:42:16,840 --> 00:42:18,249
Or even if you have a Reddit account and

1209
00:42:18,250 --> 00:42:19,479
you write in different subjects on

1210
00:42:19,480 --> 00:42:21,609
different topics, can we still identify

1211
00:42:21,610 --> 00:42:22,989
you as well if you're not writing about

1212
00:42:22,990 --> 00:42:24,069
the same thing?

1213
00:42:24,070 --> 00:42:25,479
Another thing to look at would be other

1214
00:42:25,480 --> 00:42:26,859
domains.

1215
00:42:26,860 --> 00:42:28,899
And finally, is it possible for us to

1216
00:42:28,900 --> 00:42:31,509
change how a document feels

1217
00:42:31,510 --> 00:42:33,729
or how its actual content is to make

1218
00:42:33,730 --> 00:42:35,419
it feel more like the other domain?

1219
00:42:35,420 --> 00:42:36,909
So, for example, we had that tweet up

1220
00:42:36,910 --> 00:42:38,619
there that had barely any words in it.

1221
00:42:38,620 --> 00:42:40,269
What if we were able to make it look a

1222
00:42:40,270 --> 00:42:41,859
little more like plain text and make it

1223
00:42:41,860 --> 00:42:42,999
look like that blog?

1224
00:42:43,000 --> 00:42:45,099
Is that changing it too much or

1225
00:42:45,100 --> 00:42:47,139
would that work? And so that's definitely

1226
00:42:47,140 --> 00:42:48,459
a huge open question. That's not very

1227
00:42:48,460 --> 00:42:49,460
easy to answer right now.

1228
00:42:51,370 --> 00:42:53,289
So anonymity is really hard.

1229
00:42:53,290 --> 00:42:54,969
Trying to make yourself anonymous, even

1230
00:42:54,970 --> 00:42:56,589
through a lot of these methods is

1231
00:42:56,590 --> 00:42:58,779
difficult. And it's really not only about

1232
00:42:58,780 --> 00:43:00,579
what you're writing, but it's also about

1233
00:43:00,580 --> 00:43:01,509
how you write it.

1234
00:43:01,510 --> 00:43:03,729
And so even if you're doing things like

1235
00:43:03,730 --> 00:43:05,199
monitoring the content of what you write

1236
00:43:05,200 --> 00:43:06,549
to make sure it can't be traced back to

1237
00:43:06,550 --> 00:43:08,079
you or hiding your location through

1238
00:43:08,080 --> 00:43:10,179
things like Tor, we can probably

1239
00:43:10,180 --> 00:43:11,679
still identify you through only your

1240
00:43:11,680 --> 00:43:12,680
writing style.

1241
00:43:13,720 --> 00:43:15,579
So while stylometry can combat online

1242
00:43:15,580 --> 00:43:17,469
abuses, it's also a huge anonymity

1243
00:43:17,470 --> 00:43:18,470
threat.

1244
00:43:19,120 --> 00:43:21,399
Finally, we're very surprisingly

1245
00:43:21,400 --> 00:43:23,439
good at anonymizing text across many

1246
00:43:23,440 --> 00:43:25,839
domains and not just within them.

1247
00:43:25,840 --> 00:43:28,329
So not all is lost.

1248
00:43:28,330 --> 00:43:29,330
What can we do about it?

1249
00:43:30,550 --> 00:43:32,139
Our library now is developing a tool

1250
00:43:32,140 --> 00:43:33,140
called Anonymous.

1251
00:43:34,780 --> 00:43:37,089
This piece of software

1252
00:43:37,090 --> 00:43:39,689
helps you anonymize yourself,

1253
00:43:39,690 --> 00:43:41,619
anonymize your text as you write it, and

1254
00:43:41,620 --> 00:43:43,959
uses Stilo in the background

1255
00:43:43,960 --> 00:43:46,629
to verify that you're to

1256
00:43:46,630 --> 00:43:49,089
monitor that you're not the same author.

1257
00:43:49,090 --> 00:43:51,389
This is definitely a work in progress.

1258
00:43:51,390 --> 00:43:54,069
It could use a lot of work and analysis

1259
00:43:54,070 --> 00:43:54,999
and feedback.

1260
00:43:55,000 --> 00:43:57,069
So if anyone's interested in playing

1261
00:43:57,070 --> 00:43:58,599
with it or contributing to it or helping

1262
00:43:58,600 --> 00:44:00,729
with it, the get is at the bottom

1263
00:44:00,730 --> 00:44:02,139
and you can contact us with anything

1264
00:44:02,140 --> 00:44:03,099
else.

1265
00:44:03,100 --> 00:44:04,779
So thank you all for listening to all

1266
00:44:04,780 --> 00:44:05,780
three of us.

1267
00:44:14,700 --> 00:44:16,559
Special thanks to my contributors Travis

1268
00:44:16,560 --> 00:44:18,929
Dutko and Sadie Afroz, and

1269
00:44:18,930 --> 00:44:19,930
we'll take any questions.

1270
00:44:21,390 --> 00:44:22,729
Well, thank you very much.

1271
00:44:22,730 --> 00:44:24,779
And now we have about 20 minutes for

1272
00:44:24,780 --> 00:44:25,780
questions.

1273
00:44:27,060 --> 00:44:29,789
Feel free to occupy the

1274
00:44:29,790 --> 00:44:30,279
phones.

1275
00:44:30,280 --> 00:44:32,010
We have phones, microphones,

1276
00:44:34,380 --> 00:44:36,300
and we'll start with number three.

1277
00:44:37,440 --> 00:44:39,599
Thanks for the talk. I got a question

1278
00:44:39,600 --> 00:44:41,369
about across the main research.

1279
00:44:41,370 --> 00:44:43,349
I was wondering if you ever try to enrich

1280
00:44:43,350 --> 00:44:45,569
your your your future sets by by

1281
00:44:45,570 --> 00:44:48,089
metadata like activity patterns

1282
00:44:48,090 --> 00:44:50,459
or links used or something

1283
00:44:50,460 --> 00:44:51,749
like that.

1284
00:44:51,750 --> 00:44:53,699
So am I on?

1285
00:44:53,700 --> 00:44:55,259
Can anyone hear me?

1286
00:44:55,260 --> 00:44:56,699
We've done a little bit.

1287
00:44:56,700 --> 00:44:58,739
We looked at Twitter specifically because

1288
00:44:58,740 --> 00:45:00,479
there's just so much metadata associated

1289
00:45:00,480 --> 00:45:01,949
with Twitter and we found that we could

1290
00:45:01,950 --> 00:45:03,569
improve our Twitter results a little bit.

1291
00:45:03,570 --> 00:45:05,339
But in the cross domain case, it doesn't

1292
00:45:05,340 --> 00:45:06,809
particularly help.

1293
00:45:06,810 --> 00:45:08,489
But our Twitter results are already

1294
00:45:08,490 --> 00:45:09,959
ninety eight point nine, so any

1295
00:45:09,960 --> 00:45:12,149
improvement isn't really an improvement.

1296
00:45:13,620 --> 00:45:15,239
Do you have any idea why that is so?

1297
00:45:15,240 --> 00:45:17,159
Because my expectation would be that it's

1298
00:45:17,160 --> 00:45:19,529
like it's a very good fingerprint

1299
00:45:19,530 --> 00:45:20,530
of someone,

1300
00:45:22,080 --> 00:45:24,299
like at what time that person

1301
00:45:24,300 --> 00:45:25,800
is writing something or

1302
00:45:27,330 --> 00:45:28,889
how many links are in the text or

1303
00:45:28,890 --> 00:45:29,879
something that could write.

1304
00:45:29,880 --> 00:45:31,919
So we don't have data for I didn't we

1305
00:45:31,920 --> 00:45:33,539
didn't collect any data for when things

1306
00:45:33,540 --> 00:45:35,669
were posted for the blogs.

1307
00:45:35,670 --> 00:45:37,769
So we haven't done analysis with that.

1308
00:45:37,770 --> 00:45:40,619
When we looked at the

1309
00:45:40,620 --> 00:45:42,779
metadata, we're talking about hash tags,

1310
00:45:42,780 --> 00:45:44,019
tags and links.

1311
00:45:44,020 --> 00:45:46,169
So the hash tags in the tags don't really

1312
00:45:46,170 --> 00:45:48,059
translate over to blogs.

1313
00:45:48,060 --> 00:45:49,619
And as far as links go, I just don't

1314
00:45:49,620 --> 00:45:51,119
think that there's enough similarity

1315
00:45:51,120 --> 00:45:53,369
between them to get any real

1316
00:45:53,370 --> 00:45:54,370
improvement out of it.

1317
00:45:55,380 --> 00:45:56,579
Thank you.

1318
00:45:56,580 --> 00:45:57,580
Number four, please.

1319
00:45:58,690 --> 00:46:00,779
It is anonymous,

1320
00:46:00,780 --> 00:46:03,269
limited to English

1321
00:46:03,270 --> 00:46:04,270
or is it

1322
00:46:05,850 --> 00:46:08,069
independent of the natural language

1323
00:46:08,070 --> 00:46:09,569
you choose?

1324
00:46:09,570 --> 00:46:11,849
So I think that the current

1325
00:46:11,850 --> 00:46:14,369
implementation is

1326
00:46:14,370 --> 00:46:16,799
is limited to English, but it wouldn't

1327
00:46:16,800 --> 00:46:18,959
take a lot of work to extend it to,

1328
00:46:18,960 --> 00:46:21,269
say, German in particular,

1329
00:46:21,270 --> 00:46:23,789
because we have the analysis background

1330
00:46:23,790 --> 00:46:25,929
and we have the analysis back end

1331
00:46:25,930 --> 00:46:27,359
in German.

1332
00:46:27,360 --> 00:46:30,329
So it would be just a question of

1333
00:46:30,330 --> 00:46:32,129
of adding a little to a couple tweaks to

1334
00:46:32,130 --> 00:46:33,779
the interface to do that, to get

1335
00:46:33,780 --> 00:46:35,939
extensions to further languages where

1336
00:46:35,940 --> 00:46:37,500
you basically need to do is

1337
00:46:38,970 --> 00:46:39,970
augment the.

1338
00:46:41,630 --> 00:46:44,119
The analysis

1339
00:46:44,120 --> 00:46:46,339
engine to have function words in part

1340
00:46:46,340 --> 00:46:48,079
of speech shagger for that language.

1341
00:46:48,080 --> 00:46:50,059
Now, it may be a little more difficult to

1342
00:46:50,060 --> 00:46:52,249
you say, Asian languages that have

1343
00:46:52,250 --> 00:46:53,849
that require segmentation.

1344
00:46:53,850 --> 00:46:55,369
So you need a segmentation engine for

1345
00:46:55,370 --> 00:46:57,589
that. But other than that,

1346
00:46:57,590 --> 00:46:59,429
it's it shouldn't be that hard.

1347
00:46:59,430 --> 00:47:01,489
Yeah, the like I said, you could

1348
00:47:01,490 --> 00:47:04,429
already use the analysis

1349
00:47:04,430 --> 00:47:06,869
for for some languages, but

1350
00:47:06,870 --> 00:47:08,239
the front end of Anonymous doesn't

1351
00:47:08,240 --> 00:47:10,429
doesn't do that currently is an

1352
00:47:10,430 --> 00:47:12,679
abstraction layer in the

1353
00:47:12,680 --> 00:47:13,059
code.

1354
00:47:13,060 --> 00:47:16,189
Yet there is an API.

1355
00:47:16,190 --> 00:47:17,190
Yeah.

1356
00:47:18,890 --> 00:47:19,890
Thank you.

1357
00:47:20,210 --> 00:47:22,489
I, uh, well,

1358
00:47:22,490 --> 00:47:24,229
uh, thanks for the talk.

1359
00:47:24,230 --> 00:47:26,059
I think it's a fascinating subject.

1360
00:47:26,060 --> 00:47:27,979
In the first half of the lecture you were

1361
00:47:27,980 --> 00:47:30,259
talking about, you know, source

1362
00:47:30,260 --> 00:47:32,029
code analysis, etc..

1363
00:47:32,030 --> 00:47:33,979
I'm trying to understand, since you were

1364
00:47:33,980 --> 00:47:35,929
one of your results, is that the best

1365
00:47:35,930 --> 00:47:37,999
feature is to look, you have nothing

1366
00:47:38,000 --> 00:47:40,019
to do with the actual code.

1367
00:47:40,020 --> 00:47:42,589
It was no, you know, um, indentation

1368
00:47:42,590 --> 00:47:44,029
and stuff like that.

1369
00:47:44,030 --> 00:47:45,289
But with the structure of the program

1370
00:47:45,290 --> 00:47:47,029
itself, why are you limited to source

1371
00:47:47,030 --> 00:47:47,959
code analysis?

1372
00:47:47,960 --> 00:47:50,119
I mean, you dump it into the pro

1373
00:47:50,120 --> 00:47:52,999
and you get a photograph of the

1374
00:47:53,000 --> 00:47:55,279
program you can analyze.

1375
00:47:55,280 --> 00:47:56,899
So, yeah, you're right.

1376
00:47:56,900 --> 00:47:58,879
This was the first time you were trying

1377
00:47:58,880 --> 00:48:01,069
to syntactic feature set because it

1378
00:48:01,070 --> 00:48:02,509
wasn't tried before.

1379
00:48:02,510 --> 00:48:04,369
And we first wanted to see that our

1380
00:48:04,370 --> 00:48:06,559
intuition is really correct and

1381
00:48:06,560 --> 00:48:07,969
this will get us somewhere.

1382
00:48:07,970 --> 00:48:10,069
So right now with C++, we

1383
00:48:10,070 --> 00:48:12,499
saw that this works very well.

1384
00:48:12,500 --> 00:48:14,599
And as long as we have a parser to

1385
00:48:14,600 --> 00:48:16,939
get the structure of a program

1386
00:48:16,940 --> 00:48:19,399
or anything, this would be very helpful.

1387
00:48:19,400 --> 00:48:21,469
So we're willing to extend our

1388
00:48:21,470 --> 00:48:23,419
work with like different languages.

1389
00:48:23,420 --> 00:48:25,489
Then we have a lot of other things left

1390
00:48:25,490 --> 00:48:27,079
to do in future.

1391
00:48:27,080 --> 00:48:28,399
Yeah, and we would like to get to the

1392
00:48:28,400 --> 00:48:30,469
binary case, and that is next

1393
00:48:30,470 --> 00:48:32,089
sort of on the agenda. But we wanted to

1394
00:48:32,090 --> 00:48:33,109
sort of confirm this in.

1395
00:48:33,110 --> 00:48:34,639
The nice thing now that we can do is we

1396
00:48:34,640 --> 00:48:36,799
can compile these programs and

1397
00:48:36,800 --> 00:48:38,659
then we can directly compare the accuracy

1398
00:48:38,660 --> 00:48:40,279
that we get from the source code to the

1399
00:48:40,280 --> 00:48:42,469
to the binary so we can see what the the

1400
00:48:42,470 --> 00:48:44,179
differences, I guess you know that.

1401
00:48:44,180 --> 00:48:46,309
But it's a really it's a very

1402
00:48:46,310 --> 00:48:48,469
realistic problem, like insert,

1403
00:48:48,470 --> 00:48:50,599
uh, malware research in general.

1404
00:48:52,010 --> 00:48:53,060
Thanks. Thanks.

1405
00:48:55,500 --> 00:48:57,849
So you've said that you used

1406
00:48:57,850 --> 00:49:00,629
to quote from Kojm to analyze

1407
00:49:00,630 --> 00:49:03,089
to check how your program

1408
00:49:03,090 --> 00:49:05,279
works. And did

1409
00:49:05,280 --> 00:49:06,809
you strip out Macarius?

1410
00:49:06,810 --> 00:49:08,699
Because I know that people in such

1411
00:49:08,700 --> 00:49:10,799
programing contests use quite a

1412
00:49:10,800 --> 00:49:12,959
lot markers that they are to

1413
00:49:12,960 --> 00:49:15,539
every day profile because it makes

1414
00:49:16,620 --> 00:49:18,689
it easier to program

1415
00:49:18,690 --> 00:49:21,299
later. And it's about 20

1416
00:49:21,300 --> 00:49:23,519
years of such macarius

1417
00:49:23,520 --> 00:49:25,739
and they trip to doubt because

1418
00:49:25,740 --> 00:49:27,959
if you didn't, you might

1419
00:49:27,960 --> 00:49:29,999
even not compare.

1420
00:49:30,000 --> 00:49:32,069
Ah, if it is the code

1421
00:49:32,070 --> 00:49:34,769
of the same outr, but

1422
00:49:34,770 --> 00:49:37,350
if it is the same code, actually

1423
00:49:39,540 --> 00:49:41,609
we looked at the macros, we

1424
00:49:41,610 --> 00:49:44,699
had a layered feature

1425
00:49:44,700 --> 00:49:47,309
just for macros and also

1426
00:49:47,310 --> 00:49:49,689
our ESTIE parser is an

1427
00:49:49,690 --> 00:49:52,319
honor function by function basis.

1428
00:49:52,320 --> 00:49:54,869
So most of the time macros

1429
00:49:54,870 --> 00:49:56,909
were excluded from the

1430
00:49:58,080 --> 00:50:00,089
structural information.

1431
00:50:00,090 --> 00:50:02,249
So we kept that separately,

1432
00:50:02,250 --> 00:50:04,529
but we tried to find out if there

1433
00:50:04,530 --> 00:50:06,719
are any like similarities like

1434
00:50:06,720 --> 00:50:08,999
that in court and we didn't see too

1435
00:50:09,000 --> 00:50:11,099
many. But if we investigate

1436
00:50:11,100 --> 00:50:13,199
further just for that specific thing,

1437
00:50:13,200 --> 00:50:15,909
we might find more similarities.

1438
00:50:15,910 --> 00:50:17,399
So that's a very good point.

1439
00:50:17,400 --> 00:50:18,839
I'll check that.

1440
00:50:18,840 --> 00:50:20,909
OK, and the second question is,

1441
00:50:20,910 --> 00:50:22,979
are you found that for more

1442
00:50:22,980 --> 00:50:24,630
advanced problems, the

1443
00:50:25,650 --> 00:50:28,209
accuracy of checking

1444
00:50:28,210 --> 00:50:30,509
code is much higher?

1445
00:50:30,510 --> 00:50:32,669
Could be it artifact

1446
00:50:32,670 --> 00:50:34,529
because of that?

1447
00:50:34,530 --> 00:50:36,849
Ah, there were

1448
00:50:36,850 --> 00:50:40,019
less solutions for more advanced problems

1449
00:50:40,020 --> 00:50:42,570
and because of that list autres.

1450
00:50:44,690 --> 00:50:47,659
I mean, there were more outdoors

1451
00:50:47,660 --> 00:50:50,479
writing solutions for easier problems

1452
00:50:50,480 --> 00:50:52,579
and less outdoors writing

1453
00:50:52,580 --> 00:50:54,799
solutions for more for

1454
00:50:54,800 --> 00:50:57,109
harder programs and dirt

1455
00:50:57,110 --> 00:50:58,879
because there are less outdoors.

1456
00:50:58,880 --> 00:51:01,609
The accuracy should be right.

1457
00:51:01,610 --> 00:51:03,679
The dataset sizes are

1458
00:51:03,680 --> 00:51:05,869
always kept the same so that we

1459
00:51:05,870 --> 00:51:08,119
can compare the results.

1460
00:51:08,120 --> 00:51:10,669
So we grouped into our

1461
00:51:10,670 --> 00:51:12,949
two groups into heart

1462
00:51:12,950 --> 00:51:15,379
problems and easy problems and to

1463
00:51:15,380 --> 00:51:17,839
maintain the same size of yes.

1464
00:51:17,840 --> 00:51:20,149
And it was a complete random

1465
00:51:20,150 --> 00:51:22,519
selection from like hundreds or thousands

1466
00:51:22,520 --> 00:51:24,709
of users to make sure that

1467
00:51:24,710 --> 00:51:27,019
it represents a real world

1468
00:51:27,020 --> 00:51:28,020
scenario.

1469
00:51:29,000 --> 00:51:31,199
And thank you for your targets.

1470
00:51:31,200 --> 00:51:32,390
Interesting. Thanks.

1471
00:51:35,320 --> 00:51:36,320
Number two, please.

1472
00:51:37,790 --> 00:51:40,119
Um, there was a assaying

1473
00:51:40,120 --> 00:51:41,859
lost in translation.

1474
00:51:41,860 --> 00:51:44,379
Uh, have you tried taking a passage

1475
00:51:44,380 --> 00:51:46,929
and passing it through Google Translate

1476
00:51:46,930 --> 00:51:48,639
or some other translating program a few

1477
00:51:48,640 --> 00:51:50,859
times and seeing if

1478
00:51:50,860 --> 00:51:53,079
it's recognizable when it comes back

1479
00:51:53,080 --> 00:51:54,729
as being from the original author,

1480
00:51:54,730 --> 00:51:57,189
perhaps with some, uh, corrections

1481
00:51:57,190 --> 00:51:58,149
to spelling and grammar?

1482
00:51:58,150 --> 00:51:59,150
Is it still.

1483
00:52:00,290 --> 00:52:02,449
Yeah, a few years ago, I had a project

1484
00:52:02,450 --> 00:52:04,669
on that where we would like take

1485
00:52:04,670 --> 00:52:07,189
the writing translated to German,

1486
00:52:07,190 --> 00:52:09,409
translated to Japanese, translated

1487
00:52:09,410 --> 00:52:11,179
back, and we would do that with several

1488
00:52:11,180 --> 00:52:13,489
different translators, such

1489
00:52:13,490 --> 00:52:16,069
as Google Bing and

1490
00:52:16,070 --> 00:52:18,949
Language Rewa and a few others.

1491
00:52:18,950 --> 00:52:21,109
And we saw that

1492
00:52:21,110 --> 00:52:23,269
in most of the cases, depending

1493
00:52:23,270 --> 00:52:25,519
on the quality of the translator

1494
00:52:25,520 --> 00:52:27,709
and on that particular language,

1495
00:52:27,710 --> 00:52:29,989
we were able to identify those people

1496
00:52:29,990 --> 00:52:31,699
with very good accuracy.

1497
00:52:31,700 --> 00:52:33,199
But again, the quality

1498
00:52:34,250 --> 00:52:36,259
of the translator on a particular

1499
00:52:36,260 --> 00:52:38,959
language has a very big effect here.

1500
00:52:38,960 --> 00:52:40,999
We were able to observe that.

1501
00:52:41,000 --> 00:52:43,279
I mean, the the longer the path

1502
00:52:43,280 --> 00:52:44,749
that you translate through, like if you

1503
00:52:44,750 --> 00:52:46,279
go through 12 different intermediate

1504
00:52:46,280 --> 00:52:49,249
languages, you're going to

1505
00:52:49,250 --> 00:52:50,959
get almost unrecognizable at the end.

1506
00:52:50,960 --> 00:52:52,459
Now, if someone was trying to subvert a

1507
00:52:52,460 --> 00:52:54,529
system like this, could they

1508
00:52:54,530 --> 00:52:56,539
just do that, end up with the final

1509
00:52:56,540 --> 00:52:58,969
product after 10, 20 translations

1510
00:52:58,970 --> 00:53:01,279
and then just make simple,

1511
00:53:01,280 --> 00:53:02,449
uh, spelling?

1512
00:53:02,450 --> 00:53:03,439
Correct.

1513
00:53:03,440 --> 00:53:05,509
Spoke about simple grammatical,

1514
00:53:05,510 --> 00:53:07,379
uh, phrasing corrections.

1515
00:53:07,380 --> 00:53:09,499
Uh, what sort of length,

1516
00:53:09,500 --> 00:53:11,479
uh, did you did you test this on how many

1517
00:53:11,480 --> 00:53:13,849
translations did you run the passage

1518
00:53:13,850 --> 00:53:15,709
through before, uh, before bringing it

1519
00:53:15,710 --> 00:53:17,449
back to the original language?

1520
00:53:17,450 --> 00:53:18,859
So Arsène Total was three.

1521
00:53:18,860 --> 00:53:20,539
It was German and Japanese, some back to

1522
00:53:20,540 --> 00:53:23,659
English or maybe two in the middle.

1523
00:53:23,660 --> 00:53:25,879
But a recent paper show that they did

1524
00:53:25,880 --> 00:53:28,189
many translations, I think up to 20.

1525
00:53:28,190 --> 00:53:29,989
And the more you did translations, the

1526
00:53:29,990 --> 00:53:32,149
more unidentifiable the author

1527
00:53:32,150 --> 00:53:34,519
became. But at the same time, the text

1528
00:53:34,520 --> 00:53:36,709
lost its semantics, like

1529
00:53:36,710 --> 00:53:39,079
there was not much context or meaning

1530
00:53:39,080 --> 00:53:40,189
left in the text.

1531
00:53:40,190 --> 00:53:42,229
One thing that we have experimented with

1532
00:53:42,230 --> 00:53:44,119
in the anonymous program is actually what

1533
00:53:44,120 --> 00:53:45,919
we would use on a sentence by sentence

1534
00:53:45,920 --> 00:53:48,079
basis, translate it to a whole

1535
00:53:48,080 --> 00:53:49,489
bunch of different languages and back

1536
00:53:49,490 --> 00:53:51,619
just one way, but then rank

1537
00:53:51,620 --> 00:53:53,239
the sentences that the translations that

1538
00:53:53,240 --> 00:53:54,559
were produced by the ones that had the

1539
00:53:54,560 --> 00:53:56,659
most anonymity to the least anonymity and

1540
00:53:56,660 --> 00:53:57,619
put the ones at the top.

1541
00:53:57,620 --> 00:53:59,419
And then the person can look at them and

1542
00:53:59,420 --> 00:54:01,039
find ones that have more anonymity that

1543
00:54:01,040 --> 00:54:02,539
still are close to the meaning and bring

1544
00:54:02,540 --> 00:54:03,739
those back in.

1545
00:54:03,740 --> 00:54:05,389
So that was one thing we experimented

1546
00:54:05,390 --> 00:54:06,390
with.

1547
00:54:06,920 --> 00:54:07,920
OK, thank you.

1548
00:54:10,190 --> 00:54:12,289
I was wondering if your work is one

1549
00:54:12,290 --> 00:54:15,199
way and by that I mean is

1550
00:54:15,200 --> 00:54:17,299
how far away are you from producing

1551
00:54:17,300 --> 00:54:19,429
a quote unquote genuine letter

1552
00:54:19,430 --> 00:54:21,389
from Angela Merkel or a long lost play

1553
00:54:21,390 --> 00:54:23,809
from Shakespeare with all the

1554
00:54:23,810 --> 00:54:25,669
information you have?

1555
00:54:25,670 --> 00:54:27,889
So text generation

1556
00:54:27,890 --> 00:54:30,349
is much, much harder than text analysis.

1557
00:54:30,350 --> 00:54:32,569
It's sort of, I would argue, like

1558
00:54:32,570 --> 00:54:34,879
an MLP complete problem.

1559
00:54:34,880 --> 00:54:37,099
So I don't think

1560
00:54:37,100 --> 00:54:39,169
we're very what we what

1561
00:54:39,170 --> 00:54:41,479
we would be able to do is probably

1562
00:54:41,480 --> 00:54:43,549
help somebody create

1563
00:54:43,550 --> 00:54:45,469
a letter that was that was imitating that

1564
00:54:45,470 --> 00:54:47,269
style. Like you could you could have it

1565
00:54:47,270 --> 00:54:49,309
be sort of a collaboration between the

1566
00:54:49,310 --> 00:54:50,659
analysis engine and a person.

1567
00:54:50,660 --> 00:54:52,579
And that would probably work quite well.

1568
00:54:52,580 --> 00:54:54,799
Um, but to do it automatically

1569
00:54:54,800 --> 00:54:55,729
would be much harder.

1570
00:54:55,730 --> 00:54:57,979
So it could be used to hate

1571
00:54:57,980 --> 00:54:59,059
impersonation as well.

1572
00:55:00,250 --> 00:55:01,250
Yes.

1573
00:55:03,700 --> 00:55:06,939
Um, I've two questions, actually,

1574
00:55:06,940 --> 00:55:09,099
my first one is, will there be

1575
00:55:09,100 --> 00:55:11,259
something like an

1576
00:55:11,260 --> 00:55:13,539
animal for forest

1577
00:55:13,540 --> 00:55:14,540
code, actually?

1578
00:55:15,890 --> 00:55:18,109
Yeah, it's actually available on it,

1579
00:55:18,110 --> 00:55:20,299
but I didn't do the licensing

1580
00:55:20,300 --> 00:55:22,309
yet, but if you want to play with it, you

1581
00:55:22,310 --> 00:55:24,019
can play with it and I'll fix the

1582
00:55:24,020 --> 00:55:26,269
documentation and the licensing

1583
00:55:26,270 --> 00:55:28,579
information as soon as possible.

1584
00:55:28,580 --> 00:55:30,709
What she means is her analysis code.

1585
00:55:30,710 --> 00:55:32,839
We have not written an anonymizer

1586
00:55:32,840 --> 00:55:34,459
for source code yet.

1587
00:55:34,460 --> 00:55:36,349
We don't have an anonymizer for source

1588
00:55:36,350 --> 00:55:38,239
code, which can be called an obfuscatory.

1589
00:55:38,240 --> 00:55:41,209
Maybe you could edit

1590
00:55:41,210 --> 00:55:43,279
Anonymous to probably

1591
00:55:43,280 --> 00:55:45,469
work OK if you're trying to anonymize

1592
00:55:45,470 --> 00:55:47,059
source code to some extent.

1593
00:55:47,060 --> 00:55:48,689
Yeah, but the suggestions would be bad.

1594
00:55:48,690 --> 00:55:49,690
Yes.

1595
00:55:50,460 --> 00:55:52,579
OK, um, and the

1596
00:55:52,580 --> 00:55:54,409
second one is

1597
00:55:56,180 --> 00:55:59,269
when you try to compare

1598
00:55:59,270 --> 00:56:01,849
different codings,

1599
00:56:01,850 --> 00:56:04,309
um, did you also try

1600
00:56:04,310 --> 00:56:06,529
to compare between

1601
00:56:06,530 --> 00:56:08,929
different languages or

1602
00:56:08,930 --> 00:56:11,479
did you just always compare the

1603
00:56:11,480 --> 00:56:14,119
source codes of the same language?

1604
00:56:14,120 --> 00:56:16,339
We always looked at C++ in

1605
00:56:16,340 --> 00:56:18,679
this case because our asked parser

1606
00:56:18,680 --> 00:56:20,219
was for C++.

1607
00:56:20,220 --> 00:56:20,959
Yeah, sure.

1608
00:56:20,960 --> 00:56:23,359
But you think it's possible

1609
00:56:23,360 --> 00:56:25,699
to find

1610
00:56:25,700 --> 00:56:27,889
out, um,

1611
00:56:27,890 --> 00:56:30,559
also the similarities

1612
00:56:30,560 --> 00:56:33,199
between different source codes?

1613
00:56:33,200 --> 00:56:35,619
Well, um, from different languages.

1614
00:56:35,620 --> 00:56:37,459
Since each programing language has a

1615
00:56:37,460 --> 00:56:39,259
structure and its own grammar, this

1616
00:56:39,260 --> 00:56:41,299
should be possible as long as you have

1617
00:56:41,300 --> 00:56:43,369
the parser so it can be extended

1618
00:56:43,370 --> 00:56:46,619
to other languages in the same manner.

1619
00:56:46,620 --> 00:56:48,199
OK, yeah.

1620
00:56:48,200 --> 00:56:49,879
And Patros domain case might be tricky.

1621
00:56:49,880 --> 00:56:51,139
We'd have to do some experiments.

1622
00:56:53,940 --> 00:56:56,339
I actually wanted to ask the question

1623
00:56:56,340 --> 00:56:58,989
of my predecessor,

1624
00:56:58,990 --> 00:57:01,049
just I just want to say thank you

1625
00:57:01,050 --> 00:57:02,909
for a great presentation.

1626
00:57:02,910 --> 00:57:03,910
Thanks.

1627
00:57:05,130 --> 00:57:08,369
OK, I have so many questions from ARACY.

1628
00:57:08,370 --> 00:57:09,370
OK, if I might ask.

1629
00:57:10,380 --> 00:57:12,599
OK, the first one was about the

1630
00:57:18,550 --> 00:57:21,899
case of Kaisa versus Jacobsson,

1631
00:57:21,900 --> 00:57:23,949
where you had this comparison between

1632
00:57:23,950 --> 00:57:26,909
this free code and the, um,

1633
00:57:26,910 --> 00:57:29,309
uh, the, uh, copyrighted

1634
00:57:29,310 --> 00:57:30,389
code.

1635
00:57:30,390 --> 00:57:32,789
And they wanted to know, uh,

1636
00:57:32,790 --> 00:57:34,499
how you got to the source code of the

1637
00:57:34,500 --> 00:57:36,599
copyrighted code or if it was open source

1638
00:57:36,600 --> 00:57:39,149
code or which license it was under.

1639
00:57:39,150 --> 00:57:41,219
No, we didn't compare it because

1640
00:57:41,220 --> 00:57:43,379
it's copyright code, so we didn't try

1641
00:57:43,380 --> 00:57:45,449
to access it because

1642
00:57:45,450 --> 00:57:47,309
it's not publicly available.

1643
00:57:47,310 --> 00:57:49,050
Thank you. I was just an example.

1644
00:57:51,440 --> 00:57:52,459
Number four, please.

1645
00:57:52,460 --> 00:57:54,529
Okay, first of all, sorry, first of

1646
00:57:54,530 --> 00:57:55,489
all, thanks for a very interesting

1647
00:57:55,490 --> 00:57:57,709
thought and also thanks for doing work

1648
00:57:57,710 --> 00:58:00,319
on this anonymous solution,

1649
00:58:00,320 --> 00:58:02,359
because it would be more concerning if it

1650
00:58:02,360 --> 00:58:04,669
was only being applied to

1651
00:58:04,670 --> 00:58:06,169
reduce people's privacy, which in some

1652
00:58:06,170 --> 00:58:07,999
countries can end quite badly.

1653
00:58:08,000 --> 00:58:10,549
And my question is, if people use

1654
00:58:10,550 --> 00:58:12,859
this tool to

1655
00:58:12,860 --> 00:58:14,929
make their language less identifiable,

1656
00:58:14,930 --> 00:58:16,459
can they then be identified as having

1657
00:58:16,460 --> 00:58:17,929
used that tool with high confidence?

1658
00:58:17,930 --> 00:58:19,489
Does it leave a signature if you use

1659
00:58:19,490 --> 00:58:21,749
Anonymous and

1660
00:58:21,750 --> 00:58:23,929
and what's the size of the set

1661
00:58:23,930 --> 00:58:25,009
of people that use it? Because you're

1662
00:58:25,010 --> 00:58:26,839
only as anonymous as the number of people

1663
00:58:26,840 --> 00:58:29,399
using that tool if it's identifiable.

1664
00:58:29,400 --> 00:58:31,489
So I don't know how many

1665
00:58:31,490 --> 00:58:33,169
people use Anonymous.

1666
00:58:33,170 --> 00:58:35,059
Probably not a huge amount, because if

1667
00:58:35,060 --> 00:58:36,409
you actually try using it, it's kind of

1668
00:58:36,410 --> 00:58:37,410
difficult.

1669
00:58:38,390 --> 00:58:40,699
But the and I don't

1670
00:58:40,700 --> 00:58:43,189
know if using anonymous itself

1671
00:58:43,190 --> 00:58:45,530
would would really would

1672
00:58:46,880 --> 00:58:48,439
create a signature. And my guess is it

1673
00:58:48,440 --> 00:58:50,509
probably would, given the way that

1674
00:58:50,510 --> 00:58:52,759
people tend to know what the experiment

1675
00:58:52,760 --> 00:58:54,679
that we did do was, looking at the people

1676
00:58:54,680 --> 00:58:56,179
that we just told to imitate someone

1677
00:58:56,180 --> 00:58:57,589
else's style or just to try and hide

1678
00:58:57,590 --> 00:58:59,309
their style without anonymous.

1679
00:58:59,310 --> 00:59:01,249
And we were able to create a classifier

1680
00:59:01,250 --> 00:59:03,379
that was able to distinguish people

1681
00:59:03,380 --> 00:59:05,659
that had had tried to do that from people

1682
00:59:05,660 --> 00:59:07,729
who hadn't without necessarily being

1683
00:59:07,730 --> 00:59:09,409
able to identify the original author.

1684
00:59:10,550 --> 00:59:12,349
It just seems to me that if the stakes if

1685
00:59:12,350 --> 00:59:14,539
the stakes were high and

1686
00:59:14,540 --> 00:59:16,339
the amount of safety that you'd get from

1687
00:59:16,340 --> 00:59:18,469
using this, it would be difficult to

1688
00:59:18,470 --> 00:59:20,659
kind of calculate the function of when

1689
00:59:20,660 --> 00:59:23,109
it's safer to start using this and

1690
00:59:23,110 --> 00:59:25,549
obfuscated this tool

1691
00:59:25,550 --> 00:59:27,769
versus just saying less.

1692
00:59:27,770 --> 00:59:29,959
And it'd be nice to be able to have more

1693
00:59:29,960 --> 00:59:31,489
analysis so that people can make that

1694
00:59:31,490 --> 00:59:33,259
decision on an informed basis.

1695
00:59:33,260 --> 00:59:34,669
I agree that it would be nice.

1696
00:59:34,670 --> 00:59:36,500
OK, number one, please.

1697
00:59:37,580 --> 00:59:37,909
Hi.

1698
00:59:37,910 --> 00:59:40,159
Yes. Many, many development

1699
00:59:40,160 --> 00:59:42,379
houses and coding houses use style guides

1700
00:59:42,380 --> 00:59:43,789
and they're pretty strict about it.

1701
00:59:43,790 --> 00:59:45,919
And like and you'll run things

1702
00:59:45,920 --> 00:59:47,599
like Rubick and things like that.

1703
00:59:47,600 --> 00:59:49,849
They'll say remove spaces and,

1704
00:59:49,850 --> 00:59:51,709
you know, use single quotes and double

1705
00:59:51,710 --> 00:59:54,949
quotes. Have you taken that into account?

1706
00:59:54,950 --> 00:59:57,049
Um, first of all, we thought that

1707
00:59:57,050 --> 00:59:59,119
people have to implement the

1708
00:59:59,120 --> 01:00:00,919
functionality in a limited time.

1709
01:00:00,920 --> 01:00:03,229
So they use, like, the

1710
01:00:03,230 --> 01:00:05,599
things that they would like, naturally

1711
01:00:05,600 --> 01:00:07,669
use or they would

1712
01:00:07,670 --> 01:00:09,829
express their style because they

1713
01:00:09,830 --> 01:00:11,299
are limited in time.

1714
01:00:11,300 --> 01:00:13,369
On the other hand, if you think

1715
01:00:13,370 --> 01:00:15,529
that their style has to follow

1716
01:00:15,530 --> 01:00:18,019
a certain format, that would make

1717
01:00:18,020 --> 01:00:19,729
everyone more similar.

1718
01:00:19,730 --> 01:00:21,949
And in that scenario, the machine

1719
01:00:21,950 --> 01:00:23,809
learning problem will become even more

1720
01:00:23,810 --> 01:00:25,669
difficult if they're following a certain

1721
01:00:25,670 --> 01:00:27,049
style guide.

1722
01:00:27,050 --> 01:00:29,149
But there is no way for us to tell that

1723
01:00:29,150 --> 01:00:30,589
because we don't have ground truth

1724
01:00:30,590 --> 01:00:32,899
information from these contestants about

1725
01:00:32,900 --> 01:00:35,119
how they were implementing

1726
01:00:35,120 --> 01:00:36,859
the functionality at the time of the

1727
01:00:36,860 --> 01:00:38,109
competition.

1728
01:00:38,110 --> 01:00:40,249
What we can say is it really depends

1729
01:00:40,250 --> 01:00:41,959
on the style guide because we know the

1730
01:00:41,960 --> 01:00:42,859
features that we use.

1731
01:00:42,860 --> 01:00:45,079
So like in the obvious case,

1732
01:00:45,080 --> 01:00:46,789
if the if the style guide really only

1733
01:00:46,790 --> 01:00:48,499
talks about spacing and layout and

1734
01:00:48,500 --> 01:00:49,699
variable names and stuff like that, it

1735
01:00:49,700 --> 01:00:51,079
doesn't affect the deeper structure of

1736
01:00:51,080 --> 01:00:53,449
the code, like the variable depths

1737
01:00:53,450 --> 01:00:54,979
and things like that, then it wouldn't

1738
01:00:54,980 --> 01:00:56,719
then it wouldn't really be relevant.

1739
01:00:56,720 --> 01:00:58,189
But if it does affect that, then it would

1740
01:00:58,190 --> 01:01:00,319
be so probably depends on the specific

1741
01:01:00,320 --> 01:01:01,320
style guide itself.

1742
01:01:03,230 --> 01:01:04,819
But we don't have any data to suggest

1743
01:01:04,820 --> 01:01:07,009
that it's just in in

1744
01:01:07,010 --> 01:01:09,139
development. How do you do a pull request

1745
01:01:09,140 --> 01:01:10,649
and someone criticizes your code?

1746
01:01:10,650 --> 01:01:11,869
You have to change to make it look like

1747
01:01:11,870 --> 01:01:13,099
everyone else's. And so I was wondering

1748
01:01:13,100 --> 01:01:14,809
if you could pick out three thousand

1749
01:01:14,810 --> 01:01:16,849
developers here, which weren't actually

1750
01:01:16,850 --> 01:01:18,529
wrote that code and that sort of thing.

1751
01:01:20,060 --> 01:01:21,060
OK.

1752
01:01:24,920 --> 01:01:26,929
Hello. Yeah, I think we are going to take

1753
01:01:26,930 --> 01:01:28,699
the last three questions and then wrap it

1754
01:01:28,700 --> 01:01:30,529
up. OK. So number two, please.

1755
01:01:30,530 --> 01:01:32,149
And when we're when we're done, we'll go

1756
01:01:32,150 --> 01:01:34,879
to the cafeteria area and

1757
01:01:34,880 --> 01:01:36,499
sit down in a chair and a table there.

1758
01:01:36,500 --> 01:01:37,699
And if people want to ask more questions,

1759
01:01:37,700 --> 01:01:38,809
they can.

1760
01:01:38,810 --> 01:01:39,769
OK.

1761
01:01:39,770 --> 01:01:42,049
OK, so the next question from ask

1762
01:01:42,050 --> 01:01:44,659
is, what about multiple authors

1763
01:01:44,660 --> 01:01:46,429
like an open source projects?

1764
01:01:47,850 --> 01:01:49,139
What happens to the

1765
01:01:50,520 --> 01:01:53,109
protection of the author in such a case?

1766
01:01:53,110 --> 01:01:55,289
OK, so we haven't done anything with

1767
01:01:55,290 --> 01:01:57,539
source code on this yet because

1768
01:01:57,540 --> 01:01:58,979
that's, I think, a difficult problem that

1769
01:01:58,980 --> 01:01:59,999
we just haven't looked at.

1770
01:02:00,000 --> 01:02:02,009
We're currently, though, looking at

1771
01:02:02,010 --> 01:02:04,079
different Macías that are written by

1772
01:02:04,080 --> 01:02:05,639
multiple authors and these are similar

1773
01:02:05,640 --> 01:02:08,129
problems and getting preliminary

1774
01:02:08,130 --> 01:02:10,289
results. So keep looking forward to

1775
01:02:10,290 --> 01:02:11,699
it and we'll have something there, I

1776
01:02:11,700 --> 01:02:12,539
guess.

1777
01:02:12,540 --> 01:02:14,069
Do anything.

1778
01:02:14,070 --> 01:02:15,629
Well, I lead to the preliminary results

1779
01:02:15,630 --> 01:02:17,759
get but those were very good because I

1780
01:02:17,760 --> 01:02:19,159
wasn't using the abstracts initially.

1781
01:02:19,160 --> 01:02:20,160
So.

1782
01:02:21,000 --> 01:02:22,829
OK, number two.

1783
01:02:24,260 --> 01:02:26,419
My question is quite similar.

1784
01:02:26,420 --> 01:02:28,579
Is it possible to detect to

1785
01:02:28,580 --> 01:02:30,709
detect a text as written by

1786
01:02:30,710 --> 01:02:34,029
one person or by more persons?

1787
01:02:34,030 --> 01:02:36,109
I think that's definitely part of

1788
01:02:36,110 --> 01:02:37,429
that problem.

1789
01:02:37,430 --> 01:02:39,139
It may be a first step to completing that

1790
01:02:39,140 --> 01:02:40,140
problem.

1791
01:02:40,970 --> 01:02:42,529
Yeah, it's something we're actively

1792
01:02:42,530 --> 01:02:44,269
working on, but we don't have any results

1793
01:02:44,270 --> 01:02:45,079
yet.

1794
01:02:45,080 --> 01:02:47,299
It does cross

1795
01:02:47,300 --> 01:02:49,369
the main actually also work

1796
01:02:49,370 --> 01:02:50,359
across languages.

1797
01:02:50,360 --> 01:02:52,429
For example, if I'm on

1798
01:02:52,430 --> 01:02:54,469
one mailing list in German and on one

1799
01:02:54,470 --> 01:02:55,579
forum in English.

1800
01:02:55,580 --> 01:02:57,769
Would you be able to match these

1801
01:02:57,770 --> 01:02:58,770
accounts by

1802
01:03:00,050 --> 01:03:02,419
styles that that are independent of of

1803
01:03:02,420 --> 01:03:04,070
the language I'm using for posting?

1804
01:03:05,120 --> 01:03:06,799
You can't use a language independent

1805
01:03:06,800 --> 01:03:09,199
features that or you can try translating

1806
01:03:09,200 --> 01:03:11,419
the code and then I'm sorry, not

1807
01:03:11,420 --> 01:03:13,909
the code, the writing, and then

1808
01:03:13,910 --> 01:03:16,069
do an autocratic retribution

1809
01:03:16,070 --> 01:03:18,199
with an English feature set and

1810
01:03:18,200 --> 01:03:20,569
look at whichever one works better.

1811
01:03:20,570 --> 01:03:23,549
Yeah, I think I think translating

1812
01:03:23,550 --> 01:03:25,639
actually probably the best way to do that

1813
01:03:25,640 --> 01:03:27,019
would be to translate both of them and

1814
01:03:27,020 --> 01:03:29,119
then do both analysis and in both of

1815
01:03:29,120 --> 01:03:30,469
the individual languages and see what the

1816
01:03:30,470 --> 01:03:31,489
results are.

1817
01:03:31,490 --> 01:03:33,529
That's how I would go about it, because

1818
01:03:33,530 --> 01:03:35,629
because the translation, because

1819
01:03:35,630 --> 01:03:37,189
it's hard to do like the Ingrams and

1820
01:03:37,190 --> 01:03:38,299
stuff will be different for the different

1821
01:03:38,300 --> 01:03:38,959
languages.

1822
01:03:38,960 --> 01:03:41,439
So you probably want to translate them

1823
01:03:41,440 --> 01:03:42,440
so.

1824
01:03:43,580 --> 01:03:45,679
OK, I think that's that.

1825
01:03:45,680 --> 01:03:47,419
Thank you so much for coming, and we hope

1826
01:03:47,420 --> 01:03:48,829
that you'll be back next year,

1827
01:03:50,420 --> 01:03:51,420
Ceressus.