And he's only submitted 2 stories since starting his account 341 days ago. So the story must have meant a lot to him, 1st to submit it, 2nd to mention it in that comment.
Also, (s)he's only made 18 comments in total and his last comment was 104 days ago yet the flesh story was submitted 13 days ago: so he's careful with his comments: probably careful with his identity and privacy too, so much so that CitizenParker has no bio info: and look at the name CitizenParker (sort of like call yourself John Smith on a sample credit card) - so generic naming could be important to CitizenParker: something he's conscious about, and will write about it: whilst also doing that anonymously.
But mainly, his style seems similar, which was what got me thinking.
Is the question serious, or is it meant as a joke? As a riposte to the earlier thread, it is fantastic. The user writes "privacy is dead" and here, on Hacker News, and also on jacquesmattheij.com, you have a thread with a lot of intelligent people trying to figure out the person's identity, and failing. Therefore, the user's original point is disproven simply by starting this new thread. If this was deliberate, then this was genius.
"this thread highlights a fundamental property of a networked life: privacy is dead, there is only identity management."
but then this contradicts the original thesis:
"If harnessed properly, these things can be useful, but it requires a mindset and workflow not entirely dissimilar to those of spies or high-end criminals - controlling information by selective disclosure, identity segmentation, disinformation, anonymization, etc. - not for sinister purposes, mind you, but simply to guard what we traditionally call privacy."
I'd say the current thread offers proof that privacy can be defended. After all, here we have all these smart people, failing to identify the earlier user.
Exactly! That's the whole idea here, I was quite surprised that my first solution (with a very high correlation) failed, even more surprised when the second one failed as well (especially since that person had been commenting in the same thread and had a very high correlation as well).
Elsewhere in this thread someone is jumping up and down to stop trying to identify the poster, the funny thing is I think he/she is in no danger at all of being identified, at least not without his/her cooperation.
The only person that could identify this user is PG, and maybe alaskamiller, and I'm pretty sure that our secrets are safe there.
edit: and if that person is the original poster then they're not helping themselves by increasing the sample size :)
However, as I tried pointing out (in a seemingly dead thread: http://news.ycombinator.com/item?id=1200091), identity management is something that only the technically savvy can pull off. And even they are likely to stumble apart at some point, because being perfect in every way is inhumanly hard. And so, as it stands, privacy is dead in this age.
On /. posting anonymously is as simple as checking a box, even if you have an account.
Plenty of people have done so over the years, checking a box is nothing that only the 'technically savvy' can do.
If you do that rarely then I think your anonymous words are reasonably safe. If you do it regularly then you are open to the kind of attack that I attempted, and then it will have a better chance of success.
Privacy is dead in a general sense, companies like facebook, google and twitter facilitate identified communication and in that sense every letter you wrote using the old postal system was just as revealing, it just wasn't open to be read by the public.
People are slowly coming around about all this stuff being visible online. I can see that with the 'reocities' project, on average two people every day ask for their old account to be wiped because of privacy reasons. That's not much, but it still means that 1,000 non-technical users per year that I happen to have backed up a few pages for realize this. So if you extrapolate that to the internet at large I think that the number of users that are wising up to this is much larger than you'd expect at first glance.
Time will tell if there will be enough support for this, the 'think of the children' and 'war on terror' people seem to have the advantage for now, but laws that are enacted can in due course be repealed.
I've never bothered to hide my identity, there is nothing that I have to say that I wouldn't put my name to, even if not all of it is received equally well, that doesn't bother me (maybe it should).
There are people in positions that are sensitive that have stuff to tell us, in such cases (which are rare) anonymity really serves a purpose and I think this little experiment shows that without at least access to some log files these exercises get a lot harder.
"I'd say the current thread offers proof that privacy can be defended. After all, here we have all these smart people, failing to identify the earlier user."
Imagine the hassle of creating brand new identities for every action taken online. If privacy isn't dead, then it sure as hell is tough to maintain (and to do so would not be very practical).
I'm not sure I'll have the time to do this, but I've had some good results running Latent Semantic Analysis and Latent Dirichlet Allocation on a similar problem. In my case, I have data from people playing a negotiation game and having a conversation with a human actor. I have scores from a human judge going from 1 - 5. Using LDA on the transcriptions of the dialog I can predict the results of the human judge to a correlation of .5 There was a previous study with essay's a teacher grades that got .8 with LSA. The LSA study used a much larger training corpus outside the individuals.
For slightly more details, here's a sketch of the algorithm:
Treat each comment as a "document" input to LDA. Use the theta matrix that represents the distribution of topics over each document. Then use the inverse dot product between two document theta vectors and perform k Nearest Neighbors to predict IDs. You should be able to tune the rank and k values from all the labelled data.
When it comes time to infer I suggest running the with the whole set through LDA instead of reusing the discovered alpha and beta. For some reason (which I'm not entirely sure of), my results seem much better that way.
Simple psychology would lead me to guess that it is you Jacques. If I wanted to tell how easy it was to identify an anonymous comment, then I'd make one. I'd then publicise it, and challenge other people to crack it.
Looking at how quickly you commented after OTToken and when he commented how quickly you responded, I could see why someone would think it was you. Just like the yahoo answers "questions" that are obviously setups because they are answered 1minute after asked.
Assuming it's not you, we have another area of comparison that is being overlooked. Besides OTToken's text patterns we also have when the comments were left. So we can throw out certain people that never comment during the hours of the day that OTToken did.
Also OTToken is obviously familiar with HN and has an alternative account by his own implication. Likely he was reading/commenting on HN then decided to make the account. OTToken made multiple comments over two hours so we could also try and pick people who's comments adjoin that timeframe. Specifically people who made comments in advance of OTTokens comments, but not at the same time.
He can switch accounts to make alternative comments, but it's unlikely he was making comments from two separate accounts simultaneously.
> Besides OTToken's text patterns we also have when the comments were left. So we can throw out certain people that never comment during the hours of the day that OTToken did.
Also, if you could check my comment history (which you can't because it seems to time out on HNs server) you'd see that my comment speed is usually fairly quick in threads where I'm active.
This is merely a naive guess. He's the only other user on hacker news (according to Google) to use the term "people search engines". He also seems to have been working in the data mining business.
From a quick look at his comments, he seems to match the other heuristics seasoup mentions in that thread: meticulously correct spelling, grammar, and punctuation, use of semicolons, and use of dashes.
He also seems to comment heavily on technical issues - programming languages, database technologies, so it might make sense for him to feel heavy non-tech opinions deserve a onetimetoken. Similarly reasoned, the sentiment of "privacy is dead, there is only identity management" seems to be a realization appropriate for someone who recently started working on YC-funded companies. Seem convincing to me, but since they're all reverse-justifications, probably best to take them with a grain of salt: you might be able to draw similar conclusions combing through many other comment histories.
Interesting that this "human [powered] search engine" style of identification might have been faster than devising a machine heuristic.
Yeah it's almost crowd sourced. Reminds me of when the cops put a letter or riddle from a serial killer in the newspaper figuring someONE out there will recognize it, as opposed to a computer recognizing it. Or maybe that just happens in the movies.
To be incredibly picky: Meticulously correct grammar would join "privacy is dead" and "there is only identity management" with a semicolon, dash, or conjunction -- or just split it into two sentences. ;)
I am concerned with online privacy, but not to the extent as "onetimetoken" (my FB profile is globally viewable). Also my comments are usually short, and I avoid big generalizations.
Looking at the thread in question, though, I'd definitely guess jgrahamc.
It'd be really interesting if we had challenges, both social and technical, posted here on HN on a weekly basis. Some of the solutions and discussions would be pretty brilliant, I think.
Sounds good. Anyone with a challenge, email me at kyro@kyrobeshay.com with title/text of the submission. I'll post them on a weekly, or even bi-weekly, basis and credit the author.
Edit: The intention behind this was to keep it structured and organized, contest-like, and not for karmic purposes, which I take is the reason for the downvotes.
Also donating prizes would give a different metric than pure karma-per-submission to order the challenges. (Though it might be hard to order bragging rights. But we should be able to find a (corporate?) sponsor who hands out 50 dollar for the charity of choice of the winner every week. (Hey, I might even be able to get the money out of my employer, if I asked to--or I just do it myself.))
Enough parenthesis. I just go ahead and pledge 10 Pounds per week to it. Perhaps we should discuss more by email?
Karma doesn't enter in to it, what's the difference between posting a challenge yourself vs mailing someone and having them post it for you and credit you.
Seems a bit roundabout without any real advantage.
Oh, there's something to be said for having an "official" challenge of the week. It focusses attention. Though on the other hand, having the primaries out in court of HN may be the best approach to picking the most interesting challenges.
Informal is cool with me. Someone could just state their challenge, and the prize, if any (which need not be money), and if others think the challenge is interesting enough they could paypal the author their contribution to the pot, or they could publicly state that they want to up the stakes (or both).
Also posting bets and searching for someone to take the other side (or be the arbiter--in case any is needed) could be interesting. Similar to http://www.longbets.org/, but embedded into HN and not focussed on long-term bets.
My strategy was to look for unusual words and phrases and do a google site search for those phrases.
Additionally, eru's post in this thread indicate an interest in privacy and eru's activity pattern is both frequent, and recent which I would expect to be true for the poster.
1) based on his writing ('a one time account as a rhetorical device') I don't think he'd mind, also there is nothing in the comment itself that you would have to be ashamed of
2) you can't be sure, unless the person will confirm using the original 'one time' account.
They did not issue a challenge to be identified--in fact they agree with the notion that privacy is dead, which seems to be what you're trying to prove with this exercise. They may have serious reasons for using a one time account.
If your name is one of the (very random) guesses in this post, please neither confirm nor deny that the user is you, since this could identify that user by elimination.
This item should not have so many points. The post is rubbish. A 275 word sample is long, but likely insufficient given the pool of candidates. The post did not explain what methods were used, what work in authorship identification influenced his approach, nor did he provide his ranked findings. The tries are actually failed guesses, rather than, say, different algorithms attempted. This item has now devolved into a guessing game, rather than a coding exercise.
> The post did not explain what methods were used,
The post didn't but the original thread did, I tried matching the vocabulary of the samples to the corpus of HN comments.
> what work in authorship identification influenced his approach
This is not a scientific paper.
> nor did he provide his ranked findings.
I'm not giving my ranked results because I think two attempts from me is enough.
> The tries are actually failed guesses, rather than, say, different algorithms attempted.
They were the #1 and #2 outputs of my code.
> This item has now devolved into a guessing game, rather than a coding exercise.
No-one said that you had to guess, but human guesses are also powered by computation at some level, even if it would be very hard to figure out exactly what went on.
> Again, stop trying to identify this user.
If that request would be posted by 'onetimetoken', who posted three times then it would have some credibility.
If you are not him/her why does this upset you ?
The 'one time account used as a rhetorical device' says fairly clearly that it is just a gimmick, not some kind of terrible secret.
And if you are 'onetimetoken' you are increasing the sample size ;)
I assume this upsets the user because using a one time account indicates a desire not to be identified or associated with the posted content, and the user wants this preference to be honored.
I never use any other username besides tokenadult on the forums where I use the username tokenadult. I like to have one consistent identity wherever I post (real name some places, screen name some other places) and I'm sparing in my use of screen names, and nonexistent in my use of sock-puppets. (I have been tempted a few times, but have thus far always resisted the temptation.) Now I will go look at the comment so I can identify what about it does NOT have my writing style, and then post that in an edit to this reply.
Evidence that the comment does not come from my keyboard:
I would never write a ponderous sentence like "I fully agree with the sentiment that inspires your statement" as the opening sentence of a post.
I wouldn't write "Without even noticing it we are whoring out our privacy and intimate patterns," because I consider "whoring out" a crude expression, too crude for the polite, learned conversation I expect on HN.
The phrase "it requires a mindset and workflow not entirely dissimilar to those of spies" reminds me of George Orwell's "One can cure oneself of the not un- formation by memorizing this sentence: A not unblack dog was chasing a not unsmall rabbit across a not ungreen field." I may occasionally write like that, if I am composing a sentence as I type, but I try not to.
You are correct. I first used the screen name on a forum, and then another forum, where the majority of users are teenagers. The screen name doesn't fit well here on HN (where almost everyone is an adult, even though I am older than most participants), but I like to minimize my use of distinct screen names. However, I am sure by screen name searches that other people now use this same screen name.
Sure, and of course since I was wrong this is correct. But my brain thought that the using the term token for a username was sufficiently distinct to maybe be a subtle hint as to the original author (especially given the original context of the comment in question).
I don't think tokenadult would mind posting that comment from his own account. The person that posted this sees his HN-identity as not stating strong opinions.
I just searched for "google-facebook" and "identity management" and saw a blog by the title "Google-Facebook: Identity Management in a Brave New Internet"
But, don't know if Clayton Donley is on HN or not..
His Bio, at Oracle:
Clayton Donley, Sr. Director, Development
Currently run the dev organization for some of Oracle's security and identity management products. Landed here after selling OctetString in 2005. Before that held various roles at IBM, Motorola, and as an independent consultant. Also wrote LDAP Programming in 2001.
I think everything matches here except that Mr. Donley capitalizes "F"acebeook. The comment in question has these as lowercase, which has already been mentioned.
Well, you're forgetting that most people, well at least I am, are more careful when writing formal blog posts vs simple HN comments. So, the minor differences can be attributed to that.
For example, the comment has "/" in google/facebook, while the blog has "Google-Facebook".
What made me curious was not just the topic, and the identity management, google-facebook, but the fact that he is in the field of security/identity management.
But as I said, I dont know if he is on HN or not :)
I used the rarely rare words used (rare combination of words used only 1/few time(s)[ORG]):
(i) The first search[1] revealed "gstar", but although they both have similar writing style, gstar doesn't have an active participation on privacy discussions (based to the query only [2])
(ii) The second search[3] revealed "astine": now this is interesting because this user has a very active participation on privacy discussions[4], especially I think he was inspired by _why[5].
it's pretty impossible to identify a user just by 1 anonymous post on a website.(without the logs). I mean sure you can compare a person's typing style...but unless they always add "jambalaya" to their posts, it'll be next to impossible to be 100% sure.
The way it works in real life, is that you find a person's email address or a long term account on a forum, and then use that info to build up a full profile about that person. The longer the person is on the web, the more personal information they've revealed in the past.
i.e. 6 months ago they might have mentioned their phone #...so you can use whitepages to see their address. Or maybe they posted a link to their site..where they didn't have privacy enabled, so you can get the full name and address using whois. Or maybe they are using the same username on all sites, so you can use google to see all the forums they've ever posted on. etc
Obviously, everyone loves a good challenge, but is there any evidence that the find-ee wants to be found?
They seemed interested in the thread as to how they might be found, but don't seem to have given any permission for a site-wide (wo-)manhunt. (This might have happened out of band, though.) I guess they do say that they were just using a one time account for rhetorical emphasis.
Further, if this were to really be a contest, it seems like there should be some sort of rules, such that the result isn't determined just by exhaustion of currently in use usernames by guessers.
Please read the whole original thread, that's exactly how we got to that point, and the response I got to my 'I bet I can identify you' and his/her admission that they thought of obfuscating the text made it pretty clear they would not mind an attempt, but that does not guarantee that there will be a resolution.
It wasn't me. Nice approach, though. Just intersecting word choices has very little to recommend it for industrial scale author identification but for a small-ish community like HN it might work, and of course it is trivial to implement if you already have the data source lying around.
"What a surprise to find a whole thread and blog post dedicated to the search for my identity.
I consent to a benevolent search for my identity or identities. I was quite surprised to see the speed and scale of this development - another symptom of networked life."
This person has said that we can find out his identity... if at all possible. Therefore, you are welcome to search based on his own permission.
So I wrote a bit of code to compare against other HN comments
Do you have a corpus?
Further, and this is an open question, is there an archive/downloadable corpus of HN in part or entirety anywhere? It would be fascinating and I'd love to keep a copy to look back at in years to come.
I'm a big proponent of fair play and I think the author would identify himself when asked, but at the same time only PG can be sure.
Of course you can be paranoid, but I think the bigger chance is the author seeing a chance here at sowing some disinformation. Such as participating in this thread and giving false pointers and / or confusing the issue.
For the really paranoid, of course the last person to participate in this thread is 'the one'...
First I want to say that I don't agree with publicly disclosing the "identity" of people who doesn't want to be found. I also don't think doing so "originates" from any good personal quality. That being said, I do remember this [1] talk from last CCC to be interesting from a technical standpoint.
For online messages with such short length,
when the full set of features are used, a sample size of about 30 messages per author is
necessary to predict authorship with an accuracy of 80~90%
One of the strong indicators is the use of italics in the post. Many users will ignore formatting within their posts. I am confident this user has used italic formatting before for emphasis and has done it often within their HN posts. It also indicates a comfort level within HN which means they have likely posted frequently. (At least one a month)
"high-end criminals" isn't unusual. Certainly not to these British eyes. If you Google for "high end criminals" even without the hyphen, about half of the results use the hyphenated version.
The other things you point out encourage me to share your opinion, however.
The problem is context, a common issue with tracking things with Google Trends in particular. Tracking programming language usage with it, for example, has been a nightmare ("ruby" and "python" having far too many meanings, but few write "ruby programming").
"high end" has more uses than "high-end." For example, "I bought a car at the high end of my budget." In that case, "high-end" wouldn't make sense. In "datacenters have been targeted by high-end criminals," however, "high-end" is a compound adjective.
Alternatively, you could drop the hyphen and/or form an entirely new word: "highend." The word "highend" doesn't seem to have caught on yet, though. I suspect that's because "upmarket" covers the same meaning already and is less susceptible to these morphological mishaps.
(On seeing what OS X had to suggest as a correction for "highend," it suggested both "high end" and "high-end.")
That's a good one, another poster above suggested something similar. I believe that HN'ers would play fair in something like this.
On the other hand that isn't proof of anything, but I think the changes are higher that someone will own up that didn't do it to throw sand in the eyes of the searchers than the reverse.
But then again, maybe I'm a sucker and I believe that people in general are honest and trustworthy. So far that seems to me to be a better assumption than the reverse.
chime fits a few of the patterns: use of etc. mid-sentence, occasional use of hyphens - in this very pattern - and moderate use of slashes when "or" would do. Also American spelling.
Still, it's easy to get into a sort of confirmation bias looking at this stuff manually, and seeing things that fit while missing things that don't.
I think that we're all treating this as a game, under the assumption that if we figure out who onetimetoken is he'll tell us, as it supports his point that privacy is dead. Also, it wouldn't be fun if we assumed that he will deny it if identified. Never underestimate the importance of having fun.
i tossed a few of the stylistic quirks into a search and took a look at some writing samples. i think his stuff looks the most similar. i'm going purely on my own arbitrary judgment. it just feels right. :) the method itself isn't much different than what has already been talked about.
edit: oh right, i also looked at the fact that he was posting comments on HN around the time that the one in question was posted. a lot of my other candidates didn't meet that data point.
> i also looked at the fact that he was posting comments on HN around the time that the one in question was posted. a lot of my other candidates didn't meet that data point.
Ah, very clever, another angle of attack. Never thought of that one.
I don't think the reason he wants to find out is because he wants to know. I think he wants to prove that it's possible to find out.
He thought it was long enough that he could pretty trivially have a program compare the writing style to other HN comments and determine who it was, but he failed. So it's a challenge for other hackers--can you write a program that can determine who said something simply based on the writing style and knowing that he's a member of a reasonably small sample (HN users).
You got it. Sorry for not being more clear, I thought it was an interesting challenge, and since I've used up my 'two guesses' I think it is more appropriate to admit failure rather than to keep on hammering away at it until I hit the right user.
It's not 'worthwhile' in a sense that you can't take it to the bank (though it may come to that, see elsewhere in this thread), but I think it is what hackers do, solve puzzles.
This is effectively a puzzle, a reasonably hard one (my attempt failed, but not for lack of trying, I spent a fair number of hours on it before making my guesses, of course it could simply be that I'm stupid), and one that seems fun to solve.
It is exactly the kind of thing that I enjoy doing when it comes to programming in the first place, figure out how stuff works and/or solving reasonably hard problems. One step above my current competence is my favorite, that way I'm reasonably sure I can solve the problem, if the difference is too big then I tend to get stuck.
It's of course a bit like the question why people climb Mount Everest, the answers are: because they can and because it's there.
edit: Funny, I thought your downmod for asking a valid question was unfair, in return I get downmodded for answering :)
A typical case of me and my big mouth.
I figured that a basic analysis should reveal who wrote this:
http://news.ycombinator.com/item?id=1197027
Because of the size of the sample. So I wrote a bit of code to compare against other HN comments, and figured that that would turn up the user quickly.
But I was wrong, after two tries (Daniel Markham and John Graham-Cummings) I have to admit that my simple analysis has failed.
So, who will take up the challenge, can you identify this user somehow ?
He submitted the Human Flesh story from NYTimes that he mentioned in his comment. http://news.ycombinator.com/item?id=1167615
And he's only submitted 2 stories since starting his account 341 days ago. So the story must have meant a lot to him, 1st to submit it, 2nd to mention it in that comment.
Also, (s)he's only made 18 comments in total and his last comment was 104 days ago yet the flesh story was submitted 13 days ago: so he's careful with his comments: probably careful with his identity and privacy too, so much so that CitizenParker has no bio info: and look at the name CitizenParker (sort of like call yourself John Smith on a sample credit card) - so generic naming could be important to CitizenParker: something he's conscious about, and will write about it: whilst also doing that anonymously.
But mainly, his style seems similar, which was what got me thinking.
http://news.ycombinator.com/threads?id=citizenparker
http://searchyc.com/user/citizenparker?only=comments
Does it really matter?
edit: http://citizenparker.com/ Scott Parker: http://citizenparker.com/page/About-Scott-Parker.aspx