Reproducing Hacker News writing style fingerprinting

antirez.com

324 points by grep_it 3 days ago


mtlynch - 3 days ago

>Well, the first problem I had, in order to do something like that, was to find an archive with Hacker News comments. Luckily there was one with apparently everything posted on HN from the start to 2023, for a huge 10GB of total data.

This is actually super easy. The data is available in BigQuery.[0] It's up to date, too. I tried the following query, and the latest comment was from yesterday.

    SELECT 
      id,
      text,
      `by` AS username,
      FORMAT_TIMESTAMP('%Y-%m-%dT%H:%M:%SZ', TIMESTAMP_SECONDS(time)) AS timestamp
    FROM 
      `bigquery-public-data.hacker_news.full`
    WHERE 
      type = 'comment'
      AND EXTRACT(YEAR FROM TIMESTAMP_SECONDS(time)) = 2025
    ORDER BY 
      time DESC
    LIMIT 
      100

https://console.cloud.google.com/bigquery?ws=!1m5!1m4!4m3!1s...
Frieren - 3 days ago

It works for me. The accounts I used long time ago are there in high positions. I guess that my style is very distinctive.

But I also have seen some accounts that seem to be from other non-native English speakers. They may even have a Latin language as their native one (I just read some of their comments, and, at minimum, some of them seem to also be from the EU). So, I guess, that it is also grouping people by their native language other than English.

So, maybe, it is grouping many accounts by the shared bias of different native-languages. Probably, we make the same type of mistakes while using English.

My guess will be that native Indian or Chinese speakers accounts will also be grouped together, for the same reason. Even more so, as the language is more different to English and the bias probably stronger.

It would be cool that Australians, British, Canadians tried the tool. My guess is that the probability of them finding alt-accounts is higher as the populations is smaller and the writing more distinctive than Americans.

Thanks for sharing the projects. It is really interesting.

Also, do not trust the comments too much. There is an incentive to lie as to not acknowledge alt-accounts if they were created to remain hidden.

hammock - 3 days ago

The "analyze" feature works pretty well.

My comments underindex on "this" - because I have drilled into my communication style never to use pronouns without clear one-word antecedents, meaning I use "this" less frequently that I would otherwise.

They also underindex on "should" - a word I have drilled OUT of my communication style, since it is judgy and triggers a defensive reaction in others when used. (If required, I prefer "ought to")

My comments also underindex on personal pronouns (I, my). Again, my thought on good, interesting writing is that these are to be avoided.

In case anyone cares.

xnorswap - 3 days ago

I wonder how much accuracy would be improved if expanding from single words to the most common pairs or n-tuples.

You would need more computation to hash, but I bet adding frequency of the top 50 word-pairs and top 20 most common 3-tuples would be a strong signal.

( The nothing the accuracy is already good of course. I am indeed user eterm. I think I've said on this account or that one before that I don't sync passwords, so they are simply different machines that I use. I try not to cross-contribute or double-vote. )

jedberg - 3 days ago

Maybe I talk too much on HN. :)

When I ran it, it gave me 20 random users, but when I do the analyze, it says my most common words are [they because then that but their the was them had], which is basically just the most common English words.

Probably would be good to exclude those most common words.

nomilk - 2 days ago

For visibility, here's the tool where you can enter your hn username:

https://antirez.com/hnstyle?username=pg&threshold=20&action=...

keepamovin - 2 days ago

This is great example of what's possible and how true anonymity, even online, is only "technological threshold" anonymity. People obsessed with biometrics might not consider this is another biometric.

Instead of just HN, now do it with the whole internet, imagine what you'd find. Then imagine that it's not being done already.

paxys - 3 days ago

It did find my "alt" (really an old account with a lost password), but the rest of the list – all users with very high match scores (0.8+) – is random.

Taking a look at comments from those users, I think the issue is that the algorithm focuses too much on the topic of discussion rather than style. If you are often in conversations about LLMs or Musk or self driving cars then you will inevitably end up using a lot of similar words as others in the same discussions. There's only so many unique words you can use when talking about a technical topic.

I see in your post that you try to mitigate this by reducing the number of words compared, but I don't think that is enough to do the job.

chrismorgan - 3 days ago

I wonder how much curly quote usage influences things. I type things like curly quotes with my Compose key, and so do most of my top similars; and four or five words with straight quotes show up among the bottom ten in our analyses. (Also etc, because I like to write &c.)

I’m not going to try comparing it with normalising apostrophes, but I’d be interested how much of a difference it made. It could easily be just that the sorts of people who choose to write in curly quotes are more likely to choose words carefully and thus end up more similar.

keepamovin - 2 days ago

We can improve this. antirez has made a highly compelling poc but it could be refined for authorship attribution judging by the number of misses in the comments here, and how this compares to greater accuracy of the original post to which antirez refers. I’m no expert, but some ideas:

- remove super high frequency non specific words from the comparison bags, because they don’t distinguish much, have less semantic value and may skew the data

- remove stop words (NLP definition of stop words)

- perform stemming/tokenization/depluralization etc (again, NLP standard)

- implement commutativity and transitivity in the similarity function

- consider words as hyperlinks to the sets of people who use them often enough, and do something Pageranky to refine similarity

- consider word bigrams, etc

- weight variations and misspellings higher as distinguishing signals

What are your ideas ?

declan_roberts - 2 days ago

This is exactly why HN needs to allow us to delete accounts.

MivLives - 2 days ago

Managed to find an alt I forgot I made and gave up using years ago. I do wonder about other high up people. Like what about our mutual histories makes us have similar word usage? Are we from the same areas or did we hang out in similar places online?

weinzierl - 3 days ago

How does it find the high similarity between "dang" and "dangg" when the "dangg" account has no activity (like comments) at all?

https://antirez.com/hnstyle?username=dang&threshold=20&actio...

seabombs - 2 days ago

This is a bit tangential but I've noticed lots of comments aping the style of Matt Walsh. Not just on HN either, but probably more here than other places I visit.

Anyway, I guess this would be useful cluster the "Matt Walsh"-y commenters together.

qsort - 3 days ago

Have you tried to analyze whether there is a correlation between "closeness" according to this metric and how often users chat in the same thread? I recognize some usernames that are reported as being similar to me, I wonder if there's some kind of self-selection at play.

lnauta - 2 days ago

That makes me wonder two things. Firstly, if your can use this to find LLM generated content, which I guess would need similar instructions. Imagine instructing it to talk like a pirate, it would be quite different from a generic response.

Secondly, if you want to make an alt account harder to cross-correlate with your main, would rewriting your comments with an LLM work against this method? And if so, how well?

wild_egg - 3 days ago

Very cool. Also a bit surprising — two of my matches are people I know IRL.

wruza - 2 days ago

Dang's analysis was funny:

don't site comment we here post that users against you're

Quite a stance, man :)

And me clearly inarticulate and less confident than some:

it may but that because or not and even these

I noticed that randomly remembered usernames tend to produce either lots of utility words like the above, or very few of them. Interestingly, it doesn't really correlate with my overall impression about them.

LinuxBender - 3 days ago

I think it would be interesting to run this tool against Reddit, 4chan and Tweeter to find astroturf accounts. Does it look like a real browser to those sites or would it be blocked?

SnorkelTan - 2 days ago

I remember the original post the author is referring to. I was captivated by it and thought it was cool. When I ran the original mentioned in the post, it detected my one of my alt's that I forgot about. OP's newer implementation using different methodologies did not detect the alt. For reference, the alt was created in 2010 and the last post was in 2012. Perhaps my writing style has changed?

ziddoap - 3 days ago

I noticed that in my top 20 similar users, the similarity rank/score/whatever are all >~0.83. However, randomly sampling from users in this thread, some top 20s are all <~0.75, or all roughly 0.8, etc.

Is there anything that can be inferred from that? Is my writing less unique, so ends up being more similar to more people?

Also, someone like tptacek has a top 20 with matches all >0.87. Would this be a side-effect of his prolific posting, so matches better with a lot more people?

giancarlostoro - 3 days ago

I tried my name, and I don't think a single "match" is any of my (very rarely used) throw away alts ;) I guess I have a few people I talk like?

Boogie_Man - 3 days ago

No matches higher than .7something and no mutual matches let's go boys I'm a special unique snowflake

atiedebee - 3 days ago

It looks like I don't use the word "and" very often. I do notice that I tend to avoid concatenating sentences like that, lthough it is likely that there just isn't enough data on my account as I haven't been on HN for that long.

GenshoTikamura - 2 days ago

Such a nice scientific way to detect and mute those who go against the agenda's grain, oh I mean don't contribute anything meaningful to the community

morkalork - 3 days ago

I wonder if such an analysis could tease apart the authors of intentionally anonymous publications. Things like peer review notes for papers or legal opinions (afaik in countries that are not the USA, the authors of a dissenting supreme court decision are not named).

0xWTF - 3 days ago

There are some interesting similarities in o.g. accounts aaronsw, pg, and jedberg.

  - aaronsw and jedberg share danielweber
  - aronsw and jedberg share wccrawford
  - aaronsw and pg share Natsu
  - aaronsw and pg share mcphage
byearthithatius - 3 days ago

This is so cool. The user who talks most like me, and I can confirm he does, is ajb257

nottorp - 3 days ago

Interesting, the top 3 similar accounts to me are two USers and an Australian. I'm Romanian (and living in Romania). I probably read too many books and news in English :)

Well, and worked a lot with americans over text based communication...

jmward01 - 3 days ago

I think an interesting use of this is potentially finding LLMs trained to have the style of a person. Unfortunately now, just because a post has my style it doesn't mean it was me. I promise I am not a bot. Honest.

Uptrenda - 2 days ago

I knew that this was possible but I always thought it took much more... effort? How do we mitigate this, then? Run our posts through an LLM?

- 3 days ago
[deleted]
throAwOfCou - 2 days ago

I rotate hn accounts every year or two. In my top 4, I found 3 old alts.

This is impressive and scary. Obviously I had to create a throwaway to say this.

formerly_proven - 3 days ago

I'm surprised no one has made this yet with a clustered visualization.

Lerc - 3 days ago

Used More Often by dang.

don't +0.9339

alganet - 3 days ago

Cool tool. It's a shame I don't have other accounts to test it.

It's also a tool for wannabe impersonators to hoan their writing style mimic skills!

- 3 days ago
[deleted]
- 3 days ago
[deleted]
wizzwizz4 - 3 days ago

PhasmaFelis and mikeash have all matches mutual for the top 20, 30, 50 and 100. Are there other users like this? If so, how many? What's the significance of this, in terms of the shape of the graph?

tablespoon is close, but has a missing top 50 mutual (mikeash). In some ways, this is an artefact of the "20, 30, 50, 100" scale. Is there a way to describe the degree to which a user has this "I'm a relatively closer neighbour to them than they are to me" property? Can we make the metric space smaller (e.g. reduce the number of Euclidean dimensions) while preserving this property for the points that have it?

rcpt - 3 days ago

Searched my nearest neighbor and found someone who agrees with my political views.

tptacek - 3 days ago

This is an interesting and well-written post but the data in the app seems pretty much random.

srhtftw - 3 days ago

Did not find any of the alt accounts I've used since 2007. Which is good.

LoganDark - 3 days ago

we have Dissociative Identity Disorder, I wonder if our different personalities would also have different fingerprints? we do have different writing styles

johnea - 3 days ago

I wonder if it could help improve my karma? 8-/

brap - 3 days ago

My highest match was ChatGPT. Oh well

Edit: ChatGTP, my bad

konstantinua00 - 2 days ago

so the website processes only comments older than 2023?

not very useful for more newer users like me :/

gfd - 2 days ago

I don't mind revealing my alts since none of them seem to link back to my main. But the top 4 results were all correct for me:

https://antirez.com/hnstyle?username=gfd&threshold=20&action...

zawerf (Similarity: 0.7379)

ghj (Similarity: 0.7207)

fyp (Similarity: 0.7197)

uyt (Similarity: 0.7052)

I typically abandon an account once I reach 500 karma since it unlocks the ability to downvote. I'm now very self conscious about the words I overuse...

38 - 3 days ago

this got two accounts that I used to use

tinix - 3 days ago

fun project! but it didn't get any of my alts.

- 3 days ago
[deleted]
aaron695 - 2 days ago

[dead]

andrewmcwatters - 3 days ago

Well, well, well, cocktailpeanuts. :spiderman_pointing:

I suspect, antirez, that you may have greater success removing some of the most common English words in order to find truly suspicious correlations in the data.

cocktailpeanuts and I for example, mutually share some words like:

because, people, you're, don't, they're, software, that, but, you, want

Unfortunately, this is a forum where people will use words like "because, people, and software."

Because, well, people here talk about software.

<=^)

Edit: Neat work, nonetheless.

scoresomefeed - 2 days ago

The original version nailed all of my accounts with terrifying accuracy. Since then I make a new account every few days or weeks. Against the rules I know. And I’ve learned a lot about HN IP tracking and funny shadowbanning-like tricks they play but dont cop to. Like I get different error messages based on the different banned ips I use. And j see different behavior and inconsistency with flagged messages (like one that got upvoted a day after it was flagged and not visible to other users).