Show HN: Self-host Reddit – 2.38B posts, works offline, yours forever

135 points by 19-84 6 hours ago

Reddit's API is effectively dead for archival. Third-party apps are gone. Reddit has threatened to cut off access to the Pushshift dataset multiple times. But 3.28TB of Reddit history exists as a torrent right now, and I built a tool to turn it into something you can browse on your own hardware.

The key point: This doesn't touch Reddit's servers. Ever. Download the Pushshift dataset, run my tool locally, get a fully browsable archive. Works on an air-gapped machine. Works on a Raspberry Pi serving your LAN. Works on a USB drive you hand to someone.

What it does: Takes compressed data dumps from Reddit (.zst), Voat (SQL), and Ruqqus (.7z) and generates static HTML. No JavaScript, no external requests, no tracking. Open index.html and browse. Want search? Run the optional Docker stack with PostgreSQL – still entirely on your machine.

API & AI Integration: Full REST API with 30+ endpoints – posts, comments, users, subreddits, full-text search, aggregations. Also ships with an MCP server (29 tools) so you can query your archive directly from AI tools.

Self-hosting options: - USB drive / local folder (just open the HTML files) - Home server on your LAN - Tor hidden service (2 commands, no port forwarding needed) - VPS with HTTPS - GitHub Pages for small archives

Why this matters: Once you have the data, you own it. No API keys, no rate limits, no ToS changes can take it away.

Scale: Tens of millions of posts per instance. PostgreSQL backend keeps memory constant regardless of dataset size. For the full 2.38B post dataset, run multiple instances by topic.

How I built it: Python, PostgreSQL, Jinja2 templates, Docker. Used Claude Code throughout as an experiment in AI-assisted development. Learned that the workflow is "trust but verify" – it accelerates the boring parts but you still own the architecture.

Live demo: https://online-archives.github.io/redd-archiver-example/

GitHub: https://github.com/19-84/redd-archiver (Public Domain)

Pushshift torrent: https://academictorrents.com/details/1614740ac8c94505e4ecb9d...

Aurornis - 2 hours ago

Cool way to self-host archives.

What I'd really like is a plugin that automatically pulls from archives somewhere and replaces deleted comments and those bot-overwritten comments with the original context.

Reddit is becoming maddening to use because half the old links I click have comments overwritten with garbage out of protest for something. Ironically the original content is available in these archives (which are used for AI training) but now missing for actual users like me just trying to figure out how someone fixed their printer driver 2 years ago.

anonymous908213 - an hour ago

That would only really be ironic if the reason for people overwriting their comments was out of protest for LLM training, but the main reason that resulted in by far the biggest wave of deletions was Reddit locking down their API. If the result of their protest is that the site is less useful for you, the user, then in fact it served its purpose, as the entire point was an attempt to boycott Reddit, ie. get people to stop using it by removing the user contributions that give the site its only value in the first place.
- Aurornis - an hour ago
  
  > If the result of their protest is that the site is less useful for you, the user, then in fact it served its purpose, as the entire point was an attempt to boycott Reddit, ie. get people to stop using it by removing the user contributions that give the site its only value in the first place.
  In practice I just give them more page views because I have to view more threads before I find the answer.
  Reddit's DAU numbers have only gone up since the protest.
  - anonymous908213 - an hour ago
    
    I did phrase it as "an attempt". In the end the protest probably wasn't as effective as protestors might have hoped, and it didn't get Reddit to change course on their enshittification decisions. I do think it was good that there was an attempt at pushback, at least, when most software users just accept enshittification as normal and continue tolerating whatever abuse their masters throw at them.

NickNaraghi - 3 hours ago

Data is available via torrent in this section: https://github.com/19-84/redd-archiver?tab=readme-ov-file#-g...

19-84 - 2 hours ago

I have also published sub statistics and profiling for each platform. these can be used to help identify which subs to prioritize for archiving.
reddit: https://github.com/19-84/redd-archiver/blob/main/tools/subre...
voat: https://github.com/19-84/redd-archiver/blob/main/tools/subve...
ruqqus: https://github.com/19-84/redd-archiver/blob/main/tools/guild...

alcroito - 44 minutes ago

I tried spinning up the local approach with docker compose, but it fails.

There's no `.env.example` file to copy from. And even if the env vars are set manually, there are issues with the mentioned volumes not existing locally.

Seems like this needs more polish.

19-84 - 31 minutes ago

thank you for your comment, some example dot files were not copied in my original repo, they have now been added.
https://github.com/19-84/redd-archiver/commit/0bb103952195ae...
the docs have been updated with mkdir steps
https://github.com/19-84/redd-archiver/commit/c3754ea3a0238f...
- alcroito - 13 minutes ago
  
  Cheers. I checked the updated steps.
  This is still missing creating the `output/.postgres-data` dir, without which docker compose refuses to start.
  After creating that manually, going to http://localhost/ shows a 403 Forbidden page, which makes you believe that something might have gone wrong.
  This is before running `reddarchiver-builder python reddarc.py` to generate the necessary DB from the input data.

elSidCampeador - 3 hours ago

I wonder if this can be hooked up with the now-dead Apollo app in some way, to get back a slice of time that is forever lost now?

19-84 - 3 hours ago

the API should allow for a lot of different integrations

- an hour ago

[deleted]

kylehotchkiss - 2 hours ago

_Hacker News collectively grabs the dataset to train their models on how to become effective reddit trolls_

layer8 - an hour ago

Don’t we have enough of those already? ;)
19-84 - 2 hours ago

the API and MCP server is very powerful ;)

dvngnt_ - 2 hours ago

I want to do the same thing for tiktok. I have 5k videos starting from the pandemic downloaded. want to find a way to use AI to tag and categorize the videos to scroll locally.

syngrog66 - 2 hours ago

Did you pay all the people who created its content?

devilsdata - 44 minutes ago

I have no problem with this being downloaded for personal use, in fact that's a good thing. But of course we both know it'll be used to train AI.

Jordan-117 - 2 hours ago

>Voat

Gross. Why would anyone want to have an archive of Reddit For Neonazis?

apstls - 6 minutes ago

There are certainly things to be learned from analysis of the dataset. Keep your friends close but your enemies as JSON, or something...
19-84 - 2 hours ago

thank you for your comment, I will support any platform that has complete dataset available. I will take submissions for any complete datasets through github issues. https://github.com/19-84/redd-archiver/blob/main/.github/ISS...
devilsdata - 43 minutes ago

Might be good for researchers to be able to perform studies on.
metaPushkin - 8 minutes ago

It seems you have no understanding of the term neo-fascism, and yes, it's not what your propaganda talks about.
diggyhole - an hour ago

Wat?
- Jordan-117 - an hour ago
  
  It sold itself as a healthier alternative to Reddit, but by the end of its run virtually every post sitewide was some flavor of virulently racist, misogynistic, anti-semitic, fringe conspiratorial, etc.
  - tempsaasexample - 44 minutes ago
    
    So what? It's written by humans, and those people are just as human as you. Just because you don't agree with it doesn't mean they are wrong.
    There are a billion Muslims that worship a guy that married a prepubescent child. Who are you to say that they are wrong to do so? Or that not liking a group of people (like you are doing yourself) is wrong?
    
    Jordan-117 - 30 minutes ago
    
    Funny defense to use for a crowd that spent all their time regularly, hatefully dehumanizing people. The front page was routinely plastered with shit like "Why interracial children are an abomination," Hitler-did-nothing-wrong propaganda, usernames that echoed Nazi slogans and fantasized about mass-murdering non-white people, etc. It was an utter cesspool, and preserving and perpetuating that is a really weird use of dev time and effort.
    
    lazyasciiart - 28 minutes ago
    
    It’ll be useful to have when the posters pop up as presidential advisors.
    
    tempsaasexample - 23 minutes ago
    
    Someday people will look back on most of us and be shocked we were ok aborting babies, cutting off children's genitals and all that sort. I don't think it's fair to go against millions of Germans just because you don't understand them.