Reddit’s API is effectively dead for archival. Third-party apps are gone. Reddit has threatened to cut off access to the Pushshift dataset multiple times. But 3.28TB of Reddit history exists as a torrent right now, and I built a tool to turn it into something you can browse on your own hardware.

The key point: This doesn’t touch Reddit’s servers. Ever. Download the Pushshift dataset, run my tool locally, get a fully browsable archive. Works on an air-gapped machine. Works on a Raspberry Pi serving your LAN. Works on a USB drive you hand to someone.

What it does: Takes compressed data dumps from Reddit (.zst), Voat (SQL), and Ruqqus (.7z) and generates static HTML. No JavaScript, no external requests, no tracking. Open index.html and browse. Want search? Run the optional Docker stack with PostgreSQL – still entirely on your machine.

API & AI Integration: Full REST API with 30+ endpoints – posts, comments, users, subreddits, full-text search, aggregations. Also ships with an MCP server (29 tools) so you can query your archive directly from AI tools.

Self-hosting options:

  • USB drive / local folder (just open the HTML files)
  • Home server on your LAN
  • Tor hidden service (2 commands, no port forwarding needed)
  • VPS with HTTPS
  • GitHub Pages for small archives

Why this matters: Once you have the data, you own it. No API keys, no rate limits, no ToS changes can take it away.

Scale: Tens of millions of posts per instance. PostgreSQL backend keeps memory constant regardless of dataset size. For the full 2.38B post dataset, run multiple instances by topic.

How I built it: Python, PostgreSQL, Jinja2 templates, Docker. Used Claude Code throughout as an experiment in AI-assisted development. Learned that the workflow is “trust but verify” – it accelerates the boring parts but you still own the architecture.

Live demo: https://online-archives.github.io/redd-archiver-example/ GitHub: https://github.com/19-84/redd-archiver (Public Domain)

Pushshift torrent: https://academictorrents.com/details/1614740ac8c94505e4ecb9d88be8bed7b6afddd4

  • frongt@lemmy.zip
    link
    fedilink
    English
    arrow-up
    2
    ·
    18 days ago

    And only a 3.28 TB database? Oh, because it’s compressed. Includes comments too, though.

  • a1studmuffin@aussie.zone
    link
    fedilink
    English
    arrow-up
    1
    ·
    18 days ago

    This seems especially handy for anyone who wants a snapshot of Reddit from pre-enshittification and AI era, where content was more authentic and less driven by bots and commercial manipulation of opinion. Just choose the cutoff date you want and stick with that dataset.

  • Tanis Nikana@lemmy.world
    link
    fedilink
    English
    arrow-up
    1
    ·
    18 days ago

    Reddit is hot stinky garbage but can be useful for stuff like technical support and home maintenance.

    Voat and Ruqqus are straight-up misinformation and fascist propaganda, and if you excise them from your data set, your data will dramatically improve.

    • Gerudo@lemmy.zip
      link
      fedilink
      English
      arrow-up
      1
      ·
      18 days ago

      Say what you will about Reddit, but there is tons of information on that platform that’s not available anywhere else.

      • UnderpantsWeevil@lemmy.world
        link
        fedilink
        English
        arrow-up
        0
        ·
        18 days ago

        :-/

        You can definitely mine a bit of gold out of that pile of turds. But you could also go to the library and receive a much higher ratio of signal to noise.

        • pixeltree@lemmy.blahaj.zone
          link
          fedilink
          English
          arrow-up
          0
          ·
          17 days ago

          This one specific bug in this one niche library has probably not been written about in a book, and even if it has I doubt that book is in my local library, and even if it is I doubt I can fucking find it

          • mirisgaiss@lemmy.world
            link
            fedilink
            English
            arrow-up
            1
            ·
            17 days ago

            obscure problems almost always have reddit comments as search results, and there’s no forums or blogs with any of it anymore. be nice to have around solely for that… though I’m sure if shit like /pics or whatever else was removed it could get significantly smaller…

    • communism@lemmy.ml
      link
      fedilink
      English
      arrow-up
      1
      ·
      17 days ago

      This is just an archive. No different from using the wayback machine or any other archive of web content.

      • Appoxo@lemmy.dbzer0.com
        link
        fedilink
        English
        arrow-up
        0
        ·
        17 days ago

        You still use Reddit in some capacity.

        Or would you deny watching a movie just because you watched it on your local Jellyfin folder instead of watching it on Netflix or the cinema?

    • 19-84@lemmy.dbzer0.comOP
      link
      fedilink
      English
      arrow-up
      1
      ·
      18 days ago

      2005-06 to 2024-12

      however the data from 2025-12 has been released already, it just needs to be split and reprocessed for 2025 by watchful1. once that happens then you can host archive up till end of 2025. i will probably add support for importing data from the arctic shift dumps instead so that archives can be updated monthly.