Self-host Reddit – 2.38B posts, works offline, yours forever

19-84@lemmy.dbzer0.com · 18 days ago

Self-host Reddit – 2.38B posts, works offline, yours forever

frongt@lemmy.zip · 18 days ago

And only a 3.28 TB database? Oh, because it’s compressed. Includes comments too, though.

19-84@lemmy.dbzer0.com · 18 days ago

Yes! Too many comments to count in a reasonable amount of time!

a1studmuffin@aussie.zone · 18 days ago

This seems especially handy for anyone who wants a snapshot of Reddit from pre-enshittification and AI era, where content was more authentic and less driven by bots and commercial manipulation of opinion. Just choose the cutoff date you want and stick with that dataset.

Tanis Nikana@lemmy.world · 18 days ago

Reddit is hot stinky garbage but can be useful for stuff like technical support and home maintenance.

Voat and Ruqqus are straight-up misinformation and fascist propaganda, and if you excise them from your data set, your data will dramatically improve.

UnderpantsWeevil@lemmy.world · 18 days ago

I would sooner download a tire fire.

19-84@lemmy.dbzer0.com · 18 days ago

thanks anyway for looking at my project 🙂

Gerudo@lemmy.zip · 18 days ago

Say what you will about Reddit, but there is tons of information on that platform that’s not available anywhere else.

UnderpantsWeevil@lemmy.world · 18 days ago

:-/

You can definitely mine a bit of gold out of that pile of turds. But you could also go to the library and receive a much higher ratio of signal to noise.

pixeltree@lemmy.blahaj.zone · 17 days ago

This one specific bug in this one niche library has probably not been written about in a book, and even if it has I doubt that book is in my local library, and even if it is I doubt I can fucking find it

mirisgaiss@lemmy.world · 17 days ago

obscure problems almost always have reddit comments as search results, and there’s no forums or blogs with any of it anymore. be nice to have around solely for that… though I’m sure if shit like /pics or whatever else was removed it could get significantly smaller…

Appoxo@lemmy.dbzer0.com · 17 days ago

People will do anything to use Reddit instead of just letting go.

communism@lemmy.ml · 17 days ago

This is just an archive. No different from using the wayback machine or any other archive of web content.

Appoxo@lemmy.dbzer0.com · 17 days ago

You still use Reddit in some capacity.

Or would you deny watching a movie just because you watched it on your local Jellyfin folder instead of watching it on Netflix or the cinema?

ᴍᴜᴛɪʟᴀᴛɪᴏɴᴡᴀᴠᴇ @lemmy.dbzer0.com · 17 days ago

There is a ton of useful info on Reddit. I don’t use it anymore either but I’ll be downloading this project.

Appoxo@lemmy.dbzer0.com · 17 days ago

I never said I am not using it.
But that feels like it’s a compromise to keep using it as native as possible.
If it was just for research purposes, accessing archive.org would suffice.

ᴍᴜᴛɪʟᴀᴛɪᴏɴᴡᴀᴠᴇ @lemmy.dbzer0.com · 17 days ago

I think the idea here is to have it offline in the event of further fascist control of the internet. There is really so much useful information on there on a wide variety of topics. I don’t care about backing up memes and bot drivel.

Clbull@lemmy.world · 17 days ago

Eww, Voat and Ruqqus.

19-84@lemmy.dbzer0.com · 17 days ago

i will always take more data sources, including lemmy!

polarity_inverter@startrek.website · 17 days ago

… for building your personal Grok?

19-84@lemmy.dbzer0.com · 17 days ago

if you didn’t notice, this project was released into the public domain

Tiger@sh.itjust.works · 18 days ago

What is the timing of the dataset, up through which date in time?

19-84@lemmy.dbzer0.com · 18 days ago

2005-06 to 2024-12

however the data from 2025-12 has been released already, it just needs to be split and reprocessed for 2025 by watchful1. once that happens then you can host archive up till end of 2025. i will probably add support for importing data from the arctic shift dumps instead so that archives can be updated monthly.

Self-host Reddit – 2.38B posts, works offline, yours forever

Self-host Reddit – 2.38B posts, works offline, yours forever

GitHub - 19-84/redd-archiver: A PostgreSQL-backed archive generator that creates browsable HTML archives from link aggregator platforms including Reddit, Voat, and Ruqqus.