Show HN: A free, privacy preserving, archive of public Discord servers
I have been working on this project for a while, and I think this solves a problem that a lot of people here have: not being able to easily search Discord servers.
Currently, I only scrape servers that are marked as "discoverable" on Discord. However, if there's enough interest in the project, I'm open to adding specific servers by request. I'm primarily focused on informational servers rather than casual hangout spaces, such as open source projects, Minecraft mods, and support communities for tools, services, or platforms (for example, hosting providers).
I have placed restrictions on searching directly by user ID to prevent doxing. I also made the opt out process one click, for those who do not want to be archived.
This is my first large scale project, so I'd love to hear your feedback!
> I have placed restrictions on searching directly by user ID to prevent doxing. I also made the opt out process one click, for those who do not want to be archived.
1) I'd suggest anonymizing the usernames / author ids to something more privacy friendly such as how some image sites were generating 3-4 random words as a human readable unique id. This removes a lot of the reason people would opt out (i.e. posts being tracked down years later)
2) You not seem to have a clear rate limit documentation. If you are asking people to pay for commercial use, I'd suggest making it clear what the rough original limits are as well as the rough price range of what you'd offer.
3) Tbh, the only real thing I want from this project is basically narrative / roleplay / writing content for LLM reasons as I'm trying to build a rules-oriented system that narrates via LLM. If you don't want people using this data for this purpose, I'd suggest making that clear.
Thanks for your suggestions.
> 1) I'd suggest anonymizing the usernames / author ids to something more privacy friendly such as how some image sites were generating 3-4 random words as a human readable unique id. This removes a lot of the reason people would opt out (i.e. posts being tracked down years later)
In the original iteration of Searchcord, it used to work similarly to that. The username was `sha256(userid+guildid)`, truncated to the first 8 characters. Unfortunately, it was pretty hard to follow chats. I will try your idea and see how it works, though.
> 2) You not seem to have a clear rate limit documentation.
This is a good idea. The rate limit varies by endpoint, and I haven't gotten around to documenting each one.
> If you are asking people to pay for commercial use, I'd suggest making it clear what the rough original limits are as well as the rough price range of what you'd offer.
I have absolutely zero idea what industry would be interested in this, in what form, and if anyone would even pay.
> 3) Tbh, the only real thing I want from this project is basically narrative / roleplay / writing content for LLM reasons as I'm trying to build a rules-oriented system that narrates via LLM. If you don't want people using this data for this purpose, I'd suggest making that clear.
I really don't care what people do with the data, as long as they are not spamming requests or using the data for commercial purposes without permission.
I was scrolling through the home page and came across afew where the only channels you're allowed to access are the verify-yourself or welcome channels.
https://archive.org/search?query=subject%3A%22DiscordChatExp... https://archive.org/search?query=subject%3A%22archiveteam_di... https://wiki.archiveteam.org/index.php/Discord
This is interesting, I somehow missed this. Unfortunately, those are not full text searchable. Maybe I will download them and import them into Searchcord, with proper credit of course.
Thanks for this!
And related, I'd like to be able to run this locally for exports of guilds that I'm on myself. Is that even possible with the architect you've built?
This is absolutely something I want to do, but at the guild level. The database itself is over 13TB which is much to large to create regular exports of. I will probably provide a SQLite export of each guild, regenerated each week/month. Anyone is free to download whatever they want in real time from the API.
Thanks for your question!
Thanks for your feedback.
For software, I use ScyllaDB and Elasticsearch. It's split across 6 physical nodes (8 including the CDN). Data collection is handled using standard user accounts, accessing only public, discoverable servers. I plan to write a blog post about the technical aspect of how this was done soon.
Admins of these servers weren't contacted, as the content indexed is already publicly accessible, comparable to a forum like this or public subreddit. That said, I understand the sensitivity around data visibility, and I've made it very simple for any user to opt out of indexing at any time. Private or invite-only servers are, of course, completely excluded.
Thanks for your suggestions. However, this does not work for a few reasons:
1. Joining servers is protected by increasingly difficult to solve captchas that have no commercially available solver. This is not a battle I want to fight.
2. There are a LOT of CSAM rings that spam invite links in public servers. This is also not something I want to go anywhere near.
Moreover, after the fallout of spy.pet, I think it is very important that users are able to opt out.
Not exactly. Attachments are only fetched from Discord as the user requests them. This means that the vast majority of attachments are never stored on my server. Right now, I only have about 280TB of attachments locally on my own infrastructure. You can see more stats here: https://searchcord.io/about
Thanks for your question!
There's so much stuff locked in Discord now that forums have fallen in popularity, think this sort of thing really helps unlock that knowledge again.