ArchiveTeam has finished archiving all goo.gl short links

134 pentagrama 34 8/17/2025, 5:46:04 PM tracker.archiveteam.org ↗

Comments (34)

dkh · 16m ago

Excellent! ArchiveTeam have always been impressive this way. Some years ago, I was working at a video platform that had just announced it would be shutting down fairly soon. I forget how, but one way or another I got connected with someone at ArchiveTeam who expressed their interest in archiving it all before it was too late. Believing this to be a good idea, I gave them a couple of tips about where some of our device-sniffing server endpoints were likely to give them a little trouble, and temporarily "donated" a couple EC2 instances to them to put towards their archiving tasks.

Since the servers were mine, I could see what was happening, and I was very impressed. Within I want to say two minutes, the instances had been fully provisioned and were actively archiving videos as fast as was possible, fully saturating the connection, with each instance knowing to only grab videos the other instances had not already gotten. Basically they have always struck me as not only having a solid mission, but also being ultra-efficient in how they carry it out.

zdimension · 2h ago

Title is imprecise, it's Archiveteam.org, not Archive.org. The Internet Archive is providing free hosting, but the archival work was done by Archiveteam members.

im3w1l · 1h ago

What exactly is archiveteam's contribution? I don't fully understand.

Edit: Like they kinda seem like an unnecessary middle-man between the archive and archivee, but maybe I'm missing something.

creatonez · 59m ago

What ArchiveTeam mainly does is provide hand-made scripts to aggressively archive specific websites that are about to die, with a prioritization for things the community deems most endangered and most important. They provide a bot you can run to grab these scripts automatically and run them on your own hardware, to join the volunteer effort.

This is in contrast to the Wayback Machine's builtin crawler, which is just a broad spectrum internet crawler without any specific rules, prioritizations, or supplementary link lists.

For example, one ArchiveTeam project had the goal to save as many obscure Wikis as possible, using the MediaWiki export feature rather than just grabbing page contents directly. This came in handy for thousands of wikis that were affected by Miraheze's disk failure and happened to have backups created by this project. Thanks to the domain-specific technique, the backups were high-fidelity enough that many users could immediately restart their wiki on another provider as if nothing happened.

They also try to "graze the rate limit" when a website announces a shutdown date and there isn't enough time to capture everything. They actively monitor for error responses and adjust the archiving rate accordingly, to get as much as possible as fast as possible, hopefully without crashing the backend or inadvertently archiving a bunch of useless error messages.

dkh · 13m ago

I just made a root comment with my experience seeing their process at work, but yeah it really cannot be overstated how efficient and effective their archiving process is

wongarsu · 44m ago

> Like they kinda seem like an unnecessary middle-man between the archive and archivee

They are the middlemen that collects the data to be archived.

In this example the archivee (goo.gl/Alphabet) is simply shutting the service down and has no interest in archiving it. Archive.org is willing to host the data, but only if somebody brings it to them. Archiveteam writes and organises crawlers to collect the data and send it to Archive.org

1gn15 · 1h ago

ArchiveTeam delegates tasks to volunteers and themselves running the Archive Warrior VM, which does the actual archiving. The resultant archives are then centralized by ArchiveTeam and uploaded to the Internet Archive.

(Source: ran a Warrior)

notpushkin · 1h ago

Sidenote, but you can also run a Warrior in Docker, which is sometimes easier to set up (e.g. if you already have a server with other apps in containers).

diggan · 1h ago

> What exactly is archiveteam's contribution? I don't fully understand.

If Internet Archive is a library, ArchiveTeam is people who run around collecting stuff, and gives it to the library for safe keeping. Stuff that are estimated/announced to be disappearing/removed soon tends to be focused too.

debesyla · 1h ago

They gathered up the links for processing, because Google doesn't just give a list of short links in use. So the links have to be brute-forcefully gathered first.

Aardwolf · 5m ago

I don't understand the page, it shows a list of data sets (I think?) up to 91 TiB in size

The list of short links and their target URLs can't be 91 TiB in size can it? Does anyone know how this works?

dang · 2h ago

Related. Others?

Enlisting in the Fight Against Link Rot - https://news.ycombinator.com/item?id=44877021 - Aug 2025 (107 comments)

Google shifts goo.gl policy: Inactive links deactivated, active links preserved - https://news.ycombinator.com/item?id=44759918 - Aug 2025 (190 comments)

Google's shortened goo.gl links will stop working next month - https://news.ycombinator.com/item?id=44683481 - July 2025 (222 comments)

Google URL Shortener links will no longer be available - https://news.ycombinator.com/item?id=40998549 - July 2024 (49 comments)

Ask HN: Google is sunsetting goo.gl on 3/30. What will be your URL shortener? - https://news.ycombinator.com/item?id=19385433 - March 2019 (14 comments)

Tell HN: Goo.gl (Google link Shortener) is shutting down - https://news.ycombinator.com/item?id=16902752 - April 2018 (45 comments)

Google is shutting down its goo.gl URL shortening service - https://news.ycombinator.com/item?id=16722817 - March 2018 (56 comments)

Transitioning Google URL Shortener to Firebase Dynamic Links - https://news.ycombinator.com/item?id=16719272 - March 2018 (53 comments)

Ayesh · 2h ago

Recent update from Google: https://blog.google/technology/developers/googl-link-shorten...

shaky-carrousel · 2h ago

Yeah, I'll take that "update" like the extremely unreliable info from an extremely unreliable company that it is.

nocoiner · 1h ago

I have a question about this.

Per google, shortened links “won't work after August 25 and we recommend transitioning to another URL shortener if you haven’t already.”

Am I missing something, or doesn’t this basically obviate the entire gesture of keeping some links active? If your shortened link is embedded in a document somewhere and can’t be updated, google is about to break it, no?

OJFord · 2h ago

This leaves me wondering what the point is? What could it possibly cost to keep redirecting existing shortlinks that they consider unused/low activity already anyway?

(In addition to the higher activity ones parent link says they'll now continue to redirect.)

toomuchtodo · 2h ago

To save face.

fortran77 · 2h ago

I don't really understand this. Is it really that costly to keep the entire database if they're going to keep part of it?

tombert · 1h ago

I built a URL shortener years ago for fun. I don't have the resources that Google has, but I just hacked it together in Erlang using Riak KV and it did horizontally scale across at least three computer (I didn't have more at the time).

Unless I'm just super smart (I'm not), it's pretty easy to write a URL shortener as a key-value system, and pure key-value stuff is pretty easy to scale. I cannot imagine that isn't doing something as or more efficient than what I did.

wtallis · 1h ago

Google also has the advantages that they now only need a read-only key-value store, and they know the frequency distribution for lookups. This is now the kind of problem many programmers would be happy to spend a weekend optimizing to get an average lookup time down to tens of nanoseconds.

benoau · 1h ago

I don't understand the data on ArchiveTeam's page but, it seems like they have 35 terabytes of data (286.56TiB)? It's a lot larger than I'd have thought.

wtallis · 51m ago

FYI, "TiB" means terabytes with a base of 1024, ie. the units you'd typically use for measuring memory rather than the units you'd typically see drive vendors using. The factor of 8 you divided by only applies to units based on bits rather than bytes, and those units use "b" rather than "B", and are only used for capacity measurements when talking about individual memory dies (though they're normal for talking about interconnect speeds).

Either way, we're talking about a dataset that fits easily in a 1U server with at most half of its SSD slots filled.

makeworld · 1h ago

Glad I contributed to this in some small way.

Klathmon · 1h ago

Same, it's nice to see my username on the leaderboards.

Even though all I did was setup the docker container one day and forget about it

yreg · 51m ago

I wonder how many of them lead to private YouTube videos, Google documents, etc.

mdaniel · 7m ago

I was going to be cheeky and say "well, now you can download them and search" but it seems it's "Access-restricted-item: true" for some reason, above and beyond being 10G a pop <https://archive.org/details/archiveteam_googl_20250228144231...>

do_not_redeem · 2h ago

Does "all" mean all the URLs publicly known, or did they exhaustively iterate the entire URL namespace?

jedberg · 2h ago

They iterated the entire URL namespace by having volunteers run a client so they didn't get IP banned.

Imustaskforhelp · 2h ago

are we sure that the whole entire URL namespace has been mapped?

How would that even function, I mean, did they loop through every single permutation and see the result, or what exactly/ how would that work?

jedberg · 1h ago

> did they loop through every single permutation and see the result, or what exactly/ how would that work?

In short, yes. Since no one can make new links, it's a pre-defined space to search. They just requested every possible key, and recorded the answer, and then uploaded it to a shared database.

toomuchtodo · 2h ago

The pipeline code is available for review of the mechanics of http requests made if you follow the ArchiveTeam wiki links.

barbazoo · 2h ago

Beautiful. I wish I had seen this and could have helped.

brokensegue · 18m ago

they are still archiving other url shorteners https://tracker.archiveteam.org:1338/ you can participate in that

ccgreg · 1h ago

The goo.gl URLs that are publicly known are already in the Internet Archive and Common Crawl crawls.

When Did AI Take Over Hacker News? (zachperk.com)

Claudia – Desktop companion for Claude code (claudiacode.com)

HN Search isn't ingesting new data since Friday (github.com)

Review of Anti-Aging Drugs (scienceblog.com)

AI Doesn't Lighten the Burden of Mastery; AI Makes It Easy to Stop Valuing It (playtechnique.io)

ClickHouse matches PG for single-row UPDATEs and 4000 x faster for bulk UPDATEs (clickhouse.com)

LL3M: Large Language 3D Modelers (threedle.github.io)

Who does your assistant serve? (xeiaso.net)

ArchiveTeam has finished archiving all goo.gl short links (tracker.archiveteam.org)

Derivatives, Gradients, Jacobians and Hessians (blog.demofox.org)

The Enterprise Experience (churchofturing.github.io)

LinkedIn rewards mediocrity (elliotcsmith.com)

Secure Boot, TPM and Anti-Cheat Engines (andrewmoore.ca)

Show HN: OverType – A Markdown WYSIWYG editor that's just a textarea

Faster Index I/O with NVMe SSDs (marginalia.nu)

Does OLAP Need an ORM (clickhouse.com)

Here be dragons: Preventing static damage, latchup, and metastability in the 386 (righto.com)

A Visual Exploration of Gaussian Processes (2019) (distill.pub)

Show HN: Fallinorg - Offline Mac app that organizes files by meaning (fallinorg.com)

IQ Tests Results for AI (trackingai.org)

Show HN: NextDNS Adds "Bypass Age Verification"

'Safety Today Is a Luxury,' Giorgetto Giugiaro Says After His Crash (jalopnik.com)

BBC Micro: The Ancestor to a Device You Are Guaranteed to Own (retrogamecoders.com)

Dynasties, Selection, and Talent Allocation Among Classical Composers (papers.ssrn.com)

Lessons learned from building a sync-engine and reactivity system with SQLite (finkelstein.fr)

Why Computer-Use Agents Should Think Less (prava.co)

Eliminating JavaScript cold starts on AWS Lambda (goose.icu)

The Photographic Periodic Table of the Elements (periodictable.com)

Wan – Open-source alternative to VEO 3 (github.com)

Endoscopist deskilling risk after exposure to AI in colonoscopy (thelancet.com)

VictoriaLogs Practical Ingestion Guide for Message, Time and Streams (victoriametrics.com)

Dispelling misconceptions about RLHF (aerial-toothpaste-34a.notion.site)

SuperSight: A graphical enhancement mod for Brøderbund's "Stunts" (marnetto.net)

LLMs tell bad jokes because they avoid surprises (danfabulich.medium.com)

Show HN: Rust macro utility for batching expensive async operations (github.com)

Guid Smash (guidsmash.com)

Comparison of different C libraries providing generic containers capabilities (github.com)

Programming in the Age of Abundance (guyren.me)

Nuvistor Valves (r-type.org)

Show HN: Chatbang – Access ChatGPT from the terminal without an API key (github.com)

That 16B password story (a.k.a. "data troll") (troyhunt.com)

An argument for increasing TCP's initial congestion window (2024) (jeclark.net)

GDPR meant nothing: chat control ends privacy for the EU [video] (youtube.com)

Good multipliers for congruential pseudorandom number generators (arxiv.org)

FFmpeg moves to Forgejo (code.ffmpeg.org)

Counting Words at SIMD Speed (healeycodes.com)

A Lisp in 99LOC (github.com)

Living with Williams Syndrome, the 'opposite of autism' (2014) (bbc.com)

Node.js is able to execute TypeScript files without additional configuration (nodejs.org)

ResurrectedGod: The Ruby Framework for Process Management (github.com)

ArchiveTeam has finished archiving all goo.gl short links

Comments (34)