Google will delete OAuth clients falsely flagged as unused
8 points by panstromek 17h ago 7 comments
Ask HN: How do I start my own cybersecurity related company?
4 points by babuloseo 1d ago 4 comments
Compiler Explorer and the promise of URLs that last forever
340 anarazel 174 5/28/2025, 4:28:20 PM xania.org ↗
It's easy to set up, but be warned, it takes up a lot of disk space.
https://github.com/gildas-lormeau/SingleFile1. find a way to dedup media
2. ensure content blockers are doing well
3. for news articles, put it through readability and store the markdown instead. if you wanted to be really fancy, instead you could attempt to programatically create a "template" of sites you've visited with multiple endpoints so the style is retained but you're not storing the content. alternatively a good compression algo could do this, if you had your directory like /home/andrew/archive/boehs.org.tar.gz and inside of the tar all the boehs.org pages you visited are saved
4. add fts and embeddings over the pages
It is. 1.1TB is both:
- objectively an incredibly huge amount of information
- something that can be stored for the cost of less than a day of this industry's work
Half my reluctance to store big files is just an irrational fear of the effort of managing it.
Far, far less even. You can grab a 1TB external SSD from a good name for less than a days work at minimum wage in the UK.
I keep getting surprised at just how cheap large storage is every time I need to update stuff.
A couple of questions:
- do you store them compressed or plain?
- what about private info like bank accounts or health issuance?
I guess for privacy one could train oneself to use private browsing mode.
Regarding compression, for thousands of files don't all those self-extraction headers add up? Wouldn't there be space savings by having a global compression dictionary and only storing the encoded data?
Can’t speak to your other issues but I would think the right file system will save you here. Hopefully someone with more insight can provide color here, but my understanding is that file systems like ZFS were specifically built for use cases like this where you have a large set of data you want to store in a space efficient manner. Rather than a compression dictionary, I believe tech like ZFS simply looks at bytes on disk and compresses those.
I haven't put the effort in to make a "bookmark server" that will accomplish what singlefile does but on the internet because of how well singlefile works.
- Do you also archive logged in pages, infinite scrollers, banking sites, fb etc? - How many entries is that? - How often do you go back to the archive? is stuff easy to find? - do you have any organization or additional process (eg bookmarks)?
did you try integrating it with llms/rag etc yet?
Or even worse, when a domain parking company does that: https://archive.org/post/423432/domainsponsorcom-erasing-pri...
A technological conundrum, however, is the fact that I have no way to prove that my archive is an accurate representation of a site at a point in time. Hmmm, or maybe I do? Maybe something funky with cert chains could be done.
edit: sorry, that would only prove when it was taken, not that it wasn’t fabricated.
One way to make this work is to have a mechanism like bitcoin (proof of work), where the proof of work is put into the webpage itself as a hash (made by the original author of that page). Then anyone can verify that the contents wasn't changed, and if someone wants to make changes to it and claim otherwise, they'd have to put in even more proof of work to do it (so not impossible, but costly).
IIUC the timestamping service needs to independently download the contents itself in order to hash it, so if you need to be logged in to see the content there might be complications, and if there's a lot of content they'll probably want to charge you.
But you also don't need to do this: all you need is a service which will attest that it saw a particular hashsum at a particular time. It's up to other mechanisms to prove what that means.
Often true in practice unfortunately, but to the extent that it is true, any approach that tries to use hashes to prove things to a third party is sunk. (We could imagine a timestamping service that allows some kind of post-download "normalisation" step to strip out content that varies between queries and then hash the results of that, but that doesn't seem practical to offer as a free service.)
> all you need is a service which will attest that it saw a particular hashsum at a particular time
Isn't that what I'm proposing?
What?? I am a heavy user of the Internet Archive services, not just the Wayback Machine, including official and "unofficial" clients and endpoints, and I had absolutely no idea the extension could do this.
To bulk archive I would manually do it via the web interface or batch automate it. The limitations of manually doing it one by one are obvious, and the limitations of doing it in batches requires, well, keeping batches (lists).
As for being still alive, by that measure hardly anything anyone does is important in the modern world. It's pretty hard to fail at thinking or remembering so badly that it becomes a life-or-death thing.
Agreed.
“Why don’t you just” is a red flag now for me.
Half the time people are suggested a better way, it's because they're actually doing it wrong, they've gotten the solution's requirements all wrong in the first place, and this perspective helps.
"Why don't you just ..." is just lazy idea suggestion from armchair internet warriors.
turns out that unlike most webpages, the pdf version is only a single page of what is visible on screen.
turns out also that opening the warc immediately triggers a js redirect that is planted in the page. i can still extract the text manually - it’s embedded there - but i cannot “just open” the warc in my browser and expect an offline “archive” version - im interacting with a live webpage! this sucks from all sides - usability, privacy, security.
Admittedly, i don’t use webrecorder - does it solve this problem? did you verify?
Unfortunately there are sites where it does not work.
My phone browser has a "reader view" popup but it only appears sometimes, and usually not on pages that need it!
Edit: Just installed w3m in Termux... the things we can do nowadays!
It's for bibliographies, but it also archives and stores web pages locally with a browser integration.
I'm sure there are bookmark services that also allow notes, but the tagging, linking related things, etc, all in the app is awesome, plus the ability to export bib tex for writing a paper!
At a fundamental level, broken website links and dangling pointers in C are the same.
Not that I don't think there is some benefit in what you are attempting, of course. A similar thing I still wish I could do is to "archive" someone's phone number from my contact list. Be it a number that used to be ours, or family/friends that have passed.
Any site/company whatsoever of this world (and most) that promises that anything will last forever is seriously deluded or intentionally lying, unless their theory of time is different than that of the majority.
> url shortening was a fucking awful idea[2]
[1] https://wiki.archiveteam.org/index.php/Goo.gl
[2] https://wiki.archiveteam.org/index.php/URLTeam
Yes. Also not using a url shortener as infrastructure.
But they never became popular and then link shorteners reimplemented the idea, badly.
https://en.m.wikipedia.org/wiki/Uniform_Resource_Name
domain names often exchange hands and a URL that is supposed to last forever can turn into malicious phishing link over time.
If you serve an SPA via IPFS, the SPA still needs to fetch the data from an endpoint which could go down or change
Even if you put everything on a blockchain, an RPC endpoint to read the data must have a URL
And thus we arrive at the root of the conflict. Many users (that care about this kind of thing) want to publications that they’ve seen to stay where they’ve seen them; many publishers have become accustomed to being able to memory-hole things (sometimes for very real safety reasons; often for marketing ones). That on top of all the usual problems of maintaining a space of human-readable names.
This problem was recognized in 1997 and is why the Digital Object Identifier was invented.
The real abusers are the people who use a shortener to hide scam/spam/illegal websites behind a common domain and post it everywhere.
In other words, every shortened url is "using the url shortener as a database" in that sense. Taking a url with a long query parameter and using a url shortener to shorten it does not constitute "abusing a link shortener as a database."
It’s simply a normal use-case for a url shortener. A long url, usually because of some very large query parameter, which gets mapped to a short one.
> Google Go Links (2010–2021)
> Killed about 4 years ago, (also known as Google Short Links) was a URL shortening service. It also supported custom domain for customers of Google Workspace (formerly G Suite (formerly Google Apps)). It was about 11 years old.
Killing the existing ones is much more of a jerk move. Particularly so since Google is still keeping it around in some form for internal use by their own apps.
Edit: Google is using a g.co link on the "Your device is booting another OS" screen that appears when booting up my Pixel running GrapheneOS. Will be awkward when they kill that service and the hard coded link in the phones bios is just dead
Possibly those other ones are just using the domain name and the underlying service is totally different, not sure.
This is the second time today I've seen a disclaimer like this. Looks like we're witnessing the start of a new trend.
Personal blogs, essays, articles, creative writing, "serious work" - please tell us if LLMs were used, if they were, and to what extent. If I read a blog and it seems human and there's no mention of LLMs, I'd like to be able to safely assume it's a human who wrote it. Is that so much to ask?
If the content can stand on its own, then it is sufficient. If the content is slop, then why does it matter that it is an ai generated slop vs human generated slop?
The only reason anyone wants to know/have the disclaimer is if they cannot themselves discern the quality of the contents, and is using ai generation as a proxy for (bad) quality.
And I differentiate between "Matt Godbolt" who is an expert in some areas and in my experience careful about avoiding wrong information and an LLM which may produce additional depth, but may also make up things.
And well, "discern the quality of the contents" - I often read texts to learn new things. On new things I don't have enough knowledge to qualify the statements, but I may have experience with regards to the author or publisher.
(Some researcher's names I know, some institutions published good reports in the past and that I take into consideration on how much I trust it ... and since I'm human I trust it more if it confirms my view and less if it challenges it or put in different words: there are many factors going into subjective trust)
that's a negative credit activity
I /am/ thinking about a foundation or similar though: the single point of failure is not funding but "me".
I think the most valuable long-living compiler explorer links are in bug reports. I like to link to compiler explorer in bug reports for convenience, but I also include the code in the report itself, and specify what compiler I used with what version to reproduce the bug. I don't expect compiler explorer to vanish anytime soon, but making bug reports self-contained like this protects against that.
i also wonder if url death could be a good thing. humanity makes special effort to keep around the good stuff. the rest goes into the garbage collection of history.
If I could time jump it would be interesting to see how historians inna thousand years will look back at our period where a lot of information will just disappear without traces as digital media rots.
Having more average sources certainly helps and we now aren't good judges on what will be relevant in future. We can only try to keep some of everything.
The things that make (or fail to make) life mundane at some point in history are themselves subjects of significant academic interest.
(And of course we have no way to tell what things are "curiosities" or not. Preservation can be seen as a way to minimize survivorship bias.)
Maybe we should get a journaling boom going.
But it has to be written, because pen and paper is literally ten times more durable than even good digital storage.
citation needed lol. data replication >>>> paper's single point of failure.
Sure, as long as the media is copied there is a chance of survival, but will this then be "average" material or things we now consider interesting, only? Will the chain hold or will it become as uninteresting as many other things were over time? Will the Organisation doing it be funded? Will the location where this happens be spares from war?
For today's historians the random finds are important artifacts to understand "average" people's lives as the well preserved documents are legends on the mighty people.
Having lots of material all over gives a chance for some to survive and from 40 years or so back we were in a good spot. Lots of paper allover about everything. Analog vinyl records, which might be readable in a future to learn about our music. But now all on storage media, where many forms see data loss, where the format is outdated and (when looking from a thousand years away) fast change of data formats etc.
But that's just survivorship bias. The vast vast vast majority of all written sheets of paper have been lost to history. Those deemed worthy were carefully preserved, some of the rest was preserved by a fluke. The same is happening with digital media.
The storage media. We have evidence to support this:
* original paper works from 1000 years ago are insanely rare
* more recent storage media provide much more content
How many digital copies of Beowulf do we have? Millions?
How many paper copies from 1000 years ago? one
how many other works from 1000 years ago do we have zero copies of thanks to paper's fragility and thus don't even know existed? probably a lot
You can't have a fully history without either.
agreed. formerly wrote some thoughts here: https://boehs.org/node/internet-evanescence
>Google (using their web search API)
>GitHub (using their API)
>Our own (somewhat limited) web logs
>The archive.org Stack Overflow data dumps
>Archive.org’s own list of archived webpages
You're an angel Matt
Someone has to foot the bill somewhere and if there isn't a source of income then the project is bound to be unsupported eventually.
So many paid offerings, whether from startups or even from large companies, have been sunset over time, often with frustratingly short migration periods.
If anything, I feel like I can think of more paid services that have given their users short migration periods than free ones.
Sure, an service more to monitor, while for the most part "fix by restart" is a good enough approach. And then once in a while have an intern switching to latest backend choice.
The Rust playground uses GitHub Gists as the primary storage location for shared data. I'm dreading the day that I need to migrate everything away from there to something self-maintained.
[1]: https://www.w3.org/Provider/Style/URI
He advocated for /foo/bar with no extension. He was right about not using /foo/bar.php because the implementation might change.
But he was wrong, it should be /foo/bar.html because the end-result will always be HTML when it's served by a browser, whether it's generated by PHP, Node.js or by hand.
It's pointless to prepare for some hypothetical new browser that uses an alternate language other than HTML and that doesn't use HTML.
Just use .html for your pages and stop worrying about how to correctly convert foo.md to foo/index.html and configure nginx accordingly.
You're probably thinking of W3C's guidance: https://www.w3.org/Provider/Style/URI
> But he was wrong, it should be /foo/bar.html because the end-result will always be HTML
20 years ago, it wasn't obvious at all that the end-result would always be HTML (in particular, various styled forms of XML was thought to eventually take over). And in any case, there's no reason to have the content-type in the URL; why would the user care about that?
I agree though that I was too harsh, I didn't realize it was written in 1998 when HTML was still new. I probably first read it around 2010.
But now that we have hindsight, I think it's safe to say .html files will continue to be supported for the next 50 years.
https://www.w3.org/Provider/Style/URI
You say the extension is cruft. That's your opinion. I don't share it.
The way I look at it is that yes, the extension can be useful for requesting a particular file format (IMO the Accept header is not particularly accessible, especially if you are just a regular web browser user). But if you have a default/canonical representation, then you should give that representation in response to a URL that has no extension. And when you link to that document in a representation-neutral way, you should link without the extension.
That doesn't stop you from also serving that same content from a URL that includes the extension that describes the default/canonical representation. And people who want to link to you and ensure they get a particular representation can use the extension in their links. But someone who doesn't care, and just wants the document in whatever format the website owner recommends, should be able to get it without needing to know the extension. For those situations, the extension is an implementation detail that is irrelevant to most visitors.
Not at all. He's famous for helping create the initial version of JavaScript, which was a fairly even mixture of great and terrible. Which means his initial contributions to software were not extremely noteworthy, and he just happened to be in the right time and right place, since something like JavaScript was apparently inevitable. Plus, I can't think of any of his major contributions to software in the decades since. So no, I don't even think that's really an appeal to authority.
You may be thinking of Brendan Eich? Berners-Lee is famous for HTML, HTTP, the first web browser, and the World Wide Web in general; as far as I know he had nothing to do with JS.
I never saw any site where the extra flexibility added any value. So, right now I do favor the extension.
Why did I think Joel Spolsky or Jeff Atwood wrote it?
https://github.com/sdegutis/bubbles
https://github.com/sdegutis/bubbles/
No redirect, just two renders!
It bothers me first because it's semantically different.
Second and more importnatly, because it's always such a pain to configure that redirect in nginx or whatever. I eventually figure it out each time, after many hours wasted looking it up all over again and trial/error.
[0] https://nvd.nist.gov/vuln/detail/CVE-2024-38475
I'm pretty sure the lore says that a solemn promise from Google carries the exact same value as a prostitute saying she likes you.
Where URLs may last longer is where they are not used for the RL bit. But more like a UUID for namespacing. E.g. in XML, Java or Go.
It's become so trite to mention that I'm rolling my eyes at myself just for bringing it up again but... come on! How bad can it be before Google do something about the reputation this behaviour has created?
Was Stadia not an expensive enough failure?
For other obsolete apps and services, you can argue that they require some continual maintenance and upkeep, so keeping them around is expensive and not cost-effective if very few people are using them.
But a URL shortener is super simple! It's just a database, and in this case we don't even need to write to it. It's literally one of the example programs for AWS Lambda, intentionally chosen because it's really simple.
I guess the goo.gl link database is probably really big, but even so, this is Google! Storage is cheap! Shutting it down is such a short-sighted mean-spirited bean-counter decision, I just don't get it.
We should be using more of them.
but perhaps I don't appreciate how much traffic godbolt gets
URIs however, can be made to last forever! Also comes with the added benefit that if you somehow integrate content-addressing into the identifier, you'll also be able to safely fetch it from any computer, hostile or not.
I still don't know the difference between URI and URL.
I'm starting to think it doesn't matter.
URI is basically a format and nothing else. (foo://bar123 would be a URI but not a URL because nothing defines what foo: is.)
URLs and URNs are thingies using the URI format; https://news.ycombinator.com is a URL (in addition to being a URI) because there's an RFC that specifies that https: means and how to go out and fetch them.
urn:isbn:0451450523 (example cribbed from Wikipedia) is an URN (in addition to being an URI) that uniquely identifies a book, but doesn't tell you how to go find that book.
Mostly, the difference is pedantic, given that URNs never took off.
[1]: ba dum tss
However, “URL” in the broader sense is used as an umbrella term for URIs and IRIs (internationalized resource identifiers), in particular by WHATWG.
In practice, what matters is the specific URI scheme (“http”, “doi”, etc.).
One is a location, the other one is a ID. Which is which is referenced in the name :)
And sure, it doesn't matter as long as you're fine with referencing locations rather than the actual data, and aware of the tradeoffs.
And URL is an URI that also tells you how to find the document.
A URN tells you which data to get (usually by hash or by some big centralized registry), but not how to get it. DOIs in academia, for example, or RFC numbers. Magnet links are borderline.
URIs are either URLs or URNs. URNs are rarely used since they're less practical since browsers can't open them - but note that in any case each URL scheme (https) or URN scheme (doi) is unique - there's no universal way to fetch one without specific handling for each supported scheme. So it's not actually that unusual for a browser not to be able to open a certain scheme.
https://docs.ipfs.tech/
Depends on your use case I suppose. For things I want to ensure I can reference forever (theoretical forever), then using location for that reference feels less than ideal, I cannot even count the number of dead bookmarks on both hands and feet, so "link rot" is a real issue.
If those bookmarks instead referenced the actual content (via content-addressing for example), rather than the location, then those would still work today.
But again, not everyone cares about things sticking around, not all use cases require the reference to continue being alive, and so on, so if it's applicable to you or not is something only you can decide.
I've pondered that a lot in my system design which bears some resemblance to the principles of REST.
I have split resources in ephemeral (and mutable), and immutable, reference counted (or otherwise GC-ed), which are persistent while referred to, but collected when no one refers to them.
In a distributed system the former is the default, the latter can exist in little islands of isolated context.
You can't track references throughout the entire world. The only thing that works is timeouts. But those are not reliable. Nor you can exist forever, years after no one needs you. A system needs its parts to be useful, or it dies full of useless parts.
And yet… that was a very self-destructive decision.
Considering link permanence was a "founding principle", that's just unbelievably stupid. If I decide one of my "founding principles" is that I'm never going to show up at work with a dirty windshield, then I shouldn't rely on the corner gas station's squeegee and cleaning fluid.
There seemed to be two principles at play here:
1. Links should always work
2. We don't want to store any user data
#2 is a bit complicated, because although it sounds nice, it has two potential justifications:
2a: For privacy reasons, don't store any user data
2b: To avoid having to think through the implications of storing all those things ourselves
I'm not sure how much each played into their thinking; possibly because of a lack of clarity, 2a sounded nice and 2b was the real motivation.
I'd say 2a is a reasonable aspiration; but using a link shortener changed it from "don't store any user data" to "store the user data somewhere we can't easily get at it", which isn't the same thing.
2b, when stated more clearly, is obviously just taking on technical debt and adding dependencies which may come back to bite you -- as it did.