Ask HN: Why hasn't x86 caught up with Apple M series?
433 points by stephenheron 3d ago 616 comments
Ask HN: Best codebases to study to learn software design?
103 points by pixelworm 5d ago 90 comments
Sometimes CPU cores are odd
83 rbanffy 101 8/28/2025, 9:39:18 PM anubis.techaro.lol ↗
Another joke from the same era: Having a 2 core processor means that you can now e.g. watch a film at the same time. At the same time with what? At the same time with running Windows Vista!
2^0 = 1
So the logic might make sense in people's heads if they never encounter 6 or 12 core CPUs that are common these days.
[1] https://abrahamjuliot.github.io/creepjs/
I always found it annoying that CPU information was widely available and precise while memory information was not - it's clamped to 0.25, 0.5, 1, 2, 4 or 8 GB. If you're running something memory-bound in the browser you have to be really conservative to avoid locking up the user's device (or ask them to manually specify how much memory to use). https://developer.mozilla.org/en-US/docs/Web/API/Device_Memo...
> ... a challenge method that requires the client to do a single round of SHA-256 hashing deeply nested into a Preact hook in order to prove that the client is running JavaScript.
Why a single round? Doing the whole proof of work challenge inside the proof of react would be even more effective, right?
I have Chrome on mobile configured as such that JS and cookies are disabled by default, and then I enable them per site based on my judgement. You might be surprised to learn that normally, this actually works fine, and sites are usually better for it. They stop nagging, and load faster. This makes some sense in retrospect, as this is what allows search engine crawlers to do their thing and get that SEO score going.
Anubis (and Cloudflare for that matter) force me to temporarily enable JS and cookies at least once however anyways, completely defeating the purpose of my paranoid settings. I basically never bother to, but I do admit it is annoying. It's kind of up there with sites that don't have any content by default, only with JS on (high profile example: AWS docs). At least Cloudflare only spoils the fun every now and then. With Anubis, it's always.
It's definitely my fault, but at the same time, I don't feel this is right. Simple static pages now require allowing arbitrary code execution and statefulness. (Although I do recognize that SVGs and fonts also kind of do so anyhow, much to my further annoyance).
My God, there's two of us!
(Though … you're being privacy conscious on Chrome? Come to Firefox. Ignore the pesky "it's funded by Google" problems, nothing to see, nothing to see, the water is fiiiine.)
> You might be surprised to learn that normally, this actually works fine
I guess I have a different experience there. A huge number of sites just outright crash. (E.g., the HN search.) JavaScript devs, I've learned, do not handle error cases, and the exceptions tend to just propagated out and ruin the rendering. There seems to be some popular framework out there that even just destroys the whole DOM to emit just the error. (I forget the text, but it's the same text, always. Always centered. Flash of page, then crash.)
I have a custom extension that fakes the cookie storage for those JS pages that just lies & says "yeah, cookies are enabled" and the blackholes the writes. But it fails for anything that needs a real cookie … like Anubis.
I'm empathetic towards where Anubis is coming from though. But the "I passed the challenge" cookie is indistinguishable from a tracker … although probably most people running Anubis are inherently trustworthy by a sort of cultural association so long as Anubis remains non-mainstream. I think I might modify it to have the ability to store cookies for a short time frame (like 1h) in some cases, such as Anubis; that's enough to pass the challenge, but weighed against tracking. I'm usually only blocked by Anubis for something like a blog post, so that should suffice.
Where I work our main product is a React-based web site with a JSON back end, you might go to
http://example.com/web/item/88841
and that will load maybe 20MB of stuff (always the same thing) and eventually after the JS boots up a useEffect() gets called that reads '88841' out of the URL and does a GET to
http://example.com/api/item/88841
which gets you nicely formatted JSON. On top of that the public id(s) are sequential integers so you could easily enumerate all the items if you just thought a little bit.
We've had more than one obnoxious crawler that we had reason to believe was targeted specifically at us that would go to the /web/ URL and, without a cache, download all the HTML, Javascript, CSS, then run the JS and download the JSON for each page -- at which case they are either saving the generated HTML or looking at the DOM. If they'd spent 10 minutes playing with the browser dev tools they would have seen the /item/ request and probably could have figured out pretty quickly out how to interpret the results. As is they're going to have to figure out how to parse that HTML and turn it into something like the JSON and could probably save them 95% of the bandwidth, 95% of the CPU, and whatever time they spent writing parsing code and managing their Rube Goldberg machine but I'd take 50% odds any day that they never actually did anything with the data they captured because crawlers usually don't.
I know because I've done more than my share of web crawling and I have crawlers that: capture plain http data, can run Javascript in a limited way, and can run React apps. The last one would blast right past Anubis without any trouble except for the rate limiting which is not a lot of problem because when I crawl I hit fast, I hit hard, and I crawl once. [1] (There's a running gag in my pod that I can't visit the state of Delaware because of my webcrawling)
[1] Ok, sometimes the way you avoid trouble is hit slow, hit soft, but still hit once. It's a judgement call if you can hit them before they knew what hit them or if you can blend in with the rest of the traffic.
I have no problem with bots scraping all my data, I have a problem with poorly-coded bots overloading my server, making it unusable for anybody else. I'm using Anubis on the web interface to an SVN server, so if the bots actually wanted the data, they could just run "svn co" instead of trying to scrape the history pages for 300k files.
> It seems like a whole lot of crap to me. Hostile webcrawlers, not to mention Google, frequently run Javascript these days.
I'm also rather unhappy that I had to deploy Anubis, but it's unfortunately the only thing that seemed to work, and the server load was getting so bad that the alternative was just disabling the SVN web interface altogether.
Anubis has become an annoying denial-of-service layer in front of sites that I would otherwise use. I hope its no-script mode gets enabled by default soon.
Incidentally, I read a short while ago that not having "Mozilla" in your user-agent will bypass Anubis, so give that a try.
https://gitlab.com/zipdox/anubis-bypass
My options are using custom Chrome, migrating to Firefox, or proxying my traffic and making edits that way (e.g. doing the Anubis PoW there and injecting the cookie required).
Not stoked about any of these, although Firefox is a lot on my mind these days, and option #3 would be a good excuse to dust off my RPi.
I refuse to be boiled slowly by Google. With MV3, it was full-fat ad blockers. With MV4 it could very well be ALL ad-blockers.
And yeah, I concede that sounds conspiratorial - as conspiratorial as Google cracking down on your ability to run the ad-blocker of your choice would've sounded a decade ago.
Making you pay time, power, bandwidth, or money to access content does not significantly impede your browsing, so long as the cost is appropriately small. For the user above reporting thirty seconds of maxcpu, that’s excessive for a median normal person (but us hackers are not that).
If giving your unique burned-in crypto-attested device ID is acceptable, there’s an entire standard for that, and when your device is found to misbehave, your device can be banned. Nintendo, Sony, Xbox call this a “console ban”; it’s quite effective because it’s stunningly expensive to replace a device.
If submitting proof of citizenship through whatever attestation protocol is palatable is okay, the Anubis could simply add the digital ID web standard and let users skip the proof of work in exchange for affirming that they have a valid digital ID. But this only works if your specific identity can be banned, or else AI crawlers will just send a valid anonymized digital ID header.
This problem repeats in every suggested outcome: either you make it more difficult for users to access a site, or you require users to waste energy to access a site, or you require identifiable information signed by a dependable third-party authority to be presented such that a ban is possible based on it. IP addresses don’t satisfy this; Apple IDs, trusted-vendor HSM-protected device identifiers, and digital passports do satisfy this.
If you have a solution that only presents barriers to excessive use and allows abusive traffic to be revoked without depending on IP address, browser fingerprint, or paid/state credentials, then you can make billions of dollars in twelve months.
Ideas welcome! This has been a problem since bots started scraping RSS feeds and republishing them as SEO blogs, and we still don’t have a solution besides Cloudflare and/or CPU-burning interstitials.
(ps. I do have a solution for this, but it would require physical builds, be mildly unprofitable over time with no growth potential, and incite governments hostility towards privacy-preserving identity systems. A billionaire philanthropist could build it in a year and completely solve this problem. Sigh.)
This might seem contradictory, but I believe this is technically possible? What I don't think is this is how these solutions actually work currently. Like to basically prove that I am indeed a unique visitor who's a person according to the govt, but wouldn't reveal the person info to the site, and wouldn't reveal the site info to the govt, even if they collude.
Same with the whole +18 goof. I'd actually quite like to try age gated communities, like +-5 years my age. I feel a lot of conflict stems from people coming from a bit too different walks of life sometimes. Could even do high confidence location based gating this way, which could also be cool (as well as the exact opposite of cool, because of course).
It’s not difficult to solve this problem — the database schema and queries are dead simple! — it’s just exceedingly difficult to succeed if you're not a passport-issuing entity or an authorized monopoly of such.
In the model I described, the trust anchor would be the govt, so basically a centralized model like domain certs. This resolves the issues you list off, but brings others: what if the trust anchor isn't trustworthy and starts forging identities?
The alternative to that would then be web of trust stuff. But this is why I consider this to be a separate problem. If the core protocol could be laid out and standardized at least, then layering on another that makes this choice between centralized vs web of trust could be done separately.
Scratch that, this happens all the time. With a third party there's no way to revoke, government you can usually physically handle this.
Person W is welcome to have thousands of unique IDs if they want to, so long as when site X bans identity Y, that ban is applied to all of Person W’s present and future identities. Whether W has a single Y or a thousand Y makes no difference to me. I suppose some sites will care to restrict participation to a single Y per W, but e.g. in the general browsing a site with crawler/bot/AI shielding such as Anubis today, it’s completely irrelevant to them what your Y is so long as rate limits and bans apply to all Y of W rather than to the presented Y alone.
I'm not super well-versed in crypto though, so I confess this is a lot more conjecture than knowledge.
The only way I can imagine this working is:
1. You go to the government and request to have a digital ID generated.
2. The government generates a random number.
3. The government issues a request to an NGO to generate a new cryptographic object based on the random number, and receives back a retrieval number.
4. The government gives you the retrieval number, which you can use to get your digital ID from the NGO.
This way, the government only has the mapping between your identity and a random number, and the NGO only has the mapping between the random number and the generated object, with no possibility to deanonymize it because you don't present any ID to get it. Obviously, there must be no information exchange between the government and the NGO.
The construction would go basically like this:
pseudonym = VRF(secret_key + site_id)
The expectation is that you would have only one valid secret_key at any time, and it would be unknown to the government. This kind of scheme is called anonymous credential generation in literature I believe. It can be established the secret_key got govt backed, but that's it.
The site_id would be e.g. domain cert public key or similar (domain ownership is a moving target, so just the domain name imo is not sound).
VRF is a verifiable random function. This is the magic ZK part.
Pseudonym is what you present to the site, i.e. the identity you go by.
This way the site can verify that this pseudonym was specifically issued for it (making it site unique), and that it belongs to a govt certified identity (of which there should be only one issued at a time per person). The VRF is deterministic, guaranteeing that it's the same person every time.
Revocation is annoying so I didn't bother thinking that through but should be fairly okay I think?
I believe this is robust to people forging arbitrary IDs, to sites colluding with each other in deanonymization, and colluding with the govt in the same. The only kickers I can think of are secret_key misuse (e.g. via duress) / theft / loss / sharing, and the trust anchor (the govt) being untrustworthy (forging invalid or duplicate identities). Would also need to handle people dying, but that would be pretty much just revocation.
I consider trust anchor issues out of scope. The remainder doesn't sound too bad to try defending for, and I think is also basically out of scope.
Potentially important edit: I'm not accounting for timing side channels here, which might be relevant during revocation or else.
Another: didn't mention but in my humble opinion cryptographically attesting people is unsound. People can't calculate crypto in their head, and can't recall long arbitrary strings of hex. What is appropriate to attest (if anything) is their devices instead. But that's a layer of complication I didn't want to deal with here.
Why, though? If you're the only one who knows it, nothing prevents you from creating as many identities for the same site as you wish.
Authentication works, doesn't it?
there are different metrics for cost, however. Based on cpu utilization and/or time, it's hard to argue that Anubis is a high price.
But if it is important to you to not run javascript for whatever reason, the price of access to a site using Anubis is rather high.
You put stuff on the public Internet, expect it to be read by everyone.
Don't like that? Put it behind a login.
How did the propaganda persuade people into accepting mass surveillance and normalising the invasion of privacy for something that was never really a problem?
I suspect the ones pushing this are the ones running these "crawlers" themselves, much like Cloudflare hosts providers of DDoS services.
What a coincidence that "identity verification" became a hot topic recently.
Don't even try to solve this nonexistent "problem", because that will only drive us further into authoritarian technocracy.
Crying “Conspiracy” in reply to a career Chicken Little is comedic. I’ve been raising warnings about identity verification looming on the horizon for perhaps fifteen years now; thanks to DejaNews for that early realization, I suppose.
> Don't even try to solve this nonexistent "problem", because that will only drive us further into authoritarian technocracy.
I would celebrate and tell all my friends if someone on this thread, on any thread, would explain how we solve this without bankrupting non-business site operators and without a third-party authority. Anubis is a band-aid at best, yet no better solution — not even an idea — is presented alongside your objections.
> You put stuff on the public Internet, expect it to be read by everyone.
My hobbyist forum can barely stay online eight hours a day due to crawler traffic. Someone scraped the entire site by spawning one request per page with no fork limit last year. It was down for a solid week after that, and now has very severe limits in place. I don’t know how they can afford to stay running, but certainly “static only” isn’t going to solve the CPU and bandwidth costs incurred by incompetent and redundant AI crawlers. So, by making their site public in today’s infested internet, their content is no longer accessible.
> Don't like that? Put it behind a login.
As I noted above, one solution is payment — since free credentials registration is not an obstacle to AI bots, after all. For some reason people don’t like to charge money for hobbyist content if they can avoid it. I recognize why and am trying my best to discover a non-monetary solution on their behalf.
> I suspect the ones pushing this are the ones running these "crawlers" themselves, much like Cloudflare hosts providers of DDoS services.
I do not, have not, and will not run crawlers or AI agents, trainers, or other such shit at any time in the past thirty years and will continue to abstain from the entire category, which should be quite easy as I’m a retired sysop now attending full-time accounting school and giving a finger to the entire industry to pursue work that benefits humanity. Same reason I bother telling HN “the anonymity sky is falling” every so often: I’d much prefer it if we didn’t have to sacrifice anonymity online to defeat scraper bots.
> No, no, no, hell fucking no!!
Please find a way to turn your vehemence and passion into a productive contribution, before it’s too late for all of us. As presented, your argument is neither supported nor persuasive, and your hostility only gives opponents of anonymity more arrows in their quiver to shoot at us.
I checked the value of navigator.hardwareConcurrency on my phone and it returns 9... I guess that explains it.
It looks like setting light performance mode in device optimisations (I don't game on my phone) turns off the S24s sole Cortex-X4.
I'm not sure what generation it is, but I bought it around a decade ago I think.
I'd immediately look into what happens for odd numbers, rounding, implicit type conversions etc. Or at least that's what I was taught when I first started programming.
Also relying on "well we know that X is always Y" is almost always a mistake; maybe not always at first but definitely in the future because X will almost certainly be Y at some point. Defensive coding would catch such issues (with at the very least an Assert somewhere to ensure X is indeed Y before continuing, ensuring that we get a nice error when that assumption proves to be wrong).
Also, they still might not (but probably learned). In this article they imply that each type of CPU core (what they call a "tier" in the article) will still be a power of two, and one just happened to be 2^0. I'm not sure they were around when the AMD Athlon II X3 was hot.
>>> Today I learned this was possible. This was a total "today I learned" moment. I didn't actually think that hardware vendors shipped processors with an odd number of cores, however if you look at the core geometry of the Pixel 8 Pro, it has three tiers of processor cores. I guess every assumption that developers have about CPU design is probably wrong.
I never thought about it before but I actually had to look up die shots to make sure they were not the same processor. and if I can trust the internet they are not. Hell I had to confirm that yes the playstation 3(also ppc, queue x-files theme) only had the one core and it's screwball subprocessers like I remembered.
Yeah that's obviously not true, and believing it shows a marked lack of experience in the field. Of the current Xeon workstation lineup, only 3 of 14 SKUs have power-of-2 core counts. And there are consumer lines of CPUs with 6 cores and that sort of thing.
I realize Anubis was probably never tested on a true single-core machine. They are actually somewhat difficult to find these days outside of microcontrollers.
Javascripters, perhaps. Those who work on schedulers, or kernels in general would find this completely normal
No comments yet
Why?
What would the alternative have been?
The first effect is great, because it's a lot more annoying to bring up a full browser environment in your scraper than just run a curl command.
But the actual proof of work only takes about 10ms on a server in native code, while it can take multiple seconds on a low-end phone. Given the companies in questions are building entire data centers to house all their GPUs, an extra 10ms per web-page is not a problem for them. They're going to spend orders of magnitude more compute actually training on the content they scraped, than solving the challenge.
It's mostly the inconvenience of adapting to Anubis's JS requirements that held them back for a while, but the PoW difficulty mostly slowed down real users.
you can even get a curl fork that drives a browser under the hood
What the Anubis POW system is doing right now is exploiting the fact that there's been no need for crawlers to be anything but naive. But the cost to make them sophisticated enough to defeat the POW system is quite low, and when that happens, the POW will just be annoying legit users for no benefit.
I don't know if "mistake" is the word I'd use for it. It's not a whole lot of code! It's a reasonable first step to force crawlers to emulate a tiny fraction of a real browser. But as it evolves, it should evolve away from burning compute, because that's playing to lose.
However the exact PoW implementation (hash) chosen by Anubis might significantly reduce this asymmetry, because the calculation speed is highly dependent on hardware.
Unfortunately for the user on a low-end phone, the overhead can be several seconds. For the scraper it's only ever 10ms because that's running on a (relatively) powerful server CPU.
Tavis Ormandy went into more detail on the math here, but it's not great!
(1) there's a sharp asymmetry between adversaries and legitimate users (as with password hashes and KDFs, or antiabuse systems where the marginal adversarial request has value ~reciprocal to what a legit users gets, as with brute-forcing IDs)
(2) the POW serves as a kind of synchronization clock in a distributed system (as with blockchains)
What's case (3) here?
In an adversarial engineering domain neither the problems or solutions are static. If by some miracle you have a perfect solution at one point in time, the adversaries will quickly adapt, and your solution stops being perfect.
So you’ll mostly be playing the game in this shifting gray area of maybe legit, maybe abusive cases. Since you can’t perfectly classify them (if you could, they wouldn’t be in the gray area), the options are basically to either block all of them, allow all of them, or issue them a challenge that the user must pass to be allowed. The first two options tend to be unacceptable in the gray area, so issuing a challenge that the client must pass is usually the preferred option.
A good counter-abuse challenge is something that has at least one of the following properties:
1. It costs more to pass than the economic value that the adversary can extract from the service, but not so much that the legitimate users won’t be willing to pay it.
2. It proves control of a scarce resource without necessarily having to spend that resource, but at least in such a way that the same scarce resource can’t be used to pass unlimited challenges.
3. It produces additional signals that can be used to meaningfully improve the precision/recall tradeoff.
And proof of work does none of those. The last two by construction, since compute is about the most fungible resource in the world. The last doesn't work since it's impossible to balance the difficulty factor such that it imposes a cost the attacker would notice but would be acceptable to the defender.
If you add 10s to the latency for your worst-case real users (already too long), it'll cost about $0.01/1k solves. That's not a deterrent to any kind of abuse.
So proof of work just is a really bad fit for this specific use case. The only advantage is that it is easy to implement, but that's a very short term benefit.
https://github.com/TecharoHQ/anubis/pull/1038
Could someone explain how this would help stop scrapers? If you’re just running the page JS wouldn’t this run too and let you through?
> how this would help stop scrapers
I think anubis bases its purpose on some flawed assumptions:
- that most scrapers aren't headless browsers
- that they don't have access to millions of different IPs across the world from big/shady proxy companies
- that this can help with a real network-level DDoS
- that scrapers will give up if the requests become 'too expensive'
- that they aren't contributing to warming the planet
I'm sure there does exist some older bots that are not smart and don't use headless browsers, but especially with newer tech/AI crawlers/etc., I don't think this is a realistic majority assumption anymore.