Show HN: I was curious about spherical helix, ended up making this visualization (visualrambling.space)

I've been working on a web crawler and have been trying to make it as friendly as possible. Strictly checking robots.txt, crawling slowly, clear identification in the User Agent string, single IP source address. But I've noticed some anti-bot tricks getting applied to the robot.txt file itself. The latest was a slow loris approach where it takes forever for robots.txt to download. I accidentally treated this as a 404, which then meant I continued to crawl that site. I had to change the code so a robots.txt timeout is treated like a Disallow /.

It feels odd because I find I'm writing code to detect anti-bot tools even though I'm trying my best to follow conventions.

navane · 1h ago

That's like deterring burglars by hiding your doorbell

lwansbrough · 7h ago

We solved a lot of our problems by blocking all Chinese ASNs. Admittedly, not the friendliest solution, but there were so many issues originating from Chinese clients that it was easier to just ban the entire country.

It's not like we can capitalize on commerce in China anyway, so I think it's a fairly pragmatic approach.

sugarpimpdorsey · 7h ago

There's some weird ones you'd never think of that originate an inordinate amount of bad traffic. Like Seychelles. A tiny little island nation in the middle of the ocean inhabited by... bots apparently? Cyprus is another one.

Re: China, their cloud services seem to stretch to Singapore and beyond. I had to blacklist all of Alibaba Cloud and Tencent and the ASNs stretched well beyond PRC borders.

grandinj · 7h ago

There is a Chinese player that has taken effective control of various internet-related entities in the Seychelles. Various ongoing court-cases currently.

So the seychelles traffic is likely really disguised chinese traffic.

supriyo-biswas · 6h ago

I don't think these are "Chinese players" and is linked to [1], although it may be that the hands changed many times that the IP addresses have been leased or bought by Chinese entities.

[1] https://mybroadband.co.za/news/internet/350973-man-connected...

galaxy_gas · 5h ago

this all from Cloud Innovation vpns,proxies,spam,bots CN Seychelles IP holder

sylware · 4h ago

I forgot about that: all the nice game binaries from them running directly on nearly all systems...

lukan · 52m ago

Huh? Who is them in this case?

sylware · 5h ago

omg... that's why my self-hosted servers are getting nasty trafic from SC all the time.

The explanation is that easy??

lwansbrought · 4h ago

> So the seychelles traffic is likely really disguised chinese traffic.

Soon: chineseplayer.io

seanhunter · 7h ago

The Seychelles has a sweetheart tax deal with India such that a lot of corporations who have an India part and a non-India part will set up a Seychelles corp to funnel cash between the two entities. Through the magic of "Transfer Pricing"[1] they use this to reduce the amount of tax they need to pay.

It wouldn't surprise me if this is related somehow. Like maybe these are Indian corporations using a Seychelles offshore entity to do their scanning because then they can offset the costs against their tax or something. It may be that Cyprus has similar reasons. Istr that Cyprus was revealed to be important in providing a storefront to Russia and Putin-related companies and oligarchs.[2]

So Seychelles may be India-related bots and Cyprus Russia-related bots.

[1] https://taxjustice.net/faq/what-is-transfer-pricing/#:~:text...

[2] Yup. My memory originated in the "Panama Papers" leaks https://www.icij.org/investigations/cyprus-confidential/cypr...

johnisgood · 5h ago

Yeah I am banning whole Singapore and China, for one.

I am getting down-voted for saying I ban whole Singapore and China? Oh lord... OK. Please all the down-voters list your public facing websites. I do not care if people from China cannot access my website. They are not the target audience, and they are free to use VPNs if they so wish, or Tor, or whatever works for them, I have not banned them yet, for my OWN PERSONAL SHITTY WEBSITE, inb4 you want to moderate the fuck out of what I can and cannot do on my own server(s). Frankly, fuck off, or be a hero and die a martyr. :D

Bender · 16m ago

Ignore the trolls. Also, if they are upset with you they should focus their vitriol on me. I block nearly all of BRICS especially Brazil as most are hard wired to not follow even the simplest of rules, most data-centers, some VPN's based on MSS, posting from cell phones and much more. I am always happy to give people the satisfaction of down-voting me since I use uBlock to hide karma.

nkrisc · 3h ago

People get weird when you do what you want with your own things.

Want to block an entire country from your site? Sure, it’s your site. Is it fair? Doesn’t matter.

lmz · 3h ago

> be a hero and die a martyr

I believe it's "an hero".

johnisgood · 3h ago

Oh thank you kind sir.

mnw21cam · 3h ago

Uh, no, it's definitely not. Hero begins with a consonant, so it should be preceded by "a", not "an".

lmz · 3h ago

https://knowyourmeme.com/memes/an-hero

nailer · 3h ago

Welcome to British English. The h in hero isn’t pronounced, same as hospital, so you use an before it.

tmp123456au · 2h ago

This is wrong.

Unless maybe you're from the east end of london.

nailer · 2h ago

I’m not claiming everyone pronounces it that way. But he’s an ero, we need to find an ospital, ninety miles an our. You will find government documents and serious newspapers that refer to an hospital.

alistairSH · 3m ago

Generic American English pronounces the 'h' in hospital, hero, heroine, but not hour.

Same is true for RP English.

Therefore, for both accents/dialects, the correct phrases are "a hotel", "a hero", "a heroine", and "an hour".

Cockney, West Country, and a few other English accents "h drop" and would use "an 'our", "an 'otel", etc.

ralferoo · 40m ago

Likewise, when I was at school, many of my older teachers would say things like "an hotel" although I've not heard anyone say anything but "a hotel" for decades now. I think I've heard "an hospital" relatively recently though.

Weirdly, in certain expressions I say "before mine eyes" even though that fell out of common usage centuries ago, and hasn't really appeared in literature for around a century. So while I wouldn't have encountered it in speech, I've come across enough literary references that it somehow still passed into my diction. I only ever use it for "eyes" though, never anything else starting with a vowel. I also wouldn't use it for something mundane like "My eyes are sore", but I'm not too clear on when or why I use the obsolete form at other times - it just happens!

nutjob2 · 2h ago

That's not right. It's:

a hospital

an hour

a horse

It all comes down to how the word is pronounced but it's not consistent. 'H' can sound like it's missing on not. Same with other leading consonants that need an 'an'. Some words can go both ways.

pabs3 · 4h ago

Which site is it?

johnisgood · 4h ago

My own shitty personal website that is so uninteresting that I do not even wish to disclose here. Hence my lack of understanding of the down-votes for me doing what works for my OWN shitty website, well, server.

In fact, I bet it would choke on a small amount of traffic from here considering it has a shitty vCPU with 512 MB RAM.

spacecadet · 2h ago

Downvoted by bots.

johnisgood · 2h ago

Thanks, appreciate it. I would hope so. I do not care about down-votes per se, my main complaint is really the fact that I am somehow in the wrong for doing what I deem is right for my shitty server(s).

sylware · 5h ago

ucloud ("based in HK") has been an issue (much less lately though), and I had to ban the whole digital ocean AS (US). google cloud, aws and microsoft have also some issues...

hostpapa in the US seems to become the new main issue (via what seems a 'ip colocation service'... yes, you read well).

sim7c00 · 6h ago

its not weird .its companies putting themselves in places where regulations favor their business models.

it wont be all chinese companies or ppl doing the scraping. its well known that a lot of countries dont mind such traffic as long as it doesnt target themselves or for the west also some allies.

laws arent the same everywhere and so companies can get away with behavior in one place which seem almost criminal in another.

and what better place to put your scrapers than somewhere where there is no copyright.

russia also had same but since 2012 or so they changed laws and a lot of traffic reduced. companies moved to small islands or small nation states (favoring them with their tax payouts, they dont mind if j bring money for them) or few remaining places like china who dont care for copyrights.

its pretty hard to get really rid of such traffic. you can block stuff but mostly it will just change the response your server gives. flood still knockin at the door.

id hope someday maybe ISPs or so get more creative but maybe they dont have enough access and its hard to do this stuff without the right access into the traffic (creepy kind) or running into accidentally censoring the whole thing.

adzicg · 4h ago

We solved a similar issue by blocking free user traffic from data centres (and whitelisted crawlers for SEO). This eliminated most fraudulent usage over VPNs. Commercial users can still access, but free just users get a prompt to pay.

CloudFront is fairly good at marking if someone is accessing from a data centre or a residential/commercial endpoint. It's not 100% accurate and really bad actors can still use infected residential machines to proxy traffic, but this fix was simple and reduced the problem to a negligent level.

lxgr · 7h ago

Why stop there? Just block all non-US IPs!

If it works for my health insurance company, essentially all streaming services (including not even being able to cancel service from abroad), and many banks, it’ll work for you as well.

Surely bad actors wouldn’t use VPNs or botnets, and your customers never travel abroad?

raffraffraff · 7h ago

And across the water, my wife has banned US IP addresses from her online shop once or twice. She runs a small business making products that don't travel well, and would cost a lot to ship to the US. It's a huge country with many people. Answering pointless queries, saying "No, I can't do that" in 50 different ways and eventually dealing with negative reviews from people you've never sold to and possibly even never talked to... Much easier to mass block. I call it network segmentation. She's also blocked all of Asia, Africa, Australia and half of Europe.

The blocks don't stay in place forever, just a few months.

silisili · 7h ago

Google Shopping might be to blame here, and I don't at all blame the response.

I say that because I can't count how many times Google has taken me to a foreign site that either doesn't even ship to the US, or doesn't say one way or another and treat me like a crazy person for asking.

lxgr · 7h ago

As long as your customer base never travels and needs support, sure, I guess.

The only way of communicating with such companies are chargebacks through my bank (which always at least has a phone number reachable from abroad), so I’d make sure to account for these.

542354234235 · 33m ago

Usually CC companies require email records (another way of communicating with a company) showing you attempted to resolve the problem but could not. I don’t think “I tried to visit the website that I bought X item from while in Africa and couldn’t get to it” is sufficient.

closewith · 6h ago

Chargebacks aren't the panacea you're used to outside the US, so that's a non-issue.

lxgr · 4h ago

Only if your bank isn't competent in using them.

Visa/Mastercard chargeback rules largely apply worldwide (with some regional exceptions, but much less than many banks would make you believe).

closewith · 4h ago

No, outside the US, both Visa and Mastercard regularly side with the retailer/supplier. If you process a chargeback simply because a UK company blocks your IP, you will be denied.

lxgr · 2h ago

Visa and Mastercard aren't even involved in most disputes. Almost all disputes are settled between issuing and acquiring bank, and the networks only step in after some back and forth if the two really can't figure out liability.

I've seen some European issuing banks completely misinterpret the dispute rules and as a result deny cardholder claims that other issuers won without any discussion.

closewith · 48m ago

> Visa and Mastercard aren't even involved in most disputes. Almost all disputes are settled between issuing and acquiring bank, and the networks only step in after some back and forth if the two really can't figure out liability.

Yes, the issuing and acquiring banks perform an arbitration process, and it's generally a very fair process.

We disputed every chargeback and post PSD2 SCA, we won almost all and had a 90%+ net recovery rate. Similar US businesses were lucky to hit 10% and were terrified of chargeback limits.

> I've seen some European issuing banks completely misinterpret the dispute rules and as a result deny cardholder claims that other issuers won without any discussion.

Are you sure? More likely, the vendor didn't dispute the successful chargebacks.

antonkochubey · 2h ago

One of requirements of Visa/Mastercard is for the customer to be able to contact merchant post-purchase.

closewith · 2h ago

Only via the original method of commerce. An online retailer who geoblocks users does not have to open the geoblock for users who move into the geoblocked regions.

I have first-hand experience, as I ran a company that geoblocked US users for legal reasons and successfully defended chargebacks by users who made transactions in the EU and disputed them from the US.

Chargebacks outside the US are a true arbitration process, not the rubberstamped refunds they are there.

silisili · 7h ago

I'm not precisely sure the point you're trying to make.

In my experience running rather lowish traffic(thousands hits a day) sites, doing just that brought every single annoyance from thousands per day to zero.

Yes, people -can- easily get around it via various listed methods, but don't seem to actually do that unless you're a high value target.

lxgr · 7h ago

It definitely works, since you’re externalizing your annoyance to people you literally won’t ever hear from because you blanket banned them based. Most of them will just think your site is broken.

marginalia_nu · 1h ago

This isn't coming from nowhere though. China and Russia don't just randomly happen to have been assigned more bad actors online.

Due to frosty diplomatic relations, there is a deliberate policy to do fuck all to enforce complaints when they come from the west, and at least with Russia, this is used as a means of gray zone cyberwarfare.

China and Russia are being antisocial neighbors. Just like in real life, this does have ramifications for how you are treated.

raincole · 3h ago

In other words, a smart business practice.

aspenmayer · 5h ago

It seems to be a choice they’re making with their eyes open. If folks running a storefront don’t want to associate with you, it’s not personal in that context. It’s business.

motorest · 3h ago

> Why stop there? Just block all non-US IPs!

This is a perfectly good solution to many problems, if you are absolutely certain there is no conceivable way your service will be used from some regions.

> Surely bad actors wouldn’t use VPNs or botnets, and your customers never travel abroad?

Not a problem. Bad actors which are motivated enough to use VPNd or botnets are a different class of attacks that have different types of solutions. If you eliminate 95% of your problems with a single IP filter them you have no good argument to make against it.

calgoo · 3h ago

This. If someone wants to target you, they will target you. What this does is remove the noise and 90%+ of crap.

Basically the same thing as changing the ssh port on a public facing server, reduce the automated crap attacks.

paulcole · 1h ago

> if you are absolutely certain there is no conceivable way your service will be used from some regions.

This isn’t the bar you need to clear.

It’s “if you’re comfortable with people in some regions not being able to use your service.”

lwansbrough · 7h ago

Don't care, works fine for us.

yupyupyups · 7h ago

And that's perfectly fine. Nothing is completely bulletproof anyway. If you manage to get rid of 90% of the problem then that's a good thing.

ruszki · 4h ago

Okay, but this causes me about 90% of my major annoyances. Seriously. It’s almost always these stupid country restrictions.

I was in UK. I wanted to buy a movie ticket there. Fuck me, because I have an Austrian ip address, because modern mobile backends pass your traffic through your home mobile operator. So I tried to use a VPN. Fuck me, VPN endpoints are blocked also.

I wanted to buy a Belgian train ticket still from home. Cloudflare fuck me, because I’m too suspicious as a foreigner. It broke their whole API access, which was used by their site.

I wanted to order something while I was in America at my friend’s place. Fuck me of course. Not just my IP was problematic, but my phone number too. And of course my bank card… and I just wanted to order a pizza.

The most annoying is when your fucking app is restricted to your stupid country, and I should use it because your app is a public transport app. Lovely.

And of course, there was that time when I moved to an other country… pointless country restrictions everywhere… they really helped.

I remember the times when the saying was that the checkout process should be as frictionless as possible. That sentiment is long gone.

42lux · 3h ago

The vpn is probably your problem there mate.

sarchertech · 2h ago

> I wanted to order something while I was in America at my friend’s place. Fuck me of course. Not just my IP was problematic, but my phone number too.

Your mobile provider was routing you through Austria while in the US?

nucleardog · 1h ago

Not OP, but as far as I know that's how it works, yeah.

When I was in China, using a Chinese SIM had half the internet inaccessible (because China). As I was flying out I swapped my SIM back to my North American one... and even within China I had fully unrestricted (though expensive) access to the entire internet.

I looked into it at the time (now that I had access to non-Chinese internet sites!) and forgot the technical details, but seems that this was how the mobile network works by design. Your provider is responsible for your traffic.

lxgr · 7h ago

And if your competitor manages to do so without annoying the part of their customer base that occasionally leaves the country, everybody wins!

yupyupyups · 5h ago

Fair point, that's something to consider.

runroader · 1h ago

Oddly, my bank has no problem with non-US IPs, but my City's municipal payments site doesn't. I always think it's broken for a moment before realizing I have my VPN turned on.

mort96 · 7h ago

The percentage of US trips abroad which are to China must be minuscule, and I bet nobody in the US regularly uses a VPN to get a Chinese IP address. So blocking Chinese IP addresses is probably going to have a small impact on US customers. Blocking all abroad IP addresses, on the other hand, would impact people who just travel abroad or use VPNs. Not sure what your point is or why you're comparing these two things.

mvdtnz · 7h ago

You think all streaming services have banned non US IPs? What world do you live in?

lxgr · 6h ago

This is based on personal experience. At least two did not let me unsubscribe from abroad in the past.

throwawayffffas · 6h ago

Not letting you unsubscribe and blocking your IP are very different things.

There are some that do not provide services in most countries but Netflix, Disney, paramount are pretty much global operations.

HBO and peacock might not be available in Europe but I am guessing they are in Canada.

rtpg · 5h ago

I think a lot of services end up sending you to a sort of generic "not in your country yet!" landing page in an awkward way that can make it hard to "just" get to your account page to do this kind of stuff.

Netflix doesn't have this issue but I've seen services that seem to make it tough. Though sometimes that's just a phone call away.

Though OTOH whining about this and knowing about VPNs and then complaining about the theoretical non-VPN-knower-but-having-subscriptions-to-cancel-and-is-allergic-to-phone-calls-or-calling-their-bank persona... like sure they exist but are we talking about any significant number of people here?

misiek08 · 5h ago

In Europe we have all of them, with only few movies unavailable or additionally paid occasionally. Netflix, Disney, HBO, Prime and others work fine.

Funny to see how narrow perspective some people have…

lxgr · 4h ago

Obligatory side note of "Europe is not a country".

In several European countries, there is no HBO since Sky has some kind of exclusive contract for their content there, and that's where I was accordingly unable to unsubscribe from an US HBO plan.

lxgr · 4h ago

> Not letting you unsubscribe and blocking your IP are very different things.

How so? They did not let me unsubscribe via blocking my IP.

Instead of being able to access at least my account (if not the streaming service itself, which I get – copyright and all), I'd just see a full screen notice along the lines of "we are not available in your market, stay tuned".

sylware · 4h ago

Won't help: I get scans and script kiddy hack attempts from digital ocean, microsoft cloud (azure, stretchoid.com), google cloud, aws, and lately "hostpapa" via its 'IP colocation service'. Ofc it is instant fail-to-ban (it is not that hard to perform a basic email delivery to an existing account...).

Traffic should be "privatize" as much as possible between IPv6 addresses (because you still have 'scanners' doing the whole internet all the time... "the nice guys scanning the whole internet for your protection... never to sell any scan data ofc).

Public IP services are done for: going to be hell whatever you do.

The right answer seems significantly big 'security and availability teams' with open and super simple internet standards. Yep the javascript internet has to go away and the app private protocols have too. No more whatng cartel web engine, or the worst: closed network protocols for "apps".

And the most important: hardcore protocol simplicity, but doing a good enough job. It is common sense, but the planned obsolescence and kludgy bloat lovers won't let you...

thrown-0825 · 7h ago

If you are traveling without a vpn then you are asking for trouble

lxgr · 6h ago

Yes, and I’m arguing that that’s due to companies engaging in silly pseudo-security. I wish that would stop.

ordu · 6h ago

It is not silly pseudo-security, it is economics. Ban Chinese, lower your costs while not losing any revenue. It is capitalism working as intended.

lxgr · 4h ago

Not sure I'd call dumping externalities on a minority of your customer base without recourse "capitalism working as intended".

Capitalism is a means to an end, and allowable business practices are a two-way street between corporations and consumers, mediated by regulatory bodies and consumer protection agencies, at least in most functioning democracies.

ordu · 47m ago

Maybe, but it doesn't change the fact, that no one is going to forbid me to ban IPs. Therefore I will ban IPs and IPs ranges because it is the cheapest solution.

snickerbockers · 5h ago

Lmao I came here to post this. My personal server was making constant hdd grinding noises before I banned the entire nation of China. I only use this server for jellyfin and datahoarding so this was all just logs constantly rolling over from failed ssh auth attempts (PSA: always use public-key, don't allow root, and don't use really obvious usernames like "webadmin" or <literally just the domain>).

Xiol32 · 4h ago

Changing the SSH port also helps cut down the noise, as part of a layered strategy.

dotancohen · 4h ago

Are you familiar with port knocking? My servers will only open port 22, or some other port, after two specific ports have been knocked on in order. It completely eliminates the log files getting clogged.

azthecx · 4h ago

Did you really notice a significant drop off in connection attempts? I tried this some years ago and after a few hours on a random very high port number I was already seeing connections.

Bender · 10m ago

I use a non standard port and have not had an unknown IP hit it in over 25 years.

gessha · 2h ago

I have my jellyfin and obsidian couchdb sync on my Tailscale and they don’t see any public traffic.

johnisgood · 4h ago

Most of the traffic comes from China and Singapore, so I banned both. I might have to re-check and ban other regions who would never even visit my stupid website anyway. The ones who want to are free to, through VPN. I have not banned them yet.

thrown-0825 · 7h ago

Block Russia too, thats where i see most of my bot traffic coming from

debesyla · 2h ago

And usually hackers/malicious actors from that country are not afraid to attack anyone that is not russian, because their local law permits attacking targets in other countries.

(It sometimes comes to funny situations where malware doesn't enable itself on Windows machines if it detects that russian language keyboard is installed.)

imiric · 5h ago

Lately I've been thinking that the only viable long-term solution are allowlists instead of blocklists.

The internet has become a hostile place for any public server, and with the advent of ML tools, bots will make up far more than the current ~50% of all traffic. Captchas and bot detection is a losing strategy as bot behavior becomes more human-like.

Governments will inevitably enact privacy-infringing regulation to deal with this problem, but for sites that don't want to adopt such nonsense, allowlists are the only viable option.

I've been experimenting with a system where allowed users can create short-lived tokens via some out-of-band mechanism, which they can use on specific sites. A frontend gatekeeper then verifies the token, and if valid, opens up the required public ports specifically for the client's IP address, and redirects it to the service. The beauty of this system is that the service itself remains blocked at the network level from the world, and only allowed IP addresses are given access. The only publicly open port is the gatekeeper, which only accepts valid tokens, and can run from a separate machine or network. It also doesn't involve complex VPN or tunneling solutions, just a standard firewall.

This should work well for small personal sites, where initial connection latency isn't a concern, but obviously wouldn't scale well at larger scales without some rethinking. For my use case, it's good enough.

seer · 2h ago

I guess this is what "Identity aware proxy" from GCP can do for you? Outsource all of this to google - where you can connect your own identity servers, and then your service will only be accessed after the identity has been verified.

We have been using that instead of VPN and it has been incredibly nice and performant.

imiric · 35m ago

Yeah, I suppose it's something like that. Except that my solution wouldn't rely on Google, would be open source and self-hostable. Are you aware of a similar project that does this? Would save me some time and effort. :)

There also might be similar solutions for other cloud providers or some Kubernetes-adjacent abomination, but I specifically want something generic and standalone.

phplovesong · 7h ago

We block China and Russia. DDOS attacks and other hack attempts went down by 95%.

We have no chinese users/customers so in theory this does not effect business at all. Also russia is sanctioned and our russian userbase does not actually live in russia, so blocking russia did not effect users at all.

praptak · 6h ago

How did you choose where to get the IP addresses to block? I guess I'm mostly asking where this problem (i.e. "get all IPs for country X") is on the scale from "obviously solved" to "hard and you need to play catch up constantly".

I did a quick search and found a few databases but none of them looks like the obvious winner.

spc476 · 5h ago

I used CYMRU <https://www.team-cymru.com/ip-asn-mapping> to do the mapping for the post.

preinheimer · 2h ago

MaxMind is very common, IPInfo is also good. https://ipinfo.io/developers/database-download

If you want to test your IP blocks, we have servers on both China and Russia, we can try to take a screenshot from there to see what we get (free, no signup) https://testlocal.ly/

tietjens · 6h ago

The common cloud platforms allow you to do geo-blocking.

bakugo · 6h ago

Maxmind's GeoIP database is the industry standard, I believe. You can download a free version of it.

If your site is behind cloudflare, blocking/challenging by country is a built-in feature.

mavamaarten · 7h ago

Same here. It sucks. But it's just cost vs reward at some point.

zelphirkalt · 2h ago

Which will ensure you never get customers from those countries. And so the circle closes ...

maldonad0 · 6m ago

I used to run a forum that blocked IPs from outside Western Europe. I had no interest in users from beyond. It's not all about money.

42lux · 2h ago

The regulatory burden of conducting business with countries like Russia or China is a critical factor that offhand comments like yours consistently overlook.

bob1029 · 5h ago

I think a lot of really smart people are letting themselves get taken for a ride by the web scraping thing. Unless the bot activity is legitimately hammering your site and causing issues (not saying this isn't happening in some cases), then this mostly amounts to an ideological game of capture the flag. The difference being that you'll never find their flag. The only thing you win by playing is lost time.

The best way to mitigate the load from diffuse, unidentifiable, grey area participants is to have a fast and well engineered web product. This is good news, because your actual human customers would really enjoy this too.

phito · 5h ago

My friend has a small public gitea instance, only use by him a a few friends. He's getting thousounds of requests an hour from bots. I'm sorry but even if it does not impact his service, at the very least it feels like harassment

ralferoo · 24m ago

What's worse is when you get bots blasting HTTP traffic at every open port, even well known services like SMTP. Seriously, it's a mail server. It identified itself as soon as the connection was opened, if they waited 100ms-300ms before spamming, they'd know that it wasn't HTTP because the other side wouldn't send anything at all if it was. There's literally no need to bombard a mail server on a well known port by continuing to send a load of junk that's just going to fill someone's log file.

dmesg · 4h ago

Yes and it makes reading your logs needlessly harder. Sometimes I find an odd password being probed, search for it on the web and find an interesting story, that a new backdoor was discovered in a commercial appliance.

In that regard reading my logs led me sometimes to interesting articles about cyber security. Also log flooding may result in your journaling service truncating the log and you miss something important.

rollcat · 2h ago

> Sometimes I find an odd password being probed, search for it on the web and find an interesting story [...].

Yeah, this is beyond irresponsible. You know the moment you're pwned, __you__ become the new interesting story?

For everyone else, use a password manager to pick a random password for everything.

Thorrez · 2h ago

What is beyond irresponsible? Monitoring logs and researching odd things found there?

rollcat · 57m ago

The way to handle a password:

    plaintextPassword = POST["password"]
    ok = bcryptCompare(hashedPassword, plaintextPassword)
    // (now throw away POST and plaintextPassword)
    if (ok) { ... }

Bonus points: on user lookup, when no user is found, fetch a dummy hashedPassword, compare, and ignore the result. This will partially mitigate username enumeration via timing attacks.

JohnFen · 1h ago

How are passwords ending up in your logs? Something is very, very wrong there.

dmesg · 51m ago

Does an attacking bot know your webserver is not a misconfigured router exposing its web interface to the net? I often am baffled what conclusions people come up with from half reading posts. I had bots attack me with SSH 2.0 login attempts on port 80 and 443. Some people underestimate how bad at computer science some skids are.

wvbdmp · 4h ago

You log passwords?

No comments yet

wraptile · 4h ago

> thousounds of requests an hour from bots

That's not much for any modern server so I genuinely don't understand the frustration. I'm pretty certain gitea should be able to handle thousands of read requests per minute (not per hour) without even breaking a sweat.

q3k · 3h ago

Serving file content/diff requests from gitea/forgejo is quite expensive computationally. And these bots tend to tarpit themselves when they come across eg. a Linux repo mirror.

https://social.hackerspace.pl/@q3k/114358881508370524

rollcat · 2h ago

I think at this point every self-hosted forge should block diffs from anonymous users.

Also: Anubis and go-away, but also: some people are on old browsers or underpowered computers.

bob1029 · 4h ago

Thousands of requests per hour? So, something like 1-3 per second?

If this is actually impacting perceived QoS then I think a gitea bug report would be justified. Clearly there's been some kind of a performance regression.

Just looking at the logs seems to be an infohazard for many people. I don't see why you'd want to inspect the septic tanks of the internet unless absolutely necessary.

themafia · 4h ago

The way I get a fast web product is to pay a premium for data. So, no, it's not "lost time" by banning these entities, it's actual saved costs on my bandwidth and compute bills.

The bonus is my actual customers get the same benefits and don't notice any material loss from my content _not_ being scraped. How you see this as me being secretly taken advantage of is completely beyond me.

threeducks · 2h ago

> The best way to mitigate the load from diffuse, unidentifiable, grey area participants is to have a fast and well engineered web product.

I wonder what all those people are doing that their server can't handle the traffic. Wouldn't a simple IP-based rate limit be sufficient? I only pay $1 per month for my VPS, and even that piece of trash can handle 1000s of requests per second.

rollcat · 2h ago

> I only pay $1 per month for my VPS, and even that piece of trash can handle 1000s of requests per second.

Depends on the computational cost per request. If you're serving static content from memory, 10k/s sounds easy. If you constantly have to calculate diffs across ranges of commits, I imagine a couple dozen can bring your box down.

Also: who's your webhost? $1/m sounds like a steal.

johnnyfaehell · 4h ago

While we may be smart, a lot of us are extremely pedantic about tech things. I think for many if they did nothing it would wind them up the wall while doing something the annoyance is smaller.

ricardo81 · 2h ago

>an ideological game of capture the flag

I prefer the whack a mole analogy.

I've seen forums where people spend an inordinate amount of time identifying 'bad bots' for blocking, there'll always be more.

alphazard · 36m ago

I'm always a little surprised to see how many people take robots.txt seriously on HN. It's nice to see so many folks with good intentions.

However, it's obviously not a real solution. It depends on people knowing about it, and adding the complexity of checking it to their crawler. Are there other more serious solutions? It seems like we've heard about "micropayments" and "a big merkle tree of real people" type solutions forever and they've never materialized.

ralferoo · 30m ago

> It depends on people knowing about it, and adding the complexity of checking it to their crawler.

I can't believe any bot writer doesn't know about robots.txt. They're just so self-obsessed and can't comprehend why the rules should apply to them, because obviously their project is special and it's just everyone else's bot that causes trouble.

Bender · 7m ago

(malicious) Bot writers have exactly zero concern for robots.txt. Most bots are malicious. Most bots don't set most of the TCP/IP flags. Their only concern is speed. I block about 99% of bots by simply dropping any TCP SYN packet that is missing MSS or uses a strange value.

    -A PREROUTING -i eth0 -p tcp -m tcp -d $INTERNET_IP --syn -m tcpmss ! --mss 1280:1460 -j DROP

Example rule from the netfilter raw table.

firefoxd · 7h ago

Since I posted an article here about using zip bombs [0], I'm flooded with bots. I'm constantly monitoring and tweaking my abuse detector, but this particular bot mentioned in the article seemed to be pointing to an RSS reader. I white listed it at first. But now that I gave it a second look, it's one of the most rampant bot on my blog.

[0]: https://news.ycombinator.com/item?id=43826798

dmurray · 6h ago

If I had a shady web crawling bot and I implemented a feature for it to avoid zip bombs, I would probably also test it by aggressively crawling a site that is known to protect itself with hand-made zip bombs.

sim7c00 · 3h ago

also protect yourself fromnsucking up fake generated content. i know some folks here like to feed them all sorts of 'data' . fun stuff :D

JdeBP · 3h ago

One of the few manual deny-list entries that I have made was not for a Chinese company, but for the ASes of the U.S.A. subsidiary of a Chinese company. It just kept coming back again and again, quite rapidly, for a particular page that was 404. Not for any other related pages, mind. Not for the favicon, robots.txt, or even the enclosing pseudo-directory. Just that 1 page. Over and over.

The directory structure had changed, and the page is now 1 level lower in the tree, correctly hyperlinked long since, in various sitemaps long since, and long since discovered by genuine HTTP clients.

The URL? It now only exists in 1 place on the WWW according to Google. It was posted to Hacker News back in 2017.

(My educated guess is that I am suffering from the page-preloading fallout from repeated robotic scraping of old Hacker News stuff by said U.S.A. subsidiary.)

Moru · 6h ago

Rule number one: You do not talk about fight club.

popcorncowboy · 5h ago

Dark forest theory taking root.

boris · 5h ago

Yes, I've seen this one in our logs. Quite obnoxious, but at least it identifies itself as a bot and, at least in our case (cgit host), does not generate much traffic. The bulk of our traffic comes from bots that pretend to be real browsers and that use a large number of IP addresses (mostly from Brazil and Asia in our case).

I've been playing cat and mouse trying to block them for the past week and here are a couple of observations/ideas, in case this is helpful to someone:

* As mentioned above, the bulk of the traffic comes from a large number of IPs, each issuing only a few requests a day, and they pretend to be real UAs.

* Most of them don't bother sending the referrer URL, but not all (some bots from Huawei Cloud do, but they currently don't generate much traffic).

* The first thing I tried was to throttle bandwidth for URLs that contain id= (which on a cgit instance generate the bulk of the bot traffic). So I set the bandwidth to 1Kb/s and thought surely most of the bots will not be willing to wait for 10-20s to download the page. Surprise: they didn't care. They just waited and kept coming back.

* BTW, they also used keep alive connections if ones were offered. So another thing I did was disable keep alive for the /cgit/ locations. Failed that enough bots would routinely hog up all the available connections.

* My current solution is to deny requests for all URLs containing id= unless they also contain the `notbot` parameter in the query string (and which I suggest legitimate users add in the custom error message for 403). I also currently only do this if the referrer is not present but I may have to change that if the bots adapt. Overall, this helped with the load and freed up connections to legitimate users, but the bots didn't go away. They still request, get 403, but keep coming back.

My conclusion from this experience is that you really only have two options: either do something ad hoc, very specific to your site (like the notbot in query string) that whoever runs the bots won't bother adapting to or you have to employ someone with enough resources (like Cloudflare) to fight them for you. Using some "standard" solution (like rate limit, Anubis, etc) is not going to work -- they have enough resources to eat up the cost and/or adapt.

palmfacehn · 4h ago

Pick an obscure UA substring like MSIE 3.0 or HP-UX. Preemptively 403 these User Agents, (you'll create your own list). Later in the week you can circle back and distill these 403s down to problematic ASNs. Whack moles as necessary.

JdeBP · 4h ago

I (of course) use the djbwares descendent of Bernstein publicfile. I added a static GEMINI UCSPI-SSL tool to it a while back. One of the ideas that I took from the GEMINI specification and then applied to Bernstein's HTTP server was the prohibition on fragments in request URLs (which the Bernstein original allowed), which I extended to a prohibition on query parameters as well (which the Bernstein original also allowed) in both GEMINI and HTTP.

* https://geminiprotocol.net/docs/protocol-specification.gmi#r...

The reasoning for disallowing them in GEMINI pretty much applies to static HTTP service (which is what publicfile provides) as it does to static GEMINI service. They moreover did not actually work in Bernstein publicfile unless a site administrator went to extraordinary lengths to create multiple oddly-named filenames (non-trivial to handle from a shell on a Unix or Linux-based system, because of the metacharacter) with every possible combination of query parameters, all naming the same file.

* https://jdebp.uk/Softwares/djbwares/guide/publicfile-securit...

* https://jdebp.uk/Softwares/djbwares/guide/commands/httpd.xml

* https://jdebp.uk/Softwares/djbwares/guide/commands/geminid.x...

Before I introduced this, attempted (and doomed to fail) exploits against weak CGI and PHP scripts were a large fraction of all of the file not found errors that httpd had been logging. These things were getting as far as hitting the filesystem and doing namei lookups. After I introduced this, they are rejected earlier in the transaction, without hitting the filesystem, when the requested URL is decomposed into its constituent parts.

Bernstein publicfile is rather late to this party, as there are over 2 decades of books on the subject of static sites versus dynamic sites (although in fairness it does pre-date all of them). But I can report that the wisdom when it comes to queries holds up even today, in 2025, and if anything a stronger position can be taken on them now.

To those running static sites, I recommend taking this good idea from GEMINI and applying it to query parameters as well.

Unless you are brave enough to actually attempt to provide query parameter support with static site tooling. (-:

Jnr · 4h ago

Externally I use Cloudflare proxy and internally I put Crowdsec and Modsecurity CRS middlewares in front of Traefik.

After some fine-tuning and eliminating false positives, it is running smoothly. It logs all the temporarily banned and reported IPs (to Crowdsec) and logging them to a Discord channel. On average it blocks a few dozen different IPs each day.

From what I see, there are far more American IPs trying to access non-public resources and attempting to exploit CVEs than there are Chinese ones.

I don't really mind anyone scraping publicly accessible content and the rest is either gated by SSO or located in intranet.

For me personally there is no need to block a specific country, I think that trying to block exploit or flooding attempts is a better approach.

poisonborz · 3h ago

Crowdsec: the idea is tempting, but giving away all of the server's traffic to a for-profit is a huge liability.

Jnr · 2h ago

You pass all traffic through Cloudflare. You do not pass any traffic to Crowdsec, you detect locally and only report blocked IPs. And with Modsecurity CRS you don't report anything to anyone but configuring and fine tuning is a bit harder.

jrgifford · 2h ago

The more egregious attempts are likely being blocked by Cloudflare WAF / similar.

Jnr · 1h ago

I don't think they are really blocking anything unless you specifically enable it. But it gives some piece of mind knowing that I could probably enable it quickly if it becomes necessary.

Etheryte · 8h ago

One starts to wonder, at what point might it be actually feasible to do it the other way around, by whitelisting IP ranges. I could see this happening as a community effort, similar to adblocker list curation etc.

bobbiechen · 7h ago

Unfortunately, well-behaved bots often have more stable IPs, while bad actors are happy to use residential proxies. If you ban a residential proxy IP you're likely to impact real users while the bad actor simply switches. Personally I don't think IP level network information will ever be effective without combining with other factors.

Source: stopping attacks that involve thousands of IPs at my work.

BLKNSLVR · 6h ago

Blocking a residential proxy doesn't sound like a bad idea to me.

My single-layer thought process:

If they're knowingly running a residential proxy then they'll likely know "the cost of doing business". If they're unknowingly running a residential proxy then blocking them might be a good way for them to find out they're unknowingly running a residential proxy and get their systems deloused.

throwawayffffas · 6h ago

> If you ban a residential proxy IP you're likely to impact real users while the bad actor simply switches.

Are you really? How likely do you think is a legit customer/user to be on the same IP as a residential proxy? Sure residential IPS get reused, but you can handle that by making the block last 6-8 hours, or a day or two.

richardwhiuk · 4h ago

In these days of CGNAT, a residential IP is shared by multiple customers.

jampa · 7h ago

The Pokémon Go company tried that shortly after launch to block scraping. I remember they had three categories of IPs:

- Blacklisted IP (Google Cloud, AWS, etc), those were always blocked

- Untrusted IPs (residential IPs) were given some leeway, but quickly got to 429 if they started querying too much

- Whitelisted IPs (IPV4 addresses are used legitimately by many people), for example, my current data plan tells me my IP is from 5 states over, so anything behind a CGNAT.

You can probably guess what happens next. Most scrapers were thrown out, but the largest ones just got a modem device farm and ate the cost. They successfully prevented most users from scraping locally, but were quickly beaten by companies profiting from scraping.

I think this was one of many bad decisions Pokémon Go made. Some casual players dropped because they didn't want to play without a map, while the hardcore players started paying for scraping, which hammered their servers even more.

aorth · 5h ago

I have an ad hoc system that is similar, comprised of three lists of networks: known good, known bad, and data center networks. These are rate limited using a geo map in nginx for various expensive routes in my application.

The known good list is IPs and ranges I know are good. The known bad list is specific bad actors. The data center networks list is updated periodically based on a list of ASNs belonging to data centers.

There are a lot of problems with using ASNs, even for well-known data center operators. First, they update so often. Second, they often include massive subnets like /13(!), which can apparently overlap with routes announced by other networks, causing false positives. Third, I had been merging networks (to avoid overlaps causing problems in nginx) with something like https://github.com/projectdiscovery/mapcidr but found that it also caused larger overlaps that introduced false positives from adjacent networks where apparently some legitimate users are. Lastly, I had seen suspicious traffic from data center operators like CATO Networks Ltd and ZScaler that are some kind of enterprise security products that route clients through their clouds. Blocking those resulted in some angry users in places I didn't expect...

And none of the accounts for the residential ISPs that bots use to appear like legitimate users https://www.trendmicro.com/vinfo/us/security/news/vulnerabil....

gunalx · 5h ago

This really seems like they did everything they could and still got abused by borderline criminal activity from scrapers. But i do really think it had an impact on scraping, it is just a matter of attrition and raising the cost so it should hurt more to scrape, the problem really never can go away, because at some point the scrapers can just start paying regular users to collect the data.

lxgr · 7h ago

Many US companies do it already.

It should be illegal, at least for companies that still charge me while I’m abroad and don’t offer me any other way of canceling service or getting support.

withinboredom · 6h ago

I'm pretty sure I still owe t-mobile money. When I moved to the EU, we kept our old phone plans for awhile. Then, for whatever reason, the USD didn't make it to the USD account in time and we missed a payment. Then t-mobile cut off the service and you need to receive a text message to login to the account. Obviously, that wasn't possible. So, we lost the ability to even pay, even while using a VPN. We just decided to let it die, but I'm sure in t-mobile's eyes, I still owe them.

thenthenthen · 6h ago

This! Dealing with European services from China is also terrible. As is the other way around. Welcome to the intranet!

friendzis · 7h ago

It's never either/or: you don't have to choose between white and black lists exclusively and most of the traffic is going to come from grey areas anyway.

Say you whitelist an address/range and some systems detect "bad things". Now what? You remove that address/range from whitelist? Doo you distribute the removal to your peers? Do you communicate removal to the owner of unwhitelisted address/range? How does owner communicate dealing with the issue back? What if the owner of the range is hosting provider where they don't proactively control the content hosted, yet have robust anti-abuse mechanisms in place? And so on.

Whitelist-only is a huge can of worms and whitelists works best with trusted partner you can maintain out-of-band communication with. Similarly blacklists work best with trusted partners, however to determine addresses/ranges that are more trouble than they are worth. And somewhere in the middle are grey zone addresses, e.g. ranges assigned to ISPs with CGNATs: you just cannot reliably label an individual address or even a range of addresses as strictly troublesome or strictly trustworthy by default.

Implement blacklists on known bad actors, e.g. the whole of China and Russia, maybe even cloud providers. Implement whitelists for ranges you explicitly trust to have robust anti-abuse mechanisms, e.g. corporations with strictly internal hosts.

partyguy · 7h ago

That's what I'm trying to do here, PRs welcome: https://github.com/AnTheMaker/GoodBots

aorth · 5h ago

Noble effort. I might make some pull requests, though I kinda feel it's futile. I have my own list of "known good" networks.

delusional · 7h ago

At that point it almost sounds like we're doing "peering" agreements at the IP level.

Would it make sense to have a class of ISPs that didn't peer with these "bad" network participants?

JimDabell · 7h ago

If this didn’t happen for spam, it’s not going to happen for crawlers.

shortrounddev2 · 7h ago

Why not just ban all IP blocks assigned to cloud providers? Won't halt botnets but the IP range owned by AWS, GCP, etc is well known

aorth · 5h ago

Tricky to get a list of all cloud providers, all their networks, and then there are cases like CATO Networks Ltd and ZScaler, which are apparently enterprise security products that route clients traffic through their clouds "for security".

jjayj · 6h ago

But my work's VPN is in AWS, and HN and Reddit are sometimes helpful...

Not sure what my point is here tbh. The internet sucks and I don't have a solution

hnlmorg · 6h ago

Because crawlers would then just use a different IP which isn’t owned by cloud vendors.

ygritte · 7h ago

Came here to say something similar. The sheer amount of IP addresses one has to block to keep malware and bots at bay is becoming unmanageable.

worthless-trash · 7h ago

I admin a few local business sites.. I whitelist all the countries isps and the strangeness in the logs and attack counts have gone down.

Google indexes in country, as does a few other search engines..

Would recommend.

coffee_am · 7h ago

Is there a public curated list of "good ips" to whitelist ?

partyguy · 7h ago

> Is there a public curated list of "good ips" to whitelist ?

https://github.com/AnTheMaker/GoodBots

worthless-trash · 7h ago

So, its relatively easy because there is limited ISP's in my country. I imagine its a much harder option for the US.

I looked at all the IP ranges delegated by APNIC, along with every local ISP that I could find, unioned this with

https://lite.ip2location.com/australia-ip-address-ranges

And so far i've not had any complaints. and I think that I have most of them.

At some time in the future, i'll start including https://github.com/ebrasha/cidr-ip-ranges-by-country

ralferoo · 34m ago

Out of spite, I'd ignore their request to filter by IP (who knows what their intent is by saying that - maybe they're connecting from VPNs or tor exit nodes to cause disruption etc), but instead filter by matching for that content in the User-Agent instead and feeding them a zip bomb instead.

zImPatrick · 1h ago

I run an instance of a downloader tool and had lots of chinese IPs mass-download youtube videos with the most generic UA. I started with „just“ blocking their ASNs, but they always came back with another one until I just decided to stop bothering and banned China entirely. I‘m confused on why some chinese ISPs have so many different ASNs - while most major internet providers here have exactly one.

breve · 2h ago

> Alex Schroeder's Butlerian Jihad

That's Frank Herbert's Butlerian Jihad.

flanbiscuit · 16m ago

To be fair, he was referring to a post on Alex Schroeder's blog titled with the same name as the term from the Dune books. And that post correctly credits Dune/Herbert. But the post is not about Dune, it's about Spam bots so it's more related to what the original author's post if about.

Speaking of the Butlerian Jihad, Frank Herbert's son (Brian) and another author named Kevin J Anderson co-wrote a few books in the Dune universe and one of them was about the Butlerian Jihad. I read it. It was good, not as good at Frank Herbert's books but I still enjoyed it. One of the authors is not as good as the other because you can kind of tell the writing quality changing per chapter.

https://en.wikipedia.org/wiki/Dune:_The_Butlerian_Jihad

dirkc · 1h ago

Interesting to think that the answer to banning thinking computers in Dune was basically to indoctrinate kids from birth (mentats) and/or doing large quantities of drugs (guild navigators).

Hizonner · 2h ago

It's a simple translation error. They really meant "Feed me worthless synthetic shit at the highest rate you feel comfortable with. It's also OK to tarpit me."

teekert · 5h ago

These IP addresses being released at some point, and making their way into something else is probably the reason I never got to fully run my mailserver from my basement. These companies are just massively giving IP addresses a bad reputation, messing them up for any other use and then abandoning them. I wonder what this would look like when plotted: AI (and other toxic crawling) companies slowly consuming the IPv4 address space? Ideally we'd forced them into some corner of the IPv6 space I guess. I mean robots.txt seems not to be of any help here.

praptak · 6h ago

"I'm seriously thinking that the CCP encourage this with maybe the hope of externalizing the cost of the Great Firewall to the rest of the world. If China scrapes content, that's fine as far as the CCP goes; If it's blocked, that's fine by the CCP too (I say, as I adjust my tin foil hat)."

Then turn the tables on them and make the Great Firewall do your job! Just choose a random snippet about illegal Chinese occupation of Tibet or human rights abuses of Uyghur people each time you generate a page and insert it as a breaker between paragraphs. This should get you blocked in no time :)

_def · 8h ago

I've seen blocks like that for e.g. alibaba cloud. It's sad indeed, but it can be really difficult to handle aggressive scrapers.

orochimaaru · 2h ago

Is there a list of chinese ASN’s that you can ban if you don’t do much business there - eg all of China, Macau, select Chinese clouds in SE Asia, Polynesia and Africa. I think they’ve kept HK clean so far.

herbst · 6h ago

More than half of my traffic is Bing, Claude and for whatever reason the Facebook bots.

None of these are main main traffic drivers, just the main resource hogs. And the main reason when my site turns slow (usually an AI, microsoft or Facebook ignoring any common sense)

China and co is only a very small portion of my malicious traffic. Gladly. It's usually US companies who disrespect my robots.txt and DNS rate limits who make me the most problems.

devoutsalsa · 6h ago

There are a lot of dumb questions, and I pose all of them to Claude. There's no infrastructure in place for this, but I would support some business model where LLM-of-choice compensates website operators for resources consumed by my super dumb questions. Like how content creators get paid when I watch with a YouTube Premium subscription. I doubt this is practical in practice.

herbst · 3h ago

For me it looks more like out of the control bots than average requests. For example a few days ago I blocked a few bots. Google was about 600 requests in 24 hours. Bing 1500, Facebook is mostly blocked right now, Claude with 3 different bot types was about 100k requests in the same time.

There is no reason to query all my sub-sites, it's like a search engine with way to many theoretical pages.

Facebook also did aggressively, daily indexing of way to many pages, using large IP ranges until I blocked it. I get like one user per week from them, no idea what they want.

And bing, I learned, "simply" needs hard enforced rate limits it kinda learns to agree on.

ricardo81 · 2h ago

Any respectable web scale crawler(/scraper) should have reverse DNS so that it can automatically be blocked.

Though it would seem all bets are off and anyone will scrape anything. Now we're left with middlemen like cloudflare that cost people millions of hours of time ticking boxes to prove they're human beings.

sim7c00 · 3h ago

i think there is an opportunity to train an neural network on browser user agent s(they are catalogued but vary and change a lot). then u can block everything not matching.

it will work better than regex. a lot of these companies rely on 'but we are clearly recognizable' via fornexample these user agents, as excuse to put burden on sysadmins to maintains blocklists instead of otherway round (keep list of scrapables..)

maybe someone mathy can unburden them ?

you could also look who ask for nonexisting resources, and block anyone who asks for more than X (large enough not to let config issue or so kill regular clients). block might be just a minute so u dont have too many risk when an FP occurs. it will be enough likely to make the scraper turn away.

there are many things to do depending on context, app complexity, load etc. , problem is there's no really easy way to do these things.

ML should be able to help a lot in such a space??

arewethereyeta · 1h ago

What exactly do you want to train on a falsifiable piece of info? We do something like this at https://visitorquery.com in order to detect HTTP proxies and VPNs but the UA is very unreliable. I guess you could detect based on multiple pieces with UA being one of them where one UA must have x, y, z or where x cannot be found on one UA. Most of the info is generated tho.

BLKNSLVR · 5h ago

I've mentioned my project[0] before, and it's just as sledgehammer-subtle as this bot asks.

I have a firewall that logs every incoming connection to every port. If I get a connection to a port that has nothing behind it, then I consider the IP address that sent the connection to be malicious, and I block the IP address from connecting to any actual service ports.

This works for me, but I run very few things to serve very few people, so there's minimal collateral damage when 'overblocking' happens - the most common thing is that I lock myself out of my VPN (lolfacepalm).

I occasionally look at the database of IP addresses and do some pivot tabling to find the most common networks and have identified a number of cough security companies that do incessant scanning of the IPv4 internet among other networks that give me the wrong vibes.

[0]: Uninvited Activity: https://github.com/UninvitedActivity/UninvitedActivity

P.S. If there aren't any Chinese or Russian IP addresses / networks in my lists, then I probably block them outright prior to the logging.

poisonborz · 3h ago

Naive question: why isn't there a publicly accessible central repository of bad IPs and domains, stewarded by the industry, operated by a nonprofit, like W3C? Yes it wouldn't be enough by itself ("bad" is a very subjective term) but it could be a popular well-maintained baseline.

sim7c00 · 3h ago

there are many of these and they are always outdated

another issue is things like cloud hosting will overlap their ranges with legit business ranges happily, so if you go that route you will inadvertently also block legitimate things. not that a regular person care too much for that, but an abuse list should be accurate.

nubinetwork · 2h ago

https://xkcd.com/927/

For what it's worth, I'm also guilty of this, even if I made my site to replace one that died.

niczem · 5h ago

I think banning IPs is a treadmill you never really get off of. Between cloud providers, VPNs, CGNAT, and botnets, you spend more time whack-a-moling than actually stopping abuse. What’s worked better for me is tarpitting or just confusing the hell out of scrapers so they waste their own resources.

There’s a great talk on this: Defense by numbers: Making Problems for Script Kiddies and Scanner Monkeys https://www.youtube.com/watch?v=H9Kxas65f7A

What I’d really love to see - but probably never will—is companies joining forces to share data or support open projects like Common Crawl. That would raise the floor for everyone. But, you know… capitalism, so instead we all reinvent the wheel in our own silos.

BLKNSLVR · 5h ago

If you can automate the treadmill and set a timeout at which point the 'bad' IPs will go back to being 'not necessarily bad', then you're minimising the effort required.

An open project that classifies and records this - would need a fair bit of on-going protection, ironically.

mellosouls · 5h ago

Unfortunately, HN itself is occasionally used for publicising crawling services that rely on underhand techniques that don't seem terribly different to the ones here.

I don't know if its because they operate in the service of capital rather than China, as here, but use of those methods in the former case seems to get more of a pass here.

hackrmn · 2h ago

I know opinions are divided on what I am about to mention, but what about CAPTCHA to filter bots? Yes, I am well aware we're a decade past a lot of CAPTCHA being broken by _algorithms_, but I believe it is still a relatively useful general solution, technically -- question is, would we want to filter non-humans, effectively? I am myself on the fence about this, big fan of what HTTP allows us to do, and I mean specifically computer-to-computer (automation/bots/etc) HTTP clients. But with the geopolitical landscape of today, where Internet has become a tug of war (sometimes literally), maybe Butlerian Jihad was onto something? China and Russia are blatantly and near-openly shoving their fingers in every hole they can find, and if this is normalized so will Europe and U.S., for countermeasure (at least one could imagine it being the case). One could also allow bots -- clients unable to solve CAPTCHA -- access to very simplified, distilled and _reduced_ content, to give them the minimal goodwill to "index" and "crawl" for ostensibly "good" purposes.

yumraj · 6h ago

Wouldn't it be better, if there's an easy way, to just feed such bots shit data instead of blocking them. I know it's easier to block and saves compute and bandwidth, but perhaps feeding them shit data at scale would be a much better longer term solution.

sotspecatcle · 6h ago

    if ($http_user_agent ~* "BadBot") {
        limit_rate 1k;
        default_type application/octet-stream;
        proxy_buffering off;
        alias /dev/zero;
        return 200;
    }

internet_points · 5h ago

https://zadzmo.org/code/nepenthes/

throwawayffffas · 6h ago

No serving shit data costs bandwidth and possibly compute time.

Blocking IPS is much cheaper for the blocker.

fuckaj · 4h ago

Zip bomb?

aspenmayer · 3h ago

Doesn’t that tie up a socket on the server similarly to how a keepalive would on the bot user end?

that_lurker · 6h ago

Why not just block the User Agent?

arewethereyeta · 1h ago

Because it's the single most falsifiable piece of information you would find on ANY "how to scrape for dummies" article out there. They all start with changing your UA.

N_Lens · 4h ago

Bots often rotate the UA too, their entire goal is to get through and scrape as much content as possible, using any means possible.

lexicality · 3h ago

because you have to parse the http request to do that, while blocking the IP can be done at the firewall

aspenmayer · 3h ago

I think the UA is easily spoofed, whereas the AS and IP are less easily spoofed. You have everything you need already to spoof UA, while you will need resources to spoof your IP, whether it’s wall clock time to set it up, CPU time to insert another network hop, and/or peers or other third parties to route your traffic, and so on. The User Agent are variables that you can easily change, no real effort or expense or third parties required.

timpera · 3h ago

As someone who uses VPNs all the time, these comments make me sad. Blocking by IP is not the solution.

roguebloodrage · 6h ago

This is everything I have for AS132203 (Tencent). It has your addresses plus others I have found and confirmed using ipinfo.io

43.131.0.0/18 43.129.32.0/20 101.32.0.0/20 101.32.102.0/23 101.32.104.0/21 101.32.112.0/23 101.32.112.0/24 101.32.114.0/23 101.32.116.0/23 101.32.118.0/23 101.32.120.0/23 101.32.122.0/23 101.32.124.0/23 101.32.126.0/23 101.32.128.0/23 101.32.130.0/23 101.32.13.0/24 101.32.132.0/22 101.32.132.0/24 101.32.136.0/21 101.32.140.0/24 101.32.144.0/20 101.32.160.0/20 101.32.16.0/20 101.32.17.0/24 101.32.176.0/20 101.32.192.0/20 101.32.208.0/20 101.32.224.0/22 101.32.228.0/22 101.32.232.0/22 101.32.236.0/23 101.32.238.0/23 101.32.240.0/20 101.32.32.0/20 101.32.48.0/20 101.32.64.0/20 101.32.78.0/23 101.32.80.0/20 101.32.84.0/24 101.32.85.0/24 101.32.86.0/24 101.32.87.0/24 101.32.88.0/24 101.32.89.0/24 101.32.90.0/24 101.32.91.0/24 101.32.94.0/23 101.32.96.0/20 101.33.0.0/23 101.33.100.0/22 101.33.10.0/23 101.33.10.0/24 101.33.104.0/21 101.33.11.0/24 101.33.112.0/22 101.33.116.0/22 101.33.120.0/21 101.33.128.0/22 101.33.132.0/22 101.33.136.0/22 101.33.140.0/22 101.33.14.0/24 101.33.144.0/22 101.33.148.0/22 101.33.15.0/24 101.33.152.0/22 101.33.156.0/22 101.33.160.0/22 101.33.164.0/22 101.33.168.0/22 101.33.17.0/24 101.33.172.0/22 101.33.176.0/22 101.33.180.0/22 101.33.18.0/23 101.33.184.0/22 101.33.188.0/22 101.33.24.0/24 101.33.25.0/24 101.33.26.0/23 101.33.30.0/23 101.33.32.0/21 101.33.40.0/24 101.33.4.0/23 101.33.41.0/24 101.33.42.0/23 101.33.44.0/22 101.33.48.0/22 101.33.52.0/22 101.33.56.0/22 101.33.60.0/22 101.33.64.0/19 101.33.64.0/23 101.33.96.0/22 103.52.216.0/22 103.52.216.0/23 103.52.218.0/23 103.7.28.0/24 103.7.29.0/24 103.7.30.0/24 103.7.31.0/24 43.130.0.0/18 43.130.64.0/18 43.130.128.0/19 43.130.160.0/19 43.132.192.0/18 43.133.64.0/19 43.134.128.0/18 43.135.0.0/18 43.135.64.0/18 43.135.192.0/19 43.153.0.0/18 43.153.192.0/18 43.154.64.0/18 43.154.128.0/18 43.154.192.0/18 43.155.0.0/18 43.155.128.0/18 43.156.192.0/18 43.157.0.0/18 43.157.64.0/18 43.157.128.0/18 43.159.128.0/19 43.163.64.0/18 43.164.192.0/18 43.165.128.0/18 43.166.128.0/18 43.166.224.0/19 49.51.132.0/23 49.51.140.0/23 49.51.166.0/23 119.28.64.0/19 119.28.128.0/20 129.226.160.0/19 150.109.32.0/19 150.109.96.0/19 170.106.32.0/19 170.106.176.0/20

bigiain · 6h ago

For anyone wondering how to do this (like me from a month or two back).

Here's a useful tool/site:

https://bgp.tools

You can feed it an ip address to get an AS ("Autonomous System"), then ask it for all prefixes associated with that AS.

I fed it that first ip address from that list (43.131.0.0) and it showed my the same Tencent owned AS132203, and it gives back all the prefixes they have here:

https://bgp.tools/as/132203#prefixes

(Looks like roguebloodrage might have missed at least the 1.12.x.x and 1.201.x.x prefixes?)

I started searching about how to do that after reading a RachelByTheBay post where she wrote:

Enough bad behavior from a host -> filter the host.

Enough bad hosts in a netblock -> filter the netblock.

Enough bad netblocks in an AS -> filter the AS. Think of it as an "AS death penalty", if you like.

(from the last part of https://rachelbythebay.com/w/2025/06/29/feedback/ )

roguebloodrage · 2h ago

I add re-actively. I figure there are "legitimate" IP's that companies use and I only look at IP addresses that are 'vandalizing' my servers with inappropriate scans and block them.

If I saw the two you have identified, then they would have been added. I do play a balance between "might be a game CDN" or a "legit server" and an outright VPS that is being used to abuse other servers.

But thanks, I will keep an eye on those two ranges.

BLKNSLVR · 5h ago

This is what I've used to find ASs to block: https://hackertarget.com/as-ip-lookup/

eg. Chuck 'Tencent' into the text box and execute.

nubinetwork · 3h ago

FWIW, I looked through my list of ~8000 IP addresses, there isn't as many hits for these ranges as I would have thought. It's possible that they're more focused on using known DNS names than simply connecting to 80/443 on random IPs.

Edit: I also checked my Apache logs, I couldn't find any recent logs for "thinkbot".

speleding · 4h ago

For the Thinkbot problem mentioned in the article, it's less maintenance work to simply block on the User Agent string.

sim7c00 · 3h ago

jep, good tip! for ppl that do this be sure to make it case insensitive and only capture few distinct parts, not too specific. especially if u only expect browsers this can mitigate a lot.

u can also filter for allowing but this gives a risk of allowing the wrong thing as headers are easy to set, so its better to do it via blocking (sadly)

geokon · 5h ago

Is there a way to reverse look up IPs by company? Like a list off all IPs owned by Alphabet, Meta Bing etc?

BLKNSLVR · 5h ago

https://hackertarget.com/as-ip-lookup/

Chuck 'Tencent' into the text box and execute.

ta8645 · 7h ago

If ipv6 ever becomes a thing, it'll make blocking all that much harder.

rnhmjoj · 7h ago

No, it's really the same thing with just different (and more structured) prefix lengths. In IPv4 you usually block a single /32 address first, then a /24 block, etc. In IPv6 you start with a single /128 address, a single LAN is /64, an entire site is usually /56 (residential) or /48 (company), etc.

withinboredom · 6h ago

Hmmm... that isn't my experience:

/128: single application

/64: single computer

/56: entire building

/48: entire (digital) neighborhood

rnhmjoj · 5h ago

A /64 is the smallest network on which you can run SLAAC, so almost all VLANs should use this. /56 and /48 for end users is what RIRs are recommending, in reality the prefixes are longer, because ISPs and hosting providers wants you to pay like IPv6 space is some scarse resource.

[1]: https://www.ripe.net/publications/docs/ripe-690/

withinboredom · 5h ago

Everyone at my isp is issued a /56 (and as far as I can tell, the entire country is this way).

snerbles · 7h ago

For ipv6 you just start nuking /64s and /48s if they're really rowdy.

jedisct1 · 2h ago

I don’t understand why people want to block bots, especially from a major player like Tencent, while at the same time doing everything they can to be indexed by Google

mediumsmart · 4h ago

fiefdom internet 1.0 release party - information near your fingertips

sneak · 6h ago

I feel like people seem to forget that an HTTP request is, after all, a request. When you serve a webpage to a client, you are consenting to that interaction with a voluntary response.

You can blunt instrument 403 geoblock entire countries if you want, or any user agent, or any netblock or ASN. It’s entirely up to you and it’s your own server and nobody will be legitimately mad at you.

You can rate limit IPs to x responses per day or per hour or per week, whatever you like.

This whole AI scraper panic is so incredibly overblown.

I’m currently working on a sniffer that tracks all inbound TCP connections and UDP/ICMP traffic and can trigger firewall rule addition/removal based on traffic attributes (such as firewalling or rate limiting all traffic from certain ASNs or countries) without actually having to be a reverse proxy in the HTTP flow. That way your in-kernel tables don’t need to be huge and they can just dynamically be adjusted from userspace in response to actual observed traffic.

worthless-trash · 6h ago

> This whole AI scraper panic is so incredibly overblown.

The problem is that its eating into peoples costs, and if you're not concerned with money, I'm just asking, can you send me $50.00 USD ?

znpy · 6h ago

Oh i recognise those ip addresses… they gave us quite an headache a while ago

baxuz · 4h ago

Geoblocking China and Russia should be the default.

rglullis · 5h ago

All of the "blockchain is only for drug dealing and scams" people will sooner or later realize that it is the exact type of scenarios that makes it imperative to keep developing trustless systems.

The internet was a big level-playing field, but for the past half century corporations and state actors managed to keep control and profit to themselves while giving the illusion that us peasants could still benefit from it and had a shot at freedom. Now that computing power is so vast and cheap, it has become an arms race and the cyberpunk dystopia has become apparent.

latexr · 4h ago

> All of the "blockchain is only for drug dealing and scams" people will sooner or later realize that it is the exact type of scenarios that makes it imperative to keep developing trustless systems.

This is like saying “All the “sugar-sweetened beverages are bad for you” people will sooner or later realize it is imperative to drink liquids”. It is perfectly congruent to believe trustless systems are important and that the way the blockchain works is more harmful than positive.

Additionally, the claim is that cryptocurrencies are used like that. Blockchains by themselves have a different set of issues and criticisms.

rglullis · 2h ago

Tell that to the "web3 is doing great" crowd.

I've met and worked with many people who never shilled a coin in their whole life and were treated as criminals for merely proposing any type of application on Ethereum.

I got tired of having people yelling online about how "we are burning the planet" and who refused to understand that proof of stake made energy consumption negligible.

To this day, I have my Mastodon instance on some extreme blocklist because "admin is a crypto shill" and their main evidence was some discussion I was having to use ENS as an alternative to webfinger so that people could own their identity without relying on domain providers.

The goalposts keep moving. The critics will keep finding reasons and workarounds. Lots of useful idiots will keep doubling down on the idea that some holy government will show up and enact perfect regulation, even though it's the institutions themselves who are the most corrupt and taking away their freedoms.

The open, anonymous web is on the verge of extinction. We no longer can keep ignoring externalities. We will need to start designing our systems in a way where everyone will need to either pay or have some form of social proof for accessing remote services. And while this does not require any type of block chains or cryptocurrency, we certainly will need to start showing some respect to all the people who were working on them and have learned a thing or two about these problems.

latexr · 1h ago

> and who refused to understand that proof of stake made energy consumption negligible.

Proof of stake brought with it its own set of flaws and failed to solve many of the ones which already existed.

> To this day, I have my Mastodon instance on some extreme blocklist because (…)

Maybe. Or maybe you misinterpreted the reason? I don’t know, I only have your side of the story, so won’t comment either way.

> The goalposts keep moving. The critics will keep finding reasons and workarounds.

As will proponents. Perhaps if initial criticisms had been taken seriously and addressed in a timely manner, there wouldn’t have been reason to thoroughly dismiss the whole field. Or perhaps it would’ve played out exactly the same. None of us know.

> even though it's the institutions themselves who are the most corrupt and taking away their freedoms.

Curious that what is probably the most corrupt administration in the history of the USA, the one actively taking away their citizens’ freedoms as we speak, is the one embracing cryptocurrency to the max. And remember all the times the “immutable” blockchains were reverted because it was convenient to those with the biggest stakes in them? They’re far from impervious to corruption.

> And while this does not require any type of block chains or cryptocurrency, we certainly will need to start showing some respect to all the people who were working on them and have learned a thing or two about these problems.

Er, no. For one, the vast majority of blockchain applications were indeed grifts. It’s unfortunate for the minority who had good intentions, but it is what it is. For another, they didn’t invent the concept of trustless systems and cryptography. The biggest lesson we learned from blockchains is how bad of a solution they are. I don’t feel the need to thank anyone for grabbing an idea, doing it badly, wasting tons of resources while ignoring the needs of the world, using it to scam others, then doubling down on it when presented with the facts of its failings.

DonHopkins · 2h ago

So are you telling us that "web3 is (non-sarcastically) doing great", hmmmm?

>"blockchain is only for drug dealing and scams"

You forgot money laundering.

You lie down with dogs, you get up with fleas.

rglullis · 1h ago

Your absolute lack of reading comprehension works as a perfect example to illustrate my point.

DonHopkins · 45m ago

Oh dear, you mean web3 is NOT going great??!

It delights me to hear how frustrated you are that people have finally universally caught on to what a scam crypto is, and how nobody takes you seriously any more.

The fact that Trump is embracing crypto should be a huge clue to you. Care to explain why he's so honest and right about crypto (aka corrupto), and his economic policies are so sound and good for society that we should trust him? How many of his Big Beautiful NFTs and MAGA coins do you own?

uz3snolc3t6fnrq · 5m ago

is your trigger-word "Ethereum"? he's not even talking about trading crypto or anything you could remotely consider scammy, he's talking about a blockchain based naming system. you're freaking out over nothing, go home man...

PeterStuer · 7h ago

FAFO from both sides. Not defending this bot at all. That said, the shenanigans some rogue or clueless webmasters are up to blocking legitimate and non intrusive or load causing M2M trafic is driving some projects into the arms of 'scrape services' that use far less considerate nor ethical means to get to the data you pay them for.

IP blocking is useless if your sources are hundreds of thousands of people worldwide just playing a "free" game on their phone that once in a while on wifi fetches some webpages in the background for the game publisher's scraping as a service side revenue deal.

geocar · 5h ago

Exactly. If someone can harm your website on accident, they can absolutely harm it on purpose.

If you feel like you need to do anything at all, I would suggest treating it like any other denial-of-service vulnerability: Fix your server or your application. I can handle 100k clients on a single box, which equates to north of 8 billion daily impressions, and so I am happy to ignore bots and identify them offline in a way that doesn't reveal my methodologies any further than I absolutely have to.

BLKNSLVR · 5h ago

> IP blocking is useless if your sources are hundreds of thousands of people worldwide just playing a "free" game on their phone that once in a while on wifi fetches some webpages in the background for the game publisher's scraping as a service side revenue deal.

That's traffic I want to block, and that's behaviour that I want to punish / discourage. If a set of users get caught up in that, even when they've just been given recycled IP addresses, then there's more chance to bring the shitty 'scraping as a service' behaviour to light, thus to hopefully disinfect it.

(opinion coming from someone definitely NOT hosting public information that must be accessible by the common populace - that's an issue requiring more nuance, but luckily has public funding behind it to develop nuanced solutions - and can just block China and Russia if it's serving a common populace outside of China and Russia).

PeterStuer · 3h ago

Trust me, there's nothing 'nuanced' about the contractor that won the website management contract for the next 6-12 months by being the cheapest bidder for it.

ahtihn · 6h ago

What? Are you trying to say it's legitimate to want to scrape websites that are actively blocking you because you think you are "not intrusive"? And that this justifies paying for bad actors to do it for you?

I can't believe the entitlement.

PeterStuer · 6h ago

No. I'm talking about literally legitimate, information that has to be public by law and/or regulation (typically gov stuff), in formats specifically meant for m2m consuption, and still blocked by clueless or malicious outsourced lowest bidder site managers.

And no, I do not use those paid services, even though it would make it much easier.

AWS CEO says using AI to replace junior staff is 'Dumbest thing I've ever heard' (theregister.com)

Anna's Archive: An Update from the Team (annas-archive.org)

FFmpeg 8.0 (ffmpeg.org)

Show HN: I was curious about spherical helix, ended up making this visualization (visualrambling.space)

AGENTS.md – Open format for guiding coding agents (agents.md)

Why are anime catgirls blocking my access to the Linux kernel? (lock.cmpxchg8b.com)

Copilot broke audit logs, but Microsoft won't tell customers (pistachioapp.com)

Mark Zuckerberg freezes AI hiring amid bubble fears (telegraph.co.uk)

DeepSeek-v3.1 (api-docs.deepseek.com)

A German ISP changed their DNS to block my website (lina.sh)

AI tooling must be disclosed for contributions (github.com)

Obsidian Bases (help.obsidian.md)

How we exploited CodeRabbit: From simple PR to RCE and write access on 1M repos (research.kudelskisecurity.com)

Web apps in a single, portable, self-updating, vanilla HTML file (hyperclay.com)

Go is still not good (blog.habets.se)

OpenMower – An open source lawn mower (github.com)

U.S. government takes 10% stake in Intel (cnbc.com)

Show HN: Whispering – Open-source, local-first dictation you can trust (github.com)

Waymo granted permit to begin testing in New York City (cnbc.com)

Comet AI browser can get prompt injected from any site, drain your bank account (twitter.com)

Zedless: Zed fork focused on privacy and being local-first (github.com)

How to Draw a Space Invader (muffinman.io)

Show HN: NextDNS Adds "Bypass Age Verification"

Ask HN: Why does the US Visa application website do a port-scan of my network?

Counter-Strike: A billion-dollar game built in a dorm room (nytimes.com)

The Enterprise Experience (churchofturing.github.io)

Claudia – Desktop companion for Claude code (claudiacode.com)

Weaponizing image scaling against production AI systems (blog.trailofbits.com)

Io_uring, kTLS and Rust for zero syscall HTTPS server (blog.habets.se)

D4D4 (nmichaels.org)

Show HN: OverType – A Markdown WYSIWYG editor that's just a textarea

An interactive guide to SVG paths (joshwcomeau.com)

How to build a coding agent (ghuntley.com)

Sequoia backs Zed (zed.dev)

Left to Right Programming (graic.net)

What makes Claude Code so damn good (minusx.ai)

D2 (text to diagram tool) now supports ASCII renders (d2lang.com)

Pixel 10 Phones (blog.google)

Line scan camera image processing for train photography (daniel.lawrence.lu)

ArchiveTeam has finished archiving all goo.gl short links (tracker.archiveteam.org)

FFmpeg Assembly Language Lessons (github.com)

Home Depot sued for 'secretly' using facial recognition at self-checkouts (petapixel.com)

Manim: Animation engine for explanatory math videos (github.com)

Gemma 3 270M re-implemented in pure PyTorch for local tinkering (github.com)

What are OKLCH colors? (jakub.kr)

95% of Companies See 'Zero Return' on $30B Generative AI Spend (thedailyadda.com)

4chan will refuse to pay daily online safety fines, lawyer tells BBC (bbc.co.uk)

Show HN: OS X Mavericks Forever (mavericksforever.com)

How to Think About GPUs (jax-ml.github.io)

T-Mobile claimed selling location data without consent is legal–judges disagree (arstechnica.com)

Ban me at the IP level if you don't like me

Comments (253)