> Neither Cloudflare, nor any other service, will ever be able to block all scrapers. They can make their operations more expensive,
Cloudflare presents like single platform for crawlers. The get the same amount of data as platforms to bock crawlers they don't want. Other big platforms can prevent scrapers effectively when they don't want them Google, Facebook. etc. Nifty new scraper might crawl few million url's before it's detected.
mmarian · 5h ago
Hey! Sorry, didn't quite catch what you meant.
Is it that Cloudlare can always spot crawlers because of the amount of data they collect? Or is it there's always a nifty new scraper that will get away with it?
nabla9 · 3h ago
It's that Cloudflare can always spot crawlers. Few million random urls crawled is nothing, and provides no value for AI companies, they want all.
Comprehensive crawl of LinkedIn, FB, instagram, IMDB, Amazon, would be worth a lot.
Plenty of open-source ones as well that could bypass, eg maybe this one that came up in search https://github.com/VeNoMouS/cloudscraper Combine with residential proxies and you're just not going to find them.
> Comprehensive crawl of LinkedIn, FB, instagram, IMDB, Amazon, would be worth a lot.
Cloudflare presents like single platform for crawlers. The get the same amount of data as platforms to bock crawlers they don't want. Other big platforms can prevent scrapers effectively when they don't want them Google, Facebook. etc. Nifty new scraper might crawl few million url's before it's detected.
Is it that Cloudlare can always spot crawlers because of the amount of data they collect? Or is it there's always a nifty new scraper that will get away with it?
Comprehensive crawl of LinkedIn, FB, instagram, IMDB, Amazon, would be worth a lot.
I mention in the post a scraping service that Cloudflare isn't spotting: https://www.scrapingbee.com/blog/how-to-bypass-cloudflare-an...
Plenty of open-source ones as well that could bypass, eg maybe this one that came up in search https://github.com/VeNoMouS/cloudscraper Combine with residential proxies and you're just not going to find them.
> Comprehensive crawl of LinkedIn, FB, instagram, IMDB, Amazon, would be worth a lot.
Just from a quick Google search:
- LinkedIn: https://brightdata.com/products/datasets/linkedin
- Amazon: https://www.junglescout.com/features/product-database/