Scraperr – A Self Hosted Webscraper

263 jpyles 94 5/11/2025, 6:29:18 PM github.com ↗

Comments (94)

lucb1e · 32d ago

Funny, I saw this HN headline just after banning another scraper's IP range

You're welcome to scrape my sites but please do it ethically. Idk how to define that but some examples of things I consider not cool:

- Scraping without a contact method, or at least some unique identifier (like your project's codename), in the user agent string.

This is common practice, see e.g.: <https://en.wikipedia.org/wiki/User-Agent_header#Format_for_a...>. Many sites mention in public API guidelines to include an email address so you can be contacted in case of problems. If you don't include this and you're causing trouble, all I can do is ban your IP address altogether (or entire ranges: if you hop between several IPs I'll have to assume you have access to the whole range). Nobody likes IP bans: you have to get a new IP, your provider has a burned IP address, the next customer runs into issues... don't be this person, include an identifier.

- Timing out the request after a few seconds.

Some pages on my site involve number crunching and take 20 seconds to load. I could add complexity to do this async instead, but, by having it live, the regular users get the latest info and they know to just wait a few seconds and everybody is happy. Even the scrapers can get the info, I'm fine computing those pages for you. But if you ask for me to do work and then walk away, that's just rude. It shows up in my logs as HTTP status 499 and I'll ban scrapers that I notice doing this regularly

- Ignoring robots.txt.

I have exactly 1 entry in there, and that's a caching proxy for another site that is struggling with load. If you ignore the robots file and just crawl the thing from A to Z at a high rate, that causes a lot of requests to the upstream site for updating stale caches. You can obviously expect a ban because it's again just a waste of resources

whazor · 32d ago

I find it unethical for a website robots.txt to allow-list particular search engines and ban all others. Essentially you are colluding with established search providers.

Loic · 32d ago

Not necessarily, I have a website with 95% (maybe even more) of the traffic generated by crawlers. If some of them are behaving badly, it is fair to exclude them with my robots.txt.

But of course, the ones behaving badly tend to not respect the robots.txt, so you end up banning the IP or IP block.

And here, I am a nice guy, the crawler must really be a piece of crap for me to start to block.

whazor · 31d ago

Deny-listing/banning bad crawlers is fine. Especially if they ignore the robots.txt.

But allow-listing particular crawlers only is collusion.

throawayonthe · 32d ago

the parent comment is talking about allow-listing (aka 'whitelisting') just a few crawlers from like, google

ToucanLoucan · 32d ago

This rather bluntly runs up against the fact that permitting crawling is an expense the web operator is taking on, ergo, receiving that content is by definition a privilege not a right.

p3rls · 32d ago

I think my team has spent about a quarter of this year's dev time on defending against scrapers. Their arrogance is the stuff of ancient greek dramas.

lucb1e · 30d ago

I don't know if that's a reply at me or a general remark, but yes, I never understood why you'd include a few big names and ban the rest for example. That's just screaming for anticompetitiveness. I don't know if my mention of robots.txt sounded like I do this, but I do not

datatrashfire · 32d ago

> - Scraping without a contact method, or at least some unique identifier (like your project's codename), in the user agent string.

This is a very effective way to make sure you won't get any scraping done!

lucb1e · 30d ago

Tell that to Googlebot, Bingbot, Petalbot, SemrushBot, MJ12bot, MojeekBot, DotBot, YandexBot, SeznamBot, Barkrowler, AhrefsBot, DuckDuckBot, AcademicBotRTU, Bytespider, Applebot, ZoominfoBot, TelegramBot, TwitterBot, SemanticScholarBot, redditbot, Pinterestbot... From a quick peek at my access log, all include either a link (most) or an email address (zoom, tiktok/bytedance, dotbot, and that academic bot)

Very few individual bots don't follow this good practice. Most of the IP ranges of violating bots are owned by Huawei (a few is Huawei Cloud so it could be anyone, but the majority seems to be Huawei themselves) and the remainder is all small beans as far as I remember (few thousand accesses in a day and then disappear forever, for example)

datatrashfire · 28d ago

None of the institutional market intelligence products I’ve ever worked on in nearly a decade of doing this do. Why? Cause they wouldn’t otherwise work.

Many APIs require specific user agents. Tools like curl impersonate require specific user agents.

Fokamul · 32d ago

Who cares, IP ranges are cheap. You're just banning datacenters.

edoceo · 32d ago

What do you have for log analytics and ban automation? Could you say more about how to identify these bad-bots?

lucb1e · 32d ago

There is no automation, I use `tail -f access.log`

I just look at what's happening on my server every now and then. Sometimes not for months, but then when I set up a project like that caching proxy, I'm currently keeping a more regular eye to see that crawlers aren't bothering the upstream via me. Most respect the robots policy, most of the ones that don't set a user agent string that include the word 'bot' and so I know not to refresh the cache based on that request. So far it has mostly been Huawei who pretend to be a regular user but request millions of pages (from 12 separate IP ranges so far, some of them bigger than /16, some of them a handful of /24s).

> Could you say more about how to identify these bad-bots?

Many requests per day to random pages from either the same IP address (range), or ranges owned by the same corporation

reconnecting · 32d ago

Interesting. Our open-source platform [1] has the capacity to help with all of this through a GUI and rule engine, but I'm still concerned about whether we should present this way of bot hunting as a feature. I worry that this approach may be irrelevant in today's context.

[1] https://github.com/TirrenoTechnologies/tirreno

tough · 32d ago

I mean if anything with AI data and main sources are becoming the actual precious resource again.

So i'd expect an uptick in bots as everyone races to try and compete with google on data hoarding

reconnecting · 32d ago

As I can see, there is already a heavy wave of new AI/startup/VC etc. data companies that goes beyond the data consumption expectations of websites in the pre-AI era.

However, I see the development of new bot types that tackle security in more aggressive ways. It's not just simple SQL injection as it was before, but more sophisticated and custom bots that not only request but also push a lot.

Or just a couple of days ago, I found a new type of bot that "brute-forces" website folder structure. ~205,000 requests in a couple of days.

These new bots are probably not directly the work of AI, but they seem to be a consequence of it.

reconnecting · 32d ago

We use tirreno [1] to manually and automatically analyze traffic and block unwanted bots. Although bot management is not currently listed as an official feature, it works well and is particularly helpful in complex bot hunting.

[1] https://github.com/TirrenoTechnologies/tirreno

VladVladikoff · 32d ago

What sort of pages require 20 seconds to generate? This is extremely slow by most web standards and even your users would be frustrated by this. It sounds like poorly designed database queries with unindexed joins.

Google will also abandon page loads that take too long, and will demote rankings for that page (or the entire site!)

lucb1e · 32d ago

> It sounds like poorly designed database queries with unindexed joins

Neither of those assumptions are correct. As an example, one page needs to look through 2.5 million records to find where the world record holder changed because it provides stats on who held the most records, held them for the greatest cumulative time, etc. The only thing to do would be introducing caching layers for parts of the computation, but for the number of users this system has, it's just not worth spending more development time than I already have. Also keep in mind it's a free web service and I don't run ads or anything, it's just a fan project for a game

> Google will ... demote rankings for that page (or the entire site!)

Google employs anticompetitive practices to maintain the search monopoly. We need more diversity in search engines, I don't know how else to encourage people to use something instead of, or at least in addition to, Google, besides by making Google Search just not competitive anymore. Google's crawler cannot access my site in the first place (but their other crawlers can; I'm pretty selective about this). My sites never show up in Google searches, on purpose

It's also not the whole site that's slow, it's when you click on a handful of specific pages. If that makes those pages not appear in search results, that's fine. Besides that it's not my loss, it's not like any other site has the info so people will find their way to the main page and click on what they want to see

VladVladikoff · 32d ago

Like I said then, you need indexes on those columns which you filter on in this table. Search a table of 2.5 million records for a value is still blazing fast if you use indexes correctly. I’m talking about 0.01 seconds or less. Even with tables much larger.

I agree about Google being shit. However, my website makes my living, and feeds and clothes my children, so I have to play along to their rules, or suffer.

Please take your slowest performing query and run it with EXPLAIN in front. And share that (or dump it into an LLM and it will tell you have to fix it)

monooso · 32d ago

You have very strong opinions about a site you effectively know nothing about.

Instead of immediately concluding that the person actually building the system is an incompetent fool who doesn't know any better, maybe work on the assumption that they know what they're doing, and have already considered the various trade-offs.

If nothing else, that would be considerably less obnoxious.

Loic · 32d ago

You need to drive and fine tune a Ferrari because it feeds your family. The OP just drive a nice little car, because it is fun to drive and he enjoys it. He could extract another +5% of torque by fine tuning, but he does not care, this is not where his joy is and where he wants to spend his time.

lucb1e · 30d ago

To be completely fair to the person you're responding to, they're talking about pages that take 20 seconds to load. On a regular website that hopes to get visitors from search engines, say, that is indeed insane and the fix is not to squeeze out 5% by fine tuning, the fix is to re-architect the thing

I don't mind people asking why it is this slow (whether I can't or why I don't re-architect it) or suggest fixes, but as the sibling comment to yours (from u/monooso) put it well, it would be nice if one does not assume that I'm an incompetent fool. The person also doesn't seem to read what I explained before suggesting more of the same in another reply. Thanks for adding your comment as well, I appreciate the sentiment. Even if I'm not sure if it applies in this case, it definitely applies to other things I do (I may have too many hobby projects running on that server.. ^^)

Breza · 28d ago

Well said! At work, I deal with these kinds of issues, and they get messy. I've spent hours this week discussing and debating table indexes and caching parameters for a system that's been running for fifteen years but we think we can improve. There's a diminishing return to putting tons of your time into every little thing, especially when a project is not your livelihood. It sounds like you've taken a thoughtful approach to your system architecture. If it works for you, it works!

a_ko · 32d ago

OP is driving with handbrake engaged.

lucb1e · 30d ago

How so?

lucb1e · 30d ago

Right...

In case it helps to understand: compare it to something like weather models. You can't "just add indexes" to make it fast, but you can store the result of an hourly run and serve that to people in milliseconds. In my case, nobody's paying me to serve them that 'weather report' so it is what it is

> Like I said then, you need indexes on those columns

Dude, like I said ("that's not a correct assumption" in response to "It sounds like ... unindexed joins"), I have indexes on the relevant columns in the correct order

Believe me I've benchmarked and SQL-EXPLAINed everything. All substantial queries have a code comment saying what index it uses as a way to make sure that changes to one query (and its corresponding index) don't affect another. I've learned a lot in this project about how everything from the different Where parts to the Order By clause to cardinality estimates to explicit index hints affect which indexes it can use as well as chooses to use. I enjoy learning about it, but now that I know the things relevant for this project, I'm also just done with the project and would rather spend my spare time on something new rather than adding code and introducing code and/or infrastructure complexity for storing parts of the computation that don't frequently change for example. Or if it was a popular site with frequent new visitors, that could be worth it as well. It's not. That doesn't make it necessarily a poor design if it's simply a lot of data

> Search a table of 2.5 million records for a value is still blazing fast

If you read what I write then you'd know it's not about looking up a single record

Breza · 28d ago

But but my GCP-PDE exam said every project needs massive and expensive infrastructure to optimize every tiny detail, minimize latency, and deliver five nines of availability! If your cloud bill isn't five figures every month, are you even a dev? /s

selcuka · 32d ago

> It sounds like poorly designed database queries with unindexed joins.

I find it amusing that you think every database operation imaginable can be performed in less than 20 seconds if we throw in a few indexes. Some things are slow no matter how much you optimise them.

The GP could have implemented them as async endpoints, or callbacks, but obviously they've already considered those options.

throwup238 · 32d ago

It's the kind of prescriptive cargo culting that is responsible for a significant fraction of pain involved in software engineering, right up there with DRY and KISS and shitty management.

I bet the GP abstracts out a function the second there's a third callsite too, regardless of where it's used or how it will evolved - only to add an options argument and blow up the cyclomatic complexity three days later.

beatthatflight · 32d ago

So what about flight searches where we have to query several 3rd party providers, and can take 45 seconds to get results from all of them (out of my control). I can dynamically update the page (and do) but a scraper would have to wait 20-45 seconds to get the 'cheapest' flight from my site. I can add async the queries and have the fastest pipes, but if the upstream providers take their time (they need to query their GDSs as well), there's not much you can do.

Breza · 28d ago

Oof, reminds me of the data pipelines I maintain that pull data from Salesforce. I've optimized the heck out of all of our internal database queries and they're blazing fast. Then my pipeline has to wait patiently for the SOQL response.

andrethegiant · 32d ago

Shameless plug: prefix any URL with https://pure.md/ to get the pure markdown of that page. Useful for direct piping into an LLM. Has bot detection avoidance, proxy rotation, and headless JS rendering built in.

yoble · 32d ago

Love the easter egg when going to https://pure.md/https://pure.md

matt-p · 32d ago

That's excellent pricing from a structural perspective.

fredoliveira · 32d ago

that looks fantastic - well done!

smartmic · 32d ago

My preferred "self-hosted" webscraper is a local, single binary called xidel [1]. The feature I really like is that it can also follow links.

[1] https://github.com/benibela/xidel

darkwater · 32d ago

Wow, it's written in Pascal! That surely brings me to memory lane.

DocTomoe · 32d ago

With Pascal being my first "adult" language, not used in 20 years ... it is surprising how readable that code is. Makes me wish for such simpler times.

benibela · 28d ago

that fits, I wrote the first code for xidel almost 20 years ago

and it still uses Pascal because I didn't plan to change it but just wanted to show people what I programmed 20 years ago

darkwater · 28d ago

What environment are you using nowadays to develop in Pascal? Editor, possible plugins etc?

benibela · 23d ago

FreePascal / Lazarus

There was actually just a large discussion thread in the Lazarus forum wondering about why it is not more popular, with almost 300 comments. But now the thread got locked. That might be a reason.

I used Delphi on Windows 98. But that became outdated, so I ported it to FreePascal. FreePascal has a lot of integrated libraries, but I do not really use anything I did not already use with Delphi.

renegat0x0 · 32d ago

Not a web scraper, but a web crawler software. Allows to specify method of crawling, selenium, and others. Returns data in JSON (status code, text contents, etc).

[1] https://github.com/rumca-js/crawler-buddy

TheTaytay · 32d ago

Does anyone know of a scraper that uses LLMs/natural language to build a deterministic, robust script that I can use to scrape the same site in the future? All of the natural language extractors I’ve seen so far need an LLM every time, but that seems unnecessary…

throwup238 · 32d ago

llm-scraper [1] does a decent job but it's still a bit fragile. The biggest problem I have is all the React CSS-in-JS libraries that use hashes in their class names, which the LLM isn't smart enough to ignore.

[1] https://github.com/mishushakov/llm-scraper

cdolan · 32d ago

What have you had success doing with this? Curious to test it

throwup238 · 32d ago

I mostly use it to aggregate event calendars for all the concert/sport/etc venues, meetups, and clubs in my area and do some other scraping tasks. I host a little wrapper around llm-scraper on a DigitalOcean droplet that I call from Val.town scripts

I only check most places once a week so I use the LLM to do the scraping but there are a few cases where I have to scrape thousands of pages very frequently so I use the more deterministic script it generates instead.

dddw · 31d ago

Oh Im interested in doing something similiar, is it hard to do?

cdolan · 32d ago

Great thanks!

TheTaytay · 32d ago

Nice! Thanks!

cdolan · 32d ago

We’ve built one internally using browser-use to generate playwright code

Works ok. Not as automated as I’d like

nicman23 · 32d ago

they are all quite bad

nomilk · 32d ago

I used to scrape back in the day when it was easy (literally just make a request and parse html). Seems cloudflare checkboxes / human verification are very commonplace nowdays. Curious how(/if) web scrapers get around those?

welanes · 32d ago

1. Clicking the box programmatically – possible but inconsistent

2. Outsourcing the task to one of the many CAPTCHA-solving services (2Captcha etc) – better

3. Using a pool of reliable IP addresses so you don't encounter checkboxes or turnstiles – best

I run a web scraping startup (https://simplescraper.io) and this is usually the approach[0]. It has become more difficult, and I think a lot of the AI crawlers are peeing in the pool with aggressive scraping, which is making the web a little bit worse for everyone.

[0] Worth mentioning that once you're "in" past the captcha, a smart scraper will try to use fetch to access more pages on the same domain so you only need to solve a fraction of possible captchas.

nomilk · 32d ago

That's awesome. Thanks for sharing.

First time hearing of the fetch() approach! If I understand correctly, regular browser automation might typically involve making separate GET requests for each page. Whereas the fetch() strategy involves making a GET for the first page (just as with regular browser automation), then after satisfying cloudflare, rather than going on to the next GET request, use fetch(<url>) to retrieve the rest of the pages you're after.

This approach is less noisy/impact on the server and therefore less likely to get noticed by bot detection.

This is fascinating stuff. (I'd previously used very little javascript in scrapes, preferring ruby, R, or python but this may tilt my tooling preferences toward using more js)

therein · 32d ago

Almost. I mean it's not like fetch(..) is going to lead to some esoteric kind of HTTP request method. I am guessing parent comment is saying what it is saying because fetch will utilize the cookies and other crumbs set by the successful completion of the captcha. If you can take all those crumbs and include it in your next GET request, you don't need to resort to utilizing fetch.

tough · 32d ago

Scammers will use fingerprints from their victims browser/IP/geolocation to try and impersonate them, you basically can buy not only stolen credentials but also the environment in which to run them -safely- from such vendors

Tokumei-no-hito · 32d ago

first time hearing about fetch too. but i don't see the advantage. is fetch reusing the connection and a manual page load not?

cess11 · 32d ago

Low effort baseline would be https://seleniumbase.io/, to drive a preconfigured web browser that looks relatively human to the network service. Typically it just clicks through the one-click captcha:s.

If that's not good enough you'll likely have to fiddle with your own web driver and possibly a computer vision rig to manage to click through 'find the motorcycle' kind of challenges. Paying a click farm to do it for you is probably cheaper in the short run.

An important hurdle is getting reputable IPv4 addresses to do it from, if you're going to do it a lot. Having or renting a botnet could help, but might be too illegal for your use case.

gruez · 32d ago

>Seems cloudflare checkboxes / human verification are very commonplace nowdays. Curious how(/if) web scrapers get around those?

You can get a real browser[1] to check the box for you, then use the cookies in your "dumb" scraper.

[1] https://github.com/FlareSolverr/FlareSolverr

ricardo81 · 32d ago

Some CDNs go to the length of fingerprinting the TLS and HTTP/2 handshakes to see if you're a bot. As others have mentioned, using an automated browser tends to be the broadest solution.

nicman23 · 32d ago

i usually use a real browser that i use, profile and all

anxman · 32d ago

By clicking the box

monkeydust · 32d ago

Practical use-case.

I am looking for a way to throw an address at a planning authority (UK) and download the associated documents for that property. Could this or another tool help?

e.g.

https://publicaccess.barnet.gov.uk/online-applications/appli...

As pure random example.

A property can have multiple planning applications and under each many documents.

What I have found useful (saved me time and potential lost £££) is to take the documents, combine to single pdf and provide to Gemini 2.5 Pro and then ask it to validate against agent specification for a property.

Over the weekend found a place that was advertising a feature of the house that was explicitly prohibited through planning decision notice.

Called the Agent up on it who claimed no knowledge but said this would have come up through solicitor checks, which it would have done, much later down the process with more or my money spent and considerable time lost.

Of course all this possible without LLMs but just makes it easier/cheaper to check at scale.

cess11 · 32d ago

Could just cut out the href-value with grep and sed or a bit of scripting, '.pdf' seems to only occur on those links.

I'd keep it simple like that until I need to do periodic comparisons, i.e. actually need scrapers and is prepared to build what's needed to automatically watch and process directories where the scrapers put the files.

3abiton · 32d ago

> extract data from websites with precision using XPath selectors.

I've used XPath for crawling with selenium, and it used to be my favorite way, but turned out quite unreliable if you don't combine it with other selectors as certain website are really badly designed and have no good patterns. So what's the added value over pure selenium?

cess11 · 32d ago

Check whether the site is actually server side rendered, because if it's a browser client that talks JSON to the backend, you could do the same.

tengbretson · 32d ago

Anyone have any experience webscraping from a Starlink IP? My assumption is you could stay under the radar due to cg nat, but it's not exactly something I want to be the first to find out about.

lyjackal · 32d ago

Mobile 4g USB sticks you can usually rotate your IP address by reconnecting. I tried on a pi, it was inconsistent. This was just with some random test mobile plan from rando carrier renting off Verizon I think

dewey · 32d ago

Seems much easier to just pay for a rotating proxy pool.

iSloth · 32d ago

Interesting, wish it had markdown output like firecrawl for embedding/llm use cases

jsemrau · 32d ago

I would prefer if we'd build a programmable web that provides value without relying on "scraping" websites for content. Most applications that do this are not well intended.

gzkk · 32d ago

There is quite high probability, that your own UserScripts will be well intended ;)

_QrE · 32d ago

Is there a reason for using Selenium over something like Playwright? I haven't had very many positive experiences with selenium, and playwright I found is easier to use and more flexible.

Also, for stuff like this:

`modified_value = original_value.replace("HeadlessChrome", "Chrome")`

There's quite a few ways to figure out that a browser is a bot, and I don't think replacing a few values like this does much. Not asking you to reveal any tricks, just saying that if you're using something like Playwright, you can e.g. run scripts in the browser to adjust your fingerprint more easily.

jpyles · 32d ago

I am quite aware, but I actually built most of the scraping logic a long time ago, before I even knew that playwright was a thing.

I am looking to refactor a lot of this, and switching over to playwright is a high priority, using something like camoufox for scraping, instead of just chromium.

Most of my work on this the past month has been simple additions that are nice to haves

michaeljx · 32d ago

I was in a similar boat with my scrapers. Started with Selenium 5-6 years ago and only discovered Playwright 2 years ago. Spend a month or so swapping the two, which was well worth it. Cleaner API, async support.

nkozyra · 32d ago

Playwright was miles ahead of selenium but what I think is really overlooked is chromedp

jpyles · 32d ago

Luckily, I have some experience with playwright, so swapping shouldn't take me too long.

Currently working on a PR to swap over

windexh8er · 32d ago

If you're a fan of Playwright check out Crawlee [0]. I've used it for a few small projects and it's been faster for me to get what I've needed done.

[0] https://crawlee.dev/

kristopolous · 32d ago

It's by apify which is an interesting community

jpyles · 32d ago

With the custom headers, you can actually trick a lot of sites with bot protection to let you load their sites (even big sites like youtube, which I have found success in)

dotancohen · 32d ago

How do you work around pop-ups for newsletters and such? Look at the BBC for a good example.

anxman · 32d ago

Pack ad blockers into your containers. They can be loaded into Chrome and help immensely in suppressing popovers while crawling.

dotancohen · 32d ago

Thank you, I'll experiment with that. Tips and advice welcome!

tough · 32d ago

Another cool trick is to deny all the content types you don't care about in your playwright. so if you only want text why bother allowing requests for fonts, css, svgs, images, videos, etc

Just request the html and cap down all the other stuff

PS: I also think this has the nice side-effect of you consuming less resources (that you didnt care about/need anyways) from the server, so win win

dotancohen · 25d ago

That is a great tip, thank you!

throwaway81523 · 32d ago

Last time I looked, Selenium was able to use Firefox. IDK about Playwright, but Puppeteer was Chrome-only.

stuffoverflow · 32d ago

Playwright supports Firefox, chromium and webkit

vivzkestrel · 32d ago

does this implement a rotating proxy IP address service?

mellosouls · 32d ago

From the repo (clipped):

When using Scraperr, please remember to:

Respect robots.txt

Terms of Service

Rate Limiting

Kudos for promoting ethical usage; makes a change from some of the grifters selling borderline ddos-bots as crawlers.

gitroom · 32d ago

pretty cool seeing people still tweak their own scraping tools, but the cat and mouse game never ends huh - you think the web ever gets more open again or just keeps locking down?

tommica · 32d ago

Well, it won't get more open by us just bitching here and doing nothing else

Asteroid Defense: The Real-World Impact of NASA's Dart Mission (spectrum.ieee.org)

Does the model know how many times it's called a tool? (github.com)

I write novels and build AI. The debate is more complex than either side admits (fastcompany.com)

LLM provider will go down, but you don't have to (assembled.com)

AMD Advancing AI: MI350X and MI400 UALoE72, MI500 UAL256 (semianalysis.com)

ConverseJS 11.0.1 – A powerful, open-source and web-based XMPP chat client (conversejs.org)

We Want Robots at Work but Humans in Art (twitter.com)

Nitpicking Gladiator's Iconic Opening Battle, Part II (acoup.blog)

Jensen Huang says he disagrees with almost everything Anthropic CEO Amodei says (fortune.com)

Logitech's new Flip Folio iPad case has a removable wireless keyboard (theverge.com)

Swimming world body will banish participants in pro-doping Enhanced Games (theguardian.com)

Cargo fuzz: a cargo subcommand for fuzzing with libFuzzer (github.com)

Modularizing George Cave's eInk Energy Dashboard (slipway.co)

How to Get the RSS Feed for Any YouTube Channel (2024) (chuck.is)

HTML WARDen - an HTML-based wiki (ratfactor.com)

Shoelace: A forward-thinking library of web components (shoelace.style)

When Does US Debt Become Genuinely Bad? [video] (youtube.com)

Scaling Laws in Autonomous Driving (waymo.com)

Don't stop till you get enough – sample size in machine learning (blog.engora.com)

What's Not to Like about "Unlikeable Characters"? (countercraft.substack.com)

I collect 8k leads/week w AI Lead Magnet (quickgen.ai)

Erie Insurance Reports 'Information Security Event' Caused Network Outage (insurancejournal.com)

IBM Has a Roadmap to a 'Fault-Tolerant' Quantum Computer by 2029 (wsj.com)

Webb telescope spots infant planets in different stages of development (dawn.com)

Fucoidans are senotherapeutics that enhance SIRT6-dependent DNA repair (researchsquare.com)

The Hat, the Spectre and SAT Solvers (2024) (nhatcher.com)

Transfer – A simple local file server app for Android (github.com)

Luxe Game Engine (luxeengine.com)

Tech Giants' New AI Ad Tools Threaten Big Agencies (wsj.com)

A Simple Showcase for the Sea-of-Nodes Compiler IR (github.com)

'We're done with Teams': German state hits uninstall on Microsoft (france24.com)

Israel "had clear U.S. green light for Iran strike" (axios.com)

Parquet on Iceberg Outperforms MergeTree (altinity.com)

Ask HN: Why does EU institutions and member gov. websites have cookie banners?

Ask HN: Is ageism in tech still a problem?

Show HN: 10k Lines of Basic (10klob.com)

Show HN: Algochat – Real-Time AI Chatbots for Streamers (algochat.io)

NASA scientist tasked with identifying asteroids on collision course with Earth (abc.net.au)

Stuck? Build Your Language Backwards (jimmyhmiller.com)

What I talk about when I talk about IRs (bernsteinbear.com)

RFC: type safe search params defined in routes.ts (github.com)

US Streetlights Are Turning Purple (scientificamerican.com)

30 Minutes with a Stranger (pudding.cool)

Honored to accept a direct commission as a Lt. Colonel (twitter.com)

Compiling C with Zig (mitchellhanberg.com)

Delay, Interfere, Undermine (propublica.org)

Ask HN: Casual Math Book Suggestions

Launch your WebSocket within seconds (cloudfully.io)

Show HN: Garlic – Java/Android decompiler written in C

Why did Apple build Linux Container support? (micahhausler.com)

Scraperr – A Self Hosted Webscraper

Comments (94)