QRS: Epsilon Wrangling

13 zdw 4 7/9/2025, 8:45:19 PM tbray.org ↗

Comments (4)

o11c · 13h ago
I also have been playing with regexes recently. One observation that I've made: if you're willing to setting for something slightly weaker than regexes, you can make your code trivial to understand (and go straight to a DFA instead of going through NFA). AFAICT the only "hard" case (which I'm erroring out on) involves things like /(AB)+|(AC)+/ (where A, B, and C are arbitrarily complex patterns), since everything else can be easily simplified. And at least for the contexts I care about, that kind of regex is exceptionally rare.

... I probably should actually read the papers on how to do it properly though. Last time I tried, I got stuck in a tangle of "why does C make efficient hashmaps so hard!" - this time, I'm avoiding C until I have a fully-tested prototype (current status: 1.0KLoC logic, 0.5KLoC comments, 4.4KLoC test suite, 40% tests failing after a recent refactor [edit: I forgot Python enums don't compare equal to integers], 100% of the time being annoyed at how stupidly slow Python is if you use obscure programming features like "loops", "strings", "branching", or "functions").

kragen · 11h ago
Plausibly this approach is trivial to understand and implements full regexes, but it is slower than the NFA or DFA approach: http://canonical.org/~kragen/sw/dev3/redl.py

PEGs are in some ways more powerful than regexes, and in other ways less powerful, but usually the former matter more. This is not trivial to understand but I think it's not that hard either; it's a page of code: https://github.com/kragen/peg-bootstrap/blob/master/peg.md

That version doesn't memoize and so doesn't enjoy Packrat parsing's linear-time guarantee, but you can easily modify it to do so.

Another subset of regexes that's easy to understand is this single-page pattern matcher by Rob Pike from TPOP: https://www.cs.princeton.edu/courses/archive/spr09/cos333/be... which is enormously less code than my single-page PEG parser generator above. But then, it doesn't have to compile itself.

o11c · 11h ago
Unfortunately, neither "waste time" nor "waste space" are approaches worth pursuing.

We already have too many programs being written that are too simple and thus slow and/or wrong. We need to write code that is as simple as possible, but no simpler.

kragen · 6h ago
In practice you can always make a program take less space if it can use more time, or take less time if it can use more space; the guiltless perfection you seek does not exist.

I feel like PEG parsing can be fast and space-efficient, but I haven't seen an existence proof yet.