GPT-OSS vs. Qwen3 and a detailed look how things evolved since GPT-2 (magazine.sebastianraschka.com)

Good write-up. This is a very popular approach to substring search! It is still worst case `O(m*n)` though. Do you have a fallback implementation like the `memchr` crate has to guarantee `O(m+n)`?

I'll also push back on some bits in the end:

    > But if it’s so much better, then why haven’t I made a pull request to
    > change std.mem.indexOf to use SIMD? Well, the reason is that
    >
    > std.mem.indexOf is generic over element size, and having a size
    > larger than u8 makes the algorithm much slower
    >
    > The algorithm used in stdmem.indexOf is cross-platform, while the
    > SIMD code wouldn’t be. (not all platforms have SIMD registers at all,
    > Arm has only 128-bit)

Does Zig not have a way to specialize this for sequences of unsigned 8-bit integers? If not, and you're thereforce force to used a more generic algorithm, that seems pretty unfortunate.

    > Substring searching is rarely the bottleneck in programs,
    > especially ones written in a fast language like Zig. That’s why
    > I don’t personally think it would be worth it to add it to the
    > standard library.

Oh I'm not sure I buy this at all! Substring search is a primitive operation and easily can be a bottleneck. There's a reason why widely used substring search implementations tend to be highly optimized.

ashvardanian · 1h ago

I like that more people are getting involved with SIMD, and there have been several posts lately on both memmem-like and memcpy-like operations implemented in SIMD in different programming languages.

In most cases, though, these still focus on AVX/NEON instructions from over 10 years ago, rather than newer and more powerful AVX-512 variations, SVE & SVE2, or RVV.

These newer ISAs can noticeably change how one would implement a state-of-the-art substring search or copy/move operation. In my projects, such as StringZilla, I often use mask K registers (https://github.com/ashvardanian/StringZilla/blob/2f4b1386ca2...) and an input-dependent mix of temporal and non-temporal loads and stores (https://github.com/ashvardanian/StringZilla/blob/2f4b1386ca2...).

In typical cases, the difference between the suggested SIMD kernels and the state-of-the-art can be as significant as 50% in throughput. As SIMD becomes more widespread, it would be beneficial to focus more on delivering software and bundling binaries, rather than just the kernels.

lukaslalinsky · 1h ago

I really wish Zig decided to add SIMD intrisics. There are many SIMD algorithms that can be done, but you have to switch back to C for those, because they depend on operations outside of what LLVM provides for vectors.

lokeg · 2h ago

What about the worst case? I.e. something like searching for 1000 'a's in a long string of 'a's interspersed with 'b's every 500-1000 steps? Seems accidentally quadradic unfortunately in the absence of some KMP-like fallback

expenses3 · 31m ago

How is it quadratic? You do 1000 checks every character in the haystack but that's still O(n)

burntsushi · 11m ago

The worst case is that `std.mem.eql`[1] is executed at every position in the haystack, which gives you `O(n*m)` time complexity. Several substring search algorithms are `O(n)`.

[1]: https://github.com/aarol/substr/blob/9392f9557de735929dfb79e...

codethief · 2h ago

Nice article!

Also, this might be a stupid question (I'm a Zig newbie) but… instead of calling std.mem.eql() in the while loop to look at each potential match individually, couldn't you repeat the same trick as before? That is, use SIMD to search for the second and second-to-last character of the needle, then third and third-to-last, and so on, and finally take a bitwise AND of all the resulting bit masks? This way, one would avoid looking at each potential match one by one, and instead look at all of them at the same time.

Even if that doesn't work for some reason and you still need to loop over all potential matches individually, couldn't you use SIMD inside the while loop to replace std.mem.eql and thereby speed up string comparison? My understanding was that std.mem.eql loops over bytes one by one and compares them?

ncruces · 2h ago

Knowing little about zig, std.mem.eql very likely already uses SIMD.

This is about using SIMD to avoid even calling std.mem.eql for 99% of the possible attempts.

llimllib · 2h ago

std.mem.eql is here, super easy to read: https://github.com/ziglang/zig/blob/master/lib/std/mem.zig#L...

My read is it would use SIMD if T is @Vector, and not otherwise? But I'm neither a zig nor SIMD expert

jiehong · 2h ago

Nice!

But, does that work with non-ascii characters? (aka Unicode).

llimllib · 1h ago

Kind of! This script is assuming that you're dealing with a byte slice, which means you've already encoded your unicode data.

If you just encoded your string to bytes naïvely, it will probably-mostly still work, but it will get some combining characters wrong if they're represented differently in the two sources you're comparing. (eg, e-with-an-accent-character vs. accent-combining-character+e)

If you want to be correct-er you'll normalize your UTF string[1], but note that there are four different defined ways to do this, so you'll need to choose the one that is the best tradeoff for your particular application and data sources.

[1]: https://en.wikipedia.org/wiki/Unicode_equivalence#Normalizat...

codethief · 1h ago

> If you just encoded your string to bytes naïvely

By "naïvely" I assume you mean you would just plug in UTF-8 bytestrings for haystack & needle, without adjusting the implementation?

Wouldn't the code still need to take into account where characters (code points) begin and end, though, in order to prevent incorrect matches?

burntsushi · 54m ago

IDK what "encoded your string to bytes naively" means personally. There is only one way to correctly UTF-8 encode a sequence of Unicode scalar values.

In any case, no, this works because UTF-8 is self synchronizing. As long as both your needle and your haystack are valid UTF-8, the byte offsets returned by the search will always fall on a valid codepoint boundary.

In terms of getting "combining characters wrong," this is a reference to different Unicode normalization forms.

To be more precise... Consider a needle and a haystack, represented by a sequence of Unicode scalar values (typically represented by a sequence of unsigned 32-bit integers). Now encode them to UTF-8 (a sequence of unsigned 8-bit integers) and run a byte level search as shown by the OP here. That will behave as if you've executed the search on the sequence of Unicode scalar values.

So semantically, a "substring search" is a "sequence of Unicode scalar values search." At the semantic level, this may or may not be what you want. For example, if you always want `office` to find substrings like `oﬃce` in your haystack, then this byte level search will not do what you want.

The standard approach for performing a substring search that accounts for normalization forms is to convert both the needle and haystack to the same normal form and then execute a byte level search.

(One small caveat is when the needle is an empty string. If you want to enforce correct UTF-8 boundaries, you'll need to handle that specially.)

llimllib · 12m ago

By naively, I meant without normalization.

You know much more about this than I do though

jiehong · 1h ago

Thanks for this detailed answer!

codethief · 1h ago

I suppose generalizing the approach to UTF-32 should be straightforward, but variable-length encodings like UTF-8 and UTF-16 might be more involved(?) Either way, I'm sure BurntSushi found a solution and built it into ripgrep.

burntsushi · 50m ago

ripgrep always deals with UTF-8. When it sees a different encoding, like UTF-16, ripgrep first transcodes to UTF-8 and then searches.

This is absolutely in part because of all of the byte oriented optimizations that are baked into ripgrep (and its regex engine). Note that I said a part. Making ripgrep (and its regex engine) work on things other than a sequence of bytes is far more difficult than just porting a bunch of SIMD algorithms. There are also many optimizations and architectural constraints in the code based on the alphabet size. That is, with 8-bit integers, its alphabet size is 256. With 16-bit integers, the alphabet size is 65,536.

suddenlybananas · 2h ago

>The difference between 4μs vs 1μs is extremely small, but it’s slightly faster nonetheless.

Put that in a loop and its an enormous speed-up.

OpenSSH Post-Quantum Cryptography (openssh.com)

Wikimedia Foundation Challenges UK Online Safety Act Regulations (wikimediafoundation.org)

Faster substring search with SIMD in Zig (aarol.dev)

Hand-picked selection of articles on AI fundamentals/concepts (aman.ai)

Operation Costs in CPU Clock Cycles (2016) (ithare.com)

Pricing Pages – A Curated Gallery of Pricing Page Designs (pricingpages.design)

Vanishing from Hyundai’s data network (techno-fandom.org)

Millau Viaduct (fosterandpartners.com)

Why Is Web Performance Undervalued? (blaines-blog.com)

GPT-OSS-120B runs on just 8GB VRAM & 64GB+ system RAM (old.reddit.com)

Going faster than memcpy (squadrick.dev)

Self-Guaranteeing Promises (stephango.com)

Try and (ygdp.yale.edu)

Justice Dept. Settles with Greystar to End Participation in Algorithmic Pricing (justice.gov)

GPT-OSS vs. Qwen3 and a detailed look how things evolved since GPT-2 (magazine.sebastianraschka.com)

Generic Containers in C: Safe Division Using Maybe (uecker.codeberg.page)

Compiling a Lisp: Lambda lifting (bernsteinbear.com)

The Chrome VRP Panel has decided to award $250k for this report (issues.chromium.org)

Show HN: Bolt – A super-fast, statically-typed scripting language written in C (github.com)

Lists and Lists: Basics of Lisp through interactive fiction (1996) (eblong.com)

Fight Chat Control (fightchatcontrol.eu)

A Global Look at Teletext (text-mode.org)

Creating the Longest Possible Ski Jump in “The Games: Winter Challenge” (mrwint.github.io)

Show HN: Engineering.fyi – Search across tech engineering blogs in one place (engineering.fyi)

Graham: Synchronizing Clocks by Leveraging Local Clock Properties (2022) [pdf] (usenix.org)

One Million Screenshots (onemillionscreenshots.com)

The enduring puzzle of static electricity (pubs.aip.org)

Diffusion language models are super data learners (jinjieni.notion.site)

Show HN: A Sinclair ZX81 retro web assembler+simulator

Nukes, Nubs and Coners: The Unique Social Hierarchy Aboard a Nuclear Submarine (twz.com)

Fitzgerald's Follies (libertiesjournal.com)

1910: The year the modern world lost its mind (derekthompson.org)

How I code with AI on a budget/free (wuu73.org)

Booting 5000 Erlangs on Ampere One 192-core (underjord.io)

Writing simple tab-completions for Bash and Zsh (mill-build.org)

Abogen – Generate audiobooks from EPUBs, PDFs and text (github.com)

Type (YC W23) is hiring a founding engineer to build an AI-native doc editor (ycombinator.com)

Dear String-to-Integer Parsers (owl.billpg.com)

PHP compile time generics: yay or nay? (thephp.foundation)

Amazon Drone Beehive Concept (2019) (etrr.springeropen.com)

TCP Client Self-Connect (2013) (sgros.blogspot.com)

Reflections on Soviet Amateur Photography (publicbooks.org)

How Does a Blind Model See the Earth? (outsidetext.substack.com)

Bouncing on trampolines to run eBPF programs (bootlin.com)

The Framework Desktop is a beast (world.hey.com)

Abusing Entra OAuth for fun and access to internal Microsoft applications (research.eye.security)

Israel's Leviathan signs $35B natural gas supply deal with Egypt (reuters.com)

Digital Foundry leaves IGN, now independent [video] (youtube.com)

Events (developer.mozilla.org)

Anti-competitive practices masquerading as security is a dangerous pattern (blog.alinelerner.com)

Faster substring search with SIMD in Zig

Comments (18)