On the positive side, you can scale out memory quite a lot, fill up PCI slots, even have memory external to your chassis. Memory tiering has a lot of potential.
On the negative side, you've got latency costs to swallow up. You don't get distance from CPU for free (there's a reason the memory on your motherboard is as close as practical to the CPU)
https://www.nextplatform.com/2022/12/05/just-how-bad-is-cxl-.... CXL spec for 2.0 is at about 200ns of latency added to all calls to what is stored in memory, so when using it you've got to think carefully about how you approach using it, or you'll cripple yourself.
There's been work on the OS side around data locality, but CXL stuff hasn't been widely available, so there's an element of "Well, we'll have to see".
Yup, for best results you wouldn't just dump your existing pointer-chasing and linked-list data structures to CXL (like the Optane's transparent mode was, whatever it was called).
But CXL-backed memory can use your CPU caches as usual and the PCIe 5.0 lane throughput is still good, assuming that the CXL controller/DRAM side doesn't become a bottleneck. So you could design your engines and data structures to account for these tradeoffs. Like fetching/scanning columnar data structures, prefetching to hide latency etc. You probably don't want to have global shared locks and frequent atomic operations on CXL-backed shared memory (once that becomes possible in theory with CXL3.0).
Edit: I'll plug my own article here - if you've wondered whether there were actual large-scale commercial products that used Intel's Optane as intended then Oracle database took good advantage of it (both the Exadata and plain database engines). One use was to have low latency durable (local) commits on Optane:
VMware supports it as well, but using it as a simpler layer for tiered memory.
GordonS · 1h ago
Huh, 200ns is less than I imagined; even if it is still almost 100x slower than regular RAM, it's still around 100x faster than NVMe storage.
Dylan16807 · 57m ago
Regular RAM is 50-100ns.
jauntywundrkind · 1h ago
Most cross-socket traffic is >100ns.
mdaniel · 2h ago
> Buy From One of the Regions Below
> Egypt
:-/
But, because I'm a good sport, I actually chased a couple of those links figuring that I could convert Egyptian Pound into USD but <https://www.sigma-computer.com/en/search?q=CXL%20R5X4> is "No results", and similar for the other ones that I could get to even load
tanelpoder · 1h ago
Yeah I saw the same. I've been keeping an eye on the CXL world for ~5 years and so far it's 99% announcements, unveilings and great predictions. But the only CXL cards a consumer/small business can buy are some experimental-ish 64GB/128GB cards that you can actually buy today. Haven't seen any of my larger clients use it either. Both Intel Optane and DSSD storage efforts got discontinued after years of fanfare, from technical point of view, I hope that the same doesn't happen to CXL.
sheepscreek · 1h ago
That is pretty hilarious. I wonder what’s the reason behind this. Maybe they wanted plausible deniability in case someone tried to buy it (“oh the phone lines were down, you’ll have to go there to buy one”).
JonChesterfield · 23m ago
I don't get it. The point of (ddrN) memory is latency. If its on the far side of pcie latency is much worse than the system memory. In what sense is this better than ssd on the far side of pcie?
wmf · 13m ago
It's only ~2x worse latency than main memory but 100x lower than SSD.
jonhohle · 1h ago
Why did something like this take so long to exist? I’ve always wanted swap or tmpfs available on old RAM I have lying around.
I'd have rather a question why we had single (or already dual) core CPUs with dual-channel memory controller and now we have 16-core CPUs but still with only dual-channel RAM.
Dylan16807 · 27m ago
DDR1 and DDR2 were clocked 20x and 10x slower than DDR5. The CPU cores we have now are faster but not that much faster, and with the typical user having 8 or fewer performance cores 128 bits of memory width has stayed a good balance.
If you need a lot of memory bandwidth, workstation boards have DDR5 at 256-512 bits wide. Apple Silicon supports that range on Pro and Max, and Ultra is 1024.
(I'm using bits instead of channels because channels/subchannels can be 16 or 32 or 64 bits wide.)
christkv · 41m ago
Check out Strix Halo 395+ it’s got 8 memory channels up to 128 GB and 16 cores
Dylan16807 · 25m ago
That's a true but misleading number. It's the equivalent of "quad channel" in normal terms.
aidenn0 · 47m ago
(S)ATA or PCI to DRAM adapters were widely available until NAND became cheaper per bit than DRAM, at which point the use for it kind of went away.
IIRC Intel even made a DRAM card that was drum-memory compatible.
Dylan16807 · 54m ago
RAM controllers are expensive enough that it's rarely worth pairing them with old RAM lying around.
trebligdivad · 1h ago
My god - a CXL product! That's really surprising anything go that far.
I'd been expecting external CXL boxes, not internal stuff.
bri3d · 1h ago
CXL is a standard for compute and I/O extension over PCIe signaling which has been around for a few years, with a couple of available RAM boards (from SMART and others).
I think the main bridge chipsets come from Microchip (this one) and Montage.
This Gigabyte product is interesting since it’s a little lower end than most AXL solutions - so far AXL memory expansion has mostly appeared in esoteric racked designs like the particularly wild https://www.servethehome.com/cxl-paradigm-shift-asus-rs520qa... .
No comments yet
roscas · 2h ago
That is amazing. Most consumer boards will only have 32 or 64. To have 512 is great!
justincormack · 1h ago
You havent seen the price of 128GB DDR5 RDIMMs, they are maybe $1300 each.
A lot of the initial use cases of CXL seem to be to use up lots of older DDR4 RDIMMs in newer systems to expand memory, eg cloud providers have a lot.
kvemkon · 1h ago
Micron DDR5-5600 for 900 Euro (without VAT, business).
tanelpoder · 2h ago
... and if you have the money, you can use 3 out of 4 PCIe5 slots for CXL expansion. So that could be 2TB DRAM + 1.5TB DRAM-over-CXL, all cache coherent thanks to CXL.mem.
I guess there are some use cases for this for local users, but I think the biggest wins could come from the CXL shared memory arrays in smaller clusters. So you could, for example, cache the entire build-side of a big hash join in the shared CXL memory and let all other nodes performing the join see the single shared dataset. Or build a "coherent global buffer cache" using CPU+PCI+CXL hardware, like Oracle Real Application Clusters has been doing with software+NICs for the last 30 years.
Edit: One example of the CXL shared memory pool devices is Samsung CMM-B. Still just an announcement, haven't seen it in the wild. So, CXL arrays might become something like the SAN arrays in the future - with direct loading to CPU cache (with cache coherence) and being byte-addressable.
Both of the supported motherboards support installation of 2TB of DRAM.
reilly3000 · 35m ago
Presumably this is about adding more memory channels via pcie lanes. I’m very curious to know what kind of bandwidth one could expect with such a setup, as that is the primary bottleneck for inference speed.
Dylan16807 · 8m ago
The raw speed of PCIe 5.0 x16 is 63 billion bytes per second each way. Assuming we transfer several cache lines at a time the overhead should be pretty small, so expect 50-60GB/s. Which is on par with a single high-clocked channel of DRAM.
On the positive side, you can scale out memory quite a lot, fill up PCI slots, even have memory external to your chassis. Memory tiering has a lot of potential.
On the negative side, you've got latency costs to swallow up. You don't get distance from CPU for free (there's a reason the memory on your motherboard is as close as practical to the CPU) https://www.nextplatform.com/2022/12/05/just-how-bad-is-cxl-.... CXL spec for 2.0 is at about 200ns of latency added to all calls to what is stored in memory, so when using it you've got to think carefully about how you approach using it, or you'll cripple yourself.
There's been work on the OS side around data locality, but CXL stuff hasn't been widely available, so there's an element of "Well, we'll have to see".
Azure has some interesting whitepapers out as they've been investigating ways to use CXL with VMs, https://www.microsoft.com/en-us/research/wp-content/uploads/....
But CXL-backed memory can use your CPU caches as usual and the PCIe 5.0 lane throughput is still good, assuming that the CXL controller/DRAM side doesn't become a bottleneck. So you could design your engines and data structures to account for these tradeoffs. Like fetching/scanning columnar data structures, prefetching to hide latency etc. You probably don't want to have global shared locks and frequent atomic operations on CXL-backed shared memory (once that becomes possible in theory with CXL3.0).
Edit: I'll plug my own article here - if you've wondered whether there were actual large-scale commercial products that used Intel's Optane as intended then Oracle database took good advantage of it (both the Exadata and plain database engines). One use was to have low latency durable (local) commits on Optane:
https://tanelpoder.com/posts/testing-oracles-use-of-optane-p...
VMware supports it as well, but using it as a simpler layer for tiered memory.
:-/
But, because I'm a good sport, I actually chased a couple of those links figuring that I could convert Egyptian Pound into USD but <https://www.sigma-computer.com/en/search?q=CXL%20R5X4> is "No results", and similar for the other ones that I could get to even load
For example:
https://en.wikipedia.org/wiki/I-RAM
(Not a unique thing, merely the first one I found).
And then there are the more exotic options, like the stuff that these folk used to make: https://en.wikipedia.org/wiki/Texas_Memory_Systems - iirc - Eve Online used the RamSan product line (apparently starting in 2005: https://www.eveonline.com/news/view/a-history-of-eve-databas... )
If you need a lot of memory bandwidth, workstation boards have DDR5 at 256-512 bits wide. Apple Silicon supports that range on Pro and Max, and Ultra is 1024.
(I'm using bits instead of channels because channels/subchannels can be 16 or 32 or 64 bits wide.)
IIRC Intel even made a DRAM card that was drum-memory compatible.
I think the main bridge chipsets come from Microchip (this one) and Montage.
This Gigabyte product is interesting since it’s a little lower end than most AXL solutions - so far AXL memory expansion has mostly appeared in esoteric racked designs like the particularly wild https://www.servethehome.com/cxl-paradigm-shift-asus-rs520qa... .
No comments yet
A lot of the initial use cases of CXL seem to be to use up lots of older DDR4 RDIMMs in newer systems to expand memory, eg cloud providers have a lot.
I guess there are some use cases for this for local users, but I think the biggest wins could come from the CXL shared memory arrays in smaller clusters. So you could, for example, cache the entire build-side of a big hash join in the shared CXL memory and let all other nodes performing the join see the single shared dataset. Or build a "coherent global buffer cache" using CPU+PCI+CXL hardware, like Oracle Real Application Clusters has been doing with software+NICs for the last 30 years.
Edit: One example of the CXL shared memory pool devices is Samsung CMM-B. Still just an announcement, haven't seen it in the wild. So, CXL arrays might become something like the SAN arrays in the future - with direct loading to CPU cache (with cache coherence) and being byte-addressable.
https://semiconductor.samsung.com/news-events/tech-blog/cxl-...