If hope AMD can produce a chip that matches H100 in training workloads.
lhl · 2h ago
Last year I had issues using MI300X for training, and when it did work, was about 20-30% slower than H100, but I'm doing some OpenRLHF (transformers/DeepSpeed-based) DPO training atm w/ latest ROCm and PyTorch and it seems to be doing OK, roughly matching GPU-hour perf w/ an H200 for small ~12h runs.
Note: previous testing I did was on a single (8x) MI300X node, currently I'm doing testing on just a single MI300X GPU, so not quite apples-to-apples, multi-GPU/multi-node training is still a question mark, just a single data point.
fooker · 1h ago
It gets even more jarring that H100 is about three years old now.
AMD future should be figuring out how to reproduce the performance numbers they “claim” they are getting
halJordan · 5h ago
Honestly that was a hard read. I hope that guy gets an mi355 just for writing this.
AMD deserves exactly zero of the credulity this writer heaps onto them. They just spent four months not supporting their rdna4 lineup in rocm after launch. AMD is functionally capable of day120 support. None of the benchmarks disambiguated where the performance is coming from. 100% they are lying on some level, representing their fp4 performance against fp 8/16.
jchw · 2h ago
I still find their delay with properly investing in ROCm on client to be rather shocking, but in fairness they did finally announce that they would be supporting client cards on day 1[1]. Of course, AMD has to keep the promise for it to matter, but they really do seem to, for whatever reason, finally realized just how important it is that ROCm is well-supported across their entire stack (among many other investments they've announced recently.)
It's baffling that AMD is the same company that makes both Ryzen and Radeon, but the year-to-date for Radeon has been very good, aside from the official ROCm support for RDNA4 taking far too long. I wouldn't get overly optimistic; even if AMD finally committed hard to ROCm and Radeon it doesn't mean they'll be able to compete effectively against NVIDIA, but the consumer showing wasn't so bad so far with the 9070 XT and FSR4, so I'm cautiously optimistic they've decided to try to miss some opportunities to miss opportunities. Let's see how long these promises last... Maybe longer than a Threadripper socket, if we're lucky :)
Is this day 1 support a claim about the future or something they've demonstrated? Because if it involves the future it is safer to just assume AMD will muck it up somehow when it comes to their AI chips. It isn't like their failure in the space is a weird one-off - it has been confusingly systemic for years. It'd be nice if they pull it off, but it could easily be day 1 support for a chip that turns out to crash the computer.
I dunno; I suppose they can execute on server parts. But regardless, a good plan here is to let someone else go first and report back.
pclmulqdq · 5h ago
AMD doesn't care about you being able to do computing on their consumer GPUs. The datacenter GPUs have a pretty good software stack and great support.
fc417fc802 · 4h ago
I'm inclined to believe it but that difference is exactly how nvidia got so far ahead of them in this space. They've consistently gone out of their way to put their GPGPU hardware and software in the hands of the average student and professional and the results speak for themselves.
zombiwoof · 1h ago
Just look at the disaster of rocm or you need to spend 300k on software engineers to get anything so work
"25 complimentary GPU hours (approximately $50 US of credit for a single MI300X GPU instance), available for 10 days. If you need additional hours, we've made it easy to request additional credits."
stingraycharles · 2h ago
Yes but then they fail to understand a lot of “long tail” home projects, opensource stuff etc is done on consumer GPUs at home, which is tremendously important for ecosystem support.
wmf · 2h ago
What if they understand that and they don't care? Getting one hyperscaler as a customer is worth more than the entire long tail.
stingraycharles · 1h ago
The problem is that this is short-term thinking. You need students and professionals playing around with your tools at home and/or on their work computers to drive hyperscale demand in the long term.
This is why it’s so important AMD gets their act together quickly, as the benefits of these kind of things are measured in years, not months.
moffkalast · 36s ago
Then they will stay irrelevant in the GPU space like they have been so far.
selectodude · 2h ago
Then they’re fools. Every AI maestro knows CUDA because they learned it at home.
jiggawatts · 1h ago
It’s the same reason there’s orders of magnitude more code written for Linux than for mainframes.
danielheath · 39m ago
Why would a hyperscaler pick the technology that’s harder to hire for (because there’s no hobbyist-to-expert pipeline)?
cma · 2h ago
Nvidia started removing nvlink with the 4000 series, they aren't heavily focused on it either anymore and want to sell the workstation cards for uses like training models at home.
archerx · 3h ago
If they care about their future they should. I am a die hard AMD supporter and even I am getting over their mediocrity and what seems to be constant self sabotage in the GPU department.
zombiwoof · 1h ago
It’s the AMD management . They just are recycling 20 year VP lifers at AMD to take over key projects
booder1 · 3h ago
I have had trained on both large AMD and Nvidia clusters and your right AMD support is good. I never had to talk to Nvidia support. That was better.
They should care about the availability of their hardware so large customers don't have to find and fix their bugs. Let consumers do that...
fooker · 1h ago
It’s the same software stack.
echelon · 3h ago
> AMD doesn't care about you being able to do computing on their consumer GPUs
Makes it a little hard to develop for without consumer GPU support...
caycep · 5h ago
this is ROCm?
fooblaster · 4h ago
Yes, the mi300x/mi250 are best supported as they directly compete with data center gpus from Nvidia which actually make money. Desktop is a rounding error by comparison.
shmerl · 2h ago
Aren't they addressing it with the unified UDNA architecture? That's going to be a thing in the future GPUs, making consumer and datacenter ones share the same arch.
Different architectures was probably a big reason for the above issue.
Note: previous testing I did was on a single (8x) MI300X node, currently I'm doing testing on just a single MI300X GPU, so not quite apples-to-apples, multi-GPU/multi-node training is still a question mark, just a single data point.
Their MI300s already beat them, 400s coming soon.
AMD deserves exactly zero of the credulity this writer heaps onto them. They just spent four months not supporting their rdna4 lineup in rocm after launch. AMD is functionally capable of day120 support. None of the benchmarks disambiguated where the performance is coming from. 100% they are lying on some level, representing their fp4 performance against fp 8/16.
It's baffling that AMD is the same company that makes both Ryzen and Radeon, but the year-to-date for Radeon has been very good, aside from the official ROCm support for RDNA4 taking far too long. I wouldn't get overly optimistic; even if AMD finally committed hard to ROCm and Radeon it doesn't mean they'll be able to compete effectively against NVIDIA, but the consumer showing wasn't so bad so far with the 9070 XT and FSR4, so I'm cautiously optimistic they've decided to try to miss some opportunities to miss opportunities. Let's see how long these promises last... Maybe longer than a Threadripper socket, if we're lucky :)
[1]: https://www.phoronix.com/news/AMD-ROCm-H2-2025
I dunno; I suppose they can execute on server parts. But regardless, a good plan here is to let someone else go first and report back.
"25 complimentary GPU hours (approximately $50 US of credit for a single MI300X GPU instance), available for 10 days. If you need additional hours, we've made it easy to request additional credits."
This is why it’s so important AMD gets their act together quickly, as the benefits of these kind of things are measured in years, not months.
They should care about the availability of their hardware so large customers don't have to find and fix their bugs. Let consumers do that...
Makes it a little hard to develop for without consumer GPU support...
Different architectures was probably a big reason for the above issue.
AMD is a marketing company now