Is there anything like this for more commodity arm cores (neoverse v2) or do we think the insights from apple silicon cores will generalize well to those other ARM architectures?
mananaysiempre · 45m ago
Finally, a processor manufacturer defects from the obfuscatory equilibrium. Granted, Apple’s processor people are not saints—I’ve yet to see even a full table of throughputs, latencies, and port loads from them, let alone an accurate CPU model—but I welcome anything that might maybe, hopefully, pretty please start a race of giving more accurate data to people doing low-level optimization.
touisteur · 25m ago
Intel Processor Trace was already pretty great. Built a MC-DC coverage tool with it. Used it for fine profiling, live program monitoring...
bri3d · 40m ago
What’s your beef with VTune and uProf?
jauntywundrkind · 40m ago
Longer term I sort of dream of doing computing from the inside out, using all this tracing data we've started gathering not just for observability but as a log and engine of compute: the record of what computing has been done as an event-source, for an event sourcing computing architecture.
ip26 · 13m ago
The present opportunity, in my view, is to feed this tracing into the development of superior compilers. This is starting to happen with automated profiling by the compiler, but you can imagine the profiling expanding to an enormous degree, with the compiler tracing the program it is building in great detail.
do_not_redeem · 1h ago
> Instead of statistical sampling like most profilers, you get a complete picture of your app’s execution flow.
Potentially interesting, but it's not really clear whether this is anything new or not. valgrind + kcachegrind does this too.
These screenshots look a lot like kcachegrind with a slightly reimagined UI. Is there actually anything new here, or is this another case of Apple finally catching up to the open source world?
nkurz · 47m ago
As 'GeekyBear' implies in a sibling comment, valgrind works with an emulation of an ideal processor rather than directly on the actual CPU. Sometimes this gives you a good idea of how the program will actually run, and sometimes it doesn't. As processors became more complex, it got farther and farther from the truth. Personally, I started in the Valgrind era and stopped using it as soon as better tools using native instrumentation became available. If Apple's approach works as well as described, it is much better than anything from that era.
do_not_redeem · 36m ago
I've never found cachegrind inaccurate, but maybe I'm not doing hardcore enough performance work. You can also use perf and get you numbers straight from the hardware if that's what you need. Truth be told I mainly use cachegrind because I prefer kcachegrind's UI to hotspot.
(I even prefer cachegrind's approach since the numbers will be less distorted by other random background activity on the machine, but that could just be idealism on my part, who knows.)
If perf or the vendor-specific tools like vtune/uprof aren't sufficient for you then I'm curious what do you use?
nkurz · 7m ago
I switched from emulator tools like valgrind to tools with hardware support like perf, pmu-tools, and VTune. I generally found them sufficient, but sometimes buggy and difficult to use.
Cachegrind is occasionally inaccurate due to an inaccurate model, but the greater problem was that cache hit percentages only tell a fraction of the story. To be able to predict performance I often needed to be able to accurately measure things like the number of memory requests in flight.
In general I have much greater faith in the on chip performance registers. That said, other than glancing at news stories like this I haven't been keeping up with recent advances. I guess it's possible that cachegrind and friends have improved since I was using them.
GeekyBear · 1h ago
> Potentially interesting, but it's not really clear whether this is anything new or not. valgrind + kcachegrind does this too.
Looking at the kcachegrind homepage, it doesn't sound like they are pulling their data directly from the CPU core itself:
> Callgrind uses runtime instrumentation via the Valgrind framework for its cache simulation and call-graph generation.
Apple seems to have modified it's core design so that it will stream data to a log file while the code is running.
> Recent Apple silicon devices can capture a processor trace where the CPU stores information about the code it runs, including the branches it takes and the instructions it jumps to. The CPU streams this information to an area on the file system so that you can analyze it with the Processor Trace instrument.
jauntywundrkind · 47m ago
Intel has a Performance Monitoring Unit on its core that has significant overlap.
Forgetting this tool-space, but at least some of these tools can make use of that hardware:
If you need data straight from the hardware you can use e.g. perf+hotspot, although I've heard that perf's tracing (not sampling!) supports fewer CPUs (but still more than just 1)
urbandw311er · 1h ago
I feel like it probably would work on older hardware, this very much smacks of forced obsolescence. Just guessing though.
nozzlegear · 38m ago
Is forced obsolescence the right term for a somewhat obscure debug tool built for developers of macOS/iOS software? I don't imagine there are many people who would feel forced to upgrade their machines more quickly just to get access to this.
astrange · 1h ago
It would not. You could port cachegrind I suppose.
(Even if hardware support did exist earlier, you don't want to deal with errata for a new hardware feature. It's kind of amazing anything ever works.)
Potentially interesting, but it's not really clear whether this is anything new or not. valgrind + kcachegrind does this too.
https://developer.apple.com/documentation/xcode/analyzing-cp...
These screenshots look a lot like kcachegrind with a slightly reimagined UI. Is there actually anything new here, or is this another case of Apple finally catching up to the open source world?
(I even prefer cachegrind's approach since the numbers will be less distorted by other random background activity on the machine, but that could just be idealism on my part, who knows.)
If perf or the vendor-specific tools like vtune/uprof aren't sufficient for you then I'm curious what do you use?
Cachegrind is occasionally inaccurate due to an inaccurate model, but the greater problem was that cache hit percentages only tell a fraction of the story. To be able to predict performance I often needed to be able to accurately measure things like the number of memory requests in flight.
Searching now for an example, I hit on a comment I made here a few years ago where this new tool probably would have been helpful: https://news.ycombinator.com/item?id=18442131
In general I have much greater faith in the on chip performance registers. That said, other than glancing at news stories like this I haven't been keeping up with recent advances. I guess it's possible that cachegrind and friends have improved since I was using them.
Looking at the kcachegrind homepage, it doesn't sound like they are pulling their data directly from the CPU core itself:
> Callgrind uses runtime instrumentation via the Valgrind framework for its cache simulation and call-graph generation.
https://kcachegrind.github.io/html/Home.html
Apple seems to have modified it's core design so that it will stream data to a log file while the code is running.
> Recent Apple silicon devices can capture a processor trace where the CPU stores information about the code it runs, including the branches it takes and the instructions it jumps to. The CPU streams this information to an area on the file system so that you can analyze it with the Processor Trace instrument.
Forgetting this tool-space, but at least some of these tools can make use of that hardware:
https://github.com/intel/pcm https://github.com/andikleen/pmu-tools
(Even if hardware support did exist earlier, you don't want to deal with errata for a new hardware feature. It's kind of amazing anything ever works.)