This is a CGRA. It's like an FPGA but with bigger cells. It's not a VLIW core.
I assume that like all past attempts at this, it's about 20x more efficient when code fits in the one array (FPGAs get this ratio), but if your code size grows past something very trivial, the grid config needs to switch and that costs tons of time and power.
rf15 · 13h ago
I agree this is very "FPGA-shaped" and I wonder if they have further switching optimisations on hand.
RossBencina · 8h ago
My understanding is that they have a grid configuration cache, and are certainly trying to reduce the time/power cost of changing the grid connectivity.
pclmulqdq · 8h ago
An FPGA startup called Tabula had the same thesis and it didn't work out well for them. Their configurable blocks had 16 configurations that they would let you cycle through. Reportedly, the chips were hell to program and the default tools were terrible.
reactordev · 6h ago
Is that a design flaw or a tooling flaw? The dev experience is usually left till the very end of some proof like this.
gchadwick · 3h ago
Both? Theoretically amazing hardware that just needs the magic compiler to work well is a well worn path in the hardware world (The itanium being a notable example). A design can be impossible to compile well for and very hard to program manually if it hasn't been developed well. Equally you can indeed have a bad toolchain for that hard to use design making it even harder to get the best out of it.
wmf · 3h ago
It doesn't matter. You have to get both right or you go out of business.
(And then your IP is thrown away so the next startup also has to get both right...)
That was my first thought too. I really like the idea of interconnected nodes array. There's something biological, thinking in topology and neighbours diffusion that I find appealing.
londons_explore · 11h ago
One day someone will get it working...
Data transfer is slow and power hungry - it's obvious that putting a little bit of compute next to every bit of memory is the way to minimize data transfer distance.
The laws of physics can't be broken, yet people demand more and more performance, so eventually the difficulty of solving this issue will be worth solving.
AnimalMuppet · 8h ago
That minimizes the data transfer distance from that bit of memory to that bit of compute. But it increases the distance between that bit of (memory and compute) and all the other bits of (memory and compute). If your problem is bigger than one bit of memory, such a configuration is probably a net loss, because of the increased data transfer distance between all the bits.
Your last paragraph... you're right that, sooner or later, something will have to give. There will be some scale such that, if you create clumps either larger or smaller than that scale, things will only get worse. (But that scale may be problem-dependent...) I agree that sooner or later we will have to do something about it.
Earw0rm · 1h ago
We already do.
Cache hierarchies operate on the principle that the probability of a bit being operated on is inversely proportional to the time since it was last operated on.
Registers can be thought of in this context as just another cache, the memory closest to the compute units for the most frequent operations.
It's possible to have register-less machines (everything expressed as memory to memory operations) but it blows up the instruction word length, better to let the compiler do some of the thinking.
Imustaskforhelp · 13h ago
Pardon me but could somebody here explain to me like I am 15? Because I guess Its late night and I can't go into another rabbithole and I guess I would appreciate it. Cheers and good night fellow HN users.
elseless · 11h ago
Sure. You can think of a (simple) traditional CPU as executing instructions in time, one-at-a-time[1] — it fetches an instruction, decodes it, performs an arithmetic/logical operation, or maybe a memory operation, and then the instruction is considered to be complete.
The Efficient architecture is a CGRA (coarse-grained reconfigurable array), which means that it executes instructions in space instead of time. At compile time, the Efficient compiler looks at a graph made up of all the “unrolled” instructions (and data) in the program, and decides how to map it all spatially onto the hardware units. Of course, the graph may not all fit onto the hardware at once, in which case it must also be split up to run in batches over time. But the key difference is that there’s this sort of spatial unrolling that goes on.
This means that a lot of the work of fetching and decoding instructions and data can be eliminated, which is good. However, it also means that the program must be mostly, if not completely, static, meaning there’s a very limited ability for data-dependent branching, looping, etc. to occur compared to a CPU. So even if the compiler claims to support C++/Rust/etc., it probably does not support, e.g., pointers or dynamically-allocated objects as we usually think of them.
[1] Most modern CPUs don’t actually execute instructions one-at-a-time — that’s just an abstraction to make programming them easier. Under the hood, even in a single-core CPU, there is all sorts of reordering and concurrent execution going on, mostly to hide the fact that memory is much slower to access than on-chip registers and caches.
pclmulqdq · 11h ago
Pointers and dynamic objects are probably fine given the ability to do indirect loads, which I assume they have (Side note: I have built b-trees on FPGAs before, and these kinds of data structures are smaller than you think). It's actually pure code size that is the problem here rather than specific capabilities, as long as the hardware supports those instructions.
Instead of assembly instructions taking time in these architectures, they take space. You will have a capacity of 1000-100000 instructions (including all the branches you might take), and then the chip is full. To get past that limit, you have to store state to RAM and then reconfigure the array to continue computing.
elseless · 10h ago
Agree that code size is a significant potential issue, and that going out to memory to reprogram the fabric will be costly.
Re: pointers, I should clarify that it’s not the indirection per se that causes problems — it’s the fact that, with (traditional) dynamic memory allocation, the data’s physical location isn’t known ahead of time. It could be cached nearby, or way off in main memory. That makes dataflow operator latencies unpredictable, so you either have to 1. leave a lot more slack in your schedule to tolerate misses, or 2. build some more-complicated logic into each CGRA core to handle the asynchronicity. And with 2., you run the risk that the small, lightweight CGRA slices will effectively just turn into CPU cores.
pclmulqdq · 8h ago
Oh, many embedded architectures don't have a cache hierarchy and instead place dynamic objects in one SRAM. Access latency is constant anywhere you go.
kannanvijayan · 10h ago
Hmm. You'd be able to trade off time for that space by using more general configurations that you can dynamically map instruction-sequences onto, no?
The mapping wouldn't be as efficient as a bespoke compilation, but it should be able to avoid the configuration swap-outs.
Basically a set of configurations that can be used as an interpreter.
majkinetor · 54m ago
> meaning there’s a very limited ability for data-dependent branching, looping, etc. to occur compared to a CPU
Not very useful then if I can't do this very basic thing?
markhahn · 5h ago
I think that footnote is close to the heart of it: on a modern OoO superscalar processor, there are hundreds of instructions in-flight. that means a lot of work done to maintain their state and ensure that they "fire" when their operands are satisfied. I think that's what this new system is about: a distributed, scalable dataflow-orchestration engine.
I think this still depends very much on the compiler: whether it can assemble "patches" of direct dependencies to put into each of the little processing units. the edges between patches are either high-latency operations (memory) or inter-patch links resulting from partitioning the overall dataflow graph. I suspect it's the NOC addressing that will be most interesting.
esperent · 4h ago
> it executes instructions in space instead of time. At compile time, the Efficient compiler looks at a graph made up of all the “unrolled” instructions (and data) in the program, and decides how to map it all spatially onto the hardware units.
Naively that sounds similar to a GPU. Is it?
Nevermark · 4h ago
Instead of large cores operating mostly independently in parallel (with some few standardized hardwired pipeline steps per core), …
You have many more very small ALU cores, configurable into longer custom pipelines with each step more or less as wide/parallel or narrow as it needs to be for each step.
Instead of streaming instructions over & over to large cores, you use them to set up those custom pipeline circuits, each running until it’s used up its data.
And you also have some opportunity for multiple such pipelines operating in parallel depending on how many operations (tiles) each pipeline needs.
Probably not. This is graduate-level computer architecture.
archipelago123 · 13h ago
It's a dataflow architecture.
I assume the hardware implementation is very similar to what is described here:
https://csg.csail.mit.edu/pubs/memos/Memo-229/Memo-229.pdf.
The problem is that it becomes difficult to exploit data locality, and there is so much optimization you can perform during compile time.
Also, the motivation for these types of architectures (e.g. lack of ILP in Von-Neumann style architectures) are non-existent in modern OoO cores.
timschmidt · 13h ago
Out of order cores spend an order of magnitude more logic and energy than in-order cores handling invalidation, pipeline flushes, branch prediction, etc etc etc... All with the goal of increasing performance. This architecture is attempting to lower the joules / instruction at the cost of performance, not increase energy use in exchange for performance.
gchadwick · 3h ago
> The interconnect between tiles is also statically routed and bufferless, decided at compile time. As there's no flow control or retry logic, if two data paths would normally collide, the compiler has to resolve it at compile time.
This sounds like the most troublesome part of the design to me. It's very hard to do this static scheduling well. You can end having to hold up everything waiting for some tiny thing to complete so you can proceed forward in lock step. You'll also have situations where 95% of the time the static scheduling can work but 5% of cases where something fiddly happens. Without any ability for dynamic behaviour and data movement small corner cases dominate how the rest of the system behaves.
Interestingly you see this very problem in hardware design! All paths between logic gates need to be some maximum length to reach a target clock frequency. Often you get long fiddly paths relating to corner cases in behaviour that require significant manual effort to resolve and achieve timing closure.
regularfry · 1h ago
Was I misreading, or is this thing not essentially unclocked? There have been asynchronous designs in the past (of ARM6 cores, no less) but they've not taken the world by storm.
pedalpete · 9h ago
Though I'm sure this is valuable in certain instances, thinking about many embedded designs today, is the CPU/micro really the energy hog in these systems?
We're building an EEG headband with bone-conduction speaker so in order of power, our speaker/sounder and LEDs are orders of magnitude more expensive than our microcontroller.
In anything with a screen, that screen is going to suck all the juice, then your radios, etc. etc.
I'm sure there are very specific use-cases that a more energy efficient CPU will make a difference, but I struggle to think of anything that has a human interface where the CPU is the bottleneck, though I could be completely wrong.
schobi · 1h ago
I would not expect that this becomes competitive against a low power controller that is sleeping most of the time, like in a typical wristwatch wearable.
However, the examples indicate that if you have a loop that is executed over and over, the setup cost for configuring the fabric could be worth doing. Like a continuous audio stream in a wakeup-word detection, a hearing aid, or continous signals from an EEG.
Instead of running a general purpose cpu at 1MHz the fabric would be used to unroll the loop, you will use (up to) 100 building blocks for all individual operations. Instead of one instruction after another, you have a pipeline that can execute one operation in each cycle in each building block. The compute thus only needs to run at 1/100 clock, e. g. the 10kHz sampling rate of the incoming data. Each tick of the clock moves data through the pipeline, one step at a time.
I have no insights but can imagine how marketing thinks: "let's build a 10x10 grid of building blocks, if they are all used, the clock can be 1/100... Boom - claim up to 100x more efficient!"
I hope their savings estimate is more elaborate though...
montymintypie · 7h ago
Human interfaces, sure, but there's a good chunk of industrial sensing IoT that might do some non-trivial edge processing to decide if firing up the radio is even worth it. I can see this being useful there. Potentially also in smart watches with low power LCD/epaper displays, where the processor starts to become more visible in power charts.
Wonder if it could also be a coprocessor, if the fabric has a limited cell count? Do your dsp work on the optimised chip and hand off the the expensive radio softdevice when your codesize is known to be large.
kendalf89 · 14h ago
This grid based architecture reminds me of a programming game from zactronics, TIS-100.
mcphage · 6h ago
I thought the same thing :-)
ZiiS · 14h ago
Percentage chance this is 100X more efficent at the general purpose computing ARM is optimized for: 1/100%
Grosvenor · 14h ago
Is this the return if Itanium? static scheduling and pushing everything to the compiler it sounds like it.
wood_spirit · 14h ago
The Mill videos are worth watching again - there are variations on NaT handling and looping and branching etc that make DSPs much more general-purpose.
I don’t know how similar this Electron is, but the Mill explained how it could be done.
I love these videos and his enthusiasm for the problem space. Unfortunately, it seems to me that the progress/ideas have floundered because of concerns around monetizing intellectual property, which is a shame. If he had gone down a more RISC-V like route, I wonder if we would see more real-world prototypes and actual use cases. This type of thing seems great for microprocessor workloads.
darksaints · 14h ago
It kinda sounds like it, though the article explicitly said it's not VLIW.
I've always felt like itanium was a great idea but came too soon and too poorly executed. It seemed like the majority of the commercial failure came down to friction from switching architecture and the inane pricing rather than the merits of the architecture itself. Basically intel being intel.
bri3d · 13h ago
I disagree; Itanium was fundamentally flawed for general purpose computing and especially time-shared generally purpose computing. VLIW is not practical in time-sharing systems without completely rethinking the way cache works, and Itanium didn't really do that.
As soon as a system has variable instruction latency, VLIW completely stops working; the entire concept is predicated on the compiler knowing how many cycles each instruction will take to retire ahead of time. With memory access hierarchy and a nondeterministic workload, the system inherently cannot know how many cycles an instruction will take to retire because it doesn't know what tier of memory its data dependencies live in up front.
The advantage of out-of-order execution is that it dynamically adapts to data availability.
This is also why VLIW works well where data availability is _not_ dynamic, for example in DSP applications.
As for this Electron thing, the linked article is too puffed to tell what it's actually doing. The first paragraph says something about "no caches" but the block diagram has a bunch of caches in it. It sort of sounds like an FPGA with bigger primitives (configurable instruction tiles rather than gates), which means that synchronization is going to continue to be the problem and I don't know how they'll solve for variable latency.
hawflakes · 13h ago
Not to detract form your point, but Itanium's design was to address the code compatibility between generations. You could have code optimized for a wider chip run on a narrower chip because of the stop bits.
The compiler still needs to know how to schedule to optimize for a specific microarchitecture but the code would still run albeit not as efficiently.
As an aside, I never looked into the perf numbers but having adjustable register windows while cool probably made for terrible context switching and/or spilling performance.
als0 · 13h ago
> VLIW is not practical in time-sharing systems without completely rethinking the way cache works
Just curious as to how you would rethink the design of caches to solve this problem. Would you need a dedicated cache per execution context?
bri3d · 13h ago
That's the simplest and most obvious way I can think of. I know the Mill folks were deeply into this space and probably invented something more clever but I haven't kept up with their research in many years.
bobmcnamara · 9h ago
Itanic did exactly what it was supposed to do - kill off most of the RISCs.
markhahn · 7h ago
haha! very droll.
cmrdporcupine · 11h ago
It does feel maybe like the world has changed a bit now that LLVM is ubiquitous with its intermediate representation form being available for specialized purposes. Translation from IR to a VLIW plan should be easier now than the state of compiler tech in the 90s.
But "this is a good idea just poorly executed" seems to be the perennial curse of VLIW, and how Itanium ended up shoved onto people in the first place.
mochomocha · 14h ago
On the other hand, Groq seems pretty successful.
rpiguy · 16h ago
The architecture diagram in the article resembles the approach Apple took in the design of their neural engine.
Typically these architectures are great for compute. How will it do on scalar tasks with a lot of branching? I doubt well.
variadix · 12h ago
Pretty interesting concept, though as other commenters have pointed out the efficiency gains likely break down once your program doesn’t fit onto the mesh all at once. Also this looks like it requires a “sufficiently smart compiler”, which isn’t a good sign either. The need to do routing etc. reminds me of the problems FPGAs have during place and route (effectively the minimum cut problem on a graph, i.e. NP), hopefully compilation doesn’t take as long as FPGA synthesis takes.
kyboren · 8h ago
> The need to do routing etc. reminds me of the problems FPGAs have during place and route (effectively the minimum cut problem on a graph, i.e. NP)
I'd like to take this opportunity to plug the FlowMap paper, which describes the polynomial-time delay-optimal FPGA LUT-mapping algorithm that cemented Jason Cong's 31337 reputation: https://limsk.ece.gatech.edu/book/papers/flowmap.pdf
Very few people even thought that optimal depth LUT mapping would be in P. Then, like manna from heaven, this paper dropped... It's well worth a read.
almostgotcaught · 3h ago
I don't what this has to do with what you're responding to - tech mapping and routing are two completely different things and routing is known NP complete.
icandoit · 13h ago
I wondered if this was using interaction combinators like the vine programming language does.
I haven't read much that explains how they do it.
I have been very slowly trying to build a translation layer between starlark and vine as a proof of concept of massively parallel computing. If someone better qualified finds a better solution the market it sure to have demand for you. A translation layer is bound to be cheaper than teaching devs to write in jax or triton or whatever comes next.
wolfi1 · 14h ago
reminds me from the architecture of transputers but on the same silicon
fidotron · 14h ago
Yep, or the old GreenArrays GA144 or even maybe XMOS with more compiler magic.
One of the big questions here is how quickly it can switch between graphs, or if that will be like a context switch from hell. In an embedded context that's likely to become a headache way too fast, so the idea of a magic compiler fixing it so you don't have to know what it's doing sounds like a fantasy honestly.
icodestuff · 5h ago
Yep, that’s definitely the question. The article says that there are caches of recently used graphs for use in large loops. Presumably those are pretty fast to swap, but I have to imagine programming a whole new graph in isn’t fast. But maybe the E2 or E3 will have the ability to reprogram partial graphs with good AOT dataflow analysis.
nolist_policy · 14h ago
Also, how would cycle-accurate assembly look like for this chip?
artemonster · 13h ago
As a person who is highly vested and interested in CPU space, especially embedded, I am HIGHLY skeptical of such claims. Somebody played TIS-100, remembered GA144 failed and decided to try their own. You know what can be a simple proof of your claims? No, not a press release. No, not a pitch deck or a youtube video. And NO, not even working silicon, you silly. A SIMPLE FUCKING ISA EMULATOR WITH A PROFILER. Instead we got bunch of whitepapers. Yeah, I call it 90% chance for total BS and vaporware
jecel · 12h ago
The 2022 PhD thesis linked from their web site includes a picture of what they claim was an actual chip made using a 22nm process. I understand that the commercial chip might be different, but it is possible that the measurements made for the thesis could be valid for their future products as well.
wmf · 12h ago
There's >20 years of academic research behind dataflow architectures going back to TRIPS and MIT RAW. It's not literally a scam but the previous versions weren't practical and it's unlikely this version succeeds either. I agree that if the compiler was good they would release it and if they don't release it that's probably because it isn't good.
bmenrigh · 12h ago
I like Ian but he’s rapidly losing credibility by postings so much sponsored content. Many of his videos and articles now are basically just press releases.
vendiddy · 14h ago
I don't know much about CPUs so maybe someone can clarify.
Is this effectively having a bunch of tiny processors on a single chip each with its own storage and compute?
lawlessone · 13h ago
I think it's more like having the instructions your program does spread accross mulitple tiny processors.
So one instruction gets done.. output is pass to the next.
Hopefully i've made somebody mad enough to explain why i am wrong.
ACCount36 · 8h ago
I can't see this ever replacing general purpose Arm cores, but it might be viable in LP-optimized always-on processors and real time control cores.
SoftTalker · 15h ago
> Efficient’s goal is to approach the problem by static scheduling and control of the data flow - don’t buffer, but run. No caches, no out-of-order design, but it’s also not a VLIW or DSP design. It’s a general purpose processor.
Sounds like a mainframe. Is there any similarity?
wmf · 12h ago
This has nothing to do with mainframes (which are fairly normal general purpose computers).
nnx · 8h ago
Not sure about general-purposeness, but the architecture looks rather perfect for LLM inference?
Wonder why they do not focus their marketing on this.
trhway · 10h ago
> spatial data flow model. Instead of instructions flowing through a centralized pipeline, the E1 pins instructions to specific compute nodes called tiles and then lets the data flow between them. A node, such as a multiply, processes its operands when all the operand registers for that tile are filled. The result then travels to the next tile where it is needed. There's no program counter, no global scheduler. This native data-flow execution model supposedly cuts a huge amount of the energy overhead typical CPUs waste just moving data.
I assume that like all past attempts at this, it's about 20x more efficient when code fits in the one array (FPGAs get this ratio), but if your code size grows past something very trivial, the grid config needs to switch and that costs tons of time and power.
(And then your IP is thrown away so the next startup also has to get both right...)
Data transfer is slow and power hungry - it's obvious that putting a little bit of compute next to every bit of memory is the way to minimize data transfer distance.
The laws of physics can't be broken, yet people demand more and more performance, so eventually the difficulty of solving this issue will be worth solving.
Your last paragraph... you're right that, sooner or later, something will have to give. There will be some scale such that, if you create clumps either larger or smaller than that scale, things will only get worse. (But that scale may be problem-dependent...) I agree that sooner or later we will have to do something about it.
Cache hierarchies operate on the principle that the probability of a bit being operated on is inversely proportional to the time since it was last operated on.
Registers can be thought of in this context as just another cache, the memory closest to the compute units for the most frequent operations.
It's possible to have register-less machines (everything expressed as memory to memory operations) but it blows up the instruction word length, better to let the compiler do some of the thinking.
The Efficient architecture is a CGRA (coarse-grained reconfigurable array), which means that it executes instructions in space instead of time. At compile time, the Efficient compiler looks at a graph made up of all the “unrolled” instructions (and data) in the program, and decides how to map it all spatially onto the hardware units. Of course, the graph may not all fit onto the hardware at once, in which case it must also be split up to run in batches over time. But the key difference is that there’s this sort of spatial unrolling that goes on.
This means that a lot of the work of fetching and decoding instructions and data can be eliminated, which is good. However, it also means that the program must be mostly, if not completely, static, meaning there’s a very limited ability for data-dependent branching, looping, etc. to occur compared to a CPU. So even if the compiler claims to support C++/Rust/etc., it probably does not support, e.g., pointers or dynamically-allocated objects as we usually think of them.
[1] Most modern CPUs don’t actually execute instructions one-at-a-time — that’s just an abstraction to make programming them easier. Under the hood, even in a single-core CPU, there is all sorts of reordering and concurrent execution going on, mostly to hide the fact that memory is much slower to access than on-chip registers and caches.
Instead of assembly instructions taking time in these architectures, they take space. You will have a capacity of 1000-100000 instructions (including all the branches you might take), and then the chip is full. To get past that limit, you have to store state to RAM and then reconfigure the array to continue computing.
Re: pointers, I should clarify that it’s not the indirection per se that causes problems — it’s the fact that, with (traditional) dynamic memory allocation, the data’s physical location isn’t known ahead of time. It could be cached nearby, or way off in main memory. That makes dataflow operator latencies unpredictable, so you either have to 1. leave a lot more slack in your schedule to tolerate misses, or 2. build some more-complicated logic into each CGRA core to handle the asynchronicity. And with 2., you run the risk that the small, lightweight CGRA slices will effectively just turn into CPU cores.
The mapping wouldn't be as efficient as a bespoke compilation, but it should be able to avoid the configuration swap-outs.
Basically a set of configurations that can be used as an interpreter.
Not very useful then if I can't do this very basic thing?
I think this still depends very much on the compiler: whether it can assemble "patches" of direct dependencies to put into each of the little processing units. the edges between patches are either high-latency operations (memory) or inter-patch links resulting from partitioning the overall dataflow graph. I suspect it's the NOC addressing that will be most interesting.
Naively that sounds similar to a GPU. Is it?
You have many more very small ALU cores, configurable into longer custom pipelines with each step more or less as wide/parallel or narrow as it needs to be for each step.
Instead of streaming instructions over & over to large cores, you use them to set up those custom pipeline circuits, each running until it’s used up its data.
And you also have some opportunity for multiple such pipelines operating in parallel depending on how many operations (tiles) each pipeline needs.
This sounds like the most troublesome part of the design to me. It's very hard to do this static scheduling well. You can end having to hold up everything waiting for some tiny thing to complete so you can proceed forward in lock step. You'll also have situations where 95% of the time the static scheduling can work but 5% of cases where something fiddly happens. Without any ability for dynamic behaviour and data movement small corner cases dominate how the rest of the system behaves.
Interestingly you see this very problem in hardware design! All paths between logic gates need to be some maximum length to reach a target clock frequency. Often you get long fiddly paths relating to corner cases in behaviour that require significant manual effort to resolve and achieve timing closure.
We're building an EEG headband with bone-conduction speaker so in order of power, our speaker/sounder and LEDs are orders of magnitude more expensive than our microcontroller.
In anything with a screen, that screen is going to suck all the juice, then your radios, etc. etc.
I'm sure there are very specific use-cases that a more energy efficient CPU will make a difference, but I struggle to think of anything that has a human interface where the CPU is the bottleneck, though I could be completely wrong.
However, the examples indicate that if you have a loop that is executed over and over, the setup cost for configuring the fabric could be worth doing. Like a continuous audio stream in a wakeup-word detection, a hearing aid, or continous signals from an EEG.
Instead of running a general purpose cpu at 1MHz the fabric would be used to unroll the loop, you will use (up to) 100 building blocks for all individual operations. Instead of one instruction after another, you have a pipeline that can execute one operation in each cycle in each building block. The compute thus only needs to run at 1/100 clock, e. g. the 10kHz sampling rate of the incoming data. Each tick of the clock moves data through the pipeline, one step at a time.
I have no insights but can imagine how marketing thinks: "let's build a 10x10 grid of building blocks, if they are all used, the clock can be 1/100... Boom - claim up to 100x more efficient!" I hope their savings estimate is more elaborate though...
Wonder if it could also be a coprocessor, if the fabric has a limited cell count? Do your dsp work on the optimised chip and hand off the the expensive radio softdevice when your codesize is known to be large.
I don’t know how similar this Electron is, but the Mill explained how it could be done.
Edit: aha, found them! https://m.youtube.com/playlist?list=PLFls3Q5bBInj_FfNLrV7gGd...
I've always felt like itanium was a great idea but came too soon and too poorly executed. It seemed like the majority of the commercial failure came down to friction from switching architecture and the inane pricing rather than the merits of the architecture itself. Basically intel being intel.
As soon as a system has variable instruction latency, VLIW completely stops working; the entire concept is predicated on the compiler knowing how many cycles each instruction will take to retire ahead of time. With memory access hierarchy and a nondeterministic workload, the system inherently cannot know how many cycles an instruction will take to retire because it doesn't know what tier of memory its data dependencies live in up front.
The advantage of out-of-order execution is that it dynamically adapts to data availability.
This is also why VLIW works well where data availability is _not_ dynamic, for example in DSP applications.
As for this Electron thing, the linked article is too puffed to tell what it's actually doing. The first paragraph says something about "no caches" but the block diagram has a bunch of caches in it. It sort of sounds like an FPGA with bigger primitives (configurable instruction tiles rather than gates), which means that synchronization is going to continue to be the problem and I don't know how they'll solve for variable latency.
As an aside, I never looked into the perf numbers but having adjustable register windows while cool probably made for terrible context switching and/or spilling performance.
Just curious as to how you would rethink the design of caches to solve this problem. Would you need a dedicated cache per execution context?
But "this is a good idea just poorly executed" seems to be the perennial curse of VLIW, and how Itanium ended up shoved onto people in the first place.
https://www.patentlyapple.com/2021/04/apple-reveals-a-multi-...
Typically these architectures are great for compute. How will it do on scalar tasks with a lot of branching? I doubt well.
I'd like to take this opportunity to plug the FlowMap paper, which describes the polynomial-time delay-optimal FPGA LUT-mapping algorithm that cemented Jason Cong's 31337 reputation: https://limsk.ece.gatech.edu/book/papers/flowmap.pdf
Very few people even thought that optimal depth LUT mapping would be in P. Then, like manna from heaven, this paper dropped... It's well worth a read.
I haven't read much that explains how they do it.
I have been very slowly trying to build a translation layer between starlark and vine as a proof of concept of massively parallel computing. If someone better qualified finds a better solution the market it sure to have demand for you. A translation layer is bound to be cheaper than teaching devs to write in jax or triton or whatever comes next.
One of the big questions here is how quickly it can switch between graphs, or if that will be like a context switch from hell. In an embedded context that's likely to become a headache way too fast, so the idea of a magic compiler fixing it so you don't have to know what it's doing sounds like a fantasy honestly.
Is this effectively having a bunch of tiny processors on a single chip each with its own storage and compute?
So one instruction gets done.. output is pass to the next.
Hopefully i've made somebody mad enough to explain why i am wrong.
Sounds like a mainframe. Is there any similarity?
Wonder why they do not focus their marketing on this.
should work great for NN.