One thing I'd love to see is dynamic CPU allocation or otherwise something similar to Jenkin's concept of a flyweight runner. Certain pipelines can often spend minutes to hours using zero CPU just polling for completion (e.g. CloudFormation, hosted E2E tests, etc.) In these cases I'd be charged for 2 vCPUs but use almost nothing.
Otherwise, the customers are stuck with the same sizing/packing/utilisation problems. And imagine being the CI vendor in this world: you know which pipeline steps use what resources on average (and at the p99), and with that information you could over-provision customer jobs so that you sell 20 vCPUs but schedule them on 10 vCPUs. 200% utilisation baby!
matt-p · 15m ago
I'm sure they're doing this, they'd be mad not to - firecracker has cgroup support.
hinkley · 1h ago
I had a service that was used to do a bunch of compute at deployment time but even with the ramp up in deployment rates anticipated by the existence of the tool, we had machines that were saturated about 6 hours a month, 12 at the outside. The amount of hardware we has sitting around for this was initially equivalent to about 10% of our primary cluster, and I got it down to about 3%.
But at the end of that project I realized that all this work could have been done on a CI agent if only they had more compute on them. My little cluster was still almost the size of the build agent pool tended to be. If I could convince them to double or quadruple the instance size on the CI pipeline I could turn these machines off entirely, which would be a lower total cost at 2x and only a 30% increase at 4, especially since some builds would go faster resulting in less autoscaling.
So if one other team could also eliminate a similar service, it would be a huge win. I unfortunately did not get to finish that thought due to yet another round of layoffs.
arccy · 3h ago
i think cloudflare workers does this
Havoc · 2h ago
Surprised they’re doing fixed leases. I would have thought a fixed base with a layer of spot priced VMs for peaks would be more efficient on cost
matt-p · 51m ago
Outside of the big clouds just buying a 1 Year lease (say) on a dedicated server is so cheap that you'd not be saving much vs spot instances and with spot instances you need code to manage this and you're introducing risk of slowdowns. Probably not worth the trade off.
To illustrate a 128GB ram 20 core server with a 10Gbps NIC and some small SSD storage is probably going to cost you <$2000 USD for a years rental.
Havoc · 4m ago
They've got usage that plummets 80% 2 days a week and the other 5 have a broad predictable time based pattern where usage drops ~66% judging by graph.
If that works out to same prices as keeping compute at literally your peak requirement level round the clock then something is very wrong somewhere. Maybe that issue is not in-house at blacksmith - perhaps spot pricing is a joke...but something there doesn't check out.
Loads of companies do scaling with much less predictable patterns.
>risk of slowdowns
Yeah you do probably want the scaling to be super conservative...but -80% fluctuation is a comically large gap to not actively scale
>To illustrate
Better view I'd say is: That chart looks like ~4.5 peak. So you're paying for 730 hours of peak capacity and using all of it about 90 hrs.
Given that they wrote a blog about this topic they probably have a good reason for doing it this way. Just doesn't really make sense to me based on given info
tsaifu · 24m ago
yeah, like the others have said, the tradeoff isn't really worth it for us as a business. spot instances also generally come with low qos guarantees (since they tend to be interruptible). tbf there are on-demand alternatives with better guarantees though
another thing to note is that we bootstrap the hosts, and tune them a decent amount, to support certain high-performance features which takes time and makes control + fixed-term ownership desirable
[disclaimer: i work at blacksmith]
whizzter · 1h ago
If their businessmodel is high performance runners and cheap cost they probably don't want to budge on speed, and once renting something fast on the cloud the costs run up quickly enough that they are probably just better off with a few more machines that pay themselves over time.
TuringTest · 2h ago
Back in the ancient era of the mainframes, this "multitenancy" concept would have been called "time sharing".
It looks like everything old is new again.
pphysch · 20m ago
Yeah, isn't this just HPCaaS, with an emphasis on CI workloads?
jeffreygoesto · 1h ago
Oh the memory. IBM3090 MVS with TSO...
hinkley · 1h ago
I was kind of disappointed the first time I saw an IBM mainframe and it kinda just looked like a rack of servers. To be fair, it was taking up a little bit of a server room that had clearly been designed for a larger predecessor and now almost had enough free space for a proper game of ping pong.
Hyperscaler rack designs definitely blur this line further. In some ways I think Oxide is trying to reinvent the mainframe, in a world where the suppliers got too much leverage and started getting uppity.
No comments yet
andrewstuart · 1h ago
It’s a common refrain on HN this thing is the same as something old. Dagnab those young folks!
nitwit005 · 8m ago
People trying to market things create a lot of new terms. They don't want to seem like they're selling the same old thing, but something new and innovative.
Occasionally, the new term is warranted, of course, but that's far less common than simply trying to appear different.
morkalork · 27m ago
It pretty much is the same, the only change is the level of abstraction. Apparently the easiest thing for everyone is just giving the user access to the whole damn OS via a container, rather than have them deal with vendor specific mainframe minutea.
coolcase · 2h ago
Was thinking about this exact thing today. Where I work combining X services from their own scaling sets to pack them together into a kubernetes cluster (or similar tech) should "smooth out" the spikes relatively and reduce wastage and also need to scale. This is on cloud so no fixed hardware concern but even then it helps with reserve instances, discounts and keeping cost down generally. This was intuition but I might math the maths on it now inspired by this.
mlhpdx · 2h ago
Everyone doing multi-tenant SaaS wants cost to be a sub-linear function of usage. This model of large unit capacity divided by small work units is an example of how to get there. The tough bit is that it’s stepwise at low volumes, and becomes linear at large scale, so it’s only magic during the growth phase — which is pretty solid for a growth phase company showing numbers for the next raise.
hinkley · 1h ago
Something for nothing or the Tragedy of the Commons. Many want a fair division of the cost but an unfair portion of the shared resource, subsidized by people who have not figured out how to minmax their slice of the pie. Doesn’t work when several clever people share the same resource pool.
shadowgovt · 4h ago
Interesting writeup. I wonder somewhat what this looks like from the customer side; one downside I've observed with some serverless in the past is that it can introduce up-front latency delays as the system spins up support to handle your spike. I know the CI consensus seems to be that latency matters little in a process that's going to take a long time to run to completion anyway... But I'm also a developer of CI, and that latency is painful during a tight-loop development cycle.
(The good news is that if the spikes are regular, a sufficiently-advanced serverless can "prime the pump" and prep-and-launch instances into surplus compute before the spike since historical data suggests the spike is coming).
aayushshah15 · 4h ago
> one downside I've observed with some serverless in the past is that it can introduce up-front latency delays as the system spins up support to handle your spike
[cofounder of blacksmith here]
This is exactly one of the symptoms of running CI on traditional hyperscalers we're setting out to solve. The fundamental requirement for CI is that each job requires its own fresh VM (which is unlike traditional serverless workloads like lambdas). To provision an EC2 instance for a CI job:
- you're contending against general on-demand production workloads (which have a particular demand curve based on, say, the time of day). This can typically imply high variance in instance provisioning times.
- since AWS/GCP/Azure deploy capacity out as spot instances with a guaranteed pre-emption warning, you're also waiting for the pre-emption windows to expire before a VM can be handed to you!
LunaSea · 1m ago
> The fundamental requirement for CI is that each job requires its own fresh VM (which is unlike traditional serverless workloads like lambdas). To provision an EC2 instance for a CI job
Is this different to lambdas or ECS services due to the need to setup a VM / container and nested virtualisation / Docker-in-Docker is not supported?
shadowgovt · 3h ago
Excellent! I did some work in the past on prediction of behavior given past data, and I can tell you two things we learned:
- there are low-frequency and high frequency effects (so you can make predictions based on last week, for example, but those predictions fall flat if the company rushes launches at the EOQ or takes the last couple weeks in December off).
- you can try to capture those low-frequency effects, but in practice we consistently found that comprehension by end-users beat out a high-fidelity model, and users were just not going to learn an idea like "you can generate any wave by summing two other waves." The user feedback they got was that they consistently preferred the predictive model being a very dumb "The next four weeks look like the past four weeks" and an explicit slider to flag "Christmas is coming: we anticipate our load to be 10% of normal" (which can simultaneously tune the prediction for Christmas and drop Christmas as an outlier when making future predictions). When they set the slider wrong they'd get the wrong predictions, but they were wrong predictions that were "their fault" and they understood; they preferred wrong predictions they could understand to less-wrong predictions they had to think about Fourier analysis to understand.
0xbadcafebee · 4h ago
tl;dr for this particular case it's bin packing
other business cases have economics where multitenancy has (almost) nothing to do with "efficient computing", and more to do with other efficiencies, like human costs, organizational costs, and (like the other post linked in the article) functional efficiencies
Otherwise, the customers are stuck with the same sizing/packing/utilisation problems. And imagine being the CI vendor in this world: you know which pipeline steps use what resources on average (and at the p99), and with that information you could over-provision customer jobs so that you sell 20 vCPUs but schedule them on 10 vCPUs. 200% utilisation baby!
But at the end of that project I realized that all this work could have been done on a CI agent if only they had more compute on them. My little cluster was still almost the size of the build agent pool tended to be. If I could convince them to double or quadruple the instance size on the CI pipeline I could turn these machines off entirely, which would be a lower total cost at 2x and only a 30% increase at 4, especially since some builds would go faster resulting in less autoscaling.
So if one other team could also eliminate a similar service, it would be a huge win. I unfortunately did not get to finish that thought due to yet another round of layoffs.
To illustrate a 128GB ram 20 core server with a 10Gbps NIC and some small SSD storage is probably going to cost you <$2000 USD for a years rental.
If that works out to same prices as keeping compute at literally your peak requirement level round the clock then something is very wrong somewhere. Maybe that issue is not in-house at blacksmith - perhaps spot pricing is a joke...but something there doesn't check out.
Loads of companies do scaling with much less predictable patterns.
>risk of slowdowns
Yeah you do probably want the scaling to be super conservative...but -80% fluctuation is a comically large gap to not actively scale
>To illustrate
Better view I'd say is: That chart looks like ~4.5 peak. So you're paying for 730 hours of peak capacity and using all of it about 90 hrs.
Given that they wrote a blog about this topic they probably have a good reason for doing it this way. Just doesn't really make sense to me based on given info
another thing to note is that we bootstrap the hosts, and tune them a decent amount, to support certain high-performance features which takes time and makes control + fixed-term ownership desirable
[disclaimer: i work at blacksmith]
It looks like everything old is new again.
Hyperscaler rack designs definitely blur this line further. In some ways I think Oxide is trying to reinvent the mainframe, in a world where the suppliers got too much leverage and started getting uppity.
No comments yet
Occasionally, the new term is warranted, of course, but that's far less common than simply trying to appear different.
(The good news is that if the spikes are regular, a sufficiently-advanced serverless can "prime the pump" and prep-and-launch instances into surplus compute before the spike since historical data suggests the spike is coming).
[cofounder of blacksmith here]
This is exactly one of the symptoms of running CI on traditional hyperscalers we're setting out to solve. The fundamental requirement for CI is that each job requires its own fresh VM (which is unlike traditional serverless workloads like lambdas). To provision an EC2 instance for a CI job:
- you're contending against general on-demand production workloads (which have a particular demand curve based on, say, the time of day). This can typically imply high variance in instance provisioning times.
- since AWS/GCP/Azure deploy capacity out as spot instances with a guaranteed pre-emption warning, you're also waiting for the pre-emption windows to expire before a VM can be handed to you!
Is this different to lambdas or ECS services due to the need to setup a VM / container and nested virtualisation / Docker-in-Docker is not supported?
- there are low-frequency and high frequency effects (so you can make predictions based on last week, for example, but those predictions fall flat if the company rushes launches at the EOQ or takes the last couple weeks in December off).
- you can try to capture those low-frequency effects, but in practice we consistently found that comprehension by end-users beat out a high-fidelity model, and users were just not going to learn an idea like "you can generate any wave by summing two other waves." The user feedback they got was that they consistently preferred the predictive model being a very dumb "The next four weeks look like the past four weeks" and an explicit slider to flag "Christmas is coming: we anticipate our load to be 10% of normal" (which can simultaneously tune the prediction for Christmas and drop Christmas as an outlier when making future predictions). When they set the slider wrong they'd get the wrong predictions, but they were wrong predictions that were "their fault" and they understood; they preferred wrong predictions they could understand to less-wrong predictions they had to think about Fourier analysis to understand.
other business cases have economics where multitenancy has (almost) nothing to do with "efficient computing", and more to do with other efficiencies, like human costs, organizational costs, and (like the other post linked in the article) functional efficiencies