Google's Liquid Cooling

159 giuliomagnifico 75 8/25/2025, 5:57:18 PM chipsandcheese.com ↗

Comments (75)

ykl · 6m ago
I once saw an interview with the SVP who oversees Azure datacenter buildout or something like that and a thing that stuck with me was that he said his job got a lot easier when he realized he was no longer in the computer business, he was now in the industrial cooling business.

Reading this article immediately made me think back to that.

jonathaneunice · 2h ago
It’s very odd when mainframes (S/3x0, Cray, yadda yadda) have been extensively water-cooled for over 50 years, and super-dense HPC data centers have used liquid cooling for at least 20, to hear Google-scale data center design compared to PC hobbyist rigs. Selective amnesia + laughably off-target point of comparison.
liquidgecka · 1h ago
I posed this further down in a reply-to-a-reply but I should call it out a little closer to the top: The innovation here is not “we are using water for cooling”. The innovation here is that they are direct cooling the servers with chillers that are outside of the facility. Most mainframes will use water cooling to get the heat from the core out to the edges where traditional where it can be picked up by the typical heatsink/cooling fans. Even home PCs do this by moving the heat to a reservoir that can be more effectively cooled.

What Google is doing is using the huge chillers that would normally be cooling the air in the facility to cool water which is directly pumped into every server. The return water is then cooled in the chiller tower. This eliminates ANY air based transfer besides the chiller tower. This is one being done a server or a rack.. its being done on the whole data center all at once.

I am super curious how they handle things like chiller maintenance or pump failures. I am sure they have redundancy but the system for that has to be super impressive because it can’t be offline long before you experience hardware failure!

[Edit: It was pointed out in another comment that AWS is doing this as well and honestly their pictures make it way clearer what is happening: https://www.aboutamazon.com/news/aws/aws-liquid-cooling-data...]

nitwit005 · 49m ago
This was before I was born, so I'm hardly an expert, but I've heard of feeding IBM mainframes chilled water. A quick check of wikipedia found some mention of the idea: https://en.wikipedia.org/wiki/IBM_3090
ChuckMcM · 38m ago
When our mainframe in 1978 sprung a leak in its water cooling jacket it took down the main east/west node on IBMs internal network at the time. :-). But that was definitely a different chilling mechanism than the types Google uses.
ChuckMcM · 37m ago
Much of the Google use of liquid chillers was protected behind NDAs as part of its "hidden advantage" with respect to the rest of the world. It was the secret behind really low PUE numbers.
ambicapter · 1h ago
So every time they plug in a server they also plug in water lines?
liquidgecka · 1h ago
[I am not a current Google Employee so my understanding of this is based on externally written articles and “leap of faith” guestimation]

Yes. A supply and return line along with power. Though if I had to guess how its setup this would be done with some super slick “it just works” kind of mount that lets them just slide the case in and lock it in place. When I was there almost all hardware replacement was made downright trivial so it could just be more or less slide in place and walk away.

scrlk · 12m ago
You can see the male quick disconnect fittings for the liquid cooling at each corner of the server in this photo:

https://substackcdn.com/image/fetch/$s_!8aMm!,f_auto,q_auto:...

Looks like the power connector is in the centre. I'm not sure if backplane connectors are covered up by orange plugs?

michaelt · 20m ago
Interestingly, entire supercomputers have been decommissioned [1] due to faulty quick disconnects causing water spray.

So you can get a single, blind-mating connector combining power, data and water - but you might not want to :)

[1] https://gsaauctions.gov/auctions/preview/282996

nielsbot · 29m ago
Maybe similar to a gasoline hose breakaway

https://www.opwglobal.com/products/us/retail-fueling-product...

jayd16 · 16m ago
Maybe we can declutter things if they get PWoE(power and water over ethernet) or just a USB-W standard.
Nition · 6m ago
It worked for MONIAC.
ajb · 54m ago
I remember reading somewhere that they don't operate at the level of servers; if one dies they leave it in place until they're ready to replace the whole rack. Don't know if that's true now, though.

It does sound like connections do involve water lines though. As they are isolating different water circuits, in theory they could have a dry connection between heat exchanger plates, or one made through thermal paste. It doesn't sound like they're doing that though.

liquidgecka · 18m ago
It has not been true for a LONG time. That was part of Google early “compute unit” strategy that involved things like sealed containers and such. Turns out that’s not super efficient or useful because you leave large swaths of hardware idle.

In my day we had software that would “drain” a machine and release it to hardware ops to swap the hardware on. This could be a drive, memory, CPU or a motherboard. If it was even slightly complicated they would ship it to Mountain View for diagnostic and repair. But every machine was expected to be cycled to get it working as fast as possible.

We did a disk upgrade on a whole datacenter that involved switching from 1TB to 2TB disks or something like that (I am dating myself) and total downtime was so important they hired temporary workers to work nights to get the swap done as quickly as possible. If I remember correctly that was part of the “holy cow gmail is out of space!” chaos though, so added urgency.

cavisne · 20m ago
Definitely not true for these workloads. TPUs are interconnected, one dying makes the whole cluster significantly less useful.
jedberg · 1h ago
Looks like it. New server means power, internet, and water.
fudgy73 · 39m ago
just like humans.
legulere · 1h ago
It's not so surprising when considering googles history coming from inexpensive commodity hardware. It's pretty similar to how it took decades for x86 servers and operating systems to gain mainframe functionality like virtualisation.

https://blog.codinghorror.com/building-a-computer-the-google...

spankalee · 2h ago
From the article:

> Liquid cooling is a familiar concept to PC enthusiasts, and has a long history in enterprise compute as well.

And the trend in data centers was to move towards more passive cooling at the individual servers and hotter operating temperatures for a while. This is interesting because it reverses that trend a lot, and possibly because of the per-row cooling.

dekhn · 1h ago
We've basically been watching Google gradually re-discover all the tricks of supercomputing (and other high performance areas) over the past 10+ years. For a long time, websearch and ads were the two main drivers of Google's datacenter architecture, along with services like storage and jobs like mapreduce. I would describe the approach as "horizontal scaling with statistical multiplexing for load balancing".

Those style of jobs worked well but as Google has realized it has more high performance computing with unique workload characteristics that are mission-critical (https://cloud.google.com/blog/topics/systems/the-fifth-epoch...) their infrastructure has had to undergo a lot of evolution to adapt to that.

Google PR has always been full of "look we discovered something important and new and everybody should do it", often for things that were effectively solved using that approach a long time ago. MapReduce is a great example of that- Google certainly didn't invent the concepts of Map or Reduce, or even the idea of using those for doing high throughput computing (and the shuffle phase of MapReduce is more "interesting" from a high performance computing perspective than mapping or reducing anyway).

liquidgecka · 1h ago
As somebody that worked on Google data centers after coming from a high performance computing world I can categorically say that Google is not “re-learning” old technology. In the early days (when I was there) they focused heavily on moving from thinking of computers to thinking of compute units. This is where containers and self contained data centers came from. This was actually a joke inside of Google because it failed but was copied by all the other vendors for years after Google had given up on it. They then moved to stop thinking about cooling as something that happens within a server case to something that happens to a whole facility. This was the first major leap forward where they moved from cooling the facility and pushing conditioned air in to cooling the air immediately behind the server.

Liquid cooling at Google scale is different than mainframes as well. Mainframes needed to move heat from the core out to the edges of the server where traditional data center cooling would transfer it away to be conditioned. Google liquid cooling is moving the heat completely outside of the building while it’s still liquid. That’s never been done before as far as I am aware. Not at this scale at least.

mattofak · 1h ago
It's possible it never made it into production; but when I was helping to commission a 4 rack "supercomputer" circa 2010 we used APC's in-row cooling (which did glycol exchange to the outside but still maintains the hot/cold aisle) and I distinctly remember reading a whitepaper about racks with built in water cooling and the problems with pressure loss, dripless connectors, and corrosion. I no longer recall if the direct cooling loop exited the building or just cycled in the rack to an adjacent secondary heat exchanger. (And I don't remember if it was an APC whitepaper or some other integrator.)

There's also all the fun experiments with dunking the whole server into oil, but I'll give you that again I've only seen setups described with secondary cooling loops - probably because of corrosion and wanting to avoid contaminants.

jonathaneunice · 54m ago
"From the core to the edges of the server"—what does that even mean?

Unless Google has discovered a way to directly transfer heat to the aethereal plane, nothing they’re doing is new. Mainframes were moving chip and module heat entirely outside the building decades ago. Immersion cooling? Chip, module, board, rack, line, and facility-level work? Rear-door and hybrid strategies? Integrated thermal management sensors and controls? Done. Done. Done. Done. Richard Chu, Roger Schmidt, and company were executing all these strategies at scale long before Google even existed.

liquidgecka · 24m ago
Where does the “heat” leave the server. In most mainframes they use liquid cooling to move the head from the chips at the hyper dense core of the machine to the exterior of the machine, where it’s transferred to air via fans and heat sinks, to be picked up and cooled by a data center level cooling system (chillers and such).

As far as I know there were no mainframes of old that would use coolant that had moved from outside of the building, directly into the core of the chip and back. Most either used an intermediate transfer reservoir for fluids, or transferred to air outside of the center of the compute system, then cooled the air via some other cooling system.

I could be completely wrong of course. I mostly worked in the HPC world of the late 90’s and early 2000’s before switching into enterprise super computing. There are a lot of machines I never worked with or even knew about before my time, when things were far more experimental.

zer00eyz · 59m ago
> cooling is moving the heat completely outside of the building while it’s still liquid.

We have been doing this for decades, it's how refrigerants work.

The part that is new is not having an air-interface in the middle of the cycle.

Water isn't the only material being looked at, mostly because high pressure PtC (Push to Connect) fittings, and monitoring/sensor hardware has evolved. If a coolant is more expensive but leaks don't destroy equipment, and can be quickly isolated then it becomes a cost/accounting question.

marcosdumay · 24m ago
The claim is that Google has larger pipes that go all the way out of the building. While mainframes have short pipes that go only to a heat exchanger on the end of the hack.

IMO, it's not a big difference. There are probably many details more noteworthy than this. And yeah, mainframes are that way because the vendor only creates them up to the hack-level, while Google has the "vendor" design the entire datacenter. Supercomputers have had single-vendor datacenters for decades too, and have been using large pipes for a while too.

liquidgecka · 49m ago
> The part that is new is not having an air-interface in the middle of the cycle.

I wasn’t clear when I was writing but this was the point I was trying to make. Heat from the chip is transferred in the same medium all the way from the chip to the exterior chiller without intermediate transfers to a new medium.

Sesse__ · 43m ago
> MapReduce is a great example of that- Google certainly didn't invent the concepts of Map or Reduce, or even the idea of using those for doing high throughput computing (and the shuffle phase of MapReduce is more "interesting" from a high performance computing perspective than mapping or reducing anyway).

The “Map” in MapReduce does not originally stand for the map operation, it comes from the concept of “a map” (or, I guess, a multimap). MapReduce descends from “the ripper”, an older system that mostly did per-element processing, but wasn't very robust or flexible. I believe the map operation was called “Filter()” at the time, and reduce also was called something else. Eventually things were cleaned up and renamed into Map() and Reduce() (and much more complexity was added, such as combiners), in a sort of backnaming.

It may be tangential, but it's not like the MapReduce authors started with “aha, we can use functional programming here”; it's more like the concept fell out. The fundamental contribution of MapReduce is not to invent lambda calculus, but to show that with enough violence (and you should know there was a lot of violence in there!), you can actually make a robust distributed system that appears simple to the users.

dekhn · 28m ago
I believe Map in MapReduce stood for "map" function, not a multimap- I've never heard or read otherwise (maps can operate over lists of items, not just map types). That's consistent both with the original mapreduce paper: """Our abstraction is inspired by the map and reduce primitives present in Lisp and many other functional languages. We realized that most of our computations involved applying a map operation to each logical “record” in our input in order to compute a set of intermediate key/value pairs, and then applying a reduce operation to all the values that shared the same key, in order to combine the derived data appropriately"""

and with the internal usage of the program (I only started in 2008, but spoke to Jeff extensively about the history of MR as part of Google's early infra) where the map function can be fed with recordio (list containers) or sstable (map containers).

As for the ripper, if you have any links to that (rather than internal google lore), I'd love to hear about it. Jeff described the early infrastructure as being very brittle.

dmayle · 1h ago
As to MapReduce, I think you're fundamentally mistaken. You can talk about map and reduce in the lambda calculus sense of the term, but in terms of high performance distributed calculations, MapReduce was definitely invented at Google (by Jeff Dean and Sanjay Ghemawat in 2004).
jonathaneunice · 38m ago
Not quite. Google brilliantly rebranded the work of John McCarthy, C.A.R. Hoare, Guy Steele, _et al_ from 1960 ff. e.g. https://dl.acm.org/doi/10.1145/367177.367199

Dean, Ghemawat, and Google at large deserve credit not for inventing map and reduce—those were already canonical in programming languages and parallel algorithm theory—but for reframing them in the early 2000s against the reality of extraordinarily large, scale-out distributed networks.

Earlier takes on these primitives had been about generalizing symbolic computation or squeezing algorithms into environments of extreme resource scarcity. The 2004 MapReduce paper was also about scarcity—but scarcity redefined, at the scale of global workloads and thousands of commodity machines. That reframing was the true innovation.

dekhn · 23m ago
CERN was doing the equivalent of MapReduce before Google existed.
echelon · 1h ago
This abundance of marketing (not necessarily this blog post) is happening because of all the environmental chatter about AI and data centers recently.

Google wants you to know it recycles its water. It's free points.

Edit: to clarify, normal social media is being flooded with stories about AI energy and water usage. Google isn't greenwashing, they're simply showing how things work and getting good press for something they already do.

Legend2440 · 1h ago
The environmental impact of water usage seems way overblown to me.

Last year, U.S. data centers consumed 17 billion gallons of water. Which sounds like a lot, but the US as a whole uses 300 billion gallons of water every day. Water is not a scarce resource in much of the country.

Guvante · 1h ago
To be clear all talk of water usage has been focusing on local usage which isn't well represented by national numbers.
foota · 1h ago
Personally, I agree. That said, I think it might be worth considering the impact of water usage in local areas that are relatively water constrained.
mlyle · 1h ago
Google uses plenty of water in cooling towers and chillers. Sure, water cooling loops to the server may reduce the amount a little bit compared to fans, but this isn't "recycling" in any normal sense.

It's mostly a play for density.

coliveira · 45m ago
This is typical for Google: they will reinvent things and say they're the first to do it. Google is typically concerned with costs; there is probably solutions out there for these problems, but they don't want to pay for it. It is cheaper, at their scale, to reinvent things and get it done internally, and then claim they're first.
jsnell · 29m ago
Why make up such an absurd grievance to get fake-outraged about? Nowhere in the article does it say that Google claims to have invented liquid cooling. In fact, nowhere does it say they claim any part of this is a new invention.

But the point of this kind of paper is typically not what is new, it's what combination of previously known and novel techniques have been found to work well at massive scale over a timespan of years.

jeffbee · 2h ago
Hyper-scale data centers normally need not be concerned with power density, and their designers might avoid density because of the problems it causes. Arguably modern HPC clusters that are still concerned about density are probably misguided. But when it comes to ML workloads, putting everything physically close together starts to bring benefits in terms of interconnectivity.
jonathaneunice · 1h ago
LOLWUT? Hyperscalers and HPC data centers have been very concerned about power and thermal density for decades IME. If you're just running web sites or VM farms, sure keep racks cooler and more power efficient. But for those that deeply care about performance, distance drives latency. That drives a huge demand to "pack things closely" and that drives thermal density up, up, up. "Hot enough to roast a pig" was a fintech data center meme of 20 years go, back at 20kW+ racks. Today you're not really cooking until you get north of 30kW.
jeffbee · 1h ago
Name a hyper-scale data center where the size and shape suggests that power density ever entered the conversation.
michaelt · 2h ago
> TPU chips are hooked up in series in the loop, which naturally means some chips will get hotter liquid that has already passed other chips in the loop. Cooling capacity is budgeted based on the requirements of the last chip in each loop.

Of course, it's worth noting that if you've got four chips, each putting out 250W of power, and a pump pushing 1 litres of water per minute through them, water at the outlet must be 14°C hotter than water at the inlet, because of the specific heat capacity of water. That's true whether the water flows through the chips in series, or in parallel.

foota · 2h ago
Hm... but in the case when the chips are in serial, the heat transfer from the last chip will be less than when the chips are in parallel, because the rate of heating is proportional to the difference in temperature, and the water starts at a lower temperature for the parallel case for this last theoretical chip.
chickenbig · 51m ago
In steady state the power put into the chip is removed by the water (neglecting heat transfer away from the cooling system). The increased water temperature on entering into the cooling block is offset by a correspondingly higher chip temperature.
fraserphysics · 2h ago
One way to characterize the cost of cooling is entropy production. As you say, cooling is proportional to difference in temperature. However, entropy production is also proportional to temperature difference. It's not my field, but it looks like an interesting challenge to optimize competing objectives.
0x457 · 54m ago
Yes, but water is constantly moving in a loop. It's not like you use water to cool chip #1, and then it moves to chip #2, it's constantly moving, so temperature delta isn't that much.
smachiz · 23m ago
in their first serial design, that's exactly what it was doing.
friendzis · 2h ago
While there is some truth to your comment, it has no practical engineering relevance. Since energy transfer rate is proportional to temp difference, therefore you compute the flow rate required, which is going to be different if the chips are in series or in parallel.
idiotsecant · 2h ago
It just means in series that some of your chips get overcooled in order to achieve the required cooling on the hottest chip. You need to run more water for the same effect.
k7sune · 1h ago
I can imagine a setup where multiple series of slower cooler water converging into a faster warmer stream, and the water will extract an equal amount of heat away from all the chips whether upstream or downstream.
miohtama · 34m ago
How much moat Google has in the AI race due to its in-house expertise in data centers and in-house IPR in TPUs? Can any company on Earth match them?
owebmaster · 11m ago
Not enough to prevent open source models making it all useless, I think.
hnburnsy · 38m ago
Kind of related from B1M...

>The Paris Olympic Pool is Heated by the Internet

>https://www.youtube.com/watch?v=2gWudPtN6z4&t=4s

betaby · 3h ago
bri3d · 2h ago
Your linked articles are about immersion cooling, which is "liquid cooling," I suppose, but a very different thing. Do OVH actually use immersion cooling in production? This seems like a "labs" product that hasn't been fully baked yet.

OVH _do_ definitely use traditional water-loop/water-block liquid cooling like Google are presenting, described here: https://blog.ovhcloud.com/understanding-ovhclouds-data-centr... and visually in https://www.youtube.com/watch?v=RFzirpvTiOo , but while it looks very similar architecturally their setup seems almost comically inefficient compared to Google's according to their PUE disclosures.

jeffbee · 2h ago
And yet their claimed PUE is 1.26 which is pretty bad. One way to characterize that overall PUE figure is they waste 3x as much on overhead as Google (claimed 1.09 global PUE) or Meta (1.08).
m463 · 1h ago
I wonder what the economics of water cooling really is.

Is it because chips are getting more expensive, so it is more economical to run them faster by liquid cooling them?

Or is it data center footprint is more expensive, so denser liquid cooling makes more sense?

Or is it that wiring distances (1ft = 1nanosecond) make dense computing faster and more efficient?

MurkyLabs · 1h ago
It's a mixture of both 2 and 3. The chips are getting hotter because they're compacting more stuff in a small space and throwing more power into them. At the same time, powering all those fans that cool the computers takes a lot of power (when you have racks and racks those small fans add up quickly) and that heat is then blown into hot isles that need to then circulate the heat to A/C units. With liquid cooling they're able to save costs due to lower electricity usage and having direct liquid to liquid cool as apposed to chip->air->AC->liquid. ServeTheHome did a write up on it last year, https://www.servethehome.com/estimating-the-power-consumptio...
mikepurvis · 1h ago
I've never done DC ops, but I bet fan failure is a factor too— basically there'd be a benefit to centralizing all the cooling for N racks in 2-3 large redundant pumps rather than having each node bringing its own battalion of fans that are all going to individually fail in a bell curve centered on 30k hours of operation, with each failure knocking out the system and requiring hands-on maintenance.
summerlight · 1h ago
Not sure about classical computing demands, but I think wiring distances definitely matter for TPU-like memory heavy computation.
moffkalast · 1h ago
It's more of a testament to inefficiency, with rising TDP year after year as losses get larger with smaller nm processes. It's so atrocious, even in the consumer sector Nvidia can't even design a connector that doesn't melt during normal usage because their power draw has become beyond absurd.

People don't really complain about crappy shovels during a gold rush though unfortunately, they're just happy they got one before they ran out. They have no incentive to innovate in efficiency while the performance line keeps going up.

BoppreH · 2h ago
I see frequent mentions of AI wasting water. Is this one such setup, perhaps with the CDU using the facility's water supply for evaporative cooling?
bri3d · 2h ago
The CDU is inside the datacenter and strictly liquid to liquid exchange. It transfers heat from the rack block's coolant to the facility coolant. The facility then provides outdoor heat exchange for the facility coolant, which is sometimes accomplished using open-loop evaporative cooling (spraying down the cooling towers). All datacenters have some form of facility cooling, whether there's a CDU and local water cooling or not, so it's not particularly relevant.

The whole AI-water conversation is sort of tiring, since water just moves to more or less efficient parts or locations in the water cycle - I think a "total runtime energy consumption" metric would be much more useful if it were possible to accurately price in water-related externalities (ie - is a massive amount of energy spent moving water because a datacenter evaporates it? or is it no big deal?). And the whole thing really just shows how inefficient and inaccurately priced the market for water is, especially in the US where water rights, price, and the actual utility of water in a given location are often shockingly uncorrelated.

maartin0 · 1h ago
Lerc · 2h ago
I have encountered a lot of references to AI using water, but with scant details. Is it using water in the same way a car uses a road? The road remains largely unchanged?

The implication is clear that it is a waste, but I feel like if they had the data so support that, it wouldn't be left for the reader to infer.

I can see two models where you could say water is consumed. Either talking about drinkable water rendered undrinkable, or turning water into something else where it is not practically recaptured. Tuning it into steam, sequestering it in some sludge etc.

Are these things happening? If it is happening, is it bad? Why?

I'd love to see answers on this, because I have seen the figures used like a kudgel without specifying what the numbers actually refer to. It's frustrating as hell.

tony_cannistra · 1h ago
This article will help you. https://www.construction-physics.com/p/how-does-the-us-use-w...

> ...actual water consumed by data centers is around 66 million gallons per day. By 2028, that’s estimated to rise by two to four times. This is a large amount of water when compared to the amount of water homes use, but it's not particularly large when compared to other large-scale industrial uses. 66 million gallons per day is about 6% of the water used by US golf courses, and it's about 3% of the water used to grow cotton in 2023.

jeffbee · 1h ago
The water is "used" in the sense that it evaporates. At a global average rate of 1 liter per kilowatt-hour of energy, Google claims.
sleepydog · 2h ago
AWS had a similar article a couple months ago:

https://www.aboutamazon.com/news/aws/aws-liquid-cooling-data...

In either case I cannot find out how they dump the heat from the output water before recycling it. That's a problem I find far more interesting.

jeffbee · 2h ago
The reason you see frequent mentions of AI wasting water is there is a massive influence operation that seeks to destroy knowledge, science, and technology in the United States and return most of us to a state of bare subsistence, working menial jobs or reduced to literal slavery. There is no subjective measure by which the water used by AI is even slightly concerning.
wredcoll · 2h ago
I looked very hard but I don't see a way to subscribe to your newsletter?
stripe_away · 1h ago
> there is a massive influence operation that seeks to destroy knowledge, science, and technology in the United States

Agreed. Started with big tobacco by discrediting the connection to lung cancer, playbook copied by many and weaponized by Russia.

> There is no subjective measure by which the water used by AI is even slightly concerning.

Does not follow from your first point. The water has to be sourced from somewhere, and debates over water rights are as old as civilization. For one recent example, see i.e. https://www.texaspolicy.com/legewaterrights/

You are probably correct that the AI does not damage the water, but unless there are guarantees that the water is rapidly returned "undamaged" to the source, there are many reasons to be concerned about who is sourcing water from where.

hnburnsy · 48m ago
>"Google extensively validates components with leak testing, uses alerting systems to discover problems like leaks, and takes preventative measures like scheduled maintenance and filtration. They also have a clear set of protocols to respond to alerts and issues, allowing a large workforce to respond to problems in a consistent fashion. It’s a far cry from the ad-hoc measures enthusiasts take to maintain their water cooling setups."

How about the author compare it to MS, Meta, or AWS instead of Joe Blow buying parts online? I would hope that Google had extensive validation and clear protocols. [Roll-Eyes]