Measuring the environmental impact of AI inference

93 ksec 38 8/23/2025, 3:22:33 AM arstechnica.com ↗

Research paper: https://services.google.com/fh/files/misc/measuring_the_envi...

Google blog post: https://cloud.google.com/blog/products/infrastructure/measur...

Comments (38)

jjani · 1h ago

Here's what happened:

1. Google rolled our AI summaries on all of their search queries, through some very tiny model 2. Given worldwide search volume, that model now represents more than 50% of all queries if you throw it on a big heap with "intentional" LLM usage 3. Google gets to claim "the median is now 33x lower!", as the median is now that tiny model giving summaries nobody asked for

It's very concerning that this marketing puff piece is being eaten up by HN of all places as evidenced by the other thread.

Google is basing this all of "median" because there's orders of magnitudes difference betwen strong models (what most people think of when you talk AI) and tiny models, which Google uses "most" by virtue of running them for every single google search to produce the summaries. So the "median" will be whatever tiny model they use for those models. Never mind that Gemini 2.5 Pro, which is what everyone here would actually be using, may well consume >100x much.

It's absurdly misleading and rather obvious, but it feels like most are very eager to latch on to this so they can tell themselves their usage and work (for the many here in AI or at Google) is all peachy. I've been reading this place for years and have never before seen such uncritical adoption of an obvious PR piece detached from reality.

raincole · 15m ago

It's not what the report says.

> It's very concerning that this marketing puff piece is being eaten up by HN of all places as evidenced by the other thread.

It's very concerning that you can just make shit up on HN and be the top comment as long as it's to bash Google.

> Never mind that Gemini 2.5 Pro, which is what everyone here would actually be using, may well consume >100x much

Yes, exactly, never mind that. The report is to compare against a data point from May 2024, before Gemini 2.5 Pro became a thing. Google never said that the AI summary is made by Gemini 2.5 Pro.

mgraczyk · 1h ago

As others have pointed out, this is false. Google has made their models and hardware more efficient, you can read the linked report. Most of the efficiency comes from quantization, MoE, new attention techniques, and distillation (making smaller models useable in place of bigger models)

shwaj · 1h ago

Are you sure? It wouldn’t shock me, but they specifically say “Gemini Apps”. I wasn’t familiar with the term, but a web search indicated that it has a specific meaning, and it doesn’t seem to me like web search AI summaries would be covered by it. Am I missing something?

user568439 · 25m ago

"It's very concerning that this marketing puff piece is being eaten up by HN of all places as evidenced by the other thread."

It's very concerning that you claim this without previously fully reading and understanding Google's publication...

jonas21 · 1h ago

What exactly are you basing this assertion on (other than your feelings)? Are you accusing Google of lying when they say in the technical report [1]:

> This impact results from: A 33x reduction in per-prompt energy consumption driven by software efficiencies—including a 23x reduction from model improvements, and a 1.4x reduction from improved machine utilization.

followed by a list of specific improvements they've made?

[1] https://services.google.com/fh/files/misc/measuring_the_envi...

esperent · 1h ago

Unless marketing blogs from any company specifically say what model they are talking about, we should always assume they're hiding/conflating/mislabeling/misleading in every way possible. This is corporate media literacy 101.

The burden of proof is on Google here. If they've reduced gemini 2.5 energy use by 33x, they need to state that clearly. Otherwise a we should assume they're fudging the numbers, for example:

A) they've chosen one particular tiny model for this number

B) it's a median across all models including the tiny one they use for all search queries

EDIT: I've read over the report and it's B) as far as I can see

Without more info, any other reading of this is a failing on the reader's part, or wishful thinking if they want to feel good about their AI usage.

We should also be ready to change these assumptions if Google or another reputable party does confirm this applies to large models like Gemini 2.5, but should assume the least impressive possible reading until that missing info arrives.

Even more useful info would be how much electricity Google uses per month, and whether that has gone down or continued to grow in the period following this announcement. Because total energy use across their whole AI product range, including training, is the only number that really matters.

mquander · 55m ago

You should not assume that "they've chosen one particular tiny model", or "it's a median across all models including the tiny one they use for all search queries" because those are totally made up assumptions that have nothing to do with what they say they measured. They measured the Gemini Apps product that completes text prompts. They also provided a chart showing that the thing they are measuring scores comparably to GPT-4o on LM Arena.

mgraczyk · 29m ago

> total energy use across their whole AI product range, including training, is the only number that really matters.

What if they are serving more requests?

mgraczyk · 1h ago

They did specifically say in the linked report

esperent · 39m ago

Here's the report. Could you tell me where in it you found a link to 33x reduction (or any large reduction) for any specific non-tiny model? Because all I can find is lots of references to "median Gemini". In fact, I would say they're being extremely careful in this paper not to mention any particular Google models with regards to energy reduction.

https://services.google.com/fh/files/misc/measuring_the_envi...

mgraczyk · 27m ago

Figure 4

I think you are assuming we are talking about swapping API usage from one model to another. That is not what happened. A specific product doing a specific thing uses less energy now.

To clarify: the way models become more efficient is usually by training a new one with a new architecture, quantization, etc.

This is analogous to making a computer more efficient by putting a new CPU in it. It would be completely normal to say that you made the computer more efficient, even though you've actually swapped out the hardware.

sigilis · 5m ago

Don’t they call all their LLM models Gemini? The paper indicates that they specifically used all the AI models to come up with this figure when they describe the methodology. It looks like they even include classification and search models in this estimate.

I’m inclined to believe that they are issuing a misleading figure here, myself.

esperent · 14m ago

> Figure 4: Median Gemini Apps text prompt emissions over time—broken down by Scope 2 MB emissions (top) and Scope 1+3 emissions (bottom). Over 12 months, we see that AI model efficiency efforts have led to a 47x reduction in the Scope 2 MB emissions per prompt, and 36x reduction in the Scope 1+3 emissions per user prompt—equivalent to a 44x reduction in total emissions per prompt.

Again, it's talking about "median Gemini" while being very careful not to name any specific numbers for any specific models.

mgraczyk · 43s ago

That isn't what that means. Look at the paragraph above that where they explain.

This is the median model used to serve requests for a specific product surface. It's exactly analogous to upgrading the CPU in a computer over time

RajT88 · 1h ago

Big tech seems all about the fluff.

But, wasn't it always so?

Wasn't it always so in business of all kinds?

Why should we expect anything different? We should have been skeptical all along.

camillomiller · 1h ago

I’ve been covering tech for 20 years. No, it wasn’t always like that. There was a sincere mutual respect between the companies and the media industry that I don’t see anymore. Both sides have their fault, but you know it’s not media that huperscaled and created gazillionaires by the score. Also, software is way more bendable to the emperors’ whims, and Google has become particularly hypocritical in the way it publicly represent itself.

tobr · 1h ago

I’ve dramatically reduced my median calories per meal, by scheduling eight new meals a day, each consisting of one lettuce leaf.

kingstnap · 2h ago

If you have a market for it, the hardware industry will aggressively dig in to try to deliver. Maximum performance and maximum efficiency. So I can imagine there is still more to go.

I'm sure the relatively clean directed computational graph + massively parallel + massively hungry workload of AI is a breath of fresh air to the industry.

Hardware gains were for the longest time doing very little for consumers because the bottlenecks were not in the hardware but instead in extremely poorly written software running in very poorly designed layers of abstraction that nothing could be done about.

sbierwagen · 1h ago

The hardware overhang embodied: that early AI will be inefficiently embodied as a blob of differentiable floating point numbers in order to do gradient descent on them, and shortly after be translated into a dramatically simpler and faster form. An AGI that requires a full rack of H100s to run, suddenly appearing on single video game consoles. https://www.lesswrong.com/w/computing-overhang

Fun fact: Deep Blue was a dedicated chess compute cluster that ran on 30 RS/6000 processors and 480 VLSI chips. If the Stockfish chess program existed in 1997 it would have beaten it with a single 486 CPU: https://www.lesswrong.com/posts/75dnjiD8kv2khe9eQ/measuring-...

textlapse · 1h ago

What’s the cost of training vs inference?

If it’s like Marvel sequels every year then there is a significant added training cost as the expectations get higher and higher to churn out better models every year like clockwork.

theanonymousone · 40m ago

Finally someone using drop not in a teenage slang sense.

jillesvangurp · 1h ago

There are two ways to make AI cheaper: make energy cheaper or make AI hardware and algorithms more efficient and use less energy that way. Google is investing in doing both. And that's a good thing.

I actually see growth in energy demand because of AI or other reasons as a positive thing. It's putting pressure on the world to deliver more energy cheaply. And it seems the most popular and straightforward way is through renewables + batteries. The more clean and cheap capacity like that is added, the more marginalized traditional more expensive solutions get.

The framing on this topic can be a bit political. I prefer to look at this through the lens of economics. The simple economic reality is that coal and gas plant construction has been bottle necked for years on a lot of things to the point where only very little of it gets planned and realized. And what little comes online has pretty poor economics. The cost and growth curves for renewables+battery paint a pretty optimistic picture here with traditional generation plateauing for a while (we'll still build more coal/gas plants, not a lot, and they'll be underutilized) and then dropping rapidly second half of the century as cost and availability of alternatives improves and completely steam roll anything that can't keep up. Fossil fuel based generation could be all but gone by the 2060s.

There are lots of issues with regulations, planning, approval, etc for fossil fuel based generation. There are issues with supply chains for things like turbines. Long term access to cooling water (e.g. rivers) is becoming problematic because of climate change. And there are issues with investors voting with their feet and being reluctant to make long term commitments in what could end up being very poor long term investments. A lot of this also impacts nuclear, which while clean remains expensive and hard to deliver. The net result of all this is that investments in new energy capacity are heavily biased towards battery + renewables. It's the only thing that works on short notice. And it's also the cheapest way to add new capacity. Current growth is already 80-90% renewable. It's not even close at this point. We're talking tens/hundreds of GW added annually.

Of course AI is so hungry for energy that there is a temporary increase in usage for coal/gas. That's existing underutilized plants temporarily getting utilized a bit more mainly because they are there and utilizing them a bit more is relatively easy and quick to realize. It's not actually cheaper and future cost reductions will likely come in the form of replacing that capacity with cheaper power generation as soon as that can be delivered.

zekrioca · 26m ago

Measurements for water consumption seems cherry-picked and incorrect to look better than they actually are. When asked about it, they doubled-down and incorrectly mentioned the study in question (to which they compared against) was incorrect. See https://www.linkedin.com/posts/shaolei-ren-68557415_today-go...

philberto · 1h ago

How do you drop something by 33x? That is literally impossible unless they make money by purchasing energy.

mgraczyk · 1h ago

No it isn't

Suppose you were running a computation that requires doing 33,000 multiplies. Later you find a way to do the same computation using only 1,000 multiples

That's basically what happened here

playforclaude · 44m ago

33,000 multiples - (33 * 33,000 multiples) = -1056000 multiples

mgraczyk · 38m ago

Reducing something 33x means to make it 33 times smaller. It's a common way of saying this in English

energy123 · 1h ago

Cost/prompt is a ratio. "Prompt" is not a normalized metric that is stable over time. It can increase (as context lengths increase) or decrease (as google's product suite integrates llms).

drakenot · 1h ago

This from quantizing their Gemini model?

There are a lot of anecdotal reports of quality differences following some Gemini 2.5 Pro releases earlier in the year.

ant6n · 29m ago

I for one think that Gemini 2.5 pro has become much more stupid than before. This isn’t for coding, just simple business type support. It keeps forgetting queries, making really obviously bad suggestions, simple mistakes etc etc.

It’s kind of funny, because they keep talking about how close we are to AGI, and in reality they keep making the models dumber (uh, I mean more efficient).

lalaithion · 1h ago

They didn’t account for training. From the paper:

> LLM training & data storage: This study specifically considers the inference and serving energy consumption of an Al prompt. We leave the measurement of Al model training to future work.

This is disappointing, and no analysis is complete without attempting to account for training, including training runs that were never deployed. I’m worried these numbers would be significantly worse and that’s why we don’t have them.

sbierwagen · 1h ago

If I download a copy of llama and run a single query, what was the cost of that query?

progval · 46m ago

No, because you don't incentivize the training of the next version of LLama, and the current version was not trained because you wanted to run that query.

This is not true of Gemini.

ChrisArchitect · 3h ago

[dupe] https://news.ycombinator.com/item?id=44972808

benreesman · 1h ago

I'm on record as pretty stridently anti-AI Hype Bullshit (I was calling Altman a criminal back when that had real-world consequences, check the history).

But this is in the vanishing minority of frontpage AI threads where it's a really interesting concersation about quantifiable things: what quantization, what engagement metrics, what NDGC on downstream IR. People are complaining they gamed the number: that's an improvement! Normally they just lie. This is amenable to analysis and frankly an interesting one.

If it were up to me they'd flat regex ban "llm" and "ai" on HN, thats about the right ROC. But if we're going to have it? I'll take this over "How AI Saved My Vibecode Startup From Vibe Coding".

motorest · 1h ago

> People are complaining they gamed the number: that's an improvement!

Is it, though?

There's a post in this discussion claiming that Google rolled out AI summaries on all of their search queries. This means they greatly increased the number of queries by triggering queries at each Google search. These are unsolicited queries that users do not send by themselves or want.

Then the post claims each of these unsolicited queries are executed using small models that are cheaper to run.

The post asserts these unsolicited queries represent half of the queries.

Google's claims are that now the median cost of their queries is lower. The post asserts around half of Google's AI queries are not requested by users and instead forced upon them with searches.

To me, what this spells is the exact opposite of a improvement. It's waste that is not requested by anyone and adds no value. It's just waste.

Consequently, if Google pulled the plug on these queries then the would reduce their total query count by around 50%. How much energy and carbon emissions would that save? Well, if you pick up that value and flip it over to show how much is being wasted, that's your "improvement".

Show HN: JavaScript-free (X)HTML Includes (github.com)

Measuring the environmental impact of AI inference (arstechnica.com)

Shader Academy: Learn computer graphics by solving challenges (shaderacademy.com)

I run a full Linux desktop in Docker just because I can (howtogeek.com)

The theory and practice of selling the Aga cooker (1935) [pdf] (comeadwithus.wordpress.com)

Nitro: A tiny but flexible init system and process supervisor (git.vuxu.org)

My tips for using LLM agents to create software (efitz-thoughts.blogspot.com)

The first Media over QUIC CDN: Cloudflare (moq.dev)

A visual history of Visual C++ (2017) (malsmith.net)

Top Secret: Automatically filter sensitive information (thoughtbot.com)

Glyn: Type-safe PubSub and Registry for Gleam actors with distributed clustering (github.com)

FFmpeg 8.0 (ffmpeg.org)

Japan city drafts ordinance to cap smartphone use at 2 hours per day (english.kyodonews.net)

Computer fraud laws used to prosecute leaking air crash footage to CNN (techdirt.com)

The use of LLM assistants for kernel development (lwn.net)

Popular Japanese smartphone games have introduced external payment systems (english.kyodonews.net)

From M1 MacBook to Arch Linux: A month-long experiment that became permanenent (ssp.sh)

Bluesky Goes Dark in Mississippi over Age Verification Law (wired.com)

Leaving Gmail for Mailbox.org (giuliomagnifico.blog)

Launch HN: BlankBio (YC S25) – Making RNA Programmable

LabPlot: Free, open source and cross-platform Data Visualization and Analysis (labplot.org)

Embedding Text Documents with Qwen3 (daft.ai)

Why is this hard? (programmersstone.blog)

Transcribe music in abc with syntax highlighting (fugue-state.io)

The issue of anti-cheat on Linux (2024) (tulach.cc)

Now, Together (natashajaffe.substack.com)

Mail Carriers Pause US Deliveries as Tariff Shift Sows Confusion (bloomberg.com)

Closing the Nix gap: From environments to packaged applications for rust (devenv.sh)

It’s not wrong that "\u{1F926}\u{1F3FC}\u200D\u2642\uFE0F".length == 7 (2019) (hsivonen.fi)

U.S. government takes 10% stake in Intel (cnbc.com)

What Happened to Egghead Software (dfarq.homeip.net)

VHS-C: When a lazy idea stumbles towards perfection [video] (youtube.com)

Should the web platform adopt XSLT 3.0? (github.com)

What about using rel="share-url" to expose sharing intents? (shkspr.mobi)

Developer sentenced to prison for activating “kill switch” to avenge his firing (arstechnica.com)

Launch HN: Inconvo (YC S23) – AI agents for customer-facing analytics

Waymo granted permit to begin testing in New York City (cnbc.com)

Io_uring, kTLS and Rust for zero syscall HTTPS server (blog.habets.se)

Show HN: Pinch – macOS voice translation for real-time conversations (startpinch.com)

Making LLMs Cheaper and Better via Performance-Efficiency Optimized Routing (arxiv.org)

TrueNAS on Arm is finally a thing (jeffgeerling.com)

How Not to Buy a SSD (andrei.xyz)

Build Log: Macintosh Classic (jeffgeerling.com)

The Minecraft Code (2024) [video] (youtube.com)

Writing Micro Compiler in OCaml (2014) (troydm.github.io)

Optimizing our way through Metroid (antithesis.com)

4chan will refuse to pay daily online safety fines, lawyer tells BBC (bbc.co.uk)

Busy Beaver Hunters Reach Numbers That Overwhelm Ordinary Math (quantamagazine.org)

Privately-Owned Rail Cars (amtrak.com)

Go is still not good (blog.habets.se)

Measuring the environmental impact of AI inference

Comments (38)