I've come to realize that I liked believing that there was something special about the human mental ability to use our mind's eye and visual imagination to picture something, such as how we would look with a different hairstyle. It's uncomfortable seeing that skill reproduced by machinery at the same level as my own imagination, or even better. It makes me feel like my ability to use my imagination is no more remarkable than my ability to hold a coat off the ground like a coat hook would.
FuckButtons · 1h ago
I have aphantasia, I’m glad we’re all on a level playing field now.
yoz-y · 59m ago
I always thought I had a vivid imagination. But then the aphantasia was mentioned in Hello Internet once, I looked it up, see comments like these and honestly…
I’ve no idea how to even check. According to various tests I believe I have aphantasia. But mostly I’ve got not even a slightest idea on how not having it is supposed to work. I guess this is one of those mysteries when a missing sense cannot be described in any manner.
foofoo12 · 20m ago
Ask people to visualize a thing. Pick something like a house, dog, tree, etc. Then ask about details. Where is the dog?
I have aphantasia and my dog isn't anywhere. It's just a dog, you didn't ask me to visualize anything else.
When you ask about details, like color, tail length, eyes then I have to make them up on the spot. I can do that very quickly but I don't "see" the good boy.
jmcphers · 51m ago
A simple test for aphantasia that I gave my kids when they asked about it is to picture an apple with three blue dots on it. Once you have it, describe where the dots are on the apple.
Without aphantasia, it should be easy to "see" where the dots are since your mind has placed them on the apple somewhere already. Maybe they're in a line, or arranged in a triangle, across the middle or at the top.
Sohcahtoa82 · 39m ago
After reading your first sentence, I immediately saw an apple with three dots in a triangle pointing downwards on the side. Interestingly, the 3 dots in my image were flat, as if merely superimposed on an image of an apple, rather than actually being on an apple.
How do people with aphantasia answer the question?
foofoo12 · 27m ago
I found out recently that I have aphantasia, based on everything I've read. When you tell me to visualize, I imagine. I don't see it. An apple, I can imagine that. I can describe it in incomprehensibly sparse details. But when you ask details I have to fill them in.
I hadn't really placed those three dots in a specific place on the apple. But when you ask where they are, I'll decide to put them in a line on the apple. If you ask what color they are, I'll have to decide.
jvanderbot · 27m ago
They may not answer but what they'll realize is that the "placing" comes consciously after the "thinking of" which does not happen with others.
That is, they have to ascribe a placement rather than describe one in the image their mind conjured up.
wrs · 26m ago
There's no apple, much less any dots. Of course, I'm happy to draw you an apple on a piece of paper, and draw some dots on that, then tell you where those are.
aaronblohowiak · 31m ago
oh just close your eyes and imagine an apple for a few moments, then open your eyes, look at the wikipedia article about aphantasia and pick the one that best fits the level of detail you imagined.
Revisional_Sin · 1h ago
Aphantasia gang!
m3kw9 · 1h ago
To be fair, the model's ability came from us generating the training data.
quantummagic · 43m ago
To be fair, we're the beneficiaries of nature generating the data we trained on ourselves. Our ability came from being exposed to training in school, and in the world, and from examples from all of human history. Ie. if you locked a child in a dark room for their entire lives, and gave them no education or social interaction, they wouldn't have a very impressive imagination or artistic ability either.
We're reliant on training data too.
micromacrofoot · 55m ago
it can only do this because it's been trained on millions of human works
vunderba · 1h ago
Nano-Banana can produce some astonishing results. I maintain a comparison website for state-of-the-art image models with a very high focus on adherence across a wide variety of text-to-image prompts.
I recently finished putting together an Editing Comparison Showdown counterpart where the focus is still adherence but testing the ability to make localized edits of existing images using pure text prompts. It's currently comparing 6 multimodal models including Nano-Banana, Kontext Max, Qwen 20b, etc.
Gemini Flash 2.5 leads with a score of 7 out of 12, but Kontext comes in at 5 out of 12 which is especially surprising considering you can run the Dev model of it locally.
No comments yet
namibj · 3m ago
After looking at Cases 4, 9, 23, 33, and 61, I think it might be suited to take in several wide-angle pictures or photospheres or such from inside a residence, and output a corresponding floor plan schematic.
If anyone has examples, guides, or anything to save me from pouring unnecessary funds into those API credits just to figure out how to feed it for this kind of task, I'd really appreciate sharing.
xnx · 1h ago
Amazing model. The only limit is your imagination, and it's only $0.04/image.
Yep, Google actually recommends using Imagen4 / Imagen4 Ultra for straight image generation. In spite of that, Flash 2.5 still scored shockingly high on my text-to-image comparisons though image fidelity is obviously not as good as the dedicated text to image models.
Came within striking distance of OpenAI gpt-image-1 at only one point less.
minimaxir · 1h ago
[misread]
vunderba · 1h ago
They're referring to Case 1 Illustration to Figure, the anime figurine dressed in a maid outfit in the HN post.
pdpi · 1h ago
I assume OP means the actual post.
The second example under "Case 1: Illustration to Figure" is a panty shot.
darkamaul · 1h ago
This is amazing. Not that long ago, even getting a model to reliably output the same character multiple times was a real challenge. Now we’re seeing this level of composition and consistency. The pace of progress in generative models is wild.
Huge thanks to the author (and the many contributors) as well for gathering so many examples; it’s incredibly useful to see them to better understand the possibilities of the tool.
Through that testing, there is one prompt engineering trend that was consistent but controversial: both a) LLM-style prompt engineering with with Markdown-formated lists and b) old-school AI image style quality syntatic sugar such as award-winning and DSLR camera are both extremely effective with Gemini 2.5 Flash Image, due to its text encoder and larger training dataset which can now more accurately discriminate which specific image traits are present in an award-winning image and what traits aren't. I've tried generations both with and without those tricks and the tricks definitely have an impact. Google's developer documentation encourages the latter.
This is the first time I really don't understand how people are getting good results. On https://aistudio.google.com with Nano Banana selected (gemini-2.5-flash-image-preview) I get - garbage - results. I'll upload a character reference photo and a scene and ask Gemini to place the character in the scene. What it then does is to simply cut and paste the character into the scene, even if they are completely different in style, colours, etc.
I get far better results using ChatGPT for example. Of course, the character seldom looks anything like the reference, but it looks better than what I could do in paint in two minutes.
Am I using the wrong model, somehow??
SweetSoftPillow · 35m ago
Play around with your prompt, try ask Gemini 2.5 pro to improve your prompt before sending it to Gemini 2.5 Flash, retry and learn what works and what doesn't.
epolanski · 37m ago
+1
I understand the results are non deterministic but I get absolute garbage too.
Uploaded pics of my (32 years old) wife and we wanted to ask it to give her a fringe/bangs to see how would she look like it either refused "because of safety" and when it complied results were horrible, it was a different person.
After many days and tries we got it to make one but there was no way to tweak the fringe, the model kept returning the same pic every time (with plenty of "content blocked" in between).
SweetSoftPillow · 29m ago
Are you in gemini.google.com interface? If so, try Google AI Studio instead, there you can disable safety filters.
istjohn · 45m ago
Personally, I'm underwhelmed by this model. I feel like these examples are cherry-picked. Here are some fails I've had:
- Given a face shot in direct sunlight with severe shadows, it would not remove the shadows
- Given an old black and white photo, it would not render the image in vibrant color as if taken with a modern DSLR camera. It will colorize the photo, but only with washed out, tinted colors
- When trying to reproduce the 3 x 3 grid of hair styles, it repeatedly created a 2x3 grid. Finally, it made a 3x3 grid, but one of the nine models was black instead of caucasian.
- It is unable to integrate real images into fabricated imagery. For example, when given an image of a tutu and asked to create an image of a dolphin flying over clouds wearing the tutu, the result looks like a crude photoshop snip and copy/paste job.
foofoo12 · 17m ago
> I feel like these examples are cherry-picked
I don't know of a demo, image, film, project or whatever where the showoff pieces are not cherry picked.
mustaphah · 19m ago
In a side-by-side comparison with GPT-4o [1], they are pretty much on par.
I have two friends who are excellent professional graphic artists and I hesitate to send them this.
SweetSoftPillow · 1h ago
They better learn it today than tomorrow. Even though it's might be painful for some who does not like to learn new tools and explore new horizons.
mitthrowaway2 · 1h ago
Maybe they're better off switching careers? At some point, your customers aren't going to pay you very much to do something that they've become able to do themselves.
There used to be a job people would do, where they'd go around in the morning and wake people up so they could get to work on time. They were called a "knocker-up". When the alarm clock was invented, these people lose their jobs to other knockers-up with alarm clocks, they lost their jobs to alarm clocks.
non_aligned · 1h ago
A lot of technological progress is about moving in the other direction: taking things you can do yourself and having others do it instead.
You can paint your own walls or fix your own plumbing, but people pay others instead. You can cook your food, but you order take-out. It's not hard to sew your own clothes, but...
So no, I don't think it's as simple as that. A lot of people will not want the mental burden of learning a new tool and will have no problem paying someone else to do it. The main thing is that the price structure will change. You won't be able to charge $1,000 for a project that takes you a couple of days. Instead, you will need to charge $20 for stuff you can crank out in 20 minutes with gen AI.
GMoromisato · 1h ago
I agree with this. And it's not just about saving time/effort--an artist with an AI tool will always create better images than an amateur, just as an artist with a camera will always produce a better picture than me.
That said, I'm pretty sure the market for professional photographers shrank after the digital camera revolution.
AstroBen · 1h ago
I don't know if "learning this tool" is gunna help..
frfl · 1h ago
While these are incredibly good, it's sad to think about the unfathomable amount of abuse, spam, disinformation, manipulation and who know what other negatives these advancement are gonna cause. It was one thing when you could spot an AI image, but now and moving forward it's be basically increasingly futile to even try.
Almost all "human" interaction online will be subject to doubt soon enough.
Hard to be cheerful when technology will be a net negative overall even if it benefits some.
signatoremo · 1h ago
By your logic email is clearly a net negative, given how much junk it generates - spam, phishing, hate mails, etc. Most of my emails at this point are spams.
frfl · 54m ago
If we're talking objectively, yeah by definition if it's a net negative, it's a net negative. But we can both agree in absolute terms the negatives of email are manageable.
Hopefully you understand the sentiment of my original message, without getting into the semantics. AI advancement, like email when it arrived, are gonna turbocharge the negatives. Difference is in the magnitude of the problem. We're dealing with whole different scale we have never seen before.
Re: Most of my emails at this point are spams. - 99% of my emails are not spam. Yet AI spam is everywhere else I look online.
No comments yet
flysonic10 · 50m ago
I added some of these examples into my Nanna Banana image generator: https://nannabanana.ai
stoobs · 28m ago
I'm pretty sure these are cherry-picked out of many generation attempts, I tried a few basic things and it flat out refused to do many of them like turning a cartoon illustration into a real-world photographic portrait, it kept wanting to create a pixar style image, then when I used an ai generated portrait as an example, it refused with an error saying it wouldn't modify real world people...
I then tried to generate some multi-angle product shots from a single photo of an object, and it just refused to do the whole left, right, front, back thing, and kept doing things like a left, a front, another left, and weird half back/half side view combination.
Very frustrating.
SweetSoftPillow · 26m ago
Are you in gemini.google.com interface? If so, try Google AI Studio instead, there you can disable safety filters.
stoobs · 12m ago
I'm in AI Studio, and weirdly I get no safety settings.
I had them before when I was trying this and yes, I had them turned off.
destel · 1h ago
Some examples are mind blowing. It’s interesting if it can generate web/app designs
AstroBen · 1h ago
I just tried it for an app I'm working on.. very bad results
eig · 1h ago
While I think most of the examples are incredible...
...the technical graphics (especially text) is generally wrong. Case 16 is an annotated heart and the anatomy is nonsensical. Case 28 with the tallest buildings has the decent images, but has the wrong names, locations, and years.
vunderba · 1h ago
Yeah I think some of them are really more proof of concept than anything.
Case 8 Substitute for ControlNet
The two characters in the final image are VERY obviously not in the instructed set of poses.
SweetSoftPillow · 1h ago
Yes, it's Gemini Flash model, meaning it's fast and relatively small and cheap, optimized for performance rather than quality. I would not expect mind-blowing capabilities in fine details from this class of models, but still, even in this regard this model sometimes just surprisingly good.
AstroBen · 1h ago
The #1 most frustrating part of image models to me has always been their inability to keep the relevant details. Ask to change a hairstyle and you'd get a subtly different person
The ability to pretty accurately keep the same image from an input is a clear sign of it's improved abilities.
moralestapia · 1h ago
Wow, just amazing.
Is this model open? Open weights at least? Can you use it commercially?
SweetSoftPillow · 1h ago
This is a Google's Gemini flash 2.5 model with native image output capability. It's fast, relatively cheap and SOTA-quality, and available via API.
I think getting this kind of quality in open source models will need some time, probably first from Chinese models and then from BlackForestLabs or Google's open source (Gemma) team.
vunderba · 1h ago
Outside of Google Deepmind open sourcing the code and weights of AlphaFold, I don't think they've released any of their GenAI stuff (Imagen, Gemini, Flash 2.5, etc).
The best multimodal models that you can run locally right now are probably Qwen-Edit 20b, and Kontext.Dev.
Flux Kontext has similar quality, is open weight, and the outputs can be used commercially, however prompt adherence is good-but-not-as-good.
ChrisArchitect · 53m ago
sigh
so many little details off when the instructions are clear and/or the details are there. Brad Pitt jeans? The result are not the same style and missing clear details which should be expected to just translate over.
Another one where the prompt ended with output in a 16:9 ratio. The image isn't in that ratio.
The results are visually something but then still need so much review. Can't trust the model. Can't trust people lazily using it. Someone mentioned something about 'net negative'.
istjohn · 36m ago
Yes, almost all of the examples are off in one way or another. The viewpoints don't actually match the arrow directions, for example. And if you actually use the model, you will see that even these examples must be cherry-picked.
I’ve no idea how to even check. According to various tests I believe I have aphantasia. But mostly I’ve got not even a slightest idea on how not having it is supposed to work. I guess this is one of those mysteries when a missing sense cannot be described in any manner.
I have aphantasia and my dog isn't anywhere. It's just a dog, you didn't ask me to visualize anything else.
When you ask about details, like color, tail length, eyes then I have to make them up on the spot. I can do that very quickly but I don't "see" the good boy.
Without aphantasia, it should be easy to "see" where the dots are since your mind has placed them on the apple somewhere already. Maybe they're in a line, or arranged in a triangle, across the middle or at the top.
How do people with aphantasia answer the question?
I hadn't really placed those three dots in a specific place on the apple. But when you ask where they are, I'll decide to put them in a line on the apple. If you ask what color they are, I'll have to decide.
That is, they have to ascribe a placement rather than describe one in the image their mind conjured up.
We're reliant on training data too.
I recently finished putting together an Editing Comparison Showdown counterpart where the focus is still adherence but testing the ability to make localized edits of existing images using pure text prompts. It's currently comparing 6 multimodal models including Nano-Banana, Kontext Max, Qwen 20b, etc.
https://genai-showdown.specr.net/image-editing
Gemini Flash 2.5 leads with a score of 7 out of 12, but Kontext comes in at 5 out of 12 which is especially surprising considering you can run the Dev model of it locally.
No comments yet
If anyone has examples, guides, or anything to save me from pouring unnecessary funds into those API credits just to figure out how to feed it for this kind of task, I'd really appreciate sharing.
Since the page doesn't mention it, this is the Google Gemini Image Generation model: https://ai.google.dev/gemini-api/docs/image-generation
Good collection of examples. Really weird to choose an inappropriate for work one as the second example.
Came within striking distance of OpenAI gpt-image-1 at only one point less.
The second example under "Case 1: Illustration to Figure" is a panty shot.
Huge thanks to the author (and the many contributors) as well for gathering so many examples; it’s incredibly useful to see them to better understand the possibilities of the tool.
Through that testing, there is one prompt engineering trend that was consistent but controversial: both a) LLM-style prompt engineering with with Markdown-formated lists and b) old-school AI image style quality syntatic sugar such as award-winning and DSLR camera are both extremely effective with Gemini 2.5 Flash Image, due to its text encoder and larger training dataset which can now more accurately discriminate which specific image traits are present in an award-winning image and what traits aren't. I've tried generations both with and without those tricks and the tricks definitely have an impact. Google's developer documentation encourages the latter.
However, taking advantage of the 32k context window (compared to 512 for most other models) can make things interesting. It’s possible to render HTML as an image (https://github.com/minimaxir/gemimg/blob/main/docs/notebooks...) and providing highly nuanced JSON can allow for consistent generations. (https://github.com/minimaxir/gemimg/blob/main/docs/notebooks...)
I get far better results using ChatGPT for example. Of course, the character seldom looks anything like the reference, but it looks better than what I could do in paint in two minutes.
Am I using the wrong model, somehow??
I understand the results are non deterministic but I get absolute garbage too.
Uploaded pics of my (32 years old) wife and we wanted to ask it to give her a fringe/bangs to see how would she look like it either refused "because of safety" and when it complied results were horrible, it was a different person.
After many days and tries we got it to make one but there was no way to tweak the fringe, the model kept returning the same pic every time (with plenty of "content blocked" in between).
- Given a face shot in direct sunlight with severe shadows, it would not remove the shadows
- Given an old black and white photo, it would not render the image in vibrant color as if taken with a modern DSLR camera. It will colorize the photo, but only with washed out, tinted colors
- When trying to reproduce the 3 x 3 grid of hair styles, it repeatedly created a 2x3 grid. Finally, it made a 3x3 grid, but one of the nine models was black instead of caucasian.
- It is unable to integrate real images into fabricated imagery. For example, when given an image of a tutu and asked to create an image of a dolphin flying over clouds wearing the tutu, the result looks like a crude photoshop snip and copy/paste job.
I don't know of a demo, image, film, project or whatever where the showoff pieces are not cherry picked.
[1] https://github.com/JimmyLv/awesome-nano-banana
There used to be a job people would do, where they'd go around in the morning and wake people up so they could get to work on time. They were called a "knocker-up". When the alarm clock was invented, these people lose their jobs to other knockers-up with alarm clocks, they lost their jobs to alarm clocks.
You can paint your own walls or fix your own plumbing, but people pay others instead. You can cook your food, but you order take-out. It's not hard to sew your own clothes, but...
So no, I don't think it's as simple as that. A lot of people will not want the mental burden of learning a new tool and will have no problem paying someone else to do it. The main thing is that the price structure will change. You won't be able to charge $1,000 for a project that takes you a couple of days. Instead, you will need to charge $20 for stuff you can crank out in 20 minutes with gen AI.
That said, I'm pretty sure the market for professional photographers shrank after the digital camera revolution.
Almost all "human" interaction online will be subject to doubt soon enough.
Hard to be cheerful when technology will be a net negative overall even if it benefits some.
Hopefully you understand the sentiment of my original message, without getting into the semantics. AI advancement, like email when it arrived, are gonna turbocharge the negatives. Difference is in the magnitude of the problem. We're dealing with whole different scale we have never seen before.
Re: Most of my emails at this point are spams. - 99% of my emails are not spam. Yet AI spam is everywhere else I look online.
No comments yet
I then tried to generate some multi-angle product shots from a single photo of an object, and it just refused to do the whole left, right, front, back thing, and kept doing things like a left, a front, another left, and weird half back/half side view combination.
Very frustrating.
I had them before when I was trying this and yes, I had them turned off.
...the technical graphics (especially text) is generally wrong. Case 16 is an annotated heart and the anatomy is nonsensical. Case 28 with the tallest buildings has the decent images, but has the wrong names, locations, and years.
Case 8 Substitute for ControlNet
The two characters in the final image are VERY obviously not in the instructed set of poses.
..guess that's solved now.. overnight. Mindblowing
Is this model open? Open weights at least? Can you use it commercially?
The best multimodal models that you can run locally right now are probably Qwen-Edit 20b, and Kontext.Dev.
https://qwenlm.github.io/blog/qwen-image-edit
https://bfl.ai/blog/flux-1-kontext-dev
[1] https://deepmind.google/models/gemma
[2] https://huggingface.co/google/gemma-7b [2]
so many little details off when the instructions are clear and/or the details are there. Brad Pitt jeans? The result are not the same style and missing clear details which should be expected to just translate over.
Another one where the prompt ended with output in a 16:9 ratio. The image isn't in that ratio.
The results are visually something but then still need so much review. Can't trust the model. Can't trust people lazily using it. Someone mentioned something about 'net negative'.