Show HN: iOS App Size Analysis Tool for macOS (apps.apple.com)
2 points by elpakal 8h ago 0 comments
Show HN: LocoStudio - A better UI for Ollama (locostudio.ai)
5 points by simmy345 22h ago 0 comments
Create and edit images with Gemini 2.0 in preview
211 meetpateltech 92 5/7/2025, 4:06:44 PM developers.googleblog.com ↗
https://genai-showdown.specr.net
I don't know how much of Google's original Imagen 3.0 is incorporated into this new model, but the overall aesthetic quality seems to be unfortunately significantly worse.
The big "wins" are:
- Multimodal aspect in trying to keep parity with OpenAI's offerings.
- An order of magnitude faster than OpenAI 4o image gen
I hope that we get an open weights multimodal image gen model. I'm slightly concerned that if these things take tens to hundreds of millions of dollars to train, that only Google and OpenAI will provide them.
That said, the one weakness in multimodal models is that they don't let you structure the outputs yet. Multimodal + ControlNets would fix that, and that would be like literally painting with the mind.
The future, when these models are deeply refined and perfected, is going to be wild.
If that happens, then I'm sure we'll see slimmer multimodal models over the course of the next year or so. And that teams like Black Forest Labs will make more focused and performant multimodal variants.
We need the incredible instructivity of multimodality. That's without question. But we also need to be able to fine tune, use ControlNets to guide diffusion, and to compose these into workflows.
If you're looking for other suggestions a summary table showing which models are ahead would be great.
Wrt to the summary table, did you have a different metric in mind? The top of the display should already be showing a "Model Performance" chart with OpenAI 4 and Google Imagen 3 leading the pack.
> The top of the display should already be showing a "Model Performance" chart
I guess I missed this earlier!
I've been thinking about possibly rerunning the Flux Dev prompts using the 1.1 Pro but I liked having a base reference for images that can be generated on consumer hardware.
Really now?
For example, asking the models to show clocks set to a specific time or people drawing with their left hand. I think most, if not all models, will likely display every clock with the same time...And portray subjects drawing with their right hand.
Thanks for the suggestions. Most of the current prompts are a result of personal images that I wanted to generate - so I'll try to add some "classic GenAI failure modes". Musical instruments such as pianos also used to be a pretty big failure point as well.
I should also add an image that is heavy with "greebles". GenAI usually lacks the fidelity for these kinds of minor details so although it adds them - they tend to fall apart at more than a cursory examination.
https://en.wikipedia.org/wiki/Greeble
I'm sure part of this is a lack of imagination on my part about how to describe the vague image in my own head. But I guess I have a lot of doubts about using a conversational interface for this kind of stuff
Sometimes sketching it could be helpful, but more abstract technical thing like LUTs, feels still out of reach.
This is more related to our ability to articulate than is easy to demonstrate, in my experience. I can certainly produce images in my head I have difficulty reproducing well and consistently via linguistic description.
As some have mentioned, LLMs are treasure troves of information for learning how to prompt the LLM. One thing to get over is a fear of embarrassment in what you say to the LLM. Just write a stream of consciousness to the LLM about what you want and ask it to generate a prompt based on that. "I have an image that I am trying to get an image LLM to add some clutter to. But when I ask it to do it, like I say add some stack of paper and notebooks, but it doesn't look like I want because they are neat stacks of paper. What I want is a desk that kind of looks like it has been worked at for a while by a typical office worker, like at the end of the day with a half empty coffee cup and .... ". Just ramble away and then ask the LLM to give you the best prompt. And if it doesn't work, literally go back to the same message chain and say "I tried that prompt and it was [better|worse] than before because ...".
This is one of those opportunities where life is giving you an option: give up or learn. Choose wisely.
At 4c per image that's more than a dollar on that single prompt.
I built this quick tool https://tools.simonwillison.net/gemini-image-json for pasting that JSON into to see it rendered.
source: https://ai.google.dev/gemini-api/docs/models#gemini-2.0-flas... and my Google Ai Studio
https://aistudio.google.com/apps/bundled/gemini-co-drawing?s...
No comments yet
The lamp is put on a different desk in a totally different room, with AI mush in the foreground. Props for not cherry-picking a first example, I guess. The sofa colour one is somehow much better, with a less specific instruction.
The main difference is that Gemini does allow for incorporating a conversation to generate the image as demoed here, while Imagen 3 is a strict text-in/image-out with optional mask-constrained edits but likely allows for higher-quality images overall if skilled with prompt engineering. This is a nuance that is annoying to differentiate.
What makes you say that?
You can see the full table with images here: https://tabulator-ai.notion.site/1df2066c65b580e9ad76dbd12ae...
I think the results came out quiet well. Be aware I don't generate a text prompt based on row data for image generation. Instead, the raw row data(ingredients, instructions...) and table metadata(column names and descriptions) are sent directly to gemini-2.0-flash-exp-image-generation.
There are a lot of failure modes still but what I want is a very large cookbook showing what known-good workflows are. Since this is just so directly downstream of (limited) training data it might be that I am just prompting in a ever so slightly bad way.
Very well done!
Seems to help if you explicitly describe the scene, but then the drawing-along aspect seem relatively pointless.
It seems like the real goal here, for Google and other AI companies, is a world flooded with endless AI-generated variants of objects that don’t even exist yet, crafted to be sold and marketed (probably by AI too) to hyper-targeted audiences. This feels like an incoming wave of "AI slop", mass-produced synthetic content, crashing against the small island of genuine human craftsmanship and real, existing objects.
I get that they are trying to find some practical used cases for their tools. But there's no enlightenment in the product development here.
If this is already the part of the s-curve where these AI tools get diminishing returns...what a waste of everybody's time.
It should be illegal in my view.
Maybe this is why all of the future AI fiction has people dressed in the same bland clothing.
If for example you use controlnets you can pretty much get very close to a style composition that you need with an open model like Flux that will be far better. Flux has a few successors coming up now
I take an image with some desired colors or typography from an already existing music album or from Ideogram's poster section. I pass it to gemini and give the command:
"describe the texture of the picture, all the element and their position in the picture, left side, center right side, up and down, the color using rgb, the artistic style and the calligraphy or font of the letters"
Then i take the result and pass it through an LLM, a different LLM because i don't like gemini that much, i find it is much less coherent than other models. I use qwen-qwq-32b usually and I take the description gemini outputs and give it to qwen:
" write a similar description, but this time i want a surreal painting with several imaginative colors. Follow the example of image description, add several new and beautiful shapes of all elements and give all details, every side which brushstrokes it uses, and rgb colors it uses, the color palette of the elements of the page, i want it to be a pastel painting like the example, and don't put bioluminesence. I want it to be old style retro style mystery sci fi. Also i want to have a title of "Song Title" and describe the artistic font it uses and it's position in the painting, it should be designed as a drum n bass album cover "*
Then i take the result and give it back to gemini with command: "Create an image with text "Song Title" for an album cover: here is the description of the rest of the album"
If the resulting image is good, then it is time to add font, i take the new image description and pass it through qwen again, supposing the image description has fields Title and Typography:
"rewrite the description and add full description of the letters and font of text, clean or distressed, jagged or fluid letters or any other property they might have, where they are overlayed, and make some new patterns about the letter appearance and how big they are and the material they are made of, rewrite the Title and Typography."
I replace the previous description's section Title and Typography with the new description and create images with beautiful fonts.
[1] https://imgur.com/a/8TCUJ75
Now I can use:
- Gemini 2.0 Flash Image Generation Preview (May) instead of Gemini 2.0 Flash Image Generation Preview (March)
- or when I need text, Gemini 2.5 Flash Thinking 04-17 Preview ("natively multimodal" w/o image generation)
- When I need to control thinking budgets, I can do that with Gemini 2.5 Flash Preview 04-17, with not-thinking at a 50% price increase over a month prior
- And when I need realtime, fallback to Gemini 2.0 Flash 001 Live Preview (announced as In Preview on April 9 2025 after the Multimodal Live API was announced as released on December 11 2024)
- I can't control Gemini 2.5 Pro Experimental/Preview/Preview IO Edition's thinking budgets, but good news follows in the next bullet: they'll swap the model out underneath me with one that thinks ~10x less so at least its in the same cost ballpark as their competitors
- and we all got autoupgraded from Gemini 2.5 Pro Preview (03/25 released 4/2) to Gemini 2.5 Pro Preview (IO Edition) yesterday! Yay!
Which is probably what makes me so cranky here. It's very hard keeping track of all of it and doing my best to lever up the models that are behind Claude's agentic capabilities, and all the Newspeak of Google PR makes it consume almost as much energy as the rest of the providers combined. (I'm v frustrated that I didn't realize till yesterday that 2.0 Flash had quietly gone from 10 RPM to 'you can actually use it')
I'm a Xoogler and I get why this happens ("preview" is a magic wand that means "you don't have to get everyone in bureaucracy across DeepMind/Cloud/? to agree to get this done and fill out their damn launchcal"), but, man.
Btw still not as good as ChatGPT but much, much faster, it's a nice progress compare to the previous model.
Is it just me or is the market just absolutely terrible at understanding the implications and speed of progress behind what's happening right now in the walls of big G?
https://www.bloomberg.com/news/articles/2025-05-07/apple-wor...
Although AI is fun and great, an AI search engine may have issues of being unprofitable. It's similar to how 23 and Me got many customers selling a 500 dollar test to people for 100 dollars.
So Apple may not be making their own, but they won't be spending billions either. I'm wondering how the people will be able to monetize the searches so that they make money.
I am not sure why people think OpenAI et al are going to eat Google's lunch here. Seems like they're already doing AI-for-search and if there is anyone who can do it cheaply and at scale I bet on Google being the ones to do it (with all their data centers, data integrations/crawlers, and custom hardware and experience etc). I doubt some startup using the Bing-index and renting off-the-shelf Nvidia hardware using investor-funds is going to leapfrog Google-scale infrastructure and expertise.
LLMs are insanely competitive and a dime a dozen now. Most professional uses can get away with local models.
This is image generation... Niche cases in another saturated market.
How are any of these supposed to make google billions of dollars?
> Okay, I understand. You want me to replace ONLY the four windows located underneath the arched openings on the right side of the house with bifold doors, leaving all other features of the house unchanged. Here is the edited image:
Followed by no image. This is a behaviour I have seen many times from Gemini in the past so it's frustrating that it's still a problem.
I give this a 0/10 for my first use case.