Shifts in diatom and dinoflagellate biomass in the North Atlantic over 6 decades (journals.plos.org)

MLX is worth paying attention to. It's still pretty young (just over a year old) but the amount of activity in that ecosystem is really impressive, and it's quickly becoming the best way to run LLMs (and vision LLMs and increasingly audio models) on a Mac.

Here's a fun way to start interacting with it (this loads and runs Llama 3.2 3B in a terminal chat UI):

  uv run --isolated --with mlx-lm python -m mlx_lm.chat

masto · 59d ago

Ran it and it crapped out with a huge backtrace. I spotted `./build_bundled.sh: line 21: cmake: command not found` in it, so I guessed I needed cmake installed. `brew install cmake` and try again. Then it crapped out with `Compatibility with CMake < 3.5 has been removed from CMake.`. Then I give up.

This is typical of what happens any time I try to run something written in Python. It may be easier than setting up an NVIDIA GPU, but that's a low bar.

H3X_K1TT3N · 59d ago

This is absolutely every experience I have with python.

simonw · 59d ago

Which Python version was that? Could be that MLX have binary wheels for some versions but not others.

masto · 59d ago

Adding `-p 3.12` made it work. Leaving that here in case it helps someone.

porridgeraisin · 59d ago

Aha, knew you wouldn't give up. Not what our kind do

hnfong · 59d ago

Never gonna give you up…

(Sorry I’ll excuse myself now…

jack_pp · 59d ago

for the record these problems don't really exist on linux in my experience

mobiuscog · 59d ago

Python problems exist on all platforms. It's just that most people using Python have figured out their 'happy path' workarounds in the past and keep using them.

Python is awesome in many ways, one of my favourite languages, but unless you are happy with venv manipulation (or live in Conda), it's often a nightmare that ends up worse than DLL-hell.

masto · 58d ago

Python is in a category of things you can't just use without being an expert in the minutiae. This is unfortunate because there are a lot of people who are not Python developers who would like to run programs which happen to be written in Python.

Python is by no means alone in this or particularly egregious. Having been a heavy Perl developer in the 2000s, I was part of the problem. I didn't understand why other people had so much trouble doing things that seemed simple to me, because I was eating, breathing, and sleeping Perl. I knew how to prolong the intervals between breaking my installation, and how to troubleshoot and repair it, but there was no reason why anyone who wanted to deploy, or even develop on, my code base should have needed that encyclopedic knowledge.

This is why, for all their faults, I count containers as the biggest revolution in the software industry, at least for us "backend" folks.

esafak · 59d ago

https://ml-explore.github.io/mlx/

marci · 59d ago

For those who never heard of those:

mlx is similar to numpy/pytorch, but only for Apple Silicon.

mlx-lm is a llama.cpp equivalent, but built on top of mlx.

https://github.com/ml-explore/mlx-lm

mathfailure · 59d ago

How much disk & RAM does it need?

What's your tokens/sec rate (and on which device)?

simonw · 59d ago

I've been running it on a 64GB M2. My favorite models to run tend to be about 20GB to download (eg Mistral Small 3.1) and use about 20GB of RAM while they are running.

I don't have a token/second figure to hand but it's fast enough that I'm not frustrated by it.

_bin_ · 59d ago

I wish apple would spend some more time paying attention to metal-jax :) it crashes with a few lines still and seems like an obvious need if apple wants to be serious about enabling ML work on their new MBPs.

MLX looks really nice from the demo-level playing around with it I've done, but I usually stick to jax so, you know, I can actually deploy it on a server without trying to find someone who racks macs.

dkga · 59d ago

So, on an M4 I sometimes get faster training on plain vanilla jax compared to the same model in pytorch or tensorflow. And jax-metal often breaks :/

_bin_ · 59d ago

No kidding? Might switch to CPU then. And yeah jax-metal is so utterly unreliable. I ran across an issue it turns out reduces to like a 2 line repro example which has been open on github for the better part of a year without updates

fsiefken · 59d ago

That's great, like the ai ryzen max 395, apple silicon chips are also more energy efficient for llm (or gaming) then nvidia.

For 4 bit deepseek-r1-distill-llama-70b on a Macbook Pro M4 Max with the MLX version on LM Studio: 10.2 tok/sec on power and 4.2 tok/sec on battery / low power

For 4 bit gemma-3-27b-it-qat I get: 26.37 tok/sec on power and on battery low power 9.7

It'd be nice to know all the possible power tweaks to get the value higher and get additional insight on how llm's work and interact with the cpu and memory.

robbru · 58d ago

Interesting benchmarks, thanks for sharing!

If you're optimizing for lower power draw + higher throughput on Mac (especially in MLX), definitely keep an eye on the Desloth LLMs that are starting to appear.

Desloth models are basically aggressively distilled and QAT-optimized versions of larger instruction models (think: 7B → 1.3B or 2B) designed specifically for high tokens/sec at minimal VRAM. They're tiny but surprisingly capable for structured outputs, fast completions, and lightweight agent pipelines.

I'm seeing Desloth-tier models consistently hit >50 tok/sec on M1/M2 hardware without needing active cooling ramps, especially when combined with low-bit quant like Q4_K_M or Q5_0.

If you care about runtime efficiency per watt + low-latency inference (vs. maximum capability), these newer Desloth styled architectures are going to be a serious unlock.

bigyabai · 59d ago

> apple silicon chips are also more energy efficient for llm (or gaming) then nvidia.

Which benchmarks are you working off of, exactly? Unless your memory is bottlenecked, neither raster or compute workloads on M4 are more energy efficient than Nvidia's 50-series silicon: https://browser.geekbench.com/opencl-benchmarks

fouc · 59d ago

NVIDIA GeForce RTX 5090 - 376224 - 400-550W for gpu + 250-500W for cpu/ram/cooling/etc.

Apple M3 Ultra - 131247 - 200W [1]

Looks like it might be 2.8x faster in the benchmark, but ends up using 3.25x more power at a minimum.

[1] https://www.tweaktown.com/news/103891/apples-new-m3-ultra-ru...

nico · 59d ago

Thank you for the numbers

What have you used those models for, and how would you rate them in those tasks?

realo · 59d ago

RPG prompts works very very well with many of the models, but not the reasoning ones because it ends up thinking endlessly about how to be the absolute best game master possible...

nico · 59d ago

Great use case. And very funny situation with the reasoning models! :)

vlovich123 · 59d ago

How does mlx compare with the llama.cpp backend for LM Studio?

robbru · 59d ago

TinyLLM is very cool to see! I will def tinker with it. I've been using MLX format for local LLMs as of late. Kinda amazing to see these models become cheaper and faster. Check out the MLX community on HuggingFace. https://huggingface.co/mlx-community

nico · 59d ago

Great recommendation about the community

Any other resources like that you could share?

Also, what kind of models do you run with mlx and what do you use them for?

Lately I’ve been pretty happy with gemma3:12b for a wide range of things (generating stories, some light coding, image recognition). Sometimes I’ve been surprised by qwen2.5-coder:32b. And I’m really impressed by the speed and versatility, at such tiny size, of qwen2.5:0.5b (playing with fine tuning it to see if I can get it to generate some decent conversations roleplaying as a character)

simonw · 59d ago

I've shared a bunch of notes on MLX over the past year, many of them with snippets of code I've used to try out models: https://simonwillison.net/tags/mlx/

I mainly use MLX for LLMs (with https://github.com/ml-explore/mlx-lm and my own https://github.com/simonw/llm-mlx which wraps that), vision LLMs (via https://github.com/Blaizzy/mlx-vlm) and running Whisper (https://github.com/ml-explore/mlx-examples/tree/main/whisper)

I haven't tried mlx-audio yet (which can synthesize speech) but it looks interesting too: https://github.com/Blaizzy/mlx-audio

The two best people to follow for MLX stuff are Apple's Awni Hannun - https://twitter.com/awnihannun and https://github.com/awni - and community member Prince Canuma who's responsible for both mlx-vlm and mlx-audio: https://twitter.com/Prince_Canuma and https://github.com/Blaizzy

robbru · 58d ago

Very cool insight, Simonw! I will check out the audio mlx stuff soon. I think that is kinda new still. Prince Canuma is the GOAT.

nico · 59d ago

Amazing. Thank you for the great resources!

robbru · 58d ago

Hey Nico,

Very cool to hear your perspective in how you are using the small LLMs! I’ve been experimenting extensively with local LLM stacks on:

• M1 Max (MLX native)

• LM Studio (GLM, MLX, GGUFs)

• Llama.cp (GGUFs)

• n8n for orchestration + automation (multi-stage LLM workflows)

My emerging use cases: -Rapid narration scripting -Roleplay agents with embedded prompt personas -Reviewing image/video attachments + structuring copy for clarity -Local RAG and eval pipelines

My current lineup of small LLMs (this changes every month depending on what is updated):

MLX-native models (mlx-community):

-Qwen2.5-VL-7B-Instruct-bf16 → excellent VQA and instruction following

-InternVL3-8B-3bit → fast, memory-light, solid for doc summarization

-GLM-Z1-9B-bf16 → reliable multilingual output + inference density

GGUF via LM Studio / llama.cpp:

-Gemma-3-12B-it-qat → well-aligned, solid for RP dialogue

-Qwen2.5-0.5B-MLX-4bit → blazing fast; chaining 2+ agents at once

-GLM-4-32B-0414-8bit (Cobra4687) → great for iterative copy drafts

Emerging / niche models tested:

MedFound-7B-GGUF → early tests for narrative medicine tasks

X-Ray_Alpha-mlx-8Bit → experimental story/dialogue hybrid

llama-3.2-3B-storyteller-Q4_K_M → small, quick, capable of structured hooks

PersonalityParty_saiga_fp32-i1 → RP grounding experiments (still rough)

I test most new LLMs on release. QAT models in particular are showing promise, balancing speed + fidelity for chained inference. The meta-trend: models are getting better, smaller, faster, especially for edge workflows.

Happy to swap notes if others are mixing MLX, GGUF, and RAG in low-latency pipelines.

nico · 58d ago

Impressive! Thank you for the amazing notes, I have a lot to learn and test

pj_mukh · 59d ago

Super cool, and will definitely check it out.

But as a measure for what you can achieve with a course like this: does anyone know what the max tok/s vs iPhone model plot look like, and how does MLX change that plot?

gitroom · 59d ago

dang, i've been messing with mlx too and its blowing my mind how quick this stuff is getting on macs. feels like somethings changing every time i blink

robbru · 58d ago

I think people are sleeping on MLX and directing their attention to the "Apple Intelligence" marketing atm.

XSLT – Native, zero-config build system for the Web (github.com)

I Switched from Flutter and Rust to Rust and Egui (jdiaz97.github.io)

Parameterized types in C using the new tag compatibility rule (nullprogram.com)

Biomolecular shifts occur in our 40s and 60s (2024) (med.stanford.edu)

AlphaGenome: AI for better understanding the genome (deepmind.google)

Launch HN: Issen (YC F24) – Personal AI language tutor

“Why is the Rust compiler so slow?” (sharnoff.io)

Sailing the fjords like the Vikings yields unexpected insights (arstechnica.com)

Calculating the Fibonacci numbers on GPU (veitner.bearblog.dev)

The time is right for a DOM templating API (justinfagnani.com)

Alternative Layout System (alternativelayoutsystem.com)

A lumberjack created more than 200 sculptures in Wisconsin's Northwoods (smithsonianmag.com)

Bogong moths use a stellar compass for long-distance navigation at night (nature.com)

Denmark to tackle deepfakes by giving people copyright to their own features (theguardian.com)

Starcloud can’t put a data centre in space at $8.2M in one Starship (angadh.com)

Kea 3.0, our first LTS version (isc.org)

How much slower is random access, really? (samestep.com)

Collections: Nitpicking Gladiator's Iconic Opening Battle, Part I (acoup.blog)

Apple Research unearthed forgotten AI technique and using it to generate images (9to5mac.com)

VA Tech scientists are building a better fog harp (arstechnica.com)

Snow - Classic Macintosh emulator (snowemu.com)

Show HN: Magnitude – Open-source AI browser automation framework (github.com)

Blazing Matrix Products (panadestein.github.io)

Uv and Ray: Pain-Free Python Dependencies in Clusters (anyscale.com)

Timeline of US Class I Railroads Since 1977 (en.wikipedia.org)

Typr – TUI typing test with a word selection algorithm inspired by keybr (github.com)

'Peak flower power era': The story of first ever Glastonbury Festival in 1970 (bbc.com)

A Review of Aerospike Nozzles: Current Trends in Aerospace Applications (mdpi.com)

Introducing Gemma 3n (developers.googleblog.com)

Show HN: I built an AI dataset generator (github.com)

Fault Tolerant Llama training (pytorch.org)

SigNoz (YC W21, Open Source Datadog) Is Hiring DevRel Engineers (Remote)(US) (ycombinator.com)

Welsh publisher brings Tolkien classic in Celtic languages together (nation.cymru)

Matrix v1.15 (matrix.org)

Dickinson's Dresses on the Moon (theparisreview.org)

Apptainer: Application Containers for Linux (apptainer.org)

Better Auth, by a self-taught Ethiopian dev, raises $5M from Peak XV, YC (techcrunch.com)

The AI-Powered Cursor for Spreadsheets (excalai.com)

The Business of Betting on Catastrophe (thereader.mitpress.mit.edu)

Show HN: PRSS Site Creator – Create Blogs and Websites from Your Desktop (prss.co)

A Woman Who Spent Five Hundred Days in a Cave (newyorker.com)

Shifts in diatom and dinoflagellate biomass in the North Atlantic over 6 decades (journals.plos.org)

Some bits on malloc(0) in C being allowed to return NULL (utcc.utoronto.ca)

Judge rejects Meta's claim that torrenting is “irrelevant” in AI copyright case (arstechnica.com)

A new pyramid-like shape always lands the same side up (quantamagazine.org)

Puerto Rico's Solar Microgrids Beat Blackout (spectrum.ieee.org)

Lateralized sleeping positions in domestic cats (cell.com)

Access BMC UART on Supermicro X11SSH (github.com)

The Offline Club (theoffline-club.com)

Apple Just Patented an Image Sensor with 20 Stops of Dynamic Range (ymcinema.com)

Tiny-LLM – a course of serving LLM on Apple Silicon for systems engineers

Comments (35)