Strix Halo Local AI Workstation Real World Performance Test
When AMD announced the Ryzen AI Max+ 395, the chip the industry now calls Strix Halo, I treated it the same way I treat any new piece of hardware that promises to change local AI. I was skeptical. The marketing talked about up to 128GB of unified memory on a single APU, which on paper sounds like the answer everyone running large models has been waiting for. But marketing slides do not run inference. So I put one on my desk next to a Mac Studio and a 4090 box and ran the workloads I actually care about. This is the honest report.
Why does Strix Halo matter for local AI?
The bottleneck for running large language models on consumer hardware has never really been compute. It has always been memory. If a model does not fit in the memory your GPU can address, you either drop to a smaller model or you accept paging penalties that destroy throughput. I have written about this problem before in my breakdown of VRAM requirements for local AI coding, and the math has not changed. A 70B model in 4 bit quantization wants roughly 40GB of fast memory. A 32B coding model wants around 20GB. A 4090 gives you 24GB. A 5090 gives you 32GB. Anything beyond that, in the consumer Nvidia world, means stacking cards or paying workstation prices.
Strix Halo sidesteps this by using a unified memory architecture, the same trick Apple uses on M series chips. The CPU and the integrated GPU share one pool. Configure the machine with 128GB and you can hand 96GB or more to the iGPU as effective VRAM. Suddenly a 70B model in Q8 fits without a second thought. That is the promise. The question is whether it works in practice.
How does Strix Halo compare to Mac Studio on real workloads?
I ran the same suite on three machines: a Strix Halo mini PC with 128GB unified memory, a Mac Studio with 96GB unified memory on M4 Max, and a desktop with a 4090 and 64GB of system RAM. I tested a 7B Mistral, a 14B Phi, a 32B Qwen coder, and a 70B Llama, all in common GGUF quantizations through llama.cpp.
The Mac Studio was the most predictable performer. Apple has been doing unified memory for years and the software stack reflects that. Metal kernels are mature, llama.cpp has been tuned aggressively for Apple Silicon, and the memory bandwidth on M4 Max is genuinely fast. On the 32B Qwen model I saw token generation in the high teens per second. On the 70B Llama I was looking at single digits, but it ran cleanly. This matches what I cover in my cost effective local LLM setup guide, and it is exactly what the Apple pitch promises. As I explained in the video that pairs with this post, Apple has a real technological advantage when it comes to memory architecture, and that shows up in benchmarks the moment you cross the 24GB ceiling.
Strix Halo, on smaller models, was the surprise. On the 7B and 14B range it was competitive with the Mac Studio, occasionally faster on prompt processing because the AMD iGPU has more raw compute headroom on these sizes. The 32B Qwen ran fine and stayed responsive. The 70B Llama loaded without issue, which is the headline benefit, and produced usable output, though generation speed was lower than the Mac Studio. The unified memory promise is real. The execution is not as polished yet.
One thing worth flagging: thermals on the small form factor Strix Halo machines I tested were notably warmer under sustained load than the Mac Studio doing the same work. Apple’s chassis design and idle power story remain class leading, and if you plan to leave a model running as a background service all day on your desk, that quiet idle behavior actually matters more than peak benchmark numbers.
Where does the 4090 still win?
If your model fits in 24GB, the 4090 is not close. It is not even the same conversation. On the 14B Phi model the 4090 was producing tokens at multiples of what either unified memory machine could do. CUDA is mature, the kernels are optimized down to the metal, and Nvidia has spent a decade making sure every popular inference engine treats their hardware as the reference target.
The catch is the moment your model exceeds 24GB. Then the 4090 either offloads layers to system RAM, which collapses throughput, or it simply cannot run the model at full quality. This is the wall I keep hitting with clients who bought a single 4090 and now want to run a 70B coder. They have to drop quantization aggressively, which hurts quality, or they have to look at multi GPU setups that get expensive fast. Strix Halo and Mac Studio do not have this wall. They have a slower lane, but the lane exists.
If you want to understand why this quantization tradeoff matters so much, my piece on model quantization and local AI performance walks through it in detail.
What about the ROCm versus CUDA software gap?
This is the part of the Strix Halo story that does not show up in the benchmark charts but absolutely shows up in your week. ROCm, AMD’s answer to CUDA, has improved a lot. It is no longer the disaster it was three years ago. But it is still behind, and on a brand new chip like Strix Halo the gap is wider than usual.
Here is what that meant for me in practice. llama.cpp worked. Ollama worked. LM Studio worked for the GGUF formats I cared about. These cover the vast majority of what most people doing local inference actually need, and if your workflow looks like mine that is honestly enough. Where it got rough was anything outside that golden path. Some image generation pipelines that assume CUDA needed workarounds. A few research repos I tried simply did not run without patching. Fine tuning, in particular, is still much smoother on Nvidia and on Apple than it is on Strix Halo today.
If you are someone who only runs popular inference engines on standard quantized models, Strix Halo is fine right now. If you are someone who lives at the bleeding edge of new research code, you will fight the toolchain regularly. Be honest with yourself about which one you are before you spend the money.
Want hands on local AI projects to test on your own hardware?
If you want to put any of these machines through their paces on something more interesting than a generic benchmark, I keep a set of practical local AI projects on the open source page. They cover the workflows I actually use day to day, from local coding assistants to RAG over personal notes, and they run on anything from a 16GB Mac Mini up to a maxed out workstation. Running the same project across different hardware tiers is actually the fastest way to build intuition for where each platform genuinely shines and where the marketing is hiding the rough edges.
Who should buy a Strix Halo machine?
After living with the chip for a while, here is my honest read. Strix Halo is the right pick for one specific person: someone who wants to run very large models locally, who values the lower price point relative to a maxed out Mac Studio or a multi GPU Nvidia rig, and who is comfortable when the software stack occasionally requires patience. If that is you, the value is genuinely strong. 128GB of memory accessible to a capable iGPU at this price point did not exist in the consumer market a year ago.
For most people starting out, I would still point them at a Mac Mini or an entry Mac Studio first. The software path is smoother, resale is better, and the machine is useful for plenty of things beyond inference. I made this same argument in the companion video, and the logic has not changed just because a new chip exists. You do not need to spend big to get started, and I broke down why in learn AI without expensive hardware.
For anyone whose models fit in 24GB and who values raw throughput above all else, a single 4090 or a 5090 is still the answer. CUDA dominance is not going away this year.
Strix Halo is not a Mac Studio killer and it is not a 4090 killer. It is a third option that did not exist before, and for the right buyer it is genuinely compelling. That is more than most new hardware launches deliver.
Where to go next
If you want to see the visual walkthrough where I show the actual tooling and the side by side numbers, the full breakdown is on YouTube here: https://www.youtube.com/watch?v=VGnw5Blcmm0
And if you want to go deeper on building real local AI systems with people who are doing the same work, come join us in the AI Engineer community: https://aiengineer.community/join