Local AI Performance on Integrated Graphics with Vulkan Offload

When I first told people you can run capable language models on a laptop without an Nvidia GPU, the reaction was usually the same. They assumed you needed a desktop with a 3090 or 4090. That assumption is outdated. The integrated graphics chip already sitting inside your everyday laptop, whether that is Intel Iris Xe, AMD Radeon 780M, or an Apple M-series chip, can offload meaningful portions of inference work through the Vulkan backend in llama.cpp. The performance is not magical. It is real, and it changes who can practically run local AI.

In this guide I want to be canonical about what integrated GPU offload actually delivers in 2026. I will walk through how Vulkan offload works, what tokens per second to expect on common integrated GPUs, when this approach beats CPU only inference, and the honest cutoff where you need a discrete card.

What is Vulkan offload and why does it matter for local AI?

Vulkan is a cross-platform graphics and compute API. Most people know it as a gaming technology, but for local AI it serves a different role. The llama.cpp project, which has become the de facto runtime for running quantized large language models on consumer hardware, ships a Vulkan backend. That backend lets the same binary push tensor operations onto any GPU that exposes a working Vulkan driver. You do not need CUDA. You do not need ROCm. You do not need an Nvidia card.

This matters because integrated GPUs almost universally support Vulkan. Intel Iris Xe, AMD Radeon graphics built into Ryzen chips, and even Apple Silicon through MoltenVK all expose Vulkan compute. When llama.cpp offloads model layers to the Vulkan backend, those layers run on the iGPU instead of the CPU. The iGPU is generally slower than a discrete card, but it has two advantages over the CPU. It has more parallel compute units suited to matrix multiplication, and on shared-memory designs it can access the same RAM the CPU uses, which means no copying penalty.

For anyone trying to learn AI engineering without buying new hardware, this is the unlock. I covered the broader picture in my guide on how to learn AI without expensive hardware, but Vulkan offload is the specific technical lever that makes it work on the machine you already own.

How does integrated GPU performance compare to CPU only?

The honest answer depends on the chip generation and the model size, but the pattern is consistent. CPU-only inference on a modern eight core laptop with a 7B parameter quantized model usually lands somewhere between three and seven tokens per second. That is readable, barely. It feels slow when you are waiting for a code suggestion or a long answer. Pushing the same model through Vulkan offload on an Iris Xe or Radeon 780M typically gets you into the eight to fifteen tokens per second range. On an Apple M2 or M3 with the Metal backend, which is the Apple equivalent of the same idea, you can see twenty to forty tokens per second on the same model size.

The reason for the jump is simple. Matrix multiplication is embarrassingly parallel. A CPU has eight or sixteen cores doing this work serially per core. An iGPU has dozens of execution units doing it in parallel. Even when the iGPU has lower clock speeds and shares memory bandwidth with the CPU, it still wins for this workload. The shared memory architecture also means you avoid the pcie copy overhead that a low-end discrete card would incur for small batches.

Quantization is the other half of this story. None of these numbers are achievable with full precision weights. You need quantized models, typically Q4 or Q5 GGUF formats, which is why I wrote a deeper piece on model quantization as the key to faster local AI performance. Without quantization, the model will not fit in memory, let alone run at usable speeds.

What tokens per second should I expect on Intel Iris Xe?

Intel Iris Xe ships in most Intel laptops from the 11th generation onward. It is not a powerhouse. It has 80 to 96 execution units depending on the variant, and it shares system memory.

For a Q4 quantized 7B model like Mistral 7B or Llama 3.1 8B, a typical Iris Xe machine running llama.cpp with Vulkan offload will deliver eight to twelve tokens per second on prompt evaluation and six to ten tokens per second on generation. That is genuinely usable for chat, summarization, and code completion at small context sizes. Push the context window past 4,000 tokens and the numbers degrade quickly because memory bandwidth becomes the bottleneck.

For 3B class models, the picture is much better. Phi-3.5 Mini, the model I demoed in the video this article is based on, runs comfortably at fifteen to twenty-five tokens per second on Iris Xe. That is fast enough that you stop noticing the wait. For learning, prototyping retrieval augmented generation pipelines, and building local agents, a 3B class model on Iris Xe is a reasonable starting point.

The catch with Iris Xe is RAM. Because the iGPU shares system memory, you need at least 16 GB total to leave headroom for the OS, your editor, a browser, and the model. 8 GB machines technically work but you will swap constantly.

What about AMD Radeon 780M and the latest Ryzen iGPUs?

The Radeon 780M, found in Ryzen 7040 and 8040 series chips, is the strongest integrated GPU on the Windows side. It has twelve RDNA3 compute units and noticeably more raw throughput than Iris Xe.

Expect twelve to twenty tokens per second on Q4 7B models with Vulkan offload, and thirty to fifty tokens per second on 3B models. In practice this puts the 780M close to entry level discrete GPUs from a few years ago. For anyone running a Framework 13, a recent ThinkPad with a Ryzen AI chip, or a handheld like the ROG Ally, this is the sweet spot for local AI without any extra hardware.

One caveat. AMD’s Vulkan driver on Linux is generally faster and more stable than on Windows for compute workloads. I wrote about this in my comparison of Linux vs Windows VRAM usage for local AI, and the same pattern holds for iGPU offload. If you are serious about extracting performance from a Radeon iGPU, dual booting or using WSL2 is worth the friction.

How do Apple M-series chips compare?

Apple Silicon is in its own category. The M1, M2, M3, and M4 chips have unified memory and a GPU that, while integrated, is genuinely powerful. llama.cpp uses the Metal backend on Apple, which is technically not Vulkan, but the conceptual model is identical. Layers offload to the GPU, and the GPU shares memory with the CPU.

A base M2 with 16 GB of unified memory runs Q4 7B models at thirty to forty-five tokens per second. An M3 Pro can hit sixty to eighty. An M4 Max with 64 GB of unified memory can run 70B class models at four to eight tokens per second, which is something no integrated GPU on the Windows side can do at all because they cannot address enough memory.

If you are buying a laptop today specifically to learn AI engineering and you want maximum local inference per dollar, an M2 or M3 MacBook with 24 GB or more of unified memory is hard to beat. The unified memory architecture is the reason. You are not constrained by a tiny VRAM pool the way you are on consumer Nvidia laptops.

I want to be careful not to oversell this. Apple Silicon is great for inference. It is not great for training, and the software ecosystem still assumes CUDA in many places. For pure local inference and learning, though, it is the most capable integrated solution available.

If you want a curated set of starter projects that work on all of these platforms, including Docker compose files tuned for Vulkan and Metal backends, I keep them updated in my open source projects collection. They are the same setups I use when I am demoing local AI on whatever laptop I happen to have with me.

When does integrated GPU offload beat pure CPU inference?

Almost always, with one exception. If you cannot get a working Vulkan driver, or if you are on a server platform where the iGPU is disabled in BIOS, CPU only is your fallback. Otherwise, Vulkan offload to the iGPU is faster than CPU on every modern laptop chip I have tested.

The bigger question is whether the speedup is worth the complexity. For a casual user who just wants to chat with a local model occasionally, CPU only with a 3B model is fine. For anyone building actual applications, doing retrieval augmented generation, running agents that make multiple model calls per task, or iterating on prompts dozens of times per hour, the two to three times speedup from iGPU offload is the difference between a usable workflow and a frustrating one.

There is also a thermal angle. CPU only inference pegs every core at 100 percent and heats the laptop aggressively. iGPU offload spreads the load across the GPU compute units, which on most laptops have better sustained thermal headroom than the CPU package. Battery life also tends to be slightly better, though neither approach is what I would call efficient on battery.

When do I actually need a discrete GPU?

Three scenarios push you past what integrated graphics can handle.

First, model size. Once you want to run 13B class models or larger at usable speeds, integrated GPUs run out of memory bandwidth. A 13B Q4 model needs around 8 GB of working memory and benefits enormously from dedicated VRAM with high bandwidth. My VRAM requirements guide for local AI coding goes into the specifics, but the rough cutoff is that 7B and below works on iGPUs, 13B is borderline, and 30B and up effectively requires discrete VRAM.

Second, context length. Long context windows, anything past 16,000 tokens, hammer memory bandwidth. Integrated GPUs sharing system memory choke here. Discrete GPUs with GDDR6 or GDDR6X memory have several times the bandwidth and handle long context dramatically better.

Third, batch inference and serving. If you are running a model as a service for multiple users, or doing batched generation for evaluation, the throughput gap between integrated and discrete widens significantly. A single user chatting with a 7B model is fine on Iris Xe. Ten concurrent users is not.

For everything else, learning, prototyping, building personal tools, running coding assistants on small models, and even shipping internal apps to small teams, integrated graphics with Vulkan offload is enough. The gap between what is possible on a 1500 dollar laptop today and what required a 3000 dollar GPU two years ago is smaller than most people assume.

How do I get started with Vulkan offload on my laptop?

The shortest path is llama.cpp with the Vulkan backend, or one of the wrappers that bundles it. LocalAI, which I demoed in the source video for this post, uses llama.cpp under the hood and exposes an OpenAI-compatible API. Ollama is another option that has added Vulkan support in recent builds. LM Studio offers a graphical interface that makes backend selection and offload configuration straightforward.

The configuration knobs that matter most are the number of layers to offload to the GPU and the context size. Start with offloading all layers if the model fits in memory, drop the offload count if you see out of memory errors, and tune context size to match your actual use case rather than maxing it out by default.

The community I run at aiengineer.community has dozens of members running local AI on integrated graphics across Intel, AMD, and Apple hardware. We compare benchmarks, share working configurations, and troubleshoot driver issues together. If you want hands-on help getting Vulkan offload tuned on your specific machine, that is the place.

Final thoughts

Local AI on integrated graphics is one of those topics where the conventional wisdom lags two years behind reality. People still tell beginners they need an expensive GPU to get started. They do not. A 2022 laptop with Iris Xe or a Ryzen with Radeon graphics, paired with llama.cpp’s Vulkan backend and a quantized 7B model, gets you into double-digit tokens per second and a genuinely productive learning environment. An Apple Silicon Mac does even better.

The honest limits are real. Large models, long contexts, and serving workloads still need discrete GPUs or cloud inference. For everything below those thresholds, which covers almost every learning scenario and most personal projects, integrated graphics with Vulkan offload is enough. Stop waiting for the right hardware. The hardware you already own is, very likely, ready to go.

Watch the full walkthrough on YouTube: https://www.youtube.com/watch?v=GqrmkpKBlyI

Join the community of AI engineers building local AI projects: https://aiengineer.community/join

Zen van Riel

Senior AI Engineer | Ex-Microsoft, Ex-GitHub

I went from a $500/month internship to Senior AI Engineer. Now I teach 30,000+ engineers on YouTube and coach engineers toward six-figure AI careers in the AI Engineering community.

Blog last updated Jul 7, 2026