Local AI Pair Programmer That Works Offline on a MacBook
I spend a lot of time on planes, in coffee shops with terrible Wi Fi, and in airport lounges where the network is so locked down that even Claude Code refuses to handshake. That used to mean my AI pair programming workflow simply stopped. Today it does not. I run a local AI pair programmer that works fully offline on a MacBook, and once you understand the few things that actually matter on Apple Silicon, you can do the same.
This post is about the specific combination most tutorials gloss over. Offline plus Apple Silicon plus a real coding agent. I want to give you the same mental model I use to pick a model, configure LM Studio, and connect it to an agent like Continue, Kilo Code, or even Claude Code itself.
Why does a MacBook make sense for offline AI coding?
On any normal gaming PC, the first question is how much VRAM is on the GPU. VRAM is expensive, and most consumer Nvidia cards do not even have 32 GB of dedicated memory. That is the wall most people hit when they try to run a serious coding model.
Apple Silicon breaks that wall in a way that is genuinely strange the first time you see it. The M chips use unified memory, which means the same RAM is shared between the CPU and the GPU. If you buy a MacBook Pro with an M4 Pro and 48 GB of unified memory, you effectively have around 48 GB of VRAM available for a language model. That is more than almost any consumer Nvidia card on the market, in a laptop that fits in a backpack and runs on battery.
That is the quiet superpower behind running a local AI pair programmer offline on a MacBook. You are not paying gaming PC prices for a tower that needs a wall outlet. You are using a laptop you probably already wanted to own, and getting model capacity that would otherwise require a workstation GPU.
The trade off is raw speed. A high end discrete GPU still pushes more tokens per second. The same 20 billion parameter model that ran at around 175 tokens per second on my desktop GPU runs noticeably slower on an M4 Pro MacBook. But slower is not unusable. For real pair programming, where you read the diff before you accept anything, the MacBook stays comfortably inside the speed envelope where the experience feels natural.
What does “fully offline” actually mean for this workflow?
Offline does not just mean no internet. It means no surprise dependencies. No remote API the tool silently calls for embeddings. No telemetry handshake that hangs the IDE on a plane. No license check that fails at 35,000 feet.
The reason I lean on LM Studio is exactly this. Once a model is downloaded, LM Studio runs the entire inference loop locally. The local server uses the OpenAI compatible API format, which means any code agent that knows how to talk to OpenAI can talk to your MacBook instead. You point the agent at a localhost URL and the request never leaves the machine.
That is the property that matters when the airport captive portal is fighting you. Your editor still loads. Your agent still calls tools. Your model still answers. The only thing missing is the cloud, and on this workflow you do not need it.
If you want a deeper walkthrough of the runtime, my Ollama local development guide covers the terminal first alternative. For pure offline MacBook work, LM Studio remains my default because the GUI makes it trivial to swap models, watch memory pressure, and verify nothing is leaking out to the network.
How do I pick a model that actually fits on Apple Silicon?
This is where most people go wrong. The size on disk of a model is not the same as the memory it will consume when you load it for real coding work. A 21 GB Qwen 2.5 32 billion parameter model fits in 48 GB of unified memory at first glance. The moment you ask for a context window large enough to load real source files, that estimate balloons fast. With Qwen 2.5 32B and a 75,000 token context window, the memory estimate climbs to around 45 GB, which leaves almost no headroom for the rest of your system.
The rule I use on a MacBook is simple. Pick the smallest capable model that still supports tool calling, then push the context window as far as it will go without spilling into territory where the system starts to swap. For most current MacBook Pro configurations, that means a 20 billion parameter model with a generous context window beats a 32 billion parameter model with a cramped one, every time.
A few specifics that matter on Apple Silicon:
- The OpenAI 20 billion parameter open source model loads cleanly with a 50,000 token context window in roughly 20 GB of memory. That is the sweet spot for a 32 GB or 48 GB MacBook.
- A 32 billion parameter model can run, but only if you accept a small context window. On a real codebase, that small context window is the thing that breaks agents.
- Quantized models are not a compromise on Apple Silicon. They are the default. The math is designed to keep accuracy nearly intact while shrinking the footprint, and that shrinkage is what makes a laptop viable.
If you want to see the full intuition on why parameter count is the wrong thing to optimize for, I broke it down in local AI coding reality check. The short version: pick for context window, not for raw parameter count.
Where does MLX fit into this?
MLX is Apple’s array framework, and it is the thing that makes Apple Silicon punch above its weight on language models. It is built specifically around unified memory, so weights do not have to be copied between CPU and GPU memory the way they would on a discrete card. That copy avoidance is part of why a MacBook can serve a 20 billion parameter model at usable speeds even though its raw compute is far below a desktop GPU.
You will not interact with MLX directly. LM Studio uses Apple optimized backends under the hood, and that is enough. But it is worth knowing this workflow is viable because Apple invested in a math framework that treats unified memory as a feature, not a workaround. When you read benchmarks claiming a MacBook is “surprisingly competitive” on local inference, MLX is a big part of the reason.
The practical takeaway is that you do not need to chase exotic configurations. Download the model in LM Studio, pick a quantization that fits, and trust the Apple optimized inference path to do its job.
What is the offline by default workflow inside LM Studio?
Here is the rhythm I use when I know I am about to lose connectivity. The night before a flight, I open LM Studio at home and download the two models I want to have available. I usually pick the OpenAI 20 billion parameter model for general work and one slightly larger coding model for moments when I want a second opinion. Both live on disk now and never need the network again.
In LM Studio, I pre configure each model with the context window and parameters I want. I tick the box that lets me manually choose load parameters, and I dial the context window to the largest value that does not spill into shared territory. I turn on flash attention and set the K cache quantization to F16, which is the lever that buys me a bit more context on a tight budget. I verify the local server is running on the developer tab.
Then I close everything except my editor. On the plane, I open LM Studio, load whichever model I want, and point my agent at the localhost URL. That is the entire ritual. The agent does not know or care that there is no internet. As far as it is concerned, the OpenAI API is reachable, because LM Studio is wearing that face on localhost.
This is also where I remind myself that the tools I rely on need to be installed before I lose connectivity. Continue, Kilo Code, Claude Code Router, the model files, and any project repositories all need to be on disk before the flight. The actual offline session is the easy part. The preparation is what makes it work.
If you want practical starter projects already wired up to run against a local model, you can grab them from my open source vault. They are sized to work cleanly in the context windows a MacBook can realistically support, and they are the same projects I use to test new local model setups. Get the Local AI Starter Projects.
Which agents work cleanly with a local model on a MacBook?
The three I keep coming back to are Continue, Kilo Code, and Claude Code via Claude Code Router. Each one has a different personality, but all of them speak the OpenAI compatible API that LM Studio exposes, which means none of them care whether the model is running in San Francisco or on your lap.
Continue is the easiest entry point. You add a chat model, scroll past the cloud providers to LM Studio in the list, and let it auto detect whichever model you have loaded. Within seconds you have an agent that lists files, reads them, and replies, all locally, all offline. Watching the GPU usage spike on the local machine while the agent thinks is the moment most people understand what they have built.
Kilo Code is more aggressive about tool calling, which is where the limits of smaller models show. A 20 billion parameter model can struggle with the structured output Kilo expects when the context window is tight, and you will see the agent loop on the same file. The fix is not a bigger model, which makes the context problem worse. The fix is more explicit hints. Tell it which file. Tell it where the function lives. This setup rewards specific prompts the way pairing with a focused junior engineer does.
Claude Code through Claude Code Router is the most surprising one. Claude Code is normally tied to the cloud API, but the community built Claude Code Router so it can route through any OpenAI compatible endpoint. Point it at LM Studio’s local server, run “ccr code” instead of plain “claude”, and you get the Claude Code experience driven entirely by your local model. On a flight, this is the closest thing to magic in the stack.
For more on how this slots into a sustainable career, I wrote about how local AI is shaping software engineering careers. The short version: engineers who can run models offline are no longer the weird ones. They are increasingly the ones who keep shipping when others are stuck waiting for an API to come back up.
What are the honest limits of this setup?
I want to be square with you about where this stops working, because the YouTube version of local AI coding tends to skip this part.
Context window is the ceiling. A real codebase eats tokens fast. Even a small auction site sample I use in demos is around 9,000 tokens of Python, and dumping the repo three times pushes you to 34,000 tokens. Agentic tools consume context as their favorite lunch. On a 48 GB MacBook with a 20 billion parameter model, you can support meaningful sessions, but you do not have unlimited room.
Speed degrades as context fills. The first few exchanges are fast. By the time the agent has explored several files, you will feel the slowdown. This is more pronounced on a MacBook because the absolute speed ceiling is lower than on a discrete GPU.
Bigger is not better. A 32 billion parameter Qwen model is more capable per token, but slower per token, and forces a smaller context window. On a MacBook, the context window almost always wins.
Heat and battery are real. Sustained inference on Apple Silicon is efficient, but not free. Expect the fans to spin up and battery life to drop during long sessions. Bring a charger.
So is this actually viable for daily work?
For me, yes, with a clear understanding of what I use it for. Local AI on a MacBook is my daily driver for travel work, focused refactors, and sessions where I want certainty that no code is leaving the machine. It is also the workflow I use when teaching, because demonstrating a working agent without network dependency removes a whole class of “well, it usually works” excuses.
For deep architectural work on a large codebase, I still reach for cloud models at my desk. The combination of speed, context, and tool calling is just better there today. But the gap is closing, and the offline workflow is already good enough that I never feel stranded when I leave the desk.
The skill that matters is not memorizing one configuration. It is understanding the trade space well enough to pick a model and context window that fit your specific MacBook. Once you have that intuition, you can adapt to whatever new model drops next month.
If you want to keep going deeper on this, the full master class is on my YouTube channel. Watch Ultimate Local AI Coding Guide For 2026 for the long form walkthrough of every step, including the LM Studio configuration, the agent setups, and the moments where things break in interesting ways.
And if you want to be in the room with other engineers who are building this kind of skill seriously, that is exactly what we do inside the AI Engineer community. Join us at aiengineer.community/join and tell me what you are building offline. I read every introduction.