Local AI Coding Setup for VS Code Without Cloud API Keys


I run my entire AI coding workflow on local models. No Anthropic key, no OpenAI key, no cloud subscription burning tokens in the background. Everything routes through my own GPU, exposed to my MacBook over an encrypted link, and I plug it into VS Code through a few different paths depending on the project.

If you want a setup that works offline, costs nothing per token, and never sends your repository to a third party, this is the workflow I actually use in 2026. I will walk through the extension picks, the Ollama and LM Studio wiring, the model choices that survive real coding tasks, and the parts that genuinely work without internet access.

Why skip cloud API keys for VS Code AI coding?

The obvious reason is privacy. If you work on proprietary code, regulated data, or anything under NDA, sending your repository to a hosted model is a compliance problem you do not want to argue about with your security team.

The less obvious reason is cost predictability. Cloud coding agents inject huge system prompts on every turn. A long agentic session can burn through a surprising amount of money before you realize it. With a local model, your only cost is the electricity to keep your GPU warm, and you can run sessions all day without watching a meter tick.

The third reason is simply that local models in 2026 are good enough for a lot of real work. Not as strong as the latest frontier model, but strong enough to scaffold features, fix bugs, write tests, and refactor code when you set the system up correctly. I covered the honest version of this tradeoff in my local AI coding reality check, and this post drills into the specific VS Code angle.

Which VS Code extensions work without cloud keys?

Two extensions cover almost every use case I have run into. Continue.dev is the more flexible one. It supports any OpenAI-compatible endpoint, any Anthropic-compatible endpoint, and direct Ollama integration. You configure providers in a JSON file, point it at your local server, and the chat panel and inline edits start working immediately. No login, no key required.

Cline is the second pick. It is more agentic, closer in spirit to a full coding assistant that can read files, run commands, and iterate on tasks. It also accepts local endpoints through the OpenAI-compatible interface, so you can wire it to LM Studio or Ollama without ever creating an account.

I also use Claude Code itself, but routed through my local server. Since it added support for arbitrary base URLs, you can override the Anthropic endpoint with environment variables and point it at LM Studio’s Anthropic-compatible API. It is technically a CLI rather than a VS Code extension, but it runs inside the VS Code terminal, so functionally it lives in the same window. The main caveat is that Claude Code injects a very large system prompt, and on a small local model that prompt alone can saturate your context window before you have written a single instruction.

How do I wire Ollama or LM Studio into VS Code?

The simplest path is Ollama. You install it, pull a model, and Ollama exposes a local server on port 11434 by default. In Continue.dev’s config, you select Ollama as the provider, name the model you pulled, and you are done. Everything runs on localhost. If you want a step-by-step walkthrough of getting Ollama configured properly, I wrote a full Ollama local development guide that covers the gotchas.

LM Studio is the other path I lean on heavily. It exposes three different endpoints from one server: a native LM Studio API, an OpenAI-compatible endpoint, and an Anthropic-compatible endpoint at v1/messages. That last one is the magic, because it lets tools that expect Anthropic, like Claude Code, talk to your local model with no adapter in the middle.

The feature that changed my workflow this year is LM Studio’s linking. My main GPU lives in a Linux machine with an RTX 5090 and 32 GB of VRAM. My day-to-day development happens on a MacBook. With linking, I sign into LM Studio on both devices, the MacBook sees the Linux box’s loaded models over an encrypted connection, and from VS Code’s perspective the model is running locally. Setup is genuinely a few clicks. No port forwarding, no SSH tunnels, no certificates to manage.

Which local models actually fit a coding workflow?

Model choice is where most people make the setup unusable. The hard constraint is that the entire model needs to fit on your GPU’s VRAM. If even part of the weights spill into system RAM, the GPU has to shuttle data back and forth on every token, and your tokens-per-second falls off a cliff. For agentic coding with large context windows, that penalty compounds because compute cost scales steeply with context size.

On a 32 GB card, a Qwen 3 coder model around 30 billion parameters runs comfortably at over 100 tokens per second when it fits cleanly. A larger model like Qwen 3.5 with 35 billion parameters works because it is a mixture-of-experts architecture, where only a fraction of the parameters are active per token. Mixture-of-experts is the trick that makes bigger local models feasible on consumer hardware.

If you have a smaller GPU, you have to be honest about what fits. A 12 GB card runs smaller coding models well but struggles with the larger context windows that agentic VS Code extensions assume. A 24 GB card opens up more options. Below 12 GB, you are mostly limited to small models that work fine for autocomplete but fall apart on multi-file refactors.

The other variable is context window. LM Studio defaults to a 4,000 token context window for many models, and that is a trap. Claude Code’s system prompt alone is around 3,000 tokens. Continue.dev and Cline are leaner but still send substantial context. If you do not bump your context window to at least 32,000 tokens, and ideally 80,000 or more, your requests will hang silently with no clear error message. I always set context to the maximum my GPU can handle and watch VRAM usage as I go.

If you want plug-and-play starting points for the projects I run on this stack, including the configs and example apps, I share them in my open-source projects. They are the same setups I use to test new models when they release.

What works without internet, and what does not?

Once your local model is running, the parts that work offline are the parts that matter most for coding. Chat with your code, inline completions, multi-file edits, agentic task execution, plan mode, and sub-agents all work without a network connection. The model weights live on your disk. The extension talks to localhost. There is nothing in that loop that needs the internet.

What does not work offline is anything that depends on external services. Web search tools, documentation fetchers, package registry lookups, and any MCP servers that hit cloud APIs all break the moment you disconnect. For most of my coding sessions, this is fine. For research-heavy work where the agent needs to look up library documentation, I either pre-fetch the docs into the repo or accept that I need a connection for that specific task.

One workflow tip that pays off heavily with local models: use sub-agents aggressively. Each sub-agent gets a fresh context window, does one piece of work, and reports back to the main agent. With a local model that has a tighter context budget than a frontier cloud model, this is the difference between finishing a feature and watching the agent forget what it was doing halfway through. I cover the patterns that work in my post on sub-agent strategies for local AI coding.

How do I keep coding sessions running for hours?

The answer is bypass-all-permissions mode inside a dev container. I let Claude Code run autonomously, route everything through my local model, and walk away. There is no token meter to watch and no rate limit to hit. If a task takes 30 minutes instead of 5, I do not care, because the cost is the same either way.

This is the actual unlock of local AI coding for me. Frontier cloud models are faster per token, but they are not faster per dollar when you let them run for hours. With a local model, time becomes the only variable, and time is something I can spend liberally on background tasks while I do something else. I broke down this approach in unlimited AI coding sessions with local models for anyone who wants to push it further.

The catch is that you have to set up your environment so the agent cannot do damage. A dev container isolates the file system, sandboxes shell commands, and means a misbehaving agent at most wrecks the container, not your machine. If you skip this step and run bypass mode on your host, you are asking for trouble eventually.

What about model self-awareness and weird quirks?

One thing that surprises people: when you route Claude Code through a Qwen model, the model often claims it is Sonnet. This is not a bug. Language models do not have reliable self-awareness. Their behavior is shaped by the system prompt they receive, and Claude Code’s system prompt tells the model it is a Claude variant. The Qwen model just plays along.

Practically, this means your local model will follow the coding conventions and tool-use patterns that the host CLI prescribes. That is usually good, because those patterns are well-tuned for agentic work. It also means you cannot trust the model’s introspection. If you want to verify which model is actually running, check your local server’s logs, not the model’s claim.

Another quirk worth knowing: when your context window fills up, LM Studio gives you options for how to handle overflow. Truncating the middle of the conversation preserves the early codebase exploration while dropping intermediate steps. This trades memory for the ability to keep going. Claude Code sometimes summarizes proactively before this kicks in, but knowing the option exists has saved me from dead-ended sessions more than once.

Is this setup actually worth the effort?

For me, yes, and not because it replaces frontier cloud models. It does not. Frontier models still produce cleaner code with fewer bugs, especially on novel problems. What this setup gives me is a coding environment that is fully under my control, costs nothing per use, runs offline, and never sends a single line of my code to anyone else’s server.

For privacy-sensitive work, that is non-negotiable. For long-running agentic tasks, the unlimited-time tradeoff beats the per-token economics of cloud. For learning how AI engineering actually works under the hood, running your own model end-to-end teaches you more in a week than reading documentation for a month.

If you have the hardware to make it work, build this setup once and you will reach for it constantly. If you do not have the hardware yet, prioritize VRAM over everything else when you upgrade. A 24 GB card opens up almost every workflow I described here. A 32 GB card lets you run the larger mixture-of-experts models that are genuinely competitive for coding tasks.

If you want to see this exact workflow in action, including the LM Studio linking setup and the Claude Code routing, watch the full walkthrough on YouTube: https://www.youtube.com/watch?v=3zSANOIBHYw

And if you want to learn AI engineering with people building real systems on local and cloud models alike, join my community at https://aiengineer.community/join. I share the projects, the failures, and the fixes there before they ever make it into a video.

Zen van Riel

Zen van Riel

Senior AI Engineer | Ex-Microsoft, Ex-GitHub

I went from a $500/month internship to Senior AI Engineer. Now I teach 30,000+ engineers on YouTube and coach engineers toward six-figure AI careers in the AI Engineering community.

Blog last updated