RTX 3090 Qwen Coder VS Code Daily Driver Setup
I run a local AI coding environment on my own hardware every day, and the single piece of gear that makes this practical for most engineers is a used RTX 3090 with 24GB of VRAM. In the master class I recorded recently I was using an RTX 5090, but that card is overkill for what 90% of you actually need. The RTX 3090 is the sweet spot for an rtx 3090 qwen coder vs code daily driver setup, and once you understand exactly which model fits, what context window you can afford, and how to plug it into VS Code, you can replace a meaningful chunk of your cloud spend with hardware you already own or can buy used for around 700 to 850 dollars.
This is the hardware-specific guide I wish someone had written for me. No hand waving about “it depends on your machine.” Specific quants, specific context lengths, specific tokens per second, and the real workflow I use when I do not want to ship code to a cloud API.
Why is the RTX 3090 the right card for local AI coding?
The RTX 3090 has 24GB of GDDR6X VRAM. That number is the only one that matters for picking a local coding model. The 4090 has the same 24GB. The 5090 jumps to 32GB, which is nice but doubles the price. The 4080 and 3080 only have 16GB or less, which forces you into smaller models that struggle with tool calling and agentic coding.
Used RTX 3090s show up on local marketplaces and eBay between 700 and 900 dollars depending on condition. That is the price of about three months of a Claude Max subscription. The card pays for itself surprisingly fast if you are someone who runs heavy agentic sessions, and unlike a cloud subscription you keep the hardware. If you are weighing whether the money is worth it, I covered the broader trade off in my local AI coding reality check where I lay out where local actually wins and where cloud models still dominate.
The reason I keep coming back to 24GB is simple. Tool calling and agentic coding need a model with at least 20 to 32 billion parameters to behave. Anything smaller falls apart the moment a coding agent like Continue or Kilo Code asks it to produce structured output. And once you commit to a 32B parameter model, you need every gigabyte of VRAM you can get for context window.
Which Qwen Coder quant actually fits on 24GB?
Here is where most tutorials lie to you. They show a 32B model loading on a 24GB card and call it a day. What they do not show is what happens when you actually try to feed it a real codebase.
Qwen 2.5 Coder 32B at the standard Q4_K_M quantization weighs in around 19 to 20GB. That leaves you roughly 4GB for context, which sounds fine until you remember that a real coding agent eats context like it is its favorite lunch. A typical Continue session with a few open files, the system prompt, and tool definitions will burn through 8,000 to 12,000 tokens before you have even asked anything useful. With only 4GB of headroom on a 3090, you are looking at a context window of maybe 6,000 to 8,000 tokens. Workable for one file edits, painful for anything else.
The trick I have settled on for the 3090 is to step down to Qwen 2.5 Coder 14B at Q5_K_M or Q6_K. The 14B model at Q5_K_M lands around 10 to 11GB, which leaves a generous 12 to 13GB of VRAM for context. That gives you a 30,000 to 40,000 token context window without spilling into shared system memory. And once you spill into shared memory, the entire setup falls over. I demonstrated that exact failure in the video. The card maxes out, the model technically loads, but inference drops from 40 tokens per second to a crawl. Even the video feed lagged.
If your work tolerates a smaller model, Qwen 2.5 Coder 7B at Q8_0 is shockingly capable for autocomplete and small refactors, and you can run it with a 64,000 token context on a 3090 without breaking a sweat. The full breakdown of how to size these decisions lives in my VRAM requirements guide, which goes deeper into the math.
What tokens per second can you actually expect?
Numbers from my own testing on the 3090, with Qwen 2.5 Coder 14B Q5_K_M, flash attention enabled, K cache quantization at F16:
Cold prompt with about 500 tokens of context. Generation hits roughly 55 to 65 tokens per second. Indistinguishable from a fast cloud API in feel.
Warm session with around 15,000 tokens of loaded codebase context. Generation drops to 35 to 45 tokens per second. Still very usable for editing tasks. Prompt processing takes about 4 to 6 seconds before the first token appears.
Heavy session with 30,000 plus tokens. Generation slows to 20 to 28 tokens per second, and prompt processing balloons to 15 to 20 seconds. This is the real ceiling on a 3090 with a 14B model.
For comparison, the 32B Qwen at the same context size on this card simply will not run without spilling memory. And the OpenAI gpt-oss 20B model, which I also tested on the bigger card, comes in faster than the 32B Qwen but loses ground on actual coding quality, especially around tool calling.
How do you wire it into VS Code with Continue.dev?
This is the part I get asked about most. The short version is that LM Studio runs the model and exposes an OpenAI compatible API on localhost, and Continue.dev plugs into that API like it would plug into any cloud provider.
Inside LM Studio, load Qwen 2.5 Coder 14B Q5_K_M. Set the context length to 30,000. Turn on flash attention and enable K cache quantization at F16 in the advanced settings. Those two flags shave just enough VRAM off your context overhead to push the window higher without spilling. Start the local server from the developer tab and confirm it is listening on the default port.
In VS Code, install the Continue.dev extension. Open the model configuration, click add chat model, scroll down past the cloud providers, and pick LM Studio. Let it auto detect. Continue will read the running model from the API and add it to the config automatically. That is the entire setup. No API keys, no billing, no rate limits.
If you prefer Kilo Code instead of Continue, the flow is almost identical. Pick “use your own API key,” search for LM Studio, paste the base URL from the LM Studio developer tab, and Kilo will detect the loaded model. I touch on which agent works better for which use case in my piece on sub agent strategies for local AI coding.
If you want a tested project repo to verify your setup actually works end to end, I keep a few reference applications and configurations in my open source collection. Get the Local AI Starter Projects and clone something concrete instead of debugging an empty workspace.
What does a real daily driver workflow look like?
My honest workflow on the 3090 is hybrid. I do not pretend the local model replaces Claude or GPT for hard problems. It does not. What it does is replace the 80% of my coding interactions that are mechanical: rename this across the file, write a test for this function, explain what this regex does, generate a Pydantic model from this JSON shape, refactor this loop into a comprehension.
For those tasks the 14B Qwen is fast, free, and private. I never have to think about whether the snippet I am pasting contains a client secret or proprietary architecture. The model runs on my desk, the data never leaves the machine, and there is no monthly cap. If you want to understand why the unlimited aspect matters more than people admit, I wrote about unlimited AI coding sessions with local models covering exactly that point.
For the harder 20%, deep architectural changes, novel algorithms, debugging across many files, I switch to a cloud model. I do this without guilt. Hybrid is the correct answer for almost everyone. The mistake is thinking you have to pick one.
The other thing I do is keep LM Studio set up to swap models in under thirty seconds. When I am writing Python, I run Qwen 2.5 Coder 14B. When I want autocomplete only, I drop to the 7B at Q8 and crank context up to 64K. When I am writing prose or documentation, I sometimes run the OpenAI gpt-oss 20B because it has a slightly more natural tone for explanations. If you prefer a more terminal driven approach instead of LM Studio’s UI, the Ollama local development guide walks through that path with the same models.
What about Claude Code with a local model?
Yes, this works on a 3090, with caveats. The community project Claude Code Router lets you point Claude Code at any OpenAI compatible endpoint, which means your local LM Studio instance becomes a drop in replacement for Anthropic’s API. You install the router, run “ccr ui” to configure it, paste your local URL with the chat completions endpoint appended, give it a fake API key string because something has to be there, and select the loaded model.
The catch is context. Claude Code is built around enormous context windows, and a 14B local model with 30K context will choke on anything beyond a single file investigation. For that flow I actually drop to the OpenAI gpt-oss 20B because it tolerates a larger context window on the same VRAM budget, and Claude Code’s planning mode degrades more gracefully when it is not constantly hitting the wall. It is not magic, but for plan mode investigations on a single file or two it works surprisingly well.
Is the RTX 3090 still worth buying in late 2026?
Yes, with one important framing. The 3090 is the cheapest path to the 24GB VRAM tier, and 24GB is the floor for serious local coding work. The 4090 gives you better speed at nearly double the used price. The 5090 gives you 32GB and lets you run the full 32B Qwen comfortably, but that card costs three to four times as much as a used 3090.
If your goal is to learn the fundamentals of running local AI for coding without spending five thousand dollars to do it, the 3090 is the right answer. If your goal is to run the absolute largest open source coding models with full context, save up for the 5090 or wait for the 6090. For the rest of us, a 3090 paired with Qwen 2.5 Coder 14B inside VS Code is a daily driver that genuinely changes how you work.
I cover the full master class on YouTube here: Local AI Coding Master Class. If you want to go deeper with people building this stuff seriously, join the AI Engineer community at aiengineer.community/join and bring your hardware questions.