Run Llama 3.2 in the Browser with WebGPU Tutorial


When I first told people that they could run Llama 3.2 directly inside a Chrome tab without a server, most of them did not believe me. A year ago, I would not have believed it either. But the browser has quietly become one of the most exciting deployment targets for local AI, and WebGPU is the reason. In this tutorial I want to walk you through exactly how I got Llama 3.2 running in the browser, what kind of token speeds you can realistically expect, which libraries do the heavy lifting, and where this approach actually makes sense versus where it falls apart.

I built a full demo site that runs five different AI models entirely in the browser. Image classification, real-time hand tracking, speech to text with Moonshine, semantic search through embeddings, and an LLM chat powered by Llama 3.2. No server. No API keys. No expensive cloud bill. Just a static front end that pulls a model into the browser cache and uses your own GPU to run inference. If you have ever wanted to ship an AI feature without provisioning infrastructure, this changes the math entirely.

What is WebGPU and why does it matter for running Llama 3.2?

WebGPU is a modern browser API that gives JavaScript direct access to the GPU. It replaces the older WebGL approach with something far more capable, exposing compute shaders that map almost one to one onto how machine learning frameworks already think about tensors. That matters because large language models are dominated by matrix multiplications, and those are exactly the workloads GPUs were designed to chew through.

Before WebGPU, running a model like Llama 3.2 in the browser meant falling back to WebAssembly on the CPU. That works for tiny models, but a 1B or 3B parameter language model needs proper GPU acceleration to feel responsive. With WebGPU, the same hardware that runs your favorite game or your local Ollama setup is now reachable from a webpage. That is a profound shift. It means anyone visiting your site can run real inference using their own machine, and you do not pay a cent for the compute.

The catch is that WebGPU is not universally available yet. Chrome and Edge have it stable. Firefox is shipping it progressively. Safari is rolling it out. In the demo project I built, I check for WebGPU availability with a hook before loading any model that needs it. If WebGPU is missing, smaller models can still run on the CPU, but the LLM chat requires acceleration. Always feature detect before you commit to a model load.

How do you actually load Llama 3.2 in a browser tab?

There are two libraries doing the real work in this ecosystem. The first is Transformers.js from Hugging Face, which mirrors the Python transformers API but runs in JavaScript and uses ONNX Runtime Web with a WebGPU backend. The second is Web LLM from MLC AI, which is purpose built for running large language models in the browser using a compiled MLC engine.

For Llama 3.2 specifically, I went with Web LLM. The reason is that Web LLM ships with prebuilt, quantized model artifacts and a chat completions style API that feels almost identical to what you would write against OpenAI. In the project I built, the LLM chat worker calls something close to engine.chat.completions.create with the message history, and the engine streams tokens back. If you have ever wired up a server side chat endpoint, the mental model is the same. The only difference is that the engine lives inside a web worker and the weights live in your browser cache.

The first time a user opens the chat, the model has to download. For Llama 3.2 1B in a quantized format, that is somewhere in the range of 600 MB to 1 GB depending on the quantization level. Llama 3.2 3B can push past 1.5 GB. That is a real cost you need to budget for. The good news is that browsers cache the weights aggressively. After the first visit, the model loads from local storage in seconds, not minutes. In my recordings I had everything precached, so the chat was ready almost instantly.

If you want to see a similar pattern applied to Ollama based workflows, my local development guide for Ollama covers how the same quantization tricks make small Llama variants viable on consumer hardware.

What token speeds can you expect from Llama 3.2 in the browser?

This is the question everyone asks, and the honest answer is that it depends heavily on your GPU. On a modern discrete GPU, I was generating tokens fast enough that a haiku response finished before I could even point at the GPU utilization graph. The activity barely registered because the prompt was so small. The moment I pasted a long Wikipedia article and asked for a summary, the GPU spiked to near full utilization and stayed there until the response finished streaming. That is the signature of real local inference, and it is exactly what you want to see.

For ballpark numbers, Llama 3.2 1B in the browser tends to land in the range of 30 to 80 tokens per second on a decent GPU, and Llama 3.2 3B drops into the 15 to 40 tokens per second range on the same hardware. Integrated graphics will be slower, sometimes dramatically so. Mobile devices vary wildly. The point is not to match a hosted API on raw throughput. The point is that you get useful, interactive speeds with zero infrastructure.

Latency for the first token is also worth thinking about. Once the model is loaded, time to first token is typically well under a second for short prompts. For long prompts, prefill takes proportionally longer because the model has to process every input token through its attention layers. This is the same dynamic you see on a server, just running on the userโ€™s hardware instead of yours.

Which use cases actually make sense for browser based Llama 3.2?

This is where I want to be honest with you. Browser based LLMs are not a replacement for hosted inference in every situation. The download cost alone rules out a lot of casual use cases. If a visitor is going to spend thirty seconds on your page, you are not going to ask them to download a gigabyte of weights first. That would be absurd.

Where browser inference shines is in tools and applications where the user has committed to staying. Think internal tools, productivity apps, learning environments, privacy sensitive workflows where data must never leave the device, and proof of concept projects you want to share without standing up a backend. In my demo I paired Llama 3.2 with a small embedding model so the LLM could answer questions grounded in local documents. That kind of retrieval augmented generation, fully client side, is genuinely magical. No data leaves the browser. No API costs. No rate limits.

Another pattern I love is using browser LLMs to prototype an idea before committing to infrastructure. You can ship a working demo to a stakeholder for free, validate the concept, and only later decide whether the production version needs hosted inference. My write up on setting up local LLMs cost effectively goes deeper into this trade off. And if you are weighing browser inference against running models on a small device or edge box, my breakdown of what edge AI actually means frames the spectrum well.

If you want to skip ahead and grab the full project I built, including the Llama 3.2 chat worker, the WebGPU detection hooks, and every other model in the demo, browse my open source local AI starter projects and clone the one that fits your use case.

How do you build a Llama 3.2 browser chat from scratch?

Let me describe the architecture without code, because the structure matters more than the syntax. You start with a static front end. In my case it is a TypeScript project with a small component tree. There is no backend at all. The entire site is shipped as static files.

Inside the components folder, the LLM chat is split into two pieces. There is the visual component that renders the chat UI, the input box, and the message bubbles. And there is a web worker that owns the model. The worker exists because LLM inference blocks the JavaScript thread, and you do not want your UI to freeze while a token is being generated. The worker holds a reference to the MLC engine, receives messages from the main thread, runs inference, and posts streamed tokens back as they arrive.

When the user opens the chat for the first time, the worker initializes the engine, which triggers the model download. Progress events flow back to the UI so you can show a loading bar. Once the model is ready, every user message gets appended to a running conversation history, sent to the worker, and the worker calls the chat completions style API. Tokens stream back into the UI in real time, exactly like a hosted chat endpoint, just with no network call and no cost.

For the other models in the demo, the pattern is even simpler. Image classification, for example, uses Hugging Face Transformers.js. You create a classifier pipeline, await it on an image URL, and you get back the top predictions. In my recording I dropped a photo of a lion into the page and got a confident lion classification in about 230 milliseconds. The Egyptian cat photo classified just as quickly. The community has done excellent work abstracting the hard parts away, so you spend your energy on the use case rather than on plumbing.

If you want a broader perspective on what is possible with consumer hardware in general, my piece on running advanced language models on your local machine lays out the full landscape, from browser to desktop to dedicated inference boxes.

What are the real limitations of running Llama 3.2 in the browser?

I want to close with the honest trade offs. First, model size is a hard ceiling. Anything beyond a few billion parameters becomes impractical to download and load. You will not be running Llama 3.1 70B in a browser tab any time soon. For that you need a real server or a serious local rig.

Second, cold start matters. Even with aggressive caching, the first visit is slow. If your users are first time visitors, you need to design around that. Sometimes the right answer is to lazy load the model only when the user explicitly opts into the AI feature, rather than forcing the download on page load.

Third, browser support is still uneven. WebGPU is great in Chromium browsers, decent in Firefox, and improving in Safari. You should always feature detect and provide a graceful fallback, whether that is a smaller CPU model, a hosted API, or a polite message asking the user to try a supported browser.

Fourth, debugging is harder than on a server. Errors inside web workers can be opaque, and GPU memory issues manifest in confusing ways. Plan for extra time when something goes wrong.

Despite all of that, I keep coming back to this approach because the deployment story is just so good. A static site costs almost nothing to host. There is no scaling problem because every user brings their own compute. There is no API key to leak. There is no rate limit to manage. For the right use case, browser based Llama 3.2 is not a curiosity, it is genuinely the best option.

If you want to see the full project running and read the code that wires Llama 3.2 to the browser chat, watch the full walkthrough on YouTube where I demo every model in real time and talk through the architecture. And if you want to go deeper, learn how to ship production grade AI features, and connect with other engineers building these systems, join my community at aiengineer.community/join. I would love to see what you build.

Zen van Riel

Zen van Riel

Senior AI Engineer | Ex-Microsoft, Ex-GitHub

I went from a $500/month internship to Senior AI Engineer. Now I teach 30,000+ engineers on YouTube and coach engineers toward six-figure AI careers in the AI Engineering community.

Blog last updated