Local AI for Embedded Engineers Running on Edge Devices


I keep meeting embedded engineers who think local AI is still science fiction on their hardware. They picture an RTX 4090 chugging away in a server room and assume their 8 watt board has no business running anything intelligent. Then I show them what I run inside a browser tab on a midrange laptop, and the whole conversation shifts. If WebGPU can serve a Llama 3.2 chat, a Moonshine speech model, real time hand tracking, image classification, and semantic search from cached files, then a Jetson Orin Nano or a Raspberry Pi 5 with a Coral accelerator is sitting on far more capability than most teams ever use.

This guide is for the embedded crowd. The people who care about board temperature, flash wear, deterministic timing, and whether the model still fits when the bootloader takes its share of memory. I want to walk you through how I think about local AI when the deployment target is not a GPU server but a board the size of a credit card, a smartphone NPU, or a custom industrial gateway with a fan that nobody wants to hear.

Why should embedded engineers care about local AI right now?

The cloud first era of AI assumed bandwidth was free and latency did not matter. Embedded work has always known better. A factory robot cannot wait 400 milliseconds for a round trip to a regional data center. A medical device cannot stream patient audio to a third party. A drone cannot rely on cellular reception over a forest. These constraints used to mean either no AI at all or a heavily watered down rule based system.

That window has closed. Models in the under 1 billion parameter range now perform tasks that needed 7B or larger just two years ago. Quantization has gone from a research curiosity to a default. Runtimes like ONNX Runtime, TensorFlow Lite, ExecuTorch, and llama.cpp ship binaries that fit into the kind of memory budget an embedded engineer is used to negotiating. The skill ceiling for shipping local intelligence is lower than it has ever been, which is exactly why I think this is the best moment in a decade to be an AI engineer working close to the metal.

What hardware actually matters for edge AI deployment?

I get asked for hardware recommendations every week, and the honest answer is that the right board depends on the workload class, not the brand. Let me break down how I categorize the common targets.

The Jetson family from NVIDIA still owns the high end of the embedded AI space. A Jetson Orin Nano gives you genuine GPU acceleration with a CUDA toolchain that ports cleanly from your development workstation. If your model relies on transformer attention with longer sequence lengths, this is where I start. The thermal envelope is forgiving compared to a phone, and the software stack is mature.

The Raspberry Pi 5 is the budget workhorse. With the right quantized model and a Hailo or Coral accelerator over PCIe or USB, you can run object detection at usable frame rates and small language models at a few tokens per second. Without an accelerator, the CPU still handles classification and embedding workloads through ONNX Runtime acceptably, especially on integer quantized weights.

Google Coral TPUs are specialized. They love static quantized integer graphs and they punish you for anything they were not designed for. If your inference path is a fixed convolutional pipeline, Coral hits power efficiency numbers that nothing else matches. If you need flexibility, look elsewhere.

Smartphone NPUs are the dark horse. Qualcomm Hexagon, Apple Neural Engine, and the various MediaTek APUs are all underused by general purpose developers. The tooling is improving fast through Core ML, NNAPI replacements, and vendor SDKs. For consumer products, ignoring the NPU sitting in every modern phone is leaving performance on the table.

How small can a useful model actually get?

This is where embedded engineers get the most upside, because the field has been quietly rewriting the answer to this question. In the WebGPU project I ran through recently, the hand tracking model that handled real time gesture recognition was 5 megabytes. Five. That is smaller than a single high resolution photograph, and it ran competently on what I described as a fairly old device. Image classification weighed in at 80 megabytes and recognized an Egyptian cat in 230 milliseconds without breaking a sweat.

For language tasks, sub 1 billion parameter models have become genuinely useful. Llama 3.2 in the 1B and 3B range, Phi 3 Mini, Gemma 2 2B, and Qwen 2.5 in its smallest variants all handle structured extraction, classification, summarization of short documents, and tool calling well enough for production. They will not write a novel, but embedded use cases rarely need a novel. They need a reliable function caller, a deterministic intent classifier, or a translator that fits in 700 megabytes of flash.

The combination that unlocks this is aggressive quantization paired with thoughtful model selection. A 1B parameter model at 4 bit quantization lands somewhere between 600 and 800 megabytes. At 2 bit with the right calibration data, you can squeeze it under 400. Whether that fits your board depends on what else lives there, but the math has finally tilted in our favor. If you want to understand exactly why this works, my deep dive on model quantization as the key to faster local AI performance covers the trade offs in detail.

What latency budgets should you plan for on edge devices?

Latency budgets are where embedded engineering instincts pay off, because we already think in milliseconds rather than seconds. I structure local AI latency in four buckets.

Cold start is the time from idle to first token or first inference. On a Jetson with a quantized small language model loaded into shared memory, this can be under a second. On a Pi 5 reading a 600 megabyte model from a microSD card, expect 8 to 15 seconds the first time. The fix is keeping the model resident in RAM if you have the headroom, or using a fast NVMe drive over the Pi 5 PCIe lane.

Time to first token matters for any streaming language workload. For a 1B parameter model on edge silicon, sub 500 milliseconds is the threshold I aim for. Below that, the interaction feels live. Above one second, users start to wonder if it crashed.

Throughput matters for batch and vision pipelines. The 230 millisecond classification I demonstrated is fine for a user uploading a photo. For a 30 frame per second camera feed, you need to be under 33 milliseconds, which usually means moving to a smaller model, dedicated accelerator silicon, or both.

Tail latency is the one most teams forget. The 99th percentile is what your users will complain about, not the median. Thermal throttling on a fanless enclosure can double inference time once the board has been running for an hour. Profile under the actual deployment conditions, not on a cool desk.

Which quantization tricks actually matter in production?

I treat quantization as the single most important lever an embedded engineer has for local AI. The headline number is the bit width, but the nuance is in how you get there.

Post training quantization is where most projects start, and for many it is enough. You take a trained float16 or bfloat16 model and convert it to int8 or int4 using a calibration dataset. The accuracy drop is often in the noise, especially for classification and detection workloads. This is the path I recommend if you want results this week.

Quantization aware training matters when you go below 4 bits or when the model is sensitive. The model learns during training to be robust to the quantization noise, which buys you another bit or two of compression without the accuracy collapse. It costs you compute upfront, but the deployment artifact is dramatically smaller.

Mixed precision is underused. Not every layer needs the same precision. Attention heads often tolerate aggressive quantization while the embedding and final projection layers prefer higher precision. Tools like AWQ and GPTQ for language models, and the various ONNX quantization passes for vision models, expose enough control to mix and match. For a deeper treatment of all the levers available, my guide on model compression explained walks through the full toolbox.

Want to skip the trial and error? I keep a running set of starter projects that show end to end local AI deployment patterns, including the quantization recipes I actually use.

Get the Local AI Starter Projects

How do you choose between CPU, GPU, and dedicated accelerators?

The choice depends on three factors I evaluate in order: the operator coverage of your accelerator, the memory bandwidth available to it, and the power envelope of the deployment.

Operator coverage is the silent killer. A Coral TPU is fast for the operations it supports and useless for the ones it does not. If your model uses dynamic shapes, custom attention variants, or operations the vendor never compiled for the device, you will fall back to CPU and lose all the speedup you bought the chip for. Always run the actual model architecture through the vendor compiler before committing to silicon.

Memory bandwidth determines what model size you can serve at speed. Language models are memory bandwidth bound during generation, not compute bound. A board with fast LPDDR5 will outperform a board with more theoretical TOPS but slower memory for token generation. This is why the Jetson Orin family punches above its weight class on language workloads.

Power envelope shapes everything else. A board that draws 25 watts under inference cannot live in a battery powered enclosure. A 5 watt budget rules out almost every GPU and pushes you toward NPUs and tiny models. Decide the power budget before the architecture, not after, because it constrains everything downstream. The broader landscape of compression techniques that make these tradeoffs possible is covered well in my model compression techniques guide.

What does a real local AI deployment workflow look like?

I follow the same loop on every embedded AI project, and it has saved me from a lot of dead ends.

I start by defining the latency, accuracy, and size budget on paper before touching code. If the budget is impossible, the project is impossible, and I would rather know in week one than week ten.

I then prototype on the development workstation with the full precision model. The goal here is not deployment, it is verifying that the task is solvable with current models at all. If a 7B parameter model on my desktop cannot do the task reliably, no amount of compression will save a 1B parameter model on the edge.

Once the task is proven, I select the smallest model family that solves it and quantize aggressively. I measure accuracy on a held out validation set after every quantization pass, because the failure modes are not always obvious from a few example prompts.

Then I port to the target board, profile under real thermal conditions, and iterate. The first port is always slower than expected. The second is usually fast enough.

Finally, I ship with telemetry. I want to know cold start times, percentile latencies, and accuracy proxies in production. Edge deployments drift in ways cloud deployments do not, and observability is what catches it.

Where should embedded engineers go from here?

If you take one thing from this guide, let it be that the local AI capability ceiling is no longer set by the hardware. It is set by your willingness to engage with quantization, model selection, and runtime tuning as first class engineering concerns. The boards we already deploy can do far more than they are doing.

The WebGPU project I keep coming back to is proof at the consumer end of the spectrum. Five models, all running locally, no server, no API keys, on whatever device the user happens to open the page on. The embedded equivalent is sitting on your bench right now. The Jetson, the Pi, the Coral, the phone in your pocket. They are all waiting for you to ship something on them.

If you want to see exactly how I architect these projects, the full WebGPU walkthrough is on my YouTube channel here: https://www.youtube.com/watch?v=1mix7WnuEK0. And if you want to skip the trial and error and learn directly from engineers shipping local AI in production, come join the AI Engineer community at https://aiengineer.community/join. We talk about this stuff every day, and the people there are exactly who you want in your corner when the model needs to ship next quarter.

Zen van Riel

Zen van Riel

Senior AI Engineer | Ex-Microsoft, Ex-GitHub

I went from a $500/month internship to Senior AI Engineer. Now I teach 30,000+ engineers on YouTube and coach engineers toward six-figure AI careers in the AI Engineering community.

Blog last updated