Local AI for Government Contractors Working Air Gapped


When I tell people I run AI inside vaults that have no internet connection, they usually picture me wheeling in a server, plugging in an HDMI cable, and somehow conjuring GPT-4 out of thin air. The reality is messier and far more interesting. Government contractors working in air gapped environments operate under constraints that most AI engineers never encounter. No package managers reaching out to PyPI. No model weights pulled from Hugging Face on a whim. No telemetry phoning home to validate a license. Every byte that crosses the boundary has to be accounted for, scanned, and approved.

I have spent enough time in these environments to know that the playbook for cloud AI is almost useless here. What works is a careful, almost archival approach to local AI. You think about provenance. You think about the supply chain. You think about what happens when a CUDA driver update is six months away because nothing leaves the secure facility without a compliance review. This post is my field guide to running serious local AI behind the wire.

Why does air gapped local AI demand a different mindset?

Most tutorials assume you have a fat pipe to the internet. They tell you to run a single command and let the package manager handle the rest. In an air gapped facility that single command is the enemy. Every dependency that tries to fetch something at runtime is a bug. Every library that pings a vendor server to check for updates is a security finding waiting to happen. The mental model has to flip from “pull whatever I need on demand” to “everything I need has to already be present, reviewed, and signed off.”

This is why the local AI conversation matters so much for federal work. Contractors building under FedRAMP High, IL4, IL5, or IL6 boundaries cannot use commercial cloud AI APIs the way a startup can. The data simply cannot leave the enclave. That makes local inference the only real option, and it raises the stakes for getting your environment right the first time. If you are new to running models on your own hardware, my Ubuntu setup guide for AI engineers is a good place to start, because Ubuntu is what most secure environments standardize on once they get past Red Hat variants.

How do I actually source models when I cannot touch the internet?

Model sourcing is the part nobody talks about. In a normal workflow you point your script at a Hugging Face repository and the weights land on disk. In an air gapped facility you cannot do that. You need a process that brings model weights across the boundary in a controlled, auditable way.

What I do looks like this. On a clean low side workstation I download the model weights, the tokenizer files, and the configuration JSON. I verify the checksums against what the model card publishes. I generate my own SHA-256 hashes and write them to a manifest. Then the whole bundle goes onto approved removable media, gets scanned, gets logged, and crosses the diode or the transfer kiosk into the secure environment. On the high side I verify the hashes again before anyone touches the weights.

The choice of model matters too. I gravitate toward models with permissive licenses and clean provenance. Llama family models, Mistral family models, Qwen, and similar weights that are openly distributed and have a clear pedigree. I avoid anything that requires a license server check or that loads adapter code from a remote URL at startup. The model file should be exactly that. A file. No phoning home, no callbacks, no hidden network behavior baked into the loader.

What inference stack survives with no internet update path?

Once the weights are inside, the question becomes what runs them. This is where I get picky. A lot of popular local AI tooling assumes occasional internet access. They check for updates on launch. They fetch tokenizer fixes. They load a remote configuration file the first time you run a new model. None of that works in an air gapped facility, and worse, it can trigger alerts that put your whole project under review.

I lean on inference engines that are fully self contained. Llama.cpp compiled from source. vLLM running from a vendored wheel set. TensorRT-LLM when the hardware supports it and the team has the skills to maintain it. Whatever I pick, I make sure it can start, load a model, and serve requests without a single outbound packet. I test that explicitly with network monitoring before anything ships to a customer environment.

This is also where my Linux versus Windows VRAM analysis becomes operationally important. In a constrained environment where you cannot just buy bigger GPUs on a whim, the eight hundred megabytes of VRAM that Linux saves over Windows is the difference between a model fitting and not fitting. I have seen procurement cycles for a single GPU stretch over a year inside government programs. You squeeze every megabyte out of what you already have.

How do I provision CUDA drivers without breaking compliance?

CUDA is its own special problem. Nvidia ships drivers, the CUDA toolkit, cuDNN, and a stack of libraries that all need to align with the kernel version, the GPU architecture, and the inference engine you picked. In a normal lab you run the installer and move on. In an air gapped lab you have to bring all of that across the boundary as signed packages, verify them, and install them with no connectivity.

My approach is to build a known good driver bundle on a low side machine that mirrors the high side hardware exactly. Same GPU, same kernel, same Ubuntu version. I install the Nvidia driver and CUDA toolkit, capture the exact deb files used, and bundle them with their dependencies. That bundle becomes the artifact that crosses the boundary. On the high side I install from those local debs. No apt update, no network calls, just a clean install from files that are already present and approved.

When something breaks, and something always breaks, you cannot just pull a newer driver. You have to plan for that. I keep two known good driver bundles on hand. The current one and the previous one. If the new one misbehaves I can roll back without waiting weeks for a new transfer cycle. This is the kind of operational discipline that separates contractors who deliver from contractors who get stuck.

What does an offline package mirror actually look like?

Python packages are the next minefield. Your AI code depends on torch, transformers, numpy, and probably another hundred indirect dependencies. None of that can come from PyPI in real time. You need an offline mirror.

I usually maintain a curated wheelhouse. On a low side machine I create a fresh virtual environment, install exactly the dependencies my project needs, and then use pip download to pull every wheel into a directory. I include the platform specific wheels for the target architecture. I include source distributions for anything that has to compile. The whole directory becomes a tarball that crosses the boundary alongside the model weights and the driver bundle.

On the high side, pip installs from that local directory and never touches the network. For larger programs I have set up internal devpi servers or Nexus repositories that act as a permanent mirror, so multiple teams can install from the same vetted package set. Either way, the principle is the same. Nothing comes from the public internet at install time. Everything was already reviewed before it crossed the wire.

If you are building production grade systems on top of this, my guide to building production RAG systems walks through the architectural patterns that hold up under real workloads. The same patterns work air gapped, you just have to be ruthless about which dependencies you bring with you.

Want practical local AI projects you can adapt for secure environments?

I publish open source local AI starter projects that are deliberately built to run without external services. They use local model files, local vector stores, and local inference. You can browse my open source local AI projects and use them as a starting point for your own air gapped builds. The code is structured so you can vendor every dependency and run the whole stack offline.

Why do telemetry-heavy tools fail in classified environments?

This is the trap that catches a lot of teams new to government work. They pick a slick local AI tool, get it working in a lab, and then discover during the security review that the tool sends usage analytics, checks for updates on launch, or validates a cloud license. Every one of those behaviors is a non starter inside a classified or controlled enclave.

I treat telemetry as a binary filter. If a tool cannot be configured to be fully silent on the network, I do not use it for air gapped work. I read the source. I run it under a network monitor. I look at startup behavior, idle behavior, and behavior on first model load. Anything that reaches out, even once, even just to check for updates, gets cut. There are plenty of fully local alternatives. I would rather invest the time finding a clean tool than fight a compliance battle later.

This applies to model loaders, vector databases, observability stacks, and even the editor people use to write code. The whole environment has to be quiet on the network. My post on data privacy in AI covers the broader privacy principles that underpin this kind of thinking, and they apply with extra force when the data in question is controlled unclassified information or higher.

How do FedRAMP and IL4 considerations shape the architecture?

I am not going to pretend a blog post can substitute for a real authorization to operate. What I can say is that the architectural choices you make for local AI line up well with what FedRAMP and the DoD impact levels expect. A self contained inference stack with no outbound connectivity, vetted models with documented provenance, an offline package supply chain, and explicit driver provisioning are exactly the kind of things assessors want to see.

The work I do at the higher impact levels lives in environments where every component has been reviewed, every package has a documented source, and every model has a chain of custody from download to deployment. The local AI mindset I described in this post is not an optimization. It is the baseline. If your stack does any of this badly, it will not survive an authorization review, no matter how clever your prompts are.

For contractors who want to build this capability into their delivery, the real differentiator is being able to demonstrate the discipline, not just the model. Anyone can get Llama 3 running on a workstation. Far fewer teams can get it running inside an enclave with full traceability, predictable updates, and zero network noise. That is the skill set that wins recompetes.

What is the path forward for engineers who want this niche?

Local AI for air gapped government work is one of the most underserved corners of the AI engineering field. The demand is real, the budgets are real, and the people who can do this well are rare. If you can combine solid local AI fundamentals with the operational discipline of a cleared environment, you are unusually valuable.

The way in is to build the muscle on your own hardware first. Run native Linux. Run your models without internet. Vendor your dependencies. Monitor your network. Treat every external call as a problem to solve. Once those habits are second nature, the air gapped version is just a stricter application of the same principles.

I dig into the operating system side of all this in my full benchmark video on YouTube: https://www.youtube.com/watch?v=wudNmLHcZeE. And if you want to be in the room with other engineers building serious local AI systems, including a few who work in regulated environments, come join the community at https://aiengineer.community/join. It is the fastest way I know to go from curious about local AI to genuinely employable in this niche.

Zen van Riel

Zen van Riel

Senior AI Engineer | Ex-Microsoft, Ex-GitHub

I went from a $500/month internship to Senior AI Engineer. Now I teach 30,000+ engineers on YouTube and coach engineers toward six-figure AI careers in the AI Engineering community.

Blog last updated