Local AI for Healthcare Engineers Building HIPAA Compliant Tools


I get the same question from healthcare engineers almost every week. They watch a demo of an AI assistant pulling answers from documents, they see how much time it could save their clinicians, and then they hit a wall. The wall is always the same. The data they want the AI to read contains protected health information, and sending PHI to an external API endpoint is not a conversation they want to have with their compliance officer.

This post is the engineering answer to that wall. I am not going to pretend to be a lawyer, and nothing here is legal advice. What I can do is show you the architecture I use when I want a system that keeps every byte of patient data inside a network you control. The pattern works for clinicians searching internal guidelines, for engineers building intake tooling, and for anyone who needs the convenience of AI search without shipping data to OpenAI or Anthropic.

I built a self hosted AI search engine in a recent video, and the same building blocks map directly onto a HIPAA aligned deployment. The ingredients are a local language model, a metasearch layer that points at internal sources, and a deployment topology where nothing leaves the boundary you set. Let me walk you through how that works.

Why Does HIPAA Push You Toward Local AI in the First Place?

Most managed AI providers will sign a Business Associate Agreement if you push hard enough and pay enough. That is one valid path. The path I prefer for sensitive workloads is much simpler. If the model never sees data outside your network, you do not need a BAA for inference at all, because there is no business associate involved in that step. You still have plenty of compliance work to do around storage, access control, and audit logging, but you have removed an entire category of vendor risk by never making the API call in the first place.

This is the core mental shift I want healthcare engineers to make. Local AI is not just a cost optimization or a privacy preference. It is an architectural choice that changes which compliance conversations you need to have. I covered the broader tradeoffs in my local versus cloud LLM decision guide, and the healthcare case is where those tradeoffs become sharpest.

What Does the Reference Architecture Actually Look Like?

The system I demoed in the video has three layers, and each layer maps cleanly onto the healthcare use case.

The first layer is the language model itself, running locally through Ollama or a similar runtime. In the video I started with a small Llama variant, then switched to a larger Phi model because I noticed the smaller one was not reliably citing sources. That observation matters here. In healthcare you cannot accept hallucinated citations, so model selection is not a cosmetic decision. You pick a model that is large enough to follow instructions about grounding, and you run it on hardware you own or rent in a controlled environment.

The second layer is the search and retrieval layer. In the public version I used SearXNG, which queries multiple public engines and combines the results. For a healthcare deployment you swap that out. Instead of pointing at Bing and DuckDuckGo, you point at your internal document stores, your clinical guideline repositories, your formularies, your internal wikis. The retrieval pattern is the same one I describe in my guide on building production RAG systems. The only thing that changes is the source of the documents and the access controls in front of them.

The third layer is the orchestration layer that takes a user query, runs retrieval, hands the results to the local model, and streams the answer back with citations. This is the layer where you enforce that every factual claim links back to a source the user can open and verify. That is a habit borrowed from the AI search world, and it is exactly the habit you want in clinical tooling.

How Do You Keep PHI From Ever Leaving the Network?

This is where the engineering discipline matters. A local model on its own is not enough. You need to be deliberate about every place data could escape.

Start with the model runtime. Run it on infrastructure inside your boundary. That can be on premise, a private cloud subscription, or a controlled enclave. Disable telemetry. Confirm there are no outbound calls during inference. Most local runtimes are quiet by default, but you should verify rather than assume.

Next, audit the retrieval layer. If you reuse a metasearch tool that was designed for public web search, double check that you have removed every external engine from its configuration. The same tool that is convenient because it queries Google and Bing becomes a liability if a misconfiguration sends a query containing PHI out to those engines. The fix is simple. You replace the public engines with your internal connectors and you put a network policy in place that blocks egress for that container entirely.

Then think about the prompts. The system prompt and the user prompt are both places where PHI lives momentarily. They should never be logged in plaintext to a system that is outside your boundary. If you use a hosted observability tool, either run it locally too or strip identifiers before anything leaves. This is where a de identification step run by a local model can earn its keep. You can use a small local model to scrub names, dates of birth, and identifiers from text before it flows into any logging or analytics pipeline that might extend beyond your perimeter.

If you want a deeper treatment of the privacy side of this, my post on data privacy in AI walks through the broader threat model.

What About Audit Trails and Access Controls?

Compliance teams care about who saw what, when, and through which system. AI tools are not exempt. The good news is that a self hosted architecture makes audit logging straightforward, because every component is something you operate.

Log every query at the orchestration layer with the authenticated user, the timestamp, the documents that were retrieved, and the response that was generated. Store those logs in the same audit system you already use for your other clinical applications. When a reviewer asks why a clinician received a particular answer, you can reconstruct the entire chain.

Access control is where the retrieval layer earns its complexity. A clinician should only be able to retrieve documents they are authorized to see. If your search index does not enforce that, the AI will happily summarize content the user was never supposed to read. The pattern that works is to filter retrieval by the userโ€™s role and patient relationships before the documents reach the model, not after. Filtering after the fact is a leak waiting to happen.

If you want a starting point for these patterns, browse my open source local AI projects for examples you can adapt. Several of them are deliberately structured so you can see where to plug in your own access control and audit hooks.

How Do You Pick the Right Model Without Calling External APIs?

Model selection in a healthcare context is its own discipline. You are choosing for instruction following, for citation discipline, and for safety on edge cases, not for raw benchmark scores. In the video I had to switch from a small model to a larger one because the small one was not reliable about returning sources. That same lesson applies here, just with higher stakes.

I run an evaluation harness locally before any model goes near a real workload. I feed it a curated set of representative queries, I compare the outputs against ground truth, and I check whether citations actually support the claims. Open weights models from the Llama, Phi, Qwen, and Mistral families all have variants that are worth testing. The right answer for your workload depends on your hardware budget and your latency tolerance, and you can only learn it by measuring.

This is also the place where the patterns from my AI system design article show up. Caching, batching, and routing between models of different sizes are how you keep a local deployment fast enough to feel like a product rather than a science experiment.

What Is the Cleared Deployment Pattern I Recommend?

When I help a healthcare team go from prototype to production, the path looks the same almost every time. You start with a contained pilot, ideally on synthetic or de identified data, so you can iterate on the model and the retrieval layer without compliance friction. Once the system behaves the way you want, you move to a controlled enclave with real data, you wire in the audit and access control hooks, and you put it in front of a small group of clinicians who agree to give honest feedback.

The reason this sequence works is that it separates the engineering risks from the compliance risks. You finish the engineering on synthetic data, then you run the compliance review on a system that is already known to work. Trying to do both at once is how projects stall for a year.

If you have not seen the underlying architecture in action, watch the build video here. It is the public web search version, but every component shown maps directly onto the private healthcare version I described above.

Watch the full build on YouTube.

If you want to talk through your specific deployment with other engineers working on the same problems, come join us in the AI Engineer community at aiengineer.community/join. The healthcare engineers in there are some of the sharpest people I get to work with, and the conversations about local AI architectures are exactly the ones you cannot have on public forums.

Zen van Riel

Zen van Riel

Senior AI Engineer | Ex-Microsoft, Ex-GitHub

I went from a $500/month internship to Senior AI Engineer. Now I teach 30,000+ engineers on YouTube and coach engineers toward six-figure AI careers in the AI Engineering community.

Blog last updated