Local AI RAG Pipeline Without Sending Data to OpenAI

I built my own private version of Google last week. No API keys, no usage dashboards, no data going to OpenAI or Anthropic. Every query, every embedding, every generated answer stayed on my machine. If you have ever felt uncomfortable shipping internal documents, customer data, or proprietary research through a cloud LLM endpoint, this is the architecture you have been looking for.

The premise is simple. A retrieval augmented generation pipeline has three moving parts that talk to AI models. The embedding step that turns your documents into vectors. The vector store that holds those vectors. The synthesis step where an LLM reads retrieved chunks and writes an answer. Most tutorials assume the embedding model and the LLM live behind a paid API. They do not have to. You can replace every single one of those calls with a local equivalent and get a working system that never leaks a byte.

This post walks through the full local stack. I will explain the choices, the tradeoffs, and the configuration decisions that actually matter when you flip the switch from cloud to fully self hosted.

Why would I run a RAG pipeline locally instead of using OpenAI?

There are three reasons people end up here, and they usually arrive in this order.

The first reason is privacy. If you work in healthcare, legal, finance, or any regulated industry, the legal team will not let you paste client data into a third party API. It does not matter how many SOC 2 reports the vendor has. The simplest way to satisfy a privacy review is to make the data physically incapable of leaving the building.

The second reason is cost. Embedding ten million documents through a paid API gets expensive fast. Running the same workload on a local GPU costs you electricity. If you are processing large corpora repeatedly, the math tilts toward local within weeks.

The third reason is control. When OpenAI deprecates a model, your pipeline changes whether you want it to or not. When you self host, the model on disk today is the model you run next year. For research workflows and reproducible experiments, that stability is worth a lot.

I covered the broader strokes in my complete guide to building production RAG systems. This post drills specifically into the no cloud variant.

What does a fully local RAG architecture actually look like?

Think of it as four layers that all run on your machine or your private network.

At the bottom you have a document store. This is where the raw source files live. PDFs, markdown, transcripts, scraped pages, internal wikis. Nothing fancy, just a folder or a database.

Above that you have an embedding service. A local embedding model reads each chunk of text and produces a vector. Popular open weights options include BGE from BAAI, the nomic embed family, and mxbai embed large. All of them run comfortably on CPU for small workloads and absolutely fly on a consumer GPU. I tend to default to BGE for English heavy corpora and nomic when I need a longer context window.

Above that you have a vector store. This is where the embeddings get indexed for fast similarity search. You have real options here and the right one depends on your scale. I cover the practical differences in my breakdown of Chroma for local development and the comparison between pgvector and dedicated vector databases.

At the top you have a local LLM doing the synthesis. This is the model that reads the retrieved chunks and writes the final answer. Ollama makes this trivial. You pull a model once and you have a local API endpoint that speaks the OpenAI protocol.

The key insight is that every layer in this stack has a mature open source option. Nothing is missing. The pieces just need to be wired together correctly.

Which local embedding models should I use for production quality results?

The embedding model is where most people get tripped up. They assume that since OpenAI charges money for text-embedding-3-large, it must be meaningfully better than what you can run for free. It is not. The open source embedding leaderboards have been tightly competitive for over a year, and the top open models are within a few points of the best commercial options on retrieval benchmarks.

My current shortlist looks like this.

BGE large from BAAI is my default for English documents. It is fast, the retrieval quality is excellent, and it is small enough to run on CPU if you really have to. The vector dimension is sensible at 1024, which keeps your storage costs down.

Nomic embed text is what I reach for when I need long context. It handles inputs up to 8192 tokens, which means you can chunk less aggressively and preserve more semantic structure per vector.

Mxbai embed large is a strong all rounder that frequently lands at the top of the MTEB leaderboard. If you want a single model and you do not want to think about it, this is a safe pick.

All three run inside Ollama or directly through Hugging Face transformers. You point your pipeline at localhost instead of api.openai.com and you are done. The retrieved results will be comparable in quality, and you will sleep better at night knowing the embeddings of your private data are sitting on your own SSD.

How do I pick a vector store that runs entirely on my machine?

There is no single right answer, but the decision tree is short.

If you are prototyping or your corpus is under a few hundred thousand chunks, use Chroma. It runs in process, it persists to disk, and it has an extremely friendly Python API. You can have a working index in fifteen minutes.

If you already run Postgres in your stack, use pgvector. The operational story is identical to any other Postgres extension. You get transactions, joins against your relational data, and a battle tested backup story. I wrote up the tradeoffs in detail in pgvector vs dedicated vector databases.

If you are heading toward production scale with millions of vectors and aggressive latency requirements, Qdrant is my pick. It runs in a Docker container, it has excellent filtering support, and it scales horizontally when you eventually need that.

For a private RAG pipeline, the operational simplicity matters more than the absolute throughput numbers. Pick the option that you can debug at three in the morning. For most readers that means Chroma first, pgvector second, Qdrant when you outgrow both.

The important point is that none of these tools call home. Once you pull the container or install the package, they work entirely offline. You can airgap the whole machine and the vector store will not notice.

Want the exact stack I run?

I keep my fully local AI projects organized in one place so you can clone them and have a working pipeline in an afternoon. If you want the embedding configurations, the docker compose files, and the example chunkers I actually use, grab them from my open source projects page. Everything there is built on the no cloud principle this post describes.

Which local LLM should I use for the synthesis step?

This is the layer where model size matters most. Embedding quality is roughly flat across the top open models. Generation quality is not. A small model will hallucinate, miss nuance, and ignore parts of the retrieved context. A larger model will follow your instructions and actually use the sources you handed it.

In the video that accompanies this post, I demonstrate this concretely. I tried using a small Llama 3.2 model for a self hosted search experience and the answers were not reliable. Switching to a Phi 4 class model immediately fixed the citation behavior. The model was finally large enough to read the retrieved sources and quote them faithfully.

For local RAG synthesis on a single workstation, my current recommendations are these.

Phi 4 from Microsoft is a remarkable middle ground. It is small enough to run on a 16 gigabyte GPU and large enough to produce coherent grounded answers.

Llama 3.3 70B is the right call if you have the hardware. The grounding quality and instruction following are excellent and it handles complex multi document synthesis without losing the thread.

Qwen 2.5 32B sits between the two. It is a workhorse for general purpose RAG and frequently the best speed to quality tradeoff.

Pull whichever you choose through Ollama, point your RAG pipeline at the local endpoint, and you are running an end to end private inference loop. Nothing leaves the machine.

How do I make local search feel as good as a commercial product?

This is where most local RAG attempts fall short. The embedding works, the retrieval works, the LLM generates an answer, but the experience feels clunky compared to Perplexity or Google.

The trick is to add a meta search layer for live web queries when you need them, and keep your private corpus retrieval entirely local. The Perplexica project is a good example of this hybrid pattern. It uses SearXNG under the hood to aggregate results from multiple search engines, then hands the aggregated context to a local LLM for synthesis. I broke down the architecture in Perplexica vs SearXNG for self hosted search and the broader case for owning your search experience in self hosted search advantages.

You can run the same pattern against your own document corpus. Replace the SearXNG layer with your local vector search, keep the synthesis LLM exactly as it is, and you have an internal Perplexity for your private knowledge base. The frontend pattern is identical. The privacy story is dramatically better.

For a fully airgapped setup you skip the meta search entirely and only retrieve from your local index. For a hybrid setup you add public web search as an optional source the user can toggle. Either way the synthesis stays on your hardware.

What does the end to end flow look like in practice?

Let me walk through a single query as it moves through the system, so the architecture stops feeling abstract.

A user types a question into the frontend. The frontend posts the question to a backend running on localhost. The backend embeds the question using the local BGE model loaded inside Ollama. The resulting vector is sent to the local Chroma index, which returns the top retrieved chunks from the user’s private corpus. The backend assembles those chunks into a prompt and calls the local Phi 4 model, again through Ollama. The model streams an answer back to the frontend. The frontend renders the answer with inline citations linking to the retrieved sources.

At no point in that flow does a packet leave the machine. The embedding model is local. The vector store is local. The LLM is local. Even the source documents never moved. You have built a complete RAG product where the privacy boundary is the chassis of your computer.

This is the part that people underestimate before they try it. Once you have run a query through a stack like this, the cloud version feels unnecessarily intrusive for any task that involves your own data. You stop reaching for the API key.

What about hardware, am I going to need a server rack?

A surprising amount of this runs on consumer hardware. A modern laptop with 32 gigabytes of RAM and an Apple Silicon chip will handle BGE embeddings and a Phi 4 synthesis model without complaint. For larger corpora you eventually want a dedicated machine with a beefier GPU, but there is a wide middle ground where the local stack is not just possible, it is faster than the cloud round trip would be.

Where do I go from here?

If you want to see the self hosted search version of this stack in action, including the local LLM swapping and the SearXNG configuration, the full walkthrough is on YouTube at Build Your Private Google (Self-Hosted AI Search). The video shows the pieces I described above wired together into a working private search engine.

If you want to discuss your specific use case, share what you are building, or get feedback on a local RAG architecture you are designing, come join the engineers building this stuff with me at aiengineer.community. The conversations there are exactly the ones you want to be part of as private AI infrastructure becomes the default.

Zen van Riel

Senior AI Engineer | Ex-Microsoft, Ex-GitHub

I went from a $500/month internship to Senior AI Engineer. Now I teach 30,000+ engineers on YouTube and coach engineers toward six-figure AI careers in the AI Engineering community.

Blog last updated Jul 7, 2026