GGUF Export After LoRA Training Step by Step
When I finished my first weekend of LoRA training on my home lab, I had a working adapter that made an open source Qwen model sound like me. The fine-tune passed my evaluation tests. The tone was right. The thinking traces were disabled. The responses were short and direct, just like I talk on the channel. But there was still one big problem. None of that mattered if I could not actually run the model on the tools I use every day. That is where the GGUF export after LoRA training step by step process comes in, and it is the part of the pipeline that almost nobody walks through honestly.
In this guide, I share how I take a freshly trained LoRA adapter and turn it into a single GGUF file that loads cleanly into Ollama and LM Studio. I cover merging the adapter into the base, running the llama.cpp conversion script, picking the right quantization, validating the output, and registering the model with a Modelfile. My video covers the broader fine-tuning pipeline. This article picks up where that ends.
Why does fine-tuning need a GGUF export at all?
The output of a LoRA training run is usually a folder of safetensors files containing only the adapter weights. Those weights represent the small slice of parameters you actually retrained, often around half a percent to one and a half percent of the full model. That format is great for further training and for evaluation inside the same Python environment you trained in, but it is not portable. You cannot drop a raw adapter into Ollama. You cannot hand it to a teammate running LM Studio on a MacBook. You cannot serve it from a small inference container without dragging in the entire training stack.
GGUF solves that problem. It is a single binary file format designed for efficient inference on consumer hardware. It packs the model weights, the tokenizer, the chat template, and the metadata into one file that runs anywhere llama.cpp or its derivatives run. That includes Ollama, LM Studio, llamafile, and a long tail of local AI tools. If you care about running models locally without paying API costs, GGUF is the format that actually delivers on that promise.
The catch is that you cannot convert a LoRA adapter directly into GGUF in a useful way. You first need to fold the adapter back into the base model so the conversion script sees a complete set of weights.
How do I merge a LoRA adapter into the base model?
The first real step is the merge. When I finish training, I have two things on disk. I have the original base model, for example a Qwen 3 checkpoint that I pulled from Hugging Face. And I have the adapter folder my training script produced, which contains the small set of trained weights plus a configuration file that points back at the base.
To merge them, I load the base model and the adapter together using the PEFT library, then I call the merge and unload method. That method takes the low rank matrices from the adapter, multiplies them out to their full shape, adds them into the corresponding layers of the base model, and then drops the adapter wrapper. What you are left with is a regular Hugging Face model directory that looks identical in structure to the base you started from. Same config file. Same tokenizer. Same safetensors layout. Just with the trained behavior baked in.
I save that merged model to a fresh output directory. I do not overwrite my original base. I do not overwrite my adapter folder either. Both stay around in case I need to retrain, re-merge with a different base revision, or compare behavior. Disk is cheap. Hours of training time are not.
One detail that catches people. Make sure the base model you are loading matches the precision you trained against. If you trained against a 4 bit quantized base and then merge into a different precision, you can get subtle drift in the merged weights.
How do I convert the merged model to GGUF?
Once the merged model is on disk, I move over to llama.cpp. I keep a clone of the llama.cpp repository in my home lab specifically for conversions. The script I use is called convert_hf_to_gguf.py and it lives in the root of the repository. It takes a Hugging Face style model directory as input and produces a GGUF file as output.
I point the script at my merged model directory. I tell it where to write the output GGUF. I pick an output type for the initial conversion, usually 16 bit floating point because I want a clean high precision GGUF first that I can then quantize down from. The script reads the model config, figures out the architecture, walks every tensor, rewrites the weights into the GGUF tensor layout, and embeds the tokenizer and the chat template into the file.
This step is mostly mechanical, but a few things can break. If your tokenizer files are missing or the chat template is not where the script expects, the conversion will either fail loudly or, worse, succeed silently with a model that has no idea how to format conversations. I always check that my merged directory has the tokenizer.json, the tokenizer_config.json, and a chat template either embedded in the config or sitting in a jinja file. If the base model you trained from had those files, the merge step preserves them automatically.
The output is a single GGUF file, still in 16 bit and often quite large. For a 27 billion parameter model, that file lands around fifty gigabytes. Time to quantize.
Which GGUF quantization should I pick: Q4_K_M, Q5_K_M, or Q8_0?
Quantization is where the GGUF format really shines. Once you have a 16 bit GGUF, you can run the llama.cpp quantize binary against it to produce a smaller file in any of dozens of quantization levels. The three I actually reach for in practice are Q4_K_M, Q5_K_M, and Q8_0. Each one trades off file size, memory usage, and quality differently, and the right pick depends on what hardware you are targeting and how much quality drop you are willing to accept on your fine-tuned behavior.
Q4_K_M is the default I recommend for most fine-tuned models that need to run on consumer GPUs or Apple Silicon laptops. It compresses the model down to roughly a quarter of its 16 bit size while keeping quality very close to the original for most prompts. For a 27 billion parameter model, this lands you around the 16 to 18 gigabyte range, which is exactly what I ended up shipping for my own persona model. If you have not already read about why model quantization matters for local AI performance, it is worth a detour.
Q5_K_M is what I pick when I have the VRAM headroom and I want a touch more fidelity. It uses five bit weights instead of four, which means the file is bigger and the memory footprint at inference is bigger too, but the quality recovery against the original is noticeably better on long generations and on edge case prompts. For persona fine-tunes, this can help when you notice your Q4 export starting to drift back toward generic base model behavior on rare topics.
Q8_0 is what I pick when I am evaluating quality against the unquantized merged model. It is essentially eight bit weights with very minimal loss compared to 16 bit. The file is large, around half the size of the full precision GGUF, but if you care about the absolute best inference quality your fine-tune can produce on local hardware, Q8_0 is the level to reach for. I rarely ship Q8_0 to others because the size hurts, but I keep one around for my own reference runs.
A quick rule I follow. Start at Q4_K_M. If your fine-tune feels weaker than it did during evaluation in the training environment, jump to Q5_K_M before you blame your training pipeline. If both of those feel off, grab Q8_0 and confirm whether the issue is quantization or something deeper. And if you are running on a machine where VRAM is tight, the lower Q4 variants are often the only realistic option anyway.
Want to skip the trial and error?
Fine-tuning and exporting is one of those skills that gets a lot easier when you have already worked through a clean local AI project from end to end. I bundled my best starter projects, including local model serving, retrieval augmented generation, and small inference experiments, into a free pack you can grab in one click. They will give you the foundation that makes the GGUF export step feel obvious instead of intimidating.
How do I validate that my GGUF export actually works?
Before I import a fresh GGUF into Ollama and start telling people it works, I run two validation passes. The first is a smoke test directly with the llama.cpp main binary. I load the GGUF, send it a handful of prompts I used during evaluation, and confirm that the responses match what I saw at the end of training. If they match, I know the conversion did not corrupt anything. If they do not match, the problem is almost always either a tokenizer mismatch or a chat template that did not survive the export.
The second pass is a quality comparison. I run the same prompts against the unquantized 16 bit GGUF and against my chosen quantized version, and I read the responses side by side. I am specifically looking for places where the quantized version drifts in tone, length, or factual content. Because model compression always trades quality for size, some drift is expected. The question is whether the drift is acceptable for the use case. For a persona fine-tune, I care a lot about tone preservation. For a structured output fine-tune, I care more about format adherence. Your validation criteria should match your training goals.
If validation passes, the file is ready to ship. If it fails, the fastest debug path is to go back one step at a time. Re-test the merged Hugging Face model in Python to confirm it still behaves correctly. Re-run the conversion to a fresh 16 bit GGUF. Re-quantize. Most failures live in one of those three steps, not in the model itself.
How do I import the GGUF into Ollama with a Modelfile?
The final step is making the GGUF feel like a first class citizen on my local machine. Ollama is my daily driver for that, and the way you teach Ollama about a custom GGUF is through a small text file called a Modelfile. The Modelfile is a recipe. It tells Ollama which GGUF file to load, what the chat template looks like, what the system prompt should default to, what stop tokens to use, and what default sampling parameters to apply.
A minimal Modelfile for a fine-tuned model has four pieces. A FROM line that points at the GGUF file on disk. A TEMPLATE block that describes how to format conversations for this specific model, which usually mirrors the chat template that was baked into the GGUF during conversion. A PARAMETER block with sensible defaults for temperature, top p, and stop sequences. And optionally a SYSTEM block with a default system prompt, though for a persona fine-tune I usually leave this empty because the persona is already baked in.
Once the Modelfile is written, I run the Ollama create command, give my model a memorable name, and Ollama copies the GGUF into its own model store and registers the recipe. From that point on, the model is available in any tool that talks to Ollama. I can chat with it from the CLI. I can pull it from any local app that supports the Ollama API. I can swap it in and out next to other base models with no extra ceremony.
That is the moment where the work pays off. I open a fresh chat, paste the same question I tested against the vanilla base model at the start of the project, and the response comes back short, direct, and in my actual voice. No poetic detours. No fifteen second thinking trace. Just a real answer to a real question, exactly the way I would have written it myself.
Where to go next
The GGUF export is the bridge between a successful training run and a model that people can actually use. If you skip it or rush it, all the work you put into data collection, dataset engineering, and LoRA training stays trapped on the machine where you trained it. If you do it carefully, you walk away with a portable artifact that runs anywhere local AI runs.
If you want to watch the full fine-tuning pipeline in action, including the evaluation step that tells you whether your export is actually worth shipping, the video is here: https://www.youtube.com/watch?v=v7qMjy_RxOs.
And if you want to talk through your own fine-tuning project with engineers doing this for real, join us at https://aiengineer.community/join.