Synthetic Data Generation Pipeline for Fine Tuning Local LLM
The first time I fine-tuned a model on every YouTube transcript I had ever recorded, I expected something close to a digital twin. What I got instead was hours of training time burned on a model that produced complete slop. The output read like a broken voice memo. No personality, no clean punctuation, no real understanding of what I sound like when I write.
That painful first run forced me to rebuild the entire data side of the project from scratch. In this post I want to walk you through the exact synthetic data generation pipeline I now use for fine tuning a local LLM, including the parts most tutorials skip. Almost every fine tuning video on the internet jumps straight to step five, which is the actual training run. The first four steps are the ones that decide whether your model learns your voice or learns your noise.
Why does the data pipeline matter more than the training run?
Fine tuning is mostly a data problem in disguise. Modern training stacks are remarkably forgiving. You can launch a LoRA run with a single command, and the loss curve will go down regardless of whether your inputs make any sense. The model will dutifully memorize whatever you feed it, including the spelling mistakes, the run on sentences, and the weird capitalization quirks of automatic captions.
In my own pipeline I learned this the hard way. My YouTube transcripts come from Google’s automatic captions, which means they are split mid thought every three seconds, they lowercase product names like Notion and Claude, and they sometimes hear “open AI” as three separate words. If you fine tune on that raw text, your model inherits every one of those quirks. The bad behaviors of the data carry straight over to the final model, and you spend two weeks debugging a training loss that was never the real issue.
So before I touch a training script, I run four pipeline stages. Fetch the data, clean it, rewrite it into a question and answer shape, and then augment it with synthetic instructions. Only after those four stages do I split into train and validation and start the actual fine tune.
How do I fetch and seed the pipeline from real data?
Synthetic data only works if it is grounded in something real. I do not generate from a blank prompt. I seed everything from my own spoken words.
For me that means pulling transcripts with yt-dlp from a curated playlist of videos I actually want the model to learn from. I do not just grab everything I have ever uploaded. I pick the videos that represent the voice and the topics I want the fine tuned model to specialize in. This is the same principle behind building an AI knowledge base, where the quality of the source material decides the ceiling of what the system can do.
The seeding step matters because it gives every synthetic sample a real anchor. When I later ask a local model to generate three different user questions for a paragraph, the paragraph itself is something I genuinely said on camera. The synthetic part is the instruction wrapped around it, not the answer. That is the safest form of synthetic data generation for fine tuning a local LLM, because the model never learns words I did not actually use.
How do I clean transcripts without flattening my voice?
Cleaning is where most pipelines either underdo it or overdo it. Underdo it and you train on garbage. Overdo it and your model loses everything that makes it sound like you.
I run cleaning in two passes. The first pass is regular expressions for the obvious systematic problems. YouTube annotations, filler words like “uh” and “um” that I rarely actually say, repeated stutters, and known boilerplate. Regex is great when the pattern is mechanical and predictable.
The second pass is a local language model. I run a Mistral 14 billion parameter model entirely on my own hardware to handle the judgment calls. Regex cannot tell where a sentence really ends. It cannot decide whether “cloud code” should be “Claude Code” in this specific context, because sometimes the speech to text drops the space and sometimes it hears the wrong word entirely. A language model can read the surrounding context and make the right call.
Running Mistral over thousands of transcript snippets takes about two hours on my machine. That sounds like a lot until you remember it saved me roughly two weeks of debugging training loss. If you want to speed this stage up, model quantization is key to faster local AI performance, and a properly quantized 14B model can chew through a long transcript backlog overnight.
The cleaning system prompt I use focuses on a small list of common speech to text errors specific to my domain. Things like “open AAI” splitting into pieces, product names losing their capitals, and run on sentences that need a full stop somewhere. The cleaning user prompt is more about punctuation and capitalization, adding periods, commas, and question marks where sentences naturally end. The better the local model you point at this stage, the better the rest of the pipeline performs.
How do I rewrite monologue into question and answer pairs?
This is the step almost nobody talks about, and it is the difference between a model that sounds like a podcast host stuck on a long ramble and a model that actually responds to questions.
The way I talk on YouTube is not the way I would explain something to you face to face. On video I narrate. I set up scenes, I point at the screen, I reference what just happened. None of that translates to a chat interface. If you fine tune directly on monologue text, the model will respond to “what is fine tuning” with a five hundred word essay that opens with “in this video”.
So I rewrite. I take each cleaned paragraph and reshape it as if it were the answer to a question someone might actually ask. The paragraph itself stays close to my real words, but it loses the visual references and the long winded setup. This is the rewrite step that earns the “synthetic” in synthetic data generation. The paragraph is grounded in something I said, but the framing is engineered for chat.
Where do instruction generation strategies fit in?
Once I have clean rewritten paragraphs, I generate instructions for them. This is where the academic literature has done a lot of the heavy lifting, and it is worth knowing the names.
Self-instruct is the foundational idea. You take a small set of seed instructions, then ask a strong model to generate more instructions in the same style, filter for quality, and add the survivors back into the pool. It is how the first wave of instruction tuned open models got off the ground.
Evol-instruct goes further. Instead of just generating new instructions, it evolves existing ones along axes like deeper reasoning, more constraints, or higher complexity. You end up with a curriculum that pushes the model harder than the original seeds.
In my pipeline I use a simpler hybrid. For each cleaned paragraph I ask Mistral to generate three different user inputs that the paragraph could plausibly answer. One is a direct question, like “what is the right way to use AI coding tools”. One is an opinion request, like “what is your take on shipping code you did not write”. One is an explicit task, like “walk me through how to think about code you do not fully grasp” or “write a blog post about this”.
The same paragraph then appears three times in the training set, paired with three different instructions. This teaches the model to respond to the meaning of the question rather than memorize a single phrasing. It is a cheap, robust version of self-instruct that runs entirely locally and never sends my data to an external API.
Get the local AI starter projects
If you want to see this kind of pipeline running on real hardware before you build your own, my open source local AI projects are the fastest place to start. They cover the full local stack I use day to day, including the model serving setup that makes a 14B cleaning model practical on a single machine.
How do I use a judge model to detect slop?
Generating synthetic data is easy. Generating good synthetic data is hard. Between those two points sits the judge model.
After Mistral produces its three instructions per paragraph, I run a second pass where a model acts as a judge. The judge reads the instruction and the paragraph, and answers a small set of questions. Does the paragraph actually answer the instruction. Is the instruction specific enough to be useful. Does the pair contain obvious artifacts like hallucinated product names, repeated phrases, or filler. Anything that fails gets dropped or regenerated.
This is the slop detection layer, and it is non negotiable for synthetic data generation pipelines for fine tuning local LLMs. Without it, your training set slowly drifts toward the failure modes of your generator model. With it, you keep the distribution honest. The judge does not need to be huge. It needs to be consistent. I find that running the same local model with a tighter, more skeptical prompt is often enough.
I also run a deduplication pass at this stage. Near duplicate instructions are common when you generate three variants per paragraph across thousands of paragraphs. I hash normalized forms of each instruction and drop exact matches, then run a similarity check on the remainder to catch the near misses. Deduplication is boring work that quietly raises the ceiling of every model you ever train on the dataset.
How do I handle persona oversampling for personality?
One of the hardest problems in fine tuning your own voice is that you almost never talk about yourself. My videos are mostly instructional. I explain coding tools, I walk through pipelines, I break down concepts. Out of five thousand transcript segments, maybe fifty talk about me as a person. That is one percent of the data carrying one hundred percent of the personality signal.
If I trained on the raw distribution, the model would have no idea who I am. Ask it about my background and it would hallucinate something generic. The fix is persona oversampling. I duplicate the persona rows in my training set, typically around ten times. The model learns by frequency, and repeating a row tells the model to treat that pattern as a rule rather than an outlier.
Ten times is roughly my cap. Push it higher and the model starts parroting persona answers on completely unrelated prompts, which is its own kind of slop. The right multiplier depends on your data, and you will need a few runs to dial it in. But the principle generalizes. Any signal that matters but is rare in your raw corpus needs to be deliberately oversampled, or it will get drowned out.
The same logic applies to length. My videos are long, but I respond in chat in short, direct paragraphs. So I split long transcripts into smaller chunks before training. My average input dropped from around 1200 tokens to 140 tokens, which sped up training enormously. The attention mechanism does roughly n times n comparisons per example, so cutting the length is a multiplicative win. My fine tunes went from sixty hour runs to ninety minute runs, which means I can actually iterate.
What does the final training format look like?
Each training row ends up as a JSON object with three fields. A system prompt that frames the model’s role, like “you are an AI engineer”. A human input, which is one of the three synthetic instructions. And a model response, which is the cleaned rewritten paragraph from the original transcript. The same response paragraph appears multiple times across rows, paired with different human inputs.
If you have not set up your local serving environment yet, my Ollama local development guide covers the runtime side, and my local AI coding reality check is honest about what current local models can and cannot do once they are fine tuned.
Where should you start with your own pipeline?
If you are building a synthetic data generation pipeline for fine tuning a local LLM, do not start with the training script. Start with the data. Pick a focused source you actually own, clean it in two passes with regex and a local model, rewrite monologue into chat shaped pairs, generate three instruction variants per paragraph, judge and deduplicate the output, and oversample the rare signals you care about. Only then run the fine tune.
That is the pipeline that took my model from slop to something that genuinely sounds like me. The training run is the easy part once the data is right.
You can watch the full walkthrough on my YouTube channel here: https://www.youtube.com/watch?v=XGwp1tN4LKw
And if you want to go deeper with a community of engineers building local AI systems and fine tuning their own models, join us at https://aiengineer.community/join.