How to Clean YouTube Transcripts for LLM Fine Tuning
I fine tuned a language model on every YouTube transcript from my channel because I wanted an AI that sounded just like me. The first attempt failed in the most frustrating way possible. After hours of training, the model produced complete slop. Run on sentences. Lowercase product names. Sentences that ended in the middle of a thought. The training had worked exactly as designed, which was the problem. The model learned the bad habits inside my raw transcripts and reproduced them faithfully.
Most fine tuning tutorials skip the part that actually matters. They jump straight to step five, the training run, and ignore the four steps that decide whether your model will be useful or unusable. If your input data is poor, no amount of clever training arguments will save you. So I want to walk you through the pipeline that took my transcripts from garbage into a clean, augmented, instruction ready dataset. This is the part nobody shows you, and it is the part that turned my fine tuned model into something that genuinely captures my voice.
Why are raw YouTube transcripts unusable as training data?
Google automatically transcribes every video you upload. That is incredibly convenient, but the output has problems that will absolutely poison a fine tune if you feed it in directly. The first issue is that automatic captions break sentences mid thought. The transcription engine cuts a new line every three seconds or so, regardless of where you actually paused. So you get one sentence split across six lines, with no punctuation anchoring where the idea begins and ends.
The second issue is filler. Every “uh” and “um” gets transcribed faithfully. I do not want my fine tuned model writing “uh” in the middle of a paragraph, but if I leave those tokens in the dataset, that is exactly what it will learn to do.
The third issue is misheard words. Speech to text gets product names wrong constantly. On my channel, “Claude” gets transcribed as “cloud” half the time. “OpenAI” gets split into “open AI” or even “open AAI”. “Notion” comes out lowercased. If you train on raw captions, the model learns that “cloud code” is a real product, which is obviously wrong.
The fourth issue is format. Spoken explanations are not the same as written answers. When I record a video I gesture at the screen and say “browse to this URL”. That makes sense on camera but it falls apart in a chat interface where the model has no screen to point at. Your training data needs to read like written answers, not narrated demos.
How do I pull transcripts from a YouTube playlist for fine tuning?
The fetching step is the easy part, and that is partly why it gets so much attention in tutorials while the cleaning gets ignored. I use yt dlp, the Python based command line tool, to pull captions for every video in a curated playlist. I do not pull my entire channel. I select the videos I actually want the model to learn from and put them in a dedicated playlist, then point yt dlp at that playlist URL with the subtitle and auto subtitle flags enabled.
If you want a programmatic alternative, the youtube transcript api Python package fetches captions directly without downloading the video file. That is faster when you only need the text. Either approach gives you a folder of subtitle files, usually in SRT or VTT format, ready for the next stage. If you have never set up a local Python workflow for AI projects before, my Ollama local development guide walks through the environment setup that pairs nicely with this kind of pipeline.
What is the right way to clean transcript filler and formatting issues?
Cleaning happens in two layers. The first layer is regular expressions, because some problems are perfectly systematic. SRT files come with timestamp lines and sequence numbers that are useless for training. Strip those out with a simple pattern. Bracketed annotations like “[Music]” or “[Applause]” that YouTube inserts also go. My known filler list, which is mostly “uh” and “um” with surrounding whitespace, gets stripped the same way. None of this is glamorous, but it removes maybe forty percent of the noise in a few lines of code.
The second layer is where the real work happens, and regular expressions cannot handle it. Run on sentences need natural language judgment to figure out where one thought ends and the next begins. Misheard product names need context. I cannot hardcode a rule that says “replace cloud with Claude” because sometimes I am genuinely talking about cloud servers. The fix has to understand what I meant.
This is where a local language model earns its keep. I run Mistral at fourteen billion parameters locally and feed it transcript chunks with a cleaning system prompt. The system prompt lists the common speech to text errors specific to my channel. Open AAI should be OpenAI. Cloud code should be Claude Code. Notion should always be capitalized. Then a user prompt instructs the model to add periods, commas, and question marks where sentences naturally end, and to fix capitalization without changing the actual words.
That last constraint matters. The cleaning model is not allowed to rewrite the meaning. It only fixes punctuation, casing, and obvious misheards. If you let it paraphrase, you lose the voice you are trying to capture. Running Mistral over thousands of transcript chunks took me about two hours of compute. It saved at least two weeks of debugging training loss on a model that would have learned the wrong patterns. If you are wondering whether a fourteen billion parameter model is realistic on your hardware, model quantization is the key to faster local AI performance and what makes this kind of bulk cleaning practical on a single workstation.
How do I recover sentence boundaries in raw captions?
Sentence boundary recovery is the single most underrated step in transcript cleaning. Raw captions are essentially one long stream of words with arbitrary line breaks. Training on that stream teaches the model to output the same shapeless wall of text. So I explicitly ask the local cleaning model to insert full stops, commas, and question marks where my speech actually paused.
The trick is to give the model context windows that overlap. If you feed it tiny three second chunks, it cannot tell whether a fragment is the start of a new sentence or the middle of an old one. I feed it paragraphs of around two hundred words at a time, with a small overlap between adjacent chunks, and reassemble the output. The model has enough context to make confident calls about where ideas begin and end.
After this step, my transcripts read like written paragraphs. They have full stops. They have proper nouns. They look like something a person could plausibly have written rather than a stream of dictation.
How do I deduplicate and split transcripts before augmentation?
Once the text is clean, I split it. My videos run forty seven minutes on the long end. Converted to tokens, that is thousands of tokens per video, which is terrible training data for a chat model. If someone asks my fine tuned model a quick opinion question, I want a brief direct answer, not a three hundred word essay. So I cut each cleaned transcript into paragraph sized snippets of roughly one hundred and forty tokens each.
Shorter snippets also train dramatically faster. The attention mechanism inside fine tuning compares every token in an example against every other token, which scales as n squared. I went from average examples of around twelve hundred tokens down to one hundred and forty tokens, and my training time dropped from sixty hours to about ninety minutes per model. That speedup matters because it lets me iterate. A failed sixty hour run is catastrophic. A failed ninety minute run is just lunch.
After splitting, I deduplicate. I record on related topics across many videos and the same explanations recur. Near duplicate snippets bias the model toward whatever I happened to repeat most often, which is rarely the most important content. A simple similarity check on each pair of snippets, dropping anything above a high overlap threshold, keeps the dataset honest.
If you want to skip the build and start tinkering with a clean local AI environment today, the open source projects on my open source page include the kind of starter setups I use when I am bootstrapping a new fine tuning experiment.
How do I augment transcripts into instruction tuning pairs?
Cleaned snippets still are not training data for an instruction tuned chat model. Each snippet is just an answer floating in space. I need to attach a question. Better yet, I need to attach multiple questions, because real users ask the same thing in many different ways.
For every cleaned paragraph I run a second pass through Mistral with an augmentation prompt. The prompt tells the model that it is creating question and answer training pairs, that the paragraph is the answer, and that it should write a focused single sentence question that this paragraph directly answers. Then I run it three times with three different instruction styles.
The first style is a direct question. “What is the right way to use AI coding tools?” The second style is an opinion request. “What is your take on shipping code you did not write?” The third style is a writing task. “Walk me through how to think about code you do not fully grasp.” All three instructions point at the exact same paragraph as the answer.
This teaches the model to respond to the meaning of a question, not the exact wording. If I only ever trained it on direct questions, it would fail on writing tasks. By rotating instruction styles for the same response, the model generalizes across the kinds of prompts real users actually send. If you want to see how this generalization principle plays out in retrieval based systems, building an AI knowledge base covers the parallel idea for RAG style applications.
What does the final JSON format look like for instruction tuning?
The training pipeline expects a specific JSON shape. Each example has a system prompt, a human input, and an assistant output. My system prompt is something like “You are an AI engineer.” The human input is one of the three generated questions. The assistant output is the cleaned paragraph itself. Three rows per paragraph, all sharing the same answer with different framings.
The fields you use depend on the format your training framework expects. ShareGPT format wraps the conversation in a list of role tagged messages. Alpaca format uses an instruction field and a response field. Pick whichever your training stack consumes natively and stick to it across the whole dataset. Mixing formats inside one dataset is a surprisingly common source of broken training runs.
How does persona oversampling fix a thin training set?
Here is the problem nobody warns you about. Most of my videos are instructional. I talk about coding tools, fine tuning pipelines, and local AI setups. I rarely talk about myself as a person, but the whole point of fine tuning a personal model is for it to know who I am. Out of around five thousand transcript segments, maybe fifty cover anything personal. That is one percent. The model will essentially never learn it.
The fix is persona oversampling. I duplicate the personal segments inside the training set. The more often a row appears during training, the more strongly the model treats it as a rule. I oversample personal segments by ten times, which brings them up to roughly nine percent of the data, enough for the model to actually pick up the pattern.
Ten times is my cap. Push it higher and the model starts parroting persona answers on completely unrelated prompts, which is its own kind of failure. The right multiplier depends on what you want the model to retrieve reliably. Start at five times, evaluate, and adjust. The same idea applies any time you have a high value but rare class in your dataset, which is one of those practical realities I cover in my local AI coding reality check on what actually works.
What does the finished dataset look like in practice?
When everything is wired up, the pipeline reads like this. yt dlp pulls the raw captions. Regular expressions strip the timestamps, brackets, and known fillers. Mistral fixes punctuation, capitalization, and misheard product names. The cleaned text gets cut into one hundred and forty token snippets and deduplicated. Mistral runs again to generate three instruction styles per snippet. Personal snippets get duplicated ten times. Everything serializes into the JSON format my training framework expects, splits into training and validation sets, and is ready for the actual fine tune.
The entire data pipeline runs in maybe four hours on my workstation for a few thousand snippets. The training itself takes ninety minutes. Compare that to the alternative, which is feeding raw captions into a sixty hour training run that produces unusable output, and the pipeline pays for itself immediately.
The lesson I keep coming back to is that fine tuning is a data engineering problem dressed up as a machine learning problem. Once you understand the steps, the actual code falls out easily, especially with a coding agent helping you write the glue. What matters is knowing what to clean, how to augment, and where to oversample. Get those right and the training run is almost a formality.
If you want the video walkthrough where I open the actual repository and show the cleaned versus raw transcripts side by side, watch it on YouTube at https://www.youtube.com/watch?v=XGwp1tN4LKw. And if you want to swap notes with other engineers building local AI and fine tuning pipelines, join the community at https://aiengineer.community/join.