How to · 10 min read

How to Clone Your Voice for a Podcast: The 90-Minute Fix Workflow

Clone your voice with ElevenLabs to patch podcast mistakes without re-recording. Honest limits on word-level fixes vs paragraph-length generation.

Difficulty
intermediate
Time needed
1h 30min
Published
ElevenLabs — used to clone a podcaster's voice for corrections without re-recording
Screenshot — ElevenLabs — used to clone a podcaster's voice for corrections without re-recording

Disclosure: Some tool links below are affiliate links. If you sign up through one we may earn a commission — at no extra cost to you. We'd recommend the same tools either way.

We earn a commission if you sign up through our links. It doesn't change what we write — we'd tell you the same thing either way.

Who this is for

You shipped a 60-minute episode. Somewhere at 14:32, you said the wrong stat — cited the 2023 number when you meant 2024, or called your guest by the wrong first name, or mispronounced a sponsor. Re-recording is painful: you have to match tone, mic distance, and room acoustics, and editors still catch the seam. Cloning your voice lets you type the correction, generate one sentence in your own voice, and drop it into the timeline.

This guide is for solo podcasters and co-hosts who already edit their own episodes and want a recovery workflow that doesn't involve pulling the host back into the booth for a 15-second fix. If you outsource editing, forward this to your editor — the workflow is theirs to run, not yours.

What you'll need

  • A voice clone training sample: 1 minute of clean audio for ElevenLabs Instant cloning, or 30+ minutes for Professional cloning. Must be you speaking, mono, minimal background noise, no music bed, no heavy processing. Pull it from an existing episode's raw track.
  • An ElevenLabs account on Starter ($6/mo) for Instant Voice Cloning, or Creator ($22/mo) for Professional Voice Cloning. Free tier clones are commercially unlicensed, so don't ship from them.
  • Your episode's edit project open in Descript, Premiere, Audition, Logic, or whatever DAW you use. You need to be able to drop a WAV file onto the timeline at a specific timestamp.
  • A backup of the original episode file before you touch anything. Non-negotiable.
  • Budget: $6–$22/month on ElevenLabs, or your existing Descript subscription if you're using Overdub instead.
  • Skill level: you know how to splice audio in your DAW and match levels. This guide does not cover the DAW work.

Step 1: Record or extract a clean training sample

Action: Get 60 seconds of your voice, alone, cleanly captured. Either record a fresh sample or extract one from a recent episode's raw multitrack.

For Instant Voice Cloning, one minute is the floor. Go up to 2–3 minutes if you have it — more sample audio reduces prosody drift on edge-case words. What matters more than length is quality: no music under the voice, no room echo, no compression artifacts, no breaths edited out (the model needs your natural breath pattern). Export as WAV, 44.1 kHz, 16-bit minimum.

For Professional Voice Cloning, you need 30+ minutes of varied content. Read passages, conversational clips, different emotional registers. One long monotone reading trains a flat clone. Pull from three or four different episodes if you need variety.

The cheapest source is your own podcast's raw solo track — the one from before your editor mixed in the guest audio and theme music. Export the first 60 seconds of a cold open where you're speaking alone.

Failure mode: training on a mixed track with music or guest audio bleeding in. The clone will intermittently try to generate music under the voice or shift toward the bleed voice's prosody. Train on isolated host audio only.

Step 2: Train the clone in ElevenLabs

Action: Sign in, open the voice library, and create a new cloned voice.

For Instant Voice Cloning, upload your 60-second sample, confirm you have rights to the voice (yes — it's yours), name the clone something specific like show-name-host-v1, and submit. Processing takes under two minutes.

For Professional Voice Cloning, the flow is longer: you upload the training audio, wait for ElevenLabs' queue (typically a few hours), verify the result with test generation, and iterate if the prosody is off. The quality gap between Instant and Professional is real for podcast work — Professional handles 2–3 sentence fixes before the AI-tell becomes audible; Instant starts showing cracks after one sentence.

Success signal: you can play back a test sentence ("Welcome back to the show") and it sounds like you within the first listen, not after three replays.

Alternative: if you already pay for Descript, Overdub is the built-in option. Lower ceiling than ElevenLabs Professional, but the clone lives inside the editor you're already using — no export-import dance. For word-level fixes under 10 syllables, Overdub is faster end-to-end even if the audio is slightly less convincing.

Step 3: Write the correction with the exact context

Action: Type out the full sentence you want to patch, not just the wrong word. Include the three to five words before and after the error.

If the original was "in 2023, revenue hit 40 million dollars" and you meant 2024, don't generate just "2024" — generate the full "in 2024, revenue hit 40 million dollars." Word-level splices are detectable; sentence-level splices aren't. You'll trim the crossfade points in the DAW later.

Copy your delivery style in the text itself. If you naturally say "uh, so, like" before a stat, include it. If you pause after numbers, add a comma or ellipsis — ElevenLabs respects punctuation as timing hints. For emphasis, use italics or capitalize the word; the model reads this as prosodic stress.

Success signal: the text you've written reads like something you'd actually say out loud, not like a press release.

Failure mode: generating the single wrong word. The clone will produce it in isolation, and when you paste it over the original, the level, room tone, and mic distance won't match, making the splice audible.

Step 4: Generate and tune the output

Action: Paste the sentence into ElevenLabs' text-to-speech panel, select your cloned voice, and generate.

Before you hit generate, set the voice settings: Stability, Similarity, and Style Exaggeration. For podcast patches:

  • Stability moderate-to-high (0.5–0.7). Low stability produces more emotional range but drifts in long sentences; podcast corrections need consistency.
  • Similarity high (0.75+). Push the clone toward your training sample's exact timbre.
  • Style Exaggeration low (0–0.3). Exaggeration is for dramatic reads, not corrections.

Generate three to five takes. The model produces different micro-variations each time — tiny shifts in breath placement and emphasis. Listen to all of them and pick the one that matches the surrounding audio's energy level. If none match, adjust the text (add a breath comma, split a long clause) and regenerate.

Success signal: one of the takes reads naturally when played without the surrounding episode audio. If you need the original to cover up the AI-ness, the take isn't good enough.

Step 5: Drop the patch into your DAW and match levels

Action: Export the chosen take as WAV. Import it into your editor at the exact timestamp of the error.

Three things to match against the original audio around it: peak level (use a compressor or simple gain match), room tone (splice 100–200ms of ambient room audio from a silent moment in the same episode and feather it under the patch), and EQ (apply the same channel strip you used on the original host track — reverb, high-pass filter, saturation plugin, all of it).

Place the patch with 50–100ms crossfades on each side. Listen to the stitched section three times: at your monitoring level, at phone-speaker level, and in headphones. If you can tell where the splice is on any of the three, nudge the crossfade length or re-match the level.

Failure mode: pasting the patch dry with no room tone. Listeners experience a "dead air" moment right where the patch lives, which is louder than the AI-ness itself.

Step 6: Spot-check the full section before export

Action: Play the full 30 seconds around the patch — 15 seconds before, the patch itself, 15 seconds after. Listen end-to-end once without pausing.

What you're checking for: energy continuity (is the patch calmer or more animated than the surrounding delivery?), breath flow (does the patch inherit a breath from the original that doesn't quite fit?), and logical consistency (does the corrected statement actually make sense in context — sometimes fixing one number breaks a later reference to it).

If the patch passes all three, render the episode. If anything feels off, go back to Step 4 and regenerate with adjusted text or settings. Budget one regeneration cycle as normal; three or more means the sample quality in Step 1 wasn't good enough — re-train with cleaner audio.

Success signal: you play the section for someone who hasn't heard the episode, ask where the edit is, and they can't point to it.

Common pitfalls

  • Training on over-processed audio: the reference track has been through compression, EQ, and noise reduction. Clones trained on processed audio generate processed-sounding output that won't match your raw capture. Train on pre-processing files when possible, then apply your channel chain to the output.
  • Trying to patch more than one sentence: paragraph-length generation (20+ seconds) is where the AI-tell becomes audible even on Professional clones. If you need to replace a whole paragraph, re-record it for real — the 10 minutes in the booth beat the 30 minutes of regeneration cycles.
  • Ignoring the commercial license: free-tier ElevenLabs output is not licensed for commercial use. A podcast with even a single sponsor, YouTube ad revenue, or a Patreon is commercial. Upgrade to Starter before you ship.
  • Not disclosing synthetic audio: listeners who discover an undisclosed clone feel tricked even when the fix was trivial. A single line in your show notes — "this episode includes AI-corrected audio for the guest's name at 14:32" — defuses the trust issue for zero cost.

When not to use this approach

  • Whole-segment replacement: if you need to re-record a 90-second intro or a 3-minute sponsor read, voice cloning produces detectable output at that length. Re-record in the booth. The cloning workflow is for fixes under 15 seconds.
  • High-stakes sponsor reads: most brand contracts prohibit synthetic voice in paid placements, and the one that slips through catches up with you on the next contract review. Re-record sponsor mistakes; save cloning for editorial content.
  • Live or near-live podcasts: if your show ships within hours of recording, the training-plus-generation cycle is slower than a retake. Voice cloning pays off when the episode is already published and re-recording means coordinating schedules.

Bottom line

Voice cloning is the right recovery tool when the error is small, the stakes are editorial rather than sponsor-paid, and you've been honest with your listeners that synthetic audio is part of your production workflow. For a weekly solo or co-hosted podcast with occasional stat corrections or name fixes, a Starter or Creator subscription pays for itself the first time you avoid a re-record.

Start with ElevenLabs on Starter if you're testing the workflow on a backlog of fixes; move to Creator the first time you need Professional Voice Cloning quality on a flagship episode.

Common questions

Questions people ask.

Is it legal to clone my own voice for podcast corrections?
For your own voice, yes — the tools require you to confirm consent during training, and since you are both the trainer and the subject, you are covered. For a co-host or guest, you need their recorded consent before training, and some jurisdictions require written sign-off for synthetic audio even when verbal consent was captured. If the podcast is commercial, disclose synthetic patches in your show notes once per episode that uses them.
What does this actually cost in a typical month?
ElevenLabs Starter at $6/month gives roughly 30 minutes of generated audio, which covers about 40–60 single-sentence podcast patches. If you hit that cap, Creator at $22/month unlocks Professional Voice Cloning and 2 hours of output. Descript Overdub rides on your existing Descript subscription — Hobbyist at $24/month includes it, no extra line item. The realistic monthly cost for a weekly podcaster doing one or two fixes per episode is $6–$22.
Will listeners be able to tell the audio is AI?
For single sentences or isolated phrase fixes, a well-trained clone with clean reference audio is effectively undetectable — ElevenLabs Professional Voice Cloning holds up even on high-fidelity monitoring. For paragraph-length generation (20+ seconds of continuous speech), the tell becomes audible: prosody flattens, micro-pauses land in slightly wrong places, and breath patterns repeat. Use cloning for fixes, not to replace whole segments.
Can I do this entirely for free?
Technically yes on ElevenLabs Free (10,000 credits, about 10 minutes of output), but the free tier has no commercial license — anything monetized including a YouTube channel with ads violates the terms. Descript Overdub requires a paid Descript tier. The honest minimum for a real podcast is Starter at $6 on ElevenLabs or your existing Descript subscription.
How long does it take to train the voice clone the first time?
ElevenLabs Instant Voice Cloning processes a one-minute sample in under two minutes. Professional Voice Cloning takes several hours of asynchronous processing after you submit 30+ minutes of training audio. Descript Overdub sits between the two — about 10 minutes of asynchronous training after you record its scripted sample. Budget 90 minutes total for the first end-to-end session including the fix itself; subsequent fixes drop to 10–15 minutes each.

Get more guides.

Subscribe to CreatorStack.

Join the list