Do I need to appear on camera to use my own avatar?

For HeyGen's Instant Avatar, you record about 2 minutes of yourself on a webcam following their scripted guidance, and their model trains a clone within a couple of hours. For Studio Avatar (higher fidelity), you do a guided shoot — remote with phone-quality framing, or in-person if your client needs broadcast polish. If you don't want to appear at all, the 175+ stock avatars in HeyGen and 230+ in Synthesia cover most use cases, and the stock avatars often look less uncanny than weak custom clones.

How much does this realistically cost per month?

HeyGen Creator at $24/month annual (or $29 monthly) is the working tier — unlimited videos up to 30 minutes, 1080p, watermark-free. Synthesia Starter sits at $29/month for 120 minutes of output per year, which works out to 10 minutes a month; Creator at $89/month unlocks 360 minutes a year. For one explainer video a month, HeyGen Creator is the cheaper and more flexible option. For enterprise training workflows with SCORM, Synthesia's tooling justifies the premium.

Will my audience recognize it as AI?

Inside 10–15 seconds for most viewers, yes. The tells are micro-expression repetition, blink rhythm, and hands that sit unnaturally still. For marketing and training where viewers know they're watching a produced asset, this is fine. For content that depends on you personally being on camera — a creator channel, a sales founder video — the AI-tell erodes trust faster than the production savings compensate. Run a three-person test on your target audience before committing.

Can I do this entirely for free?

HeyGen's free tier gives 3 × 1-minute videos per month with a watermark at 720p. Enough to validate the avatar quality for one landing page test, not enough for ongoing production. Synthesia has no free tier — only a demo video. Captions.ai's AI Twin starts at $9.99/month and works for mobile-first vertical output, not landing-page 16:9. Plan to pay $24–$29/month once you ship for real.

How long does the whole workflow take end-to-end?

About 45 minutes the first time: 10 minutes to write a 150-word script, 5 minutes to pick the avatar and voice, 10 minutes to paste the script and adjust pacing, 5 minutes to render, 10 minutes to review and re-render once, 5 minutes to export and upload. Once you have a brand-matched avatar and a voice preset saved, subsequent explainers drop to 15–20 minutes each — the script is the remaining bottleneck.

How to · 10 min read

How to Make an AI Avatar Explainer Video in 45 Minutes

Build a 60-second AI avatar explainer for a landing page or LinkedIn post with HeyGen. Honest workflow, script template, and when to skip avatars entirely.

Difficulty: beginner
Time needed: 45 min
Published: 24 April 2026

Screenshot — HeyGen — used to produce a 60-second AI avatar explainer for a landing page

Disclosure: Some tool links below are affiliate links. If you sign up through one we may earn a commission — at no extra cost to you. We'd recommend the same tools either way.

We earn a commission if you sign up through our links. It doesn't change what we write — we'd tell you the same thing either way.

Who this is for

You need a 60-second explainer for a landing page hero, a LinkedIn feed post, a YouTube intro, or a product onboarding flow. You don't want to set up a camera, light yourself properly, or record six takes trying not to stumble on the phrase "conversion rate optimization." You have an idea of what you want to say, and you'd rather spend 45 minutes than a full afternoon shooting it.

This guide assumes your audience is prospects, employees, or a LinkedIn professional network — people watching for information, not for personality. If you're building a creator channel where fans come for your face and voice, AI avatars are the wrong tool and no workflow fixes that.

What you'll need

A 150-word script for a 60-second video at conversational pace. More on this in Step 1.
A HeyGen account on Creator at $24/month annual (or the free tier for a one-time test with watermark).
A decision on avatar type: stock (immediate), Instant Avatar clone of yourself (2 hours to train), or Studio Avatar (higher fidelity, guided shoot).
A voice decision: stock voice (300+ options), cloned voice from 30 seconds of recording, or pre-recorded audio you upload for HeyGen to lip-sync to.
An optional brand asset: your logo as a PNG, your brand color as a hex code, a background image if you want something other than a neutral studio backdrop.
Budget: $0 for the free-tier test, $24–$29/month for ongoing production, $89+ if you move to Synthesia for enterprise features.
Skill level: you've used a web-based video tool before. You don't need editing chops — the avatar tool handles the render.

Step 1: Write the 150-word script before you open the tool

Action: Draft a 150-word script that reads cleanly out loud in 55–60 seconds.

The single biggest mistake in AI avatar video is pasting a written-for-reading paragraph into the tool. Written prose has different rhythm than spoken delivery — an AI avatar reading written prose sounds stilted in a way that amplifies the uncanny tell. Write for the ear:

Short sentences. Aim for 10–15 words each. Long complex sentences expose the avatar's lack of natural mid-sentence breathing.
Contractions. "We are building" becomes "we're building." Formal grammar signals AI; conversational contractions humanize the delivery.
One idea per sentence. The avatar can't carry a three-clause argument with a satisfying pause before the conclusion. Break it up.
Open on a hook, close on a CTA. 60 seconds is enough for: hook (10s), problem (15s), solution (25s), CTA (10s). Don't try to fit your whole company story.

Read the script out loud and time it. If it runs over 65 seconds at conversational pace, cut 10%. If it runs under 50 seconds, you're probably rushing — add a breath moment between the problem and solution sections.

Success signal: you've read the script out loud three times, the timer reads 55–60 seconds, and you don't stumble on any phrase.

Step 2: Pick stock avatar or train your own

Action: Decide whether to use a stock avatar, train an Instant Avatar of yourself, or schedule a Studio Avatar shoot.

Stock avatars (175+ in HeyGen, 230+ in Synthesia) are the fastest path and often the safest choice. Browse by demographic, setting, and delivery style. Watch the preview clip of your top three picks to the end — the one that looks least AI on a 15-second hold is the right pick, even if another looks more polished in the thumbnail.

Instant Avatar clones you from about 2 minutes of webcam footage following HeyGen's scripted guidance. Processing takes roughly 2 hours asynchronously. The result is passable for short clips and drifts visibly past 30 seconds. Good enough for a 60-second explainer if your face has been seen by the audience before; poor fit if this is their first impression of you.

Studio Avatar requires a longer, higher-quality training session — either in-person or guided remote with specific lighting and framing. The output holds up on longer takes but gates you to HeyGen's Studio tier pricing. Only worth it if you'll reuse the avatar for 10+ videos over six months.

For most 60-second explainers on Creator tier, stock avatar is the right call. The audience isn't evaluating your face; they're evaluating whether the message lands.

Failure mode: picking a stock avatar based on the thumbnail and not watching the preview all the way through. Some stock avatars look great at second 1 and obviously AI at second 8.

Step 3: Pick your voice and set pacing

Action: Browse the voice library, filter by language and accent, and audition three voices against your script's first sentence.

HeyGen's 300+ stock voices span 175+ languages. The defaults sound flatter than ElevenLabs standalone, but they're good enough for informational delivery. Key controls:

Pitch and speed: adjust slightly from defaults only. Extreme adjustments produce artifacts.
Emphasis markers: HeyGen supports SSML-lite syntax for pauses (...) and emphasis (*italics*). Use pauses before key claims and emphasis on numbers and brand names.
Voice match to avatar: pick a voice that matches the avatar's apparent demographic. A deep male voice on a young female avatar is immediately uncanny.

If you need higher voice quality, record the voiceover separately in ElevenLabs, export the WAV, and upload it to HeyGen's "use my own audio" flow. The avatar will lip-sync to your uploaded audio. This adds 15 minutes but noticeably improves emotional range.

Success signal: you play the first sentence with your chosen voice and avatar, and it lands with the intended tone (confident, curious, urgent — whatever your script calls for).

Step 4: Paste the script, adjust scene settings, and render

Action: Open a new project, paste your script, select your avatar and voice, set the aspect ratio and background, and render.

Aspect ratio decisions based on destination:

Landing page hero: 16:9 (1920×1080) for a desktop-first page, 1:1 (1080×1080) for a responsive mobile-first page.
LinkedIn feed post: 1:1 (1080×1080) for the feed, 9:16 (1080×1920) if you're using LinkedIn's vertical video option.
YouTube intro: 16:9 (1920×1080), Creator tier caps at 1080p.
Product onboarding modal: 16:9 for desktop apps, 9:16 for mobile onboarding screens.

Background: the default studio backdrop works for most uses. If you have a brand asset (office photo, product screenshot, solid brand color), upload it. Avoid busy backgrounds that pull attention from the avatar's face.

Hit render. HeyGen typically completes a 60-second 1080p render in 3–8 minutes. For vertical mobile-first content, Captions.ai is sometimes the faster path — its AI Twin is built around phone output and renders faster than HeyGen on short vertical clips, though the avatar catalog is smaller.

Failure mode: rendering at the wrong aspect ratio for your destination and having to re-render. Confirm the output size before you hit the button.

Step 5: Review the render, then re-render once if needed

Action: Watch the full render three times: once at full screen, once at phone size in a browser preview, once muted with captions visible.

What you're checking for:

Lip sync drift: if the avatar's mouth visibly lags or leads the audio by more than a frame, re-render. Usually a pacing issue — shorten a long sentence or add a comma for a beat.
Frozen gesture loops: avatars sometimes repeat the same subtle hand movement every 4 seconds. If you see the loop, pick a different avatar or regenerate — there's no way to manually edit gesture within the tool.
Facial inconsistencies: the avatar's expression should match the script's emotional arc. A deadpan avatar reading excited copy is worse than a stock voice reading flat copy.
Caption accuracy: HeyGen auto-generates captions. Read every line. Proper nouns and brand names are the most common misses.

Budget one re-render cycle as normal. Two or more re-renders without improvement usually means the script is the problem, not the tool — go back to Step 1 and shorten.

Success signal: you've watched the render muted with captions and the message still lands. If it only works with sound, the script relied too heavily on voice delivery.

Step 6: Export and ship to the destination

Action: Export as MP4 (H.264, 1080p or 4K depending on tier), download, and upload to your destination.

Upload flow by destination:

Landing page: upload to your site's CDN or video host (Mux, Cloudflare Stream, YouTube unlisted). Embed with autoplay muted and a visible play button. Don't autoplay with sound — browsers block it and users hate it.
LinkedIn: native upload through the LinkedIn composer. Add captions as an SRT file (or enable LinkedIn's auto-captions) — most LinkedIn feed viewers watch muted.
YouTube intro: upload as a separate short, or use as the first 10 seconds of a longer upload.
Product onboarding: host on Mux or Cloudflare Stream, embed in your app with controls disabled for a smooth inline experience.

For LinkedIn specifically, consider adding a 1–2 second static frame at the start with your headline overlaid as text — most feed viewers decide whether to keep watching in the first 2 seconds, and text hooks outperform avatar hooks on that metric.

Success signal: the video lives at the destination URL, plays on desktop and mobile, and the first viewer's feedback is about the message, not about the video being AI.

Common pitfalls

Writing a written script instead of a spoken one: the fastest way to produce an uncanny AI video is to paste a paragraph of marketing copy. Spoken scripts are 30% shorter than written versions for the same idea.
Picking a stock avatar from the thumbnail: thumbnails hide the avatar's worst moment. Always watch the preview clip to the end before committing.
Over-relying on the avatar for emotional delivery: avatars can't perform. If your script needs genuine excitement, urgency, or empathy, either record yourself or write the script to let the words do the emotional work.
Shipping without a mute test: 80% of LinkedIn and Instagram feed viewers watch muted. If your video doesn't work with captions alone, it doesn't work on social.

When not to use this approach

Founder sales videos for enterprise deals: prospects in high-ticket B2B contexts respond to authenticity, and AI avatars erode it. Record yourself, even badly; bad-real beats polished-AI in this context.
Creator-brand content where your face is the product: if your audience follows you for you, don't swap yourself out. The moment the first viewer catches the switch, the comment thread becomes the story.
Highly regulated industries requiring human accountability on camera: financial advice, medical content, and some legal contexts have disclosure rules that treat synthetic avatars as misleading. Check your industry's rules before you ship.

Bottom line

AI avatar explainers work best when the audience is paying attention to the message, not the messenger — marketing, training, onboarding, and information-led social posts. For those contexts, a 45-minute workflow that produces a 60-second video is a clear win over a half-day shoot.

See HeyGen on Creator if you need flexibility and speed; step up to Synthesia if your use case is enterprise training with SCORM and localization built in.

Common questions

Questions people ask.

Do I need to appear on camera to use my own avatar?: For HeyGen's Instant Avatar, you record about 2 minutes of yourself on a webcam following their scripted guidance, and their model trains a clone within a couple of hours. For Studio Avatar (higher fidelity), you do a guided shoot — remote with phone-quality framing, or in-person if your client needs broadcast polish. If you don't want to appear at all, the 175+ stock avatars in HeyGen and 230+ in Synthesia cover most use cases, and the stock avatars often look less uncanny than weak custom clones.
How much does this realistically cost per month?: HeyGen Creator at $24/month annual (or $29 monthly) is the working tier — unlimited videos up to 30 minutes, 1080p, watermark-free. Synthesia Starter sits at $29/month for 120 minutes of output per year, which works out to 10 minutes a month; Creator at $89/month unlocks 360 minutes a year. For one explainer video a month, HeyGen Creator is the cheaper and more flexible option. For enterprise training workflows with SCORM, Synthesia's tooling justifies the premium.
Will my audience recognize it as AI?: Inside 10–15 seconds for most viewers, yes. The tells are micro-expression repetition, blink rhythm, and hands that sit unnaturally still. For marketing and training where viewers know they're watching a produced asset, this is fine. For content that depends on you personally being on camera — a creator channel, a sales founder video — the AI-tell erodes trust faster than the production savings compensate. Run a three-person test on your target audience before committing.
Can I do this entirely for free?: HeyGen's free tier gives 3 × 1-minute videos per month with a watermark at 720p. Enough to validate the avatar quality for one landing page test, not enough for ongoing production. Synthesia has no free tier — only a demo video. Captions.ai's AI Twin starts at $9.99/month and works for mobile-first vertical output, not landing-page 16:9. Plan to pay $24–$29/month once you ship for real.
How long does the whole workflow take end-to-end?: About 45 minutes the first time: 10 minutes to write a 150-word script, 5 minutes to pick the avatar and voice, 10 minutes to paste the script and adjust pacing, 5 minutes to render, 10 minutes to review and re-render once, 5 minutes to export and upload. Once you have a brand-matched avatar and a voice preset saved, subsequent explainers drop to 15–20 minutes each — the script is the remaining bottleneck.

Get more guides.

Subscribe to CreatorStack.

Join the list