AI Short Drama Voice and Lip-Sync Guide
What you will learn
- How to choose preset voices and use voice cloning
- How to apply the 9 core emotion labels
- How batch dubbing and incremental generation work
- How the two lip-sync model paths differ
- How to generate sound effects and schedule them on a timeline
- How BGM generation and vocal modes work
- How to balance dialogue, music, and effects in the audio mix panel
1. AI voice overview
Linghui AI uses TTS technology to turn script dialogue into speech, then applies lip-sync processing so characters can appear to speak naturally on screen.
Voice workflow
2. Voice selection guide
2.1 Preset voice types
| Voice type | Traits | Best for |
|---|---|---|
| Sweet female | Soft and cute | Young lead actresses and youthful roles |
| Mature female | Calm and steady | Professional women and supporting female roles |
| Sunny male | Bright and energetic | Young male leads and school-age roles |
| Deep male | Low and charismatic | Powerful or mature male characters |
| Child voice | Young and innocent | Child characters |
Every voice supports preview. You can filter by gender and keep exploring additional preset voices inside the platform.
2.2 Selection tips
- Match the personality - Softer characters need gentler voices, while dominant roles need stronger presence.
- Keep voices distinct - Different roles should sound clearly different from each other.
- Preview with the same line - Use the same dialogue line to compare multiple voices fairly.
2.3 Voice cloning
Recording mode
- The system provides 5 sample reading scripts and displays one at random.
- Recommended recording length is 10 to 30 seconds.
- A real-time waveform preview is shown while recording.
- Typical flow: start → pause → continue → stop → preview → rerecord or use.
- Submission is disabled when the recording is shorter than 10 seconds.
Upload mode
You can also upload an audio file of at least 5 seconds in a supported format to create a cloned voice.
Cloned voice management
Cloned voices support viewing, renaming, deleting, and batch deletion.
✅ Voice cloning is currently a 0-credit feature and can be used repeatedly.
2.4 Global dubbing setup
At the top of the storyboard page, you can assign voices to all roles at once to avoid repeating the same setup shot by shot.
3. Emotion labels
Emotion labels directly affect speaking tone and delivery. The current base set contains 9 emotions.
Rising tone and lively delivery
Lower tone and slower pace
Heavier emphasis and quicker attack
Tense and trembling delivery
Rising pitch and short bursts
Flat and neutral delivery
Rejecting and dismissive tone
Soft, close, and private feeling
Higher volume with strong emphasis
More than 30 extended emotions are also available in script editing. Let emotion evolve with the plot instead of staying flat the whole time.
4. Batch dubbing and incremental generation
- Batch generate voices for all storyboard dialogue in one run
- Incremental mode only processes unfinished storyboard items and skips completed ones
- Dubbing progress is shown as completed N / total M
- Each dialogue segment can be previewed and downloaded independently
- You can upload a custom recorded voice file for a specific storyboard item
5. Lip sync
5.1 How it works
- Analyze phonemes and timing from the generated speech
- Generate matching mouth animation
- Apply the animation to the character image
- Keep the rest of the character stable
5.2 Optimization tips
✅ Front-facing characters usually produce the most natural lip-sync result.
⚠️ Side-angle characters may need a more frontal dialogue shot for better results.
5.3 Model selection
| Model | Traits | Recommended for |
|---|---|---|
| Tongyi | Natural and smooth result | Dialogue-heavy scenes |
| Kling | Stable and reliable result | Action-oriented scenes |
5.4 Automatic skip
If the selected video generation model already contains embedded audio, such as Vidu, the system skips the lip-sync step automatically.
5.5 Lip-sync status
Progress states: processing (blue) / completed (green) / pending (gray).
6. Sound effect generation
6.1 Overview
Sound effects are generated from natural-language descriptions through AI audio generation.
6.2 Basic mode
Enter a sound-effect prompt such as "thunder in a storm" or "coffee shop ambience" and generate directly.
6.3 Timeline segment mode
You can place multiple sound-effect segments inside one storyboard shot.
- Each segment defines start time, end time, and a sound description
- Validation rules: end time must be greater than start time, the description cannot be empty, and it must stay within 1500 characters
0-3s: Footsteps approaching from a distance
3-5s: Door creaks open
5-8s: Thunder rolls
6.4 AI prompt recommendation
The system can recommend sound-effect prompts based on scene description, dialogue, emotion, and shot type.
6.5 Upload custom sound effects
Custom sound-effect files in common audio formats are supported.
7. BGM generation
7.1 Overview
Background music is generated through the Doubao Music API.
7.2 Music prompt writing
- Edit prompts manually, such as "light piano music" or "tense electronic score"
- Use AI-recommended prompts based on the current scene mood
7.3 Vocal mode selection
| Mode | Description |
|---|---|
| Instrumental | Music only, recommended for most scenes |
| Light vocal hum | Music with soft vocal texture |
| Lead vocal | Full singing track. Lyrics are required, otherwise it falls back to light vocal hum. |
7.4 Upload custom BGM
Custom BGM files such as MP3 are supported.
7.5 Credit usage
Before BGM generation starts, a credit confirmation dialog shows the duration and estimated cost.
8. Audio mixing panel
8.1 Independent three-track volume control
- Dialogue volume (0-100)
- Music volume (0-100)
- Sound-effect volume (0-100)
8.2 Preset mixing profiles
You can apply one-click profiles such as "Dialogue First" with dialogue 100 / music 40 / effects 60.
8.3 Best practice
✅ Keep dialogue the loudest, music next, and sound effects lowest so viewers can always hear the lines clearly.
FAQ
Q: Can I add background music?
Yes. Linghui AI includes built-in BGM generation during the dubbing stage. It supports three vocal modes and also allows custom music uploads.
Q: What should I do if the voice does not match the visuals?
Check whether the storyboard duration is long enough. Longer dialogue often needs longer shot duration.
Q: What if the lip sync looks unnatural?
Use a front-facing character image, avoid overly long continuous dialogue, try switching the lip-sync model, and regenerate the affected shot if needed.
Q: Which is better, voice cloning or preset voices?
Preset voices are stable and require no sampling. Cloned voices are more personalized but depend on recording quality. Start with presets and only clone if needed.
Q: Can sound effects and BGM be used together?
Yes. Use the audio mixing panel to balance them independently and avoid overlap issues.
Q: How do I make dubbing feel more emotional?
Set an appropriate emotion label for each line and let the emotional profile change with the plot. Voice cloning can strengthen the result further.