How to record video samples for custom text to speech avatar

This article provides instructions on preparing high-quality video samples for creating a custom text to speech avatar.

Custom text to speech avatar model building requires training on a video recording of a real human speaking. This person is the avatar talent. You must get sufficient consent under all relevant laws and regulations from the avatar talent to create a custom avatar from their talent's image or likeness. To learn about requirements of the consent statement video, see Get consent file from the avatar talent.

Recording environment

We recommend recording in a professional video recording studio or a well-lit place.

Background requirement

If you need a commercial, multi-scene avatar, the background of the video should be clean, smooth, pure-colored, and a green screen is the best choice.

If your avatar only needs to be used in a single scene, you can select a specific scene to record (such as in your office), but the background can't be subtracted and changed.

Here are best practices to consider when you use a pure-colored background (such as green screen) for recording:

  • A green screen is set behind your back, and if your avatar video shows the full body of the actor, including feet, there should be a green screen under the feet. And the back green screen and floor green screen should be seamlessly connected.
  • The green screen should be flat, and the color is uniform.
  • The actor should keep 0.5 m – 1 m distance away from the back background.
  • The green screen can be properly lit to prevent shadows.
  • The full outline of the actor is within the edge of the green screen.
  • The actor shouldn't stand too close to the green screen.
  • Avoid the actor's head and hands spilling out of the green screen when speaking.

Lighting requirement

  • Ensure even and bright lighting on the actor's face, avoiding shadows on the face or reflections on actor's glasses and clothes.
  • Try to avoid changes in ambient light on actors. It's recommended to turn off the projector, close the curtains to avoid daylight changes, and use a stable artificial light source, etc.

Devices

  • Camera requirement: A minimum of 1080-P resolution and 25 FPS (frames per second).
  • Don't change the position of light and camera after settling down during the whole video shooting.
  • You can use a teleprompter to remind the script during recording but ensure it doesn't affect the actor's gaze towards the camera. Provide a place to sit if the avatar needs to be in a sitting position.
  • For half-length or seated digital avatars, provide a place to sit for the actor. If you don't want the image of the chair to appear, you can choose a chair.

Appearance of the actor

The custom text to speech avatar doesn't support customization of clothes or looks. Therefore, it's essential to carefully design and prepare the avatar's appearance when recording the training data. Consider the following tips:

Categories Dos Don'ts
Hair - The actor's hair should have a smooth and glossy surface.
- Even the actor's bangs or broken hair should have a clear and smooth border.
- Choose a hairstyle that is easy to keep consistent during the whole video recording.
- Avoid messy hair or backgrounds showing through the hair.
- Don't let hair block the eyes or eyebrows.
- Avoid shadows on the face caused by hairstyle.
- Avoid hair changes too much during speech and body gesture. For example, the high ponytail of an actor might appear, disappear, and swing during speaking.
Clothing - Pay attention to clothing status and make sure no significant changes on the clothing during speaking. - Avoid wearing clothing and accessories that are too loose, heavy, or complex, as they might affect the consistency of clothing status during speaking and body gesture.
- Avoid wearing clothing that is too similar to the background color or reflective materials like white shirts or translucent materials.
- Avoid clothing with obvious lines or items with logos and brand names you don't want to highlight.
- Avoid reflective elements such as metal belts, shiny leather shoes, and leather pants.
Face - Ensure the actor's face is clearly visible. - Avoid face obscured by hair, sunglasses, or accessories.

What video clips to record

You need four types of basic video clips:

Consent Video:

  • The consent video must represent the same avatar talent speaking, following the requirement of the consent statement. Make sure the statement is correctly recorded, and each word is clearly spoken. Get consent file from the avatar talent. You can select any one of the languages supported.
  • The avatar talent should always face the front of the camera, without large movements.
  • The video should be taken in a quiet environment, and the voice should be recorded at a reasonable volume. Try to keep the signal-to-noise ratio higher than 20. For voice recording guidance, see the Recording custom voice samples guide.
  • Ensure that the head part will not be occluded in each frame of the video.
  • Make sure no other objects appear in the camera, including filming equipment, mobile phone, etc.

Status 0 speaking:

  • Status 0 represents the posture you can naturally maintain most of the time while speaking. For example, arms crossed in front of the body or hanging down naturally at the sides.
  • Maintain a front-facing pose. The actor can move slightly to show a relaxed status, like moving the head or shoulder slightly, but don't move the body too much.
  • Length: keep speaking in status 0 for 3-5 minutes.

Samples of status 0 speaking:

Animated graphic depicting Lisa speaking in status 0, representing the posture naturally maintained while speaking.

Animated graphic depicting Harry speaking in status 0, representing the posture naturally maintained while speaking.

Animated graphic depicting Lori speaking in status 0, representing the posture naturally maintained while speaking.

Naturally speaking:

  • Actor speaks in status 0 but with natural hand gestures from time to time.
  • Hands should start from status 0 and return after making gestures.
  • Use natural and common gestures when speaking. Avoid meaningful gestures like pointing, applause, or thumbs up.
  • Length: Minimum 5 minutes, maximum 30 minutes in total. At least one piece of 5-minute continuous video recording is required. If recording multiple video clips, keep each clip under 10 minutes.

Samples of natural speaking:

Animated graphic depicting sample of Lisa speaking in status 0 with natural hand gestures, representing the posture naturally maintained while speaking.

Animated graphic depicting sample of Harry speaking in status 0 with natural hand gestures, representing the posture naturally maintained while speaking.

Animated graphic depicting sample of Lori speaking in status 0 with natural hand gestures, representing the posture naturally maintained while speaking.

Silent status:

This video clip is important if you build a real-time conversation with the custom avatar. The video clip is used as the main template for both speaking and listening status for a chatbot.

  • Maintain status 0, don't speak, but still feel relaxed.
  • Even remaining in status 0, don't keep still; you can move slightly but not too much. Perform like you're waiting.
  • Maintain a smile as if listening or waiting patiently.
  • Avoid nodding frequently.
  • Length: 1 minute.

Samples of silent status:

Animated graphic depicting sample of Lisa maintaining silent status without speaking but still feeling relaxed.

Animated graphic depicting sample of Harry maintaining silent status without speaking but still feeling relaxed.

Animated graphic depicting sample of Lori maintaining silent status without speaking but still feeling relaxed.

Gestures (optional):

Gesture video clips are optional, and customers who have the need to insert certain gestures in the avatar speaking can follow this guideline to take gesture videos. Gesture insertion is only enabled for batch mode avatar; real-time avatar doesn't support gesture insertion at this point. Each custom avatar model can support no more than 10 gestures.

Gesture tips:

  • Each gesture clip should be within 10 seconds.
  • Gestures should start from status 0 and end with status 0. It's essential that the character maintains the same position as in status 0, which is in the middle of the screen, throughout the gesture. Otherwise, the gesture clip can't be smoothly inserted into the avatar video.
  • The gesture clip only captures the body gestures; the actor doesn't have to speak during making gestures.
  • We recommend designing a list of gestures before recording; here are some examples of gesture video clips:

Samples of gesture:

Gestures Samples
Delivering sell link/promotion code An animated graphic depicting sample of delivering sell link.
Praising the product An animated graphic depicting sample of praising the product
Introducing the product An animated graphic depicting sample of introducing the product.
Displaying the price (number from 1 to 10-fist-number with each hand) Right hand An animated graphic depicting sample of displaying the price with right hand. Left hand An animated graphic depicting sample of displaying the price with left hand.

High-quality avatar models are built from high-quality video recordings, including audio quality. Here are more tips for actor's performance and recording video clips:

Dos Don'ts
- Ensure all video clips are taken in the same conditions.
- During the recording process, design the size and display area of the character you need so that the character can be displayed on the screen appropriately.
- Actor should be steady during the recording.
- Mind facial expressions, which should be suitable for the avatar's use case. For example, look positive and smile if the custom text to speech avatar is used as customer service. Look professionally if the avatar is used for news reporting.
- Maintain eye gaze towards the camera, even when using a teleprompter.
- Return your body to status 0 when pausing speaking.
- Speak on a self-chosen topic, and minor speech mistakes like miss a word or mispronounced are acceptable. If the actor misses a word or mispronounces something, just go back to status 0, pause for 3 seconds, and then continue speaking.
- Consciously pause between sentences and paragraphs. When pausing, go back to the status 0 and close your lips.
- The audio should be clear and loud enough; bad audio quality impacts training result.
- Keep the shooting environment quiet.
- Don't adjust the camera parameters, focal length, position, angle of view. Don't move the camera; keep the person's position, size, angle, consistent in the camera.
- Characters that are too small might lead to a loss of image quality during post-processing. Characters that are too large might cause the screen to overflow during gestures and movements.
- Don't make too long gestures or too much movement for one gesture; for example, actor's hands are always making gestures and forget to go back to status 0.
- The actor's movements and gestures must not block the face.
- Avoid small movements of the actor like licking lips, touching hair, talking sideways, constant head shaking during speech, and not closing up after speaking.
- Avoid background noise; staff should avoid walking and talking during video recording.
- Avoid other people's voice recorded during the actor speaking.

How to prepare an interaction video clip

Creating a high-quality interaction video clip is essential if you're building a real-time conversation with a custom avatar. The clip should consist of a question-and-answer format, where a photographer asks a question, and the actor responds. Loop the question-answer pair until the conversation is complete. If you're filming alone, imagine someone else asking the questions during the asking phase.

Here are some tips for each phase:

Asking phase:

  • Maintain status 0, don't speak, but still feel relaxed.
  • Even remaining in status 0, don't keep still. Perform like you're waiting.
  • Maintain a smile as if listening or waiting patiently.
  • Avoid nodding frequently.
  • Length: Each asking slot should last around 3–5 seconds.

Answering phase:

  • Speak naturally with natural hand gestures from time to time.
  • Use natural and common gestures when speaking. Avoid meaningful gestures like pointing, applause, or thumbs up.
  • Begin gestures after starting to speak, and stop them before you finish.
  • Length: Each answering slot should last around 5 seconds.

Total video length:

  • Aim for a total video length of 1–5 minutes.

Data requirements

Doing some basic processing of your video data is helpful for model training efficiency, such as:

  • Make sure that the character is in the middle of the screen, the size and position are consistent during the video processing. Each video processing parameter such as brightness, contrast remains the same and doesn't change. The output avatar's size, position, brightness, contrast will directly reflect those present in the training data. We don't apply any alterations during processing or model building.
  • The start and end of the clip should be kept in state 0; the actors should close their mouths and smile, and look ahead. The video should be continuous, not abrupt.

Avatar training video recording file format: .mp4 or .mov.

Resolution: At least 1920x1080.

Frame rate per second: At least 25 FPS.

Next steps