Using the Audio Lip Sync Filter (Windows Embedded CE 6.0)
1/6/2010
The process of tuning a configuration for good Lip-Sync is a time-consuming cycle of measuring and adjusting the Lip-Sync Filter. These suggestions provide general guidance and best practices for this task. Your own situation will be unique, and you will want to refine these techniques to suit your own needs.
Reload dvraudiosync.dll
Be sure that dvraudiosync.dll is reloaded after you change a registry setting. The registry settings are cached during load, so your new values will not be applied until the dvraudiosync.dll library is unloaded and reloaded.
Enable Logging
Use an optimized retail build with the retail logging of the DVR engine and audio sync filters enabled. You will need this diagnostic information in order to measure your progress.
Build the Audio Lip-Sync Filter with the Lip-Sync stats debug zone enabled by default. This safeguards against throwing off the timing by using PB to turn on the zone. Having the zone on by default also speeds up your test-and-go cycle. This debug zone only runs every 10 seconds, so it does not have a major impact on what you are measuring.
Tune the Capture Side Clock Ratio
Tune the capture side clock ratio to be stable in the range 0.99 to 1.01. You are not going to get a good playback experience if the clock ratio is fluctuating.
If the clock ratio is significantly off from 1.0, audio playback frequency will be especially distorted. The rendering filters are designed to introduce subtle changes in order to re-sync to a varying clock. Audio and video won’t recover at the same rate so Lip-Sync will rarely be accurate in this case. Look for the following logged information from the DVR engine about the clock ratio:
Old Rate:1.0000 New Rate:1.0065 Rebased:-87Delta:-87
Balance Thread Priority
The capture graph’s source filter and any other filters upstream of the DVR source filter should be run at a priority higher than the UI, audio, and rest of the a/v stream. E.g. priorities 140 to 148. If your source filter for the playback side is reading from disk, ideally put its content on a separate volume [to minimize shared locks trigging priority inversions giving high priority to other code]. If reading files on the same volume as the captured files, I suggest trying a priority not far below audio (149) or maybe on par with audio to minimize the impact of priority inversions.
The DVR engine’s sink filter should be running at a priority more urgent than UI and ideally on-par with the decoder filters on the playback side. StreamingThreadPri and DVRWriterPriority are the registry settings controlling these priorities.
Reduce the Default Thread Quantum
You need smooth arrival of samples. Bursty arrival to the DVR sink filter will de-stabilize the clock and force you to increase the size of the DVR sin filter’s RAM cache. In one product, the default quantum was lowered to 5 ms in order to achieve this goal. That product experienced a reduction of CPU usage from an average of about 80% to an average of about 65%. The added overhead in thread swapping was totally swamped by the savings in the decoding and rendering code because there was no need to play catch-up. Another hint that you are having bursty a/v is that the total cpu reported by the performance monitor tool (under process statistics) will exhibit a saw-tooth. On one product, a short saw tooth (about 3 to 5% of the CPU) was associated with thread quantum while a large saw tooth (about 20 to 30% of the CPU) was associated with poor choices of thread priority. Note that Audio Lip-Sync Filter registry settings may also cause large swings in CPU usage.
Here is a sample code snippet for setting the default thread quantum. OEMInit() is commonly found in a file with a name like platform\YOUR_PLATFORM\src\kernel\libs\startup\oeminit.c:
extern DWORD dwDefaultThreadQuantum;
// Must change the default thread quantum from
// A/V rendering in the face of limited memory and
// high CPU demand)
void OEMInit()
{
dwDefaultThreadQuantum = 5;
// 5 ms thread quantum gives better time-slicing
// between the various A/V rendering and drawing threads
....
Tune Playback Side Priorities
You will need to switch off tuning the thread priorities for the playback side with tuning the audio Lip-Sync statistics.
Suggestions:
- The thread that sends video out in sync with the VBI interval must be set to a very high priority, higher than anything on the capture side.
- Threads that dispatch audio (but not necessarily decode it) should be at the standard audio priority (249). These must be more urgent than the decoders and ideally should be more urgent that the DVR sink.
- Decoder threads should be more urgent than UI and key dispatch but less urgent than audio dispatch. Priority 250 is good.
- The DVR source filter’s reader thread should be similar to the decoder threads. (See DVRReaderPriority). Priority 250 is good.
- The DVR lazy open and lazy delete thread priorities should be less urgent than MPEG decoding and ideally should be similar to the UI and key dispatch. (See DVRLazyOpenPriority and DVRLazyDeletePriority). Priority 251 is good.
- The DVR thread that pre-creates files for the capture graph should be less urgent than the lazy open and delete threads. Being similar to other moderately important background activities is good. Priority 252 is recommended.
Verify that your Decoder is Sending Audio Sample Durations Compatible with the Audio Lip-Sync Filter
Lip-sync in the Audio Lip-Sync Filter (dvraudiosync.dll) is based on audio samples having an average duration matching a registry-supplied value (MillisecTypicalSamplePlayTime).
You can use the set the debugger to break in AudioSyncFilter::Transform, step over the early call to IMediaSample::GetTimes(), and then query rtSampleEnd and rtSampleStart. The difference is in 100ns units (i.e., divide by 10,000 to get milliseconds).
The default registry setting is 32 ms. For MPEG1 layer 2 audio this should be 24.
If your decoder is sending down a different average duration, update the registry value to match. In addition, you may need to re-tune the registry setting, identifying how long it takes for an audio sample to be rendered if pushed downstream by the Audio Lip-Sync Filter when all downstream buffering is full. Adjust the value of the registry value for DesiredLeadTimeMillisec to be its default value * (your average sample duration in milliseconds) / 32 ms.
Match the Acceptable Audio Playback Rates to Match the Typical Clock Rate Ratio
Observe your typical clock rate ratio and tune the minimum / maximum acceptable audio playback rates to match.
The default settings assume that your capture and audio (playback) clocks are close, within 1% almost all the time. If you cannot get the clocks to come that close, you are going to have to live with incorrect playback frequency.
If your clock ratio is typically more than 1.01, then increase the following Audio Lip-Sync Filter registry setting by roughly 200 for each 0.01 that your clock ratio is over 1.01:
- MaximumAudiblePlaybackRate Increase the following by roughly 300 for each 0.01 over 1.01:
- MaximumGoodPlaybackRate Increase the following by roughly 400 for each 0.01 that your clock ratio is greater than 1.01:
- ModerateSampleDecimationThreshold
- These changes allow the Audio Lip-Sync Filter to play faster to keep up with the faster-than-recommended clock. They also delay playback when the Lip-Sync Filter discards samples to keep up.
If your clock ratio is typically greater than 0.99, then decrease the following Audio Lip-Sync Filter registry settings by roughly 200 for each 0.01 that your clock ratio is under 0.99
- MinimumAudiblePlaybackRate
- BeginMuteToSlowRateThreshold
- EndMuteToSlowRateThreshold
These changes allow the Audio Lip-Sync Filter to play more slowly, to keep pace with a slower-than-recommended clock. The changes also push back the point at which the heuristic mutes audio and does a drastic slowdown of playback to sync back up with the clock.
Adjust Downstream Buffering Estimate
For this you’ll need to turn on the Lip-Sync heuristic debug zone.
Pause or rewind if you are testing bound-to-live scenarios to make sure you are playing with a least a 10 second lag behind live.
You will see debug text that looks like this:### AUDIO Deviation from ideal lip sync: 132 (132) ms average [min: 84, max: 179] @ offset 73 of 73 ### Audio: backpressure min 0, max 0, average 0.000000
Do a trick mode that involves flushing the graph. Look at the “max” value. The max during a time-period in which you’ve flushed the graph will be the maximum number of free media samples ever seen during the Audio Lip-Sync Filter’s receive method. So the total number of buffers is 1 plus that maximum.
The default setting of the registry value DesiredLeadTimeMillisec is based on a configuration in which there are 8 buffers in this pool plus more buffering downstream. Tentatively adjust this registry setting up [if you have more buffers] or down [if you have fewer] by your average audio sample duration times the difference of buffers-vs-8-assumed.
Start by limiting playback to a position at least 10 seconds behind live.Let playback roll (behind live) for a few minutes so that you can see the steady state behind. The goal is to find a value of DesiredLeadTimeMillisec where the ‘backpressure’ measure is 0 or 1 in steady state and the reported deviation from ideal Lip-Sync hovers around 0.
If you are seeing a max that is often larger than 1, your estimate of DesiredLeadTimeMillisec is too large. The entire quota of downstream buffers can be played in less than that time. If the max is pegged near zero and audio is arriving early (per the chatter), then your DesiredLeadTimeMillisec is too small.
Notes:
- The range of usable values can be very tight (perhaps only 2 to 5 ms). Be prepared to experiment here.
Caution: this method is tuned for a clock rate of 1.0. It is possible that it won’t work for other clock rates. - The Audio Lip-Sync Filter strips time-stamps from the audio before sending it downstream (to work around an issue with clock slaving). It controls Lip-Sync by keeping the buffer pipeline full (or with just one empty buffer) so that the audio decoder cannot run ahead. It adjusts the audio drift rate so that audio is rendered in sync with the playback clock instead of the audio clock. This mechanism only works if the Audio Lip-Sync Filter has accurate knowledge of how long it takes to play a full buffer pipeline.
Caution: This method requires that a common IMemAllocator be negotiated to span the gap between the output pin of the audio decoder, through the audio Lip-Sync filter, and on into the input pin of the downstream audio filter. If separate files are used between the audio decoder and the Audio Lip-Sync Filter vs the Audio Lip-Sync Filter and its downstream filter, the Audio Lip-Sync Filter will be unable to control backpressure.
If you are having trouble gaining control, especially if your clock rate is not near 1.0, here are some things to try.
- Revert the MillisecondDesiredLeadLag back to its default (322).
- If your playback is stubborn about letting the lead time grow too large (positive values), try:
- Reducing MinimumOngoingTroubleRateAdjust well below 1000. For example, a value of 500 will significantly tilt the method towards slowing down playback and so whittle down the lead.
- Reducing MillisecToCorrectLeadLag. For example, decreasing the value to 15000 will significantly increase the pressure to accomplish adjustments quickly.
- If your playback is stubborn about letting the lag time grow (toward large negative numbers), try:
- Increasing MaximumOngoingTroubleRateAdjust about 1000. E.g. a value of 1500 will significantly the method towards speeding up playback and so catch up more quickly.
- Reducing MillisecToCorrectLeadLag. For example, decreasing the value to 15000 will significantly increase the pressure to accomplish adjustments quickly.
Adjust the Video Versus Audio Rendering Targets
By carrying out the previous steps, you will have put the Audio Lip-Sync Filter into a state in which it can control when audio renders. It is only at that point that you can achieve lip sync.
Observe playback and decide whether audio is in sync with video, audio leads video, or video leads audio.
If audio and video are in sync, you’re done with this step.
If you hear audio before you see the corresponding video, lower the value of registry value AdditionalMillisecVideoLead. This will cause the video to render sooner (relative to audio).
If you see video before the corresponding audio, raise the value of registry value AdditionalMillisecVideoLead. This will delay the video relative to audio.
Tune the Live Position of Live TV
Now try playing at live (‘now’) when bound to live TV. Because the data at this position is coming from an MPEG encoder that may need to alternate between waiting for enough data to generate an i-frame and nearby b-frames and spitting out the newly created data, audio and video will be very jerky if played at the extreme edge of what is coming out of the encoder. The severity of the variance of the MPEG encoder may change with the type of content and the bit rate selected for the output.
The Audio Lip-Sync Filter has registry settings to control how far away from live you need to be for normal content and for the most severely variable conditions. The farther from live playback your material starts, the greater the changes that playback will need to make under worst-case conditions. More change means poorer initial Lip-Sync and a greater chance of visible glitches. The DVR Source Filter honors the Audio Lip-Sync Filter’s estimates of how far to back off from live when doing trick modes causing playback to reach live, but the estimates (in rare cases) may not reach the DVR source filter in time or may not be quite accurate. It should be your goal to be as close to live as you can be without causing a/v glitches.
Your primary control knobs are the following registry settings:
MillisecAdjustWhenFixingNormal
MillisecAdjustWhenFixingSevere
The average of these two values is used for moderately jittery live tv.
The playback position will be adjusted to be N milliseconds behind live where N = the registry setting minus 1000.
By default, content is deemed to have normal jitter (burstiness). You can use IStreamBufferPlayback to tell the DVR engine what level of burstiness to expect. The Audio Lip-Sync Filter will ask the DVR engine for the current level and tell the DVR engine how far to offset seeks/fast-forward-to-live.
If you are too close to live, at steady state you will see the Lip-Sync stats chatter reporting good Lip-Sync a few times, then see a report in which the minimum early/late number is 100 or more milliseconds late (i.e < -100). This cycle will repeat periodically.
### AUDIO Deviation from ideal lip sync: 1 (0) ms average [min: -17, max: 20] @ offset 1750 of 1750 ###
### AUDIO Deviation from ideal lip sync: 0 (0) ms average [min: -14, max: 15] @ offset 1750 of 1750 ###
### AUDIO Deviation from ideal lip sync: -125 (-127) ms average [min: -242, max: 16] @ offset 1750 of 1750 ###
If you turn on the Lip-Sync heuristic zone, you will typically also see a pattern of a backpressure measure near 0 during the good times followed by a max backpressure of 4 or more during the bad glitch.
When too close to live, increase MillisecAdjustWhenFixing… If the default value isn’t causing trouble, try decreasing it until trouble starts then back off a bit.
Fine Tuning to Improve Recovery
Once you have Lip-Sync tracking well, you can tweak parameters to improve how rapidly Lip-Sync recovers from a transient period of unusually fast/slow clock or other system stress.
The “decimation” values control how often samples are discarded to catch if audio is lagging behind its target.
The min/max adjustment values were described earlier – they control how much pressure is brought to bear for an on-going early/late issue. Note that picking more extreme values raises the risk of oscillating due to overcorrection. Extreme values also make the corrections for loss of audio sync more obvious. (There is a fundamental trade-off by correcting quickly versus correcting in a non-obvious way.)
The “initial offset” values control how far back from live you are right after a trick mode or starting the graph. Smaller values mean that the system appears more responsive to the user when changing the channel. Larger values mean that there are fewer a/v glitches and better initial Lip-Sync when starting the graph, jumping to live, or fast-forwarding to live.