ALTREFs are non-displayable pictures that are used as reference for other pictures. They are usually constructed using several source frames but can hold any type of information useful for compression and the given use-case. In the current version of SVT-AV1, temporal filtering of adjacent video frames is used to construct some of the ALTREF pictures. The resulting temporally filtered pictures will be encoded in place of or in addition to the original source pictures. This methodology is especially useful for source pictures that contain a high level of noise since the temporal filtering process produces reference pictures with reduced noise level.
Temporal filtering is currently applied to the base layer picture and can also be applied to layer 1 pictures of each mini-GOP (e.g. source frame positions 16 and 8 respectively in a mini-GOP in a 5-layer hierarchical prediction structure). In addition, filtering of the key-frames and intra-only frames is also supported.
The diagram in Figure 1 illustrates the use of five adjacent pictures: Two past, two future and one central picture, in order to produce a single filtered picture. Motion estimation is applied between the central picture and each future or past pictures generating multiple motion-compensated predictions. These are then combined using adaptive weighting (filtering) to produce the final noise-reduced picture.
Fig. 1. Example of motion estimation for temporal filtering in a temporal window consisting of 5 adjacent pictures
Since temporal filtering makes use of a number of adjacent frames, the Look Ahead Distance (lad_mg_pictures) needs to be incremented by the number of future frames used for ALTREF temporal filtering. When applying temporal filtering to ALTREF pictures, an Overlay picture might be necessary. This Overlay picture corresponds to the same original source picture but uses only the temporally filtered version of the source picture as a reference to reconstruct the original picture.
Various signals are used to specify the temporal filtering settings and are described in Table 1 below. The settings could be different based on the frame type; however, the same set of signals is used for all frame types. The temporal filtering flow diagram in Figure 2 below further explains how and where each of the defined signals is used. These parameters are decided as a function of the encoder preset (enc_mode).
Category | Signal(s) | Description |
---|---|---|
Filtering | Enabled | Specifies whether the current input will be filtered or not (0: OFF, 1: ON). |
Filtering | do_chroma | Specifies whether the U&V planes will be filered or not (0: filter all planes, 1: filter Y plane only). |
Filtering | use_medium_filter | Specifies whether the weights generation will use approximations or not (0: do not use approximations, 1: per-block weights derivation, use an approximated exponential & log, use an approximated noise level). |
Filtering | use_fast_filter | Specifies whether the weights derivation will use the distance factor (MV-based correction) and the 5x5 window error or not (0: OFF, 1: ON). |
Filtering | use_fixed_point | Specifies noise-level-estimation and filtering precision (0: use float/double precision, 1: use fixed point precision). |
Number of reference frame(s) | num_past_pics | Specifies the default number of frame(s) from past. |
Number of reference frame(s) | num_future_pics | Specifies the default number of frame(s) from future. |
Number of reference frame(s) | noise_adjust_past_pics | Specifies whether num_past_pics will be incremented or not based on the noise level of the central frame (0: OFF or 1: ON). |
Number of reference frame(s) | noise_adjust_future_pics | Specifies whether num_future_pics will be incremented or not based on the noise level of the central frame (0: OFF or 1: ON). |
Number of reference frame(s) | use_intra_for_noise_est | Specifies whether to use the key-frame noise level for all inputs or to re-compute the noise level for each input. |
Number of reference frame(s) | activity_adjust_th | Specifies whether num_past_pics and num_future_pics will be decremented or not based on the activity of the outer reference frame(s) compared to the central frame (∞: OFF, else remove the reference frame if the cumulative differences between the histogram bins of the central frame and the histogram bins of the reference frame is higher than activity_adjust_th . |
Number of reference frame(s) | max_num_past_pics | Specifies the maximum number of frame(s) from past (after all adjustments). |
Number of reference frame(s) | max_num_future_pics | Specifies the maximum number of frame(s) from future (after all adjustments). |
Motion search | hme_me_level | Specifies the accuracy of the ME search (note that ME performs a HME search, then a Full-Pel search). |
Motion search | half_pel_mode | Specifies the accuracy of the Half-Pel search (0: OFF, 1: perform refinement for the 8 neighboring positions, 2/3: perform refinement for the 2 horizontal-neighboring positions and for the 2 vertical-neighboring positions, but not for all the 4 diagonal-neighboring positions = function (horizontal & vertical distortions). |
Motion search | quarter_pel_mode | Specifies the accuracy of the Quarter-Pel search (0: OFF, 1: perform refinement for the 8 neighboring positions, 2/3: perform refinement for the 2 horizontal-neighboring positions and for the 2 vertical-neighboring positions, but not for all the 4 diagonal-neighboring positions = function (horizontal & vertical distortions). |
Motion search | eight_pel_mode | Specifies the accuracy of the Eight-Pel search (0: OFF, 1: perform refinement for the 8 neighboring positions). |
Motion search | use_8bit_subpel | Specifies whether the Sub-Pel search for a 10bit input will be performed in 8bit resolution (0: OFF, 1: ON, NA if 8bit input). |
Motion search | avoid_2d_qpel | Specifies whether the Sub-Pel positions that require a 2D interpolation will be tested or not (0: OFF, 1: ON, NA if 16x16 block or if the Sub-Pel mode is set to 1). |
Motion search | use_2tap | Specifies the Sub-Pel search filter type (0: regular, 1: bilinear, NA if 16x16 block or if the Sub-Pel mode is set to 1). |
Motion search | sub_sampling_shift | Specifies whether sub-sampled input/prediction will be used at the distortion computation of the Sub-Pel search. |
Motion search | pred_error_32x32_th | Specifies the 32x32 prediction error (after subpel) under which the subpel for the 16x16 block(s) is bypassed. |
Motion search | tf_me_exit_th | Specifies whether to exit ME after HME or not (0: perform both HME and Full-Pel search, else if the HME distortion is less than me_exit_th then exit after HME (i.e. do not perform the Full-Pel search), NA if use_fast_filter is set 0). |
Motion search | use_pred_64x64_only_th | Specifies whether to perform Sub-Pel search for only the 64x64 block or to use default size(s) (32x32 or/and 16x16) (∞: perform Sub-Pel search for default size(s), else if the deviation between the 64x64 ME distortion and the sum of the 4 32x32 ME distortions is less than use_pred_64x64_only_th then perform Sub-Pel search for only the 64x64 block, NA if use_fast_filter is set 0). |
The block diagram in Figure 2 outlines the flow of the temporal filtering operations.
Fig. 2. Block diagram of the temporal filtering operations. (image is too large to display in md, please click the link to see)
In order to decide temporal window length according to the content
characteristics, the amount of noise is estimated from the central source
picture. The algorithm considered is based on a simplification of the algorithm
proposed in [1]. The standard deviation (sigma) of the noise is estimated using
the Laplacian operator. Pixels that belong to an edge (i.e. as determined by
how the magnitude of the Sobel gradients compare to a predetermined threshold),
are not considered in the computation. The current noise estimation considers
only the luma component. When use_intra_for_noise_est
is set to 1, the noise
level of the I-frame will be used for ALTREF_FRAME or ALTREF2_FRAME.
As mentioned previously, the temporal filtering algorithm uses multiple frames to generate a temporally denoised or filtered picture at the central picture location. If enough pictures are available in the list of source picture buffers, the number of pictures used will generally be given by the num_past_pics and num_future_pics in addition to the central picture, unless not enough frames are available (e.g. end of sequence).
The number of pictures will be first increased based on the noise level of the central picture. Basically, the lower the noise of the central picture, the widerthe temporal window (+3 on each side if noise <0.5, +2 on each side if noise < 1.0, and +1 on each if noise < 2.0). Both sides of the window could be adjusted or just one side depending on noise_adjust_past_pics and noise_adjust_future_pics.
In order to account for illumination changes, which might compromise the
quality of the temporally filtered picture, an adjustment of both
num_past_pics
and num_future_pics
is conducted to remove cases where a
significant illumination change is found in the defined temporal window. This
algorithm first computes and accumulates the absolute difference between the
luminance histograms of adjacent pictures in the temporal window, starting from
the first past picture to the last past picture and from the first future
picture to the last future picture. Then, depending on a threshold, ahd_th, if
the cumulative difference is high enough, edge pictures will be removed. The
current threshold is chosen based on the picture width and height:ahd_th =
(width * height) * activity_adjust_th / 100
After this step, the list of pictures to use for the temporal filtering is ready. However, given that the number of past and future frames can be different, the index of the central picture needs to be known.
The central picture is split into 64x64 pixel non-overlapping blocks. For each block, (num_past_pics + num_future_pics )–, motion-compensated predictions will be determined from the adjacent frames and weighted in order to generate a final filtered block. All blocks are then combined to build the final filtered picture.
The motion search consists of three steps: (1) Hierarchical Motion Estimation (HME), (2) Full-Pel search, and (3) Sub-Pel search, and performed for only the Luma plane.
HME is performed for each single 64x64-block, while Full-Pel search is performed for the 85 square blocks between 8x8 and 64x64, and the Sub-Pel search (using regular or bilinear as filter type depending on use_2tap) is performed for only the 4 32x32-blocks and the 16 16x16-blocks.
After obtaining the motion information, an inter-depth decision between the 4 32x32-blocks and the 16 16x16-blocks is performed towards a final partitioning for the 64x64. The latter will be considered at the final compensation (using sharp as filter type and for all planes).
However, if the 64x64 distortion after HME is less than tf_me_exit_th, then the Full_Pel search is bypassed and Sub-Pel search/final compensation is performed for only the 64x64.
Also, Sub-Pel search/final compensation is performed for only 64x64 blocks, if the deviation between the 64x64 ME distortion and the 4 32x32 ME distortions (after the Full-Pel search) is less than use_pred_64x64_only_th.
The decay factor (tf_decay_factor
) is derived per block/per component and
will be used at the sample-based filtering operations.
tf_decay_factor = 2 * n_decay * n_decay * q_decay * s_decay
The noise-decay (n_decay
) is mainly an increasing function of the input noise
level, but is also adjusted depending on the filtering method
(use_fast_filter
), the input resolution, and the input QP; where a higher
noise level implies a larger n_decay value and a stronger filtering. The
computations of n_decay
are simplified when use_fixed_point
or
use_fast_filter
is set to 1.
The QP-decay (q_decay
) is an increasing function of the input QP. For a high
QP, the quantization leads to a higher loss of information, and thus a stronger
filtering is less likely to distort the encoded quality, while a stronger
filtering could reduce bit rates. For a low QP, more details are expected to be
retained. Filtering is thus more conservative.
The strength decay (s_decay
) is a function of the filtering strength that is
set in the code.
After multiplying each pixel of the co-located 64x64 blocks by the respective weight, the blocks are then added and normalized to produce the final output filtered block. These are then combined with the rest of the blocks in the frame to produce the final temporally filtered picture.
The process of generating one filtered block is illustrated in diagram of
Figure 3. In this example, only 3 pictures are used for the temporal filtering
(num_past_pics = 1
and num_future_pics = 1
). Moreover, the values of the
filter weights are used for illustration purposes only and are in the range
{0,32}.
Fig. 2. Example of the process of generating the filtered block from the predicted blocks of adjacent picture and their corresponding pixel weights.
Inputs: list of picture buffer pointers to use for filtering, location of central picture, initial filtering strength
Outputs: the resulting temporally filtered picture, which replaces the location of the central pictures in the source buffer. The original source picture is stored in an additional buffer.
Control flags:
Flag | Level |
---|---|
tf-controls | Sequence |
enable-overlays | Sequence |
The current implementation supports 8-bit and 10-bit sources as well as 420, 422 and 444 chroma sub-sampling. Moreover, in addition to the C versions, SIMD implementations of some of the more computationally demanding functions are also available.
Most of the variables and structures used by the temporal filtering
process are located at the picture level, in the PictureControlSet (PCS)
structure. For example, the list of pictures is stored in the
temp_filt_pcs_list
pointer array.
For purposes of quality metrics computation, the original source picture
is stored in save_enhanced_picture_ptr
and
save_enhanced_picture_bit_inc_ptr
(for high bit-depth content)
located in the PCS.
The current implementation disables temporal filtering on key-frames if
the source has been classified as screen content (sc_content_detected
in the PCS is 1).
Due to the fact that HME is open-loop, which means it operates on the
source pictures, HME can only use the source picture which is going to
be filtered after the filtering process has been finalized. The strategy
for synchronizing the processing of the pictures for this case is
similar to the one employed for the determination of the prediction
structure in the Picture Decision Process. The idea is to write to a
queue, the picture_decision_results_input_fifo_ptr
, which is
consumed by the HME process.
Three uint8_t or uint16_t buffers of size 64x64x3 are allocated: the accumulator, predictor and counter. In addition, an extra picture buffer (or two in case of high bit-depth content) is allocated to store the original source. Finally, a temporary buffer is allocated for high-bit depth sources, due to the way high bit-depth sources are stored in the encoder implementation (see sub-section on high bit-depth considerations).
For some of the operations, different but equivalent functions are
implemented for 8-bit and 10-bit sources. For 8-bit sources, uint8_t
pointers are used, while for 10-bit sources, uint16_t pointers are
used. In addition, the current implementation stores the high bit-depth
sources in two separate uint8_t buffers in the EbPictureBufferDesc
structure, for example, buffer_y
for the luma 8 MSB and
buffer_bit_inc_y
for the luma LSB per pixel (2 in case of 10-bit).
Therefore, prior to applying the temporal filtering, in case of 10-bit
sources, a packing operation converts the two 8-bit buffers into a
single 16-bit buffer. Then, after the filtered picture is obtained, the
reverse unpacking operation is performed.
The filtering algorithm operates independently in units of 64x64 blocks
and is currently multi-threaded. The number of threads used is
controlled by the variable tf_segment_column_count
, which depending
the resolution of the source pictures, will allocate more or less
threads for this task. Each thread will process a certain number of
blocks.
Most of the filtering steps are multi-threaded, except the
pre-processing steps: packing (in case of high bit-depth sources) and
unpacking, estimation of noise, adjustment of strength, padding and
copying of the original source buffers. These steps are protected by a
mutex, temp_filt_mutex
, and a binary flag, temp_filt_prep_done
in
the PCS structure.
If the temporally filtered picture location is of type ALTREF_FRAME
or
ALTREF2_FRAME
, the frame should not be displayed with the
show_existing_frame
strategy and should contain an associated Overlay
picture. In addition, the frame has the following field values in the
frame header OBU:
-
show_frame = 0
-
showable_frame = 0
-
order_hint
= the index that corresponds to the central picture of the ALTREF frame
In contrast, the temporally filtered key-frame will have showable_frame
= 1 and no Overlay picture.
According to current SVT-AV1 implementation, if high level control
enable_overlays = 1
, Overlay picture will be used to associate with
ALTREF picture with temporal_layer_index == 0
.
In resource_coordination, each picture (except the first picture) will be associated with an additional potential Overlay picture.
In picture_analysis function, the processing for potential Overlay picture is skipped because Overlay picture shares the same results as the associated ALTREF picture.
In picture_decision function, the Overlay picture will not be used to update the picture_decision_reorder_queue. When GOP is decided, the position of ALTREF and Overlay pictures is clear, then the additional potential Overlay picture of non-ALTREF will be released, and the Overlay picture of ALTREF will be initiated.
Overlay picture will not be used as reference, and Overlay picture will only reference the associated ALTREF picture with the same picture number.
In set_frame_display_params function, frm_hdr->show_frame = EB_TRUE
and
pcs_ptr->has_show_existing = EB_FALSE
for Overlay picture, so that the
associate ALTREF picture will not be displayed, and the reconstructed Overlay
picture will be displayed instead.
Consider the example of a five-layer prediction structure shown in Figure 4 below. The ALTREF and Overlay picture settings are shown in Table 3.
Example when picture 16 is ALTREF:
Key Point | ALTREF Picture | Overlay Picture |
---|---|---|
picture_number | 16 | 16 |
is_alt_ref | 1 | 0 |
is_overlay | 0 | 1 |
show_frame | 0 | 1 |
slice_type | B_SLICE | P_SLICE |
The feature settings that are described in this document were compiled at v1.4.0 of the code and may not reflect the current status of the code. The description in this document represents an example showing how features would interact with the SVT architecture. For the most up-to-date settings, it's recommended to review the section of the code implementing this feature.
[1] Tai, Shen-Chuan, and Shih-Ming Yang. "A fast method for image noise estimation using Laplacian operator and adaptive edge detection." In 2008 3rd International Symposium on Communications, Control and Signal Processing, pp. 1077-1081. IEEE, 2008.