Undisputedly Distributed Video Coding (DVC) is new paradigm for video compression, based on Slepian and Wolf's (lossless coding) and Wyner and Ziv's (lossy coding) information theoretic results. DVC is useful for emerging applications such as wireless video cameras, wireless low-power surveillance networks, and disposable video cameras and certain medical applications etc. The primary objective of DVC is low-complexity video encoding where the bulk of the computation is shifted to the decoder, as opposed to low-complexity decoder in conventional video compression standards such as H.264 and MPEG etc. There are couple of early architectures/implementations of DVC from Stanford university[2][3] in 2002, Berkeley university PRISM (Power-efficient, Robust, hIgh-compression, Syndrome-based Multimedia coding)[4][5] in 2002 and European project DISCOVER (DIStributed COding for Video SERvices)[6] in 2007. Primarily there are two types of DVC techniques namely pixel domain and transform domain based. Pixel domain DVC designs are sub-set of transform domain. Transform domain design will have better rate-distortion RD performance as it exploits spatial correlation between neighboring sample values and compacts the block energy into as few transform coefficients as possible. In this paper, architecture, implementation details and C model implementation results of transform domain DVC are presented.
First, split the incoming video sequence into two parts (GOP, Group of Pictures). A first set of frames are called Key frames and remaining frames are called Wyner-Ziv (WZ) frames. In a GOP first frame is key frame and remaining frames are WZ frames. Key frames are encoded using H.264 main profile intra coder. Splitting
of input video sequence depends on relative motion between frames. Increasing GOP size increases the number of WZ frames that reduces the data rates
Every WZ frame undergoes a block based transform i.e. DCT is applied on the each 4 x 4 block of the WZ frame
The DCT coefficients of entire frame are grouped together, forming so called DCT coefficient bands
After the transform coding each band is uniform scalar quantized with predefined levels
Bit plane ordering is performed on the quantized bins
Each bit plane is encoded separately by Low Density Parity Check Accumulator (LDPCA) encoder. LDPCA encoder computes a set of parity bits representing the accumulated syndrome of the encoded bit planes
An 8 bit Cyclic Redundancy Check (CRC) sum is sent for each bit plane to ensure the correct decoding operation
The parity bits are stored in a buffer and progressively transmitted to the decoder which iteratively asks for more bits during the decoding operation using the feedback channel
The decoding steps in presented DVC are:
Key frames are decoded by H.264 main profile intra decoder
The key frames are used for the reconstruction of the so called side information (SI), which is an estimation of the original WZ frame. A motion compensated interpolation between the two closest reference frames is performed, for SI generation
The difference between the original frame and the corresponding SI can be correlation noise in virtual channel, an online Laplacian model is used to obtain a good approximation of the residual (WZ-SI)
The same transform used at the encoder is applied to the SI and an estimate of the coefficients of the WZ frame is thus obtained
From these coefficients, soft values for the information bits are computed, taking into account the statistical modeling of the virtual noise
The conditional probability obtained for each DCT coefficient is converted into conditional bit probabilities by considering the previously decoded bit planes and the value of side information
These soft values are fed to the LDPCA decoder which performs the proper decoding operation
The decoder success or failure is verified by an 8 bit CRC sum received from encoder for current bit plane
If the decoding fails, i.e. if the received parity bits are not sufficient to guarantee successful decoding with a low bit error rate, then more parity bits are requested using the feedback channel
This is iterated until successful decoding is obtained. After the successful decoding of all bit planes, inverse bit plane ordering is performed
Inverse quantization and reconstruction is performed on the decoded bins
Then after inverse DCT (IDCT) is performed and each WZ frame is restored in pixel domain
Finally to get the decoded video sequence, decoded WZ frames and key frames are interleaved as per GOP
The section 2 highlights DVC codec Architecture and implementation details. The section 4 presents the results with C code implementation, followed by conclusions and further work in section 4.
2. DVC Codec Architecture & Implementation details
The Architecture of implemented DVC codec is shown in Figure-1.
2.1 DVC Encoder
The presented DVC encoder has following modules, which are explained in subsequent sub-sections:
Adaptive Video Splitter
Transform
Quantizer & Bit plane ordering
LDPCA Encoder & buffer
H.264 Intra Encoder
2.1.1 Adaptive Video Splitter
Adaptive Video Splitter is used to control the (non periodic) insertion of key frames in between the WZ frames in an adaptive way. The GOP size control mechanism added to the encoder shall not significantly increase its complexity, i.e. shall not perform any motion estimation. The simple yet powerful metrics such as Difference of Histograms, Histogram of difference, Block Histogram difference and Block variance difference[7][8] are used to evaluate the motion activity along the video sequence.
2.1.2 Transform
The transform enables the codec to exploit the statistical dependencies with in a frame, thus achieving better RD performance. It has been determined to deploy H.264 intra coder due to its lower bit rates, for key frames path. Hence the obvious choice of transform in WZ path is DCT to match with that of H.264. The WZ frames are transformed using a 4 x 4 DCT by breaking down image into 4 x 4 blocks of pixels, from left to right and top to bottom.
Once the DCT operation has been performed over all the 4 x 4 samples of image, the DCT coefficients are grouped together according to the standard Zig - Zag scan order[9] within the 4 x 4 DCT coefficient blocks. After performing the Zig - Zag scan order, coefficients are organized into 16 bands. First band containing low frequency information is often called the DC band and the remaining bands are called AC bands contains the high frequency information.
2.1.3 Quantizer & Bit plane ordering
To encode the WZ frames each DCT band is quantized separately using a predefined number of levels, depending on the target quality for the WZ frame. DCT coefficients representing lower spatial frequencies are quantized using uniform scalar quantizer with low step sizes, i.e. with a higher number of levels. The higher frequencies are more coarsely quantized, i.e. with fewer levels, without significantly decreasing the visual quality of the decoded image. AC bands are quantized using a dead Zone Quantizer with doubled zero interval is applied to reduce the blocking artifacts because the AC coefficients are mainly concentrated around the zero amplitude. The dynamic range of each AC band is calculated instead of using a fixed value; it is possible to have a quantization step size adjusted to the dynamic range of each band. The dynamic data range is calculated separately for each AC band to be quantized, and transmitted to the decoder inside the encoded bit stream. Depending on the target quality and data rates we can use different type of quantization matrices given in Figure-2.
Figure 1: DVC Encoder and Decoder Architecture
Figure 2: Eight quantization matrices associated to different RD performances
After quantizing the DCT coefficient band, the quantized symbols are converted into bit stream, the quantized symbols bits of the same significance (Ex: MSB) are grouped together forming the corresponding bit plane, which are independently encoded by LDPCA encoder.
2.1.4 LDPCA Encoder & buffer
It is determined to employ Low Density Parity Check Accumulator (LDPCA)[10] channel encoder (aka WZ encoder), as it performs better with reduced complexity compared to turbo codes[10]. An LDPCA encoder consists of an LDPC syndrome-former concatenated with an accumulator. For each bit plane, syndrome bits are created using the LDPC code and accumulated modulo 2 to produce the accumulated syndrome. The encoder stores these accumulated syndromes in a buffer and initially transmits only a few of them in chunks. If the decoder fails, more accumulated syndromes are requested from the encoder buffer using a feedback channel. To aid the decoder detecting residual errors, an 8-bit CRC sum of the encoded bit plane is also transmitted.
2.1.5 H.264 Intra Encoder
Key frames are encoded with H.264 Intra (Main profile)[10]. Coding with H.264/AVC in Main profile without exploiting temporal redundancy is intra coding. And H.264 intra coding is among the most efficient intra coding standard solutions available, even better than JPEG2000 in many cases. The JM reference software[11] is used as main profile intra encoder in this implementation. Quantization Parameters (QP) for each RD point are chosen to match with the WZ frame quality.
2.2 DVC Decoder
The presented DVC decoder has following modules, which are explained in subsequent sub-sections:
H.264 Intra Decoder
SI Estimation
DCT & Virtual Channel model
Soft Input Computation
LDPCA Decoder
Inverse Quantizer & Reconstruction
IDCT
2.2.1 H.264 Intra Decoder
Similar to the H.264 Intra encoder, decoder (Main profile) specifications and reference software are used from [9] and [11] respectively. These key frames decoded outputs are used to estimate the Side information.
2.2.2 SI estimation
A frame interpolation algorithm[7] is used at the decoder in order to generate the side information. When the correlation between the side information and the frame to be encoded is high, the fewer are the parity bits that need to be requested from the encoder to reach a certain quality. Another important issue to consider in the motion interpolation framework is the capability to work with longer GOP's without a significant decrease in the quality of the interpolated frame, especially when the correlation of the frames in the GOP is high. This is a complex task since the interpolation and quantization errors are propagated inside the GOP, when the frame interpolation algorithm uses WZ decoded frames as reference. The presented frame interpolation framework works for GOP's of any length including longer and high motion GOP's. Figure-3 shows the architecture for the frame interpolation scheme.
Figure 3: Side information estimation
The frame interpolation structure used to generate the side information is based on previously decoded frames, XB and XF, the backward (in the past) and forward references (in the future). An example frame structure definition for GOP 4 is shown in Figure-4.
Figure 4: Frame structure definition
In forward motion estimation, a full search based block motion estimation algorithm is used to estimate the motion between the decoded frames XB and XF. The search for the candidate block is exhaustively carried out in all possible positions with in the specified range ±M (M=32) in the reference frame.
The bidirectional motion estimation[7][12] module refines the motion vector obtained in the forward motion estimation step by using a bidirectional motion estimation scheme similar to the B frame coding mode used in current video standards[13]. But here the interpolated pixels are not known, a different motion estimation technique is used. This technique selects the linear trajectory between the next and previous key frames passing at the center of the blocks in the interpolated frame as shown in Figure-5.
Figure 5: Bidirectional Motion Estimation
This bidirectional motion estimation technique combines a hierarchical block size (first 16x16 and then followed by 8x8), with an adaptive search range in the backward and forward reference frames based on the motion vectors of neighboring blocks.
Extensive experimentation has been carried out to find out whether half-pel for forward as well as bidirectional motion estimation works well or any other combination with integer-pel. Here, Author proposes to use half-pel for forward motion estimation and integer-pel motion estimation for 16x16 and 8x8 bidirectional motion estimation, which gave best results.
Next, a spatial motion smoothing algorithm[14] is used to make the final motion vector field smoother, except at object boundaries and uncovered regions by using weighted vector median filters, and by evaluating prediction error and the spatial properties of the motion field.
Once the final motion vector field is obtained, the interpolated frame can be filled by simply using bidirectional motion compensation as defined in standard video coding schemes[13].
2.2.3 DCT & Virtual Channel model
In DVC, the decoding efficiency of WZ frame critically depends on the capability to model the statistical dependency[15][16] between the original information at the encoder and the side information computed at the decoder. This is a complex task since the original information is not available at the decoder and the side information quality varies throughout the sequence i.e. the error distribution is not temporally constant. A Laplacian distribution, which has got good tradeoff between model accuracy and complexity, is used to model the correlation noise, i.e. the error distribution between corresponding DCT bands of SI and WZ frames. The Laplacian distribution parameter is estimated online at the decoder and takes into consideration the temporal and spatial variability of the correlation noise statistics. The techniques used estimates the Laplacian distribution parameter α at the DCT band level (one α per DCT band and frame) and at the coefficient level (one α per DCT coefficient). The estimation approach uses the residual frame, i.e. the difference between XB and XF (along the motion vectors), as a confidence measure of the frame interpolation operation, and also a rough estimate of the side information quality.
2.2.4 Soft input computation
The conditional probability obtained for each DCT coefficient is converted into conditional bit probabilities by considering the previously decoded bit planes and the value of side information. The benefits of LDPCA codes are that they incorporate the underlying statistics of the channel noise into the decoding process in the form of soft inputs or priori probabilities. The probability calculations are different for AC and DC band because DC band contains only unsigned coefficients and AC bands contains signed coefficients.
DC band probability calculations (for either 0 or 1) depends on previously decode bit planes and SI DCT coefficients. For AC bands, extensive experimentation has been carried out to determine which is best among SI quantization or SI DCT coefficients, for probability calculations. For AC bands, calculation of the probability (for input bit-plane bit being 0) is done by considering the information of previous correctly decoded bit planes and SI DCT. Here, Author proposes that calculation of the probability (for input bit-plane bit being 1) shall be done differently for MSB (represents sign) and other bits. MSB bit probability (of being 1) calculations shall be done using previously decoded bit planes and SI quantized values. And all other bits probability (of being 1) calculations shall be carried out using previously decoded bit planes and SI DCT coefficients. This gave best RD performance.
Using these probabilities a parameter called Log likelihood Ratio (LLR) intrinsic is calculated. Using LLR intrinsic value LDPCA decoder decodes the current bit plane.
2.2.5 LDPCA decoder
This decoding procedure[10][17] explains the decoding of a bit plane given the soft input computation of side information and the parity bits transmitted from the encoder are inputs, for a fixed number of input parity bits. This procedure is repeated for every increment in the number of parity bit request made by the decoder. Before the decoding procedure starts the syndrome bits are extracted from received parity bits from encoder, by doing inverse accumulation operation according to the graph structure[18]. Sum product decoding operation is performed on these syndrome bits. This algorithm is a soft decision algorithm which accepts the probability of each received bit as input. To establish if decoding is successful, syndrome check error is computed, i.e. the Hamming distance between the received syndrome and the one generated using the decoded bit plane, followed by a cyclic redundancy check (CRC). If the Hamming distance is non-zero, then the decoder proceeds to the next iteration and requests more accumulated syndromes via the feedback channel. If the Hamming distance is zero, then the successfulness of the decoding operation is verified using the 8-bit CRC sum. If the CRC sum computed on the decoded bit plane matches the value received from the encoder, the decoding is declared successful and the decoded bit plane is sent to the inverse quantization & reconstruction module.
2.2.6 Inverse Quantizer & Reconstruction
Inverse quantization is carried out after the successful decoding of the all bit planes of a particular band. For each coefficient of the bit planes are grouped together to form quantized symbols. After forming the quantization bin, each bin that tells the decoder where the original bin lies, i.e. in an interval. The decoded quantization bin is an approximation of the true quantization bin obtained at the encoder before bit plane extraction.
Here, Author proposes to use different types of reconstruction methods for positive, negative and zero coefficients, as described below.
If decoded bin q>0:
Inverse quantized coefficient range is calculated using and, where q is decoded bin and W is the step size. The reconstructed coefficient is calculated using below equation, where y is SI DCT coefficient.
If decoded bin q<0
For this bin decoded quantized bin range is calculated using
and The reconstructed coefficient is calculated using below equation:
If decoded bin q=0
For this bin decoded quantized bin range is calculated using and. The reconstructed coefficient is calculated using below equation:
2.2.7 Inverse DCT
IDCT operation is carried out after performing inverse quantization operation and inverse zig - zag scan. IDCT operation is to restore the image into pixel domain from transform domain.
3. DVC C model Implementation Results
The DVC encoder and decoder described are completely implemented in C. The implemented codec has been evaluated with four standard test sequences QCIF hall monitor, foreman and soccer sequences with 15 Hz vertical frequency. The chosen test sequences are representative of various levels of motion activity. The Hall monitor video surveillance sequence has low to medium amount of motion activity. And Coast guard sequence has medium to high amount of motion activity, whereas Foreman video conferencing sequence has very high amount of motion activity. And Soccer sequence has significant motion activity.
The H.264 Key frame coder is with main profile, which can encode only 4:2:0 sequence, and not 4:0:0. All the metrics were measured only for the luma component of any video sequence. Hence color components of test sequences are replaced with 0's and used H.264 in 4:2:0 mode for luma analysis.
Figures 6 to 9 shows RD performance and comparison with that of H.264/Intra AVC, with fixed GOP of 2. For Hall monitor sequence shown in Figure-6, PSNR achieved is around 2-3dB better than that of H.264 intra for a given bit rate.
Figure 6: Hall Monitor RD performance
For Coast Guard sequence shown in Figure-7, PSNR achieved is 1-2dB better than that of H.264 intra for a given bit rate.
Figure 7: Coast Guard RD performance
For Foreman sequence shown in Figure-8, PSNR achieved is around (-1) ïƒ (+1) dB lower/better than that of H.264 intra for a given bit rate.
Figure 8: Foreman RD performance
For Soccer sequence shown in Figure-9, PSNR achieved is around 3dB lower than that of H.264 intra for a given bit rate.
Figure 9: Soccer RD performance
4. Conclusions & Further work
The DVC Architecture, implementation details and results are presented. DVC had proved to have better RD performance that H.264 intra for low to medium motion activity sequences. For high motion sequences, the RD performance of DVC and H.264 are comparable. Where as, DVC RD performance of very high motion sequences are lower than that of H.264 intra. Author has highlighted gaps and challenges of DVC in practical usage, in another submitted paper[1]. Author's further work includes working on the gaps and challenges highlighted in [1] and include those schemes worked out into already implemented codec presented in this paper. Also DVC presented in this paper is mono-view DVC. And the methods can be easily extended to multi-view DVC.