Video is becoming the “biggest big data” as the most important and valuable source for insights and information.
Researches in our lab cover a broad range of topics on video signal. Inspired by state-of-the-art representation model like deep learning, and driven by modern biologically computational model for visual experience, we are working on better solution for video analysis, video processing and video compression. Specifically we are exploring each research topic from three aspects: algorithm, computation and data.
Follow our Github for academic open source projects.
Follow our Wechat Official Account for the latest media technology progress.
In face synthesis tasks, commonly used 2D face representations (e.g. 2D landmarks, segmentation maps, etc.) are usually sparse and discontinuous. To combat these short comings, we utilize a dense and continuous representation, named Projected Normalized Coordinate Code (PNCC), as the guidance and develop a PNCC-Spatio-Normalization (PSN) method to achieve face synthesis regarding arbitrary head poses and expressions. Based on PSN, we provide an effective framework for face reenactment and face swapping task. To ensure a harmonious and seamless face swapping, a simple yet effective Appearance-Blending Module (ABM) is proposed to fit the synthesized face to the target face. Our method is subject-agnostic and can be applied to any pair of faces without extra fine-tuning. Both qualitative and quantitative experiments are conducted to demonstrate the superiority of the proposed method in comparisons to existing state-of-the-art systems.
It is a very well-known fact, that the high complexity of the High Efficiency Video Coding standard (HEVC) is the main hurdle for its wide deployment and use. To tackle this problem, a number of recent research outcomes exploit heuristic algorithms and machine learning, including deep learning, to reduce the coding complexity. However, in most cases, each encoder module, i.e., encoding process, is first accelerated individually, and then different acceleration algorithms are manually combined. Without a holistic strategy, the acceleration potential of multi-module combination is not exploited and the Rate-Distortion (RD) loss is generally not well controlled. To tackle these shortcomings, this paper exploits the acceleration properties of different modules, i.e., the numerical representation of potential time saving and possible RD loss, from which a heuristic model is explored. Then a Heuristic Model Oriented Framework (HMOF) is proposed which adapts the properties of modules to underlying acceleration algorithms. In the framework, two advanced acceleration algorithms, including Border Considered CNN (BC-CNN)-based Coding Unit (CU) partition and Naive Bayes-based Prediction Unit (PU) partition, are proposed for the CU and PU modules, respectively. Further, by leveraging the heuristic model as the guidance to combine the proposed acceleration algorithms, HMOF is globally optimized, where different time saving budgets are wisely allocated to different modules and a theoretically minimal RD loss is achieved. According to the experimental results, through fusing a suitable deep learning technique and a Bayes-Based prediction, the proposed acceleration framework HMOF enable multiple acceleration choices. Here the proposed joint optimization strategy help to make a choice leading to the best cost-performance. Furthermore, within the proposed framework, intra coding time can be precisely controlled with negligible Bjøntegaard delta bit-rate (BDBR) loss. In this context, as a complexity control method, HMOF outperforms the state-of-the-art complexity reduction algorithms under a similar complexity reduction ratio. These results partially demonstrate the superiority of the proposed technique.
Bit-depth expansion(BDE) algorithms have made great progress on single image. However, it is hard to directly apply on videos with no temporal constraint. Aiming at the problem of video BDE, we adopt an encoder-decoder structure with 3D convolution to fuse spatial and temporal domain information. The encoder utilizes 3-stage down sampling with 3D ResBlocks to align the features of different time series, the decoder adopt the inverse structure with Coordinate Attention to fuse the aligned features and reconstruct the high bit-depth frame. The quantitative and qualitative experimental results show that the proposed video BDE network is superior to other methods. Compared with the best BDE algorithm, we obtain a 1.51db improvement on PSNR. With no additional temporal information such as optical flow, our method is also superior in running speed.
Dynamic Adaptive Streaming over HTTP(DASH) is a video transmission ptotocol to adapt to different network conditions and heterogeneous client devices. The client first requests and parses the Media Presentation Description(MPD) file from the DASH server to obtain basic information including the average bitrate list and segment URLs. After that, the adaptive bitrate(ABR) algorithm decides the quality of the next requested segment based on the network and buffer conditions. However, there is a big discrepancy between the segment size calculated with this coarse-grained average bitrate and the real size due to the constantly changing video scenes, which is especially obvious for variable bitrate(VBR) encoded videos. And this error is transparent to the ABR algorithm. This paper first analyzes and confirms the inter-layer correlation of scalable video coding(SVC) encoded contents, then designs a segment size prediction module to cooperate with the ABR algorithm. Experimental results show that predicting a certain enhancement layer(EL) segment size by all its previous layers can increase the probability that the predicted value falls within the accurate interval by 20% −58% compared to predicting the EL only by base layer(BL). Besides, the ABR algorithm assisted with size prediction module can increase the average bitrate by 29.2%, reduce the average bitrate switches by 13.4% and the average rebuffering events by 78.7% compared with the independent ABR algorithm.
Distortion of real video is affected by many factors. Many existing old videos n have the problem of low definition. However, most video enhancement based on paired learning is trained for specific degradation problem, lack of the ability to enhance real low-definition video with unknown distortion. In this paper, we propose a joint video enhancement algorithm based on unpaired learning, which uses a high-definition video to enhance a low-definition video with similar contents. In order to train the network, we also build three pairs of unpaired video datasets with chimpanzee, city night view and military figures as contents. In experiments, we compare our method with enhancement algorithm based on paired learning and other unpaired framework and find that our method achieves a higher performance,
Video conferences introduce a new scenario for video transmission, which focuses on keeping the fidelity of faces even in the low bandwidth network environment. In this work, we propose VSBNet, one of the frameworks to utilize face landmarks in video compression. Our method utilizes the adversarial learning to reconstruct origin frames from the landmarks. To recover more details and keep the consistency of identity, we propose the concept of visual sensitivity to separate the contour of the face from the fast-moving parts, such as eyes and mouth. Experimental results demonstrate the superiority of our framework with a low bit rate of around 1KB/s.
Gaze target detection aims to infer where each person in a scene is looking. Existing works focus on 2D gaze and 2D saliency, but fail to exploit 3D contexts. In this work, we propose a three-stage method to simulate the human gaze inference behavior in 3D space. In the ﬁrst stage, we introduce a coarse-to-ﬁne strategy to robustly estimate a 3D gaze orientation from the head. The predicted gaze is decomposed into a planar gaze on the image plane and a depthchannel gaze. In the second stage, we develop a Dual Attention Module (DAM), which takes the planar gaze to produce the ﬁled of view and masks interfering objects regulated by depth information according to the depth-channel gaze. In the third stage, we use the generated dual attention as guidance to perform two sub-tasks: (1) identifying whether the gaze target is inside or out of the image; (2) locating the target if inside. Extensive experiments demonstrate that our approach performs favorably against state-of-the-art methods on GazeFollow and VideoAttentionTarget datasets.
Video coding standards like HEVC and VVC have achieved significant coding performance. However, the RDO module in coding framework ignores the characteristics of human visual system (HVS), which leads to insufficiency for perceptual video coding. Recently, learning-based objective assessment metric VMAF is developed and has been demonstrated higher quality assessment accuracy than conventional metrics. To incorporate VMAF into RDO aiming at improving perceptual coding efficiency, in this paper, a perceptual RDO scheme is proposed. A CNN-based on-line training method is first explored to determine the VMAF-related distortion estimation coefficient. Based on the VMAF-related coefficient and R-D model, a VMAF-based Lagrangian multiplier is proposed to adjust the R-D performance of each coding block. Experiments demonstrate that the proposed method can achieve an average -2.80% VMAF-based BD-Rate compared with the original HEVC, which effectively improves the coding performance.
Recently, Joint Video Experts Team (JVET) has completed the new Versatile Video Coding (H.266/VVC) standard. VVC employs a new block partition structure named quad-tree with nested multi-type tree (QTMT) to improve coding efficiency. However, the new block partition structure increases huge encoding time compared with HEVC for brute-force ratedistortion (RD) optimization. To reduce encoding complexity, we propose a Support Vector Machine (SVM) based fast CU partitioning algorithm for VVC intra coding in this paper which terminates redundant partitions early by predicting the partition of CU using texture information. We trained classifiers for CUs of different sizes to improve accuracy and control the complexity of the classifiers themselves. Different thresholds are set for each classifier to achieve a trade-off between encoding complexity and RD performance. Experimental results show that the proposed method can save encoder time ranging from 30.78% to 63.16% with 1.10% to 2.71% BD-BR increase.
Because of the explosive growth of face photos as well as their widespread dissemination and easy accessibility in social media, the security and privacy of personal identity information becomes an unprecedented challenge. Meanwhile, the convenience brought by advanced identity-agnostic computer vision technologies is attractive. Therefore, it is important to use face images while taking careful consideration in protecting people’s identities. Given a face image, face de-identification, also known as face anonymization, refers to generating another image with similar appearance and the same background, while the real identity is hidden. Although extensive efforts have been made, existing face de-identification techniques are either insufficient in photo-reality or incapable of well-balancing privacy and utility. In this paper, we focus on tackling these challenges to improve face de-identification. We propose IdentityDP, a face anonymization framework that combines a data-driven deep neural network with a differential privacy (DP) mechanism. This framework encompasses three stages: facial representations disentanglement, $epsilon$-IdentityDP perturbation and image reconstruction. Our model can effectively obfuscate the identity-related information of faces, preserve significant visual similarity, and generate high-quality images that can be used for identity-agnostic computer vision tasks, such as detection, tracking, etc. Different from the previous methods, we can adjust the balance of privacy and utility through the privacy budget according to pratical demands and provide a diversity of results without pre-annotations. Extensive experiments demonstrate the effectiveness and generalization ability of our proposed anonymization framework.
5G and edge computing have brought great changes to video industry. Interactive video is becoming an emerging application form of multimedia service, which provides attraction beyond typical scenarios like cloud gaming and remote virtual reality (VR), and puts forward great challenges in resource capacity, response latency, and function flexibility to its service system. In this paper, we propose an elastic system architecture with low latency features to accommodate generic interactive video applications on near user edges. To increase system flexibility, we firstly design a dynamic Directed Acyclic Graph (dDAG) model for efficient task representation. Secondly, based on the model, we present the elastic architecture together with its scalable workflow pipeline. Thirdly, we propose a set of novel latency measurement metrics to analyze and optimize the performance of an interactive video system. Based on the proposed approaches, we disassemble a real world free-viewpoint synthesis application and benchmark its performance with the metrics. Extensive experimental results show the flexibility of our system to handle the stochastic human interactions during a video service session, with less than 5 ms additional scheduling latency introduced. End to end latency is kept within 43 ms for complex functions, and 28 ms for simpler scenarios, which satisfies the restrictions of most interactive video applications provided by an edge. Client of the architecture serves as a pure video player, which is also friendly to power limited terminals such as 5G phones. Efficiency and stability analyses of the system show superiorities over existing work, and also reveal potential optimization directions for future research.
Intra prediction is the key technology to reduce spatial redundancy in the modern video coding standard. Recently, deep learning based methods that directly generate the intra prediction by neural network achieve superior performance than traditional directional based intra prediction. However, these methods lack the ability to handle complex blocks which contain mixed directional textures or recurrent patterns since they only use the neighboring reference samples of the current one. The other intermediate information denoted as reference priors in this paper generated during the coding process is not exploited. In this paper, a Current Frame Priors assisted Neural Network (CFPNN) is presented to improve the intra prediction efficiency. Specifically, we utilize the local contextual information provided by the neighboring multiple references as the primary inference source. In addition to the neighboring references, we additionally use the other two reference priors within the current frame – the predictor searched by intra block copy (IntraBC) and the corresponding residual component. The IntraBC predictor provides useful nonlocal information to help generate more accurate prediction for complex blocks together with neighboring local information. While the residual component contains unique information that reflects the characteristics of the block to some extent is utilized to reduce the noise contained in the reconstructed reference samples. Moreover, we investigate the best way to integrate the proposed method into the codec. Experimental results demonstrate that compared to HEVC, our proposed CFPNN achieves an average of 4.1% BD-rate reduction for the luma component under the All Intra configuration.
Generative Adversarial Networks (GANs) have shown promising improvements in face synthesis and image manipulation. However, it remains difficult to swap the faces in videos with a specific target. The most well-known face swapping method, Deepfakes, focuses on reconstructing the face image with auto-encoder while paying less attention to the identity gap between the source and target faces, which causes the swapped face looks like both the source face and the target face. In this work, we propose to incorporate cross-identity adversarial training mechanism for highly photo-realistic face swapping. Specifically, we introduce corresponding discriminator to faithfully try to distinguish the swapped faces, reconstructed faces and real faces in the training process. In addition, attention mechanism is applied to make our network robust to variation of illumination. Comprehensive experiments are conducted to demonstrate the superiority of our method over baseline models in quantitative and qualitative fashion.
Low-light videos are usually accompanied with acquisition noise, motion blur and some other specific distortions, which makes it hard to compress by the video coding technologies and generates less satisfying compressed videos. In order to enhance both the subjective and objective quality of compressed dark videos, we propose a novel learning based post-processing scheme for the most recent VVC. Specifically, we adopt a multi-scale residue learning structure, named Dark Video Restoration Convolutional Neural Network (DVRCNN), as an additional out-loop post-processing method. To avoid the over-smooth effect by MSE metric, SSIM and texture loss are also added to the final loss function. Luma and chroma components are decomposed then fed to two corresponding models separately. Compared with VVC baseline on the six sequences given in ICME 2020 Grand Challenge, our approach significantly reduces the BD-rate by 36.08% and achieves a fair promotion on both objective and subjective quality, especially for the low bit-rate compressed sequence with severe distortion. Validation results show that the proposed model generates well to continuous scenes and variable bitrates.
4K-UHD videos have become popular since they significantly improve user’s visual experience. As video coding, transmission and enhancement technology developing fast, existing quality assessing metrics are not suitable for 4K-UHD scenario because of the expanded resolution and lack of training data. In this paper, we present a no-reference video quality assessment model achieving high performance and suitability for 4K-UHD scenario by simulating full-reference metric VMAF. Our approach extract deep spatial features and optical flow based temporal features from cropped frame patches. Overall score for video clip is obtained from weighted average of patch results to fully reflect the content of high-resolution video frames. The model is trained on automatically generated HEVC encoded 4K-UHD dataset which is labeled by VMAF. The strategy of constructing dataset can be easily extended to other scenarios such as HD resolution and other distortion types by modifying dataset and adjusting network. With the absence of reference video, our proposed model achieves considerable accuracy on VMAF labels and high correlation with human rating, as well as relatively fast processing speed.
The popularization of intelligent devices including smartphones and surveillance cameras results in more serious privacy issues. De-identiﬁcation is regarded as an effective tool for visual privacy protection with the process of concealing or replacing identity information. Most of the existing de-identiﬁcation methods suffer from some limitations since they mainly focus on the protection process and are usually non-reversible. In this paper, we propose a personalized and invertible de-identiﬁcation method based on the deep generative model, where the main idea is introducing a user-speciﬁc password and an adjustable parameter to control the direction and degree of identity variation. Extensive experiments demonstrate the effectiveness and generalization of our proposed framework for both face de-identiﬁcation and recovery.