Video Multimethod Assessment Fusion (VMAF), as a state-of-the-art learning-based objective assessment metric, has been demonstrated to achieve a stronger correlation with human visual system than others. However, combining multiple metrics by Support Vector Machine (SVM) makes it unrealistic to be directly integrated into encoder due to its high complexity and non-differentiability. In this paper, to alleviate this problem, we propose a novel perceptual distortion metric directed by VMAF (D-VMAF) to optimize perceptual quality. The relationship between frame-level VMAF and block-level content is fitted by DVMAF with the combination of spatial and temporal factors. By the proposed perceptual distortion metrics, the Lagrangian multiplier in RDO is adjusted adaptively. Experimental results demonstrate that 3% coding performance in BD-VMAF can be achieved using our method compared to the original HM encoder.