Video conferences introduce a new scenario for video transmission, which focuses on keeping the fidelity of faces even in the low bandwidth network environment. In this work, we propose VSBNet, one of the frameworks to utilize face landmarks in video compression. Our method utilizes the adversarial learning to reconstruct origin frames from the landmarks. To recover more details and keep the consistency of identity, we propose the concept of visual sensitivity to separate the contour of the face from the fast-moving parts, such as eyes and mouth. Experimental results demonstrate the superiority of our framework with a low bit rate of around 1KB/s.