Recent studies have shown remarkable success in synthesizing realistic talking faces by exploiting generative adversarial networks. However, existing methods are mostly target specific that cannot generate images of previously unseen people, and they suffer from artifacts such as blurriness and mismatching of facial details. In this paper, we tackle these problems by proposing a target-agnostic framework. We introduce a geometry-aware feature transformation module to achieve shape transfer while preserving the appearance of the source face. To further improve image quality of synthesized results, we present a multi-scale spatially-consistent transfer unit to maintain spatial consistency between the encoder and decoder features. Experimental results show that our model is able to synthesize photo-realistic talking faces which are previously unseen, outperforming state-of-the-art methods both qualitatively and quantitatively.