The emerging HEVC standard supports up to 12 variable block sizes ranging from 4×8/8×4 to 64×64 to conduct motion estimation (ME) and motion compensation (MC). This feature contributes considerable coding gain compared with 7 variable block sizes in H.264/AVC at the cost of huge computational complexity. In the test model HM, ME with variable block sizes (VBSME) may be called up to 425 times for the mode decision procedure of one CTU (Coding Tree Unit). Obviously, VBSME becomes the bottleneck for real time encoding. In this paper, we focus on parallel realization architecture design of VBSME in HEVC. Firstly, an efficient parallel encoder framework is proposed for CPU plus GPU platform. With the framework, VBSME, fractional-pixel image interpolation and border padding processes run on GPU without burden on the host CPU. Secondly, for workload balance between CPU and GPU, a fast Prediction Unit partition mode decision algorithm is also proposed. Lastly, the parallel realization strategy of VBSME on GPU is improved for ME compression performance improvement. Experimental results based on the NVIDIA’s C2050 GPU show that the speed of the VBSME strategy on GPU is about 113 times faster than the one on CPU.