Testing AWS P3 instances

Intro

AWS has recently introduced the P3 instances. They come with Tesla V100 GPUs, so I decided to run a little benchmark to see how well they perform compared to my workstation when training neural networks. I installed the most recent versions of CUDA/cuDNN (9.0/7.0) and TensorFlow (1.4.0), and run two non-trivial benchmarks that test both the GPU and the CPU.

Building the Software

Building TensorFlow from source is relatively straightforward, except that you need to install bazel. And gosh, never have I ever managed to build that stuff without issues. This time was not an exception. I wrote an article about that in the past, so I won't go into much detail here. I will just say that you will need Protocol Buffers version 3.4.0, grpc-java version 1.6.1, and bazel version 0.7.0. You will then need to apply this patch that I have taken from here and resolved the merge conflicts. Then, you will need to apply this one on top. The rest should go smooth.

Testing Setup

I used my workstation, and two AWS GPU Compute instances. Their exact parameters are in the table below. Since my workstation has an SSD, I used RAM disks on AWS to make the tests more comparable.

Name	Description	CPU	GPU	CUDA Compute	Data Source
ti	My workstation	i7-6600U	GeForce GTX 1080 Ti	6.1	SSD
p2	AWS p2.xlarge	E5-2686	Tesla K80	3.7	RAM disk
p3	AWS p3.2xlarge	E5-2686	Tesla V100	7.0	RAM disk

The tests are object detection and semantic segmentation, both coming in a smaller and a larger flavor. The former one processes all the input data in parallel to the GPU thread, whereas the latter does the processing serially in the main thread. On both, the p3 and ti machines, the CPU utilization was at 100% when running the semantic segmentation test. It means that the CPU is a bottleneck here.

Results

Normalized Performance

Machine	VGG300	VGG512	KITTI	Cityscapes
ti	11:38	28:09	00:16	09:22
p2	46:39	1:49:05	00:50	20:31
p3	08:15	20:10	did not work	13:01

The results in the image above are normalized, with 1 being the score for the ti setup. The table contains the exact execution times of training over one epoch. The V100 is around 30% faster than the 1080 Ti. The 1080 Ti, in turn, is about four times faster than the K80. Also, a Core i7 seems to be more performant than the Xeon Amazon uses in their instances. The KITTI test did not work on the V100 - it has hit a strange CUDA bug.

2017-11-09