AWS has recently introduced the P3 instances. They come with Tesla V100 GPUs, so I decided to run a little benchmark to see how well they perform compared to my workstation when training neural networks. I installed the most recent versions of CUDA/cuDNN (9.0/7.0) and TensorFlow (1.4.0), and run two non-trivial benchmarks that test both the GPU and the CPU.
Building the Software
Building TensorFlow from source is relatively straightforward, except that you need to install bazel. And gosh, never have I ever managed to build that stuff without issues. This time was not an exception. I wrote an article about that in the past, so I won't go into much detail here. I will just say that you will need Protocol Buffers version 3.4.0, grpc-java version 1.6.1, and bazel version 0.7.0. You will then need to apply this patch that I have taken from here and resolved the merge conflicts. Then, you will need to apply this one on top. The rest should go smooth.
I used my workstation, and two AWS GPU Compute instances. Their exact parameters are in the table below. Since my workstation has an SSD, I used RAM disks on AWS to make the tests more comparable.
|Name||Description||CPU||GPU||CUDA Compute||Data Source|
|ti||My workstation||i7-6600U||GeForce GTX 1080 Ti||6.1||SSD|
|p2||AWS p2.xlarge||E5-2686||Tesla K80||3.7||RAM disk|
|p3||AWS p3.2xlarge||E5-2686||Tesla V100||7.0||RAM disk|
The tests are object detection and semantic segmentation, both coming in a smaller and a larger flavor. The former one processes all the input data in parallel to the GPU thread, whereas the latter does the processing serially in the main thread. On both, the p3 and ti machines, the CPU utilization was at 100% when running the semantic segmentation test. It means that the CPU is a bottleneck here.
|p3||08:15||20:10||did not work||13:01|
The results in the image above are normalized, with 1 being the score for the ti setup. The table contains the exact execution times of training over one epoch. The V100 is around 30% faster than the 1080 Ti. The 1080 Ti, in turn, is about four times faster than the K80. Also, a Core i7 seems to be more performant than the Xeon Amazon uses in their instances. The KITTI test did not work on the V100 - it has hit a strange CUDA bug.