Content tagged gpu

The project

A neural network learned how to drive a car by observing how I do it! :) I must say that it's one of the coolest projects that I have ever done. Udacity provided a simulator program where you had to drive a car for a while on two tracks to collect training data. Each sample consisted of a steering angle and images from three front-facing cameras.

The view from the cameras
The view from the cameras

Then, in the autonomous driving mode, you are given an image from the central camera and must send back an appropriate steering angle, such that the car does not go off-track.

An elegant solution to this problem was described in a paper by nVidia from April 2016. I managed to replicate it in the simulator. Not without issues, though. The key takeaways for me were:

  • The importance of making sure that the training data sample is balanced. That is, making sure that some categories of steering angles are not over-represented.
  • The importance of randomly jittering the input images. To quote another paper: "ConvNets architectures have built-in invariance to small translations, scaling and rotations. When a dataset does not naturally contain those deformations, adding them synthetically will yield more robust learning to potential deformations in the test set."
  • Not over-using dropout.

The model needed to train for 35 epochs. Each epoch consisted of 24 batches of 2048 images with on-the-fly jittering. It took 104 seconds to process one epoch on Amazon's p2.xlarge instance and 826 seconds to do the same thing on my laptop. What took an hour on a Tesla K80 GPU would have taken my laptop over 8 hours.

Results

Below are some sample results. The driving is not very smooth, but I blame that on myself not being a good driving model ;) The second track is especially interesting, because it differs from the one that the network was trained on. Interestingly enough, a MacBook Air did no have enough juice to run both the simulator and the model, even though the model is fairly small. I ended up having to create an ssh tunnel to my Linux laptop.

Track #1

Track #2

The classifier

I have built a road sign classifier recently as an assignment for one of the online courses I participate in. The particulars of the implementation are unimportant. It suffices to say that it's a variation on the solution found in this paper by Sermanet and LeCun and operates on the same data set of German road signs. The solution has around 4.3M trainable parameters, and there are around 300k training (after augmentation), 40k validation (after augmentation), and 12k testing samples. The classifier reached the testing accuracy of 98.67%, which is just about human performance. That's not bad.

The thing that I want to share the most is not all mentioned above, but the training benchmarks. I tested it on three different machines in 5 configurations in total:

  • x1-cpu: My laptop, four i7-6600U CPU cores at 2.60GHz each and 4MB cache, 16GB RAM
  • g2.8-cpu: Amazon's g2.8xlarge instance, 32 Xeon E5-2670 CPU cores at 2.60GHz each with 20MB cache, 60GB RAM
  • g2.2-cpu: Amazon's g2.2xlarge instance, 8 Xeon E5-2670 CPU cores at 2.60GHz each with 20MB cache, 15GB RAM
  • g2.8-gpu: The same as g2.8-cpu but used the 4 GRID K520 GPUs
  • g2.2-gpu: The same as g2.2-cpu but used the 1 GRID K520 GPU
  • p2-gpu: Amazon's p2.xlarge instance, 4 Xeon E5-2686 CPU cores 2.30GHz each with 46MB cache, 60GB RAM, 1 Tesla K80 GPU

Here are the times it took to train one epoch as well as how long it would have taken to train for 540 epochs (it took 540 epochs to get the final solution):

  • x1-cpu: 40 minutes/epoch, 15 days to train
  • g2.8-cpu: 6:24/epoch, 2 days 9 hours 36 minutes to train
  • g2.2-cpu: 16:15/epoch, 6 days, 2 hours 15 minutes to train
  • g2.8-gpu: 1:37/epoch, 14 hours, 33 minutes to train
  • g2.2-gpu: 1:37/epoch, 14 hours, 33 minutes to train
  • p2-gpu: 56 seconds/epoch, 8 hours, 24 minutes to train

I was quite surprised by these results. I knew that GPUs are better suited for this purpose, but I did not know that they are this much better. The slowness of the laptop might have been due to swapping. I run the test with the usual (unused) laptop workload and Chrome taking a lot of RAM. I was not doing anything during the test, though. When testing with g2.8-cpu, it looked like only 24 out of the 32 CPU cores were busy. Three additional GPUs on g2.8-gpu did not seem to have made any difference. TensorFlow allows you to pin operations to devices, but I did not do any of that. The test just runs the same exact graph as g2.2-gpu. There's likely a lot to gain by doing manual tuning.

The results

I tested it on pictures of a bunch of French and Swiss road signs taken around where I live. These are in some cases different from their German counterparts. When the network had enough training examples, it generalized well, otherwise, not so much. In the images below, you'll find sample sign pictures and the top three logits returned by the classifier after applying softmax.

Speed limit - training (top) vs. French (bottom)
Speed limit - training (top) vs. French (bottom)

No entry - training (top) vs. French (bottom)
No entry - training (top) vs. French (bottom)

Traffic lights - training (top) vs. French (bottom)
Traffic lights - training (top) vs. French (bottom)

No trucks - training (top) vs. French (bottom)
No trucks - training (top) vs. French (bottom)