The classifier

I have built a road sign classifier recently as an assignment for one of the online courses I participate in. The particulars of the implementation are unimportant. It suffices to say that it's a variation on the solution found in this paper by Sermanet and LeCun and operates on the same data set of German road signs. The solution has around 4.3M trainable parameters, and there are around 300k training (after augmentation), 40k validation (after augmentation), and 12k testing samples. The classifier reached the testing accuracy of 98.67%, which is just about human performance. That's not bad.

The thing that I want to share the most is not all mentioned above, but the training benchmarks. I tested it on three different machines in 5 configurations in total:

  • x1-cpu: My laptop, four i7-6600U CPU cores at 2.60GHz each and 4MB cache, 16GB RAM
  • g2.8-cpu: Amazon's g2.8xlarge instance, 32 Xeon E5-2670 CPU cores at 2.60GHz each with 20MB cache, 60GB RAM
  • g2.2-cpu: Amazon's g2.2xlarge instance, 8 Xeon E5-2670 CPU cores at 2.60GHz each with 20MB cache, 15GB RAM
  • g2.8-gpu: The same as g2.8-cpu but used the 4 GRID K520 GPUs
  • g2.2-gpu: The same as g2.2-cpu but used the 1 GRID K520 GPU
  • p2-gpu: Amazon's p2.xlarge instance, 4 Xeon E5-2686 CPU cores 2.30GHz each with 46MB cache, 60GB RAM, 1 Tesla K80 GPU

Here are the times it took to train one epoch as well as how long it would have taken to train for 540 epochs (it took 540 epochs to get the final solution):

  • x1-cpu: 40 minutes/epoch, 15 days to train
  • g2.8-cpu: 6:24/epoch, 2 days 9 hours 36 minutes to train
  • g2.2-cpu: 16:15/epoch, 6 days, 2 hours 15 minutes to train
  • g2.8-gpu: 1:37/epoch, 14 hours, 33 minutes to train
  • g2.2-gpu: 1:37/epoch, 14 hours, 33 minutes to train
  • p2-gpu: 56 seconds/epoch, 8 hours, 24 minutes to train

I was quite surprised by these results. I knew that GPUs are better suited for this purpose, but I did not know that they are this much better. The slowness of the laptop might have been due to swapping. I run the test with the usual (unused) laptop workload and Chrome taking a lot of RAM. I was not doing anything during the test, though. When testing with g2.8-cpu, it looked like only 24 out of the 32 CPU cores were busy. Three additional GPUs on g2.8-gpu did not seem to have made any difference. TensorFlow allows you to pin operations to devices, but I did not do any of that. The test just runs the same exact graph as g2.2-gpu. There's likely a lot to gain by doing manual tuning.

The results

I tested it on pictures of a bunch of French and Swiss road signs taken around where I live. These are in some cases different from their German counterparts. When the network had enough training examples, it generalized well, otherwise, not so much. In the images below, you'll find sample sign pictures and the top three logits returned by the classifier after applying softmax.

Speed limit - training (top) vs. French (bottom)
Speed limit - training (top) vs. French (bottom)

No entry - training (top) vs. French (bottom)
No entry - training (top) vs. French (bottom)

Traffic lights - training (top) vs. French (bottom)
Traffic lights - training (top) vs. French (bottom)

No trucks - training (top) vs. French (bottom)
No trucks - training (top) vs. French (bottom)