Content from 2017-01

The project

A neural network learned how to drive a car by observing how I do it! :) I must say that it's one of the coolest projects that I have ever done. Udacity provided a simulator program where you had to drive a car for a while on two tracks to collect training data. Each sample consisted of a steering angle and images from three front-facing cameras.

The view from the cameras
The view from the cameras

Then, in the autonomous driving mode, you are given an image from the central camera and must send back an appropriate steering angle, such that the car does not go off-track.

An elegant solution to this problem was described in a paper by nVidia from April 2016. I managed to replicate it in the simulator. Not without issues, though. The key takeaways for me were:

  • The importance of making sure that the training data sample is balanced. That is, making sure that some categories of steering angles are not over-represented.
  • The importance of randomly jittering the input images. To quote another paper: "ConvNets architectures have built-in invariance to small translations, scaling and rotations. When a dataset does not naturally contain those deformations, adding them synthetically will yield more robust learning to potential deformations in the test set."
  • Not over-using dropout.

The model needed to train for 35 epochs. Each epoch consisted of 24 batches of 2048 images with on-the-fly jittering. It took 104 seconds to process one epoch on Amazon's p2.xlarge instance and 826 seconds to do the same thing on my laptop. What took an hour on a Tesla K80 GPU would have taken my laptop over 8 hours.


Below are some sample results. The driving is not very smooth, but I blame that on myself not being a good driving model ;) The second track is especially interesting, because it differs from the one that the network was trained on. Interestingly enough, a MacBook Air did no have enough juice to run both the simulator and the model, even though the model is fairly small. I ended up having to create an ssh tunnel to my Linux laptop.

Track #1

Track #2


Writing this blog became increasingly tedious over time. The reason for this was the slowness of the rendering tool I use - coleslaw. It seemed to work well for other people, though, so I decided to investigate what I am doing wrong. The problem came from the fact that the code coloring implementation (which I co-wrote) spawned a Python process every time it received a code block to handle. The coloring itself was fast. Starting and stopping Python every time was the cause of the issue. A solution for this malady is fairly simple. You keep the Python process running at all times and communicate with it via standard IO.

Surprisingly enough, I could not find an easy and convenient way to do it. The dominant paradigm of uiop:run-program seems to be spawn-process-close, and it does not allow for easy access to the actual streams. sb-ext:run-program does hand me the stream objects that I need, but it's not portable. While reading the code of uiop trying to figure out how to extract the stream objects from run-program, I accidentally discovered uiop:launch-program which does exactly what I need in a portable manner. It was implemented in asdf- released on Dec 1st, 2016 (a month and a half ago!). This post is meant as a piece of documentation that can be indexed by search engines to help spread my happy discovery. :)


The Python code reads commands from standard input and writes the responses to standard output. Both, commands and response headers are followed by newlines and an optional payload.

The commands are:

  • exit - what it does is self-evident
  • pygmentize|len|lang[|opts]:
    • len is the length of the code snippet
    • lang is the language to colorize
    • optional parameter opts is the configuration of the HTML formatter
    • after the newline, len utf-8 characters of the code block need to follow

There's only one response: colorized|len, followed by a newline and len utf-8 characters of the colorized code as an HTML snippet.

Python's automatic inference of standard IO's encoding is still pretty messed up, even in Python 3. It's a good idea to create wrapper objects and interact only with them:

1 input  = io.TextIOWrapper(sys.stdin.buffer,  encoding='utf-8')
2 output = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8')

Printing diagnostic messages to standard error output is useful for debugging:

1 def eprint(*args, **kwargs):
2     print(*args, file=sys.stderr, **kwargs)


OK, I have a python script that does the coloring. Before I can use it, I need to tell ASDF about it and locate where it is in the filesystem. The former is done by using the :static-file qualifier in the :components list. The latter is a bit more complicated. Since the file's location is known relative to the lisp file it will be used with, it's doable.

1 (defvar *pygmentize-path*
2   (merge-pathnames ""
3                    #.(or *compile-file-truename* *load-truename*))
4   "Path to the pygmentize script")

The trick here is to use #. to execute the statement at read-time. You can see the full explanation here.

With that out of the way, I can start the renderer with:

1 (defmethod start-concrete-renderer ((renderer (eql :pygments)))
2   (setf *pygmentize-process* (uiop:launch-program
3                               (list *python-command*
4                                     (namestring *pygmentize-path*))
5                               :input :stream
6                               :output :stream)))

For debugging purposes, it's useful to add :error-output "/tmp/debug", so that the diagnostics do not get eaten up by /dev/null.

To stop the process, we send it the exit command, flush the stream, and wait until the process dies:

1 (defmethod stop-concrete-renderer ((renderer (eql :pygments)))
2   (write-line "exit" (process-info-input *pygmentize-process*))
3   (force-output  (process-info-input *pygmentize-process*))
4   (wait-process *pygmentize-process*))

The Lisp part of the colorizer sends the pygmentize command together with the code snippet to Python and receives the colorized HTML:

 1 (defun pygmentize-code (lang params code)
 2   (let ((proc-input (process-info-input *pygmentize-process*))
 3         (proc-output (process-info-output *pygmentize-process*)))
 4     (write-line (format nil "pygmentize|~a|~a~@[|~a~]"
 5                         (length code) lang params)
 6                 proc-input)
 7     (write-string code proc-input)
 8     (force-output proc-input)
 9     (let ((nchars (parse-integer
10                    (nth 1
11                         (split-sequence #\| (read-line proc-output))))))
12       (coerce (loop repeat nchars
13                  for x = (read-char proc-output)
14                  collect x)
15               'string))))

See the entire pull request here.


I was able to get down from well over a minute to less that three seconds with the time it takes to generate this blog.

]==> time ./coleslaw-old.x /path/to/blog/
./coleslaw-old.x /path/to/blog/  66.40s user 6.19s system 98% cpu 1:13.55 total
]==> time ./coleslaw-new-no-renderer.x /path/to/blog/
./coleslaw-new-no-renderer.x /path/to/blog/  65.50s user 6.03s system 98% cpu 1:12.53 total
]==> time ./coleslaw-new-renderer.x /path/to/blog/
./coleslaw-new-renderer.x /path/to/blog/  2.78s user 0.27s system 106% cpu 2.849 total
  • coleslaw-old.x is the original code
  • coleslaw-new-no-renderer.x starts and stops the renderer with every code snippet
  • coleslaw-new-renderer.x starts the renderer beforehand and stops it after all the job is done

The classifier

I have built a road sign classifier recently as an assignment for one of the online courses I participate in. The particulars of the implementation are unimportant. It suffices to say that it's a variation on the solution found in this paper by Sermanet and LeCun and operates on the same data set of German road signs. The solution has around 4.3M trainable parameters, and there are around 300k training (after augmentation), 40k validation (after augmentation), and 12k testing samples. The classifier reached the testing accuracy of 98.67%, which is just about human performance. That's not bad.

The thing that I want to share the most is not all mentioned above, but the training benchmarks. I tested it on three different machines in 5 configurations in total:

  • x1-cpu: My laptop, four i7-6600U CPU cores at 2.60GHz each and 4MB cache, 16GB RAM
  • g2.8-cpu: Amazon's g2.8xlarge instance, 32 Xeon E5-2670 CPU cores at 2.60GHz each with 20MB cache, 60GB RAM
  • g2.2-cpu: Amazon's g2.2xlarge instance, 8 Xeon E5-2670 CPU cores at 2.60GHz each with 20MB cache, 15GB RAM
  • g2.8-gpu: The same as g2.8-cpu but used the 4 GRID K520 GPUs
  • g2.2-gpu: The same as g2.2-cpu but used the 1 GRID K520 GPU
  • p2-gpu: Amazon's p2.xlarge instance, 4 Xeon E5-2686 CPU cores 2.30GHz each with 46MB cache, 60GB RAM, 1 Tesla K80 GPU

Here are the times it took to train one epoch as well as how long it would have taken to train for 540 epochs (it took 540 epochs to get the final solution):

  • x1-cpu: 40 minutes/epoch, 15 days to train
  • g2.8-cpu: 6:24/epoch, 2 days 9 hours 36 minutes to train
  • g2.2-cpu: 16:15/epoch, 6 days, 2 hours 15 minutes to train
  • g2.8-gpu: 1:37/epoch, 14 hours, 33 minutes to train
  • g2.2-gpu: 1:37/epoch, 14 hours, 33 minutes to train
  • p2-gpu: 56 seconds/epoch, 8 hours, 24 minutes to train

I was quite surprised by these results. I knew that GPUs are better suited for this purpose, but I did not know that they are this much better. The slowness of the laptop might have been due to swapping. I run the test with the usual (unused) laptop workload and Chrome taking a lot of RAM. I was not doing anything during the test, though. When testing with g2.8-cpu, it looked like only 24 out of the 32 CPU cores were busy. Three additional GPUs on g2.8-gpu did not seem to have made any difference. TensorFlow allows you to pin operations to devices, but I did not do any of that. The test just runs the same exact graph as g2.2-gpu. There's likely a lot to gain by doing manual tuning.

The results

I tested it on pictures of a bunch of French and Swiss road signs taken around where I live. These are in some cases different from their German counterparts. When the network had enough training examples, it generalized well, otherwise, not so much. In the images below, you'll find sample sign pictures and the top three logits returned by the classifier after applying softmax.

Speed limit - training (top) vs. French (bottom)
Speed limit - training (top) vs. French (bottom)

No entry - training (top) vs. French (bottom)
No entry - training (top) vs. French (bottom)

Traffic lights - training (top) vs. French (bottom)
Traffic lights - training (top) vs. French (bottom)

No trucks - training (top) vs. French (bottom)
No trucks - training (top) vs. French (bottom)


I must be getting old and eccentric. I have recently started working on the Coursera's Scala Specialization. It's all great, but my first realization was that the tools they use are not going to work for me. The problem lies mainly with sbt - their build tool. It fetches and installs in your system God knows what from God knows where to bootstrap itself. I don't trust this kind of behavior in the slightest. I understand that automatic pulling of dependencies and auto-update may save work, but they are also dangerous. I don't even want to mention that they contribute to general bloat and sluggishness that plagues the Java world. You don't need to know what depends on what, so everything uses everything, without a single thought.

Having said all that, I do trust and use QuickLisp. So, go figure.

I would still like to take the class, but I would like to do it using Emacs and command line. Here's what I did.


You'll need the following packages:

==> sudo apt-get install default-jdk scala ant junit4 scala-mode-el

The assignments they give seem to have a stub for implementation and a bunch of scalatest test suites. I will also want to write some other code to play with things. This is the directory structure I used:

==> find -type f

You can get all this here and will need to run the script to fetch the two scalatest jar files before you can do anything else.


I wrote an ant build template that sets up scalac and scalatest tasks and provides compile and test targets. All that the build.xml files in the subdirectories need to do is define some properties and import the template:

1 <project default="compile">
2   <property name="" value="recfun.jar" />
3   <property name="jar.class" value="recfun.Main" />
4   <property name="tests-wildcard" value="recfun" />
5   <import file="../build-targets.xml" />
6 </project>
  • is the name of the resulting jar file
  • jar.class is the default class that should run
  • test-wildcard is the name of the package containing the test suites (they are discovered automatically)

You can then run the thing (some output has been omitted for clarity):

]==> ant compile
    [mkdir] Created dir: ./build/classes
    [mkdir] Created dir: ./build-test

   [scalac] Compiling 1 source file to ./build/classes
      [jar] Building jar: ./build/recfun.jar

]==> ant test
   [scalac] Compiling 3 source files to ./build-test

[scalatest] Discovery starting.
[scalatest] Discovery completed in 127 milliseconds.
[scalatest] Run starting. Expected test count is: 11
[scalatest] BalanceSuite:
[scalatest] - balance: '(if (zero? x) max (/ 1 x))' is balanced
[scalatest] - balance: 'I told him ...' is balanced
[scalatest] - balance: ':-)' is unbalanced
[scalatest] - balance: counting is not enough
[scalatest] PascalSuite:
[scalatest] - pascal: col=0,row=2
[scalatest] - pascal: col=1,row=2
[scalatest] - pascal: col=1,row=3
[scalatest] CountChangeSuite:
[scalatest] - countChange: example given in instructions
[scalatest] - countChange: sorted CHF
[scalatest] - countChange: no pennies
[scalatest] - countChange: unsorted CHF
[scalatest] Run completed in 246 milliseconds.
[scalatest] Total number of tests run: 11
[scalatest] Suites: completed 4, aborted 0
[scalatest] Tests: succeeded 11, failed 0, canceled 0, ignored 0, pending 0
[scalatest] All tests passed.

Submiting the assignments

I could probably write some code to do the assignment submission in python or using ant, but I am too lazy for that. I will use the container handling script that I wrote for another occasion and install sbt in there. The docker/devel sub dir contains a Dockerfile for an image that has sbt installed in it.

==> git clone
==> cd jail/docker/devel
==> docker build -t jail:v01-dev .

This is the configurations script: