Content from 2017-01
A neural network learned how to drive a car by observing how I do it! :) I must say that it's one of the coolest projects that I have ever done. Udacity provided a simulator program where you had to drive a car for a while on two tracks to collect training data. Each sample consisted of a steering angle and images from three front-facing cameras.
Then, in the autonomous driving mode, you are given an image from the central camera and must send back an appropriate steering angle, such that the car does not go off-track.
An elegant solution to this problem was described in a paper by nVidia from April 2016. I managed to replicate it in the simulator. Not without issues, though. The key takeaways for me were:
- The importance of making sure that the training data sample is balanced. That is, making sure that some categories of steering angles are not over-represented.
- The importance of randomly jittering the input images. To quote another paper: "ConvNets architectures have built-in invariance to small translations, scaling and rotations. When a dataset does not naturally contain those deformations, adding them synthetically will yield more robust learning to potential deformations in the test set."
- Not over-using dropout.
The model needed to train for 35 epochs. Each epoch consisted of 24 batches of 2048 images with on-the-fly jittering. It took 104 seconds to process one epoch on Amazon's p2.xlarge instance and 826 seconds to do the same thing on my laptop. What took an hour on a Tesla K80 GPU would have taken my laptop over 8 hours.
Below are some sample results. The driving is not very smooth, but I blame that on myself not being a good driving model ;) The second track is especially interesting, because it differs from the one that the network was trained on. Interestingly enough, a MacBook Air did no have enough juice to run both the simulator and the model, even though the model is fairly small. I ended up having to create an ssh tunnel to my Linux laptop.
Writing this blog became increasingly tedious over time. The reason for this was the slowness of the rendering tool I use - coleslaw. It seemed to work well for other people, though, so I decided to investigate what I am doing wrong. The problem came from the fact that the code coloring implementation (which I co-wrote) spawned a Python process every time it received a code block to handle. The coloring itself was fast. Starting and stopping Python every time was the cause of the issue. A solution for this malady is fairly simple. You keep the Python process running at all times and communicate with it via standard IO.
Surprisingly enough, I could not find an easy and convenient way to do it. The
dominant paradigm of
uiop:run-program seems to be spawn-process-close, and it
does not allow for easy access to the actual streams.
hand me the stream objects that I need, but it's not portable. While reading the
code of uiop trying to figure out how to extract the stream objects from
run-program, I accidentally discovered
uiop:launch-program which does
exactly what I need in a portable manner. It was implemented in asdf-220.127.116.11
released on Dec 1st, 2016 (a month and a half ago!). This post is meant as a
piece of documentation that can be indexed by search engines to help spread my
happy discovery. :)
The Python code reads commands from standard input and writes the responses to standard output. Both, commands and response headers are followed by newlines and an optional payload.
The commands are:
exit- what it does is self-evident
There's only one response:
colorized|len, followed by a newline and
utf-8 characters of the colorized code as an HTML snippet.
Python's automatic inference of standard IO's encoding is still pretty messed up, even in Python 3. It's a good idea to create wrapper objects and interact only with them:
1 input = io.TextIOWrapper(sys.stdin.buffer, encoding='utf-8') 2 output = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8')
Printing diagnostic messages to standard error output is useful for debugging:
1 def eprint(*args, **kwargs): 2 print(*args, file=sys.stderr, **kwargs)
OK, I have a python script that does the coloring. Before I can use it, I need
to tell ASDF about it and locate where it is in the filesystem. The former is
done by using the
:static-file qualifier in the
:components list. The latter
is a bit more complicated. Since the file's location is known relative to the
lisp file it will be used with, it's doable.
1 (defvar *pygmentize-path* 2 (merge-pathnames "pygmentize.py" 3 #.(or *compile-file-truename* *load-truename*)) 4 "Path to the pygmentize script")
The trick here is to use
#. to execute the statement at read-time. You can see
the full explanation here.
With that out of the way, I can start the renderer with:
1 (defmethod start-concrete-renderer ((renderer (eql :pygments))) 2 (setf *pygmentize-process* (uiop:launch-program 3 (list *python-command* 4 (namestring *pygmentize-path*)) 5 :input :stream 6 :output :stream)))
For debugging purposes, it's useful to add
:error-output "/tmp/debug", so that
the diagnostics do not get eaten up by
To stop the process, we send it the
exit command, flush the stream, and wait
until the process dies:
1 (defmethod stop-concrete-renderer ((renderer (eql :pygments))) 2 (write-line "exit" (process-info-input *pygmentize-process*)) 3 (force-output (process-info-input *pygmentize-process*)) 4 (wait-process *pygmentize-process*))
The Lisp part of the colorizer sends the
pygmentize command together with the
code snippet to Python and receives the colorized HTML:
1 (defun pygmentize-code (lang params code) 2 (let ((proc-input (process-info-input *pygmentize-process*)) 3 (proc-output (process-info-output *pygmentize-process*))) 4 (write-line (format nil "pygmentize|~a|~a~@[|~a~]" 5 (length code) lang params) 6 proc-input) 7 (write-string code proc-input) 8 (force-output proc-input) 9 (let ((nchars (parse-integer 10 (nth 1 11 (split-sequence #\| (read-line proc-output)))))) 12 (coerce (loop repeat nchars 13 for x = (read-char proc-output) 14 collect x) 15 'string))))
See the entire pull request here.
I was able to get down from well over a minute to less that three seconds with the time it takes to generate this blog.
]==> time ./coleslaw-old.x /path/to/blog/ ./coleslaw-old.x /path/to/blog/ 66.40s user 6.19s system 98% cpu 1:13.55 total ]==> time ./coleslaw-new-no-renderer.x /path/to/blog/ ./coleslaw-new-no-renderer.x /path/to/blog/ 65.50s user 6.03s system 98% cpu 1:12.53 total ]==> time ./coleslaw-new-renderer.x /path/to/blog/ ./coleslaw-new-renderer.x /path/to/blog/ 2.78s user 0.27s system 106% cpu 2.849 total
coleslaw-old.xis the original code
coleslaw-new-no-renderer.xstarts and stops the renderer with every code snippet
coleslaw-new-renderer.xstarts the renderer beforehand and stops it after all the job is done
I have built a road sign classifier recently as an assignment for one of the online courses I participate in. The particulars of the implementation are unimportant. It suffices to say that it's a variation on the solution found in this paper by Sermanet and LeCun and operates on the same data set of German road signs. The solution has around 4.3M trainable parameters, and there are around 300k training (after augmentation), 40k validation (after augmentation), and 12k testing samples. The classifier reached the testing accuracy of 98.67%, which is just about human performance. That's not bad.
The thing that I want to share the most is not all mentioned above, but the training benchmarks. I tested it on three different machines in 5 configurations in total:
- x1-cpu: My laptop, four i7-6600U CPU cores at 2.60GHz each and 4MB cache, 16GB RAM
- g2.8-cpu: Amazon's g2.8xlarge instance, 32 Xeon E5-2670 CPU cores at 2.60GHz each with 20MB cache, 60GB RAM
- g2.2-cpu: Amazon's g2.2xlarge instance, 8 Xeon E5-2670 CPU cores at 2.60GHz each with 20MB cache, 15GB RAM
- g2.8-gpu: The same as g2.8-cpu but used the 4 GRID K520 GPUs
- g2.2-gpu: The same as g2.2-cpu but used the 1 GRID K520 GPU
- p2-gpu: Amazon's p2.xlarge instance, 4 Xeon E5-2686 CPU cores 2.30GHz each with 46MB cache, 60GB RAM, 1 Tesla K80 GPU
Here are the times it took to train one epoch as well as how long it would have taken to train for 540 epochs (it took 540 epochs to get the final solution):
- x1-cpu: 40 minutes/epoch, 15 days to train
- g2.8-cpu: 6:24/epoch, 2 days 9 hours 36 minutes to train
- g2.2-cpu: 16:15/epoch, 6 days, 2 hours 15 minutes to train
- g2.8-gpu: 1:37/epoch, 14 hours, 33 minutes to train
- g2.2-gpu: 1:37/epoch, 14 hours, 33 minutes to train
- p2-gpu: 56 seconds/epoch, 8 hours, 24 minutes to train
I was quite surprised by these results. I knew that GPUs are better suited for this purpose, but I did not know that they are this much better. The slowness of the laptop might have been due to swapping. I run the test with the usual (unused) laptop workload and Chrome taking a lot of RAM. I was not doing anything during the test, though. When testing with g2.8-cpu, it looked like only 24 out of the 32 CPU cores were busy. Three additional GPUs on g2.8-gpu did not seem to have made any difference. TensorFlow allows you to pin operations to devices, but I did not do any of that. The test just runs the same exact graph as g2.2-gpu. There's likely a lot to gain by doing manual tuning.
I tested it on pictures of a bunch of French and Swiss road signs taken around where I live. These are in some cases different from their German counterparts. When the network had enough training examples, it generalized well, otherwise, not so much. In the images below, you'll find sample sign pictures and the top three logits returned by the classifier after applying softmax.
I must be getting old and eccentric. I have recently started working on the Coursera's Scala Specialization. It's all great, but my first realization was that the tools they use are not going to work for me. The problem lies mainly with sbt - their build tool. It fetches and installs in your system God knows what from God knows where to bootstrap itself. I don't trust this kind of behavior in the slightest. I understand that automatic pulling of dependencies and auto-update may save work, but they are also dangerous. I don't even want to mention that they contribute to general bloat and sluggishness that plagues the Java world. You don't need to know what depends on what, so everything uses everything, without a single thought.
Having said all that, I do trust and use QuickLisp. So, go figure.
I would still like to take the class, but I would like to do it using Emacs and command line. Here's what I did.
You'll need the following packages:
==> sudo apt-get install default-jdk scala ant junit4 scala-mode-el
The assignments they give seem to have a stub for implementation and a bunch of scalatest test suites. I will also want to write some other code to play with things. This is the directory structure I used:
==> find -type f ./01-hello-world/build.xml ./01-hello-world/src/HelloWorld.scala ./02-square-root-newton/build.xml ./02-square-root-newton/src/SquareRoot.scala ./a00-lists/build.xml ./a00-lists/src/example/Lists.scala ./a00-lists/test/example/ListsSuite.scala ./a01-pascal-balance-countchange/build.xml ./a01-pascal-balance-countchange/src/recfun/Main.scala ./a01-pascal-balance-countchange/test/recfun/BalanceSuite.scala ./a01-pascal-balance-countchange/test/recfun/CountChangeSuite.scala ./a01-pascal-balance-countchange/test/recfun/PascalSuite.scala ./build.properties ./build-targets.xml ./lib/get-libs.sh
You can get all this here and will need to run the
get-libs.sh script to
fetch the two scalatest jar files before you can do anything else.
I wrote an ant build template that sets up
test targets. All that the
build.xml files in the
subdirectories need to do is define some properties and import the template:
1 <project default="compile"> 2 <property name="jar.name" value="recfun.jar" /> 3 <property name="jar.class" value="recfun.Main" /> 4 <property name="tests-wildcard" value="recfun" /> 5 <import file="../build-targets.xml" /> 6 </project>
jar.nameis the name of the resulting jar file
jar.classis the default class that should run
test-wildcardis the name of the package containing the test suites (they are discovered automatically)
You can then run the thing (some output has been omitted for clarity):
]==> ant compile init: [mkdir] Created dir: ./build/classes [mkdir] Created dir: ./build-test compile: [scalac] Compiling 1 source file to ./build/classes [jar] Building jar: ./build/recfun.jar ]==> ant test compile-test: [scalac] Compiling 3 source files to ./build-test test: [scalatest] Discovery starting. [scalatest] Discovery completed in 127 milliseconds. [scalatest] Run starting. Expected test count is: 11 [scalatest] BalanceSuite: [scalatest] - balance: '(if (zero? x) max (/ 1 x))' is balanced [scalatest] - balance: 'I told him ...' is balanced [scalatest] - balance: ':-)' is unbalanced [scalatest] - balance: counting is not enough [scalatest] PascalSuite: [scalatest] - pascal: col=0,row=2 [scalatest] - pascal: col=1,row=2 [scalatest] - pascal: col=1,row=3 [scalatest] CountChangeSuite: [scalatest] - countChange: example given in instructions [scalatest] - countChange: sorted CHF [scalatest] - countChange: no pennies [scalatest] - countChange: unsorted CHF [scalatest] Run completed in 246 milliseconds. [scalatest] Total number of tests run: 11 [scalatest] Suites: completed 4, aborted 0 [scalatest] Tests: succeeded 11, failed 0, canceled 0, ignored 0, pending 0 [scalatest] All tests passed.
Submiting the assignments
I could probably write some code to do the assignment submission in python or
using ant, but I am too lazy for that. I will use the container handling script
that I wrote for another occasion and install
sbt in there. The
docker/devel sub dir contains a
Dockerfile for an image that has
installed in it.
==> git clone https://github.com/ljanyst/jail.git ==> cd jail/docker/devel ==> docker build -t jail:v01-dev .
This is the configurations script:
CONT_HOSTNAME=jail-scala CONT_HOME=$HOME/Contained/jail-scala/home CONT_NAME=jail:v01-dev CONT_RESOLUTION=1680x1050