Recent Content


I have recently stumbled upon two articles (1, 2) treating about running TensorFlow on CPU setups. Out of curiosity, I decided to check how the kinds of models I use behave in such situations. As you will see below, the results were somewhat unexpected. I did not put in the time to investigate what went wrong, and my attempts to reason about the performance problems are pure speculations. Instead, I just run my models with a bunch of different threading and OpenMP settings that people typically recommend on the Internet and hoped to have a drop-in alternative to my GPU setup. In particular, I did not convert my models to use the NCHW format as recommended by the Intel article. This data format conversion seems to be particularly important, and people report performance doubling in some cases. However, since my largest test case uses transfer learning, applying the conversion is a pain. If you happen to know how to optimize the settings better without major tweaking of the models, please do drop me a line.

Testing boxes

  • ti: My workstation
    • GPU: GeForce GTX 1080 Ti (11GB, Pascal)
    • CPU: 8 OS CPUs (Core i7-7700K, 1 packages x 4 cores/pkg x 2 threads/core (4 total cores))
    • RAM: 32GB (test data loaded from an SSD)
  • p2: An Amazon p2.xlarge instance
    • GPU: Tesla K80 (12GB, Kepler)
    • CPU: 4 OS CPUs (Xeon E5-2686 v4)
    • RAM: 60GB (test data loaded from a ramdisk)
  • m4: An Amazon m4.16xlarge instance
    • CPU: 64 OS CPUs (Xeon E5-2686 v4, 2 packages x 16 cores/pkg x 2 threads/core (32 total cores))
    • RAM: 256GB (test data loaded from a ramdisk)

TensorFlow settings

The GPU flavor was compiled with CUDA support; the CPU version was configured with only the default settings; the MKL flavor uses the MKL-ML library that the TensorFlow's configuration script downloads automatically;

The GPU and the CPU setups run with the default session settings. The other configurations change the threading and OpenMP setting on the case-by-case basis. I use the following annotations when talking about the tests:

  • [xC,yT] means the KMP_HW_SUBSET envvar setting and the interop and intraop thread numbers set to 1.
  • [affinity] means the KMP_AFFINITY envvar set to granularity=fine,verbose,compact,1,0 and the interop thread number set to 2.
  • [intraop=x, interop=y] means the TensorFlow threading setting and no OpenMP setting.

More information on controlling thread affinity is here, and this is an article on managing thread allocation.

Tests and Results

The test results are the times it took to train one epoch normalized to the result obtained using the ti-gpu configuration - if some score is around 20, it means that this setting is 20 times slower than the baseline.

LeNet - CIFAR10
LeNet - CIFAR10

The first test uses the LeNet architecture on the CIFAR-10 data. The MKL setup run with [4C,2T] on ti and [affinity] on m4. The results are pretty surprising because the model consists of almost exclusively the operations that Intel claims to have optimized. The fact that ti run faster than m4 might suggest that there is some synchronization issue in the graph handling algorithms preventing it from processing a bunch of tiny images efficiently.

Road Sign Classifier
Road Sign Classifier

The second test is my road sign classifier. It uses mainly 2D convolutions and pooling, but they are interleaved with hyperbolic tangents as activations as well as dropout layers. This fact probably prevents the graph optimizer from grouping the MKL nodes together resulting with frequent data format conversions between NHWC and the Intel's SIMD friendly format. Also, ti scored better than m4 for the MKL version but not for the plain CPU implementation. It would suggest inefficiencies in the OpenMP implementation of threading.

Image Segmentation - KITTI (2 classes)
Image Segmentation - KITTI (2 classes)

The third and the fourth test run a fully convolutional neural network besed on VGG16 for an image segmentation project. Apart from the usual suspects, this model uses transposed convolutions to handle learnable upsampling. The tests differ in the size of the input images and in the sizes of the weight matrices handled by the transposed convolutions. For the KITTI dataset, the ti-mkl config run with [intraop=6, interop=6] and m4-mkl with [affinity].

Image Segmentation - Cityscapes (29 classes)
Image Segmentation - Cityscapes (29 classes)

For the Cityscapes dataset, ti-mkl run with [intraop=6, interop=6] and m4-mkl run with [intraop=44, interop=6]. Here the MKL config was as fast as the baseline CPU configs for the dataset with fewer classes and thus smaller upsampling layers. The slowdown for the dataset with more classes could probably be explained by the difference in the handling of the transposed convolution nodes.


It was an interesting experience that arose mixed feelings. On the one hand, the best baseline CPU implementation was at worst two to four times slower with only the compiler optimization than Amazon P2. It's a much better outcome than I had expected. On the other hand, the MKL support was a disappointment. To be fair, in large part it's probably because of my refusal to spend enough time tweaking the parameters, but hey, it was supposed to be a drop-in replacement, and I don't need to do any of these when using a GPU. Another reason is that TensorFlow probably has too few MKL-based kernels to be worth using in this mode and the frequent data format conversions kill the performance. I have also noticed the MKL setup not making any progress with some threading configurations despite all the cores being busy. I might have hit the Intel Hyperthreading bug.

Notes on building TensorFlow

The GPU versions were compiled with GCC 5.4.0, CUDA 8.0 and cuDNN 6. The ti configuration used CUDA capability 6.1, whereas the p2 configuration used 3.7. The compiler flags were:

  • ti: -march=core-avx-i -mavx2 -mfma -O3
  • p2: -march=broadwell -O3

The CPU versions were compiled with GCC 7.1.0 with the following flags:

  • ti: -march=skylake -O3
  • m4: -march=broadwell -O3

I tried compiling the MKL version with the additional -DEIGEN_USE_MKL_VML flag but got worse results.

The MKL library is poorly integrated with the TensorFlow's build system. For some strange reason, the configuration script creates a link to inside the build tree which results with the library being copied to the final wheel package. Doing so is a horrible idea because in glibc mostly provides an interface for's private API so a system update may break the TensorFlow installation. Furthermore, the way in which it figures out which library to link against is broken. The configuration script uses the locate utility to find all files named and picks the first one from the list. Now, locate is not installed on Ubuntu or Debian by default, so if you did not do:

]==> sudo apt-get install locate
]==> sudo updatedb

at some point in the past, the script will be killed without an error message leaving the source tree unconfigured. Moreover, the first pick is usually a wrong one. If you run a 64-bit version of Ubuntu with multilib support, the script will end up choosing a 32-bit version of the library. I happen to hack glibc from time to time, so in my case, it ended up picking one that was cross-compiled for a 64-bit ARM system.

I have also tried compiling Eigen with full MKL support as suggested in this thread. However, the Eigen's and MKL's BLAS interfaces seem to be out of sync. I attempted to fix the situation but gave up when I noticed Eigen passing floats to MKL functions expecting complex numbers using incompatible data types. I will continue using the GPU setup, so fixing all that and doing proper testing was way more effort than I was willing to make.

Node 14.07.2017: My OCD took the upper hand again and I figured it out. Unfortunately, it did not improve the numbers at all.

What is it about?

Semantic segmentation is a process of dividing an image into sets of pixels sharing similar properties and assigning to each of these sets one of the pre-defined labels. Ideally, you would like to get a picture such as the one below. It's a result of blending color-coded class labels with the original image. This sample comes from the CityScapes dataset.

Segmented Image
Segmented Image

Segmentation Classes
Segmentation Classes

How is it done?

Figuring out object boundaries in an image is hard. There's a variety of "classical" approaches taking into account colors and gradients that obtained encouraging results, see this paper by Shi and Malik for example. However, in 2015 and 2016, Long, Shelhamer, and Darrell presented a method using Fully Convolutional Networks that significantly improved the accuracy (the mean intersection over union metric) and the inference speed. My goal was to replicate their architecture and use it to segment road scenes.

A fully convolutional network differs from a regular convolutional network in the fact that it has the final fully-connected classifier stripped off. Its goal is to take an image as an input and produce an equally-sized output in which each pixel is represented by a softmax distribution describing the probability of this pixel belonging to a given class. I took this picture from one of the papers mentioned above:

Fully Convolutional Network
Fully Convolutional Network

For the results presented in this post, I used the pre-trained VGG16 network provided by Udacity for the beta test of their Advanced Deep Learning Capstone. I took layers 3, 4, and 7 and combined them in the manner described in the picture below, which, again, is taken from one of the papers by Long et al.

Upscaling and merging
Upscaling and merging

First, I used a 1x1 convolutions on top of each extracted layer to act as a local classifier. After doing that, these partial results are still 32, 16, and 8 times smaller than the input image, so I needed to upsample them (see below). Finally, I used a weighted addition to obtain the result. The authors of the original paper report that without weighting the learning process diverges.

Learnable Upsampling

Upsampling is done by applying a process called transposed convolution. I will not describe it here because this post over at does a great job of doing that. I will just say that transposed convolutions (just like the regular ones) use learnable weights to produce output. The trick here is the initialization of those weights. You don't use the truncated normal distribution, but you initialize the weights in such a way that the convolution operation performs a bilinear interpolation. It's easy and interesting to test whether the implementation works correctly. When fed an image, it should produce the same image but n times larger.

 1 img = cv2.imread(sys.argv[1])
 2 print('Original size:', img.shape)
 4 imgs = np.zeros([1, *img.shape], dtype=np.float32)
 5 imgs[0,:,:,:] = img
 7 img_input = tf.placeholder(tf.float32, [None, *img.shape])
 8 upscale = upsample(img_input, 3, 8, 'upscaled')
10 with tf.Session() as sess:
12     upscaled =, feed_dict={img_input: imgs})
14 print('Upscaled:', upscaled.shape[1:])
15 cv2.imwrite(sys.argv[2], upscaled[0,:, :, :])

Where upsample is defined here.


I was mainly interested in road scenes, so I played with the KITTI Road and CityScapes datasets. The first one has 289 training images with two labels (road/not road) and 290 testing samples. The second one has 2975 training, 500 validation, and 1525 testing pictures taken while driving around large German cities. It has fine-grained annotations for 29 classes (including "unlabeled" and "dynamic"). The annotations are color-based and look like the picture below.

Picture Labels
Picture Labels

Even though I concentrated on those two datasets, both the training and the inference software is generic and can handle any pixel-labeled dataset. All you need to do is to create a new file defining your custom samples. The definition is a class that contains seven attributes:

  • image_size - self-evident, both horizontal and vertical dimensions need to be divisible by 32
  • num_classes - number of classes that the model is supposed to handle
  • label_colors - a dictionary mapping a class number to a color; used for blending of the classification results with input image
  • num_training - number of training samples
  • num_validation - number of validation samples
  • train_generator - a generator producing training batches
  • valid_generator - a generator producing validation batches

See or for a concrete example. The training script picks the source based on the value of the --data-source parameter.


Typically, you would normalize the input dataset such that its mean is at zero and its standard deviation is at one. It significantly improves convergence of the gradient optimization. In the case of the VGG model, the authors just zeroed the mean without scaling the variance (see section 2.1 of the paper). Assuming that the model was trained on the ImageNet dataset, the mean values for each channel are muR = 123.68, muG = 116.779, muB = 103.939. The pre-trained model provided by Udacity already has a pre-processing layer handling these constants. Judging from the way it does it, it expects plain BGR scaled between 0 and 255 as input.

Label Validation

Since the network outputs softmaxed logits for each pixel, the training labels need to be one-hot encoded. According to the TensorFlow documentation, each row of labels needs to be a proper probability distribution. Otherwise, the gradient calculation will be incorrect and the whole model will diverge. So, you need to make sure that you're never in a situation where you have all zeros or multiple ones in your label vector. I have made this mistake so many time that I decided to write a checker script for my data source modules. It produces examples of training images blended with their pixel labels to check if the color maps have been defined correctly. It also checks every pixel in every sample to see if the label rows are indeed valid. See here for the source.

Initialization of variables

Initialization of variables is a bit of a pain in TensorFlow. You can use the global initializer if you create and train your model from scratch. However, in the case when you want to do transfer learning - load a pre-trained model and extend it - there seems to be no convenient way to initialize only the variables that you created. I ended up doing acrobatics like this:

1 uninit_vars    = []
2 uninit_tensors = []
3 for var in tf.global_variables():
4     uninit_vars.append(var)
5     uninit_tensors.append(tf.is_variable_initialized(var))
6 uninit_bools =
7 uninit = zip(uninit_bools, uninit_vars)
8 uninit = [var for init, var in uninit if not init]


For training purposes, I reshaped both labels and logits in such a way that I ended up with 2D tensors for both. I then used tf.nn.softmax_cross_entropy_with_logits as a measure of loss and used AdamOptimizer with a learning rate of 0.0001 to minimize it. The model trained on the KITTI dataset for 500 epochs - 14 seconds per epoch on my GeForce GTX 1080 Ti. The CityScapes dataset took 150 epochs to train - 9.5 minutes per epoch on my GeForce vs. 25 minutes per epoch on an AWS P2 instance. The model exhibited some overfitting. However, the visual results seemed tighter the more it trained. In the picture below the top row contains the ground truth, the bottom one contains the inference results (TensorBoard rocks! :).

CityScapes Validation Examples
CityScapes Validation Examples

CityScapes Validation Loss
CityScapes Validation Loss

CityScapes Training Loss
CityScapes Training Loss


The inference (including image processing) takes 80 milliseconds per image on average for CityScapes and 27 milliseconds for KITTI. Here are some examples from both datasets. The model seems to be able to distinguish a pedestrian from a bike rider with some degree of accuracy, which is pretty impressive!

CityScapes Example #1
CityScapes Example #1

CityScapes Example #2
CityScapes Example #2

KITTI Example #1
KITTI Example #1

KITTI Example #2
KITTI Example #2

Go here for the full code.

A month ago or so, I wrote a post about installing TensorFlow 1.1.0 on Jetson TX1. This post is an update for 1.2.0 which has one additional issue on top of the ones discussed previously. The problem is that Eigen is missing some template specializations when used on ARM. The bug has been fixed, but you need to make the TensorFlow build use the fixed version.

diff --git a/tensorflow/workspace.bzl b/tensorflow/workspace.bzl
index 2a206b0ac..f44a17405 100644
--- a/tensorflow/workspace.bzl
+++ b/tensorflow/workspace.bzl
@@ -150,11 +150,10 @@ def tf_workspace(path_prefix="", tf_repo_name=""):
       name = "eigen_archive",
       urls = [
-          "",
-          "",
+          "",
-      sha256 = "ca7beac153d4059c02c8fc59816c82d54ea47fe58365e8aded4082ded0b820c4",
-      strip_prefix = "eigen-eigen-f3a22f35b044",
+      sha256 = "a34b208da6ec18fa8da963369e166e4a368612c14d956dd2f9d7072904675d9b",
+      strip_prefix = "eigen-eigen-d781c1de9834",
       build_file = str(Label("//third_party:eigen.BUILD")),

The build instructions are the same as for the previous versions, but you need to checkout the v1.2.0-jetson-tx1 tag from my repository to get all the fixes.


I had expected a smooth ride with this one, but it turned out to be quite an adventure and not one of a pleasant kind. To be fair, the likely reason why it's such a horror story is that I was bootstrapping bazel - the build software that TensorFlow uses - on an unsupported system. I spent more time figuring out the dependency issues related to that than working on TensorFlow itself. This post was initially supposed to be a rant on the Java dependency hell. However, in the end, my stubbornness took the upper hand, and I did not go to sleep until it all worked, so you have a HOWTO instead.


Jetson TX1
Jetson TX1

You'll need the board itself and the following installed on it:

  • Jetpack 3.0
  • L4T 24.2.1
  • Cuda Toolkit 8.0.34-1
  • cuDNN 5.1.5-1+cuda8.0


A Java Development Kit

First of all, you'll need a Java compiler and related utilities. Just type:

]==> sudo apt-get install default-jdk

It would not have been worth a separate paragraph, except that the version that comes with the system messes up the CA certificates. You won't be able to download things from GitHub without overriding SSL warnings. I fixed that by installing ca-certificates and ca-certificates-java from Debian.

Protocol Buffers

You'll need the exact two versions mentioned below. No other versions work down the road. I learned about this fact the hard way. Be sure to call on the master branch first - it needs to download gmock, and the link in older tags points to the void.

]==> sudo apt-get install curl
]==> git clone
]==> cd protobuf
]==> ./

This version is needed for the gRPC Java codegen plugin.

]==> git checkout v3.0.0-beta-3
]==> ./
]==> ./configure --prefix=/home/ljanyst/Temp/protobuf-3.0.0-beta-3
]==> make -j12 && make install

This one is needed by Bazel itself.

]==> git checkout v3.0.0
]==> ./
]==> ./configure --prefix=/home/ljanyst/Temp/protobuf-3.0.0
]==> make -j12 && make install

gRPC Java

Building this one took me a horrendous amount of time. At first, I thought that the whole package is needed. Apart from problems with the protocol buffer versions, it has some JNI dependencies that are problematic to compile. Even after I have successfully produced these, they had interoperability issues with other dependencies. After some digging, it turned out that only one component of the package is actually required, so the whole effort was unnecessary. Of course, the source needed patching to make it build on aarch64, but I won't bore you with that. Again, make sure you use the v0.15.0-jetson-tx1 tag - no other tag will work.

]==> git clone
]==> cd grpc-java
]==> git checkout v0.15.0-jetson-tx1
]==> echo protoc=/home/ljanyst/Temp/protobuf-3.0.0-beta-3/bin/protoc >
]==> CXXFLAGS=-I/home/ljanyst/Temp/protobuf-3.0.0-beta-3/include \
     LDFLAGS=-L/home/ljanyst/Temp/protobuf-3.0.0-beta-3/lib \
     ./gradlew java_pluginExecutable


The latest available release of Bazel (0.4.5) does not build on aarch64 without the patch listed below. I took it from the master branch.

diff --git a/src/main/java/com/google/devtools/build/lib/util/ b/src/main/java/com/google/devtools/build/lib/util/
index 7a85c29..ff8bc86 100644
--- a/src/main/java/com/google/devtools/build/lib/util/
+++ b/src/main/java/com/google/devtools/build/lib/util/
@@ -25,7 +25,7 @@ public enum CPU {
   X86_32("x86_32", ImmutableSet.of("i386", "i486", "i586", "i686", "i786", "x86")),
   X86_64("x86_64", ImmutableSet.of("amd64", "x86_64", "x64")),
   PPC("ppc", ImmutableSet.of("ppc", "ppc64", "ppc64le")),
-  ARM("arm", ImmutableSet.of("arm", "armv7l")),
+  ARM("arm", ImmutableSet.of("arm", "armv7l", "aarch64")),
   S390X("s390x", ImmutableSet.of("s390x", "s390")),
   UNKNOWN("unknown", ImmutableSet.<String>of());

The compilation is straightforward, but make sure you point to the right version of protocol buffers and the gRPC Java compiler built earlier.

]==> git clone
]==> cd bazel
]==> git checkout 0.4.5
]==> export PROTOC=/home/ljanyst/Temp/protobuf-3.0.0/bin/protoc
]==> export GRPC_JAVA_PLUGIN=/home/ljanyst/Temp/grpc-java/compiler/build/exe/java_plugin/protoc-gen-grpc-java
]==> ./
]==> export PATH=/home/ljanyst/Temp/bazel/output:$PATH


  • Note 22.06.2017: Go here for TensorFlow 1.2.0. See the v1.2.0-jetson-tx1 tag.
  • Note 03.07.2017: For TensorFlow 1.2.1, see the v1.2.1-jetson-tx1 tag.
  • Note 18.08.2017: For TensorFlow 1.3.0, see the v1.3.0-jetson-tx1 tag. I have tested it against JetPack 3.1 which fixes the CUDA-related bugs. Note that the kernel in this version of JetPack has been compiled without swap support, so you may want to add --local_resources=2048,0.5,0.5 to the bazel commandline if you want to avoid the out-of-memory kills.


The version of CUDA toolkit for this device is somewhat handicapped. nvcc has problems with variadic templates and compiling some kernels using Eigen makes it crash. I found that adding:


to these problematic files makes the problem go away. A constructor with an initializer list seems to be an issue in one of the cases as well. Using the default constructor instead, and then initializing the array elements one by one makes things go through.

Also, the cuBLAS API seems to be incomplete. It only defines 5 GEMM algorithms (General Matrix to Matrix Multiplication) where the newer patch releases of the toolkit define 8. TensorFlow enumerates them by name to experimentally determine which one is best for a given computation and the code notes that they may fail under perfectly normal circumstances (i.e., a GPU older than sm_50). Therefore, simply omitting the missing algorithms should be perfectly safe.

diff --git a/tensorflow/stream_executor/cuda/ b/tensorflow/stream_executor/cuda/
index 2c650af..49c6db7 100644
--- a/tensorflow/stream_executor/cuda/
+++ b/tensorflow/stream_executor/cuda/
@@ -1912,8 +1912,7 @@ bool CUDABlas::GetBlasGemmAlgorithms(
 #if CUDA_VERSION >= 8000
   for (cublasGemmAlgo_t algo :

See the full patches are here and here.

Memory Consumption

The compilation process may take considerable amounts of RAM - more than the device has available. The documentation advises to use only one execution thread (--local_resources 2048,.5,1.0 param for Bazel), so that you don't get the OOM kills. It's unnecessary most of the time, though, because it's only the last 20% of the compilation steps when the memory is filled completely. Instead, I used an SD card as a swap device.

]==> sudo mkswap /dev/mmcblk1p2
]==> sudo swapon /dev/mmcblk1p2

At peak times, the entire RAM and around 7.5GB of swap were used. However, only at most 5 to 6 compilation threads were in the D state (uninterruptable sleep due to IO), with 2 to 3 being runnable.


You need to install these packages before you can proceed.

]==> sudo apt-get install python3-numpy python3-dev python3-pip
]==> sudo apt-get install python3-wheel python3-virtualenv

Then clone my repo containing the necessary patches and configure the source. I used the system version of Python 3, located at /usr/bin/python3 with its default library in /usr/lib/python3/dist-packages. The answers to the CUDA related questions are:

  • the version of the SDK is 8.0;
  • the version of cuDNN is 5.1.5, and it's located in /usr;
  • the CUDA compute capability for TX1 is 5.3.

Go ahead and run:

]==> git clone
]==> cd tensorflow
]==> git checkout v1.1.0-jetson-tx1
]==> ./configure

Finally, run the compilation, and, 2 hours and change after, build the wheel:

]==> bazel build --config=opt --config=cuda --curses=no --show_task_finish \
]==> bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg
]==> cp /tmp/tensorflow_pkg/*.whl ../

I don't particularly like polluting system directories with custom-built binaries, so I use virtualenv to handle pip-installed Python packages.

]==> mkdir -p ~/Apps/virtualenvs/tensorflow
]==> cd ~/Apps/virtualenvs/tensorflow
]==> python3 /usr/lib/python3/dist-packages/ -p /usr/bin/python3 .
]==> . ./bin/activate
(tensorflow) ]==> pip install ~/Temp/tensorflow-1.1.0-cp35-cp35m-linux_aarch64.whl

The device is identified correctly when starting a new TensorFlow session. You should see the following if you don't count warnings about NUMA:

Found device 0 with properties:
name: NVIDIA Tegra X1
major: 5 minor: 3 memoryClockRate (GHz) 0.072
pciBusID 0000:00:00.0
Total memory: 3.90GiB
Free memory: 2.11GiB
DMA: 0
0:   Y
Creating TensorFlow device (/gpu:0) -> (device: 0, name: NVIDIA Tegra X1, pci bus id: 0000:00:00.0)

The CPU and the GPU share the memory controller, so the GPU does not have the 4GB just for itself. On the upside you can use the CUDA unified memory model without penalties (no memory copies).


I run two benchmarks to see if things work as expected. The first one was my TensorFlow implementation of LeNet training on and classifying the MNIST data. The training code run twice as fast on the TX1 comparing to my 4th generation Carbon X1 laptop. The second test was my slightly enlarged implementation of Sermanet applied to classifying road signs. The convolution part of the training process took roughly 20 minutes per epoch, which is a factor of two improvement over the performance of my laptop. The pipeline was implemented with a large device in mind, though, and expected 16GB of RAM. The TX1 has only 4GB, so the swap speed was a bottleneck here. Based on my observations of the processing speed of individual batches, I can speculate that a further improvement of a factor of two is possible with a properly optimized pipeline.

The algorithm

I took courses on probability, statistics, and the Monte Carlo methods while I was at school, but the ubiquity of the algorithms based on randomness still amazes me. Imagine that you are a robot moving in a 2D space. You have a map of the area, and you know your rough initial position in it. Every step you take, you get data from the controls (speed and angular speed) that is quite noisy. You can also sense obstacles within a certain range and with a certain precision with respect to your current position and heading. How do you figure out where you are?

One good strategy to solving this problem is using a particle filter. You start by creating N "particles," or guesses as to where you are, by drawing the x- and y- position as well as the heading from the Gaussian distribution around your initial estimate. Then, every step you take, for every particle, you:

  • move according to the data you got from controls taking the noise into account;
  • match the sensor data to the landmarks on the map using the perspective of the particle;
  • assign weight to the particle based on how well the observation matches the map.

Finally, you draw with replacement N particles from the initial set with the probability proportional to the weights. The particles that match the observation well will likely be drawn multiple times and those that don't are unlikely to be drawn at all. The repetitions are not a problem because the movement is noisy, so they will diverge after the next step you take. The particle with the highest weight is your best estimate of your actual position and heading.

The result

I did an experiment on the Udacity data, and the approach using a 1000 particles turned out to work well comparing to using just one. The average deviation from the ground truth was around 10cm. Using one particle effectively ignores the observation data and relies only on the controls. You can follow the blue diamond in the video below to see how fast the effects of the noise accumulate. Both cases use the same noise values.

Particle filter localization - 1000 particles vs 1

The project

I try to avoid publishing my code solving homework assignments, but this Udacity SDC project is generic enough to be useful in a wider context. So here you have it. The task was to fuse together radar and lidar measurements using two kinds of Kalman Filters to estimate the trajectory of a moving bicycle. The unscented filter uses the CTRV model tracking the position, speed, yaw, and yaw rate, whereas the extended filter uses the constant velocity model.

The Unscented Filter result
The Unscented Filter result

Both algorithms performed well, with the CTRV model predicting the velocity significantly better. The values below are RMSEs of the prediction against the ground truth. The first two values represent the position, the last two - the velocity.

Extended filter:

]==>  ./ExtendedKF  ../data/obj_pose-laser-radar-synthetic-input.txt ../src/ekf.txt
Accuracy - RMSE:

Unscented filter:

]==>  ./UnscentedKF  ../data/obj_pose-laser-radar-synthetic-input.txt ../src/ukf.txt
Accuracy - RMSE:

The code

I wrote a handy library that does most of the math and provides various concrete implementations of Kalman predictors and updaters:

 1 class KalmanPredictor {
 2   public:
 3     virtual ~KalmanPredictor() {}
 4     virtual void Predict(KalmanState &state, uint64_t dt) = 0;
 5 };
 7 class KalmanUpdater {
 8   public:
 9     virtual ~KalmanUpdater() {}
10     virtual void Update(KalmanState           &state,
11                         const Eigen::VectorXd &z) = 0;
12 };

The code you need to implement yourself depends on the sensor, the model, and the type of the filter you use. Ie., for the CTRV model and a Lidar measurement you only need to specify the projection matrix and the sensor noise covariance:

 1 class LidarUpdater: public LinearKalmanUpdater {
 2   public:
 3     LidarUpdater() {
 4       H_ = MatrixXd(2, 5);
 5       H_ << 1, 0, 0, 0, 0,
 6             0, 1, 0, 0, 0;
 8       R_ = MatrixXd(2, 2);
 9       R_ << 0.0225,      0,
10             0,      0.0225;
11     }
12 };

See here and here for more examples. The state travels around in an object of the KalmanState class:

 1 struct KalmanState {
 2   KalmanState(int n) {
 3     x = Eigen::VectorXd(n);
 4     P = Eigen::MatrixXd(n, n);
 5     x.fill(0.0);
 6     P.fill(0.0);
 7   }
 8   Eigen::VectorXd x;              // mean
 9   Eigen::MatrixXd P;              // covariance
10   Eigen::MatrixXd sigma_points;   // sigma points
11   double          nis = 0;        // Normalized Innovation Squared
12 };

All this ends up with the measurement update code boiling down to this:

1 double dt = measurement.timestamp - previous_timestamp_;
2 previous_timestamp_ = measurement.timestamp;
3 predictor_->Predict(state_, dt);
4 updaters_[measurement.sensor_type]->Update(state_,;
5 return state_;

See the full code on GitHub.

Board bring-up

I started playing with the FRDM-K64F board recently. I want to use it as a base for a bunch of hobby projects. The start-up code is not that different from the one for Tiva, which I describe here - it's the same Cortex-M4 architecture after all. Two additional things need to be taken care of, though: flash security and the COP watchdog.

The K64F MCU restricts external access to a bunch of resources by default. It's a great feature if you want to ship a product, but it makes debugging impossible. The Flash Configuration Field (see section 29.3.1 of the datasheet) defines the default security and boot settings.

 1 static const struct {
 2   uint8_t backdor_key[8];   // backdor key
 3   uint8_t fprot[4];         // program flash protection (FPROT{0-3})
 4   uint8_t fsec;             // flash security (FSEC)
 5   uint8_t fopt;             // flash nonvolatile option (FOPT)
 6   uint8_t feprot;           // EEPROM protection (FEPROT)
 7   uint8_t fdprot;           // data flash protection (FDPROT)
 8 } fcf  __attribute__ ((section (".fcf"))) = {
 9   {0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00},
10   {0xff, 0xff, 0xff, 0xff}, // disable flash program protection
11   0x02,                     // disable flash security
12   0x01,                     // disable low-power boot (section 6.3.3)
13   0x00,
14   0x00
15 };

If flash protection (the fprot field) is not disabled, you won't be able to flash new code by copying it to the MBED partition and will have to run mass erase from OpenOCD every time:

interface cmsis-dap
set CHIPNAME k60
source [find target/kx.cfg]
kinetis mdm mass_erase

If the MCU is in the secured state (the fsec field), the debugger will have no access to memory.

The structure listed above needs to end up in flash just after the interrupt vector. I use the linker script to make sure it happens. I define the appropriate memory block:

FLASH-FCF  (rx)  : ORIGIN = 0x00000400, LENGTH = 0x00000010

And then put the .fcf section in it:

.fcf :

See here.

I also disable the COP (computer operates properly) watchdog which resets the MCU if it is not serviced often enough.

1 WDOG_UNLOCK = 0xc520;        // unlock magic #1
2 WDOG_UNLOCK = 0xd928;        // unlock magic #2
3 for(int i = 0; i < 2; ++i);  // delay a couple of cycles
4 WDOG_STCTRLH &= ~0x0001;     // disable the watchdog

You can get the template code at GitHub.

The challenge

Here's another cool project I have done as a part of the Udacity's self-driving car program. There were two problems solve. The first one was to find the lane lines and compute some of their properties. The second one was to detect and draw bounding boxes around nearby vehicles. Here's the result I got:

Detecting lane lines and vehicules

Detecting lanes

The first thing I do after correcting for camera lens distortion is applying a combination of Sobel operators and color thresholding to get an image of edges. This operation makes lines more pronounced and therefore much easier to detect.


I then get a birds-eye view of the scene by applying a perspective transform and produce a histogram of all the white pixels located in the bottom half of the image. The peaks in this histogram indicate the presence of mostly vertical lines, which is what we're looking for. I detect all these lines by using a sliding window search. I start at the bottom of the image and move towards the top adjusting the horizontal position of each successive window to the average of the x coordinate of all the pixels contained in the previous one. Finally, I fit a parabola to all these pixels. Out of all the candidates detected this way, I select a pair that is the closest to being parallel and is roughly in the place where a lane line would be expected.

The orange area in the picture below visualizes the histogram, and the red boxes with blue numbers in them indicate the positions of the peaks found by the find_peaks_cwt function from scipy.

Bird's eye view - histogram search
Bird's eye view - histogram search

Once I have found the lanes in one video frame, locating them in the next one is much simpler - their position did not change by very much. I just take all the pixels from a close vicinity of the previous detection and fit a new polynomial to them. The green area in the image below denotes the search range, and the blue lines are the newly fitted polynomials.

Bird's eye view - vicinity search
Bird's eye view - vicinity search

I then use the equations of the parabolas to calculate the curvature. The program that produced the video above uses cross-frame averaging to make the lines smoother and to vet new detections in successive video frames.

Vehicle detection

I detect cars by dividing the image into a bunch of overlapping tiles of varying sizes and running each tile through a classifier to check if it contains a car or a fraction of a car. In this particular solution, I used a linear support vector machine (LinearSVC from sklearn). I also wrapped it in a CalibratedClassifierCV to get a measure of confidence. I rejected predictions of cars that were less than 85% certain. The classifier trained on data harvested from the GTI, KITTI, and Udacity datasets from which I collected around 25 times more background samples than cars to limit the occurrences of false-positive detections.

As far as image features are concerned, I use only Histograms of Oriented Gradients with parameters that are essentially the same as the ones presented in this paper dealing with detection of humans. I used OpenCV's HOGDescriptor to extract the HOGs. The reason for this is that it can compute the gradients taking into account all of the color channels. See here. It is the capability that other libraries typically lack limiting you to a form of grayscale. The training set consists of roughly 2M images of 64 by 64 pixels.

Tiles containing cars
Tiles containing cars

Since the samples the classifier trains on contain pictures of fractions of cars, the same car is usually detected multiple times in overlapping tiles. Also, the types of background differ quite a bit, and it's hard to find images of all the possible things that are not cars. Therefore false-positives are quite frequent. To combat these problems, I use heat maps that are averaged across five video frames. Every pixel that has less than three detections on average per frame is rejected as a false positive.

Heat map
Heat map

I then use OpenCV's connectedComponentsWithStats to find connected components and get centroids and bounding boxes for the detections. The centroids are used to track the objects across frames and smooth the bounding boxes by averaging them with 12 previous frames. To further reject false-positives, an object needs to be classified as a car in at least 6 out of 12 consecutive frames.


The topic is pretty fascinating and the results I got could be significantly improved by:

  • employing smarter sliding window algorithms (i.e., having momentum) to better detect dashed lines that are substantially curved
  • finding better ways to do perspective transforms
  • using a better classifier for cars (a deep neural network perhaps)
  • using techniques like YOLO
  • using something smarter than strongly connected components to distinguish overlapping detections of different vehicles - mean shift clustering comes to mind
  • making performance improvements here and there (use C++, parallelize video processing and so on)

I learned a lot of computer vision techniques and had plenty of fun doing this project. I also spent a lot of time reading the code of OpenCV. It has a lot of great tutorials, but its API documentation is lacking.

The project

A neural network learned how to drive a car by observing how I do it! :) I must say that it's one of the coolest projects that I have ever done. Udacity provided a simulator program where you had to drive a car for a while on two tracks to collect training data. Each sample consisted of a steering angle and images from three front-facing cameras.

The view from the cameras
The view from the cameras

Then, in the autonomous driving mode, you are given an image from the central camera and must send back an appropriate steering angle, such that the car does not go off-track.

An elegant solution to this problem was described in a paper by nVidia from April 2016. I managed to replicate it in the simulator. Not without issues, though. The key takeaways for me were:

  • The importance of making sure that the training data sample is balanced. That is, making sure that some categories of steering angles are not over-represented.
  • The importance of randomly jittering the input images. To quote another paper: "ConvNets architectures have built-in invariance to small translations, scaling and rotations. When a dataset does not naturally contain those deformations, adding them synthetically will yield more robust learning to potential deformations in the test set."
  • Not over-using dropout.

The model needed to train for 35 epochs. Each epoch consisted of 24 batches of 2048 images with on-the-fly jittering. It took 104 seconds to process one epoch on Amazon's p2.xlarge instance and 826 seconds to do the same thing on my laptop. What took an hour on a Tesla K80 GPU would have taken my laptop over 8 hours.


Below are some sample results. The driving is not very smooth, but I blame that on myself not being a good driving model ;) The second track is especially interesting, because it differs from the one that the network was trained on. Interestingly enough, a MacBook Air did no have enough juice to run both the simulator and the model, even though the model is fairly small. I ended up having to create an ssh tunnel to my Linux laptop.

Track #1

Track #2


Writing this blog became increasingly tedious over time. The reason for this was the slowness of the rendering tool I use - coleslaw. It seemed to work well for other people, though, so I decided to investigate what I am doing wrong. The problem came from the fact that the code coloring implementation (which I co-wrote) spawned a Python process every time it received a code block to handle. The coloring itself was fast. Starting and stopping Python every time was the cause of the issue. A solution for this malady is fairly simple. You keep the Python process running at all times and communicate with it via standard IO.

Surprisingly enough, I could not find an easy and convenient way to do it. The dominant paradigm of uiop:run-program seems to be spawn-process-close, and it does not allow for easy access to the actual streams. sb-ext:run-program does hand me the stream objects that I need, but it's not portable. While reading the code of uiop trying to figure out how to extract the stream objects from run-program, I accidentally discovered uiop:launch-program which does exactly what I need in a portable manner. It was implemented in asdf- released on Dec 1st, 2016 (a month and a half ago!). This post is meant as a piece of documentation that can be indexed by search engines to help spread my happy discovery. :)


The Python code reads commands from standard input and writes the responses to standard output. Both, commands and response headers are followed by newlines and an optional payload.

The commands are:

  • exit - what it does is self-evident
  • pygmentize|len|lang[|opts]:
    • len is the length of the code snippet
    • lang is the language to colorize
    • optional parameter opts is the configuration of the HTML formatter
    • after the newline, len utf-8 characters of the code block need to follow

There's only one response: colorized|len, followed by a newline and len utf-8 characters of the colorized code as an HTML snippet.

Python's automatic inference of standard IO's encoding is still pretty messed up, even in Python 3. It's a good idea to create wrapper objects and interact only with them:

1 input  = io.TextIOWrapper(sys.stdin.buffer,  encoding='utf-8')
2 output = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8')

Printing diagnostic messages to standard error output is useful for debugging:

1 def eprint(*args, **kwargs):
2     print(*args, file=sys.stderr, **kwargs)


OK, I have a python script that does the coloring. Before I can use it, I need to tell ASDF about it and locate where it is in the filesystem. The former is done by using the :static-file qualifier in the :components list. The latter is a bit more complicated. Since the file's location is known relative to the lisp file it will be used with, it's doable.

1 (defvar *pygmentize-path*
2   (merge-pathnames ""
3                    #.(or *compile-file-truename* *load-truename*))
4   "Path to the pygmentize script")

The trick here is to use #. to execute the statement at read-time. You can see the full explanation here.

With that out of the way, I can start the renderer with:

1 (defmethod start-concrete-renderer ((renderer (eql :pygments)))
2   (setf *pygmentize-process* (uiop:launch-program
3                               (list *python-command*
4                                     (namestring *pygmentize-path*))
5                               :input :stream
6                               :output :stream)))

For debugging purposes, it's useful to add :error-output "/tmp/debug", so that the diagnostics do not get eaten up by /dev/null.

To stop the process, we send it the exit command, flush the stream, and wait until the process dies:

1 (defmethod stop-concrete-renderer ((renderer (eql :pygments)))
2   (write-line "exit" (process-info-input *pygmentize-process*))
3   (force-output  (process-info-input *pygmentize-process*))
4   (wait-process *pygmentize-process*))

The Lisp part of the colorizer sends the pygmentize command together with the code snippet to Python and receives the colorized HTML:

 1 (defun pygmentize-code (lang params code)
 2   (let ((proc-input (process-info-input *pygmentize-process*))
 3         (proc-output (process-info-output *pygmentize-process*)))
 4     (write-line (format nil "pygmentize|~a|~a~@[|~a~]"
 5                         (length code) lang params)
 6                 proc-input)
 7     (write-string code proc-input)
 8     (force-output proc-input)
 9     (let ((nchars (parse-integer
10                    (nth 1
11                         (split-sequence #\| (read-line proc-output))))))
12       (coerce (loop repeat nchars
13                  for x = (read-char proc-output)
14                  collect x)
15               'string))))

See the entire pull request here.


I was able to get down from well over a minute to less that three seconds with the time it takes to generate this blog.

]==> time ./coleslaw-old.x /path/to/blog/
./coleslaw-old.x /path/to/blog/  66.40s user 6.19s system 98% cpu 1:13.55 total
]==> time ./coleslaw-new-no-renderer.x /path/to/blog/
./coleslaw-new-no-renderer.x /path/to/blog/  65.50s user 6.03s system 98% cpu 1:12.53 total
]==> time ./coleslaw-new-renderer.x /path/to/blog/
./coleslaw-new-renderer.x /path/to/blog/  2.78s user 0.27s system 106% cpu 2.849 total
  • coleslaw-old.x is the original code
  • coleslaw-new-no-renderer.x starts and stops the renderer with every code snippet
  • coleslaw-new-renderer.x starts the renderer beforehand and stops it after all the job is done