Recent Content

Intro

It's kind of silly, but I wanted to build a drone that can evade being hit by a lightsaber ever since I first watched this scene in "Attack of the Clones":

Star Wars - Jedi Younglings

To gain the understanding of the ecosystem, I decided to buy parts semi-randomly on Amazon, build something that can just fly, and then iteratively improve on this basic design. People can do astounding things with drones these days. My ultimate goal is to be able to build stuff like that.

The hardware

Here's the list of things I have bought:

  • A carbon fiber quadrotor frame. (Amazon)
  • Brushless rotors. I cannot find the exact model anymore, but they were similar to the ones in the link. (Amazon)
  • Electronic Speed Controllers. (Amazon)
  • Clock-wise and counter-clock-wise propellers. You only need two of each kind, but they easily break if you do tuning in a confined space, so it's wise to buy more of them upfront. (Amazon)
  • A battery. The one in the link is large enough and still fits inside the frame. I need it inside because I wanted the electronics to be easily accessible on the top - I will likely want to change it quite a bit later. (Amazon)
  • A power connection board. You can do without it, but it's quite a lot of connections, so soldering wires together and wrapping the joints in isolation tape is painful and looks ugly. (Amazon)
  • An autopilot board. I bought a cheap CC3D because I ultimately want to ditch it and build one myself. (Amazon)
  • A Raspberry Pi. I want the drone to fly by itself, so I did not buy any radio controller - it will be the task of a computer to do the steering. I used an RPi model 2 because I had one readily available at home. However, these days model 3 is cheaper, so it's probably a much better idea to buy that one. (Amazon)
  • A WiFi dongle. I want this first version to be controllable from a web browser via WiFi. A later version will send some telemetry and receive high-level commands via GSM. (Amazon)

Apart from all that, I used some electronics components to power things up and connect them. I had most of them at home, but I will put some links below nonetheless. You'll likely need these:

  • A prototyping board. It's nice to solder things together to something stable so that the components don't fly around attached to loose wires. The one in the following link should do. (Farnell)
  • A 5V voltage regulator. You will need one to power the Raspberry Pi up. The documentation of CC3D says that the board puts the unregulated output from the ESCs on the output of its serial ports. This output happens to be at 5V, so I initially used that for powering the Pi. Unfortunately, it needs to draw at least around 600 mA of current to work, so the ESC that powered the Raspberry got extremely hot and the motor it was controlling lagged behind the others. Make sure you buy a regulator with as stable output as possible. Some of the cheap ones will make the Raspberry reboot in the middle of the flight due to voltage oscillations. This, in turn, will make the autopilot think it lost the connection to the radio controller and it will go into failsafe mode. A TO-220-compatible heatsink for that regulator is not a bad idea either. You will also need two capacitors. I used 10 μF and 22 μF. Alternatively, you can get yourself a DC to DC converter, in which case you won't need capacitors or heatsinks, and it should be much easier on the battery. (Farnell) (Farnell) or (Farnell)
  • An NPN Transistor and two 10 kΩ resistors for a logic inverter with voltage level adjustment. (Farnell)
  • Header pins and jumper cables so that you can connect things nicely. (Amazon, Amazon)

I used some extra components, even though they are not necessary to make things work. I am not exactly sure where this project will take me, so it seemed prudent to plan far ahead.

  • A 3.3V voltage regulator. I will likely want to power a 3.3V-based microcontroller to act as an autopilot. It needs an extra 10μF capacitor. (Farnell)
  • Four NPN Transistors and eight 10 kΩ resistors for bi-directional voltage level adjusting.

Wiring things up

Wiring things up is not hugely complicated. I put the power connection board on the bottom side of the drone together with all the cables powering the ESDs. The ESD control cables and the power for the RaspberryPi go from the bottom to the top in two bunches in the middle of each side of the drone. There are all sorts of electronic-related connections on the top. The battery is inside the drone frame.

Drone Wiring
Drone Wiring

Pretty much the only thing to pay attention to at this stage is making sure that all the rotors are placed in the right positions and that they connect to the ESCs such that they spin in the right directions. Here's a great video on that. The image below was produced by the LibrePilot configuration wizard.

Drone Rotor Topology - produced by LibrePilot
Drone Rotor Topology - produced by LibrePilot

You cannot connect the communication ports of the Raspberry Pi to the CC3D directly because there is a difference in the voltage levels at which the ports operate. The Pi's GPIO works at 3.3V and cannot tolerate 5V. The autopilot should, in principle, work at 3.3V with tolerance to 5V. However, in practice, I found that what only 5V based logic works. This is why I needed to build two voltage level converters out of transistors. I want to use them to send commands and receive telemetry from the FlexiPort of the autopilot.

Voltage Leveling Circuit
Voltage Leveling Circuit

As shown in the rotor topology diagram, the computer controls the autopilot using the S-Bus protocol. This protocol is just transmitting some data over UART with the added quirk of S-Bus being a logical inversion of UART (every 1 in UART is a 0 in S-Bus and vice-versa) plus we need to take care of the voltage level difference in the high states. The best solution here is again to build a circuit that does the inversion. It is a half of the voltage leveling circuit:

Inverter Circuit
Inverter Circuit

Here's what the resulting board together with the voltage regulators looks like in my case:

Complete board
Complete board

Control and Telemetry

A massive bummer of the RaspberryPi for me right now is that it only has one hardware UART controller. I will need many. At least one more to read telemetry data from CC3D and, later on, one extra to talk to my WaveShare GSM modem. You can bitbang UART on GPIOs, and some external kernel modules out there can do that. The problem is that they rely on the kernel's hrtimers and are not reliable enough at higher speeds, especially if the system is under load. I use one for now at a low speed, but I am working on my own implementation that uses hardware timers to flip the GPIO states reliably and on time. The CC3D and the Raspberry Pi can talk telemetry over such simulated serial port using the UAVTalk protocol. The LibrePilot source code provides python bindings for that.

I used the one hardware UART port that the Pi has for the control link because it needs to operate at a high and non-standard speed. On RaspberryPis 1 and 2 this port is used as a Linux console output by default, so you will need to disable that in /boot/config.txt. RaspberryPis 0 and 3 use the hardware UART to control Bluetooth. This behavior may be disabled by installing the pi3-disable-bt device tree overlay. All the necessary details are here. Once you're done with that, you can connect pin 14 of the Pi to the input of the inverter and the yellow (orange) cable of the CC3D's main port to its output.

After doing all that, it's a matter of opening the serial port in the right mode and sending the protocol byte stream down the pipe. S-Bus expects a baud rate of 100000, one even parity bit, and two stop bits. Here's how to open the port in this mode using Python's pyserial:

import serial

port = serial.Serial('/dev/ttyAMA0', baudrate=100000,
                     parity=serial.PARITY_EVEN,
                     stopbits=serial.STOPBITS_TWO)

I found an excellent description of the S-Bus data frames here. Each frame is 25 bytes long and consists of: a start byte, 16 11-bit channels packed in the next 22 bytes, a byte containing flags and extra binary channels, and, finally, a stop byte. The controller is supposed to send a frame every 7ms, but after reading the code, I found that the LibrePilot firmware is fine as long as it receives a frame at least ten times per second (at least more often than every 102.4ms to be precise). You can see the code of my encoder here.

I quickly got tired of putting these numbers in a terminal window, so I wrote a trivial controller that works in a browser and uses a bunch of sliders. The code is on GitHub.

Controller interface
Controller interface

Open/Libre Pilot

There seems to have been some disagreement in the Open/Libre Pilot community, and the project does not look like it's in a great shape. I needed to make a bunch trivial changes to the GCS source code to make it compile on my Debian Testing laptop. Furthermore, the firmware does not build with the cross-compiler toolchain they supply due to some GCC configuration issues. I managed to build the firmware using the stock Debian cross-compiler for ARM and modifying the Makefile to make it not use the -Werror flag. The firmware code has plenty of unused variables that make the build process fail with this setting turned on. After building everything, the GCS crashes every other time you try to power cycle the board. As far as the CC3D boards themselves are concerned, I have two of them, and only one works in a more or less stable way. The other one does not load the configuration correctly or hangs every 3 out of 5 boots.

I used the config wizard at the beginning but found it confusing, so I later decided to do the configuration manually. Here's a list of what I did screen-by-screen:

  • Hardware:
    • Receiver Port: Disabled+OneShot
    • Flexi Port: Telemetry
    • Main Port: S.Bus
    • USB HID Port: USBTelemetry
    • Telemetry Speed: 9600 - faster than that is not reliable with current implementations of software UART for RaspberryPi.
  • Vehicle - Multirotor:
    • Airframe Type: Quad X
    • Assigned the rotors to the appropriate channels
  • Input:
    • Remote Control Input:
      • All channels need to be assigned even though not all of them correspond to any inputs in the pipilot interface. Otherwise, you will get receiver warnings, and the copter won't arm. I figured that out the hard way by reading the firmware code. Here's to the great diagnostics!
      • Throttle is Channel 1, Yaw - Channel 2, Roll - Channel 3, Pitch - Channel 4.
      • Accessories are Channels 5 to 8.
      • You can assign other controls to whatever other channels you like.
      • S.Bus transmits 11bits worth of data per channel, so the minimum is at 0 and the maximum is at 2047.
    • Flight Mode Settings
      • Flight Mode Count: 1
      • Stabilized 1: Attitude, Attitude, Axis Lock, CruiseControl - CruiseControl is particularly important. If you set it Manual, the copter will behave crazy.
    • Arming Settings:
      • Arm airframe using throttle off and: Yaw Left
  • Output:
  • Attitude:
    • You want to level your gyros
    • People say that there are two ways to combat the copter drifting while hovering:
      • Increase the amount of low-pass filtering.
      • Set the virtual rotation to compensate for the board not being completely flat. See this link.
    • Neither of these solutions helped me.

Results

Flight Test #0

If you think it looks completely underwhelming, then I have to agree with you. The main problem is the drift while hovering. I tried virtual rotation, low-pass filtering, and PID tuning, but no amount of configuration tweaking alleviates the problem. The setup does not have any optical sensors and accelerometers, by definition, don't see drifting at a constant speed. On the other hand, the copter is stationary at the beginning and starts to move after the take-off, so the acceleration is not zero. It might be that the sensors are not sensitive enough to pick it up. That's something that I intend to investigate once I get the telemetry connection working reliably.

Next steps

Here's what I plan to do next:

  • Get my kernel soft UART module based on hardware timers to work. I have the timer interface finished and tested, but still need to do the byte encoder, the GPIO state changing logic, and the TTY interface.
  • Connect the telemetry at higher speed to see if the sensors see the drift.
  • If the sensors see the drift, either write a PID controller at the level of pipilot or see why the firmware does not compensate for it.

Medium-term plans include:

  • Attach the Crazyflie sensor and the IMUs directly to the Raspberry Pi.
  • Hack the CC3D firmware so that the Pi can control the motors directly.
  • See if Linux (a non-RTOS) on RPi is reliable enough to control the copter and keep it hovering stably.

Long-term plans:

  • Port everything to my FRDM-K64F board to see if things improve if implemented on top of "bare metal."
  • Start playing with more complex control and estimation math.
  • Add cameras, lidar, and implement some autonomy.
  • Perhaps write all the microcontroller code in Rust instead of C.

Intro

I got tired of having to wait for several hours every time I want to build TensorFlow on my Jetson board. The process got especially painful since NVIDIA removed the swap support from the kernel that came with the most recent JetPack. The swap was pretty much only used during the compilation of CUDA sources and free otherwise. Without it, I have to restrict Bazel resources to a bare minimum to avoid OOM kills when the memory usage spikes for a split second. I, therefore, decided to cross-compile TensorFlow for Jetson on a more powerful machine. As usual, it was not exactly smooth sailing, so here's a quick guide.

The toolchain and target side dependencies

First of all, you will need a compiler capable of producing binaries for the target CPU. I have initially built one from sources but then noticed that Ubuntu provides one that is suitable for the task.

]==> sudo apt-get install gcc-aarch64-linux-gnu g++-aarch64-linux-gnu

Let's see if it can indeed produce the binaries for the target:

]==> aarch64-linux-gnu-g++ hello.cxx
]==> scp a.out jetson:
]==> ssh jetson ./a.out
Hello, World!

You will also need the same version of CUDA that comes with the JetPack. You can download the repository setup file from the NVIDIA's website. I usually go for the deb (network) option. Go ahead and set that and then type:

]==> sudo apt-get install cuda-toolkit-8-0

You will need cuDNN 6, both on the build-host- and the target-side. It's quite surprising, but both versions are necessary because the TensorFlow code generators that need to run on the build-host depend on libtensorflow_framework.so which seems to provide everything from string helpers to cuDNN and cuBLAS wrappers. It's not a great design choice, but to give them justice it's not noticeable unless you try to do weird stuff, like cross-compiling CUDA applications. You can get the host-side version here, and you can download the target-version from the device:

]==> export TARGET_PACKAGES=/some/empty/directory
]==> mkdir -p $TARGET_PACKAGES/cudnn/{include,lib}
]==> cd $TARGET_PACKAGES/cudnn
]==> ln -sf lib lib64
]==> scp jetson:/usr/lib/aarch64-linux-gnu/libcudnn.so.6.0.21 lib
]==> cd lib
]==> ln -sf libcudnn.so.6.0.21 libcudnn.so.6
]==> cd ..
]==> scp jetson:/usr/include/aarch64-linux-gnu/cudnn_v6.h include/cudnn.h

The next step is to install the CUDA libraries for the target. Unfortunately, their packaging is broken. The target architecture in the metadata of the relevant packages is marked as arm64. It makes sense on the surface because they contain aarch64 binaries after all, but it makes them not installable. The convention is to mark the architecture of such packages as all (universal) because the binary shared objects they contain are only meant to be stubs for the cross compiler (see here) and are not supposed to be runnable on the build-host. We will, therefore, need some apt trickery to install them:

]==> sudo apt-get -o Dpkg::Options::="--force-architecture" install \
         cuda-nvrtc-cross-aarch64-8-0:arm64 cuda-cusolver-cross-aarch64-8-0:arm64 \
         cuda-cudart-cross-aarch64-8-0:arm64 cuda-cublas-cross-aarch64-8-0:arm64 \
         cuda-cufft-cross-aarch64-8-0:arm64 cuda-curand-cross-aarch64-8-0:arm64 \
         cuda-cusparse-cross-aarch64-8-0:arm64 cuda-npp-cross-aarch64-8-0:arm64 \
         cuda-nvml-cross-aarch64-8-0:arm64 cuda-nvgraph-cross-aarch64-8-0:arm64

Furthermore, the names of the libraries installed by these packages are inconsistent with their equivalents for the build host, so we will need to make some symlinks in order not to confuse the TensorFlow build scripts.

]==> cd  /usr/local/cuda-8.0/targets/aarch64-linux/lib
]==> for i in cublas curand cufft cusolver; do \
       sudo ln -sf stubs/lib$i.so lib$i.so.8.0.61 && \
       sudo ln -sf lib$i.so.8.0.61 lib$i.so.8.0 && \
       sudo ln -sf lib$i.so.8.0 lib$i.so; \
     done

Let's check if all this works at least in a trivial test:

]==> wget https://raw.githubusercontent.com/ljanyst/kicks-and-giggles/master/hello/hello.cu
]==> /usr/local/cuda-8.0/bin/nvcc -ccbin /usr/bin/aarch64-linux-gnu-g++ -std=c++11 \
       --gpu-architecture=compute_53 --gpu-code=sm_53,compute_53 \
       hello.cu
]==> scp a.out jetson:
]==> ssh jetson ./a.out
Input: 1, 2, 3, 4, 5,
Output: 2, 3, 4, 5, 6,

And with cuBLAS:

]==> wget https://raw.githubusercontent.com/ljanyst/kicks-and-giggles/master/hello/hello-cublas.cxx
]==> aarch64-linux-gnu-g++ hello-cublas.cxx \
       -I /usr/local/cuda-8.0/targets/aarch64-linux/include \
       -L /usr/local/cuda-8.0/targets/aarch64-linux/lib \
       -lcudart -lcublas
]==> scp a.out jetson:
]==> ssh jetson ./a.out
A =
1 2 3
4 5 6

B =
7 8
9 10
11 12

A*B =
58 64
139 154

TensorFlow also needs Python headers for the target. The installation process for these is straight-forward. You will just need to pre-define results of some of the configuration tests in the config.site file. These tests require either access to the dev file system of the target or need to run compiled C code, so they cannot be executed on the build-host.

]==> wget https://www.python.org/ftp/python/3.5.2/Python-3.5.2.tar.xz
]==> tar xf Python-3.5.2.tar.xz
]==> cd Python-3.5.2
]==> cat config.site
ac_cv_file__dev_ptmx=yes
ac_cv_file__dev_ptc=no
ac_cv_buggy_getaddrinfo=no
]==> CONFIG_SITE=config.site ./configure --prefix=$TARGET_PACKAGES --enable-shared \
       --host aarch64-linux-gnu --build x86_64-linux-gnu --without-ensurepip
]==> make -j12 && make install
]==> cd $TARGET_PACKAGES/include
]==> mkdir aarch64-linux-gnu
]==> cd aarch64-linux-gnu && ln -sf ../python3.5m

Bazel setup and TensorFlow mods

The way to tell Bazel about a compiler configuration is to write a CROSSTOOL file. The file is just a collection of paths to various tools and the default configuration parameters for them. There are however some things to note here as well. First, the configuration script of TensorFlow asks about the host Python installation and sets the source up to use it. However, what we need in this case is the target Python. Since there seems to be no easy way to plug that into the standard build scripts, we pass the right include directory to the compiler here:

cxx_flag: "-isystem"
cxx_flag: "__TARGET_PYTHON_INCLUDES__"

We will also need to inject some compiler parameters on the fly for some of the binaries, so we call neither the build-host nor the target compiler directly:

tool_path { name: "gcc" path: "crosstool_wrapper_driver_is_not_gcc" }
...
tool_path { name: "gcc" path: "crosstool_wrapper_host_tf_framework" }

As mentioned before, one of the problems is that the code generators that need to run on the build host depend on libtensorflow_framework.so, which in turn, depends on CUDA. We, therefore, need to let the compiler know where the host versions of the CUDA libraries are installed. The second problem is that Bazel fails to link the code generators against the framework library. We fix that in the host wrapper script:

 1 if ofile is not None:
 2     is_gen = ofile.endswith('py_wrappers_cc') or ofile.endswith('gen_cc')
 3     if is_cuda == 'yes' and (ofile.endswith('libtensorflow_framework.so') or is_gen):
 4         cuda_libdirs = [
 5             '-L', '{}/targets/x86_64-linux/lib'.format(cuda_dir),
 6             '-L', '{}/targets/x86_64-linux/lib/stubs'.format(cuda_dir),
 7             '-L', '{}/lib64'.format(cudnn_dir)
 8         ]
 9 
10     if is_gen:
11         tf_libs += [
12             '-L', 'bazel-out/host/bin/tensorflow',
13             '-ltensorflow_framework'
14         ]

As far as the target is concerned, the only problem that I noticed is Bazel failing to set up RPATH for the target version of libtensorflow_framework.so correctly. It causes build failures of some of the binaries that depend on this library. We fix this problem in the wrapper script for the target compiler:

ofile = GetOptionValue(sys.argv[1:], 'o')
if ofile and ofile[0].endswith('libtensorflow_framework.so'):
  cpu_compiler_flags += [
      '-Wl,-rpath,'+os.getcwd()+'/bazel-out/arm-py3-opt/genfiles/external/local_config_cuda/cuda/cuda/lib',
  ]

Some adjustments need to be made to the paths where TensorFlow looks for CUDA libraries and header files. Also, the build_pip_package.sh script needs to be patched to make sure to make sure that the resulting wheel file has the correct platform metadata specified in it.

Building the CPU and the GPU packages

I have put all of the patches I mentioned above in a git repo, so you will need to check that out:

]==> git clone https://github.com/ljanyst/tensorflow.git
]==> cd tensorflow
]==> git checkout v1.5.0-cross-jetson-tx1

Let's try a CPU-only setup first. You need to configure the toolchain, and then you can configure and compile TensorFlow as usual. Use /usr/bin/python3 for python, use -O2 for the compilation flags and say no to everything but jemalloc.

]==> cd third_party/toolchains/cpus/aarch64
]==> ./configure.py
]==> cd ../../../..
]==> ./configure
]==> bazel  build  --config=opt \
       --crosstool_top=//third_party/toolchains/cpus/aarch64:toolchain \
       --cpu=arm  //tensorflow/tools/pip_package:build_pip_package
]==> mkdir out-cpu
]==> bazel-bin/tensorflow/tools/pip_package/build_pip_package out-cpu --platform linux_aarch64

To get GPU setup working, you will need to rerun the toolchain configuration script and tell it the paths to the build-host side CUDA 8.0 and cuDNN 6. Then configure TensorFlow with the same settings as above, but enable CUDA this time. Tell it the paths to your CUDA 8.0 installation, your target-side cuDNN 6, and specify /usr/bin/aarch64-linux-gnu-gcc as the compiler.

]==> cd third_party/toolchains/cpus/aarch64
]==> ./configure.py
]==> cd ../../../..
]==> ./configure
]==> bazel build  --config=opt --config=cuda \
         --crosstool_top=//third_party/toolchains/cpus/aarch64:toolchain \
         --cpu=arm  --compiler=cuda \
         //tensorflow/tools/pip_package:build_pip_package

The compilation takes roughly 15 minutes for the CPU-only setup and 22 minutes for the CUDA setup on my Core i7 build host. It's a vast improvement comparing to hours on the Jetson board.

Tests

I haven't done any extensive testing, but my SSD implementation works fine and reproduces the results I get on other boxes. There is, therefore, a strong reason to believe that things compiled fine.

Detections in a Pascal-VOC example on the Jetson
Detections in a Pascal-VOC example on the Jetson

Intro

AWS has recently introduced the P3 instances. They come with Tesla V100 GPUs, so I decided to run a little benchmark to see how well they perform compared to my workstation when training neural networks. I installed the most recent versions of CUDA/cuDNN (9.0/7.0) and TensorFlow (1.4.0), and run two non-trivial benchmarks that test both the GPU and the CPU.

Building the Software

Building TensorFlow from source is relatively straightforward, except that you need to install bazel. And gosh, never have I ever managed to build that stuff without issues. This time was not an exception. I wrote an article about that in the past, so I won't go into much detail here. I will just say that you will need Protocol Buffers version 3.4.0, grpc-java version 1.6.1, and bazel version 0.7.0. You will then need to apply this patch that I have taken from here and resolved the merge conflicts. Then, you will need to apply this one on top. The rest should go smooth.

Testing Setup

I used my workstation, and two AWS GPU Compute instances. Their exact parameters are in the table below. Since my workstation has an SSD, I used RAM disks on AWS to make the tests more comparable.

Name Description CPU GPU CUDA Compute Data Source
ti My workstation i7-6600U GeForce GTX 1080 Ti 6.1 SSD
p2 AWS p2.xlarge E5-2686 Tesla K80 3.7 RAM disk
p3 AWS p3.2xlarge E5-2686 Tesla V100 7.0 RAM disk

The tests are object detection and semantic segmentation, both coming in a smaller and a larger flavor. The former one processes all the input data in parallel to the GPU thread, whereas the latter does the processing serially in the main thread. On both, the p3 and ti machines, the CPU utilization was at 100% when running the semantic segmentation test. It means that the CPU is a bottleneck here.

Results

Normalized Performance
Normalized Performance

Machine VGG300 VGG512 KITTI Cityscapes
ti 11:38 28:09 00:16 09:22
p2 46:39 1:49:05 00:50 20:31
p3 08:15 20:10 did not work 13:01

The results in the image above are normalized, with 1 being the score for the ti setup. The table contains the exact execution times of training over one epoch. The V100 is around 30% faster than the 1080 Ti. The 1080 Ti, in turn, is about four times faster than the K80. Also, a Core i7 seems to be more performant than the Xeon Amazon uses in their instances. The KITTI test did not work on the V100 - it has hit a strange CUDA bug.

Intro

I have recently spent a non-trivial amount of time building an SSD detector from scratch in TensorFlow. I had initially intended for it to help identify traffic lights in my team's SDCND Capstone Project. However, it turned out that it's not particularly efficient with tiny objects, so I ended up using the TensorFlow Object Detection API for that purpose instead. In the end, I managed to bring my implementation of SSD to a pretty decent state, and this post gathers my thoughts on the matter. It is not intended to be a tutorial. Instead, it's a discussion of all the pieces of information that were unclear to me or that I needed to research independently of the original paper.

Object Detection
Object Detection

Base Network and Extensions

SSD-VGG300
SSD-VGG300

SSD is based on a modified VGG-16 network pre-trained on the ImageNet data. I happened to have one from one of my previous projects, and I used it here as well. The following modifications have been made to the base network:

  • pool5 was changed from 2x2 (stride: 2) to 3x3 (stride: 1)
  • fc6 and fc7 were converted to convolutional layers and subsampled
  • à trous convolution was used in fc6
  • fc8 and all of the dropout layers were removed

As you can see from the above image, the fc6 and fc7 convolutions are 3x3x1024 and 1x1x1024 respectively, whereas in the original VGG they are 7x7x4096 and 1x1x4096. Having huge filters like these is a computational bottleneck. According to one of the references, we can address this problem by "spatially subsampling (by simple decimation)" the weights and then using the à trous convolution to keep the filter's receptive field unchanged. It was not immediately clear to me what it means, but after reading this page of MatLab's documentation, I came up the following:

with tf.variable_scope('mod_conv6'):
    orig_w, orig_b = sess.run([self.vgg_fc6_w, self.vgg_fc6_b])
    mod_w = np.zeros((3, 3, 512, 1024))
    mod_b = np.zeros(1024)

    for i in range(1024):
        mod_b[i] = orig_b[4*i]
        for h in range(3):
            for w in range(3):
                mod_w[h, w, :, i] = orig_w[3*h, 3*w, :, 4*i]

    w = array2tensor(mod_w, 'weights')
    b = array2tensor(mod_b, 'biases')
    x = tf.nn.atrous_conv2d(self.mod_pool5, w, rate=6, padding='SAME')
    x = tf.nn.bias_add(x, b)
    self.mod_conv6 = tf.nn.relu(x)

It doubled the speed of training and did not seem to have any adverse effects on accuracy. Note that the dilation rate of the à trous convolution is set to 6 instead of 3. This setting is inconsistent with the size of the original filter, but it is nonetheless used in the reference code.

The output of the conv4_3 layer differs in magnitude compared to other layers used as feature maps of the detector. As pointed out in the ParseNet paper, this fact may lead to reduced performance because "larger" features may overwhelm the "smaller" ones. They propose to use L2 normalization with a scale learnable separately for each channel as a remedy to this problem. This is what I ended up doing in TensorFlow:

def l2_normalization(x, initial_scale, channels, name):
    with tf.variable_scope(name):
        scale = array2tensor(initial_scale*np.ones(channels), 'scale')
        x = scale*tf.nn.l2_normalize(x, dim=-1)
    return x

The initial scale for each channel is set to 20, and it does not change very much over the training time.

Furthermore, a bunch of extra convolutional layers were added on top of the modified fc7. The number of these layers depends on the flavor of the detector: vgg300 or vgg512. The paper does not explain well enough the parameters of the convolutions, especially the padding settings, even though getting this part wrong can significantly impact the performance. I looked these up in the reference code for vgg300 and worked my way backward from the number of anchors in the case of vgg512. Here's what I ended up with:

  • conv8_1: 1x1x256 (stride: 1, pad: same)
  • conv8_2: 3x3x512 (stride: 2, pad: same)
  • conv9_1: 1x1x128 (stride: 1, pad: same)
  • conv9_2: 3x3x256 (stride: 2, pad: same)
  • conv10_1: 1x1x128 (stride: 1, pad: same)
  • conv10_2: 3x3x256 (stride: 1, pad: valid) for vgg300, (stride: 2, pad: same) for vgg515
  • conv11_1: 1x1x128 (stride: 1, pad: same)
  • conv11_2: 3x3x256 (stride: 1, pad: valid)

For the vgg512 flavor, there are two extra layers:

  • conv12_1: 1x1x128 (stride: 1, pad: same)
  • padding of the conv12_1 feature map with one extra cell in each spacial dimension
  • conv12_2: 3x3x256 (stride: 1, pad: valid)

It's not possible to use the predefined padding options (VALID or SAME) for extending conv12_1, so I ended doing it manually:

x, l2 = conv_map(self.ssd_conv11_2, 128, 1, 1, 'conv12_1')
paddings = [[0, 0], [0, 1], [0, 1], [0, 0]]
x = tf.pad(x, paddings, "CONSTANT")
self.ssd_conv12_1 = self.__with_loss(x, l2)
x, l2 = conv_map(self.ssd_conv12_1, 256, 3, 1, 'conv12_2', 'VALID')
self.ssd_conv12_2 = self.__with_loss(x, l2)

Default Boxes (a. k. a. Anchors)

Default Boxes
Default Boxes

The model takes the outputs of some of these convolutional layers and associates a scale with each of them. The exact formula is presented in the paper; the reference implementation does not seem to follow it exactly, though. In general, the further away the feature map is from the input, the larger is the scale assigned to it. The scale only loosely correlates with the receptive field of the respective filter.

The model adds a bunch of 3x3xp convolutional filters on top of each of these maps. Each of these filters predicts p parameters of a default box (or an anchor) at the location to which it is applied. Four of these p parameters are the coordinates of the window (relative width and height, as well as x and y offsets from the center of the anchor). The remaining parameters define the probability distribution of the box belonging to one of the classes that the model predicts (the softmaxed logits). We need to add as many of these filters per feature map as we want aspect ratios for the default boxes of a given scale. In general, the more, the better. The paper advises using six aspect ratios per map. However, the implementation uses fewer of them in some cases.

We now need to create the ground truth labels for the optimizer. We match each ground truth box to an anchor box with the highest Jaccard overlap (if it exceeds 0.5). Additionally, we match it to every anchor with overlap higher than 0.5. The original code uses a mixture of bipartite matching and maximum overlap to resolve conflicts, but I just used the latter criterion for simplicity. For every matched anchor we set the class label accordingly and use the following for the box parameters:

\[ w = 10 \cdot log(\frac{w_{gt}}{w_{a}}) \\ h = 10 \cdot log(\frac{h_{gt}}{h_{a}}) \\ x_c = 5 \cdot \frac{x_{c,gt} - x_{c,a}}{w_a} \\ y_c = 5 \cdot \frac{y_{x,gt} - y_{c,a}}{h_a} \]

The code uses the scaling constants above (5, 10) and calls them "prior variance," but the paper does not mention this fact.

Training Objective

The loss function consists of three parts:

  • the confidence loss
  • the localization loss
  • the l2 loss (weight decay in the Caffe parlance)

The confidence loss is what TensorFlow calls softmax_cross_entropy_with_logits, and it's computed for the class probability part of the parameters of each anchor. Since there are many more positive (matched) anchors than negative (unmatches/background) ones, the learning ends up being more stable if not every background score contributes to the final loss. We need to mine the scores of all the positive anchors and at most three times as of many negative anchors. We only use the background anchors with the highest confidence loss. It results in a somewhat involved code in the declarative style of TensorFlow.

The localization loss sums up the Smooth L1 losses of differences between the prediction and the ground truth labels. The Smooth L1 loss is defined as follows:

\[ SmoothL1(x) = \begin{cases} |x| - 0.5 & x \geq 1 \\ 0.5 \cdot x^2 & x \lt 1 \\ \end{cases} \]

It translates to the following code in TensorFlow:

def smooth_l1_loss(x):
    square_loss   = 0.5*x**2
    absolute_loss = tf.abs(x)
    return tf.where(tf.less(absolute_loss, 1.), square_loss, absolute_loss-0.5)

The paper advises using the batch size of 32. However, this recommendation assumes training in parallel on four GPUs. If you have just one (like I do), 8 is a better number. The original code uses the SGD optimizer with momentum, rate decay at predefined steps, and doubling of the rate for biases. I found that using the Adam optimizer with the exponential decay rate of 0.97 per epoch and using 0.1 as the stability constant (epsilon) works better for this implementation. The TensorFlow documentation warns that the default epsilon may not be a good choice in general and recommends using a higher value in some cases. Indeed, I found that using the default makes the weights very small very fast and the learning process becomes unstable.

Non-Maximum Suppression

Because of the anchor matching strategy and the vast irregularity of the shapes we train on, the network will produce multiple overlapping detections of the same object. One way to get rid of duplicates is to perform a non-maxima suppression. The algorithm is straightforward:

  • you pick your favorite box
  • you remove all the boxes that have the Jaccard overlap with your selection above a certain threshold
  • you choose your second favorite box and repeat step 2
  • you continue until there is no new favorite to select

This article provides a more detailed description, although their selection criterion is rather strange (the position of the lower-right corner) and the implementation is pretty inefficient. My code using numpy's bulk operations is here. I should reimplement it using TensorFlow tensors and will likely do that when I have a spare moment.

Data Augmentation and Issues with Parallelism in Python

The SSD training depends heavily on data augmentation. I won't describe it at all here because the paper does a great job at that. The only tricky part that it does not mention is the fact that you do not clip any ground truth box if it happens to span outside the boundaries of a subsampled input image. See transforms.py if you want more details.

Things run much faster when the data is preprocessed in parallel before being fed to TensorFlow. However, the poor support for multithreading/multiprocessing in Python turned out to be a significant obstacle here. As you probably know, running your computation in multiple threads is utterly pointless in Python because the execution ends up being serial due to GIL issues. The GIL problem is typically addressed with multiprocessing. However, it comes with a separate can of worms.

First, if you want to transfer any significant amount of data between the processes efficiently, you need to avoid pickling and use the POSIX shared memory instead. It's not hugely complicated, but it's not trivial either. Second, if any of the packages you import uses threading underneath, you're almost guaranteed to encounter fork-safety issues. Add strange errors while forking CUDA-enabled libraries to the mix and you end up with a minor horror story. It took me about a full day of work to write and debug the shared memory queue and to debug the fork safety issues in the pipeline. In case you wonder, this code does the trick for the latter:

workers = []
os.environ['CUDA_VISIBLE_DEVICES'] = ""
cv2_num_threads = cv2.getNumThreads()
cv2.setNumThreads(1)
for i in range(num_workers):
    args = (sample_queue, batch_queue)
    w = mp.Process(target=batch_producer, args=args)
    workers.append(w)
    w.start()
del os.environ['CUDA_VISIBLE_DEVICES']
cv2.setNumThreads(cv2_num_threads)

Pascal VOC and the mAP Metric

The Pascal VOC (Visual Object Classes) project provides standardized datasets for object class recognition as well as tools for evaluation and comparison of different detection methods. The datasets contain several thousands of annotated Flickr pictures. The metric they use for method comparison of object detection algorithms is called mAP - Mean Average Precision - and is an arithmetic mean of the AP (Average Precision) scores for each object class in the dataset.

The task of object detection is treated as a ranked document retrieval system (as in search) and the AP metric is an 11-point interpolated average precision. More specifically, the system:

  • sorts the detections of a given class in all the images of the dataset by confidence in descending order
  • loops over the detections and classifies them according to the following greedy algorithm:
    • if a detection overlaps with the ground truth object with the IoU score of 50% or more and the object has not been previously detected, it's a true positive
    • if IoU is above 50% but the object has been detected before, or the IoU is below 50%, it's a false positive
    • ground truth object with no matching detections are false negatives
    • calculate the precision and recall for the current state

Precision and recall data points calculated at each iteration contribute to the precision vs. recall curve which is then interpolated according to the following formula, sampled at 11 equally spaced recall points between 0 and 1, and averaged.

\[ p_{interp}(r) = \max_{r' \geq r} p(r') \]

The graph below shows what the curves for the bottle class look like when we decide to accept objects above different confidence thresholds. Note how the curves for lower confidence levels extend the ones for the higher levels.

Precision vs. Recall - Bottle class
Precision vs. Recall - Bottle class

Here are the AP values for the corresponding confidence thresholds:

Confidence AP
0.01 0.497
0.10 0.471
0.30 0.353
0.50 0.270

The lower confidence results we're willing to accept, the higher our AP gets, but also the number of low confidence false positives grows. It makes perfect sense for a ranked document retrieval system. We care a lot whether we get only the relevant results in the first couple of pages of a Google search, but we don't care all that much if we have a bunch of false positives on the hundredth page. Does it make sense when it comes to object detection? That probably varies widely depending on your application. I would argue that, in a general case, when you just care about quality detections, it's somewhat confusing. Below are examples of detections in the same picture with boxes above 0.5 and 0.01 confidence levels coming from the same SSD model. The parameters used to produce the second picture score higher mAP over the entire dataset than the ones used to generate the first one.

Detections above 0.5 confidence
Detections above 0.5 confidence

Detections above 0.01 confidence
Detections above 0.01 confidence

You can get more info about it here.

Results

I trained a somewhat modified version of the vgg300 flavor of the detector on the union of VOC2007+VOC2012 trainval and VOC2007 test samples with heavy data augmentation. It scored 74.7% mAP when tested on the samples it trained on, while the reference score is around 77.5%. The result on the VOC2012 test samples was 68.71% with the reference at 75.8%. I did not use the same aspect ratio and scale settings as the ones utilized by the original implementation. Surprisingly, sticking to the reference parameters produced even worse results. Another reason for the discrepancy may be a different choice of the optimizer and the fact that the reference implementation doubles the learning rate for biases. Using different learning rates for different variables is possible in TensorFlow. However, I have not been able to do that without the system repeating the forward pass and most of the backward pass for each learning rate setting. It effectively almost doubled the training time per epoch, and I was not patient enough to wait for the results.

When I exported the model as a static inference graph, it took roughly 100MB, compared to around 1.3GB when in the checkpoint format. I then used it as a detector in the vehicle detection project I did some time ago. It processed 1261 frames of the testing video, including the FFmpeg compression and decompression time, in roughly 25 seconds reaching over 50 FPS on average. It's a blazing speed considering that my fairly inefficient SVM implementation took well over 8 minutes (~2.5 FPS) to process the same video. Note, however, that, due to the non-maximum suppression, the speed is a function of the number of positive predictions, and this video has relatively few detected objects. You can see the results below.

Vehicle detection with SSD

Conclusion

The project took quite a bit longer than I had initially anticipated but it was a great learning experience and ultimately a great deal of fun. With the hard negative mining, it was probably the most complicated loss function I have implemented in TensorFlow to date. I learned about adaptive feature map scaling, dug through a lot of Caffe's and TensorFlow's source code, learned about the stability of AdamOptimizer, and read a whole bunch of deep learning research papers. I wasted some time fighting mostly non-existent issues because I had not initially paid sufficient attention to what is measured by the accuracy metric. I have a bunch of ideas on how to improve the model to reach the reference performance and I will likely try some of them out in the near future.

All my code is here.

Update 10.03.2018: I have had a look at the PyTorch SSD implementation which achieves better results than mine in the VOC2012 test, but still lower than the baseline. I discovered that the way I did the data augmentation reflected what the paper describes but not what the original Caffe implementation does. I have updated the code in the repo to match the reference. I have also discovered a bug where the ground truth boxes produced by the sampler were sometimes too small to match any anchors. This behavior did not cause any runtime errors, but such samples did not contribute to the loss function and, therefore, had no impact on the optimization process. With these two changes, I was able to shrink the number of anchors used by my models to the level of the original implementation and reproduce my previous results. The performance of my code is still somewhat behind the original one. At this point, I am reasonably sure it's because of the base network weights I used. I will have a look at that when I have a spare moment.

Intro

The goal of this project is to plan a path for a car to make its way through highway traffic. You have at your disposal a map consisting of a set of waypoints. The waypoints follow the center of the road, and each of them comes with the normal vector. You also know the positions and velocities of nearby vehicles. Your car needs to obey the speed limit of 50 MPH (22.35 m/s), not collide with other traffic, keep acceleration within certain limits, and minimize jerk (the time derivative of acceleration). The path that you need to compute is a set of successive Cartesian coordinates that the car will visit perfectly every 0.02 seconds.

Here's the result I got. It's not completely optimal at times, but I used large safety margins and did not spend much time tweaking it.

Path Planning

Path planning

Udacity recommends using minimum jerk trajectories defined in the Frenet coordinate space to solve the path planning problem. This approach, however, struck me as suboptimal and hard to do well because of the distance distortions caused by the nonlinearity of the coordinate-space transforms between Cartesian and Frenet and back. Therefore, I decided to reuse the code I wrote for doing model-predictive control. However, instead of using actuations computed by the algorithm, I used the model states themselves as a source of coordinates that the vehicle should visit.

The controller follows a trajectory defined as a polynomial fitted to the waypoints representing the center of one of the lanes. In the first step, it computes 75 points starting from the current position of the car. In each successive step, it takes 25 points from the previously computed trajectory that the car still did not visit, it estimates the vehicle's state at the 25th point and uses this estimated state as an input to the solver. The points produced this way complement the path. This cycle repeats itself every 250 ms with the target speed given by:

\[ v = \begin{cases} v_l - 0.25 \cdot (25 - d_l) & d_l \leq 25 \\ 22.2 & d_l \gt 25 \\ \end{cases} \]

Where \( v_l \) is the speed of the leader and \( d_l \) is the car's distance from it. Subtracting the proximity penalty tends to make speed "bounce" less than when using fractional terms. I was too lazy to tune a PID controller for this task.

Lane selection

The most optimal lane is selected using a simple finite state machine with cost functions associated with state transitions. Every three seconds the algorithm evaluates these functions for all reachable states and chooses the cheapest move.

Lane Changing FSM
Lane Changing FSM

The cost function for keeping the current lane (the KL2KL transition) penalizes the difference in speed between the vehicle and the leader (if any) and their proximity. The one for changing lanes (the KL2CL transition) does a similar calculation for the target lane. Additionally, it increases the cost if a follower is close enough adding a further penalty proportional to its speed.

The start up state allows the car to ramp up to the target speed before the actual lane-change logic kicks in. Doing so avoids erratic behavior arising from the penalization of speed differences.

The exact logic is here.

Conclusions

Using the MPC approach has the advantage of letting you tune more parameters of the trajectory. You can choose how much acceleration and jerk you're going to tolerate, increasing or decreasing the perceived aggressiveness of lane changes.

I also tried incorporating the predicted trajectories of nearby vehicles into the solver's cost function, but the car ended up going out of the lane, accelerating or breaking aggressively to avoid collisions. It is something that you would do while driving in the real world, but it breaks the rules of this particular game. Another problem was that I used a Naïve Bayes Classifier to predict the behavior of other vehicles and it is not accurate enough. The mispredictions made the car behave erratically trying to avoid collisions with non-existing obstacles.

You can see the full code here.

Intro

I have recently stumbled upon two articles (1, 2) treating about running TensorFlow on CPU setups. Out of curiosity, I decided to check how the kinds of models I use behave in such situations. As you will see below, the results were somewhat unexpected. I did not put in the time to investigate what went wrong, and my attempts to reason about the performance problems are pure speculations. Instead, I just run my models with a bunch of different threading and OpenMP settings that people typically recommend on the Internet and hoped to have a drop-in alternative to my GPU setup. In particular, I did not convert my models to use the NCHW format as recommended by the Intel article. This data format conversion seems to be particularly important, and people report performance doubling in some cases. However, since my largest test case uses transfer learning, applying the conversion is a pain. If you happen to know how to optimize the settings better without major tweaking of the models, please do drop me a line.

Testing boxes

  • ti: My workstation
    • GPU: GeForce GTX 1080 Ti (11GB, Pascal)
    • CPU: 8 OS CPUs (Core i7-7700K, 1 packages x 4 cores/pkg x 2 threads/core (4 total cores))
    • RAM: 32GB (test data loaded from an SSD)
  • p2: An Amazon p2.xlarge instance
    • GPU: Tesla K80 (12GB, Kepler)
    • CPU: 4 OS CPUs (Xeon E5-2686 v4)
    • RAM: 60GB (test data loaded from a ramdisk)
  • m4: An Amazon m4.16xlarge instance
    • CPU: 64 OS CPUs (Xeon E5-2686 v4, 2 packages x 16 cores/pkg x 2 threads/core (32 total cores))
    • RAM: 256GB (test data loaded from a ramdisk)

TensorFlow settings

The GPU flavor was compiled with CUDA support; the CPU version was configured with only the default settings; the MKL flavor uses the MKL-ML library that the TensorFlow's configuration script downloads automatically;

The GPU and the CPU setups run with the default session settings. The other configurations change the threading and OpenMP setting on the case-by-case basis. I use the following annotations when talking about the tests:

  • [xC,yT] means the KMP_HW_SUBSET envvar setting and the interop and intraop thread numbers set to 1.
  • [affinity] means the KMP_AFFINITY envvar set to granularity=fine,verbose,compact,1,0 and the interop thread number set to 2.
  • [intraop=x, interop=y] means the TensorFlow threading setting and no OpenMP setting.

More information on controlling thread affinity is here, and this is an article on managing thread allocation.

Tests and Results

The test results are the times it took to train one epoch normalized to the result obtained using the ti-gpu configuration - if some score is around 20, it means that this setting is 20 times slower than the baseline.

LeNet - CIFAR10
LeNet - CIFAR10

The first test uses the LeNet architecture on the CIFAR-10 data. The MKL setup run with [4C,2T] on ti and [affinity] on m4. The results are pretty surprising because the model consists of almost exclusively the operations that Intel claims to have optimized. The fact that ti run faster than m4 might suggest that there is some synchronization issue in the graph handling algorithms preventing it from processing a bunch of tiny images efficiently.

Road Sign Classifier
Road Sign Classifier

The second test is my road sign classifier. It uses mainly 2D convolutions and pooling, but they are interleaved with hyperbolic tangents as activations as well as dropout layers. This fact probably prevents the graph optimizer from grouping the MKL nodes together resulting with frequent data format conversions between NHWC and the Intel's SIMD friendly format. Also, ti scored better than m4 for the MKL version but not for the plain CPU implementation. It would suggest inefficiencies in the OpenMP implementation of threading.

Image Segmentation - KITTI (2 classes)
Image Segmentation - KITTI (2 classes)

The third and the fourth test run a fully convolutional neural network besed on VGG16 for an image segmentation project. Apart from the usual suspects, this model uses transposed convolutions to handle learnable upsampling. The tests differ in the size of the input images and in the sizes of the weight matrices handled by the transposed convolutions. For the KITTI dataset, the ti-mkl config run with [intraop=6, interop=6] and m4-mkl with [affinity].

Image Segmentation - Cityscapes (29 classes)
Image Segmentation - Cityscapes (29 classes)

For the Cityscapes dataset, ti-mkl run with [intraop=6, interop=6] and m4-mkl run with [intraop=44, interop=6]. Here the MKL config was as fast as the baseline CPU configs for the dataset with fewer classes and thus smaller upsampling layers. The slowdown for the dataset with more classes could probably be explained by the difference in the handling of the transposed convolution nodes.

Conclusions

It was an interesting experience that arose mixed feelings. On the one hand, the best baseline CPU implementation was at worst two to four times slower with only the compiler optimization than Amazon P2. It's a much better outcome than I had expected. On the other hand, the MKL support was a disappointment. To be fair, in large part it's probably because of my refusal to spend enough time tweaking the parameters, but hey, it was supposed to be a drop-in replacement, and I don't need to do any of these when using a GPU. Another reason is that TensorFlow probably has too few MKL-based kernels to be worth using in this mode and the frequent data format conversions kill the performance. I have also noticed the MKL setup not making any progress with some threading configurations despite all the cores being busy. I might have hit the Intel Hyperthreading bug.

Notes on building TensorFlow

The GPU versions were compiled with GCC 5.4.0, CUDA 8.0 and cuDNN 6. The ti configuration used CUDA capability 6.1, whereas the p2 configuration used 3.7. The compiler flags were:

  • ti: -march=core-avx-i -mavx2 -mfma -O3
  • p2: -march=broadwell -O3

The CPU versions were compiled with GCC 7.1.0 with the following flags:

  • ti: -march=skylake -O3
  • m4: -march=broadwell -O3

I tried compiling the MKL version with the additional -DEIGEN_USE_MKL_VML flag but got worse results.

The MKL library is poorly integrated with the TensorFlow's build system. For some strange reason, the configuration script creates a link to libdl.so.2 inside the build tree which results with the library being copied to the final wheel package. Doing so is a horrible idea because in glibc libdl.so mostly provides an interface for libc.so's private API so a system update may break the TensorFlow installation. Furthermore, the way in which it figures out which library to link against is broken. The configuration script uses the locate utility to find all files named libdl.so.2 and picks the first one from the list. Now, locate is not installed on Ubuntu or Debian by default, so if you did not do:

]==> sudo apt-get install locate
]==> sudo updatedb

at some point in the past, the script will be killed without an error message leaving the source tree unconfigured. Moreover, the first pick is usually a wrong one. If you run a 64-bit version of Ubuntu with multilib support, the script will end up choosing a 32-bit version of the library. I happen to hack glibc from time to time, so in my case, it ended up picking one that was cross-compiled for a 64-bit ARM system.

I have also tried compiling Eigen with full MKL support as suggested in this thread. However, the Eigen's and MKL's BLAS interfaces seem to be out of sync. I attempted to fix the situation but gave up when I noticed Eigen passing floats to MKL functions expecting complex numbers using incompatible data types. I will continue using the GPU setup, so fixing all that and doing proper testing was way more effort than I was willing to make.

Node 14.07.2017: My OCD took the upper hand again and I figured it out. Unfortunately, it did not improve the numbers at all.

What is it about?

Semantic segmentation is a process of dividing an image into sets of pixels sharing similar properties and assigning to each of these sets one of the pre-defined labels. Ideally, you would like to get a picture such as the one below. It's a result of blending color-coded class labels with the original image. This sample comes from the CityScapes dataset.

Segmented Image
Segmented Image

Segmentation Classes
Segmentation Classes

How is it done?

Figuring out object boundaries in an image is hard. There's a variety of "classical" approaches taking into account colors and gradients that obtained encouraging results, see this paper by Shi and Malik for example. However, in 2015 and 2016, Long, Shelhamer, and Darrell presented a method using Fully Convolutional Networks that significantly improved the accuracy (the mean intersection over union metric) and the inference speed. My goal was to replicate their architecture and use it to segment road scenes.

A fully convolutional network differs from a regular convolutional network in the fact that it has the final fully-connected classifier stripped off. Its goal is to take an image as an input and produce an equally-sized output in which each pixel is represented by a softmax distribution describing the probability of this pixel belonging to a given class. I took this picture from one of the papers mentioned above:

Fully Convolutional Network
Fully Convolutional Network

For the results presented in this post, I used the pre-trained VGG16 network provided by Udacity for the beta test of their Advanced Deep Learning Capstone. I took layers 3, 4, and 7 and combined them in the manner described in the picture below, which, again, is taken from one of the papers by Long et al.

Upscaling and merging
Upscaling and merging

First, I used a 1x1 convolutions on top of each extracted layer to act as a local classifier. After doing that, these partial results are still 32, 16, and 8 times smaller than the input image, so I needed to upsample them (see below). Finally, I used a weighted addition to obtain the result. The authors of the original paper report that without weighting the learning process diverges.

Learnable Upsampling

Upsampling is done by applying a process called transposed convolution. I will not describe it here because this post over at cv-tricks.com does a great job of doing that. I will just say that transposed convolutions (just like the regular ones) use learnable weights to produce output. The trick here is the initialization of those weights. You don't use the truncated normal distribution, but you initialize the weights in such a way that the convolution operation performs a bilinear interpolation. It's easy and interesting to test whether the implementation works correctly. When fed an image, it should produce the same image but n times larger.

 1 img = cv2.imread(sys.argv[1])
 2 print('Original size:', img.shape)
 3 
 4 imgs = np.zeros([1, *img.shape], dtype=np.float32)
 5 imgs[0,:,:,:] = img
 6 
 7 img_input = tf.placeholder(tf.float32, [None, *img.shape])
 8 upscale = upsample(img_input, 3, 8, 'upscaled')
 9 
10 with tf.Session() as sess:
11     sess.run(tf.global_variables_initializer())
12     upscaled = sess.run(upscale, feed_dict={img_input: imgs})
13 
14 print('Upscaled:', upscaled.shape[1:])
15 cv2.imwrite(sys.argv[2], upscaled[0,:, :, :])

Where upsample is defined here.

Datasets

I was mainly interested in road scenes, so I played with the KITTI Road and CityScapes datasets. The first one has 289 training images with two labels (road/not road) and 290 testing samples. The second one has 2975 training, 500 validation, and 1525 testing pictures taken while driving around large German cities. It has fine-grained annotations for 29 classes (including "unlabeled" and "dynamic"). The annotations are color-based and look like the picture below.

Picture Labels
Picture Labels

Even though I concentrated on those two datasets, both the training and the inference software is generic and can handle any pixel-labeled dataset. All you need to do is to create a new source_xxxxxx.py file defining your custom samples. The definition is a class that contains seven attributes:

  • image_size - self-evident, both horizontal and vertical dimensions need to be divisible by 32
  • num_classes - number of classes that the model is supposed to handle
  • label_colors - a dictionary mapping a class number to a color; used for blending of the classification results with input image
  • num_training - number of training samples
  • num_validation - number of validation samples
  • train_generator - a generator producing training batches
  • valid_generator - a generator producing validation batches

See source_kitti.py or source_cityscapes.py for a concrete example. The training script picks the source based on the value of the --data-source parameter.

Normalization

Typically, you would normalize the input dataset such that its mean is at zero and its standard deviation is at one. It significantly improves convergence of the gradient optimization. In the case of the VGG model, the authors just zeroed the mean without scaling the variance (see section 2.1 of the paper). Assuming that the model was trained on the ImageNet dataset, the mean values for each channel are muR = 123.68, muG = 116.779, muB = 103.939. The pre-trained model provided by Udacity already has a pre-processing layer handling these constants. Judging from the way it does it, it expects plain BGR scaled between 0 and 255 as input.

Label Validation

Since the network outputs softmaxed logits for each pixel, the training labels need to be one-hot encoded. According to the TensorFlow documentation, each row of labels needs to be a proper probability distribution. Otherwise, the gradient calculation will be incorrect and the whole model will diverge. So, you need to make sure that you're never in a situation where you have all zeros or multiple ones in your label vector. I have made this mistake so many time that I decided to write a checker script for my data source modules. It produces examples of training images blended with their pixel labels to check if the color maps have been defined correctly. It also checks every pixel in every sample to see if the label rows are indeed valid. See here for the source.

Initialization of variables

Initialization of variables is a bit of a pain in TensorFlow. You can use the global initializer if you create and train your model from scratch. However, in the case when you want to do transfer learning - load a pre-trained model and extend it - there seems to be no convenient way to initialize only the variables that you created. I ended up doing acrobatics like this:

1 uninit_vars    = []
2 uninit_tensors = []
3 for var in tf.global_variables():
4     uninit_vars.append(var)
5     uninit_tensors.append(tf.is_variable_initialized(var))
6 uninit_bools = sess.run(uninit_tensors)
7 uninit = zip(uninit_bools, uninit_vars)
8 uninit = [var for init, var in uninit if not init]
9 sess.run(tf.variables_initializer(uninit))

Training

For training purposes, I reshaped both labels and logits in such a way that I ended up with 2D tensors for both. I then used tf.nn.softmax_cross_entropy_with_logits as a measure of loss and used AdamOptimizer with a learning rate of 0.0001 to minimize it. The model trained on the KITTI dataset for 500 epochs - 14 seconds per epoch on my GeForce GTX 1080 Ti. The CityScapes dataset took 150 epochs to train - 9.5 minutes per epoch on my GeForce vs. 25 minutes per epoch on an AWS P2 instance. The model exhibited some overfitting. However, the visual results seemed tighter the more it trained. In the picture below the top row contains the ground truth, the bottom one contains the inference results (TensorBoard rocks! :).

CityScapes Validation Examples
CityScapes Validation Examples

CityScapes Validation Loss
CityScapes Validation Loss

CityScapes Training Loss
CityScapes Training Loss

Results

The inference (including image processing) takes 80 milliseconds per image on average for CityScapes and 27 milliseconds for KITTI. Here are some examples from both datasets. The model seems to be able to distinguish a pedestrian from a bike rider with some degree of accuracy, which is pretty impressive!

CityScapes Example #1
CityScapes Example #1

CityScapes Example #2
CityScapes Example #2

KITTI Example #1
KITTI Example #1

KITTI Example #2
KITTI Example #2

Go here for the full code.

A month ago or so, I wrote a post about installing TensorFlow 1.1.0 on Jetson TX1. This post is an update for 1.2.0 which has one additional issue on top of the ones discussed previously. The problem is that Eigen is missing some template specializations when used on ARM. The bug has been fixed, but you need to make the TensorFlow build use the fixed version.

diff --git a/tensorflow/workspace.bzl b/tensorflow/workspace.bzl
index 2a206b0ac..f44a17405 100644
--- a/tensorflow/workspace.bzl
+++ b/tensorflow/workspace.bzl
@@ -150,11 +150,10 @@ def tf_workspace(path_prefix="", tf_repo_name=""):
   native.new_http_archive(
       name = "eigen_archive",
       urls = [
-          "http://mirror.bazel.build/bitbucket.org/eigen/eigen/get/f3a22f35b044.tar.gz",
-          "https://bitbucket.org/eigen/eigen/get/f3a22f35b044.tar.gz",
+          "https://bitbucket.org/eigen/eigen/get/d781c1de9834.tar.gz",
       ],
-      sha256 = "ca7beac153d4059c02c8fc59816c82d54ea47fe58365e8aded4082ded0b820c4",
-      strip_prefix = "eigen-eigen-f3a22f35b044",
+      sha256 = "a34b208da6ec18fa8da963369e166e4a368612c14d956dd2f9d7072904675d9b",
+      strip_prefix = "eigen-eigen-d781c1de9834",
       build_file = str(Label("//third_party:eigen.BUILD")),
   )

The build instructions are the same as for the previous versions, but you need to checkout the v1.2.0-jetson-tx1 tag from my repository to get all the fixes.

Intro

I had expected a smooth ride with this one, but it turned out to be quite an adventure and not one of a pleasant kind. To be fair, the likely reason why it's such a horror story is that I was bootstrapping bazel - the build software that TensorFlow uses - on an unsupported system. I spent more time figuring out the dependency issues related to that than working on TensorFlow itself. This post was initially supposed to be a rant on the Java dependency hell. However, in the end, my stubbornness took the upper hand, and I did not go to sleep until it all worked, so you have a HOWTO instead.

Prerequisites

Jetson TX1
Jetson TX1

You'll need the board itself and the following installed on it:

  • Jetpack 3.0
  • L4T 24.2.1
  • Cuda Toolkit 8.0.34-1
  • cuDNN 5.1.5-1+cuda8.0

Dependencies

A Java Development Kit

First of all, you'll need a Java compiler and related utilities. Just type:

]==> sudo apt-get install default-jdk

It would not have been worth a separate paragraph, except that the version that comes with the system messes up the CA certificates. You won't be able to download things from GitHub without overriding SSL warnings. I fixed that by installing ca-certificates and ca-certificates-java from Debian.

Protocol Buffers

You'll need the exact two versions mentioned below. No other versions work down the road. I learned about this fact the hard way. Be sure to call autogen.sh on the master branch first - it needs to download gmock, and the link in older tags points to the void.

]==> sudo apt-get install curl
]==> git clone https://github.com/google/protobuf.git
]==> cd protobuf
]==> ./autogen.sh

This version is needed for the gRPC Java codegen plugin.

]==> git checkout v3.0.0-beta-3
]==> ./autogen.sh
]==> ./configure --prefix=/home/ljanyst/Temp/protobuf-3.0.0-beta-3
]==> make -j12 && make install

This one is needed by Bazel itself.

]==> git checkout v3.0.0
]==> ./autogen.sh
]==> ./configure --prefix=/home/ljanyst/Temp/protobuf-3.0.0
]==> make -j12 && make install

gRPC Java

Building this one took me a horrendous amount of time. At first, I thought that the whole package is needed. Apart from problems with the protocol buffer versions, it has some JNI dependencies that are problematic to compile. Even after I have successfully produced these, they had interoperability issues with other dependencies. After some digging, it turned out that only one component of the package is actually required, so the whole effort was unnecessary. Of course, the source needed patching to make it build on aarch64, but I won't bore you with that. Again, make sure you use the v0.15.0-jetson-tx1 tag - no other tag will work.

]==> git clone https://github.com/ljanyst/grpc-java.git
]==> cd grpc-java
]==> git checkout v0.15.0-jetson-tx1
]==> echo protoc=/home/ljanyst/Temp/protobuf-3.0.0-beta-3/bin/protoc > gradle.properties
]==> CXXFLAGS=-I/home/ljanyst/Temp/protobuf-3.0.0-beta-3/include \
     LDFLAGS=-L/home/ljanyst/Temp/protobuf-3.0.0-beta-3/lib \
     ./gradlew java_pluginExecutable

Bazel

The latest available release of Bazel (0.4.5) does not build on aarch64 without the patch listed below. I took it from the master branch.

diff --git a/src/main/java/com/google/devtools/build/lib/util/CPU.java b/src/main/java/com/google/devtools/build/lib/util/CPU.java
index 7a85c29..ff8bc86 100644
--- a/src/main/java/com/google/devtools/build/lib/util/CPU.java
+++ b/src/main/java/com/google/devtools/build/lib/util/CPU.java
@@ -25,7 +25,7 @@ public enum CPU {
   X86_32("x86_32", ImmutableSet.of("i386", "i486", "i586", "i686", "i786", "x86")),
   X86_64("x86_64", ImmutableSet.of("amd64", "x86_64", "x64")),
   PPC("ppc", ImmutableSet.of("ppc", "ppc64", "ppc64le")),
-  ARM("arm", ImmutableSet.of("arm", "armv7l")),
+  ARM("arm", ImmutableSet.of("arm", "armv7l", "aarch64")),
   S390X("s390x", ImmutableSet.of("s390x", "s390")),
   UNKNOWN("unknown", ImmutableSet.<String>of());

The compilation is straightforward, but make sure you point to the right version of protocol buffers and the gRPC Java compiler built earlier.

]==> git clone https://github.com/bazelbuild/bazel.git
]==> cd bazel
]==> git checkout 0.4.5
]==> export PROTOC=/home/ljanyst/Temp/protobuf-3.0.0/bin/protoc
]==> export GRPC_JAVA_PLUGIN=/home/ljanyst/Temp/grpc-java/compiler/build/exe/java_plugin/protoc-gen-grpc-java
]==> ./compile.sh
]==> export PATH=/home/ljanyst/Temp/bazel/output:$PATH

TensorFlow

  • Note 22.06.2017: Go here for TensorFlow 1.2.0. See the v1.2.0-jetson-tx1 tag.
  • Note 03.07.2017: For TensorFlow 1.2.1, see the v1.2.1-jetson-tx1 tag.
  • Note 18.08.2017: For TensorFlow 1.3.0, see the v1.3.0-jetson-tx1 tag. I have tested it against JetPack 3.1 which fixes the CUDA-related bugs. Note that the kernel in this version of JetPack has been compiled without swap support, so you may want to add --local_resources=2048,0.5,0.5 to the bazel commandline if you want to avoid the out-of-memory kills.
  • Note 21.11.2017: TensorFlow 1.4.0 builds on the Jetson without any modification. However, you will need a newer version of bazel. This is the combination of versions that worked for me:
    • protocol buffers 3.4.0
    • grpc-java 1.6.1 with the ARM patch
    • bazel 0.7.0 with this and this patch

Patches

The version of CUDA toolkit for this device is somewhat handicapped. nvcc has problems with variadic templates and compiling some kernels using Eigen makes it crash. I found that adding:

#define EIGEN_HAS_VARIADIC_TEMPLATES 0

to these problematic files makes the problem go away. A constructor with an initializer list seems to be an issue in one of the cases as well. Using the default constructor instead, and then initializing the array elements one by one makes things go through.

Also, the cuBLAS API seems to be incomplete. It only defines 5 GEMM algorithms (General Matrix to Matrix Multiplication) where the newer patch releases of the toolkit define 8. TensorFlow enumerates them by name to experimentally determine which one is best for a given computation and the code notes that they may fail under perfectly normal circumstances (i.e., a GPU older than sm_50). Therefore, simply omitting the missing algorithms should be perfectly safe.

diff --git a/tensorflow/stream_executor/cuda/cuda_blas.cc b/tensorflow/stream_executor/cuda/cuda_blas.cc
index 2c650af..49c6db7 100644
--- a/tensorflow/stream_executor/cuda/cuda_blas.cc
+++ b/tensorflow/stream_executor/cuda/cuda_blas.cc
@@ -1912,8 +1912,7 @@ bool CUDABlas::GetBlasGemmAlgorithms(
 #if CUDA_VERSION >= 8000
   for (cublasGemmAlgo_t algo :
        {CUBLAS_GEMM_DFALT, CUBLAS_GEMM_ALGO0, CUBLAS_GEMM_ALGO1,
-        CUBLAS_GEMM_ALGO2, CUBLAS_GEMM_ALGO3, CUBLAS_GEMM_ALGO4,
-        CUBLAS_GEMM_ALGO5, CUBLAS_GEMM_ALGO6, CUBLAS_GEMM_ALGO7}) {
+        CUBLAS_GEMM_ALGO2, CUBLAS_GEMM_ALGO3, CUBLAS_GEMM_ALGO4}) {
     out_algorithms->push_back(algo);
   }
 #endif

See the full patches are here and here.

Memory Consumption

The compilation process may take considerable amounts of RAM - more than the device has available. The documentation advises to use only one execution thread (--local_resources 2048,.5,1.0 param for Bazel), so that you don't get the OOM kills. It's unnecessary most of the time, though, because it's only the last 20% of the compilation steps when the memory is filled completely. Instead, I used an SD card as a swap device.

]==> sudo mkswap /dev/mmcblk1p2
]==> sudo swapon /dev/mmcblk1p2

At peak times, the entire RAM and around 7.5GB of swap were used. However, only at most 5 to 6 compilation threads were in the D state (uninterruptable sleep due to IO), with 2 to 3 being runnable.

Compilation

You need to install these packages before you can proceed.

]==> sudo apt-get install python3-numpy python3-dev python3-pip
]==> sudo apt-get install python3-wheel python3-virtualenv

Then clone my repo containing the necessary patches and configure the source. I used the system version of Python 3, located at /usr/bin/python3 with its default library in /usr/lib/python3/dist-packages. The answers to the CUDA related questions are:

  • the version of the SDK is 8.0;
  • the version of cuDNN is 5.1.5, and it's located in /usr;
  • the CUDA compute capability for TX1 is 5.3.

Go ahead and run:

]==> git clone https://github.com/ljanyst/tensorflow.git
]==> cd tensorflow
]==> git checkout v1.1.0-jetson-tx1
]==> ./configure

Finally, run the compilation, and, 2 hours and change after, build the wheel:

]==> bazel build --config=opt --config=cuda --curses=no --show_task_finish \
     //tensorflow/tools/pip_package:build_pip_package
]==> bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg
]==> cp /tmp/tensorflow_pkg/*.whl ../

I don't particularly like polluting system directories with custom-built binaries, so I use virtualenv to handle pip-installed Python packages.

]==> mkdir -p ~/Apps/virtualenvs/tensorflow
]==> cd ~/Apps/virtualenvs/tensorflow
]==> python3 /usr/lib/python3/dist-packages/virtualenv.py -p /usr/bin/python3 .
]==> . ./bin/activate
(tensorflow) ]==> pip install ~/Temp/tensorflow-1.1.0-cp35-cp35m-linux_aarch64.whl

The device is identified correctly when starting a new TensorFlow session. You should see the following if you don't count warnings about NUMA:

Found device 0 with properties:
name: NVIDIA Tegra X1
major: 5 minor: 3 memoryClockRate (GHz) 0.072
pciBusID 0000:00:00.0
Total memory: 3.90GiB
Free memory: 2.11GiB
DMA: 0
0:   Y
Creating TensorFlow device (/gpu:0) -> (device: 0, name: NVIDIA Tegra X1, pci bus id: 0000:00:00.0)

The CPU and the GPU share the memory controller, so the GPU does not have the 4GB just for itself. On the upside you can use the CUDA unified memory model without penalties (no memory copies).

Benchmark

I run two benchmarks to see if things work as expected. The first one was my TensorFlow implementation of LeNet training on and classifying the MNIST data. The training code run twice as fast on the TX1 comparing to my 4th generation Carbon X1 laptop. The second test was my slightly enlarged implementation of Sermanet applied to classifying road signs. The convolution part of the training process took roughly 20 minutes per epoch, which is a factor of two improvement over the performance of my laptop. The pipeline was implemented with a large device in mind, though, and expected 16GB of RAM. The TX1 has only 4GB, so the swap speed was a bottleneck here. Based on my observations of the processing speed of individual batches, I can speculate that a further improvement of a factor of two is possible with a properly optimized pipeline.

The algorithm

I took courses on probability, statistics, and the Monte Carlo methods while I was at school, but the ubiquity of the algorithms based on randomness still amazes me. Imagine that you are a robot moving in a 2D space. You have a map of the area, and you know your rough initial position in it. Every step you take, you get data from the controls (speed and angular speed) that is quite noisy. You can also sense obstacles within a certain range and with a certain precision with respect to your current position and heading. How do you figure out where you are?

One good strategy to solving this problem is using a particle filter. You start by creating N "particles," or guesses as to where you are, by drawing the x- and y- position as well as the heading from the Gaussian distribution around your initial estimate. Then, every step you take, for every particle, you:

  • move according to the data you got from controls taking the noise into account;
  • match the sensor data to the landmarks on the map using the perspective of the particle;
  • assign weight to the particle based on how well the observation matches the map.

Finally, you draw with replacement N particles from the initial set with the probability proportional to the weights. The particles that match the observation well will likely be drawn multiple times and those that don't are unlikely to be drawn at all. The repetitions are not a problem because the movement is noisy, so they will diverge after the next step you take. The particle with the highest weight is your best estimate of your actual position and heading.

The result

I did an experiment on the Udacity data, and the approach using a 1000 particles turned out to work well comparing to using just one. The average deviation from the ground truth was around 10cm. Using one particle effectively ignores the observation data and relies only on the controls. You can follow the blue diamond in the video below to see how fast the effects of the noise accumulate. Both cases use the same noise values.

Particle filter localization - 1000 particles vs 1