Recent Content

Table of Contents

  1. Chassis & Assembly
  2. Making Bits Flow and Disks Spin


Alright, we have the chassis, the disks, the board, and all the other stuff nicely put together. It is time to put some life into the silicon and the wires and make the disks work. The goal is to have a fully functional and glitch-free Linux system that can spin the disks up and down.

A year ago or so, I wrote a post about installing Linux on the NanoPi M4 board. Most of it is still applicable, except for U-Boot and some of the Linux kernel configuration. U-Boot has progressed nicely, and, as of version 2021.1, no binary crapware is necessary to run the board stably. The memory timing (DRAM needs to be refreshed to work) issues have been fixed. Let's start from the beginning, though.

Fun with Bootloaders

There are about half a quadrillion ways in which you can integrate an ARM processing core into your system. It means that the boot-up process needs to be flexible and, more often than not, ends up being a pretty weird beast. Perhaps the situation is similar on PCs, but I don't know enough about it yet to judge. I will play more with a PC boot-up process as one of my following toy projects.

Sometimes things are easy. You always have to start with some boot-up code in ROM that does some very basic device initialization, looks for a bootloader (i.e., on an SD card), loads that bootloader to SRAM, and jumps to executing it. The bootloader is loaded to SRAM because dynamically initializing DRAM timings using SPD or even statically is too much to do for the ROM bootloader (a.k.a. the Primary Program Loader). The code required to do that is usually too large. The bootloader then initializes the DRAM and whatever else is needed, loads the OS kernel to DRAM, and jumps to it.

The bootloader code itself may be pretty big, though, with all the filesystem, USB, network, and other drivers. It may not fit into SRAM, which is expensive and usually very limited. In that case, the bootloader could be split into two parts: the Secondary Program Loader (SPL) and the bootloader proper. The ROM code then loads the SPL to SRAM, the SPL sets up the DRAM, loads the bootloader proper to DRAM, and starts executing it.

Still, the amount of available SRAM may be so low (e.g., 4K) that having a big enough SPL to dynamically configure DRAM using SPD is impossible. There are usually other things we could still do. For example, in many such platforms, an L2 CPU cache could be locked and used as SRAM. To support those cases, people came up with Tertiary Program Loaders. The ROM code starts with loading the SPL to the small SRAM. The SPL locks the L2 cache to use it as a larger SRAM bank and loads the TPL there. The TPL's role is the same as the SPL's was in the previous paragraph.

NanoPi M4 with a SATA shield
NanoPi M4 with a SATA shield

On the Rockchip RK3399 SOC in the NanoPi M4 things are even more peculiar. The details are described here and here in section 1.2. It boils down to the following:

  • The ROM code looks for the bootloader image on various devices.
  • The ROM code loads the DRAM initializer part of the bootloader to SRAM and runs it. The DRAM initializer's role is typically played either by U-Boot's TPL or by Rockchip's proprietary code.
  • When the DRAM initializer is done, it passes the control back to the ROM code.
  • The ROM code loads small boot code to DRAM and runs it. This role is played either by the U-Boot's SPL or by the Rockchip's proprietary mini loader.
  • The boot code loads the ARM Trusted Firmware stage BL31 (more below) at EL3 (Exception Level 3, more here) as a secure monitor.
  • The trusted firmware runs U-Boot proper at EL2.

ARM Trusted Firmware

To quote ARM's documentaiton:

Arm TrustZone technology is used on billions of application processors to protect high-value code and data. Arm TrustZone technology offers an efficient, system-wide approach to security with hardware-enforced isolation built into the CPU.

That's the theory and probably even the practice. That said, I have mostly seen it used as an abstraction/mediation layer between the OS and the hardware so that the OS can have a uniform way to manage the power settings and devices and not mess up too much. I am probably quite biased because I do not care about the DRM use cases or anything similar.

The Trusted Firmware on Aarch64 divides the boot stages as follows:

  • Boot Loader stage 1 (BL1) AP Trusted ROM
  • Boot Loader stage 2 (BL2) Trusted Boot Firmware
  • Boot Loader stage 3-1 (BL31) EL3 Runtime Software
  • Boot Loader stage 3-2 (BL32) Secure-EL1 Payload (optional)
  • Boot Loader stage 3-3 (BL33) Non-trusted Firmware

You can read the detailed documentation for all these here. On Rockchip, the BL1 and BL2 stages are supplied by U-Boot's TPL/SPL duo or the Rockchip's proprietary equivalents. BL31 is implemented by ARM Trusted Firmware. I omit BL32, but you could use something like OP-TEE, and the Rockchip's bootloader likely includes something of their own making. I never needed anything like this, so I don't know much about it. BL33 is U-Boot proper.

Going back to BL31, the documentation says that it redoes some of the configuration done by the previous stages of the boot process, installs its own exception handlers, and is responsible for providing a bunch of services to the operating system. One of them is Power State Coordination Interface (PSCI) that switches the CPU cores on and off, performs reboots, etc. On platforms like Xilinx's ZynqMP, I have seen it managing access to clock sources.

The Bootloader Build Instructions

Okay, all the theory is nice, but we need to get the board to boot. The first step is to build the BL31 stage of the ARM Trusted Firmware.

git clone
cd arm-trusted-firmware
git checkout v2.4
make CROSS_COMPILE=aarch64-linux-gnu- PLAT=rk3399 bl31

Then, you need to build U-Boot and make the SD card bootable:

git clone
cd u-boot
git checkout v2021.01
export BL31=/path/to/arm-trusted-firmware/to/bl31.elf
make nanopi-m4-rk3399_defconfig
sudo dd if=u-boot-rockchip.bin of=/dev/sda seek=64

Installing Linux

We are now free of the Rochkchip's BL32 stage, and the ARM's BL31 stage executes solely in SRAM. Therefore, there is no longer any inaccessible undeclared RAM allocated for the EL3-level firmware and nothing for Linux to trip over. The Linux memory patch described in the previous post is no longer necessary.

Configuration-wise, we need to add a couple of features to the kernel that are generally not meant for ARM and thus not enabled by defconfig. These are RAID-4/RAID-5/RAID-6 mode and Crypt target support, and you can find them in Device Drivers -> Multiple devices driver support (RAID and LVM). We need those for the RAID setup and the LUKS encryption. I also want my NAS device to serve NFS; this requires some kernel support which is enabled by ticking NFS server support for NFS version 4 in File systems -> Network File Systems -> NFS server support. That's pretty much it. Everything else should work as described in the other article.

Making Disks Quiet

My NAS box lives in my living room, so I want my disks absolutely quiet whenever they are not used. Seagate has their "Extended Power Conditions - PowerChoice" thingy that just works, as opposed to the WD disks that are tricky to spin down and seemingly unable to do that without manual intervention from the OS. Seagate provides a bunch of open-source tools to manage the disk settings and query the status.

git clone --recursive
cd openSeaChest/Make/gcc
make release

The build process creates a bunch of executables in the openseachest_exes directory.

The EPC functionality mentioned above supports four power-saving modes that you can set up to kick in after a certain amount of time has passed since the last activity. Here's a summary of these modes and the settings I use:

Mode Description My timer
idle_a Electronics in power saving mode 100 ms
idle_b Heads unloaded; spinning at full RPM 120000 ms
idle_c Heads unloaded; spinning ar reduced RPM 300000 ms
standby_z Heads unloaded; motor stopped 900000 ms

Here's how to set it up:

./openSeaChest_PowerControl --device /dev/sda  --EPCfeature enable
./openSeaChest_PowerControl --device /dev/sda  --idle_a 100 --idle_b 120000 --idle_c 300000 --standby_z 900000

And here's how to query it:

./openSeaChest_PowerControl --device /dev/sdc --showEPCSettings

You want to monitor the number of the load/unload cycles (aka. the number of head parks) because the heads are susceptible to wear and tear. The user's manual says that the IronWolf disks can support 600k of these cycles before a failure. Start/stop cycles (aka. spinning down and up again) is probably another metric worth tracking, but I have not seen any info on how many of those a disk can handle before failing.

You can poll these with the command below. The value that you want to look at is the last one, and the tools show it only in hex for your convenience ;)

./openSeaChest_SMART -d /dev/sdd --smartAttributes raw | grep  -E "Start/Stop|Load-Unload"

Disk Power Issues

When choosing the power supply, I had foolishly assumed that the power consumption information for the disks posted on the retailer's website paints a more or less complete picture of what's needed. I then chose a power supply with a healthy margin and assumed things are going to work fine. They did for my previous setup with WD disks, so I saw no reason they would not work here. Things indeed work fine as long as you don't spin the disks down, which is a must for me. According to Seagate's user manual that you need to look pretty hard for, the disks need roughly 1.7 Amps at 12 Volts to spin up. The Phobya power supply I had initially intended to use can only deliver about half of the necessary power. I learned it the hard way by observing the Linux kernel spit out many AHCI link errors like the one posted below (in case someone wants to google it) and losing data on some of the disks. Fortunately, it never happened to more than one disk at a time. Hooray for RAID5!

[  697.033445] ata2.00: exception Emask 0x10 SAct 0x80000000 SErr 0x190002 action 0xe frozen
[  697.034246] ata2.00: irq_stat 0x80400000, PHY RDY changed
[  697.034754] ata2: SError: { RecovComm PHYRdyChg 10B8B Dispar }
[  697.035315] ata2.00: failed command: READ FPDMA QUEUED
[  697.035796] ata2.00: cmd 60/08:f8:00:89:59/01:00:6f:01:00/40 tag 31 ncq dma 135168 in
                        res 40/00:f8:00:89:59/00:00:6f:01:00/40 Emask 0x10 (ATA bus error)
[  697.037699] ata2.00: status: { DRDY }
[  697.038058] ata2: hard resetting link

The openSeaChest toolkit makes it possible to enable the feature called Low Current Spinup. I could not find any helpful information about it, and I did not measure how much less current the device took when the feature enabled. It comes with three modes: disable, low, ultra. Neither low nor ultra-low mode made the issues mentioned above go away.

openSeaChest_Configure --device /dev/sdX --lowCurrentSpinup ultra

The ATX PSU I ended up using
The ATX PSU I ended up using

I ended up buying a 600W ATX power supply unit to power up the disks. I will also make it supply what ended up being a small cluster of ARM single-board servers that live under my TV table. Don't try that at home unless you understand what you're doing.


I had expected getting this far to be just going through the motions, but it proved a bit more challenging than that. I am happy, though, because it provided an excellent excuse to dig deeper into topics that interest me quite a bit and that I hate exploring in the vacuum.

Table of Contents

  1. Chassis & Assembly
  2. Making Bits Flow and Disks Spin


For some time now, I have been watching people at work and over the Internet building all sorts of cool stuff by mixing 3D-printed components of their design and electronic DYI hardware. That's something I want to learn how to do myself. Fortunately, lack of ideas for fun projects, most of which I will sadly never have the time realize, is not exactly something I can complain about. As it happens, for instance, I am in dire need of a pretty unusual storage solution for my home network that I amazingly cannot buy anywhere. Why not kill two birds with one stone then and try my hand at designing my own NAS device?

Spoiler Alert! Here's what the thing ended up looking like:

The result
The result

The Building Blocks

Conveniently, I already had a Prisa Mini printer and a bunch of other building blocks and tools. All I needed to do was to buy some disks. I ended up using Seagate IronWolf Pro for their open-source admin tools and because of my previous bad experiences with Western Digital and their issues with firmware upgrades, messed up head parking, problems with putting them to sleep, and the company's issues with honesty in disclosure of the SMR technology being used even in high-end devices. The problems mentioned above are described in more detail here and here.

The materials
The materials

Here's the full BOM:

  1. A NanoPI M4
  2. A NanoPi M4 SATA Hat
  3. Four Seagate IronWolf Pro CMR disks
  4. An external Blu-Ray drive
  5. A power supply for the disks (I ended up using an ATX supply, see here)
  6. A power supply for the board and the Blu-Ray drive
  7. Four short (10-15 cm) long flexible SATA cables
  8. Four-way splitter SATA power supply cable
  9. Some jumper cables to drive out the UART debug interface
  10. 24 #6-32 screws and isolation rings to attach the disks
  11. 4 20mm long M3 screws to attach the NanoPi M4 board
  12. Some PLA filament to print the chasis
  13. Fast drying plastic glue to connect the components


I decided that pointing and dragging things in a typical CAD program is for rookies, so I went for OpenSCAD. Using it, you can code everything up in a text editor and version control the result in git like a real pro[grammer]. There's a pretty great tutorial here that will teach you how to do all sorts of magic with it. That, the Prusa Mini instruction manual and some elementary math are all you need to know to make the figments of your imagination real. Stop here for a second and ponder how mind-blowing that is! Seriously, you type some stuff in a text editor, and there's this thing on your desk that you can buy for 346 euros that makes it such that you can touch it with your fingers. I can't stop being amazed.


The video above is the rendering of what I came up with. It's not that great because it's pretty hard to change the disks when they break, which is the whole point of having a NAS RAID array. I treat it as an upside, though, because it will be an opportunity to redesign the thing when a disk breaks down.

I loved playing with OpenSCAD, but I don't think it's suitable for large or even moderately-sized designs. It is primarily because of the limitations of the language it uses. I have briefly looked at its implementation. It's a tree-walking interpreter that builds the CGAL geometry as it goes from AST node to AST node and uses a caching mechanism to avoid recomputation of the geometry represented by the subtrees that it has seen in the past. This mechanism allows OpenSCAD to quickly re-parse and re-render the design whenever the user changes the SCAD files and saves them. It is pretty cool!

In my opinion, however, building the geometry directly from the parse tree is a design flaw that gravely handicaps composability and makes the construction of complex models with multiple custom objects very hard. Instead, a model similar to the one employed in the Java3D API would make putting modules together much more manageable. With Java3D, you can write functions that return the geometry tree nodes that can have properties such as coordinates of the joints, dimensions, etc. It, in turn, allows you to write functions taking these properties as parameters and returning transforms aligning the coordinate frames such that the joint points meet right where you want them.

This problem could be somewhat alleviated by extending the language to have maps and using functions to compute module properties. It would be pretty tedious in practice, though, because you would have to make the same call twice, first to the module itself and then to the function computing the map of the module's properties for the given input parameters. In my chassis model, I have made an attempt at a poor man's alternative to this approach using the OpenSCAD vector type. It helped somewhat, but things got pretty convoluted and confusing pretty fast.

I have a module called support with the following interface:

module support(top, num_disks, disk_dims) {}

I then define a function that computes a vector of properties of the support that I call in multiple places to figure out how to fit things together in a parent module:

function support_properties(num_disks, disk_dims) = [
        (num_disks - 1) * 2 * comb_size + disk_dims[0] + 2 * frame_thickness,
        disk_dims[1] + 2 * frame_thickness,

It could have been made much simpler if the module returned an object that you could pass as a child node to another module, but before you do that, you could read some of its properties to transform the coordinate frame such that things fit together in a way that you want. My next toy project will be an attempt to create such an API and a wrapper for OpenSCAD in go that will compile the geometry to the SCAD language and have OpenSCAD render it.

The language also misses other features that seem crucial. The most important of them is treating functions as first-class citizens, i.e., the ability to assign them to variables and pass them as parameters to modules. It would make things like a function grapher, one example in the tutorial I mentioned earlier, generic. Ie. You could define any function \( f: \mathbb{R}^{2}\rightarrow \mathbb{R} \) and have a module that plots it.

Initially, I wanted to add this and some other features to the OpenSCAD language. However, using a programming language like go to create the geometry tree and then having this tree compiled to the SCAD language will make this and other problems go away for free.

3D Printing
3D Printing

The printing and putting things together was pretty easy. The Prusa slicer takes the STL files, produces the GCODE files that the printer can handle directly. For some parts of the model, it was necessary to generate additional support, but you can read all that in the Prusa instruction manual.


It was my first non-trivial 3D printing project. Even though I am not completely happy with the design, it's pretty stable and solid. Most importantly, the whole process taught me a lot and gave me many ideas to improve the open-source ecosystem. I will work on them in my "copious spare time" (sarcasm ;).

The whole model is available on GitHub here.


It starts to feel like this blog has become mostly about installing Debian on more or less esoteric pieces of hardware. One of the reasons for this is that I have lately been getting more and more into building embedded systems of various kinds and probably the largest annoyance in all this is the issue of the operating system that controls the hardware. If the device is powerful enough to run Linux, more often than not, it comes with some more or less useless vendor-specific and horrendously outdated IoT distribution. The only reasonable argument that I have heard for this being the state of affairs is that these embedded devices become obsolete pretty quickly, and the vendors are not keen on providing proper support. That's, in fact, an argument for making things more open, if you ask me, and for enabling the community to provide adequate support. I would have gladly put some time into it if getting the documentation was not close to impossible.

Even disregarding the fact that running this outdated software usually is a huge security risk, it ends up being an enormous drag when you want to build something that the vendor has not foreseen. The good people at Armbian do a pretty great job of bringing mainstream Linux to these small devices. However, I'd still instead run a vanilla distribution whenever possible. In most cases, making it work is just a matter of building the bootloader and tweaking the kernel.

So what's the deal this time around? Well, I need a file server and a media player for my home network. I found a suitable device with a bunch of SATA controllers attached over PCIe, a relatively powerful GPU, and a hardware video decoder. But then again, making vanilla Linux distribution run on it is somewhat of a challenge, so here's a howto.

NanoPi M4 with a SATA shield
NanoPi M4 with a SATA shield


You need to get a MicroSD card and partition it. You only need one partition because the preloader will look for the SPL, U-boot, and ARM Trusted Firmware directly in the block device. U-boot can boot from ext4 just fine, so there is no need for any FAT. That said, you may consider adding some swap if your use case calls for it. I created a GPT disklabel with the first partition starting at the 65536th block to have enough room for all the bootloader payload.

]==> fdisk /dev/sdc

Welcome to fdisk (util-linux 2.34).
Changes will remain in memory only, until you decide to write them.
Be careful before using the write command.

Command (m for help): g
Created a new GPT disklabel (GUID: 3124F0FC-9528-A44C-BD06-E3598E61CE99).
The old dos signature will be removed by a write command.

Command (m for help): n
Partition number (1-128, default 1):
First sector (2048-124735454, default 2048): 65536
Last sector, +/-sectors or +/-size{K,M,G,T,P} (65536-124735454, default 124735454): 116346879

Created a new partition 1 of type 'Linux filesystem' and of size 55.5 GiB.

You can then format this partition with mkfs.ext4 /dev/sdc1.


Note 01.04.2021: I wrote a followup here.

You can, in theory, build the entire bootloader from open-source components. However, it seems that the memory timing setup is then somehow messed up. You will get the Linux kernel to panic due to memory errors while trying to access random chunks of RAM. I decided not to dig any deeper into it and use the mini loader provided by Rockchip, which comes with its own mess that I describe later. Anyways, you can start with vanilla U-boot:

]==> sudo apt-get install gcc-aarch64-linux-gnu bison flex u-boot-tools
]==> sudo apt-get install python-pyelftools device-tree-compiler libncurses-dev rsync
]==> git clone
]==> cd u-boot
]==> git checkout v2019.10
]==> make  nanopi-m4-rk3399_defconfig
]==> make ARCH=arm CROSS_COMPILE=aarch64-linux-gnu-

Once you have the U-boot image, you need to create the Rockchip bootloader payload. I run the commands below in a container because I am not that fond of running binaries of suspicious origin unchecked. Yeah, they produce other binaries that I then run uncontrolled on the target board. I see the irony, but I'd still rather keep my workstation safe.

]==> git clone
]==> cd rkbin
]==> ./tools/trust_merger RKTRUST/RK3399TRUST.ini
]==> ./tools/loaderimage --pack --uboot /path/to/u-boot/u-boot-dtb.bin uboot.img

Finally, you need to build the mini loader with the right DRAM timing settings and flash all that stuff to the disk:

]==> mkimage -n rk3399 -T rksd -d bin/rk33/rk3399_ddr_933MHz_v1.24.bin idbloader.img
]==> cat bin/rk33/rk3399_miniloader_v1.19.bin >> idbloader.img
]==> sudo dd if=idbloader.img of=/dev/sdc seek=64
]==> sudo dd if=trust.img of=/dev/sdc seek=24576
]==> sudo dd if=uboot.img of=/dev/sdc seek=16384
]==> sync

That should give you a working bootloader capable of starting Linux from an ext4 filesystem.

Base System

You need to create the Debian filesystem for aarch64 the usual way. I add mdadm here because I will want to run my disks in RAID4 mode.

]==> apt-get install qemu-user-static debootstrap
]==> mount /dev/sdc1 /mnt
]==> sudo qemu-debootstrap --include=u-boot-tools,mc,initramfs-tools,network-manager,openssh-server,mdadm --arch=arm64 testing /mnt/

Modify the necessary configuration files, in particular the fstab, and set up the required user accounts:

]==> chroot /mnt /usr/bin/qemu-aarch64-static /bin/bash
]==> cat /etc/fstab
UUID=15c39b7d-fad1-4a76-93eb-050b94791312 /               ext4    errors=remount-ro 0       1
]==> cat /etc/hostname
]==> cat /etc/apt/sources.list
deb testing main contrib non-free
deb-src testing main contrib non-free
deb testing-security main contrib non-free
deb-src testing-security main contrib non-free
]==> useradd -m your-user
]==> passwd your-user
]==> passwd
]==> exit

You then need to build the kernel. I mentioned before that the Rockchip's bootloader and the ATF come with their mess. The main problem is that they reserve a chunk of memory at EL3. The kernel running at EL1 can then trip over this memory, get denied access, and Oops. So we need to declare it as inaccessible in the device tree. It is what the memory.patch does.

]==> wget
]==> tar xf linux-5.6.3.tar.xz
]==> cd linux-5.6.3
]==> patch -Np1 -i ../memory.patch
]==> make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- defconfig
]==> make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- menuconfig
]==> make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- KBUILD_IMAGE=arch/arm64/boot/Image -j12 bindeb-pkg
]==> sudo cp ../linux-headers-5.6.3_5.6.3-1_arm64.deb ../linux-libc-dev_5.6.3-1_arm64.deb ../linux-libc-dev_5.6.3-1_arm64.deb /mnt
]==> chroot /mnt /usr/bin/qemu-aarch64-static /bin/bash
]==> dpkg -i linux-*deb
]==> rm linux-*
]==> exit

As you probably have noticed, I manually include the Image instead of zImage into the package. It is because the aarch64 kernel does not currently provide a decompressor. Therefore, either the bootloader needs to ungzip it, or you need to have an already decompressed kernel image.

Bootloader setup and automation

The last thing before being able to boot into the new system is telling the bootloader how to do it. U-boot will look for the boot.scr file in the root of your filesystem. For the first boot, it's best to set things up statically. Here's what I do:

]==> cat /boot.cmd
setenv bootargs 'root=/dev/mmcblk1p1 rootfstype=ext4 rootwait console=ttyS2,1500000 console=tty1 usb-storage.quirks=0x2537:0x1066:u,0x2537:0x1068:u memtest=4 earlycon=uart8250,mmio32,0xff1a0000'
load mmc 1:1 ${kernel_addr_r} /boot/vmlinuz-5.6.3
load mmc 1:1 ${fdt_addr_r} /usr/lib/linux-image-5.6.3/rockchip/rk3399-nanopi-m4.dtb
load mmc 1:1 ${ramdisk_addr_r} /boot/uinitrd.img-5.6.3
booti ${kernel_addr_r} ${ramdisk_addr_r} ${fdt_addr_r}
]==> mkimage -A arm64 -O linux -T ramdisk -C gzip -d /boot/initrd.img-5.6.3 /boot/uinitrd.img-5.6.3
]==> mkimage -A arm -O linux -T script -C none -n "Initial u-boot script" -d /boot.cmd /boot.scr

There are three tricks to the above. Firstly, we use the booti command instead of the usual bootz. It is because of the decompressor issue mentioned above. Secondly, we need to convert the initial ramdisk to the format digestible by U-boot. Thirdly, we need to convert the text boot script to the U-boot's binary format.

That's it. You can now boot your system.

It's a pain to run all this by hand every time you want to update your kernel, so it's a good idea to automate the process. To that end, I create a template file in the root of a filesystem:

]==> cat /
setenv bootargs 'root=/dev/mmcblk1p1 rootfstype=ext4 rootwait console=ttyS2,1500000 console=tty1 usb-storage.quirks=0x2537:0x1066:u,0x2537:0x1068:u'
load mmc 1:1 ${kernel_addr_r} /boot/vmlinuz-__VERSION__
load mmc 1:1 ${fdt_addr_r} /usr/lib/linux-image-__VERSION__/rockchip/rk3399-nanopi-m4.dtb
load mmc 1:1 ${ramdisk_addr_r} /boot/uinitrd.img-__VERSION__
booti ${kernel_addr_r} ${ramdisk_addr_r} ${fdt_addr_r}

And then run a hook that converts the ramdisk, creates boot.cmd, and converts it to the U-boot's binary format.

]==> cat /etc/kernel/postinst.d/uboot
#!/bin/sh -e
/usr/bin/mkimage -A arm64 -O linux -T ramdisk -C gzip -d /boot/initrd.img-${version} /boot/uinitrd.img-${version}
/bin/cat /  | /usr/bin/sed -e "s/__VERSION__/${version}/g" > /boot.cmd
/usr/bin/mkimage -A arm -O linux -T script -C none -n "Initial u-boot script" -d /boot.cmd /boot.scr

Post boot setup and media

After you've booted up the system, you can install some desktop environment and add your user account to all the useful groups. Reconfiguring locales and time zone data gets rid of annoying internationalization warnings and lets you handle the time correctly.

]==> apt-get update
]==> apt-get dist-upgrade
]==> apt-get install xfce4 alsa-utils mesa-utils pulseaudio mdadm locales tzdata mesa-utils-extra
]==> dpkg-reconfigure locales
]==> dpkg-reconfigure tzdata
]==> usermod -a -G pulse,pulse-access,netdev,plugdev,video,audio,sudo,dialout,users,render your-user

If you have some fancy surround sound setup as I do, you can set PulseAudio up to use the Alsa sink by adding load-module module-alsa-sink to /etc/pulse/ You can also configure the number of audio channels in /etc/pulse/daemon.conf. The setting is called default-sample-channels. As for ALSA itself, this is what I put in /etc/asound.conf:

]==> cat /etc/asound.conf
pcm.dmixer  {
        type dmix
        ipc_key 1024
        slave {
                pcm "hw"
                channels 6

pcm.!default "plug:dmixer

The last step is to make the hardware media decoder work. Rockchip has its proprietary acceleration hardware and the software driving it. Here's how to get it:

]==> sudo apt-get install build-essential dh-exec git cmake
]==> git clone mpp-1.4.0
]==> git checkout 50a96555

It happens to have a rules file to create a Debian package, but the file is messed up, so you will have to apply the following patch:

diff --git a/debian/rules b/debian/rules
index 876d6e6e..26686f39 100755
--- a/debian/rules
+++ b/debian/rules
@@ -24,7 +24,5 @@ include /usr/share/dpkg/
 # This is example for Cmake (See )
        dh_auto_configure -- \
-       -DCMAKE_TOOLCHAIN_FILE=/etc/dpkg-cross/cmake/CMakeCross.txt \
        -DCMAKE_BUILD_TYPE=Release \
-       -DHAVE_DRM=ON \
-       -DARM_MIX_32_64=ON
+       -DHAVE_DRM=ON

Then build and install it the usual way:

]==> tar czf mpp_1.4.0.orig.tar.gz mpp-1.4.0/
]==> cd mpp-1.4.0
]==> DEB_BUILD_OPTIONS=nocheck dpkg-buildpackage -rfakeroot --no-sign
]==> dpkg -i ../librockchip-mpp-dev_1.4.0-1_arm64.deb ../librockchip-mpp1_1.4.0-1_arm64.deb ../librockchip-vpu0_1.4.0-1_arm64.deb 

That's not the end. You still need to make ffmpeg use the hardware acceleration.

]==> apt-get source ffmpeg
]==> apt-get build-dep ffmpeg

Apply the following patch to enable Rockchip's MPP:

diff -Naur ffmpeg-4.2.2.orig/debian/rules ffmpeg-4.2.2/debian/rules
--- ffmpeg-4.2.2.orig/debian/rules      2020-01-25 17:22:32.000000000 +0100
+++ ffmpeg-4.2.2/debian/rules   2020-04-11 23:15:44.465637513 +0200
@@ -104,7 +104,10 @@
        --enable-openal \
        --enable-opencl \
        --enable-opengl \
-       --enable-sdl2
+       --enable-sdl2 \
+       --enable-rkmpp \
+       --enable-version3
 # The standard configuration only uses the shared CONFIG.
 CONFIG_standard = --enable-shared

Then build and install it the usual way:

]==> cd ffmpeg-4.2.2
]==> dpkg-buildpackage -rfakeroot --no-sign
]==> sudo dpkg -i ../libpostproc-dev_*_arm64.deb ../libavformat-dev_*_arm64.deb ../libavcodec-dev_*_arm64.deb ../libavformat58_*_arm64.deb ../libavutil-dev_*_arm64.deb ../libavutil56_*_arm64.deb ../libswresample-dev_*_arm64.deb ../libswresample3_*_arm64.deb ../libavfilter-dev_*_arm64.deb  ../libavfilter7_*_arm64.deb ../libswscale-dev_*_arm64.deb ../libswscale5_*_arm64.deb ../ffmpeg_*_arm64.deb ../libavresample4_*_arm64.deb ../libavresample-dev_*_arm64.deb ../libavcodec58_*_arm64.deb ../libpostproc55_*_arm64.deb ../libavdevice58_*_arm64.deb  ../libavdevice-dev_*_arm64.deb
]==> sudo apt-mark hold libpostproc-dev libavformat-dev libavcodec-dev libavformat58 libavutil-dev libavutil56 libswresample-dev libswresample3 libavfilter-dev libavfilter7 libswscale-dev libswscale5 ffmpeg libavresample4 libavresample-dev libavcodec58 libpostproc55 libavdevice58 libavdevice-dev

The last command tells apt not to update these packages as the default system version does not support hardware acceleration on this platform.


The board works quite well, and I am reasonably happy with it. I am currently waiting for a 3D printer to come to put it in a case together with the disks. I also intend to set up NFS4 with Kerberos, so there will be follow up articles.


It's been quite a long while since the 64-bit CPUs took over the world. It seems that most of the ecosystem moved on, and people stopped paying close attention to the 32-bit compatibility issues. I know I did. It's all fine - compatibility can be quite a pain for very negligible gain. However, some notable software projects are still stuck in the 32-bit-only rut. Furthermore, it seems to be the case even when their primary target platforms have been 64-bit capable for quite a long time now.

One of the projects in the first category is Bitcoin. The Bitcoin developers made their stable release without noticing that the crypto unit tests fail on 32-bit platforms. On the other side of the spectrum, there is the RaspberryPi software stack. It could not run well in 64-bit mode up until a couple of days ago. The good people at Balena Linux distro did the necessary porting work and submitted it to the upstream RPi kernel. This work is so fresh and has been done mainly by a third-party, even though the boards from RaspberryPi 2B 1.2 onwards have an ARMv8 CPU. That's early 2016. I guess that this situation is the result of not pushing the changes to the upstream kernel that seems to be very common in the embedded world and the lack of documentation.

The reason I am writing about this is that I have just hit both of these issues when trying to run a full Bitcoin node on my fresh and new RaspberryPi 4B. I ended up having to install stock Debian on it. It was not exactly hard, but there seemed to have been no instruction on the Internet, so I write down what I did here.


Except for a RaspberryPi board, you will need a microSD card and a Linux system that can write to this card. Following what Raspbian does, I partitioned my card in the following way:

]==> fdisk /dev/sdb

Welcome to fdisk (util-linux 2.34).
Changes will remain in memory only, until you decide to write them.
Be careful before using the write command.

Command (m for help): p
Disk /dev/sdb: 7.41 GiB, 7948206080 bytes, 15523840 sectors
Disk model: USB SD Reader   
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x8c7e33d1

Device     Boot  Start      End  Sectors  Size Id Type
/dev/sdb1         8192   532480   524289  256M  c W95 FAT32 (LBA)
/dev/sdb2       540672 15523839 14983168  7.1G 83 Linux

The first partition is the boot partition holding the VideoCore GPU firmware necessary to boot a kernel as well as the Linux kernel image. I will discuss how to get both later. The second partition is for the root filesystem. I mounted them in my host system as described below and the rest of this document follows this convention.

]==> mount /dev/sdb2 /mnt
]==> mkdir /mnt/boot
]==> mount /dev/sdb1 /mnt/boot/

Furthermore, you will need to have the following packages installed in the host system:

]==> apt-get install debootstrap qemu-user-static
]==> apt-get install gcc-aarch64-linux-gnu bison flex python-pyelftools device-tree-compiler

The packages in the first group are necessary to bootstrap the root filesystem, while the ones in the second group are needed to cross-compile the kernel image.

System installation

You'll need the arm64 flavor of Debian. I use the testing distribution for pretty much every system I have, but you probably can successfully run any other. I also like using Midnight Commander, so I include that package in the target installation. The include specification is a comma-separated list, and you can declare any package you want there. My network operator is Init7, so I use their mirror. You should select one that is close to you. Finally, the target for the installation is /mnt.

]==> qemu-debootstrap --include=mc --arch=arm64 testing /mnt/

After the bootstrapping process completes, you need to build the kernel image that can run on the board. The official website describes the process quite well, but I like building proper Debian packages instead. The commit with hash db690083a4 worked fine for me.

]==> cd /tmp
]==> git clone
]==> cd linux
]==> make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- bcm2711_defconfig
]==> make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- KBUILD_IMAGE=arch/arm64/boot/Image -j12 bindeb-pkg
]==> make ARCH=arm64 CROSS_COMPILE=/usr/bin/aarch64-linux-gnu- dtbs
]==> cp /tmp/linux/arch/arm64/boot/dts/broadcom/bcm2711-rpi-4-b.dtb /mnt/boot/
]==> mkdir /mnt/boot/overlays
]==> cp /tmp/linux/arch/arm64/boot/dts/overlays/vc4-fkms-v3d.dtbo /mnt/boot/overlays/
]==> cp /tmp/linux*deb /mnt

The first make command configures the kernel for this particular board. The second one builds the Debian packages with the kernel, the kernel headers, and the kernel userspace headers. The third one compiles the Device Tree files. The last four commands install the necessary Device Tree files in the boot partition and copy the kernel packages to the root filesystem.

You can now chroot into the root filesystem, install the kernel, create a user account, and change passwords. We also need to prevent the linux-libc-dev package from being updated using apt-mark.

]==> chroot /mnt /usr/bin/qemu-aarch64-static /bin/bash
]==> dpkg -i linux-*deb
]==> rm linux*deb
]==> apt-mark hold linux-libc-dev
]==> useradd -m youruser
]==> passwd youruser
]==> passwd root

You then need to edit the files that play a role in the boot process. The relevant listings are presented below. Make sure that you got the partition setup right. You can get the PARTUUIDs by listing: ls -l /dev/disk/by-partuuid.

]==> cat /etc/hostname 
]==> cat /etc/fstab   
PARTUUID=8c7e33d1-01  /boot           vfat    defaults          0       2
PARTUUID=8c7e33d1-02  /               ext4    defaults,noatime  0       1
]==> cat /etc/apt/sources.list
deb testing main contrib non-free
deb-src testing main contrib non-free

deb testing-security main contrib non-free
deb-src testing-security main contrib non-free

Finally, all that's left is the installation of the VideoCore GPU firmware files and the configuration of the bootloader. Again, make sure that the partition IDs are right and that you name the kernel image name correctly.

]==> cd /mnt/boot
]==> wget
]==> wget
]==> wget
]==> wget
]==> wget
]==> wget
]==> wget
]==> wget
]==> cat config.txt 


]==> cat cmdline.txt
dwc_otg.lpm_enable=0 console=serial0,115200 console=tty1 root=PARTUUID=8c7e33d1-02 rootfstype=ext4 elevator=deadline quiet rootwait

That's it. You can now unmount the filesystems from the host, insert the microSD card into your RaspberryPi, and the Debian system should boot.

Post-installation setup

Generally, it's nice to have the network interface start up at the boot time so that you can download stuff from the Internet without a hassle. NetworkManager does that job well. The SSH daemon is also useful for other reasons.

]==> dhclient eth0
]==> apt-get install network-manager
]==> systemctl enable NetworkManager
]==> apt-get install openssh-server
]==> systemctl enable ssh

You may also want to suppress the annoying Perl warnings about missing locale files and set up your favorite time zone.

]==> apt-get install locales tzdata
]==> dpkg-reconfigure locales
]==> dpkg-reconfigure tzdata


My Bitcoin node runs fine with all its >250GB of blockchain data, but I have not checked if anything else works at all. In particular, I have not tested the display drivers nor any camera setup. However, random people on the Internet claim that the GPU drivers are now in the kernel, so things should be fine.

Have fun!


It's kind of silly, but I wanted to build a drone that can evade being hit by a lightsaber ever since I first watched this scene in "Attack of the Clones":

Star Wars - Jedi Younglings

To gain the understanding of the ecosystem, I decided to buy parts semi-randomly on Amazon, build something that can just fly, and then iteratively improve on this basic design. People can do astounding things with drones these days. My ultimate goal is to be able to build stuff like that.

The hardware

Here's the list of things I have bought:

  • A carbon fiber quadrotor frame. (Amazon)
  • Brushless rotors. I cannot find the exact model anymore, but they were similar to the ones in the link. (Amazon)
  • Electronic Speed Controllers. (Amazon)
  • Clock-wise and counter-clock-wise propellers. You only need two of each kind, but they easily break if you do tuning in a confined space, so it's wise to buy more of them upfront. (Amazon)
  • A battery. The one in the link is large enough and still fits inside the frame. I need it inside because I wanted the electronics to be easily accessible on the top - I will likely want to change it quite a bit later. (Amazon)
  • A power connection board. You can do without it, but it's quite a lot of connections, so soldering wires together and wrapping the joints in isolation tape is painful and looks ugly. (Amazon)
  • An autopilot board. I bought a cheap CC3D because I ultimately want to ditch it and build one myself. (Amazon)
  • A Raspberry Pi. I want the drone to fly by itself, so I did not buy any radio controller - it will be the task of a computer to do the steering. I used an RPi model 2 because I had one readily available at home. However, these days model 3 is cheaper, so it's probably a much better idea to buy that one. (Amazon)
  • A WiFi dongle. I want this first version to be controllable from a web browser via WiFi. A later version will send some telemetry and receive high-level commands via GSM. (Amazon)

Apart from all that, I used some electronics components to power things up and connect them. I had most of them at home, but I will put some links below nonetheless. You'll likely need these:

  • A prototyping board. It's nice to solder things together to something stable so that the components don't fly around attached to loose wires. The one in the following link should do. (Farnell)
  • A 5V voltage regulator. You will need one to power the Raspberry Pi up. The documentation of CC3D says that the board puts the unregulated output from the ESCs on the output of its serial ports. This output happens to be at 5V, so I initially used that for powering the Pi. Unfortunately, it needs to draw at least around 600 mA of current to work, so the ESC that powered the Raspberry got extremely hot and the motor it was controlling lagged behind the others. Make sure you buy a regulator with as stable output as possible. Some of the cheap ones will make the Raspberry reboot in the middle of the flight due to voltage oscillations. This, in turn, will make the autopilot think it lost the connection to the radio controller and it will go into failsafe mode. A TO-220-compatible heatsink for that regulator is not a bad idea either. You will also need two capacitors. I used 10 μF and 22 μF. Alternatively, you can get yourself a DC to DC converter, in which case you won't need capacitors or heatsinks, and it should be much easier on the battery. (Farnell) (Farnell) or (Farnell)
  • An NPN Transistor and two 10 kΩ resistors for a logic inverter with voltage level adjustment. (Farnell)
  • Header pins and jumper cables so that you can connect things nicely. (Amazon, Amazon)

I used some extra components, even though they are not necessary to make things work. I am not exactly sure where this project will take me, so it seemed prudent to plan far ahead.

  • A 3.3V voltage regulator. I will likely want to power a 3.3V-based microcontroller to act as an autopilot. It needs an extra 10μF capacitor. (Farnell)
  • Four NPN Transistors and eight 10 kΩ resistors for bi-directional voltage level adjusting.

Wiring things up

Wiring things up is not hugely complicated. I put the power connection board on the bottom side of the drone together with all the cables powering the ESDs. The ESD control cables and the power for the RaspberryPi go from the bottom to the top in two bunches in the middle of each side of the drone. There are all sorts of electronic-related connections on the top. The battery is inside the drone frame.

Drone Wiring
Drone Wiring

Pretty much the only thing to pay attention to at this stage is making sure that all the rotors are placed in the right positions and that they connect to the ESCs such that they spin in the right directions. Here's a great video on that. The image below was produced by the LibrePilot configuration wizard.

Drone Rotor Topology - produced by LibrePilot
Drone Rotor Topology - produced by LibrePilot

You cannot connect the communication ports of the Raspberry Pi to the CC3D directly because there is a difference in the voltage levels at which the ports operate. The Pi's GPIO works at 3.3V and cannot tolerate 5V. The autopilot should, in principle, work at 3.3V with tolerance to 5V. However, in practice, I found that what only 5V based logic works. This is why I needed to build two voltage level converters out of transistors. I want to use them to send commands and receive telemetry from the FlexiPort of the autopilot.

Voltage Leveling Circuit
Voltage Leveling Circuit

As shown in the rotor topology diagram, the computer controls the autopilot using the S-Bus protocol. This protocol is just transmitting some data over UART with the added quirk of S-Bus being a logical inversion of UART (every 1 in UART is a 0 in S-Bus and vice-versa) plus we need to take care of the voltage level difference in the high states. The best solution here is again to build a circuit that does the inversion. It is a half of the voltage leveling circuit:

Inverter Circuit
Inverter Circuit

Here's what the resulting board together with the voltage regulators looks like in my case:

Complete board
Complete board

Control and Telemetry

A massive bummer of the RaspberryPi for me right now is that it only has one hardware UART controller. I will need many. At least one more to read telemetry data from CC3D and, later on, one extra to talk to my WaveShare GSM modem. You can bitbang UART on GPIOs, and some external kernel modules out there can do that. The problem is that they rely on the kernel's hrtimers and are not reliable enough at higher speeds, especially if the system is under load. I use one for now at a low speed, but I am working on my own implementation that uses hardware timers to flip the GPIO states reliably and on time. The CC3D and the Raspberry Pi can talk telemetry over such simulated serial port using the UAVTalk protocol. The LibrePilot source code provides python bindings for that.

I used the one hardware UART port that the Pi has for the control link because it needs to operate at a high and non-standard speed. On RaspberryPis 1 and 2 this port is used as a Linux console output by default, so you will need to disable that in /boot/config.txt. RaspberryPis 0 and 3 use the hardware UART to control Bluetooth. This behavior may be disabled by installing the pi3-disable-bt device tree overlay. All the necessary details are here. Once you're done with that, you can connect pin 14 of the Pi to the input of the inverter and the yellow (orange) cable of the CC3D's main port to its output.

After doing all that, it's a matter of opening the serial port in the right mode and sending the protocol byte stream down the pipe. S-Bus expects a baud rate of 100000, one even parity bit, and two stop bits. Here's how to open the port in this mode using Python's pyserial:

import serial

port = serial.Serial('/dev/ttyAMA0', baudrate=100000,

I found an excellent description of the S-Bus data frames here. Each frame is 25 bytes long and consists of: a start byte, 16 11-bit channels packed in the next 22 bytes, a byte containing flags and extra binary channels, and, finally, a stop byte. The controller is supposed to send a frame every 7ms, but after reading the code, I found that the LibrePilot firmware is fine as long as it receives a frame at least ten times per second (at least more often than every 102.4ms to be precise). You can see the code of my encoder here.

I quickly got tired of putting these numbers in a terminal window, so I wrote a trivial controller that works in a browser and uses a bunch of sliders. The code is on GitHub.

Controller interface
Controller interface

Open/Libre Pilot

There seems to have been some disagreement in the Open/Libre Pilot community, and the project does not look like it's in a great shape. I needed to make a bunch trivial changes to the GCS source code to make it compile on my Debian Testing laptop. Furthermore, the firmware does not build with the cross-compiler toolchain they supply due to some GCC configuration issues. I managed to build the firmware using the stock Debian cross-compiler for ARM and modifying the Makefile to make it not use the -Werror flag. The firmware code has plenty of unused variables that make the build process fail with this setting turned on. After building everything, the GCS crashes every other time you try to power cycle the board. As far as the CC3D boards themselves are concerned, I have two of them, and only one works in a more or less stable way. The other one does not load the configuration correctly or hangs every 3 out of 5 boots.

I used the config wizard at the beginning but found it confusing, so I later decided to do the configuration manually. Here's a list of what I did screen-by-screen:

  • Hardware:
    • Receiver Port: Disabled+OneShot
    • Flexi Port: Telemetry
    • Main Port: S.Bus
    • USB HID Port: USBTelemetry
    • Telemetry Speed: 9600 - faster than that is not reliable with current implementations of software UART for RaspberryPi.
  • Vehicle - Multirotor:
    • Airframe Type: Quad X
    • Assigned the rotors to the appropriate channels
  • Input:
    • Remote Control Input:
      • All channels need to be assigned even though not all of them correspond to any inputs in the pipilot interface. Otherwise, you will get receiver warnings, and the copter won't arm. I figured that out the hard way by reading the firmware code. Here's to the great diagnostics!
      • Throttle is Channel 1, Yaw - Channel 2, Roll - Channel 3, Pitch - Channel 4.
      • Accessories are Channels 5 to 8.
      • You can assign other controls to whatever other channels you like.
      • S.Bus transmits 11bits worth of data per channel, so the minimum is at 0 and the maximum is at 2047.
    • Flight Mode Settings
      • Flight Mode Count: 1
      • Stabilized 1: Attitude, Attitude, Axis Lock, CruiseControl - CruiseControl is particularly important. If you set it Manual, the copter will behave crazy.
    • Arming Settings:
      • Arm airframe using throttle off and: Yaw Left
  • Output:
  • Attitude:
    • You want to level your gyros
    • People say that there are two ways to combat the copter drifting while hovering:
      • Increase the amount of low-pass filtering.
      • Set the virtual rotation to compensate for the board not being completely flat. See this link.
    • Neither of these solutions helped me.


Flight Test #0

If you think it looks completely underwhelming, then I have to agree with you. The main problem is the drift while hovering. I tried virtual rotation, low-pass filtering, and PID tuning, but no amount of configuration tweaking alleviates the problem. The setup does not have any optical sensors and accelerometers, by definition, don't see drifting at a constant speed. On the other hand, the copter is stationary at the beginning and starts to move after the take-off, so the acceleration is not zero. It might be that the sensors are not sensitive enough to pick it up. That's something that I intend to investigate once I get the telemetry connection working reliably.

Next steps

Here's what I plan to do next:

  • Get my kernel soft UART module based on hardware timers to work. I have the timer interface finished and tested, but still need to do the byte encoder, the GPIO state changing logic, and the TTY interface.
  • Connect the telemetry at higher speed to see if the sensors see the drift.
  • If the sensors see the drift, either write a PID controller at the level of pipilot or see why the firmware does not compensate for it.

Medium-term plans include:

  • Attach the Crazyflie sensor and the IMUs directly to the Raspberry Pi.
  • Hack the CC3D firmware so that the Pi can control the motors directly.
  • See if Linux (a non-RTOS) on RPi is reliable enough to control the copter and keep it hovering stably.

Long-term plans:

  • Port everything to my FRDM-K64F board to see if things improve if implemented on top of "bare metal."
  • Start playing with more complex control and estimation math.
  • Add cameras, lidar, and implement some autonomy.
  • Perhaps write all the microcontroller code in Rust instead of C.


I got tired of having to wait for several hours every time I want to build TensorFlow on my Jetson board. The process got especially painful since NVIDIA removed the swap support from the kernel that came with the most recent JetPack. The swap was pretty much only used during the compilation of CUDA sources and free otherwise. Without it, I have to restrict Bazel resources to a bare minimum to avoid OOM kills when the memory usage spikes for a split second. I, therefore, decided to cross-compile TensorFlow for Jetson on a more powerful machine. As usual, it was not exactly smooth sailing, so here's a quick guide.

The toolchain and target side dependencies

First of all, you will need a compiler capable of producing binaries for the target CPU. I have initially built one from sources but then noticed that Ubuntu provides one that is suitable for the task.

]==> sudo apt-get install gcc-aarch64-linux-gnu g++-aarch64-linux-gnu

Let's see if it can indeed produce the binaries for the target:

]==> aarch64-linux-gnu-g++ hello.cxx
]==> scp a.out jetson:
]==> ssh jetson ./a.out
Hello, World!

You will also need the same version of CUDA that comes with the JetPack. You can download the repository setup file from the NVIDIA's website. I usually go for the deb (network) option. Go ahead and set that and then type:

]==> sudo apt-get install cuda-toolkit-8-0

You will need cuDNN 6, both on the build-host- and the target-side. It's quite surprising, but both versions are necessary because the TensorFlow code generators that need to run on the build-host depend on which seems to provide everything from string helpers to cuDNN and cuBLAS wrappers. It's not a great design choice, but to give them justice it's not noticeable unless you try to do weird stuff, like cross-compiling CUDA applications. You can get the host-side version here, and you can download the target-version from the device:

]==> export TARGET_PACKAGES=/some/empty/directory
]==> mkdir -p $TARGET_PACKAGES/cudnn/{include,lib}
]==> cd $TARGET_PACKAGES/cudnn
]==> ln -sf lib lib64
]==> scp jetson:/usr/lib/aarch64-linux-gnu/ lib
]==> cd lib
]==> ln -sf
]==> cd ..
]==> scp jetson:/usr/include/aarch64-linux-gnu/cudnn_v6.h include/cudnn.h

The next step is to install the CUDA libraries for the target. Unfortunately, their packaging is broken. The target architecture in the metadata of the relevant packages is marked as arm64. It makes sense on the surface because they contain aarch64 binaries after all, but it makes them not installable. The convention is to mark the architecture of such packages as all (universal) because the binary shared objects they contain are only meant to be stubs for the cross compiler (see here) and are not supposed to be runnable on the build-host. We will, therefore, need some apt trickery to install them:

]==> sudo apt-get -o Dpkg::Options::="--force-architecture" install \
         cuda-nvrtc-cross-aarch64-8-0:arm64 cuda-cusolver-cross-aarch64-8-0:arm64 \
         cuda-cudart-cross-aarch64-8-0:arm64 cuda-cublas-cross-aarch64-8-0:arm64 \
         cuda-cufft-cross-aarch64-8-0:arm64 cuda-curand-cross-aarch64-8-0:arm64 \
         cuda-cusparse-cross-aarch64-8-0:arm64 cuda-npp-cross-aarch64-8-0:arm64 \
         cuda-nvml-cross-aarch64-8-0:arm64 cuda-nvgraph-cross-aarch64-8-0:arm64

Furthermore, the names of the libraries installed by these packages are inconsistent with their equivalents for the build host, so we will need to make some symlinks in order not to confuse the TensorFlow build scripts.

]==> cd  /usr/local/cuda-8.0/targets/aarch64-linux/lib
]==> for i in cublas curand cufft cusolver; do \
       sudo ln -sf stubs/lib$ lib$ && \
       sudo ln -sf lib$ lib$ && \
       sudo ln -sf lib$ lib$; \

Let's check if all this works at least in a trivial test:

]==> wget
]==> /usr/local/cuda-8.0/bin/nvcc -ccbin /usr/bin/aarch64-linux-gnu-g++ -std=c++11 \
       --gpu-architecture=compute_53 --gpu-code=sm_53,compute_53 \
]==> scp a.out jetson:
]==> ssh jetson ./a.out
Input: 1, 2, 3, 4, 5,
Output: 2, 3, 4, 5, 6,

And with cuBLAS:

]==> wget
]==> aarch64-linux-gnu-g++ hello-cublas.cxx \
       -I /usr/local/cuda-8.0/targets/aarch64-linux/include \
       -L /usr/local/cuda-8.0/targets/aarch64-linux/lib \
       -lcudart -lcublas
]==> scp a.out jetson:
]==> ssh jetson ./a.out
A =
1 2 3
4 5 6

B =
7 8
9 10
11 12

A*B =
58 64
139 154

TensorFlow also needs Python headers for the target. The installation process for these is straight-forward. You will just need to pre-define results of some of the configuration tests in the file. These tests require either access to the dev file system of the target or need to run compiled C code, so they cannot be executed on the build-host.

]==> wget
]==> tar xf Python-3.5.2.tar.xz
]==> cd Python-3.5.2
]==> cat
]==> ./configure --prefix=$TARGET_PACKAGES --enable-shared \
       --host aarch64-linux-gnu --build x86_64-linux-gnu --without-ensurepip
]==> make -j12 && make install
]==> cd $TARGET_PACKAGES/include
]==> mkdir aarch64-linux-gnu
]==> cd aarch64-linux-gnu && ln -sf ../python3.5m

Bazel setup and TensorFlow mods

The way to tell Bazel about a compiler configuration is to write a CROSSTOOL file. The file is just a collection of paths to various tools and the default configuration parameters for them. There are however some things to note here as well. First, the configuration script of TensorFlow asks about the host Python installation and sets the source up to use it. However, what we need in this case is the target Python. Since there seems to be no easy way to plug that into the standard build scripts, we pass the right include directory to the compiler here:

cxx_flag: "-isystem"

We will also need to inject some compiler parameters on the fly for some of the binaries, so we call neither the build-host nor the target compiler directly:

tool_path { name: "gcc" path: "crosstool_wrapper_driver_is_not_gcc" }
tool_path { name: "gcc" path: "crosstool_wrapper_host_tf_framework" }

As mentioned before, one of the problems is that the code generators that need to run on the build host depend on, which in turn, depends on CUDA. We, therefore, need to let the compiler know where the host versions of the CUDA libraries are installed. The second problem is that Bazel fails to link the code generators against the framework library. We fix that in the host wrapper script:

 1if ofile is not None:
 2    is_gen = ofile.endswith('py_wrappers_cc') or ofile.endswith('gen_cc')
 3    if is_cuda == 'yes' and (ofile.endswith('') or is_gen):
 4        cuda_libdirs = [
 5            '-L', '{}/targets/x86_64-linux/lib'.format(cuda_dir),
 6            '-L', '{}/targets/x86_64-linux/lib/stubs'.format(cuda_dir),
 7            '-L', '{}/lib64'.format(cudnn_dir)
 8        ]
10    if is_gen:
11        tf_libs += [
12            '-L', 'bazel-out/host/bin/tensorflow',
13            '-ltensorflow_framework'
14        ]

As far as the target is concerned, the only problem that I noticed is Bazel failing to set up RPATH for the target version of correctly. It causes build failures of some of the binaries that depend on this library. We fix this problem in the wrapper script for the target compiler:

ofile = GetOptionValue(sys.argv[1:], 'o')
if ofile and ofile[0].endswith(''):
  cpu_compiler_flags += [

Some adjustments need to be made to the paths where TensorFlow looks for CUDA libraries and header files. Also, the script needs to be patched to make sure to make sure that the resulting wheel file has the correct platform metadata specified in it.

Building the CPU and the GPU packages

I have put all of the patches I mentioned above in a git repo, so you will need to check that out:

]==> git clone
]==> cd tensorflow
]==> git checkout v1.5.0-cross-jetson-tx1

Let's try a CPU-only setup first. You need to configure the toolchain, and then you can configure and compile TensorFlow as usual. Use /usr/bin/python3 for python, use -O2 for the compilation flags and say no to everything but jemalloc.

]==> cd third_party/toolchains/cpus/aarch64
]==> ./
]==> cd ../../../..
]==> ./configure
]==> bazel  build  --config=opt \
       --crosstool_top=//third_party/toolchains/cpus/aarch64:toolchain \
       --cpu=arm  //tensorflow/tools/pip_package:build_pip_package
]==> mkdir out-cpu
]==> bazel-bin/tensorflow/tools/pip_package/build_pip_package out-cpu --platform linux_aarch64

To get GPU setup working, you will need to rerun the toolchain configuration script and tell it the paths to the build-host side CUDA 8.0 and cuDNN 6. Then configure TensorFlow with the same settings as above, but enable CUDA this time. Tell it the paths to your CUDA 8.0 installation, your target-side cuDNN 6, and specify /usr/bin/aarch64-linux-gnu-gcc as the compiler.

]==> cd third_party/toolchains/cpus/aarch64
]==> ./
]==> cd ../../../..
]==> ./configure
]==> bazel build  --config=opt --config=cuda \
         --crosstool_top=//third_party/toolchains/cpus/aarch64:toolchain \
         --cpu=arm  --compiler=cuda \

The compilation takes roughly 15 minutes for the CPU-only setup and 22 minutes for the CUDA setup on my Core i7 build host. It's a vast improvement comparing to hours on the Jetson board.


I haven't done any extensive testing, but my SSD implementation works fine and reproduces the results I get on other boxes. There is, therefore, a strong reason to believe that things compiled fine.

Detections in a Pascal-VOC example on the Jetson
Detections in a Pascal-VOC example on the Jetson


AWS has recently introduced the P3 instances. They come with Tesla V100 GPUs, so I decided to run a little benchmark to see how well they perform compared to my workstation when training neural networks. I installed the most recent versions of CUDA/cuDNN (9.0/7.0) and TensorFlow (1.4.0), and run two non-trivial benchmarks that test both the GPU and the CPU.

Building the Software

Building TensorFlow from source is relatively straightforward, except that you need to install bazel. And gosh, never have I ever managed to build that stuff without issues. This time was not an exception. I wrote an article about that in the past, so I won't go into much detail here. I will just say that you will need Protocol Buffers version 3.4.0, grpc-java version 1.6.1, and bazel version 0.7.0. You will then need to apply this patch that I have taken from here and resolved the merge conflicts. Then, you will need to apply this one on top. The rest should go smooth.

Testing Setup

I used my workstation, and two AWS GPU Compute instances. Their exact parameters are in the table below. Since my workstation has an SSD, I used RAM disks on AWS to make the tests more comparable.

Name Description CPU GPU CUDA Compute Data Source
ti My workstation i7-6600U GeForce GTX 1080 Ti 6.1 SSD
p2 AWS p2.xlarge E5-2686 Tesla K80 3.7 RAM disk
p3 AWS p3.2xlarge E5-2686 Tesla V100 7.0 RAM disk

The tests are object detection and semantic segmentation, both coming in a smaller and a larger flavor. The former one processes all the input data in parallel to the GPU thread, whereas the latter does the processing serially in the main thread. On both, the p3 and ti machines, the CPU utilization was at 100% when running the semantic segmentation test. It means that the CPU is a bottleneck here.


Normalized Performance
Normalized Performance

Machine VGG300 VGG512 KITTI Cityscapes
ti 11:38 28:09 00:16 09:22
p2 46:39 1:49:05 00:50 20:31
p3 08:15 20:10 did not work 13:01

The results in the image above are normalized, with 1 being the score for the ti setup. The table contains the exact execution times of training over one epoch. The V100 is around 30% faster than the 1080 Ti. The 1080 Ti, in turn, is about four times faster than the K80. Also, a Core i7 seems to be more performant than the Xeon Amazon uses in their instances. The KITTI test did not work on the V100 - it has hit a strange CUDA bug.


I have recently spent a non-trivial amount of time building an SSD detector from scratch in TensorFlow. I had initially intended for it to help identify traffic lights in my team's SDCND Capstone Project. However, it turned out that it's not particularly efficient with tiny objects, so I ended up using the TensorFlow Object Detection API for that purpose instead. In the end, I managed to bring my implementation of SSD to a pretty decent state, and this post gathers my thoughts on the matter. It is not intended to be a tutorial. Instead, it's a discussion of all the pieces of information that were unclear to me or that I needed to research independently of the original paper.

Object Detection
Object Detection

Base Network and Extensions


SSD is based on a modified VGG-16 network pre-trained on the ImageNet data. I happened to have one from one of my previous projects, and I used it here as well. The following modifications have been made to the base network:

  • pool5 was changed from 2x2 (stride: 2) to 3x3 (stride: 1)
  • fc6 and fc7 were converted to convolutional layers and subsampled
  • à trous convolution was used in fc6
  • fc8 and all of the dropout layers were removed

As you can see from the above image, the fc6 and fc7 convolutions are 3x3x1024 and 1x1x1024 respectively, whereas in the original VGG they are 7x7x4096 and 1x1x4096. Having huge filters like these is a computational bottleneck. According to one of the references, we can address this problem by "spatially subsampling (by simple decimation)" the weights and then using the à trous convolution to keep the filter's receptive field unchanged. It was not immediately clear to me what it means, but after reading this page of MatLab's documentation, I came up the following:

with tf.variable_scope('mod_conv6'):
    orig_w, orig_b =[self.vgg_fc6_w, self.vgg_fc6_b])
    mod_w = np.zeros((3, 3, 512, 1024))
    mod_b = np.zeros(1024)

    for i in range(1024):
        mod_b[i] = orig_b[4*i]
        for h in range(3):
            for w in range(3):
                mod_w[h, w, :, i] = orig_w[3*h, 3*w, :, 4*i]

    w = array2tensor(mod_w, 'weights')
    b = array2tensor(mod_b, 'biases')
    x = tf.nn.atrous_conv2d(self.mod_pool5, w, rate=6, padding='SAME')
    x = tf.nn.bias_add(x, b)
    self.mod_conv6 = tf.nn.relu(x)

It doubled the speed of training and did not seem to have any adverse effects on accuracy. Note that the dilation rate of the à trous convolution is set to 6 instead of 3. This setting is inconsistent with the size of the original filter, but it is nonetheless used in the reference code.

The output of the conv4_3 layer differs in magnitude compared to other layers used as feature maps of the detector. As pointed out in the ParseNet paper, this fact may lead to reduced performance because "larger" features may overwhelm the "smaller" ones. They propose to use L2 normalization with a scale learnable separately for each channel as a remedy to this problem. This is what I ended up doing in TensorFlow:

def l2_normalization(x, initial_scale, channels, name):
    with tf.variable_scope(name):
        scale = array2tensor(initial_scale*np.ones(channels), 'scale')
        x = scale*tf.nn.l2_normalize(x, dim=-1)
    return x

The initial scale for each channel is set to 20, and it does not change very much over the training time.

Furthermore, a bunch of extra convolutional layers were added on top of the modified fc7. The number of these layers depends on the flavor of the detector: vgg300 or vgg512. The paper does not explain well enough the parameters of the convolutions, especially the padding settings, even though getting this part wrong can significantly impact the performance. I looked these up in the reference code for vgg300 and worked my way backward from the number of anchors in the case of vgg512. Here's what I ended up with:

  • conv8_1: 1x1x256 (stride: 1, pad: same)
  • conv8_2: 3x3x512 (stride: 2, pad: same)
  • conv9_1: 1x1x128 (stride: 1, pad: same)
  • conv9_2: 3x3x256 (stride: 2, pad: same)
  • conv10_1: 1x1x128 (stride: 1, pad: same)
  • conv10_2: 3x3x256 (stride: 1, pad: valid) for vgg300, (stride: 2, pad: same) for vgg515
  • conv11_1: 1x1x128 (stride: 1, pad: same)
  • conv11_2: 3x3x256 (stride: 1, pad: valid)

For the vgg512 flavor, there are two extra layers:

  • conv12_1: 1x1x128 (stride: 1, pad: same)
  • padding of the conv12_1 feature map with one extra cell in each spacial dimension
  • conv12_2: 3x3x256 (stride: 1, pad: valid)

It's not possible to use the predefined padding options (VALID or SAME) for extending conv12_1, so I ended doing it manually:

x, l2 = conv_map(self.ssd_conv11_2, 128, 1, 1, 'conv12_1')
paddings = [[0, 0], [0, 1], [0, 1], [0, 0]]
x = tf.pad(x, paddings, "CONSTANT")
self.ssd_conv12_1 = self.__with_loss(x, l2)
x, l2 = conv_map(self.ssd_conv12_1, 256, 3, 1, 'conv12_2', 'VALID')
self.ssd_conv12_2 = self.__with_loss(x, l2)

Default Boxes (a. k. a. Anchors)

Default Boxes
Default Boxes

The model takes the outputs of some of these convolutional layers and associates a scale with each of them. The exact formula is presented in the paper; the reference implementation does not seem to follow it exactly, though. In general, the further away the feature map is from the input, the larger is the scale assigned to it. The scale only loosely correlates with the receptive field of the respective filter.

The model adds a bunch of 3x3xp convolutional filters on top of each of these maps. Each of these filters predicts p parameters of a default box (or an anchor) at the location to which it is applied. Four of these p parameters are the coordinates of the window (relative width and height, as well as x and y offsets from the center of the anchor). The remaining parameters define the probability distribution of the box belonging to one of the classes that the model predicts (the softmaxed logits). We need to add as many of these filters per feature map as we want aspect ratios for the default boxes of a given scale. In general, the more, the better. The paper advises using six aspect ratios per map. However, the implementation uses fewer of them in some cases.

We now need to create the ground truth labels for the optimizer. We match each ground truth box to an anchor box with the highest Jaccard overlap (if it exceeds 0.5). Additionally, we match it to every anchor with overlap higher than 0.5. The original code uses a mixture of bipartite matching and maximum overlap to resolve conflicts, but I just used the latter criterion for simplicity. For every matched anchor we set the class label accordingly and use the following for the box parameters:

\[ w = 10 \cdot log(\frac{w_{gt}}{w_{a}}) \\ h = 10 \cdot log(\frac{h_{gt}}{h_{a}}) \\ x_c = 5 \cdot \frac{x_{c,gt} - x_{c,a}}{w_a} \\ y_c = 5 \cdot \frac{y_{x,gt} - y_{c,a}}{h_a} \]

The code uses the scaling constants above (5, 10) and calls them "prior variance," but the paper does not mention this fact.

Training Objective

The loss function consists of three parts:

  • the confidence loss
  • the localization loss
  • the l2 loss (weight decay in the Caffe parlance)

The confidence loss is what TensorFlow calls softmax_cross_entropy_with_logits, and it's computed for the class probability part of the parameters of each anchor. Since there are many more positive (matched) anchors than negative (unmatches/background) ones, the learning ends up being more stable if not every background score contributes to the final loss. We need to mine the scores of all the positive anchors and at most three times as of many negative anchors. We only use the background anchors with the highest confidence loss. It results in a somewhat involved code in the declarative style of TensorFlow.

The localization loss sums up the Smooth L1 losses of differences between the prediction and the ground truth labels. The Smooth L1 loss is defined as follows:

\[ SmoothL1(x) = \begin{cases} |x| - 0.5 & x \geq 1 \\ 0.5 \cdot x^2 & x \lt 1 \\ \end{cases} \]

It translates to the following code in TensorFlow:

def smooth_l1_loss(x):
    square_loss   = 0.5*x**2
    absolute_loss = tf.abs(x)
    return tf.where(tf.less(absolute_loss, 1.), square_loss, absolute_loss-0.5)

The paper advises using the batch size of 32. However, this recommendation assumes training in parallel on four GPUs. If you have just one (like I do), 8 is a better number. The original code uses the SGD optimizer with momentum, rate decay at predefined steps, and doubling of the rate for biases. I found that using the Adam optimizer with the exponential decay rate of 0.97 per epoch and using 0.1 as the stability constant (epsilon) works better for this implementation. The TensorFlow documentation warns that the default epsilon may not be a good choice in general and recommends using a higher value in some cases. Indeed, I found that using the default makes the weights very small very fast and the learning process becomes unstable.

Non-Maximum Suppression

Because of the anchor matching strategy and the vast irregularity of the shapes we train on, the network will produce multiple overlapping detections of the same object. One way to get rid of duplicates is to perform a non-maxima suppression. The algorithm is straightforward:

  • you pick your favorite box
  • you remove all the boxes that have the Jaccard overlap with your selection above a certain threshold
  • you choose your second favorite box and repeat step 2
  • you continue until there is no new favorite to select

This article provides a more detailed description, although their selection criterion is rather strange (the position of the lower-right corner) and the implementation is pretty inefficient. My code using numpy's bulk operations is here. I should reimplement it using TensorFlow tensors and will likely do that when I have a spare moment.

Data Augmentation and Issues with Parallelism in Python

The SSD training depends heavily on data augmentation. I won't describe it at all here because the paper does a great job at that. The only tricky part that it does not mention is the fact that you do not clip any ground truth box if it happens to span outside the boundaries of a subsampled input image. See if you want more details.

Things run much faster when the data is preprocessed in parallel before being fed to TensorFlow. However, the poor support for multithreading/multiprocessing in Python turned out to be a significant obstacle here. As you probably know, running your computation in multiple threads is utterly pointless in Python because the execution ends up being serial due to GIL issues. The GIL problem is typically addressed with multiprocessing. However, it comes with a separate can of worms.

First, if you want to transfer any significant amount of data between the processes efficiently, you need to avoid pickling and use the POSIX shared memory instead. It's not hugely complicated, but it's not trivial either. Second, if any of the packages you import uses threading underneath, you're almost guaranteed to encounter fork-safety issues. Add strange errors while forking CUDA-enabled libraries to the mix and you end up with a minor horror story. It took me about a full day of work to write and debug the shared memory queue and to debug the fork safety issues in the pipeline. In case you wonder, this code does the trick for the latter:

workers = []
os.environ['CUDA_VISIBLE_DEVICES'] = ""
cv2_num_threads = cv2.getNumThreads()
for i in range(num_workers):
    args = (sample_queue, batch_queue)
    w = mp.Process(target=batch_producer, args=args)
del os.environ['CUDA_VISIBLE_DEVICES']

Pascal VOC and the mAP Metric

The Pascal VOC (Visual Object Classes) project provides standardized datasets for object class recognition as well as tools for evaluation and comparison of different detection methods. The datasets contain several thousands of annotated Flickr pictures. The metric they use for method comparison of object detection algorithms is called mAP - Mean Average Precision - and is an arithmetic mean of the AP (Average Precision) scores for each object class in the dataset.

The task of object detection is treated as a ranked document retrieval system (as in search) and the AP metric is an 11-point interpolated average precision. More specifically, the system:

  • sorts the detections of a given class in all the images of the dataset by confidence in descending order
  • loops over the detections and classifies them according to the following greedy algorithm:
    • if a detection overlaps with the ground truth object with the IoU score of 50% or more and the object has not been previously detected, it's a true positive
    • if IoU is above 50% but the object has been detected before, or the IoU is below 50%, it's a false positive
    • ground truth object with no matching detections are false negatives
    • calculate the precision and recall for the current state

Precision and recall data points calculated at each iteration contribute to the precision vs. recall curve which is then interpolated according to the following formula, sampled at 11 equally spaced recall points between 0 and 1, and averaged.

\[ p_{interp}(r) = \max_{r' \geq r} p(r') \]

The graph below shows what the curves for the bottle class look like when we decide to accept objects above different confidence thresholds. Note how the curves for lower confidence levels extend the ones for the higher levels.

Precision vs. Recall - Bottle class
Precision vs. Recall - Bottle class

Here are the AP values for the corresponding confidence thresholds:

Confidence AP
0.01 0.497
0.10 0.471
0.30 0.353
0.50 0.270

The lower confidence results we're willing to accept, the higher our AP gets, but also the number of low confidence false positives grows. It makes perfect sense for a ranked document retrieval system. We care a lot whether we get only the relevant results in the first couple of pages of a Google search, but we don't care all that much if we have a bunch of false positives on the hundredth page. Does it make sense when it comes to object detection? That probably varies widely depending on your application. I would argue that, in a general case, when you just care about quality detections, it's somewhat confusing. Below are examples of detections in the same picture with boxes above 0.5 and 0.01 confidence levels coming from the same SSD model. The parameters used to produce the second picture score higher mAP over the entire dataset than the ones used to generate the first one.

Detections above 0.5 confidence
Detections above 0.5 confidence

Detections above 0.01 confidence
Detections above 0.01 confidence

You can get more info about it here.


I trained a somewhat modified version of the vgg300 flavor of the detector on the union of VOC2007+VOC2012 trainval and VOC2007 test samples with heavy data augmentation. It scored 74.7% mAP when tested on the samples it trained on, while the reference score is around 77.5%. The result on the VOC2012 test samples was 68.71% with the reference at 75.8%. I did not use the same aspect ratio and scale settings as the ones utilized by the original implementation. Surprisingly, sticking to the reference parameters produced even worse results. Another reason for the discrepancy may be a different choice of the optimizer and the fact that the reference implementation doubles the learning rate for biases. Using different learning rates for different variables is possible in TensorFlow. However, I have not been able to do that without the system repeating the forward pass and most of the backward pass for each learning rate setting. It effectively almost doubled the training time per epoch, and I was not patient enough to wait for the results.

When I exported the model as a static inference graph, it took roughly 100MB, compared to around 1.3GB when in the checkpoint format. I then used it as a detector in the vehicle detection project I did some time ago. It processed 1261 frames of the testing video, including the FFmpeg compression and decompression time, in roughly 25 seconds reaching over 50 FPS on average. It's a blazing speed considering that my fairly inefficient SVM implementation took well over 8 minutes (~2.5 FPS) to process the same video. Note, however, that, due to the non-maximum suppression, the speed is a function of the number of positive predictions, and this video has relatively few detected objects. You can see the results below.

Vehicle detection with SSD


The project took quite a bit longer than I had initially anticipated but it was a great learning experience and ultimately a great deal of fun. With the hard negative mining, it was probably the most complicated loss function I have implemented in TensorFlow to date. I learned about adaptive feature map scaling, dug through a lot of Caffe's and TensorFlow's source code, learned about the stability of AdamOptimizer, and read a whole bunch of deep learning research papers. I wasted some time fighting mostly non-existent issues because I had not initially paid sufficient attention to what is measured by the accuracy metric. I have a bunch of ideas on how to improve the model to reach the reference performance and I will likely try some of them out in the near future.

All my code is here.

Update 10.03.2018: I have had a look at the PyTorch SSD implementation which achieves better results than mine in the VOC2012 test, but still lower than the baseline. I discovered that the way I did the data augmentation reflected what the paper describes but not what the original Caffe implementation does. I have updated the code in the repo to match the reference. I have also discovered a bug where the ground truth boxes produced by the sampler were sometimes too small to match any anchors. This behavior did not cause any runtime errors, but such samples did not contribute to the loss function and, therefore, had no impact on the optimization process. With these two changes, I was able to shrink the number of anchors used by my models to the level of the original implementation and reproduce my previous results. The performance of my code is still somewhat behind the original one. At this point, I am reasonably sure it's because of the base network weights I used. I will have a look at that when I have a spare moment.

Update 16.08.2020: I have just noticed that the post has not been updated to reflect the fact that my implementation does reproduce the performance results of the original paper after some more tweaks. Please see the GitHub repo for details.


The goal of this project is to plan a path for a car to make its way through highway traffic. You have at your disposal a map consisting of a set of waypoints. The waypoints follow the center of the road, and each of them comes with the normal vector. You also know the positions and velocities of nearby vehicles. Your car needs to obey the speed limit of 50 MPH (22.35 m/s), not collide with other traffic, keep acceleration within certain limits, and minimize jerk (the time derivative of acceleration). The path that you need to compute is a set of successive Cartesian coordinates that the car will visit perfectly every 0.02 seconds.

Here's the result I got. It's not completely optimal at times, but I used large safety margins and did not spend much time tweaking it.

Path Planning

Path planning

Udacity recommends using minimum jerk trajectories defined in the Frenet coordinate space to solve the path planning problem. This approach, however, struck me as suboptimal and hard to do well because of the distance distortions caused by the nonlinearity of the coordinate-space transforms between Cartesian and Frenet and back. Therefore, I decided to reuse the code I wrote for doing model-predictive control. However, instead of using actuations computed by the algorithm, I used the model states themselves as a source of coordinates that the vehicle should visit.

The controller follows a trajectory defined as a polynomial fitted to the waypoints representing the center of one of the lanes. In the first step, it computes 75 points starting from the current position of the car. In each successive step, it takes 25 points from the previously computed trajectory that the car still did not visit, it estimates the vehicle's state at the 25th point and uses this estimated state as an input to the solver. The points produced this way complement the path. This cycle repeats itself every 250 ms with the target speed given by:

\[ v = \begin{cases} v_l - 0.25 \cdot (25 - d_l) & d_l \leq 25 \\ 22.2 & d_l \gt 25 \\ \end{cases} \]

Where \( v_l \) is the speed of the leader and \( d_l \) is the car's distance from it. Subtracting the proximity penalty tends to make speed "bounce" less than when using fractional terms. I was too lazy to tune a PID controller for this task.

Lane selection

The most optimal lane is selected using a simple finite state machine with cost functions associated with state transitions. Every three seconds the algorithm evaluates these functions for all reachable states and chooses the cheapest move.

Lane Changing FSM
Lane Changing FSM

The cost function for keeping the current lane (the KL2KL transition) penalizes the difference in speed between the vehicle and the leader (if any) and their proximity. The one for changing lanes (the KL2CL transition) does a similar calculation for the target lane. Additionally, it increases the cost if a follower is close enough adding a further penalty proportional to its speed.

The start up state allows the car to ramp up to the target speed before the actual lane-change logic kicks in. Doing so avoids erratic behavior arising from the penalization of speed differences.

The exact logic is here.


Using the MPC approach has the advantage of letting you tune more parameters of the trajectory. You can choose how much acceleration and jerk you're going to tolerate, increasing or decreasing the perceived aggressiveness of lane changes.

I also tried incorporating the predicted trajectories of nearby vehicles into the solver's cost function, but the car ended up going out of the lane, accelerating or breaking aggressively to avoid collisions. It is something that you would do while driving in the real world, but it breaks the rules of this particular game. Another problem was that I used a Naïve Bayes Classifier to predict the behavior of other vehicles and it is not accurate enough. The mispredictions made the car behave erratically trying to avoid collisions with non-existing obstacles.

You can see the full code here.


I have recently stumbled upon two articles (1, 2) treating about running TensorFlow on CPU setups. Out of curiosity, I decided to check how the kinds of models I use behave in such situations. As you will see below, the results were somewhat unexpected. I did not put in the time to investigate what went wrong, and my attempts to reason about the performance problems are pure speculations. Instead, I just run my models with a bunch of different threading and OpenMP settings that people typically recommend on the Internet and hoped to have a drop-in alternative to my GPU setup. In particular, I did not convert my models to use the NCHW format as recommended by the Intel article. This data format conversion seems to be particularly important, and people report performance doubling in some cases. However, since my largest test case uses transfer learning, applying the conversion is a pain. If you happen to know how to optimize the settings better without major tweaking of the models, please do drop me a line.

Testing boxes

  • ti: My workstation
    • GPU: GeForce GTX 1080 Ti (11GB, Pascal)
    • CPU: 8 OS CPUs (Core i7-7700K, 1 packages x 4 cores/pkg x 2 threads/core (4 total cores))
    • RAM: 32GB (test data loaded from an SSD)
  • p2: An Amazon p2.xlarge instance
    • GPU: Tesla K80 (12GB, Kepler)
    • CPU: 4 OS CPUs (Xeon E5-2686 v4)
    • RAM: 60GB (test data loaded from a ramdisk)
  • m4: An Amazon m4.16xlarge instance
    • CPU: 64 OS CPUs (Xeon E5-2686 v4, 2 packages x 16 cores/pkg x 2 threads/core (32 total cores))
    • RAM: 256GB (test data loaded from a ramdisk)

TensorFlow settings

The GPU flavor was compiled with CUDA support; the CPU version was configured with only the default settings; the MKL flavor uses the MKL-ML library that the TensorFlow's configuration script downloads automatically;

The GPU and the CPU setups run with the default session settings. The other configurations change the threading and OpenMP setting on the case-by-case basis. I use the following annotations when talking about the tests:

  • [xC,yT] means the KMP_HW_SUBSET envvar setting and the interop and intraop thread numbers set to 1.
  • [affinity] means the KMP_AFFINITY envvar set to granularity=fine,verbose,compact,1,0 and the interop thread number set to 2.
  • [intraop=x, interop=y] means the TensorFlow threading setting and no OpenMP setting.

More information on controlling thread affinity is here, and this is an article on managing thread allocation.

Tests and Results

The test results are the times it took to train one epoch normalized to the result obtained using the ti-gpu configuration - if some score is around 20, it means that this setting is 20 times slower than the baseline.

LeNet - CIFAR10
LeNet - CIFAR10

The first test uses the LeNet architecture on the CIFAR-10 data. The MKL setup run with [4C,2T] on ti and [affinity] on m4. The results are pretty surprising because the model consists of almost exclusively the operations that Intel claims to have optimized. The fact that ti run faster than m4 might suggest that there is some synchronization issue in the graph handling algorithms preventing it from processing a bunch of tiny images efficiently.

Road Sign Classifier
Road Sign Classifier

The second test is my road sign classifier. It uses mainly 2D convolutions and pooling, but they are interleaved with hyperbolic tangents as activations as well as dropout layers. This fact probably prevents the graph optimizer from grouping the MKL nodes together resulting with frequent data format conversions between NHWC and the Intel's SIMD friendly format. Also, ti scored better than m4 for the MKL version but not for the plain CPU implementation. It would suggest inefficiencies in the OpenMP implementation of threading.

Image Segmentation - KITTI (2 classes)
Image Segmentation - KITTI (2 classes)

The third and the fourth test run a fully convolutional neural network besed on VGG16 for an image segmentation project. Apart from the usual suspects, this model uses transposed convolutions to handle learnable upsampling. The tests differ in the size of the input images and in the sizes of the weight matrices handled by the transposed convolutions. For the KITTI dataset, the ti-mkl config run with [intraop=6, interop=6] and m4-mkl with [affinity].

Image Segmentation - Cityscapes (29 classes)
Image Segmentation - Cityscapes (29 classes)

For the Cityscapes dataset, ti-mkl run with [intraop=6, interop=6] and m4-mkl run with [intraop=44, interop=6]. Here the MKL config was as fast as the baseline CPU configs for the dataset with fewer classes and thus smaller upsampling layers. The slowdown for the dataset with more classes could probably be explained by the difference in the handling of the transposed convolution nodes.


It was an interesting experience that arose mixed feelings. On the one hand, the best baseline CPU implementation was at worst two to four times slower with only the compiler optimization than Amazon P2. It's a much better outcome than I had expected. On the other hand, the MKL support was a disappointment. To be fair, in large part it's probably because of my refusal to spend enough time tweaking the parameters, but hey, it was supposed to be a drop-in replacement, and I don't need to do any of these when using a GPU. Another reason is that TensorFlow probably has too few MKL-based kernels to be worth using in this mode and the frequent data format conversions kill the performance. I have also noticed the MKL setup not making any progress with some threading configurations despite all the cores being busy. I might have hit the Intel Hyperthreading bug.

Notes on building TensorFlow

The GPU versions were compiled with GCC 5.4.0, CUDA 8.0 and cuDNN 6. The ti configuration used CUDA capability 6.1, whereas the p2 configuration used 3.7. The compiler flags were:

  • ti: -march=core-avx-i -mavx2 -mfma -O3
  • p2: -march=broadwell -O3

The CPU versions were compiled with GCC 7.1.0 with the following flags:

  • ti: -march=skylake -O3
  • m4: -march=broadwell -O3

I tried compiling the MKL version with the additional -DEIGEN_USE_MKL_VML flag but got worse results.

The MKL library is poorly integrated with the TensorFlow's build system. For some strange reason, the configuration script creates a link to inside the build tree which results with the library being copied to the final wheel package. Doing so is a horrible idea because in glibc mostly provides an interface for's private API so a system update may break the TensorFlow installation. Furthermore, the way in which it figures out which library to link against is broken. The configuration script uses the locate utility to find all files named and picks the first one from the list. Now, locate is not installed on Ubuntu or Debian by default, so if you did not do:

]==> sudo apt-get install locate
]==> sudo updatedb

at some point in the past, the script will be killed without an error message leaving the source tree unconfigured. Moreover, the first pick is usually a wrong one. If you run a 64-bit version of Ubuntu with multilib support, the script will end up choosing a 32-bit version of the library. I happen to hack glibc from time to time, so in my case, it ended up picking one that was cross-compiled for a 64-bit ARM system.

I have also tried compiling Eigen with full MKL support as suggested in this thread. However, the Eigen's and MKL's BLAS interfaces seem to be out of sync. I attempted to fix the situation but gave up when I noticed Eigen passing floats to MKL functions expecting complex numbers using incompatible data types. I will continue using the GPU setup, so fixing all that and doing proper testing was way more effort than I was willing to make.

Node 14.07.2017: My OCD took the upper hand again and I figured it out. Unfortunately, it did not improve the numbers at all.