Cross-compiling TensorFlow for Jetson TX1 with Bazel

Intro

I got tired of having to wait for several hours every time I want to build TensorFlow on my Jetson board. The process got especially painful since NVIDIA removed the swap support from the kernel that came with the most recent JetPack. The swap was pretty much only used during the compilation of CUDA sources and free otherwise. Without it, I have to restrict Bazel resources to a bare minimum to avoid OOM kills when the memory usage spikes for a split second. I, therefore, decided to cross-compile TensorFlow for Jetson on a more powerful machine. As usual, it was not exactly smooth sailing, so here's a quick guide.

The toolchain and target side dependencies

First of all, you will need a compiler capable of producing binaries for the target CPU. I have initially built one from sources but then noticed that Ubuntu provides one that is suitable for the task.

]==> sudo apt-get install gcc-aarch64-linux-gnu g++-aarch64-linux-gnu

Let's see if it can indeed produce the binaries for the target:

]==> aarch64-linux-gnu-g++ hello.cxx
]==> scp a.out jetson:
]==> ssh jetson ./a.out
Hello, World!

You will also need the same version of CUDA that comes with the JetPack. You can download the repository setup file from the NVIDIA's website. I usually go for the deb (network) option. Go ahead and set that and then type:

]==> sudo apt-get install cuda-toolkit-8-0

You will need cuDNN 6, both on the build-host- and the target-side. It's quite surprising, but both versions are necessary because the TensorFlow code generators that need to run on the build-host depend on libtensorflow_framework.so which seems to provide everything from string helpers to cuDNN and cuBLAS wrappers. It's not a great design choice, but to give them justice it's not noticeable unless you try to do weird stuff, like cross-compiling CUDA applications. You can get the host-side version here, and you can download the target-version from the device:

]==> export TARGET_PACKAGES=/some/empty/directory
]==> mkdir -p $TARGET_PACKAGES/cudnn/{include,lib}
]==> cd $TARGET_PACKAGES/cudnn
]==> ln -sf lib lib64
]==> scp jetson:/usr/lib/aarch64-linux-gnu/libcudnn.so.6.0.21 lib
]==> cd lib
]==> ln -sf libcudnn.so.6.0.21 libcudnn.so.6
]==> cd ..
]==> scp jetson:/usr/include/aarch64-linux-gnu/cudnn_v6.h include/cudnn.h

The next step is to install the CUDA libraries for the target. Unfortunately, their packaging is broken. The target architecture in the metadata of the relevant packages is marked as arm64. It makes sense on the surface because they contain aarch64 binaries after all, but it makes them not installable. The convention is to mark the architecture of such packages as all (universal) because the binary shared objects they contain are only meant to be stubs for the cross compiler (see here) and are not supposed to be runnable on the build-host. We will, therefore, need some apt trickery to install them:

]==> sudo apt-get -o Dpkg::Options::="--force-architecture" install \
         cuda-nvrtc-cross-aarch64-8-0:arm64 cuda-cusolver-cross-aarch64-8-0:arm64 \
         cuda-cudart-cross-aarch64-8-0:arm64 cuda-cublas-cross-aarch64-8-0:arm64 \
         cuda-cufft-cross-aarch64-8-0:arm64 cuda-curand-cross-aarch64-8-0:arm64 \
         cuda-cusparse-cross-aarch64-8-0:arm64 cuda-npp-cross-aarch64-8-0:arm64 \
         cuda-nvml-cross-aarch64-8-0:arm64 cuda-nvgraph-cross-aarch64-8-0:arm64

Furthermore, the names of the libraries installed by these packages are inconsistent with their equivalents for the build host, so we will need to make some symlinks in order not to confuse the TensorFlow build scripts.

]==> cd  /usr/local/cuda-8.0/targets/aarch64-linux/lib
]==> for i in cublas curand cufft cusolver; do \
       sudo ln -sf stubs/lib$i.so lib$i.so.8.0.61 && \
       sudo ln -sf lib$i.so.8.0.61 lib$i.so.8.0 && \
       sudo ln -sf lib$i.so.8.0 lib$i.so; \
     done

Let's check if all this works at least in a trivial test:

]==> wget https://raw.githubusercontent.com/ljanyst/kicks-and-giggles/master/hello/hello.cu
]==> /usr/local/cuda-8.0/bin/nvcc -ccbin /usr/bin/aarch64-linux-gnu-g++ -std=c++11 \
       --gpu-architecture=compute_53 --gpu-code=sm_53,compute_53 \
       hello.cu
]==> scp a.out jetson:
]==> ssh jetson ./a.out
Input: 1, 2, 3, 4, 5,
Output: 2, 3, 4, 5, 6,

And with cuBLAS:

]==> wget https://raw.githubusercontent.com/ljanyst/kicks-and-giggles/master/hello/hello-cublas.cxx
]==> aarch64-linux-gnu-g++ hello-cublas.cxx \
       -I /usr/local/cuda-8.0/targets/aarch64-linux/include \
       -L /usr/local/cuda-8.0/targets/aarch64-linux/lib \
       -lcudart -lcublas
]==> scp a.out jetson:
]==> ssh jetson ./a.out
A =
1 2 3
4 5 6

B =
7 8
9 10
11 12

A*B =
58 64
139 154

TensorFlow also needs Python headers for the target. The installation process for these is straight-forward. You will just need to pre-define results of some of the configuration tests in the config.site file. These tests require either access to the dev file system of the target or need to run compiled C code, so they cannot be executed on the build-host.

]==> wget https://www.python.org/ftp/python/3.5.2/Python-3.5.2.tar.xz
]==> tar xf Python-3.5.2.tar.xz
]==> cd Python-3.5.2
]==> cat config.site
ac_cv_file__dev_ptmx=yes
ac_cv_file__dev_ptc=no
ac_cv_buggy_getaddrinfo=no
]==> CONFIG_SITE=config.site ./configure --prefix=$TARGET_PACKAGES --enable-shared \
       --host aarch64-linux-gnu --build x86_64-linux-gnu --without-ensurepip
]==> make -j12 && make install
]==> cd $TARGET_PACKAGES/include
]==> mkdir aarch64-linux-gnu
]==> cd aarch64-linux-gnu && ln -sf ../python3.5m

Bazel setup and TensorFlow mods

The way to tell Bazel about a compiler configuration is to write a CROSSTOOL file. The file is just a collection of paths to various tools and the default configuration parameters for them. There are however some things to note here as well. First, the configuration script of TensorFlow asks about the host Python installation and sets the source up to use it. However, what we need in this case is the target Python. Since there seems to be no easy way to plug that into the standard build scripts, we pass the right include directory to the compiler here:

cxx_flag: "-isystem"
cxx_flag: "__TARGET_PYTHON_INCLUDES__"

We will also need to inject some compiler parameters on the fly for some of the binaries, so we call neither the build-host nor the target compiler directly:

tool_path { name: "gcc" path: "crosstool_wrapper_driver_is_not_gcc" }
...
tool_path { name: "gcc" path: "crosstool_wrapper_host_tf_framework" }

As mentioned before, one of the problems is that the code generators that need to run on the build host depend on libtensorflow_framework.so, which in turn, depends on CUDA. We, therefore, need to let the compiler know where the host versions of the CUDA libraries are installed. The second problem is that Bazel fails to link the code generators against the framework library. We fix that in the host wrapper script:

 1if ofile is not None:
 2    is_gen = ofile.endswith('py_wrappers_cc') or ofile.endswith('gen_cc')
 3    if is_cuda == 'yes' and (ofile.endswith('libtensorflow_framework.so') or is_gen):
 4        cuda_libdirs = [
 5            '-L', '{}/targets/x86_64-linux/lib'.format(cuda_dir),
 6            '-L', '{}/targets/x86_64-linux/lib/stubs'.format(cuda_dir),
 7            '-L', '{}/lib64'.format(cudnn_dir)
 8        ]
 9
10    if is_gen:
11        tf_libs += [
12            '-L', 'bazel-out/host/bin/tensorflow',
13            '-ltensorflow_framework'
14        ]

As far as the target is concerned, the only problem that I noticed is Bazel failing to set up RPATH for the target version of libtensorflow_framework.so correctly. It causes build failures of some of the binaries that depend on this library. We fix this problem in the wrapper script for the target compiler:

ofile = GetOptionValue(sys.argv[1:], 'o')
if ofile and ofile[0].endswith('libtensorflow_framework.so'):
  cpu_compiler_flags += [
      '-Wl,-rpath,'+os.getcwd()+'/bazel-out/arm-py3-opt/genfiles/external/local_config_cuda/cuda/cuda/lib',
  ]

Some adjustments need to be made to the paths where TensorFlow looks for CUDA libraries and header files. Also, the build_pip_package.sh script needs to be patched to make sure to make sure that the resulting wheel file has the correct platform metadata specified in it.

Building the CPU and the GPU packages

I have put all of the patches I mentioned above in a git repo, so you will need to check that out:

]==> git clone https://github.com/ljanyst/tensorflow.git
]==> cd tensorflow
]==> git checkout v1.5.0-cross-jetson-tx1

Let's try a CPU-only setup first. You need to configure the toolchain, and then you can configure and compile TensorFlow as usual. Use /usr/bin/python3 for python, use -O2 for the compilation flags and say no to everything but jemalloc.

]==> cd third_party/toolchains/cpus/aarch64
]==> ./configure.py
]==> cd ../../../..
]==> ./configure
]==> bazel  build  --config=opt \
       --crosstool_top=//third_party/toolchains/cpus/aarch64:toolchain \
       --cpu=arm  //tensorflow/tools/pip_package:build_pip_package
]==> mkdir out-cpu
]==> bazel-bin/tensorflow/tools/pip_package/build_pip_package out-cpu --platform linux_aarch64

To get GPU setup working, you will need to rerun the toolchain configuration script and tell it the paths to the build-host side CUDA 8.0 and cuDNN 6. Then configure TensorFlow with the same settings as above, but enable CUDA this time. Tell it the paths to your CUDA 8.0 installation, your target-side cuDNN 6, and specify /usr/bin/aarch64-linux-gnu-gcc as the compiler.

]==> cd third_party/toolchains/cpus/aarch64
]==> ./configure.py
]==> cd ../../../..
]==> ./configure
]==> bazel build  --config=opt --config=cuda \
         --crosstool_top=//third_party/toolchains/cpus/aarch64:toolchain \
         --cpu=arm  --compiler=cuda \
         //tensorflow/tools/pip_package:build_pip_package

The compilation takes roughly 15 minutes for the CPU-only setup and 22 minutes for the CUDA setup on my Core i7 build host. It's a vast improvement comparing to hours on the Jetson board.

Tests

I haven't done any extensive testing, but my SSD implementation works fine and reproduces the results I get on other boxes. There is, therefore, a strong reason to believe that things compiled fine.

Detections in a Pascal-VOC example on the Jetson

If you like this kind of content, you can subscribe to my newsletter, follow me on Twitter, or subscribe to my RSS channel.

2018-02-05