Content tagged cuda


I got tired of having to wait for several hours every time I want to build TensorFlow on my Jetson board. The process got especially painful since NVIDIA removed the swap support from the kernel that came with the most recent JetPack. The swap was pretty much only used during the compilation of CUDA sources and free otherwise. Without it, I have to restrict Bazel resources to a bare minimum to avoid OOM kills when the memory usage spikes for a split second. I, therefore, decided to cross-compile TensorFlow for Jetson on a more powerful machine. As usual, it was not exactly smooth sailing, so here's a quick guide.

The toolchain and target side dependencies

First of all, you will need a compiler capable of producing binaries for the target CPU. I have initially built one from sources but then noticed that Ubuntu provides one that is suitable for the task.

]==> sudo apt-get install gcc-aarch64-linux-gnu g++-aarch64-linux-gnu

Let's see if it can indeed produce the binaries for the target:

]==> aarch64-linux-gnu-g++ hello.cxx
]==> scp a.out jetson:
]==> ssh jetson ./a.out
Hello, World!

You will also need the same version of CUDA that comes with the JetPack. You can download the repository setup file from the NVIDIA's website. I usually go for the deb (network) option. Go ahead and set that and then type:

]==> sudo apt-get install cuda-toolkit-8-0

You will need cuDNN 6, both on the build-host- and the target-side. It's quite surprising, but both versions are necessary because the TensorFlow code generators that need to run on the build-host depend on which seems to provide everything from string helpers to cuDNN and cuBLAS wrappers. It's not a great design choice, but to give them justice it's not noticeable unless you try to do weird stuff, like cross-compiling CUDA applications. You can get the host-side version here, and you can download the target-version from the device:

]==> export TARGET_PACKAGES=/some/empty/directory
]==> mkdir -p $TARGET_PACKAGES/cudnn/{include,lib}
]==> cd $TARGET_PACKAGES/cudnn
]==> ln -sf lib lib64
]==> scp jetson:/usr/lib/aarch64-linux-gnu/ lib
]==> cd lib
]==> ln -sf
]==> cd ..
]==> scp jetson:/usr/include/aarch64-linux-gnu/cudnn_v6.h include/cudnn.h

The next step is to install the CUDA libraries for the target. Unfortunately, their packaging is broken. The target architecture in the metadata of the relevant packages is marked as arm64. It makes sense on the surface because they contain aarch64 binaries after all, but it makes them not installable. The convention is to mark the architecture of such packages as all (universal) because the binary shared objects they contain are only meant to be stubs for the cross compiler (see here) and are not supposed to be runnable on the build-host. We will, therefore, need some apt trickery to install them:

]==> sudo apt-get -o Dpkg::Options::="--force-architecture" install \
         cuda-nvrtc-cross-aarch64-8-0:arm64 cuda-cusolver-cross-aarch64-8-0:arm64 \
         cuda-cudart-cross-aarch64-8-0:arm64 cuda-cublas-cross-aarch64-8-0:arm64 \
         cuda-cufft-cross-aarch64-8-0:arm64 cuda-curand-cross-aarch64-8-0:arm64 \
         cuda-cusparse-cross-aarch64-8-0:arm64 cuda-npp-cross-aarch64-8-0:arm64 \
         cuda-nvml-cross-aarch64-8-0:arm64 cuda-nvgraph-cross-aarch64-8-0:arm64

Furthermore, the names of the libraries installed by these packages are inconsistent with their equivalents for the build host, so we will need to make some symlinks in order not to confuse the TensorFlow build scripts.

]==> cd  /usr/local/cuda-8.0/targets/aarch64-linux/lib
]==> for i in cublas curand cufft cusolver; do \
       sudo ln -sf stubs/lib$ lib$ && \
       sudo ln -sf lib$ lib$ && \
       sudo ln -sf lib$ lib$; \

Let's check if all this works at least in a trivial test:

]==> wget
]==> /usr/local/cuda-8.0/bin/nvcc -ccbin /usr/bin/aarch64-linux-gnu-g++ -std=c++11 \
       --gpu-architecture=compute_53 --gpu-code=sm_53,compute_53 \
]==> scp a.out jetson:
]==> ssh jetson ./a.out
Input: 1, 2, 3, 4, 5,
Output: 2, 3, 4, 5, 6,

And with cuBLAS:

]==> wget
]==> aarch64-linux-gnu-g++ hello-cublas.cxx \
       -I /usr/local/cuda-8.0/targets/aarch64-linux/include \
       -L /usr/local/cuda-8.0/targets/aarch64-linux/lib \
       -lcudart -lcublas
]==> scp a.out jetson:
]==> ssh jetson ./a.out
A =
1 2 3
4 5 6

B =
7 8
9 10
11 12

A*B =
58 64
139 154

TensorFlow also needs Python headers for the target. The installation process for these is straight-forward. You will just need to pre-define results of some of the configuration tests in the file. These tests require either access to the dev file system of the target or need to run compiled C code, so they cannot be executed on the build-host.

]==> wget
]==> tar xf Python-3.5.2.tar.xz
]==> cd Python-3.5.2
]==> cat
]==> ./configure --prefix=$TARGET_PACKAGES --enable-shared \
       --host aarch64-linux-gnu --build x86_64-linux-gnu --without-ensurepip
]==> make -j12 && make install
]==> cd $TARGET_PACKAGES/include
]==> mkdir aarch64-linux-gnu
]==> cd aarch64-linux-gnu && ln -sf ../python3.5m

Bazel setup and TensorFlow mods

The way to tell Bazel about a compiler configuration is to write a CROSSTOOL file. The file is just a collection of paths to various tools and the default configuration parameters for them. There are however some things to note here as well. First, the configuration script of TensorFlow asks about the host Python installation and sets the source up to use it. However, what we need in this case is the target Python. Since there seems to be no easy way to plug that into the standard build scripts, we pass the right include directory to the compiler here:

cxx_flag: "-isystem"

We will also need to inject some compiler parameters on the fly for some of the binaries, so we call neither the build-host nor the target compiler directly:

tool_path { name: "gcc" path: "crosstool_wrapper_driver_is_not_gcc" }
tool_path { name: "gcc" path: "crosstool_wrapper_host_tf_framework" }

As mentioned before, one of the problems is that the code generators that need to run on the build host depend on, which in turn, depends on CUDA. We, therefore, need to let the compiler know where the host versions of the CUDA libraries are installed. The second problem is that Bazel fails to link the code generators against the framework library. We fix that in the host wrapper script:

 1if ofile is not None:
 2    is_gen = ofile.endswith('py_wrappers_cc') or ofile.endswith('gen_cc')
 3    if is_cuda == 'yes' and (ofile.endswith('') or is_gen):
 4        cuda_libdirs = [
 5            '-L', '{}/targets/x86_64-linux/lib'.format(cuda_dir),
 6            '-L', '{}/targets/x86_64-linux/lib/stubs'.format(cuda_dir),
 7            '-L', '{}/lib64'.format(cudnn_dir)
 8        ]
10    if is_gen:
11        tf_libs += [
12            '-L', 'bazel-out/host/bin/tensorflow',
13            '-ltensorflow_framework'
14        ]

As far as the target is concerned, the only problem that I noticed is Bazel failing to set up RPATH for the target version of correctly. It causes build failures of some of the binaries that depend on this library. We fix this problem in the wrapper script for the target compiler:

ofile = GetOptionValue(sys.argv[1:], 'o')
if ofile and ofile[0].endswith(''):
  cpu_compiler_flags += [

Some adjustments need to be made to the paths where TensorFlow looks for CUDA libraries and header files. Also, the script needs to be patched to make sure to make sure that the resulting wheel file has the correct platform metadata specified in it.

Building the CPU and the GPU packages

I have put all of the patches I mentioned above in a git repo, so you will need to check that out:

]==> git clone
]==> cd tensorflow
]==> git checkout v1.5.0-cross-jetson-tx1

Let's try a CPU-only setup first. You need to configure the toolchain, and then you can configure and compile TensorFlow as usual. Use /usr/bin/python3 for python, use -O2 for the compilation flags and say no to everything but jemalloc.

]==> cd third_party/toolchains/cpus/aarch64
]==> ./
]==> cd ../../../..
]==> ./configure
]==> bazel  build  --config=opt \
       --crosstool_top=//third_party/toolchains/cpus/aarch64:toolchain \
       --cpu=arm  //tensorflow/tools/pip_package:build_pip_package
]==> mkdir out-cpu
]==> bazel-bin/tensorflow/tools/pip_package/build_pip_package out-cpu --platform linux_aarch64

To get GPU setup working, you will need to rerun the toolchain configuration script and tell it the paths to the build-host side CUDA 8.0 and cuDNN 6. Then configure TensorFlow with the same settings as above, but enable CUDA this time. Tell it the paths to your CUDA 8.0 installation, your target-side cuDNN 6, and specify /usr/bin/aarch64-linux-gnu-gcc as the compiler.

]==> cd third_party/toolchains/cpus/aarch64
]==> ./
]==> cd ../../../..
]==> ./configure
]==> bazel build  --config=opt --config=cuda \
         --crosstool_top=//third_party/toolchains/cpus/aarch64:toolchain \
         --cpu=arm  --compiler=cuda \

The compilation takes roughly 15 minutes for the CPU-only setup and 22 minutes for the CUDA setup on my Core i7 build host. It's a vast improvement comparing to hours on the Jetson board.


I haven't done any extensive testing, but my SSD implementation works fine and reproduces the results I get on other boxes. There is, therefore, a strong reason to believe that things compiled fine.

Detections in a Pascal-VOC example on the Jetson
Detections in a Pascal-VOC example on the Jetson

A month ago or so, I wrote a post about installing TensorFlow 1.1.0 on Jetson TX1. This post is an update for 1.2.0 which has one additional issue on top of the ones discussed previously. The problem is that Eigen is missing some template specializations when used on ARM. The bug has been fixed, but you need to make the TensorFlow build use the fixed version.

diff --git a/tensorflow/workspace.bzl b/tensorflow/workspace.bzl
index 2a206b0ac..f44a17405 100644
--- a/tensorflow/workspace.bzl
+++ b/tensorflow/workspace.bzl
@@ -150,11 +150,10 @@ def tf_workspace(path_prefix="", tf_repo_name=""):
       name = "eigen_archive",
       urls = [
-          "",
-          "",
+          "",
-      sha256 = "ca7beac153d4059c02c8fc59816c82d54ea47fe58365e8aded4082ded0b820c4",
-      strip_prefix = "eigen-eigen-f3a22f35b044",
+      sha256 = "a34b208da6ec18fa8da963369e166e4a368612c14d956dd2f9d7072904675d9b",
+      strip_prefix = "eigen-eigen-d781c1de9834",
       build_file = str(Label("//third_party:eigen.BUILD")),

The build instructions are the same as for the previous versions, but you need to checkout the v1.2.0-jetson-tx1 tag from my repository to get all the fixes.


I had expected a smooth ride with this one, but it turned out to be quite an adventure and not one of a pleasant kind. To be fair, the likely reason why it's such a horror story is that I was bootstrapping bazel - the build software that TensorFlow uses - on an unsupported system. I spent more time figuring out the dependency issues related to that than working on TensorFlow itself. This post was initially supposed to be a rant on the Java dependency hell. However, in the end, my stubbornness took the upper hand, and I did not go to sleep until it all worked, so you have a HOWTO instead.


Jetson TX1
Jetson TX1

You'll need the board itself and the following installed on it:

  • Jetpack 3.0
  • L4T 24.2.1
  • Cuda Toolkit 8.0.34-1
  • cuDNN 5.1.5-1+cuda8.0


A Java Development Kit

First of all, you'll need a Java compiler and related utilities. Just type:

]==> sudo apt-get install default-jdk

It would not have been worth a separate paragraph, except that the version that comes with the system messes up the CA certificates. You won't be able to download things from GitHub without overriding SSL warnings. I fixed that by installing ca-certificates and ca-certificates-java from Debian.

Protocol Buffers

You'll need the exact two versions mentioned below. No other versions work down the road. I learned about this fact the hard way. Be sure to call on the master branch first - it needs to download gmock, and the link in older tags points to the void.

]==> sudo apt-get install curl
]==> git clone
]==> cd protobuf
]==> ./

This version is needed for the gRPC Java codegen plugin.

]==> git checkout v3.0.0-beta-3
]==> ./
]==> ./configure --prefix=/home/ljanyst/Temp/protobuf-3.0.0-beta-3
]==> make -j12 && make install

This one is needed by Bazel itself.

]==> git checkout v3.0.0
]==> ./
]==> ./configure --prefix=/home/ljanyst/Temp/protobuf-3.0.0
]==> make -j12 && make install

gRPC Java

Building this one took me a horrendous amount of time. At first, I thought that the whole package is needed. Apart from problems with the protocol buffer versions, it has some JNI dependencies that are problematic to compile. Even after I have successfully produced these, they had interoperability issues with other dependencies. After some digging, it turned out that only one component of the package is actually required, so the whole effort was unnecessary. Of course, the source needed patching to make it build on aarch64, but I won't bore you with that. Again, make sure you use the v0.15.0-jetson-tx1 tag - no other tag will work.

]==> git clone
]==> cd grpc-java
]==> git checkout v0.15.0-jetson-tx1
]==> echo protoc=/home/ljanyst/Temp/protobuf-3.0.0-beta-3/bin/protoc >
]==> CXXFLAGS=-I/home/ljanyst/Temp/protobuf-3.0.0-beta-3/include \
     LDFLAGS=-L/home/ljanyst/Temp/protobuf-3.0.0-beta-3/lib \
     ./gradlew java_pluginExecutable


The latest available release of Bazel (0.4.5) does not build on aarch64 without the patch listed below. I took it from the master branch.

diff --git a/src/main/java/com/google/devtools/build/lib/util/ b/src/main/java/com/google/devtools/build/lib/util/
index 7a85c29..ff8bc86 100644
--- a/src/main/java/com/google/devtools/build/lib/util/
+++ b/src/main/java/com/google/devtools/build/lib/util/
@@ -25,7 +25,7 @@ public enum CPU {
   X86_32("x86_32", ImmutableSet.of("i386", "i486", "i586", "i686", "i786", "x86")),
   X86_64("x86_64", ImmutableSet.of("amd64", "x86_64", "x64")),
   PPC("ppc", ImmutableSet.of("ppc", "ppc64", "ppc64le")),
-  ARM("arm", ImmutableSet.of("arm", "armv7l")),
+  ARM("arm", ImmutableSet.of("arm", "armv7l", "aarch64")),
   S390X("s390x", ImmutableSet.of("s390x", "s390")),
   UNKNOWN("unknown", ImmutableSet.<String>of());

The compilation is straightforward, but make sure you point to the right version of protocol buffers and the gRPC Java compiler built earlier.

]==> git clone
]==> cd bazel
]==> git checkout 0.4.5
]==> export PROTOC=/home/ljanyst/Temp/protobuf-3.0.0/bin/protoc
]==> export GRPC_JAVA_PLUGIN=/home/ljanyst/Temp/grpc-java/compiler/build/exe/java_plugin/protoc-gen-grpc-java
]==> ./
]==> export PATH=/home/ljanyst/Temp/bazel/output:$PATH


  • Note 22.06.2017: Go here for TensorFlow 1.2.0. See the v1.2.0-jetson-tx1 tag.
  • Note 03.07.2017: For TensorFlow 1.2.1, see the v1.2.1-jetson-tx1 tag.
  • Note 18.08.2017: For TensorFlow 1.3.0, see the v1.3.0-jetson-tx1 tag. I have tested it against JetPack 3.1 which fixes the CUDA-related bugs. Note that the kernel in this version of JetPack has been compiled without swap support, so you may want to add --local_resources=2048,0.5,0.5 to the bazel commandline if you want to avoid the out-of-memory kills.
  • Note 21.11.2017: TensorFlow 1.4.0 builds on the Jetson without any modification. However, you will need a newer version of bazel. This is the combination of versions that worked for me:
    • protocol buffers 3.4.0
    • grpc-java 1.6.1 with the ARM patch
    • bazel 0.7.0 with this and this patch


The version of CUDA toolkit for this device is somewhat handicapped. nvcc has problems with variadic templates and compiling some kernels using Eigen makes it crash. I found that adding:


to these problematic files makes the problem go away. A constructor with an initializer list seems to be an issue in one of the cases as well. Using the default constructor instead, and then initializing the array elements one by one makes things go through.

Also, the cuBLAS API seems to be incomplete. It only defines 5 GEMM algorithms (General Matrix to Matrix Multiplication) where the newer patch releases of the toolkit define 8. TensorFlow enumerates them by name to experimentally determine which one is best for a given computation and the code notes that they may fail under perfectly normal circumstances (i.e., a GPU older than sm_50). Therefore, simply omitting the missing algorithms should be perfectly safe.

diff --git a/tensorflow/stream_executor/cuda/ b/tensorflow/stream_executor/cuda/
index 2c650af..49c6db7 100644
--- a/tensorflow/stream_executor/cuda/
+++ b/tensorflow/stream_executor/cuda/
@@ -1912,8 +1912,7 @@ bool CUDABlas::GetBlasGemmAlgorithms(
 #if CUDA_VERSION >= 8000
   for (cublasGemmAlgo_t algo :

See the full patches are here and here.

Memory Consumption

The compilation process may take considerable amounts of RAM - more than the device has available. The documentation advises to use only one execution thread (--local_resources 2048,.5,1.0 param for Bazel), so that you don't get the OOM kills. It's unnecessary most of the time, though, because it's only the last 20% of the compilation steps when the memory is filled completely. Instead, I used an SD card as a swap device.

]==> sudo mkswap /dev/mmcblk1p2
]==> sudo swapon /dev/mmcblk1p2

At peak times, the entire RAM and around 7.5GB of swap were used. However, only at most 5 to 6 compilation threads were in the D state (uninterruptable sleep due to IO), with 2 to 3 being runnable.


You need to install these packages before you can proceed.

]==> sudo apt-get install python3-numpy python3-dev python3-pip
]==> sudo apt-get install python3-wheel python3-virtualenv

Then clone my repo containing the necessary patches and configure the source. I used the system version of Python 3, located at /usr/bin/python3 with its default library in /usr/lib/python3/dist-packages. The answers to the CUDA related questions are:

  • the version of the SDK is 8.0;
  • the version of cuDNN is 5.1.5, and it's located in /usr;
  • the CUDA compute capability for TX1 is 5.3.

Go ahead and run:

]==> git clone
]==> cd tensorflow
]==> git checkout v1.1.0-jetson-tx1
]==> ./configure

Finally, run the compilation, and, 2 hours and change after, build the wheel:

]==> bazel build --config=opt --config=cuda --curses=no --show_task_finish \
]==> bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg
]==> cp /tmp/tensorflow_pkg/*.whl ../

I don't particularly like polluting system directories with custom-built binaries, so I use virtualenv to handle pip-installed Python packages.

]==> mkdir -p ~/Apps/virtualenvs/tensorflow
]==> cd ~/Apps/virtualenvs/tensorflow
]==> python3 /usr/lib/python3/dist-packages/ -p /usr/bin/python3 .
]==> . ./bin/activate
(tensorflow) ]==> pip install ~/Temp/tensorflow-1.1.0-cp35-cp35m-linux_aarch64.whl

The device is identified correctly when starting a new TensorFlow session. You should see the following if you don't count warnings about NUMA:

Found device 0 with properties:
name: NVIDIA Tegra X1
major: 5 minor: 3 memoryClockRate (GHz) 0.072
pciBusID 0000:00:00.0
Total memory: 3.90GiB
Free memory: 2.11GiB
DMA: 0
0:   Y
Creating TensorFlow device (/gpu:0) -> (device: 0, name: NVIDIA Tegra X1, pci bus id: 0000:00:00.0)

The CPU and the GPU share the memory controller, so the GPU does not have the 4GB just for itself. On the upside you can use the CUDA unified memory model without penalties (no memory copies).


I run two benchmarks to see if things work as expected. The first one was my TensorFlow implementation of LeNet training on and classifying the MNIST data. The training code run twice as fast on the TX1 comparing to my 4th generation Carbon X1 laptop. The second test was my slightly enlarged implementation of Sermanet applied to classifying road signs. The convolution part of the training process took roughly 20 minutes per epoch, which is a factor of two improvement over the performance of my laptop. The pipeline was implemented with a large device in mind, though, and expected 16GB of RAM. The TX1 has only 4GB, so the swap speed was a bottleneck here. Based on my observations of the processing speed of individual batches, I can speculate that a further improvement of a factor of two is possible with a properly optimized pipeline.