Table of Contents

  1. Syscalls, memory, and your first therad
  2. The pointer to self and thread-local storage
  3. Futexes, mutexes, and memory sychronization
  4. Joining threads and dynamic initialization
  5. Cancellation
  6. Scheduling and task priority
  7. RW Locks
  8. Condition variables
  9. Final thoughts


This is a first of hopefully many posts documenting my attempts to understand how to implement a pthread-style threading system on Linux. To this end, I started implementing a small, non-portable and a rather useless library that I called Thread Bites. Thread Bites is useless mainly because it lacks support for compiler-generated thread-local storage. It may not sound like a grave issue, but it makes Thread Bites incompatible with most of glibc. Therefore, I had to provide my own functionality for invoking syscalls, managing the heap and even printing stuff to stdout. It's all pretty simple and understandable so far, so I hope I will be able to implement most of the pthreads' functionality in a couple of small bites. You can get the source from GitHub.


For a program to be even remotely useful, it needs to communicate with the user in one way or another. A standard library for the programming language, like glibc, typically provides all the necessary components for such communication. For reasons mentioned in the introduction, using glibc in this case is not advisable. Hence, I need to find a way to call the operating system directly without using the syscall function, because it is also a part glibc and sets errno, which is supposed to reside in the TLS that I did not set up.

All that is needed to implement an equivalent function is shuffling around the values of the registers to translate between the calling conventions of C and the Linux kernel, as described here and here on page 20. As it turns out, it's not that hard to do it using inline assembly in C, and there's an excellent tutorial here.

 1#define SYSCALL(name, a1, a2, a3, a4, a5, a6)           \
 2  ({                                                    \
 3    long result;                                        \
 4    long __a1 = (long)(a1);                             \
 5    long __a2 = (long)(a2);                             \
 6    long __a3 = (long)(a3);                             \
 7    long __a4 = (long)(a4);                             \
 8    long __a5 = (long)(a5);                             \
 9    long __a6 = (long)(a6);                             \
10    register long _a1 asm("rdi") = __a1;                \
11    register long _a2 asm("rsi") = __a2;                \
12    register long _a3 asm("rdx") = __a3;                \
13    register long _a4 asm("r10") = __a4;                \
14    register long _a5 asm("r8")  = __a5;                \
15    register long _a6 asm("r9")  = __a6;                \
16    asm volatile (                                      \
17      "syscall\n\t"                                     \
18      : "=a" (result)                                   \
19      : "0" (name), "r" (_a1), "r" (_a2), "r" (_a3),    \
20        "r" (_a4), "r" (_a5), "r" (_a6)                 \
21      : "memory", "cc", "r11", "cx");                   \
22    (long) result; })
24#define SYSCALL1(name, a1) \
25  SYSCALL(name, a1, 0, 0, 0, 0, 0)
26#define SYSCALL2(name, a1, a2) \
27  SYSCALL(name, a1, a2, 0, 0, 0, 0)
28#define SYSCALL3(name, a1, a2, a3) \
29  SYSCALL(name, a1, a2, a3, 0, 0, 0)
30#define SYSCALL4(name, a1, a2, a3, a4) \
31  SYSCALL(name, a1, a2, a3, a4, 0, 0)
32#define SYSCALL5(name, a1, a2, a3, a4, a5) \
33  SYSCALL(name, a1, a2, a3, a4, a5, 0)
34#define SYSCALL6(name, a1, a2, a3, a4, a5, a6) \
35  SYSCALL(name, a1, a2, a3, a4, a5, a6)

All this is, of course, horribly inefficient because it messes up with all the registers even in the situations it does not have to. The intermediate variables for the parameters (__a1 and friends) are used to prevent embedded function calls from messing with the registers that have already been set; think of strlen in SYSCALL3(__NR_write, 1, blah, strlen(blah)).

See the full code on GitHub.

Printing to stdout

It seems that glibc is using errno and other thread local stuff to calculate buffer sizes in one of the subroutines called by printf. It causes printf to segfault when called concurrently from different threads because of the same TLS story. Thread Bites provides a convenience function similar to printf and supporting %s %x %u %o %d in l and ll flavors:

1void tbprint(const char *format, ...);


Glibc's default malloc implementation, a ptmalloc2 derivative, uses thread-specific arenas to limit lock congestion caused by calling malloc concurrently from multiple threads. It looks like it depends on TLS, so using it is not the best idea. Thread Bites comes with its own evil version of malloc. It's pathological because it's extremely prone to fragmentation, it never shrinks the heap, and it's essentially one big critical section. It has some undeniable advantages too: it works, it's incredibly simple, and it fits in around 50 lines of code. Look here to find it.

The only thing worth noting in this section is that the sys_brk syscall does not behave like glibc's brk or sbrk functions. On error, it would return the previous location of the heap boundary, so the code calls it with an obviously wrong parameter (0) to figure out what the initial heap boundary is.


Clone is an interesting beast. From the standpoint of this section, the relevant thing about it is that it behaves mostly like fork, except that the child is launched on a new stack. For this reason and to generate proper Call Frame Information (see here and here), it needs to be implemented in assembly. The implementation puts the user function pointer and its argument on the child's stack so that they can be later popped and called by the child. Then, it puts the syscall parameters in the appropriate registers and makes the syscall. Again, all the code is on GitHub.

Creating a thread

After dealing with all this boilerplate, the actual thread creation is relatively straightforward. First, the new thread needs a stack it can run on. The stack could be allocated on the heap, but it's probably safer to get a separate memory region for it. It can be done using the mmap syscall. A proper threading library should check the system limits to figure out the size of the stack, but Thread Bites is not proper, so it will use 8MiB as a default. All the function parameters in the snippet below match glibc's mmap call.

1void *stack = tbmmap(NULL, attr->stack_size, PROT_READ | PROT_WRITE,
2                     MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
3long status = (long)stack;
4if(status < 0)
5  return status;

It's a good idea to mark a page at the beginning of the stack as non-readable and non-writable to protect any valid adjecent memory if there is any. Stepping over the guard page will get us a familiar segmentation fault. Why at the beginning and not at the end? Because the stack grows downwards, i.e. from a high address towards a lower one. Go here for more details.

1status = SYSCALL3(__NR_mprotect, stack, EXEC_PAGESIZE, PROT_NONE);
2if(status < 0) {
3  tbmunmap(stack, attr->stack_size);
4  return status;

We finally have all that is needed to spawn a new kernel task sharing with the main task everything that you would expect threads running within the same process to share. The man page of clone discusses all of the parameters in details.

2flags |= CLONE_THREAD;
3int tid = tbclone(start_thread, *thread, flags, stack+attr->stack_size);
4if(tid < 0) {
5  tbmunmap(stack, attr->stack_size);
6  free(*thread);
7  return tid;

Note that the user function is not called directly and start_thread is called instead. It does some internal book-keeping and eventually calls the user-supplied function:

1tbthread_t th = (tbthread_t)arg;
2uint32_t stack_size = th->stack_size;
3void *stack = th->stack;

When the function returns, the stack memory needs to be freed (munmaped) and the thread terminated. The clone function took some provisions for calling sys_exit with the status returned by the wrapper. However, here we remove the stack from underneath our feet, and we cannot risk any C function write over it. Therefore, we need to call sys_munmap and sys_exit in one piece of assembly.

 1register long a1 asm("rdi") = (long)stack;
 2register long a2 asm("rsi") = stack_size;
 3asm volatile(
 4  "syscall\n\t"
 5  "movq $60, %%rax\n\t" // 60 = __NR_exit
 6  "movq $0, %%rdi\n\t"
 7  "syscall"
 8  :
 9  : "a" (__NR_munmap), "r" (a1), "r" (a2)
10  : "memory", "cc", "r11", "cx");
11return 0;

The full code.

Lessons learned

  • The logic involved in running threads does not seem to be horrendously complicated, at least not yet.
  • Glibc is quite convoluted and figuring out what is going on and where is a challenge. It is probably justified by the range of platforms it supports, but the documentation could be better.
  • Threading is tightly coupled with glibc in the areas that I have not initially suspected: the dynamic linker to support TLS, globals in TLS (errno), locales, finalizers, and so on.
  • POSIX threads have an interesting relationship with the Linux task model. This document describes the initial incompatibilities that have ultimately been ironed out. It is a very interesting read if you want to understand how the userspace and the kernelspace interact to implement a threading system.
  • Ultimately all this is not necessarily a black magic. At least not so far.
If you like this kind of content, you can subscribe to my newsletter, follow me on Twitter, or subscribe to my RSS channel.