Table of Contents
- Syscalls, memory, and your first therad
- The pointer to self and thread-local storage
- Futexes, mutexes, and memory sychronization
- Joining threads and dynamic initialization
- Cancellation
- Scheduling and task priority
- RW Locks
- Condition variables
- Final thoughts
Intro
This is a first of hopefully many posts documenting my attempts to understand how to implement a pthread-style threading system on Linux. To this end, I started implementing a small, non-portable and a rather useless library that I called Thread Bites. Thread Bites is useless mainly because it lacks support for compiler-generated thread-local storage. It may not sound like a grave issue, but it makes Thread Bites incompatible with most of glibc. Therefore, I had to provide my own functionality for invoking syscalls, managing the heap and even printing stuff to stdout. It's all pretty simple and understandable so far, so I hope I will be able to implement most of the pthreads' functionality in a couple of small bites. You can get the source from GitHub.
Syscalls
For a program to be even remotely useful, it needs to communicate with the user
in one way or another. A standard library for the programming language, like
glibc, typically provides all the necessary components for such communication.
For reasons mentioned in the introduction, using glibc in this case is not
advisable. Hence, I need to find a way to call the operating system directly
without using the syscall
function, because it is also a part glibc and sets
errno
, which is supposed to reside in the TLS that I did not set up.
All that is needed to implement an equivalent function is shuffling around the values of the registers to translate between the calling conventions of C and the Linux kernel, as described here and here on page 20. As it turns out, it's not that hard to do it using inline assembly in C, and there's an excellent tutorial here.
1#define SYSCALL(name, a1, a2, a3, a4, a5, a6) \
2 ({ \
3 long result; \
4 long __a1 = (long)(a1); \
5 long __a2 = (long)(a2); \
6 long __a3 = (long)(a3); \
7 long __a4 = (long)(a4); \
8 long __a5 = (long)(a5); \
9 long __a6 = (long)(a6); \
10 register long _a1 asm("rdi") = __a1; \
11 register long _a2 asm("rsi") = __a2; \
12 register long _a3 asm("rdx") = __a3; \
13 register long _a4 asm("r10") = __a4; \
14 register long _a5 asm("r8") = __a5; \
15 register long _a6 asm("r9") = __a6; \
16 asm volatile ( \
17 "syscall\n\t" \
18 : "=a" (result) \
19 : "0" (name), "r" (_a1), "r" (_a2), "r" (_a3), \
20 "r" (_a4), "r" (_a5), "r" (_a6) \
21 : "memory", "cc", "r11", "cx"); \
22 (long) result; })
23
24#define SYSCALL1(name, a1) \
25 SYSCALL(name, a1, 0, 0, 0, 0, 0)
26#define SYSCALL2(name, a1, a2) \
27 SYSCALL(name, a1, a2, 0, 0, 0, 0)
28#define SYSCALL3(name, a1, a2, a3) \
29 SYSCALL(name, a1, a2, a3, 0, 0, 0)
30#define SYSCALL4(name, a1, a2, a3, a4) \
31 SYSCALL(name, a1, a2, a3, a4, 0, 0)
32#define SYSCALL5(name, a1, a2, a3, a4, a5) \
33 SYSCALL(name, a1, a2, a3, a4, a5, 0)
34#define SYSCALL6(name, a1, a2, a3, a4, a5, a6) \
35 SYSCALL(name, a1, a2, a3, a4, a5, a6)
All this is, of course, horribly inefficient because it messes up with all the
registers even in the situations it does not have to. The intermediate variables
for the parameters (__a1
and friends) are used to prevent embedded function
calls from messing with the registers that have already been set; think of
strlen
in SYSCALL3(__NR_write, 1, blah, strlen(blah))
.
See the full code on GitHub.
Printing to stdout
It seems that glibc is using errno
and other thread local stuff to calculate
buffer sizes in one of the subroutines called by printf
. It causes printf
to
segfault when called concurrently from different threads because of the same TLS
story. Thread Bites provides a convenience function similar to printf
and
supporting %s
%x
%u
%o
%d
in l
and ll
flavors:
1void tbprint(const char *format, ...);
Malloc
Glibc's default malloc implementation, a ptmalloc2 derivative, uses
thread-specific arenas to limit lock congestion caused by calling malloc
concurrently from multiple threads. It looks like it depends on TLS, so using it
is not the best idea. Thread Bites comes with its own evil version of
malloc
. It's pathological because it's extremely prone to fragmentation, it
never shrinks the heap, and it's essentially one big critical section. It has
some undeniable advantages too: it works, it's incredibly simple, and it fits in
around 50 lines of code. Look here to find it.
The only thing worth noting in this section is that the sys_brk
syscall does
not behave like glibc's brk
or sbrk
functions. On error, it would return
the previous location of the heap boundary, so the code calls it with an
obviously wrong parameter (0
) to figure out what the initial heap boundary is.
Clone
Clone
is an interesting beast. From the standpoint of this section, the
relevant thing about it is that it behaves mostly like fork
, except that the
child is launched on a new stack. For this reason and to generate proper Call
Frame Information (see here and here), it needs to be implemented in
assembly. The implementation puts the user function pointer and its argument on
the child's stack so that they can be later popped and called by the child.
Then, it puts the syscall parameters in the appropriate registers and makes the
syscall. Again, all the code is on GitHub.
Creating a thread
After dealing with all this boilerplate, the actual thread creation is
relatively straightforward. First, the new thread needs a stack it can run on.
The stack could be allocated on the heap, but it's probably safer to get a
separate memory region for it. It can be done using the mmap
syscall. A proper
threading library should check the system limits to figure out the size of the
stack, but Thread Bites is not proper, so it will use 8MiB as a default. All
the function parameters in the snippet below match glibc's mmap
call.
1void *stack = tbmmap(NULL, attr->stack_size, PROT_READ | PROT_WRITE,
2 MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
3long status = (long)stack;
4if(status < 0)
5 return status;
It's a good idea to mark a page at the beginning of the stack as non-readable and non-writable to protect any valid adjecent memory if there is any. Stepping over the guard page will get us a familiar segmentation fault. Why at the beginning and not at the end? Because the stack grows downwards, i.e. from a high address towards a lower one. Go here for more details.
1status = SYSCALL3(__NR_mprotect, stack, EXEC_PAGESIZE, PROT_NONE);
2if(status < 0) {
3 tbmunmap(stack, attr->stack_size);
4 return status;
5}
We finally have all that is needed to spawn a new kernel task sharing with the
main task everything that you would expect threads running within the same
process to share. The man page of clone
discusses all of the parameters
in details.
1int flags = CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SYSVSEM | CLONE_SIGHAND;
2flags |= CLONE_THREAD;
3int tid = tbclone(start_thread, *thread, flags, stack+attr->stack_size);
4if(tid < 0) {
5 tbmunmap(stack, attr->stack_size);
6 free(*thread);
7 return tid;
8}
Note that the user function is not called directly and start_thread
is called
instead. It does some internal book-keeping and eventually calls the
user-supplied function:
1tbthread_t th = (tbthread_t)arg;
2uint32_t stack_size = th->stack_size;
3void *stack = th->stack;
4th->fn(th->arg);
5free(th);
When the function returns, the stack memory needs to be freed (munmap
ed) and
the thread terminated. The clone
function took some provisions for calling
sys_exit
with the status returned by the wrapper. However, here we remove the
stack from underneath our feet, and we cannot risk any C function write over it.
Therefore, we need to call sys_munmap
and sys_exit
in one piece of assembly.
1register long a1 asm("rdi") = (long)stack;
2register long a2 asm("rsi") = stack_size;
3asm volatile(
4 "syscall\n\t"
5 "movq $60, %%rax\n\t" // 60 = __NR_exit
6 "movq $0, %%rdi\n\t"
7 "syscall"
8 :
9 : "a" (__NR_munmap), "r" (a1), "r" (a2)
10 : "memory", "cc", "r11", "cx");
11return 0;
Lessons learned
- The logic involved in running threads does not seem to be horrendously complicated, at least not yet.
- Glibc is quite convoluted and figuring out what is going on and where is a challenge. It is probably justified by the range of platforms it supports, but the documentation could be better.
- Threading is tightly coupled with glibc in the areas that I have not
initially suspected: the dynamic linker to support TLS, globals in TLS
(
errno
), locales, finalizers, and so on. - POSIX threads have an interesting relationship with the Linux task model. This document describes the initial incompatibilities that have ultimately been ironed out. It is a very interesting read if you want to understand how the userspace and the kernelspace interact to implement a threading system.
- Ultimately all this is not necessarily a black magic. At least not so far.