Content from 2016-02

Table of Contents

  1. Creating threads
  2. Thread-local storage
  3. Mutexes
  4. Joining and initialization
  5. Cancellation
  6. Scheduling
  7. RW Locks
  8. Condition Variables
  9. Conclusion

Recycling the thread descriptors

How do we know when the thread task has died? This is what the CHILD_SETTID and CHILD_CLEARTID flags to sys_clone are for. If they are set, the kernel will store the new thread's TID at the location pointed to by the ctid argument (see tb #1). When the thread terminates, the kernel will set the TID to 0 and wake the futex at this location. It is a convenient way to wait for a thread to finish. Unfortunately, as far as I can tell, there is no way to unset these flags, and it makes implementing tbthread_detach a pain. We cannot delete the thread descriptor in the thread it refers to anymore. Doing so would cause the kernel to write to a memory location that might have been either unmapped or reused. Therefore, we need to have some sort of a cache holding thread descriptors and make sure that we re-use them only after the thread they were referring to before has exited. Thread bites uses two linked lists to maintain this cache, and the descriptor allocation function calls the following procedure to wait until the corresponding task is gone:

1 static void wait_for_thread(tbthread_t thread)
2 {
3   uint32_t tid = thread->exit_futex;
4   long ret = 0;
5   if(tid != 0)
6     do {
7       ret = SYSCALL3(__NR_futex, &thread->exit_futex, FUTEX_WAIT, tid);
8     } while(ret != -EWOULDBLOCK && ret != 0);
9 }

See the full patch at GitHub.

Joining threads

Joining complicates things a bit further because it does quite a bit of error checking to prevent deadlocks and such. To perform these checks, the thread calling tbthread_join needs to have a valid thread descriptor obtainable by tbthread_self. The problem is that we have never set this thread descriptor up for the main thread, and we need to do it by hand at the beginning of the program. The original state needs to be restored at the end because glibc uses it internally and not cleaning things up causes segfaults.

 1 static void *glibc_thread_desc;
 2 void tbthread_init()
 3 {
 4   glibc_thread_desc = tbthread_self();
 5   tbthread_t thread = malloc(sizeof(struct tbthread));
 6   memset(thread, 0, sizeof(struct tbthread));
 7   thread->self = thread;
 8   SYSCALL2(__NR_arch_prctl, ARCH_SET_FS, thread);
 9 }
11 void tbthread_finit()
12 {
13   free(tbthread_self());
14   SYSCALL2(__NR_arch_prctl, ARCH_SET_FS, glibc_thread_desc);
15 }

After performing all the validity and deadlock checks, the meat of tbthread_join is rather simple:

1 wait_for_thread(thread);
2 if(retval)
3   *retval = thread->retval;
4 release_descriptor(thread);
5 return 0;

See the full patch at GitHub.

Dynamic initialization

pthread_once is an interesting beast. Its purpose is to initialize dynamically some resources by calling a designated function exactly once. The fun part is that the actual initialization call may be made from multiple threads at the same time. pthread_once_t, therefore, is kind of like a mutex, but has three states instead of two:

  • new: the initialization function has not been called yet; one of the threads needs to call it.
  • in progress: the initialization function is running; the threads are waiting for it to finish.
  • done: the initialization function is done; all the threads may be woken up.

The thread that manages to change the state from new to in progress gets to call the function. All the other threads wait until the done state is reached.

 1 int tbthread_once(tbthread_once_t *once, void (*func)(void))
 2 {
 3   if(!once || !func)
 4     return -EINVAL;
 6   if(*once == TB_ONCE_DONE)
 7     return 0;
 9   if(__sync_bool_compare_and_swap(once, TB_ONCE_NEW, TB_ONCE_IN_PROGRESS)) {
10     (*func)();
11     *once = TB_ONCE_DONE;
12     SYSCALL3(__NR_futex, once, FUTEX_WAKE, INT_MAX);
13     return 0;
14   }
16   while(*once != TB_ONCE_DONE)
18   return 0;
19 }

Side effects

The original glibc thread descriptor stores the localization information for the thread. Changing it to ours causes seemingly simple functions, like strerror, to segfault. Therefore, we need to implement strerror ourselves.

Table of Contents

  1. Creating threads
  2. Thread-local storage
  3. Mutexes
  4. Joining and initialization
  5. Cancellation
  6. Scheduling
  7. RW Locks
  8. Condition Variables
  9. Conclusion


This part discusses an implementation of a mutex. It will not be a particularly efficient mutex, but it will be an understandable and straightforward one. We will not bother minimizing the number of system calls or implementing lock elision. We will also not handle the case of inter-process communication. Therefore, the process-shared and robust mutexes will not be discussed. If you are interested in these ideas, I recommend the kernel's documentation file on robust mutexes and Intel's blog post on lock elision. The scheduling related parameters will likely be dealt with on another occasion.


POSIX defines a bunch of different mutexes. See the manpage for pthread_mutexattr_settype to learn more. On Linux, all of them are implemented using the same locking primitive - a Futex (Fast User-Space Mutex). It is a 4-byte long chunk of aligned memory. Its contents can be updated atomically, and its address can be used to refer to a kernel-level process queue. The kernel interface is defined as follows:

1 asmlinkage long sys_futex(u32 __user *uaddr, int op, u32 val,
2                          struct timespec __user *utime, u32 __user *uaddr2,
3                          u32 val3);

We will only use the first four out of the six parameters here. The first one is the address of the futex, and the second one is the type of the operation to be performed. The meaning of the remaining parameters depends on the context. We will need only two of the available operations to implement a mutex:

  • FUTEX_WAIT puts a thread to sleep if the value passed in val is the same as the value stored in the memory pointed to by *uaddr. Optionally, the sleep time may be limited by passing a pointer to a timespec object. The return values are:
    • 0, if the thread was woken up by FUTEX_WAIT.
    • EWOULDBLOCK, if the value of the futex was different than val.
    • EINTR, if the sleep was interrupted by a signal.
  • FUTEX_WAKE wakes the number of threads specified in val. In practice, it only makes sense to either wake one or all sleeping threads, so we pass either 1 or INT_MAX respectively.

See the original futex paper for more details.

Normal mutex

We start with a normal mutex because it is possible to implement all the other kinds using the procedures defined for it. If the value of the associated futex is 0, then the mutex is unlocked. Locking it means changing the value to 1. To avoid race conditions, both the checking and the changing need to be done atomically. GCC has a built-in function to do this that results with lock cmpxchgq or similar being emitted in assembly. If the locking fails, we need to wait until another thread releases the mutex and re-check the lock.

1 static int lock_normal(tbthread_mutex_t *mutex)
2 {
3   while(1) {
4     if(__sync_bool_compare_and_swap(&mutex->futex, 0, 1))
5       return 0;
6     SYSCALL3(__NR_futex, &mutex->futex, FUTEX_WAIT, 1);
7   }
8 }

The logic of trylock is essentially the same, except that we return a failure instead of sleeping.

1 static int trylock_normal(tbthread_mutex_t *mutex)
2 {
3   if(__sync_bool_compare_and_swap(&mutex->futex, 0, 1))
4       return 0;
5   return -EBUSY;
6 }

To unlock a mutex, we do the reverse. We change the futex value from 1 to 0 and wake one thread waiting for the futex to be released.

1 static int unlock_normal(tbthread_mutex_t *mutex)
2 {
3   if(__sync_bool_compare_and_swap(&mutex->futex, 1, 0))
4     SYSCALL3(__NR_futex, &mutex->futex, FUTEX_WAKE, 1);
5   return 0;
6 }

Note that the values stored in the futex are application-specific and arbitrary. The kernel does not care and does not change this variable except in one case, which we will discuss in a later chapter.

This mutex is not particularly efficient because we make a system call while unlocking regardless of whether there is a waiting thread or not. To see how the situation could be improved, please refer to Ulrich Drepper's Futexes Are Tricky.

Error-check mutex

These guys are the same as the ones discussed earlier, except that they do additional bookkeeping. An error-check mutex remembers who owns it, to report the following types of errors:

  • re-locking of a mutex the thread already owns
  • unlocking a mutex owned by another thread
  • unlocking a mutex that is not locked

The behavior of normal mutexes in these cases is undefined.

 1 static int lock_errorcheck(tbthread_mutex_t *mutex)
 2 {
 3   tbthread_t self = tbthread_self();
 4   if(mutex->owner == self)
 5     return -EDEADLK;
 6   lock_normal(mutex);
 7   mutex->owner = self;
 8   return 0;
 9 }
11 static int trylock_errorcheck(tbthread_mutex_t *mutex)
12 {
13   int ret = trylock_normal(mutex);
14   if(ret == 0)
15     mutex->owner = tbthread_self();
16   return ret;
17 }
19 static int unlock_errorcheck(tbthread_mutex_t *mutex)
20 {
21   if(mutex->owner != tbthread_self() || mutex->futex == 0)
22     return -EPERM;
23   mutex->owner = 0;
24   unlock_normal(mutex);
25   return 0;
26 }

Recursive mutex

Recursive mutexes may be locked multiple times by the same thread and require the same numbers of unlock operations to be released. To provide this kind of functionality, we just need to add a counter counter.

 1 static int lock_recursive(tbthread_mutex_t *mutex)
 2 {
 3   tbthread_t self = tbthread_self();
 4   if(mutex->owner != self) {
 5     lock_normal(mutex);
 6     mutex->owner   = self;
 7   }
 8   if(mutex->counter == (uint64_t)-1)
 9     return -EAGAIN;
10   ++mutex->counter;
11   return 0;
12 }
14 static int trylock_recursive(tbthread_mutex_t *mutex)
15 {
16   tbthread_t self = tbthread_self();
17   if(mutex->owner != self && trylock_normal(mutex))
18     return -EBUSY;
20   if(mutex->owner != self) {
21     mutex->owner = self;
22     mutex->counter = 1;
23     return 0;
24   }
26   if(mutex->counter == (uint64_t)-1)
27     return -EAGAIN;
29   ++mutex->counter;
30   return 0;
31 }
33 static int unlock_recursive(tbthread_mutex_t *mutex)
34 {
35   if(mutex->owner != tbthread_self())
36     return -EPERM;
37   --mutex->counter;
38   if(mutex->counter == 0) {
39     mutex->owner = 0;
40     return unlock_normal(mutex);
41   }
42   return 0;
43 }

Other code

This is pretty much it. The remaining part of the code is not interesting. Both mutex and mutexattr objects need to be initialized to their default values, but the futexes don't need any initialization or cleanup. As always, the full patch is available on GitHub.

Edit 28.03.2016: There are more details about the startup code in this post.


I have recently started playing with the Tiva launchpad. It's a pity, though, that most of the tutorials and course material out there show you how to program it only using something or other on Windows. I have even gone as far as installing it on my old laptop to follow some of these tutorials. But, I have quickly re-discovered the reasons for my dislike of Windows.

There are some great resources available explaining how to use the Stellaris board on Linux. Stellaris is a predecessor of Tiva, and much of this advice applies to Tiva as well. Everyone seems to use Make, though. I don't like it because generating source file dependencies and discovering libraries with it involves black magic and blood of goats. I decided, then, to add my two cents and create a template for CMake (GitHub). It works fine both with or without TivaWare and uses my BSD-licensed start-up files. To use it for your project, all you need to do is:

 1 #-------------------------------------------------------------------------------
 2 # Some boilerplate
 3 #-------------------------------------------------------------------------------
 4 cmake_minimum_required(VERSION 3.4)
 5 set(CMAKE_TOOLCHAIN_FILE ${CMAKE_SOURCE_DIR}/cmake/TM4C_toolchain.cmake)
 8 include(Firmware)
10 #-------------------------------------------------------------------------------
11 # Configure your project
12 #-------------------------------------------------------------------------------
13 project(tm4c-template)
14 add_executable(tm4c-template.axf main.c tm4c/TM4C_startup.c)
15 add_raw_binary(tm4c-template.bin tm4c-template.axf)
16 target_link_libraries(tm4c-template.axf ${TIVAWARE_LIB})

And then:

]==> mkdir build
]==> cd build
]==> cmake ../
-- The CXX compiler identification is GNU 4.9.3
-- Check for working CXX compiler: /usr/bin/arm-none-eabi-c++
-- Check for working CXX compiler: /usr/bin/arm-none-eabi-c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Configuring done
-- Generating done
-- Build files have been written to: /home/ljanyst/Temp/board/cmake/build
]==> make
Scanning dependencies of target tm4c-template.axf
[ 25%] Building C object CMakeFiles/tm4c-template.axf.dir/main.c.obj
[ 50%] Building C object CMakeFiles/tm4c-template.axf.dir/tm4c/TM4C_startup.c.obj
[ 75%] Linking C executable tm4c-template.axf
[ 75%] Built target tm4c-template.axf
Scanning dependencies of target tm4c-template.bin
[100%] Creating raw binary tm4c-template.bin
[100%] Built target tm4c-template.bin

Or, if you want TivaWare, do this instead:

]==> cmake .. -DTIVAWARE_PATH=/path/to/tivaware/


I wrote a short piece of code that lets you test things without the need for TivaWare. Go here and compile lm4flash, it needs libusb-1.0-0-dev on Debian.

]==> lm4flash tm4c-template.bin
Found ICDI device with serial: 0E21xxxx
ICDI version: 9270


You can tweak a bit the instructions from the tutorial over at to run a debugging session. Plug-in the board and start an Open On-Chip Debugger session:

]==> openocd -f /usr/share/openocd/scripts/board/ek-tm4c123gxl.cfg
Open On-Chip Debugger 0.9.0 (2015-05-28-17:08)
Licensed under GNU GPL v2
For bug reports, read
Info : The selected transport took over low-level target control. The results might differ compared to plain JTAG/SWD
adapter speed: 500 kHz
Info : clock speed 32767 kHz
Info : ICDI Firmware version: 9270
Info : tm4c123gh6pm.cpu: hardware has 6 breakpoints, 4 watchpoints

Then, in another terminal window, run gdb as follows:

]==> cat gdb-embeded.init
target extended-remote :3333
monitor reset halt
monitor reset init
break main
]==> arm-none-eabi-gdb --command=gdb-embeded.init  tm4c-template.axf
GNU gdb (7.10-1+9) 7.10
Copyright (C) 2015 Free Software Foundation, Inc.

-- cut --

Breakpoint 1, main () at /home/ljanyst/Temp/board/cmake/main.c:113
113       init_sys_tick();
(gdb) n
114       init_gpio();
(gdb) n
116       unsigned long led = 0x02;
(gdb) n
119         unsigned long sw1 = !(GPIODATA_REG_PORTF & 0x01);
(gdb) n
120         unsigned long sw2 = !(GPIODATA_REG_PORTF & 0x10);
(gdb) p sw1
$1 = 0
(gdb) p /t *(unsigned long *)0x4005d3fc
$2 = 10001

Have fun!

Table of Contents

  1. Creating threads
  2. Thread-local storage
  3. Mutexes
  4. Joining and initialization
  5. Cancellation
  6. Scheduling
  7. RW Locks
  8. Condition Variables
  9. Conclusion


The second part of our threading journey covers the thread-local storage. I do not mean here the compiler generated TLS that I mentioned in the first part. I mean the stuff that involves pthread_setspecific and friends. I don't think it's the most critical part of all this threading business. I barely ever use it in practice. However, we will need to refer to the current thread all over the place, and this requires the TLS. It's better to deal with it once and for all, especially that it's not particularly complicated.


How does one store something in a fixed location and make it distinct for each execution thread? The only two answers that come to my mind are either using syscalls to associate something with kernel's task_struct or using CPU registers. The first approach requires context switches to retrieve the value, so it's rather inefficient. The second option, though, should be pretty fast. Conveniently, on x86_64, some registers are left unused (see StackOverflow). In fact, the SETTLS option to clone takes a pointer and puts it in the file segment register for us, so we don't even need to make an extra syscall just for that.

Since fs is a segment register, we cannot retrieve it's absolute value without engaging the operating system. Linux on x86_64 uses the arch_prctl syscall for this purpose:

1 #include "tb.h"
2 #include <asm/prctl.h>
4 tbthread_t tbthread_self()
5 {
6   tbthread_t self;
7   SYSCALL2(__NR_arch_prctl, ARCH_GET_FS, &self);
8   return self;
9 }

This syscall seems expensive (link) and making it defies the reason for using a register in the first place. We can read the memory at an address relative to the value of the register, though. Using this fact, we can make our thread struct point to itself and then retrieve this pointer using inline assembly. Here's how:

1 tbthread_t tbthread_self()
2 {
3   tbthread_t self;
4   asm("movq %%fs:0, %0\n\t" : "=r" (self));
5   return self;
6 }

For the main thread, the linker initializes the TLS segment for pthreads automatically using arch_prctl. We cannot use it for our purposes, but we can count on tbthread_self returning a unique, meaningful value.

See the full patch here.


The actual TLS is handled by the following four functions: pthread_key_create, pthread_key_delete, pthread_getspecific, pthread_setspecific. I will not explain what they do because it should pretty self-evident. If it's not, see the man pages.

In principle, we could just have a hash table in the tbthread struct. We could then use setspecific and getspecific to set and retrieve the value associated with each given key. Calling setspecific with a NULL pointer would delete the key. Unfortunately, the designers of pthreads made it a bit more complicated by having separate key_create and key_delete functions, with the delete function invalidating the key in all the threads. Glibc uses a global array of keys and sequence numbers in a clever way to solve this problem. We will take almost the same approach in a less efficient but a bit clearer way.

We will have a global array representing keys. Each element of this array will host a pointer to a destructor and a sequence number. The index in the array will be the actual key passed around between the TLS functions. Both key_create and key_delete will bump the sequence number associated with a key in an atomic way. If the number is even, the key is not allocated, if the number is odd, it is.

 1 static struct
 2 {
 3   uint64_t seq;
 4   void (*destructor)(void *);
 7 #define KEY_UNUSED(k) ((keys[k].seq&1) == 0)
 8 #define KEY_ACQUIRE(k) (__sync_bool_compare_and_swap(&(keys[k].seq), keys[k].seq, keys[k].seq+1))
 9 #define KEY_RELEASE(k) (__sync_bool_compare_and_swap(&(keys[k].seq), keys[k].seq, keys[k].seq+1))
11 int tbthread_key_create(tbthread_key_t *key, void (*destructor)(void *))
12 {
13   for(int i = 0; i < TBTHREAD_MAX_KEYS; ++i) {
14     if(KEY_UNUSED(i) && KEY_ACQUIRE(i)) {
15       *key = i;
16       keys[i].destructor = destructor;
17       return 0;
18     }
19   }
20   return -ENOMEM;
21 }

Each tbthread struct will hold an array of the same size as the global array of keys. Each element in this array will hold a data pointer and a sequence number. Storing data will set the sequence number as well. Retrieving the data will check whether the local and the global sequence number match before proceeding.

 1 void *tbthread_getspecific(tbthread_key_t key)
 2 {
 3   if(key >= TBTHREAD_MAX_KEYS || KEY_UNUSED(key))
 4     return 0;
 6   tbthread_t self = tbthread_self();
 7   if(self->tls[key].seq == keys[key].seq)
 8     return self->tls[key].data;
 9   return 0;
10 }

See the full patch here.


KEY_ACQUIRE and KEY_RELEASE use gcc atomic builtins described here.