Table of Contents

  1. Syscalls, memory, and your first therad
  2. The pointer to self and thread-local storage
  3. Futexes, mutexes, and memory sychronization
  4. Joining threads and dynamic initialization
  5. Cancellation
  6. Scheduling and task priority
  7. RW Locks
  8. Condition variables
  9. Final thoughts


The second part of our threading journey covers the thread-local storage. I do not mean here the compiler generated TLS that I mentioned in the first part. I mean the stuff that involves pthread_setspecific and friends. I don't think it's the most critical part of all this threading business. I barely ever use it in practice. However, we will need to refer to the current thread all over the place, and this requires the TLS. It's better to deal with it once and for all, especially that it's not particularly complicated.


How does one store something in a fixed location and make it distinct for each execution thread? The only two answers that come to my mind are either using syscalls to associate something with kernel's task_struct or using CPU registers. The first approach requires context switches to retrieve the value, so it's rather inefficient. The second option, though, should be pretty fast. Conveniently, on x86_64, some registers are left unused (see StackOverflow). In fact, the SETTLS option to clone takes a pointer and puts it in the file segment register for us, so we don't even need to make an extra syscall just for that.

Since fs is a segment register, we cannot retrieve it's absolute value without engaging the operating system. Linux on x86_64 uses the arch_prctl syscall for this purpose:

1#include "tb.h"
2#include <asm/prctl.h>
4tbthread_t tbthread_self()
6  tbthread_t self;
7  SYSCALL2(__NR_arch_prctl, ARCH_GET_FS, &self);
8  return self;

This syscall seems expensive (link) and making it defies the reason for using a register in the first place. We can read the memory at an address relative to the value of the register, though. Using this fact, we can make our thread struct point to itself and then retrieve this pointer using inline assembly. Here's how:

1tbthread_t tbthread_self()
3  tbthread_t self;
4  asm("movq %%fs:0, %0\n\t" : "=r" (self));
5  return self;

For the main thread, the linker initializes the TLS segment for pthreads automatically using arch_prctl. We cannot use it for our purposes, but we can count on tbthread_self returning a unique, meaningful value.

See the full patch here.


The actual TLS is handled by the following four functions: pthread_key_create, pthread_key_delete, pthread_getspecific, pthread_setspecific. I will not explain what they do because it should pretty self-evident. If it's not, see the man pages.

In principle, we could just have a hash table in the tbthread struct. We could then use setspecific and getspecific to set and retrieve the value associated with each given key. Calling setspecific with a NULL pointer would delete the key. Unfortunately, the designers of pthreads made it a bit more complicated by having separate key_create and key_delete functions, with the delete function invalidating the key in all the threads. Glibc uses a global array of keys and sequence numbers in a clever way to solve this problem. We will take almost the same approach in a less efficient but a bit clearer way.

We will have a global array representing keys. Each element of this array will host a pointer to a destructor and a sequence number. The index in the array will be the actual key passed around between the TLS functions. Both key_create and key_delete will bump the sequence number associated with a key in an atomic way. If the number is even, the key is not allocated, if the number is odd, it is.

 1static struct
 3  uint64_t seq;
 4  void (*destructor)(void *);
 7#define KEY_UNUSED(k) ((keys[k].seq&1) == 0)
 8#define KEY_ACQUIRE(k) (__sync_bool_compare_and_swap(&(keys[k].seq), keys[k].seq, keys[k].seq+1))
 9#define KEY_RELEASE(k) (__sync_bool_compare_and_swap(&(keys[k].seq), keys[k].seq, keys[k].seq+1))
11int tbthread_key_create(tbthread_key_t *key, void (*destructor)(void *))
13  for(int i = 0; i < TBTHREAD_MAX_KEYS; ++i) {
14    if(KEY_UNUSED(i) && KEY_ACQUIRE(i)) {
15      *key = i;
16      keys[i].destructor = destructor;
17      return 0;
18    }
19  }
20  return -ENOMEM;

Each tbthread struct will hold an array of the same size as the global array of keys. Each element in this array will hold a data pointer and a sequence number. Storing data will set the sequence number as well. Retrieving the data will check whether the local and the global sequence number match before proceeding.

 1void *tbthread_getspecific(tbthread_key_t key)
 3  if(key >= TBTHREAD_MAX_KEYS || KEY_UNUSED(key))
 4    return 0;
 6  tbthread_t self = tbthread_self();
 7  if(self->tls[key].seq == keys[key].seq)
 8    return self->tls[key].data;
 9  return 0;

See the full patch here.


KEY_ACQUIRE and KEY_RELEASE use gcc atomic builtins described here.

If you like this kind of content, you can subscribe to my newsletter, follow me on Twitter, or subscribe to my RSS channel.