Content tagged thread-bites

Table of Contents

  1. Creating threads
  2. Thread-local storage
  3. Mutexes
  4. Joining and initialization
  5. Cancellation
  6. Scheduling
  7. RW Locks
  8. Condition Variables
  9. Conclusion


I have implemented all of the interesting functions listed here and, thus, reached my goal. There were quite a few surprises. I had expected some things to bo more complicated than they are. Conversely, some things that had seemed simple turned out to be quite complex.

  1. I had initially hoped that I would be able to re-use much of glibc and concentrate only on the thread-specific functionality. I was surprised to discover how much of glibc code refers to thread-local storage.
  2. I had expected the interaction between join and detach to be much simpler to handle. Having to implement descriptor caching was an unexpected event.
  3. I had never heard of pthread_once before.
  4. I had not used much of the real-time functionality before, so figuring out the scheduling part was very entertaining. I especially enjoyed implementing the PRIO_INHERIT mutex.

I may revisit this project in the future because there are still some things that I would like to learn more about.

  1. If I'll have the time to learn DWARF, I would like to provide proper .eh_frame for the signal trampoline. It would allow me to implement cancellation using stack unwinding the way glibc does it.
  2. I may look into the inter-process synchronization to learn about the robust futexes.
  3. The Intel article on lock elision seemed interesting, and I'd like to play with this stuff as well.
  4. I may have a look at the compiler-generated TLS.

The End

Table of Contents

  1. Creating threads
  2. Thread-local storage
  3. Mutexes
  4. Joining and initialization
  5. Cancellation
  6. Scheduling
  7. RW Locks
  8. Condition Variables
  9. Conclusion

Condition Variables

Condition variables are a mechanism used for signaling that a certain predicate has become true. The POSIX mechanism for handling them boils down to three functions: cond_wait, cond_signal and cond_broadcast. The first one causes the thread that calls it to wait. The second wakes a thread up so that it can verify whether the condition is true. The third wakes all the waiters up.

The waiter

 4int bseq = cond->broadcast_seq;
 5int futex = cond->futex;
 8while(1) {
 9  st = SYSCALL3(__NR_futex, &cond->futex, FUTEX_WAIT, futex);
10  if(st == -EINTR)
11    continue;
13  tb_futex_lock(&cond->lock);
14  if(cond->signal_num) {
15    --cond->signal_num;
16    goto exit;
17  }
19  if(bseq != cond->broadcast_seq)
20    goto exit;
21  tb_futex_unlock(&cond->lock);

The algorithm is as follows:

  1. We lock the internal lock.
  2. We remember the value of the broadcast sequence and the futex.
  3. In a loop, we wait for the futex.
  4. We go back to sleeping if the FUTEX_WAIT syscall was interrupted by a signal.
  5. We consume a signal, if we can, and exit.
  6. If there was a broadcast, we exit too.
  7. Otherwise, we go back to sleep.


We wake one of the threads up. We bump the value of the futex to prevent a deadlock. Then we bump the number of signals and wake the futex.

 1int tbthread_cond_signal(tbthread_cond_t *cond)
 3  tb_futex_lock(&cond->lock);
 4  if(cond->waiters == cond->signal_num)
 5    goto exit;
 6  ++cond->futex;
 7  ++cond->signal_num;
 8  SYSCALL3(__NR_futex, &cond->futex, FUTEX_WAKE, 1);
10  tb_futex_unlock(&cond->lock);
11  return 0;


We wake all the waiters. The algorithm is essentially the same as for signal, except that, we bump the broadcast sequence number and wake all the threads instead of just one.

 1int tbthread_cond_broadcast(tbthread_cond_t *cond)
 3  tb_futex_lock(&cond->lock);
 4  if(!cond->waiters)
 5    goto exit;
 6  ++cond->futex;
 7  ++cond->broadcast_seq;
 8  SYSCALL3(__NR_futex, &cond->futex, FUTEX_WAKE, INT_MAX);
10  tb_futex_unlock(&cond->lock);
11  return 0;

See the full patch at GitHub.

Table of Contents

  1. Creating threads
  2. Thread-local storage
  3. Mutexes
  4. Joining and initialization
  5. Cancellation
  6. Scheduling
  7. RW Locks
  8. Condition Variables
  9. Conclusion

RW Locks

A read-write lock protects a critical section by allowing multiple readers when there are no writers. We won't bother implementing lock attributes handling because we don't support process-shared locks (irrelevant in our case) and we don't let the user prefer readers (a non-POSIX extension). The implementation remembers the ID of the current writer if any. It also counts readers as well as the queued writers. We use two futexes, one to block the readers and one to block the writers.

A writer first bumps the number of queued writers. If there is no other writer and no readers, it marks itself as the owner of the lock and decrements the number of queued writers. It goes to sleep on the writer futex otherwise.

 1int tbthread_rwlock_wrlock(tbthread_rwlock_t *rwlock)
 3  int queued = 0;
 4  while(1) {
 5    tb_futex_lock(&rwlock->lock);
 7    if(!queued) {
 8      queued = 1;
 9      ++rwlock->writers_queued;
10    }
12    if(!rwlock->writer && !rwlock->readers) {
13      rwlock->writer = tbthread_self();
14      --rwlock->writers_queued;
15      tb_futex_unlock(&rwlock->lock);
16      return 0;
17    }
18    int sleep_status = rwlock->wr_futex;
20    tb_futex_unlock(&rwlock->lock);
22    SYSCALL3(__NR_futex, &rwlock->wr_futex, FUTEX_WAIT, sleep_status);
23  }

A reader acquires the lock if there are no writers at all. It goes to sleep on the reader futex otherwise.

 1int tbthread_rwlock_rdlock(tbthread_rwlock_t *rwlock)
 3  while(1) {
 4    tb_futex_lock(&rwlock->lock);
 6    if(!rwlock->writer && !rwlock->writers_queued) {
 7      ++rwlock->readers;
 8      tb_futex_unlock(&rwlock->lock);
 9      return 0;
10    }
11    int sleep_status = rwlock->rd_futex;
13    tb_futex_unlock(&rwlock->lock);
15    SYSCALL3(__NR_futex, &rwlock->rd_futex, FUTEX_WAIT, sleep_status);
16  }

When unlocking, we use the writer field to determine whether we were a reader or a writer. If we were a writer, we've had an exclusive ownership of the lock. Therefore, we need to either wake another writer or all of the readers, depending on the state of the counters. If we were a reader, we've had a non-exclusive lock. Therefore, we only need to wake a writer when we're the last reader and there is a writer queued. We bump the value of the futex because we want to handle the cases when FUTEX_WAKE was called before the other thread manged to call FUTEX_WAIT.

 1int tbthread_rwlock_unlock(tbthread_rwlock_t *rwlock)
 3  tb_futex_lock(&rwlock->lock);
 4  if(rwlock->writer) {
 5    rwlock->writer = 0;
 6    if(rwlock->writers_queued) {
 7      __sync_fetch_and_add(&rwlock->wr_futex, 1);
 8      SYSCALL3(__NR_futex, &rwlock->wr_futex, FUTEX_WAKE, 1);
 9    } else {
10      __sync_fetch_and_add(&rwlock->rd_futex, 1);
11      SYSCALL3(__NR_futex, &rwlock->rd_futex, FUTEX_WAKE, INT_MAX);
12    }
13    goto exit;
14  }
16  --rwlock->readers;
17  if(!rwlock->readers && rwlock->writers_queued) {
18    __sync_fetch_and_add(&rwlock->wr_futex, 1);
19    SYSCALL3(__NR_futex, &rwlock->wr_futex, FUTEX_WAKE, 1);
20  }
23  tb_futex_unlock(&rwlock->lock);
24  return 0;

See the full patch at GitHub.

Table of Contents

  1. Creating threads
  2. Thread-local storage
  3. Mutexes
  4. Joining and initialization
  5. Cancellation
  6. Scheduling
  7. RW Locks
  8. Condition Variables
  9. Conclusion

Thread Scheduling

The scheduler makes a decision of which thread to run next based on two parameters: scheduling policy and priority.

Conceptually, the scheduler maintains a list of runnable threads for each possible sched_priority value. In order to determine which thread runs next, the scheduler looks for the nonempty list with the highest static priority and selects the thread at the head of this list.

A thread's scheduling policy determines where it will be inserted into the list of threads with equal static priority and how it will move inside this list. the sched(7) man page

The supported policies are:

  • SCHED_NORMAL - It's the default Linux policy for threads not requiring any real-time machinery. All the threads have priority of 0, and the decision of which thread gets run next is based on the nice mechanism.
  • SCHED_FIFO - The threads have priority from 1 (low) to 99 (high). When a SCHED_FIFO thread becomes runnable, it will preempt any thread of lower priority. There is no time slicing, so the hight priority thread runs as long as it must.
  • SCHED_RR - RR stands for round-robin. It's the same as SCHED_FIFO except each thread is allowed to run for a limited time quantum.

See the sched(7) man page for more details.

Setting the scheduling parameters of a kernel task boils down to invoking the sys_sched_setscheduler syscall, as shown below.

 1struct tb_sched_param
 3  int sched_priority;
 6int tb_set_sched(tbthread_t thread, int policy, int priority)
 8  struct tb_sched_param p; p.sched_priority = priority;
 9  int ret = SYSCALL3(__NR_sched_setscheduler, thread->tid, policy, &p);
10  if(!ret)
11    thread->sched_info = SCHED_INFO_PACK(policy, priority);
12  return ret;

There is a bit more fun to it, though. As you can see, the kernel can only schedule a task that already exists. Therefore, we need to have a way to set the thread's priority before this thread invokes the user function. The reason for this is that we may need to abort this thread immediately should the scheduler setting fail. We do it by having another futex that we wake when we know whether the thread can run or not:

1if(th->start_status != TB_START_OK) {
2  SYSCALL3(__NR_futex, &th->start_status, FUTEX_WAIT, TB_START_WAIT);
3  if(th->start_status == TB_START_EXIT)
4    SYSCALL1(__NR_exit, 0);

This futex is initialized in the tbthread_create function depending on the thread attributes:

2  (*thread)->start_status = TB_START_WAIT;
3else {
4  tbthread_t self = tbthread_self();
5  (*thread)->sched_info = self->sched_info;

And then set to either TB_START_OK or TB_START_EXIT after we spawn the thread but before tbthread_create exits:

 1if(!attr->sched_inherit) {
 2  ret = tb_set_sched(*thread, attr->sched_policy, attr->sched_priority);
 4  if(ret) (*thread)->start_status = TB_START_EXIT;
 5  else (*thread)->start_status = TB_START_OK;
 6  SYSCALL3(__NR_futex, &(*thread)->start_status, FUTEX_WAKE, 1);
 8  if(ret) {
 9    wait_for_thread(*thread);
10    goto error;
11  }

See the patch at GitHub.

Priority mutexes

Priority mutexes come in three varieties:

  • TBTHREAD_PRIO_NONE - Acquiring this type of mutex does not change the scheduling characteristics of the thread.
  • TBTHREAD_PRIO_INHERIT - When the mutex owner blocks another thread with a higher priority, the owner inherits the priority of the thread it blocks if it's higher than its own.
  • TBTHREAD_PRIO_PROTECT - Acquiring this kind of mutex raises the priority of the owner to the prioceiling value of the mutex.

Thread Bites implements this functionality by keeping lists of PRIO_INHERIT and PRIO_PROTECT mutexes. It then calculates the highest possible priority taking into account the priority of the mutexes and the priority set by the user.

The implementation of PRIO_PROTECT is relatively straightforward. Whenever a thread acquires this kind of mutex, it is added to the list, and the priority of the thread is recalculated:

 1static int lock_prio_protect(tbthread_mutex_t *mutex)
 3  lock_prio_none(mutex);
 4  tb_protect_mutex_sched(mutex);
 5  return 0;
 8static int unlock_prio_protect(tbthread_mutex_t *mutex)
10  tbthread_t self = tbthread_self();
11  tb_protect_mutex_unsched(mutex);
12  unlock_prio_none(mutex);
13  return 0;

Implementing PRIO_INHERIT is a lot more tricky. We add the mutex to the appropriate list when a thread acquires it. Whenever a higher priority thread tries to lock the mutex, it bumps the priority of the blocker. But the priority recalculation is done only at this point. Implementing it like this covers all the main cases and is not horrendously hard. It allows for simple recursion: if the owner of a mutex gets blocked, the blocker inherits the priority that comes with the first mutex. It also has a couple of drawbacks:

  • It assumes that the kernel will always wake the highest priority thread. It makes sense and is most likely the case. However, I have not tested it.
  • If the owner of a PRIO_INHERIT mutex is already blocked on another mutex of the same kind and it's priority gets bumped later, the last thread in the line won't be affected.
 1static int lock_prio_inherit(tbthread_mutex_t *mutex)
 3  tbthread_t self = tbthread_self();
 5  while(1) {
 6    int locked = 0;
 7    tb_futex_lock(&mutex->internal_futex);
 8    if(mutex->futex == 0) {
 9      locked = 1;
10      mutex->owner = self;
11      mutex->futex = 1;
12      tb_inherit_mutex_add(mutex);
13    }
14    else
15      tb_inherit_mutex_sched(mutex, self);
16    tb_futex_unlock(&mutex->internal_futex);
17    if(locked)
18      return 0;
19    SYSCALL3(__NR_futex, &mutex->futex, FUTEX_WAIT, 1);
20  }
23static int unlock_prio_inherit(tbthread_mutex_t *mutex)
25  tb_futex_lock(&mutex->internal_futex);
26  tb_inherit_mutex_unsched(mutex);
27  mutex->owner = 0;
28  mutex->futex = 0;
29  SYSCALL3(__NR_futex, &mutex->futex, FUTEX_WAKE, 1);
30  tb_futex_unlock(&mutex->internal_futex);
31  return 0;

It was by far the most challenging part so far. See the patch at GitHub.

Remaining functions

  • pthread_setconcurrency- It defines how many kernel tasks should be created to handle the user-level threads. It does not make sense in our case because we create a kernel task for every thread.
  • pthread_attr_setscope - It defines the set of threads against which the thread will compete for resources. There are two settings: PTHREAD_SCOPE_SYSTEM meaning all the threads in the entire system and PTHREAD_SCOPE_PROCESS meaning only the threads within the process. The man page says that Linux only supports PTHREAD_SCOPE_SYSTEM, but I am not sure whether it's still the case with all the cgroups stuff.

Table of Contents

  1. Creating threads
  2. Thread-local storage
  3. Mutexes
  4. Joining and initialization
  5. Cancellation
  6. Scheduling
  7. RW Locks
  8. Condition Variables
  9. Conclusion


Cancellation boils down to making one thread exit following a request from another thread. It seems that calling tbthread_exit at an appropriate point is enough to implement all of the behavior described in the man pages. We will go this way despite the fact that it is not the approach taken by glibc. Glibc unwinds the stack back to the point invoking the user-supplied thread function. This behavior allows it to simulate an exception if C++ code is using the library. We don't bother with C++ support for the moment and don't always care to supply valid DWARF information. Therefore, we will take the easier approach.

tbthread_setcancelstate and tbthread_setcanceltype are the two functions controlling the response of a thread to a cancellation request. The former enables or disables cancellation altogether queuing the requests for later handling if necessary. The latter decides whether the thread should abort immediately or at a cancellation point. POSIX has a list of cancellation points, but we will not bother with them. Instead, we'll just use tbthread_testcancel and the two functions mentioned before for this purpose.

The thread must not get interrupted after it disables or defers cancellation. It would likely lead to deadlocks due to unreleased mutexes, memory leaks and such. The trick here is to update all the cancellation related flags atomically. So, we use one variable to handle the following flags:

  • TB_CANCEL_ENABLED: The cancellation is enabled; if a cancellation request has been queued, reaching a cancellation point will cause the thread to exit.
  • TB_CANCEL_DEFERRED: The cancellation is deferred (not asynchronous); SIGCANCEL will not be sent; see the paragraph on signal handling.
  • TB_CANCELING: A cancellation request has been queued; depending on other flags, SIGCANCEL may be sent.
  • TB_CANCELED: A cancellation request has been taken into account and the thread is in the process of exiting; this flag is used to handle the cases when a cancellation point has been reached before SIGCANCEL has been delivered by the kernel.

The tbhread_testcancel looks as follows:

 1void tbthread_testcancel()
 3  tbthread_t thread = tbthread_self();
 4  uint8_t val, newval;
 6  while(1) {
 7    newval = val = thread->cancel_status;
 8    if(!(val & TB_CANCEL_ENABLED) || !(val & TB_CANCELING) ||
 9       (val & TB_CANCELED))
10      return;
11    newval |= TB_CANCELED;
12    if(__sync_bool_compare_and_swap(&thread->cancel_status, val, newval))
13      break;
14  }
15  tbthread_exit(TBTHREAD_CANCELED);

See the full patch at GitHub.

Clean-up handlers

The user may register a bunch of functions cleaning up the mess caused by an unexpected interruption. They are installed with tbthread_cleanup_push and called when the thread exits abnormally. The purpose of these functions is to unlock mutexes, free the heap memory and such. tbthread_cleanup_pop removes them and optionally executes in the process.

 1void tbthread_cleanup_push(void (*func)(void *), void *arg)
 3  tbthread_t self = tbthread_self();
 4  struct cleanup_elem *e = malloc(sizeof(struct cleanup_elem));
 5  e->func = func;
 6  e->arg = arg;
 7  list_add_elem(&self->cleanup_handlers, e, 1);
10void tbthread_cleanup_pop(int execute)
12  tbthread_t self = tbthread_self();
13  list_t *node = self->;
14  if(!node)
15    return;
16  list_rm(node);
17  struct cleanup_elem *e = (struct cleanup_elem*)node->element;
18  if(execute)
19    (*e->func)(e->arg);
20  free(e);
21  free(node);

See the full patch at GitHub.

Signals and asynchronous cancellation

The asynchronous cancellation uses the first real-time signal, SIGRTMIN, that we call SIGCANCEL here for clarity.

Registering a signal handler is somewhat more tricky than just calling the appropriate syscall. It is so because, on x86_64, we need to provide a function that restores the stack after the signal handler returns. The function is called a signal trampoline and its purpose is to invoke sys_rt_sigreturn. The trampoline is registered with the kernel using a special sigaction flag:

1void __restore_rt();
2#define SA_RESTORER 0x04000000
4int tbsigaction(int signum, struct sigaction *act, struct sigaction *old)
6  act->sa_flags |= SA_RESTORER;
7  act->sa_restorer = __restore_rt;
8  return SYSCALL4(__NR_rt_sigaction, signum, act, old, sizeof(sigset_t));

The trampoline itself, called __restore_rt here, is defined in assembly as follows:

1  .text
3  .global __restore_rt
4  .type   __restore_rt,@function
5  .align  16
8  movq $__NR_rt_sigreturn, %rax
9  syscall

Looking at the corresponding glibc code, you can see that they add the eh_frame info here. The comments say that it is to aid gdb and handle the stack unwinding. I don't know enough DWARF to write one on my own, gdb does not seem to be utterly confused without it, and we won't do stack unwinding, so we just won't bother with it for the moment.

In the cancellation handler, we first check whether it's the right signal and that it has been sent by a thread in the same thread group. We then need to check whether the thread is still in the asynchronous cancellation mode. It might have changed between the time the signal was sent and the time the it is delivered. Finally, we call thread_testcancel to see if the thread should exit.

 1void tb_cancel_handler(int sig, siginfo_t *si, void *ctx)
 3  if(sig != SIGCANCEL || si->si_pid != tb_pid || si->si_code != SI_TKILL)
 4    return;
 6  tbthread_t self = tbthread_self();
 7  if(self->cancel_status & TB_CANCEL_DEFERRED)
 8    return;
10  tbthread_testcancel();

We invoke sys_tgkill to send the signal:

1SYSCALL3(__NR_tgkill, tb_pid, thread->exit_futex, SIGCANCEL);

See the full patch at GitHub.

Cancellation of a "once" function

The implementation of tbthread_once gets quite a bit more interesting as well. If the thread invoking the initialization function gets canceled, another thread needs to pick it up. We need to install a cleanup handler that will change the state of the once control back to TB_ONCE_NEW and wake all the threads so that they could restart from the beginning:

 1static void once_cleanup(void *arg)
 3  tbthread_once_t *once = (tbthread_once_t *)arg;
 4  *once = TB_ONCE_NEW;
 5  SYSCALL3(__NR_futex, once, FUTEX_WAKE, INT_MAX);
 8int tbthread_once(tbthread_once_t *once, void (*func)(void))
10  if(!once || !func)
11    return -EINVAL;
13  int cancel_state;
15  while(1) {
16    if(*once == TB_ONCE_DONE)
17      return 0;
19    //--------------------------------------------------------------------------
20    // The executor
21    //--------------------------------------------------------------------------
22    tbthread_setcancelstate(TBTHREAD_CANCEL_DISABLE, &cancel_state);
23    if(__sync_bool_compare_and_swap(once, TB_ONCE_NEW, TB_ONCE_IN_PROGRESS)) {
24      tbthread_cleanup_push(once_cleanup, once);
25      tbthread_setcancelstate(cancel_state, 0);
27      (*func)();
29      tbthread_setcancelstate(TBTHREAD_CANCEL_DISABLE, &cancel_state);
30      tbthread_cleanup_pop(0);
32      *once = TB_ONCE_DONE;
33      SYSCALL3(__NR_futex, once, FUTEX_WAKE, INT_MAX);
34      tbthread_setcancelstate(cancel_state, 0);
35      return 0;
36    }
38    tbthread_setcancelstate(cancel_state, 0);
40    //--------------------------------------------------------------------------
41    // The waiters
42    //--------------------------------------------------------------------------
43    while(1) {
44      SYSCALL3(__NR_futex, once, FUTEX_WAIT, TB_ONCE_IN_PROGRESS);
45      if(*once != TB_ONCE_IN_PROGRESS)
46        break;
47    }
48  }

See the patch at GitHub.

Table of Contents

  1. Creating threads
  2. Thread-local storage
  3. Mutexes
  4. Joining and initialization
  5. Cancellation
  6. Scheduling
  7. RW Locks
  8. Condition Variables
  9. Conclusion

Recycling the thread descriptors

How do we know when the thread task has died? This is what the CHILD_SETTID and CHILD_CLEARTID flags to sys_clone are for. If they are set, the kernel will store the new thread's TID at the location pointed to by the ctid argument (see tb #1). When the thread terminates, the kernel will set the TID to 0 and wake the futex at this location. It is a convenient way to wait for a thread to finish. Unfortunately, as far as I can tell, there is no way to unset these flags, and it makes implementing tbthread_detach a pain. We cannot delete the thread descriptor in the thread it refers to anymore. Doing so would cause the kernel to write to a memory location that might have been either unmapped or reused. Therefore, we need to have some sort of a cache holding thread descriptors and make sure that we re-use them only after the thread they were referring to before has exited. Thread bites uses two linked lists to maintain this cache, and the descriptor allocation function calls the following procedure to wait until the corresponding task is gone:

1static void wait_for_thread(tbthread_t thread)
3  uint32_t tid = thread->exit_futex;
4  long ret = 0;
5  if(tid != 0)
6    do {
7      ret = SYSCALL3(__NR_futex, &thread->exit_futex, FUTEX_WAIT, tid);
8    } while(ret != -EWOULDBLOCK && ret != 0);

See the full patch at GitHub.

Joining threads

Joining complicates things a bit further because it does quite a bit of error checking to prevent deadlocks and such. To perform these checks, the thread calling tbthread_join needs to have a valid thread descriptor obtainable by tbthread_self. The problem is that we have never set this thread descriptor up for the main thread, and we need to do it by hand at the beginning of the program. The original state needs to be restored at the end because glibc uses it internally and not cleaning things up causes segfaults.

 1static void *glibc_thread_desc;
 2void tbthread_init()
 4  glibc_thread_desc = tbthread_self();
 5  tbthread_t thread = malloc(sizeof(struct tbthread));
 6  memset(thread, 0, sizeof(struct tbthread));
 7  thread->self = thread;
 8  SYSCALL2(__NR_arch_prctl, ARCH_SET_FS, thread);
11void tbthread_finit()
13  free(tbthread_self());
14  SYSCALL2(__NR_arch_prctl, ARCH_SET_FS, glibc_thread_desc);

After performing all the validity and deadlock checks, the meat of tbthread_join is rather simple:

3  *retval = thread->retval;
5return 0;

See the full patch at GitHub.

Dynamic initialization

pthread_once is an interesting beast. Its purpose is to initialize dynamically some resources by calling a designated function exactly once. The fun part is that the actual initialization call may be made from multiple threads at the same time. pthread_once_t, therefore, is kind of like a mutex, but has three states instead of two:

  • new: the initialization function has not been called yet; one of the threads needs to call it.
  • in progress: the initialization function is running; the threads are waiting for it to finish.
  • done: the initialization function is done; all the threads may be woken up.

The thread that manages to change the state from new to in progress gets to call the function. All the other threads wait until the done state is reached.

 1int tbthread_once(tbthread_once_t *once, void (*func)(void))
 3  if(!once || !func)
 4    return -EINVAL;
 6  if(*once == TB_ONCE_DONE)
 7    return 0;
 9  if(__sync_bool_compare_and_swap(once, TB_ONCE_NEW, TB_ONCE_IN_PROGRESS)) {
10    (*func)();
11    *once = TB_ONCE_DONE;
12    SYSCALL3(__NR_futex, once, FUTEX_WAKE, INT_MAX);
13    return 0;
14  }
16  while(*once != TB_ONCE_DONE)
18  return 0;

Side effects

The original glibc thread descriptor stores the localization information for the thread. Changing it to ours causes seemingly simple functions, like strerror, to segfault. Therefore, we need to implement strerror ourselves.

Table of Contents

  1. Creating threads
  2. Thread-local storage
  3. Mutexes
  4. Joining and initialization
  5. Cancellation
  6. Scheduling
  7. RW Locks
  8. Condition Variables
  9. Conclusion


This part discusses an implementation of a mutex. It will not be a particularly efficient mutex, but it will be an understandable and straightforward one. We will not bother minimizing the number of system calls or implementing lock elision. We will also not handle the case of inter-process communication. Therefore, the process-shared and robust mutexes will not be discussed. If you are interested in these ideas, I recommend the kernel's documentation file on robust mutexes and Intel's blog post on lock elision. The scheduling related parameters will likely be dealt with on another occasion.


POSIX defines a bunch of different mutexes. See the manpage for pthread_mutexattr_settype to learn more. On Linux, all of them are implemented using the same locking primitive - a Futex (Fast User-Space Mutex). It is a 4-byte long chunk of aligned memory. Its contents can be updated atomically, and its address can be used to refer to a kernel-level process queue. The kernel interface is defined as follows:

1asmlinkage long sys_futex(u32 __user *uaddr, int op, u32 val,
2                         struct timespec __user *utime, u32 __user *uaddr2,
3                         u32 val3);

We will only use the first four out of the six parameters here. The first one is the address of the futex, and the second one is the type of the operation to be performed. The meaning of the remaining parameters depends on the context. We will need only two of the available operations to implement a mutex:

  • FUTEX_WAIT puts a thread to sleep if the value passed in val is the same as the value stored in the memory pointed to by *uaddr. Optionally, the sleep time may be limited by passing a pointer to a timespec object. The return values are:
    • 0, if the thread was woken up by FUTEX_WAIT.
    • EWOULDBLOCK, if the value of the futex was different than val.
    • EINTR, if the sleep was interrupted by a signal.
  • FUTEX_WAKE wakes the number of threads specified in val. In practice, it only makes sense to either wake one or all sleeping threads, so we pass either 1 or INT_MAX respectively.

See the original futex paper for more details.

Normal mutex

We start with a normal mutex because it is possible to implement all the other kinds using the procedures defined for it. If the value of the associated futex is 0, then the mutex is unlocked. Locking it means changing the value to 1. To avoid race conditions, both the checking and the changing need to be done atomically. GCC has a built-in function to do this that results with lock cmpxchgq or similar being emitted in assembly. If the locking fails, we need to wait until another thread releases the mutex and re-check the lock.

1static int lock_normal(tbthread_mutex_t *mutex)
3  while(1) {
4    if(__sync_bool_compare_and_swap(&mutex->futex, 0, 1))
5      return 0;
6    SYSCALL3(__NR_futex, &mutex->futex, FUTEX_WAIT, 1);
7  }

The logic of trylock is essentially the same, except that we return a failure instead of sleeping.

1static int trylock_normal(tbthread_mutex_t *mutex)
3  if(__sync_bool_compare_and_swap(&mutex->futex, 0, 1))
4      return 0;
5  return -EBUSY;

To unlock a mutex, we do the reverse. We change the futex value from 1 to 0 and wake one thread waiting for the futex to be released.

1static int unlock_normal(tbthread_mutex_t *mutex)
3  if(__sync_bool_compare_and_swap(&mutex->futex, 1, 0))
4    SYSCALL3(__NR_futex, &mutex->futex, FUTEX_WAKE, 1);
5  return 0;

Note that the values stored in the futex are application-specific and arbitrary. The kernel does not care and does not change this variable except in one case, which we will discuss in a later chapter.

This mutex is not particularly efficient because we make a system call while unlocking regardless of whether there is a waiting thread or not. To see how the situation could be improved, please refer to Ulrich Drepper's Futexes Are Tricky.

Error-check mutex

These guys are the same as the ones discussed earlier, except that they do additional bookkeeping. An error-check mutex remembers who owns it, to report the following types of errors:

  • re-locking of a mutex the thread already owns
  • unlocking a mutex owned by another thread
  • unlocking a mutex that is not locked

The behavior of normal mutexes in these cases is undefined.

 1static int lock_errorcheck(tbthread_mutex_t *mutex)
 3  tbthread_t self = tbthread_self();
 4  if(mutex->owner == self)
 5    return -EDEADLK;
 6  lock_normal(mutex);
 7  mutex->owner = self;
 8  return 0;
11static int trylock_errorcheck(tbthread_mutex_t *mutex)
13  int ret = trylock_normal(mutex);
14  if(ret == 0)
15    mutex->owner = tbthread_self();
16  return ret;
19static int unlock_errorcheck(tbthread_mutex_t *mutex)
21  if(mutex->owner != tbthread_self() || mutex->futex == 0)
22    return -EPERM;
23  mutex->owner = 0;
24  unlock_normal(mutex);
25  return 0;

Recursive mutex

Recursive mutexes may be locked multiple times by the same thread and require the same numbers of unlock operations to be released. To provide this kind of functionality, we just need to add a counter counter.

 1static int lock_recursive(tbthread_mutex_t *mutex)
 3  tbthread_t self = tbthread_self();
 4  if(mutex->owner != self) {
 5    lock_normal(mutex);
 6    mutex->owner   = self;
 7  }
 8  if(mutex->counter == (uint64_t)-1)
 9    return -EAGAIN;
10  ++mutex->counter;
11  return 0;
14static int trylock_recursive(tbthread_mutex_t *mutex)
16  tbthread_t self = tbthread_self();
17  if(mutex->owner != self && trylock_normal(mutex))
18    return -EBUSY;
20  if(mutex->owner != self) {
21    mutex->owner = self;
22    mutex->counter = 1;
23    return 0;
24  }
26  if(mutex->counter == (uint64_t)-1)
27    return -EAGAIN;
29  ++mutex->counter;
30  return 0;
33static int unlock_recursive(tbthread_mutex_t *mutex)
35  if(mutex->owner != tbthread_self())
36    return -EPERM;
37  --mutex->counter;
38  if(mutex->counter == 0) {
39    mutex->owner = 0;
40    return unlock_normal(mutex);
41  }
42  return 0;

Other code

This is pretty much it. The remaining part of the code is not interesting. Both mutex and mutexattr objects need to be initialized to their default values, but the futexes don't need any initialization or cleanup. As always, the full patch is available on GitHub.

Table of Contents

  1. Creating threads
  2. Thread-local storage
  3. Mutexes
  4. Joining and initialization
  5. Cancellation
  6. Scheduling
  7. RW Locks
  8. Condition Variables
  9. Conclusion


The second part of our threading journey covers the thread-local storage. I do not mean here the compiler generated TLS that I mentioned in the first part. I mean the stuff that involves pthread_setspecific and friends. I don't think it's the most critical part of all this threading business. I barely ever use it in practice. However, we will need to refer to the current thread all over the place, and this requires the TLS. It's better to deal with it once and for all, especially that it's not particularly complicated.


How does one store something in a fixed location and make it distinct for each execution thread? The only two answers that come to my mind are either using syscalls to associate something with kernel's task_struct or using CPU registers. The first approach requires context switches to retrieve the value, so it's rather inefficient. The second option, though, should be pretty fast. Conveniently, on x86_64, some registers are left unused (see StackOverflow). In fact, the SETTLS option to clone takes a pointer and puts it in the file segment register for us, so we don't even need to make an extra syscall just for that.

Since fs is a segment register, we cannot retrieve it's absolute value without engaging the operating system. Linux on x86_64 uses the arch_prctl syscall for this purpose:

1#include "tb.h"
2#include <asm/prctl.h>
4tbthread_t tbthread_self()
6  tbthread_t self;
7  SYSCALL2(__NR_arch_prctl, ARCH_GET_FS, &self);
8  return self;

This syscall seems expensive (link) and making it defies the reason for using a register in the first place. We can read the memory at an address relative to the value of the register, though. Using this fact, we can make our thread struct point to itself and then retrieve this pointer using inline assembly. Here's how:

1tbthread_t tbthread_self()
3  tbthread_t self;
4  asm("movq %%fs:0, %0\n\t" : "=r" (self));
5  return self;

For the main thread, the linker initializes the TLS segment for pthreads automatically using arch_prctl. We cannot use it for our purposes, but we can count on tbthread_self returning a unique, meaningful value.

See the full patch here.


The actual TLS is handled by the following four functions: pthread_key_create, pthread_key_delete, pthread_getspecific, pthread_setspecific. I will not explain what they do because it should pretty self-evident. If it's not, see the man pages.

In principle, we could just have a hash table in the tbthread struct. We could then use setspecific and getspecific to set and retrieve the value associated with each given key. Calling setspecific with a NULL pointer would delete the key. Unfortunately, the designers of pthreads made it a bit more complicated by having separate key_create and key_delete functions, with the delete function invalidating the key in all the threads. Glibc uses a global array of keys and sequence numbers in a clever way to solve this problem. We will take almost the same approach in a less efficient but a bit clearer way.

We will have a global array representing keys. Each element of this array will host a pointer to a destructor and a sequence number. The index in the array will be the actual key passed around between the TLS functions. Both key_create and key_delete will bump the sequence number associated with a key in an atomic way. If the number is even, the key is not allocated, if the number is odd, it is.

 1static struct
 3  uint64_t seq;
 4  void (*destructor)(void *);
 7#define KEY_UNUSED(k) ((keys[k].seq&1) == 0)
 8#define KEY_ACQUIRE(k) (__sync_bool_compare_and_swap(&(keys[k].seq), keys[k].seq, keys[k].seq+1))
 9#define KEY_RELEASE(k) (__sync_bool_compare_and_swap(&(keys[k].seq), keys[k].seq, keys[k].seq+1))
11int tbthread_key_create(tbthread_key_t *key, void (*destructor)(void *))
13  for(int i = 0; i < TBTHREAD_MAX_KEYS; ++i) {
14    if(KEY_UNUSED(i) && KEY_ACQUIRE(i)) {
15      *key = i;
16      keys[i].destructor = destructor;
17      return 0;
18    }
19  }
20  return -ENOMEM;

Each tbthread struct will hold an array of the same size as the global array of keys. Each element in this array will hold a data pointer and a sequence number. Storing data will set the sequence number as well. Retrieving the data will check whether the local and the global sequence number match before proceeding.

 1void *tbthread_getspecific(tbthread_key_t key)
 3  if(key >= TBTHREAD_MAX_KEYS || KEY_UNUSED(key))
 4    return 0;
 6  tbthread_t self = tbthread_self();
 7  if(self->tls[key].seq == keys[key].seq)
 8    return self->tls[key].data;
 9  return 0;

See the full patch here.


KEY_ACQUIRE and KEY_RELEASE use gcc atomic builtins described here.

Table of Contents

  1. Creating threads
  2. Thread-local storage
  3. Mutexes
  4. Joining and initialization
  5. Cancellation
  6. Scheduling
  7. RW Locks
  8. Condition Variables
  9. Conclusion


This is a first of hopefully many posts documenting my attempts to understand how to implement a pthread-style threading system on Linux. To this end, I started implementing a small, non-portable and a rather useless library that I called Thread Bites. Thread Bites is useless mainly because it lacks support for compiler-generated thread-local storage. It may not sound like a grave issue, but it makes Thread Bites incompatible with most of glibc. Therefore, I had to provide my own functionality for invoking syscalls, managing the heap and even printing stuff to stdout. It's all pretty simple and understandable so far, so I hope I will be able to implement most of the pthreads' functionality in a couple of small bites. You can get the source from GitHub.


For a program to be even remotely useful, it needs to communicate with the user in one way or another. A standard library for the programming language, like glibc, typically provides all the necessary components for such communication. For reasons mentioned in the introduction, using glibc in this case is not advisable. Hence, I need to find a way to call the operating system directly without using the syscall function, because it is also a part glibc and sets errno, which is supposed to reside in the TLS that I did not set up.

All that is needed to implement an equivalent function is shuffling around the values of the registers to translate between the calling conventions of C and the Linux kernel, as described here and here on page 20. As it turns out, it's not that hard to do it using inline assembly in C, and there's an excellent tutorial here.

 1#define SYSCALL(name, a1, a2, a3, a4, a5, a6)           \
 2  ({                                                    \
 3    long result;                                        \
 4    long __a1 = (long)(a1);                             \
 5    long __a2 = (long)(a2);                             \
 6    long __a3 = (long)(a3);                             \
 7    long __a4 = (long)(a4);                             \
 8    long __a5 = (long)(a5);                             \
 9    long __a6 = (long)(a6);                             \
10    register long _a1 asm("rdi") = __a1;                \
11    register long _a2 asm("rsi") = __a2;                \
12    register long _a3 asm("rdx") = __a3;                \
13    register long _a4 asm("r10") = __a4;                \
14    register long _a5 asm("r8")  = __a5;                \
15    register long _a6 asm("r9")  = __a6;                \
16    asm volatile (                                      \
17      "syscall\n\t"                                     \
18      : "=a" (result)                                   \
19      : "0" (name), "r" (_a1), "r" (_a2), "r" (_a3),    \
20        "r" (_a4), "r" (_a5), "r" (_a6)                 \
21      : "memory", "cc", "r11", "cx");                   \
22    (long) result; })
24#define SYSCALL1(name, a1) \
25  SYSCALL(name, a1, 0, 0, 0, 0, 0)
26#define SYSCALL2(name, a1, a2) \
27  SYSCALL(name, a1, a2, 0, 0, 0, 0)
28#define SYSCALL3(name, a1, a2, a3) \
29  SYSCALL(name, a1, a2, a3, 0, 0, 0)
30#define SYSCALL4(name, a1, a2, a3, a4) \
31  SYSCALL(name, a1, a2, a3, a4, 0, 0)
32#define SYSCALL5(name, a1, a2, a3, a4, a5) \
33  SYSCALL(name, a1, a2, a3, a4, a5, 0)
34#define SYSCALL6(name, a1, a2, a3, a4, a5, a6) \
35  SYSCALL(name, a1, a2, a3, a4, a5, a6)

All this is, of course, horribly inefficient because it messes up with all the registers even in the situations it does not have to. The intermediate variables for the parameters (__a1 and friends) are used to prevent embedded function calls from messing with the registers that have already been set; think of strlen in SYSCALL3(__NR_write, 1, blah, strlen(blah)).

See the full code on GitHub.

Printing to stdout

It seems that glibc is using errno and other thread local stuff to calculate buffer sizes in one of the subroutines called by printf. It causes printf to segfault when called concurrently from different threads because of the same TLS story. Thread Bites provides a convenience function similar to printf and supporting %s %x %u %o %d in l and ll flavors:

1void tbprint(const char *format, ...);


Glibc's default malloc implementation, a ptmalloc2 derivative, uses thread-specific arenas to limit lock congestion caused by calling malloc concurrently from multiple threads. It looks like it depends on TLS, so using it is not the best idea. Thread Bites comes with its own evil version of malloc. It's pathological because it's extremely prone to fragmentation, it never shrinks the heap, and it's essentially one big critical section. It has some undeniable advantages too: it works, it's incredibly simple, and it fits in around 50 lines of code. Look here to find it.

The only thing worth noting in this section is that the sys_brk syscall does not behave like glibc's brk or sbrk functions. On error, it would return the previous location of the heap boundary, so the code calls it with an obviously wrong parameter (0) to figure out what the initial heap boundary is.


Clone is an interesting beast. From the standpoint of this section, the relevant thing about it is that it behaves mostly like fork, except that the child is launched on a new stack. For this reason and to generate proper Call Frame Information (see here and here), it needs to be implemented in assembly. The implementation puts the user function pointer and its argument on the child's stack so that they can be later popped and called by the child. Then, it puts the syscall parameters in the appropriate registers and makes the syscall. Again, all the code is on GitHub.

Creating a thread

After dealing with all this boilerplate, the actual thread creation is relatively straightforward. First, the new thread needs a stack it can run on. The stack could be allocated on the heap, but it's probably safer to get a separate memory region for it. It can be done using the mmap syscall. A proper threading library should check the system limits to figure out the size of the stack, but Thread Bites is not proper, so it will use 8MiB as a default. All the function parameters in the snippet below match glibc's mmap call.

1void *stack = tbmmap(NULL, attr->stack_size, PROT_READ | PROT_WRITE,
2                     MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
3long status = (long)stack;
4if(status < 0)
5  return status;

It's a good idea to mark a page at the beginning of the stack as non-readable and non-writable to protect any valid adjecent memory if there is any. Stepping over the guard page will get us a familiar segmentation fault. Why at the beginning and not at the end? Because the stack grows downwards, i.e. from a high address towards a lower one. Go here for more details.

1status = SYSCALL3(__NR_mprotect, stack, EXEC_PAGESIZE, PROT_NONE);
2if(status < 0) {
3  tbmunmap(stack, attr->stack_size);
4  return status;

We finally have all that is needed to spawn a new kernel task sharing with the main task everything that you would expect threads running within the same process to share. The man page of clone discusses all of the parameters in details.

2flags |= CLONE_THREAD;
3int tid = tbclone(start_thread, *thread, flags, stack+attr->stack_size);
4if(tid < 0) {
5  tbmunmap(stack, attr->stack_size);
6  free(*thread);
7  return tid;

Note that the user function is not called directly and start_thread is called instead. It does some internal book-keeping and eventually calls the user-supplied function:

1tbthread_t th = (tbthread_t)arg;
2uint32_t stack_size = th->stack_size;
3void *stack = th->stack;

When the function returns, the stack memory needs to be freed (munmaped) and the thread terminated. The clone function took some provisions for calling sys_exit with the status returned by the wrapper. However, here we remove the stack from underneath our feet, and we cannot risk any C function write over it. Therefore, we need to call sys_munmap and sys_exit in one piece of assembly.

 1register long a1 asm("rdi") = (long)stack;
 2register long a2 asm("rsi") = stack_size;
 3asm volatile(
 4  "syscall\n\t"
 5  "movq $60, %%rax\n\t" // 60 = __NR_exit
 6  "movq $0, %%rdi\n\t"
 7  "syscall"
 8  :
 9  : "a" (__NR_munmap), "r" (a1), "r" (a2)
10  : "memory", "cc", "r11", "cx");
11return 0;

The full code.

Lessons learned

  • The logic involved in running threads does not seem to be horrendously complicated, at least not yet.
  • Glibc is quite convoluted and figuring out what is going on and where is a challenge. It is probably justified by the range of platforms it supports, but the documentation could be better.
  • Threading is tightly coupled with glibc in the areas that I have not initially suspected: the dynamic linker to support TLS, globals in TLS (errno), locales, finalizers, and so on.
  • POSIX threads have an interesting relationship with the Linux task model. This document describes the initial incompatibilities that have ultimately been ironed out. It is a very interesting read if you want to understand how the userspace and the kernelspace interact to implement a threading system.
  • Ultimately all this is not necessarily a black magic. At least not so far.