Content tagged threads
Tags
Months
Table of Contents
- Creating threads
- Thread-local storage
- Mutexes
- Joining and initialization
- Cancellation
- Scheduling
- RW Locks
- Condition Variables
- Conclusion
Conclusion
I have implemented all of the interesting functions listed here and, thus, reached my goal. There were quite a few surprises. I had expected some things to bo more complicated than they are. Conversely, some things that had seemed simple turned out to be quite complex.
- I had initially hoped that I would be able to re-use much of glibc and concentrate only on the thread-specific functionality. I was surprised to discover how much of glibc code refers to thread-local storage.
- I had expected the interaction between
join
anddetach
to be much simpler to handle. Having to implement descriptor caching was an unexpected event. - I had never heard of
pthread_once
before. - I had not used much of the real-time functionality before, so figuring out the scheduling part was very entertaining. I especially enjoyed implementing the PRIO_INHERIT mutex.
I may revisit this project in the future because there are still some things that I would like to learn more about.
- If I'll have the time to learn DWARF, I would like to provide proper
.eh_frame
for the signal trampoline. It would allow me to implement cancellation using stack unwinding the way glibc does it. - I may look into the inter-process synchronization to learn about the robust futexes.
- The Intel article on lock elision seemed interesting, and I'd like to play with this stuff as well.
- I may have a look at the compiler-generated TLS.
The End
Table of Contents
- Creating threads
- Thread-local storage
- Mutexes
- Joining and initialization
- Cancellation
- Scheduling
- RW Locks
- Condition Variables
- Conclusion
Condition Variables
Condition variables are a mechanism used for signaling that a certain predicate
has become true. The POSIX mechanism for handling them boils down to three
functions: cond_wait
, cond_signal
and cond_broadcast
. The first one causes
the thread that calls it to wait. The second wakes a thread up so that it can
verify whether the condition is true. The third wakes all the waiters up.
The waiter
1 tb_futex_lock(&cond->lock);
2 ...
3 ++cond->waiters;
4 int bseq = cond->broadcast_seq;
5 int futex = cond->futex;
6 tb_futex_unlock(&cond->lock);
7
8 while(1) {
9 st = SYSCALL3(__NR_futex, &cond->futex, FUTEX_WAIT, futex);
10 if(st == -EINTR)
11 continue;
12
13 tb_futex_lock(&cond->lock);
14 if(cond->signal_num) {
15 --cond->signal_num;
16 goto exit;
17 }
18
19 if(bseq != cond->broadcast_seq)
20 goto exit;
21 tb_futex_unlock(&cond->lock);
22 }
The algorithm is as follows:
- We lock the internal lock.
- We remember the value of the broadcast sequence and the futex.
- In a loop, we wait for the futex.
- We go back to sleeping if the
FUTEX_WAIT
syscall was interrupted by a signal. - We consume a signal, if we can, and exit.
- If there was a broadcast, we exit too.
- Otherwise, we go back to sleep.
Signal
We wake one of the threads up. We bump the value of the futex to prevent a deadlock. Then we bump the number of signals and wake the futex.
1 int tbthread_cond_signal(tbthread_cond_t *cond)
2 {
3 tb_futex_lock(&cond->lock);
4 if(cond->waiters == cond->signal_num)
5 goto exit;
6 ++cond->futex;
7 ++cond->signal_num;
8 SYSCALL3(__NR_futex, &cond->futex, FUTEX_WAKE, 1);
9 exit:
10 tb_futex_unlock(&cond->lock);
11 return 0;
12 }
Broadcast
We wake all the waiters. The algorithm is essentially the same as for signal, except that, we bump the broadcast sequence number and wake all the threads instead of just one.
1 int tbthread_cond_broadcast(tbthread_cond_t *cond)
2 {
3 tb_futex_lock(&cond->lock);
4 if(!cond->waiters)
5 goto exit;
6 ++cond->futex;
7 ++cond->broadcast_seq;
8 SYSCALL3(__NR_futex, &cond->futex, FUTEX_WAKE, INT_MAX);
9 exit:
10 tb_futex_unlock(&cond->lock);
11 return 0;
12 }
See the full patch at GitHub.
Table of Contents
- Creating threads
- Thread-local storage
- Mutexes
- Joining and initialization
- Cancellation
- Scheduling
- RW Locks
- Condition Variables
- Conclusion
RW Locks
A read-write lock protects a critical section by allowing multiple readers when there are no writers. We won't bother implementing lock attributes handling because we don't support process-shared locks (irrelevant in our case) and we don't let the user prefer readers (a non-POSIX extension). The implementation remembers the ID of the current writer if any. It also counts readers as well as the queued writers. We use two futexes, one to block the readers and one to block the writers.
A writer first bumps the number of queued writers. If there is no other writer and no readers, it marks itself as the owner of the lock and decrements the number of queued writers. It goes to sleep on the writer futex otherwise.
1 int tbthread_rwlock_wrlock(tbthread_rwlock_t *rwlock)
2 {
3 int queued = 0;
4 while(1) {
5 tb_futex_lock(&rwlock->lock);
6
7 if(!queued) {
8 queued = 1;
9 ++rwlock->writers_queued;
10 }
11
12 if(!rwlock->writer && !rwlock->readers) {
13 rwlock->writer = tbthread_self();
14 --rwlock->writers_queued;
15 tb_futex_unlock(&rwlock->lock);
16 return 0;
17 }
18 int sleep_status = rwlock->wr_futex;
19
20 tb_futex_unlock(&rwlock->lock);
21
22 SYSCALL3(__NR_futex, &rwlock->wr_futex, FUTEX_WAIT, sleep_status);
23 }
24 }
A reader acquires the lock if there are no writers at all. It goes to sleep on the reader futex otherwise.
1 int tbthread_rwlock_rdlock(tbthread_rwlock_t *rwlock)
2 {
3 while(1) {
4 tb_futex_lock(&rwlock->lock);
5
6 if(!rwlock->writer && !rwlock->writers_queued) {
7 ++rwlock->readers;
8 tb_futex_unlock(&rwlock->lock);
9 return 0;
10 }
11 int sleep_status = rwlock->rd_futex;
12
13 tb_futex_unlock(&rwlock->lock);
14
15 SYSCALL3(__NR_futex, &rwlock->rd_futex, FUTEX_WAIT, sleep_status);
16 }
17 }
When unlocking, we use the writer
field to determine whether we were a reader
or a writer. If we were a writer, we've had an exclusive ownership of the lock.
Therefore, we need to either wake another writer or all of the readers,
depending on the state of the counters. If we were a reader, we've had a
non-exclusive lock. Therefore, we only need to wake a writer when we're the last
reader and there is a writer queued. We bump the value of the futex because we
want to handle the cases when FUTEX_WAKE
was called before the other thread
manged to call FUTEX_WAIT
.
1 int tbthread_rwlock_unlock(tbthread_rwlock_t *rwlock)
2 {
3 tb_futex_lock(&rwlock->lock);
4 if(rwlock->writer) {
5 rwlock->writer = 0;
6 if(rwlock->writers_queued) {
7 __sync_fetch_and_add(&rwlock->wr_futex, 1);
8 SYSCALL3(__NR_futex, &rwlock->wr_futex, FUTEX_WAKE, 1);
9 } else {
10 __sync_fetch_and_add(&rwlock->rd_futex, 1);
11 SYSCALL3(__NR_futex, &rwlock->rd_futex, FUTEX_WAKE, INT_MAX);
12 }
13 goto exit;
14 }
15
16 --rwlock->readers;
17 if(!rwlock->readers && rwlock->writers_queued) {
18 __sync_fetch_and_add(&rwlock->wr_futex, 1);
19 SYSCALL3(__NR_futex, &rwlock->wr_futex, FUTEX_WAKE, 1);
20 }
21
22 exit:
23 tb_futex_unlock(&rwlock->lock);
24 return 0;
25 }
See the full patch at GitHub.
Table of Contents
- Creating threads
- Thread-local storage
- Mutexes
- Joining and initialization
- Cancellation
- Scheduling
- RW Locks
- Condition Variables
- Conclusion
Thread Scheduling
The scheduler makes a decision of which thread to run next based on two parameters: scheduling policy and priority.
Conceptually, the scheduler maintains a list of runnable threads for each possible sched_priority value. In order to determine which thread runs next, the scheduler looks for the nonempty list with the highest static priority and selects the thread at the head of this list.
A thread's scheduling policy determines where it will be inserted into the list of threads with equal static priority and how it will move inside this list. the sched(7) man page
The supported policies are:
SCHED_NORMAL
- It's the default Linux policy for threads not requiring any real-time machinery. All the threads have priority of 0, and the decision of which thread gets run next is based on the nice mechanism.SCHED_FIFO
- The threads have priority from 1 (low) to 99 (high). When aSCHED_FIFO
thread becomes runnable, it will preempt any thread of lower priority. There is no time slicing, so the hight priority thread runs as long as it must.SCHED_RR
- RR stands for round-robin. It's the same asSCHED_FIFO
except each thread is allowed to run for a limited time quantum.
See the sched(7) man page for more details.
Setting the scheduling parameters of a kernel task boils down to invoking the
sys_sched_setscheduler
syscall, as shown below.
1 struct tb_sched_param
2 {
3 int sched_priority;
4 };
5
6 int tb_set_sched(tbthread_t thread, int policy, int priority)
7 {
8 struct tb_sched_param p; p.sched_priority = priority;
9 int ret = SYSCALL3(__NR_sched_setscheduler, thread->tid, policy, &p);
10 if(!ret)
11 thread->sched_info = SCHED_INFO_PACK(policy, priority);
12 return ret;
13 }
There is a bit more fun to it, though. As you can see, the kernel can only schedule a task that already exists. Therefore, we need to have a way to set the thread's priority before this thread invokes the user function. The reason for this is that we may need to abort this thread immediately should the scheduler setting fail. We do it by having another futex that we wake when we know whether the thread can run or not:
1 if(th->start_status != TB_START_OK) {
2 SYSCALL3(__NR_futex, &th->start_status, FUTEX_WAIT, TB_START_WAIT);
3 if(th->start_status == TB_START_EXIT)
4 SYSCALL1(__NR_exit, 0);
5 }
This futex is initialized in the tbthread_create
function depending on the
thread attributes:
1 if(!attr->sched_inherit)
2 (*thread)->start_status = TB_START_WAIT;
3 else {
4 tbthread_t self = tbthread_self();
5 (*thread)->sched_info = self->sched_info;
6 }
And then set to either TB_START_OK
or TB_START_EXIT
after we spawn the
thread but before tbthread_create
exits:
1 if(!attr->sched_inherit) {
2 ret = tb_set_sched(*thread, attr->sched_policy, attr->sched_priority);
3
4 if(ret) (*thread)->start_status = TB_START_EXIT;
5 else (*thread)->start_status = TB_START_OK;
6 SYSCALL3(__NR_futex, &(*thread)->start_status, FUTEX_WAKE, 1);
7
8 if(ret) {
9 wait_for_thread(*thread);
10 goto error;
11 }
12 }
See the patch at GitHub.
Priority mutexes
Priority mutexes come in three varieties:
TBTHREAD_PRIO_NONE
- Acquiring this type of mutex does not change the scheduling characteristics of the thread.TBTHREAD_PRIO_INHERIT
- When the mutex owner blocks another thread with a higher priority, the owner inherits the priority of the thread it blocks if it's higher than its own.TBTHREAD_PRIO_PROTECT
- Acquiring this kind of mutex raises the priority of the owner to the prioceiling value of the mutex.
Thread Bites implements this functionality by keeping lists of PRIO_INHERIT
and PRIO_PROTECT
mutexes. It then calculates the highest possible priority
taking into account the priority of the mutexes and the priority set by the
user.
The implementation of PRIO_PROTECT
is relatively straightforward. Whenever a
thread acquires this kind of mutex, it is added to the list, and the priority
of the thread is recalculated:
1 static int lock_prio_protect(tbthread_mutex_t *mutex)
2 {
3 lock_prio_none(mutex);
4 tb_protect_mutex_sched(mutex);
5 return 0;
6 }
7
8 static int unlock_prio_protect(tbthread_mutex_t *mutex)
9 {
10 tbthread_t self = tbthread_self();
11 tb_protect_mutex_unsched(mutex);
12 unlock_prio_none(mutex);
13 return 0;
14 }
Implementing PRIO_INHERIT
is a lot more tricky. We add the mutex to the
appropriate list when a thread acquires it. Whenever a higher priority thread
tries to lock the mutex, it bumps the priority of the blocker. But the priority
recalculation is done only at this point. Implementing it like this covers all
the main cases and is not horrendously hard. It allows for simple recursion: if
the owner of a mutex gets blocked, the blocker inherits the priority that comes
with the first mutex. It also has a couple of drawbacks:
- It assumes that the kernel will always wake the highest priority thread. It makes sense and is most likely the case. However, I have not tested it.
- If the owner of a
PRIO_INHERIT
mutex is already blocked on another mutex of the same kind and it's priority gets bumped later, the last thread in the line won't be affected.
1 static int lock_prio_inherit(tbthread_mutex_t *mutex)
2 {
3 tbthread_t self = tbthread_self();
4
5 while(1) {
6 int locked = 0;
7 tb_futex_lock(&mutex->internal_futex);
8 if(mutex->futex == 0) {
9 locked = 1;
10 mutex->owner = self;
11 mutex->futex = 1;
12 tb_inherit_mutex_add(mutex);
13 }
14 else
15 tb_inherit_mutex_sched(mutex, self);
16 tb_futex_unlock(&mutex->internal_futex);
17 if(locked)
18 return 0;
19 SYSCALL3(__NR_futex, &mutex->futex, FUTEX_WAIT, 1);
20 }
21 }
22
23 static int unlock_prio_inherit(tbthread_mutex_t *mutex)
24 {
25 tb_futex_lock(&mutex->internal_futex);
26 tb_inherit_mutex_unsched(mutex);
27 mutex->owner = 0;
28 mutex->futex = 0;
29 SYSCALL3(__NR_futex, &mutex->futex, FUTEX_WAKE, 1);
30 tb_futex_unlock(&mutex->internal_futex);
31 return 0;
32 }
It was by far the most challenging part so far. See the patch at GitHub.
Remaining functions
pthread_setconcurrency
- It defines how many kernel tasks should be created to handle the user-level threads. It does not make sense in our case because we create a kernel task for every thread.pthread_attr_setscope
- It defines the set of threads against which the thread will compete for resources. There are two settings:PTHREAD_SCOPE_SYSTEM
meaning all the threads in the entire system andPTHREAD_SCOPE_PROCESS
meaning only the threads within the process. The man page says that Linux only supportsPTHREAD_SCOPE_SYSTEM
, but I am not sure whether it's still the case with all the cgroups stuff.
Table of Contents
- Creating threads
- Thread-local storage
- Mutexes
- Joining and initialization
- Cancellation
- Scheduling
- RW Locks
- Condition Variables
- Conclusion
Cancellation
Cancellation boils down to making one thread exit following a request from
another thread. It seems that calling tbthread_exit
at an appropriate point
is enough to implement all of the behavior described in the man pages. We will
go this way despite the fact that it is not the approach taken by glibc.
Glibc unwinds the stack back to the point invoking the user-supplied thread
function. This behavior allows it to simulate an exception if C++ code is using
the library. We don't bother with C++ support for the moment and don't always
care to supply valid DWARF information. Therefore, we will take the easier
approach.
tbthread_setcancelstate
and tbthread_setcanceltype
are the two functions
controlling the response of a thread to a cancellation request. The former
enables or disables cancellation altogether queuing the requests for later
handling if necessary. The latter decides whether the thread should abort
immediately or at a cancellation point. POSIX has a list of cancellation points,
but we will not bother with them. Instead, we'll just use tbthread_testcancel
and the two functions mentioned before for this purpose.
The thread must not get interrupted after it disables or defers cancellation. It would likely lead to deadlocks due to unreleased mutexes, memory leaks and such. The trick here is to update all the cancellation related flags atomically. So, we use one variable to handle the following flags:
TB_CANCEL_ENABLED
: The cancellation is enabled; if a cancellation request has been queued, reaching a cancellation point will cause the thread to exit.TB_CANCEL_DEFERRED
: The cancellation is deferred (not asynchronous);SIGCANCEL
will not be sent; see the paragraph on signal handling.TB_CANCELING
: A cancellation request has been queued; depending on other flags,SIGCANCEL
may be sent.TB_CANCELED
: A cancellation request has been taken into account and the thread is in the process of exiting; this flag is used to handle the cases when a cancellation point has been reached beforeSIGCANCEL
has been delivered by the kernel.
The tbhread_testcancel
looks as follows:
1 void tbthread_testcancel()
2 {
3 tbthread_t thread = tbthread_self();
4 uint8_t val, newval;
5
6 while(1) {
7 newval = val = thread->cancel_status;
8 if(!(val & TB_CANCEL_ENABLED) || !(val & TB_CANCELING) ||
9 (val & TB_CANCELED))
10 return;
11 newval |= TB_CANCELED;
12 if(__sync_bool_compare_and_swap(&thread->cancel_status, val, newval))
13 break;
14 }
15 tbthread_exit(TBTHREAD_CANCELED);
16 }
See the full patch at GitHub.
Clean-up handlers
The user may register a bunch of functions cleaning up the mess caused by an
unexpected interruption. They are installed with tbthread_cleanup_push
and
called when the thread exits abnormally. The purpose of these functions is to
unlock mutexes, free the heap memory and such. tbthread_cleanup_pop
removes
them and optionally executes in the process.
1 void tbthread_cleanup_push(void (*func)(void *), void *arg)
2 {
3 tbthread_t self = tbthread_self();
4 struct cleanup_elem *e = malloc(sizeof(struct cleanup_elem));
5 e->func = func;
6 e->arg = arg;
7 list_add_elem(&self->cleanup_handlers, e, 1);
8 }
9
10 void tbthread_cleanup_pop(int execute)
11 {
12 tbthread_t self = tbthread_self();
13 list_t *node = self->cleanup_handlers.next;
14 if(!node)
15 return;
16 list_rm(node);
17 struct cleanup_elem *e = (struct cleanup_elem*)node->element;
18 if(execute)
19 (*e->func)(e->arg);
20 free(e);
21 free(node);
22 }
See the full patch at GitHub.
Signals and asynchronous cancellation
The asynchronous cancellation uses the first real-time signal, SIGRTMIN
, that
we call SIGCANCEL
here for clarity.
Registering a signal handler is somewhat more tricky than just calling the
appropriate syscall. It is so because, on x86_64, we need to provide a function
that restores the stack after the signal handler returns. The function is called
a signal trampoline and its purpose is to invoke sys_rt_sigreturn
. The
trampoline is registered with the kernel using a special sigaction
flag:
1 void __restore_rt();
2 #define SA_RESTORER 0x04000000
3
4 int tbsigaction(int signum, struct sigaction *act, struct sigaction *old)
5 {
6 act->sa_flags |= SA_RESTORER;
7 act->sa_restorer = __restore_rt;
8 return SYSCALL4(__NR_rt_sigaction, signum, act, old, sizeof(sigset_t));
9 }
The trampoline itself, called __restore_rt
here, is defined in assembly as follows:
1 .text
2
3 .global __restore_rt
4 .type __restore_rt,@function
5 .align 16
6
7 __restore_rt:
8 movq $__NR_rt_sigreturn, %rax
9 syscall
Looking at the corresponding glibc code, you can see that they add the
eh_frame
info here. The comments say that it is to aid gdb and handle the
stack unwinding. I don't know enough DWARF to write one on my own, gdb does
not seem to be utterly confused without it, and we won't do stack unwinding, so
we just won't bother with it for the moment.
In the cancellation handler, we first check whether it's the right signal and
that it has been sent by a thread in the same thread group. We then need to
check whether the thread is still in the asynchronous cancellation mode. It
might have changed between the time the signal was sent and the time the it is
delivered. Finally, we call thread_testcancel
to see if the thread should
exit.
1 void tb_cancel_handler(int sig, siginfo_t *si, void *ctx)
2 {
3 if(sig != SIGCANCEL || si->si_pid != tb_pid || si->si_code != SI_TKILL)
4 return;
5
6 tbthread_t self = tbthread_self();
7 if(self->cancel_status & TB_CANCEL_DEFERRED)
8 return;
9
10 tbthread_testcancel();
11 }
We invoke sys_tgkill
to send the signal:
1 SYSCALL3(__NR_tgkill, tb_pid, thread->exit_futex, SIGCANCEL);
See the full patch at GitHub.
Cancellation of a "once" function
The implementation of tbthread_once
gets quite a bit more interesting as well.
If the thread invoking the initialization function gets canceled, another thread
needs to pick it up. We need to install a cleanup handler that will change the
state of the once control back to TB_ONCE_NEW
and wake all the threads so that
they could restart from the beginning:
1 static void once_cleanup(void *arg)
2 {
3 tbthread_once_t *once = (tbthread_once_t *)arg;
4 *once = TB_ONCE_NEW;
5 SYSCALL3(__NR_futex, once, FUTEX_WAKE, INT_MAX);
6 }
7
8 int tbthread_once(tbthread_once_t *once, void (*func)(void))
9 {
10 if(!once || !func)
11 return -EINVAL;
12
13 int cancel_state;
14
15 while(1) {
16 if(*once == TB_ONCE_DONE)
17 return 0;
18
19 //--------------------------------------------------------------------------
20 // The executor
21 //--------------------------------------------------------------------------
22 tbthread_setcancelstate(TBTHREAD_CANCEL_DISABLE, &cancel_state);
23 if(__sync_bool_compare_and_swap(once, TB_ONCE_NEW, TB_ONCE_IN_PROGRESS)) {
24 tbthread_cleanup_push(once_cleanup, once);
25 tbthread_setcancelstate(cancel_state, 0);
26
27 (*func)();
28
29 tbthread_setcancelstate(TBTHREAD_CANCEL_DISABLE, &cancel_state);
30 tbthread_cleanup_pop(0);
31
32 *once = TB_ONCE_DONE;
33 SYSCALL3(__NR_futex, once, FUTEX_WAKE, INT_MAX);
34 tbthread_setcancelstate(cancel_state, 0);
35 return 0;
36 }
37
38 tbthread_setcancelstate(cancel_state, 0);
39
40 //--------------------------------------------------------------------------
41 // The waiters
42 //--------------------------------------------------------------------------
43 while(1) {
44 SYSCALL3(__NR_futex, once, FUTEX_WAIT, TB_ONCE_IN_PROGRESS);
45 if(*once != TB_ONCE_IN_PROGRESS)
46 break;
47 }
48 }
49 }
See the patch at GitHub.
Table of Contents
- Creating threads
- Thread-local storage
- Mutexes
- Joining and initialization
- Cancellation
- Scheduling
- RW Locks
- Condition Variables
- Conclusion
Recycling the thread descriptors
How do we know when the thread task has died? This is what the CHILD_SETTID
and CHILD_CLEARTID
flags to sys_clone
are for. If they are set, the kernel
will store the new thread's TID at the location pointed to by the ctid
argument (see tb #1). When the thread terminates, the kernel will set the
TID to 0
and wake the futex at this location. It is a convenient way to wait
for a thread to finish. Unfortunately, as far as I can tell, there is no way to
unset these flags, and it makes implementing tbthread_detach
a pain. We cannot
delete the thread descriptor in the thread it refers to anymore. Doing so would
cause the kernel to write to a memory location that might have been either
unmapped or reused. Therefore, we need to have some sort of a cache holding
thread descriptors and make sure that we re-use them only after the thread they
were referring to before has exited. Thread bites uses two linked lists to
maintain this cache, and the descriptor allocation function calls the following
procedure to wait until the corresponding task is gone:
1 static void wait_for_thread(tbthread_t thread)
2 {
3 uint32_t tid = thread->exit_futex;
4 long ret = 0;
5 if(tid != 0)
6 do {
7 ret = SYSCALL3(__NR_futex, &thread->exit_futex, FUTEX_WAIT, tid);
8 } while(ret != -EWOULDBLOCK && ret != 0);
9 }
See the full patch at GitHub.
Joining threads
Joining complicates things a bit further because it does quite a bit of error
checking to prevent deadlocks and such. To perform these checks, the thread
calling tbthread_join
needs to have a valid thread descriptor obtainable by
tbthread_self
. The problem is that we have never set this thread descriptor
up for the main thread, and we need to do it by hand at the beginning of the
program. The original state needs to be restored at the end because glibc
uses it internally and not cleaning things up causes segfaults.
1 static void *glibc_thread_desc;
2 void tbthread_init()
3 {
4 glibc_thread_desc = tbthread_self();
5 tbthread_t thread = malloc(sizeof(struct tbthread));
6 memset(thread, 0, sizeof(struct tbthread));
7 thread->self = thread;
8 SYSCALL2(__NR_arch_prctl, ARCH_SET_FS, thread);
9 }
10
11 void tbthread_finit()
12 {
13 free(tbthread_self());
14 SYSCALL2(__NR_arch_prctl, ARCH_SET_FS, glibc_thread_desc);
15 }
After performing all the validity and deadlock checks, the meat of
tbthread_join
is rather simple:
1 wait_for_thread(thread);
2 if(retval)
3 *retval = thread->retval;
4 release_descriptor(thread);
5 return 0;
See the full patch at GitHub.
Dynamic initialization
pthread_once
is an interesting beast. Its purpose is to initialize dynamically
some resources by calling a designated function exactly once. The fun part is
that the actual initialization call may be made from multiple threads at the
same time. pthread_once_t
, therefore, is kind of like a mutex, but has three
states instead of two:
- new: the initialization function has not been called yet; one of the threads needs to call it.
- in progress: the initialization function is running; the threads are waiting for it to finish.
- done: the initialization function is done; all the threads may be woken up.
The thread that manages to change the state from new to in progress gets to call the function. All the other threads wait until the done state is reached.
1 int tbthread_once(tbthread_once_t *once, void (*func)(void))
2 {
3 if(!once || !func)
4 return -EINVAL;
5
6 if(*once == TB_ONCE_DONE)
7 return 0;
8
9 if(__sync_bool_compare_and_swap(once, TB_ONCE_NEW, TB_ONCE_IN_PROGRESS)) {
10 (*func)();
11 *once = TB_ONCE_DONE;
12 SYSCALL3(__NR_futex, once, FUTEX_WAKE, INT_MAX);
13 return 0;
14 }
15
16 while(*once != TB_ONCE_DONE)
17 SYSCALL3(__NR_futex, once, FUTEX_WAIT, TB_ONCE_IN_PROGRESS);
18 return 0;
19 }
Side effects
The original glibc thread descriptor stores the localization information for
the thread. Changing it to ours causes seemingly simple functions, like
strerror
, to segfault. Therefore, we need to implement strerror
ourselves.
Table of Contents
- Creating threads
- Thread-local storage
- Mutexes
- Joining and initialization
- Cancellation
- Scheduling
- RW Locks
- Condition Variables
- Conclusion
Intro
This part discusses an implementation of a mutex. It will not be a particularly efficient mutex, but it will be an understandable and straightforward one. We will not bother minimizing the number of system calls or implementing lock elision. We will also not handle the case of inter-process communication. Therefore, the process-shared and robust mutexes will not be discussed. If you are interested in these ideas, I recommend the kernel's documentation file on robust mutexes and Intel's blog post on lock elision. The scheduling related parameters will likely be dealt with on another occasion.
Futexes
POSIX defines a bunch of different mutexes. See the manpage for
pthread_mutexattr_settype
to learn more. On Linux, all of them are implemented
using the same locking primitive - a Futex (Fast User-Space Mutex). It is a
4-byte long chunk of aligned memory. Its contents can be updated atomically, and
its address can be used to refer to a kernel-level process queue. The kernel
interface is defined as follows:
1 asmlinkage long sys_futex(u32 __user *uaddr, int op, u32 val,
2 struct timespec __user *utime, u32 __user *uaddr2,
3 u32 val3);
We will only use the first four out of the six parameters here. The first one is the address of the futex, and the second one is the type of the operation to be performed. The meaning of the remaining parameters depends on the context. We will need only two of the available operations to implement a mutex:
FUTEX_WAIT
puts a thread to sleep if the value passed inval
is the same as the value stored in the memory pointed to by*uaddr
. Optionally, the sleep time may be limited by passing a pointer to atimespec
object. The return values are:- 0, if the thread was woken up by
FUTEX_WAIT
. EWOULDBLOCK
, if the value of the futex was different thanval
.EINTR
, if the sleep was interrupted by a signal.
- 0, if the thread was woken up by
FUTEX_WAKE
wakes the number of threads specified inval
. In practice, it only makes sense to either wake one or all sleeping threads, so we pass either1
orINT_MAX
respectively.
See the original futex paper for more details.
Normal mutex
We start with a normal mutex because it is possible to implement all the other
kinds using the procedures defined for it. If the value of the associated futex
is 0
, then the mutex is unlocked. Locking it means changing the value to 1
.
To avoid race conditions, both the checking and the changing need to be done
atomically. GCC has a built-in function to do this that results
with lock cmpxchgq
or similar being emitted in assembly. If the locking fails,
we need to wait until another thread releases the mutex and re-check the lock.
1 static int lock_normal(tbthread_mutex_t *mutex)
2 {
3 while(1) {
4 if(__sync_bool_compare_and_swap(&mutex->futex, 0, 1))
5 return 0;
6 SYSCALL3(__NR_futex, &mutex->futex, FUTEX_WAIT, 1);
7 }
8 }
The logic of trylock
is essentially the same, except that we return a failure
instead of sleeping.
1 static int trylock_normal(tbthread_mutex_t *mutex)
2 {
3 if(__sync_bool_compare_and_swap(&mutex->futex, 0, 1))
4 return 0;
5 return -EBUSY;
6 }
To unlock a mutex, we do the reverse. We change the futex value from 1
to 0
and wake one thread waiting for the futex to be released.
1 static int unlock_normal(tbthread_mutex_t *mutex)
2 {
3 if(__sync_bool_compare_and_swap(&mutex->futex, 1, 0))
4 SYSCALL3(__NR_futex, &mutex->futex, FUTEX_WAKE, 1);
5 return 0;
6 }
Note that the values stored in the futex are application-specific and arbitrary. The kernel does not care and does not change this variable except in one case, which we will discuss in a later chapter.
This mutex is not particularly efficient because we make a system call while unlocking regardless of whether there is a waiting thread or not. To see how the situation could be improved, please refer to Ulrich Drepper's Futexes Are Tricky.
Error-check mutex
These guys are the same as the ones discussed earlier, except that they do additional bookkeeping. An error-check mutex remembers who owns it, to report the following types of errors:
- re-locking of a mutex the thread already owns
- unlocking a mutex owned by another thread
- unlocking a mutex that is not locked
The behavior of normal mutexes in these cases is undefined.
1 static int lock_errorcheck(tbthread_mutex_t *mutex)
2 {
3 tbthread_t self = tbthread_self();
4 if(mutex->owner == self)
5 return -EDEADLK;
6 lock_normal(mutex);
7 mutex->owner = self;
8 return 0;
9 }
10
11 static int trylock_errorcheck(tbthread_mutex_t *mutex)
12 {
13 int ret = trylock_normal(mutex);
14 if(ret == 0)
15 mutex->owner = tbthread_self();
16 return ret;
17 }
18
19 static int unlock_errorcheck(tbthread_mutex_t *mutex)
20 {
21 if(mutex->owner != tbthread_self() || mutex->futex == 0)
22 return -EPERM;
23 mutex->owner = 0;
24 unlock_normal(mutex);
25 return 0;
26 }
Recursive mutex
Recursive mutexes may be locked multiple times by the same thread and require the same numbers of unlock operations to be released. To provide this kind of functionality, we just need to add a counter counter.
1 static int lock_recursive(tbthread_mutex_t *mutex)
2 {
3 tbthread_t self = tbthread_self();
4 if(mutex->owner != self) {
5 lock_normal(mutex);
6 mutex->owner = self;
7 }
8 if(mutex->counter == (uint64_t)-1)
9 return -EAGAIN;
10 ++mutex->counter;
11 return 0;
12 }
13
14 static int trylock_recursive(tbthread_mutex_t *mutex)
15 {
16 tbthread_t self = tbthread_self();
17 if(mutex->owner != self && trylock_normal(mutex))
18 return -EBUSY;
19
20 if(mutex->owner != self) {
21 mutex->owner = self;
22 mutex->counter = 1;
23 return 0;
24 }
25
26 if(mutex->counter == (uint64_t)-1)
27 return -EAGAIN;
28
29 ++mutex->counter;
30 return 0;
31 }
32
33 static int unlock_recursive(tbthread_mutex_t *mutex)
34 {
35 if(mutex->owner != tbthread_self())
36 return -EPERM;
37 --mutex->counter;
38 if(mutex->counter == 0) {
39 mutex->owner = 0;
40 return unlock_normal(mutex);
41 }
42 return 0;
43 }
Other code
This is pretty much it. The remaining part of the code is not interesting. Both
mutex
and mutexattr
objects need to be initialized to their default values,
but the futexes don't need any initialization or cleanup. As always, the full
patch is available on GitHub.
Table of Contents
- Creating threads
- Thread-local storage
- Mutexes
- Joining and initialization
- Cancellation
- Scheduling
- RW Locks
- Condition Variables
- Conclusion
Intro
The second part of our threading journey covers the thread-local storage. I do
not mean here the compiler generated TLS that I mentioned in the first part. I
mean the stuff that involves pthread_setspecific
and friends. I don't think
it's the most critical part of all this threading business. I barely ever use it
in practice. However, we will need to refer to the current thread all over the
place, and this requires the TLS. It's better to deal with it once and for all,
especially that it's not particularly complicated.
Self
How does one store something in a fixed location and make it distinct for each
execution thread? The only two answers that come to my mind are either using
syscalls to associate something with kernel's task_struct
or using CPU
registers. The first approach requires context switches to retrieve the value,
so it's rather inefficient. The second option, though, should be pretty fast.
Conveniently, on x86_64, some registers are left unused (see
StackOverflow). In fact, the SETTLS
option to clone
takes a pointer
and puts it in the file segment register for us, so we don't even need to make
an extra syscall just for that.
Since fs
is a segment register, we cannot retrieve it's absolute value without
engaging the operating system. Linux on x86_64 uses the arch_prctl
syscall for
this purpose:
1 #include "tb.h"
2 #include <asm/prctl.h>
3
4 tbthread_t tbthread_self()
5 {
6 tbthread_t self;
7 SYSCALL2(__NR_arch_prctl, ARCH_GET_FS, &self);
8 return self;
9 }
This syscall seems expensive (link) and making it defies the reason for using a register in the first place. We can read the memory at an address relative to the value of the register, though. Using this fact, we can make our thread struct point to itself and then retrieve this pointer using inline assembly. Here's how:
1 tbthread_t tbthread_self()
2 {
3 tbthread_t self;
4 asm("movq %%fs:0, %0\n\t" : "=r" (self));
5 return self;
6 }
For the main thread, the linker initializes the TLS segment for pthreads
automatically using arch_prctl
. We cannot use it for our purposes, but we can
count on tbthread_self
returning a unique, meaningful value.
See the full patch here.
TLS
The actual TLS is handled by the following four functions: pthread_key_create
,
pthread_key_delete
, pthread_getspecific
, pthread_setspecific
. I will not
explain what they do because it should pretty self-evident. If it's not, see the
man pages.
In principle, we could just have a hash table in the tbthread
struct. We could
then use setspecific
and getspecific
to set and retrieve the value
associated with each given key. Calling setspecific
with a NULL pointer would
delete the key. Unfortunately, the designers of pthreads made it a bit more
complicated by having separate key_create
and key_delete
functions, with the
delete function invalidating the key in all the threads. Glibc uses a global
array of keys and sequence numbers in a clever way to solve this problem. We
will take almost the same approach in a less efficient but a bit clearer way.
We will have a global array representing keys. Each element of this array will
host a pointer to a destructor and a sequence number. The index in the array
will be the actual key passed around between the TLS functions. Both
key_create
and key_delete
will bump the sequence number associated with a
key in an atomic way. If the number is even, the key is not allocated, if the
number is odd, it is.
1 static struct
2 {
3 uint64_t seq;
4 void (*destructor)(void *);
5 } keys[TBTHREAD_MAX_KEYS];
6
7 #define KEY_UNUSED(k) ((keys[k].seq&1) == 0)
8 #define KEY_ACQUIRE(k) (__sync_bool_compare_and_swap(&(keys[k].seq), keys[k].seq, keys[k].seq+1))
9 #define KEY_RELEASE(k) (__sync_bool_compare_and_swap(&(keys[k].seq), keys[k].seq, keys[k].seq+1))
10
11 int tbthread_key_create(tbthread_key_t *key, void (*destructor)(void *))
12 {
13 for(int i = 0; i < TBTHREAD_MAX_KEYS; ++i) {
14 if(KEY_UNUSED(i) && KEY_ACQUIRE(i)) {
15 *key = i;
16 keys[i].destructor = destructor;
17 return 0;
18 }
19 }
20 return -ENOMEM;
21 }
Each tbthread
struct will hold an array of the same size as the global array
of keys. Each element in this array will hold a data pointer and a sequence
number. Storing data will set the sequence number as well. Retrieving the data
will check whether the local and the global sequence number match before
proceeding.
1 void *tbthread_getspecific(tbthread_key_t key)
2 {
3 if(key >= TBTHREAD_MAX_KEYS || KEY_UNUSED(key))
4 return 0;
5
6 tbthread_t self = tbthread_self();
7 if(self->tls[key].seq == keys[key].seq)
8 return self->tls[key].data;
9 return 0;
10 }
See the full patch here.
Atomics
KEY_ACQUIRE
and KEY_RELEASE
use gcc atomic builtins described here.
Table of Contents
- Creating threads
- Thread-local storage
- Mutexes
- Joining and initialization
- Cancellation
- Scheduling
- RW Locks
- Condition Variables
- Conclusion
Intro
This is a first of hopefully many posts documenting my attempts to understand how to implement a pthread-style threading system on Linux. To this end, I started implementing a small, non-portable and a rather useless library that I called Thread Bites. Thread Bites is useless mainly because it lacks support for compiler-generated thread-local storage. It may not sound like a grave issue, but it makes Thread Bites incompatible with most of glibc. Therefore, I had to provide my own functionality for invoking syscalls, managing the heap and even printing stuff to stdout. It's all pretty simple and understandable so far, so I hope I will be able to implement most of the pthreads' functionality in a couple of small bites. You can get the source from GitHub.
Syscalls
For a program to be even remotely useful, it needs to communicate with the user
in one way or another. A standard library for the programming language, like
glibc, typically provides all the necessary components for such communication.
For reasons mentioned in the introduction, using glibc in this case is not
advisable. Hence, I need to find a way to call the operating system directly
without using the syscall
function, because it is also a part glibc and sets
errno
, which is supposed to reside in the TLS that I did not set up.
All that is needed to implement an equivalent function is shuffling around the values of the registers to translate between the calling conventions of C and the Linux kernel, as described here and here on page 20. As it turns out, it's not that hard to do it using inline assembly in C, and there's an excellent tutorial here.
1 #define SYSCALL(name, a1, a2, a3, a4, a5, a6) \
2 ({ \
3 long result; \
4 long __a1 = (long)(a1); \
5 long __a2 = (long)(a2); \
6 long __a3 = (long)(a3); \
7 long __a4 = (long)(a4); \
8 long __a5 = (long)(a5); \
9 long __a6 = (long)(a6); \
10 register long _a1 asm("rdi") = __a1; \
11 register long _a2 asm("rsi") = __a2; \
12 register long _a3 asm("rdx") = __a3; \
13 register long _a4 asm("r10") = __a4; \
14 register long _a5 asm("r8") = __a5; \
15 register long _a6 asm("r9") = __a6; \
16 asm volatile ( \
17 "syscall\n\t" \
18 : "=a" (result) \
19 : "0" (name), "r" (_a1), "r" (_a2), "r" (_a3), \
20 "r" (_a4), "r" (_a5), "r" (_a6) \
21 : "memory", "cc", "r11", "cx"); \
22 (long) result; })
23
24 #define SYSCALL1(name, a1) \
25 SYSCALL(name, a1, 0, 0, 0, 0, 0)
26 #define SYSCALL2(name, a1, a2) \
27 SYSCALL(name, a1, a2, 0, 0, 0, 0)
28 #define SYSCALL3(name, a1, a2, a3) \
29 SYSCALL(name, a1, a2, a3, 0, 0, 0)
30 #define SYSCALL4(name, a1, a2, a3, a4) \
31 SYSCALL(name, a1, a2, a3, a4, 0, 0)
32 #define SYSCALL5(name, a1, a2, a3, a4, a5) \
33 SYSCALL(name, a1, a2, a3, a4, a5, 0)
34 #define SYSCALL6(name, a1, a2, a3, a4, a5, a6) \
35 SYSCALL(name, a1, a2, a3, a4, a5, a6)
All this is, of course, horribly inefficient because it messes up with all the
registers even in the situations it does not have to. The intermediate variables
for the parameters (__a1
and friends) are used to prevent embedded function
calls from messing with the registers that have already been set; think of
strlen
in SYSCALL3(__NR_write, 1, blah, strlen(blah))
.
See the full code on GitHub.
Printing to stdout
It seems that glibc is using errno
and other thread local stuff to calculate
buffer sizes in one of the subroutines called by printf
. It causes printf
to
segfault when called concurrently from different threads because of the same TLS
story. Thread Bites provides a convenience function similar to printf
and
supporting %s
%x
%u
%o
%d
in l
and ll
flavors:
1 void tbprint(const char *format, ...);
Malloc
Glibc's default malloc implementation, a ptmalloc2 derivative, uses
thread-specific arenas to limit lock congestion caused by calling malloc
concurrently from multiple threads. It looks like it depends on TLS, so using it
is not the best idea. Thread Bites comes with its own evil version of
malloc
. It's pathological because it's extremely prone to fragmentation, it
never shrinks the heap, and it's essentially one big critical section. It has
some undeniable advantages too: it works, it's incredibly simple, and it fits in
around 50 lines of code. Look here to find it.
The only thing worth noting in this section is that the sys_brk
syscall does
not behave like glibc's brk
or sbrk
functions. On error, it would return
the previous location of the heap boundary, so the code calls it with an
obviously wrong parameter (0
) to figure out what the initial heap boundary is.
Clone
Clone
is an interesting beast. From the standpoint of this section, the
relevant thing about it is that it behaves mostly like fork
, except that the
child is launched on a new stack. For this reason and to generate proper Call
Frame Information (see here and here), it needs to be implemented in
assembly. The implementation puts the user function pointer and its argument on
the child's stack so that they can be later popped and called by the child.
Then, it puts the syscall parameters in the appropriate registers and makes the
syscall. Again, all the code is on GitHub.
Creating a thread
After dealing with all this boilerplate, the actual thread creation is
relatively straightforward. First, the new thread needs a stack it can run on.
The stack could be allocated on the heap, but it's probably safer to get a
separate memory region for it. It can be done using the mmap
syscall. A proper
threading library should check the system limits to figure out the size of the
stack, but Thread Bites is not proper, so it will use 8MiB as a default. All
the function parameters in the snippet below match glibc's mmap
call.
1 void *stack = tbmmap(NULL, attr->stack_size, PROT_READ | PROT_WRITE,
2 MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
3 long status = (long)stack;
4 if(status < 0)
5 return status;
It's a good idea to mark a page at the beginning of the stack as non-readable and non-writable to protect any valid adjecent memory if there is any. Stepping over the guard page will get us a familiar segmentation fault. Why at the beginning and not at the end? Because the stack grows downwards, i.e. from a high address towards a lower one. Go here for more details.
1 status = SYSCALL3(__NR_mprotect, stack, EXEC_PAGESIZE, PROT_NONE);
2 if(status < 0) {
3 tbmunmap(stack, attr->stack_size);
4 return status;
5 }
We finally have all that is needed to spawn a new kernel task sharing with the
main task everything that you would expect threads running within the same
process to share. The man page of clone
discusses all of the parameters
in details.
1 int flags = CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SYSVSEM | CLONE_SIGHAND;
2 flags |= CLONE_THREAD;
3 int tid = tbclone(start_thread, *thread, flags, stack+attr->stack_size);
4 if(tid < 0) {
5 tbmunmap(stack, attr->stack_size);
6 free(*thread);
7 return tid;
8 }
Note that the user function is not called directly and start_thread
is called
instead. It does some internal book-keeping and eventually calls the
user-supplied function:
1 tbthread_t th = (tbthread_t)arg;
2 uint32_t stack_size = th->stack_size;
3 void *stack = th->stack;
4 th->fn(th->arg);
5 free(th);
When the function returns, the stack memory needs to be freed (munmap
ed) and
the thread terminated. The clone
function took some provisions for calling
sys_exit
with the status returned by the wrapper. However, here we remove the
stack from underneath our feet, and we cannot risk any C function write over it.
Therefore, we need to call sys_munmap
and sys_exit
in one piece of assembly.
1 register long a1 asm("rdi") = (long)stack;
2 register long a2 asm("rsi") = stack_size;
3 asm volatile(
4 "syscall\n\t"
5 "movq $60, %%rax\n\t" // 60 = __NR_exit
6 "movq $0, %%rdi\n\t"
7 "syscall"
8 :
9 : "a" (__NR_munmap), "r" (a1), "r" (a2)
10 : "memory", "cc", "r11", "cx");
11 return 0;
Lessons learned
- The logic involved in running threads does not seem to be horrendously complicated, at least not yet.
- Glibc is quite convoluted and figuring out what is going on and where is a challenge. It is probably justified by the range of platforms it supports, but the documentation could be better.
- Threading is tightly coupled with glibc in the areas that I have not
initially suspected: the dynamic linker to support TLS, globals in TLS
(
errno
), locales, finalizers, and so on. - POSIX threads have an interesting relationship with the Linux task model. This document describes the initial incompatibilities that have ultimately been ironed out. It is a very interesting read if you want to understand how the userspace and the kernelspace interact to implement a threading system.
- Ultimately all this is not necessarily a black magic. At least not so far.