Content from 2016-03

Table of Contents

  1. Compiling and start-up code
  2. Hardware Abstraction Layer and UART
  3. Debugging, display, heap and fonts
  4. Timers and ADC
  5. DAC, Sound and Nokia Tunes
  6. Random Number Generator, Rendering Engine and the Game
  7. Operating System

Intro

I have recently been playing with microcontrollers a lot. Among other things, I have worked through some of the labs from this course on EdX. The material does not use much high-level code, so it gives a good overview of how the software interacts with the hardware. There are some "black box" components in there, though. For me, the best way to learn something well has always been building things from "first principles." I find black boxes frustrating. This post describes the first step on my way to make an Alien Invaders game from "scratch."

Compiling for Tiva

First, we need to be able to compile C code for Tiva. To this end, we will use GCC as a cross-compiler, so make sure you have the arm-none-eabi-gcc command available on your system. We will use the following flags build Tiva-compatible binaries:

  • -mcpu=cortex-m4 - produce the code for ARM Cortex-M4 CPU
  • -mfpu=fpv4-sp-d16 - FPv4 single-precision floating point with the register bank seen by the software as 16 double-words
  • -mfloat-abi=hard - generate floating point instructions and use FPU-specific calling conventions
  • -mthumb - use the Thumb instruction set
  • -std=c11 - use the C11 standard
  • -O0 - don't perform any optimizations
  • -Wall and -pedantic - warn about all the potential issues with the code
  • -ffunction-sections and -fdata-sections - place every function and data item in a separate section in the resulting object file; it allows the optimizations removing all unused code and data to be performed at link-time

Object files

To generate a proper binary image, we need to have some basic understanding of object files produced by the compiler. In short, they consist of sections containing various pieces of compiled code and the corresponding data. These sections may be loadable, meaning that the contents of the section should be read from the object file and stored in memory. They may also be just allocatable, meaning that there is nothing to be loaded, but a chunk of memory needs to be put aside for them nonetheless. There are multiple sections in a typical ELF object file, but we need to know only four of them:

  • .text - contains the program code
  • .rodata - contains the constants (read-only data)
  • .data - contains the read-write data
  • .bss - contains statically allocated variables (initialized to zero)

Let's consider the following code:

 1 #include <stdio.h>
 2 
 3 int a = 12;
 4 int b;
 5 const char *c = "The quick brown fox jumps over the lazy dog.";
 6 const char * const d = "The quick brown fox jumps over the lazy dog.";
 7 int e[20];
 8 const int f[4] = {7, 4, 2, 1};
 9 
10 int main(int argc, char **argv)
11 {
12   printf("Hello world!\n");
13   return 0;
14 }

After compiling it, we end up with an object file containing the following sections (most have been omitted for clarity):

]==> objdump -h test

test:     file format elf64-x86-64

Sections:
Idx Name          Size      VMA               LMA               File off  Algn
 13 .text         00000192  00000000004003f0  00000000004003f0  000003f0  2**4
                  CONTENTS, ALLOC, LOAD, READONLY, CODE
 15 .rodata       0000006d  0000000000400590  0000000000400590  00000590  2**4
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
 24 .data         00000020  0000000000600948  0000000000600948  00000948  2**3
                  CONTENTS, ALLOC, LOAD, DATA
 25 .bss          00000090  0000000000600980  0000000000600980  00000968  2**5
                  ALLOC

As you can see, every section has two addresses:

  • VMA (virtual memory address) - This is the location of the section the code expects when it runs.
  • LMA (load memory address) - This is the location where the section is stored by the loader.

These two addresses are in most cases the same, except the situation that we care about here: an embedded system. In our binary image, we need put the .data section in ROM because it contains initialized variables whose values would otherwise be lost on reset. The section's LMA, therefore, must point to a location in ROM. However, this data is not constant, so it's final position at program's runtime needs to be in RAM. Therefore, the VMA must point to a location RAM. We will see an example later.

Tiva's memory layout

Tiva has 256K of ROM (range: 0x0000000000-0x0003ffff) and 32K of RAM (range: 0x20000000-0x20003fff). See the table 2-4 on page 90 of the data sheet for details. The NVIC (Interrupt) table needs to be located at address 0x00000000 (section 2.5 of the data sheet). We will create this table in C, put it in a separate object file section, and fill with weak aliases of the default handler function. This approach will enable the user to redefine the interrupt handlers without having to edit the start-up code. The linker will resolve the handler addresses to strong symbols if any are present.

So, we define a dummy interrupt handler that loops indefinitely:

1 void __int_handler(void)
2 {
3   while(1);
4 }

and then create a bunch of weak aliases to this function:

1 #define DEFINE_HANDLER(NAME) void NAME ## _handler() __attribute__ ((weak, alias ("__int_handler")))
2 
3 DEFINE_HANDLER(nmi);
4 DEFINE_HANDLER(hard_fault);
5 DEFINE_HANDLER(mman);
6 DEFINE_HANDLER(bus_fault);
7 DEFINE_HANDLER(usage_fault);
8 ...

Finally, we construct the nvic_table, place it in the .nvic section in the resulting object file and fill it with handler addresses:

1 #define HANDLER(NAME) NAME ## _handler
2 void (*nvic_table[])(void) __attribute__ ((section (".nvic"))) = {
3   HANDLER(reset),
4   HANDLER(nmi),
5   HANDLER(hard_fault),
6   HANDLER(mman),
7 ...

Linker scripts

We will use linker scripts to set the VMAs and the LMAs to the values we like and to create some symbols whose addresses we can play with in the C code. We first need to define the memory layout:

1 MEMORY
2 {
3   FLASH (rx)  : ORIGIN = 0x00000000, LENGTH = 0x00040000
4   RAM   (rwx) : ORIGIN = 0x20000000, LENGTH = 0x00008000
5 }

We then need to tell the linker where to put the section in the final executable:

 1 SECTIONS
 2 {
 3   .text :
 4   {
 5     LONG(0x20007fff)
 6     KEEP(*(.nvic))
 7     *(.text*)
 8     *(.rodata*)
 9      __text_end_vma = .;
10   } > FLASH
11 
12   .data :
13   {
14     __data_start_vma = .;
15     *(.data*)
16     *(vtable)
17     __data_end_vma = .;
18   } > RAM AT > FLASH
19 
20   .bss :
21   {
22     __bss_start_vma = .;
23     *(.bss*)
24     *(COMMON)
25     __bss_end_vma = .;
26   } > RAM
27 }
  1. We start with the .text section and begin it with 0x20003fff. It is the initial value of the stack pointer (see the data sheet). Since the stack grows towards lower addresses, we initialize the top of the stack to the last byte of available RAM.
  2. We then put the .nvic section. The KEEP function forces the linker to keep this section even when the link-time optimizations are enabled, and the section seems to be unused. The asterisk in *(.nvic) is a wildcard for an input object file name. Whatever is in the brackets is a wildcard for a section name.
  3. We put all the code and read-only data from all of the input files in this section as well.
  4. We define a new symbol: __text_end_vma and assign its address to the current VMA (the dot means the current VMA).
  5. We put this section in FLASH: > FLASH at line 10.
  6. We combine the .data* sections from all input files into one section and put it behind the .text section in FLASH. We set the VMAs to be in RAM: > RAM AT > FLASH.
  7. Apparently TivaWare changes the value of the VTABLE register and needs to have the NVIC table in RAM, so we oblige: *(vtable).
  8. We put .bss in RAM after .data.
  9. We use asterisks in section names (i.e. .bss*) because -ffunction-sections and -fdata-sections parameters cause the compiler to generate a separate section for each function and data item.

Edit 02.04.2016: The initial stack pointer needs to be aligned to 8 bytes for passing of 64-bit long variadic parameters to work. Therefore, the value of the first four bytes in the text section should be: LONG(0x20007ff8). See this post for details.

See the binutils documentation for more details.

Start-up code

On the system start-up, we need to copy the contents of the .data section from FLASH to RAM ourselves before we can run any code. We do it by defining a reset handler:

 1 extern unsigned long __text_end_vma;
 2 extern unsigned long __data_start_vma;
 3 extern unsigned long __data_end_vma;
 4 extern unsigned long __bss_start_vma;
 5 extern unsigned long __bss_end_vma;
 6 
 7 extern void main();
 8 
 9 void __rst_handler()
10 {
11   unsigned long *src = &__text_end_vma;
12   unsigned long *dst = &__data_start_vma;
13 
14   while(dst < &__data_end_vma) *dst++ = *src++;
15   dst = &__bss_start_vma;
16   while(dst < &__bss_end_vma) *dst++ = 0;
17 
18   main();
19 }
20 
21 void reset_handler() __attribute__ ((weak, alias ("__rst_handler")));

We first declare external symbols. They are put in the symbol table by the linker. The reset handler then moves the .data section from FLASH to RAM, zeroes the .bss section, and calls main.

A test

Let's put everything together. I wrote a short program that blinks an LED using the SysTick interrupt. The color of the LED depends on the switch pressed. The files are here:

Compile and link:

]==> arm-none-eabi-gcc -mcpu=cortex-m4 -mfpu=fpv4-sp-d16 -mfloat-abi=hard -mthumb -std=c11 -O0 -Wall -pedantic -ffunction-sections -fdata-sections -c main.c -g
]==> arm-none-eabi-gcc -mcpu=cortex-m4 -mfpu=fpv4-sp-d16 -mfloat-abi=hard -mthumb -std=c11 -O0 -Wall -pedantic -ffunction-sections -fdata-sections -c TM4C_startup.c -g
]==> arm-none-eabi-ld -T TM4C.ld TM4C_startup.o main.o -o main --gc-sections

Let's see what we have in the resulting binary:

]==> arm-none-eabi-objdump -h main

main:     file format elf32-littlearm

Sections:
Idx Name          Size      VMA       LMA       File off  Algn
  0 .text         00000484  00000000  00000000  00010000  2**2
                  CONTENTS, ALLOC, LOAD, READONLY, CODE
  1 .data         00000004  20000000  00000484  00020000  2**2
                  CONTENTS, ALLOC, LOAD, DATA
  2 .bss          00000004  20000004  00000488  00020004  2**2
                  ALLOC

The .text section starts at 0x00000000 both VMA and LMA. The .data section starts at 0x00000484 LMA (in FLASH) but the code expects it to start at 0x20000000 VMA (in RAM). The symbol addresses seem to match the expectations as well:

]==> arm-none-eabi-objdump -t main | grep vma
20000004 g       .bss   00000000 __bss_start_vma
00000484 g       .text  00000000 __text_end_vma
20000008 g       .bss   00000000 __bss_end_vma
20000000 g       .data  00000000 __data_start_vma
20000004 g       .data  00000000 __data_end_vma

We now need to create a raw binary file that we can flash to the board. The arm-none-eabi-objcopy utility can take the relevant sections and put them in an output file aligned according to their LMAs.

]==> arm-none-eabi-objcopy -O binary main main.bin
]==> stat --format=%s main.bin
1160

The total size of the raw binary matches the sum of the sizes of the .text and .data sections (0x488 == 1160). Let's flash it and see if it works!

]==> lm4flash main.bin
Found ICDI device with serial: xxxxxxxx
ICDI version: 9270

Test

Get the full code at GitHub.

Table of Contents

  1. Creating threads
  2. Thread-local storage
  3. Mutexes
  4. Joining and initialization
  5. Cancellation
  6. Scheduling
  7. RW Locks
  8. Condition Variables
  9. Conclusion

Conclusion

I have implemented all of the interesting functions listed here and, thus, reached my goal. There were quite a few surprises. I had expected some things to bo more complicated than they are. Conversely, some things that had seemed simple turned out to be quite complex.

  1. I had initially hoped that I would be able to re-use much of glibc and concentrate only on the thread-specific functionality. I was surprised to discover how much of glibc code refers to thread-local storage.
  2. I had expected the interaction between join and detach to be much simpler to handle. Having to implement descriptor caching was an unexpected event.
  3. I had never heard of pthread_once before.
  4. I had not used much of the real-time functionality before, so figuring out the scheduling part was very entertaining. I especially enjoyed implementing the PRIO_INHERIT mutex.

I may revisit this project in the future because there are still some things that I would like to learn more about.

  1. If I'll have the time to learn DWARF, I would like to provide proper .eh_frame for the signal trampoline. It would allow me to implement cancellation using stack unwinding the way glibc does it.
  2. I may look into the inter-process synchronization to learn about the robust futexes.
  3. The Intel article on lock elision seemed interesting, and I'd like to play with this stuff as well.
  4. I may have a look at the compiler-generated TLS.

The End

Table of Contents

  1. Creating threads
  2. Thread-local storage
  3. Mutexes
  4. Joining and initialization
  5. Cancellation
  6. Scheduling
  7. RW Locks
  8. Condition Variables
  9. Conclusion

Condition Variables

Condition variables are a mechanism used for signaling that a certain predicate has become true. The POSIX mechanism for handling them boils down to three functions: cond_wait, cond_signal and cond_broadcast. The first one causes the thread that calls it to wait. The second wakes a thread up so that it can verify whether the condition is true. The third wakes all the waiters up.

The waiter

 1 tb_futex_lock(&cond->lock);
 2 ...
 3 ++cond->waiters;
 4 int bseq = cond->broadcast_seq;
 5 int futex = cond->futex;
 6 tb_futex_unlock(&cond->lock);
 7 
 8 while(1) {
 9   st = SYSCALL3(__NR_futex, &cond->futex, FUTEX_WAIT, futex);
10   if(st == -EINTR)
11     continue;
12 
13   tb_futex_lock(&cond->lock);
14   if(cond->signal_num) {
15     --cond->signal_num;
16     goto exit;
17   }
18 
19   if(bseq != cond->broadcast_seq)
20     goto exit;
21   tb_futex_unlock(&cond->lock);
22 }

The algorithm is as follows:

  1. We lock the internal lock.
  2. We remember the value of the broadcast sequence and the futex.
  3. In a loop, we wait for the futex.
  4. We go back to sleeping if the FUTEX_WAIT syscall was interrupted by a signal.
  5. We consume a signal, if we can, and exit.
  6. If there was a broadcast, we exit too.
  7. Otherwise, we go back to sleep.

Signal

We wake one of the threads up. We bump the value of the futex to prevent a deadlock. Then we bump the number of signals and wake the futex.

 1 int tbthread_cond_signal(tbthread_cond_t *cond)
 2 {
 3   tb_futex_lock(&cond->lock);
 4   if(cond->waiters == cond->signal_num)
 5     goto exit;
 6   ++cond->futex;
 7   ++cond->signal_num;
 8   SYSCALL3(__NR_futex, &cond->futex, FUTEX_WAKE, 1);
 9 exit:
10   tb_futex_unlock(&cond->lock);
11   return 0;
12 }

Broadcast

We wake all the waiters. The algorithm is essentially the same as for signal, except that, we bump the broadcast sequence number and wake all the threads instead of just one.

 1 int tbthread_cond_broadcast(tbthread_cond_t *cond)
 2 {
 3   tb_futex_lock(&cond->lock);
 4   if(!cond->waiters)
 5     goto exit;
 6   ++cond->futex;
 7   ++cond->broadcast_seq;
 8   SYSCALL3(__NR_futex, &cond->futex, FUTEX_WAKE, INT_MAX);
 9 exit:
10   tb_futex_unlock(&cond->lock);
11   return 0;
12 }

See the full patch at GitHub.

Table of Contents

  1. Creating threads
  2. Thread-local storage
  3. Mutexes
  4. Joining and initialization
  5. Cancellation
  6. Scheduling
  7. RW Locks
  8. Condition Variables
  9. Conclusion

RW Locks

A read-write lock protects a critical section by allowing multiple readers when there are no writers. We won't bother implementing lock attributes handling because we don't support process-shared locks (irrelevant in our case) and we don't let the user prefer readers (a non-POSIX extension). The implementation remembers the ID of the current writer if any. It also counts readers as well as the queued writers. We use two futexes, one to block the readers and one to block the writers.

A writer first bumps the number of queued writers. If there is no other writer and no readers, it marks itself as the owner of the lock and decrements the number of queued writers. It goes to sleep on the writer futex otherwise.

 1 int tbthread_rwlock_wrlock(tbthread_rwlock_t *rwlock)
 2 {
 3   int queued = 0;
 4   while(1) {
 5     tb_futex_lock(&rwlock->lock);
 6 
 7     if(!queued) {
 8       queued = 1;
 9       ++rwlock->writers_queued;
10     }
11 
12     if(!rwlock->writer && !rwlock->readers) {
13       rwlock->writer = tbthread_self();
14       --rwlock->writers_queued;
15       tb_futex_unlock(&rwlock->lock);
16       return 0;
17     }
18     int sleep_status = rwlock->wr_futex;
19 
20     tb_futex_unlock(&rwlock->lock);
21 
22     SYSCALL3(__NR_futex, &rwlock->wr_futex, FUTEX_WAIT, sleep_status);
23   }
24 }

A reader acquires the lock if there are no writers at all. It goes to sleep on the reader futex otherwise.

 1 int tbthread_rwlock_rdlock(tbthread_rwlock_t *rwlock)
 2 {
 3   while(1) {
 4     tb_futex_lock(&rwlock->lock);
 5 
 6     if(!rwlock->writer && !rwlock->writers_queued) {
 7       ++rwlock->readers;
 8       tb_futex_unlock(&rwlock->lock);
 9       return 0;
10     }
11     int sleep_status = rwlock->rd_futex;
12 
13     tb_futex_unlock(&rwlock->lock);
14 
15     SYSCALL3(__NR_futex, &rwlock->rd_futex, FUTEX_WAIT, sleep_status);
16   }
17 }

When unlocking, we use the writer field to determine whether we were a reader or a writer. If we were a writer, we've had an exclusive ownership of the lock. Therefore, we need to either wake another writer or all of the readers, depending on the state of the counters. If we were a reader, we've had a non-exclusive lock. Therefore, we only need to wake a writer when we're the last reader and there is a writer queued. We bump the value of the futex because we want to handle the cases when FUTEX_WAKE was called before the other thread manged to call FUTEX_WAIT.

 1 int tbthread_rwlock_unlock(tbthread_rwlock_t *rwlock)
 2 {
 3   tb_futex_lock(&rwlock->lock);
 4   if(rwlock->writer) {
 5     rwlock->writer = 0;
 6     if(rwlock->writers_queued) {
 7       __sync_fetch_and_add(&rwlock->wr_futex, 1);
 8       SYSCALL3(__NR_futex, &rwlock->wr_futex, FUTEX_WAKE, 1);
 9     } else {
10       __sync_fetch_and_add(&rwlock->rd_futex, 1);
11       SYSCALL3(__NR_futex, &rwlock->rd_futex, FUTEX_WAKE, INT_MAX);
12     }
13     goto exit;
14   }
15 
16   --rwlock->readers;
17   if(!rwlock->readers && rwlock->writers_queued) {
18     __sync_fetch_and_add(&rwlock->wr_futex, 1);
19     SYSCALL3(__NR_futex, &rwlock->wr_futex, FUTEX_WAKE, 1);
20   }
21 
22 exit:
23   tb_futex_unlock(&rwlock->lock);
24   return 0;
25 }

See the full patch at GitHub.

Table of Contents

  1. Creating threads
  2. Thread-local storage
  3. Mutexes
  4. Joining and initialization
  5. Cancellation
  6. Scheduling
  7. RW Locks
  8. Condition Variables
  9. Conclusion

Thread Scheduling

The scheduler makes a decision of which thread to run next based on two parameters: scheduling policy and priority.

Conceptually, the scheduler maintains a list of runnable threads for each possible sched_priority value. In order to determine which thread runs next, the scheduler looks for the nonempty list with the highest static priority and selects the thread at the head of this list.

A thread's scheduling policy determines where it will be inserted into the list of threads with equal static priority and how it will move inside this list. the sched(7) man page

The supported policies are:

  • SCHED_NORMAL - It's the default Linux policy for threads not requiring any real-time machinery. All the threads have priority of 0, and the decision of which thread gets run next is based on the nice mechanism.
  • SCHED_FIFO - The threads have priority from 1 (low) to 99 (high). When a SCHED_FIFO thread becomes runnable, it will preempt any thread of lower priority. There is no time slicing, so the hight priority thread runs as long as it must.
  • SCHED_RR - RR stands for round-robin. It's the same as SCHED_FIFO except each thread is allowed to run for a limited time quantum.

See the sched(7) man page for more details.

Setting the scheduling parameters of a kernel task boils down to invoking the sys_sched_setscheduler syscall, as shown below.

 1 struct tb_sched_param
 2 {
 3   int sched_priority;
 4 };
 5 
 6 int tb_set_sched(tbthread_t thread, int policy, int priority)
 7 {
 8   struct tb_sched_param p; p.sched_priority = priority;
 9   int ret = SYSCALL3(__NR_sched_setscheduler, thread->tid, policy, &p);
10   if(!ret)
11     thread->sched_info = SCHED_INFO_PACK(policy, priority);
12   return ret;
13 }

There is a bit more fun to it, though. As you can see, the kernel can only schedule a task that already exists. Therefore, we need to have a way to set the thread's priority before this thread invokes the user function. The reason for this is that we may need to abort this thread immediately should the scheduler setting fail. We do it by having another futex that we wake when we know whether the thread can run or not:

1 if(th->start_status != TB_START_OK) {
2   SYSCALL3(__NR_futex, &th->start_status, FUTEX_WAIT, TB_START_WAIT);
3   if(th->start_status == TB_START_EXIT)
4     SYSCALL1(__NR_exit, 0);
5 }

This futex is initialized in the tbthread_create function depending on the thread attributes:

1 if(!attr->sched_inherit)
2   (*thread)->start_status = TB_START_WAIT;
3 else {
4   tbthread_t self = tbthread_self();
5   (*thread)->sched_info = self->sched_info;
6 }

And then set to either TB_START_OK or TB_START_EXIT after we spawn the thread but before tbthread_create exits:

 1 if(!attr->sched_inherit) {
 2   ret = tb_set_sched(*thread, attr->sched_policy, attr->sched_priority);
 3 
 4   if(ret) (*thread)->start_status = TB_START_EXIT;
 5   else (*thread)->start_status = TB_START_OK;
 6   SYSCALL3(__NR_futex, &(*thread)->start_status, FUTEX_WAKE, 1);
 7 
 8   if(ret) {
 9     wait_for_thread(*thread);
10     goto error;
11   }
12 }

See the patch at GitHub.

Priority mutexes

Priority mutexes come in three varieties:

  • TBTHREAD_PRIO_NONE - Acquiring this type of mutex does not change the scheduling characteristics of the thread.
  • TBTHREAD_PRIO_INHERIT - When the mutex owner blocks another thread with a higher priority, the owner inherits the priority of the thread it blocks if it's higher than its own.
  • TBTHREAD_PRIO_PROTECT - Acquiring this kind of mutex raises the priority of the owner to the prioceiling value of the mutex.

Thread Bites implements this functionality by keeping lists of PRIO_INHERIT and PRIO_PROTECT mutexes. It then calculates the highest possible priority taking into account the priority of the mutexes and the priority set by the user.

The implementation of PRIO_PROTECT is relatively straightforward. Whenever a thread acquires this kind of mutex, it is added to the list, and the priority of the thread is recalculated:

 1 static int lock_prio_protect(tbthread_mutex_t *mutex)
 2 {
 3   lock_prio_none(mutex);
 4   tb_protect_mutex_sched(mutex);
 5   return 0;
 6 }
 7 
 8 static int unlock_prio_protect(tbthread_mutex_t *mutex)
 9 {
10   tbthread_t self = tbthread_self();
11   tb_protect_mutex_unsched(mutex);
12   unlock_prio_none(mutex);
13   return 0;
14 }

Implementing PRIO_INHERIT is a lot more tricky. We add the mutex to the appropriate list when a thread acquires it. Whenever a higher priority thread tries to lock the mutex, it bumps the priority of the blocker. But the priority recalculation is done only at this point. Implementing it like this covers all the main cases and is not horrendously hard. It allows for simple recursion: if the owner of a mutex gets blocked, the blocker inherits the priority that comes with the first mutex. It also has a couple of drawbacks:

  • It assumes that the kernel will always wake the highest priority thread. It makes sense and is most likely the case. However, I have not tested it.
  • If the owner of a PRIO_INHERIT mutex is already blocked on another mutex of the same kind and it's priority gets bumped later, the last thread in the line won't be affected.
 1 static int lock_prio_inherit(tbthread_mutex_t *mutex)
 2 {
 3   tbthread_t self = tbthread_self();
 4 
 5   while(1) {
 6     int locked = 0;
 7     tb_futex_lock(&mutex->internal_futex);
 8     if(mutex->futex == 0) {
 9       locked = 1;
10       mutex->owner = self;
11       mutex->futex = 1;
12       tb_inherit_mutex_add(mutex);
13     }
14     else
15       tb_inherit_mutex_sched(mutex, self);
16     tb_futex_unlock(&mutex->internal_futex);
17     if(locked)
18       return 0;
19     SYSCALL3(__NR_futex, &mutex->futex, FUTEX_WAIT, 1);
20   }
21 }
22 
23 static int unlock_prio_inherit(tbthread_mutex_t *mutex)
24 {
25   tb_futex_lock(&mutex->internal_futex);
26   tb_inherit_mutex_unsched(mutex);
27   mutex->owner = 0;
28   mutex->futex = 0;
29   SYSCALL3(__NR_futex, &mutex->futex, FUTEX_WAKE, 1);
30   tb_futex_unlock(&mutex->internal_futex);
31   return 0;
32 }

It was by far the most challenging part so far. See the patch at GitHub.

Remaining functions

  • pthread_setconcurrency- It defines how many kernel tasks should be created to handle the user-level threads. It does not make sense in our case because we create a kernel task for every thread.
  • pthread_attr_setscope - It defines the set of threads against which the thread will compete for resources. There are two settings: PTHREAD_SCOPE_SYSTEM meaning all the threads in the entire system and PTHREAD_SCOPE_PROCESS meaning only the threads within the process. The man page says that Linux only supports PTHREAD_SCOPE_SYSTEM, but I am not sure whether it's still the case with all the cgroups stuff.

Table of Contents

  1. Creating threads
  2. Thread-local storage
  3. Mutexes
  4. Joining and initialization
  5. Cancellation
  6. Scheduling
  7. RW Locks
  8. Condition Variables
  9. Conclusion

Cancellation

Cancellation boils down to making one thread exit following a request from another thread. It seems that calling tbthread_exit at an appropriate point is enough to implement all of the behavior described in the man pages. We will go this way despite the fact that it is not the approach taken by glibc. Glibc unwinds the stack back to the point invoking the user-supplied thread function. This behavior allows it to simulate an exception if C++ code is using the library. We don't bother with C++ support for the moment and don't always care to supply valid DWARF information. Therefore, we will take the easier approach.

tbthread_setcancelstate and tbthread_setcanceltype are the two functions controlling the response of a thread to a cancellation request. The former enables or disables cancellation altogether queuing the requests for later handling if necessary. The latter decides whether the thread should abort immediately or at a cancellation point. POSIX has a list of cancellation points, but we will not bother with them. Instead, we'll just use tbthread_testcancel and the two functions mentioned before for this purpose.

The thread must not get interrupted after it disables or defers cancellation. It would likely lead to deadlocks due to unreleased mutexes, memory leaks and such. The trick here is to update all the cancellation related flags atomically. So, we use one variable to handle the following flags:

  • TB_CANCEL_ENABLED: The cancellation is enabled; if a cancellation request has been queued, reaching a cancellation point will cause the thread to exit.
  • TB_CANCEL_DEFERRED: The cancellation is deferred (not asynchronous); SIGCANCEL will not be sent; see the paragraph on signal handling.
  • TB_CANCELING: A cancellation request has been queued; depending on other flags, SIGCANCEL may be sent.
  • TB_CANCELED: A cancellation request has been taken into account and the thread is in the process of exiting; this flag is used to handle the cases when a cancellation point has been reached before SIGCANCEL has been delivered by the kernel.

The tbhread_testcancel looks as follows:

 1 void tbthread_testcancel()
 2 {
 3   tbthread_t thread = tbthread_self();
 4   uint8_t val, newval;
 5 
 6   while(1) {
 7     newval = val = thread->cancel_status;
 8     if(!(val & TB_CANCEL_ENABLED) || !(val & TB_CANCELING) ||
 9        (val & TB_CANCELED))
10       return;
11     newval |= TB_CANCELED;
12     if(__sync_bool_compare_and_swap(&thread->cancel_status, val, newval))
13       break;
14   }
15   tbthread_exit(TBTHREAD_CANCELED);
16 }

See the full patch at GitHub.

Clean-up handlers

The user may register a bunch of functions cleaning up the mess caused by an unexpected interruption. They are installed with tbthread_cleanup_push and called when the thread exits abnormally. The purpose of these functions is to unlock mutexes, free the heap memory and such. tbthread_cleanup_pop removes them and optionally executes in the process.

 1 void tbthread_cleanup_push(void (*func)(void *), void *arg)
 2 {
 3   tbthread_t self = tbthread_self();
 4   struct cleanup_elem *e = malloc(sizeof(struct cleanup_elem));
 5   e->func = func;
 6   e->arg = arg;
 7   list_add_elem(&self->cleanup_handlers, e, 1);
 8 }
 9 
10 void tbthread_cleanup_pop(int execute)
11 {
12   tbthread_t self = tbthread_self();
13   list_t *node = self->cleanup_handlers.next;
14   if(!node)
15     return;
16   list_rm(node);
17   struct cleanup_elem *e = (struct cleanup_elem*)node->element;
18   if(execute)
19     (*e->func)(e->arg);
20   free(e);
21   free(node);
22 }

See the full patch at GitHub.

Signals and asynchronous cancellation

The asynchronous cancellation uses the first real-time signal, SIGRTMIN, that we call SIGCANCEL here for clarity.

Registering a signal handler is somewhat more tricky than just calling the appropriate syscall. It is so because, on x86_64, we need to provide a function that restores the stack after the signal handler returns. The function is called a signal trampoline and its purpose is to invoke sys_rt_sigreturn. The trampoline is registered with the kernel using a special sigaction flag:

1 void __restore_rt();
2 #define SA_RESTORER 0x04000000
3 
4 int tbsigaction(int signum, struct sigaction *act, struct sigaction *old)
5 {
6   act->sa_flags |= SA_RESTORER;
7   act->sa_restorer = __restore_rt;
8   return SYSCALL4(__NR_rt_sigaction, signum, act, old, sizeof(sigset_t));
9 }

The trampoline itself, called __restore_rt here, is defined in assembly as follows:

1 .text
2 
3   .global __restore_rt
4   .type   __restore_rt,@function
5   .align  16
6 
7 __restore_rt:
8   movq $__NR_rt_sigreturn, %rax
9   syscall

Looking at the corresponding glibc code, you can see that they add the eh_frame info here. The comments say that it is to aid gdb and handle the stack unwinding. I don't know enough DWARF to write one on my own, gdb does not seem to be utterly confused without it, and we won't do stack unwinding, so we just won't bother with it for the moment.

In the cancellation handler, we first check whether it's the right signal and that it has been sent by a thread in the same thread group. We then need to check whether the thread is still in the asynchronous cancellation mode. It might have changed between the time the signal was sent and the time the it is delivered. Finally, we call thread_testcancel to see if the thread should exit.

 1 void tb_cancel_handler(int sig, siginfo_t *si, void *ctx)
 2 {
 3   if(sig != SIGCANCEL || si->si_pid != tb_pid || si->si_code != SI_TKILL)
 4     return;
 5 
 6   tbthread_t self = tbthread_self();
 7   if(self->cancel_status & TB_CANCEL_DEFERRED)
 8     return;
 9 
10   tbthread_testcancel();
11 }

We invoke sys_tgkill to send the signal:

1 SYSCALL3(__NR_tgkill, tb_pid, thread->exit_futex, SIGCANCEL);

See the full patch at GitHub.

Cancellation of a "once" function

The implementation of tbthread_once gets quite a bit more interesting as well. If the thread invoking the initialization function gets canceled, another thread needs to pick it up. We need to install a cleanup handler that will change the state of the once control back to TB_ONCE_NEW and wake all the threads so that they could restart from the beginning:

 1 static void once_cleanup(void *arg)
 2 {
 3   tbthread_once_t *once = (tbthread_once_t *)arg;
 4   *once = TB_ONCE_NEW;
 5   SYSCALL3(__NR_futex, once, FUTEX_WAKE, INT_MAX);
 6 }
 7 
 8 int tbthread_once(tbthread_once_t *once, void (*func)(void))
 9 {
10   if(!once || !func)
11     return -EINVAL;
12 
13   int cancel_state;
14 
15   while(1) {
16     if(*once == TB_ONCE_DONE)
17       return 0;
18 
19     //--------------------------------------------------------------------------
20     // The executor
21     //--------------------------------------------------------------------------
22     tbthread_setcancelstate(TBTHREAD_CANCEL_DISABLE, &cancel_state);
23     if(__sync_bool_compare_and_swap(once, TB_ONCE_NEW, TB_ONCE_IN_PROGRESS)) {
24       tbthread_cleanup_push(once_cleanup, once);
25       tbthread_setcancelstate(cancel_state, 0);
26 
27       (*func)();
28 
29       tbthread_setcancelstate(TBTHREAD_CANCEL_DISABLE, &cancel_state);
30       tbthread_cleanup_pop(0);
31 
32       *once = TB_ONCE_DONE;
33       SYSCALL3(__NR_futex, once, FUTEX_WAKE, INT_MAX);
34       tbthread_setcancelstate(cancel_state, 0);
35       return 0;
36     }
37 
38     tbthread_setcancelstate(cancel_state, 0);
39 
40     //--------------------------------------------------------------------------
41     // The waiters
42     //--------------------------------------------------------------------------
43     while(1) {
44       SYSCALL3(__NR_futex, once, FUTEX_WAIT, TB_ONCE_IN_PROGRESS);
45       if(*once != TB_ONCE_IN_PROGRESS)
46         break;
47     }
48   }
49 }

See the patch at GitHub.