Content tagged linux

Board bring-up

I started playing with the FRDM-K64F board recently. I want to use it as a base for a bunch of hobby projects. The start-up code is not that different from the one for Tiva, which I describe here - it's the same Cortex-M4 architecture after all. Two additional things need to be taken care of, though: flash security and the COP watchdog.

The K64F MCU restricts external access to a bunch of resources by default. It's a great feature if you want to ship a product, but it makes debugging impossible. The Flash Configuration Field (see section 29.3.1 of the datasheet) defines the default security and boot settings.

 1 static const struct {
 2   uint8_t backdor_key[8];   // backdor key
 3   uint8_t fprot[4];         // program flash protection (FPROT{0-3})
 4   uint8_t fsec;             // flash security (FSEC)
 5   uint8_t fopt;             // flash nonvolatile option (FOPT)
 6   uint8_t feprot;           // EEPROM protection (FEPROT)
 7   uint8_t fdprot;           // data flash protection (FDPROT)
 8 } fcf  __attribute__ ((section (".fcf"))) = {
 9   {0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00},
10   {0xff, 0xff, 0xff, 0xff}, // disable flash program protection
11   0x02,                     // disable flash security
12   0x01,                     // disable low-power boot (section 6.3.3)
13   0x00,
14   0x00
15 };

If flash protection (the fprot field) is not disabled, you won't be able to flash new code by copying it to the MBED partition and will have to run mass erase from OpenOCD every time:

interface cmsis-dap
set CHIPNAME k60
source [find target/kx.cfg]
kinetis mdm mass_erase

If the MCU is in the secured state (the fsec field), the debugger will have no access to memory.

The structure listed above needs to end up in flash just after the interrupt vector. I use the linker script to make sure it happens. I define the appropriate memory block:

FLASH-FCF  (rx)  : ORIGIN = 0x00000400, LENGTH = 0x00000010

And then put the .fcf section in it:

.fcf :

See here.

I also disable the COP (computer operates properly) watchdog which resets the MCU if it is not serviced often enough.

1 WDOG_UNLOCK = 0xc520;        // unlock magic #1
2 WDOG_UNLOCK = 0xd928;        // unlock magic #2
3 for(int i = 0; i < 2; ++i);  // delay a couple of cycles
4 WDOG_STCTRLH &= ~0x0001;     // disable the watchdog

You can get the template code at GitHub.

Table of Contents

  1. Compiling and start-up code
  2. Hardware Abstraction Layer and UART
  3. Debugging, display, heap and fonts
  4. Timers and ADC
  5. DAC, Sound and Nokia Tunes
  6. Random Number Generator, Rendering Engine and the Game
  7. Operating System


The game code up until this point abuses timers a lot. It has a timer to handle rendering and to refresh the display, and a timer to change notes of a tune. These tasks are not very time sensitive. A couple of milliseconds delay here or there is not going to be noticeable to users. The timer interrupts are more appropriate for things like maintaining a sound wave of the proper frequency. A slight delay here lowers the quality of the user experience significantly.

We could, of course, do even more complex time management to handle both the graphics and the sound in one loop, but that would be painful. It's much nicer to have a scheduling system that can alternate between multiple threads of execution. It is what I will describe in this post.

Thread Control Block and the Stack

Since there's usually only one CPU, the threads need to share it. The easiest way to achieve time sharing is to have a fixed time slice at the end of which the system will switch to another thread. The systick interrupt perfect for this purpose. Not only is it invoked periodically, but it can also by requested manually by manipulating a register. This property will be useful in implementation of sleeping and blocking.

But first things first: we need to have a structure that will describe a thread, a. k. a. a Thread Control Block:

1 struct IO_sys_thread {
2   uint32_t             *stack_ptr;
3   uint32_t              flags;
4   void (*func)();
5   struct IO_sys_thread *next;
6   uint32_t              sleep;
7   IO_sys_semaphore     *blocker;
8   uint8_t               priority;
9 };
  • stack_ptr - points to the top of the thread's stack
  • flags - properties describing the thread; we will need just one to indicate whether the thread used the floating-point coprocessor
  • func - thread's entry point
  • next - pointer to the next thread in the queue (used for scheduling)
  • sleep - number of milliseconds the thread still needs to sleep
  • blocker - a pointer to a semaphore blocking the thread (if any)
  • priority - thread's priority

When invoking an interrupt handler, the CPU saves most of the running state of the current thread to the stack. Therefore, the task of the interrupt handler boils down to switching the stack pointers. The CPU will then pop all the registers back from the new stack. This behavior means that we need to do some initialization first:

 1 void IO_sys_stack_init(IO_sys_thread *thread, void (*func)(void *), void *arg,
 2   void *stack, uint32_t stack_size)
 3 {
 4   uint32_t sp1 = (uint32_t)stack;
 5   uint32_t sp2 = (uint32_t)stack;
 6   sp2 += stack_size;
 7   sp2 = (sp2 >> 3) << 3;          // the stack base needs to be 8-aligned
 8   if(sp1 % 4)
 9     sp1 = ((sp1 >> 2) << 2) + 4;  // make the end of the stack 4-aligned
10   stack_size = (sp2 - sp1) / 4;   // new size in double words
12   uint32_t *sp = (uint32_t *)sp1;
13   sp[stack_size-1] = 0x01000000;          // PSR with thumb bit
14   sp[stack_size-2] = (uint32_t)func;      // program counter
15   sp[stack_size-3] = 0xffffffff;          // link register
16   sp[stack_size-8] = (uint32_t)arg;       // r0 - the argument
17   thread->stack_ptr = &sp[stack_size-16]; // top of the stack
18 }

The ARM ABI requires that the top of the stack is 8-aligned and we will typically push and pop 4-byte words. The first part of the setup function makes sure that the stack boundaries are right. The second part sets the initial values of the registers. Have a look here for details.

  • the PSR register needs to have the Thumb bit switched on
  • we put the startup function address to the program counter
  • we put 0xffffffff to the link register to avoid confusing stack traces in GDB
  • r0 gets the argument to the startup function
  • an interrupt pushes 16 words worth of registers to the stack, so the initial value of the stack pointer needs to reflect that

This function is typically called as:

1 IO_sys_stack_init(thread, thread_wrapper, thread, stack, stack_size);

Note that we do not call the user thread function directly. Rather we have a wrapper function that gets the TBC as its argument. It is because we need to remove the thread from the scheduling queue if the user-specified function returns.

The context switcher

Let's now have a look at the code that does the actual context switching. Since it needs to operate directly on the stack, it needs to be written in assembly. It is not very complicated, though. What it does is:

  • pushing some registers to the stack
  • storing the current stack pointer in the stack_ptr variable of the current TCB
  • calling the scheduler to select the next thread
  • loading the stack pointer from the new thread's TCB
  • popping some registers from the new stack
 1 #define OFF_STACK_PTR 0
 2 #define OFF_FLAGS     4
 3 #define FLAG_FPU      0x01
 5   .thumb
 6   .syntax unified
 8   .global IO_sys_current
 9   .global IO_sys_schedule
11   .text
13   .global systick_handler
14   .type systick_handler STT_FUNC
15   .thumb_func
16   .align  2
17 systick_handler:
18   cpsid i                     ; disable interrupts
19   push  {r4-r11}              ; push r4-11
20   ldr   r0, =IO_sys_current   ; pointer to IO_sys_current to r1
21   ldr   r1, [r0]              ; r1 = OS_current
23   ubfx  r2, lr, #4, #1        ; extract the fourth bit from the lr register
24   cbnz  r2, .Lsave_stack      ; no FPU context to save
25   vstmdb sp!, {s16-s31}       ; push FPU registers, this triggers pushing of
26                               ; s0-s15
27   ldr   r2, [r1, #OFF_FLAGS]  ; load the flags
28   orr   r2, r2, #FLAG_FPU     ; set the FPU context flag
29   str   r2, [r1, #OFF_FLAGS]  ; store the flags
31 .Lsave_stack:
32   str   sp, [r1, #OFF_STACK_PTR] ; store the stack pointer at *OS_current
34   push  {r0, lr}              ; calling c code, so store r0 and the link
35                               ; register
36   bl    IO_sys_schedule       ; call the scheduler
37   pop   {r0, lr}              ; restore r0 and lr
39   ldr   r1, [r0]              ; load the new TCB pointer to r1
40   ldr   sp, [r1, #OFF_STACK_PTR] ; get the stack pointer of the new thread
42   orr   lr, lr, #0x10         ; clear the floating point flag in EXC_RETURN
43   ldr   r2, [r1, #OFF_FLAGS]  ; load the flags
44   tst   r2, #0x01             ; see if we have the FPU context
45   beq   .Lrestore_regs        ; no FPU context
46   vldmia sp!, {s16-s31}       ; pop the FPU registers
47   bic   lr, lr, #0x10         ; set the floating point flag in EXC_RETURN
49 .Lrestore_regs:
50   pop   {r4-r11}              ; restore regs r4-11
51   cpsie i                     ; enable interrupts
52   bx    lr                    ;  exit the interrupt, restore r0-r3, r12, lr, pc,
53                               ; psr

The only complication here is that we sometimes need to store the floating point registers in addition to the regular ones. It is, however, only necessary if the thread used the FPU. The fourth bit of EXC_RETURN, the value in the LR register, indicates the status of the FPU. Go here and here for more details. If the value of the bit is 0, we need to save the high floating-point registers to the stack and set the FPU flag in the TCB.

Also, after selecting the new thread, we check if its stack contains the FPU registers by checking the FPU flag in its TCB. If it does, we pop these registers and change EXC_RETURN accordingly.

The Lazy Stacking is taken care of by simply pushing and popping the high registers - it counts as an FPU operation.

Semaphores, sleeping and idling

We can now run threads and switch between them, but it would be useful to be able to put threads to sleep and make them wait for events.

Sleeping is easy. We just need to set the sleep field in the TCB of the current thread and make the scheduler ignore threads whenever their sleep field is not zero:

1 void IO_sys_sleep(uint32_t time)
2 {
3   IO_sys_current->sleep = time;
4   IO_sys_yield();
5 }

The ISR that handles the system time can loop over all threads and decrement this counter every millisecond.

Waiting for a semaphore works in a similar way. We mark the current thread as blocked:

 1 void IO_sys_wait(IO_sys_semaphore *sem)
 2 {
 3   IO_disable_interrupts();
 4   --*sem;
 5   if(*sem < 0) {
 6     IO_sys_current->blocker = sem;
 7     IO_sys_yield();
 8   }
 9   IO_enable_interrupts();
10 }

The purpose of IO_sys_yield is to indicate that the current thread does not need to run anymore and force a context switch. The function resets the systick counter and forces the interrupt:

1 void IO_sys_yield()
2 {
3   STCURRENT_REG = 0;          // clear the systick counter
4   INTCTRL_REG   = 0x04000000; // trigger systick
5 }

Waking a thread waiting for a semaphore is somewhat more complex:

 1 void IO_sys_signal(IO_sys_semaphore *sem)
 2 {
 3   IO_disable_interrupts();
 4   ++*sem;
 5   if(*sem <= 0 && threads) {
 6     IO_sys_thread *t;
 7     for(t = threads; t->blocker != sem; t = t->next);
 8     t->blocker = 0;
 9   }
10   IO_enable_interrupts();
11 }

If the value of the semaphore was negative, we find a thread that it was blocking and unblock it. It will make the scheduler consider this thread for running in the future.

None of the user-defined threads may be runnable at the time the scheduler makes its decision. All of them may be either sleeping or waiting for a semaphore. In that case, we need to keep the CPU occupied with something, i.e., we need a fake thread:

1 static void iddle_thread_func(void *arg)
2 {
3   (void)arg;
4   while(1) IO_wait_for_interrupt();
5 }


The system maintains a circular linked list of TCBs called threads. The job of the scheduler is to loop over this list and select the next thread to run. It places its selection in a global variable called IO_sys_current so that other functions may access it.

 1 void IO_sys_schedule()
 2 {
 3   if(!threads) {
 4     IO_sys_current = &iddle_thread;
 5     return;
 6   }
 8   IO_sys_thread *stop = IO_sys_current->next;
10   if(IO_sys_current == &iddle_thread)
11     stop = threads;
13   IO_sys_thread *cur  = stop;
14   IO_sys_thread *sel  = 0;
15   int            prio = 266;
17   do {
18     if(!cur->sleep && !cur->blocker && cur->priority < prio) {
19       sel = cur;
20       prio = sel->priority;
21     }
22     cur = cur->next;
23   }
24   while(cur != stop);
26   if(!sel)
27     sel = &iddle_thread;
29   IO_sys_current = sel;
30 }

This scheduler is simple:

  • whenever there is nothing to run, select the idle thread
  • otherwise select the next highest priority thread that is neither sleeping nor blocked on a semaphore

Starting up the beast

So how do we get this whole business running? We need to invoke the scheduler that will preempt the current thread and select the next one to run. The problem is that we're running using the stack provided by the bootstrap code and don't have a TCB. Nothing prevents us from creating a dummy one, though. We can create it on the current stack (it's useful only once) and point it to the beginning of our real queue of TCBs:

1 IO_sys_thread dummy;
2 = threads;
3 IO_sys_current = &dummy;

We then set the systick up:

1 STCTRL_REG     = 0;            // turn off
2 STCURRENT_REG  = 0;            // reset
3 SYSPRI3_REG   |= 0xE0000000;   // priority 7
4 STRELOAD_REG   = time_slice-1; // reload value

And force its interrupt:

1 IO_sys_yield();
2 IO_enable_interrupts();


Tests 11 and 12 run a dummy calculation for some time and then return. After this happens, the system can only run the idle thread. If we plug-in the profiler code, we can observe the timings on a logic analyzer:

Test #11
Test #11

Test 13 is more complicated than the two previous ones. Three threads are running in a loop, sleeping, and signaling semaphores. Two more threads are waiting for these semaphores, changing some local variables and signaling other semaphores. Finally, there is the writer thread that blocks on the last set of semaphores and displays the current state of the environment. The output from the logic analyzer shows that the writer thread needs around 3.3 time-slices to refresh the screen:

Test #13
Test #13

Silly Invaders

How all this makes Silly Invaders better? The main advantage is that we don't need to calculate complex timings for multiple functions of the program. We create two threads, one for rendering of the scene and another one for playing the music tune. Each thread cares about its own timing. Everything else takes care of itself with good enough time guarantees.

 1 IO_sys_thread game_thread;
 2 void game_thread_func()
 3 {
 4   while(1) {
 5     SI_scene_render(&scenes[current_scene].scene, &display);
 6     IO_sys_sleep(1000/scenes[current_scene].scene.fps);
 7   }
 8 }
10 IO_sys_thread sound_thread;
11 void sound_thread_func()
12 {
13   IO_sound_player_run(&sound_player);
14 }

The threads are registered with the system in the main function:

1 IO_sys_thread_add(&game_thread,  game_thread_func,  2000, 255);
2 IO_sys_thread_add(&sound_thread, sound_thread_func, 1000, 255);
3 IO_sys_run(1000);

For the complete code see:


There is a great course on EdX called Realtime Bluetooth Networks explaining the topic in more details. I highly recommend it.


Life would be so much easier if all the software was open source and came packaged with Debian. Much of the stuff I use is available this way, but there are still some programs that come as binary blobs with custom installers. I don't like that. I don't trust that. Every now and then, you hear the stories of software coming from reputable companies and misbehaving in dramatic ways. It would be great to contain the potential damage, but running virtual machines on a laptop is a pain. As it turns out, things may work pretty well with Docker, but as usual, the configuration is not so trivial.


The X Server

The solutions I found on the Internet either share the main X server with the container or use VNC. The first approach is problematic because apparently the X architecture has been designed by happy hippies and has no notion of security. If two applications share a screen, for instance, one can sniff the keystrokes typed into the other, and all this is by design. The VNC solution, on the other hand, is terribly slow: windows smudge when moved, and the Netflix playback is lagging.

Starting a Xephyr instance on the host and sharing its socket with the container seems to solve the sniffing problem. The programs running inside the container can't listen to the keystrokes typed outside of it anymore. Xephyr is also fast enough to handle high-resolution movie playback smoothly.

You can start Xephyr like this:

Xephyr :1 -ac -br -screen 1680x1050 -resizeable

The server will run as display :1 in a resizable window of initial size defined by the screen parameter. Adding the following to the Docker command line makes the server visible inside of the container:

-e DISPLAY=:1 -v /tmp/.X11-unix/X1:/tmp/.X11-unix/X1

The only remaining pain point is the fact that you cannot share the clipboard by default. Things copied outside of the container do not paste inside and vice versa. The xsel utility and a couple of lines of bash code can solve this problem easily:

 1 CLIP1=""
 2 CLIP2=`xsel -o -b --display :1`
 3 while true; do
 4   CLIP1NEW=`xsel -o -b --display :0`
 5   CLIP2NEW=`xsel -o -b --display :1`
 6   if [ "x$CLIP1" != "x$CLIP1NEW" ]; then
 7     xsel -o -b --display :0 | xsel -i -b --display :1
 8     CLIP1=$CLIP1NEW
 9   fi;
10   if [ x"$CLIP2" != x"$CLIP2NEW" ]; then
11     xsel -o -b --display :1 | xsel -i -b --display :0
12     CLIP2=$CLIP2NEW
13   fi
14   sleep 1
15 done


Making the audio work both ways (the sound and the microphone) is surprisingly easy with PulseAudio. The host just needs to configure the native protocol plug-in and ensure that port 4713 is not blocked by the firewall:

pactl load-module module-native-protocol-tcp auth-ip-acl=

All you need to do in the container is making sure that the PULSE_SERVER envvar points to the host. It is less straightforward than you might expect when you run a desktop environment and don't want to start all your programs in a terminal window. For XFCE, I do the following in the script driving the container:

 1 mkdir -p /home/prisoner/.config/xfce4
 2 chown prisoner:prisoner -R /home/prisoner
 3 XINITRC=/home/prisoner/.config/xfce4/xinitrc
 4 rm -f $XINITRC
 6 echo -e "#!/bin/sh\n\n" >> $XINITRC
 7 echo -e "export PULSE_SERVER=$PULSE_SERVER\n\n" >> $XINITRC
 8 tail -n +2 /etc/xdg/xfce4/xinitrc >> $XINITRC
10 sudo -u prisoner startxfce4

This code prepends the appropriate export statement to XFCE's xinitrc file and assumes that the PULSE_SERVER variable is known inside the container:


I also disable the local PulseAudio server in XFCE. It has strange interactions with the one running on the host.


Doing all this by hand every time you want to run the container is way too painful. I wrote a script to automate the process. It can:

  • set up the PulseAudio server
  • forward static devices (i.e., /dev/video0)
  • forward USB devices if they are present (i.e., 21a9:1006 as /dev/bus/usb/001/016)
  • run the Xephyr X server instance for the container
  • forward the clipboards
  • set up docker's command line to make everything work together

This is what it all looks like:

[i] Running jail: 20161121-213528
[i] Container: jail:v01
[i] Hostname: jail
[i] Home: /home/ljanyst/Contained/jail/home
[i] Setting up the local PulseAudio server ( OK
[i] Attaching device /dev/video0
[i] USB device 21a9:1006 not present
[i] Running Xephyr display at :1 (1680x1050)... OK
[i] Running docker... OK
[i] Removing container a545365592c1... OK
[i] Killing clipboard forwarder, PID: 2776... DONE
[i] Killing Xephyr, PID: 2767... DONE
[i] All done. Bye!



Saleae Logic Analyzer
Saleae Logic Analyzer


I have finally received my Kickstarter-backed UP boards. So far they seem great! There are three minor drawbacks, though:

  1. They don't have the exact same shape as Raspberry PI's, so they don't fit the raspberry cases. It's nothing that could not be rectified with small pliers, though.
  2. The audio chip on Cherry Trail (Intel Atom x5 z8350) SoCs is not yet supported by Linux out of the box, so some fiddling with the kernel is necessary.
  3. Debian's UEFI boot configuration does not seem to work from the get-go either.


You can install Debian Testing using a USB stick. Don't try Jessie, though - the kernel will not detect the MMC card. Things should work fine, except that grub will install itself in /EFI/debian/grubx64.efi on the EFI partition. You will need to move it to /EFI/boot/bootx64.efi manually. It's possible to do it from the UEFI shell using familiar commands.


Kodi installs and works out of the box from Debian Multimedia. Unfortunately, to get the sound working, you will need to recompile the kernel :)

Get the sources and the necessary patches and create the config:

git clone
cd linux
git remote add cht
git fetch cht
git checkout byt-cht-hdmi-v4.7
make oldconfig

You will need to edit .config:

  • to set CONFIG_SYSTEM_TRUSTED_KEYS variable to an empty string
  • to enable CONFIG_HDMI

Then the only thing that's left is building and installing the package:

fakeroot make-kpkg --initrd --append-to-version=-up --revision=1 -j 8 kernel_image kernel_headers
cd ..
sudo dpkg -i linux-image-4.7.0-up+_1_amd64.deb

I wanted to see how efficient it is, so I run the compilation on the board itself. It took roughly 2.5 hours and got very hot. The board can handle perfectly fine the FullHD video files over Samba that Raspberry PI 2 couldn't. The audio quality is much better too. It seems that surround 5.1 actually works. :)

Table of Contents

  1. Compiling and start-up code
  2. Hardware Abstraction Layer and UART
  3. Debugging, display, heap and fonts
  4. Timers and ADC
  5. DAC, Sound and Nokia Tunes
  6. Random Number Generator, Rendering Engine and the Game
  7. Operating System

Random Number Generator

To make the game more engaging, we introduce some randomness into it. We don't need anything cryptographically secure, so a Linear Congruential Generator will do just fine. We count the time from the start-up in millisecond-long jiffies and wait for a first button press to select the seed.

 1 void button_event(IO_io *io, uint16_t event)
 2 {
 3   uint64_t btn;
 4   IO_get(io, &btn);
 5   if(btn)
 6     button_value = 1;
 8   if(!rng_initialized) {
 9     rng_initialized = 1;
10     IO_rng_seed(IO_time());
11   }
12 }

Rendering Engine

The rendering engine takes a scene descriptor, a display device, and a timer. Based on this information it computes new positions of objects, draws them on the screen if necessary and checks for collisions.

 1 struct SI_scene {
 2   SI_object **objects;
 3   void      (*pre_render)(struct SI_scene *);
 4   void      (*collision)(SI_object *obj1, SI_object *obj2);
 5   uint8_t     fps;
 6   uint8_t     num_objects;
 7   uint8_t     flags;
 8 };
10 void SI_scene_render(SI_scene *scene, IO_io *display, IO_io *timer);

Each SI_scene holds a list of "polymorphic" objects that should be rendered, a pointer to a pre_render function that calculates a new position of each object, and a pointer to a collision callback that is invoked when the scene renderer detects an overlap between two objects. The SI_scene_render function runs after every interrupt:

1 while(1) {
2     SI_scene_render(&scenes[current_scene].scene, &display, &scene_timer);
3     IO_wait_for_interrupt();
4   }

Whether it gets executed or not, depends on the flag parameter of the scene. If it's set to SI_SCENE_IGNORE, the renderer returns immediately. On the other hand, if it's set to SI_SCENE_RENDER, the renderer calls the pre_render callback, draws the objects on the screen, and computes the object overlaps notifying the collision callback if necessary. After each frame, the scene is disabled (SI_SCENE_IGNORE). It is re-enabled by the timer interrupt in a time quantum that depends on the fps parameter.

See SI_scene.h and SI_scene.c.

Each object has a draw function that enables the renderer to draw it on the screen. There are three types of objects: a generic object, a bitmap object, and a text object:

 1 struct SI_object {
 2   uint16_t x;
 3   uint16_t y;
 4   uint16_t width;
 5   uint16_t height;
 6   uint8_t  flags;
 7   uint8_t  user_flags;
 8   void (*draw)(struct SI_object *this, IO_io *display);
 9 };
11 struct SI_object_bitmap {
12   SI_object        obj;
13   const IO_bitmap *bmp;
14 };
16 struct SI_object_text {
17   SI_object      obj;
18   const char    *text;
19   const IO_font *font;
20 };

The object array in the scene is initialized with the SI_object pointers:

1 static SI_object         score_obj;
2 static SI_object_bitmap  invader_obj[5];
3 scene->objects[1] = &score_obj;
4 scene->objects[i+5] = &invader_obj[i].obj;

See SI_scene_game.c.

The renderer calls the draw function of each SI_OBJECT_VISIBLE object:

1 obj->draw(obj, display);

Finally, each draw method uses the CONTAINER_OF macro to compute the pointer to the actual object of concrete type:

 2   ((TYPE *) ( (char *)MEMBER_ADDR - offsetof(TYPE, MEMBER)))
 4 void SI_object_bitmap_draw(SI_object *obj, IO_io *display)
 5 {
 6   SI_object_bitmap *this = CONTAINER_OF(SI_object_bitmap, obj, obj);
 7   IO_display_print_bitmap(display, obj->x, obj->y, this->bmp);
 8 }
10 void SI_object_text_draw(SI_object *obj, IO_io *display)
11 {
12   SI_object_text *this = CONTAINER_OF(SI_object_text, obj, obj);
13   IO_display_set_font(display, this->font);
14   IO_display_cursor_goto(display, obj->x, obj->y);
15   IO_print(display, "%s", this->text);
16 }

The Game

All this seems to work pretty well when put together:

The Game

See silly-invaders.c.

Table of Contents

  1. Compiling and start-up code
  2. Hardware Abstraction Layer and UART
  3. Debugging, display, heap and fonts
  4. Timers and ADC
  5. DAC, Sound and Nokia Tunes
  6. Random Number Generator, Rendering Engine and the Game
  7. Operating System


Tiva does not have a DAC, but we'd like to have some sound effects while playing the game. Fortunately, it's easy to make a simple binary-weighted DAC using resistors and GPIO signals. It's not very accurate, but will do.

A binary-weighted DAC
A binary-weighted DAC

As far as the software is concerned, we will simply take 4 GPIO pins and set them up as output. We will then get an appropriate bit-banded alias such that writing an integer to it is reflected only in the state of these four pins.

 1 int32_t IO_dac_init(IO_io *io, uint8_t module)
 2 {
 3   if(module > 0)
 4     return -IO_EINVAL;
 6   TM4C_gpio_port_init(GPIO_PORTD_NUM);
 7   TM4C_gpio_pin_init(GPIO_PORTD_NUM, GPIO_PIN0_NUM, 0, 0, 1);
 8   TM4C_gpio_pin_init(GPIO_PORTD_NUM, GPIO_PIN1_NUM, 0, 0, 1);
 9   TM4C_gpio_pin_init(GPIO_PORTD_NUM, GPIO_PIN2_NUM, 0, 0, 1);
10   TM4C_gpio_pin_init(GPIO_PORTD_NUM, GPIO_PIN3_NUM, 0, 0, 1);
12   uint32_t addr =  GPIO_REG_BASE + GPIO_PORTD;
13   addr += GPIO_PIN0_BIT_OFFSET;
14   addr += GPIO_PIN1_BIT_OFFSET;
15   addr += GPIO_PIN2_BIT_OFFSET;
16   addr += GPIO_PIN3_BIT_OFFSET;
17   dac_data = (uint32_t*)addr;
19   io->type    = IO_DAC;
20   io->sync    = 0;
21   io->channel = 0;
22   io->flags   = 0;
23   io->read    = 0;
24   io->write   = dac_write;
25   io->event   = 0;
27   return 0;
28 }

See TM4C_platform01.c.


We will create a virtual device consisting of a DAC and a timer. Using the timer, we will change the output of the DAC frequently enough to produce sound. Since the timer interrupt needs to be executed often and any delay makes the sound break, we need to assign the highest possible priority to this interrupt so that it does not get preempted.

1 int32_t IO_sound_init(IO_io *io, uint8_t module)
2 {
3 //...
4   IO_dac_init(&snd_dac, 0);
5   IO_timer_init(&snd_timer, 11);
6   TM4C_enable_interrupt(104, 0); // adjust the interrupt priority for timer 11
7   snd_timer.event = snd_timer_event;
8 //...
9 }

Writing to this virtual device sets the frequency of the tone that we want to play by adjusting the timer's firing rate accordingly.

 1 static int32_t snd_write(IO_io *io, const void *data, uint32_t length)
 2 {
 3   if(length != 1)
 4     return -IO_EINVAL;
 5   const uint64_t *val = data;
 7   if(!(*val)) {
 8     snd_interval = 0;
 9     return 1;
10   }
11   uint8_t turn_on = 0;
12   if(!snd_interval)
13     turn_on = 1;
15   double interval = 1.0/(*val);
16   interval /= 32.0;
17   interval *= 1000000000;
18   snd_interval = interval;
20   if(turn_on)
21     IO_set(&snd_timer, interval);
22   return 1;
23 }

In reality, the timer fires 32 times more often than the frequency of the tone requires. It is because we use a table with 32 entries to simulate the actual sound wave. In principle, we could just use a sinusoid, but it turns out that the quality of the sound is not so great if we do so. I have found another waveform in the lab materials of EdX's Embedded Systems course that works much better.

 1 static const uint8_t snd_trumpet[] = {
 2   10, 11, 11, 12, 10,  8,  3,  1,  8, 15, 15, 11, 10, 10, 11, 10, 10, 10, 10,
 3   10, 10, 10, 10, 11, 11, 11, 11, 11, 11, 10, 10, 10 };
 5 static void snd_timer_event(IO_io *io, uint16_t event)
 6 {
 7   IO_set(&snd_dac, snd_trumpet[snd_step++]);
 8   snd_step %= 32;
 9   if(snd_interval)
10     IO_set(io, snd_interval);
11 }

See TM4C_platform01.c.

Nokia tunes

The tune player API consists of four functions:

1 int32_t IO_sound_play(IO_io *io, IO_io *timer, IO_tune *tune, uint16_t start);
2 int32_t IO_sound_stop(IO_io *io);
3 IO_tune *IO_sound_decode_RTTTL(const char *tune);
4 void IO_sound_free_tune(IO_tune *tune);
  • IO_sound_play uses a sound device and a timer to play a tune. It sends an IO_EVENT_DONE to the virtual sound device when the playback finishes.
  • IO_sound_stop stops the playback on the given device and returns the index of the last note it played so that it can be restarted from that point.
  • IO_sound_decode_RTTL takes an RTTTL representation and produces the IO_tune structure that can be handled by the player.
  • IO_sound_free_tune frees the memory used by IO_sound_decode_RTTL when it's no longer needed.

There is plenty of tunes all over the Internet. The ones in the demo video are taken from here. The code of the player is based on this work.

It plays music! :)

See IO_sound.c.

Table of Contents

  1. Compiling and start-up code
  2. Hardware Abstraction Layer and UART
  3. Debugging, display, heap and fonts
  4. Timers and ADC
  5. DAC, Sound and Nokia Tunes
  6. Random Number Generator, Rendering Engine and the Game
  7. Operating System


Tiva has 12 timer modules that can be configured in various relatively complex ways. However, for the purpose of this game, we don't need anything fancy. We will, therefore, represent a timer as an IO_io device with the IO_set function (generalized from IO_gpio_set) setting and arming it. When it counts to 0, the IO_TICK event will be reported to the event handler.

 1 void timer_event(IO_io *io, uint16_t event)
 2 {
 4   IO_set(&timer, 500000000); // fire in half second
 5 }
 7 int main()
 8 {
 9   IO_init();
10   IO_timer_init(&timer, 0);
11   timer.event = timer_event;
12   IO_set(&timer, 500000000); // fire in half second
14   while(1)
15     IO_wait_for_interrupt();
16 }

See TM4C_timer.c and test-07-timer.c.


Similarly to the timers, the ADC sequencers on Tiva may be set up in fairly sophisticated ways. There are 12 analog pins, two modules with four sequencers each. Again, we don't need anything sophisticated here, so we will just use the first eight pins and assign them to a separate sequencer each. In the blocking mode, IO_get initiates the readout and returns the value. In the non-blocking and asynchronous mode IO_set, requests sampling and IO_get returns it when ready. An IO_DONE event is reported to the event handler if enabled.

 1 IO_io slider;
 2 IO_io timer;
 3 uint64_t sliderR = 0;
 5 void timer_event(IO_io *io, uint16_t event)
 6 {
 7   IO_set(&slider, 0); // request a sample
 8 }
10 void slider_event(IO_io *io, uint16_t event)
11 {
12   IO_get(&slider, &sliderR);
13   IO_set(&timer, 100000000); // fire in 0.1 sec
14 }
16 int main()
17 {
18   IO_init();
20   IO_timer_init(&timer, 0);
21   IO_slider_init(&slider, 0, IO_ASYNC);
23   timer.event     = timer_event;
24   slider.event    = slider_event;
26   IO_event_enable(&slider,    IO_EVENT_DONE);
28   IO_set(&slider, 0); // request a sample
30   while(1)
31     IO_wait_for_interrupt();
32 }

See TM4C_adc.c and test-08-input.c.

The game board

Everything works fine when soldered together as well.

Buttons and the Slider


The paper shows that, despite often repeated mantra, the OS task scheduling is far from being easy. The authors developed two new tools to investigate the CPU usage and the state of the associated run queue. It has allowed them to uncover four interesting performance bugs on a 64 core NUMA system. They discovered that often some cores stay idle for a long time while tasks are waiting. It is a violation of one of the design principles of the Completely Fair Scheduler, the Linux default, which is supposed to be work-conserving. Fixing these bugs resulted in a 138 times speedup in an extreme test case (multithreaded, using spinlocks) and 13-23% speedups in other test cases. This type of bugs is hard to uncover because they typically waste cycles hundreds of milliseconds at a time, which is beyond the resolution of standard monitoring tools.

Completely Fair Scheduler

CFS defines an interval in which each task must run a least once. This interval is then divided between all the tasks in proportion to their weight (niceness). A running thread accumulates vruntime, which is the amount of time it was running divided by its weight. The scheduler keeps these tasks in a run queue which is implemented as a red-black tree. When the CPU gets idle, the leftmost node is picked because it has accumulated the least of weighted runtime.

In a multi-core system, each core has its own run queue. To fairly distribute the load among the cores, the run queues must be periodically re-balanced. In today's systems, with dozens of run queues, the balancing procedure is expensive and not run very ofter. It is due to the need to take into account other factors, such as power-saving, cache and memory locality. The load balancing algorithm takes the threads from the most loaded cores and distributes them between the least loaded cores taking into account the topology of the system. The more complex the system gets, the more rules need to be applied and the harder it gets to reason about performance.

The bugs and the tools

The bugs uncovered by the authors are all related to migrating tasks between NUMA nodes. They were detected using new tools:

  • Sanity Checker checks every second whether there are idle cores in the presence of waiting threads in other run queues. If there are, it monitors the system for another 100ms. If the situation is not remediated, it begins to record the profiling information for further off-line analysis.
  • The scheduler visualization tool taps into various kernel functions to monitor and visualize scheduling activity over time.


The authors note that the problems were caused by people wanting to optimize CFS to compensate for the complexity of the modern hardware. They suggest rewriting of the scheduler as a core and a bunch of optimization modules.

Table of Contents

  1. Compiling and start-up code
  2. Hardware Abstraction Layer and UART
  3. Debugging, display, heap and fonts
  4. Timers and ADC
  5. DAC, Sound and Nokia Tunes
  6. Random Number Generator, Rendering Engine and the Game
  7. Operating System


To test and debug the SSI code, I connected two boards and made them talk to each other. It mostly worked. However, it turned out that, by default, you can run the OpenOCD-GDB duo only for one board at a time. It's the one that libusb enumerates first. There is a patch that lets OpenOCD choose the device to attach to by its serial number. The patch has not made it to any release yet, but applying it and recompiling the whole thing is relatively straight-forward: clone the source, apply the patch and run the usual autotools combo. You will then need to create a config file for each board that specifies unique port numbers and defines the serial number of the device to attach to:

]==> cat board1.cfg
gdb_port 3333
telnet_port 4444
tcl_port 6666
interface hla
hla_serial 0Exxxxxx
source [find board/ek-tm4c123gxl.cfg]

]==> cat board2.cfg
gdb_port 3334
telnet_port 4445
tcl_port 6667
interface hla
hla_serial 0Exxxxxx
source [find board/ek-tm4c123gxl.cfg]

Separate GDB batch files come handy as well:

]==> cat gdb-board1.conf
target extended-remote :3333
monitor reset init
break main

]==> cat gdb-board2.conf
target extended-remote :3334
monitor reset init
break main

Tweaking the linker script

GCC started generating .init and .fini sections that contain no-op functions:

]==> arm-none-eabi-objdump -d  test-06-display.axf
Disassembly of section .init:

00007af8 <_init>:
    7af8:       b5f8            push    {r3, r4, r5, r6, r7, lr}
    7afa:       bf00            nop

Disassembly of section .fini:

00007afc <_fini>:
    7afc:       b5f8            push    {r3, r4, r5, r6, r7, lr}
    7afe:       bf00            nop

We will discard this code by adding the following to the linker script:

2   {
3     *(.init*)
4     *(.fini*)
5   }

GCC also started generating the stack unwinding code and GDB gets confused in some places if it is not present, so we put it after the code in FLASH:

1 .ARM.exidx :
2   {
3     *(.ARM.exidx*)
4     *(.gnu.linkonce.armexidx*)
5   } > FLASH

See TM4C.ld.


We need both SSI and GPIO to control the Nokia display that we want to use for the game. Since, in the end, both these systems need to push and receive data, they fit well the generic interface used for UART. The SSI's initialization function needs many more parameters than the one for UART, so we pack them all in a struct. As far as GPIO is concerned, there are two helpers: IO_gpio_get_state and IO_gpio_set_state that just write the appropriate byte to the IO device. GPIO also comes with a new event type: IO_EVENT_CHANGE.

1 struct IO_ssi_attrs {
2   uint8_t  master;        //!< 1 for master, 0 for slave
3   uint8_t  slave_out;     //!< 1 slave output enabled, 0 slave output disabled
4   uint32_t bandwidth;     //!< bandwidth in bps
5   uint8_t  frame_format;  //!< frame format
6   uint8_t  freescale_spo; //!< SPO value for freescale frames
7   uint8_t  freescale_sph; //!< SPH value for freescale frames
8   uint8_t  frame_size;    //!< size of the frame in bits
9 };

See TM4C_ssi.c and TM4C_gpio.c.

Platforms, the display interface, and fonts

All the devices that are not directly on the board may be connected in many different ways. To handle all these configurations with the same board, we split the driver into libtm4c.a (for the board specific stuff) and libtm4c_platform_01.a (for the particular configuration). For now, the only thing that the platform implements is the display interface. It passes the appropriate SSI module and GPIOs to the actual display driver. The user sees the usual IO_io structure that is initialized with IO_display_init and can be written to and synced. write renders the text to the back-buffer, while sync sends the back-buffer to the device for rendering. There's also a couple of specialized functions that have to do only with display devices:

 1 int32_t IO_display_get_attrs(IO_io *io, IO_display_attrs *attrs);
 2 int32_t IO_display_clear(IO_io *io);
 3 int32_t IO_display_put_pixel(IO_io *io, uint16_t x, uint16_t y, uint32_t argb);
 4 int32_t IO_display_print_bitmap(IO_io *io, uint16_t x, uint16_t y,
 5   const IO_bitmap *bitmap);
 6 int32_t IO_display_set_font(IO_io *io, const IO_font *font);
 7 int32_t IO_display_cursor_goto(IO_io *io, uint32_t x, uint32_t y);
 8 int32_t IO_display_cursor_goto_text(IO_io *io, uint32_t line, uint32_t space);
 9 int32_t IO_display_cursor_move(IO_io *io, int32_t dx, int32_t dy);
10 int32_t IO_display_cursor_move_text(IO_io *io, int32_t dline, int32_t dspace);

See IO_display.h.

Platform 01 provides one display device, a PCD8544, the one used in Nokia 5110. It translates and passes the interface calls to the lower-level driver. See pcd8544.c.

If you haven't noticed in the list of the functions above, the display interface supports multiple fonts. In fact, I wrote a script that rasterizes TrueType fonts and creates internal IO_font structures. These can then be used to render text on a display device. All you need to do is provide a TTF file, declare the font name and size in CMake, and then reference it in the font manager. The code comes with DejaVuSans10 and DejaVuSerif10 by default.

The heap

Malloc comes handy from time to time, so I decided to implement one. It is extremely prone to fragmentation and never merges chunks, so using free is not advisable. Still, sometimes you just wish you had one. For instance, when you need to define a buffer for pixels and don't have a good way to ask for display parameters at compile time. For alignment reasons, the heap management code reserves a bit more than 4K for the stack. It then creates a 32 bytes long guard region protected by the MPU. Everything between the end of the .bss section and the guard page is handled by IO_malloc and IO_free.

 1 void TM4C_heap_init()
 2 {
 3   uint8_t *stack_start = (uint8_t *)0x20007ff8;
 4   uint8_t *stack_end   = stack_start-4120;
 5   uint8_t *stack_guard = stack_end-32;
 6   uint8_t *heap_start  = (uint8_t *)&__bss_end_vma;
 8   MPUCTRL_REG |= (uint32_t)0x05; // enable MPU and the background region
10   uint32_t val = (uint32_t)stack_guard;
11   val |= 0x10; // valid
12   val |= 0x07; // highest priority region
13   MPUBASE_REG &= ~0xfffffff7;
14   MPUBASE_REG |= val;
16   val = 0;
17   val |= (1 << 28); // disable instruction fetches
18   val |= (4 << 1);  // 0x04 == 32bytes
19   val |= 1;         // enable the region
20   MPUATTR_REG &= ~0x173fff3f;
21   MPUATTR_REG |= val;
23   IO_set_up_heap(heap_start, stack_guard);
24 }

See TM4C.c and IO_malloc.c.

A display test

The LCD demo works fine on the breadboard. As you can see, there is a text printed with two kinds of fonts: with and without serifs. Later, the code plays the Game of Life shooting gliders.

Glider Gun on a breadboard


Since the display works fine, it's safe to do some soldering. We'll use a 9-volt battery as a power source and an LM1086 power regulator to supply 3.3 volts to the microcontroller and other devices.

Soldered Display - Front
Soldered Display - Front

Soldered Display - Back
Soldered Display - Back

Glider Gun - Soldered

Table of Contents

  1. Compiling and start-up code
  2. Hardware Abstraction Layer and UART
  3. Debugging, display, heap and fonts
  4. Timers and ADC
  5. DAC, Sound and Nokia Tunes
  6. Random Number Generator, Rendering Engine and the Game
  7. Operating System

Hardware Abstraction Layer

I'd like the game to be as portable as possible. As far as the game logic is concerned, the actual interaction with the hardware is immaterial. Ideally, we just need means to write a pixel to a screen, blink an LED or check the state of a push-button. It means that hiding the hardware details behind a generic interface is desirable. This interface can then be re-implemented for a different kind of board, and the whole thing can act as a cool tool for getting to know new hardware.

In this project, we will use one static library (libio.a) to provide the interface. This library will implement all the hardware independent functions as well as the stubs for the driver (as weak symbols). Another library (libtm4c.a) will provide the real driver logic for Tiva and the strong symbols. This kind of approach enables us to use the linker to easily produce the final binary for other platforms in the future.

Initialization PLL and FPU

To initialize the hardware platform, the user calls IO_init(). The stub for this function is provided by libio.a as follows:

1 int32_t __IO_init()
2 {
3   return -IO_ENOSYS;
4 }
6 WEAK_ALIAS(__IO_init, IO_init);

The actual implementation for Tiva in libtm4c.a initializes PLL to provide 80MHz clock and turns on microDMA. It also sets the access permissions to the FPU by setting the appropriate bits in the CPAC register and resetting the pipeline in assembly. We will likely need the floating point in the game, and it comes handy when calculating UART transmission speed parameters.

 1 int32_t IO_init()
 2 {
 3   TM4C_pll_init();
 4   TM4C_dma_init();
 6   // Enable the floating point coprocessor
 7   CPAC_REG |= (0x0f << 20);
 8   __asm__ volatile (
 9     "dsb\r\n"        // force memory writed before continuing
10     "isb\r\n" );     // reset the pipeline
11   return 0;
12 }

Simple read/write interface and functions

We provide an IO device abstraction called IO_io and implement four generic functions for accessing it:

1 int32_t IO_write(IO_io *io, const void *data, uint32_t length);
2 int32_t IO_print(IO_io *io, const char *format, ...);
3 int32_t IO_read(IO_io *io, void *data, uint32_t length);
4 int32_t IO_scan(IO_io *io, uint8_t type, void *data, uint32_t param);

IO_read and IO_write push to and fetch bytes from the device. IO_print writes a formated string to the device using the standard printf semantics. IO_scan reads a word (a stream of characters surrounded by whitespaces) and tries to convert it to the requested type.

Each subsystem needs to provide its initialization function to fill the IO_io struct with the information required to perform the IO operations. For instance, the following function initializes UART:

1 int32_t IO_uart_init(IO_io *io, uint8_t module, uint16_t flags, uint32_t baud);

It needs to know which UART module to use, what the desired mode of operation is (non-blocking, asynchronous, DMA...) and what should be the speed of the link. This approach hides the hardware details from the user well and is very generic, see test-01-uart.c. For instance, you can write something like this:

1 IO_init();
2 IO_io uart0;
3 IO_uart_init(&uart0, 0, 0, 115200);
4 IO_print(&uart0, "Hello %s\r\n", "World");

Passing 0 as flags to the UART initialization routine creates a blocking device that is required for IO_print and IO_scan to work.

Non-blocking and asynchronous IO

A blocking IO device will cause the IO functions to return only after they have pushed or pulled all the data to or from the hardware. If, however, you configure a non-blocking (IO_NONBLOCKING) device, the functions will process as many bytes as they can and return. They return -IO_WOULDBLOCK if it is not possible to handle any data.

The IO_ASYNC flag makes the system notify the user about the device readiness for reading or writing. These events are received and processed by a user-defined call-back function:

 1 void uart_event(IO_io *io, uint16_t event)
 2 {
 3   if(event & IO_EVENT_READ) {
 4   }
 6   if(event & IO_EVENT_WRITE) {
 7   }
 8 }
10 int main()
11 {
12   IO_init();
13   IO_io uart0;
14   IO_uart_init(&uart0, 0, IO_NONBLOCKING|IO_ASYNC, 115200);
15   uart0.event = uart_event;
16   IO_event_enable(&uart0, IO_EVENT_READ|IO_EVENT_WRITE);
17   while(1) IO_wait_for_interrupt();
18 }

See test-02-uart-async.c.


The DMA mode allows for transferring data between the peripheral and the main memory in the background. It uses the memory bus when the CPU does not need it for anything else. When in this mode, IO_read and IO_write only initiate a background transfer. The next invocation will either block or return -EWOULDBLOCK, depending on other configuration flags, as long as the current DMA operation is in progress. The memory buffer cannot be changed until the DMA transfer is done. Passing IO_ASYNC will generate completion events for DMA operations. It enables us to implement a pretty neat UART echo app:

 1 #include <io/IO.h>
 2 #include <io/IO_uart.h>
 4 char buffer[30];
 6 void uart_event(IO_io *io, uint16_t event)
 7 {
 8   if(event & IO_EVENT_DMA_READ)
 9     IO_write(io, buffer, 30);
11   if(event & IO_EVENT_DMA_WRITE)
12     IO_read(io, buffer, 30);
13 }
15 int main()
16 {
17   IO_init();
18   IO_io uart0;
19   IO_uart_init(&uart0, 0, IO_DMA|IO_ASYNC, 115200);
20   uart0.event = uart_event;
21   IO_read(&uart0, buffer, 30);
22   while(1) IO_wait_for_interrupt();
23 }

See test-03-uart-dma.c.

The driver

There was nothing ultimately hard about writing the driver part. It all boils down to reading the data sheet and following the instruction contained therein. It took quite some time to put everything together into a coherent whole, though. See: TM4C_uart.c.

Table of Contents

  1. Compiling and start-up code
  2. Hardware Abstraction Layer and UART
  3. Debugging, display, heap and fonts
  4. Timers and ADC
  5. DAC, Sound and Nokia Tunes
  6. Random Number Generator, Rendering Engine and the Game
  7. Operating System


I have recently been playing with microcontrollers a lot. Among other things, I have worked through some of the labs from this course on EdX. The material does not use much high-level code, so it gives a good overview of how the software interacts with the hardware. There are some "black box" components in there, though. For me, the best way to learn something well has always been building things from "first principles." I find black boxes frustrating. This post describes the first step on my way to make an Alien Invaders game from "scratch."

Compiling for Tiva

First, we need to be able to compile C code for Tiva. To this end, we will use GCC as a cross-compiler, so make sure you have the arm-none-eabi-gcc command available on your system. We will use the following flags build Tiva-compatible binaries:

  • -mcpu=cortex-m4 - produce the code for ARM Cortex-M4 CPU
  • -mfpu=fpv4-sp-d16 - FPv4 single-precision floating point with the register bank seen by the software as 16 double-words
  • -mfloat-abi=hard - generate floating point instructions and use FPU-specific calling conventions
  • -mthumb - use the Thumb instruction set
  • -std=c11 - use the C11 standard
  • -O0 - don't perform any optimizations
  • -Wall and -pedantic - warn about all the potential issues with the code
  • -ffunction-sections and -fdata-sections - place every function and data item in a separate section in the resulting object file; it allows the optimizations removing all unused code and data to be performed at link-time

Object files

To generate a proper binary image, we need to have some basic understanding of object files produced by the compiler. In short, they consist of sections containing various pieces of compiled code and the corresponding data. These sections may be loadable, meaning that the contents of the section should be read from the object file and stored in memory. They may also be just allocatable, meaning that there is nothing to be loaded, but a chunk of memory needs to be put aside for them nonetheless. There are multiple sections in a typical ELF object file, but we need to know only four of them:

  • .text - contains the program code
  • .rodata - contains the constants (read-only data)
  • .data - contains the read-write data
  • .bss - contains statically allocated variables (initialized to zero)

Let's consider the following code:

 1 #include <stdio.h>
 3 int a = 12;
 4 int b;
 5 const char *c = "The quick brown fox jumps over the lazy dog.";
 6 const char * const d = "The quick brown fox jumps over the lazy dog.";
 7 int e[20];
 8 const int f[4] = {7, 4, 2, 1};
10 int main(int argc, char **argv)
11 {
12   printf("Hello world!\n");
13   return 0;
14 }

After compiling it, we end up with an object file containing the following sections (most have been omitted for clarity):

]==> objdump -h test

test:     file format elf64-x86-64

Idx Name          Size      VMA               LMA               File off  Algn
 13 .text         00000192  00000000004003f0  00000000004003f0  000003f0  2**4
 15 .rodata       0000006d  0000000000400590  0000000000400590  00000590  2**4
 24 .data         00000020  0000000000600948  0000000000600948  00000948  2**3
                  CONTENTS, ALLOC, LOAD, DATA
 25 .bss          00000090  0000000000600980  0000000000600980  00000968  2**5

As you can see, every section has two addresses:

  • VMA (virtual memory address) - This is the location of the section the code expects when it runs.
  • LMA (load memory address) - This is the location where the section is stored by the loader.

These two addresses are in most cases the same, except the situation that we care about here: an embedded system. In our binary image, we need put the .data section in ROM because it contains initialized variables whose values would otherwise be lost on reset. The section's LMA, therefore, must point to a location in ROM. However, this data is not constant, so it's final position at program's runtime needs to be in RAM. Therefore, the VMA must point to a location RAM. We will see an example later.

Tiva's memory layout

Tiva has 256K of ROM (range: 0x0000000000-0x0003ffff) and 32K of RAM (range: 0x20000000-0x20003fff). See the table 2-4 on page 90 of the data sheet for details. The NVIC (Interrupt) table needs to be located at address 0x00000000 (section 2.5 of the data sheet). We will create this table in C, put it in a separate object file section, and fill with weak aliases of the default handler function. This approach will enable the user to redefine the interrupt handlers without having to edit the start-up code. The linker will resolve the handler addresses to strong symbols if any are present.

So, we define a dummy interrupt handler that loops indefinitely:

1 void __int_handler(void)
2 {
3   while(1);
4 }

and then create a bunch of weak aliases to this function:

1 #define DEFINE_HANDLER(NAME) void NAME ## _handler() __attribute__ ((weak, alias ("__int_handler")))
4 DEFINE_HANDLER(hard_fault);
6 DEFINE_HANDLER(bus_fault);
7 DEFINE_HANDLER(usage_fault);
8 ...

Finally, we construct the nvic_table, place it in the .nvic section in the resulting object file and fill it with handler addresses:

1 #define HANDLER(NAME) NAME ## _handler
2 void (*nvic_table[])(void) __attribute__ ((section (".nvic"))) = {
3   HANDLER(reset),
4   HANDLER(nmi),
5   HANDLER(hard_fault),
6   HANDLER(mman),
7 ...

Linker scripts

We will use linker scripts to set the VMAs and the LMAs to the values we like and to create some symbols whose addresses we can play with in the C code. We first need to define the memory layout:

2 {
3   FLASH (rx)  : ORIGIN = 0x00000000, LENGTH = 0x00040000
4   RAM   (rwx) : ORIGIN = 0x20000000, LENGTH = 0x00008000
5 }

We then need to tell the linker where to put the section in the final executable:

 2 {
 3   .text :
 4   {
 5     LONG(0x20007fff)
 6     KEEP(*(.nvic))
 7     *(.text*)
 8     *(.rodata*)
 9      __text_end_vma = .;
10   } > FLASH
12   .data :
13   {
14     __data_start_vma = .;
15     *(.data*)
16     *(vtable)
17     __data_end_vma = .;
18   } > RAM AT > FLASH
20   .bss :
21   {
22     __bss_start_vma = .;
23     *(.bss*)
24     *(COMMON)
25     __bss_end_vma = .;
26   } > RAM
27 }
  1. We start with the .text section and begin it with 0x20003fff. It is the initial value of the stack pointer (see the data sheet). Since the stack grows towards lower addresses, we initialize the top of the stack to the last byte of available RAM.
  2. We then put the .nvic section. The KEEP function forces the linker to keep this section even when the link-time optimizations are enabled, and the section seems to be unused. The asterisk in *(.nvic) is a wildcard for an input object file name. Whatever is in the brackets is a wildcard for a section name.
  3. We put all the code and read-only data from all of the input files in this section as well.
  4. We define a new symbol: __text_end_vma and assign its address to the current VMA (the dot means the current VMA).
  5. We put this section in FLASH: > FLASH at line 10.
  6. We combine the .data* sections from all input files into one section and put it behind the .text section in FLASH. We set the VMAs to be in RAM: > RAM AT > FLASH.
  7. Apparently TivaWare changes the value of the VTABLE register and needs to have the NVIC table in RAM, so we oblige: *(vtable).
  8. We put .bss in RAM after .data.
  9. We use asterisks in section names (i.e. .bss*) because -ffunction-sections and -fdata-sections parameters cause the compiler to generate a separate section for each function and data item.

Edit 02.04.2016: The initial stack pointer needs to be aligned to 8 bytes for passing of 64-bit long variadic parameters to work. Therefore, the value of the first four bytes in the text section should be: LONG(0x20007ff8). See this post for details.

See the binutils documentation for more details.

Start-up code

On the system start-up, we need to copy the contents of the .data section from FLASH to RAM ourselves before we can run any code. We do it by defining a reset handler:

 1 extern unsigned long __text_end_vma;
 2 extern unsigned long __data_start_vma;
 3 extern unsigned long __data_end_vma;
 4 extern unsigned long __bss_start_vma;
 5 extern unsigned long __bss_end_vma;
 7 extern void main();
 9 void __rst_handler()
10 {
11   unsigned long *src = &__text_end_vma;
12   unsigned long *dst = &__data_start_vma;
14   while(dst < &__data_end_vma) *dst++ = *src++;
15   dst = &__bss_start_vma;
16   while(dst < &__bss_end_vma) *dst++ = 0;
18   main();
19 }
21 void reset_handler() __attribute__ ((weak, alias ("__rst_handler")));

We first declare external symbols. They are put in the symbol table by the linker. The reset handler then moves the .data section from FLASH to RAM, zeroes the .bss section, and calls main.

A test

Let's put everything together. I wrote a short program that blinks an LED using the SysTick interrupt. The color of the LED depends on the switch pressed. The files are here:

Compile and link:

]==> arm-none-eabi-gcc -mcpu=cortex-m4 -mfpu=fpv4-sp-d16 -mfloat-abi=hard -mthumb -std=c11 -O0 -Wall -pedantic -ffunction-sections -fdata-sections -c main.c -g
]==> arm-none-eabi-gcc -mcpu=cortex-m4 -mfpu=fpv4-sp-d16 -mfloat-abi=hard -mthumb -std=c11 -O0 -Wall -pedantic -ffunction-sections -fdata-sections -c TM4C_startup.c -g
]==> arm-none-eabi-ld -T TM4C.ld TM4C_startup.o main.o -o main --gc-sections

Let's see what we have in the resulting binary:

]==> arm-none-eabi-objdump -h main

main:     file format elf32-littlearm

Idx Name          Size      VMA       LMA       File off  Algn
  0 .text         00000484  00000000  00000000  00010000  2**2
  1 .data         00000004  20000000  00000484  00020000  2**2
                  CONTENTS, ALLOC, LOAD, DATA
  2 .bss          00000004  20000004  00000488  00020004  2**2

The .text section starts at 0x00000000 both VMA and LMA. The .data section starts at 0x00000484 LMA (in FLASH) but the code expects it to start at 0x20000000 VMA (in RAM). The symbol addresses seem to match the expectations as well:

]==> arm-none-eabi-objdump -t main | grep vma
20000004 g       .bss   00000000 __bss_start_vma
00000484 g       .text  00000000 __text_end_vma
20000008 g       .bss   00000000 __bss_end_vma
20000000 g       .data  00000000 __data_start_vma
20000004 g       .data  00000000 __data_end_vma

We now need to create a raw binary file that we can flash to the board. The arm-none-eabi-objcopy utility can take the relevant sections and put them in an output file aligned according to their LMAs.

]==> arm-none-eabi-objcopy -O binary main main.bin
]==> stat --format=%s main.bin

The total size of the raw binary matches the sum of the sizes of the .text and .data sections (0x488 == 1160). Let's flash it and see if it works!

]==> lm4flash main.bin
Found ICDI device with serial: xxxxxxxx
ICDI version: 9270


Get the full code at GitHub.

Table of Contents

  1. Creating threads
  2. Thread-local storage
  3. Mutexes
  4. Joining and initialization
  5. Cancellation
  6. Scheduling
  7. RW Locks
  8. Condition Variables
  9. Conclusion


I have implemented all of the interesting functions listed here and, thus, reached my goal. There were quite a few surprises. I had expected some things to bo more complicated than they are. Conversely, some things that had seemed simple turned out to be quite complex.

  1. I had initially hoped that I would be able to re-use much of glibc and concentrate only on the thread-specific functionality. I was surprised to discover how much of glibc code refers to thread-local storage.
  2. I had expected the interaction between join and detach to be much simpler to handle. Having to implement descriptor caching was an unexpected event.
  3. I had never heard of pthread_once before.
  4. I had not used much of the real-time functionality before, so figuring out the scheduling part was very entertaining. I especially enjoyed implementing the PRIO_INHERIT mutex.

I may revisit this project in the future because there are still some things that I would like to learn more about.

  1. If I'll have the time to learn DWARF, I would like to provide proper .eh_frame for the signal trampoline. It would allow me to implement cancellation using stack unwinding the way glibc does it.
  2. I may look into the inter-process synchronization to learn about the robust futexes.
  3. The Intel article on lock elision seemed interesting, and I'd like to play with this stuff as well.
  4. I may have a look at the compiler-generated TLS.

The End

Table of Contents

  1. Creating threads
  2. Thread-local storage
  3. Mutexes
  4. Joining and initialization
  5. Cancellation
  6. Scheduling
  7. RW Locks
  8. Condition Variables
  9. Conclusion

Condition Variables

Condition variables are a mechanism used for signaling that a certain predicate has become true. The POSIX mechanism for handling them boils down to three functions: cond_wait, cond_signal and cond_broadcast. The first one causes the thread that calls it to wait. The second wakes a thread up so that it can verify whether the condition is true. The third wakes all the waiters up.

The waiter

 1 tb_futex_lock(&cond->lock);
 2 ...
 3 ++cond->waiters;
 4 int bseq = cond->broadcast_seq;
 5 int futex = cond->futex;
 6 tb_futex_unlock(&cond->lock);
 8 while(1) {
 9   st = SYSCALL3(__NR_futex, &cond->futex, FUTEX_WAIT, futex);
10   if(st == -EINTR)
11     continue;
13   tb_futex_lock(&cond->lock);
14   if(cond->signal_num) {
15     --cond->signal_num;
16     goto exit;
17   }
19   if(bseq != cond->broadcast_seq)
20     goto exit;
21   tb_futex_unlock(&cond->lock);
22 }

The algorithm is as follows:

  1. We lock the internal lock.
  2. We remember the value of the broadcast sequence and the futex.
  3. In a loop, we wait for the futex.
  4. We go back to sleeping if the FUTEX_WAIT syscall was interrupted by a signal.
  5. We consume a signal, if we can, and exit.
  6. If there was a broadcast, we exit too.
  7. Otherwise, we go back to sleep.


We wake one of the threads up. We bump the value of the futex to prevent a deadlock. Then we bump the number of signals and wake the futex.

 1 int tbthread_cond_signal(tbthread_cond_t *cond)
 2 {
 3   tb_futex_lock(&cond->lock);
 4   if(cond->waiters == cond->signal_num)
 5     goto exit;
 6   ++cond->futex;
 7   ++cond->signal_num;
 8   SYSCALL3(__NR_futex, &cond->futex, FUTEX_WAKE, 1);
 9 exit:
10   tb_futex_unlock(&cond->lock);
11   return 0;
12 }


We wake all the waiters. The algorithm is essentially the same as for signal, except that, we bump the broadcast sequence number and wake all the threads instead of just one.

 1 int tbthread_cond_broadcast(tbthread_cond_t *cond)
 2 {
 3   tb_futex_lock(&cond->lock);
 4   if(!cond->waiters)
 5     goto exit;
 6   ++cond->futex;
 7   ++cond->broadcast_seq;
 8   SYSCALL3(__NR_futex, &cond->futex, FUTEX_WAKE, INT_MAX);
 9 exit:
10   tb_futex_unlock(&cond->lock);
11   return 0;
12 }

See the full patch at GitHub.

Table of Contents

  1. Creating threads
  2. Thread-local storage
  3. Mutexes
  4. Joining and initialization
  5. Cancellation
  6. Scheduling
  7. RW Locks
  8. Condition Variables
  9. Conclusion

RW Locks

A read-write lock protects a critical section by allowing multiple readers when there are no writers. We won't bother implementing lock attributes handling because we don't support process-shared locks (irrelevant in our case) and we don't let the user prefer readers (a non-POSIX extension). The implementation remembers the ID of the current writer if any. It also counts readers as well as the queued writers. We use two futexes, one to block the readers and one to block the writers.

A writer first bumps the number of queued writers. If there is no other writer and no readers, it marks itself as the owner of the lock and decrements the number of queued writers. It goes to sleep on the writer futex otherwise.

 1 int tbthread_rwlock_wrlock(tbthread_rwlock_t *rwlock)
 2 {
 3   int queued = 0;
 4   while(1) {
 5     tb_futex_lock(&rwlock->lock);
 7     if(!queued) {
 8       queued = 1;
 9       ++rwlock->writers_queued;
10     }
12     if(!rwlock->writer && !rwlock->readers) {
13       rwlock->writer = tbthread_self();
14       --rwlock->writers_queued;
15       tb_futex_unlock(&rwlock->lock);
16       return 0;
17     }
18     int sleep_status = rwlock->wr_futex;
20     tb_futex_unlock(&rwlock->lock);
22     SYSCALL3(__NR_futex, &rwlock->wr_futex, FUTEX_WAIT, sleep_status);
23   }
24 }

A reader acquires the lock if there are no writers at all. It goes to sleep on the reader futex otherwise.

 1 int tbthread_rwlock_rdlock(tbthread_rwlock_t *rwlock)
 2 {
 3   while(1) {
 4     tb_futex_lock(&rwlock->lock);
 6     if(!rwlock->writer && !rwlock->writers_queued) {
 7       ++rwlock->readers;
 8       tb_futex_unlock(&rwlock->lock);
 9       return 0;
10     }
11     int sleep_status = rwlock->rd_futex;
13     tb_futex_unlock(&rwlock->lock);
15     SYSCALL3(__NR_futex, &rwlock->rd_futex, FUTEX_WAIT, sleep_status);
16   }
17 }

When unlocking, we use the writer field to determine whether we were a reader or a writer. If we were a writer, we've had an exclusive ownership of the lock. Therefore, we need to either wake another writer or all of the readers, depending on the state of the counters. If we were a reader, we've had a non-exclusive lock. Therefore, we only need to wake a writer when we're the last reader and there is a writer queued. We bump the value of the futex because we want to handle the cases when FUTEX_WAKE was called before the other thread manged to call FUTEX_WAIT.

 1 int tbthread_rwlock_unlock(tbthread_rwlock_t *rwlock)
 2 {
 3   tb_futex_lock(&rwlock->lock);
 4   if(rwlock->writer) {
 5     rwlock->writer = 0;
 6     if(rwlock->writers_queued) {
 7       __sync_fetch_and_add(&rwlock->wr_futex, 1);
 8       SYSCALL3(__NR_futex, &rwlock->wr_futex, FUTEX_WAKE, 1);
 9     } else {
10       __sync_fetch_and_add(&rwlock->rd_futex, 1);
11       SYSCALL3(__NR_futex, &rwlock->rd_futex, FUTEX_WAKE, INT_MAX);
12     }
13     goto exit;
14   }
16   --rwlock->readers;
17   if(!rwlock->readers && rwlock->writers_queued) {
18     __sync_fetch_and_add(&rwlock->wr_futex, 1);
19     SYSCALL3(__NR_futex, &rwlock->wr_futex, FUTEX_WAKE, 1);
20   }
22 exit:
23   tb_futex_unlock(&rwlock->lock);
24   return 0;
25 }

See the full patch at GitHub.

Table of Contents

  1. Creating threads
  2. Thread-local storage
  3. Mutexes
  4. Joining and initialization
  5. Cancellation
  6. Scheduling
  7. RW Locks
  8. Condition Variables
  9. Conclusion

Thread Scheduling

The scheduler makes a decision of which thread to run next based on two parameters: scheduling policy and priority.

Conceptually, the scheduler maintains a list of runnable threads for each possible sched_priority value. In order to determine which thread runs next, the scheduler looks for the nonempty list with the highest static priority and selects the thread at the head of this list.

A thread's scheduling policy determines where it will be inserted into the list of threads with equal static priority and how it will move inside this list. the sched(7) man page

The supported policies are:

  • SCHED_NORMAL - It's the default Linux policy for threads not requiring any real-time machinery. All the threads have priority of 0, and the decision of which thread gets run next is based on the nice mechanism.
  • SCHED_FIFO - The threads have priority from 1 (low) to 99 (high). When a SCHED_FIFO thread becomes runnable, it will preempt any thread of lower priority. There is no time slicing, so the hight priority thread runs as long as it must.
  • SCHED_RR - RR stands for round-robin. It's the same as SCHED_FIFO except each thread is allowed to run for a limited time quantum.

See the sched(7) man page for more details.

Setting the scheduling parameters of a kernel task boils down to invoking the sys_sched_setscheduler syscall, as shown below.

 1 struct tb_sched_param
 2 {
 3   int sched_priority;
 4 };
 6 int tb_set_sched(tbthread_t thread, int policy, int priority)
 7 {
 8   struct tb_sched_param p; p.sched_priority = priority;
 9   int ret = SYSCALL3(__NR_sched_setscheduler, thread->tid, policy, &p);
10   if(!ret)
11     thread->sched_info = SCHED_INFO_PACK(policy, priority);
12   return ret;
13 }

There is a bit more fun to it, though. As you can see, the kernel can only schedule a task that already exists. Therefore, we need to have a way to set the thread's priority before this thread invokes the user function. The reason for this is that we may need to abort this thread immediately should the scheduler setting fail. We do it by having another futex that we wake when we know whether the thread can run or not:

1 if(th->start_status != TB_START_OK) {
2   SYSCALL3(__NR_futex, &th->start_status, FUTEX_WAIT, TB_START_WAIT);
3   if(th->start_status == TB_START_EXIT)
4     SYSCALL1(__NR_exit, 0);
5 }

This futex is initialized in the tbthread_create function depending on the thread attributes:

1 if(!attr->sched_inherit)
2   (*thread)->start_status = TB_START_WAIT;
3 else {
4   tbthread_t self = tbthread_self();
5   (*thread)->sched_info = self->sched_info;
6 }

And then set to either TB_START_OK or TB_START_EXIT after we spawn the thread but before tbthread_create exits:

 1 if(!attr->sched_inherit) {
 2   ret = tb_set_sched(*thread, attr->sched_policy, attr->sched_priority);
 4   if(ret) (*thread)->start_status = TB_START_EXIT;
 5   else (*thread)->start_status = TB_START_OK;
 6   SYSCALL3(__NR_futex, &(*thread)->start_status, FUTEX_WAKE, 1);
 8   if(ret) {
 9     wait_for_thread(*thread);
10     goto error;
11   }
12 }

See the patch at GitHub.

Priority mutexes

Priority mutexes come in three varieties:

  • TBTHREAD_PRIO_NONE - Acquiring this type of mutex does not change the scheduling characteristics of the thread.
  • TBTHREAD_PRIO_INHERIT - When the mutex owner blocks another thread with a higher priority, the owner inherits the priority of the thread it blocks if it's higher than its own.
  • TBTHREAD_PRIO_PROTECT - Acquiring this kind of mutex raises the priority of the owner to the prioceiling value of the mutex.

Thread Bites implements this functionality by keeping lists of PRIO_INHERIT and PRIO_PROTECT mutexes. It then calculates the highest possible priority taking into account the priority of the mutexes and the priority set by the user.

The implementation of PRIO_PROTECT is relatively straightforward. Whenever a thread acquires this kind of mutex, it is added to the list, and the priority of the thread is recalculated:

 1 static int lock_prio_protect(tbthread_mutex_t *mutex)
 2 {
 3   lock_prio_none(mutex);
 4   tb_protect_mutex_sched(mutex);
 5   return 0;
 6 }
 8 static int unlock_prio_protect(tbthread_mutex_t *mutex)
 9 {
10   tbthread_t self = tbthread_self();
11   tb_protect_mutex_unsched(mutex);
12   unlock_prio_none(mutex);
13   return 0;
14 }

Implementing PRIO_INHERIT is a lot more tricky. We add the mutex to the appropriate list when a thread acquires it. Whenever a higher priority thread tries to lock the mutex, it bumps the priority of the blocker. But the priority recalculation is done only at this point. Implementing it like this covers all the main cases and is not horrendously hard. It allows for simple recursion: if the owner of a mutex gets blocked, the blocker inherits the priority that comes with the first mutex. It also has a couple of drawbacks:

  • It assumes that the kernel will always wake the highest priority thread. It makes sense and is most likely the case. However, I have not tested it.
  • If the owner of a PRIO_INHERIT mutex is already blocked on another mutex of the same kind and it's priority gets bumped later, the last thread in the line won't be affected.
 1 static int lock_prio_inherit(tbthread_mutex_t *mutex)
 2 {
 3   tbthread_t self = tbthread_self();
 5   while(1) {
 6     int locked = 0;
 7     tb_futex_lock(&mutex->internal_futex);
 8     if(mutex->futex == 0) {
 9       locked = 1;
10       mutex->owner = self;
11       mutex->futex = 1;
12       tb_inherit_mutex_add(mutex);
13     }
14     else
15       tb_inherit_mutex_sched(mutex, self);
16     tb_futex_unlock(&mutex->internal_futex);
17     if(locked)
18       return 0;
19     SYSCALL3(__NR_futex, &mutex->futex, FUTEX_WAIT, 1);
20   }
21 }
23 static int unlock_prio_inherit(tbthread_mutex_t *mutex)
24 {
25   tb_futex_lock(&mutex->internal_futex);
26   tb_inherit_mutex_unsched(mutex);
27   mutex->owner = 0;
28   mutex->futex = 0;
29   SYSCALL3(__NR_futex, &mutex->futex, FUTEX_WAKE, 1);
30   tb_futex_unlock(&mutex->internal_futex);
31   return 0;
32 }

It was by far the most challenging part so far. See the patch at GitHub.

Remaining functions

  • pthread_setconcurrency- It defines how many kernel tasks should be created to handle the user-level threads. It does not make sense in our case because we create a kernel task for every thread.
  • pthread_attr_setscope - It defines the set of threads against which the thread will compete for resources. There are two settings: PTHREAD_SCOPE_SYSTEM meaning all the threads in the entire system and PTHREAD_SCOPE_PROCESS meaning only the threads within the process. The man page says that Linux only supports PTHREAD_SCOPE_SYSTEM, but I am not sure whether it's still the case with all the cgroups stuff.

Table of Contents

  1. Creating threads
  2. Thread-local storage
  3. Mutexes
  4. Joining and initialization
  5. Cancellation
  6. Scheduling
  7. RW Locks
  8. Condition Variables
  9. Conclusion


Cancellation boils down to making one thread exit following a request from another thread. It seems that calling tbthread_exit at an appropriate point is enough to implement all of the behavior described in the man pages. We will go this way despite the fact that it is not the approach taken by glibc. Glibc unwinds the stack back to the point invoking the user-supplied thread function. This behavior allows it to simulate an exception if C++ code is using the library. We don't bother with C++ support for the moment and don't always care to supply valid DWARF information. Therefore, we will take the easier approach.

tbthread_setcancelstate and tbthread_setcanceltype are the two functions controlling the response of a thread to a cancellation request. The former enables or disables cancellation altogether queuing the requests for later handling if necessary. The latter decides whether the thread should abort immediately or at a cancellation point. POSIX has a list of cancellation points, but we will not bother with them. Instead, we'll just use tbthread_testcancel and the two functions mentioned before for this purpose.

The thread must not get interrupted after it disables or defers cancellation. It would likely lead to deadlocks due to unreleased mutexes, memory leaks and such. The trick here is to update all the cancellation related flags atomically. So, we use one variable to handle the following flags:

  • TB_CANCEL_ENABLED: The cancellation is enabled; if a cancellation request has been queued, reaching a cancellation point will cause the thread to exit.
  • TB_CANCEL_DEFERRED: The cancellation is deferred (not asynchronous); SIGCANCEL will not be sent; see the paragraph on signal handling.
  • TB_CANCELING: A cancellation request has been queued; depending on other flags, SIGCANCEL may be sent.
  • TB_CANCELED: A cancellation request has been taken into account and the thread is in the process of exiting; this flag is used to handle the cases when a cancellation point has been reached before SIGCANCEL has been delivered by the kernel.

The tbhread_testcancel looks as follows:

 1 void tbthread_testcancel()
 2 {
 3   tbthread_t thread = tbthread_self();
 4   uint8_t val, newval;
 6   while(1) {
 7     newval = val = thread->cancel_status;
 8     if(!(val & TB_CANCEL_ENABLED) || !(val & TB_CANCELING) ||
 9        (val & TB_CANCELED))
10       return;
11     newval |= TB_CANCELED;
12     if(__sync_bool_compare_and_swap(&thread->cancel_status, val, newval))
13       break;
14   }
15   tbthread_exit(TBTHREAD_CANCELED);
16 }

See the full patch at GitHub.

Clean-up handlers

The user may register a bunch of functions cleaning up the mess caused by an unexpected interruption. They are installed with tbthread_cleanup_push and called when the thread exits abnormally. The purpose of these functions is to unlock mutexes, free the heap memory and such. tbthread_cleanup_pop removes them and optionally executes in the process.

 1 void tbthread_cleanup_push(void (*func)(void *), void *arg)
 2 {
 3   tbthread_t self = tbthread_self();
 4   struct cleanup_elem *e = malloc(sizeof(struct cleanup_elem));
 5   e->func = func;
 6   e->arg = arg;
 7   list_add_elem(&self->cleanup_handlers, e, 1);
 8 }
10 void tbthread_cleanup_pop(int execute)
11 {
12   tbthread_t self = tbthread_self();
13   list_t *node = self->;
14   if(!node)
15     return;
16   list_rm(node);
17   struct cleanup_elem *e = (struct cleanup_elem*)node->element;
18   if(execute)
19     (*e->func)(e->arg);
20   free(e);
21   free(node);
22 }

See the full patch at GitHub.

Signals and asynchronous cancellation

The asynchronous cancellation uses the first real-time signal, SIGRTMIN, that we call SIGCANCEL here for clarity.

Registering a signal handler is somewhat more tricky than just calling the appropriate syscall. It is so because, on x86_64, we need to provide a function that restores the stack after the signal handler returns. The function is called a signal trampoline and its purpose is to invoke sys_rt_sigreturn. The trampoline is registered with the kernel using a special sigaction flag:

1 void __restore_rt();
2 #define SA_RESTORER 0x04000000
4 int tbsigaction(int signum, struct sigaction *act, struct sigaction *old)
5 {
6   act->sa_flags |= SA_RESTORER;
7   act->sa_restorer = __restore_rt;
8   return SYSCALL4(__NR_rt_sigaction, signum, act, old, sizeof(sigset_t));
9 }

The trampoline itself, called __restore_rt here, is defined in assembly as follows:

1 .text
3   .global __restore_rt
4   .type   __restore_rt,@function
5   .align  16
7 __restore_rt:
8   movq $__NR_rt_sigreturn, %rax
9   syscall

Looking at the corresponding glibc code, you can see that they add the eh_frame info here. The comments say that it is to aid gdb and handle the stack unwinding. I don't know enough DWARF to write one on my own, gdb does not seem to be utterly confused without it, and we won't do stack unwinding, so we just won't bother with it for the moment.

In the cancellation handler, we first check whether it's the right signal and that it has been sent by a thread in the same thread group. We then need to check whether the thread is still in the asynchronous cancellation mode. It might have changed between the time the signal was sent and the time the it is delivered. Finally, we call thread_testcancel to see if the thread should exit.

 1 void tb_cancel_handler(int sig, siginfo_t *si, void *ctx)
 2 {
 3   if(sig != SIGCANCEL || si->si_pid != tb_pid || si->si_code != SI_TKILL)
 4     return;
 6   tbthread_t self = tbthread_self();
 7   if(self->cancel_status & TB_CANCEL_DEFERRED)
 8     return;
10   tbthread_testcancel();
11 }

We invoke sys_tgkill to send the signal:

1 SYSCALL3(__NR_tgkill, tb_pid, thread->exit_futex, SIGCANCEL);

See the full patch at GitHub.

Cancellation of a "once" function

The implementation of tbthread_once gets quite a bit more interesting as well. If the thread invoking the initialization function gets canceled, another thread needs to pick it up. We need to install a cleanup handler that will change the state of the once control back to TB_ONCE_NEW and wake all the threads so that they could restart from the beginning:

 1 static void once_cleanup(void *arg)
 2 {
 3   tbthread_once_t *once = (tbthread_once_t *)arg;
 4   *once = TB_ONCE_NEW;
 5   SYSCALL3(__NR_futex, once, FUTEX_WAKE, INT_MAX);
 6 }
 8 int tbthread_once(tbthread_once_t *once, void (*func)(void))
 9 {
10   if(!once || !func)
11     return -EINVAL;
13   int cancel_state;
15   while(1) {
16     if(*once == TB_ONCE_DONE)
17       return 0;
19     //--------------------------------------------------------------------------
20     // The executor
21     //--------------------------------------------------------------------------
22     tbthread_setcancelstate(TBTHREAD_CANCEL_DISABLE, &cancel_state);
23     if(__sync_bool_compare_and_swap(once, TB_ONCE_NEW, TB_ONCE_IN_PROGRESS)) {
24       tbthread_cleanup_push(once_cleanup, once);
25       tbthread_setcancelstate(cancel_state, 0);
27       (*func)();
29       tbthread_setcancelstate(TBTHREAD_CANCEL_DISABLE, &cancel_state);
30       tbthread_cleanup_pop(0);
32       *once = TB_ONCE_DONE;
33       SYSCALL3(__NR_futex, once, FUTEX_WAKE, INT_MAX);
34       tbthread_setcancelstate(cancel_state, 0);
35       return 0;
36     }
38     tbthread_setcancelstate(cancel_state, 0);
40     //--------------------------------------------------------------------------
41     // The waiters
42     //--------------------------------------------------------------------------
43     while(1) {
44       SYSCALL3(__NR_futex, once, FUTEX_WAIT, TB_ONCE_IN_PROGRESS);
45       if(*once != TB_ONCE_IN_PROGRESS)
46         break;
47     }
48   }
49 }

See the patch at GitHub.

Table of Contents

  1. Creating threads
  2. Thread-local storage
  3. Mutexes
  4. Joining and initialization
  5. Cancellation
  6. Scheduling
  7. RW Locks
  8. Condition Variables
  9. Conclusion

Recycling the thread descriptors

How do we know when the thread task has died? This is what the CHILD_SETTID and CHILD_CLEARTID flags to sys_clone are for. If they are set, the kernel will store the new thread's TID at the location pointed to by the ctid argument (see tb #1). When the thread terminates, the kernel will set the TID to 0 and wake the futex at this location. It is a convenient way to wait for a thread to finish. Unfortunately, as far as I can tell, there is no way to unset these flags, and it makes implementing tbthread_detach a pain. We cannot delete the thread descriptor in the thread it refers to anymore. Doing so would cause the kernel to write to a memory location that might have been either unmapped or reused. Therefore, we need to have some sort of a cache holding thread descriptors and make sure that we re-use them only after the thread they were referring to before has exited. Thread bites uses two linked lists to maintain this cache, and the descriptor allocation function calls the following procedure to wait until the corresponding task is gone:

1 static void wait_for_thread(tbthread_t thread)
2 {
3   uint32_t tid = thread->exit_futex;
4   long ret = 0;
5   if(tid != 0)
6     do {
7       ret = SYSCALL3(__NR_futex, &thread->exit_futex, FUTEX_WAIT, tid);
8     } while(ret != -EWOULDBLOCK && ret != 0);
9 }

See the full patch at GitHub.

Joining threads

Joining complicates things a bit further because it does quite a bit of error checking to prevent deadlocks and such. To perform these checks, the thread calling tbthread_join needs to have a valid thread descriptor obtainable by tbthread_self. The problem is that we have never set this thread descriptor up for the main thread, and we need to do it by hand at the beginning of the program. The original state needs to be restored at the end because glibc uses it internally and not cleaning things up causes segfaults.

 1 static void *glibc_thread_desc;
 2 void tbthread_init()
 3 {
 4   glibc_thread_desc = tbthread_self();
 5   tbthread_t thread = malloc(sizeof(struct tbthread));
 6   memset(thread, 0, sizeof(struct tbthread));
 7   thread->self = thread;
 8   SYSCALL2(__NR_arch_prctl, ARCH_SET_FS, thread);
 9 }
11 void tbthread_finit()
12 {
13   free(tbthread_self());
14   SYSCALL2(__NR_arch_prctl, ARCH_SET_FS, glibc_thread_desc);
15 }

After performing all the validity and deadlock checks, the meat of tbthread_join is rather simple:

1 wait_for_thread(thread);
2 if(retval)
3   *retval = thread->retval;
4 release_descriptor(thread);
5 return 0;

See the full patch at GitHub.

Dynamic initialization

pthread_once is an interesting beast. Its purpose is to initialize dynamically some resources by calling a designated function exactly once. The fun part is that the actual initialization call may be made from multiple threads at the same time. pthread_once_t, therefore, is kind of like a mutex, but has three states instead of two:

  • new: the initialization function has not been called yet; one of the threads needs to call it.
  • in progress: the initialization function is running; the threads are waiting for it to finish.
  • done: the initialization function is done; all the threads may be woken up.

The thread that manages to change the state from new to in progress gets to call the function. All the other threads wait until the done state is reached.

 1 int tbthread_once(tbthread_once_t *once, void (*func)(void))
 2 {
 3   if(!once || !func)
 4     return -EINVAL;
 6   if(*once == TB_ONCE_DONE)
 7     return 0;
 9   if(__sync_bool_compare_and_swap(once, TB_ONCE_NEW, TB_ONCE_IN_PROGRESS)) {
10     (*func)();
11     *once = TB_ONCE_DONE;
12     SYSCALL3(__NR_futex, once, FUTEX_WAKE, INT_MAX);
13     return 0;
14   }
16   while(*once != TB_ONCE_DONE)
18   return 0;
19 }

Side effects

The original glibc thread descriptor stores the localization information for the thread. Changing it to ours causes seemingly simple functions, like strerror, to segfault. Therefore, we need to implement strerror ourselves.

Table of Contents

  1. Creating threads
  2. Thread-local storage
  3. Mutexes
  4. Joining and initialization
  5. Cancellation
  6. Scheduling
  7. RW Locks
  8. Condition Variables
  9. Conclusion


This part discusses an implementation of a mutex. It will not be a particularly efficient mutex, but it will be an understandable and straightforward one. We will not bother minimizing the number of system calls or implementing lock elision. We will also not handle the case of inter-process communication. Therefore, the process-shared and robust mutexes will not be discussed. If you are interested in these ideas, I recommend the kernel's documentation file on robust mutexes and Intel's blog post on lock elision. The scheduling related parameters will likely be dealt with on another occasion.


POSIX defines a bunch of different mutexes. See the manpage for pthread_mutexattr_settype to learn more. On Linux, all of them are implemented using the same locking primitive - a Futex (Fast User-Space Mutex). It is a 4-byte long chunk of aligned memory. Its contents can be updated atomically, and its address can be used to refer to a kernel-level process queue. The kernel interface is defined as follows:

1 asmlinkage long sys_futex(u32 __user *uaddr, int op, u32 val,
2                          struct timespec __user *utime, u32 __user *uaddr2,
3                          u32 val3);

We will only use the first four out of the six parameters here. The first one is the address of the futex, and the second one is the type of the operation to be performed. The meaning of the remaining parameters depends on the context. We will need only two of the available operations to implement a mutex:

  • FUTEX_WAIT puts a thread to sleep if the value passed in val is the same as the value stored in the memory pointed to by *uaddr. Optionally, the sleep time may be limited by passing a pointer to a timespec object. The return values are:
    • 0, if the thread was woken up by FUTEX_WAIT.
    • EWOULDBLOCK, if the value of the futex was different than val.
    • EINTR, if the sleep was interrupted by a signal.
  • FUTEX_WAKE wakes the number of threads specified in val. In practice, it only makes sense to either wake one or all sleeping threads, so we pass either 1 or INT_MAX respectively.

See the original futex paper for more details.

Normal mutex

We start with a normal mutex because it is possible to implement all the other kinds using the procedures defined for it. If the value of the associated futex is 0, then the mutex is unlocked. Locking it means changing the value to 1. To avoid race conditions, both the checking and the changing need to be done atomically. GCC has a built-in function to do this that results with lock cmpxchgq or similar being emitted in assembly. If the locking fails, we need to wait until another thread releases the mutex and re-check the lock.

1 static int lock_normal(tbthread_mutex_t *mutex)
2 {
3   while(1) {
4     if(__sync_bool_compare_and_swap(&mutex->futex, 0, 1))
5       return 0;
6     SYSCALL3(__NR_futex, &mutex->futex, FUTEX_WAIT, 1);
7   }
8 }

The logic of trylock is essentially the same, except that we return a failure instead of sleeping.

1 static int trylock_normal(tbthread_mutex_t *mutex)
2 {
3   if(__sync_bool_compare_and_swap(&mutex->futex, 0, 1))
4       return 0;
5   return -EBUSY;
6 }

To unlock a mutex, we do the reverse. We change the futex value from 1 to 0 and wake one thread waiting for the futex to be released.

1 static int unlock_normal(tbthread_mutex_t *mutex)
2 {
3   if(__sync_bool_compare_and_swap(&mutex->futex, 1, 0))
4     SYSCALL3(__NR_futex, &mutex->futex, FUTEX_WAKE, 1);
5   return 0;
6 }

Note that the values stored in the futex are application-specific and arbitrary. The kernel does not care and does not change this variable except in one case, which we will discuss in a later chapter.

This mutex is not particularly efficient because we make a system call while unlocking regardless of whether there is a waiting thread or not. To see how the situation could be improved, please refer to Ulrich Drepper's Futexes Are Tricky.

Error-check mutex

These guys are the same as the ones discussed earlier, except that they do additional bookkeeping. An error-check mutex remembers who owns it, to report the following types of errors:

  • re-locking of a mutex the thread already owns
  • unlocking a mutex owned by another thread
  • unlocking a mutex that is not locked

The behavior of normal mutexes in these cases is undefined.

 1 static int lock_errorcheck(tbthread_mutex_t *mutex)
 2 {
 3   tbthread_t self = tbthread_self();
 4   if(mutex->owner == self)
 5     return -EDEADLK;
 6   lock_normal(mutex);
 7   mutex->owner = self;
 8   return 0;
 9 }
11 static int trylock_errorcheck(tbthread_mutex_t *mutex)
12 {
13   int ret = trylock_normal(mutex);
14   if(ret == 0)
15     mutex->owner = tbthread_self();
16   return ret;
17 }
19 static int unlock_errorcheck(tbthread_mutex_t *mutex)
20 {
21   if(mutex->owner != tbthread_self() || mutex->futex == 0)
22     return -EPERM;
23   mutex->owner = 0;
24   unlock_normal(mutex);
25   return 0;
26 }

Recursive mutex

Recursive mutexes may be locked multiple times by the same thread and require the same numbers of unlock operations to be released. To provide this kind of functionality, we just need to add a counter counter.

 1 static int lock_recursive(tbthread_mutex_t *mutex)
 2 {
 3   tbthread_t self = tbthread_self();
 4   if(mutex->owner != self) {
 5     lock_normal(mutex);
 6     mutex->owner   = self;
 7   }
 8   if(mutex->counter == (uint64_t)-1)
 9     return -EAGAIN;
10   ++mutex->counter;
11   return 0;
12 }
14 static int trylock_recursive(tbthread_mutex_t *mutex)
15 {
16   tbthread_t self = tbthread_self();
17   if(mutex->owner != self && trylock_normal(mutex))
18     return -EBUSY;
20   if(mutex->owner != self) {
21     mutex->owner = self;
22     mutex->counter = 1;
23     return 0;
24   }
26   if(mutex->counter == (uint64_t)-1)
27     return -EAGAIN;
29   ++mutex->counter;
30   return 0;
31 }
33 static int unlock_recursive(tbthread_mutex_t *mutex)
34 {
35   if(mutex->owner != tbthread_self())
36     return -EPERM;
37   --mutex->counter;
38   if(mutex->counter == 0) {
39     mutex->owner = 0;
40     return unlock_normal(mutex);
41   }
42   return 0;
43 }

Other code

This is pretty much it. The remaining part of the code is not interesting. Both mutex and mutexattr objects need to be initialized to their default values, but the futexes don't need any initialization or cleanup. As always, the full patch is available on GitHub.

Edit 28.03.2016: There are more details about the startup code in this post.


I have recently started playing with the Tiva launchpad. It's a pity, though, that most of the tutorials and course material out there show you how to program it only using something or other on Windows. I have even gone as far as installing it on my old laptop to follow some of these tutorials. But, I have quickly re-discovered the reasons for my dislike of Windows.

There are some great resources available explaining how to use the Stellaris board on Linux. Stellaris is a predecessor of Tiva, and much of this advice applies to Tiva as well. Everyone seems to use Make, though. I don't like it because generating source file dependencies and discovering libraries with it involves black magic and blood of goats. I decided, then, to add my two cents and create a template for CMake (GitHub). It works fine both with or without TivaWare and uses my BSD-licensed start-up files. To use it for your project, all you need to do is:

 1 #-------------------------------------------------------------------------------
 2 # Some boilerplate
 3 #-------------------------------------------------------------------------------
 4 cmake_minimum_required(VERSION 3.4)
 5 set(CMAKE_TOOLCHAIN_FILE ${CMAKE_SOURCE_DIR}/cmake/TM4C_toolchain.cmake)
 8 include(Firmware)
10 #-------------------------------------------------------------------------------
11 # Configure your project
12 #-------------------------------------------------------------------------------
13 project(tm4c-template)
14 add_executable(tm4c-template.axf main.c tm4c/TM4C_startup.c)
15 add_raw_binary(tm4c-template.bin tm4c-template.axf)
16 target_link_libraries(tm4c-template.axf ${TIVAWARE_LIB})

And then:

]==> mkdir build
]==> cd build
]==> cmake ../
-- The CXX compiler identification is GNU 4.9.3
-- Check for working CXX compiler: /usr/bin/arm-none-eabi-c++
-- Check for working CXX compiler: /usr/bin/arm-none-eabi-c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Configuring done
-- Generating done
-- Build files have been written to: /home/ljanyst/Temp/board/cmake/build
]==> make
Scanning dependencies of target tm4c-template.axf
[ 25%] Building C object CMakeFiles/tm4c-template.axf.dir/main.c.obj
[ 50%] Building C object CMakeFiles/tm4c-template.axf.dir/tm4c/TM4C_startup.c.obj
[ 75%] Linking C executable tm4c-template.axf
[ 75%] Built target tm4c-template.axf
Scanning dependencies of target tm4c-template.bin
[100%] Creating raw binary tm4c-template.bin
[100%] Built target tm4c-template.bin

Or, if you want TivaWare, do this instead:

]==> cmake .. -DTIVAWARE_PATH=/path/to/tivaware/


I wrote a short piece of code that lets you test things without the need for TivaWare. Go here and compile lm4flash, it needs libusb-1.0-0-dev on Debian.

]==> lm4flash tm4c-template.bin
Found ICDI device with serial: 0E21xxxx
ICDI version: 9270


You can tweak a bit the instructions from the tutorial over at to run a debugging session. Plug-in the board and start an Open On-Chip Debugger session:

]==> openocd -f /usr/share/openocd/scripts/board/ek-tm4c123gxl.cfg
Open On-Chip Debugger 0.9.0 (2015-05-28-17:08)
Licensed under GNU GPL v2
For bug reports, read
Info : The selected transport took over low-level target control. The results might differ compared to plain JTAG/SWD
adapter speed: 500 kHz
Info : clock speed 32767 kHz
Info : ICDI Firmware version: 9270
Info : tm4c123gh6pm.cpu: hardware has 6 breakpoints, 4 watchpoints

Then, in another terminal window, run gdb as follows:

]==> cat gdb-embeded.init
target extended-remote :3333
monitor reset halt
monitor reset init
break main
]==> arm-none-eabi-gdb --command=gdb-embeded.init  tm4c-template.axf
GNU gdb (7.10-1+9) 7.10
Copyright (C) 2015 Free Software Foundation, Inc.

-- cut --

Breakpoint 1, main () at /home/ljanyst/Temp/board/cmake/main.c:113
113       init_sys_tick();
(gdb) n
114       init_gpio();
(gdb) n
116       unsigned long led = 0x02;
(gdb) n
119         unsigned long sw1 = !(GPIODATA_REG_PORTF & 0x01);
(gdb) n
120         unsigned long sw2 = !(GPIODATA_REG_PORTF & 0x10);
(gdb) p sw1
$1 = 0
(gdb) p /t *(unsigned long *)0x4005d3fc
$2 = 10001

Have fun!

Table of Contents

  1. Creating threads
  2. Thread-local storage
  3. Mutexes
  4. Joining and initialization
  5. Cancellation
  6. Scheduling
  7. RW Locks
  8. Condition Variables
  9. Conclusion


The second part of our threading journey covers the thread-local storage. I do not mean here the compiler generated TLS that I mentioned in the first part. I mean the stuff that involves pthread_setspecific and friends. I don't think it's the most critical part of all this threading business. I barely ever use it in practice. However, we will need to refer to the current thread all over the place, and this requires the TLS. It's better to deal with it once and for all, especially that it's not particularly complicated.


How does one store something in a fixed location and make it distinct for each execution thread? The only two answers that come to my mind are either using syscalls to associate something with kernel's task_struct or using CPU registers. The first approach requires context switches to retrieve the value, so it's rather inefficient. The second option, though, should be pretty fast. Conveniently, on x86_64, some registers are left unused (see StackOverflow). In fact, the SETTLS option to clone takes a pointer and puts it in the file segment register for us, so we don't even need to make an extra syscall just for that.

Since fs is a segment register, we cannot retrieve it's absolute value without engaging the operating system. Linux on x86_64 uses the arch_prctl syscall for this purpose:

1 #include "tb.h"
2 #include <asm/prctl.h>
4 tbthread_t tbthread_self()
5 {
6   tbthread_t self;
7   SYSCALL2(__NR_arch_prctl, ARCH_GET_FS, &self);
8   return self;
9 }

This syscall seems expensive (link) and making it defies the reason for using a register in the first place. We can read the memory at an address relative to the value of the register, though. Using this fact, we can make our thread struct point to itself and then retrieve this pointer using inline assembly. Here's how:

1 tbthread_t tbthread_self()
2 {
3   tbthread_t self;
4   asm("movq %%fs:0, %0\n\t" : "=r" (self));
5   return self;
6 }

For the main thread, the linker initializes the TLS segment for pthreads automatically using arch_prctl. We cannot use it for our purposes, but we can count on tbthread_self returning a unique, meaningful value.

See the full patch here.


The actual TLS is handled by the following four functions: pthread_key_create, pthread_key_delete, pthread_getspecific, pthread_setspecific. I will not explain what they do because it should pretty self-evident. If it's not, see the man pages.

In principle, we could just have a hash table in the tbthread struct. We could then use setspecific and getspecific to set and retrieve the value associated with each given key. Calling setspecific with a NULL pointer would delete the key. Unfortunately, the designers of pthreads made it a bit more complicated by having separate key_create and key_delete functions, with the delete function invalidating the key in all the threads. Glibc uses a global array of keys and sequence numbers in a clever way to solve this problem. We will take almost the same approach in a less efficient but a bit clearer way.

We will have a global array representing keys. Each element of this array will host a pointer to a destructor and a sequence number. The index in the array will be the actual key passed around between the TLS functions. Both key_create and key_delete will bump the sequence number associated with a key in an atomic way. If the number is even, the key is not allocated, if the number is odd, it is.

 1 static struct
 2 {
 3   uint64_t seq;
 4   void (*destructor)(void *);
 7 #define KEY_UNUSED(k) ((keys[k].seq&1) == 0)
 8 #define KEY_ACQUIRE(k) (__sync_bool_compare_and_swap(&(keys[k].seq), keys[k].seq, keys[k].seq+1))
 9 #define KEY_RELEASE(k) (__sync_bool_compare_and_swap(&(keys[k].seq), keys[k].seq, keys[k].seq+1))
11 int tbthread_key_create(tbthread_key_t *key, void (*destructor)(void *))
12 {
13   for(int i = 0; i < TBTHREAD_MAX_KEYS; ++i) {
14     if(KEY_UNUSED(i) && KEY_ACQUIRE(i)) {
15       *key = i;
16       keys[i].destructor = destructor;
17       return 0;
18     }
19   }
20   return -ENOMEM;
21 }

Each tbthread struct will hold an array of the same size as the global array of keys. Each element in this array will hold a data pointer and a sequence number. Storing data will set the sequence number as well. Retrieving the data will check whether the local and the global sequence number match before proceeding.

 1 void *tbthread_getspecific(tbthread_key_t key)
 2 {
 3   if(key >= TBTHREAD_MAX_KEYS || KEY_UNUSED(key))
 4     return 0;
 6   tbthread_t self = tbthread_self();
 7   if(self->tls[key].seq == keys[key].seq)
 8     return self->tls[key].data;
 9   return 0;
10 }

See the full patch here.


KEY_ACQUIRE and KEY_RELEASE use gcc atomic builtins described here.

Table of Contents

  1. Creating threads
  2. Thread-local storage
  3. Mutexes
  4. Joining and initialization
  5. Cancellation
  6. Scheduling
  7. RW Locks
  8. Condition Variables
  9. Conclusion


This is a first of hopefully many posts documenting my attempts to understand how to implement a pthread-style threading system on Linux. To this end, I started implementing a small, non-portable and a rather useless library that I called Thread Bites. Thread Bites is useless mainly because it lacks support for compiler-generated thread-local storage. It may not sound like a grave issue, but it makes Thread Bites incompatible with most of glibc. Therefore, I had to provide my own functionality for invoking syscalls, managing the heap and even printing stuff to stdout. It's all pretty simple and understandable so far, so I hope I will be able to implement most of the pthreads' functionality in a couple of small bites. You can get the source from GitHub.


For a program to be even remotely useful, it needs to communicate with the user in one way or another. A standard library for the programming language, like glibc, typically provides all the necessary components for such communication. For reasons mentioned in the introduction, using glibc in this case is not advisable. Hence, I need to find a way to call the operating system directly without using the syscall function, because it is also a part glibc and sets errno, which is supposed to reside in the TLS that I did not set up.

All that is needed to implement an equivalent function is shuffling around the values of the registers to translate between the calling conventions of C and the Linux kernel, as described here and here on page 20. As it turns out, it's not that hard to do it using inline assembly in C, and there's an excellent tutorial here.

 1 #define SYSCALL(name, a1, a2, a3, a4, a5, a6)           \
 2   ({                                                    \
 3     long result;                                        \
 4     long __a1 = (long)(a1);                             \
 5     long __a2 = (long)(a2);                             \
 6     long __a3 = (long)(a3);                             \
 7     long __a4 = (long)(a4);                             \
 8     long __a5 = (long)(a5);                             \
 9     long __a6 = (long)(a6);                             \
10     register long _a1 asm("rdi") = __a1;                \
11     register long _a2 asm("rsi") = __a2;                \
12     register long _a3 asm("rdx") = __a3;                \
13     register long _a4 asm("r10") = __a4;                \
14     register long _a5 asm("r8")  = __a5;                \
15     register long _a6 asm("r9")  = __a6;                \
16     asm volatile (                                      \
17       "syscall\n\t"                                     \
18       : "=a" (result)                                   \
19       : "0" (name), "r" (_a1), "r" (_a2), "r" (_a3),    \
20         "r" (_a4), "r" (_a5), "r" (_a6)                 \
21       : "memory", "cc", "r11", "cx");                   \
22     (long) result; })
24 #define SYSCALL1(name, a1) \
25   SYSCALL(name, a1, 0, 0, 0, 0, 0)
26 #define SYSCALL2(name, a1, a2) \
27   SYSCALL(name, a1, a2, 0, 0, 0, 0)
28 #define SYSCALL3(name, a1, a2, a3) \
29   SYSCALL(name, a1, a2, a3, 0, 0, 0)
30 #define SYSCALL4(name, a1, a2, a3, a4) \
31   SYSCALL(name, a1, a2, a3, a4, 0, 0)
32 #define SYSCALL5(name, a1, a2, a3, a4, a5) \
33   SYSCALL(name, a1, a2, a3, a4, a5, 0)
34 #define SYSCALL6(name, a1, a2, a3, a4, a5, a6) \
35   SYSCALL(name, a1, a2, a3, a4, a5, a6)

All this is, of course, horribly inefficient because it messes up with all the registers even in the situations it does not have to. The intermediate variables for the parameters (__a1 and friends) are used to prevent embedded function calls from messing with the registers that have already been set; think of strlen in SYSCALL3(__NR_write, 1, blah, strlen(blah)).

See the full code on GitHub.

Printing to stdout

It seems that glibc is using errno and other thread local stuff to calculate buffer sizes in one of the subroutines called by printf. It causes printf to segfault when called concurrently from different threads because of the same TLS story. Thread Bites provides a convenience function similar to printf and supporting %s %x %u %o %d in l and ll flavors:

1 void tbprint(const char *format, ...);


Glibc's default malloc implementation, a ptmalloc2 derivative, uses thread-specific arenas to limit lock congestion caused by calling malloc concurrently from multiple threads. It looks like it depends on TLS, so using it is not the best idea. Thread Bites comes with its own evil version of malloc. It's pathological because it's extremely prone to fragmentation, it never shrinks the heap, and it's essentially one big critical section. It has some undeniable advantages too: it works, it's incredibly simple, and it fits in around 50 lines of code. Look here to find it.

The only thing worth noting in this section is that the sys_brk syscall does not behave like glibc's brk or sbrk functions. On error, it would return the previous location of the heap boundary, so the code calls it with an obviously wrong parameter (0) to figure out what the initial heap boundary is.


Clone is an interesting beast. From the standpoint of this section, the relevant thing about it is that it behaves mostly like fork, except that the child is launched on a new stack. For this reason and to generate proper Call Frame Information (see here and here), it needs to be implemented in assembly. The implementation puts the user function pointer and its argument on the child's stack so that they can be later popped and called by the child. Then, it puts the syscall parameters in the appropriate registers and makes the syscall. Again, all the code is on GitHub.

Creating a thread

After dealing with all this boilerplate, the actual thread creation is relatively straightforward. First, the new thread needs a stack it can run on. The stack could be allocated on the heap, but it's probably safer to get a separate memory region for it. It can be done using the mmap syscall. A proper threading library should check the system limits to figure out the size of the stack, but Thread Bites is not proper, so it will use 8MiB as a default. All the function parameters in the snippet below match glibc's mmap call.

1 void *stack = tbmmap(NULL, attr->stack_size, PROT_READ | PROT_WRITE,
2                      MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
3 long status = (long)stack;
4 if(status < 0)
5   return status;

It's a good idea to mark a page at the beginning of the stack as non-readable and non-writable to protect any valid adjecent memory if there is any. Stepping over the guard page will get us a familiar segmentation fault. Why at the beginning and not at the end? Because the stack grows downwards, i.e. from a high address towards a lower one. Go here for more details.

1 status = SYSCALL3(__NR_mprotect, stack, EXEC_PAGESIZE, PROT_NONE);
2 if(status < 0) {
3   tbmunmap(stack, attr->stack_size);
4   return status;
5 }

We finally have all that is needed to spawn a new kernel task sharing with the main task everything that you would expect threads running within the same process to share. The man page of clone discusses all of the parameters in details.

2 flags |= CLONE_THREAD;
3 int tid = tbclone(start_thread, *thread, flags, stack+attr->stack_size);
4 if(tid < 0) {
5   tbmunmap(stack, attr->stack_size);
6   free(*thread);
7   return tid;
8 }

Note that the user function is not called directly and start_thread is called instead. It does some internal book-keeping and eventually calls the user-supplied function:

1 tbthread_t th = (tbthread_t)arg;
2 uint32_t stack_size = th->stack_size;
3 void *stack = th->stack;
4 th->fn(th->arg);
5 free(th);

When the function returns, the stack memory needs to be freed (munmaped) and the thread terminated. The clone function took some provisions for calling sys_exit with the status returned by the wrapper. However, here we remove the stack from underneath our feet, and we cannot risk any C function write over it. Therefore, we need to call sys_munmap and sys_exit in one piece of assembly.

 1 register long a1 asm("rdi") = (long)stack;
 2 register long a2 asm("rsi") = stack_size;
 3 asm volatile(
 4   "syscall\n\t"
 5   "movq $60, %%rax\n\t" // 60 = __NR_exit
 6   "movq $0, %%rdi\n\t"
 7   "syscall"
 8   :
 9   : "a" (__NR_munmap), "r" (a1), "r" (a2)
10   : "memory", "cc", "r11", "cx");
11 return 0;

The full code.

Lessons learned

  • The logic involved in running threads does not seem to be horrendously complicated, at least not yet.
  • Glibc is quite convoluted and figuring out what is going on and where is a challenge. It is probably justified by the range of platforms it supports, but the documentation could be better.
  • Threading is tightly coupled with glibc in the areas that I have not initially suspected: the dynamic linker to support TLS, globals in TLS (errno), locales, finalizers, and so on.
  • POSIX threads have an interesting relationship with the Linux task model. This document describes the initial incompatibilities that have ultimately been ironed out. It is a very interesting read if you want to understand how the userspace and the kernelspace interact to implement a threading system.
  • Ultimately all this is not necessarily a black magic. At least not so far.


Following a piece of advice from a friend, I decided to buy this new domain name and start writing down all the cool things I do. I have written a bit in a bunch of other places before and have other quasi-failed blogs, so I actually already have a bit of content to bootstrap this one.

There's plenty of blogware options out there, but, as a programmer, I like the ones that keep the content in a version control system intended for software. From the alternatives available in this department, I decided to go for c()λeslaw. It's kind of similar to Jekyll, which I have used before, and it's written in Common Lisp, which, typically, is a good sign in general.

There is very little instruction over the Internet on how to use it, but it's not hard to figure out after reading the code. This post is a brief summary of what I have done to create this website and convert the content from Jekyll.

Site Structure

The first thing that you need to do is to create a .coleslawrc file describing the layout of the site, the theme to be used to render the final HTML, and other such things. There's a good example here and you can get the full picture by reading the source. :) I like to change the separator (:separator "---"), so that --- is used to distinguish the metadata from the content section in source files, this makes things look the Jekyll way. The static-pages plugin, makes it possible to create content other that blog posts and indices.

Coleslaw will search the repo for files ending with .post (and .page if the static-pages plugin is enabled) and run them through the renderer selected in the page's metadata section. It will generate the indices automatically and copy verbatim everything it finds in the static directory.

You can create our own theme following the rules described here or choose something from the built-in options. I built the theme you see here more or less from scratch using Bootstrap and the live customizer to tweak the colors. It was a fairly easy and pleasant exercise.

In the end, the resulting directory structure looks roughly like this:

==> find

The first few lines of the post you are reading right now look like this:

title: Blogging with Coleslaw
date: 2015-12-07
tags: blogging, lisp, programming, linux, sbcl
format: md


Following a piece of advice from a friend, I decided to by this new domain name


Coleslaw and the packages it depends on work pretty well to begin with, but I made a couple of improvements to make them fit my particular tastes better:

  1. Some themes and plugins are site specific and cannot be generalized. There is very little point in keeping them in the coleslaw source tree when they really belong with the site content. I submitted patches to make it possible to define themes and plugins in the content repo. See PR-98 and PR-101.
  2. I like to have the HTML files named in a certain way in the resulting web site, so it's convenient for me to be able to specify lambdas in .coleslawrc mapping the content metadata to file names. I made a pull request to allow that (PR-100), but Brit, the maintainer of coleslaw, has different ideas on how to approach this problem.
  3. I think pygments have no real competition if it comes to coloring source code, so I made changes to 3bmd - the markdown rendering library used by coleslaw - allowing it to use pygments. See PR-24.
  4. It's nice to be able to control how the rendered HTML tables look. In order to do that, you need to be able to specify the css class for the table. See PR-25.


3bmd makes it fairly easy to customize how the final HTML is rendered. For instance, you can change the resulting markup for images by defining a method :around print-tagged-element. I want the images on this web site to have frames and captions, so I did this:

 1 (defmethod print-tagged-element :around ((tag (eql :image)) stream rest)
 2   (setf rest (cdr (first rest)))
 3   (let ((fmt (concatenate 'string
 4                           "<div class=\"center-wrapper\">"
 5                           "  <div class=\"img\">"
 6                           "    <img src=\"~a\" ~@[alt=\"~a\"~] />"
 7                           "    <div class=\"img-caption\">~@[~a~]</div>"
 8                           "  </div>"
 9                           "</div>"))
10         (caption (with-output-to-string (s)
11                    (mapcar (lambda (a) (print-element a s))
12                            (getf rest :label)))))
13     (format stream
14             fmt
15             (getf rest :source)
16             caption
17             caption)))

Being able to use $config.domain and other variables in the markdown makes it possible to define relative paths to images and other resources. This comes handy if you want to test the web site using different locations. In order to acheve this you can define a method :around render-text in the following way:

 1 (defmethod render-text :around (text format)
 2   (let ((processed
 3          (reduce #'funcall
 4                  (list
 5                   #'process-embeds
 6                   (lambda (text)
 7                     (regex-replace-all "{\\\$config.domain}"
 8                                        text
 9                                        (domain *config*)))
10                   (lambda (text)
11                     (regex-replace-all "{\\\$config.repo-dir}"
12                                        text
13                                        (namestring (repo-dir *config*))))
14                   text)
15                  :from-end t)))
16     (call-next-method processed format)))


I use DreamHost for my web hosting and want to use sbcl as the lisp implementation. Unfortunately, all of my attempts to run sbcl there ended up with error messages like this one:

mmap: wanted 1040384 bytes at 0x20000000, actually mapped at 0x3cfc6467000
ensure_space: failed to validate 1040384 bytes at 0x20000000
(hint: Try "ulimit -a"; maybe you should increase memory limits.)

After some investigation, it turned out that DreamHost uses grsecurity kernel patches and, it looks like, their implementation of ASLR (Address Space Layout Randomization) does not respect the ADDR_NO_RANDOMIZE personality that is indeed set by sbcl at startup. They still allow the memory to be mapped at a specific location, which is a requirement for sbcl, if the MAP_FIXED flag is passed to mmap. The patch fixing this problem was a fairly simple one once I figured out what's going on. It looks like it will be included in sbcl 1.3.2. Until then, you will have to recompile the sources yourself.

Let's see if we get a speedup if we compile the code. The snippets below list the contents of col1.lisp and col2.lisp respectively:

(require 'coleslaw)
(coleslaw:main "/path/to/repo/")
(require 'coleslaw)
(defun main () (coleslaw:main (nth 1 *posix-argv*)))
(sb-ext:save-lisp-and-die "coleslaw.x" :toplevel #'main :executable t)

And this is what you get:

]==> time sbcl --noinform --load col1.lisp
sbcl --load col2.lisp  6.39s user 1.05s system 97% cpu 7.609 total

]==> sbcl --noinform --load col2.lisp
[undoing binding stack and other enclosing state... done]
[saving current Lisp image into coleslaw.x:
writing 4944 bytes from the read-only space at 0x20000000
writing 3168 bytes from the static space at 0x20100000
writing 85229568 bytes from the dynamic space at 0x1000000000

]==> time ./coleslaw.x /path/to/repo/
./coleslaw.x /path/to/repo/  3.37s user 0.74s system 95% cpu 4.304 total

]==> du -sh ./coleslaw.x
83M     ./coleslaw.x

The compiled code runs almost twice as fast, but the executable weights 83M!

I wrote the following post-receive hook in order to have the site rendered automatically every time I push the new content to the master branch of the repo.

 1 CLONE_DIR=`mktemp -d`
 3 echo "Cloning the repository..."
 4 git clone $PWD $CLONE_DIR > /dev/null | exit 1
 6 while read oldrev newrev refname; do
 7   if [ $refname = "refs/heads/master" ]; then
 8     echo "Running coleslaw..."
 9     coleslaw.x $CLONE_DIR/ > /dev/null
10   fi
11 done
13 rm -rf $CLONE_DIR


Building this web site was quite an instructive experience, especially that it was my first non-toy project done in Common Lisp. It showed me how easy it is to use and hack on CL projects and how handy QuickLisp is. There's plenty of good libraries around and, if they have areas in which they are lacking, it's quite a bit of fun to fill the gaps. The library environment definitely is not as mature as the one of Python or Ruby, so new users may find it difficult, but, overall, I think it's worth it to spend the time getting comfortable with Common Lisp. I finally feel emotionally prepared to go through Peter Norvig's Paradigms of Artificial Intelligence Programming. :)

Generally, LWN runs top quality articles. I always read them with pleasure and they are good enough to make me a paid subscriber. Every now and then though, they would publish something pretty great even by their standards. I read this and was amazed. I had not realized that it is this easy to create and run a simple virtual machine. I typed the code in and played with it for a couple of hours. You can get the file that actually compiles (C++14) and runs here.

  1 //------------------------------------------------------------------------------
  2 // Based on:
  3 //------------------------------------------------------------------------------
  5 #include <iostream>
  6 #include <iomanip>
  7 #include <cstdint>
  8 #include <cstring>
  9 #include <cerrno>
 10 #include <sys/stat.h>
 11 #include <fcntl.h>
 12 #include <unistd.h>
 13 #include <sys/ioctl.h>
 14 #include <sys/mman.h>
 15 #include <linux/kvm.h>
 17 //------------------------------------------------------------------------------
 18 // Error handling macro
 19 //------------------------------------------------------------------------------
 20 #define RUN(var, command) \
 21 var = command;       \
 22 if (var == -1) {     \
 23   std::cerr << #command ": " << strerror(errno) << std::endl; \
 24   return 1;          \
 25 }
 27 //------------------------------------------------------------------------------
 28 // The code to be run inside of the virtual machine
 29 //------------------------------------------------------------------------------
 30 const uint8_t code[] = {
 31   0xba, 0xf8, 0x03, /* mov $0x3f8, %dx */
 32   0x00, 0xd8,       /* add %bl, %al */
 33   0x04, '0',        /* add $'0', %al */
 34   0xee,             /* out %al, (%dx) */
 35   0xb0, '\n',       /* mov $'\n', %al */
 36   0xee,             /* out %al, (%dx) */
 37   0xf4              /* hlt */
 38 };
 40 int main(int argc, char **argv)
 41 {
 42   using namespace std;
 44   //----------------------------------------------------------------------------
 45   // Open KVM
 46   //----------------------------------------------------------------------------
 47   int kvm = open("/dev/kvm", O_RDWR | O_CLOEXEC);
 48   if (kvm == -1) {
 49     cerr << "Unable to open /dev/kvm: " << strerror(errno) << endl;
 50     return 1;
 51   }
 53   //----------------------------------------------------------------------------
 54   // Check the version of the API, we need 12
 55   //----------------------------------------------------------------------------
 56   int ret;
 57   RUN(ret, ioctl(kvm, KVM_GET_API_VERSION, 0));
 59   if (ret != 12) {
 60     cerr << "KVM_GET_API_VERSION " << ret << ", expected 12" << endl;
 61     return 1;
 62   }
 64   //----------------------------------------------------------------------------
 65   // Check if the extension required to set up guest memory is present
 66   //----------------------------------------------------------------------------
 69   if (!ret) {
 70     cerr << "KVM_CAP_USER_MEM extension is not available" << endl;
 71     return 1;
 72   }
 74   //----------------------------------------------------------------------------
 75   // Set up a virtual machine
 76   //----------------------------------------------------------------------------
 77   int vmfd;
 78   RUN(vmfd, ioctl(kvm, KVM_CREATE_VM, (unsigned long)0));
 80   //----------------------------------------------------------------------------
 81   // Get some page-aligned memory and copy the code to it
 82   //----------------------------------------------------------------------------
 83   void *mem = mmap(0, 0x1000, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS,
 84                    -1, 0);
 85   if (mem == MAP_FAILED) {
 86     cerr << "Failed to get a page of memory: " << strerror(errno) << endl;
 87     return 1;
 88   }
 89   memcpy(mem, code, sizeof(code));
 91   //----------------------------------------------------------------------------
 92   // Tell the virtual machine about this region
 93   //----------------------------------------------------------------------------
 94   kvm_userspace_memory_region region;
 95   memset(&region, 0, sizeof(region));
 96   region.slot            = 0;
 97   region.guest_phys_addr = 0x1000;
 98   region.memory_size     = 0x1000;
 99   region.userspace_addr  = (uint64_t)mem;
101   RUN(ret, ioctl(vmfd, KVM_SET_USER_MEMORY_REGION, &region));
103   //----------------------------------------------------------------------------
104   // Create a virtual CPU #0
105   //----------------------------------------------------------------------------
106   int vcpufd;
107   RUN(vcpufd, ioctl(vmfd, KVM_CREATE_VCPU, (unsigned long)0));
109   //----------------------------------------------------------------------------
110   // Allocate memory for kvm_run data structure
111   //----------------------------------------------------------------------------
112   int vcpu_run_size;
113   RUN(vcpu_run_size, ioctl(kvm, KVM_GET_VCPU_MMAP_SIZE, 0));
115   kvm_run *run;
116   run = (kvm_run *)mmap(0, vcpu_run_size, PROT_READ | PROT_WRITE, MAP_SHARED,
117                         vcpufd, 0);
118   if (run == MAP_FAILED) {
119     cerr << "Allocating VCPU run struct failed: " << strerror(errno) << endl;
120     return 1;
121   }
123   //----------------------------------------------------------------------------
124   // Set up the special registers of the VCPU #0
125   //----------------------------------------------------------------------------
126   kvm_sregs sregs;
127   RUN(ret, ioctl(vcpufd, KVM_GET_SREGS, &sregs));
128   sregs.cs.base = 0;
129   sregs.cs.selector = 0;
130   RUN(ret, ioctl(vcpufd, KVM_SET_SREGS, &sregs));
132   //----------------------------------------------------------------------------
133   // Set up the standard registers
134   //----------------------------------------------------------------------------
135   kvm_regs regs;
136   memset(&regs, 0, sizeof(regs));
137    = 0x1000;
138   regs.rax    = 2;
139   regs.rbx    = 2;
140   regs.rflags = 0x2; // starting the VM will fail with this not set, x86
141                      // architecture requirement
142   RUN(ret, ioctl(vcpufd, KVM_SET_REGS, &regs));
144   //----------------------------------------------------------------------------
145   // Run the VCPU #0
146   //----------------------------------------------------------------------------
147   while (1) {
148     RUN(ret, ioctl(vcpufd, KVM_RUN, 0));
149     switch (run->exit_reason) {
151       //------------------------------------------------------------------------
152       // HLT instruction - we're done
153       //------------------------------------------------------------------------
154       case KVM_EXIT_HLT:
155         cerr << "KVM_EXIT_HLT" << endl;
156         return 0;
158       //------------------------------------------------------------------------
159       // Simulate an IO port at 0x3f8
160       //------------------------------------------------------------------------
161       case KVM_EXIT_IO:
162         if (run->io.direction == KVM_EXIT_IO_OUT &&
163             run->io.size      == 1 &&
164             run->io.port      == 0x3f8 &&
165             run->io.count     == 1)
166           cout << *(((char *)run) + run->io.data_offset) << flush;
167         else
168           cerr << "Unhandled KVM_EXIT_IO" << endl;
169         break;
171       //------------------------------------------------------------------------
172       // Underlying virtualization mechanism can't start the VM
173       //------------------------------------------------------------------------
174       case KVM_EXIT_FAIL_ENTRY:
175         cerr << "KVM_EXIT_FAIL_ENTRY: hardware_entry_failure_reason = 0x";
176         cerr << hex;
177         cerr << (unsigned long long)run->fail_entry.hardware_entry_failure_reason;
178         cerr << endl;
179         return 1;
181       //------------------------------------------------------------------------
182       // Error from the KVM subsystem
183       //------------------------------------------------------------------------
185         cerr << "KVM_EXIT_INTERNAL_ERROR: suberror = 0x" << hex;
186         cerr << run->internal.suberror << endl;
187         return 1;
188     }
189   }
191   //----------------------------------------------------------------------------
192   // Cleanup
193   //----------------------------------------------------------------------------
194   munmap(mem, 0x1000);
195   munmap(run, vcpu_run_size);
196   close(vcpufd);
197   close(vmfd);
198   close(kvm);
200   return 0;
201 }

When writing software, I have always assumed that I could have trust in the underlying platform. At least to some basic extent. For instance, when writing a multi-threaded program running on Linux, it is not unreasonable to think that the POSIX thread synchronization mechanisms are actually, you know, thread-safe. As it turns out, it's not quite true. We have learnt about this fact in a rather painful way - having a heavily-loaded production system crash every now and then. I ended up having to implement my own semaphores.


My media center box has recently died tragically of overheating and I decided to replace it with a brand new Cubox-4iPro. While the hardware seems pretty great, the software support is less than perfect, to say the least. Especially if you decide to put, say, Debian Testing on it instead of one of the images prepared by the vendor. Basing on these notes, I was able to install and boot the system, and had quite some fun doing so. Not everything worked as described in the notes and some tweaking needed to be done, so I present here what worked for me.

Preparing the Micro SD card and bootstrapping the system

You need to create at least two partitions: a swap and a root partition. Remember to leave some space at the beginning for the boot loader, 4MB or 8192 sectors should be more than enough. The following layout works fine for my 64GB card.

]==> fdisk /dev/mmcblk0

Command (m for help): p

Disk /dev/mmcblk0: 63.9 GB, 63864569856 bytes
255 heads, 63 sectors/track, 7764 cylinders, total 124735488 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000

        Device Boot      Start         End      Blocks   Id  System
/dev/mmcblk0p1            8192     8396799     4194304   82  Linux swap / Solaris
/dev/mmcblk0p2         8396800   124735487    58169344   83  Linux

Then create the file systems and mount the root partition.

]==> mkfs.ext4 /dev/mmcblk0p2
]==> mkswap /dev/mmcblk0p1
]==> mkdir /mnt/tmp
]==> mount /dev/mmcblk0p2 /mnt/tmp/

You will need qemu-user-static and debootstrap packages to install the system on the root partition. Go here to select the appropriate kernel version.

]==> apt-get install qemu-user-static debootstrap

]==> qemu-debootstrap --foreign \
     --include=ntp,ntpdate,less,u-boot,u-boot-tools,linux-image-3.14-2-armmp,bash-completion,fake-hwclock,emacs,mc \
     --exclude=nano --arch=armhf testing /mnt/tmp

Installing and setting up the boot loader

Cubox has the Initial Program Loader (IPL) in its NVRAM and you need to put the Secondary Program Loader (SPL) and the primary boot loader (Das U-Boot in our case) at the beginning of the Micro SD card.

]==> dd if=/mnt/tmp/usr/lib/u-boot/mx6_cubox-i/SPL of=/dev/mmcblk0 bs=1K seek=1
39+0 records in
39+0 records out
39936 bytes (40 kB) copied, 0.00685229 s, 5.8 MB/s

]==> dd if=/mnt/tmp/usr/lib/u-boot/mx6_cubox-i/u-boot.img of=/dev/mmcblk0 bs=1K seek=42
292+1 records in
292+1 records out
299804 bytes (300 kB) copied, 0.0451772 s, 6.6 MB/s

You will also need an appropriate DTB file and an U-Boot environment. The DTB (Device Tree Blob) is a database that represents the hardware components in the system, it is provided by the kernel package. You can get the U-Boot environment by slightly massaging the one provided by the flash-kernel package. Finally, you will need to make an environment image using mkimage command (u-boot-tools package).

]==> cp /mnt/tmp/usr/lib/linux-image-3.14-2-armmp/imx6q-cubox-i.dtb /mnt/tmp/dtb

]==> smaug# cat /mnt/tmp/root/boot.cmd

setenv device mmc
setenv partition ${mmcdev}:${mmcpart}
setenv bootargs 'root=/dev/mmcblk0p2 rootfstype=ext4 rootwait console=ttymxc0,115200n8 console=tty1'

image_locations='/boot/ /'
for pathprefix in ${image_locations}
  load ${device} ${partition} ${loadaddr} ${pathprefix}vmlinuz \
  && load ${device} ${partition} ${fdt_addr} ${pathprefix}dtb \
  && load ${device} ${partition} ${ramdiskaddr} ${pathprefix}initrd.img \
  && echo "Booting Debian ${kvers} from ${device} ${partition}..." \
  && bootz ${loadaddr} ${ramdiskaddr}:${filesize} ${fdt_addr}

]==> mkimage -A arm -O linux -T script -C none -n "Initial u-boot script" -d /mnt/tmp/root/boot.cmd /mnt/tmp/boot/boot.scr
Image Name:   Initial u-boot script
Created:      Tue Aug 12 22:59:31 2014
Image Type:   ARM Linux Script (uncompressed)
Data Size:    576 Bytes = 0.56 kB = 0.00 MB
Load Address: 00000000
Entry Point:  00000000
   Image 0: 568 Bytes = 0.55 kB = 0.00 MB

Applying basic settings

Before you can boot into the system, you need to provide the file system information, root password, console settings, host name, apt sources and such.

]==> cat /mnt/tmp/etc/fstab
/dev/mmcblk0p2 /               ext4    errors=remount-ro 0       1
/dev/mmcblk0p1 none            swap    defaults          0       0
]==> chroot /mnt/tmp passwd root
]==> echo 'T0:23:respawn:/sbin/getty -L ttymxc0 115200 vt100' >> /mnt/tmp/etc/inittab
]==> echo 'cubox' > /mnt/tmp/etc/hostname
]==> cat /mnt/tmp/etc/apt/sources.list
deb testing/updates main contrib non-free
deb-src testing/updates main contrib non-free

deb testing-updates main contrib non-free
deb-src testing-updates main contrib non-free

deb testing main contrib non-free
deb-src testing main contrib non-free

The system does not have a real time clock, but you can cheat and preserve the time at least between reboots with good enough accuracy. To do this you can, for instance, set the current time to the last modification time of /var/log/syslog as early in the boot sequence as possible. Download this script and put it in /mnt/tmp/etc/init.d then run:

]==> chmod 755 /etc/init.d/rtcemu
]==> chroot /mnt/tmp /bin/bash
]==> update-rc.d rtcemu defaults
]==> touch /var/log/syslog

Serial console

As of writing this, the HDMI output does not really work out of the box, so you will need to access the serial console to boot into the system and make it accessible over the network. You can use a Micro USB cable and screen for this purpose, like this:

]==> screen /dev/ttyUSB0 115200


You are now ready to boot the box. Insert the Micro SD card, attach the power cable and see the system boot in your screen session. When it starts the boot count-down, stop it by pressing enter and type the following:

setenv mmcpart 2

Doing this changes the boot partition to 2 and saves the environment on the card so that you won't have to re-do this every time you reboot the system.

Post-boot settings

Now you need to configure the network, set up time zones, locale settings, keyboard layout and install some useful packages, like network-manager and openssh-server.

]==> dhclient eth0
]==> dpkg-reconfigure tzdata
]==> apt-get update
]==> apt-get install console-data keyboard-configuration locales
]==> dpkg-reconfigure console-data
]==> dpkg-reconfigure keyboard-configuration
]==> dpkg-reconfigure locales
]==> service keyboard-setup restart
]==> apt-get install network-manager openssh-server
]==> update-rc.d network-manager defaults
]==> update-rc.d ssh defaults

flash-kernel is an utility that can hook into dpkg and make the newly installed kernels visible to U-Boot.

]==> apt-get install flash-kernel
]==> cat /etc/default/flash-kernel
LINUX_KERNEL_CMDLINE="root=/dev/mmcblk0p2 rootfstype=ext4 rootwait console=ttymxc0,115200n8 console=tty1"

You have an operational base system now!

What's next?

So, what is next? Plenty of fun! :) The device needs a kernel that can leverage all its features and then xserver and xbmc, it's supposed to be a media center box after all. Stay tuned.

Google's Android NDK makes things that used to be a real pain really simple. Building a cross-compiler toolchain for TF700T is as simple as typing:

]==> ./build/tools/            \
         --platform=android-14                             \
         --install-dir=$HOME/Apps/android-toolchain-tf700t \

And then:

]==> cat hello.c
#include <stdio.h>
#include <math.h>

int main( int argc, char **argv )
  printf( "Hello world!\n" );
  printf( "Sin PI/2: %f\n", sin( M_PI/2.0 ) );
  return 0;

]==> arm-linux-androideabi-gcc -march=armv7-a -mfloat-abi=softfp -mfpu=neon -o hello hello.c
]==> adb push hello /data/local
753 KB/s (63720 bytes in 0.082s)
]==> adb shell /data/local/hello
Hello world!
Sin PI/2: 1.000000

Let's see whether I can build zsh. :)

ASUS Transformer Pad Infinity (TF700T) is a really lovely piece of equipment but, from the perspective of a Linux geek, lack of access to certain commandline utilities through a nice-looking and functional terminal emulator seriously limits its usefulness. I want zshell for heaven's sake! :) And ssh, and git, python, midnight commander, imagemagic and others. In order to install and use these comfortably I need root access to the device that I own after all! And I am denied it. O tempora o mores!

There's a certain "workaround" to this problem over at laveraging the fact that the block device holding the system partition is mounted read-only and, despite seeming access protected, is actually writeable. This could potentially enable a dissatisfied owner to use one of e2fsprogs to plant su, with all its sticky bits set right, and finally enjoy his property somewhat more. :) It looks like no Windows installation is actually needed, just functional adb and the binaries.