Threading From Scratch (#6) - Scheduling and task priority

Syscalls, memory, and your first therad
The pointer to self and thread-local storage
Futexes, mutexes, and memory sychronization
Joining threads and dynamic initialization
Cancellation
Scheduling and task priority
RW Locks
Condition variables
Final thoughts

Thread Scheduling

The scheduler makes a decision of which thread to run next based on two parameters: scheduling policy and priority.

Conceptually, the scheduler maintains a list of runnable threads for each possible sched_priority value. In order to determine which thread runs next, the scheduler looks for the nonempty list with the highest static priority and selects the thread at the head of this list.

A thread's scheduling policy determines where it will be inserted into the list of threads with equal static priority and how it will move inside this list. the sched(7) man page

The supported policies are:

SCHED_NORMAL - It's the default Linux policy for threads not requiring any real-time machinery. All the threads have priority of 0, and the decision of which thread gets run next is based on the nice mechanism.
SCHED_FIFO - The threads have priority from 1 (low) to 99 (high). When a SCHED_FIFO thread becomes runnable, it will preempt any thread of lower priority. There is no time slicing, so the hight priority thread runs as long as it must.
SCHED_RR - RR stands for round-robin. It's the same as SCHED_FIFO except each thread is allowed to run for a limited time quantum.

See the sched(7) man page for more details.

Setting the scheduling parameters of a kernel task boils down to invoking the sys_sched_setscheduler syscall, as shown below.

 1struct tb_sched_param
 2{
 3  int sched_priority;
 4};
 5
 6int tb_set_sched(tbthread_t thread, int policy, int priority)
 7{
 8  struct tb_sched_param p; p.sched_priority = priority;
 9  int ret = SYSCALL3(__NR_sched_setscheduler, thread->tid, policy, &p);
10  if(!ret)
11    thread->sched_info = SCHED_INFO_PACK(policy, priority);
12  return ret;
13}

There is a bit more fun to it, though. As you can see, the kernel can only schedule a task that already exists. Therefore, we need to have a way to set the thread's priority before this thread invokes the user function. The reason for this is that we may need to abort this thread immediately should the scheduler setting fail. We do it by having another futex that we wake when we know whether the thread can run or not:

1if(th->start_status != TB_START_OK) {
2  SYSCALL3(__NR_futex, &th->start_status, FUTEX_WAIT, TB_START_WAIT);
3  if(th->start_status == TB_START_EXIT)
4    SYSCALL1(__NR_exit, 0);
5}

This futex is initialized in the tbthread_create function depending on the thread attributes:

1if(!attr->sched_inherit)
2  (*thread)->start_status = TB_START_WAIT;
3else {
4  tbthread_t self = tbthread_self();
5  (*thread)->sched_info = self->sched_info;
6}

And then set to either TB_START_OK or TB_START_EXIT after we spawn the thread but before tbthread_create exits:

 1if(!attr->sched_inherit) {
 2  ret = tb_set_sched(*thread, attr->sched_policy, attr->sched_priority);
 3
 4  if(ret) (*thread)->start_status = TB_START_EXIT;
 5  else (*thread)->start_status = TB_START_OK;
 6  SYSCALL3(__NR_futex, &(*thread)->start_status, FUTEX_WAKE, 1);
 7
 8  if(ret) {
 9    wait_for_thread(*thread);
10    goto error;
11  }
12}

See the patch at GitHub.

Priority mutexes

Priority mutexes come in three varieties:

TBTHREAD_PRIO_NONE - Acquiring this type of mutex does not change the scheduling characteristics of the thread.
TBTHREAD_PRIO_INHERIT - When the mutex owner blocks another thread with a higher priority, the owner inherits the priority of the thread it blocks if it's higher than its own.
TBTHREAD_PRIO_PROTECT - Acquiring this kind of mutex raises the priority of the owner to the prioceiling value of the mutex.

Thread Bites implements this functionality by keeping lists of PRIO_INHERIT and PRIO_PROTECT mutexes. It then calculates the highest possible priority taking into account the priority of the mutexes and the priority set by the user.

The implementation of PRIO_PROTECT is relatively straightforward. Whenever a thread acquires this kind of mutex, it is added to the list, and the priority of the thread is recalculated:

 1static int lock_prio_protect(tbthread_mutex_t *mutex)
 2{
 3  lock_prio_none(mutex);
 4  tb_protect_mutex_sched(mutex);
 5  return 0;
 6}
 7
 8static int unlock_prio_protect(tbthread_mutex_t *mutex)
 9{
10  tbthread_t self = tbthread_self();
11  tb_protect_mutex_unsched(mutex);
12  unlock_prio_none(mutex);
13  return 0;
14}

Implementing PRIO_INHERIT is a lot more tricky. We add the mutex to the appropriate list when a thread acquires it. Whenever a higher priority thread tries to lock the mutex, it bumps the priority of the blocker. But the priority recalculation is done only at this point. Implementing it like this covers all the main cases and is not horrendously hard. It allows for simple recursion: if the owner of a mutex gets blocked, the blocker inherits the priority that comes with the first mutex. It also has a couple of drawbacks:

It assumes that the kernel will always wake the highest priority thread. It makes sense and is most likely the case. However, I have not tested it.
If the owner of a PRIO_INHERIT mutex is already blocked on another mutex of the same kind and it's priority gets bumped later, the last thread in the line won't be affected.

 1static int lock_prio_inherit(tbthread_mutex_t *mutex)
 2{
 3  tbthread_t self = tbthread_self();
 4
 5  while(1) {
 6    int locked = 0;
 7    tb_futex_lock(&mutex->internal_futex);
 8    if(mutex->futex == 0) {
 9      locked = 1;
10      mutex->owner = self;
11      mutex->futex = 1;
12      tb_inherit_mutex_add(mutex);
13    }
14    else
15      tb_inherit_mutex_sched(mutex, self);
16    tb_futex_unlock(&mutex->internal_futex);
17    if(locked)
18      return 0;
19    SYSCALL3(__NR_futex, &mutex->futex, FUTEX_WAIT, 1);
20  }
21}
22
23static int unlock_prio_inherit(tbthread_mutex_t *mutex)
24{
25  tb_futex_lock(&mutex->internal_futex);
26  tb_inherit_mutex_unsched(mutex);
27  mutex->owner = 0;
28  mutex->futex = 0;
29  SYSCALL3(__NR_futex, &mutex->futex, FUTEX_WAKE, 1);
30  tb_futex_unlock(&mutex->internal_futex);
31  return 0;
32}

It was by far the most challenging part so far. See the patch at GitHub.

Remaining functions

pthread_setconcurrency- It defines how many kernel tasks should be created to handle the user-level threads. It does not make sense in our case because we create a kernel task for every thread.
pthread_attr_setscope - It defines the set of threads against which the thread will compete for resources. There are two settings: PTHREAD_SCOPE_SYSTEM meaning all the threads in the entire system and PTHREAD_SCOPE_PROCESS meaning only the threads within the process. The man page says that Linux only supports PTHREAD_SCOPE_SYSTEM, but I am not sure whether it's still the case with all the cgroups stuff.

If you like this kind of content, you can subscribe to my newsletter, follow me on Twitter, or subscribe to my RSS channel.

2016-03-05

Table of Contents

Thread Scheduling

Priority mutexes

Remaining functions