Intro

The paper shows that, despite often repeated mantra, the OS task scheduling is far from being easy. The authors developed two new tools to investigate the CPU usage and the state of the associated run queue. It has allowed them to uncover four interesting performance bugs on a 64 core NUMA system. They discovered that often some cores stay idle for a long time while tasks are waiting. It is a violation of one of the design principles of the Completely Fair Scheduler, the Linux default, which is supposed to be work-conserving. Fixing these bugs resulted in a 138 times speedup in an extreme test case (multithreaded, using spinlocks) and 13-23% speedups in other test cases. This type of bugs is hard to uncover because they typically waste cycles hundreds of milliseconds at a time, which is beyond the resolution of standard monitoring tools.

Completely Fair Scheduler

CFS defines an interval in which each task must run a least once. This interval is then divided between all the tasks in proportion to their weight (niceness). A running thread accumulates vruntime, which is the amount of time it was running divided by its weight. The scheduler keeps these tasks in a run queue which is implemented as a red-black tree. When the CPU gets idle, the leftmost node is picked because it has accumulated the least of weighted runtime.

In a multi-core system, each core has its own run queue. To fairly distribute the load among the cores, the run queues must be periodically re-balanced. In today's systems, with dozens of run queues, the balancing procedure is expensive and not run very ofter. It is due to the need to take into account other factors, such as power-saving, cache and memory locality. The load balancing algorithm takes the threads from the most loaded cores and distributes them between the least loaded cores taking into account the topology of the system. The more complex the system gets, the more rules need to be applied and the harder it gets to reason about performance.

The bugs and the tools

The bugs uncovered by the authors are all related to migrating tasks between NUMA nodes. They were detected using new tools:

  • Sanity Checker checks every second whether there are idle cores in the presence of waiting threads in other run queues. If there are, it monitors the system for another 100ms. If the situation is not remediated, it begins to record the profiling information for further off-line analysis.
  • The scheduler visualization tool taps into various kernel functions to monitor and visualize scheduling activity over time.

Conclusion

The authors note that the problems were caused by people wanting to optimize CFS to compensate for the complexity of the modern hardware. They suggest rewriting of the scheduler as a core and a bunch of optimization modules.

If you like this kind of content, you can subscribe to my newsletter, follow me on Twitter, or subscribe to my RSS channel.