Content tagged paper
- Authors: J.-P. Lozi, B. Lepers, J. Funston, F. Gaud, V. Quéma, and A. Fedorova
- Source: http://www.i3s.unice.fr/~jplozi/wastedcores/
The paper shows that, despite often repeated mantra, the OS task scheduling is far from being easy. The authors developed two new tools to investigate the CPU usage and the state of the associated run queue. It has allowed them to uncover four interesting performance bugs on a 64 core NUMA system. They discovered that often some cores stay idle for a long time while tasks are waiting. It is a violation of one of the design principles of the Completely Fair Scheduler, the Linux default, which is supposed to be work-conserving. Fixing these bugs resulted in a 138 times speedup in an extreme test case (multithreaded, using spinlocks) and 13-23% speedups in other test cases. This type of bugs is hard to uncover because they typically waste cycles hundreds of milliseconds at a time, which is beyond the resolution of standard monitoring tools.
Completely Fair Scheduler
CFS defines an interval in which each task must run a least once. This interval is then divided between all the tasks in proportion to their weight (niceness). A running thread accumulates vruntime, which is the amount of time it was running divided by its weight. The scheduler keeps these tasks in a run queue which is implemented as a red-black tree. When the CPU gets idle, the leftmost node is picked because it has accumulated the least of weighted runtime.
In a multi-core system, each core has its own run queue. To fairly distribute the load among the cores, the run queues must be periodically re-balanced. In today's systems, with dozens of run queues, the balancing procedure is expensive and not run very ofter. It is due to the need to take into account other factors, such as power-saving, cache and memory locality. The load balancing algorithm takes the threads from the most loaded cores and distributes them between the least loaded cores taking into account the topology of the system. The more complex the system gets, the more rules need to be applied and the harder it gets to reason about performance.
The bugs and the tools
The bugs uncovered by the authors are all related to migrating tasks between NUMA nodes. They were detected using new tools:
- Sanity Checker checks every second whether there are idle cores in the presence of waiting threads in other run queues. If there are, it monitors the system for another 100ms. If the situation is not remediated, it begins to record the profiling information for further off-line analysis.
- The scheduler visualization tool taps into various kernel functions to monitor and visualize scheduling activity over time.
The authors note that the problems were caused by people wanting to optimize CFS to compensate for the complexity of the modern hardware. They suggest rewriting of the scheduler as a core and a bunch of optimization modules.
In the ancient past, around the time when we were designing EOS at CERN, I read around half a gazzilion of papers about mass storage systems, high scalability, high availability and such. Having a horribly bad memory, I made a lot of notes with key take-aways, mostly on paper and mostly in the form of bullet points. All these notes were extremely useful to me at the time. Now, I need to revise all of this information, so I'll digitize them all and, since they are likely generally useful, I'll put them here as well.
Let's start with Haystack. The original USENIX paper by D. Beaver, S. Kumar, H. C. Li, J. Sobel, P. Vajgel is here.
- store 260 billion images / 26 PB
- add 1 billion / 60 TB every week
- serve 1 million images per second at peak
- an object store
- write once, read often, delete rarely
- high throughput and low latency
- fault tolerance
- straight-forward to implement and maintain
- long tail: requests for less popular photos (ie. the older ones) constitute constitute a significant amount of the traffic
The original solution constituted of a bunch of photo servers connected to a bunch of NAS nodes. There was a CDN in front of the system. Caching a significant part of the photos at CDN is not economically viable, so the photo servers needed to handle a significant amount of traffic due to the long tail problem. With the NAS solution, reading the useless (in this case) file system metadata is a throughput bottleneck. This is because several disk operations are needed before the photo data can be read: translate file name to i-node number, retrieve the i-node, retrieve the file data. The i-node metadata takes several hundred bytes of memory and there is a lot of files, so caching the metadata in RAM is infeasible.
The system has 3 components: the Directory, the Cache and the Store. Multiple
photos are stored as sequential records in large files (~ 100GB each) called
physical volumes. A group of physical volumes storing the same content and
located at different machines is called a logical volume. When a user visits
a page, the web server uses the Directory to construct a URL for each photo:
http://<CDN>/<cache>/<machine id>/<logical volume, photo id>:
- The CDN looks up for the photo using
<logical volume, photo id>. If it fails, it strips the CDN URL and contacts the cache.
- The Cache does a similar look-up to find the photo. If it fails, it strips the cache address and requests the photo from the specified store machine.
The directory holds the cluster metadata information such as the state of the store machines and mapping from logical to physical volumes. It is implemented in PHP with a sharded, replicated and memcached MySQL in the back-end. Its primary purpose is to determine whether photo requests should be handled by the CDN or the Haystack Cache and to construct the image URLs for page requests. It also does load-balancing for writes across logical volumes.
The cache is kind of an internal CDN handling requests from the CDNs and from users. It essentially is a distributed hash table with photo ids as keys. If the photo is not in the cache, then it's fetched from the store. Photos are cached only when:
- request comes directly from a user - it makes little sense to cache requests coming from CDNs
- the photo is fetched from w write-enabled store - photos are most heavily accessed soon after they are uploaded.
The store has a very basic interface: read a photo by id from a particular
logical volume, from a particular store machine. Each store machine manages
multiple logical volumes by hosting one of their replicas (physical volumes).
A physical volume consists of a superblock and a sequence of millions of
needles. The volume file is always open and the system holds an in-memory
mapping between photo ids and offsets in the store file. They need 10 bytes
of metadata per needle; by comparison
xfs_inode_t takes 536 bytes of RAM.
Each needle has a random cookie value stored with it. In order for the system to authorize a read operation, the cookie provided by the user needs to match the cookie stored with the needle. This prevents unauthorized accesses by URL- guessing.
The web server provides the store with a logical volume id, a key, a cookie and the data to be written to all the store machines hosting given logical volume. Each store machine synchronously appends the needle to its physical volume file and updates the in-memory metadata. Users typically update entire albums at once so the writes can easily be bulked. Modifications are handled by adding a new needle with the same key; deletions by setting the delete flag both in memory and in the file. The space wasted this way is then reclaimed by an asynchronous compaction process.
XFS is used as the underlying file system because of small block maps for contiguous files and efficient pre-allocation of space. RAID6 is used for fault tollerance and performance gains.
Faults are detected by a hartbeat monitor trying to randomly read photos from random machines. The detected issues are then addressed manually.
- The number of photos written is 12 times the number of photos uploaded: 4 sizes in 3 locations.
- 10% of all requests come from CDNs.