In the ancient past, around the time when we were designing EOS at CERN, I read around half a gazzilion of papers about mass storage systems, high scalability, high availability and such. Having a horribly bad memory, I made a lot of notes with key take-aways, mostly on paper and mostly in the form of bullet points. All these notes were extremely useful to me at the time. Now, I need to revise all of this information, so I'll digitize them all and, since they are likely generally useful, I'll put them here as well.
Let's start with Haystack. The original USENIX paper by D. Beaver, S. Kumar, H. C. Li, J. Sobel, P. Vajgel is here.
- store 260 billion images / 26 PB
- add 1 billion / 60 TB every week
- serve 1 million images per second at peak
- an object store
- write once, read often, delete rarely
- high throughput and low latency
- fault tolerance
- straight-forward to implement and maintain
- long tail: requests for less popular photos (ie. the older ones) constitute constitute a significant amount of the traffic
The original solution constituted of a bunch of photo servers connected to a bunch of NAS nodes. There was a CDN in front of the system. Caching a significant part of the photos at CDN is not economically viable, so the photo servers needed to handle a significant amount of traffic due to the long tail problem. With the NAS solution, reading the useless (in this case) file system metadata is a throughput bottleneck. This is because several disk operations are needed before the photo data can be read: translate file name to i-node number, retrieve the i-node, retrieve the file data. The i-node metadata takes several hundred bytes of memory and there is a lot of files, so caching the metadata in RAM is infeasible.
The system has 3 components: the Directory, the Cache and the Store. Multiple
photos are stored as sequential records in large files (~ 100GB each) called
physical volumes. A group of physical volumes storing the same content and
located at different machines is called a logical volume. When a user visits
a page, the web server uses the Directory to construct a URL for each photo:
http://<CDN>/<cache>/<machine id>/<logical volume, photo id>:
- The CDN looks up for the photo using
<logical volume, photo id>. If it fails, it strips the CDN URL and contacts the cache.
- The Cache does a similar look-up to find the photo. If it fails, it strips the cache address and requests the photo from the specified store machine.
The directory holds the cluster metadata information such as the state of the store machines and mapping from logical to physical volumes. It is implemented in PHP with a sharded, replicated and memcached MySQL in the back-end. Its primary purpose is to determine whether photo requests should be handled by the CDN or the Haystack Cache and to construct the image URLs for page requests. It also does load-balancing for writes across logical volumes.
The cache is kind of an internal CDN handling requests from the CDNs and from users. It essentially is a distributed hash table with photo ids as keys. If the photo is not in the cache, then it's fetched from the store. Photos are cached only when:
- request comes directly from a user - it makes little sense to cache requests coming from CDNs
- the photo is fetched from w write-enabled store - photos are most heavily accessed soon after they are uploaded.
The store has a very basic interface: read a photo by id from a particular
logical volume, from a particular store machine. Each store machine manages
multiple logical volumes by hosting one of their replicas (physical volumes).
A physical volume consists of a superblock and a sequence of millions of
needles. The volume file is always open and the system holds an in-memory
mapping between photo ids and offsets in the store file. They need 10 bytes
of metadata per needle; by comparison
xfs_inode_t takes 536 bytes of RAM.
Each needle has a random cookie value stored with it. In order for the system to authorize a read operation, the cookie provided by the user needs to match the cookie stored with the needle. This prevents unauthorized accesses by URL- guessing.
The web server provides the store with a logical volume id, a key, a cookie and the data to be written to all the store machines hosting given logical volume. Each store machine synchronously appends the needle to its physical volume file and updates the in-memory metadata. Users typically update entire albums at once so the writes can easily be bulked. Modifications are handled by adding a new needle with the same key; deletions by setting the delete flag both in memory and in the file. The space wasted this way is then reclaimed by an asynchronous compaction process.
XFS is used as the underlying file system because of small block maps for contiguous files and efficient pre-allocation of space. RAID6 is used for fault tollerance and performance gains.
Faults are detected by a hartbeat monitor trying to randomly read photos from random machines. The detected issues are then addressed manually.
- The number of photos written is 12 times the number of photos uploaded: 4 sizes in 3 locations.
- 10% of all requests come from CDNs.