Content from 2015-12


In the ancient past, around the time when we were designing EOS at CERN, I read around half a gazzilion of papers about mass storage systems, high scalability, high availability and such. Having a horribly bad memory, I made a lot of notes with key take-aways, mostly on paper and mostly in the form of bullet points. All these notes were extremely useful to me at the time. Now, I need to revise all of this information, so I'll digitize them all and, since they are likely generally useful, I'll put them here as well.

Let's start with Haystack. The original USENIX paper by D. Beaver, S. Kumar, H. C. Li, J. Sobel, P. Vajgel is here.

The Problem

  • store 260 billion images / 26 PB
  • add 1 billion / 60 TB every week
  • serve 1 million images per second at peak

Design Constraints

  • an object store
  • write once, read often, delete rarely
  • high throughput and low latency
  • fault tolerance
  • straight-forward to implement and maintain
  • long tail: requests for less popular photos (ie. the older ones) constitute constitute a significant amount of the traffic

Original Solution

The original solution constituted of a bunch of photo servers connected to a bunch of NAS nodes. There was a CDN in front of the system. Caching a significant part of the photos at CDN is not economically viable, so the photo servers needed to handle a significant amount of traffic due to the long tail problem. With the NAS solution, reading the useless (in this case) file system metadata is a throughput bottleneck. This is because several disk operations are needed before the photo data can be read: translate file name to i-node number, retrieve the i-node, retrieve the file data. The i-node metadata takes several hundred bytes of memory and there is a lot of files, so caching the metadata in RAM is infeasible.

New Solution

The system has 3 components: the Directory, the Cache and the Store. Multiple photos are stored as sequential records in large files (~ 100GB each) called physical volumes. A group of physical volumes storing the same content and located at different machines is called a logical volume. When a user visits a page, the web server uses the Directory to construct a URL for each photo: http://<CDN>/<cache>/<machine id>/<logical volume, photo id>:

  • The CDN looks up for the photo using <logical volume, photo id>. If it fails, it strips the CDN URL and contacts the cache.
  • The Cache does a similar look-up to find the photo. If it fails, it strips the cache address and requests the photo from the specified store machine.

Haystack Directory

The directory holds the cluster metadata information such as the state of the store machines and mapping from logical to physical volumes. It is implemented in PHP with a sharded, replicated and memcached MySQL in the back-end. Its primary purpose is to determine whether photo requests should be handled by the CDN or the Haystack Cache and to construct the image URLs for page requests. It also does load-balancing for writes across logical volumes.

Haystack Cache

The cache is kind of an internal CDN handling requests from the CDNs and from users. It essentially is a distributed hash table with photo ids as keys. If the photo is not in the cache, then it's fetched from the store. Photos are cached only when:

  • request comes directly from a user - it makes little sense to cache requests coming from CDNs
  • the photo is fetched from w write-enabled store - photos are most heavily accessed soon after they are uploaded.

Haystack Store

The store has a very basic interface: read a photo by id from a particular logical volume, from a particular store machine. Each store machine manages multiple logical volumes by hosting one of their replicas (physical volumes). A physical volume consists of a superblock and a sequence of millions of needles. The volume file is always open and the system holds an in-memory mapping between photo ids and offsets in the store file. They need 10 bytes of metadata per needle; by comparison xfs_inode_t takes 536 bytes of RAM.

Each needle has a random cookie value stored with it. In order for the system to authorize a read operation, the cookie provided by the user needs to match the cookie stored with the needle. This prevents unauthorized accesses by URL- guessing.

The web server provides the store with a logical volume id, a key, a cookie and the data to be written to all the store machines hosting given logical volume. Each store machine synchronously appends the needle to its physical volume file and updates the in-memory metadata. Users typically update entire albums at once so the writes can easily be bulked. Modifications are handled by adding a new needle with the same key; deletions by setting the delete flag both in memory and in the file. The space wasted this way is then reclaimed by an asynchronous compaction process.

XFS is used as the underlying file system because of small block maps for contiguous files and efficient pre-allocation of space. RAID6 is used for fault tollerance and performance gains.

Faults are detected by a hartbeat monitor trying to randomly read photos from random machines. The detected issues are then addressed manually.


  • The number of photos written is 12 times the number of photos uploaded: 4 sizes in 3 locations.
  • 10% of all requests come from CDNs.

Within a couple weeks of learning Lisp I found programming in any other language unbearably constraining. Paul Graham

The more I play with Lisp, the more I agree with the statement above. The truth is, however, that the barrier to entry is incredibly high for Lisp. This is probably due to the vicious circle: there is fairly few people programming in it in comparison to other languages, so there is relatively little documentation over the Internet; the lack of documentation, in turn, makes it hard for new people to start using Lisp. The effect of this is twofold. If you search for the answers, they are likely not readily available, except in the code itself - this may be pretty frustrating and cause projects to move slowly. On the other hand, the people who make it past the obstacles are pretty good at what they do, so the community is great and the quality of the work they do can be mind-blowing.

But let's get back to the beginning. I want to start writing web apps in Common Lisp. After some research, it turned out that people recommend using one of the frameworks based on Clack for this purpose. I went to the web site, found an example and typed it into repl.

1(ql:quickload :clack)
4  (lambda (env)
5    '(200
6      (:content-type "text/plain")
7      ("Hello, Clack!"))))

Within seconds I had Hunchentoot running and all was fine and dandy. It was, in fact, much easier than what I typically have to do when I am dealing with Python. Before I could go any further, though, I needed to figure out how I deploy this in production in my hosting environment, with Apache as the web server. And here's where the fun begins. :) Apparently, I can run Clack in FastCGI mode, so, in principle, I can just pass :server :fcgi to clackup and be all set. Not so, unfortunately. Clack's fastcgi module, by default, opens a TCP listening socket and expects the web server to connect to it. Apache, however, can communicate with FastCGI only via Unix domain sockets. This is not a problem, because I can pass an :fd parameter to clackup telling it to communicate over this descriptor instead of a TCP connection. All I need to do, is to figure out the path to the socket, open it, and pass the descriptor to Clack.

So, how do you figure out where Apache keeps it's FastCGI unix sockets? The documentation won't tell you, and even if it does, how do you know which one you should connect to? Let's look at the relevant part of the source of Apache's mod_fcgid:

1        || (rv = apr_os_file_put(&file, &unix_socket, 0,
2                                 procnode->proc_pool)) != APR_SUCCESS
3        || (rv = apr_procattr_child_in_set(procattr, file, NULL)) != APR_SUCCESS) {
4        ap_log_error(APLOG_MARK, APLOG_ERR, rv, procinfo->main_server,
5                     "mod_fcgid: couldn't set child process attributes: %s",
6                     unix_addr.sun_path);
7        close(unix_socket);
8        return rv;

It seems that Apache will spawn a child process, replace it's stdin with the unix socket and execve the child's executable. All I need to do then, is pass :fd 0 to Clack to make things work. And indeed they do. After compiling the script below, I could see "Hello, Clack!" with Apache.

 1(ql:quickload 'clack :silent t)
 2(ql:quickload 'lack-middleware-backtrace :silent t)
 3(ql:quickload 'clack-handler-fcgi :silent t)
 5(defun main ()
 6  (clack:clackup
 7    (lambda (env)
 8      '(200
 9        (:content-type "text/plain")
10        ("Hello, Clack!")))
11   :server :fcgi
12   :use-thread nil
13   :fd 0))
15(sb-ext:save-lisp-and-die "test.fcgi" :toplevel #'main :executable t)

This is a fairly trivial thing and it still took me around half an hour to figure out. I am a fairly experienced engineer, but this would likely be an impossible obstacle for someone rather new to programming. By comparison, you can write a web "Hello, World!" in Python or Ruby within minutes without much prior experience.


Following a piece of advice from a friend, I decided to buy this new domain name and start writing down all the cool things I do. I have written a bit in a bunch of other places before and have other quasi-failed blogs, so I actually already have a bit of content to bootstrap this one.

There's plenty of blogware options out there, but, as a programmer, I like the ones that keep the content in a version control system intended for software. From the alternatives available in this department, I decided to go for c()╬╗eslaw. It's kind of similar to Jekyll, which I have used before, and it's written in Common Lisp, which, typically, is a good sign in general.

There is very little instruction over the Internet on how to use it, but it's not hard to figure out after reading the code. This post is a brief summary of what I have done to create this website and convert the content from Jekyll.

Site Structure

The first thing that you need to do is to create a .coleslawrc file describing the layout of the site, the theme to be used to render the final HTML, and other such things. There's a good example here and you can get the full picture by reading the source. :) I like to change the separator (:separator "---"), so that --- is used to distinguish the metadata from the content section in source files, this makes things look the Jekyll way. The static-pages plugin, makes it possible to create content other that blog posts and indices.

Coleslaw will search the repo for files ending with .post (and .page if the static-pages plugin is enabled) and run them through the renderer selected in the page's metadata section. It will generate the indices automatically and copy verbatim everything it finds in the static directory.

You can create our own theme following the rules described here or choose something from the built-in options. I built the theme you see here more or less from scratch using Bootstrap and the live customizer to tweak the colors. It was a fairly easy and pleasant exercise.

In the end, the resulting directory structure looks roughly like this:

==> find

The first few lines of the post you are reading right now look like this:

title: Blogging with Coleslaw
date: 2015-12-07
tags: blogging, lisp, programming, linux, sbcl
format: md


Following a piece of advice from a friend, I decided to by this new domain name


Coleslaw and the packages it depends on work pretty well to begin with, but I made a couple of improvements to make them fit my particular tastes better:

  1. Some themes and plugins are site specific and cannot be generalized. There is very little point in keeping them in the coleslaw source tree when they really belong with the site content. I submitted patches to make it possible to define themes and plugins in the content repo. See PR-98 and PR-101.
  2. I like to have the HTML files named in a certain way in the resulting web site, so it's convenient for me to be able to specify lambdas in .coleslawrc mapping the content metadata to file names. I made a pull request to allow that (PR-100), but Brit, the maintainer of coleslaw, has different ideas on how to approach this problem.
  3. I think pygments have no real competition if it comes to coloring source code, so I made changes to 3bmd - the markdown rendering library used by coleslaw - allowing it to use pygments. See PR-24.
  4. It's nice to be able to control how the rendered HTML tables look. In order to do that, you need to be able to specify the css class for the table. See PR-25.


3bmd makes it fairly easy to customize how the final HTML is rendered. For instance, you can change the resulting markup for images by defining a method :around print-tagged-element. I want the images on this web site to have frames and captions, so I did this:

 1(defmethod print-tagged-element :around ((tag (eql :image)) stream rest)
 2  (setf rest (cdr (first rest)))
 3  (let ((fmt (concatenate 'string
 4                          "<div class=\"center-wrapper\">"
 5                          "  <div class=\"img\">"
 6                          "    <img src=\"~a\" ~@[alt=\"~a\"~] />"
 7                          "    <div class=\"img-caption\">~@[~a~]</div>"
 8                          "  </div>"
 9                          "</div>"))
10        (caption (with-output-to-string (s)
11                   (mapcar (lambda (a) (print-element a s))
12                           (getf rest :label)))))
13    (format stream
14            fmt
15            (getf rest :source)
16            caption
17            caption)))

Being able to use $config.domain and other variables in the markdown makes it possible to define relative paths to images and other resources. This comes handy if you want to test the web site using different locations. In order to acheve this you can define a method :around render-text in the following way:

 1(defmethod render-text :around (text format)
 2  (let ((processed
 3         (reduce #'funcall
 4                 (list
 5                  #'process-embeds
 6                  (lambda (text)
 7                    (regex-replace-all "{\\\$config.domain}"
 8                                       text
 9                                       (domain *config*)))
10                  (lambda (text)
11                    (regex-replace-all "{\\\$config.repo-dir}"
12                                       text
13                                       (namestring (repo-dir *config*))))
14                  text)
15                 :from-end t)))
16    (call-next-method processed format)))


I use DreamHost for my web hosting and want to use sbcl as the lisp implementation. Unfortunately, all of my attempts to run sbcl there ended up with error messages like this one:

mmap: wanted 1040384 bytes at 0x20000000, actually mapped at 0x3cfc6467000
ensure_space: failed to validate 1040384 bytes at 0x20000000
(hint: Try "ulimit -a"; maybe you should increase memory limits.)

After some investigation, it turned out that DreamHost uses grsecurity kernel patches and, it looks like, their implementation of ASLR (Address Space Layout Randomization) does not respect the ADDR_NO_RANDOMIZE personality that is indeed set by sbcl at startup. They still allow the memory to be mapped at a specific location, which is a requirement for sbcl, if the MAP_FIXED flag is passed to mmap. The patch fixing this problem was a fairly simple one once I figured out what's going on. It looks like it will be included in sbcl 1.3.2. Until then, you will have to recompile the sources yourself.

Let's see if we get a speedup if we compile the code. The snippets below list the contents of col1.lisp and col2.lisp respectively:

(require 'coleslaw)
(coleslaw:main "/path/to/repo/")
(require 'coleslaw)
(defun main () (coleslaw:main (nth 1 *posix-argv*)))
(sb-ext:save-lisp-and-die "coleslaw.x" :toplevel #'main :executable t)

And this is what you get:

]==> time sbcl --noinform --load col1.lisp
sbcl --load col2.lisp  6.39s user 1.05s system 97% cpu 7.609 total

]==> sbcl --noinform --load col2.lisp
[undoing binding stack and other enclosing state... done]
[saving current Lisp image into coleslaw.x:
writing 4944 bytes from the read-only space at 0x20000000
writing 3168 bytes from the static space at 0x20100000
writing 85229568 bytes from the dynamic space at 0x1000000000

]==> time ./coleslaw.x /path/to/repo/
./coleslaw.x /path/to/repo/  3.37s user 0.74s system 95% cpu 4.304 total

]==> du -sh ./coleslaw.x
83M     ./coleslaw.x

The compiled code runs almost twice as fast, but the executable weights 83M!

I wrote the following post-receive hook in order to have the site rendered automatically every time I push the new content to the master branch of the repo.

 1CLONE_DIR=`mktemp -d`
 3echo "Cloning the repository..."
 4git clone $PWD $CLONE_DIR > /dev/null | exit 1
 6while read oldrev newrev refname; do
 7  if [ $refname = "refs/heads/master" ]; then
 8    echo "Running coleslaw..."
 9    coleslaw.x $CLONE_DIR/ > /dev/null
10  fi
13rm -rf $CLONE_DIR


Building this web site was quite an instructive experience, especially that it was my first non-toy project done in Common Lisp. It showed me how easy it is to use and hack on CL projects and how handy QuickLisp is. There's plenty of good libraries around and, if they have areas in which they are lacking, it's quite a bit of fun to fill the gaps. The library environment definitely is not as mature as the one of Python or Ruby, so new users may find it difficult, but, overall, I think it's worth it to spend the time getting comfortable with Common Lisp. I finally feel emotionally prepared to go through Peter Norvig's Paradigms of Artificial Intelligence Programming. :)