Recent Content


Writing this blog became increasingly tedious over time. The reason for this was the slowness of the rendering tool I use - coleslaw. It seemed to work well for other people, though, so I decided to investigate what I am doing wrong. The problem came from the fact that the code coloring implementation (which I co-wrote) spawned a Python process every time it received a code block to handle. The coloring itself was fast. Starting and stopping Python every time was the cause of the issue. A solution for this malady is fairly simple. You keep the Python process running at all times and communicate with it via standard IO.

Surprisingly enough, I could not find an easy and convenient way to do it. The dominant paradigm of uiop:run-program seems to be spawn-process-close, and it does not allow for easy access to the actual streams. sb-ext:run-program does hand me the stream objects that I need, but it's not portable. While reading the code of uiop trying to figure out how to extract the stream objects from run-program, I accidentally discovered uiop:launch-program which does exactly what I need in a portable manner. It was implemented in asdf- released on Dec 1st, 2016 (a month and a half ago!). This post is meant as a piece of documentation that can be indexed by search engines to help spread my happy discovery. :)


The Python code reads commands from standard input and writes the responses to standard output. Both, commands and response headers are followed by newlines and an optional payload.

The commands are:

  • exit - what it does is self-evident
  • pygmentize|len|lang[|opts]:
    • len is the length of the code snippet
    • lang is the language to colorize
    • optional parameter opts is the configuration of the HTML formatter
    • after the newline, len utf-8 characters of the code block need to follow

There's only one response: colorized|len, followed by a newline and len utf-8 characters of the colorized code as an HTML snippet.

Python's automatic inference of standard IO's encoding is still pretty messed up, even in Python 3. It's a good idea to create wrapper objects and interact only with them:

1 input  = io.TextIOWrapper(sys.stdin.buffer,  encoding='utf-8')
2 output = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8')

Printing diagnostic messages to standard error output is useful for debugging:

1 def eprint(*args, **kwargs):
2     print(*args, file=sys.stderr, **kwargs)


OK, I have a python script that does the coloring. Before I can use it, I need to tell ASDF about it and locate where it is in the filesystem. The former is done by using the :static-file qualifier in the :components list. The latter is a bit more complicated. Since the file's location is known relative to the lisp file it will be used with, it's doable.

1 (defvar *pygmentize-path*
2   (merge-pathnames ""
3                    #.(or *compile-file-truename* *load-truename*))
4   "Path to the pygmentize script")

The trick here is to use #. to execute the statement at read-time. You can see the full explanation here.

With that out of the way, I can start the renderer with:

1 (defmethod start-concrete-renderer ((renderer (eql :pygments)))
2   (setf *pygmentize-process* (uiop:launch-program
3                               (list *python-command*
4                                     (namestring *pygmentize-path*))
5                               :input :stream
6                               :output :stream)))

For debugging purposes, it's useful to add :error-output "/tmp/debug", so that the diagnostics do not get eaten up by /dev/null.

To stop the process, we send it the exit command, flush the stream, and wait until the process dies:

1 (defmethod stop-concrete-renderer ((renderer (eql :pygments)))
2   (write-line "exit" (process-info-input *pygmentize-process*))
3   (force-output  (process-info-input *pygmentize-process*))
4   (wait-process *pygmentize-process*))

The Lisp part of the colorizer sends the pygmentize command together with the code snippet to Python and receives the colorized HTML:

 1 (defun pygmentize-code (lang params code)
 2   (let ((proc-input (process-info-input *pygmentize-process*))
 3         (proc-output (process-info-output *pygmentize-process*)))
 4     (write-line (format nil "pygmentize|~a|~a~@[|~a~]"
 5                         (length code) lang params)
 6                 proc-input)
 7     (write-string code proc-input)
 8     (force-output proc-input)
 9     (let ((nchars (parse-integer
10                    (nth 1
11                         (split-sequence #\| (read-line proc-output))))))
12       (coerce (loop repeat nchars
13                  for x = (read-char proc-output)
14                  collect x)
15               'string))))

See the entire pull request here.


I was able to get down from well over a minute to less that three seconds with the time it takes to generate this blog.

]==> time ./coleslaw-old.x /path/to/blog/
./coleslaw-old.x /path/to/blog/  66.40s user 6.19s system 98% cpu 1:13.55 total
]==> time ./coleslaw-new-no-renderer.x /path/to/blog/
./coleslaw-new-no-renderer.x /path/to/blog/  65.50s user 6.03s system 98% cpu 1:12.53 total
]==> time ./coleslaw-new-renderer.x /path/to/blog/
./coleslaw-new-renderer.x /path/to/blog/  2.78s user 0.27s system 106% cpu 2.849 total
  • coleslaw-old.x is the original code
  • coleslaw-new-no-renderer.x starts and stops the renderer with every code snippet
  • coleslaw-new-renderer.x starts the renderer beforehand and stops it after all the job is done

The classifier

I have built a road sign classifier recently as an assignment for one of the online courses I participate in. The particulars of the implementation are unimportant. It suffices to say that it's a variation on the solution found in this paper by Sermanet and LeCun and operates on the same data set of German road signs. The solution has around 4.3M trainable parameters, and there are around 300k training (after augmentation), 40k validation (after augmentation), and 12k testing samples. The classifier reached the testing accuracy of 98.67%, which is just about human performance. That's not bad.

The thing that I want to share the most is not all mentioned above, but the training benchmarks. I tested it on three different machines in 5 configurations in total:

  • x1-cpu: My laptop, four i7-6600U CPU cores at 2.60GHz each and 4MB cache, 16GB RAM
  • g2.8-cpu: Amazon's g2.8xlarge instance, 32 Xeon E5-2670 CPU cores at 2.60GHz each with 20MB cache, 60GB RAM
  • g2.2-cpu: Amazon's g2.2xlarge instance, 8 Xeon E5-2670 CPU cores at 2.60GHz each with 20MB cache, 15GB RAM
  • g2.8-gpu: The same as g2.8-cpu but used the 4 GRID K520 GPUs
  • g2.2-gpu: The same as g2.2-cpu but used the 1 GRID K520 GPU

Here are the times it took to train one epoch as well as how long it would have taken to train for 540 epochs (it took 540 epochs to get the final solution):

  • x1-cpu: 40 minutes/epoch, 15 days to train
  • g2.8-cpu: 6:24/epoch, 2 days 9 hours 36 minutes to train
  • g2.2-cpu: 16:15/epoch, 6 days, 2 hours 15 minutes to train
  • g2.8-gpu: 1:37/epoch, 14 hours, 33 minutes to train
  • g2.2-gpu: 1:37/epoch, 14 hours, 33 minutes to train

I was quite surprised by these results. I knew that GPUs are better suited for this purpose, but I did not know that they are this much better. The slowness of the laptop might have been due to swapping. I run the test with the usual (unused) laptop workload and Chrome taking a lot of RAM. I was not doing anything during the test, though. When testing with g2.8-cpu, it looked like only 24 out of the 32 CPU cores were busy. Three additional GPUs on g2.8-gpu did not seem to have made any difference. TensorFlow allows you to pin operations to devices, but I did not do any of that. The test just runs the same exact graph as g2.2-gpu. There's likely a lot to gain by doing manual tuning.

The results

I tested it on pictures of a bunch of French and Swiss road signs taken around where I live. These are in some cases different from their German counterparts. When the network had enough training examples, it generalized well, otherwise, not so much. In the images below, you'll find sample sign pictures and the top three logits returned by the classifier after applying softmax.

Speed limit - training (top) vs. French (bottom)
Speed limit - training (top) vs. French (bottom)

No entry - training (top) vs. French (bottom)
No entry - training (top) vs. French (bottom)

Traffic lights - training (top) vs. French (bottom)
Traffic lights - training (top) vs. French (bottom)

No trucks - training (top) vs. French (bottom)
No trucks - training (top) vs. French (bottom)


I must be getting old and eccentric. I have recently started working on the Coursera's Scala Specialization. It's all great, but my first realization was that the tools they use are not going to work for me. The problem lies mainly with sbt - their build tool. It fetches and installs in your system God knows what from God knows where to bootstrap itself. I don't trust this kind of behavior in the slightest. I understand that automatic pulling of dependencies and auto-update may save work, but they are also dangerous. I don't even want to mention that they contribute to general bloat and sluggishness that plagues the Java world. You don't need to know what depends on what, so everything uses everything, without a single thought.

Having said all that, I do trust and use QuickLisp. So, go figure.

I would still like to take the class, but I would like to do it using Emacs and command line. Here's what I did.


You'll need the following packages:

==> sudo apt-get install default-jdk scala ant junit4 scala-mode-el

The assignments they give seem to have a stub for implementation and a bunch of scalatest test suites. I will also want to write some other code to play with things. This is the directory structure I used:

==> find -type f

You can get all this here and will need to run the script to fetch the two scalatest jar files before you can do anything else.


I wrote an ant build template that sets up scalac and scalatest tasks and provides compile and test targets. All that the build.xml files in the subdirectories need to do is define some properties and import the template:

1 <project default="compile">
2   <property name="" value="recfun.jar" />
3   <property name="jar.class" value="recfun.Main" />
4   <property name="tests-wildcard" value="recfun" />
5   <import file="../build-targets.xml" />
6 </project>
  • is the name of the resulting jar file
  • jar.class is the default class that should run
  • test-wildcard is the name of the package containing the test suites (they are discovered automatically)

You can then run the thing (some output has been omitted for clarity):

]==> ant compile
    [mkdir] Created dir: ./build/classes
    [mkdir] Created dir: ./build-test

   [scalac] Compiling 1 source file to ./build/classes
      [jar] Building jar: ./build/recfun.jar

]==> ant test
   [scalac] Compiling 3 source files to ./build-test

[scalatest] Discovery starting.
[scalatest] Discovery completed in 127 milliseconds.
[scalatest] Run starting. Expected test count is: 11
[scalatest] BalanceSuite:
[scalatest] - balance: '(if (zero? x) max (/ 1 x))' is balanced
[scalatest] - balance: 'I told him ...' is balanced
[scalatest] - balance: ':-)' is unbalanced
[scalatest] - balance: counting is not enough
[scalatest] PascalSuite:
[scalatest] - pascal: col=0,row=2
[scalatest] - pascal: col=1,row=2
[scalatest] - pascal: col=1,row=3
[scalatest] CountChangeSuite:
[scalatest] - countChange: example given in instructions
[scalatest] - countChange: sorted CHF
[scalatest] - countChange: no pennies
[scalatest] - countChange: unsorted CHF
[scalatest] Run completed in 246 milliseconds.
[scalatest] Total number of tests run: 11
[scalatest] Suites: completed 4, aborted 0
[scalatest] Tests: succeeded 11, failed 0, canceled 0, ignored 0, pending 0
[scalatest] All tests passed.

Submiting the assignments

I could probably write some code to do the assignment submission in python or using ant, but I am too lazy for that. I will use the container handling script that I wrote for another occasion and install sbt in there. The docker/devel sub dir contains a Dockerfile for an image that has sbt installed in it.

==> git clone
==> cd jail/docker/devel
==> docker build -t jail:v01-dev .

This is the configurations script:


Table of Contents

  1. Compiling and start-up code
  2. Hardware Abstraction Layer and UART
  3. Debugging, display, heap and fonts
  4. Timers and ADC
  5. DAC, Sound and Nokia Tunes
  6. Random Number Generator, Rendering Engine and the Game
  7. Operating System


The game code up until this point abuses timers a lot. It has a timer to handle rendering and to refresh the display, and a timer to change notes of a tune. These tasks are not very time sensitive. A couple of milliseconds delay here or there is not going to be noticeable to users. The timer interrupts are more appropriate for things like maintaining a sound wave of the proper frequency. A slight delay here lowers the quality of the user experience significantly.

We could, of course, do even more complex time management to handle both the graphics and the sound in one loop, but that would be painful. It's much nicer to have a scheduling system that can alternate between multiple threads of execution. It is what I will describe in this post.

Thread Control Block and the Stack

Since there's usually only one CPU, the threads need to share it. The easiest way to achieve time sharing is to have a fixed time slice at the end of which the system will switch to another thread. The systick interrupt perfect for this purpose. Not only is it invoked periodically, but it can also by requested manually by manipulating a register. This property will be useful in implementation of sleeping and blocking.

But first things first: we need to have a structure that will describe a thread, a. k. a. a Thread Control Block:

1 struct IO_sys_thread {
2   uint32_t             *stack_ptr;
3   uint32_t              flags;
4   void (*func)();
5   struct IO_sys_thread *next;
6   uint32_t              sleep;
7   IO_sys_semaphore     *blocker;
8   uint8_t               priority;
9 };
  • stack_ptr - points to the top of the thread's stack
  • flags - properties describing the thread; we will need just one to indicate whether the thread used the floating-point coprocessor
  • func - thread's entry point
  • next - pointer to the next thread in the queue (used for scheduling)
  • sleep - number of milliseconds the thread still needs to sleep
  • blocker - a pointer to a semaphore blocking the thread (if any)
  • priority - thread's priority

When invoking an interrupt handler, the CPU saves most of the running state of the current thread to the stack. Therefore, the task of the interrupt handler boils down to switching the stack pointers. The CPU will then pop all the registers back from the new stack. This behavior means that we need to do some initialization first:

 1 void IO_sys_stack_init(IO_sys_thread *thread, void (*func)(void *), void *arg,
 2   void *stack, uint32_t stack_size)
 3 {
 4   uint32_t sp1 = (uint32_t)stack;
 5   uint32_t sp2 = (uint32_t)stack;
 6   sp2 += stack_size;
 7   sp2 = (sp2 >> 3) << 3;          // the stack base needs to be 8-aligned
 8   if(sp1 % 4)
 9     sp1 = ((sp1 >> 2) << 2) + 4;  // make the end of the stack 4-aligned
10   stack_size = (sp2 - sp1) / 4;   // new size in double words
12   uint32_t *sp = (uint32_t *)sp1;
13   sp[stack_size-1] = 0x01000000;          // PSR with thumb bit
14   sp[stack_size-2] = (uint32_t)func;      // program counter
15   sp[stack_size-3] = 0xffffffff;          // link register
16   sp[stack_size-8] = (uint32_t)arg;       // r0 - the argument
17   thread->stack_ptr = &sp[stack_size-16]; // top of the stack
18 }

The ARM ABI requires that the top of the stack is 8-aligned and we will typically push and pop 4-byte words. The first part of the setup function makes sure that the stack boundaries are right. The second part sets the initial values of the registers. Have a look here for details.

  • the PSR register needs to have the Thumb bit switched on
  • we put the startup function address to the program counter
  • we put 0xffffffff to the link register to avoid confusing stack traces in GDB
  • r0 gets the argument to the startup function
  • an interrupt pushes 16 words worth of registers to the stack, so the initial value of the stack pointer needs to reflect that

This function is typically called as:

1 IO_sys_stack_init(thread, thread_wrapper, thread, stack, stack_size);

Note that we do not call the user thread function directly. Rather we have a wrapper function that gets the TBC as its argument. It is because we need to remove the thread from the scheduling queue if the user-specified function returns.

The context switcher

Let's now have a look at the code that does the actual context switching. Since it needs to operate directly on the stack, it needs to be written in assembly. It is not very complicated, though. What it does is:

  • pushing some registers to the stack
  • storing the current stack pointer in the stack_ptr variable of the current TCB
  • calling the scheduler to select the next thread
  • loading the stack pointer from the new thread's TCB
  • popping some registers from the new stack
 1 #define OFF_STACK_PTR 0
 2 #define OFF_FLAGS     4
 3 #define FLAG_FPU      0x01
 5   .thumb
 6   .syntax unified
 8   .global IO_sys_current
 9   .global IO_sys_schedule
11   .text
13   .global systick_handler
14   .type systick_handler STT_FUNC
15   .thumb_func
16   .align  2
17 systick_handler:
18   cpsid i                     ; disable interrupts
19   push  {r4-r11}              ; push r4-11
20   ldr   r0, =IO_sys_current   ; pointer to IO_sys_current to r1
21   ldr   r1, [r0]              ; r1 = OS_current
23   ubfx  r2, lr, #4, #1        ; extract the fourth bit from the lr register
24   cbnz  r2, .Lsave_stack      ; no FPU context to save
25   vstmdb sp!, {s16-s31}       ; push FPU registers, this triggers pushing of
26                               ; s0-s15
27   ldr   r2, [r1, #OFF_FLAGS]  ; load the flags
28   orr   r2, r2, #FLAG_FPU     ; set the FPU context flag
29   str   r2, [r1, #OFF_FLAGS]  ; store the flags
31 .Lsave_stack:
32   str   sp, [r1, #OFF_STACK_PTR] ; store the stack pointer at *OS_current
34   push  {r0, lr}              ; calling c code, so store r0 and the link
35                               ; register
36   bl    IO_sys_schedule       ; call the scheduler
37   pop   {r0, lr}              ; restore r0 and lr
39   ldr   r1, [r0]              ; load the new TCB pointer to r1
40   ldr   sp, [r1, #OFF_STACK_PTR] ; get the stack pointer of the new thread
42   orr   lr, lr, #0x10         ; clear the floating point flag in EXC_RETURN
43   ldr   r2, [r1, #OFF_FLAGS]  ; load the flags
44   tst   r2, #0x01             ; see if we have the FPU context
45   beq   .Lrestore_regs        ; no FPU context
46   vldmia sp!, {s16-s31}       ; pop the FPU registers
47   bic   lr, lr, #0x10         ; set the floating point flag in EXC_RETURN
49 .Lrestore_regs:
50   pop   {r4-r11}              ; restore regs r4-11
51   cpsie i                     ; enable interrupts
52   bx    lr                    ;  exit the interrupt, restore r0-r3, r12, lr, pc,
53                               ; psr

The only complication here is that we sometimes need to store the floating point registers in addition to the regular ones. It is, however, only necessary if the thread used the FPU. The fourth bit of EXC_RETURN, the value in the LR register, indicates the status of the FPU. Go here and here for more details. If the value of the bit is 0, we need to save the high floating-point registers to the stack and set the FPU flag in the TCB.

Also, after selecting the new thread, we check if its stack contains the FPU registers by checking the FPU flag in its TCB. If it does, we pop these registers and change EXC_RETURN accordingly.

The Lazy Stacking is taken care of by simply pushing and popping the high registers - it counts as an FPU operation.

Semaphores, sleeping and idling

We can now run threads and switch between them, but it would be useful to be able to put threads to sleep and make them wait for events.

Sleeping is easy. We just need to set the sleep field in the TCB of the current thread and make the scheduler ignore threads whenever their sleep field is not zero:

1 void IO_sys_sleep(uint32_t time)
2 {
3   IO_sys_current->sleep = time;
4   IO_sys_yield();
5 }

The ISR that handles the system time can loop over all threads and decrement this counter every millisecond.

Waiting for a semaphore works in a similar way. We mark the current thread as blocked:

 1 void IO_sys_wait(IO_sys_semaphore *sem)
 2 {
 3   IO_disable_interrupts();
 4   --*sem;
 5   if(*sem < 0) {
 6     IO_sys_current->blocker = sem;
 7     IO_sys_yield();
 8   }
 9   IO_enable_interrupts();
10 }

The purpose of IO_sys_yield is to indicate that the current thread does not need to run anymore and force a context switch. The function resets the systick counter and forces the interrupt:

1 void IO_sys_yield()
2 {
3   STCURRENT_REG = 0;          // clear the systick counter
4   INTCTRL_REG   = 0x04000000; // trigger systick
5 }

Waking a thread waiting for a semaphore is somewhat more complex:

 1 void IO_sys_signal(IO_sys_semaphore *sem)
 2 {
 3   IO_disable_interrupts();
 4   ++*sem;
 5   if(*sem <= 0 && threads) {
 6     IO_sys_thread *t;
 7     for(t = threads; t->blocker != sem; t = t->next);
 8     t->blocker = 0;
 9   }
10   IO_enable_interrupts();
11 }

If the value of the semaphore was negative, we find a thread that it was blocking and unblock it. It will make the scheduler consider this thread for running in the future.

None of the user-defined threads may be runnable at the time the scheduler makes its decision. All of them may be either sleeping or waiting for a semaphore. In that case, we need to keep the CPU occupied with something, i.e., we need a fake thread:

1 static void iddle_thread_func(void *arg)
2 {
3   (void)arg;
4   while(1) IO_wait_for_interrupt();
5 }


The system maintains a circular linked list of TCBs called threads. The job of the scheduler is to loop over this list and select the next thread to run. It places its selection in a global variable called IO_sys_current so that other functions may access it.

 1 void IO_sys_schedule()
 2 {
 3   if(!threads) {
 4     IO_sys_current = &iddle_thread;
 5     return;
 6   }
 8   IO_sys_thread *stop = IO_sys_current->next;
10   if(IO_sys_current == &iddle_thread)
11     stop = threads;
13   IO_sys_thread *cur  = stop;
14   IO_sys_thread *sel  = 0;
15   int            prio = 266;
17   do {
18     if(!cur->sleep && !cur->blocker && cur->priority < prio) {
19       sel = cur;
20       prio = sel->priority;
21     }
22     cur = cur->next;
23   }
24   while(cur != stop);
26   if(!sel)
27     sel = &iddle_thread;
29   IO_sys_current = sel;
30 }

This scheduler is simple:

  • whenever there is nothing to run, select the idle thread
  • otherwise select the next highest priority thread that is neither sleeping nor blocked on a semaphore

Starting up the beast

So how do we get this whole business running? We need to invoke the scheduler that will preempt the current thread and select the next one to run. The problem is that we're running using the stack provided by the bootstrap code and don't have a TCB. Nothing prevents us from creating a dummy one, though. We can create it on the current stack (it's useful only once) and point it to the beginning of our real queue of TCBs:

1 IO_sys_thread dummy;
2 = threads;
3 IO_sys_current = &dummy;

We then set the systick up:

1 STCTRL_REG     = 0;            // turn off
2 STCURRENT_REG  = 0;            // reset
3 SYSPRI3_REG   |= 0xE0000000;   // priority 7
4 STRELOAD_REG   = time_slice-1; // reload value

And force its interrupt:

1 IO_sys_yield();
2 IO_enable_interrupts();


Tests 11 and 12 run a dummy calculation for some time and then return. After this happens, the system can only run the idle thread. If we plug-in the profiler code, we can observe the timings on a logic analyzer:

Test #11
Test #11

Test 13 is more complicated than the two previous ones. Three threads are running in a loop, sleeping, and signaling semaphores. Two more threads are waiting for these semaphores, changing some local variables and signaling other semaphores. Finally, there is the writer thread that blocks on the last set of semaphores and displays the current state of the environment. The output from the logic analyzer shows that the writer thread needs around 3.3 time-slices to refresh the screen:

Test #13
Test #13

Silly Invaders

How all this makes Silly Invaders better? The main advantage is that we don't need to calculate complex timings for multiple functions of the program. We create two threads, one for rendering of the scene and another one for playing the music tune. Each thread cares about its own timing. Everything else takes care of itself with good enough time guarantees.

 1 IO_sys_thread game_thread;
 2 void game_thread_func()
 3 {
 4   while(1) {
 5     SI_scene_render(&scenes[current_scene].scene, &display);
 6     IO_sys_sleep(1000/scenes[current_scene].scene.fps);
 7   }
 8 }
10 IO_sys_thread sound_thread;
11 void sound_thread_func()
12 {
13   IO_sound_player_run(&sound_player);
14 }

The threads are registered with the system in the main function:

1 IO_sys_thread_add(&game_thread,  game_thread_func,  2000, 255);
2 IO_sys_thread_add(&sound_thread, sound_thread_func, 1000, 255);
3 IO_sys_run(1000);

For the complete code see:


There is a great course on EdX called Realtime Bluetooth Networks explaining the topic in more details. I highly recommend it.


Life would be so much easier if all the software was open source and came packaged with Debian. Much of the stuff I use is available this way, but there are still some programs that come as binary blobs with custom installers. I don't like that. I don't trust that. Every now and then, you hear the stories of software coming from reputable companies and misbehaving in dramatic ways. It would be great to contain the potential damage, but running virtual machines on a laptop is a pain. As it turns out, things may work pretty well with Docker, but as usual, the configuration is not so trivial.


The X Server

The solutions I found on the Internet either share the main X server with the container or use VNC. The first approach is problematic because apparently the X architecture has been designed by happy hippies and has no notion of security. If two applications share a screen, for instance, one can sniff the keystrokes typed into the other, and all this is by design. The VNC solution, on the other hand, is terribly slow: windows smudge when moved, and the Netflix playback is lagging.

Starting a Xephyr instance on the host and sharing its socket with the container seems to solve the sniffing problem. The programs running inside the container can't listen to the keystrokes typed outside of it anymore. Xephyr is also fast enough to handle high-resolution movie playback smoothly.

You can start Xephyr like this:

Xephyr :1 -ac -br -screen 1680x1050 -resizeable

The server will run as display :1 in a resizable window of initial size defined by the screen parameter. Adding the following to the Docker command line makes the server visible inside of the container:

-e DISPLAY=:1 -v /tmp/.X11-unix/X1:/tmp/.X11-unix/X1

The only remaining pain point is the fact that you cannot share the clipboard by default. Things copied outside of the container do not paste inside and vice versa. The xsel utility and a couple of lines of bash code can solve this problem easily:

 1 CLIP1=""
 2 CLIP2=`xsel -o -b --display :1`
 3 while true; do
 4   CLIP1NEW=`xsel -o -b --display :0`
 5   CLIP2NEW=`xsel -o -b --display :1`
 6   if [ "x$CLIP1" != "x$CLIP1NEW" ]; then
 7     xsel -o -b --display :0 | xsel -i -b --display :1
 8     CLIP1=$CLIP1NEW
 9   fi;
10   if [ x"$CLIP2" != x"$CLIP2NEW" ]; then
11     xsel -o -b --display :1 | xsel -i -b --display :0
12     CLIP2=$CLIP2NEW
13   fi
14   sleep 1
15 done


Making the audio work both ways (the sound and the microphone) is surprisingly easy with PulseAudio. The host just needs to configure the native protocol plug-in and ensure that port 4713 is not blocked by the firewall:

pactl load-module module-native-protocol-tcp auth-ip-acl=

All you need to do in the container is making sure that the PULSE_SERVER envvar points to the host. It is less straightforward than you might expect when you run a desktop environment and don't want to start all your programs in a terminal window. For XFCE, I do the following in the script driving the container:

 1 mkdir -p /home/prisoner/.config/xfce4
 2 chown prisoner:prisoner -R /home/prisoner
 3 XINITRC=/home/prisoner/.config/xfce4/xinitrc
 4 rm -f $XINITRC
 6 echo -e "#!/bin/sh\n\n" >> $XINITRC
 7 echo -e "export PULSE_SERVER=$PULSE_SERVER\n\n" >> $XINITRC
 8 tail -n +2 /etc/xdg/xfce4/xinitrc >> $XINITRC
10 sudo -u prisoner startxfce4

This code prepends the appropriate export statement to XFCE's xinitrc file and assumes that the PULSE_SERVER variable is known inside the container:


I also disable the local PulseAudio server in XFCE. It has strange interactions with the one running on the host.


Doing all this by hand every time you want to run the container is way too painful. I wrote a script to automate the process. It can:

  • set up the PulseAudio server
  • forward static devices (i.e., /dev/video0)
  • forward USB devices if they are present (i.e., 21a9:1006 as /dev/bus/usb/001/016)
  • run the Xephyr X server instance for the container
  • forward the clipboards
  • set up docker's command line to make everything work together

This is what it all looks like:

[i] Running jail: 20161121-213528
[i] Container: jail:v01
[i] Hostname: jail
[i] Home: /home/ljanyst/Contained/jail/home
[i] Setting up the local PulseAudio server ( OK
[i] Attaching device /dev/video0
[i] USB device 21a9:1006 not present
[i] Running Xephyr display at :1 (1680x1050)... OK
[i] Running docker... OK
[i] Removing container a545365592c1... OK
[i] Killing clipboard forwarder, PID: 2776... DONE
[i] Killing Xephyr, PID: 2767... DONE
[i] All done. Bye!



Saleae Logic Analyzer
Saleae Logic Analyzer


It took some playing around, but I have finally managed to figure out how to build from source all the tools necessary to put Zephyr on Arduino 101. You may say that the effort is pointless because you could just use whatever is provided by the SDK. For me, however, the deal is more about what I can learn from the experience that about the result itself. There is enough open source code around to make things work reasonably well, but putting it all together is a bit of a challenge, so what follows is a short HOWTO.

Arduino 101 setup
Arduino 101 setup


Arduino 101 has a Quark core and an ARC EM core. The appropriate targets seem to be i586-none-elfiamcu and arc-none-elf for the former and the later respectively. Since there is no pre-packaged toolchain for either of these in Debian, you'll need to build your own. You can use the vanilla binutils (version 2.27 worked for me) and the vanilla newlib (version did not require any patches). GCC is somewhat more problematic. Since apparently not all the necessary ARC patches have been accepted into the mainline yet, you'll need to download it from the Synopsys GitHub repo. GDB requires tweaking for both cores.


]==> mkdir binutils && cd binutils
]==> wget
]==> tar jxf binutils-2.27.tar.bz2
]==> mkdir i586-none-elfiamcu && cd i586-none-elfiamcu
]==> ../binutils-2.27/configure --prefix=/home/ljanyst/Apps/cross-compilers/i586-none-elfiamcu --target=i586-none-elfiamcu
]==> make -j12 && make install
]==> cd .. && mkdir arc-none-elf && arc-none-elf
]==> ../binutils-2.27/configure --prefix=/home/ljanyst/Apps/cross-compilers/arc-none-elf --target=arc-none-elf
]==> make -j12 && make install
]==> cd ../..


]==> mkdir gcc && cd gcc
]==> wget
]==> tar jxf gcc-6.2.0.tar.bz2
]==> git clone
]==> cd gcc && git checkout arc-4.8-dev && cd ..
]==> mkdir i586-none-elfiamcu && cd i586-none-elfiamcu
]==> ../gcc-6.2.0/configure --prefix=/home/ljanyst/Apps/cross-compilers/i586-none-elfiamcu --target=i586-none-elfiamcu --enable-languages=c --with-newlib
]==> make -j12 all-gcc && make install-gcc
]==> cd .. && mkdir arc-none-elf && arc-none-elf
]==> ../gcc/configure --prefix=/home/ljanyst/Apps/cross-compilers/arc-none-elf --target=arc-none-elf  --enable-languages=c --with-newlib --with-cpu=arc700
]==> make -j12 all-gcc && make install-gcc
]==> cd ../..


]==> mkdir newlib && cd newlib
]==> wget
]==> tar zxf newlib-
]==> mkdir i586-none-elfiamcu && cd i586-none-elfiamcu
]==> ../newlib- --prefix=/home/ljanyst/Apps/cross-compilers/i586-none-elfiamcu --target=i586-none-elfiamcu
]==> make -j12 && make install
]==> cd .. && mkdir arc-none-elf && arc-none-elf
]==> ../newlib- --prefix=/home/ljanyst/Apps/cross-compilers/arc-none-elf --target=arc-none-elf
]==> make -j12 && make install
]==> cd ../..


]==> cd gcc/i586-none-elfiamcu
]==> make -j12 all-target-libgcc && make install-target-libgcc
]==> cd ../arc-none-elf
]==> make -j12 all-target-libgcc && make install-target-libgcc
]==> cd ../..

GDB does not work for either platform out of the box. For Quark it compiles the i386 version but does not recognize the iamcu architecture even though, according to Wikipedia, it's essentially the same as i586 and libbfd knows about it. After some poking around the code, it seems that initilizing the i386 platform with iamcu bfd architecture definition does the trick:

 1 diff -Naur gdb-7.11.1.orig/gdb/i386-tdep.c gdb-7.11.1/gdb/i386-tdep.c
 2 --- gdb-7.11.1.orig/gdb/i386-tdep.c     2016-06-01 02:36:15.000000000 +0200
 3 +++ gdb-7.11.1/gdb/i386-tdep.c  2016-09-24 15:39:11.000000000 +0200
 4 @@ -8890,6 +8890,7 @@
 5  _initialize_i386_tdep (void)
 6  {
 7    register_gdbarch_init (bfd_arch_i386, i386_gdbarch_init);
 8 +  register_gdbarch_init (bfd_arch_iamcu, i386_gdbarch_init);
10    /* Add the variable that controls the disassembly flavor.  */
11    add_setshow_enum_cmd ("disassembly-flavor", no_class, valid_flavors,

For ARC the Synopsys open source repo provides a solution.

]==> mkdir gdb
]==> wget
]==> tar xf gdb-7.11.1.tar.xz
]==> cd gdb-7.11.1 && patch -Np1 -i ../iamcu-tdep.patch && cd ..
]==> git clone
]=> cd binutils-gdb && git checkout arc-2016.09-gdb && cd ..
]==> mkdir i586-none-elfiamcu && cd i586-none-elfiamcu
]==> ../gdb-7.11.1/configure --prefix=/home/ljanyst/Apps/cross-compilers/i586-none-elfiamcu --target=i586-none-elfiamcu
]==> make -j12 && make install
]==> cd .. && mkdir arc-none-elf && arc-none-elf
]==> ../binutils-gdb/configure --prefix=/home/ljanyst/Apps/cross-compilers/arc-none-elf --target=arc-none-elf 
]==> make -j12 all-gdb && make install-gdb
]==> ../..


There was no OpenOCD release for quite some time, and it does not seem to have any support for Quark SE. The situation is not much better if you look at the head of the master branch of their repo. Fortunately, both Intel and Synopsys provide some support for their parts of the platform and making it work with mainline openocd does not seem to be hard.

]==> git clone && cd openocd
]==> git checkout lj
]==> ./bootstrap
]==> ./configure --prefix=/home/ljanyst/Apps/openocd
]==> make -j12 && make install

Zephyr uses the following configuration for the Arduino (referred to as openocd.conf below):

 1 source [find interface/ftdi/flyswatter2.cfg]
 2 source [find board/quark_se.cfg]
 4 quark_se.quark configure -event gdb-attach {
 5         reset halt
 6         gdb_breakpoint_override hard
 7 }
 9 quark_se.quark configure -event gdb-detach {
10         resume
11         shutdown
12 }

You can use the following commands to run the GDB server, flash for Quark and flash for ARC respectively (this is what Zephyr does):

]==> openocd -s /home/ljanyst/Apps/openocd/share/openocd/scripts/ -f openocd.cfg  -c 'init' -c 'targets' -c 'reset halt'
]==> openocd -s /home/ljanyst/Apps/openocd/share/openocd/scripts/ -f openocd.cfg  -c 'init' -c 'targets' -c 'targets quark_se.arc-em' -c 'reset halt' -c 'load_image zephyr.bin 0x40010000' -c 'reset halt' -c 'verify_image zephyr.bin 0x40010000' -c 'reset run' -c 'shutdown'
]==> openocd -s /home/ljanyst/Apps/openocd/share/openocd/scripts/ -f openocd.cfg  -c 'init' -c 'targets' -c 'targets quark_se.arc-em' -c 'reset halt' -c 'load_image zephyr.bin 0x40034000' -c 'reset halt' -c 'verify_image zephyr.bin 0x40034000' -c 'reset run' -c 'shutdown'

Hello world!

You need to compile and flash Zephyr's Hello World sample. The two commands below do the trick for the compilation part:

make BOARD=arduino_101_factory CROSS_COMPILE=i586-none-elfiamcu- CFLAGS="-march=lakemont -mtune=lakemont -msoft-float -miamcu -O0"
make BOARD=arduino_101_sss_factory CROSS_COMPILE=arc-none-elf-

After flashing, you should see the following on your UART console:

]==> screen /dev/ttyUSB0 115200,cs8
ipm_console0: 'Hello World! arc'
Hello World! x86


If you follow the instructions from the Zephyr wiki, debugging for the Intel part works fine. I still have some trouble making breakpoints work for ARC and will try to write an update if I have time to figure it out.

]==> i586-none-elfiamcu-gdb outdir/zephyr.elf
(gdb) target remote :3333
Remote debugging using :3333
0x0000fff0 in ?? ()
(gdb) b main
Breakpoint 1 at 0x400100ed: file /home/ljanyst/Projects/zephyr/zephyr-project/samples/hello_world/nanokernel/src/main.c, line 37.
(gdb) c
target running
hit hardware breakpoint (hwreg=0) at 0x400100ed

Breakpoint 1, main () at /home/ljanyst/Projects/zephyr/zephyr-project/samples/hello_world/nanokernel/src/main.c:37
37              PRINT("Hello World! %s\n", CONFIG_ARCH);
(gdb) s
step done from EIP 0x400100ed to 0x400100f2
step done from EIP 0x400100f2 to 0x400100f7
step done from EIP 0x400100f7 to 0x40013129
target running
hit hardware breakpoint (hwreg=1) at 0x4001312f
printk (fmt=0x40013e04 "Hello World! %s\n") at /home/ljanyst/Projects/zephyr/zephyr-project/misc/printk.c:164
164             va_start(ap, fmt);
(gdb) s
step done from EIP 0x4001312f to 0x40013132
step done from EIP 0x40013132 to 0x40013135
165             _vprintk(fmt, ap);


I have finally received my Kickstarter-backed UP boards. So far they seem great! There are three minor drawbacks, though:

  1. They don't have the exact same shape as Raspberry PI's, so they don't fit the raspberry cases. It's nothing that could not be rectified with small pliers, though.
  2. The audio chip on Cherry Trail (Intel Atom x5 z8350) SoCs is not yet supported by Linux out of the box, so some fiddling with the kernel is necessary.
  3. Debian's UEFI boot configuration does not seem to work from the get-go either.


You can install Debian Testing using a USB stick. Don't try Jessie, though - the kernel will not detect the MMC card. Things should work fine, except that grub will install itself in /EFI/debian/grubx64.efi on the EFI partition. You will need to move it to /EFI/boot/bootx64.efi manually. It's possible to do it from the UEFI shell using familiar commands.


Kodi installs and works out of the box from Debian Multimedia. Unfortunately, to get the sound working, you will need to recompile the kernel :)

Get the sources and the necessary patches and create the config:

git clone
cd linux
git remote add cht
git fetch cht
git checkout byt-cht-hdmi-v4.7
make oldconfig

You will need to edit .config:

  • to set CONFIG_SYSTEM_TRUSTED_KEYS variable to an empty string
  • to enable CONFIG_HDMI

Then the only thing that's left is building and installing the package:

fakeroot make-kpkg --initrd --append-to-version=-up --revision=1 -j 8 kernel_image kernel_headers
cd ..
sudo dpkg -i linux-image-4.7.0-up+_1_amd64.deb

I wanted to see how efficient it is, so I run the compilation on the board itself. It took roughly 2.5 hours and got very hot. The board can handle perfectly fine the FullHD video files over Samba that Raspberry PI 2 couldn't. The audio quality is much better too. It seems that surround 5.1 actually works. :)


My medium-term goal is to port my Silly Invaders game to a Real Time Operating System. Zephyr seems to be a good choice. It's open source, operates under the auspices of the Linux Foundation and has an active community with many developers from Intel committing the code.

They, unfortunately, do not support Tiva so I will need to port the OS before I can proceed with the application. I decided to buy the Freescale K64F board, which is supported, to familiarize myself a little with Zephyr before I start the porting work. The howto page for setting up K64F seems to be terribly complicated and requires a JTAG programmer. I summarize here a simpler way using cmsis-dap over USB.


I updated the MBED interface firmware following the instructions on this site. I also build my own OpenOCD from the head of the master branch using the following configuration options:

./configure --prefix=/home/ljanyst/Apps/openocd  --enable-cmsis-dap

Things may work fine with the stock firmware and the stock OpenOCD as well, but I did not try that. It's also probably a good idea to add the following udev rule so that you don't have to run things as root:

]==> cat /etc/udev/rules.d/99-openocd.rules
# frdm-k64f
ATTRS{idVendor}=="0d28", ATTRS{idProduct}=="0204", GROUP="plugdev", MODE="0660"
]==> sudo udevadm control --reload-rules

Hello world!

I use the ARM cross-compiler provided by Debian to compile Zephyr and then just copy the resulting binary to the MBED disk:

]==> cd samples/hello_world/nanokernel
]==> make BOARD=frdm_k64f CROSS_COMPILE=arm-none-eabi- CFLAGS=-O0
]==> cp outdir/zephyr.bin /media/ljanyst/MBED/

You can see the effects in the UART console using screen:

]==> screen /dev/ttyACM0 115200,cs8
Hello World!

I then run OpenOCD using the following script:

]==> cat k64f.cfg
set CHIPNAME k60
source [find target/kx.cfg]

$_TARGETNAME configure -event gdb-attach {

]==> openocd -s /home/ljanyst/Apps/openocd/share/openocd/scripts/ -c "interface cmsis-dap" -f k64f.cfg

And GDB:

]==> cat remote1.conf
target extended-remote :3333
monitor reset init
break main
]==> arm-none-eabi-gdb  --command=remote1.conf outdir/zephyr.elf
Breakpoint 1 at 0x129c: file /home/ljanyst/Projects/zephyr/zephyr-project/samples/hello_world/nanokernel/src/main.c, line 37.
Note: automatically using hardware breakpoints for read-only addresses.

Breakpoint 1, main () at /home/ljanyst/Projects/zephyr/zephyr-project/samples/hello_world/nanokernel/src/main.c:37
37              PRINT("Hello World!\n");
(gdb) s
printk (fmt=0x2c90 "Hello World!\n") at /home/ljanyst/Projects/zephyr/zephyr-project/misc/printk.c:164
164             va_start(ap, fmt);
(gdb) s
165             _vprintk(fmt, ap);
(gdb) s
_vprintk (fmt=0x2c90 "Hello World!\n", ap=...) at /home/ljanyst/Projects/zephyr/zephyr-project/misc/printk.c:75
75              int might_format = 0; /* 1 if encountered a '%' */
(gdb) where
#0  _vprintk (fmt=0x2c90 "Hello World!\n", ap=...) at /home/ljanyst/Projects/zephyr/zephyr-project/misc/printk.c:75
#1  0x00001b46 in printk (fmt=0x2c90 "Hello World!\n") at /home/ljanyst/Projects/zephyr/zephyr-project/misc/printk.c:165
#2  0x000012a2 in main () at /home/ljanyst/Projects/zephyr/zephyr-project/samples/hello_world/nanokernel/src/main.c:37

The problem

I have been playing with one of my toy projects and ended up having to shuffle text many times in a file that is around 10MB long. A naïve implementation chopping a string and gluing it together again ended up being painfully slow:

$$ O(n \cdot k) $$

where n is the length of the string and k is the number of shuffling operations.

A solution

It turns out that the problem can be efficiently solved using splay trees, where each node holds:

  • the starting offset in the source string
  • the length of the substring it represents
  • the offset from the beginning of the substring taking into account all of the children on the left-hand-side
  • total length of the substring taking into account all the children on the right-hand side

The find operation looks for a node containing the offset and splits it in two, if the node does not start with the offset.

 1 static Node *Find(uint32_t offset, Node *root)
 2 {
 3   if(!root)
 4     return 0;
 5   if(offset >= root->offset && offset < (root->offset + root->length))
 6     return root;
 7   if(offset < root->offset)
 8     return Find(offset, root->left);
 9   return Find(offset-root->offset-root->length, root->right);
10 }
12 static Node *STFind(uint32_t offset, Node *root)
13 {
14   Node *n = Find(offset, root);
15   Splay(n);
17   if(n->offset < offset) {
18     Node *newNode = new Node();
19     newNode->start  = n->start;
20     newNode->length = offset - n->offset;
21     n->start  = newNode->start + newNode->length;
22     n->length -= newNode->length;
23     newNode->left = n->left;
24     update(newNode);
25     n->left = newNode;
26     update(n);
27   }
28   return n;
29 }

The actual shuffling boils down to splitting and merging sub-trees and producing the result to traversing the whole tree in-order.

 1 void process(int i, int j, int k)
 2 {
 3   Node *n, *t, *left, *right = 0;
 4   STSplit(left, n, i, pRoot);
 5   if(j+1 < pS.length())
 6     STSplit(n, right, j+1-i, n);
 7   t = STMerge(left, right);
 8   if(k < (t->offset + t->total_length))
 9     STSplit(left, right, k, t);
10   else {
11     right = 0;
12     left = t;
13   }
14   pRoot = STMerge(left, n);
15   pRoot = STMerge(pRoot, right);
16 }

The amortized complexity is:

$$ O(k \cdot log(k) + n) $$

See rope-splay.cpp.

A test

It all runs pretty nicely for a test case with 10M characters and 100k shuffle operations:

]==> ./testgen 10000000 100000 > test
]==> time ./rope-naive < test > test-o-naive
./rope-naive < test > test-o-naive  386.17s user 66.19s system 99% cpu 7:32.86 total
]==> time ./rope-splay < test > test-o-splay
./rope-splay < test > test-o-splay  0.71s user 0.04s system 99% cpu 0.752 total
]==> diff test-o-splay test-o-naive

Table of Contents

  1. Compiling and start-up code
  2. Hardware Abstraction Layer and UART
  3. Debugging, display, heap and fonts
  4. Timers and ADC
  5. DAC, Sound and Nokia Tunes
  6. Random Number Generator, Rendering Engine and the Game
  7. Operating System

Random Number Generator

To make the game more engaging, we introduce some randomness into it. We don't need anything cryptographically secure, so a Linear Congruential Generator will do just fine. We count the time from the start-up in millisecond-long jiffies and wait for a first button press to select the seed.

 1 void button_event(IO_io *io, uint16_t event)
 2 {
 3   uint64_t btn;
 4   IO_get(io, &btn);
 5   if(btn)
 6     button_value = 1;
 8   if(!rng_initialized) {
 9     rng_initialized = 1;
10     IO_rng_seed(IO_time());
11   }
12 }

Rendering Engine

The rendering engine takes a scene descriptor, a display device, and a timer. Based on this information it computes new positions of objects, draws them on the screen if necessary and checks for collisions.

 1 struct SI_scene {
 2   SI_object **objects;
 3   void      (*pre_render)(struct SI_scene *);
 4   void      (*collision)(SI_object *obj1, SI_object *obj2);
 5   uint8_t     fps;
 6   uint8_t     num_objects;
 7   uint8_t     flags;
 8 };
10 void SI_scene_render(SI_scene *scene, IO_io *display, IO_io *timer);

Each SI_scene holds a list of "polymorphic" objects that should be rendered, a pointer to a pre_render function that calculates a new position of each object, and a pointer to a collision callback that is invoked when the scene renderer detects an overlap between two objects. The SI_scene_render function runs after every interrupt:

1 while(1) {
2     SI_scene_render(&scenes[current_scene].scene, &display, &scene_timer);
3     IO_wait_for_interrupt();
4   }

Whether it gets executed or not, depends on the flag parameter of the scene. If it's set to SI_SCENE_IGNORE, the renderer returns immediately. On the other hand, if it's set to SI_SCENE_RENDER, the renderer calls the pre_render callback, draws the objects on the screen, and computes the object overlaps notifying the collision callback if necessary. After each frame, the scene is disabled (SI_SCENE_IGNORE). It is re-enabled by the timer interrupt in a time quantum that depends on the fps parameter.

See SI_scene.h and SI_scene.c.

Each object has a draw function that enables the renderer to draw it on the screen. There are three types of objects: a generic object, a bitmap object, and a text object:

 1 struct SI_object {
 2   uint16_t x;
 3   uint16_t y;
 4   uint16_t width;
 5   uint16_t height;
 6   uint8_t  flags;
 7   uint8_t  user_flags;
 8   void (*draw)(struct SI_object *this, IO_io *display);
 9 };
11 struct SI_object_bitmap {
12   SI_object        obj;
13   const IO_bitmap *bmp;
14 };
16 struct SI_object_text {
17   SI_object      obj;
18   const char    *text;
19   const IO_font *font;
20 };

The object array in the scene is initialized with the SI_object pointers:

1 static SI_object         score_obj;
2 static SI_object_bitmap  invader_obj[5];
3 scene->objects[1] = &score_obj;
4 scene->objects[i+5] = &invader_obj[i].obj;

See SI_scene_game.c.

The renderer calls the draw function of each SI_OBJECT_VISIBLE object:

1 obj->draw(obj, display);

Finally, each draw method uses the CONTAINER_OF macro to compute the pointer to the actual object of concrete type:

 2   ((TYPE *) ( (char *)MEMBER_ADDR - offsetof(TYPE, MEMBER)))
 4 void SI_object_bitmap_draw(SI_object *obj, IO_io *display)
 5 {
 6   SI_object_bitmap *this = CONTAINER_OF(SI_object_bitmap, obj, obj);
 7   IO_display_print_bitmap(display, obj->x, obj->y, this->bmp);
 8 }
10 void SI_object_text_draw(SI_object *obj, IO_io *display)
11 {
12   SI_object_text *this = CONTAINER_OF(SI_object_text, obj, obj);
13   IO_display_set_font(display, this->font);
14   IO_display_cursor_goto(display, obj->x, obj->y);
15   IO_print(display, "%s", this->text);
16 }

The Game

All this seems to work pretty well when put together:

The Game

See silly-invaders.c.