Recent Content

Intro

It took some playing around, but I have finally managed to figure out how to build from source all the tools necessary to put Zephyr on Arduino 101. You may say that the effort is pointless because you could just use whatever is provided by the SDK. For me, however, the deal is more about what I can learn from the experience that about the result itself. There is enough open source code around to make things work reasonably well, but putting it all together is a bit of a challenge, so what follows is a short HOWTO.

Arduino 101 setup
Arduino 101 setup

Toolchain

Arduino 101 has a Quark core and an ARC EM core. The appropriate targets seem to be i586-none-elfiamcu and arc-none-elf for the former and the later respectively. Since there is no pre-packaged toolchain for either of these in Debian, you'll need to build your own. You can use the vanilla binutils (version 2.27 worked for me) and the vanilla newlib (version 2.4.0.20160527 did not require any patches). GCC is somewhat more problematic. Since apparently not all the necessary ARC patches have been accepted into the mainline yet, you'll need to download it from the Synopsys GitHub repo. GDB requires tweaking for both cores.

binutils:

]==> mkdir binutils && cd binutils
]==> wget https://ftp.gnu.org/gnu/binutils/binutils-2.27.tar.bz2
]==> tar jxf binutils-2.27.tar.bz2
]==> mkdir i586-none-elfiamcu && cd i586-none-elfiamcu
]==> ../binutils-2.27/configure --prefix=/home/ljanyst/Apps/cross-compilers/i586-none-elfiamcu --target=i586-none-elfiamcu
]==> make -j12 && make install
]==> cd .. && mkdir arc-none-elf && arc-none-elf
]==> ../binutils-2.27/configure --prefix=/home/ljanyst/Apps/cross-compilers/arc-none-elf --target=arc-none-elf
]==> make -j12 && make install
]==> cd ../..

gcc:

]==> mkdir gcc && cd gcc
]==> wget ftp://ftp.uvsq.fr/pub/gcc/releases/gcc-6.2.0/gcc-6.2.0.tar.bz2
]==> tar jxf gcc-6.2.0.tar.bz2
]==> git clone git@github.com:foss-for-synopsys-dwc-arc-processors/gcc.git
]==> cd gcc && git checkout arc-4.8-dev && cd ..
]==> mkdir i586-none-elfiamcu && cd i586-none-elfiamcu
]==> ../gcc-6.2.0/configure --prefix=/home/ljanyst/Apps/cross-compilers/i586-none-elfiamcu --target=i586-none-elfiamcu --enable-languages=c --with-newlib
]==> make -j12 all-gcc && make install-gcc
]==> cd .. && mkdir arc-none-elf && arc-none-elf
]==> ../gcc/configure --prefix=/home/ljanyst/Apps/cross-compilers/arc-none-elf --target=arc-none-elf  --enable-languages=c --with-newlib --with-cpu=arc700
]==> make -j12 all-gcc && make install-gcc
]==> cd ../..

newlib:

]==> mkdir newlib && cd newlib
]==> wget ftp://sourceware.org/pub/newlib/newlib-2.4.0.20160527.tar.gz
]==> tar zxf newlib-2.4.0.20160527.tar.gz
]==> mkdir i586-none-elfiamcu && cd i586-none-elfiamcu
]==> ../newlib-2.4.0.20160527/configure --prefix=/home/ljanyst/Apps/cross-compilers/i586-none-elfiamcu --target=i586-none-elfiamcu
]==> make -j12 && make install
]==> cd .. && mkdir arc-none-elf && arc-none-elf
]==> ../newlib-2.4.0.20160527/configure --prefix=/home/ljanyst/Apps/cross-compilers/arc-none-elf --target=arc-none-elf
]==> make -j12 && make install
]==> cd ../..

libgcc:

]==> cd gcc/i586-none-elfiamcu
]==> make -j12 all-target-libgcc && make install-target-libgcc
]==> cd ../arc-none-elf
]==> make -j12 all-target-libgcc && make install-target-libgcc
]==> cd ../..

GDB does not work for either platform out of the box. For Quark it compiles the i386 version but does not recognize the iamcu architecture even though, according to Wikipedia, it's essentially the same as i586 and libbfd knows about it. After some poking around the code, it seems that initilizing the i386 platform with iamcu bfd architecture definition does the trick:

 1 diff -Naur gdb-7.11.1.orig/gdb/i386-tdep.c gdb-7.11.1/gdb/i386-tdep.c
 2 --- gdb-7.11.1.orig/gdb/i386-tdep.c     2016-06-01 02:36:15.000000000 +0200
 3 +++ gdb-7.11.1/gdb/i386-tdep.c  2016-09-24 15:39:11.000000000 +0200
 4 @@ -8890,6 +8890,7 @@
 5  _initialize_i386_tdep (void)
 6  {
 7    register_gdbarch_init (bfd_arch_i386, i386_gdbarch_init);
 8 +  register_gdbarch_init (bfd_arch_iamcu, i386_gdbarch_init);
 9  
10    /* Add the variable that controls the disassembly flavor.  */
11    add_setshow_enum_cmd ("disassembly-flavor", no_class, valid_flavors,

For ARC the Synopsys open source repo provides a solution.

]==> mkdir gdb
]==> wget ftp://sourceware.org/pub/gdb/releases/gdb-7.10.1.tar.xz
]==> tar xf gdb-7.11.1.tar.xz
]==> cd gdb-7.11.1 && patch -Np1 -i ../iamcu-tdep.patch && cd ..
]==> git clone git@github.com:foss-for-synopsys-dwc-arc-processors/binutils-gdb.git
]=> cd binutils-gdb && git checkout arc-2016.09-gdb && cd ..
]==> mkdir i586-none-elfiamcu && cd i586-none-elfiamcu
]==> ../gdb-7.11.1/configure --prefix=/home/ljanyst/Apps/cross-compilers/i586-none-elfiamcu --target=i586-none-elfiamcu
]==> make -j12 && make install
]==> cd .. && mkdir arc-none-elf && arc-none-elf
]==> ../binutils-gdb/configure --prefix=/home/ljanyst/Apps/cross-compilers/arc-none-elf --target=arc-none-elf 
]==> make -j12 all-gdb && make install-gdb
]==> ../..

OpenOCD

There was no OpenOCD release for quite some time, and it does not seem to have any support for Quark SE. The situation is not much better if you look at the head of the master branch of their repo. Fortunately, both Intel and Synopsys provide some support for their parts of the platform and making it work with mainline openocd does not seem to be hard.

]==> git clone git@github.com:ljanyst/openocd.git && cd openocd
]==> git checkout lj
]==> ./bootstrap
]==> ./configure --prefix=/home/ljanyst/Apps/openocd
]==> make -j12 && make install

Zephyr uses the following configuration for the Arduino (referred to as openocd.conf below):

 1 source [find interface/ftdi/flyswatter2.cfg]
 2 source [find board/quark_se.cfg]
 3 
 4 quark_se.quark configure -event gdb-attach {
 5         reset halt
 6         gdb_breakpoint_override hard
 7 }
 8 
 9 quark_se.quark configure -event gdb-detach {
10         resume
11         shutdown
12 }

You can use the following commands to run the GDB server, flash for Quark and flash for ARC respectively (this is what Zephyr does):

]==> openocd -s /home/ljanyst/Apps/openocd/share/openocd/scripts/ -f openocd.cfg  -c 'init' -c 'targets' -c 'reset halt'
]==> openocd -s /home/ljanyst/Apps/openocd/share/openocd/scripts/ -f openocd.cfg  -c 'init' -c 'targets' -c 'targets quark_se.arc-em' -c 'reset halt' -c 'load_image zephyr.bin 0x40010000' -c 'reset halt' -c 'verify_image zephyr.bin 0x40010000' -c 'reset run' -c 'shutdown'
]==> openocd -s /home/ljanyst/Apps/openocd/share/openocd/scripts/ -f openocd.cfg  -c 'init' -c 'targets' -c 'targets quark_se.arc-em' -c 'reset halt' -c 'load_image zephyr.bin 0x40034000' -c 'reset halt' -c 'verify_image zephyr.bin 0x40034000' -c 'reset run' -c 'shutdown'

Hello world!

You need to compile and flash Zephyr's Hello World sample. The two commands below do the trick for the compilation part:

make BOARD=arduino_101_factory CROSS_COMPILE=i586-none-elfiamcu- CFLAGS="-march=lakemont -mtune=lakemont -msoft-float -miamcu -O0"
make BOARD=arduino_101_sss_factory CROSS_COMPILE=arc-none-elf-

After flashing, you should see the following on your UART console:

]==> screen /dev/ttyUSB0 115200,cs8
ipm_console0: 'Hello World! arc'
Hello World! x86

Debugging

If you follow the instructions from the Zephyr wiki, debugging for the Intel part works fine. I still have some trouble making breakpoints work for ARC and will try to write an update if I have time to figure it out.

]==> i586-none-elfiamcu-gdb outdir/zephyr.elf
...
(gdb) target remote :3333
Remote debugging using :3333
0x0000fff0 in ?? ()
(gdb) b main
Breakpoint 1 at 0x400100ed: file /home/ljanyst/Projects/zephyr/zephyr-project/samples/hello_world/nanokernel/src/main.c, line 37.
(gdb) c
Continuing.
target running
hit hardware breakpoint (hwreg=0) at 0x400100ed

Breakpoint 1, main () at /home/ljanyst/Projects/zephyr/zephyr-project/samples/hello_world/nanokernel/src/main.c:37
37              PRINT("Hello World! %s\n", CONFIG_ARCH);
(gdb) s
step done from EIP 0x400100ed to 0x400100f2
step done from EIP 0x400100f2 to 0x400100f7
step done from EIP 0x400100f7 to 0x40013129
target running
hit hardware breakpoint (hwreg=1) at 0x4001312f
printk (fmt=0x40013e04 "Hello World! %s\n") at /home/ljanyst/Projects/zephyr/zephyr-project/misc/printk.c:164
164             va_start(ap, fmt);
(gdb) s
step done from EIP 0x4001312f to 0x40013132
step done from EIP 0x40013132 to 0x40013135
165             _vprintk(fmt, ap);
(gdb)

Intro

I have finally received my Kickstarter-backed UP boards. So far they seem great! There are three minor drawbacks, though:

  1. They don't have the exact same shape as Raspberry PI's, so they don't fit the raspberry cases. It's nothing that could not be rectified with small pliers, though.
  2. The audio chip on Cherry Trail (Intel Atom x5 z8350) SoCs is not yet supported by Linux out of the box, so some fiddling with the kernel is necessary.
  3. Debian's UEFI boot configuration does not seem to work from the get-go either.

Boot

You can install Debian Testing using a USB stick. Don't try Jessie, though - the kernel will not detect the MMC card. Things should work fine, except that grub will install itself in /EFI/debian/grubx64.efi on the EFI partition. You will need to move it to /EFI/boot/bootx64.efi manually. It's possible to do it from the UEFI shell using familiar commands.

Media

Kodi installs and works out of the box from Debian Multimedia. Unfortunately, to get the sound working, you will need to recompile the kernel :)

Get the sources and the necessary patches and create the config:

git clone https://github.com/torvalds/linux.git
cd linux
git remote add cht https://github.com/plbossart/sound.git
git fetch cht
git checkout byt-cht-hdmi-v4.7
make oldconfig

You will need to edit .config :

  • to set CONFIG_SYSTEM_TRUSTED_KEYS variable to an empty string
  • to enable CONFIG_HDMI

Then the only thing that's left is building and installing the package:

fakeroot make-kpkg --initrd --append-to-version=-up --revision=1 -j 8 kernel_image kernel_headers
cd ..
sudo dpkg -i linux-image-4.7.0-up+_1_amd64.deb

I wanted to see how efficient it is, so I run the compilation on the board itself. It took roughly 2.5 hours and got very hot. The board can handle perfectly fine the FullHD video files over Samba that Raspberry PI 2 couldn't. The audio quality is much better too. It seems that surround 5.1 actually works. :)

Intro

My medium-term goal is to port my Silly Invaders game to a Real Time Operating System. Zephyr seems to be a good choice. It's open source, operates under the auspices of the Linux Foundation and has an active community with many developers from Intel committing the code.

They, unfortunately, do not support Tiva so I will need to port the OS before I can proceed with the application. I decided to buy the Freescale K64F board, which is supported, to familiarize myself a little with Zephyr before I start the porting work. The howto page for setting up K64F seems to be terribly complicated and requires a JTAG programmer. I summarize here a simpler way using cmsis-dap over USB.

Setup

I updated the MBED interface firmware following the instructions on this site. I also build my own OpenOCD from the head of the master branch using the following configuration options:

./configure --prefix=/home/ljanyst/Apps/openocd  --enable-cmsis-dap

Things may work fine with the stock firmware and the stock OpenOCD as well, but I did not try that. It's also probably a good idea to add the following udev rule so that you don't have to run things as root:

]==> cat /etc/udev/rules.d/99-openocd.rules
# frdm-k64f
ATTRS{idVendor}=="0d28", ATTRS{idProduct}=="0204", GROUP="plugdev", MODE="0660"
]==> sudo udevadm control --reload-rules

Hello world!

I use the ARM cross-compiler provided by Debian to compile Zephyr and then just copy the resulting binary to the MBED disk:

]==> cd samples/hello_world/nanokernel
]==> make BOARD=frdm_k64f CROSS_COMPILE=arm-none-eabi- CFLAGS=-O0
]==> cp outdir/zephyr.bin /media/ljanyst/MBED/

You can see the effects in the UART console using screen :

]==> screen /dev/ttyACM0 115200,cs8
Hello World!

I then run OpenOCD using the following script:

]==> cat k64f.cfg
set CHIPNAME k60
source [find target/kx.cfg]

$_TARGETNAME configure -event gdb-attach {
  halt
}

]==> openocd -s /home/ljanyst/Apps/openocd/share/openocd/scripts/ -c "interface cmsis-dap" -f k64f.cfg

And GDB:

]==> cat remote1.conf
target extended-remote :3333
monitor reset init
break main
continue
]==> arm-none-eabi-gdb  --command=remote1.conf outdir/zephyr.elf
...
Breakpoint 1 at 0x129c: file /home/ljanyst/Projects/zephyr/zephyr-project/samples/hello_world/nanokernel/src/main.c, line 37.
Note: automatically using hardware breakpoints for read-only addresses.

Breakpoint 1, main () at /home/ljanyst/Projects/zephyr/zephyr-project/samples/hello_world/nanokernel/src/main.c:37
37              PRINT("Hello World!\n");
(gdb) s
printk (fmt=0x2c90 "Hello World!\n") at /home/ljanyst/Projects/zephyr/zephyr-project/misc/printk.c:164
164             va_start(ap, fmt);
(gdb) s
165             _vprintk(fmt, ap);
(gdb) s
_vprintk (fmt=0x2c90 "Hello World!\n", ap=...) at /home/ljanyst/Projects/zephyr/zephyr-project/misc/printk.c:75
75              int might_format = 0; /* 1 if encountered a '%' */
(gdb) where
#0  _vprintk (fmt=0x2c90 "Hello World!\n", ap=...) at /home/ljanyst/Projects/zephyr/zephyr-project/misc/printk.c:75
#1  0x00001b46 in printk (fmt=0x2c90 "Hello World!\n") at /home/ljanyst/Projects/zephyr/zephyr-project/misc/printk.c:165
#2  0x000012a2 in main () at /home/ljanyst/Projects/zephyr/zephyr-project/samples/hello_world/nanokernel/src/main.c:37

The problem

I have been playing with one of my toy projects and ended up having to shuffle text many times in a file that is around 10MB long. A naïve implementation chopping a string and gluing it together again ended up being painfully slow:

$$ O(n \cdot k) $$

where n is the length of the string and k is the number of shuffling operations.

A solution

It turns out that the problem can be efficiently solved using splay trees, where each node holds:

  • the starting offset in the source string
  • the length of the substring it represents
  • the offset from the beginning of the substring taking into account all of the children on the left-hand-side
  • total length of the substring taking into account all the children on the right-hand side

The find operation looks for a node containing the offset and splits it in two, if the node does not start with the offset.

 1 static Node *Find(uint32_t offset, Node *root)
 2 {
 3   if(!root)
 4     return 0;
 5   if(offset >= root->offset && offset < (root->offset + root->length))
 6     return root;
 7   if(offset < root->offset)
 8     return Find(offset, root->left);
 9   return Find(offset-root->offset-root->length, root->right);
10 }
11 
12 static Node *STFind(uint32_t offset, Node *root)
13 {
14   Node *n = Find(offset, root);
15   Splay(n);
16 
17   if(n->offset < offset) {
18     Node *newNode = new Node();
19     newNode->start  = n->start;
20     newNode->length = offset - n->offset;
21     n->start  = newNode->start + newNode->length;
22     n->length -= newNode->length;
23     newNode->left = n->left;
24     update(newNode);
25     n->left = newNode;
26     update(n);
27   }
28   return n;
29 }

The actual shuffling boils down to splitting and merging sub-trees and producing the result to traversing the whole tree in-order.

 1 void process(int i, int j, int k)
 2 {
 3   Node *n, *t, *left, *right = 0;
 4   STSplit(left, n, i, pRoot);
 5   if(j+1 < pS.length())
 6     STSplit(n, right, j+1-i, n);
 7   t = STMerge(left, right);
 8   if(k < (t->offset + t->total_length))
 9     STSplit(left, right, k, t);
10   else {
11     right = 0;
12     left = t;
13   }
14   pRoot = STMerge(left, n);
15   pRoot = STMerge(pRoot, right);
16 }

The amortized complexity is:

$$ O(k \cdot log(k) + n) $$

See rope-splay.cpp.

A test

It all runs pretty nicely for a test case with 10M characters and 100k shuffle operations:

]==> ./testgen 10000000 100000 > test
]==> time ./rope-naive < test > test-o-naive
./rope-naive < test > test-o-naive  386.17s user 66.19s system 99% cpu 7:32.86 total
]==> time ./rope-splay < test > test-o-splay
./rope-splay < test > test-o-splay  0.71s user 0.04s system 99% cpu 0.752 total
]==> diff test-o-splay test-o-naive

Table of Contents

  1. Compiling and start-up code
  2. Hardware Abstraction Layer and UART
  3. Debugging, display, heap and fonts
  4. Timers and ADC
  5. DAC, Sound and Nokia Tunes
  6. Random Number Generator, Rendering Engine and the Game

Random Number Generator

To make the game more engaging, we introduce some randomness into it. We don't need anything cryptographically secure, so a Linear Congruential Generator will do just fine. We count the time from the start-up in millisecond-long jiffies and wait for a first button press to select the seed.

 1 void button_event(IO_io *io, uint16_t event)
 2 {
 3   uint64_t btn;
 4   IO_get(io, &btn);
 5   if(btn)
 6     button_value = 1;
 7 
 8   if(!rng_initialized) {
 9     rng_initialized = 1;
10     IO_rng_seed(IO_time());
11   }
12 }

Rendering Engine

The rendering engine takes a scene descriptor, a display device, and a timer. Based on this information it computes new positions of objects, draws them on the screen if necessary and checks for collisions.

 1 struct SI_scene {
 2   SI_object **objects;
 3   void      (*pre_render)(struct SI_scene *);
 4   void      (*collision)(SI_object *obj1, SI_object *obj2);
 5   uint8_t     fps;
 6   uint8_t     num_objects;
 7   uint8_t     flags;
 8 };
 9 
10 void SI_scene_render(SI_scene *scene, IO_io *display, IO_io *timer);

Each SI_scene holds a list of "polymorphic" objects that should be rendered, a pointer to a pre_render function that calculates a new position of each object, and a pointer to a collision callback that is invoked when the scene renderer detects an overlap between two objects. The SI_scene_render function runs after every interrupt:

1   while(1) {
2     SI_scene_render(&scenes[current_scene].scene, &display, &scene_timer);
3     IO_wait_for_interrupt();
4   }

Whether it gets executed or not, depends on the flag parameter of the scene. If it's set to SI_SCENE_IGNORE , the renderer returns immediately. On the other hand, if it's set to SI_SCENE_RENDER , the renderer calls the pre_render callback, draws the objects on the screen, and computes the object overlaps notifying the collision callback if necessary. After each frame, the scene is disabled (SI_SCENE_IGNORE ). It is re-enabled by the timer interrupt in a time quantum that depends on the fps parameter.

See SI_scene.h and SI_scene.c.

Each object has a draw function that enables the renderer to draw it on the screen. There are three types of objects: a generic object, a bitmap object, and a text object:

 1 struct SI_object {
 2   uint16_t x;
 3   uint16_t y;
 4   uint16_t width;
 5   uint16_t height;
 6   uint8_t  flags;
 7   uint8_t  user_flags;
 8   void (*draw)(struct SI_object *this, IO_io *display);
 9 };
10 
11 struct SI_object_bitmap {
12   SI_object        obj;
13   const IO_bitmap *bmp;
14 };
15 
16 struct SI_object_text {
17   SI_object      obj;
18   const char    *text;
19   const IO_font *font;
20 };

The object array in the scene is initialized with the SI_object pointers:

1 static SI_object         score_obj;
2 static SI_object_bitmap  invader_obj[5];
3 scene->objects[1] = &score_obj;
4 scene->objects[i+5] = &invader_obj[i].obj;

See SI_scene_game.c.

The renderer calls the draw function of each SI_OBJECT_VISIBLE object:

1 obj->draw(obj, display);

Finally, each draw method uses the CONTAINER_OF macro to compute the pointer to the actual object of concrete type:

 1 #define CONTAINER_OF(TYPE, MEMBER, MEMBER_ADDR) \
 2   ((TYPE *) ( (char *)MEMBER_ADDR - offsetof(TYPE, MEMBER)))
 3 
 4 void SI_object_bitmap_draw(SI_object *obj, IO_io *display)
 5 {
 6   SI_object_bitmap *this = CONTAINER_OF(SI_object_bitmap, obj, obj);
 7   IO_display_print_bitmap(display, obj->x, obj->y, this->bmp);
 8 }
 9 
10 void SI_object_text_draw(SI_object *obj, IO_io *display)
11 {
12   SI_object_text *this = CONTAINER_OF(SI_object_text, obj, obj);
13   IO_display_set_font(display, this->font);
14   IO_display_cursor_goto(display, obj->x, obj->y);
15   IO_print(display, "%s", this->text);
16 }

The Game

All this seems to work pretty well when put together:

The Game

See silly-invaders.c.

Table of Contents

  1. Compiling and start-up code
  2. Hardware Abstraction Layer and UART
  3. Debugging, display, heap and fonts
  4. Timers and ADC
  5. DAC, Sound and Nokia Tunes
  6. Random Number Generator, Rendering Engine and the Game

DAC

Tiva does not have a DAC, but we'd like to have some sound effects while playing the game. Fortunately, it's easy to make a simple binary-weighted DAC using resistors and GPIO signals. It's not very accurate, but will do.

A binary-weighted DAC
A binary-weighted DAC

As far as the software is concerned, we will simply take 4 GPIO pins and set them up as output. We will then get an appropriate bit-banded alias such that writing an integer to it is reflected only in the state of these four pins.

 1 int32_t IO_dac_init(IO_io *io, uint8_t module)
 2 {
 3   if(module > 0)
 4     return -IO_EINVAL;
 5 
 6   TM4C_gpio_port_init(GPIO_PORTD_NUM);
 7   TM4C_gpio_pin_init(GPIO_PORTD_NUM, GPIO_PIN0_NUM, 0, 0, 1);
 8   TM4C_gpio_pin_init(GPIO_PORTD_NUM, GPIO_PIN1_NUM, 0, 0, 1);
 9   TM4C_gpio_pin_init(GPIO_PORTD_NUM, GPIO_PIN2_NUM, 0, 0, 1);
10   TM4C_gpio_pin_init(GPIO_PORTD_NUM, GPIO_PIN3_NUM, 0, 0, 1);
11 
12   uint32_t addr =  GPIO_REG_BASE + GPIO_PORTD;
13   addr += GPIO_PIN0_BIT_OFFSET;
14   addr += GPIO_PIN1_BIT_OFFSET;
15   addr += GPIO_PIN2_BIT_OFFSET;
16   addr += GPIO_PIN3_BIT_OFFSET;
17   dac_data = (uint32_t*)addr;
18 
19   io->type    = IO_DAC;
20   io->sync    = 0;
21   io->channel = 0;
22   io->flags   = 0;
23   io->read    = 0;
24   io->write   = dac_write;
25   io->event   = 0;
26 
27   return 0;
28 }

See TM4C_platform01.c.

Sound

We will create a virtual device consisting of a DAC and a timer. Using the timer, we will change the output of the DAC frequently enough to produce sound. Since the timer interrupt needs to be executed often and any delay makes the sound break, we need to assign the highest possible priority to this interrupt so that it does not get preempted.

1 int32_t IO_sound_init(IO_io *io, uint8_t module)
2 {
3 //...
4   IO_dac_init(&snd_dac, 0);
5   IO_timer_init(&snd_timer, 11);
6   TM4C_enable_interrupt(104, 0); // adjust the interrupt priority for timer 11
7   snd_timer.event = snd_timer_event;
8 //...
9 }

Writing to this virtual device sets the frequency of the tone that we want to play by adjusting the timer's firing rate accordingly.

 1 static int32_t snd_write(IO_io *io, const void *data, uint32_t length)
 2 {
 3   if(length != 1)
 4     return -IO_EINVAL;
 5   const uint64_t *val = data;
 6 
 7   if(!(*val)) {
 8     snd_interval = 0;
 9     return 1;
10   }
11   uint8_t turn_on = 0;
12   if(!snd_interval)
13     turn_on = 1;
14 
15   double interval = 1.0/(*val);
16   interval /= 32.0;
17   interval *= 1000000000;
18   snd_interval = interval;
19 
20   if(turn_on)
21     IO_set(&snd_timer, interval);
22   return 1;
23 }

In reality, the timer fires 32 times more often than the frequency of the tone requires. It is because we use a table with 32 entries to simulate the actual sound wave. In principle, we could just use a sinusoid, but it turns out that the quality of the sound is not so great if we do so. I have found another waveform in the lab materials of EdX's Embedded Systems course that works much better.

 1 static const uint8_t snd_trumpet[] = {
 2   10, 11, 11, 12, 10,  8,  3,  1,  8, 15, 15, 11, 10, 10, 11, 10, 10, 10, 10,
 3   10, 10, 10, 10, 11, 11, 11, 11, 11, 11, 10, 10, 10 };
 4 
 5 static void snd_timer_event(IO_io *io, uint16_t event)
 6 {
 7   IO_set(&snd_dac, snd_trumpet[snd_step++]);
 8   snd_step %= 32;
 9   if(snd_interval)
10     IO_set(io, snd_interval);
11 }

See TM4C_platform01.c.

Nokia tunes

The tune player API consists of four functions:

1 int32_t IO_sound_play(IO_io *io, IO_io *timer, IO_tune *tune, uint16_t start);
2 int32_t IO_sound_stop(IO_io *io);
3 IO_tune *IO_sound_decode_RTTTL(const char *tune);
4 void IO_sound_free_tune(IO_tune *tune);
  • IO_sound_play uses a sound device and a timer to play a tune. It sends an IO_EVENT_DONE to the virtual sound device when the playback finishes.
  • IO_sound_stop stops the playback on the given device and returns the index of the last note it played so that it can be restarted from that point.
  • IO_sound_decode_RTTL takes an RTTTL representation and produces the IO_tune structure that can be handled by the player.
  • IO_sound_free_tune frees the memory used by IO_sound_decode_RTTL when it's no longer needed.

There is plenty of tunes all over the Internet. The ones in the demo video are taken from here. The code of the player is based on this work.

It plays music! :)

See IO_sound.c.

Table of Contents

  1. Compiling and start-up code
  2. Hardware Abstraction Layer and UART
  3. Debugging, display, heap and fonts
  4. Timers and ADC
  5. DAC, Sound and Nokia Tunes
  6. Random Number Generator, Rendering Engine and the Game

Timers

Tiva has 12 timer modules that can be configured in various relatively complex ways. However, for the purpose of this game, we don't need anything fancy. We will, therefore, represent a timer as an IO_io device with the IO_set function (generalized from IO_gpio_set ) setting and arming it. When it counts to 0, the IO_TICK event will be reported to the event handler.

 1 void timer_event(IO_io *io, uint16_t event)
 2 {
 3 
 4   IO_set(&timer, 500000000); // fire in half second
 5 }
 6 
 7 int main()
 8 {
 9   IO_init();
10   IO_timer_init(&timer, 0);
11   timer.event = timer_event;
12   IO_set(&timer, 500000000); // fire in half second
13 
14   while(1)
15     IO_wait_for_interrupt();
16 }

See TM4C_timer.c and test-07-timer.c.

ADC

Similarly to the timers, the ADC sequencers on Tiva may be set up in fairly sophisticated ways. There are 12 analog pins, two modules with four sequencers each. Again, we don't need anything sophisticated here, so we will just use the first eight pins and assign them to a separate sequencer each. In the blocking mode, IO_get initiates the readout and returns the value. In the non-blocking and asynchronous mode IO_set , requests sampling and IO_get returns it when ready. An IO_DONE event is reported to the event handler if enabled.

 1 IO_io slider;
 2 IO_io timer;
 3 uint64_t sliderR = 0;
 4 
 5 void timer_event(IO_io *io, uint16_t event)
 6 {
 7   IO_set(&slider, 0); // request a sample
 8 }
 9 
10 void slider_event(IO_io *io, uint16_t event)
11 {
12   IO_get(&slider, &sliderR);
13   IO_set(&timer, 100000000); // fire in 0.1 sec
14 }
15 
16 int main()
17 {
18   IO_init();
19 
20   IO_timer_init(&timer, 0);
21   IO_slider_init(&slider, 0, IO_ASYNC);
22 
23   timer.event     = timer_event;
24   slider.event    = slider_event;
25 
26   IO_event_enable(&slider,    IO_EVENT_DONE);
27 
28   IO_set(&slider, 0); // request a sample
29 
30   while(1)
31     IO_wait_for_interrupt();
32 }

See TM4C_adc.c and test-08-input.c.

The game board

Everything works fine when soldered together as well.

Buttons and the Slider

Intro

The paper shows that, despite often repeated mantra, the OS task scheduling is far from being easy. The authors developed two new tools to investigate the CPU usage and the state of the associated run queue. It has allowed them to uncover four interesting performance bugs on a 64 core NUMA system. They discovered that often some cores stay idle for a long time while tasks are waiting. It is a violation of one of the design principles of the Completely Fair Scheduler, the Linux default, which is supposed to be work-conserving. Fixing these bugs resulted in a 138 times speedup in an extreme test case (multithreaded, using spinlocks) and 13-23% speedups in other test cases. This type of bugs is hard to uncover because they typically waste cycles hundreds of milliseconds at a time, which is beyond the resolution of standard monitoring tools.

Completely Fair Scheduler

CFS defines an interval in which each task must run a least once. This interval is then divided between all the tasks in proportion to their weight (niceness). A running thread accumulates vruntime, which is the amount of time it was running divided by its weight. The scheduler keeps these tasks in a run queue which is implemented as a red-black tree. When the CPU gets idle, the leftmost node is picked because it has accumulated the least of weighted runtime.

In a multi-core system, each core has its own run queue. To fairly distribute the load among the cores, the run queues must be periodically re-balanced. In today's systems, with dozens of run queues, the balancing procedure is expensive and not run very ofter. It is due to the need to take into account other factors, such as power-saving, cache and memory locality. The load balancing algorithm takes the threads from the most loaded cores and distributes them between the least loaded cores taking into account the topology of the system. The more complex the system gets, the more rules need to be applied and the harder it gets to reason about performance.

The bugs and the tools

The bugs uncovered by the authors are all related to migrating tasks between NUMA nodes. They were detected using new tools:

  • Sanity Checker checks every second whether there are idle cores in the presence of waiting threads in other run queues. If there are, it monitors the system for another 100ms. If the situation is not remediated, it begins to record the profiling information for further off-line analysis.
  • The scheduler visualization tool taps into various kernel functions to monitor and visualize scheduling activity over time.

Conclusion

The authors note that the problems were caused by people wanting to optimize CFS to compensate for the complexity of the modern hardware. They suggest rewriting of the scheduler as a core and a bunch of optimization modules.

Table of Contents

  1. Compiling and start-up code
  2. Hardware Abstraction Layer and UART
  3. Debugging, display, heap and fonts
  4. Timers and ADC
  5. DAC, Sound and Nokia Tunes
  6. Random Number Generator, Rendering Engine and the Game

Debugging

To test and debug the SSI code, I connected two boards and made them talk to each other. It mostly worked. However, it turned out that, by default, you can run the OpenOCD-GDB duo only for one board at a time. It's the one that libusb enumerates first. There is a patch that lets OpenOCD choose the device to attach to by its serial number. The patch has not made it to any release yet, but applying it and recompiling the whole thing is relatively straight-forward: clone the source, apply the patch and run the usual autotools combo. You will then need to create a config file for each board that specifies unique port numbers and defines the serial number of the device to attach to:

]==> cat board1.cfg
gdb_port 3333
telnet_port 4444
tcl_port 6666
interface hla
hla_serial 0Exxxxxx
source [find board/ek-tm4c123gxl.cfg]

]==> cat board2.cfg
gdb_port 3334
telnet_port 4445
tcl_port 6667
interface hla
hla_serial 0Exxxxxx
source [find board/ek-tm4c123gxl.cfg]

Separate GDB batch files come handy as well:

]==> cat gdb-board1.conf
target extended-remote :3333
monitor reset init
break main
continue

]==> cat gdb-board2.conf
target extended-remote :3334
monitor reset init
break main
continue

Tweaking the linker script

GCC started generating .init and .fini sections that contain no-op functions:

]==> arm-none-eabi-objdump -d  test-06-display.axf
...
Disassembly of section .init:

00007af8 <_init>:
    7af8:       b5f8            push    {r3, r4, r5, r6, r7, lr}
    7afa:       bf00            nop

Disassembly of section .fini:

00007afc <_fini>:
    7afc:       b5f8            push    {r3, r4, r5, r6, r7, lr}
    7afe:       bf00            nop

We will discard this code by adding the following to the linker script:

1   /DISCARD/ :
2   {
3     *(.init*)
4     *(.fini*)
5   }

GCC also started generating the stack unwinding code and GDB gets confused in some places if it is not present, so we put it after the code in FLASH:

1   .ARM.exidx :
2   {
3     *(.ARM.exidx*)
4     *(.gnu.linkonce.armexidx*)
5   } > FLASH

See TM4C.ld.

SSI and GPIO

We need both SSI and GPIO to control the Nokia display that we want to use for the game. Since, in the end, both these systems need to push and receive data, they fit well the generic interface used for UART. The SSI's initialization function needs many more parameters than the one for UART, so we pack them all in a struct. As far as GPIO is concerned, there are two helpers: IO_gpio_get_state and IO_gpio_set_state that just write the appropriate byte to the IO device. GPIO also comes with a new event type: IO_EVENT_CHANGE .

1 struct IO_ssi_attrs {
2   uint8_t  master;        //!< 1 for master, 0 for slave
3   uint8_t  slave_out;     //!< 1 slave output enabled, 0 slave output disabled
4   uint32_t bandwidth;     //!< bandwidth in bps
5   uint8_t  frame_format;  //!< frame format
6   uint8_t  freescale_spo; //!< SPO value for freescale frames
7   uint8_t  freescale_sph; //!< SPH value for freescale frames
8   uint8_t  frame_size;    //!< size of the frame in bits
9 };

See TM4C_ssi.c and TM4C_gpio.c.

Platforms, the display interface, and fonts

All the devices that are not directly on the board may be connected in many different ways. To handle all these configurations with the same board, we split the driver into libtm4c.a (for the board specific stuff) and libtm4c_platform_01.a (for the particular configuration). For now, the only thing that the platform implements is the display interface. It passes the appropriate SSI module and GPIOs to the actual display driver. The user sees the usual IO_io structure that is initialized with IO_display_init and can be written to and synced. write renders the text to the back-buffer, while sync sends the back-buffer to the device for rendering. There's also a couple of specialized functions that have to do only with display devices:

 1 int32_t IO_display_get_attrs(IO_io *io, IO_display_attrs *attrs);
 2 int32_t IO_display_clear(IO_io *io);
 3 int32_t IO_display_put_pixel(IO_io *io, uint16_t x, uint16_t y, uint32_t argb);
 4 int32_t IO_display_print_bitmap(IO_io *io, uint16_t x, uint16_t y,
 5   const IO_bitmap *bitmap);
 6 int32_t IO_display_set_font(IO_io *io, const IO_font *font);
 7 int32_t IO_display_cursor_goto(IO_io *io, uint32_t x, uint32_t y);
 8 int32_t IO_display_cursor_goto_text(IO_io *io, uint32_t line, uint32_t space);
 9 int32_t IO_display_cursor_move(IO_io *io, int32_t dx, int32_t dy);
10 int32_t IO_display_cursor_move_text(IO_io *io, int32_t dline, int32_t dspace);

See IO_display.h.

Platform 01 provides one display device, a PCD8544, the one used in Nokia 5110. It translates and passes the interface calls to the lower-level driver. See pcd8544.c.

If you haven't noticed in the list of the functions above, the display interface supports multiple fonts. In fact, I wrote a script that rasterizes TrueType fonts and creates internal IO_font structures. These can then be used to render text on a display device. All you need to do is provide a TTF file, declare the font name and size in CMake, and then reference it in the font manager. The code comes with DejaVuSans10 and DejaVuSerif10 by default.

The heap

Malloc comes handy from time to time, so I decided to implement one. It is extremely prone to fragmentation and never merges chunks, so using free is not advisable. Still, sometimes you just wish you had one. For instance, when you need to define a buffer for pixels and don't have a good way to ask for display parameters at compile time. For alignment reasons, the heap management code reserves a bit more than 4K for the stack. It then creates a 32 bytes long guard region protected by the MPU. Everything between the end of the .bss section and the guard page is handled by IO_malloc and IO_free .

 1 void TM4C_heap_init()
 2 {
 3   uint8_t *stack_start = (uint8_t *)0x20007ff8;
 4   uint8_t *stack_end   = stack_start-4120;
 5   uint8_t *stack_guard = stack_end-32;
 6   uint8_t *heap_start  = (uint8_t *)&__bss_end_vma;
 7 
 8   MPUCTRL_REG |= (uint32_t)0x05; // enable MPU and the background region
 9 
10   uint32_t val = (uint32_t)stack_guard;
11   val |= 0x10; // valid
12   val |= 0x07; // highest priority region
13   MPUBASE_REG &= ~0xfffffff7;
14   MPUBASE_REG |= val;
15 
16   val = 0;
17   val |= (1 << 28); // disable instruction fetches
18   val |= (4 << 1);  // 0x04 == 32bytes
19   val |= 1;         // enable the region
20   MPUATTR_REG &= ~0x173fff3f;
21   MPUATTR_REG |= val;
22 
23   IO_set_up_heap(heap_start, stack_guard);
24 }

See TM4C.c and IO_malloc.c.

A display test

The LCD demo works fine on the breadboard. As you can see, there is a text printed with two kinds of fonts: with and without serifs. Later, the code plays the Game of Life shooting gliders.

Glider Gun on a breadboard

Soldering

Since the display works fine, it's safe to do some soldering. We'll use a 9-volt battery as a power source and an LM1086 power regulator to supply 3.3 volts to the microcontroller and other devices.

Soldered Display - Front
Soldered Display - Front

Soldered Display - Back
Soldered Display - Back

Glider Gun - Soldered

Table of Contents

  1. Compiling and start-up code
  2. Hardware Abstraction Layer and UART
  3. Debugging, display, heap and fonts
  4. Timers and ADC
  5. DAC, Sound and Nokia Tunes
  6. Random Number Generator, Rendering Engine and the Game

Hardware Abstraction Layer

I'd like the game to be as portable as possible. As far as the game logic is concerned, the actual interaction with the hardware is immaterial. Ideally, we just need means to write a pixel to a screen, blink an LED or check the state of a push-button. It means that hiding the hardware details behind a generic interface is desirable. This interface can then be re-implemented for a different kind of board, and the whole thing can act as a cool tool for getting to know new hardware.

In this project, we will use one static library (libio.a ) to provide the interface. This library will implement all the hardware independent functions as well as the stubs for the driver (as weak symbols). Another library (libtm4c.a ) will provide the real driver logic for Tiva and the strong symbols. This kind of approach enables us to use the linker to easily produce the final binary for other platforms in the future.

Initialization PLL and FPU

To initialize the hardware platform, the user calls IO_init() . The stub for this function is provided by libio.a as follows:

1 int32_t __IO_init()
2 {
3   return -IO_ENOSYS;
4 }
5 
6 WEAK_ALIAS(__IO_init, IO_init);

The actual implementation for Tiva in libtm4c.a initializes PLL to provide 80MHz clock and turns on microDMA. It also sets the access permissions to the FPU by setting the appropriate bits in the CPAC register and resetting the pipeline in assembly. We will likely need the floating point in the game, and it comes handy when calculating UART transmission speed parameters.

 1 int32_t IO_init()
 2 {
 3   TM4C_pll_init();
 4   TM4C_dma_init();
 5 
 6   // Enable the floating point coprocessor
 7   CPAC_REG |= (0x0f << 20);
 8   __asm__ volatile (
 9     "dsb\r\n"        // force memory writed before continuing
10     "isb\r\n" );     // reset the pipeline
11   return 0;
12 }

Simple read/write interface and functions

We provide an IO device abstraction called IO_io and implement four generic functions for accessing it:

1 int32_t IO_write(IO_io *io, const void *data, uint32_t length);
2 int32_t IO_print(IO_io *io, const char *format, ...);
3 int32_t IO_read(IO_io *io, void *data, uint32_t length);
4 int32_t IO_scan(IO_io *io, uint8_t type, void *data, uint32_t param);

IO_read and IO_write push to and fetch bytes from the device. IO_print writes a formated string to the device using the standard printf semantics. IO_scan reads a word (a stream of characters surrounded by whitespaces) and tries to convert it to the requested type.

Each subsystem needs to provide its initialization function to fill the IO_io struct with the information required to perform the IO operations. For instance, the following function initializes UART:

1 int32_t IO_uart_init(IO_io *io, uint8_t module, uint16_t flags, uint32_t baud);

It needs to know which UART module to use, what the desired mode of operation is (non-blocking, asynchronous, DMA...) and what should be the speed of the link. This approach hides the hardware details from the user well and is very generic, see test-01-uart.c. For instance, you can write something like this:

1 IO_init();
2 IO_io uart0;
3 IO_uart_init(&uart0, 0, 0, 115200);
4 IO_print(&uart0, "Hello %s\r\n", "World");

Passing 0 as flags to the UART initialization routine creates a blocking device that is required for IO_print and IO_scan to work.

Non-blocking and asynchronous IO

A blocking IO device will cause the IO functions to return only after they have pushed or pulled all the data to or from the hardware. If, however, you configure a non-blocking (IO_NONBLOCKING ) device, the functions will process as many bytes as they can and return. They return -IO_WOULDBLOCK if it is not possible to handle any data.

The IO_ASYNC flag makes the system notify the user about the device readiness for reading or writing. These events are received and processed by a user-defined call-back function:

 1 void uart_event(IO_io *io, uint16_t event)
 2 {
 3   if(event & IO_EVENT_READ) {
 4   }
 5 
 6   if(event & IO_EVENT_WRITE) {
 7   }
 8 }
 9 
10 int main()
11 {
12   IO_init();
13   IO_io uart0;
14   IO_uart_init(&uart0, 0, IO_NONBLOCKING|IO_ASYNC, 115200);
15   uart0.event = uart_event;
16   IO_event_enable(&uart0, IO_EVENT_READ|IO_EVENT_WRITE);
17   while(1) IO_wait_for_interrupt();
18 }

See test-02-uart-async.c.

DMA

The DMA mode allows for transferring data between the peripheral and the main memory in the background. It uses the memory bus when the CPU does not need it for anything else. When in this mode, IO_read and IO_write only initiate a background transfer. The next invocation will either block or return -EWOULDBLOCK , depending on other configuration flags, as long as the current DMA operation is in progress. The memory buffer cannot be changed until the DMA transfer is done. Passing IO_ASYNC will generate completion events for DMA operations. It enables us to implement a pretty neat UART echo app:

 1 #include <io/IO.h>
 2 #include <io/IO_uart.h>
 3 
 4 char buffer[30];
 5 
 6 void uart_event(IO_io *io, uint16_t event)
 7 {
 8   if(event & IO_EVENT_DMA_READ)
 9     IO_write(io, buffer, 30);
10 
11   if(event & IO_EVENT_DMA_WRITE)
12     IO_read(io, buffer, 30);
13 }
14 
15 int main()
16 {
17   IO_init();
18   IO_io uart0;
19   IO_uart_init(&uart0, 0, IO_DMA|IO_ASYNC, 115200);
20   uart0.event = uart_event;
21   IO_read(&uart0, buffer, 30);
22   while(1) IO_wait_for_interrupt();
23 }

See test-03-uart-dma.c.

The driver

There was nothing ultimately hard about writing the driver part. It all boils down to reading the data sheet and following the instruction contained therein. It took quite some time to put everything together into a coherent whole, though. See: TM4C_uart.c.