It’s all fine and dandy to say you got C compiling for the Roku, but can you really in good conscience claim that C works if you haven’t got DOOM running?

If you want to see the end result, click here to play DOOM on your Roku. And if you’d like to see the source, visit the wasm2brs project and check out the samples directory where you’ll find DOOM:

https://github.com/TrevorSundberg/wasm2brs

When it came to picking an implementation of DOOM, I decided to go with the SDL port because I’m familiar with the SDL API. Before even trying to get it running on the Roku I wanted to see it running on my local machine.

Several segmentation faults later, I realized that the original DOOM code made a lot of assumptions about the size of a pointer being 32 bit and my default compiler was set to compile in 64 bit. Eventually that led me to this modification of the SDL port. The user makava provided the command line arguments I needed to get a running version on my PC, and from there I was confident I could make it run on Roku!

WASI WAD file

Before getting any graphical or audio output, our first goal is to get the game loading. All of the textures, sounds, levels, text, etc that DOOM loads comes from their proprietary format, the .WAD file. What’s more impressive is that the same engine can run DOOM and DOOM2, and many other flavors. It does this in the function IdentifyVersion by scanning for the existence of any of the known .WAD file names. In our case, we’re using doom1.wad because that’s the shareware / demo version of the original DOOM.

By default, it checks if the environment variable DOOMWADDIR is present, and if not it uses the relative path of ./doom1.wad. Because I use wasienv as a convenience to compile WebAssembly programs that target wasi-libc, they include a preamble.h that defines getcwd as returning an empty string. In the case of Roku, when opening a file it must begin with the device, such as pkg:/, and since our working directory was empty, the relative path ./doom1.wad did not resolve to a proper Roku path. The simplest fix was to set DOOMWADDIR to pkg:/source and place the doom1.wad in the source directory. It should be noted that as of November 30th 2020, wasi-libc now has emulation of getcwd.

The Clock

Our second goal is to put a print statement in the main loop and show that frames are indeed being processed. In the SDL port, every frame it would call SDL_GetTicks to determine how much time had passed. In DOOM there are 35 ticks per second, which is defined by TICRATE.

Luckily, wasi-libc has an implementation of clock_gettime which in turn calls our implementation of wasi_snapshot_preview1_clock_time_get. On the Roku, we used the ifDateTime to compute milliseconds and seconds. Our minor change in DOOM now looks like:

struct timespec spec;
clock_gettime(CLOCK_MONOTONIC, &spec);
int ms = spec.tv_sec * 1000 + spec.tv_nsec / 1000000;
return (ms * TICRATE) / 1000;

What was even more exciting, at this point I added a print here and sure enough when I ran it on the Roku I saw it getting called every frame!

Pixel Buffer 1: Drawing Rectangles as Pixels

When I first started playing with graphics on Roku, I was surprised to find that BrightScript has no exposed functions to upload an array of pixels to a texture. The primary drawing interface is ifDraw2D which is akin to the HTML canvas element. They even have GetByteArray(…), but sadly no SetByteArray.

By default with no scaling, DOOM runs at a resolution of 320×200 which is 64,000 pixels. I’m a fan of going for the slowest brute force approach first, but unfortunately this was the only approach I could think of. Rather than setting individual pixels, we could draw a single 1×1 rectangle for each pixel using DrawRect(…).

This approach was slower than I could have imagined. I achieved about 3-4 frames per second on my TCL 7137X with an ARM dual core 1.2 GHz processor. That doesn’t leave any time for the game logic and rendering. We’re looping over every pixel, loading memory to get the color of the pixel, and calling DrawRect.

I also found an internal limitation where it seems they are using an unsigned 16 bit integer for the number of draw calls. Beyond 65,535 calls to DrawRect it would discard all the original calls. Keep in mind 320×200 is 64,000 so you can’t get much bigger than that.

Overall this approach was a bust. I needed to put my thinking cap on to come up with another way to make a pixel buffer.

Pixel Buffer 2: Reading BMP Files From The Disk

That’s right, you read that correctly. If the first approach above seemed at least reasonable, this approach usually makes people’s jaw drop.

BrightScript again has no interface to upload binary image data, but they do have roBitmap whose constructor can take a file path to an image. Moreover, the ifByteArray has methods to write a portion of the array to disk, namely the second overload of WriteFile and AppendFile. The fact that it takes an offset and length is a huge win, because it means we can implement a zero copy pixel buffer… at least if you consider that we don’t have to loop over the pixels ourselves in BrightScript which is extremely slow.

Moreover, as I dug into writing files on Roku I found this documentation which tells us that the tmp:/ storage device is most likely in memory due to it discarding contents when the app exits. That means writes and reads to this device will be very quick!

And this final details takes the cake. I was trying to figure out what image format I wanted to write… uncompressed PNG maybe? Ideally it had be a format that I didn’t need to do any transformation of the pixel data to store. In other words, DOOM’s internal screen should be directly writable to the disk / pixel data section of the image, with no modification or looping over pixels.

DOOM uses 8 bits per pixel (single byte) and a 256 color palette. That’s when I remembered that Microsoft’s BMP format supports 8 bit color. To my excitement, DOOM’s screen format could be directly written out as a section of the BMP, as well as the color palette!

For a test, I created a 320×200 bitmap with 256 colors and timed how many I could load in one second. Obviously it’s possible Roku could have some caching behavior, however I was able to get about 125 fps with each one taking from 7 to 8 milliseconds:

That was a great static test, but dynamic was bound to be slower since we’re writing a file each time. As an optimization, I would build the bitmap header once and store it in its own roByteArray. Roughly, the 256 color BMP format looks like:

54 bytes of headers
256 bytes of color palette
320 * 200 bytes of pixels

Wikipedia has much better examples, though without a palette. For writing the headers, I was able to use a lot of the WebAssembly instructions I already implemented:

' Header
I32Store8(headers, &H00, &H42) 'B
I32Store8(headers, &H01, &H4D) 'M
I32Store(headers, &H02, fileSize) ' FileSize
I32Store(headers, &H06, 0) ' Reserved
I32Store(headers, &H0A, dataOffset) ' DataOffset

'InfoHeader
I32Store(headers, &H0E, 40) ' InfoHeader Size
I32Store(headers, &H12, width) ' Width
I32Store(headers, &H16, height) ' Height
I32Store16(headers, &H1A, 1) ' Planes
I32Store16(headers, &H1C, bitsPerPixel) ' BitsPerPixel
I32Store(headers, &H1E, 0) ' Compression (BI_RGB No compression)
I32Store(headers, &H22, 0) ' ImageSize (0 is valid for no compression)
I32Store(headers, &H26, 2835) ' XpixelsPerM
I32Store(headers, &H2A, 2835) ' YpixelsPerM
I32Store(headers, &H2E, usedColors) ' UsedColors
I32Store(headers, &H32, 0) ' ImportantColors

And Finally for both writing and immediately reading the data into an roBitmap:

path = "tmp:/s"
m.surfaceHeaders.WriteFile(path)
If m.surfaceColorTable <> Invalid Then m.surfaceColorTable.AppendFile(path)
m.wasi_memory.AppendFile(path, pixelDataOffset, m.surfacePixelDataSize)
bitmap = CreateObject("roBitmap", path)
m.screen.DrawScaledObject(0, m.surfaceHeight, 1, -1, bitmap)
m.screen.SwapBuffers()

In this case m.screen is an instance of an roScreen, which implements ifDraw2D, so effectively it’s a full screen canvas. The m.surfaceColorTable is the color palette, and m.wasi_memory is actually the entire memory of DOOM, of which we’re only writing a tiny portion which is the pixel data (screen[0] if you’re familiar with the DOOM source).

You might notice the call to DrawScaledObject; that’s because as I found out bitmaps actually store their rows in reverse order where the first row is the bottom of the image. In DOOM, the first row is the top of the image, hence the flip. I found out later BMP format actually has a mechanism to support both ways, by storing a negative number for height.

To make these functions readily callable from C was straightforward in wasm2brs. By default we compile with -Wl,--allow-undefined and any undefined function is assumed to be defined by Roku. It’s important to remember that in WebAssembly, all pointers are integers, so the following functions (taken from our roku.h):

void roku_create_surface(int bitsPerPixel, int w, int h);
void roku_set_surface_colors(void* colorTable);
void roku_draw_surface(void* pixelData);

Defined in BrightScript look like (taken from our roku.brs):

Function roku_create_surface(bitsPerPixel As Integer, w As Integer, h As Integer) as Void
Function roku_set_surface_colors(colorTableOffset as Integer) as Void
Function roku_draw_surface(pixelDataOffset As Integer) as Void

To my amazement when we called the above functions from DOOM’s i_video.c, it actually worked! After DOOM got through loading the WAD file, the intro screen popped up and it blew my mind.

Input

Every iteration of the DOOM loop pumps all available messages from the operating system and stores them in queue. Lucky for us that translates pretty well to the non-blocking BrightScript roPort.GetMessage(). Moreover, by using an roScreen we now receive the roUniversalControlEvent which tell us the button on the controller that was pressed or released.

All that was left was for us to map the Roku controller buttons into DOOM keys. They have a built in key mapping for which key acts as what game action, but we still need our mapping:

switch(key)
{
case ROKU_LEFT:           return KEY_LEFTARROW;
case ROKU_RIGHT:          return KEY_RIGHTARROW;
case ROKU_DOWN:           return KEY_DOWNARROW;
case ROKU_UP:             return KEY_UPARROW;
case ROKU_BACK:           return KEY_ESCAPE;
case ROKU_OK:             return KEY_ENTER;
case ROKU_OPTIONS:        return KEY_TAB;
case ROKU_INSTANTREPLAY:  return KEY_SPACE;
case ROKU_FASTFORWARD:    return KEY_EQUALS;
case ROKU_REWIND:         return KEY_MINUS;
case ROKU_PLAY:           return KEY_RCTRL;
}

Audio… :(

Unfortunately DOOM’s audio is not something I have solved yet. To ensure audio has no gaps and doesn’t repeat, it’s important to have a callback that’s executed on a regular interval that populates the audio channel data. This is why many audio implementations call a user provided callback on a separate thread so that they can ensure the main thread does not interfere with the timing.

Roku has the concept of Tasks, which run on their own thread, however at best they are akin to JavaScript Workers. You can send them data back and forth (like messages) but they don’t share any writable state and no atomics. This means that I can’t make a proper implementation of POSIX threads, and alas there can’t be a dedicated audio thread.

Obviously I could still have an audio Task, and modify DOOMs code to send messages instead of assuming shared memory. However, the point of getting DOOM working was to show that programs can nearly work without heavy modification. In fact, if I had implemented the SDL 1 functions directly, the SDL port of DOOM would work on Roku without any modification (but without audio).

Even if I did have a thread, we’re back to the same problem as the pixel buffer. BrightScript has no interface to upload and play raw audio data, so at best we’d be writing out a .wav file and repeatedly playing it.

At some point I may attempt a non-threaded implementation that runs after input is polled, or make an explicit call that updates the audio. Because of the low frame rates we would need to use a large audio buffer which probably means audio would lag behind.

Profiling

When I finally got DOOM running, it ran around 2 frames per second, even with the smallest screen. It was a bittersweet victory as I was happy to see it running, but quite sad to see how poor the performance was. Keep in mind this is after running clang with -O3 and producing a highly optimized WebAssembly module.

The first step to rectifying this situation was to get profiling working. I could assume all day about what would make it faster and I’d probably guess wrong. Roku has a very nice tool called the BrightScript Profiler Visualization Tool:

The fact that it’s a web based tool makes the barrier to entry basically nothing. And as a person who appreciates simplicity, the “Tutorial Mode” that adds a little hover-able question mark next to each button meant I never had to look anything up. It’s quite a brilliant piece of software!

Inlining

The first major observation that I had was that in BrightScript, function calls have a considerable overhead. For all instructions that weren’t directly implementable by a single operator, I had created a function call instead. For example, WASM instruction i32.shr_s (32 bit integer shift right, signed) is implemented by the function call to I32Shrs:

Function I32ShrS(lhs as Integer, rhs as Integer) as Integer
    rhsWrapped = rhs And &H1F
    leftBits = 0
    If lhs < 0 And rhsWrapped <> 0 Then
        leftBits = &HFFFFFFFF << (32& - rhsWrapped)
    End If
    Return (lhs >> rhsWrapped) Or leftBits
End Function

One nice fact about using functions is that I could make use of local variables. By inlining this function, I need to be careful to not stomp any generated variables. BrightScript also has a limit on the number of variables allowed in a single function (around 253) and therefore introducing temporaries can cause it to exceed this limit.

In the original wasm2c code, it generates temporary variables for all values popped from the stack. This means that when i32.shr_s is generated it pops two values from the stack and creates two temporaries for lhs and rhs. We also know that the i32.shr_s instruction consumes the two values and pushes the result on, so we know those temporaries can’t be used anywhere else. This means we can actually use those temporaries for our own modifications!

It would be a pain to write the C++ code that writes out the BrightScript for this instruction. Take for example how we output any binary instruction as a function call:

Write(StackVar(1, result_type), " = ", op, "(", StackVar(1), ", ", StackVar(0), ")", Newline());

This is used to produce lines like: w2b_d0 = F64Div(w2b_d0, w2b_d1).

So for the above I32Shrs case, rather than writing out a bunch of calls to Write, I created a simple replacement mechanism language:

WriteExprReplacement(expr.opcode, 2, 0,
  "$in0 = $in0 And &H1F\n"
  "If $in1 < 0 And $in0 <> 0 Then\n"
  "    $out1 = ($in1 >> $in0) Or (&HFFFFFFFF << (32 - $in0))\n"
  "Else\n"
  "    $out1 = $in1 >> $in0\n"
  "End If\n");

In this case, $in0 means the top of the WASM stack (the rhs in I32Shrs) and $in1 is second from the top, or lhs. Because of the above mentioned temporary lifetime, we know we can modify $in0 in place. The $out1 is the result that we push on the stack. We use $out1 instead of $out0 because its popping 2 arguments and pushing 1, which means the result is not going to be placed at the top of the stack but rather 1 down from the top.

And to give another well deserved shout-out, testing that the inlining behavior is correct would have been impossible without the WebAssembly/testsuite. By making a change and running all the tests again, I had a high degree of confidence that the inlining worked appropriately. For example, in my above replacement language I accidentally switched the $in0 and $in1 which was caught by the tests, as well as using $out0 instead of $out1!

Aligned Loads

The biggest wins came from inlining the I32Load and I32Store instructions as well as the single byte I32Load8 and I32Store8, which comes as no surprise.

However, with I32Load there was still some more speed to be gleaned. If you remember from the article where we got WebAssembly working, because we can only read bytes at a time, a load looks like:

value = (mem[n]) + (mem[n + 1] << 8) + (mem[n + 2] << 16) + (mem[n + 3] << 24)

Though we can’t see the bitcode, we can assume that’s a lot of instructions: at least one for each index[] operator (which may be implemented as a function call), one for each addition, and one for each shift. This is a scenario where using their built in GetSignedLong(n) might save us some cycles. Time to put that to the test:

iterations = 1000000
mem = CreateObject("roByteArray")
mem[1000] = 0 ' Resize the array
n = 400
ts = CreateObject("roTimespan")

ts.Mark()
For i = 1 To iterations
    value = (mem[n]) + (mem[n + 1] << 8) + (mem[n + 2] << 16) + (mem[n + 3] << 24)
End For
print ts.TotalMilliseconds()

ts.Mark()
For i = 1 To iterations
    vaue = mem.GetSignedLong(n >> 2)
End For
print ts.TotalMilliseconds()

We got 2979 milliseconds for the byte by byte approach and 902 milliseconds for the GetSignedLong(n) approach, a clear winner!

In the previous article I mentioned the undocumented fact that the n parameter is multiplied by 4 internally, which means that you can only read 4 byte aligned loads (0, 4, 8, etc). This means that you have to divide n / 4. However in my tests I found that n >> 2 was actually faster than integer division. Usually when I see people using bit-shifts to perform divisions with a constant that’s a power of 2, I scoff. The compiler will do that optimization for you, and you’re just making your code more confusing for a beginner to read. However in this case BrightScript does not optimize n / 4 into n >> 2.

For a little while I was excited because WebAssembly actually has a alignment member on each load instruction. To me, it made sense that given several compiler passes, it could compute that a specific instruction would only ever be used to perform 4 byte aligned loads. I was wrong:

The alignment hint is like a promise to the virtual machine executing our code that “the effective address is going to be aligned at N bits”, information which the VM will use to optimize our code. When we promise a certain alignment but fail to keep that promise by providing an address that is misaligned, the operation will likely be much slower than if we had provided an aligned effective address.

https://rsms.me/wasm-intro#addressing-memory

The words I was missing for a long time was “fail to keep that promise”, which is legal. The alignment parameter of the load instructions is merely a hint. It would be awesome if some day we could get a guarantee from WebAssembly that an index given to i32.load would always be a multiple of 4, but that is not the case today.

It should be mentioned that the WebAssembly/testsuite came to the rescue again. In address.wast the test 32_good5 tests loading a 4 byte aligned address at an address of 0 + 25, which results in an unaligned load and broke my assumption.

Ultimately this means we need to do an if check before every i32.load that checks if it’s aligned or not. I was worried that the overhead of an If / Else would negate our win, but in practice almost every load in DOOM was an aligned load (and I suspect that’s true with most code). So our final implementation looks like:

If n And 3 Then
    value = (mem[n]) + (mem[n + 1] << 8) + (mem[n + 2] << 16) + (mem[n + 3] << 24)
Else
    value = mem.GetSignedLong(n >> 2)
End If

Again, using bitwise math for n And 3 was faster than n Mod 4.

Caching the Global Memory Object

After my first pass of generating instructions in wasm2brs, every load / store looked like this:

w2b_j0 = I64Load(m.w2b_memory, w2b_i0)

Rather than I64Load internally assuming the name of the memory, we passed it in because multiple WASM modules can have their own memory objects.

For example, you can translate using wasm2brs --name-prefix=abc and get the following:

w2b_j0 = I64Load(m.abc_memory, w2b_i0)

If a function generates many calls to loads or stores, it means they are constantly accessing the m.abc_memory object. As I later found out, this works like JavaScript or Lua in that it’s a string lookup in a hash table.

A very simple optimization was to cache the memory object into a local variable at the beginning of each function, we called it mem:

Function w2b_am_drawwalls_476578975() As Void
  mem = m.w2b_memory
  ...
  w2b_j1 = I64Load(mem, w2b_i1)
  

Unfortunately with the implementation as it is, sometimes we cache this mem variable when there are no loads and stores in the function, however the speed gain in DOOM outweighed the extra temporary. In the future we should do a pass where we walk all instructions beforehand and determine if any operations needs mem.

Memcpy and Memset

After running the profiler several times and knocking out each of the most hit instructions, I saw that memcpy was one of the slower function calls. It occurs often during the DOOM screen wipe animations where it melts down, and in the rendering of the levels. As a convenience, I had already written a simple version of memcpy in BrightScript that I used to copy data-sections during initialization:

Function MemCpy(memory as Object, dst as Integer, src as Integer, size as Integer) as Integer
    For i = 0 To size - 1
        memory[dst + i] = memory[src + i]
    End For
    Return dst
End Function

It turns out after testing, my simple version was 2x faster than the one in wasi-libc, which originally came from the libc implementation called musl. For each different architecture, musl provides a hand written assembly implemented version of memcpy. It is only as a fallback that it uses this version, which is the case for WebAssembly. It does a lot of tricks like trying to read and write memory in aligned chunks, which is probably fast for actual JIT WebAssembly implementations. However for our BrightScript translation the overhead of reading and composing an integer, just to then decompose it again is clearly slower than a basic byte by byte copy.

Replacing the implementation in wasi-libc either meant we had to fork it, or I had to do something in the translation tool to replace it. As a bonus, I also found out that memset suffered from similar 2x performance issues and decided to replace both with my own hand rolled implementation. Given that the translation tool was already replacing instructions, it seemed like a logical step to replace a function call, we just needed to detect it:

static bool IsReplaceableMemFunction(const wabt::Func& func) {
  return
    (func.name == "$memcpy" || func.name == "$memset") &&
    func.GetNumParams() == 3 &&
    func.GetParamType(0) == wabt::Type::I32 &&
    func.GetParamType(1) == wabt::Type::I32 &&
    func.GetParamType(2) == wabt::Type::I32 &&
    func.GetNumResults() == 1 &&
    func.GetResultType(0) == wabt::Type::I32;
}

In that case, rather than emitting a mangled function name call, we just emit a call to our BrightScript MemCpy. The only exception is we had to pass the cached mem object into our MemCpy, which meant just adding an extra parameter at the beginning, which now generates code that looks like:

w2b_i0 = MemCpy(mem, w2b_i0, w2b_i1, w2b_i2)

One last interesting observation: when clang detects a series of copy or store with the same value, it actually can turn them into memcpy or memset. This is especially true when compiling with -Oz and trying to reduce code size. To prevent this behavior, you can use -no-builtin.

Future Work

One place where I see obvious inefficiency is in how we create temporaries for every value we pop off the WebAssembly stack. For example, in the following code w2b_i3 is a temporary used to store a constant, which is then immediately used again on the next line:

w2b_i3 = 121388%
w2b_i3 = (mem[w2b_i3])

You can take my word for it that w2b_i3 in this case was never used again. Ideally we would avoid creating new temporaries when we knew that a value was never read from again. By inlining the constants it would save on extra bytecode instructions:

w2b_i3 = (mem[121388%])

This type of operation I believe is done best with instructions in Static Single Assignment Form (SSA) when lifetime analysis has been run, and all redundant copies eliminated. It makes me wonder if translating directly from WebAssembly was the best idea, or if we could get more gains using something like Binaryen’s IR (or even LLVM).

Another example that could probably be more easily eliminated is the following:

w2b_i3 = w2b_p1
w2b_i2 *= w2b_i3

Here w2b_p1 is a parameter that gets copied into a temporary w2b_i3 just to be immediately used again. Considering that BrightScript parameters are also writable, I believe I could come up with a mechanism to just use the parameter in these cases.

Conclusion

Though I had to make some concessions such as running DOOM with the lowest possible resolution just to get interactive frame rates, I am proud to say it’s a job well done. Using our new tool wasm2brs we can finally output working C/C++ and even Rust programs to Roku.