Shellcode: Encryption Algorithms in ARM Assembly

Introduction

There are 2 CPU architectures currently dominating the embedded market, and are almost certain to dominate IoT technologies deployed with 5G over the coming 10-15 years.

Those architectures are Advanced RISC Machine (ARM), and Argonaut RISC Core (ARC).

Although different products, ARC is something we’ll learn more about as 5G is deployed, and I hope to discuss this architecture in future, because this post is about ARM assembly.

This isn’t a tutorial on ARM assembly, but I thought sharing the code, in addition to explaining how operations were implemented would be helpful to others learning ARM.

This post is intended for beginners, but does assume you have a basic grasp of C or similar language. Knowledge of another assembly language would be an advantage, but if you’re already familiar with C, you’ll understand what the assembly is doing based on comments.

A number of cryptographic primitives suitable for resource constrained environments were implemented in C. The assembly output generated by the GNU C compiler was then examined to determine the best implementation possible.

Finally, ARM assembly was written by hand in order to reduce space further. If you’re new to the ARM architecture, and want to learn assembly, the information here may be of some use to you.

A good resource for those new to ARM assembly

Generating assembly

Learning any assembly language can be a tedious and challenging task for many of us. Lack of knowledge in another assembly language will also impede progress, but if you persevere over time, eventually you’ll reap the rewards, and will be reading assembly like you’re reading a newspaper. 😉

My approach to learning assembly involves obtaining instruction set or programmer manuals, obtaining a C compiler that will generate assembly for the target architecture, and sometimes examining existing source code written by other people, as this can be invaluable to learning new tricks.

Implement your program in C, and have the compiler generate the assembly. Then study the code, and learn what it does.

Optimization Flags

To generate assembly of a C file, you can use the -S switch. You may also want to generate compact code using -O2 and -Os

The Microsoft compiler for ARM provides similar switches, including /FAs which generates an assembly file with source code included.

Personally, I’d never try to optimize assembly for speed, since compilers are usually already good at this, but there are definitely situations when the assembly generated by compiler is inefficient.

Debugging symbols

If we want to include symbols from the original C code to help us understand what each assembly operation does, we can use the -g switch which will include debugging symbols.

To then produce disassembly with those symbols included, use objdump -d -S source.o which tells the program to disassemble with source.

ARM Directives

The GNU Assembler provides a number of directives which can be useful and save you from pulling your hair out and crying over your keyboard.

A more comprehensive list can be viewed here, as I only include what I think are useful to prevent writing buggy code.

CPU

.arm indicates 32-bit code, .thumb indicates 16-bit code.

The 2 are essentially the same as using .code 32 or .code 16

All the code used here is 32-bit.

Architecture

It’s completely optional, but the .arch directive tells the assembler what instruction set our program should use.

ARMv6 should be fine, since Raspberry Pis run on these.

Using .arch armv6 should prevent using instructions not part of ARMv6.

If you accidently attempt to use some instruction unavailable to the target architecture, the assembler will display an error. Better to use this than to assemble some code into a binary, execute and watch it segfault.

Alignment

The .align directive ensures your code is properly aligned.

Of course, depending on what you’re writing, you may not want your code padded with null bytes.

If that’s the case, you might want to manually align using a simple NOP instruction at the end of file somewhere.

Exporting Functions

The .global directive will allow a function to be accessed from another source file.

Symbolic names

It could be considered bad practice in the eyes of some hardcore assembly programmers, but redefining registers just makes my life much easier. I sleep reasonably well at night if I used symbolic names during the day to complete a bit of code.

As some of you know, we define symbolic constants in C using the #define keyword which allows us to substitute one value for another.

In the NASM and YASM assemblers, those of you familiar with x86 or AMD64 assembly will likely have encountered the %define keyword which is essentially the same thing.

The .req directive allows us to assign our own names to registers.

I recommend using symbolic names where possible to reduce the potential for error…you can thank me later.

ARM instructions

Every instruction supports suffixes that can affect or depend on the flags register.

A SUB for subtraction can be implemented as SUBS which sets the flags based on result of subtraction.

A suffix like NE (Not Equal) or EQ (Equal) can also be added which determines whether the subtraction is executed or not.

Multiple loads with LDM (Load Multiple) and multiple stores with STM (Store Multiple) are not commonly used by compilers, but used throughout examples shown here.

Block Ciphers

Lightweight block ciphers are popular among developers of embedded systems. They can be used with Counter (CTR) Mode to implement a stream cipher.

What I illustrate here should not be interpreted as an endorsement. I do not claim these ciphers are without weaknesses, or that you should be using them in your project.

XTEA

TEA Extensions (XTEA) is a 64-bit block cipher with support for 128-bit keys. It was published in 1998 as a response to weaknesses found in the Tiny Encryption Algorithm (TEA)

XTEA compared to its predecessor contains a more complex key-schedule and rearrangement of shifts, XORs, and additions.

// uses 32 rounds by default
void xtea_encrypt(void *key, void *data) {
    int      i;
    uint32_t x0, x1, t, sum=0;
    
    uint32_t *k=(uint32_t*)key;
    uint32_t *x=(uint32_t*)data;
    
    // load 64-bit plain text
    x0 = x[0]; x1 = x[1];
    
    for (i=64; i>0; i--) {
      t = sum;
      // add constant every 2nd round
      if (i & 1) {
        sum += 0x9E3779B9;
        t = sum >> 11;          
      }
      x0 += ((((x1 << 4) ^ 
        (x1 >> 5)) + x1) ^ 
        (sum + k[t & 3]));
         
      XCHG(x0, x1);
    }
    // save 64-bit cipher text
    x[0] = x0; x[1] = x1;
}

The ARM code makes use of conditional execution in order to avoid branching. In x86 code, the if statement requires a conditional jump, but thanks to ARM conditional suffixes, the Addition and Left Shift are only executed every 2nd step.

// key
k   .req r0
x   .req r1

// data
x0  .req r2
x1  .req r3

sum .req r4
t0  .req r5
t1  .req r6
c   .req r7
i   .req r8

  // xtea_encryptx(void *key, void *data);
xtea_encryptx:
  // saxe registers
  push  {r0-r12, lr}

  // 32 rounds by default
  mov   i, #64               // number of rounds

  // load 64-bit plain text
  ldm   x, {x0, x1}          // x0  = x[0], x1 = x[1];
  mov   sum, #0              // sum = 0;
  ldr   c, =#0x9E3779B9      // c   = 0x9E3779B9;
xtea_loop:
  mov   t0, sum              // t0 = sum;
  tst   i, #1                // if (i & 1)

  // the next 2 only execute if (i % 2) is not zero
  addne sum, c               // sum += 0x9E3779B9;
  lsrne t0, sum, #11         // t0 = sum >> 11

  and   t0, #3               // t0 %= 4
  ldr   t0, [k, t0, lsl #2]  // t0 = k[t0];
  add   t1, sum, t0          // t1 = sum + t0
  mov   t0, x1, lsl #4       // t0 = (x1 << 4)
  eor   t0, x1, lsr #5       // t0^= (x1 >> 5)
  add   t0, x1               // t0+= x1
  eor   t0, t1               // t0^= t1
  mov   t1, x1               // backup x1
  add   x1, t0, x0           // x1 = t0 + x0

  // XCHG(x0, x1)
  mov   x0, t1               // x0 = x1
  subs  i, i, #1             // i--
  bne   xtea_loop            // i>0

  // save 64-bit cipher text
  stm   x, {x0, x1}

  // restore registers
  pop   {r0-r12, pc}

CHASKEY

The Chaskey cipher is a 128-bit block, 128-bit key symmetric encryption algorithm which is the underlying function used for the Chaskey Message Authentication Code (MAC).

It’s based on an Even-Mansour construction which makes it very simple to implement. The permutation function was derived from SipHash, and uses only ARX instructions making it highly suitable for 32-bit architectures.

void chas_encryptx(void *key, void *data)
{
   int      i;
   uint32_t *x=(uint32_t*)data;
   uint32_t *k=(uint32_t*)key;

   // mix key
   for (i=0; i<4; i++) x[i] ^= k[i];

   // apply permutation
   for (i=16; i>0; i--) {
     x[0] += x[1];
     x[1]=ROTR32(x[1],27) ^ x[0];
     x[2] += x[3];
     x[3]=ROTR32(x[3],24) ^ x[2];
     x[2] += x[1];
     x[0]=ROTR32(x[0],16) + x[3];
     x[3]=ROTR32(x[3],19) ^ x[0];
     x[1]=ROTR32(x[1],25) ^ x[2];
     x[2]=ROTR32(x[2],16);
   }
   // mix key
   for (i=0; i<4; i++) x[i] ^= k[i];
}

Not much to say about implementation.

k  .req r0
x  .req r1

k0 .req r2
k1 .req r3
k2 .req r4
k3 .req r5

x0 .req r6
x1 .req r7
x2 .req r8
x3 .req r9

i  .req r10
  
  // chas_encryptx(void *key, void *data);
chas_encryptx:
  
  // saxe registers
  push   {r0-r12,lr}
  
  // load 128-bit key
  ldm    k, {k0, k1, k2, k3}
  
  // load 128-bit plain text
  ldm    x, {x0, x1, x2, x3}
  
  // xor plaintext with key
  eor    x0, k0              // x[0] ^= k[0];
  eor    x1, k1              // x[1] ^= k[1];
  eor    x2, k2              // x[2] ^= k[2];
  eor    x3, k3              // x[3] ^= k[3];
  mov    i, #16              // i = 16
chaskey_loop:
  add    x0, x1              // x[0] += x[1];
  eor    x1, x0, x1, ror #27 // x[1]=ROTR32(x[1],27) ^ x[0];
  add    x2, x3              // x[2] += x[3];
  eor    x3, x2, x3, ror #24 // x[3]=ROTR32(x[3],24) ^ x[2];
  add    x2, x1              // x[2] += x[1];
  add    x0, x3, x0, ror #16 // x[0]=ROTR32(x[0],16) + x[3];
  eor    x3, x0, x3, ror #19 // x[3]=ROTR32(x[3],19) ^ x[0];
  eor    x1, x2, x1, ror #25 // x[1]=ROTR32(x[1],25) ^ x[2];
  ror    x2, #16             // x[2]=ROTR32(x[2],16);
  subs   i, i, #1            // i--
  bne    chaskey_loop        // i>0
  
  // xor cipher text with key
  eor    x0, k0              // x[0] ^= k[0];
  eor    x1, k1              // x[1] ^= k[1];
  eor    x2, k2              // x[2] ^= k[2];
  eor    x3, k3              // x[3] ^= k[3];
  
  // save 128-bit cipher text
  stm    x, {x0, x1, x2, x3}
  
  // restore registers, and return
  pop    {r0-r12,pc}

CHAM

CHAM: A Family of Lightweight Block Ciphers for Resource-Constrained Devices was published in December 2017 at the 20th Annual International Conference on Information Security and Cryptology held in South Korea.

// initialize sub keys
void cham128_setkey(void *in, void *out)
{
    int i;
    uint32_t *k=(uint32_t*)in;
    uint32_t *rk=(uint32_t*)out;

    for (i=0; i<KW; i++) {
      rk[i] = k[i] ^ ROTL32(k[i], 1) ^ ROTL32(k[i], 8);
      rk[(i + KW) ^ 1] = k[i] ^ ROTL32(k[i], 1) ^ ROTL32(k[i], 11);
    }
}

// encrypt 128-bits
void cham128_encrypt(void *keys, void *data)
{
    int i;
    uint32_t x0, x1, x2, x3;
    uint32_t t;
    uint32_t *rk=(uint32_t*)keys;
    uint32_t *x=(uint32_t*)data;

    x0 = x[0]; x1 = x[1];
    x2 = x[2]; x3 = x[3];

    for (i=0; i<R; i++)
    {
      if (i & 1) {
        x0 = ROTL32((x0 ^ i) + (ROTL32(x1, 8) ^ rk[i & 7]), 1);
      } else {
        x0 = ROTL32((x0 ^ i) + (ROTL32(x1, 1) ^ rk[i & 7]), 8);
      }
      XCHG(x0, x1);
      XCHG(x1, x2);
      XCHG(x2, x3);
    }
    x[0] = x0; x[1] = x1;
    x[2] = x2; x[3] = x3;
}

Makes use of conditional execution to avoid branching. The TST instruction will perform if (i % 2)

EOR and ROR are then executed based on the result of this.

k   .req r0
x   .req r1

// data
x0 .req r0
x1 .req r2
x2 .req r3
x3 .req r4  

// round keys  
rk .req sp

k0 .req r6
k1 .req r7
k2 .req r8

i  .req r10

cham128_encryptx:
  // save registers
  push   {r0-r12,lr}
  
  // allocate memory for round keys
  sub    sp, #32
  
  // derive round keys from 128-bit key
  mov    i, #0                 // i  = 0
cham_init:  
  ldr    k0, [k, i, lsl #2]    // k0 = k[i];  
  ror    k1, k0, #31           // k1 = ROTR32(k0, 31);
  ror    k2, k0, #24           // k2 = ROTR32(k0, 24);  
  eor    k0, k1                // k0^= k1;
  eor    k1, k0, k2            // rk[i] = k0 ^ k2; 
  str    k1, [rk, i, lsl #2]  
  eor    k0, k2, ror #29       // k0 ^= ROTR32(k2, 29);
  add    k1, i, #4             // k1 = (i+KW)
  eor    k1, #1                // k1 = (i+KW) ^ 1 
  str    k0, [rk, k1, lsl #2]  // rk[(i+KW)^1] = k0;  
  add    i, #1                 // i++
  cmp    i, #4                 // i<KW  
  bne    cham_init             //  
  
  // load 128-bit plain text
  ldm    x, {x0, x1, x2, x3}
  
  // perform encryption
  mov    i, #0                 // i = 0 
cham_enc:
  mov    k0, x3
  eor    x0, i                 // x0 ^= i
  tst    i, #1                 // if (i & 1)  
  
  // x3  = rk[i & 7];    
  and    k1, i, #7             // k1 = i & 7;
  ldr    x3, [rk, k1, lsl #2]  // x3 = rk[i & 7];  
  
  // execution depends on (i % 2)
  // x3 ^= (i & 1) ? ROTR32(x1, 24) : ROTR32(x1, 31);
  eorne  x3, x1, ror #24       // 
  eoreq  x3, x1, ror #31       // 
  
  add    x3, x0                // x3 += x0;
  
  // x3 = (i & 1) ? ROTR32(x3, 31) : ROTR32(x3, 24);
  rorne  x3, #31               // x3 = ROTR32(x3, 31); 
  roreq  x3, #24               // x3 = ROTR32(x3, 24);
  
  // swap
  mov    x0, x1                // x0 = x1; 
  mov    x1, x2                // x1 = x2;
  mov    x2, k0                // x2 = k0;

  add    i, #1                 // i++;  
  cmp    i, #80                // i<R 
  bne    cham_enc              // 
  
  // save 128-bit cipher text
  stm    x, {x0, x1, x2, x3}   // x[0] = x0; x[1] = x1; 
                               // x[2] = x2; x[3] = x3;
  // release memory for round keys
  add    sp, #32
                                                              
  // restore registers
  pop    {r0-r12, pc}

LEA

LEA is a 128-bit block cipher with support for 128, 192 and 256-bit keys published in 2014.

The only operations used for encryption and the key schedule are 32-bit Addition, eXclusive OR and Rotation (ARX structure): the designers state “the usage of 32-bit and 64-bit processors will grow rapidly compared to 8-bit and 16-bit ones”

void lea128_encryptx(void *key, void *data) {
    uint32_t k0, k1, k2, k3;
    uint32_t x0, x1, x2, x3;
    uint32_t c0, c1, c2, c3;
    
    uint32_t *x, *k, r, t0;

    x = (uint32_t*)data;
    k = (uint32_t*)key;

    // initialize 32-bit constants
    c0 = 0xc3efe9db; c1 = 0x88c4d604;
    c2 = 0xe789f229; c3 = 0xc6f98763;

    // load 128-bit key
    k0 = k[0]; k1 = k[1];
    k2 = k[2]; k3 = k[3];

    // load 128-bit plain text
    x0 = x[0]; x1 = x[1];
    x2 = x[2]; x3 = x[3];

    // perform encryption
    for (r=24; r>0; r--) {
      // create subkey
      k0 = ROTR32(k0 + c0, 31);
      k1 = ROTR32(k1 + ROTR32(c0, 31), 29);
      k2 = ROTR32(k2 + ROTR32(c0, 30), 26);
      k3 = ROTR32(k3 + ROTR32(c0, 29), 21);
      
      // encrypt block
      t0 = x0;
      x0 = ROTR32((x0 ^ k0) + (x1 ^ k1),23);
      x1 = ROTR32((x1 ^ k2) + (x2 ^ k1), 5);
      x2 = ROTR32((x2 ^ k3) + (x3 ^ k1), 3);
      x3 = t0;
      
      // update constants
      t0 = c0; c0 = c1; c1 = c2; c2 = c3;
      c3 = ROTR32(t0, 28);
    }
    // save 128-bit cipher text
    x[0] = x0; x[1] = x1;
    x[2] = x2; x[3] = x3;
}

Although we could use the stack for constants, 4 registers were used instead. Because almost all general purpose registers are being used, this just about fits onto ARM in 32-bit mode.

k   .req r0
x   .req r1

// key
k0  .req r0
k1  .req r2
k2  .req r3
k3  .req r4 

// data
x0  .req r5
x1  .req r6
x2  .req r7
x3  .req r8

// constants
g0  .req r9
g1  .req r10
g2  .req r11
g3  .req r12
 
// counter
r   .req r1
 
lea128_encryptx:
  // save registers
  push   {r0-r12, lr}
  
  ldr    g0, =#0xc3efe9db     // g0 = 0xc3efe9db; 
  ldr    g1, =#0x88c4d604     // g1 = 0x88c4d604; 
  ldr    g2, =#0xe789f229     // g2 = 0xe789f229; 
  ldr    g3, =#0xc6f98763     // g3 = 0xc6f98763; 
  
  // load 128-bit key
  ldm    k, {k0, k1, k2, k3}  

  // load 128-bit plain text
  push   {x}
  ldm    x, {x0, x1, x2, x3}  
  
  // perform encryption  
  mov    r, #24               // r = 24  
lea_main:
  push   {r}

  // create subkey
  // k0 = ROTR32(k0 + g0, 31);
  add     k0, g0
  ror     k0, #31
  
  // k1 = ROTR32(k1 + ROTR32(g0, 31), 29);
  add     k1, g0, ror #31
  ror     k1, #29
  
  // k2 = ROTR32(k2 + ROTR32(g0, 30), 26);
  add     k2, g0, ror #30
  ror     k2, #26
  
  // k3 = ROTR32(k3 + ROTR32(g0, 29), 21);
  add     k3, g0, ror #29
  ror     k3, #21
  
  // encrypt block  
  // t0 = x0;
  push   {x0}
  
  // x0 = ROTR32((x0 ^ k0) + (x1 ^ k1),23);
  eor    r1, x1, k1
  eor    x0, k0
  add    x0, r1
  ror    x0, #23
  
  // x1 = ROTR32((x1 ^ k2) + (x2 ^ k1), 5);
  eor    r1, x2, k1
  eor    x1, k2
  add    x1, r1
  ror    x1, #5
  
  // x2 = ROTR32((x2 ^ k3) + (x3 ^ k1), 3);
  eor    r1, x3, k1
  eor    x2, k3
  add    x2, r1
  ror    x2, #3
  
  // x3 = t0;
  pop    {x3}
  
  // update constants
  push   {g0}
  mov    g0, g1
  mov    g1, g2
  mov    g2, g3
  
  // g3 = ROTR32(t0, 28);
  pop    {g3}
  ror    g3, #28
  
  pop    {r}
  subs   r, #1             // r--
  bne    lea_main          // r>0
  
  // save 128-bit cipher text
  pop    {x}
  stm    x, {x0, x1, x2, x3}
  
  // restore registers
  pop    {r0-r12, pc}

SPECK

SPECK is a family of lightweight block ciphers publicly released by the National Security Agency (NSA) in June 2013. It’s an ARX (add-rotate-xor) design optimized for performance in software implementations, and has been suggested for use on resource constrained devices.

void speck64_encryptx(const void *key, void *data)
{
    uint32_t i, t, k0, k1, k2, k3, x0, x1;
    uint32_t *k=(uint32_t*)key;
    uint32_t *x=(uint32_t*)data;
    
    // copy 128-bit key to local space
    k0 = k[0]; k1 = k[1];
    k2 = k[2]; k3 = k[3];
    
    // copy input to local space
    x0 = x[0]; x1 = x[1];
    
    for (i=0; i<27; i++)
    {
      // encrypt block
      x0 = (ROTR32(x0, 8) + x1) ^ k0;
      x1 =  ROTL32(x1, 3) ^ x0;
      
      // create next sub key
      t  = k3;
      k3 = (ROTR32(k1, 8) + k0) ^ i;
      k0 =  ROTL32(k0, 3) ^ k3;
      
      k1 = k2;
      k2 = t;   
    }
    // save result
    x[0] = x0; x[1] = x1;
}

Straight forward implementation with no trickery required.

// key
k0 .req r2
k1 .req r3
k2 .req r4
k3 .req r5

// plaintext
x0 .req r6
x1 .req r7

// parameters
k  .req r0
x  .req r1
i  .req r0
t  .req r8

  // speck64_encryptx(void *key, void *data);
speck64_encryptx:
  // save registers
  push   {r0-r12, lr}
  
  // load 128-bit key
  // k0 = k[0]; k1 = k[1]; k2 = k[2]; k3 = k[3];
  ldm    k, {k0, k1, k2, k3}
  // load 64-bit plain text
  ldm    x, {x0, x1}          // x0 = x[0]; x1 = k[1];
  mov    i, #0                // i=0
speck_loop:
  add    x0, x1, x0, ror #8   // x0 = (ROTR32(x0, 8) + x1) ^ k0;
  eor    x0, k0               //
  eor    x1, x0, x1, ror #29  // x1 = ROTL32(x1, 3) ^ x0;
  mov    t, k3                // backup k3
  add    k3, k0, k1, ror #8   // k3 = (ROTR32(k1, 8) + k0) ^ i;
  eor    k3, i                //
  eor    k0, k3, k0, ror #29  // k0 = ROTL32(k0, 3) ^ k3;
  mov    k1, k2               // k1 = k2;
  mov    k2, t                // k2 = t;
  add    i, #1                // i++;
  cmp    i, #27               // i<27;
  bne    speck_loop
  
  // save result
  stm    x, {x0, x1}          // x[0] = x0; x[1] = x1;
  
  // restore registers
  pop    {r0-r12, pc}

NOEKEON

NOEKEON is a 128-bit block cipher submitted to the NESSIE project in September 2000.

void Noekeonx(void *key, void *data) {
    int      i, j;
    uint32_t t;
    w256_t   rc;
    uint32_t *x, *k;

    x = (uint32_t*)data;
    k = (uint32_t*)key;

    // constants
    rc.w[ 0] = 0x6c361b80; rc.w[1] = 0x9a4dabd8;
    rc.w[ 2] = 0x63bc5e2f; rc.w[3] = 0x6a3597c6;
    rc.b[16] = 0xd4;

    for (i=0;;i++) {
      x[0] ^= rc.b[i];
      // Theta
      t = x[0] ^ x[2];
      t ^= ROTR32(t, 8) ^ ROTR32(t, 24);

      x[1] ^= t; x[3] ^= t;

      // mix key
      x[0]^= k[0]; x[1]^= k[1];
      x[2]^= k[2]; x[3]^= k[3];

      t = x[1] ^ x[3];
      t ^= ROTR32(t, 8) ^ ROTR32(t, 24);

      x[0]^= t; x[2] ^= t;

      if (i==Nr) break;

      // Pi1
      x[1] = ROTR32(x[1], 31);
      x[2] = ROTR32(x[2], 27);
      x[3] = ROTR32(x[3], 30);

      // Gamma
      x[1]^= ~((x[3]) | (x[2]));
      x[0]^= x[2] & x[1];

      XCHG(x[0], x[3]);

      x[2]^= x[0] ^ x[1] ^ x[3];
      x[1]^= ~((x[3]) | (x[2]));
      x[0]^= x[2] & x[1];

      // Pi2
      x[1] = ROTR32(x[1], 1);
      x[2] = ROTR32(x[2], 5);
      x[3] = ROTR32(x[3], 2);
    }
}

What’s missing here is code to convert from little to big endian convention.

On x86 is the BSWAP instruction, for ARM, there’s REV

k   .req r0
x   .req r1

// key
k0  .req r2
k1  .req r3
k2  .req r4
k3  .req r5

// data
x0  .req r6
x1  .req r7
x2  .req r8
x3  .req r9

// constants
i   .req r10
t0  .req r11
t1  .req r12

Noekeonx:
  // save registers
  push   {r0-r12, lr}

  // load 128-bit key
  ldm    k, {k0, k1, k2, k3}

  // load 128-bit plain text
  ldm    x, {x0, x1, x2, x3}

  mov    i, #0
noekeon_main:
  adr    t0, rc_tab
  ldrb   t0, [t0, i]

  // x[0] ^= rc.b[i];
  eor    x0, t0

  // Theta
  // t = x[0] ^ x[2];
  // t ^= ROTR32(t, 8) ^ ROTR32(t, 24);
  eor    t0, x0, x2
  mov    t1, t0
  eor    t0, t1, ror #8
  eor    t0, t1, ror #24

  // x[1] ^= t; x[3] ^= t;
  eor    x1, t0
  eor    x3, t0

  // mix key
  // x[0]^= k[0]; x[1]^= k[1];
  // x[2]^= k[2]; x[3]^= k[3];
  eor    x0, k0
  eor    x1, k1
  eor    x2, k2
  eor    x3, k3

  // t = x[1] ^ x[3];
  // t ^= ROTR32(t, 8) ^ ROTR32(t, 24);
  eor    t0, x1, x3
  mov    t1, t0
  eor    t0, t1, ror #8
  eor    t0, t1, ror #24

  // x[0] ^= t; x[2] ^= t;
  eor    x0, t0
  eor    x2, t0

  cmp    i, #16           // if (i==Nr) break;
  beq    noekeon_end

  // Pi1
  ror    x1, #31      // x[1] = ROTR32(x[1], 31);
  ror    x2, #27      // x[2] = ROTR32(x[2], 27);
  ror    x3, #30      // x[3] = ROTR32(x[3], 30);

  // Gamma
  // x[1]^= ~((x[3]) | (x[2]));
  orr    t0, x3, x2
  mvn    t0, t0
  eor    x1, t0

  mov    t1, x3       // backup x3

  // x[0] ^= x[2] & x[1];
  and    t0, x2, x1
  eor    x3, x0, t0

  // XCHG(x[0], x[3]);
  mov    x0, t1

  // x[2]^= x[0] ^ x[1] ^ x[3];
  eor    x2, x0
  eor    x2, x1
  eor    x2, x3

  // x[1]^= ~((x[3]) | (x[2]));
  orr    t0, x3, x2
  mvn    t0, t0
  eor    x1, t0

  // x[0]^= x[2] & x[1];
  and    t0, x2, x1
  eor    x0, t0

  // Pi2
  ror    x1, #1      // x[1] = ROTR32(x[1], 1);
  ror    x2, #5      // x[2] = ROTR32(x[2], 5);
  ror    x3, #2      // x[3] = ROTR32(x[3], 2);

  add    i, #1        // i++
  b      noekeon_main

noekeon_end:
  // save 128-bit cipher text
  stm    x, {x0, x1, x2, x3}

  // restore registers, and return
  pop    {r0-r12, pc}

rc_tab:
	.byte 0x80
	.byte 0x1B, 0x36, 0x6C, 0xD8
	.byte 0xAB, 0x4D, 0x9A, 0x2F
	.byte 0x5E, 0xBC, 0x63, 0xC6
	.byte 0x97, 0x35, 0x6A, 0xD4

RC5

RC5 is a symmetric block cipher designed in 1994 by Ronald Rivest. The version shown here uses 128-bit keys, processes 64-bit blocks of data, and uses 12 rounds.

It’s been suggested 20 rounds be used in order to mitigate against a weakness.

void rc5_encryptx(void *key, void *data)
{
    uint32_t A, B, T;
    uint32_t L[4], S[RC5_KR];
    int      i, j, r;

    uint32_t *kp=(uint32_t*)key;
    uint32_t *x=(uint32_t*)data;

    // initialize L with 128-bit key
    memcpy(&L, key, 16);

    A = 0xC983AE2D;

    // initialize S with constants
    for (i=RC5_KR; i>0;) {
      A -= RC5_Q;
      S[--i] = A;
    }

    A = B = 0;
    r = (RC5_KR*3);

    // create sub keys
    for (i=0, j=0; r>0; i++, j++, r--) {
      // i needs reset?
      if (i==RC5_KR) i=0;
      j &= 3;

      A = S[i] = ROTL32(S[i] + A+B, 3);
      B = L[j] = ROTL32(L[j] + A+B, A+B);
    }

    // assign sub keys to k ptr
    kp = S;

    // load 64-bit plain text
    A = x[0] + *kp++;
    B = x[1] + *kp++;

    // apply encryption
    for (i=RC5_ROUNDS*2; i>0; i--) {
      T = B;
      B = ROTL32(A ^ B, B) + *kp++;
      A = T;
    }
    // save 64-bit ciphertext
    x[0] = A; x[1] = B;
}

When initializing 2 registers to zero, instead of 2 separate instructions, we use another register already set to zero with UMULL which is similar to using MUL on x86 with zero to initialize EAX and EDX to zero.

Since ARM doesn’t support a ROL instruction, RSB is used to calculate the value before using ROR.

k  .req r0
x  .req r1

x0 .req r2
x1 .req r3  
  
t0 .req r4
t1 .req r5
t2 .req r6

i  .req r7
j  .req r8
r  .req r9

L  .req r10
S  .req r11
kp .req r11

rc5_encryptx:
  // save registers
  push   {r0-r12, lr}
  
  // allocate memory
  sub    sp, #128
  
  // initialize L with 128-bit key
  // memcpy(&L, key, 16);
  mov    L, sp
  ldm    k, {r2-r5}
  stm    L, {r2-r5}
  
  // initialize S with constants
  ldr    x0, =#0xC983AE2D     // x0 = RC5_P precomputed
  ldr    x1, =#0x9E3779B9     // x1 = RC5_Q
  add    S, sp, #16
  mov    i, #26               // i = RC5_KR, RC5_KR=2*(ROUNDS+1)
init_sub:
  sub    x0, x0, x1           // x0 -= RC5_Q
  subs   i, i, #1             // i--  
  str    x0, [S, i, lsl #2]   // S[i] = x0;  
  bne    init_sub             // i>=0
  
  umull  x0, x1, i, i         // x0 = 0, x1 = 0 
  mov    r, #78               // r = (RC5_KR*3)
  
  // ***********************************************
  // create the round keys
  // ***********************************************    
  mov    j, i                 // j = 0   
rc5_sub:
  cmp    i, #26               // if (i == RC5_KR) i = 0
  moveq  i, #0                // zero if equal
  and    j, #3                // j &= 3

  // x0 = S[i] = ROTL32(S[i] + x0+x1, 3);
  add    x0, x1               // x0 = x0 + x1;
  ldr    t2, [S, i, lsl #2]
  add    x0, t2               // x0 += t2
  ror    x0, #29 
  str    x0, [S, i, lsl #2]   // S[i] = x0
    
  // x1 = L[j] = ROTL32(L[j] + x0+x1, x0+x1);
  add    x1, x0, x1           // x1 = x0 + x1
  rsb    t0, x1, #32          // t0 = 32 - x1  
  ldr    t2, [L, j, lsl #2]
  add    x1, t2               // x1 += t2
  ror    x1, t0               //
  str    x1, [L, j, lsl #2]   // L[j] = x1  
  add    i, #1                // i++   
  add    j, #1                // j++ 
  subs   r, #1                // r--
  bne    rc5_sub
  
  // ***********************************************
  // perform encryption
  // ***********************************************    
  // load plaintext
  ldm    x, {x0, x1}          // x0 = x[0]; x1 = x[1]; 
  
  ldr    t0, [kp], #4
  add    x0, t0               // x0 += *kp++;
  
  ldr    t0, [kp], #4
  add    x1, t0               // x1 += *kp++;
  
  // apply encryption
  mov    i, #24               // i = RC5_KR - 2 
rc5_enc:
  ldr    t0, [kp], #4         // t0 = *kp++;
  eor    x0, x1
  rsb    t1, x1, #32          // t1 = 32 - x1
  mov    t2, x1               // backup x1
  add    x1, t0, x0, ror t1 
  mov    x0, t2
  subs   i, #1                // i--
  bne    rc5_enc              // i>0
  
  // save ciphertext
  stm    x, {x0, x1}          // x[0] = x0; x[1] = x1;
  
  // release memory
  add    sp, #128
  
  // restore registers
  pop    {r0-r12, pc}

RC6

RC6 is a symmetric block cipher published in 1998. It was one of the 5 finalists considered for the Advanced Encryption Standard.

The main changes with RC6 are a 128-bit block, and support for 128, 192 and 256-bit keys. The version here will support only 256-bit keys.

The key schedule is essentially the same, but obviously working with 256-bit keys. The encryption includes additional operations and specifically a multiplication.

void rc6_encryptx (void *key, void *data)
{
    uint32_t x0, x1, x2, x3, t0, t1, t2;
    uint32_t L[8], S[RC6_KR];
    uint32_t *x, *kp;
    int      i, j, r;

    x=(uint32_t*)data;

    // initialize L with 256-bit key
    memcpy(&L, key, 32);

    x0 = 0xE96A3D2F;

    // initialize S with constants
    for (i=RC6_KR; i>0;) {
      x0 -= RC6_Q;
      S[--i] = x0;
    }

    x0 = x1 = 0;
    r = (RC6_KR*3);

    // create sub keys
    for (i=0, j=0; r>0; i++, j++, r--) {
      // i needs reset?
      if (i==RC6_KR) i=0;
      j &= 7;

      x0 = S[i] = ROTL32(S[i] + x0+x1, 3);
      x1 = L[j] = ROTL32(L[j] + x0+x1, x0+x1);
    }

    // assign sub keys to k
    kp = S;

    // load plain text
    x0 = x[0]; x1 = x[1] + *kp++;
    x2 = x[2]; x3 = x[3] + *kp++;

    // apply encryption
    for (i=RC6_ROUNDS; i>0; i--) {
      t0 = ROTL32(x1 * (x1 + x1 + 1), 5);
      t1 = ROTL32(x3 * (x3 + x3 + 1), 5);

      t2 = x3; // backup x3

      x0 ^= t0;
      x2 ^= t1;
      x3 = ROTL32(x0, t1);
      t1 = *kp++;
      x3 += t1;

      x0 = x1; // move x1 into x0
      x1 = ROTL32(x2, t0);
      t0 = *kp++;
      x1 += t0;

      x2 = t2;  // move x3 into x2
    }
    // save cipher text
    x[0] = x0 + kp[0]; x[1] = x1;
    x[2] = x2 + kp[1]; x[3] = x3;
}

I considered using a multiplication for the division, but a conditional test is perfect, especially on ARM with its conditional execution feature.

k  .req r0
x  .req r1

x0 .req r0
x1 .req r2
x2 .req r3
x3 .req r4  
  
t0 .req r5
t1 .req r6
t2 .req r7

i  .req r8
j  .req r9
r  .req r10

L  .req r11
S  .req r12
kp .req r12

  // rc6_encryptx(void *key, void *data);
rc6_encryptx:
  // save registers
  push   {r0-r12, lr}
  
  // allocate memory
  sub    sp, #256
  
  // initialize L with 256-bit key
  // memcpy(&L, key, 32);
  mov    L, sp
  ldm    k, {r2-r9}
  stm    L, {r2-r9}
  
  // initialize S with constants
  ldr    x0, =#0xE96A3D2F     // x0 = RC6_P precomputed
  ldr    x1, =#0x9E3779B9     // x1 = RC6_Q
  add    S, sp, #32
  mov    i, #44               // i = RC6_KR, RC6_KR=2*(ROUNDS+2)
init_sub:
  sub    x0, x1               // x0 -= RC6_Q
  subs   i, #1                // --i  
  str    x0, [S, i, lsl #2]   // S[i] = x0;  
  bne    init_sub             // i>=0
  
  umull  x0, x1, i, i         // x0 = 0, x1 = 0 
  mov    r, #132              // r = (RC6_KR*3)
  
  // ***********************************************
  // create the round keys
  // *********************************************** 
  mov    j, i                 // j = 0   
rc6_sub:
  cmp    i, #44               // if (i == RC6_KR)
  moveq  i, #0                // i = 0
  and    j, #7                // j &= 7

  // x0 = S[i] = ROTL32(S[i] + x0+x1, 3);
  add    x0, x1               // x0 = x0 + x1;
  ldr    t0, [S, i, lsl #2]
  add    x0, t0               // x0 += t0
  ror    x0, #29 
  str    x0, [S, i, lsl #2]   // S[i] = x0
    
  // x1 = L[j] = ROTL32(L[j] + x0+x1, x0+x1);
  add    x1, x0, x1           // x1 = x0 + x1
  rsb    t0, x1, #32          // t0 = 32 - x1  
  ldr    t1, [L, j, lsl #2]
  add    x1, t1               // x1 += t1
  ror    x1, t0               //
  str    x1, [L, j, lsl #2]   // L[j] = x1  
  add    i, #1                // i++   
  add    j, #1                // j++ 
  subs   r, #1                // r--
  bne    rc6_sub
  
  // ***********************************************
  // perform encryption
  // ***********************************************  
  // load plaintext
  ldm    x, {x0, x1, x2, x3}  // x0 = x[0]; x1 = x[1]; 
                              // x2 = x[2]; x3 = x[3];
  ldr    t0, [kp], #4
  add    x1, t0               // x1 += *kp++;
  
  ldr    t0, [kp], #4
  add    x3, t0               // x3 += *kp++;
  
  // apply encryption
  mov    i, #20               // i = RC6_ROUNDS
rc6_enc:
  // mov    t1, #1
  // t0 = ROTL32(x1 * (x1 + x1 + 1), 5);
  // add    t1, t1, x3, lsl #1
  add    t0, x1, x1
  add    t0, #1
  mul    t0, x1, t0
  ror    t0, #27

  // t1 = ROTL32(x3 * (x3 + x3 + 1), 5);
  // add    t1, t1, x3, lsl #1
  add    t1, x3, x3
  add    t1, #1  
  mul    t1, x3, t1
  ror    t1, #27
  
  mov    t2, x3               // backup x3
  
  eor    x0, t0               // x0 ^= t0;
  eor    x2, t1               // x2 ^= t1;  
  
  // x3 = ROTL32(x0 ^ t0, t1) + *kp++;
  rsb    t1, #32              // t1 = 32 - t1
  ror    x0, t1
  ldr    t1, [kp], #4         // t1 = *kp++
  add    x3, x0, t1           // x3 = ROTL32(x0, t1) + *kp++
  
  mov    x0, x1               // move x1 into x0
  
  // x1 = ROTL32(x2 ^ t1, t0) + *kp++;
  rsb    t0, #32              // t0 = 32 - t0
  ror    x2, t0
  ldr    t1, [kp], #4         // t1 = *kp++
  add    x1, x2, t1           // x1 = ROTL32(x2, t0) + *kp++
  
  mov    x2, t2               // move x3 into x2
  
  subs   i, #1                // i--
  bne    rc6_enc              // i>0  
  
  // save ciphertext
  ldr    t0, [kp], #4         // x0 += *kp++;
  add    x0, t0
  
  ldr    t0, [kp], #4         // x2 += *kp++;
  add    x2, t0
  
  stm    x, {x0, x1, x2, x3}  // x[0] = x0; x[1] = x1;
                              // x[2] = x2; x[3] = x3;
  // release memory
  add    sp, #256
  
  // restore registers
  pop    {r0-r12, pc}

Permutation Functions

Gimli

Gimli, named after the Lord Of The Rings character, is a 384-bit permutation function, designed specifically to be lightweight, high performance, and secure across multiple architectures.

void gimli(void *state)
{
  int      r, j;
  uint32_t t, x, y, z;
  uint32_t *s=(uint32_t*)state;

  for (r=24; r>0; --r) {
    // apply SP-box
    for (j=0; j<4; j++) {
      x = ROTR32(s[    j],  8);
      y = ROTR32(s[4 + j], 23);
      z =        s[8 + j];

      s[8 + j] = x ^ (z << 1) ^ ((y & z) << 2);
      s[4 + j] = y ^ x        ^ ((x | z) << 1);
      s[j]     = z ^ y        ^ ((x & y) << 3);
    }

    // apply Linear layer
    t = r & 3;

    // if zero, do small swap
    if (t == 0) {
      XCHG(s[0], s[1]);
      XCHG(s[2], s[3]);

      // add constant
      s[0] ^= (0x9e377900 | r);
    }
    // if 2, do big swap
    if (t == 2) {
      XCHG(s[0], s[2]);
      XCHG(s[1], s[3]);
    }
  }
}

The difference with assembly and C is that the state is denoted as x instead of s.

Implementing the swap was a problem. ARM doesn’t have an XCHG instruction like x86, so I thought of a few ways it could be done..

// XOR swap
  cmp         r1, r2   // a == b
  eorne       r1, r2   // a ^= b
  eorne       r2, r1   // b ^= a
  eorne       r1, r2   // a ^= b

// Pushing/popping registers is a viable option.
  push        {r1}
  mov         r1, r2
  pop         {r2}

// Using the load/store instructions
  ldr         r1, [r0]
  ldr         r2, [r0, #4]
  str         r2, [r0]
  str         r1, [r0, #4]

// double load and store
  ldrd        r2, r3, [r0]
  strd        r3, r2, [r0]

For now, it uses a temporary register which will do for now.

s   .req r0

// state
x0  .req r1
x1  .req r2
x2  .req r3
x3  .req r4 

// loop counters
r   .req r5
c   .req r6
j   .req r7

// used during permutation
x   .req r8
y   .req r9
z   .req r10

// temporary registers
t0  .req r11
t1  .req r12

gimlix:
  // save registers
  push   {r0-r12, lr}
  
  mov    r, #24              // r = 24
  ldr    c, =#0x9e377900     // c = 0x9e377900; 
gimli_main:
  mov    j, #4               // j = 4
  sub    x0, s, #4           // x0 = s - 1
gimli_perm: 
  ldr    x, [x0, #4]         // x = ROTR32(s[j],  8);
  ror    x, #8
  
  ldr    y, [x0, #20]        // y = ROTR32(s[4 + j], 23);
  ror    y, #23
  
  ldr    z, [x0, #36]        // z = s[8 + j];
  
  // s[8 + j] = x ^ (z << 1) ^ ((y & z) << 2);
  eor    t0, x, z, lsl #1    // t0 = x ^ (z << 1)
  and    t1, y, z            // t1 = y & z
  eor    t0, t1, lsl #2      // t0 = t0 ^ (t1 << 2)
  str    t0, [x0, #36]       // s[8 + j] = t0
  
  // s[4 + j] = y ^ x ^ ((x | z) << 1);
  eor    t0, y, x            // t0 = y ^ x
  orr    t1, x, z            // t1 = x | z       
  eor    t0, t1, lsl #1      // t0 = t0 ^ (t1 << 1)
  str    t0, [x0, #20]       // s[4 + j] = t0 
  
  // s[j] = z ^ y ^ ((x & y) << 3);
  eor    t0, z, y            // t0 = z ^ y
  and    t1, x, y            // t1 = x & y
  eor    t0, t1, lsl #3      // t0 = t0 ^ (t1 << 3)
  str    t0, [x0, #4]!       // s[j] = t0, s++
  
  subs   j, #1               // j--
  bne    gimli_perm          // j>0
  
  // load 16 bytes of state
  ldm    s, {x0, x1, x2, x3}
  
  // apply linear layer
  // t0 = (r & 3);
  ands   t0, r, #3
  
  // XCHG(s[0], s[1]);
  moveq  t1, x0
  moveq  x0, x1
  moveq  x1, t1
  
  // XCHG(s[2], s[3]);
  moveq  t1, x2
  moveq  x2, x3
  moveq  x3, t1

  // s[0] ^= (0x9e377900 | r);
  orreq  t1, c, r  
  eoreq  x0, t1
    
  // if (t == 2)
  cmp    t0, #2
  
  // XCHG(s[0], s[2]);
  moveq  t1, x0
  moveq  x0, x2
  moveq  x2, t1
  
  // XCHG(s[1], s[3]);
  moveq  t1, x1
  moveq  x1, x3
  moveq  x3, t1
  
  // save state
  stm    s, {x0, x1, x2, x3}
  
  subs   r, #1               // r--
  bne    gimli_main          // r>0
  
  // restore registers
  pop    {r0-r12, pc}

Xoodoo

Xoodoo is a new permutation function first described in November 2017. It’s very similar to Keccak, but uses same paramaters as Gimli.

void xoodoo(void *state) {
    uint32_t e[4], x0, x1, x2, t;
    int      r, i;
    uint32_t *x=(uint32_t*)state;

    uint32_t rc[12]=
    { 0x058, 0x038, 0x3c0, 0x0d0,
      0x120, 0x014, 0x060, 0x02c,
      0x380, 0x0f0, 0x1a0, 0x012 };

    // 12 rounds by default
    for (r=0; r<12; r++) {
      // Theta
      for (i=0; i<4; i++) {
        e[i] = ROTR32(x[i] ^ x[i+4] ^ x[i+8], 18);
        e[i]^= ROTR32(e[i], 9);
      }

      for (i=0; i<12; i++) {
        x[i]^= e[(i - 1) & 3];
      }

      // Rho west
      XCHG(x[7], x[4]);
      XCHG(x[7], x[5]);
      XCHG(x[7], x[6]);

      // Iota
      x[0] ^= rc[r];

      // Chi and Rho east
      for (i=0; i<4; i++) {
        x0 = x[i+0];
        x1 = x[i+4];
        x2 = ROTR32(x[i+8], 21);

        x[i+8] = ROTR32((x1 & ~x0) ^ x2, 24);
        x[i+4] = ROTR32((x0 & ~x2) ^ x1, 31);
        x[i+0]^= x2 & ~x1;
      }
      XCHG(x[8], x[10]);
      XCHG(x[9], x[11]);
    }
}

The assembly here is not optimized. It’s a compact implementation for simplicity.

state .req r0
x     .req r1

r     .req r2
i     .req r3

x0    .req r4
x1    .req r5
x2    .req r6
x3    .req r7

rc    .req r8
xt    .req r9

e     .req sp

xoodoo:
    // save registers
    push   {r0-r12, lr}

    mov    r, #12              // 12 rounds
    sub    sp, #16             // allocate 16 bytes
    adr    rc, rc_tab
xoodoo_main:
    mov    i, #0               // i = 0
    mov    x, state
theta_l0:
    ldr    x2, [x, #32]        // x2 = x[i+8]
    ldr    x1, [x, #16]        // x1 = x[i+4]
    ldr    x0, [x], #4         // x0 = x[i+0], advance x by 4

    // e[i] = ROTR32(x[i] ^ x[i+4] ^ x[i+8], 18);
    eor    x0, x1
    eor    x0, x2
    ror    x0, #18

    // e[i]^= ROTR32(e[i], 9);
    eor    x0, x0, ror #9
    str    x0, [sp, i, lsl #2]  // store in e

    add    i, #1               // i++
    cmp    i, #4               // i<4
    bne    theta_l0            //

    // x[i]^= e[(i - 1) & 3];
    mov    i, #0               // i = 0
    mov    x, state            // x = state
theta_l1:
    sub    xt, i, #1
    and    xt, #3               // xt = i & 3
    ldr    xt, [sp, xt, lsl #2] // xt = e[(i - 1) & 3]
    ldr    x0, [x, i, lsl #2]   // x0 = x[i]
    eor    x0, xt               // x0 ^= xt
    str    x0, [x, i, lsl #2]   // x[i] = x0
    add    i, #1                // i++
    cmp    i, #12               // i<12
    bne    theta_l1

    // Rho west
    // XCHG(x[7], x[4]);
    // XCHG(x[7], x[5]);
    // XCHG(x[7], x[6]);
    add    x, state, #16       // x = &state[4]
    ldm    x, {x0, x1, x2, x3}
    mov    xt, x0              // xt = x[4]
    mov    x0, x3              // x[4] = x[7]
    mov    x3, x2              // x[7] = x[6]
    mov    x2, x1              // x[6] = x[5]
    mov    x1, xt              // x[5] = xt
    stm    x, {x0, x1, x2, x3}

    mov    x, state

    // Iota
    // x[0] ^= rc[r];
    ldrh   xt, [rc], #2        // load half-word, advance by 2
    ldr    x0, [x]             // load word
    eor    x0, xt              // xor
    str    x0, [x]             // store word

    mov    i, #4
chi:
    // Chi and Rho east
    // x0 = x[i+0];
    ldr    x0, [x]

    // x1 = x[i+4];
    ldr    x1, [x, #16]

    // x2 = ROTR32(x[i+8], 21);
    ldr    x2, [x, #32]
    ror    x2, #21

    // x[i+8] = ROTR32((x1 & ~x0) ^ x2, 24);
    bic    xt, x1, x0
    eor    xt, x2
    ror    xt, #24
    str    xt, [x, #32]

    // x[i+4] = ROTR32((x0 & ~x2) ^ x1, 31);
    bic    xt, x0, x2
    eor    xt, x1
    ror    xt, #31
    str    xt, [x, #16]

    // x[i+0]^= x2 & ~x1;
    bic    xt, x2, x1
    eor    xt, x0
    str    xt, [x], #4

    // i--
    subs   i, #1
    bne    chi

    add    x, state, #32       // x = &state[8]

    // XCHG(x[8], x[10]);
    ldm    x, {x0, x1, x2, x3}
    push   {x0}
    mov    x0, x2
    pop    {x2}

    // XCHG(x[9], x[11]);
    push   {x1}
    mov    x1, x3
    pop    {x3}
    stm    x, {x0, x1, x2, x3}

    subs   r, #1               // r--
    bne    xoodoo_main         // r>0

    // release stack
    add    sp, #16

    // restore registers, and return
    pop    {r0-r12, pc}

    // round constants
rc_tab:
    .hword 0x058, 0x038, 0x3c0, 0x0d0
    .hword 0x120, 0x014, 0x060, 0x02c
    .hword 0x380, 0x0f0, 0x1a0, 0x012

Summary

You can accelerate your own learning of a CPU language by using a C compiler to generate assembly. The optimizer probably isn’t as good as the one for x86, but you’ll still learn a lot from them.

Sources to all algorithms mentioned can be found here.

If you have comments, requests, my e-mail can be found here