Introduction
There are 2 CPU architectures currently dominating the embedded market, and are almost certain to dominate IoT technologies deployed with 5G over the coming 10-15 years.
Those architectures are Advanced RISC Machine (ARM), and Argonaut RISC Core (ARC).
Although different products, ARC is something we’ll learn more about as 5G is deployed, and I hope to discuss this architecture in future, because this post is about ARM assembly.
This isn’t a tutorial on ARM assembly, but I thought sharing the code, in addition to explaining how operations were implemented would be helpful to others learning ARM.
This post is intended for beginners, but does assume you have a basic grasp of C or similar language. Knowledge of another assembly language would be an advantage, but if you’re already familiar with C, you’ll understand what the assembly is doing based on comments.
A number of cryptographic primitives suitable for resource constrained environments were implemented in C. The assembly output generated by the GNU C compiler was then examined to determine the best implementation possible.
Finally, ARM assembly was written by hand in order to reduce space further. If you’re new to the ARM architecture, and want to learn assembly, the information here may be of some use to you.
A good resource for those new to ARM assembly
Generating assembly
Learning any assembly language can be a tedious and challenging task for many of us. Lack of knowledge in another assembly language will also impede progress, but if you persevere over time, eventually you’ll reap the rewards, and will be reading assembly like you’re reading a newspaper. 😉
My approach to learning assembly involves obtaining instruction set or programmer manuals, obtaining a C compiler that will generate assembly for the target architecture, and sometimes examining existing source code written by other people, as this can be invaluable to learning new tricks.
Implement your program in C, and have the compiler generate the assembly. Then study the code, and learn what it does.
- Optimization Flags
To generate assembly of a C file, you can use the -S switch. You may also want to generate compact code using -O2 and -Os
The Microsoft compiler for ARM provides similar switches, including /FAs which generates an assembly file with source code included.
Personally, I’d never try to optimize assembly for speed, since compilers are usually already good at this, but there are definitely situations when the assembly generated by compiler is inefficient.
- Debugging symbols
If we want to include symbols from the original C code to help us understand what each assembly operation does, we can use the -g switch which will include debugging symbols.
To then produce disassembly with those symbols included, use objdump -d -S source.o which tells the program to disassemble with source.
ARM Directives
The GNU Assembler provides a number of directives which can be useful and save you from pulling your hair out and crying over your keyboard.
A more comprehensive list can be viewed here, as I only include what I think are useful to prevent writing buggy code.
- CPU
.arm indicates 32-bit code, .thumb indicates 16-bit code.
The 2 are essentially the same as using .code 32 or .code 16
All the code used here is 32-bit.
- Architecture
It’s completely optional, but the .arch directive tells the assembler what instruction set our program should use.
ARMv6 should be fine, since Raspberry Pis run on these.
Using .arch armv6 should prevent using instructions not part of ARMv6.
If you accidently attempt to use some instruction unavailable to the target architecture, the assembler will display an error. Better to use this than to assemble some code into a binary, execute and watch it segfault.
- Alignment
The .align directive ensures your code is properly aligned.
Of course, depending on what you’re writing, you may not want your code padded with null bytes.
If that’s the case, you might want to manually align using a simple NOP instruction at the end of file somewhere.
- Exporting Functions
The .global directive will allow a function to be accessed from another source file.
- Symbolic names
It could be considered bad practice in the eyes of some hardcore assembly programmers, but redefining registers just makes my life much easier. I sleep reasonably well at night if I used symbolic names during the day to complete a bit of code.
As some of you know, we define symbolic constants in C using the #define keyword which allows us to substitute one value for another.
In the NASM and YASM assemblers, those of you familiar with x86 or AMD64 assembly will likely have encountered the %define keyword which is essentially the same thing.
The .req directive allows us to assign our own names to registers.
I recommend using symbolic names where possible to reduce the potential for error…you can thank me later.
ARM instructions
Every instruction supports suffixes that can affect or depend on the flags register.
A SUB for subtraction can be implemented as SUBS which sets the flags based on result of subtraction.
A suffix like NE (Not Equal) or EQ (Equal) can also be added which determines whether the subtraction is executed or not.
Multiple loads with LDM (Load Multiple) and multiple stores with STM (Store Multiple) are not commonly used by compilers, but used throughout examples shown here.
Block Ciphers
Lightweight block ciphers are popular among developers of embedded systems. They can be used with Counter (CTR) Mode to implement a stream cipher.
What I illustrate here should not be interpreted as an endorsement. I do not claim these ciphers are without weaknesses, or that you should be using them in your project.
- XTEA
TEA Extensions (XTEA) is a 64-bit block cipher with support for 128-bit keys. It was published in 1998 as a response to weaknesses found in the Tiny Encryption Algorithm (TEA)
XTEA compared to its predecessor contains a more complex key-schedule and rearrangement of shifts, XORs, and additions.
// uses 32 rounds by default void xtea_encrypt(void *key, void *data) { int i; uint32_t x0, x1, t, sum=0; uint32_t *k=(uint32_t*)key; uint32_t *x=(uint32_t*)data; // load 64-bit plain text x0 = x[0]; x1 = x[1]; for (i=64; i>0; i--) { t = sum; // add constant every 2nd round if (i & 1) { sum += 0x9E3779B9; t = sum >> 11; } x0 += ((((x1 << 4) ^ (x1 >> 5)) + x1) ^ (sum + k[t & 3])); XCHG(x0, x1); } // save 64-bit cipher text x[0] = x0; x[1] = x1; }
The ARM code makes use of conditional execution in order to avoid branching. In x86 code, the if statement requires a conditional jump, but thanks to ARM conditional suffixes, the Addition and Left Shift are only executed every 2nd step.
// key k .req r0 x .req r1 // data x0 .req r2 x1 .req r3 sum .req r4 t0 .req r5 t1 .req r6 c .req r7 i .req r8 // xtea_encryptx(void *key, void *data); xtea_encryptx: // saxe registers push {r0-r12, lr} // 32 rounds by default mov i, #64 // number of rounds // load 64-bit plain text ldm x, {x0, x1} // x0 = x[0], x1 = x[1]; mov sum, #0 // sum = 0; ldr c, =#0x9E3779B9 // c = 0x9E3779B9; xtea_loop: mov t0, sum // t0 = sum; tst i, #1 // if (i & 1) // the next 2 only execute if (i % 2) is not zero addne sum, c // sum += 0x9E3779B9; lsrne t0, sum, #11 // t0 = sum >> 11 and t0, #3 // t0 %= 4 ldr t0, [k, t0, lsl #2] // t0 = k[t0]; add t1, sum, t0 // t1 = sum + t0 mov t0, x1, lsl #4 // t0 = (x1 << 4) eor t0, x1, lsr #5 // t0^= (x1 >> 5) add t0, x1 // t0+= x1 eor t0, t1 // t0^= t1 mov t1, x1 // backup x1 add x1, t0, x0 // x1 = t0 + x0 // XCHG(x0, x1) mov x0, t1 // x0 = x1 subs i, i, #1 // i-- bne xtea_loop // i>0 // save 64-bit cipher text stm x, {x0, x1} // restore registers pop {r0-r12, pc}
- CHASKEY
The Chaskey cipher is a 128-bit block, 128-bit key symmetric encryption algorithm which is the underlying function used for the Chaskey Message Authentication Code (MAC).
It’s based on an Even-Mansour construction which makes it very simple to implement. The permutation function was derived from SipHash, and uses only ARX instructions making it highly suitable for 32-bit architectures.
void chas_encryptx(void *key, void *data) { int i; uint32_t *x=(uint32_t*)data; uint32_t *k=(uint32_t*)key; // mix key for (i=0; i<4; i++) x[i] ^= k[i]; // apply permutation for (i=16; i>0; i--) { x[0] += x[1]; x[1]=ROTR32(x[1],27) ^ x[0]; x[2] += x[3]; x[3]=ROTR32(x[3],24) ^ x[2]; x[2] += x[1]; x[0]=ROTR32(x[0],16) + x[3]; x[3]=ROTR32(x[3],19) ^ x[0]; x[1]=ROTR32(x[1],25) ^ x[2]; x[2]=ROTR32(x[2],16); } // mix key for (i=0; i<4; i++) x[i] ^= k[i]; }
Not much to say about implementation.
k .req r0 x .req r1 k0 .req r2 k1 .req r3 k2 .req r4 k3 .req r5 x0 .req r6 x1 .req r7 x2 .req r8 x3 .req r9 i .req r10 // chas_encryptx(void *key, void *data); chas_encryptx: // saxe registers push {r0-r12,lr} // load 128-bit key ldm k, {k0, k1, k2, k3} // load 128-bit plain text ldm x, {x0, x1, x2, x3} // xor plaintext with key eor x0, k0 // x[0] ^= k[0]; eor x1, k1 // x[1] ^= k[1]; eor x2, k2 // x[2] ^= k[2]; eor x3, k3 // x[3] ^= k[3]; mov i, #16 // i = 16 chaskey_loop: add x0, x1 // x[0] += x[1]; eor x1, x0, x1, ror #27 // x[1]=ROTR32(x[1],27) ^ x[0]; add x2, x3 // x[2] += x[3]; eor x3, x2, x3, ror #24 // x[3]=ROTR32(x[3],24) ^ x[2]; add x2, x1 // x[2] += x[1]; add x0, x3, x0, ror #16 // x[0]=ROTR32(x[0],16) + x[3]; eor x3, x0, x3, ror #19 // x[3]=ROTR32(x[3],19) ^ x[0]; eor x1, x2, x1, ror #25 // x[1]=ROTR32(x[1],25) ^ x[2]; ror x2, #16 // x[2]=ROTR32(x[2],16); subs i, i, #1 // i-- bne chaskey_loop // i>0 // xor cipher text with key eor x0, k0 // x[0] ^= k[0]; eor x1, k1 // x[1] ^= k[1]; eor x2, k2 // x[2] ^= k[2]; eor x3, k3 // x[3] ^= k[3]; // save 128-bit cipher text stm x, {x0, x1, x2, x3} // restore registers, and return pop {r0-r12,pc}
- CHAM
CHAM: A Family of Lightweight Block Ciphers for Resource-Constrained Devices was published in December 2017 at the 20th Annual International Conference on Information Security and Cryptology held in South Korea.
// initialize sub keys void cham128_setkey(void *in, void *out) { int i; uint32_t *k=(uint32_t*)in; uint32_t *rk=(uint32_t*)out; for (i=0; i<KW; i++) { rk[i] = k[i] ^ ROTL32(k[i], 1) ^ ROTL32(k[i], 8); rk[(i + KW) ^ 1] = k[i] ^ ROTL32(k[i], 1) ^ ROTL32(k[i], 11); } } // encrypt 128-bits void cham128_encrypt(void *keys, void *data) { int i; uint32_t x0, x1, x2, x3; uint32_t t; uint32_t *rk=(uint32_t*)keys; uint32_t *x=(uint32_t*)data; x0 = x[0]; x1 = x[1]; x2 = x[2]; x3 = x[3]; for (i=0; i<R; i++) { if (i & 1) { x0 = ROTL32((x0 ^ i) + (ROTL32(x1, 8) ^ rk[i & 7]), 1); } else { x0 = ROTL32((x0 ^ i) + (ROTL32(x1, 1) ^ rk[i & 7]), 8); } XCHG(x0, x1); XCHG(x1, x2); XCHG(x2, x3); } x[0] = x0; x[1] = x1; x[2] = x2; x[3] = x3; }
Makes use of conditional execution to avoid branching. The TST instruction will perform if (i % 2)
EOR and ROR are then executed based on the result of this.
k .req r0 x .req r1 // data x0 .req r0 x1 .req r2 x2 .req r3 x3 .req r4 // round keys rk .req sp k0 .req r6 k1 .req r7 k2 .req r8 i .req r10 cham128_encryptx: // save registers push {r0-r12,lr} // allocate memory for round keys sub sp, #32 // derive round keys from 128-bit key mov i, #0 // i = 0 cham_init: ldr k0, [k, i, lsl #2] // k0 = k[i]; ror k1, k0, #31 // k1 = ROTR32(k0, 31); ror k2, k0, #24 // k2 = ROTR32(k0, 24); eor k0, k1 // k0^= k1; eor k1, k0, k2 // rk[i] = k0 ^ k2; str k1, [rk, i, lsl #2] eor k0, k2, ror #29 // k0 ^= ROTR32(k2, 29); add k1, i, #4 // k1 = (i+KW) eor k1, #1 // k1 = (i+KW) ^ 1 str k0, [rk, k1, lsl #2] // rk[(i+KW)^1] = k0; add i, #1 // i++ cmp i, #4 // i<KW bne cham_init // // load 128-bit plain text ldm x, {x0, x1, x2, x3} // perform encryption mov i, #0 // i = 0 cham_enc: mov k0, x3 eor x0, i // x0 ^= i tst i, #1 // if (i & 1) // x3 = rk[i & 7]; and k1, i, #7 // k1 = i & 7; ldr x3, [rk, k1, lsl #2] // x3 = rk[i & 7]; // execution depends on (i % 2) // x3 ^= (i & 1) ? ROTR32(x1, 24) : ROTR32(x1, 31); eorne x3, x1, ror #24 // eoreq x3, x1, ror #31 // add x3, x0 // x3 += x0; // x3 = (i & 1) ? ROTR32(x3, 31) : ROTR32(x3, 24); rorne x3, #31 // x3 = ROTR32(x3, 31); roreq x3, #24 // x3 = ROTR32(x3, 24); // swap mov x0, x1 // x0 = x1; mov x1, x2 // x1 = x2; mov x2, k0 // x2 = k0; add i, #1 // i++; cmp i, #80 // i<R bne cham_enc // // save 128-bit cipher text stm x, {x0, x1, x2, x3} // x[0] = x0; x[1] = x1; // x[2] = x2; x[3] = x3; // release memory for round keys add sp, #32 // restore registers pop {r0-r12, pc}
- LEA
LEA is a 128-bit block cipher with support for 128, 192 and 256-bit keys published in 2014.
The only operations used for encryption and the key schedule are 32-bit Addition, eXclusive OR and Rotation (ARX structure): the designers state “the usage of 32-bit and 64-bit processors will grow rapidly compared to 8-bit and 16-bit ones”
void lea128_encryptx(void *key, void *data) { uint32_t k0, k1, k2, k3; uint32_t x0, x1, x2, x3; uint32_t c0, c1, c2, c3; uint32_t *x, *k, r, t0; x = (uint32_t*)data; k = (uint32_t*)key; // initialize 32-bit constants c0 = 0xc3efe9db; c1 = 0x88c4d604; c2 = 0xe789f229; c3 = 0xc6f98763; // load 128-bit key k0 = k[0]; k1 = k[1]; k2 = k[2]; k3 = k[3]; // load 128-bit plain text x0 = x[0]; x1 = x[1]; x2 = x[2]; x3 = x[3]; // perform encryption for (r=24; r>0; r--) { // create subkey k0 = ROTR32(k0 + c0, 31); k1 = ROTR32(k1 + ROTR32(c0, 31), 29); k2 = ROTR32(k2 + ROTR32(c0, 30), 26); k3 = ROTR32(k3 + ROTR32(c0, 29), 21); // encrypt block t0 = x0; x0 = ROTR32((x0 ^ k0) + (x1 ^ k1),23); x1 = ROTR32((x1 ^ k2) + (x2 ^ k1), 5); x2 = ROTR32((x2 ^ k3) + (x3 ^ k1), 3); x3 = t0; // update constants t0 = c0; c0 = c1; c1 = c2; c2 = c3; c3 = ROTR32(t0, 28); } // save 128-bit cipher text x[0] = x0; x[1] = x1; x[2] = x2; x[3] = x3; }
Although we could use the stack for constants, 4 registers were used instead. Because almost all general purpose registers are being used, this just about fits onto ARM in 32-bit mode.
k .req r0 x .req r1 // key k0 .req r0 k1 .req r2 k2 .req r3 k3 .req r4 // data x0 .req r5 x1 .req r6 x2 .req r7 x3 .req r8 // constants g0 .req r9 g1 .req r10 g2 .req r11 g3 .req r12 // counter r .req r1 lea128_encryptx: // save registers push {r0-r12, lr} ldr g0, =#0xc3efe9db // g0 = 0xc3efe9db; ldr g1, =#0x88c4d604 // g1 = 0x88c4d604; ldr g2, =#0xe789f229 // g2 = 0xe789f229; ldr g3, =#0xc6f98763 // g3 = 0xc6f98763; // load 128-bit key ldm k, {k0, k1, k2, k3} // load 128-bit plain text push {x} ldm x, {x0, x1, x2, x3} // perform encryption mov r, #24 // r = 24 lea_main: push {r} // create subkey // k0 = ROTR32(k0 + g0, 31); add k0, g0 ror k0, #31 // k1 = ROTR32(k1 + ROTR32(g0, 31), 29); add k1, g0, ror #31 ror k1, #29 // k2 = ROTR32(k2 + ROTR32(g0, 30), 26); add k2, g0, ror #30 ror k2, #26 // k3 = ROTR32(k3 + ROTR32(g0, 29), 21); add k3, g0, ror #29 ror k3, #21 // encrypt block // t0 = x0; push {x0} // x0 = ROTR32((x0 ^ k0) + (x1 ^ k1),23); eor r1, x1, k1 eor x0, k0 add x0, r1 ror x0, #23 // x1 = ROTR32((x1 ^ k2) + (x2 ^ k1), 5); eor r1, x2, k1 eor x1, k2 add x1, r1 ror x1, #5 // x2 = ROTR32((x2 ^ k3) + (x3 ^ k1), 3); eor r1, x3, k1 eor x2, k3 add x2, r1 ror x2, #3 // x3 = t0; pop {x3} // update constants push {g0} mov g0, g1 mov g1, g2 mov g2, g3 // g3 = ROTR32(t0, 28); pop {g3} ror g3, #28 pop {r} subs r, #1 // r-- bne lea_main // r>0 // save 128-bit cipher text pop {x} stm x, {x0, x1, x2, x3} // restore registers pop {r0-r12, pc}
- SPECK
SPECK is a family of lightweight block ciphers publicly released by the National Security Agency (NSA) in June 2013. It’s an ARX (add-rotate-xor) design optimized for performance in software implementations, and has been suggested for use on resource constrained devices.
void speck64_encryptx(const void *key, void *data) { uint32_t i, t, k0, k1, k2, k3, x0, x1; uint32_t *k=(uint32_t*)key; uint32_t *x=(uint32_t*)data; // copy 128-bit key to local space k0 = k[0]; k1 = k[1]; k2 = k[2]; k3 = k[3]; // copy input to local space x0 = x[0]; x1 = x[1]; for (i=0; i<27; i++) { // encrypt block x0 = (ROTR32(x0, 8) + x1) ^ k0; x1 = ROTL32(x1, 3) ^ x0; // create next sub key t = k3; k3 = (ROTR32(k1, 8) + k0) ^ i; k0 = ROTL32(k0, 3) ^ k3; k1 = k2; k2 = t; } // save result x[0] = x0; x[1] = x1; }
Straight forward implementation with no trickery required.
// key k0 .req r2 k1 .req r3 k2 .req r4 k3 .req r5 // plaintext x0 .req r6 x1 .req r7 // parameters k .req r0 x .req r1 i .req r0 t .req r8 // speck64_encryptx(void *key, void *data); speck64_encryptx: // save registers push {r0-r12, lr} // load 128-bit key // k0 = k[0]; k1 = k[1]; k2 = k[2]; k3 = k[3]; ldm k, {k0, k1, k2, k3} // load 64-bit plain text ldm x, {x0, x1} // x0 = x[0]; x1 = k[1]; mov i, #0 // i=0 speck_loop: add x0, x1, x0, ror #8 // x0 = (ROTR32(x0, 8) + x1) ^ k0; eor x0, k0 // eor x1, x0, x1, ror #29 // x1 = ROTL32(x1, 3) ^ x0; mov t, k3 // backup k3 add k3, k0, k1, ror #8 // k3 = (ROTR32(k1, 8) + k0) ^ i; eor k3, i // eor k0, k3, k0, ror #29 // k0 = ROTL32(k0, 3) ^ k3; mov k1, k2 // k1 = k2; mov k2, t // k2 = t; add i, #1 // i++; cmp i, #27 // i<27; bne speck_loop // save result stm x, {x0, x1} // x[0] = x0; x[1] = x1; // restore registers pop {r0-r12, pc}
- NOEKEON
NOEKEON is a 128-bit block cipher submitted to the NESSIE project in September 2000.
void Noekeonx(void *key, void *data) { int i, j; uint32_t t; w256_t rc; uint32_t *x, *k; x = (uint32_t*)data; k = (uint32_t*)key; // constants rc.w[ 0] = 0x6c361b80; rc.w[1] = 0x9a4dabd8; rc.w[ 2] = 0x63bc5e2f; rc.w[3] = 0x6a3597c6; rc.b[16] = 0xd4; for (i=0;;i++) { x[0] ^= rc.b[i]; // Theta t = x[0] ^ x[2]; t ^= ROTR32(t, 8) ^ ROTR32(t, 24); x[1] ^= t; x[3] ^= t; // mix key x[0]^= k[0]; x[1]^= k[1]; x[2]^= k[2]; x[3]^= k[3]; t = x[1] ^ x[3]; t ^= ROTR32(t, 8) ^ ROTR32(t, 24); x[0]^= t; x[2] ^= t; if (i==Nr) break; // Pi1 x[1] = ROTR32(x[1], 31); x[2] = ROTR32(x[2], 27); x[3] = ROTR32(x[3], 30); // Gamma x[1]^= ~((x[3]) | (x[2])); x[0]^= x[2] & x[1]; XCHG(x[0], x[3]); x[2]^= x[0] ^ x[1] ^ x[3]; x[1]^= ~((x[3]) | (x[2])); x[0]^= x[2] & x[1]; // Pi2 x[1] = ROTR32(x[1], 1); x[2] = ROTR32(x[2], 5); x[3] = ROTR32(x[3], 2); } }
What’s missing here is code to convert from little to big endian convention.
On x86 is the BSWAP instruction, for ARM, there’s REV
k .req r0 x .req r1 // key k0 .req r2 k1 .req r3 k2 .req r4 k3 .req r5 // data x0 .req r6 x1 .req r7 x2 .req r8 x3 .req r9 // constants i .req r10 t0 .req r11 t1 .req r12 Noekeonx: // save registers push {r0-r12, lr} // load 128-bit key ldm k, {k0, k1, k2, k3} // load 128-bit plain text ldm x, {x0, x1, x2, x3} mov i, #0 noekeon_main: adr t0, rc_tab ldrb t0, [t0, i] // x[0] ^= rc.b[i]; eor x0, t0 // Theta // t = x[0] ^ x[2]; // t ^= ROTR32(t, 8) ^ ROTR32(t, 24); eor t0, x0, x2 mov t1, t0 eor t0, t1, ror #8 eor t0, t1, ror #24 // x[1] ^= t; x[3] ^= t; eor x1, t0 eor x3, t0 // mix key // x[0]^= k[0]; x[1]^= k[1]; // x[2]^= k[2]; x[3]^= k[3]; eor x0, k0 eor x1, k1 eor x2, k2 eor x3, k3 // t = x[1] ^ x[3]; // t ^= ROTR32(t, 8) ^ ROTR32(t, 24); eor t0, x1, x3 mov t1, t0 eor t0, t1, ror #8 eor t0, t1, ror #24 // x[0] ^= t; x[2] ^= t; eor x0, t0 eor x2, t0 cmp i, #16 // if (i==Nr) break; beq noekeon_end // Pi1 ror x1, #31 // x[1] = ROTR32(x[1], 31); ror x2, #27 // x[2] = ROTR32(x[2], 27); ror x3, #30 // x[3] = ROTR32(x[3], 30); // Gamma // x[1]^= ~((x[3]) | (x[2])); orr t0, x3, x2 mvn t0, t0 eor x1, t0 mov t1, x3 // backup x3 // x[0] ^= x[2] & x[1]; and t0, x2, x1 eor x3, x0, t0 // XCHG(x[0], x[3]); mov x0, t1 // x[2]^= x[0] ^ x[1] ^ x[3]; eor x2, x0 eor x2, x1 eor x2, x3 // x[1]^= ~((x[3]) | (x[2])); orr t0, x3, x2 mvn t0, t0 eor x1, t0 // x[0]^= x[2] & x[1]; and t0, x2, x1 eor x0, t0 // Pi2 ror x1, #1 // x[1] = ROTR32(x[1], 1); ror x2, #5 // x[2] = ROTR32(x[2], 5); ror x3, #2 // x[3] = ROTR32(x[3], 2); add i, #1 // i++ b noekeon_main noekeon_end: // save 128-bit cipher text stm x, {x0, x1, x2, x3} // restore registers, and return pop {r0-r12, pc} rc_tab: .byte 0x80 .byte 0x1B, 0x36, 0x6C, 0xD8 .byte 0xAB, 0x4D, 0x9A, 0x2F .byte 0x5E, 0xBC, 0x63, 0xC6 .byte 0x97, 0x35, 0x6A, 0xD4
- RC5
RC5 is a symmetric block cipher designed in 1994 by Ronald Rivest. The version shown here uses 128-bit keys, processes 64-bit blocks of data, and uses 12 rounds.
It’s been suggested 20 rounds be used in order to mitigate against a weakness.
void rc5_encryptx(void *key, void *data) { uint32_t A, B, T; uint32_t L[4], S[RC5_KR]; int i, j, r; uint32_t *kp=(uint32_t*)key; uint32_t *x=(uint32_t*)data; // initialize L with 128-bit key memcpy(&L, key, 16); A = 0xC983AE2D; // initialize S with constants for (i=RC5_KR; i>0;) { A -= RC5_Q; S[--i] = A; } A = B = 0; r = (RC5_KR*3); // create sub keys for (i=0, j=0; r>0; i++, j++, r--) { // i needs reset? if (i==RC5_KR) i=0; j &= 3; A = S[i] = ROTL32(S[i] + A+B, 3); B = L[j] = ROTL32(L[j] + A+B, A+B); } // assign sub keys to k ptr kp = S; // load 64-bit plain text A = x[0] + *kp++; B = x[1] + *kp++; // apply encryption for (i=RC5_ROUNDS*2; i>0; i--) { T = B; B = ROTL32(A ^ B, B) + *kp++; A = T; } // save 64-bit ciphertext x[0] = A; x[1] = B; }
When initializing 2 registers to zero, instead of 2 separate instructions, we use another register already set to zero with UMULL which is similar to using MUL on x86 with zero to initialize EAX and EDX to zero.
Since ARM doesn’t support a ROL instruction, RSB is used to calculate the value before using ROR.
k .req r0 x .req r1 x0 .req r2 x1 .req r3 t0 .req r4 t1 .req r5 t2 .req r6 i .req r7 j .req r8 r .req r9 L .req r10 S .req r11 kp .req r11 rc5_encryptx: // save registers push {r0-r12, lr} // allocate memory sub sp, #128 // initialize L with 128-bit key // memcpy(&L, key, 16); mov L, sp ldm k, {r2-r5} stm L, {r2-r5} // initialize S with constants ldr x0, =#0xC983AE2D // x0 = RC5_P precomputed ldr x1, =#0x9E3779B9 // x1 = RC5_Q add S, sp, #16 mov i, #26 // i = RC5_KR, RC5_KR=2*(ROUNDS+1) init_sub: sub x0, x0, x1 // x0 -= RC5_Q subs i, i, #1 // i-- str x0, [S, i, lsl #2] // S[i] = x0; bne init_sub // i>=0 umull x0, x1, i, i // x0 = 0, x1 = 0 mov r, #78 // r = (RC5_KR*3) // *********************************************** // create the round keys // *********************************************** mov j, i // j = 0 rc5_sub: cmp i, #26 // if (i == RC5_KR) i = 0 moveq i, #0 // zero if equal and j, #3 // j &= 3 // x0 = S[i] = ROTL32(S[i] + x0+x1, 3); add x0, x1 // x0 = x0 + x1; ldr t2, [S, i, lsl #2] add x0, t2 // x0 += t2 ror x0, #29 str x0, [S, i, lsl #2] // S[i] = x0 // x1 = L[j] = ROTL32(L[j] + x0+x1, x0+x1); add x1, x0, x1 // x1 = x0 + x1 rsb t0, x1, #32 // t0 = 32 - x1 ldr t2, [L, j, lsl #2] add x1, t2 // x1 += t2 ror x1, t0 // str x1, [L, j, lsl #2] // L[j] = x1 add i, #1 // i++ add j, #1 // j++ subs r, #1 // r-- bne rc5_sub // *********************************************** // perform encryption // *********************************************** // load plaintext ldm x, {x0, x1} // x0 = x[0]; x1 = x[1]; ldr t0, [kp], #4 add x0, t0 // x0 += *kp++; ldr t0, [kp], #4 add x1, t0 // x1 += *kp++; // apply encryption mov i, #24 // i = RC5_KR - 2 rc5_enc: ldr t0, [kp], #4 // t0 = *kp++; eor x0, x1 rsb t1, x1, #32 // t1 = 32 - x1 mov t2, x1 // backup x1 add x1, t0, x0, ror t1 mov x0, t2 subs i, #1 // i-- bne rc5_enc // i>0 // save ciphertext stm x, {x0, x1} // x[0] = x0; x[1] = x1; // release memory add sp, #128 // restore registers pop {r0-r12, pc}
- RC6
RC6 is a symmetric block cipher published in 1998. It was one of the 5 finalists considered for the Advanced Encryption Standard.
The main changes with RC6 are a 128-bit block, and support for 128, 192 and 256-bit keys. The version here will support only 256-bit keys.
The key schedule is essentially the same, but obviously working with 256-bit keys. The encryption includes additional operations and specifically a multiplication.
void rc6_encryptx (void *key, void *data) { uint32_t x0, x1, x2, x3, t0, t1, t2; uint32_t L[8], S[RC6_KR]; uint32_t *x, *kp; int i, j, r; x=(uint32_t*)data; // initialize L with 256-bit key memcpy(&L, key, 32); x0 = 0xE96A3D2F; // initialize S with constants for (i=RC6_KR; i>0;) { x0 -= RC6_Q; S[--i] = x0; } x0 = x1 = 0; r = (RC6_KR*3); // create sub keys for (i=0, j=0; r>0; i++, j++, r--) { // i needs reset? if (i==RC6_KR) i=0; j &= 7; x0 = S[i] = ROTL32(S[i] + x0+x1, 3); x1 = L[j] = ROTL32(L[j] + x0+x1, x0+x1); } // assign sub keys to k kp = S; // load plain text x0 = x[0]; x1 = x[1] + *kp++; x2 = x[2]; x3 = x[3] + *kp++; // apply encryption for (i=RC6_ROUNDS; i>0; i--) { t0 = ROTL32(x1 * (x1 + x1 + 1), 5); t1 = ROTL32(x3 * (x3 + x3 + 1), 5); t2 = x3; // backup x3 x0 ^= t0; x2 ^= t1; x3 = ROTL32(x0, t1); t1 = *kp++; x3 += t1; x0 = x1; // move x1 into x0 x1 = ROTL32(x2, t0); t0 = *kp++; x1 += t0; x2 = t2; // move x3 into x2 } // save cipher text x[0] = x0 + kp[0]; x[1] = x1; x[2] = x2 + kp[1]; x[3] = x3; }
I considered using a multiplication for the division, but a conditional test is perfect, especially on ARM with its conditional execution feature.
k .req r0 x .req r1 x0 .req r0 x1 .req r2 x2 .req r3 x3 .req r4 t0 .req r5 t1 .req r6 t2 .req r7 i .req r8 j .req r9 r .req r10 L .req r11 S .req r12 kp .req r12 // rc6_encryptx(void *key, void *data); rc6_encryptx: // save registers push {r0-r12, lr} // allocate memory sub sp, #256 // initialize L with 256-bit key // memcpy(&L, key, 32); mov L, sp ldm k, {r2-r9} stm L, {r2-r9} // initialize S with constants ldr x0, =#0xE96A3D2F // x0 = RC6_P precomputed ldr x1, =#0x9E3779B9 // x1 = RC6_Q add S, sp, #32 mov i, #44 // i = RC6_KR, RC6_KR=2*(ROUNDS+2) init_sub: sub x0, x1 // x0 -= RC6_Q subs i, #1 // --i str x0, [S, i, lsl #2] // S[i] = x0; bne init_sub // i>=0 umull x0, x1, i, i // x0 = 0, x1 = 0 mov r, #132 // r = (RC6_KR*3) // *********************************************** // create the round keys // *********************************************** mov j, i // j = 0 rc6_sub: cmp i, #44 // if (i == RC6_KR) moveq i, #0 // i = 0 and j, #7 // j &= 7 // x0 = S[i] = ROTL32(S[i] + x0+x1, 3); add x0, x1 // x0 = x0 + x1; ldr t0, [S, i, lsl #2] add x0, t0 // x0 += t0 ror x0, #29 str x0, [S, i, lsl #2] // S[i] = x0 // x1 = L[j] = ROTL32(L[j] + x0+x1, x0+x1); add x1, x0, x1 // x1 = x0 + x1 rsb t0, x1, #32 // t0 = 32 - x1 ldr t1, [L, j, lsl #2] add x1, t1 // x1 += t1 ror x1, t0 // str x1, [L, j, lsl #2] // L[j] = x1 add i, #1 // i++ add j, #1 // j++ subs r, #1 // r-- bne rc6_sub // *********************************************** // perform encryption // *********************************************** // load plaintext ldm x, {x0, x1, x2, x3} // x0 = x[0]; x1 = x[1]; // x2 = x[2]; x3 = x[3]; ldr t0, [kp], #4 add x1, t0 // x1 += *kp++; ldr t0, [kp], #4 add x3, t0 // x3 += *kp++; // apply encryption mov i, #20 // i = RC6_ROUNDS rc6_enc: // mov t1, #1 // t0 = ROTL32(x1 * (x1 + x1 + 1), 5); // add t1, t1, x3, lsl #1 add t0, x1, x1 add t0, #1 mul t0, x1, t0 ror t0, #27 // t1 = ROTL32(x3 * (x3 + x3 + 1), 5); // add t1, t1, x3, lsl #1 add t1, x3, x3 add t1, #1 mul t1, x3, t1 ror t1, #27 mov t2, x3 // backup x3 eor x0, t0 // x0 ^= t0; eor x2, t1 // x2 ^= t1; // x3 = ROTL32(x0 ^ t0, t1) + *kp++; rsb t1, #32 // t1 = 32 - t1 ror x0, t1 ldr t1, [kp], #4 // t1 = *kp++ add x3, x0, t1 // x3 = ROTL32(x0, t1) + *kp++ mov x0, x1 // move x1 into x0 // x1 = ROTL32(x2 ^ t1, t0) + *kp++; rsb t0, #32 // t0 = 32 - t0 ror x2, t0 ldr t1, [kp], #4 // t1 = *kp++ add x1, x2, t1 // x1 = ROTL32(x2, t0) + *kp++ mov x2, t2 // move x3 into x2 subs i, #1 // i-- bne rc6_enc // i>0 // save ciphertext ldr t0, [kp], #4 // x0 += *kp++; add x0, t0 ldr t0, [kp], #4 // x2 += *kp++; add x2, t0 stm x, {x0, x1, x2, x3} // x[0] = x0; x[1] = x1; // x[2] = x2; x[3] = x3; // release memory add sp, #256 // restore registers pop {r0-r12, pc}
Permutation Functions
- Gimli
Gimli, named after the Lord Of The Rings character, is a 384-bit permutation function, designed specifically to be lightweight, high performance, and secure across multiple architectures.
void gimli(void *state) { int r, j; uint32_t t, x, y, z; uint32_t *s=(uint32_t*)state; for (r=24; r>0; --r) { // apply SP-box for (j=0; j<4; j++) { x = ROTR32(s[ j], 8); y = ROTR32(s[4 + j], 23); z = s[8 + j]; s[8 + j] = x ^ (z << 1) ^ ((y & z) << 2); s[4 + j] = y ^ x ^ ((x | z) << 1); s[j] = z ^ y ^ ((x & y) << 3); } // apply Linear layer t = r & 3; // if zero, do small swap if (t == 0) { XCHG(s[0], s[1]); XCHG(s[2], s[3]); // add constant s[0] ^= (0x9e377900 | r); } // if 2, do big swap if (t == 2) { XCHG(s[0], s[2]); XCHG(s[1], s[3]); } } }
The difference with assembly and C is that the state is denoted as x instead of s.
Implementing the swap was a problem. ARM doesn’t have an XCHG instruction like x86, so I thought of a few ways it could be done..
// XOR swap cmp r1, r2 // a == b eorne r1, r2 // a ^= b eorne r2, r1 // b ^= a eorne r1, r2 // a ^= b
// Pushing/popping registers is a viable option. push {r1} mov r1, r2 pop {r2}
// Using the load/store instructions ldr r1, [r0] ldr r2, [r0, #4] str r2, [r0] str r1, [r0, #4]
// double load and store ldrd r2, r3, [r0] strd r3, r2, [r0]
For now, it uses a temporary register which will do for now.
s .req r0 // state x0 .req r1 x1 .req r2 x2 .req r3 x3 .req r4 // loop counters r .req r5 c .req r6 j .req r7 // used during permutation x .req r8 y .req r9 z .req r10 // temporary registers t0 .req r11 t1 .req r12 gimlix: // save registers push {r0-r12, lr} mov r, #24 // r = 24 ldr c, =#0x9e377900 // c = 0x9e377900; gimli_main: mov j, #4 // j = 4 sub x0, s, #4 // x0 = s - 1 gimli_perm: ldr x, [x0, #4] // x = ROTR32(s[j], 8); ror x, #8 ldr y, [x0, #20] // y = ROTR32(s[4 + j], 23); ror y, #23 ldr z, [x0, #36] // z = s[8 + j]; // s[8 + j] = x ^ (z << 1) ^ ((y & z) << 2); eor t0, x, z, lsl #1 // t0 = x ^ (z << 1) and t1, y, z // t1 = y & z eor t0, t1, lsl #2 // t0 = t0 ^ (t1 << 2) str t0, [x0, #36] // s[8 + j] = t0 // s[4 + j] = y ^ x ^ ((x | z) << 1); eor t0, y, x // t0 = y ^ x orr t1, x, z // t1 = x | z eor t0, t1, lsl #1 // t0 = t0 ^ (t1 << 1) str t0, [x0, #20] // s[4 + j] = t0 // s[j] = z ^ y ^ ((x & y) << 3); eor t0, z, y // t0 = z ^ y and t1, x, y // t1 = x & y eor t0, t1, lsl #3 // t0 = t0 ^ (t1 << 3) str t0, [x0, #4]! // s[j] = t0, s++ subs j, #1 // j-- bne gimli_perm // j>0 // load 16 bytes of state ldm s, {x0, x1, x2, x3} // apply linear layer // t0 = (r & 3); ands t0, r, #3 // XCHG(s[0], s[1]); moveq t1, x0 moveq x0, x1 moveq x1, t1 // XCHG(s[2], s[3]); moveq t1, x2 moveq x2, x3 moveq x3, t1 // s[0] ^= (0x9e377900 | r); orreq t1, c, r eoreq x0, t1 // if (t == 2) cmp t0, #2 // XCHG(s[0], s[2]); moveq t1, x0 moveq x0, x2 moveq x2, t1 // XCHG(s[1], s[3]); moveq t1, x1 moveq x1, x3 moveq x3, t1 // save state stm s, {x0, x1, x2, x3} subs r, #1 // r-- bne gimli_main // r>0 // restore registers pop {r0-r12, pc}
- Xoodoo
Xoodoo is a new permutation function first described in November 2017. It’s very similar to Keccak, but uses same paramaters as Gimli.
void xoodoo(void *state) { uint32_t e[4], x0, x1, x2, t; int r, i; uint32_t *x=(uint32_t*)state; uint32_t rc[12]= { 0x058, 0x038, 0x3c0, 0x0d0, 0x120, 0x014, 0x060, 0x02c, 0x380, 0x0f0, 0x1a0, 0x012 }; // 12 rounds by default for (r=0; r<12; r++) { // Theta for (i=0; i<4; i++) { e[i] = ROTR32(x[i] ^ x[i+4] ^ x[i+8], 18); e[i]^= ROTR32(e[i], 9); } for (i=0; i<12; i++) { x[i]^= e[(i - 1) & 3]; } // Rho west XCHG(x[7], x[4]); XCHG(x[7], x[5]); XCHG(x[7], x[6]); // Iota x[0] ^= rc[r]; // Chi and Rho east for (i=0; i<4; i++) { x0 = x[i+0]; x1 = x[i+4]; x2 = ROTR32(x[i+8], 21); x[i+8] = ROTR32((x1 & ~x0) ^ x2, 24); x[i+4] = ROTR32((x0 & ~x2) ^ x1, 31); x[i+0]^= x2 & ~x1; } XCHG(x[8], x[10]); XCHG(x[9], x[11]); } }
The assembly here is not optimized. It’s a compact implementation for simplicity.
state .req r0 x .req r1 r .req r2 i .req r3 x0 .req r4 x1 .req r5 x2 .req r6 x3 .req r7 rc .req r8 xt .req r9 e .req sp xoodoo: // save registers push {r0-r12, lr} mov r, #12 // 12 rounds sub sp, #16 // allocate 16 bytes adr rc, rc_tab xoodoo_main: mov i, #0 // i = 0 mov x, state theta_l0: ldr x2, [x, #32] // x2 = x[i+8] ldr x1, [x, #16] // x1 = x[i+4] ldr x0, [x], #4 // x0 = x[i+0], advance x by 4 // e[i] = ROTR32(x[i] ^ x[i+4] ^ x[i+8], 18); eor x0, x1 eor x0, x2 ror x0, #18 // e[i]^= ROTR32(e[i], 9); eor x0, x0, ror #9 str x0, [sp, i, lsl #2] // store in e add i, #1 // i++ cmp i, #4 // i<4 bne theta_l0 // // x[i]^= e[(i - 1) & 3]; mov i, #0 // i = 0 mov x, state // x = state theta_l1: sub xt, i, #1 and xt, #3 // xt = i & 3 ldr xt, [sp, xt, lsl #2] // xt = e[(i - 1) & 3] ldr x0, [x, i, lsl #2] // x0 = x[i] eor x0, xt // x0 ^= xt str x0, [x, i, lsl #2] // x[i] = x0 add i, #1 // i++ cmp i, #12 // i<12 bne theta_l1 // Rho west // XCHG(x[7], x[4]); // XCHG(x[7], x[5]); // XCHG(x[7], x[6]); add x, state, #16 // x = &state[4] ldm x, {x0, x1, x2, x3} mov xt, x0 // xt = x[4] mov x0, x3 // x[4] = x[7] mov x3, x2 // x[7] = x[6] mov x2, x1 // x[6] = x[5] mov x1, xt // x[5] = xt stm x, {x0, x1, x2, x3} mov x, state // Iota // x[0] ^= rc[r]; ldrh xt, [rc], #2 // load half-word, advance by 2 ldr x0, [x] // load word eor x0, xt // xor str x0, [x] // store word mov i, #4 chi: // Chi and Rho east // x0 = x[i+0]; ldr x0, [x] // x1 = x[i+4]; ldr x1, [x, #16] // x2 = ROTR32(x[i+8], 21); ldr x2, [x, #32] ror x2, #21 // x[i+8] = ROTR32((x1 & ~x0) ^ x2, 24); bic xt, x1, x0 eor xt, x2 ror xt, #24 str xt, [x, #32] // x[i+4] = ROTR32((x0 & ~x2) ^ x1, 31); bic xt, x0, x2 eor xt, x1 ror xt, #31 str xt, [x, #16] // x[i+0]^= x2 & ~x1; bic xt, x2, x1 eor xt, x0 str xt, [x], #4 // i-- subs i, #1 bne chi add x, state, #32 // x = &state[8] // XCHG(x[8], x[10]); ldm x, {x0, x1, x2, x3} push {x0} mov x0, x2 pop {x2} // XCHG(x[9], x[11]); push {x1} mov x1, x3 pop {x3} stm x, {x0, x1, x2, x3} subs r, #1 // r-- bne xoodoo_main // r>0 // release stack add sp, #16 // restore registers, and return pop {r0-r12, pc} // round constants rc_tab: .hword 0x058, 0x038, 0x3c0, 0x0d0 .hword 0x120, 0x014, 0x060, 0x02c .hword 0x380, 0x0f0, 0x1a0, 0x012
Summary
You can accelerate your own learning of a CPU language by using a C compiler to generate assembly. The optimizer probably isn’t as good as the one for x86, but you’ll still learn a lot from them.
Sources to all algorithms mentioned can be found here.
If you have comments, requests, my e-mail can be found here
Pingback: Synchronous shell for Linux in ARM32 assembly | modexp
Pingback: Shellcode: Encrypting traffic | modexp