Shellcode: Windows API hashing with block ciphers ( Maru Hash )


String/Pattern Matching algorithms are by far the most popular and easy way to detect a shellcode. The principle is simple: all codes have unique characteristics that can be used as signatures to identify in memory. Even shellcodes with no prior analysis can contain some known value, or piece of code that at least makes it appear suspicious, and worthy of closer inspection. The known values I’m referring to are of course API strings and hashes that have always been useful in detecting malicious code for 20+ years. If we’re going to develop more advanced shellcodes that go undetected by modern AV solutions, we need to consider using a HLL like C or C++ along with unconventional programming that guarantees everything in a shellcode is permutable. It’s not just the executable code that should be permutable, but the entire thing; code, data, constants, strings..everything. What I show here today is how to create permutable API hashes using a hash function called Maru (Ma-roo), named after the Japanese cat. Maru uses a block cipher to generate a hash of string. You don’t have to use a block cipher for this, but a cryptographic primitive can help protect the hashes against collision attacks found using Satisfiability Modulo Theories (SMT). For more information, read Using SAT and SMT to defeat simple hashing algorithms.Readers may also wish to consider reading the section on cryptography in Quick introduction into SAT/SMT solvers and symbolic execution


**WARNING: Maru is not a cryptographic hash, and should not be used for applications where a secure hash algorithm is required.**

Signature detection

When viruses for MS-DOS started to appear 30 years ago, detection by signature was relatively simple. In an attempt to hide code, authors used a simple XOR operation with an 8-bit value. For example, the ‘Skism Rythem Stack Virus-808‘ from 1991 used milliseconds of the current time to “encrypt” each new infection, so it would appear differently for 100 versions. This is the entry point of that code, and its “decryption” process.

jmp    virus_start

encrypt_val db 00h


    call   encrypt         ; encrypt/decrypt file
    jmp    virus           ; go to start of code


    push   cx
    mov    bx, offset virus_code ; start encryption at data

    mov    ch, [bx]        ; read current byte
    xor    ch, encrypt_val ; get encryption key
    mov    [bx], ch        ; switch bytes
    inc    bx              ; move bx up a byte
    cmp    bx,  offset virus_code + virus_size
    ; are we done with the encryption
    jle    xor_loop        ; no? keep going
    pop    cx

Later in the code, the virus will update ‘encrypt_val’ using the milliseconds value of the current time.

    mov    ah, 2ch         ; get time
    int    21h

    mov    encrypt_val, dl ; save m_seconds to encrypt val so
                           ; theres 100 mutations possible

There’s some effort to hide the body of the virus here, but the decrypter is still visible and easily detected. Moreover, a scanner can emulate the decrypter routine, exposing the body of virus in memory before performing detection by signature. When Win32 viruses started to appear in 1997, there were no longer MS-DOS interrupts. Instead there were API functions, accessible through dynamic libraries or DLL, and this made detecting malware even easier. Any code with a large block of API strings in it was immediately going to raise alarm bells. Authors eventually started to use 32-bit checksums, the most popular of which was CRC32. Later, LSD-PL inspired a trend of using some basic Add-Rotate-Xor routine, like the following that appeared in their WASM package.

i3: mov   esi,[4*eax+ebx]    ; ptr to symbol name
    add   esi,edx            ; rva2va

    push  ebx      ; hash: h=((h<<5)|(h>>27))+c
    push  eax
    xor   ebx,ebx
i4: xor   eax,eax  ; hash loop
    rol   ebx,5
    add   ebx,eax
    cmp   eax,0
    jnz   i4 
    ror   ebx,5 
    cmp   ebx,ecx  ; hash compare
    pop   eax
    pop   ebx
    je    i5       ; same: symbol found
    inc   eax
    jmp   i3       ; different: go to the next symbol

Most legitimate code doesn’t have to resolve an API by string or hash manually through Import Address Table (IAT) or Export Address Table (EAT) and if some code does behave this way, it’s usually up to no good. What’s funny about API hashing; It was supposed to prevent signature detection of strings. So, instead of detecting by string, AV just detected the hash of an API string, that still makes it a perfectly valid signature. Even the shellcodes generated by many popular payload tools (BeEF, Meterpreter, Veil) all use basic hashing algorithms with no entropy or randomization. They are consistently the same, thus are just as easy to detect as any API string itself. Of course, these packages offer the ability to encrypt payloads with a stream or block cipher like RC4 or AES. Again though, even if it will pass a simple scan by AV or IDS, a more thorough analysis using an emulator can decrypt the main code and perform pattern matching afterwards, having no problem with detection.

Plain strings problem

Ideally, an attacker would want everything obfuscated/encrypted during compilation and decrypted at run-time without using additional tools or code. constexpr is ideal for C++ code, but for C, there’s no such feature unless of course, a compiler can be modified to encrypt all strings at compile time? Some assemblers such as MASM and NASM support macro based compilation, but are cumbersome to work with on complex functions. (not that you couldn’t) Z0MBiE already discussed the problem of plain strings in Data Encoding In Meta Viruses and Solving Plain Strings Problem In HLL around 2002 that did somewhat “solve” the problem, but if you look closely at the sources, there’s insufficient entropy used. At the time, he provided a library VirStr with TASM macros that worked with Borland’s C compiler, but these do not work with mingw, msvc or clang and the library has never been updated since. The purpose of using entropy for hashing of API strings is obviously to increase the difficulty of detecting a malicious payload. It’s not enough to avoid detection forever, but it’s certainly better than existing hash algorithms with no entropy at all.

Hash function constructions

We can construct new hash algorithms from block ciphers, and these offer us the potential to use unique keys, thus generating completely different API hashes every time the key changes. However, as you will see, very few block ciphers are suitable for lightweight applications such as shellcode. Here are 3 potential constructions to use. (there are more, but only 3 are covered)

1. Davies–Meyer

H_i=E_{m_i}(H_{i-1})\oplus H_{i-1}

Feeds each block of the message (m_i) as the key to a block cipher. Feeds the previous hash value (H_{i-1}) as the plaintext to be encrypted. The output ciphertext is then also XORed with the previous hash value (H_{i-1}) to produce the next hash value (H_i). In the first round when there is no previous hash value it uses a constant pre-specified initial value (H_0).

2. Matyas–Meyer–Oseas

H_i=E_g{(H_{i-1})}(m_i)\oplus m_i

Feeds each block of the message (m_i) as the plaintext to be encrypted. The output ciphertext is then also XORed with the same message block (m_i) to produce the next hash value (H_i). The previous hash value (H_{i-1}) is fed as the key to the block cipher. In the first round when there is no previous hash value it uses a constant pre-specified initial value (H_0).

3. Miyaguchi–Preneel

H_i = E_{g(H_{i-1})}(m_i)\oplus H_{i-1}\oplus m_i

Feeds each block of the message (m_i) as the plaintext to be encrypted. The output ciphertext is then XORed with the same message block (m_i) and then also XORed with the previous hash value (H_{i-1}) to produce the next hash value (H_i). The previous hash value (H_{i-1}) is fed as the key to the block cipher. In the first round when there is no previous hash value it uses a constant pre-specified initial value (H_0).

Choosing a block cipher

We know signature detections require some static value, and what better static values exist than cryptographic constants? You can of course change them, but then it might affect security of the cipher. So, it’s better to choose a cipher that doesn’t use any constants, at all. While looking at potential block ciphers, my top 3 were based on Add-Rotate-Xor (ARX) designs.


Designed by the NSA and published in 2013. Supports a wide range of key and block sizes for different architectures. Proven to be the smallest software based block cipher available for most architectures. Parameters for x86: 64-bit block, 128-bit key. For x86-64: 128-bit block, 256-bit key. Both implemented in 64 and 86 bytes respectively. Here’s the x86 code.


Designed and published in 2015. Uses an Even-Mansour design with permutation function derived from SipHash. 128-bit keys, 128-bit blocks. Can be implemented in approx. 90 bytes using x86 assembly.


eXtended Tiny Encryption Algorithm. A Feistel cipher, published in 1997 after weaknesses found in TEA. 128-bit keys, and 64-bit block. Has a known constant, 0x9E3779B9 that makes it susceptible to signature based detection. Can be implemented in approx. 80 bytes.

SPECK is just the better design for lightweight applications and can’t be easily identified by signature. (at least not from any constants)


The padding is similar that used in MD4 and MD5, but instead of the string length stored in bits as 64-bit integer, only the string length as a 32-bit integer is stored. Since it will never exceed 256-bits for Maru 1 or 512 bits for Maru 2, there didn’t seem any point using a 64-bit integer.

SPECK Parameters

Version 1 is suitable for 32-bit architectures while version 2 is suitable for 64-bit.

Ver. Cipher Block Size Key Size
1 SPECK 64 128
2 SPECK 128 256

32-bit Architectures

The following is implementation of SPECK for Ver. 1

// SPECK-64/128
uint64_t speck(void *mk, uint64_t p) {
    uint32_t k[4], i, t;
    union {
      uint32_t w[2];
      uint64_t q;
    } x;
    // copy plaintext to local buffer
    x.q = p;
    // copy master key to local buffer
    for(i=0;i<4;i++) k[i]=((uint32_t*)mk)[i];
    for(i=0;i<27;i++) {
      // encrypt plaintext
      x.w[0] = (ROTR32(x.w[0], 8) + x.w[1]) ^ k[0],
      x.w[1] =  ROTR32(x.w[1],29) ^ x.w[0], t = k[3],
      // create next subkey
      k[3] = (ROTR32(k[1],8)  + k[0]) ^ i,
      k[0] =  ROTR32(k[0],29) ^ k[3],
      k[1] = k[2], k[2] = t;
    // return 64-bit ciphertext
    return x.q;

64-bit architectures

The following is implementation for Ver.2

void speck(void *in, void *mk, void *out){
  uint64_t i,t,k[4],

  // copy 128-bit plaintext to local buffer
  r[0] = h[0]; r[1] = h[1];
  // copy 256-bit master key to local buffer
  for(i=0;i<4;i++) k[i] = ((uint64_t*)mk)[i];
  for(i=0;i<34;i++) {
    // encrypt plaintext
    r[1] = (ROTR64(r[1], 8) + r[0]) ^ k[0],
    r = ROTR64(r[0], 61) ^ r[1], t=k[3],
    // create next subkey
    k[3] = (ROTR64(k[1], 8) + k[0]) ^ i,
    k[0] = ROTR64(k[0], 61) ^ k[3],
    k[1] = k[2], k[2] = t;

Maru Parameters

The numbers below (excluding Ver.) represent number of bits.

Ver. CPU Construction Padding Key Message Max. Input Output
1 32 Davies-Meyer Merkle-Damgard 64 128 256 64
2 64 Davies-Meyer Merkle-Damgard 128 256 512 128

There are 2 64-bit constants used to initialize H_0 that are not really of any significance. I just wanted something “random” without using a random number generator. Here is how to generate them using SpeedCrunch.

The 2nd value shown is used to initialize H_0 in both versions. The n-bit seed is xor’d against these values to make the hash output different. I’ve emphasized the importance of eliminating constants that are not really permutable, and can be used as signatures, so why use them here? You don’t have to use them. I’m only using here for illustration, but you can generate your own values to initialize H_0 or just use zero instead.

Maru 1

The only real difference between the two would be cipher parameters and local capacity for keys. We use a 32-bit seed here to initialize H_0.

uint64_t maru(const char *api, uint64_t iv) {
    uint64_t h;
    uint32_t len, idx, end, i;
    union {
      uint8_t  b[MARU_BLK_LEN];
      uint32_t w[MARU_BLK_LEN/4];
    } m;
    // set H to initial value
    h = iv;
    for(idx=0, len=0, end=0;!end;) {
      // end of string or max len?
      if(api[len] == 0 || len == MARU_MAX_STR) {
        // zero remainder of M
        for(i=idx;i<MARU_BLK_LEN;i++) m.b[i]=0;
        // store the end bit
        m.b[idx] = 0x80;
        // have we space in M for api length?
        if(idx >= MARU_BLK_LEN - 4) {
          // no, update H with E
          h ^= MARU_CRYPT(&m, h);
          // zero M
          for(i=0;i<MARU_BLK_LEN;i++) m.b[i]=0;
        // store total length in bits
        m.w[(MARU_BLK_LEN/4)-1] = (len * 8);
        idx = MARU_BLK_LEN;
      } else {    
        // store character from api string
        m.b[idx] = (uint8_t)api[len]; 
        idx++; len++;
      if(idx == MARU_BLK_LEN) {
        // update H with E
        h ^= MARU_CRYPT(&m, h);
        // reset idx
        idx = 0;
    return h;

Maru 2

Here, we use a 64-bit seed. The 128-bit result is stored in out parameter.

void maru2(const char *key, uint64_t iv, void *out) {
    union { uint64_t q[2]; uint32_t w[4]; uint8_t b[16]; } c, h;
    union { uint64_t q[4]; uint32_t w[8]; uint8_t b[32]; } m;
    int     len, idx, i, end;

    // initialize H with iv
    h.q[0] = MARU2_INIT_B ^ iv;
    h.q[1] = MARU2_INIT_D ^ iv;
    for (idx=0, len=0, end=0; !end; ) {
      // end of string or max len?
      if (key[len]==0 || len==MARU2_MAX_STR) {
        // zero remainder of M
        for(i=idx;i<MARU2_BLK_LEN;i++) m.b[i]=0;
        // add end bit
        m.b[idx] = 0x80;
        // have we space in M for len?
        if (idx >= MARU2_BLK_LEN-4) {
          // no, encrypt H
          MARU2_CRYPT(&h, &m, &c);
          // update H
          h.q[0] ^= c.q[0];
          h.q[1] ^= c.q[1];          
          // zero M
          for(i=0;i<MARU2_BLK_LEN;i++) m.b[i]=0;
        // add total len in bits
        m.w[(MARU2_BLK_LEN/4)-1] = (len * 8);
        idx = MARU2_BLK_LEN;
      } else {    
        // add byte to M
        m.b[idx] = (uint8_t)key[len];
        idx++; len++;
      if (idx == MARU2_BLK_LEN) {
        // encrypt H
        MARU2_CRYPT(&h, &m, &c);
        // update H
        h.q[0] ^= c.q[0];
        h.q[1] ^= c.q[1];
        idx = 0;
      ((uint8_t*)out)[i] = h.b[i];   


Both versions performed well when tested with Robert G. Browns dieharder tool. The tests involved setting a string buffer to all 1s and incrementing sequentially to generate hashes, similar to how one might test a PRNG.

  • Maru 1

  • Maru 2

Of course, just because the hash output passed dieharder tests, that doesn’t necessarily mean we have a good hash function, but it’s at the very least a good sign.

Example Hashes

  • Maru 1

  • Maru 2

As you can see above, once the seed value is changed, the hash output also changes.


We can construct better hashing algorithms for API strings using block ciphers. Maru isn’t intended for digital signatures or mission critical applications that requires robust collision resistant hashing functions, but it is good enough for hashing short strings like API. Maru is really an idea about how to introduce entropy into the string hashing process, while trying to increase difficulty of string collisions. You can see full source for Maru hash here. About Maru the cat

This entry was posted in assembly, programming, shellcode, windows and tagged , , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s