Introduction
Injecting Position Independent Code (PIC) into a remote process is trivial enough for a programmer but if they try using CreateRemoteThread() API from Wow64 against a 64-bit process, it fails. Transitioning from 32-bit to 64-bit was discussed by rgb/29a in his article Heaven’s Gate: 64-bit code in 32-bit file around 2009. ReWolf has also written extensively on this issue and published a helper library in C/ASM which enables x86 applications to read/write to x64 applications in addition to calling any NTDLL.DLL API using his X64Call() function which is probably the best solution for developers that want to solve the problem without hacking assembly. There’s lots of information about what I’m discussing here and there’s a freely available open source solution if compiling 2 separate binaries isn’t to your liking. This is only going to document a DLL/PIC injection tool called ‘pi’ written for the purpose of testing win32/win64 shellcodes.
Traditional method
The steps familiar to those struggling with the problem:
API | Description |
---|---|
OpenProcess | self explanatory, open the process we want to inject PIC into |
VirtualAllocEx | allocate read/write/executable memory for our PIC |
WriteProcessMemory | write PIC code |
VirtualProtectEx | optionally change the memory to read/execute only |
CreateRemoteThread | run the PIC code as a thread |
WaitForSingleObject | optionally wait for the thread to exit (or crash!) |
All is well until you hit CreateRemoteThread. Even if you use NtCreateThreadEx() it still won’t work. (open to correction) To circumvent the limitation, Wow64 process will transition to 64-bit as demonstrated by rgb/29a and ReWolf with their articles/code, resolve the API NtCreateThreadEx() and execute thread in remote process.
Wow64 detection
Various methods of detecting wow64/64-bit using assembly have been suggested over the last number of years. Peter Ferrie published one that uses REX prefixes and there are undoubtedly many other methods published on the internet. isWow64 also uses a REX prefix 0x48 which under 32-bit mode is “DEC EAX”. When this code executes in 32-bit mode, it will return TRUE, else FALSE for 64-bit mode. If calling from ASM, you can simply check ZF (zero flag) after it returns.
; returns TRUE or FALSE isWow64: _isWow64: bits 32 xor eax, eax dec eax neg eax ret
32-bit mode: xor eax, eax makes eax zero. dec eax makes it -1. neg eax makes it 1. return 1 in eax
/* 0000 */ "\x31\xc0" /* xor eax, eax */ /* 0002 */ "\x48" /* dec eax */ /* 0003 */ "\xf7\xd8" /* neg eax */ /* 0005 */ "\xc3" /* ret */
64-bit mode: xor eax, eax makes rax zero. neg rax does nothing. return zero in rax.
/* 0000 */ "\x31\xc0" /* xor eax, eax */ /* 0002 */ "\x48\xf7\xd8" /* neg rax */ /* 0005 */ "\xc3" /* ret */
The following by Peter Ferrie uses 0x40 REX prefix which is “inc eax” in 32-bit mode.
32-bit mode: xor eax, eax makes eax zero. inc eax makes eax 1. xchg edx, eax makes edx 1. ZF is 0 in 32-bit mode.
/* 0000 */ "\x31\xc0" /* xor eax, eax */ /* 0002 */ "\x40" /* inc eax */ /* 0003 */ "\x92" /* xchg edx, eax */ /* 0004 */ "\x74\x00" /* jz 0x6 */
64-bit mode: xor eax, eax makes eax zero. xchg edx, eax makes edx zero. jump if ZF is 1 to 64-bit code.
/* 0000 */ "\x31\xc0" /* xor eax, eax */ /* 0002 */ "\x40\x92" /* xchg edx, eax */ /* 0004 */ "\x74\x00" /* jz 0x6 */
Obtaining API address
Rather than search the NTDLL export table for LdrGetProcAddress and pass the string “NtCreateThreadEx”, we search for the hash of this string using old simple hash algorithm originally suggested by LSD-PL in their winasm presentation, published in 2002
I found a nifty NASM macro originally written by Vecna/29a and converted to NASM syntax by Jibz, the author of apLib.
The getapi function obtains NTDLL from the 64-bit PEB (Process Environment Block) and then searches through the export table for required function.
%define ROL_N 5 %macro HASH 1.nolist %assign %%h 0 %strlen %%len %1 %assign %%i 1 %rep %%len %substr %%c %1 %%i %assign %%h ((%%h + %%c) & 0FFFFFFFFh) %assign %%h ((%%h << ROL_N) & 0FFFFFFFFh) | (%%h >> (32-ROL_N)) %assign %%i (%%i+1) %endrep %assign %%h ((%%h << ROL_N) & 0FFFFFFFFh) | (%%h >> (32-ROL_N)) dd %%h %endmacro ; mov eax, HASH "string" %macro hmov 1.nolist db 0B8h HASH %1 %endmacro getapi: bits 64 push rsi push rdi push rbx push rcx mov r8, rax push 60h pop rsi mov rax, qword [gs:rsi] mov rax, [rax+18h] mov r10, [rax+30h] l_dll: mov rbp, [r10+10h] test rbp, rbp mov eax, ebp jz xit_getapi mov r10, [r10] mov eax, [rbp+3Ch] ; IMAGE_DOS_HEADER.e_lfanew add eax, 10h mov eax, [rbp+rax+78h] lea rsi, [rbp+rax+18h] ; IMAGE_EXPORT_DIRECTORY.NumberOfNames lodsd xchg eax, ecx jecxz l_dll lodsd ; IMAGE_EXPORT_DIRECTORY.AddressOfFunctions ; EMET will break on the following instruction lea r11, [rbp+rax] lodsd ; IMAGE_EXPORT_DIRECTORY.AddressOfNames lea rdi, [rbp+rax] lodsd ; IMAGE_EXPORT_DIRECTORY.AddressOfNameOrdinals lea rbx, [rbp+rax] l_api: mov esi, [rdi+4*rcx-4] add rsi, rbp xor eax, eax cdq h_api: lodsb add edx, eax rol edx, ROL_N dec eax jns h_api cmp edx, r8d loopne l_api jne l_dll movzx edx, word [rbx+2*rcx] mov eax, [r11+4*rdx] add rax, rbp xit_getapi: pop rcx pop rbx pop rdi pop rsi ret
Switching to 64-bit mode
Again, this has been described/documented by rgb/29a and ReWolf very well and it’s how I wrote the following function.
bits 32 ; switch to x64 mode sw64: call isWow64 jz ext64 ; we're already x64 pop eax ; get return address push 33h ; x64 selector push eax ; return address retf ; go to x64 mode ext64: ret
Switching back to x86 mode
This piece of code returns to 32-bit mode after we’ve created thread.
bits 32 ; switch to x86 mode sw32: call isWow64 jnz ext32 ; we're already x86 pop eax sub esp, 8 mov dword[esp], eax mov dword[esp+4], 23h ; x86 selector retf ext32: ret
CreateRemoteThread
The following is a wrapper for calling NtCreateThreadEx. It first switches to x64 mode, then calls the function before returning to 32-bit mode.
The reason to use a structure when loading arguments for NtCreateThreadEx() is that it’s easier to align the stack.
The stack must be aligned by 16 bytes minus 8 before calling an API otherwise it’ll cause problems and occasionally crash.
If you’re wondering why aligned by 16 minus 8. The eight bytes will be occupied by return address once call is executed.
The HOME_SPACE structure is required for all API.
Have a look at The history of calling conventions, part 5: amd64 for more information.
struc HOME_SPACE ._rcx resq 1 ._rdx resq 1 ._r8 resq 1 ._r9 resq 1 .size: endstruc struc ct_stk .hs: resb HOME_SPACE.size .lpStartAddress resq 1 .lpParameter resq 1 .CreateSuspended resq 1 .StackZeroBits resq 1 .SizeOfStackCommit resq 1 .SizeOfStackReserve resq 1 .lpBytesBuffer resq 1 .size: endstruc %ifndef BIN global CreateRemoteThread64 global _CreateRemoteThread64 %endif CreateRemoteThread64: _CreateRemoteThread64: bits 32 push ebx push esi push edi push ebp call sw64 ; switch to x64 mode test eax, eax ; we're already in x64 mode and will only work with wow64 process, return 0 jz exit_create bits 64 mov rsi, rsp and rsp, -16 sub rsp, ((ct_stk.size & -16) + 16) - 8 hmov "NtCreateThreadEx" call getapi mov rbx, rax xor r8, r8 xor rax, rax mov [rsp+ct_stk.lpBytesBuffer ], rax ; NULL mov [rsp+ct_stk.SizeOfStackReserve], rax ; NULL mov [rsp+ct_stk.SizeOfStackCommit ], rax ; NULL mov [rsp+ct_stk.StackZeroBits ], rax ; NULL mov [rsp+ct_stk.CreateSuspended ], rax mov eax, [rsi+9*4] ; lpParameter mov [rsp+ct_stk.lpParameter ], rax mov eax, [rsi+8*4] ; lpStartAddress mov [rsp+ct_stk.lpStartAddress ], rax mov r9d, [rsi+5*4] ; hProcess mov edx, 10000000h ; GENERIC_ALL mov ecx, [rsi+12*4] ; &hThread call rbx mov rsp, rsi ; restore stack value call sw32 ; switch back to x86 mode exit_create: bits 32 pop ebp pop edi pop esi pop ebx ret
Calling from C
Initially, the CreateRemoteThread64() function was linked with pi.c but I decided to assemble as binary and convert to a C string to keep it simple.
Write/Executable memory is allocated by VirtualAlloc before the string is copied over and executed.
All steps required to inject into remote process using the API described are the same as before with just one exception…
If pi is running as 32-bit and remote process is 64-bit, we call CreateRemoteThread64() else CreateRemoteThread().
usage
The tool still needs testing/developing but I’ve uploaded source/binaries to github if you’re interested.
That “dec eax” instruction that you have is a REX prefix in 64-bit mode. There is no one-byte dec instruction in 64-bit mode.
windbg can debug code that transitions between 32-bit and 64-bit code from user-mode.
LikeLike
Hi Peter. Is it possible to debug a 64-bit process from wow64 though? I couldn’t get it to work.
“dec eax/neg eax” becomes “neg rax” in 64-bit mode. I should have been bit more clearer about this “DEC EAX” being a REX prefix in 64-bit mode. It’s edited now.
LikeLike