Shellcoding 0x3: Dropping Multi-stage Payload

Abhinav Thakur
7 min readJul 12, 2022

--

May I bring some friends to this madhouse ?

Shellcode optimisation (with regards to size) is an important skill that an exploit developer/malware researcher is expected to possess. The problem is, a shellcode that achieves anything useful will need something more than just a few bytes/pages of memory. One may often encounter small memory regions for shellcode which doesn’t serve anything more than trivial purposes. The article describes an approach of crafting a self-destructing shellcode loader to bypass accommodation constraint allowing for huge size shellcode to reside and execute in the target process address space. Later, it describes how an attacker might use these principles to craft a first stage dropper payload.

Even though the article series is written in the below mentioned chronological order, feel free to skip to whatever interests you more.

The Constraints

Let’s say the target buffer is 45 bytes long and the payload we’re trying to execute is optimised to at most 100 bytes. Below is one way we can achieve execution of a large payload.

  1. Allocate a large enough address space to accommodate the next stage.
  2. Read 2nd stage payload in freshly allocated region.
  3. Transfer code flow to 2nd stage payload.

When World is all Sunshine and Rainbows

First, we aim to read(2) the next stage payload from standard input (STDIN_FILENO). We might also want a self destructive or a mutated stage1 shellcode, i.e. it must either remove itself from memory (after execution) or mutate itself to confuse disassemblers. Designing a self-destruct logic can be as simple as reading stage2 payload in stage1 load address (i.e. overwriting its own self). For this to happen, 2 conditions must be met —

  • stage2 payload should have RWX permissions on its resident memory segment.
  • stage1 payload must have a base memory address pointing to the relevant part of itself to destruct.

Assuming the memory region accommodating stage1 is RWX. One can try getting current RIP (base memory address of stage1) by multiple ways including — lea rsi, [rip] or call trampoline; jmp getCurrentRIP; pop <dest_reg> sequence but these generate NULL bytes (bad characters). Let’s look at a way to get base load address of stage1 (CurrentRIP) without generating bad characters.

The RCX Revenge (stage1)

As per System V ABI (AMD64 ABI Draft 0.99.6 — July 2, 2012–17:14) —

A.2.1.2 A system-call is done via the syscall instruction. The kernel destroys registers %rcx and %r11.

Let’s leverage this to our benefit. The kernel uses %rcx register to store saved-RIP to return after execution of syscall instruction (while RFLAGS is stored in %r11). Therefore, after a system call returns, the address of next instruction (after syscall instruction) can be found in %rcx. This can be used to get current RIP which essentially be used in read(2) syscall. Below is how syscall; push rcx; pop <dest_reg> sequence can be leveraged to get current RIP (see line 23,29,30 below).

ideal world stage1
  • getpid(2) (line 2123): simply used because of its small size post assembling.
  • read(2) (line 2935): reads upto 0xffff bytes of stage2 payload from standard input to its own memory location, effectively clearing stage1 off memory.
  • line 42 : padding the remaining region with nop sled.

Peaceful Stub (stage2)

Below is a harmless stage2 payload that simply writes the string “compilepeace was here x_x\n” to STDOUT_FILENO to confirm successful code execution. Feel free to skip this part or read comments (that explain the write(2) syscall better).

harmless stage2 shellcode

NOTE: line 54 pads a memory region of size equal to space occupied by stage1, with a nop sled (i.e. junk — stage1 bytes) to ensure the code flow lands on a nop sled and continue down to the second stage. Feel free to pad a larger area as we have 0xffff bytes of wiggle room !

Gdb, Let’s put it to test !

Let’s see how the shellcode modifies itself using a nop sled (i.e. also used as a padding to second stage payload) under Gdb.

first stage shellcode
mutated first stage shellcode

From above test, it is evident that stage1 shellcode read the next stage into its own body (overwritten with a nop sled).

Land Far from Ideal World

A perfect world where one could afford to inject shellcode in RWX memory region could probably be the world of JIT compilers and Looney Tunes. Memory regions these days are marked with W^X memory permissions, i.e. a memory segment should be either writable or executable but not both (mutually exclusive of each other) under default linker configuration. To bypass this mitigation, an exploit would use byte sequences (instructions) present in executable memory regions and chain them to achieve desired results, a process called Return Oriented Programming (ROP).

The shellcode above assumes the bytes beyond allocated space (45 bytes) fall into the same memory region having RWX memory permissions, this might not always be the case. The sequence read() + jmp stage2; would cause a segmentation fault (signal SIGSEGV) on jmp stage2 if stage2 falls into a memory segment with No-eXecute permissions (RW-) (this could happen for example if the buffer for stage1 lies 45 bytes from the page boundary).

stage2: could you make me some room ?

Since stage2 requires a RWX memory segment to function properly. We can use mmap(2) to allocate RWX memory pages, inject stage2 shellcode and transfer control flow to it. In order to make syscalls, one might frequently need to reference constants pertaining to a MACRO. Below is how this could be done using grep or strace to trace mmap(2) call—

$ grep -r "define PROT_[READ|WRITE|EXEC]" /usr/include
/usr/include/x86_64-linux-gnu/bits/mman-linux.h:
#define PROT_READ 0x1 /* Page can be read. */
...
OR
$ strace -e raw=mmap ./harness.elf    /* NO MACROS for mmap args */
...
mmap(0, 0xffff, 0x7, 0x22, 0, 0) = 0x7f3b97554000
...
mmap(2)
  • mmap(2) (line 2941): Above syscall maps 0xffff bytes (rounded off to 16 pages) of RWX (7) private memory (not shared by any other process) (0x22). Here we let the kernel choose an appropriate page-aligned address (0–1st arg) and the mapped region is not backed by any file on disk (hence 5th & 6th args are 0).
  • (line 40): pushes RSI (stores the length of mapped memory) which is popped by read(2) invoked next (as below).
read(2)
  • read(2) (line 4753): mmap(2) returns the memory address of mapped region in RAX . Via STDIN_FILENO, we read stage2 into into freshly mapped region (RAX). (line 47) pop rdx stores size of mapped region (RSI pushed on line 40 in mmap syscall).
  • (line 5556): It simply jumps to stage2 shellcode, i.e. RBX storing stage2 load address (see push and pop on line 49 & 55 respectively)
  • (line 62): places 5 nops to pad upto 45 bytes.
Stage 1 shellcode

Above is how a stage1 shellcode would look into memory (see field 2 storing encoded bytes). Stage2 shellcode still the same payload (writes a string constant to STDOUT). Let’s test our shellcode —

testing multistage payload

Finally, we were able to execute a 109 byte harmless shellcode in an accommodation constraint of 45 bytes, what a day !

Dropper Payload

Above shellcode is nice to use for delivery via local POCs/malware but usually an attacker would be at a remote geographical location, meaning the entire communication must happen between target process and attacker machine. To achieve this, a dropper payload would need to read the next stages on a socket rather than standard input. With position independent code, it is now easy to drop next stages into freshly mmaped region and jmp to it. A trivial dropper agent might use syscalls in below order to iteratively read and execute shellcode from a C2 server deployed within attacker infrastructure —

Syscall sequence of a simple multistage dropper

EPILOGUE

This article crafts a shellcode loader to bypass accommodation constraint in context of code execution by staging a large payload into multiple phases. This allows an ideal room for an ambitiously large parasite. Later it looks into how an attacker could use similar principles with UNIX socket API to craft a memory dropper. I hope this article contributes to your research journey.

DISCLAIMER — Since the attackers are already making use of this knowledge, it’s the defenders who might find any value to the approach mentioned in this paper. This article series is intended for exploit developers, malware researchers, folks indulged in red/blue team operations and independent researchers struggling to find relevant resources into this area. The content is intended to be used solely for educational purposes. Therefore, it doesn’t take responsibility for anyone attracting hell by carrying out malicious intentions. Happy hacking ×_×

Cheers,
Abhinav Thakur
(a.k.a
compilepeace)

--

--