Shellcoding 0x0: An Introduction to Red Pills

Abhinav Thakur
10 min readJul 6, 2022

I wonder what would you choose, blue or red x_x

Hello folks, through a series of articles I plan to describe the area of Linux shellcoding (intel x86–64 CPU architecture) in enough depth to be able to bypass static based detections applied at defensive side (at least). This article intends to cover some fundamentals around this area. It starts with providing a context introducing code injection, building a simple assembly program (performing system calls is explained in next article) and later talks about different ways to test shellcode (each one having its own limitations and advantages).

Even though the article series is written in the below mentioned chronological order, feel free to skip to whatever interests you more.

Prerequisites

  • Familiarity with compilation process is assumed throughout the article series (mainly assembling phase).
  • Familiarity with Intel x86–64 assembly (check out OST2’s Arch. 1001 which does it better job teaching it) will be helpful but not strictly necessary as I plan fill in some gaps on the fly.

Context First

Since code injection starts with exploitation and ends at post-exploitation phase (which is a later phase in any kill-chain model), this article series implicitly assumes a code injection & execution scenario -

The vulnerable software being targeted is running with root privileges (effective UID 0 without any process isolation techniques in place) on a Linux platform and intel x86–64 processor architecture. The attacker exploiting the software managed to inject data in an executable region of target software’s process address space and was successful is hijacking the code flow, achieving a state called arbitrary code execution (ACE) or remote code exection (RCE if any network software is being exploited remotely).

In context to the above scenario, we plan to craft the injected data, i.e. shellcode a.k.a payload to which an attacker may want to transfer hijacked code flow.

NOTE: Due to mitigations like DEP/NX (applied at both software & hardware layers), it is no more possible to execute shellcode in DATA regions of memory (unless you’re dealing with some JIT compilation or similar software using RWX memory region in its address space). One uses Return Oriented Programming (ROP) to bypass such mitigation by chaining sequence of instructions (called gadgets) that can be interpreted from raw bytes residing in CODE segment (having R-X permissions).

So, why bother learn shellcoding ?

Malware researchers need to be able to understand/write shellcode for obvious reasons. For exploit developers, since an exploit becomes limited to gadgets present in target process’s CODE segment and leveraging it isn’t as flexible as writing shellcode, the first strategy one might want to apply is to create a RWX memory region and bring shellcode in it for execution (in a staged manner which we discuss later). Also, since ROP is an evolution to classical shellcoding, it seems reasonable to practice running before beginning to parkour (hoping that’s enough talk to get some context).

Before proceeding forward, there must be enough understanding to differentiate between source code and compiled code, you might want to read this highlighted topic.

Shellcode — Code Suitable for Injection

What we intend to craft is a self-sufficient program which is able to execute regardless of its base load address in memory (a relocatable piece of code that doesn’t require linking), that’s what we call shellcode. Shellcode is a position-independent code (PIC) comprising stream of data bytes which corresponds to valid instructions when interpreted by the target CPU. From an attacker’s perspective, it can be understood as CPU instructions (in the form of hexadecimal encoded bytes) crafted under certain constraints meant to gain foothold over a compromised machine. Below is what your to-be-favourite shellcode may look like (highlighted in RED)

execve_binsh.s

Wait, did you just say constraints ?

While crafting shellcode one has to overcome certain restrictions (depending upon how it is delivered and interpreted by the target software) which when violated can get the shellcode corrupted before it reaches target address space. Apart from target-specific restrictions, we must ensure that shellcode is —

  • position-independent (PIC) in nature. This is usually achieved via relative addressing and clever use of CPU instructions.
  • self-sufficient in itself, i.e. it should not rely on any external API (other than APIs provided by the OS), since no standard C library functions are available to leverage. A shellcode capable of finding libc base address may allow calling its API, later we’ll see an implementation to this approach.
  • restricted to smallest possible size. This is largely dependent on the state of exploitation as the point of injection may not have enough wiggle room to accommodate the payload. Later, we learn to craft multistage payloads to deal with space constraints.
  • free from bad characters. Bad characters may be a byte or a sequence of bytes which (if existing inside our shellcode) may get our shellcode corrupted by the time it reaches target’s address space and therefore may result in unintended behaviour (usually resulting into illegal instructions and crashes). In the context of defense evasion, bad chars may be a sequence of bytes whose existence or execution is filtered/flagged by defensive software (AV/EDR/firewalls etc.). Currently, we will be treating only NULL byte (‘\x00’) as a bad character (slowly adding more bytes to this list) and later, we’ll see ways to bypass such detections.

Throughout the article series, we will see techniques that can be leveraged to overcome each of these constraints !

NOTE: An optimised shellcode can be made even smaller if we can leverage CPU state at the point when code execution is achieved. Since we can’t predict it in advance, we assume a garbage state for each register we use in this article series.

Lets Get Our Hands Red

There are many tools/assemblers ([NYFM]asm) which can do the job really well. In this series, I’ll be using GCC which relies on Gas (GNU Assembler as backend) to assemble my shellcode. Below is a program that performs No OPeration and breaks — nop and int 3 .

nop.s

A brief overview —

  • [Line 1–4] : multi-line comments (C-style) begins with /* and ends at */.
  • [Line 6–8]: .global (defines a global symbol) and .intel_syntax are assembler directives. Assembler directives are not instructions to the CPU (and therefore doesn’t generate any machine code) rather just suggestions for the assembler to take some action. They are not much relevant for the moment, but for those interested, a list of GNU assembler directives can be found here. On [Line 7], we have _start: symbol (a label), which can be thought of as a way to refer to this memory location. The linker (ld) uses this symbol as an entry point for the program.
  • [Line 11–12]: The program constitutes 2 instruction — nop and int 3 . nop stands for No OPeration which is merely used to waste a CPU cycle while int 3 instruction triggers a debug exception handler.

Next, we see how Gas (GNU Assembler at backend of GCC) can be leveraged to pack these instructions into an executable ELF binary.

./nop.elf

Above we see a Trace/breakpoint trap because CPU, while executing instructions, landed on 0xcc byte which indicates our shellcode runs just fine. Some flags given to GCC toolchain that needs explanation—

  • -Wl,-N : asks the linker to set RWX permissions for TEXT segment (required for testing self-modifying code, more about such code in upcoming articles).
  • -nostdlib : do not link to standard C library.
  • -static: generate a statically linked binary.

NOTE: nop is actually an alias for xchg rax, rax (having opcode 0x90)on x86-64 architecture. Also, int 3 (having opcode 0xcc) is the way most debuggers allow us set breakpoints.

Debugging shellcode

Writing code implicitly gives birth to bugs, this demands a need for testing. The idea is to transfer code flow to shellcode and trace through its execution to ensure it works as intended. There may be many ways to debug shellcode, most commonly using —

  • Strace
  • GCC packed shellcode (as seen above)
  • Harness program

Strace

It is used to trace system calls invoked by a program. You might want to checkout its manual to look for all features it has to offer (thank me later). Below is the output we get while strace’ing exit.elf program which has a sole purpose of performing exit() system call (which is one way to ask the Linux kernel to terminate a process).

strace ./exit.elf

NOTE: strace will drop privileges while executing a SUID program, which means any operation requiring privilege will fail with an error code EPERM (operation not permitted).

GCC packed shellcode

Strace is a quick way to know what syscalls are performed by a program but doesn’t let us trace program by stepping through instructions. This approach allows us to debug shellcode as a regular ELF program taking advantage of any debugging information (added via -g flag). Please note that debugging information is stored in a different section of ELF binary, therefore doesn’t have interference with CODE section (where shellcode bytes are present). Below is the binary under gdb stopped at _start

loading ./nop.elf inside gdb

Harness

While the above approach of debugging may sound promising yet it suffers from a drawback of assuming a clear CPU state (all registers set to store 0) at the beginning of shellcode execution which can be seen when loaded under Gdb. A shellcode assuming a clear CPU state is far from being practical as CPU state is sure to be different each time any vulnerability gets triggered (checkout publicly available shellcode, you’ll find a couple of samples presuming a clear register state).

Optionally, we can confirm its not Gdb that’s clearing off register state right before shellcode execution. This can be done by running our program outside Gdb context and examining CPU state right before executing first shellcode instruction. To do this, modify the binary by placing the opcode byte for int3 (0xcc) at the beginning of .text section. Launch ./exit.elf and wait till the CPU falls to 0xcc (int3), after it terminates, the process address space (core) is dumped. Now, we see the dumped core under gdb (via -c <core_path>) which confirms that the registers were 0 at the time of crash. Below are the respective steps -

# get the offset of 1st byte in .text section
$ readelf -S --wide ./exit.elf
There are 12 section headers, starting at offset 0x328:
...
[Nr] Name  Type     Address          Off    Size   ES Flg Lk Inf Al
[ 2] .text PROGBITS 00000000004000d4 0000d4 000008 00 WAX 0 0 1
...
# overwrite a byte at offset 0xd4 to value 0xcc (i.e. opcode for int3 instruction)
$ hexedit ./exit.elf
# (Ubuntu users)
# check & ensure a core dump in current working directory
$ cat /proc/sys/kernel/core_pattern
|/usr/share/apport/apport -p%p -s%s -c%c -d%d -P%P -u%u -g%g -- %E
$ echo "core_%e.%p_%t" | sudo tee /proc/sys/kernel/core_pattern
# run the program
$ ./exit.elf
Trace/breakpoint trap (core dumped)
cleared register state (from dumped core) when shellcode begins to execute

Hello Harness

Lets create a program that reads in shellcode at a RWX memory region, clobbers up registers rax, rdi, rsi, rdx, r10, r8 and r9 (ensuring the shellcode doesn’t presume any CPU state) and transfers code flow to base address at which shellcode is read. This creates a more realistic & robust environment to test the effectiveness of our shellcode. I hope the below mentioned source code is self-explanatory (see comments).

NOTE: The kernel interface expect syscall arguments to be passed in sequence — rdi, rsi, rdx, r10, r8, r9 while the glibc syscall wrappers are passed with arguments in rdi, rsi, rdx, rcx, r8, r9 order.

harness.c

Here, clobberContext[] is a byte-array that stores assembled instructions (to clobber CPU state). The idea is to inject this byte array before shellcode such that these instructions mess up CPU state just before code flow reaches our shellcode. Below is how we confirm this actually happens using Gdb.

building exit.raw and harness.s
memory region where shellcode is stored
Clobbered CPU state just before executing supplied shellcode

Epilogue

This article marks as an introduction to the world of machine code and lays a logical foundation to the area of code injection. It describes a few tools and approach one can leverage to debug shellcode. Later, it describes the significance of a harness program needed to test the effectiveness of shellcode. I hope this gets you familiar with the respective area, next we move on to writing some meaningful shellcode. Feel free to grab a beer before proceeding forward !

DISCLAIMER — Since the attackers are already making use of this knowledge, it’s the defenders who might find any value to the approach mentioned in this paper. This article series is intended for exploit developers, malware researchers, folks indulged in red/blue team operations and independent researchers struggling to find relevant resources into this area. The content is intended to be used solely for educational purposes. Therefore, it doesn’t take responsibility for anyone attracting hell by carrying out malicious intentions. Happy hacking ×_×

Cheers,
Abhinav Thakur
(a.k.a
compilepeace)

--

--