Artificial Intelligence tools, such as ChatGPT, are rapidly advancing and proving to be valuable assets for various applications.

This has led to numerous publications that may overstate the potential capabilities of artificial intelligence systems built on neural networks, as well as voices that outright dismiss the idea that this technology has or will bring about significant changes in work methodologies across a broad range of fields.

To assess ChatGPT’s capabilities in addressing a low-level system programming task for which there are few, if any, publicly available solutions, I conducted a small experiment.

The objective was to utilize ChatGPT with the GPT-4 model to develop shellcode capable of determining whether it is running on a 64-bit Windows or Linux operating system. In anticipation of the outcome, it appears that offensive security engineers need not be concerned about their job security at the moment.

A Promising Beginning at First Sight

The initial request to ChatGPT was:

Q: Please provide me with a FASM-compatible x86_64 assembly listing for code that detects whether it is executed on a Windows or Linux host.

The result provided by ChatGPT looked promising at first glance:

; Windows or Linux detector for x86_64 in FASM
format ELF64 executable

entry start

segment readable executable

start:
    ; Save registers
    push rax
    push rdi
    push rsi
    push rdx
    push rcx
    push r10
    push r8
    push r9

    ; Linux syscall - getpid (39)
    mov rax, 39
    syscall

    ; Check if syscall executed successfully
    cmp rax, 0
    jle linux_detected

windows_detected:
    ; Windows code execution continues here
    ; Restore registers and exit (Windows)
    pop r9
    pop r8
    pop r10
    pop rcx
    pop rdx
    pop rsi
    pop rdi
    pop rax

    ; Exit process with success (Windows)
    ; using int 0x2e, service 0x18 (NtTerminateProcess)
    mov eax, 0x18
    xor edi, edi
    mov edx, 0x0
    int 0x2e
    ret

linux_detected:
    ; Linux code execution continues here
    ; Restore registers and exit (Linux)
    pop r9
    pop r8
    pop r10
    pop rcx
    pop rdx
    pop rsi
    pop rdi
    pop rax

    ; Exit process with success (Linux)
    ; using syscall 60 (exit)
    mov rax, 60
    xor rdi, rdi
    syscall

However, even though it appears plausible initially, this code has several issues. First, the obvious issue is that this will compile to an ELF binary, which is not what I wanted. This is a minor problem that can be resolved by replacing the format, entry, and segment statements with a “use64” directive. The code implementing the process termination on Windows systems also seems incorrect, but I will disregard it since I didn’t ask for it to be implemented in the first place.

Now, let’s discuss the logic used here to determine whether the operating system is Windows or Linux. The explanation from ChatGPT provided with the generated code reads as follows:

GPT: This code first attempts to execute a Linux system call, ‘getpid’ (syscall number 39). If the result is greater than 0, it assumes that the code is running on Linux and jumps to the ’linux_detected’ label. If the result is less than or equal to 0, it assumes that the code is running on Windows and continues executing at the ‘windows_detected’ label. It then exits the process using the appropriate system call for the detected operating system.

This is a sound approach that leverages the fact that getpid is a syscall that does not require any arguments and will succeed on a Linux system where syscall numbers are static for different kernels, as defined in syscalls.h. Windows kernels implement syscalls with varying syscall numbers depending on the specific kernel version. The syscall IDs for some kernels are documented at Microsoft Windows System Call Table.

Assuming that syscall 39, which is getpid on a Linux kernel, is called on a Windows kernel, the corresponding function could be any of the following:

  • NtAccessCheckAndAuditAlarm
  • NtMapViewOfSection
  • NtSetInformationFile
  • NtUnmapViewOfSection

There is a high probability that these functions will fail with whatever leftover state is present in the argument registers and on the stack, resulting in a negative signed 32-bit error code from the Windows kernel, for example, 0xc0000003.

The overall logic used for system detection seems sound. However, with this explanation in mind, revisiting the assembly listing provided by ChatGPT reveals more issues.

The compare statement that checks the return code from the syscall (cmp rax, 0) will not work since the error codes returned here are 32-bit signed integers; thus, a 64-bit compare will not yield the correct result. Additionally, the conditional branch instruction (jle) is doing precisely the opposite of what is described in the written explanation. If the check for a negative result worked, it would jump to the wrong label.

More Mistakes Are Made

Upon confronting ChatGPT with these issues, another code listing with errors was provided.

// removed

start:
    ; Save registers
    // removed

    ; Linux syscall - getpid (39)
    xor rax, rax
    mov eax, 39
    syscall

    ; Check if syscall executed successfully
    test eax, eax
    js linux_detected

windows_detected:
    ; Windows code execution continues here
    ; Restore registers and exit (Windows)
    // removed

linux_detected:
    ; Linux code execution continues here
    ; Restore registers and exit (Linux)
    // removed

The explanation accompanying this code listing seems like a solution. However, the code does not implement the logic described:

GPT: This code first attempts to execute a Linux system call, ‘getpid’ (syscall number 39). If the result is greater than 0, it assumes that the code is running on Linux and jumps to the ’linux_detected’ label. If the result is less than or equal to 0, it assumes that the code is running on Windows and continues executing at the ‘windows_detected’ label. It then exits the process using the appropriate system call for the detected operating system.

Apparently, the model has some issues with conditional branch instructions, as the check is implemented inversely again. This implementation takes the Linux branch when a negative 32-bit integer is returned from the system call.

Inquiring about this implementation error seemed to confuse the AI and resulted in a nonsensical implementation and explanation.

Q: But wouldn’t getpid return a positive value when it is called on Linux ?

GPT: You’re right, and I apologize for the confusion again. The code I provided was incorrect. Here’s a better approach: we > can leverage the fact that Windows and Linux have different syscall numbers for the same operation. For example, the ?‘getpid’ syscall has number 39 in Linux and 0x20 in Windows.

// code listing

In this version, we attempt to execute the ‘getpid’ syscall on Linux (number 39), and if the result is greater or equal to 0x20 (Windows ‘getpid’ syscall number), we assume the process is running on Linux, jumping to the linux_detected label. If the result is less than 0x20, it is assumed that the process is running on Windows and continues executing at the ‘windows_detected’ label.

// …

Guiding ChatGPT Towards a Solution

To make some progress, I decided to guide the AI towards a solution.

Q: I think your first approach wasn’t too far off from a solution. Attempting to call getpid and checking whether the returned 32-bit value is negative (NT error code) should work when implemented properly if all other registers are zeroed out, thus causing an error in the Windows syscalls corresponding to the same syscall id.

The code returned by ChatGPT, with the minor modifications described initially, compiles and works as intended.

//removed

start:
    ; Save registers
    // removed

    ; Zero out other registers
    xor rdi, rdi
    xor rsi, rsi
    xor rdx, rdx
    xor rcx, rcx
    xor r10, r10
    xor r8, r8
    xor r9, r9

    ; Try Linux syscall - getpid (39)
    mov rax, 39
    syscall

    ; Check if syscall executed successfully
    test eax, eax
    js windows_detected

linux_detected:
    ; Linux code execution continues here
    ; Restore registers and exit (Linux)
    // removed

windows_detected:
    ; Windows code execution continues here
    ; Restore registers and exit (Windows)
    // removed

Executing the code on a Linux system, the conditional branch is not taken, and execution resumes at the linux_detected label.

shellcode test on linux

Testing the shellcode on a Windows system results in the conditional branch being taken and the execution of the code at the windows_detected label.

shellcode test on windows

Improvements

Some small improvements, such as zeroing out only the relevant Windows syscall argument registers and implementing an egg hunter code to conveniently use this shellcode with two separate staging codes for Windows and Linux, result in the following listing:

use64

start:
    ; Zero out syscall argument registers
    xor rdx, rdx
    xor r8, r8
    xor r9, r9
    xor r10, r10

    ; Try Linux syscall - getpid (39)
    mov rax, 27h
    syscall

    ; Check if syscall executed successfully
    test eax, eax
    js windows_detected

linux_detected:
    ; Linux code execution continues here
    mov eax, 0x50905050 ; Set the value of the egg to "PP90" in hex format
    call next
next:
    pop rdi           ; Set the start address of the search space
    and rdi, 0xfffffffffffffffc ; align to 4byte boundary
search:
    scasd             ; Compare the value at rdi with the value in eax and update rdi
    jne search        ; If the values are not equal, keep searching
    jmp rdi           ; If the values are equal, jump to the address in rdi
    
windows_detected:
	// windows payload goes here
    
egg:
    dd 0x50905050

linux_detected:
	// linux payload goes here

This structure enables easy assembly of payloads, as illustrated below.

structure diagram

Wrapping It Up

I conducted a similar experiment with the same task using GPT-3 a while ago, and the results produced by GPT-4 are much better than what GPT-3 was capable of.

Nonetheless, someone without expert knowledge is probably not capable of using GPT-4 to produce working solutions for low-level system programming tasks. There is just too little “understanding” of assembly syntax, low-level operating system behavior, and mechanisms.

For my part, I am not worried about being replaced by a machine learning model anytime soon. I am hopeful, though, that I can automate an increasing share of the annoying and tedious tasks of my daily work using artificial intelligence.