A quine in x86-64 assembly

This summer term, I'm taking a really interesting course on computer security: While the lectures are pretty theoretical (one of the topics is a proof that shows that proving the general security properties of certain models is equivalent to the halting problem, which is done by implementing a turing machine within the access model...), the homeworks are partially about x86(-64) assembly programming. My only assembly programming experiences until now were with MMIX, which is almost completely on the opposite end of the RISC/CISC spectrum than the good old x86 (not to mention that the architecture is completely theoretical and has never been implemented in hardware). To make a long story short, I finally have a reason to program some x86 assembly!

Our most recent exercise sounds quite simple, but kept me busy longer than I expected: We are supposed to write a quine in x86-64 assembly, that is, a program that has its own source code as its (only) output – identical to the last byte. I was already familiar with quines, but I had never tried to create one before, and if I had, assembly would definitely not have been my language of choice, but since it was one of two non-optional homework exercises, I figured that it couldn't be so hard after all. So without further ado, here is a quine in x86-64 assembly (GNU assembler syntax):

.att_syntax noprefix
.globl main
main:
pushq rbp
movq rsp, rbp
mov $.Cs, rdi
mov $0xa, rsi
mov $0x22, edx
mov $.Cs, ecx
mov $0x22, r8d
mov $0xa, r9d
xor eax, eax
call printf
xor eax, eax
leave
ret
.Cs: .string ".att_syntax noprefix
.globl main
main:
pushq rbp
movq rsp, rbp
mov $.Cs, rdi
mov $0xa, rsi
mov $0x22, edx
mov $.Cs, ecx
mov $0x22, r8d
mov $0xa, r9d
xor eax, eax
call printf
xor eax, eax
leave
ret%c.Cs: .string %c%s%c%c"

It can be compiled, executed and verified for proper quine-ness like this:

gcc quine.s -o quine && ./quine > output && diff quine.s output

As do most quines, this probably needs some explanation. Generally speaking, all quines (at least the ones I've come across) share a common structure: There is some code of the language in question, and one or more rather long strings, which contain most of that code in quoted form. The trick is to print the quoted code twice: Once verbatim, and once with quotation marks and some additional characters, so that the string declaration itself is printed out. To do that, the string quotation signs have to be escaped or stored in some modified form – otherwise, they would be interpreted simply as the end of the string.

The most obvious guess is to just escape them with a backslash (like so: \"), but that doesn't really help us – now we also have to print a backslash in our output! The solution here (and in many other quines) is to store them in another form that can be safely quoted verbatim inside of a string, but still evaluates to the desired character in the output. In this case, the printf C library function is used to recreate two occurences of two otherwise problematic characters: The aforementioned quotation mark and the newline character. Both characters are stored as their numeric (or more specific, hexadecimal) representation according to the ASCII codepage: 0xa for the newline, and 0x22 for the double quotation mark. All that I was able to learn from the very nice example for a quine in C given in the (German) Wikipedia article on the subject, which also taught me the really neat trick of using the same string twice – once as a formatting string, and once again as one of the parameters of the printf function.

My approach was then to find a valid translation for that quine to assembly, which revolved mostly around two problems: Generating the machine instructions for the printf call, and escaping all occurences of problematic characters in the resulting program so that it can be stored as a valid string.

The first part of the problem can be easily solved by some experimentation with a similar C program and by disassembling the binary as compiled by GCC (and identifying the relevant lines in the output!) – it boils down to moving the address of the string and the literal values of the ASCII characters to the right registers (according to the x86-64 calling convention) and executing the call.

A minor detail of interest is the instruction xor eax, eax right before the function call: As it turns out, functions with variable-length parameter lists like printf expect the number of parameters passed to them in the eax register; in this case, there are exactly zero. I can only speculate about the reasons for this part of the calling conventions (after all, the total number of arguments is not passed in a register!), but I gather it has something to do with possible optimizations in functions further down/up the call stack, since saving those registers is rather costly and should be avoided if not necessary. I only figured out the importance of zeroing the register when I tried the program on a workstation at my university – while I could get away without it on my own laptop, it would invariably crash there without that instruction.

Another problem would have been the newline characters: Since the GASM syntax requires a newline after every machine instruction statement, it's not possible to get away like the C quine from the mentioned article, that is, to simply write the quine on a single line. Fortunately, GCC/GASM does the right thing when confronted with "raw" newline characters inside of a string, and just treats them the same way it would handle a proper \n newline. This causes some warnings from my version of GCC, but compiles/assembles nevertheless – otherwise, all the newline characters would probably have to be submitted as parameters for printf.

If you are familiar with the GASM assembly syntax, you might have noticed a minor oddity about the code: Register names are not prefixed with a percent sign as they usually have to be. The reason for that is that the percent sign has a very special meaning in the formatting string parameter of printf – it indicates that the following character(s) should be interpreted as a formatting directive, and replaced by a specific parameter of the function! This leads to a problem similar to simple backslash escaping: For every percent character, an ASCII-encoded percent sign has to be given to printf as a parameter, but for every new parameter, we need a new mov instruction to a register – which includes a percent sign...

This part of the problem is actually easier to solve on x86 (the 32 bit variant): Since all function parameters are there passed on the stack, they can be pushed there with the push instruction (push $0x22, push $0x0a, ...) – no percent sign necessary! On x86-64, the first 13 parameters are passed in processor register instead, which means that additional parameters would have to be generated by using up the first few parameter slots in a way that still creates the same output – not impossible, but very tedious (both in manual execution and code size/readability).

A trick to circumvent that problem is the use of the .noprefix directive of GCC/GASM: Since a percent sign in front of a variable is only a visual aid to the programmer and not necessary to correctly parse the program, this option allows us to simply omit all the percent prefixes – which is just what we need.

After the encoding has been taken care of, all that is left is the exact structure of the printf format string. As I've mentioned, every occurence of a percent character is replaced by one of the function parameters in the output, and by careful construction of the formatting string, together with the trick of using the very same string as both a formatting specification for printf and a parameter being substituted inside that formatting string it is possible to create an output that is exactly identical to the source code – a quine!

Comments !

blogroll

social