Introduction
A buffer overflow is a very well-known vulnerability that occurs when it is possible to put more data into a buffer than that buffer can hold. In the coming tutorials, we will learn about this vulnerability and how it can be exploited. In this tutorial, however, we will go over some concepts that will be necessary to know in order to understand the buffer overflow.
The Stack Structure
The stack is a data structure that holds information about functions while the program is executing. Whenever a function is called, a new stack frame is pushed onto the stack. When the function is finished executing, its stack frame is popped off the stack. Consider the following program
#include <stdio.h> int first(int f1, int f2, int f3); void second(int s1, char s2); int main(){ first(1,2,3); second (4,'a'); return 0; } int first(int f1, int f2, int f3){ printf("My numbers: %d, %d, and %d\n", f1,f2,f3); int sum = f1 + f2 +f3; return sum; } void second(int s1, char s2){ s1++; }There are four functions: main(), first(), second(), and printf(). When the program first starts, there is one stack frame on the stack for main(). Then when the program gets to first(), it puts another stack frame on main()'s frame. Then when it gets to printf, it puts the another frame on top of the one for first. When printf finishes printing, the printf frame gets popped off leaving just main and first.
Below is a depiction of how the stack frames would look like during the execution of the program.
|printf| low addresses -------- |first| |first| |first| |second| -------- -------- -------- -------- -------- |main| --> |main| --> |main| --> |main| --> |main| --> |main| --> |main| high addressesNotice that the frames get put on top of the previous frame. This is because the stack data structure works in a First In Last Out (FILO) structure (aka Last In First Out - LIFO), so the first frame that gets pushed onto the stack (main) is the last frame to pop off the stack. The stack also grows from high addresses to low addresses. Thus, the address of main()'s frame is higher than the address of printf(), for example.
Detour into Registers and Assembly
Now that we understand how the stack changes during program execution, we now need to understand the stack frame. But before we can get into that, we need to talk about registers and the x86 instruction set.
Note: When talking about numerical values in assembly, we will always use base 16 numbers (ex: 0xAA, 0x14, 0x2FCD).
When we program, we have our high level C code, but when we compile it, it becomes machine code. Assembly is an interpretation of machine code (opcode) that makes it human readable. There are two main flavors/syntax of assembly: intel and At&T. We will use intel in this tutorial.
An assembly program is comprised of a series of instructions that the CPU executes. Each instruction is made of a mnemonic, a source operand, and a destination operand. The syntax for an instruction is
mnemonic destination_operand source_operand mov eax, 0x55Here, we are moving the value 0x55 into the register, eax.
A register is a data storage that the program uses to help execute the program. In the x86 processor, we have general, segment, status, and instruction registers.
For now, we only really need to worry about the general registers: EAX, EBX, ECX, EDX, EBP, ESP, ESI, EDI. Each of these registers have 32 bits of memory (or 4 bytes), and they can be referenced by their lower 16 bits, AX, BX, CX, DX, BP, SP, SI. So for example if EAX contains the 32 bit value: 0x12345678. Then AX would contain 0x5678.
For AX, BX, CX, and DX, we can also refer to their higher and lower 8 bits. For AX from the same example, the higher 8 bits, AH, would contain 0x56, and the lower 8 bits, AL, contain 0x78.
A register can hold both values and addresses.
mov eax, 0x080483fc mov ebx, [eax]In the first line, we move the value 0x080483fc into eax. In the next line, we move the value at the address contained in eax to ebx. So if the address 0x080483fc contained the value 0x5, then we would be moving 0x5 into ebx.
For the purpose of understanding stacks, there are three registers we need to know about: EBP, ESP, and EIP.
EIP is the instruction pointer, and it contains the memory address of the next instruction that the program will execute. In other words, it points to the next instruction that we are executing. When we call a function, EIP gets the address of the function's first instruction. When we later talk about exploiting buffer overflows, our objective will be to control EIP so that we can tell the program what instructions we want it to execute.
ESP is the stack pointer, and it points to the top of the stack. Whenever we push things onto the stack, ESP decreases in value since the stack grows towards lower addresses. It points to the last thing that we have pushed onto the stack.
EBP is the base pointer, and it points to the bottom of the current stack frame.
Stack Frame, Function Prologue, Function Epilogue
Now that we understand some of the registers and assembly, we can get back to the stack frame. Right before a function is called, the function's arguments get pushed onto the stack in reverse order. So notice that first() has three parameters f1, f2, f3 with the values 1, 2, 3 respectively. These get pushed onto the stack like this:
low address --------------------------------- 1 <--ESP 2 3 |main| <--EBP high address ---------------------------------(In these diagrams, the dotted lines are the stack frame boundaries.)
Now when the function is called, EIP first pushes its value (the address of the next instruction) onto the stack. This is called the return address since it is the address of the instruction that our program will return to executing right after we finish executing our function.
After pushing the return address, EIP will get the address of our function's instructions and start executing.
Every function, has a function prologue that occurs before the function carrys out what we've programmed it to do. The prologue pushes the current value of EBP onto the stack. Then it changes the value of EBP to point at this value at the top of the stack. This marks the beginning of the new stack frame.
low address |first| Old EBP <-- EBP <--ESP --------------------------------- return address 1 2 3 |main| high address ---------------------------------After this is done, the prologue allocates space for local variables by subtracting from ESP. Similarly, space is later freed on the stack by adding to ESP.
--------------------------------- low address | 24 bytes for <--ESP |local variables | |first| Old EBP <-- EBP (doesn't change within the function) --------------------------------- return address 1 2 3 |main| high address ---------------------------------So now that the function is set up, it carries out its instructions, and then when it finishes, the function goes through the function epilogue, which restores the stack to the calling frame. (In the case of first() and main(), the calling frame is main()). It set ESP equal to EBP and pops the old EBP value off the stack and sets EBP to it. The assembly for this is usually "mov esp, ebp" then "pop ebp", but the x86 instruction set allows you to use "leave", which will do the same thing.
Old EBP <--EBP <--ESP low address ------------------ ----------- return address return address <--ESP 1 1 2 2 3 3 |main| |main| <--EBP ----------------- ----------- high address //set esp = ebp //pop ebpAfter doing this, the program then pops the return address off the stack and sets EIP equal to it, which has the effect of returning execution to our main method instructions. In x86 instruction set, we simply use "ret" to do this.
When EIP returns to the instructions in main, we add to ESP in order to restore the memory we allocated when we pushed the function arguments.
low address ------------------ ----------- 1 <--ESP 2 3 <--ESP |main| <--EBP |main| <--EBP ----------------- ----------- high address //ret //free space by adding to ESP
x86 Assembly
Now that we understand what is going on in the stack, let's try to see the stack changes in assembly. Below is the assembly for the main() function. The semicolons and the words following are comments that I've added to make reading the disassembly easier.
hypervelocity ~ $ gcc -m32 -o stackexample stackexample.c hypervelocity ~ $ gdb -q ./stackexample (gdb) set disassembly-flavor intel (gdb) disassemble main Dump of assembler code for function main: 0x080483eb <+0>: lea ecx,[esp+0x4] 0x080483ef <+4>: and esp,0xfffffff0 0x080483f2 <+7>: push DWORD PTR [ecx-0x4] 0x080483f5 <+10>: push ebp ;function prologue 0x080483f6 <+11>: mov ebp,esp 0x080483f8 <+13>: push ecx 0x080483f9 <+14>: sub esp,0x4 0x080483fc <+17>: sub esp,0x4 0x080483ff <+20>: push 0x3 ;push arguments for first() 0x08048401 <+22>: push 0x2 0x08048403 <+24>: push 0x1 0x08048405 <+26>: call 0x8048429 ;call first 0x0804840a <+31>: add esp,0x10 ;free 16 bytes 0x0804840d <+34>: sub esp,0x8 ;allocate 8 bytes 0x08048410 <+37>: push 0x61 ;push arguments, 0x61 = 'a' in ASCII 0x08048412 <+39>: push 0x4 0x08048414 <+41>: call 0x804845a ;call second 0x08048419 <+46>: add esp,0x10 ;free 16 bytes 0x0804841c <+49>: mov eax,0x0 0x08048421 <+54>: mov ecx,DWORD PTR [ebp-0x4] 0x08048424 <+57>: leave 0x08048425 <+58>: lea esp,[ecx-0x4] 0x08048428 <+61>: retAt line 20 (0x080483ff), we can see the that the parameters for first() are being pushed onto the stack in reverse order. We call first() at line 26. Notice that it says call 0x8048429, which matches the address of the first instruction listed in the assembler code for first() .
The return address that we push on the stack before calling first() will be 0x0804840a since that is the address of the next instruction after our function call.
Below are the assembler codes for for first() and second().
(gdb) disassemble first Dump of assembler code for function first: 0x08048429 <+0>: push ebp ;function prologue 0x0804842a <+1>: mov ebp,esp 0x0804842c <+3>: sub esp,0x18 ;allocate 24 bytes 0x0804842f <+6>: push DWORD PTR [ebp+0x10] ;push f1,f2,f3 parameters 0x08048432 <+9>: push DWORD PTR [ebp+0xc] ; onto the stack for printf 0x08048435 <+12>: push DWORD PTR [ebp+0x8] 0x08048438 <+15>: push 0x80484f0 ;push the string argument for printf 0x0804843d <+20>: call 0x80482c0 <printf@plt> ;call printf 0x08048442 <+25>: add esp,0x10 ;free 16 bytes from stack 0x08048445 <+28>: mov edx,DWORD PTR [ebp+0x8] ;mov f1(1) to edx (1) 0x08048448 <+31>: mov eax,DWORD PTR [ebp+0xc] ;mov f2 to eax (2) 0x0804844b <+34>: add edx,eax ;add eax to edx (edx = 3) 0x0804844d <+36>: mov eax,DWORD PTR [ebp+0x10] ;mov f3 to eax (3) 0x08048450 <+39>: add eax,edx ;add edx to eax (6) 0x08048452 <+41>: mov DWORD PTR [ebp-0xc],eax ;prepare to return (6) 0x08048455 <+44>: mov eax,DWORD PTR [ebp-0xc] ; by moving into eax 0x08048458 <+47>: leave 0x08048459 <+48>: ret
(gdb) disassemble second Dump of assembler code for function second: 0x0804845a <+0>: push ebp ;prologue 0x0804845b <+1>: mov ebp,esp 0x0804845d <+3>: sub esp,0x4 ;allocate 4 bytes 0x08048460 <+6>: mov eax,DWORD PTR [ebp+0xc] ;move parameter value to eax ('a') 0x08048463 <+9>: mov BYTE PTR [ebp-0x4],al ;move al (lower 8 bits = 'a') to local var 0x08048466 <+12>: add DWORD PTR [ebp+0x8],0x1 ;add 1 to parameter (4) = 5 0x0804846a <+16>: nop 0x0804846b <+17>: leave 0x0804846c <+18>: retNotice that during the function's execution, variables are referenced relative to EBP by adding or subtracting offsets. Variables declared inside the function are EBP subtracted by some value, and the parameter values are referenced with EBP+some value. This is consistent with what we know about stack structure since parameter values are pushed first so they are at higher addresses than EBP, and local variables are allocated after EBP, so they are at lower addresses.
A DWORD is 4 bytes (or 32 bits) and it means "double word". A WORD therefore, is 2 bytes. The DWORD PTR is therefore a pointer to a 4 byte value. For example, in line 6 of the assembler code for second, the instruction is saying to move the value located at ebp+0xc into EAX. Notice the brackets around ebp+0xc, which tells us that we are using the value at the address ebp+0xc and not simply the value ebp+0xc.
Also, we know that this value will be the character 'a'. The positive offset from EBP means that it is a parameter value, and we can see in line 12, a DWORD PTR to ebp + 0x8, which we can infer is the other parameter value. Since 0xc is higher than 0x8, we know the [ebp+0xc] value was pushed on first, which means it's the second parameter of our function, char s2, which is 'a'.
Variations in Assembly
What you see when you disassemble your executable will depend on how the program was compiled. The assembly instructions you get from an executable compiled by different compilers will be different but not unrecognizably different.
For example, calling conventions, the way the stack is set up and restored for function calls, will vary depending on the compiler used. So even though the C source code of the program is the same, the assembly instructions will be different. For example, some compilers will not push parameters onto the stack but will move them instead. If you were trying to follow along on OSX, you will notice that the assembly you see is pretty different. That is because OSX uses Clang, which is a different compiler than GCC even though you might have typed gcc into the terminal.
You may have also noticed that not all the instructions in the assembly we have above 'makes sense'. For example, in lines 14 and 17 of the main function, we allocate 4 bytes on the stack. Why couldn't we have just allocated 8 bytes at one time? And after we return from first, why do we free 16 bytes then allocate 8 bytes especially when it seems that we only allocated 8 bytes before calling first? And also, in lines 41 and 44 of first(), why do we move eax to DWORD PTR [ebp-0xc] and then immediately move DWORD PTR [ebp-0xc] back to eax? These reasons for these discrepancies lie in compiler optimization. As computers advanced, people figured out ways to make compilers better in terms of making the resulting executable faster and more efficient. The drawback, however, is that the resulting code is not very intuitive. We can disable optimization by compiling with -O0 option for now, but keep in mind that executables in the 'real world' will not have optimization disabled. In fact, some will even have obfuscating, or anti-reversing, features that will make the code even more confusing. However, that is not something we need to worry about right now. So, if we were to compile without optimizations, then we would get the assembly results below.
hypervelocity ~ $ gcc -g -m32 -O0 -o stackexample stackexample.c hypervelocity ~ $ gdb -q ./stackexample (gdb) set disassembly-flavor intel (gdb) disassemble main Dump of assembler code for function main: 0x080483e4 <+0>: push ebp 0x080483e5 <+1>: mov ebp,esp 0x080483e7 <+3>: and esp,0xfffffff0 0x080483ea <+6>: sub esp,0x10 0x080483ed <+9>: mov DWORD PTR [esp+0x8],0x3 0x080483f5 <+17>: mov DWORD PTR [esp+0x4],0x2 0x080483fd <+25>: mov DWORD PTR [esp],0x1 0x08048404 <+32>: call 0x8048424 <first> 0x08048409 <+37>: mov DWORD PTR [esp+0x4],0x61 0x08048411 <+45>: mov DWORD PTR [esp],0x4 0x08048418 <+52>: call 0x804845f <second> 0x0804841d <+57>: mov eax,0x0 0x08048422 <+62>: leave 0x08048423 <+63>: ret End of assembler dump. (gdb) disassemble first Dump of assembler code for function first: 0x08048424 <+0>: push ebp 0x08048425 <+1>: mov ebp,esp 0x08048427 <+3>: sub esp,0x28 0x0804842a <+6>: mov eax,0x8048550 0x0804842f <+11>: mov edx,DWORD PTR [ebp+0x10] 0x08048432 <+14>: mov DWORD PTR [esp+0xc],edx 0x08048436 <+18>: mov edx,DWORD PTR [ebp+0xc] 0x08048439 <+21>: mov DWORD PTR [esp+0x8],edx 0x0804843d <+25>: mov edx,DWORD PTR [ebp+0x8] 0x08048440 <+28>: mov DWORD PTR [esp+0x4],edx 0x08048444 <+32>: mov DWORD PTR [esp],eax 0x08048447 <+35>: call 0x8048300 <printf@plt> 0x0804844c <+40>: mov eax,DWORD PTR [ebp+0xc] 0x0804844f <+43>: mov edx,DWORD PTR [ebp+0x8] 0x08048452 <+46>: add eax,edx 0x08048454 <+48>: add eax,DWORD PTR [ebp+0x10] 0x08048457 <+51>: mov DWORD PTR [ebp-0xc],eax 0x0804845a <+54>: mov eax,DWORD PTR [ebp-0xc] 0x0804845d <+57>: leave 0x0804845e <+58>: ret End of assembler dump. (gdb) disassemble second Dump of assembler code for function second: 0x0804845f <+0>: push ebp 0x08048460 <+1>: mov ebp,esp 0x08048462 <+3>: sub esp,0x4 0x08048465 <+6>: mov eax,DWORD PTR [ebp+0xc] 0x08048468 <+9>: mov BYTE PTR [ebp-0x4],al 0x0804846b <+12>: add DWORD PTR [ebp+0x8],0x1 0x0804846f <+16>: leave 0x08048470 <+17>: ret End of assembler dump.
No comments:
Post a Comment