I was supposed to write this months ago, but better late than never, right?
Disclaimer: It is recommended that you have some experience with programming or scripting and a basic understanding of "computer logic" in general. In fact, for this primer I will assume that you are somewhat familiar with an object-oriented programming language such as C++, Python, Delphi (which AoW was written in), Java etc. because otherwise I'd need to start with Ada and Steve.
Furthermore, I'm assuming you're using PE Explorer or a similarly capable disassembler and have figured out how to disassemble AoWEPACK.dpl, which is really the main engine library of AoW. You should also have a means of modifying AoWEPACK.dpl, either manually with a hex editor or by using a script.
WARNING! ONLY EVER OVERWRITE BYTES AND NEVER INSERT NEW BYTES! YOU HAVE BEEN WARNED!
(Also it might be a good idea to make backups before messing around with any files.)
The following tools and resources will likely come in handy over the course of this crash course:
A hexadecimal calculator: calc.exe (set to "programmer")
A quick assembler/disassembler: https://disasm.pro (make sure to set to 32 bit)
A command line hex editor: https://anonfiles.com/jeU98dreu4/hex_zip (feel free to use a different one)
A visual explanation of x86-32 register bytes: http://www.godevtool.com/GoasmHelp/usreg.htm
A list of jump instructons and flags: http://www.unixwiz.net/techtips/x86-jumps.html
With that out of the way...
INTRO
Assembly (I will call it ASM from here on) is essentially a very low-level programming language that describes exactly what a program does. Unlike higher-level (more abstract) programming languages such as the ones mentioned earlier, ASM directly corresponds to the actual machine code which a program executes at runtime. Because of this, ASM can essentially be considered a more readable representation of machine code.
Because ASM is a very low-level language consisting of only a few basic building blocks, ASM code can get very confusing for a complex program. Furthermore, machine code that has been generated by a compiler is generally optimised for runtime rather than human readability. Luckily, the Delphi Borland compiler used for AoW produces machine code that is much closer to the original code than most other compilers, and hence more readable, but understanding ASM is still a challenge.
The first thing you need to understand about ASM is that everything is a memory operation. In a typical programming language you have abstract storage locations called variables where you can store values for later use. Such variables don't exist in ASM; instead you have onlyregisters and the stack. I'll talk more about the stack (insert spooky noises here) another time; right now we'll just focus on registers.
REGISTERS
There are different types of registers, but the most important registers aregeneral purpose registers or GPRs. These GPRs are essentially what you use in lieu of variables. There are 4 GPRs called A, B, C, and D. Each of these registers can hold up to 32 bits of information. You may be familiar with terms such as "32-bit architecture" or "x86-32". Simply put, this refers to the size of the registers, which in our case is 32 bits.
32 bits is 4 bytes, which is important because bytes, not bits, are the building blocks of machine code. You can think of bytes as the underlying grid structure of a program, and in fact this is how hex editors typically display machine code. The visual representation of each byte is a hexadecimal two-digit number, with the first digit representing the upper 4 bits and the second digit representing the lower 4 bits. For example, if we store the decimal number 10 in a single byte, its representation is 0A. Accordingly, if we store the same value in a 32-bit register, its representation is 0000000A. To distinguish such hexadecimal representations from decimal values, they are usually written 0x0000000A or 0000000Ah, which not only gives information about the value itself, but also about the memory used to store it. As will be explained later, 0x0A and 0x0000000A are not necessarily identical.
I should point out here that bytes do not actually have concrete values, as different instructions can interpret bytes in different ways. For example, the byte 0xFF can be interpreted as the decimal value 255, the decimal value -1, the UTF-8 character ÿ, or something else entirely. However, when it comes to reverse-engineering AoW, you can usually assume that the disassembler automatically chooses the correct representation for each case.
This is especially important when it comes to integer values that consist of several bytes. As humans we are accustomed to reading numbers inbig-endian format, i.e. with the most significant digit to the left and the least significant digit to the right. When looking at the AoW machine code, however, you will often find that integers are stored in little-endian format, with the least significant byte first and the most significant byte last. So our decimal number 10 will be stored in 4 bytes as 0x0A, 0x00, 0x00, and 0x00, which the decompiler automatically interprets as 0x0000000A.
Of course, we don't always need 4 bytes of storage, as most smaller numbers can be stored in a single byte. With GPRs it's actually possible to access certain bytes directly, and in fact many instructions chosen by the compiler do so, so it's important to recognise and understand them.
Back in 8-bit days, each GPR had exactly one byte and all was well, but with the advent of 16-bit there came a need to reference each byte of a register independently, and so the 16-bit GPRs were split intolower and higher bytes. For example, if we store the decimal value 10 in the 16-bit register A, the lower byte AL will have the value 0x0A, while the higher byte AH will have the value 0x00. The combination of both AL and AH is called AX (the X stands for extended, I guess). In 32-bit architecture, we can also use the entire 32-bit register A, which is then called EAX (the E stands for extended, I guess). The same of course also goes for the B, C, and D registers; in practice however, you will pretty much only need and encounter the lowest byte and the full 32-bit register, i.e. AL, EAX, BL, EBX, CL, ECX, DL, and EDX.
Two bytes are called a word, and four bytes (e.g. EAX) are called a dword (double word). You can find the term dword occasionally in ASM code and documentation, where it simply means 32-bit, e.g. a DWORD PTR is a 32-bit pointer (more on pointers in the future). In the AoW code, and in fact in most 32-bit programs, the vast majority of values you deal with are either dwords or single bytes.
ADDRESSES
When a program is executed, the operating system loads not only the executable file itself into memory, but also all its dependencies (libraries), and essentially creates one large virtual file that contains all the machine code the program ever needs. Each byte of a file then exists in memory, where it has avirtual address. These virtual addresses are the numbers displayed to the left in the main window of your disassembler.
In a 32-bit program, addresses are always dwords. For example, if you look at AoWEPACK.dpl in PE Explorer and scroll to the top, you'll see that the address of the very first byte is 0x55701000. For reverse engineering, it's important to understand how compiled binary files (i.e. executables and libraries) map onto the virtual address space. To this end, a disassembler will display information about the virtual address space at the beginning of each file segment.
Right above the first address 0x55701000 is a block of meta information labelledCode Section, of which two values are of particular interest to us: Virtual Address, which is 0x55701000, and Pointer to RawData, which is 0x00000400. We subtract the second from the first and get 0x55700C00, which is the difference between the virtual address and the file offset for all bytes in the code segment. This means that if you want to modify e.g. the byte at 0x55764D0D with a hex editor, you would actually need to look for the byte at 0x0006410D.
If you scroll down to the virtual address 0x558E8000, you'll find another block of meta information, this one labelledData Section. This is where the data segment starts, and using the same method we can calculate the difference between virtual address and file offset for the data segment, which is 0x55701200.
For the moment, this is all we need to know about addresses. We'll look at addresses again in more detail once we've learned about pointers andthe stack.
Disclaimer: It is recommended that you have some experience with programming or scripting and a basic understanding of "computer logic" in general. In fact, for this primer I will assume that you are somewhat familiar with an object-oriented programming language such as C++, Python, Delphi (which AoW was written in), Java etc. because otherwise I'd need to start with Ada and Steve.
Furthermore, I'm assuming you're using PE Explorer or a similarly capable disassembler and have figured out how to disassemble AoWEPACK.dpl, which is really the main engine library of AoW. You should also have a means of modifying AoWEPACK.dpl, either manually with a hex editor or by using a script.
(Also it might be a good idea to make backups before messing around with any files.)
The following tools and resources will likely come in handy over the course of this crash course:
With that out of the way...
Assembly (I will call it ASM from here on) is essentially a very low-level programming language that describes exactly what a program does. Unlike higher-level (more abstract) programming languages such as the ones mentioned earlier, ASM directly corresponds to the actual machine code which a program executes at runtime. Because of this, ASM can essentially be considered a more readable representation of machine code.
Because ASM is a very low-level language consisting of only a few basic building blocks, ASM code can get very confusing for a complex program. Furthermore, machine code that has been generated by a compiler is generally optimised for runtime rather than human readability. Luckily, the Delphi Borland compiler used for AoW produces machine code that is much closer to the original code than most other compilers, and hence more readable, but understanding ASM is still a challenge.
The first thing you need to understand about ASM is that everything is a memory operation. In a typical programming language you have abstract storage locations called variables where you can store values for later use. Such variables don't exist in ASM; instead you have only
There are different types of registers, but the most important registers are
32 bits is 4 bytes, which is important because bytes, not bits, are the building blocks of machine code. You can think of bytes as the underlying grid structure of a program, and in fact this is how hex editors typically display machine code. The visual representation of each byte is a hexadecimal two-digit number, with the first digit representing the upper 4 bits and the second digit representing the lower 4 bits. For example, if we store the decimal number 10 in a single byte, its representation is 0A. Accordingly, if we store the same value in a 32-bit register, its representation is 0000000A. To distinguish such hexadecimal representations from decimal values, they are usually written 0x0000000A or 0000000Ah, which not only gives information about the value itself, but also about the memory used to store it. As will be explained later, 0x0A and 0x0000000A are not necessarily identical.
I should point out here that bytes do not actually have concrete values, as different instructions can interpret bytes in different ways. For example, the byte 0xFF can be interpreted as the decimal value 255, the decimal value -1, the UTF-8 character ÿ, or something else entirely. However, when it comes to reverse-engineering AoW, you can usually assume that the disassembler automatically chooses the correct representation for each case.
This is especially important when it comes to integer values that consist of several bytes. As humans we are accustomed to reading numbers in
Of course, we don't always need 4 bytes of storage, as most smaller numbers can be stored in a single byte. With GPRs it's actually possible to access certain bytes directly, and in fact many instructions chosen by the compiler do so, so it's important to recognise and understand them.
Back in 8-bit days, each GPR had exactly one byte and all was well, but with the advent of 16-bit there came a need to reference each byte of a register independently, and so the 16-bit GPRs were split into
Two bytes are called a word, and four bytes (e.g. EAX) are called a dword (double word). You can find the term dword occasionally in ASM code and documentation, where it simply means 32-bit, e.g. a DWORD PTR is a 32-bit pointer (more on pointers in the future). In the AoW code, and in fact in most 32-bit programs, the vast majority of values you deal with are either dwords or single bytes.
When a program is executed, the operating system loads not only the executable file itself into memory, but also all its dependencies (libraries), and essentially creates one large virtual file that contains all the machine code the program ever needs. Each byte of a file then exists in memory, where it has a
In a 32-bit program, addresses are always dwords. For example, if you look at AoWEPACK.dpl in PE Explorer and scroll to the top, you'll see that the address of the very first byte is 0x55701000. For reverse engineering, it's important to understand how compiled binary files (i.e. executables and libraries) map onto the virtual address space. To this end, a disassembler will display information about the virtual address space at the beginning of each file segment.
Right above the first address 0x55701000 is a block of meta information labelled
If you scroll down to the virtual address 0x558E8000, you'll find another block of meta information, this one labelled
For the moment, this is all we need to know about addresses. We'll look at addresses again in more detail once we've learned about pointers and
[This message has been edited by And G (edited 06-19-2021 @ 10:35 PM).]