Go up to the main DADA homeworks page (md)
This assignment will refresh your knowledge of x86 assembly language (which you will have to analyze in all the virus detection assignments later in this semester), expose you to the "tricky jump" style of code often used by virus writers, and familiarize you with a couple of key tools used in the analysis of programs in their binary form.
The reference platform for this project is the 64-bit Linux VirtualBox image. Although most viruses are for the Windows platform, for reasons described in class, we are using the Linux platform to make this assignment easier to perform. Similar binary analysis tools exist on Windows (and Mac OS X) platforms.
You are welcome to use Mac OS X -- or even Windows -- to write and compile your code. However, it needs to work under Linux. Your C/C++ code may have to call _foo()
instead of foo()
- try each one to see which one works. When you submit it, it should not have the underscore. The binary analysis will have to be done under the reference platform of 64-bit Linux.
main()
function, and the numerical values of the length, width, height, and weight of the object are to be passed in as parameters to the subroutine(s).main()
, passing back the density (not volume!) as the return value.volume
!volume
), and put those answers into a file named analysis.pdf.Your Makefile MUST generate an executable called volume
! Anything else will not be graded!
Use the following code for your main.cpp. Note that, for the parameters given, the volume is 960 and the density is 5.
#include <iostream>
using namespace std;
extern "C" int volume (int, int, int, int);
int main () {
int len = 12, width = 8, height = 10, weight = 4800;
int vol = volume(len, width, height, weight);
cout << "volume: " << vol << endl;
return 0;
}
The questions to answer are below. There are a number of command-line tools that can be used to generate these answers, which are listed by each question. You can type man foo
at the Linux command prompt to see the full manual page about the 'foo' program. Some of the commands below require to to use specific flags described in the manual page for each command.
See "Notes on interpreting the objdump file", below, for help interpreting the output.
main()
?nm
command; there should be 50 or so. Which of these do you recognize? Google for some others to determine what their purpose is.objdump -sRrd volume
. WARNING: this output is not in the assembly format that was studied in CS 2150; see here for the details of the differences.
volume()
?__libc_start_main
is passed several (constant) addresses. What do these appear to represent? (It is sufficient to identify what things they are addresses of.)readelf -h
.
Answer these questions in a PDF file called analysis.pdf. Copy the questions above (you can from the web page copy of this assignment) and type your answers below each question. Below each answer, copy and paste the specific portions of the tools output that were used to answer the question.
Write a separate C or C++ program that takes in, as the input, an objdump from the above executable. It should read it in as standard input - we are not dealing with file input and output. First, you may want to save the output of objdump to a file:
objdump -sRrd volume > input.txt
Your program must read in that file via standard input:
./tricky < input.txt
Your program is to detect if there is a tricky jump in that code. You can assume a few things to make this easier:
objdump -sRrd
outputs; this means the commands you are looking for are pushq
and retq
If a tricky jump is found, then you should output "tricky jump found!" for each time it's found. If there are no tricky jumps, then output "tricky jump not found".
We are not looking for error-proof or long-winded code! The reference solution was about 20 lines.
Your Makefile MUST generate an executable called tricky
! Anything else will not be graded!
You will need to create a Makefile that will compile the programs. There are three required aspects to the Makefile.
volume
tricky
You may want to start with the Makefile for the vecsum program provided in CS 2150 (see here). That example uses clang++ as the compiler, but you can use either clang++ or g++.
It's fine if your source code files are named something slightly different, as long as your Makefile can compile both of them. We just have to be able to reasonably tell which source code file name is which. But the executables need to be named exactly the same as specified herein.
This section (and some of the questions for analysis.pdf) was taken with permission from Charles Reiss (original was here).
General format
The objdump output we provide contains several parts corresponding to several parts of the executable, which are described in more detail below:
Information about the type of executable file. This indicates what architecture it is for, that it is in Linux's ELF format, and the address at which execution of the program starts.
The actual contents of each "section" of the executable that will be loaded into memory. The sections have names like ".text" and ".dynstr" depending on their purpose. These will look something like:
Contents of section .text:
4004d0 4883ec28 ba050000 00be0c00 00004889 H..(..........H.
4004e0 e764488b 04252800 00004889 44241831 .dH..%(...H.D$.1
4004f0 c0e83a01 0000488d 74240c48 89e031d2 ..:...H.t$.H..1.
The leftmost column indicates the address (in hexadecimal) where this data will be loaded in memory. The next four columns are the hexadecimal values actually placed in memory. These values are written in the order the bytes appear in memory, so the value "0x12345678" in little Endian will appear as "78563412". The final columns are the same values represented as characters, except a period (.
) is used to represent bytes which do not correspond to a printable ASCII character.
Disassembled versions of sections that contain executable code. These will look something like:
0000000000400460 <_init>:
400460: 48 83 ec 08 sub $0x8,%rsp
400464: 48 8b 05 8d 0b 20 00 mov 0x200b8d(%rip),%rax # 600ff8 <_DYNAMIC+0x1d0>
40046b: 48 85 c0 test %rax,%rax
40046e: 74 05 je 400475 <_init+0x15>
400470: e8 3b 00 00 00 callq 4004b0 <__gmon_start__@plt>
400475: 48 83 c4 08 add $0x8,%rsp
400479: c3 retq
This indicates that there is a label called _init
which has the address "0x400460" when the executable is loaded. Each following line is an instruction. The value before the colon indicates the memory address in hexadecimal of the first byte of the instruction. The hexadecimal values after the colon are the bytes of the instruction in hexadecimal. Following this is the disassembled instruction itself.
Within the disassembled instructions, objdump
attempts to provide information about addresses in addition to showing the addresses encoded in the instruction. In cases where the label is exactly equal to the address, like for the label __gmon_start__@plt
in the example above, the format is address <LABEL>
with the address in hexadecimal. In cases where the address does not correspond to a label, the format is address <LABEL+offset>
. For example 400475 <_init+0x15>
indicates the address "0x400475", which is 0x15 bytes after the label _init
.
On 64-bit x86, some instructions specify an address relative to %rip
. %rip
represents the "instruction pointer", which in 2150 and 3330 we have called the "program counter". It is the address of the current instruction, so 0x200b8d(%rip)
means memory 0x200b8d bytes after the address of the current instruction. objdump
's disassembly includes a comment indicating what address is computed. In the case of the example above, the address is 0x600ff8, which is 0x1d0 bytes after the label _DYNAMIC
.
Note that this is not all the information in the executable and not all the information that objdump is capable of providing.
On dynamic linking
This executable is dynamically linked, so it doesn't include code for functions in the C standard library like printf()
. These are loaded at runtime by the dynamic linker which is contained in /lib64/ld-linux-x86-64.so.2
. The way Linux implements dynamic linking involves having this program handle loading all dynamically linked executables as an interpreter.
As part of Linux's implementation of dynamic linking, there is a Procedure Linkage Table (PLT). This contains "stubs" for each function the executable expects to find in a dynamically linked library, like the C standard library. One of the "stubs" looks like:
00000000004004c0 <__printf_chk@plt>:
4004c0: ff 25 6a 0b 20 00 jmpq *0x200b6a(%rip) # 601030 <_GLOBAL_OFFSET_TABLE_+0x30>
4004c6: 68 03 00 00 00 pushq $0x3
4004cb: e9 b0 ff ff ff jmpq 400480 <_init+0x20>
This stub is called __print_chk@plt
and is loaded into the program's memory at address 0x4004c0. The first instruction in this function reads the address of a function from memory at 0x601030, then jumps to that function. As indicated by the comment added by objdump
this address is part of the "global offset table". This is an array of pointers used to find functions like printf()
which are loaded every time the executable runs. Using this table allows the same program to work with different implementations of printf()
, where printf may end up at different locations in memory. For example, in this case the global offset table will eventually contain the address at which __printf_chk
, part of the Linux C library's implementation of printf()
is loaded into memory.
By default, the values in this global offset table are initialized to point to the instruction following the jump, for example 0x601030 contains 0x4004c6. This means that the first time the "stub" is called, it will "fall through" to the code after the global offset table jump. This code pushes an indicator of what function was called on the stack, then jumps to part of the dynamic linker. (This code is not included in the executable file, and therefore not present in the objdump output.) The dynamic linker will then locate the actual routine (the implementation of __printf_chk
in the standard library, in this case) and update the global offset table to contain its address.
On _start
Execution of the program does not actually start in main but starts in a function called _start
that is provided by the compiler -- this is the start address specified in the program header. This function calls a special function in the C standard library called __libc_start_main
. It is this function that actually calls main()
and takes care of exiting when main()
returns.
On %fs
x86 has a feature called "segmentation". As part of this feature, the processor has several "segment registers" which specify a region of memory -- essentially the segment register acts as a pointer. %fs:0x28
specifies to use segment register %fs
and access a value 0x28 bytes from the beginning of the memory region it identifies.
On Linux, the %fs
segment register is used for "thread-local storage" -- to point to a block of data particular to a thread, even in a multithreaded process.
On Windows, the %gs
segment register is used for something similar.
The use of a segment register for this purpose instead of a normal register is just to make sure as many registers are available to the program as possible.
Segmentation was originally intended to provide functionality similar to virtual memory. These days, it is rarely used for this purpose, and its primary use is to support thread-local storage, as occurs briefly in the assembly in this assignment. It is, still, however, universally present on x86 and is entangled with x86's implementation of kernel mode and exceptions.
These files need to be submitted to the HW 2 assignment via the course tools submission.
main()
function for the volume/density code (from part 1)main()
function for the tricky jump detection code from part 3volume
and tricky
!