DADA: HW 2: x64 assembly

Go up to the main DADA homeworks page (md)

Introduction

This assignment will refresh your knowledge of x86 assembly language (which you will have to analyze in all the virus detection assignments later in this semester), expose you to the "tricky jump" style of code often used by virus writers, and familiarize you with a couple of key tools used in the analysis of programs in their binary form.

The reference platform for this project is the 64-bit Linux VirtualBox image. Although most viruses are for the Windows platform, for reasons described in class, we are using the Linux platform to make this assignment easier to perform. Similar binary analysis tools exist on Windows (and Mac OS X) platforms.

You are welcome to use Mac OS X -- or even Windows -- to write and compile your code. However, it needs to work under Linux. Your C/C++ code may have to call _foo() instead of foo() - try each one to see which one works. When you submit it, it should not have the underscore. The binary analysis will have to be done under the reference platform of 64-bit Linux.

Part 1: x86 code

Read the tricky jump document (md)
Use the C++ code shown below for the main.cpp
Create an x86 assembly language file called density.s, in your favorite text editor that computes the volume and density of a rectangular solid object
- To avoid the messiness of doing input and output in x86 assembly, the routine will be called from a C++ language main() function, and the numerical values of the length, width, height, and weight of the object are to be passed in as parameters to the subroutine(s).
- The program may assume that the dimensions are in inches and the weight in pounds, and the density is to be computed in units of pounds per cubic inch.
- Integer computations are expected, not floating point.
- The assembly routines should compute the volume (as that is the method that was called), then execute a "tricky jump" to a function that computes the density
- Return directly from this density function to main(), passing back the density (not volume!) as the return value.
- Thus, we are storing the density value into the volume variable.
Create an appropriate Makefile to compile your project; see the links above for details. Your executable needs to be named volume!
Answer the questions below about the executable that you created (volume), and put those answers into a file named analysis.pdf.

Your Makefile MUST generate an executable called volume! Anything else will not be graded!

Use the following code for your main.cpp. Note that, for the parameters given, the volume is 960 and the density is 5.

#include <iostream>
using namespace std;

extern "C" int volume (int, int, int, int);

int  main () {
  int len = 12, width = 8, height = 10, weight = 4800;
  int vol = volume(len, width, height, weight);
  cout << "volume: " << vol << endl;
  return 0;
}

Part 2: analysis

The questions to answer are below. There are a number of command-line tools that can be used to generate these answers, which are listed by each question. You can type man foo at the Linux command prompt to see the full manual page about the 'foo' program. Some of the commands below require to to use specific flags described in the manual page for each command.

See "Notes on interpreting the objdump file", below, for help interpreting the output.

What are the addresses or registers of the four local variables in main()?
Look at the list of symbols in the executable via the nm command; there should be 50 or so. Which of these do you recognize? Google for some others to determine what their purpose is.
View a disassembly of the executable via objdump -sRrd volume. WARNING: this output is not in the assembly format that was studied in CS 2150; see here for the details of the differences.
- What are the addresses or registers of the four parameters passed into volume()?
- What shared libraries are used by your program? Briefly, what are each of these libraries? Look at the 'ldd' command.
- __libc_start_main is passed several (constant) addresses. What do these appear to represent? (It is sufficient to identify what things they are addresses of.)
- What is the shortest x86 assembly command, in terms of bytes? If there are many, you need just list one
- What is the longest command? Again, if there are many, you need just list one.
- How could we determine if a tricky jump appears in the assembly code?
Look at the information output by readelf -h.
- What information can we gleam about this executable from the ELF header?
- Based on the "entry point address", and the output from the 'nm' command, above, what is the entry point of the program?

Answer these questions in a PDF file called analysis.pdf. Copy the questions above (you can from the web page copy of this assignment) and type your answers below each question. Below each answer, copy and paste the specific portions of the tools output that were used to answer the question.

Part 3: detecting tricky jumps

Write a separate C or C++ program that takes in, as the input, an objdump from the above executable. It should read it in as standard input - we are not dealing with file input and output. First, you may want to save the output of objdump to a file:

objdump -sRrd volume > input.txt

Your program must read in that file via standard input:

./tricky < input.txt

Your program is to detect if there is a tricky jump in that code. You can assume a few things to make this easier:

A tricky jump will appear as a push right before a ret, as described herein (and implemented above)
You can assume the EXACT format of the input file is what objdump -sRrd outputs; this means the commands you are looking for are pushq and retq

If a tricky jump is found, then you should output "tricky jump found!" for each time it's found. If there are no tricky jumps, then output "tricky jump not found".

We are not looking for error-proof or long-winded code! The reference solution was about 20 lines.

Your Makefile MUST generate an executable called tricky! Anything else will not be graded!

Part 4: Makefile

You will need to create a Makefile that will compile the programs. There are three required aspects to the Makefile.

The volume/density problem from part 1 must compile to an executable named volume
The tricky jump detection problem from part 3 must compile to an executable named tricky

You may want to start with the Makefile for the vecsum program provided in CS 2150 (see here). That example uses clang++ as the compiler, but you can use either clang++ or g++.

It's fine if your source code files are named something slightly different, as long as your Makefile can compile both of them. We just have to be able to reasonably tell which source code file name is which. But the executables need to be named exactly the same as specified herein.

Notes on interpreting the objdump file

This section (and some of the questions for analysis.pdf) was taken with permission from Charles Reiss (original was here).

General format

The objdump output we provide contains several parts corresponding to several parts of the executable, which are described in more detail below:

Information about the type of executable file. This indicates what architecture it is for, that it is in Linux's ELF format, and the address at which execution of the program starts.
The actual contents of each "section" of the executable that will be loaded into memory. The sections have names like ".text" and ".dynstr" depending on their purpose. These will look something like:
```
Contents of section .text:
 4004d0 4883ec28 ba050000 00be0c00 00004889  H..(..........H.
 4004e0 e764488b 04252800 00004889 44241831  .dH..%(...H.D$.1
 4004f0 c0e83a01 0000488d 74240c48 89e031d2  ..:...H.t$.H..1.
```
The leftmost column indicates the address (in hexadecimal) where this data will be loaded in memory. The next four columns are the hexadecimal values actually placed in memory. These values are written in the order the bytes appear in memory, so the value "0x12345678" in little Endian will appear as "78563412". The final columns are the same values represented as characters, except a period (.) is used to represent bytes which do not correspond to a printable ASCII character.
Disassembled versions of sections that contain executable code. These will look something like:
```
0000000000400460 <_init>:
  400460:       48 83 ec 08             sub    $0x8,%rsp
  400464:       48 8b 05 8d 0b 20 00    mov    0x200b8d(%rip),%rax        # 600ff8 <_DYNAMIC+0x1d0>
  40046b:       48 85 c0                test   %rax,%rax
  40046e:       74 05                   je     400475 <_init+0x15>
  400470:       e8 3b 00 00 00          callq  4004b0 <__gmon_start__@plt>
  400475:       48 83 c4 08             add    $0x8,%rsp
  400479:       c3                      retq
```
This indicates that there is a label called _init which has the address "0x400460" when the executable is loaded. Each following line is an instruction. The value before the colon indicates the memory address in hexadecimal of the first byte of the instruction. The hexadecimal values after the colon are the bytes of the instruction in hexadecimal. Following this is the disassembled instruction itself.

Within the disassembled instructions, objdump attempts to provide information about addresses in addition to showing the addresses encoded in the instruction. In cases where the label is exactly equal to the address, like for the label __gmon_start__@plt in the example above, the format is address <LABEL> with the address in hexadecimal. In cases where the address does not correspond to a label, the format is address <LABEL+offset>. For example 400475 <_init+0x15> indicates the address "0x400475", which is 0x15 bytes after the label _init.

On 64-bit x86, some instructions specify an address relative to %rip. %rip represents the "instruction pointer", which in 2150 and 3330 we have called the "program counter". It is the address of the current instruction, so 0x200b8d(%rip) means memory 0x200b8d bytes after the address of the current instruction. objdump's disassembly includes a comment indicating what address is computed. In the case of the example above, the address is 0x600ff8, which is 0x1d0 bytes after the label _DYNAMIC.

Note that this is not all the information in the executable and not all the information that objdump is capable of providing.

On dynamic linking

This executable is dynamically linked, so it doesn't include code for functions in the C standard library like printf(). These are loaded at runtime by the dynamic linker which is contained in /lib64/ld-linux-x86-64.so.2. The way Linux implements dynamic linking involves having this program handle loading all dynamically linked executables as an interpreter.

As part of Linux's implementation of dynamic linking, there is a Procedure Linkage Table (PLT). This contains "stubs" for each function the executable expects to find in a dynamically linked library, like the C standard library. One of the "stubs" looks like:

00000000004004c0 <__printf_chk@plt>:
  4004c0:       ff 25 6a 0b 20 00       jmpq   *0x200b6a(%rip)        # 601030 <_GLOBAL_OFFSET_TABLE_+0x30>
  4004c6:       68 03 00 00 00          pushq  $0x3
  4004cb:       e9 b0 ff ff ff          jmpq   400480 <_init+0x20>

This stub is called __print_chk@plt and is loaded into the program's memory at address 0x4004c0. The first instruction in this function reads the address of a function from memory at 0x601030, then jumps to that function. As indicated by the comment added by objdump this address is part of the "global offset table". This is an array of pointers used to find functions like printf() which are loaded every time the executable runs. Using this table allows the same program to work with different implementations of printf(), where printf may end up at different locations in memory. For example, in this case the global offset table will eventually contain the address at which __printf_chk, part of the Linux C library's implementation of printf() is loaded into memory.

By default, the values in this global offset table are initialized to point to the instruction following the jump, for example 0x601030 contains 0x4004c6. This means that the first time the "stub" is called, it will "fall through" to the code after the global offset table jump. This code pushes an indicator of what function was called on the stack, then jumps to part of the dynamic linker. (This code is not included in the executable file, and therefore not present in the objdump output.) The dynamic linker will then locate the actual routine (the implementation of __printf_chk in the standard library, in this case) and update the global offset table to contain its address.

On _start

Execution of the program does not actually start in main but starts in a function called _start that is provided by the compiler -- this is the start address specified in the program header. This function calls a special function in the C standard library called __libc_start_main. It is this function that actually calls main() and takes care of exiting when main() returns.

On %fs

x86 has a feature called "segmentation". As part of this feature, the processor has several "segment registers" which specify a region of memory -- essentially the segment register acts as a pointer. %fs:0x28 specifies to use segment register %fs and access a value 0x28 bytes from the beginning of the memory region it identifies.

On Linux, the %fs segment register is used for "thread-local storage" -- to point to a block of data particular to a thread, even in a multithreaded process.

On Windows, the %gs segment register is used for something similar.

The use of a segment register for this purpose instead of a normal register is just to make sure as many registers are available to the program as possible.

Segmentation was originally intended to provide functionality similar to virtual memory. These days, it is rarely used for this purpose, and its primary use is to support thread-local storage, as occurs briefly in the assembly in this assignment. It is, still, however, universally present on x86 and is entangled with x86's implementation of kernel mode and exceptions.

Items to Submit

These files need to be submitted to the HW 2 assignment via the course tools submission.

File density.s with your assembly language solution for volume/density (from part 1)
File main.cpp with your C++ code for the main() function for the volume/density code (from part 1)
A PDF document called analysis.pdf with the analysis questions given above (in part 2) and their answers.
File tricky.cpp with your C++ code for the main() function for the tricky jump detection code from part 3
A Makefile that compiles the programs, as explained above in part 4 -- make sure the executables are named volume and tricky!