/ REV, CTF, CODE

# Breaking CMU's Bomblab with Angr for Fun and Profit - Part 1

I have recently been learning about Angr, a binary analysis framework developed by UC Santa Barbara and Arizona State University. It caught my eye because of its versatility and utility in reverse engineering binaries whose disassembly and decompilation are hard to understand manually. Oftentimes, it is simply due to the fact that it was compiled from newer or relatively less popular languages like Rust or Haskell, where the state of currently publicly available decompilers leaves much to be desired. Angr’s ability to perform symbolic execution therefore allows us to blackbox certain functionality within the program (or even the entire program) by attempting to find the right input for a desired output.

### But what is symbolic execution anyway?

Do you recall when you had to first begin manipulating symbols in math class during elementary school? Yes, algebra! Symbolic execution can be thought of as manipulating symbols in order to derive certain constraints. These constraints can then be solved by a Satisfiability Modulo Theories (SMT) solver like Z3.

Here is a simple example:

Suppose we want Angr to figure out how how to reach line 5 (`print_flag`) of the program during its execution. From a very high level, you can tell Angr that the instruction corresponding to what happens at line 4 is an address that you want it to find, and Angr will be able to work backwards and deduce symbolically that we need to constrain `y = 20`, which implies that we have to then constrain `x=15`, and so the user input from stdin must correspond to 15.

Of course, this is a very contrived example, and in practice the constraints are usually in ranges (i.e `x > 0`), and you can end up with a lot of potential inputs for a desired output (maybe even infinitely many). When situations like this happens, you can ask Angr to return an arbitrary valid input, or return `n` such inputs, and many other options which you can refer to here.

### Bomblab

Now that we have a basic understanding of what Angr is and what symbolic execution is about, let’s put our newfound skills to the test! I decided to try it out with Carnegie Mellon’s Bomb Lab (you can download it here). It is the second lab for the class 15-213 Introduction to Computer Systems in CMU which I took last year, and which is a required class for all computer science majors. Back then when I took the class, I printed out the disassembly from `objdump` onto paper and traced all of the function calls and loops manually. I also did some basic dynamic analysis with `gdb` to debug my inputs and confirm that my intuition for what was happening was correct. It was a slow but fun and rewarding process.

Let’s see how easily we can solve Bomblab with Angr together! I have structured this walkthrough into 7 separate posts, one post for each phase of the bomb (including the secret phase). I will also approach it without relying on any prior knowledge of the phases. Without further ado, let’s get started!

### Phase 1

It’s always good to first check the attributes of the binary. The main thing I am concerned about is PIE (position independent code), because that would potentially make referencing addresses more difficult (Angr does have support for PIE though):

Awesome, we see that there is no PIE, and that this is also a 64 bit binary.

Let’s create a new Angr project skeleton:

Now let’s look at the disassembly in gdb (truncated to the interesting parts for brevity):

Here we see that there are 6 phases in the `main` function, which are logically isolated from one another. We also see that it calls a `read_line` function, which is not a standard glibc function. Let’s look at `phase_1`:

What it does is really simple - compare the user input to the string at 0x402400, and we easily solve it without Angr. But let’s try doing it with Angr anyway, because it will yield several valuable learning points.

The first thing to take note of is that user input is being read with a custom `read_line` function. Let’s take a look at that:

We don’t see any functions which reads in input directly being called, like scanf, read, or fgets. Digging further though, we see that the input is actually being read in the `skip` function, which calls fgets:

The point here is that we don’t want to reverse precisely what `read_line` does, and just rely on our intuition that it basically reads in a line. We also see that there are a couple of checks in the loops, which can potentially lead to state explosion.

### State Explosion

State explosion is an extremely important concept in symbolic execution, and it is the primary reason why achieving general symbolic execution is hard. With every branch, our number of states double. This exponential growth in the number of states can quickly render our search infeasible, which is why it is important for us to limit the number of potential paths as much as possible. Doing this automatically is currently an area of active research. There is a great paper from CMU titled Enhancing Symbolic Execution with Veritesting, which introduces the idea of veritesting in order to mitigate state explosion. This is achieved by combining both static and dynamic symbolic execution to produce heuristics to avoid states which are likely to lead to failure, and by combining branches. Angr does support veritesting in its simulation manager, but this is out of scope for this series. Do read the paper if you are interested for more!

In order to prevent that, we will inject our own symbolic memory for the input instead, and bypass the `read_line` function entirely.

If we look at the disassembly for `main`, the place where we call `phase_1` is a good point to start program execution. We can then simply pass the symbolic input in the `rdi` register!

`claripy.BVS` creates a symbolic value referred to by `phase_1_input`, with a size of 100 bytes or 100 * 8 bits. 100 bytes should be big enough, if Angr is unable to find a solution, we can always add more. BVS stands for bit vector symbolic. It has a cousin, BVV, which stands for bit vector value, that holds concrete values. Our string input value is stored at `fake_addr`. `fake_addr` was chosen arbitrarily, it can simply be any area of memory not being used by any of the other sections (text, heap, stack, libs) during execution. So with this, we have now a reference to our symbolic bit vector in `rdi`!

Now, if we look back at the disassembly for `phase_1`, we see that it primarily calls `strings_not_equal`. We have a hunch about what it does, let’s look at it in assembly:

Again, we see lots of comparisons and jumps, which is what we don’t like since it can easily lead to state explosion. We can solve this by using a hook, which is a way of overwriting instructions to call our own instructions instead. Angr provides an easy way to do this via what is known as a SimProcedure, which allows you to replace a function with one that you write in Python:

In this code snippet, we replace all function calls to `'strings_not_equal'` to our own function defined in `ReplacementStringsNotEqual`. We can use `project.hook_symbol` because the binary includes debugging symbols, which are debugging information that was not stripped by the compiler after compilation. While it makes the binary larger, it is useful as it makes debugging with a debugger easier. However, if you encounter stripped binaries, then you would need to hook by the address of the function instead.

Recall that `strings_not_equal` takes in two string pointers as input, and therefore we need to load it into memory first. `state.memory.load(addr, size)` allows us to load `size` number of bytes from `address` to a variable. We can then compare the two input strings, and return 0 or 1 accordingly. We then return a concrete 32 bit integer value of either 0 or 1, depending on the results of the comparison. Note that we cannot directly return a Python integer, because Python integers do not have sizes.

We are now done with the basic set-up. Add the following lines to create a simulation manager, which will help to explore our states later on:

### Find and Avoid

Now we are faced with the question of how to tell Angr what we want it to do.

The above introduces a few new concepts. Let’s start from the end - `simulation.explore` tells Angr to explore states starting from our initial state, and to terminate whenever it has found something that matches the `find` condition, and also to stop exploring states that have fulfilled the `avoid` condition. Our `find=is_successful` condition is what happens when we clear the stage. We see in the disassembly of `main` that after `phase_defused` is called after `phase_1`, we call `puts` with `0x4023a8`, which corresponds to:

So what `is_successful` does is that if Angr ever sees “Phase 1 defused” from stdout of the program, it will know that it has succeeded.

On the flipside, `explode_addr` is set to the address of `explode_bomb`, which we can see in the disassembly of Phase 1:

This tells Angr to avoid calling `explode_bomb` and truncate such states from its search.

Finally, we check if we are able to find a solution:

Here, if a solution was found, we convert our symbolic `phase_1_input` value into bytes and print it, and if not, an exception is raised.

### Full Solution Script

The full solution script for Phase 1 is below:

And if we run it, we get the following output:

Because we specified 100 bytes for our symbolic input, we got 100 bytes back. Since strings are null terminated, our correct input for phase 1 is therefore “Border relations with Canada have never been better.”. We verify that it works when run with the actual binary:

Awesome! I hope you have enjoyed the first part of this series and that it was helpful to you :). You can continue on to the second part here.

#### Fan Pu Zeng

Carnegie Mellon School of Computer Science, Class of 2022