LLVM

Background

Learning Outcomes

The goals of this assignment are to:

Understand the benefits of applying optimizations to the IR code, prior to the standard HLS algorithms.
Gain experience using LLVM to modify the IR code of a program.
Practise C++ skills.

Motivation

In this assignment you will get to write your own compiler optimization pass! You will use LLVM (a commonly used compiler, like GCC). The goal of the transformation will be to identify adder trees in the IR, and perform tree balancing, as shown below. This optimization is especially helpful when targeting hardware.

Example of adder balancer

Consider the following code (from simple.c) that sums the values of a 100-entry array.

int A[100];
for (int i = 0; i < 100; i++)
    sum += A[i];

To implement this in hardware, we could build an RTL circuit with a state machine that repeatedly read a value from memory, added it to a sum register, and incremented the address. We could also use an HLS tool to build that RTL for us, and it may take a similar approach if no optimizations were applied.

The code could be made much faster by performing multiple loop iterations in parallel. This is referred to as loop unrolling.

The following code, (from simple_unrolled_partitioned/simple.c), shows the array summation unrolled by a factor of 10. One issue with unrolling like this is that it is usually not possible to load 10 values from memory at once. If values are loaded sequentially, there will be little to no benefits to loop unrolling. To get around this problem, the code also includes memory banking, where the A array is split into 5 equal sized arrays, A1, A2, A3, A4, and A5. Each of these can be placed in a separate dual-ported memory, allowing us to read 10 values from memory simultaneously. (While this C code implements loop unrolling and memory banking manually, modern HLS tools will do this for you when requested. Certain tools may even be smart enough to do it automatically.)

int A1[20];
int A2[20];
int A3[20];
int A4[20];
int A5[20];

for (int i = 0; i < 20; i += 2) {
    sum += A1[i] + A1[i + 1] + 
           A2[i] + A2[i + 1] + 
           A3[i] + A3[i + 1] +
           A4[i] + A4[i + 1] + 
           A5[i] + A5[i + 1];

The above code will end up creating a cascading adder tree as shown in the left side of the figure above. We would like to automatically optimize the IR to balance the adder tree, producing a structure similar to the right-hand side of the figure above. This optimization can improve the critical path of the resulting hardware, or if timing constraints are imposed, reduce the latency of each loop iteration.

Getting Started

Required Packages

sudo apt install llvm-18 llvm-18-dev libclang-common-18-dev clang-18 liblpsolve55-dev texlive-latex-base texlive-pictures

Running LLVM

Your scheduler is going to run as an LLVM pass. This means that after you compile a C program to LLVM IR, you are doing to run your pass on the LLVM IR.

You are given a few different C programs in the benchmarks directory. You can cd into these benchmark directories, and run make <target> to compile them. Try running just clang (C front-end compiler) to compile the simple benchmark your C code into LLVM IR.

cd benchmarks/simple
make clang

The provided Makefile will first run clang to produce a binary IR file (simple.clang.bc), after which it will run llvm-dis to dissassemble the binary into human readable IR (simple.clang.ll)

Look through the LLVM IR file and try to understand how it implements the simple.c program. Then go to the simple_unrolled_partitioned bechmark and look at the IR that program produces. Make sure you can find the long sequence of cascaing add instructions.

Writing an LLVM pass

While you could download the LLVM source code, and add your pass code to it, compiling LLVM from source can take 20-30 minutes. Instead, we are going to develop your pass out-of-tree. Above you should have installed the LLVM binaries using the Ubuntu package manger. We are going to compile your pass code into a shared library object (.so) file, than can be loaded by LLVM at runtime.

Your pass code and CMakeLists.txt file are provided in the src directory. Take a look at AdderTreeBalancer.cpp:

The declaration of our class inherits from PassInfoMixin, which is a helper class for creating new passes:

class AdderTreeBalancer : public PassInfoMixin<AdderTreeBalancer> {

The run() method is where the pass code is implemented. This function is polymorphic, and by specifying the argument type as Function &, we are creating a Function Pass, that will be run on each Function in our code.

PreservedAnalyses run(Function &F, FunctionAnalysisManager &AM);

This pass returns a PreservedAnalyses object, which is used to tell LLVM which analyses are invalidated by the pass. Since we will be modifying the IR, we will return PreservedAnalyses::none(), which means that all previous analyses are invalidated.

The final step is to register your pass with LLVM. This would be more straightforward if we were compiling it into LLVM, but since we are dynamically linking it, we have to define a specific function that LLVM will call to allow us to register the pass. This is by defining the llvmGetPassPluginInfo() function:

extern "C" ::llvm::PassPluginLibraryInfo LLVM_ATTRIBUTE_WEAK
llvmGetPassPluginInfo() {
  return {LLVM_PLUGIN_API_VERSION, "ATB_625", "v0.1", [](PassBuilder &PB) {...}

If you inspect the full code inside this function, you will see code that looks for passes on the opt command line, and if it finds the ATB_625 pass listed, it will add the AdderTreeBalancer pass to the list of passes to be run.

Compiling Your LLVM Pass

Compile the provided code into the shared library object file using the following commands:

cd lab_llvm/build
cmake ../src
make

This will produce your library ATB_625.so.

Running Your LLVM Pass

The benchmark Makfile contains several targets that can be used to run your AdderTreeBalancer pass.:

make atb: This will run your pass, followed by dead code elimination (which will remove unused instructions). This will produce a file *.atb.ll which contains the LLVM IR after your pass is run.
make atb_check: Since you are modifying the original program, you will want to double check that you haven’t broken the functionality. This target will first run the atb target to produce the transformed IR, followed by running it using the LLVM emulator (lli). The benchmarks print CORRECT or ERROR, depending on whether the sum is correct.
make atb_schedule: This will run the atb pass, and then provide the new LLVM IR to an HLS scheduler, which will determine which instructions can run in each cycle in hardware. This will produce PDF files which display the schedule of each IR instructions. If you implement the pass correctly, you will see improvements made to the produced schedule.
make schedule: This will run the HLS schedule without your ATB pass.

Code Organization

lab_llvm/src/AdderTreeBalancer.cpp/.h – You will add all your code for this assignment in theses files. You are welcome to add additional source files if you like.
benchmarks – This contains some C programs to test your code. You are welcome to test your pass with any design in this folder (or create your own test designs); however, only the simple_unrolled_partitioned and shared_add_tree designs have large adder trees. The shared_add_tree design has an adder tree used by another adder tree, so it’s useful to test that you are handling that correctly (see the first tip listed later on this page).

Implementation

How you code the tree balancer is up to you.

You probably want to follow this general approach:

Look through the Instructions in the BasicBlock for add instructions, treat this instruction I as the root of an adder tree, and collect a list of inputs to this adder tree.
Create a balanced adder tree with these inputs, and add these new add instructions to the BasicBlock.
Replace uses of the root adder instruction, I, with the output of your newly created adder tree.
Repeat until all adder trees are balanced.

Some tips with this approach:

When recursively visiting adders, you need to make sure the output of an intermediate node is not used elsewhere in the code. You can do this by checking that the number of uses is one (Value.getNumUses()). The reason for this is that all of these instructions will be deleted and replaced by your adder tree; only the final output of the adder tree can be used multiple places in the code.
The inputs to your adder tree may be instructions, constants, function arguments, etc. You should use the LLVM Value to keep track of the inputs.
You will need to add all of your newly created addition instructions to the BasicBlock. Make sure you add them in a valid position, and order (ie. the final addition in your tree should be located before any uses, and adder nodes in your tree should be in a valid ordering). If you are smart in how you create your adder tree, you can build a list of add instructions that are already in the correct order. You can then add all of these add instructions immediately before instruction I (as described in Step 3 of the technique above).
You don’t need to delete the adder instructions you are replacing with your tree. As long as you complete Step 3, these additions will become dead code. Since dead code elimination is performed after your pass, these instructions will automatically be removed. Likewise, if you only complete Step 1 and 2, and never use your new adder tree, dead code elimination will remove it.
In Step 1 you are iterating over the BasicBlock. However, if you end up inserting an adder tree, you are modifying the very data structure you are iterating over. You cannot continue iterating at this point. Either exit out of the iteration and start over, or keep a track of pending changes and wait until the iteration is complete before actually making the changes.
You will probably want to keep track of which additions are part of your balanced adder trees, and then skip over them in Step 1. Otherwise you can end up in an endless loop of replacing your balanced adder trees with new identical adder trees. The other benefit of this approach is that you don’t need to check whether existing adder networks in the user’s code are balanced or not. Simply replace them with a balanced adder tree regardless of whether the existing structure is already optimal.

LLVM Coding Tips

Here are a few suggestions to help you with the LLVM API. If you are still unsure how you do something in LLVM, please post on Slack and I would be happy to help you out.

Resources

We are using LLVM version 18.1. The documentation can be found at https://releases.llvm.org/18.1.0/docs/. Some useful pages:

The Language Reference Manual describes each LLVM IR instruction.
The Programmers Manual is a good place to start reading about coding in LLVM.

The LLVM Classes

LLVM is an object-oriented code base, with lots of inheritance. You can find auto-generated documentation for each class. For example, check out the Instruction class. Usually the quickest way to find these documentation pages is a google search.

Take a look at the illustrated class hierarchy on that webpage. You will see that Instruction inherits from User, which in turn inherits from Value. From LLVM programmers manual:

Value: The Value class is the most important class in the LLVM Source base. It represents a typed value that may be used (among other things) as an operand to an instruction. There are many different types of Values, such as Constants, Arguments. Even Instructions and Functions are Values.
User: The User class is the common base class of all LLVM nodes that may refer to other Values. It exposes a list of “Operands” that are all of the Values that the User is referring to. The User class itself is a subclass of Value.
Instruction: The Instruction class is the common base class for all LLVM instructions. It provides only a few methods, but is a very commonly used class. The primary data tracked by the Instruction class itself is the opcode (instruction type) and the parent BasicBlock the Instruction is embedded into. To represent a specific type of instruction, one of many subclasses of Instruction are used.

In this assignment you will be frequently using the add instruction. This instruction is of type BinaryOperator (a subclass of Instruction), which is used by all instructions that operate on two pieces of data (thus Binary). To check if it is an adder, you can call getOpcode(), which for an adder will return the constant Instruction::Add.

Debugging

In LLVM, you can use outs() << to print out just about any class you want. You can print Values, Users, Instructions, BasicBlocks, Functions, etc. Just keep in mind that to print an object you should use the object itself, not a pointer to the object. If you have a pointer, you should dereference it, such as:

Instruction * I = ...;
outs() << *I << "\n";

You can also use report_fatal_error("message") to stop execution immediately and report an error.

Checking Instruction Types

When iterating through the Instruction objects in a BasicBlock, you may need to check the type of Instruction. This can be done a few different ways. For example, here are a few ways to check if an Instruction is a BinaryOperator.

Instruction * I;
if (isa<BinaryOperator>(I)) {
   // I is a BinaryOperator
}

Instruction * I;
if (BinaryOperator * bo = dyn_cast<BinaryOperator>(I)) {
    dbgs() << "Instruction is a binary operator with opcode " << bo->getOpcode() << "\n";
}

Creating Instructions

Generally to create new instructions, you do not call the constructor directly, but rather call a static class method. For example, one way to create a new add instruction:

Value * in1 = ...
Value * in2 = ...
BinaryOperator * bo = BinaryOperator::Create(BinaryOperator::Add, in1, in2, "myInsnName", (Instruction*) NULL);

The last argument in this case is a pointer to an existing Instruction, and the new Instruction will be inserted preceeding it. If you pass in NULL, the new instruction is not inserted into the code, but you will need to do so later.

Inserting Instructions, Replacing Instructions or Usage

There are many ways to insert instructions into a basic block. These are discussed in the Programmers Manual, in sections Creating and inserting new Instructions and Replacing an Instruction with another Value.

Deliverables

Submit your code on Github using the tag lab2_submission
Include a 1-page PDF document (lab_llvm/report.pdf) that includes the following:

For the simple_unrolled_partitioned program, obtain the missing results in the following tables (there are 8 cells to fill in), and include the completed tables in your report. You can find the HLS schedule by looking at the main.schedule.rpt file, and the longest path by looking at the main.timing.rpt file, which lists the longest path for each BasicBlock. The cycles per loop iteration can be obtained from the scheduling report (or Gantt chart PDF), by manually looking at how many states are created for the summation loop. The total summation time is simply the product of the longest path, loop iteration latency, and number of iterations (10).

CLOCK_PERIOD <= 20ns

	Longest Path (ns)	Loop iteration latency	Summation Time (ns)
WITHOUT TreeBalancer	17.5	4	700
WITH TreeBalancer
Speedup	-	-	?x

CLOCK_PERIOD <= 5ns

	Longest Path (ns)	Loop iteration latency	Summation Time (ns)
WITHOUT TreeBalancer	5.0	12	600
WITH TreeBalancer
Speedup	-	-	?x