Malware development part 6 - advanced obfuscation with LLVM and template metaprogramming
Introduction
This is the sixth post of a series which regards the development of malicious software. In this series we will explore and try to implement multiple techniques used by malicious applications to execute code, hide from defenses and persist.
Today we will explore anti-disassembly obfuscation using LLVM and template metaprogramming.
LLVM obfuscation
LLVM is a compiler infrastructure. To understand what it is exactly we need to dive into compilation process (this is most accurate for unmanaged code like C/C++).
We can distinguish three steps of assembly generation from the source code:
- Front end, which includes:
- scanner, which performs lexical analysis of the code and produces tokens (strings with certain meaning)
- parser, which produces an abstract syntax tree (tokens grouped in a tree which represents the actual algorithm implemented in the source code)
- semantic analysis (mainly type checking), during which the AST is checked for errors like wrong use of types or use of variables before initialization
- generation of intermediate representation, usually based on AST
- Optimization, which aims at reducing code complexity for example by precalculating stuff. Optimization must not change the algorithm/program itself.
- Back end, which translates the intermediate representation to expected output (assembly or bytecode).
The core of LLVM is the optimizer but the project also includes a compiler front end - clang
- which is intended to be used with the LLVM toolchain.
Obfuscator-LLVM
We will leverage Obfuscator-LLVM
project which is an open-source fork of the LLVM.
Obfuscation works on the mentioned intermediate representation (IR) level. In other words it’s a kind of ‘anti’-optimization. Clang
is used to generate IR from source code, then the IR is processed to obfuscate code flow and finally the assembly is generated.
Setup
Having gone through the theoretical introduction, let’s prepare the environment for C++ code obfuscation. The Obfuscator-LLVM needs to be downloaded and compiled. The latest branch is llvm-4.0
(from 2017, the latest version of LLVM is 11.0
nowadays) and the code needs to be compiled with Visual Studio 2017 and not 2019 (as it gives some errors during compilation). We need to use CMake
to generate VS2017 project and then compile it (minding the target architecture). We can use Developer Command Prompt for VS 2017
which is a part of Visual Studio 2017:
git clone -b llvm-4.0 https://github.com/obfuscator-llvm/obfuscator
cd obfuscator
mkdir build
cd build
cmake -G "Visual Studio 15 2017 Win64" ..
Note: I had to manually define ENDIAN_LITTLE
identifier to get rid of some compilation errors.
There are different ways to use Obfuscator-LLVM
compiler:
- use manually via command line
- add the compiler as a custom build tool for
.cpp
and other files in Visual Studio (in a relevant file Property Pages) - use VS Installer to install a
clang-cl
platform toolset and manually swap Visual Studio’s clang version with the compiled compiler (this kinda sounds like a chicken-egg problem :))
Usage and features
Let’s write a simple program which performs some rather simple calculations based on a pseudorandom value:
int main()
{
int a = GetTickCount64();
int b = a % 10;
int c = 0;
for (int i = 0; i < b; i++)
{
c += a % i;
}
return c;
}
Note: I compiled this code without CRT
dependency so the binary is small and there’s no additional code (like mainCRTStartup
etc.) - see part 4 of malware development series.
This is how the code looks like after decompiling with Ghidra:
And the program graph:
Obfuscator-LLVM
has 3 code obfuscation features: instructions substitution, bogus control flow and control flow flattening. Let’s explore them. Details can be found in the project’s repository
These features use random value which has to be provided as a command line parameter (-mllvm -aesSeed=1234567890ABCDEF1234567890ABCDEF
) on Windows systems (on Linux it uses /dev/random
).
Instructions substitution
This replaces simple arithmetic operations with more complex but equivalent ones. For example: a = b + c
may be changed to r = rand(); a = b + r; a = a + c; a = a - r;
. The random value is calculated during the compilation.
It’s possible to apply substitutions multiple times. Random seed from the command line is used to randomly select substitute instruction sequence so this brings some additional uniqueness to the resulting binary.
Let’s add following switches to the compilation command line: -mllvm -sub -mllvm -sub_loop=5 -mllvm -aesSeed=1234567890ABCDEF1234567890ABCDEF
Resulting assembly (decompiled):
And the graph:
Note here that the Ghidra decompiler handled obfuscator ‘deoptimizations’ quite well.
Bogus control flow
This adds opaque predicates before instruction blocks. An opaque predicate is basically a portion of (prefably random) code which is evaluated at the runtime to a predetermined logical value (true
or false
). It is followed by a conditional jump which points to an original instruction block.
This obfuscation can also be applied multiple times, and can target random blocks of code.
Example usage: -mllvm -bcf -mllvm -bcf_prob=100 -mllvm -bcf_loop=1 -mllvm -aesSeed=1234567890ABCDEF1234567890ABCDEF
Resulting assembly (decompiled):
And the graph:
Control flow flattening
This one disrupts the sequence of instructions block by placing them on the same level in a looped switch
statement. Additional variables are defined which actually control the order of execution. See the diagram below - it should make this more clear:
This obfuscation can also be applied multiple times on.a single block.
Example usage: -mllvm -fla -mllvm -split -mllvm -aesSeed=1234567890ABCDEF1234567890ABCDEF
Resulting assembly (decompiled):
And the graph:
Testing
Now let’s compile and obfuscate some simple malware. Remember the simplest shellcode injector from the part 1 of the series? LLVM obfuscation won’t do much with it because the most obvious indicators (shellcode and imports) will be still present and intact.
Thst’s why we will test another code - for example this classic reverse shell. Actually this uses the same method as the shell_reverse_tcp
shellcode (create an IP socket and create cmd process with its standard streams attached to the socket).
Interestingly, uploading compiled binaries to VirusTotal resulted in only one detection for the code compiled without obfuscation and 6 detections when multiple obfuscation methods were applied.
Conclusion
Obfuscator-LLVM
is a great resource to learn and understand what actually happens during the code compilation and how can this process be modified to make static assembly analysis more difficult and time-consuming. However it’s important to remember that the IR-level obfuscation can be reversed (not completely, but still). See this great article for an example of the deobfuscation process.
Here are some general thoughts and considerations: From an offensive penetration tester perspective, it’s important to combine multiple layers of code protection measures to minimize chances of detection and hinder manual analysis as much as possible (well, with a reasonable ammount of our efforts). This helps to deliver effective adversary emulations focused on the actual objectives. Of course more advanced malware requires more work put into it by defensive teams, which is also a good thing.
Anyway, make sure to consider implementing some intermediate representation level obfuscation into your offensive tooling build process.
Other LLVM-based obfuscators
Also be sure to check other LLVM-based obfuscators and articles on building custom obfuscators with LLVM:
https://github.com/HikariObfuscator/Hikari/
https://medium.com/@polarply/build-your-first-llvm-obfuscator-80d16583392b
http://www.babush.me/dumbo-llvm-based-dumb-obfuscator.html
https://github.com/emc2314/YANSOllvm
https://blog.scrt.ch/2020/06/19/engineering-antivirus-evasion/
https://blog.scrt.ch/2020/07/15/engineering-antivirus-evasion-part-ii/
Template metaprogramming
Before diving into the details of C++ constructs like templates, constant expressions and metaprogramming, let’s consider a simple case: we have a source code with some string literals (like IP addresses, domain names etc.) that need to be obfuscated so they are invisible in the assembly and only revealed at runtime. Easiest thing to do here is to encrypt these literals and replace them with a call to decryption routine, for example:
const char* address = "www.example.com";
replaced with:
char* Decrypt(const char* data);
(...)
char* addr = Decrypt("xxx.yyyyyyy.zzz");
Of course we would have to consider string length, null-byte terminators etc.
We would prefer to use plaintext values in the source code and obfuscate/encrypt them automatically during the build process. Replacement of plain strings with encrypted ones can be automated with a pre-build task, e.g. some Python script. But there’s another, cooler way to do this.
Introduction
Let’s get familiar with some features introduced in C++11
standard: template
s and constexpr
essions. The following won’t cover all the details of metaprogramming concepts - it’s just a simple introduction which will help to understand how obfuscation based on template metaprogramming actually work.
Templates
Templates are functions that operate on generic types. Templates allow simple creation of functions which operate on multiple types (basic types, structs, classes). For example we can use the following template:
template <typename T>
bool Equal(T arg1, T arg2)
{
return (arg1 == arg2)
}
instead of defining overloaded functions:
bool Equal(int arg1, int arg2);
bool Equal(double arg1, double arg2);
And example template usage:
Equal <int>(1, 2));
Of course types must implement ==
operator in order to use the Equal
function template.
Templates can be also used to create a generic struct or class, which then can be instantiated to be used with a specific type:
template <typename T>
struct Stack
{
void push(T* object);
T* pop();
};
Stack<Fruit> fruitStack;
Stack<Vegetable> vegetableStack;
This also provides type safety, in this case you won’t be able to mix fruits with vegetables - fruitStack.push(new Vegetable());
will produce a compilation error.
Let’s see an another example - usage of template for recursive factorial calculation:
template <int N>
struct Factorial
{
enum { value = N * Factorial<N - 1>::value };
};
template <>
struct Factorial<0>
{
enum { value = 1 };
};
Factorial<5>::value // 5! = 120
We see here that an integer can be a template argument and that a template specialization (template <>
) is needed to define a value for a specific argument.
Constant expressions
The constexpr
specifier indicates that the value of some expression can be evaluated at compile time. For example, when such a constant expression defined:
constexpr int sum(int a, int b)
{
return (a + b);
}
Sum(1+2)
will be precalculated at compile time - this calculation won’t consume resources at the application’s runtime.
Metaprogramming
Metaprogramming is just modifying programs by other programs or by themselves. Turns out that templates are a kind of functional programming language and can be used by compiler to generate source code.
Remember? It’s exactly what we were doing with pre-build scripts - creating a temporary source code with sensitive data obfuscated.
String obfuscation
Having understood the ability to write code which can be executed by compilers, let’s create a simple string obfuscator which will replace plaintext data with XORed values just before compilation. We would like to use the obfuscation in the following manner: Obfuscated("secret");
. The Obfuscated
macro should replace the "secret"
with a decryption function with an encrypted argument: Decrypt_runtime(Encrypt_compiletime(secret))
.
To use constant string at compile time, we need to know its exact length. So we will need a compile time function which operates on this length value. So first, we need to create a template which will get an integer as an argument: template <unsigned int N>
.
Now we will create a struct which holds the obfuscated string (which will replace the plaintext in the source code) and has a compile time function (constexpr
) as a constructor to obfuscate the plaintext:
struct Obfuscator
{
char data[N] = { 0 };
constexpr Obfuscator(const char* plaintext)
{
for (int i = 0; i < N; i++)
{
data[i] = plaintext[i] ^ 0x00;
}
}
}
Now we obfuscate data in source code by creating an Obfuscator<7>
struct from the Obfuscator<N>
template (7 = string length + null byte):
constexpr Obfuscator<7> obfuscated = Obfuscator<7>("secret");
To actually use the data in the application we need to decrypt it, so we add deobfuscation function (which operates on a constant value, hence the const
identifier following its declaration) to the Obfuscator
template:
const char* Deobfuscate() const
{
char plaintext[N] = { 0 };
for (int i = 0; i < N; i++)
{
plaintext[i] = data[i] ^ 0x11;
}
return plaintext;
}
Now we can deobfuscate the obfuscated
constant variable: obfuscated.Deobfuscate()
.
The last thing to do is to create a helper macro which simplifies the obfuscation in the source code. We will use another goodie of C++11
- lambda functions:
#define Obfuscated(string) []() -> const char* \
{ \
constexpr auto secret = Obfuscator<sizeof(string) / sizeof(string[0])>(string); \
return secret.Deobfuscate(); \
}()
Thanks to this string literals appearing in the binary are XOR encrypted. It’s possible to enhance this method to make the application create stack based strings which won’t appear in the .text
section of PE file.
Other possibilities
It’s possible to implement quite advanced string and code obfuscation using template metaprogramming. For more detailed explanation see this awesome workpaper by Sebastien Andrivet and his ADVobfuscator tool which implements described concepts. There is a number of such obfuscators available and the best thing about them is that we can use them by just adding header files to the project:
https://github.com/fritzone/obfy
https://github.com/revsic/cpp-obfuscator
Summary
This post was just an introduction to advanced and powerful obfuscation methods which leverage LLVM compiler infrastructure and template metaprogramming.
Next time we will talk about keyloggers and implement one.