A feature scheduled to be released in Inko 0.18.0 is the ability to define types to allocate on the stack instead of on the heap:
# Instances of this type are heap allocated.
type User {
let @name: String
}
# This goes on the stack instead due to the use of the `inline` keyword.
type inline User {
let @name: String
}
This also meant adding support for passing such values as arguments, capturing them in closures, and returning them. This turned out to be a lot more complicated than anticipated, as the way LLVM handles structure arguments and returns in regards to the system ABI is surprising at best, and downright terrible at worst. Let's take a look at why that is.
Within the context of this article, "ABI" refers to the system ABI (Application Binary Interface). In essence, the ABI is a specification that states how values should be passed around (i.e. what registers to place the data in), how to call functions, who (the caller or callee) is supposed to clean up certain registers when returning from a function, and so on.
Two examples of an ABI are the SystemV AMD64 ABI and the ARM ABI (well technically there are many ARM ABIs, but you get the point).
One important aspect is that in spite of CPUs not having a notion of structures or arrays (typically referred to as "aggregate" types in ABI specifications), ABIs still specify how one should pass them around to ensure consistency across platforms and compilers.
At least, that's what should be happening, but as you can probably guess from the title of this article that's not quite the case.
LLVM supports aggregate types such as structures, arrays, and vectors (the SIMD kind of vector), on top of its various scalar types such as integers and floats. The syntax LLVM's textual IR uses for structures is the following:
{ type1, type2, ... }
For example, take this Rust structure:
struct Example {
a: i64,
b: i32,
c: f64,
}
This would be expressed in LLVM as follows (LLVM uses double
for 64-bits
floats instead of f64
):
{ i64, i32, double }
With that in mind, one might think that returning a structure in a function is as simple as something along the lines of the following:
define { i64, i32, double } @example() {
%1 = alloca { i64, i32, double }, align 8
...
ret { i64, i32, double } %1
}
Similarly, it would make sense that if you want to accept a structure as an argument you'd do that as follows:
define @example({ i64, i32, double } %0) {
...
}
Surely it's that easy right. RIGHT?
Unfortunately, it's not the case because that would just make too much sense. The problem comes down to the following: while LLVM supports aggregate types, it doesn't make any attempt to lower them down to machine code that's compliant with the target ABI. This means that it might work for simple structures and depending on the ABI of whatever target you're compiling for, or it might fail in a mysterious and difficult to debug manner.
Instead of LLVM handling this, it's up to each frontend to generate the correct IR for the target ABI. This is difficult to get right as the ABI specifications aren't always clear, and thus it shouldn't come as a surprise that one gets this wrong. Here are just a few examples I found when implementing stack allocated types for Inko:
In fact, go to the issue tracker of your favourite compiled programming language and search for "struct abi" and you'll likely find at least a dozen or so issues related to generated code incorrectly handling structure arguments and returns.
The point here is that getting this right is difficult, and LLVM's lack of, well, anything to make this easier isn't helping. This isn't recent issue either: users of LLVM have been asking for years how to handle structure arguments and returns, and why LLVM does things the way it does. These aren't the only discussions either, as a search for terms such as "struct returns" and "struct abi" yields dozens of results spanning almost two decades.
When presented with these questions, the answers from maintainers and other contributors is typically the same: LLVM doesn't do this but clang does, so just copy whatever clang does. I've also seen plenty of mailing list discussions where people acknowledge the current state of affairs is less than ideal, but nobody seems interested in actually doing something about it (at least that I know of).
OK, so we know LLVM doesn't handle structures and ABI compliance for us, so what
now? Fear not, for I have gained a few new grey hairs gone through the
trouble of figuring this out for at least AMD64 and ARM64 so you don't have to.
This isn't an in-depth specification but rather a brief and easy to understand overview on how to generate the correct LLVM IR to pass structures as arguments and return them, based on what existing compilers such as clang and rustc do, and based on what I ended up implementing for Inko. The following caveats apply:
If the size of the structure is less than 8 bytes, pass the structure as an
integer of its size in bits.
Thus, if the structure is 5 bits you'd use i5
as the type. LLVM takes care of
rounding this up to the correct size. I think it should be fine to do this
rounding yourself, but I stuck with what clang does for the sake of making it
easier to compare outputs between compilers.
If the size is between 8 and 16 bytes, the logic is a little more
difficult.
First, "flatten" the structure such that you end up with a list of all the
fields (taking into account nested structures). So
{ { { { i64, i64 } } }, double }
is essentially treated the same as
{ i64, i64, double }
. Using this list of fields, classify each field as
either an integer or float along with their size in
bytes.
So for { i64, double }
you'd end up with the following list of classes:
[Int(8), Float(8)]
The next step is to combine/squash these classes together into two classes, turning this list of classes:
[Int(4), Int(4), Float(8)]
Into the following pair:
(Int(8), Float(8))
The first field/class is always 8 bytes, while the second field/class might be
less based on the alignment of the structure. It's possible that a structure
consists of both floats and integers, such as { i32, float, i64 }
. To handle
such cases, the combine
routine uses the following logic: if all the fields
combined into a single field are of the same class, use that class as-is (with
the summed size). If there instead is a mixture of classes, promote the class to
an integer:
{ float, float, i64 } -> { double, i64 }
{ i32, float, i64 } -> { i64, i64 }
{ double, i32, i32 } -> { double, i64 }
Once you have the pair of two classes, turn them into their corresponding LLVM types and use those types as the two fields for a newly generated structure.
For structures larger than 16 bytes, keep the structure type
as-is
but set the byval
parameter
attribute when
generating the function signature and when passing the argument to a call
instruction.
The logic for returning structures up to 16 bytes is the same as for passing them as arguments.
For structures larger than 16 bytes, keep the structure type
as-is.
For functions returning such a structure, the first argument must be a pointer
with the sret
attribute (and optionally the noalias
and nocapture
attributes). The sret
attribute takes a type as its argument, which must be
the type of the structure that's "returned".
The presence of the sret
argument means you'll need to shift any user-provided
arguments accordingly. When processing a return
in such a function, transform
it into a pointer write to the sret
pointer and return void
instead. In
other words, a function like this:
define { i64, i64, i64 } @example(...other arguments...) {
%1 = alloca { i64, i64, i64 }, align 8
...
ret { i64, i64, i64 } %1
}
Is turned into this:
define void @example(ptr sret({ i64, i64, i64 }) noalias nocapture %0, ...other arguments...) {
%1 = alloca { i64, i64, i64 }, align 8
...
store { i64, i64, i64 } %1, %0, ...
}
First, check if the structure consists of up to four floating point
fields
of the same type (i.e. four double
or three float
fields), known as a
"homogeneous floating-point aggregate" (HFA). This can/should reuse the same
classify
routine used by the AMD64
logic.
If the structure is a HFA, return a flattened version of the structure such that
this:
{ { { float } }, float, float }
Is turned into this:
{ float, float, float }
If the structure is not a HFA, the following rules apply (in order):
i64
values.byval
attribute.The rules for returning structures are the same as passing them and use the same
generated argument approach as AMD64 (e.g. a pointer with the sret
parameter),
with one small change compared to the arguments rule: for structures with a
size up to 8 bytes, the return type is an integer with the bits equal to
the structure size in
bits.
So if the structure is 5 bits, then the returned type is an i5
.
Due to the above rules it's likely that the types of structure arguments don't
match the types of their corresponding
alloca
slots. For example, if the user-defined type of an argument is some structure
A
then the type at the LLVM level might be different based on its size. To
handle such cases, we have to
memcpy
structure
arguments into these slots. Using just load
and store
with the correct ABI
types isn't enough and seems to result in LLVM generating incorrect machine
code.
Based on the output of clang and my own testing, it appears the rules to apply are as follows:
When passing a structure as an argument, the caller must first memcpy
it
into an alloca
slot that has the correct ABI type based on the argument rules
mentioned above. The caller must then substitute the original argument with the
result of a load
(using
the ABI type as part of the load) from that alloca
slot. This isn't necessary
if the original an ABI types are identical. This means you'll end up with
something like this:
call void @llvm.memcpy.p0.p0.i64(ptr ABI_ALLOCA, ptr ORIGINAL_ALLOCA, i64 SIZE, i1 false)
%tmp = load ABI_TYPE, ptr ABI_ALLOCA
call void @the_function(ABI_TYPE %tmp)
Here ABI_ALLOCA
is the alloca
slot for the structure using its ABI type,
while ORIGINAL_ALLOCA
is the alloca
slot storing the structure using its
true/user-defined type. SIZE
is the size of the structure in bytes, while
ABI_TYPE
is the type of the structure according to the ABI arguments rules
outlined above.
Functions that receive structure arguments must also memcpy
them into a local
alloca
slot and then use that slot instead of the original argument.
Similarly, when returning a structure using the sret
attribute the data must
be copied into the sret
pointer using memcpy
.
To ensure these memcpy
calls don't stick around, the memcpyopt
optimization
pass is used to remove these calls where possible. This pass is included
automatically when using the default<O1>
, default<O2>
or default<O3>
optimization pipelines.
Note that I'm still not sure that we always need to use memcpy
when passing
structures around. An ABI might mandate that callees copy the structures and use
the copy, but I recall running into issues when not using memcpy
even though
it wasn't strictly required. My pessimistic guess is that this is what clang
does and that the LLVM optimization passes are written with clang in mind,
generating incorrect code when the IR is different from what clang produces. Or
perhaps the ABI did require it but it just wasn't clear to me.
The simplest thing LLVM could do to improve this while retaining backwards
compatibility is to introduce some sort of system-abi
function attribute.
When this attribute is applied, one can pass/return structures to/from a
function just like any other scalar value and LLVM takes care of lowering it to
the correct ABI code:
; Function Attrs: system-abi
define { i64, i32, double } @example() {
%1 = alloca { i64, i32, double }, align 8
...
ret { i64, i32, double } %1
}
This would remove the need for every frontend to reimplement the same logic for every target it wishes to support, and likely reduce the amount of compilers running into the same ABI related bugs. While it might theoretically inhibit certain optimizations, I can't think of any that would justify every frontend having to reimplement the same complexity.
Whether something like this will ever be added to LLVM remains to be seen. I doubt it will happen any time soon though, as making LLVM easier to use or improving its compile-time performance just doesn't appear to be much of a priority for the LLVM maintainers and contributors.