BFC, or (B)rain(F)uck (C)ompiler, is a Brainfuck compiler capable of taking Brainfuck source files and producing from them GNU-as-compatible assembly files encompassing equivalent functionality. I made this project because my friend and I wanted to make brainfuck interpreters, but I decided that too many of those already existed. And while many brainfuck compilers also exist, there are substantially fewer of those than there are interpreters.
It only supports an x86_64 Linux target, but the underlying architecture is designed in such a way that it should be fairly trivial to expand it to other targets. I am not supporting / working on BFC anymore but still decided to write this article in case someone wants to read some learned experience on implementing a Brainfuck compiler.
BFC is free software.
First, clone the source from the GitHub repo, build the project using mincbuild, and install it.
This process will install the compiled bfc
binary to /usr/bin
. To
uninstall it you can either manually delete it, or run:
Generate an assembly source file from a Brainfuck source file:
Generate an object file from a Brainfuck source file (this will invoke GNU
assembler on the system from /usr/bin/as
after initial compilation):
Generate an executable binary from a Brainfuck source file (this will invoke
GNU assembler, /usr/bin/as
, and then GNU linker, /usr/bin/ld
, after initial
compilation):
This is all detailed in the help menu displayed upon running:
This is probably the only thing that caused me any headache while implementing BFC, so I will document it here in case any prospective Brainfuck compiler implementors want to save themself the struggle.
The thing is, this is actually very simple; you just need to think about it in the proper way.
Start by assigning every conditional ('[' and ']', which I shall call condbegin and condend respectively) its own "index" or "value" within the source. For example, take this short piece of Brainfuck code:
I have suffixed each conditional with a letter-comment name. So when I refer to "condbegin A", I am talking about the one on the first line, on the left.
Imagine assigning each of these conditionals a unique index:
When writing out the compiled assembly you will refer to each of these conditionals by their number indices - BFC does this in hexadecimal to save space in files many conditionals.
Now imagine you are currently at condbegin A, and you need to decide whether to jump to condend B. The way this is done in Brainfuck is that if the value at the current position of the data pointer is 0, then you perform the jump. Otherwise, the block is executed. The minimum to fulfill this would be two jump labels, and a conditional jump instruction, something like this:
r13
to store the data pointer.
je .Lce_1
.Lcb_0: // BFC uses cb_
as a prefix for condbegin jump labels.
// ...
.Lce_1: // BFC uses ce_
as a prefix for condend jump labels.Assuming a well-formed program - you can assume, upon entering a condbegin, that there will be at least N + 1 condend labels following it, where N is the number of other condbegin labels that follow it. This is because every condbegin must have a condend to terminate it.
Thus, you can just take the current jump label which you're in upon encountering a condbegin (you can either store this as compiler state or determine it by seeking backwards through the file and counting all instances of '[' and ']'), and run through the source until you have found N + 1 condend labels.
Then, you can conditionally jump to the label with index J + 2N + 1, where J is the index of the encountered condbegin label and N is the minimum number of other condbegin labels encountered before a condend label.
In practice, what BFC does is it creates a variable, nests_rem
, and
assigns it an initial value of 1. It also creates a variable called jmplabel
with the currently stored label value, as described above. Then, it runs through
the Brainfuck source from the current point; incrementing nests_rem
if the
current character is '[', and decrementing nests_rem
if it is ']'. In either
case, if the character is either '[' or ']', jmplabel
is incremented. As soon
as nests_rems
reaches zero, this algorithm terminates. By the end of this
process, jmplabel
is the index of the destination of the conditional jump. BFC
doesn't need to worry about whether the label with this index actually exists
yet, since it knows that it will inevitably be generated later in a well-formed
program.
In the case of encountering a condend, you can simply follow this procedure
but seek backwards through the source instead of forwards, and increment
nests_rem
in the case of a ']', decrementing it in the case of a '['.
BFC also adds jump label prefixes for readability when the generation is
complete (i.e. cb_XXXX
, ce_XXXX
), but this is actually not necessary. Every
label has a unique index anyway, so the label prefixes are not required.
Most obviously, you can try to solve the problems described above in your implementation.
As a short note, try to look at optimizations you implement and optimize them further. Actually, one thing I missed when initially implementing compiler optimizations (which I won't add now since I'm not working on BFC anymore) is the way data increments and decrements interact.
See, one optimization I implemented was something like "increment shortening". That is, something like the following code:
++++
.
.Lcb_0:
incb (%r13)
incb (%r13)
incb (%r13)
incb (%r13)
// ...... would get optimized to:
++++
.
.Lcb_0:
addb $4, (%r13)
// ...... which is not only more efficient in terms of speed, but also in terms of
generated filesize. This optimization works by literally scanning the Brainfuck
source and counting how many consecutive +
s occur at a specific point in the
code. But, actually, this can be taken further. You could have a counter
variable keeping track of the current "increment amount" and also count -
s.
BFC doesn't do this, but it should be absolutely trivial to implement in your own compiler.
For instance, this code:
+++--
.
.Lcb_0:
inb (%r13)
inb (%r13)
inb (%r13)
decb (%r13)
decb (%r13)
// ...... could trivially be optimized to:
+++--
.
// three increments and two decrements is mathematically equivalent to one
// increment.
.Lcb_0:
addb $1, (%r13)
// ...The compiler could even be made to check for a special case of +/-1 and
generate an incb
/ decb
instead of an addb
/ subb
, if such an
instruction would be more efficient in the targeted instruction set.
This work by tirimid is licensed under CC BY-SA 4.0