Expanding on ArchiveLib
Background
ArchiveLib is a proprietary algorithm for compressing code. I wrote about the process of turning the code from obfuscated C into rust here: A journey to the center of ArchiveLib. In this post I want to provide a psuedo-implementation of the algorithm in plain-ish language so that others can implement it.
The easier of the two algorithms to reason about is the Expansion/Decompression algorithm; so this is described here.
Some conventions
A lot of the work to read this data relies on bitwise operations; and worse reading weird numbers of bits from weird offsets; so some conventions:
- Bits within a byte are represented as
byte(hex)/bit
; so to read theF
from the 2nd byte:01/4-01/8
. - Big-endian everywhere(largest byte on the left).
- All binary numbers will have underscores where the 4-bit boundry exists in the file. For example,
01/2-01/8
is0b11_1111
- All numbers are decimal, except the byte offsets; and where one of
0b
,0o
and0x
which represent binary, octal and hexadecimal respectively. For example11 == 0xB == 0o13 == 0b1011
. - The code is python-ish; and the functions
read_unsigned
andread_signed
takes two parameters:start_bit
andbit_count
and return either a signed or unsigned integer.
Some caveats
I have learnt this by spending many hours staring at code in various forms of obfuscation, and whilst I hope that this guide will be clear I expect there to be things that are not. If you find things like that; please tweet/message me on Twitter or open an issue on GitHub.
Some helpers
Some helpers that help clarify code
read_coded_length
In quite a few places this idea of extending the size of a value exists. To give an idea of what it’s doing; take this table:
Bit pattern | Read value |
---|---|
0b000 |
0 |
0b001 |
1 |
… | … |
0b110 |
6 |
0b1110 |
7 |
0b11110 |
8 |
0b111110 |
9 |
0b1111110 |
10 |
1 | def read_coded_length(start_bit, initial_length): |
In the beginning
We need some sample data to begin with. To take advantage of some of the strengths of this library; repetition is heavily used.
1 | The code is easy. |
1 | 00000000 54 68 65 20 63 6F 64 65 20 69 73 20 65 61 73 79 |The code is easy| |
This produces the following output when compressed:
1 | 00000000 00 3F 4B 52 8D AF F8 FB 80 3E 55 87 A6 A9 35 DA |.?KR.....>U...5.| |
Building our tables
The algorithm has 2 parts; the first(and harder to explain) part in the generation of a series of lookup tables. But before we get there we must first read the number of operations in this lookup table
1 | operations = read_unsigned(0, 16); # In our example this is 63. |
Our first tables: tree_generator
& tree_generator_len
- Bytes used
2/0-5/2
:0b0100_1011_0101_0010_1000_1101_10
So, now we have got that out of the way we can start building the first of our 3 tables.
1 | fill_count = read_unsigned(16, 5); # In our example this is 9. |
Now we can start reading out bytes into tree_generator_len
. This sample will use 2/5
to 5/2
(0b011_0101_0010_1000_1101_10
) which is :
1 | offset = 21; |
In our example this produces the following items; with the remaining elements being zero: [3, 2, 4, 0, 0, 4, 3, 3, 2, ...]
Tree generation
As a bit of an aside, there is a data structure that I describe as a tree; but if someone else has a better name then feel free to send it to me.
Let’s take these two arrays:
1 | left = [0, 2, 3, 1] |