## Introduction

Compressed, encrypted, and random data all contain a high amount of entropy, which is why many products use entropy analysis to detect malicious code in binaries that have never been examined before.

In a previous post about masking, I suggested using a deterministic random number generator with the Fisher-Yates shuffle to try and scramble data without increasing the entropy that occurs with compression and encryption. Of course, no confidentiality is provided with that approach and may not be desirable.

This got me thinking about what algorithms could be used to reduce entropy and it’s much simpler than I thought. If all we need to do is reduce the number of bits per byte, then simple encoding is the answer. Or am I mistaken? Let me know in the comments.

The disadvantage of using Base32 encoding is obviously an increase in the size of data by approx. 67%. However, if the compression ratio of original data is close to 50-70% (which can be achieved with something like GZip, LZMA or ZPAQ), then the increase after reducing entropy isn’t that bad. We have confidentiality of the data and it looks benign from analysis.

## Increasing

Compression algorithms like LZ77, Huffman or Arithmetic coders are designed to remove redundant or repetitive data. Encryption algorithms will use techniques like transposition, byte substitution to make data more unpredictable or seem random. And if they’re good, the original data will appear to be random with no discernible pattern or structure. That makes it difficult to analyse and determine exactly what it is.

## Decreasing

Adding redundant data, such as null bytes, can lower the entropy score. However, some tests will disregard zeros because they affect the accuracy of results. The reason for using Base32 and not Base64 is because the latter doesn’t reduce the entropy enough and Base16 is simply going to waste too much space. Base32 isn’t perfect but it’s good enough for demonstrating the idea.

## Encoding

There are problems with this code and it’s inadvisable to use it outside this blog. For it to work, inbuf should allow out-of-bound reads up to 5 bytes. Essentially, pad inbuf with at least 5 null bytes. Might seem odd for it to work that way, but it helps later when combining both encoding and decoding. For every encoded byte, two bits will always be zero and you could say that’s a pattern. That can be avoided by flipping bits of some bytes.

```
size_t
base32_encode(size_t inlen, void *inbuf, void *outbuf) {
uint8_t *out = (uint8_t*)outbuf;
uint8_t *in = (uint8_t*)inbuf;
uint32_t x = 0, z = 0;
size_t outlen = (inlen + 4) / 5 * 8;
// return size of buffer required?
if (!outbuf) return outlen;
// encode bytes
while (outlen) {
x = (x << 8) | *in++;
z += 8;
while (z >= 5) {
z -= 5;
*out++ = (x >> z) & 31;
outlen--;
}
}
// return encoded length
return (out - (uint8_t*)outbuf);
}
```

## Decoding

This needs to know the original size of data before encoding. That’s not a big issue considering most compression and encryption work the same way. Just include the original length when transporting the encoded data.

```
void
base32_decode(size_t outlen, void *inbuf, void *outbuf) {
uint8_t *out = (uint8_t*)outbuf;
uint8_t *in = (uint8_t*)inbuf;
uint32_t x = 0, z = 0;
while (outlen) {
x = (x << 5) | *in++;
z += 5;
while (z >= 8) {
z -= 8;
*out++ = (x >> z) & 255;
outlen--;
}
}
}
```

## Combined Encoding and Decoding

Well, you may have noticed the similarity between the two by now and thought about combining both. So long as the correct length is calculated for the output, we can indeed combine both and switch between encoding and decoding using a flag.

```
void
base32(size_t outlen, void *inbuf, void *outbuf, bool encode) {
uint8_t *out = (uint8_t*)outbuf;
uint8_t *in = (uint8_t*)inbuf;
uint32_t x = 0, z = 0;
uint8_t wl = 8, rl = 5, m = 255;
if (encode) {
rl = 8; // read length
wl = 5; // write length
m = 31; // mask
}
while (outlen) {
x = (x << rl) | *in++;
z += rl;
while (z >= wl) {
z -= wl;
*out++ = (x >> z) & m;
outlen--;
}
}
}
```

## Results

To test if the encoding reduces entropy of data, random bytes are generated from the operating system. A Shannon entropy test is used to calculate before and after.

```
double
shannon_entropy(void *inbuf, size_t len) {
uint8_t *in = (uint8_t*)inbuf;
double entropy = 0;
uint32_t frequency[256] = {0};
// Count the frequency of each byte value
for (size_t i=0; i<len; i++) {
frequency[in[i]]++;
}
// Calculate the entropy
for (size_t i=0; i<256; i++) {
if (frequency[i] > 0) {
double probability = (double) frequency[i] / len;
entropy -= probability * log2(probability);
}
}
return entropy;
}
```

Before encoding, the Shannon test scores 7.835897 for 1024 random bytes. After using Base32 to encode it, Shannon reports 4.983226.

## Summary

Some of you may be saying: “This isn’t Base32 encoding!” It’s not what’s described in RFC4648 that uses an alphabet as the final step, but they are using the same process. This is only intended to reduce entropy, but wouldn’t take that much for it to work like Base32 described in the RFC. The main difference of course is that padding with the equal sign isn’t used. instead, the code depends on the caller to specify the output length and pad the input buffer for out-of-bound reads.

Finally, the decoding algorithm above can be used to compress strings, but only if each character fits into 5-bits of information. For that reason, it might only be suitable for uppercase or lowercase characters together with a few symbols or digits.