Tunstall Codes

A coding scheme unlike any Huffman codes

Hayden Sather
Experience Stack

--

Photo by Ricardo Gomez Angel on Unsplash

“Intelligence is the ability to adapt to change” — Stephen Hawking (1942–2018)

This article is inspired by Introduction to Data Compression 5th edition. Chapter 3 — Sayood, Khalid

Introduction

Previous coding schemes that have been covered in this blog have operated under the same principle: The more frequent a symbol is, the shorter its code length should be. Tunstall codes, developed in 1967 by Brian Parker Tunstall, do not follow that trend. In fact, every Tunstall code for the same source is the same length. This article covers Tunstall codes, their applications, and covers real some real-world examples.

Tunstall Codes

A single Tunstall codeword can represent multiple letters in a sequence. Tunstall codes are a lossless compression scheme that always provides a unique representation.

Example

An example codebook is shown below:

For example, the sequence AAABAABABB will be translated.

AAA=00
B=11
AAB=01
AB=10
B=11

Then, concatenate all the codes together:

0011011011

This is the compressed representation of the original sequence.

Properties

The design of a code with fixed-length codes that can represent multiple symbols per code should satisfy the following conditions:

Condition 1

Any sequence output by the source should be able to be compressed/represented by the codebook.

Condition 2

The average number of source symbols represented by each codeword should be maximized.

The second condition is easy to understand why it must be met. The more symbols/codeword, the fewer bits that are required to compress a sequence. The first condition requires more explanation, which will be discussed below.

Condition 1 Explanation

This condition is best explained with an example of a codebook that does not meet this condition:

Counter Example

For example, consider the following codebook:

Non-Tunstall Code

The same sequence, AAABAABABB, will be translated.

AAA=00
B=11
Conflict! The following sequence, AABABB, can not be translated by the codebook.

This codebook is an example of a codebook that does not follow the first condition. Fortunately, Tunstall provided a method of generating codebooks to ensure that this condition is always met.

Generation Method

Suppose there exists a source that generates iid letters from an alphabet where the size of the alphabet is N.

The sequences that this data generates can be coded by an n-bit Tunstall code. The number of codewords will be 2ⁿ.

Start with all N letters of the alphabet in the codebook, along with their probabilities.

Then, remove the entry with the highest probability.

Then, append this letter to all the other letters (including itself) to make new strings and add all of these as entries in the codebook. The probabilities of the new strings are equal to the probability of both of the letters multiplied together. This will increase the size of the codebook from N to N + (N-1).

Repeat this last step again, but with all the new symbols included. Note that although the combined strings can be considered for the highest probability item, they don’t get combined with the highest probability item. Only the original letters get combined with the highest probability item.

This will be repeated as many times as possible (K times) as long as the following inequality holds:

Example

This example will walk through creating a 3-bit Tunstall code for a source with the following probabilities. Note that the probabilities are calculated from a source. This is an important distinction because if the probabilities are not calculated from the source:

This is a 3-bit Tunstall code, so the maximum number of codes is 2³=8. To find the number of iterations the algorithm will run:

First iteration

The symbol with the highest probability is A.

Remove A from the list, then append A to all of the current symbols to make new symbols. The probability of the new symbols will be the probability of A, 0.70, multiplied by the probability of the symbol it is being added to.

Now there are 5 symbols.

Second iteration

The symbol with the highest probability is AA. Repeat the above process but for AA.

Now there are 7 symbols. If this process were to be repeated again, it would result in 9 symbols. That would exceed the maximum number of 2³=8 symbols. This confirms the above result that the maximum number of iterations that can be run, for this example, is 2. So, the above table is the final codebook for the Tunstall code.

Conclusion

This article gave an introduction to Tunstall codes. This is a great option to compress letters from a source if the probabilities and the alphabet are known beforehand. Tunstall codes are unique because they can represent multiple letters for each symbol. Additionally, they have the same length of codes for all symbols. This also makes them more resilient to errors such as random bit-switches. Bit switches would only affect the current symbol that Tunstall codes are reading, while they would affect all following symbols for prefix-code-based schemes.

--

--

Machine Learning Engineer 🤖 | Computer Vision and Deep Learning Specialist 👀 | M.S. Data Science, M.S./B.S. Computer Science ‎‍🎓