The Derivation of Average Information
How Claude Shannon discovered the most important idea in information theory
“Information is the resolution of uncertainty” — Claude Shannon
This article is inspired by Introduction to Data Compression 5th edition. Chapter 2 — Sayood, Khalid
Introduction
The Introduction to Data Compression, 5th edition, written by Khalid Sayood, is one of the seminal textbooks for the field of Data Compression. It provides a solid foundation of information theory and other preliminary disciplines required to understand the field of Data Compression. It also includes multiple sections of optional knowledge that delve into great detail on a single idea which is not necessarily required for continuing with the book. One such optional section is the derivation of average information, which was invented by Claude Shannon in the 1940s. Likely the first thing one learns when studying information theory is the idea of average information from a set of events. (I may write about that later, but for now, I am assuming the reader is familiar with the idea of average information.) However, it is a quite mathematically involved process to understand how Shannon invented this concept. Sayood’s textbook runs through this explanation, however, I believe that it can be rewritten in a more digestible way. This article is my attempt at achieving this.
Properties to Include in the Measure of Average Information
For any set of n independent events A₁, A₂, …, Aₙ, where each of these events has a probability of pᵢ = P(Aᵢ), the measure of average information for all of the events, H, should have the following properties. Note that H is assumed to be unknown throughout this article and it is slowly discovered through its properties. So when H is referred to, it is referred to abstractly until the end:
Property 1
H should be a continuous function of all the probabilities pᵢ

Property 2
If all events are equally likely, then H should be an increasing function of n. The more outcomes that are available, the more information the occurrence of each of these outcomes contains.
So when n increases, H should increase.
This means that H can also be represented by:

(The notation, Hₙ, is chosen to represent H in terms of just n when the probabilities of all the outcomes are equal, which is only true if and only if pᵢ = 1/n for all i)
Property 3
The average information for a set of events should be the same regardless of how and if the events are grouped.
For example, consider three events A₁, A₂, and A₃
The average information of A₁, A₂, A₃ separately:


If we assume that A₂ and A₃ are grouped together and A₁ is in its own group, the average information is calculated slightly differently. First, each group is indicated by the group probability, then each outcome in that group is considered by its conditional probability given that it is in the group:



The first term in the addition represents the information from both of the groups combined. The second term represents the information from the first group independently. The third term represents the information from the second group independently.
In summary, the third property requires that regardless of how the events are grouped, the average information remains the same.
Satisfying the Properties
In Shannon’s paper, he proved that the only equation to satisfy the above three properties is:

Where K is a positive arbitrary constant that changes based on the data and problem.
The base of the log is irrelevant. It only affects the value of K. Usually base 2 is used. His derivation will be explained below.
Shannon’s Derivation
For the Case of Equally Likely Events
Let there be n=kᵐ independent, equally likely outcomes for an experiment (where m is the number of choices and k is the number of possibilities for each choice). The outcome of an experiment like this can be represented as a tree.
For example, when there are m=3 choices and k=2 possibilities for each choice, the tree below shows the possible outcomes.

From the second property above, the average information of this experiment is given by:



For any given final outcome, the experiment made three selections between two choices. Therefore, the information of any final outcome is given by:

Putting this together,

This can be generalized by

There exists integers j, k, l, and m such that the following inequality holds. To make future algebraic steps easier, l is arbitrarily large:

Taking the log of all terms:

Dropping the exponents in the logs:

Dividing by l log(k):

Since l is very large, 1/l is a very small, positive number. It will be called epsilon.


This means that both the upper and lower bounds of log(j) / log(k) can be made very close to m/l as long as l is large enough. Formally:

Where epsilon is a very small number.
This result will be used later.
Finding an expression for Hₙ(n)
The second property above states that Hₙ monotonically increases with n. From our previous argument, assume the following again:

If the previous statement is true, then to satisfy the second property, the following statement must hold:

Dropping the exponents:

Dividing by l*Hₙ(k):

Via the same arguments as before, 1/l is a very small number which will be called epsilon.

Using the same arguments as before, this means that both the upper and lower bounds of log(j) / log(k) can be made very close to m/l as long as l is large enough. Formally:

The variable, l, can be controlled so that epsilon from both this inequality and the inequality of the logarithms are the same value. If this is done, then the Hₙ and log inequalities are both a distance of epsilon away from m/l. Therefore:

Essentially, both values in the absolute value function must be very close to each other because 2*epsilon is a very small number. The only way for this inequality to hold is if both:

Therefore:

For the Case of Unequally Likely Events
Shannon’s derivation of average information for equally likely events has been explained. Using this information, his derivation of average information for unequally likely events will be explained next.
Consider an experiment with n unequal groups. Each group is of size nᵢ. The experiment then has

total outcomes. Each individual outcome is equally likely. The probability of each of the groups occurring is then given by (the assumed rational number):

Since all of the events are equally likely, the information from each group is given by:

If we specify the outcome by the group it belongs in, then by which member of the group it is, we can find the average information of the events by:

(Note that the arbitrary constants Kᵢ add together for the final arbitrary K)
Combining our last two separate equations for H gives:

This can be algebraically solved for and simplified with a few steps. (One important fact that helps this simplification is the fact that the probabilities of all pᵢ sum to 1):

By convention, set K=1


This finally brings us to Shannon’s equation for average information.
Conclusion
This has been quite a long and difficult read. However, we emerge with a deep understanding of the average information of an experiment, which is one of the fundamental ideas of information theory. I hope you enjoyed this article. I will leave you with some words from the reference textbook’s author, Khalid Sayood.
Note that this formula is a natural outcome of the requirements we imposed in the beginning. It was not artificially forced in any way. Therein lies the beauty of information theory. Like the laws of physics, its laws are intrinsic in the nature of things. Mathematics is simply a tool to express these relationships. — Khalid Sayood
by the author.