The Derivation of Average Information

How Claude Shannon discovered the most important idea in information theory

Photo by Crissy Jarvis on Unsplash

“Information is the resolution of uncertainty” — Claude Shannon

This article is inspired by Introduction to Data Compression 5th edition. Chapter 2 — Sayood, Khalid


The Introduction to Data Compression, 5th edition, written by Khalid Sayood, is one of the seminal textbooks for the field of Data Compression. It provides a solid foundation of information theory and other preliminary disciplines required to understand the field of Data Compression. It also includes multiple sections of optional knowledge that delve into great detail on a single idea which is not necessarily required for continuing with the book. One such optional section is the derivation of average information, which was invented by Claude Shannon in the 1940s. Likely the first thing one learns when studying information theory is the idea of average information from a set of events. (I may write about that later, but for now, I am assuming the reader is familiar with the idea of average information.) However, it is a quite mathematically involved process to understand how Shannon invented this concept. Sayood’s textbook runs through this explanation, however, I believe that it can be rewritten in a more digestible way. This article is my attempt at achieving this.

Properties to Include in the Measure of Average Information

For any set of n independent events A₁, A₂, …, A, where each of these events has a probability of p = P(A), the measure of average information for all of the events, H, should have the following properties. Note that H is assumed to be unknown throughout this article and it is slowly discovered through its properties. So when H is referred to, it is referred to abstractly until the end:

Property 1

H should be a continuous function of all the probabilities p

Property 2

If all events are equally likely, then H should be an increasing function of n. The more outcomes that are available, the more information the occurrence of each of these outcomes contains.
So when n increases, H should increase.
This means that H can also be represented by:

(The notation, H, is chosen to represent H in terms of just n when the probabilities of all the outcomes are equal, which is only true if and only if p = 1/n for all i)

Property 3

The average information for a set of events should be the same regardless of how and if the events are grouped.
For example, consider three events A₁, A₂, and A₃
The average information of A₁, A₂, A₃ separately:

If we assume that A₂ and A₃ are grouped together and A₁ is in its own group, the average information is calculated slightly differently. First, each group is indicated by the group probability, then each outcome in that group is considered by its conditional probability given that it is in the group:

The first term in the addition represents the information from both of the groups combined. The second term represents the information from the first group independently. The third term represents the information from the second group independently.

In summary, the third property requires that regardless of how the events are grouped, the average information remains the same.

Satisfying the Properties

In Shannon’s paper, he proved that the only equation to satisfy the above three properties is:

Where K is a positive arbitrary constant that changes based on the data and problem.
The base of the log is irrelevant. It only affects the value of K. Usually base 2 is used. His derivation will be explained below.

Shannon’s Derivation

For the Case of Equally Likely Events

Let there be n=kᵐ independent, equally likely outcomes for an experiment (where m is the number of choices and k is the number of possibilities for each choice). The outcome of an experiment like this can be represented as a tree.
For example, when there are m=3 choices and k=2 possibilities for each choice, the tree below shows the possible outcomes.

Introduction to data compression, 5th edition — Khalid Sayood

From the second property above, the average information of this experiment is given by:

For any given final outcome, the experiment made three selections between two choices. Therefore, the information of any final outcome is given by:

Putting this together,

This can be generalized by

There exists integers j, k, l, and m such that the following inequality holds. To make future algebraic steps easier, l is arbitrarily large:

Taking the log of all terms:

Dropping the exponents in the logs:

Dividing by l log(k):

Since l is very large, 1/l is a very small, positive number. It will be called epsilon.

This means that both the upper and lower bounds of log(j) / log(k) can be made very close to m/l as long as l is large enough. Formally:

Referred to later as the “log inequality”

Where epsilon is a very small number.
This result will be used later.

Finding an expression for H(n)

The second property above states that Hₙ monotonically increases with n. From our previous argument, assume the following again:

If the previous statement is true, then to satisfy the second property, the following statement must hold:

Dropping the exponents:

Dividing by l*H(k):

Via the same arguments as before, 1/l is a very small number which will be called epsilon.

Using the same arguments as before, this means that both the upper and lower bounds of log(j) / log(k) can be made very close to m/l as long as l is large enough. Formally:

Referred to later as the “Hₙ(n) inequality”

The variable, l, can be controlled so that epsilon from both this inequality and the inequality of the logarithms are the same value. If this is done, then the H and log inequalities are both a distance of epsilon away from m/l. Therefore:

Essentially, both values in the absolute value function must be very close to each other because 2*epsilon is a very small number. The only way for this inequality to hold is if both:


For the Case of Unequally Likely Events

Shannon’s derivation of average information for equally likely events has been explained. Using this information, his derivation of average information for unequally likely events will be explained next.

Consider an experiment with n unequal groups. Each group is of size nᵢ. The experiment then has

total outcomes. Each individual outcome is equally likely. The probability of each of the groups occurring is then given by (the assumed rational number):

Since all of the events are equally likely, the information from each group is given by:

If we specify the outcome by the group it belongs in, then by which member of the group it is, we can find the average information of the events by:

(Note that the arbitrary constants Kadd together for the final arbitrary K)

Combining our last two separate equations for H gives:

This can be algebraically solved for and simplified with a few steps. (One important fact that helps this simplification is the fact that the probabilities of all p sum to 1):

By convention, set K=1

This finally brings us to Shannon’s equation for average information.


This has been quite a long and difficult read. However, we emerge with a deep understanding of the average information of an experiment, which is one of the fundamental ideas of information theory. I hope you enjoyed this article. I will leave you with some words from the reference textbook’s author, Khalid Sayood.

Note that this formula is a natural outcome of the requirements we imposed in the beginning. It was not artificially forced in any way. Therein lies the beauty of information theory. Like the laws of physics, its laws are intrinsic in the nature of things. Mathematics is simply a tool to express these relationships. — Khalid Sayood

 by the author.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store