## Probability, information and Shannon entropy

I would like to thank John Baez one more time for his heroic efforts in educating the public. I learn a lot from his website. The tutorial material below is from his July 2022 diary. I added commentary.

Quantitative measure of information

Information we get when we learn an event of probability p has happened: $- log_2 (p)$.

Let me explain this in more familiar terms. Compare 1 bit of computer memory to 1 byte of computer memory. Which one can store more information? Obviously, 1 byte can store more information than 1 bit. The word “probability” confuses us. How many different ways can you write into 1 byte (8 bit) memory? The answer is $2^8$ different ways (each one representing an event of probability $1/2^8$). Representing 1 bit with either 1 or 0, there are 256 different ways we can arrange 8 digits. For example, (1,0,0,0,0,0,0,0) , (0,1,0,0,0,0,0,0), (1,1,1,1,1,1,1,1), ….

The actual stored information is one of those states, say (0,1,0,0,0,0,0,0). Other states are possibilities. Then it is natural to ask “what is the expected (average) amount of information contained in a given state of the system?” That’s quantified as Shannon entropy.

Shannon entropy

When we apply the formula to 1 byte of computer memory, assuming each one of those 256 arrangements (states) has equal probability (1/256) we get H=8 (bits of information). This is a trivial example. In general, the probability of each state can be different. Don’t forget the logical rule: sum of all $p_i$ is equal to 1.
If a highly likely event occurs, we learn very little information. If an event with probability $p=1$ occurs (an event that we knew it was going to happen) we learn zero new information ( $log(1)=0$). On the other hand, if a highly unlikely event occurs, the message is much more informative.