# Entropy (at Information theory)

- The
**expectation of bits**that used for notating (or classify each other)**probabilistic events**when using optimal bits coding scheme. ( bits for notating events) - Entropy also can be interpreted as the
**average rate**at which**information is produced**by stochastic source of data. (rare events have more information than an often occurring event.) - Entropy can be calculated by where is probability distribution. (Shannon’s source coding theorem)

Let’s think about the situation that you need to notate characters “**A**, **B**, **C**, **D**” in bits that stochastically written in a sentence. You can simply notate each character with 2 bits. For example, “00” for **A**, “01” for **B**, “10” for **C**, “11” for D. If every character have the same probability, () this notating is optimal notating. You used **2 bits for each character on average.** (2 * 1/4 * 4)

But how about ? If you use same scheme, you will use **2 bits** for each character on average. (2 * 1/2 + 2 * 1/4 + 2 * 1/8 * 2 = 2 * (1/2 + 1/4 + 1/8 + 1/8)) However, this is not optimal scheme for notating 4 character. As character **A** is frequently used than others, if you notate **A** with less bits, you can use less bits on average. So when notating “1” for **A**, “01” for **B**, “000” for **C**, “001” for **D**, **1.75 bits are used on average** (1 * 1/2 + 2 * 1/4 + 3 * 1/8 * 2 = 1.75). In that case, you can decode bits by following rules:

- If looking bit is 1 or length of group of bits is 3, finish one character decoding.
- If looking bit is 0, add looking bit (0) to group of bits and looking next bit.

# Cross-Entropy and KL-Divergence

The **cross-entropy** of the distribution relative to distribution over a given set is defined as follows:

You can think **cross-entropy** as applying coding scheme which is optimal to probability distribution () to probability distribution where is length of bits to coding i-th.

**Kullback–Leibler divergence (KL-Divergence)** can be thought of as something like a measurement of how far the distribution is from the distribution .

In deep learning, is dataset and is neural network output. Making cross-entropy loss smaller is making KL-Divergence of and ( ) smaller because is fixed value.