# Entropy (at Information theory)

- The
**expectation of bits**that used for notating (or classify each other)**probabilistic events**when using optimal bits coding scheme. (\(log_2(\frac{1}{p})\) bits for notating events) - Entropy also can be interpreted as the
**average rate**at which**information is produced**by stochastic source of data. (rare events have more information than an often occurring event.) - Entropy can be calculated by \(\text{H}(X) = \text{E}[-\text{log}_2 (\text{P}(X)] = \sum\limits_{p \in P} p \text{log}_2(\frac{1}{p}) = -\sum\limits_{p \in P} p \text{log}_2({p})\) where \(P\) is probability distribution. (Shannon’s source coding theorem)

Let’s think about the situation that you need to notate characters “**A**, **B**, **C**, **D**” in bits that stochastically written in a sentence. You can simply notate each character with 2 bits. For example, “00” for **A**, “01” for **B**, “10” for **C**, “11” for D. If every character have the same probability, (\(\text{P}(A) = \text{P}(B) = \text{P}(C) = \text{P}(D) = 1/4\)) this notating is optimal notating. You used **2 bits for each character on average.** (2 * 1/4 * 4)

But how about \(\text{P}(A) = 1/2, \text{P}(B) = 1/4, \text{P}(C) = \text{P}(D) = 1/8\) ? If you use same scheme, you will use **2 bits** for each character on average. (2 * 1/2 + 2 * 1/4 + 2 * 1/8 * 2 = 2 * (1/2 + 1/4 + 1/8 + 1/8)) However, this is not optimal scheme for notating 4 character. As character **A** is frequently used than others, if you notate **A** with less bits, you can use less bits on average. So when notating “1” for **A**, “01” for **B**, “000” for **C**, “001” for **D**, **1.75 bits are used on average** (1 * 1/2 + 2 * 1/4 + 3 * 1/8 * 2 = 1.75). In that case, you can decode bits by following rules:

- If looking bit is 1 or length of group of bits is 3, finish one character decoding.
- If looking bit is 0, add looking bit (0) to group of bits and looking next bit.

# Cross-Entropy and KL-Divergence

The **cross-entropy** of the distribution \(q\) relative to distribution \(p\) over a given set is defined as follows:

You can think **cross-entropy** as applying coding scheme which is optimal to probability distribution \(q\) (\(l_i = - \text{log}_2(q(x_i)) \Leftrightarrow q(x_i) = (\frac{1}{2})^{l_i}\)) to probability distribution \(p\) where \(l_i\) is length of bits to coding i-th.

**Kullback–Leibler divergence (KL-Divergence)** can be thought of as something like a measurement of how far the distribution \(q\) is from the distribution \(p\).

In deep learning, \(p\) is dataset and \(q\) is neural network output. Making cross-entropy loss smaller is making KL-Divergence of \(p\) and \(q\) ( \(D_{KL}(p \Vert q))\) ) smaller because \(\text{H}(p)\) is fixed value.