I developed BAC (Block Arithmetic Coding) back in the 1980's and 1990's. For several reasons, I've become interested in this problem again.
This notebook discusses BAC and presents code to calculate its coding efficiency. BAC was presented in Ref [1] below.
Note: see companion notebook for encoder and decoder code.
-- Charles Boncelet, boncelet@udel.edu, 3/29/17
#python 3.6
import numpy as np
import matplotlib.pylab as plt
%matplotlib inline
#utility function
def entropy(p):
"""binary entropy function with parameter p"""
return -p*np.log2(p)-(1-p)*np.log2(1-p)
BAC (Block Arithmetic Coding) is a variable to fixed (VtoF) encoding. Here we describe the coder for IID binary inputs, but BAC works perfectly well for non-binary, non-stationary, and/or Markov inputs.
For each input bit $x$, BAC recursively divides the set of available output codewords into two non-empty groups, one for $x=1$ and one for $x=0$. The size (number of codewords) in each group is proportional to the probability of that group, i.e., $k$ codewords are divided into $k_0$ and $k_1$ with $k_0+k_1=k$ and $k_1 \approx pk$ and $k_0 \approx (1-p)k$. The recursion continues until the group contains only one codeword. That codeword represents that input phrase.
The expected length of a BAC input phrase is
$$ E(L(k)) = 1 + p E(L([pk]) + (1-p)E(L(k-[pk]))$$where $k_1=[pk]$ is a rounding of $pk$ with the restrictions that $1\le k_1 \le k-1$. This also guarantees $1 \le k_0 = k-[pk] \le k-1$.
Let $p=\Pr(x=1)=0.3$ and let $k=16$ (i.e., 4 bit codewords). Index the output codewords from 0000 to 1111, or, more simply, from 0 to 15.
Generate some random bits with probability $p$:
#uncomment to generate bits
#p = 0.3
#bits = 1*(np.random.rand(10)<p)
#bits
The sequence I got was [1, 0, 0, 0, 0, 0, 0, 0, 0, 0]. Index the output codewords from 0 to 15.
The first bit is a 1. Let $k_1 = [0.3\times 16] = [4.8] = 5$. Thus, $x=1$ corresponds to the last five codewords, $\{11, 12, 13, 14, 15\}$.
The second bit is a 0. Let $k_0 = [0.7\times 5] = [3.5] = 3$. (Both $k_0=3$ and $k_0=4$ are the same distance from $k_0=3.5$; we arbitrarily choose $k_0=3$.) Thus, the set of codewords is now $\{11, 12, 13\}$.
The third bit is a 0. Let $k_0 = [0.7\times 3] = 2$. The set of codewords is $\{11, 12\}$.
The fourth bit is a 0. Since the set only has two codewords, the division is trivial: codeword 11 is chosen (if the bit were a 1, codeword 12 would be chosen).
Thus the input sequence 1000 is encoded by codeword 11 = 1011.
The full codebook is below:
Index | Input Phrase | Output Code |
---|---|---|
0 | 0000000 | 0000 |
1 | 0000001 | 0001 |
2 | 000001 | 0010 |
3 | 00001 | 0011 |
4 | 00010 | 0100 |
5 | 00011 | 0101 |
6 | 0010 | 0110 |
7 | 0011 | 0111 |
8 | 0100 | 1000 |
9 | 0101 | 1001 |
10 | 011 | 1010 |
11 | 1000 | 1011 |
12 | 1001 | 1100 |
13 | 101 | 1101 |
14 | 110 | 1110 |
15 | 111 | 1111 |
Note, usually we don't generate the codebook. We generate the parsing for the specific input on the fly.
The expected input length can be calculated by the standard method:
$$ E(L) = 7(1-p)^7+7(1-p)^6p + 6(1-p)^5p^2 + \cdots + 3p^3 = 4.41 \text{ bits per input phrase}$$Since each codeword uses 4 bits, the normalized rate is $4.41\times h(p)/4 = 0.972$, about 3% worse than entropy. The BAC code is slightly worse than the entropy (as it must be).
p = 0.3
lengths = np.array([7,7,6,5,5,5,4,4,4,4,3,4,4,3,3,3])
ones = np.array([0,1,1,1,1,2,1,2,1,2,2,1,2,2,2,3])
pvector = p**ones * (1-p)**(lengths-ones)
EL = pvector @ lengths
Eff = EL*entropy(p)/4
#Check: the p's must sum to 1
print('Sum(p) = ', np.sum(pvector))
print('EL = ', EL, 'bits per input phrase')
print('Coding Efficiency = %.3f' % Eff, 'and entropy = %.3f' % entropy(0.3))
The standard algorithm requires us to actually compute the codebook. For large codebooks, this is expensive (slow). Fortunately, the recursive definition of the BAC code yields a recursive solution to the expected input length:
$$ EL(k) = 1 + pEL(k_1) + (1-p)EL(k_0)$$with the boundary conditions $EL(0)=EL(1)=0$.
E.g. $EL(16) = 1+0.7 EL(11)+0.3EL(5)$, and $EL(11) = 1+0.7EL(8)+0.3EL(3)$, etc.
The code below also gives 4.4127.
The BAC class below works with binary inputs and calculates the expected input phrase length of a BAC encoding with $k$ codeworks, i.e., each input phrase is encoded with $\log_2(k)$ bits.
#reset if recursion problems arise, default = 1000
#import sys
#sys.setrecursionlimit = 2000
class BAC:
def __init__(self,p,loopfirst=0):
self.p = p
self.els = {0:0,1:0}
#initialize dictionary with loop for small k
if loopfirst > 0:
self.calc_loop(loopfirst)
def calc_loop(self,k):
"""use loop to calculate codelengths, fast for small k"""
for k in range(1,k+1):
k1 = round(self.p*k)
if k1 == 0:
k1 = 1
if k1 == k:
k1 = k-1
k0 = k-k1
self.els[k] = 1 + self.p*self.els[k1] + (1-self.p)*self.els[k0]
def calc(self,k):
"""calculate expected length for k codewords"""
if k not in self.els:
k1 = round(self.p*k)
if k1 == 0:
k1 = 1
if k1 == k-1:
k1 = k-1
k0 = k-k1
#call smaller value first, minimizes recursion depth
if k0 < k1:
e0 = self.calc(k0)
e1 = self.calc(k1)
else:
e1 = self.calc(k1)
e0 = self.calc(k0)
self.els[k] = 1 + self.p*e1 + (1-self.p)*e0
return self.els[k]
First, compute the example above with $p=0.3$ and $k=16$.
code = BAC(0.3)
code.calc(16)
code = BAC(p=0.95, loopfirst=2**16)
code.calc(k=64)
code.calc(2**32), code.calc(2**48), code.calc(2**64)
Notice the answer above: we did an exact calculation of the expected input phrase length for a codebook with $2^{64} = 18446744073709551616 = 1.8\times 10^{19}$ codewords. We couldn't do that calculation if we had to compute the codebook first.
2**64, 64*np.log10(2)
Calculate the efficiency of the 64 bit code. It's over 99%.
code.calc(2**64)*entropy(p=0.95)/64
Example is given in companion notebook on BAC Encoder and Decoder.
code = BAC(p=0.95, loopfirst=2**16)
code.calc(2**48)
logks = np.arange(1,17).astype('int')
codelengths = [code.calc(2**logk) for logk in logks]
plt.figure(figsize=(9,6))
plt.plot(logks,codelengths)
plt.plot(logks,logks/entropy(code.p))
plt.ylabel('Expected Input Phrase Length',fontsize=14)
plt.xlabel('Code Size in Bits',fontsize=14)
plt.annotate('$p=0.95$', xy=(12,10),fontsize=14)
_ = plt.title('BAC vs Entropy',fontsize=18)
The most famous VtoF code is the Tunstall code. The Tunstall code builds a parsing tree by continually breaking the highest probability leaf into two leaves ($m$ leaves for an m-ary alphabet). The analagy isn't perfect but BAC is more top-down while Tunstall is bottom-up. BAC is similar to Shannon coding; Tunstall is similar to Huffman coding.
As suggested by the plot above, $\log(k)/h(p) - EL(k) \le C$ for some value of $C$ and for all $k$. I proved this result in the [cite Trans. Info. Th. paper]. While Tunstall is "optimal", Tunstall originally proved the weaker result that $$\frac{h(p)EL(k)}{\log(k)} \rightarrow 1$$ (Showing the ratio approaches 1 does not imply the difference between $EL(k)$ and $ \log(k)/h(p)$ is bounded.)
BAC has many advantages over Tunstall coding:
BAC accomodates much larger alphabets. The encoding is top-down and the codebook never needs to be stored.
BAC gracefully allows non-identically distributed bits.
BAC allows dependency between bits as long as $p(x_n) = f(x_1, x_2, \ldots, x_{n-1})$. E.g., the input sequence can be Markov or hidden Markov.
Boncelet, Charles G. "Block arithmetic coding for source compression." IEEE transactions on information theory 39.5 (1993): 1546-1554.
Boncelet, C.G., 1993, January. Block arithmetic coding for Markov sources. In Information Theory, 1993. Proceedings. 1993 IEEE International Symposium on (pp. 118-118). IEEE.