<< Chapter < Page | Chapter >> Page > |
is not Spanish (the input file was Cervante's
Don
Quijote , also with
m=3
), and
Seule sontagne trait homarcher de la t au onze
le quance matices Maississait passepart penaientla ples les au cherche de je Chamain peut accide
bien avaien rie se vent puis il nez pande
is not French (the source was Le Tour du Monde en Quatre Vingts Jours , a translation of Jules Verne's Around the World in Eighty Days. )
The input file to the program
textsim.m
is a M
atlab
.mat
file that is preprocessed to remove excessive
line breaks, spaces, and capitalization using
textman.m
, which is why there is no punctuation in
these examples. A large assortment oftext files is available for downloading at the website
of Project Gutenberg (at http://promo.net/pg/).
Text, in a variety of languages, retains some of the character of its language with correlationsof 3 to 5 letters (21–35 bits, when coded in ASCII). Thus, messages written in those languagesare not independent, except possibly at lengths greater than this. A result from probability theory suggests that ifthe letters are clustered into blocks that are longer than the correlation, then the blocksmay be (nearly) independent. This is one strategy to pursue when designing codes that seek to optimize performance. "Source Coding" will explore some practical ways to attack this problem, but the next two sections establish a measure ofperformance such that it is possible to know how close to the optimal any given code lies.
This section extends the concept of information from a single symbol to a sequence of symbols.As defined by Shannon, Actually, Hartley was the first to use this as a measure of information in his1928 paper in the Bell Systems Technical Journal called “Transmission of Information.” the information in a symbol is inversely proportional to its probability of occurring.Since messages are composed of sequences of symbols, it is important to be able to talk concretelyabout the average flow of information. This is called the entropy and is formally defined as
where the symbols are drawn from an alphabet , each with probability . sums the information in each symbol, weighted by the probability of that symbol.Those familiar with probability and random variables will recognize this as an expectation. Entropy Warning: though the word is the same, this is not the sameas the notion of entropy that is familiar from physics since the units here are in bits per symbol whilethe units in physics are energy per degree kelvin. is measured in bits per symbol, and so gives a measure of the average amount of informationtransmitted by the symbols of the source. Sources with different symbol sets and different probabilitieshave different entropies. When the probabilities are known, the definition is easy to apply.
Reconsider the fair die of [link] . What is its entropy?
Suppose that the message is received from a source characterized by
Notification Switch
Would you like to follow the 'Software receiver design' conversation and receive update notifications?