This document only contains the project problems. For the programming exercises on concepts needed for the project,

please refer to the project checklist .

The purpose of this project is to use a Markov chain to create a statistical model of a piece of English text and use the model

to generate stylized pseudo-random text and decode noisy messages.

Perspective In the 1948 landmark paper A Mathematical Theory of Communication, Claude Shannon founded the field of

information theory and revolutionized the telecommunications industry, laying the groundwork for today’s Information Age.

In this paper, Shannon proposed using a Markov chain to create a statistical model of the sequences of letters in a piece of

English text. Markov chains are now widely used in speech recognition, handwriting recognition, information retrieval, data

compression, and spam filtering. They also have many scientific computing applications including the genemark algorithm

for gene prediction, the Metropolis algorithm for measuring thermodynamical properties, and Google’s PageRank algorithm

for Web search. In this assignment, we consider two variants: generating stylized pseudo-random text and decoding noisy

messages.

Markov Model of Natural Language Shannon approximated the statistical structure of a piece of text using a simple

mathematical model known as a Markov model. A Markov model of order 0 predicts that each letter in the alphabet occurs

with a fixed probability. We can fit a Markov model of order 0 to a specific piece of text by counting the number of occurrences

of each letter in that text, and using these frequencies as probabilities. For example, if the input text is ’gagggagaggcgagaaa’,

the Markov model of order 0 predicts that each letter is ’a’ with probability 7/17, ’c’ with probability 1/17, and ’g’ with

probability 9/17 because these are the fraction of times each letter occurs. The following sequence of characters is a typical

example generated from this model:

gaggcgagaagagaagaaagagagagaaagagaag ...

A Markov model of order 0 assumes that each letter is chosen independently. This independence does not coincide with

statistical properties of English text because there a high correlation among successive characters in a word or sentence. For

example, ’w’ is more likely to be followed with ’e’ than with ’u’, while ’q’ is more likely to be followed with ’u’ than with

’e’.

We obtain a more refined model by allowing the probability of choosing each successive letter to depend on the preceding

letter or letters. A Markov model of order k predicts that each letter occurs with a fixed probability, but that probability can

depend on the previous k consecutive characters. Let a k-gram mean any string of k characters. Then for example, if the text

has 100 occurrences of ’th’, with 60 occurrences of ’the’, 25 occurrences of ’thi’, 10 occurrences of ’tha’, and 5 occurrences

of ’tho’, the Markov model of order 2 predicts that the next letter following the 2-gram ’th’ is ’e’ with probability 3/5, ’i’

with probability 1/4, ’a’ with probability 1/10, and ’o’ with probability 1/20.

A Brute-Force Solution Claude Shannon proposed a brute-force scheme to generate text according to a Markov model of

order 1:

“To construct [a Markov model of order 1], for example, one opens a book at random and selects a letter at

random on the page. This letter is recorded. The book is then opened to another page and one reads until

this letter is encountered. The succeeding letter is then recorded. Turning to another page this second letter is

searched for and the succeeding letter recorded, etc. It would be interesting if further approximations could be

constructed, but the labor involved becomes enormous at the next stage.”

Your task in this project is to write a Python program to automate this laborious task in a more efficient way — Shannon’s

brute-force approach is prohibitively slow when the size of the input text is large.

Problem 1. (Markov Model Data Type) Create a data type MarkovModel in markov_model.py to represent a Markov model of

order k from a given text string. The data type must implement the following API:

1 of 4

CS110 Project 4 (Markov Model) Swami Iyer

method description

MarkovModel(text, k) create a Markov model model of order k from text

model.order() order k of Markov model

model.kgram_freq(kgram) number of occurrences of kgram in text

model.char_freq(kgram, c) number of times that character c follows kgram

model.rand(kgram) a random character following the given kgram

model.gen(kgram, T)

a string of length T characters generated by simulating a trajectory through the corresponding

Markov chain, the first k characters of which is kgram

• Constructor To implement the data type, define two attributes: an integer _k that stores the order of the Markov

model, and a dictionary1 _st whose keys are all the k-grams from the given text. The value corresponding to each key

(say kgram) in _st is a dictionary whose keys are the characters that follow kgram in the text, and the corresponding

values are their frequencies. You may assume that the input text is a sequence of characters over the ASCII alphabet

so that all values are between 0 and 127. The frequencies should be tallied as if the text were circular (i.e., as if it

repeated the first k characters at the end). For example, if the text is ’gagggagaggcgagaaa’ and k = 2, then the dictionary

_st should look like the following:

{’aa ’: {’a’: 1, ’g’: 1},

’ag ’: {’a’: 3, ’g’: 2},

’cg ’: {’a’: 1},

’ga ’: {’a’: 1, ’g’: 4},

’gc ’: {’g’: 1},

’gg ’: {’a’: 1, ’c’: 1, ’g’: 1}}

If you are careful enough, the entire dictionary can be built in just one pass through the circular text. Note that there

is no reason to save the original text or the circular text as an attribute of the data type. That would be a grossly

inefficient waste of space. Your MarkovModel object does not need either of these strings after the dictionary is built.

• Order. Return the order k of the Markov Model.

• Frequency. There are two frequency methods.

– kgram_freq(kgram) returns the number of times kgram was found in the original text. Returns 0 when kgram is not

found. Raises2 an error if kgram is not of length k.

– char_freq(kgram, c) returns the number of times kgram was followed by the character c in the original text. Returns

0 when kgram or c is not found. Raises an error if kgram is not of length k.

• Randomly generate a character. Return a character. It must be a character that followed the kgram in the original

text. The character should be chosen randomly, but the results of calling rand(kgram) several times should mirror the

frequencies of characters that followed the kgram in the original text. Raise an error if kgram is not of length k or if

kgram is unknown.

• Generate pseudo-random text. Return a string of length T that is a randomly generated stream of characters whose

first k characters are the argument kgram. Starting with the argument kgram, repeatedly call rand() to generate the

next character. Successive k-grams should be formed by using the most recent k characters in the newly generated

text.

To avoid dead ends, treat the input text as a circular string : the last character is considered to precede the first character.

For example, if k = 2 and the text is the 17-character string ’gagggagaggcgagaaa’, then the salient features of the Markov

model are captured in the table below:

frequency of probability that

next char next char is

kgram freq a c g a c g

----------------------------------------------

aa 2 1 0 1 1/2 0 1/2

ag 5 3 0 2 3/5 0 2/5

1Dictionaries play a pivotal role in this project, so before you write any code, make sure you understand what dictionaries are and how they

work. The dictionary methods such as keys(), values(), and setdefault() will come in handy.

2raise ValueError("error message")

2 of 4

CS110 Project 4 (Markov Model) Swami Iyer

cg 1 1 0 0 1 0 0

ga 5 1 0 4 1/5 0 4/5

gc 1 0 0 1 0 0 1

gg 3 1 1 1 1/3 1/3 1/3

----------------------------------------------

17 7 1 9

Note that the frequency of ’ag’ is 5 (and not 4) because we are treating the string as circular.

A Markov chain is a stochastic process where the state change depends on only the current state. For text generation, the

current state is a k-gram. The next character is selected at random, using the probabilities from the Markov model. For

example, if the current state is ’ga’ in the Markov model of order 2 discussed above, then the next character is ’a’ with

probability 1/5 and ’g’ with probability 4/5. The next state in the Markov chain is obtained by appending the new character

to the end of the k-gram and discarding the first character. A trajectory through the Markov chain is a sequence of such

states. Shown below is a possible trajectory consisting of 9 transitions.

trajectory: ga --> ag --> gg --> gc --> cg --> ga --> ag --> ga --> aa --> ag

probability for a: 1/5 3/5 1/3 0 1 1/5 3/5 1/5 1/2

probability for c: 0 0 1/3 0 0 0 0 0 0

probability for g: 4/5 2/5 1/3 1 0 4/5 2/5 4/5 1/2

Treating the input text as a circular string ensures that the Markov chain never gets stuck in a state with no next characters.

To generate random text from a Markov model of order k, set the initial state to k characters from the input text. Then,

simulate a trajectory through the Markov chain by performing T − k transitions, appending the random character selected

at each step. For example, if k = 2 and T = 11, the following is a possible trajectory leading to the output gaggcgagaag:

trajectory: ga --> ag --> gg --> gc --> cg --> ga --> ag --> ga --> aa --> ag

output: ga g g c g a g a a g

$ python3 markov_model.py banana 2

an a

na b

na a

na -

freq(an , a) = 2

freq(na , b) = 1

freq(na , a) = 0

freq(na) = 2

$ python3 markov_model.py gagggagaggcgagaaa 2

aa a

ga g

gg c

ag -

cg -

gc -

freq(aa , a) = 1

freq(ga , g) = 4

freq(gg , c) = 1

freq(ag) = 5

freq(cg) = 1

freq(gc) = 1

Problem 2. (Random Text Generator) Write a client program text_generator.py that takes two command-line integers k

and T , reads the input text from standard input (for efficiency reasons, use sys.stdin.read() to read the text) and builds a

Markov model of order k from the input text; then, starting with the k-gram consisting of the first k characters of the input

text, prints out T characters generated by simulating a trajectory through the corresponding Markov chain, followed by a

new line. You may assume that the text has length at least k, and also that T ≥ k.

$ python3 text_generator.py 2 50 < data/input17.txt

gaggcgaggcgagagagaaaagaaaggcgaaggagagaggagaggaggcg

3 of 4

CS110 Project 4 (Markov Model) Swami Iyer

Problem 3. (Noisy Message Decoder) Imagine you receive a message where some of the characters have been corrupted

by noise. We represent unknown characters by the ~ symbol (we assume we don’t use ~ in our messages). Implement the

method replace_unknown() in the MarkovModel data type that decodes a noisy message corrupted by replacing each ~ in it with

the most likely character given an order k Markov model and conditional on the surrounding text and returns the decoded

message. You may assume that the unknown characters are at least k characters apart and also appear at least k characters

away from the start and end of the message.

This maximum-likelihood approach doesn’t always get it perfect, but it fixes most of the missing characters correctly. Here

are some details on what it means to find the most likely replacement for each ~. For each unknown character, you should

consider all possible replacement characters. You want the replacement character that makes sense not only at the unknown

position (given the previous characters) but also when the replacement is used in the context of the k subsequent known

characters. For example, we expect the unknown character in "was ~he wo" to be ’t’ and not simply the most likely character

in the context of "was ". You can compute the probability of each hypothesis by multiplying the probabilities of generating

each of k + 1 characters in sequence: the missing one, and the k next ones.

Write a client program fix_corrupted.py that takes an integer k (model order) and a string s (noisy message) as command-line

arguments, reads the input text from standard input (for efficiency reasons, use sys.stdin.read() to read the text), and prints

out the most likely original string.

Original : it was the best of times , it was the worst of times.

Noisy : it w~s th~ bes~ of tim~s, i~ was ~he wo~st of~times.

$ python3 fix_corrupted.py 4 "it w~s th~ bes~ of tim~s, i~ was ~he wo~st of~times." < data/obama.txt

it was the best of times , it was the worst of times.

$ python3 fix_corrupted.py 2 "it w~s th~ bes~ of tim~s, i~ was ~he wo~st of~times." < data/obama.txt

it was the best of times , is was the worst of times.

Acknowledgements This project is an adaptation of the Markov Model of Natural Language assignment developed at

Princeton University by Robert Sedgewick and Kevin Wayne.

4 of 4