The Word Scramble Problem
Background
I first saw the word scramble problem on Slashdot in the Fall of 2003. Here is a link to the story that they posted: http://science.slashdot.org/article.pl?sid=03/09/15/2227256&tid=134&tid=133&tid=14, I'll re-post the bulk of the story below:
An aoynmnuos raeedr sumbtis: "An interesting tidbit from Bisso's blog site: Scrambled words are legible as long as first and last letters are in place. Word of mouth has spread to other blogs, and articles as well. From the languagehat site: 'Aoccdrnig to a rscheearch at an Elingsh uinervtisy, it deosn't mttaer in waht oredr the ltteers in a wrod are, the olny iprmoetnt tihng is taht frist and lsat ltteer is at the rghit pclae. The rset can be a toatl mses and you can sitll raed it wouthit porbelm. Tihs is bcuseae we do not raed ervey lteter by it slef but the wrod as a wlohe. ceehiro.' Jamie Zawinski has also written a perl script to convert normal text into text where letters excluding the first and last are scrambled."
In the following week or so, various different versions of this puzzle were posted, each one specifying a different university. I recall Cambridge University and the University of British Columbia being credited within the puzzle, but both universities said it wasn't them, according to other stories I saw. The short end of it: I have no idea who to credit for originally "discovering" this puzzle.
A few days after the original Slashdot post, a professor challenged our Natural Language Processing group to solve the puzzle computationally. The challenge was passed along to the NLP class I was taking at the time, CISC882, and became a class assignment (in a different form)
The Problem
More specifically, the problem leaves the first and last letter of each word in place, but randomly permutes all letters in the middle of the word. The real problem is to try to descramble a scrambled message, or restore the original message.
Keith's Solution
My solution to the problem is based upon two principles:
- A scrambled word can only be descrambled to a particular set of words.
- n-grams can give the statistical likelihood of a given sequence of words.
Possible word descramblings
The possible word descramblings of a given word are generated by analyzing a list of words (like a dictionary). The basis of my approach is what I call a scrambled word signature. The signature of any word, scrambled or not, is created by sorting the characters in the middle of the word in ascending alphanumeric order. Here's an example:
| Word
| Signature
|
| the
| the
|
| word
| word
|
| given
| geivn
|
| middle
| mddile
|
| created
| caeertd
|
The basis for the application of this principle is to build a mapping of signatures to words, as in the following example:
| Signature
| Word(s) with the signature
|
| sadis
|
saids
sadis
sidas
|
| caervn
|
cavern
carven
craven
|
| cabemrs
|
cembras
cambers
crambes
|
| uacenrtd
|
uncrated
untraced
uncarted
|
| paeilrsts
|
pilasters
plaisters
psaltries
|
| paaellnrty
|
paternally
prenatally
parentally
|
| gaeegimnotc
|
geomagnetic
gamogenetic
gametogenic
|
| icdeeilnrsty
|
indiscretely
iridescently
indiscreetly
|
| icdeeeinnrssts
|
indiscreteness
indiscreetness
indirectnesses
|
The particular signature/word set pairs I've chosen illustrate that multiple words may have the same signature. However, it's common that one signature may only map to one word. In particular, I graphed the average number of words in the set as a function of word length, with the data shown below.
| Word Length
| Average number of words per signature
|
| 3
| 1
|
| 4
| 1.03470031545741
|
| 5
| 1.06252759869293
|
| 6
| 1.0484368119771
|
| 7
| 1.03410997805602
|
| 8
| 1.02328974105143
|
| 9
| 1.01249966967046
|
| 10
| 1.00711743772242
|
| 11
| 1.00444796260696
|
| 12
| 1.00324107418459
|
| 13
| 1.00134871871722
|
| 14
| 1.00171546203111
|
| 15
| 1.00128865979381
|
| 16
| 1.00154511742892
|
| 17
| 1.00055710306407
|
| 18
| 1.00103412616339
|
| 19
| 1
|
| 20
| 1
|
| 21
| 1
|
| 22
| 1
|
| 23
| 1
|
| 24
| 1
|
| 25
| 1
|
| 27
| 1
|
| 28
| 1
|
| 29
| 1
|
| 30
| 1
|
| 31
| 1
|
| 32
| 1
|
Another way to look at the data is to say, "if I map signatures to only one word, how well would I guess the word?". Instead of taking (# of unique words / # of unique signatures) as above, this ratio is (#of unique signatures / # of unique words). This approach just shows the same thing: that the word signatures can unscramble the words most of the time. However, when I say "most of the time", I mean most of the words in the dictionary can be descrambled correctly. But it's likely that most of the words in the dictionary are uncommon words.
The particular lexicon, or list of words, I used for my application was the words.list file from the yawl-0.3 package on this webpage: The Serious Scrabble Player. In all, there are about 260,000 words in this lexicon. I added to this list the words that occurred in the part of the Hansard Corpus that I used in the next part. The Hansard Corpus is available from here. Although it is what's called an aligned corpus, consisting of pairs of documents in English and French, and I only took text from the English part.
Applying n-gram statistics to disambiguation
Suppose we have the scrambled word cavren. This has the signature caervn, and maps to three words: cavern, carven, and craven. Without knowing the context in which the word occurs, you'd guess it's cavern, right? Why? Because the word cavern is more frequent than carven and craven. In other words, the probability that a word is cavern is higher than the probability that a word is craven or carven. This probability is measured for each word in a collection of documents, or corpus, by dividing the number of occurrences of each word by the total number of words. This is called unigrams, or 1-grams.
The 2-gram extension of this is that you can measure the probability that a word is "cavern" given that the previous word is "big", or any other word. The n-gram extension, for some n, is that the probability of a word being a particular word can be measured, given that the previous n - 1 words are some particular words.
My program is setup to allow any number of n-grams. A component will run on a specified corpus and collect n-grams for a specified n. This is used as part of the input to my descrambling program. I used a subset of the Hansard Corpus, as it is freely available. The subset was about 15MB, whereas the full Hansard Corpus is about 60MB.
Smoothing and back-off
A simple add-one smoothing was used, to assign everything with zero frequency the frequency of 1/2. The probability distribution was not readjusted. Back-off was used by averaging the probabilities given by multiple language models (a language model is the measurement of n-grams for a particular n). The program takes as input the highest n-grams used. So if you specify 3, it will use unigrams, bigrams, and trigrams with back-off to compute the probability of each word.
Evaluation
Since it was a class project, I haven't performed an extensive evaluation. But I've posted a few examples of scrambled input and my program's output below. I've underlined the words which are descrambled incorrectly. A true evaluation starts with normal text, scrambles it, and shows the descrambling of the scrambled text. The original text for the first two examples is obvious and the original text for the third example is provided.
Cambridge input
- Input
-
Aoccdrnig to a rseheearcr at Cmabrigde Uinervtisy, it deosn't mttaer in waht oredr the ltteers in a wrod are, the olny iprmoatnt tihng is taht the frist and lsat ltteer be at the rghit pclae. The rset can be a ttoal mses and you sitll raed it wouthit porbelm. Tihs is bcuseae the huamn mnid deos not raed ervey lteter by istlef, but the wrod as a wlohe.
amzanig huh
- Output of my program
-
According to a researcher at Cambridge University, it doesn't matter in what order the letters in a word are, the only important thing is that the first and last letter be at the right place. The rest can be a total mess and you still read it without problem. This is because the human mind does not read every letter by itself, but the word as a whole.
amazing huh
Professor's email
- Input
-
Yes if all the fsirt and lsat lertets are the smae, tehn any wrod wtih 3 lertets is crecrot and wrods with 4 lertets are olny stilghly off.
But cidsoner scanitlifingy lightener lacixel citonaniboms wtih recuded particletibidy and tulorbe cloud pilfertorae.
- Output of my program
-
Yes if all the first and last letters are the same, then any word with 3 letters is correct and words with 4 letters are only slightly off.
But consider significantly lightener lexical combinations with reduced predictability and trouble could proliferate.
I checked my language models, and apparently, the word lengthier never occurred in the subset of the Hansard Corpus that I used. This problem would be fixed with a larger corpus.
Pride and Prejudice
Pride and Prejudice, by Jane Austen, is available from Project Gutenberg, a free online collection of literary works. For brevity, I've included only the first few paragraphs below.
- Original text
-
Pride and Prejudice
by Jane Austen
Chapter 1
It is a truth universally acknowledged, that a single man in
possession of a good fortune, must be in want of a wife.
However little known the feelings or views of such a man may
be on his first entering a neighbourhood, this truth is so well
fixed in the minds of the surrounding families, that he is considered
the rightful property of some one or other of their daughters.
"My dear Mr. Bennet," said his lady to him one day, "have you
heard that Netherfield Park is let at last?"
Mr. Bennet replied that he had not.
"But it is," returned she; "for Mrs. Long has just been here, and
she told me all about it.
Mr. Bennet made no answer.
"Do you not want to know who has taken it?" cried his wife
impatiently.
"YOU want to tell me, and I have no objection to hearing it."
This was invitation enough.
"Why, my dear, you must know, Mrs. Long says that Netherfield
is taken by a young man of large fortune from the north of
England; that he came down on Monday in a chaise and four to
see the place, and was so much delighted with it, that he agreed
with Mr. Morris immediately; that he is to take possession
before Michaelmas, and some of his servants are to be in the
house by the end of next week."
- Scrambled
-
Pdrie and Piucejdre
by Jnae Asteun
Cpehatr 1
It is a turth uivlnasrely agkcondweled, that a siglne man in
psisesoosn of a good fountre, msut be in want of a wfie.
Hvoweer llitte kwnon the fngeelis or vwies of scuh a man may
be on his fsrit eennritg a nhhgrooobeiud, tihs tturh is so well
feixd in the mdnis of the snduoinrurg feimials, that he is cdneresiod
the rgtfhuil ptpoerry of smoe one or otehr of tehir dharutegs.
"My dear Mr. Bnneet," said his lday to him one day, "have you
haerd taht Nlhifreeted Prak is let at lsat?"
Mr. Benent rlpieed that he had not.
"But it is," rurenetd she; "for Mrs. Lnog has jsut been here, and
she told me all aobut it.
Mr. Bennet mdae no aneswr.
"Do you not wnat to know who has tkaen it?" ceird his wife
iitapnmlety.
"YOU want to tlel me, and I hvae no ocjbteoin to henarig it."
Tihs was inttoiiavn enoguh.
"Why, my dear, you must know, Mrs. Lnog says that Nehrifeletd
is tkean by a ynoug man of lrgae ftuonre form the nrtoh of
Ealngnd; that he cmae down on Mdoany in a cishae and four to
see the pclae, and was so mcuh dgitheled with it, that he areged
wtih Mr. Morirs itaeidmlmey; that he is to tkae psessoison
boefre Meiacmahls, and smoe of his sraevnts are to be in the
hosue by the end of nxet week."
- Output of the program
-
Pride and Prejudice
by Jane Austen
Chapter 1
It is a truth universally acknowledged, that a single man in
possession of a good fortune, must be in want of a wife.
However little known the feelings or views of such a man may
be on his first entering a neighbourhood, this truth is so well
fixed in the minds of the surrounding families, that he is considered
the rightful property of some one or other of their daughters.
"My dear Mr. Bennet," said his lady to him one day, "have you
heard that Nlhifreeted Park is let at last?"
Mr. Bennet replied that he had not.
"But it is," returned she; "for Mrs. Long has just been here, and
she told me all about it.
Mr. Bennet made no answer.
"Do you not want to know who has taken it?" cried his wife
impatiently.
"YOU want to tell me, and I have no objection to hearing it."
This was invitation enough.
"Why, my dear, you must know, Mrs. Long says that Nehrifeletd
is taken by a young man of large fortune from the north of
England; that he came down on Monday in a chaise and four to
see the place, and was so much delighted with it, that he agreed
with Mr. Morris immediately; that he is to take possession
before Meiacmahls, and some of his servants are to be in the
house by the end of next week."
My algorithm only has trouble with proper nouns in this section.
Other Stuff
For my class assignment, I was required to implement a very particular algorithm: use n-gram statistics computed on characters within each words, rather than words within a corpus. The idea is to basically search through all possible re-arrangements of the characters so as to maximize the probability of the word, using a trigram character model. I didn't try particularly hard to make a nice program to perform this task, and it performed rather poorly. However, such a program might be useful for proper nouns, which aren't usually in a corpus.
Programs
I have a program to scramble text, my described program to descramble text, and a program to descramble text using the character language model. If you would like to use them for a particular purpose, contact me by email: trnka@udel.edu.