Projects‎ > ‎

Natural Language Processing and Breaking Codes

posted Dec 19, 2011, 2:43 AM by Jack Marrows   [ updated Dec 19, 2011, 3:50 PM ]

Today I finished the final task given by Sebastian Thrun and Peter Norvig in their Introduction to Artificial Intelligence Class.  It involved applying natural language processing concepts to decode a ciphered text and reconstruct a shredded text. I found completing these tasks to be a real treat and am excited to share my results.  Notably many other approaches to these problems would also have been successful.

Cipher Challenge


The above text was encoded using a Ceasar Cipher.  This means that when encoding all characters were incremented by the same value.  For example, if all characters were incremented by two 'a' would become 'c' or the word 'case' would become 'ecug'.

In solving this problem there is 25 possible solutions that the algorithm must consider.  Not 26 because we know the text we are starting with is not correct.  Therefore, all my algorithm must do is enumerate through all of the possible solutions and find the solution most likely to be the original text.

To find the most likely original text, I assigned each enumeration a probability of being correct.  To assign a probability of each word being correct I considered its frequency in the British National Corpus (BNC).  Frequencies are normalised with all of the other possible word frequencies and then combined across all words.

For example, working with 'esp', possible words are:

i: Increment Value
w: Possible Word
f: Frequency of word in BNC

 i
 1ftq  0 6kyv 0 11pda 7 16uif 0 21zmk 0
 2gur 2 7lzw 0 12qeb 0 17vjg 0 22anl 3
 3hvs 7 8max 400 13rfc 21 18wjh 0 23bom 11
 4iwt 0 9nby 0 14sgd 0 19xki 0 24cpn 0
 5jxu 0 10ocz 0 15the 6187267 20ylj 0 25dqo 0

Looking at the frequencies above it is immediately obvious that the most likely increment value is 15 because 'the' is far more frequent than any other words.  However, it is possible that the word may be 'max' or even a different combination with no matches (perhaps an obscure name).  Therefore, so that those rows are not ruled and assigned a probability of 0, I applied Laplace Smoothing assigning my constant as 20 (given this task k (the constant) was fairly unimportant - in other situations more thought may have been needed when assigning the value).

Following this I normalised all of the frequencies and generated probabilities for each word.  Below is the example for max.  Given the massive frequency of 'the', 'max is unlikely:


The overall probability for each possible increment, after considering all words, is the product of all the probabilities normalised - we find the probability of all of the words and multiply them.  After calculating the probability of each increment (possible solution) the one with the max probability is selected as the most likely choice.  

See the output below.  At this point I haven't added the necessary code to restore capital letters and punctuation.  However, such a task would be trivial given the challenge.



Paper Reconstruction Challenge


The text above contains a message however, it is not readable because it is jumbled.  The message has been cut up into 2 character strips and mixed.  Furthermore, the word level approach used in the previous challenge is not sufficient because where the previous challenge had only 25 possible solutions this on has 19! possible solutions (calculate the factorial).  Therefore, it is not possible to enumerate them all.

To solve this challenge instead of using probabilities generated by word frequencies I used probabilities generated by bigram and trigram frequencies.  Using these frequencies I then sorted one strip at a time, each iteration placing the strip highest probability.  Below are the steps taken by the algorithm to solve this puzzle.

Sampling

Firstly the algorithm generates a table of the bigram and trigram frequencies in general language.  To be effective samples must be taken from large bodies of text.  I used the Brown Corpus for this challenge.  My algorithm read every character (including spaces and punctuation) in the text and recorded the frequency of each bigram and trigram.  See an example of a frequency table below (much smaller scale):


After sampling the above text we are left with the following frequency table.  Notably, due to the size of the text all of the bigrams and trigrams are equally likely.  This is certainly not the case over a large corpus.

Bigram Frequencies
 'He' 1  'el' 1 'll' 1 'lo' 1
 'o ' 1 ' w' 1 'wo' 1 'or' 1
 'rl' 1 'ld' 1    

Trigram Frequencies
 'Hel' 1 'ell' 1 'llo' 1 'lo ' 1
 'o w' 1 ' wo' 1 'wor' 1 'orl' 1
 'rld 1      

Modelling the Problem

I modelled the shredded paper in my application using String arrays.  Each column was an array of strings and all of the columns were kept in an array.  Once sorted the array was copied to a List.

Sorting

My algorithm sorts from left to right.  Firstly it locates the shred that will be at the left most position.  In this situation that was done by looking for a strip that starts with a capital letter.  Probabilities could also be applied because we know that all of the letters on the first strip will be the start of a word.  Therefore, we could use our bigrams and find which strip contains letters that most frequently follow a space (new words always follow spaces in the English language).



The following algorithm was followed until all of the strips were sorted:

1. Place each unsorted strip next to the last sorted strip (right most strip)


2. Assign probabilities to the trigrams and bigrams created (by placing the strip)
3. Find the product of all probabilities for each given strip


4. The strip with the highest probability is added to the collection of sorted strips.



5. Repeat

Result

The algorithm effectively decoded the text only needing to consider bigrams and trigrams.  Perhaps one of the most exciting aspects of this method is that it seems to work even with names that would not be found in a dictionary.  





Comments