Natural Language Processing and Breaking Codes

posted Dec 19, 2011, 2:43 AM by Jack Marrows   [ updated Dec 19, 2011, 3:50 PM ]

Today I finished the final task given by Sebastian Thrun and Peter Norvig in their Introduction to Artificial Intelligence Class.  It involved applying natural language processing concepts to decode a ciphered text and reconstruct a shredded text. I found completing these tasks to be a real treat and am excited to share my results.  Notably many other approaches to these problems would also have been successful.

Cipher Challenge

The above text was encoded using a Ceasar Cipher.  This means that when encoding all characters were incremented by the same value.  For example, if all characters were incremented by two 'a' would become 'c' or the word 'case' would become 'ecug'.

In solving this problem there is 25 possible solutions that the algorithm must consider.  Not 26 because we know the text we are starting with is not correct.  Therefore, all my algorithm must do is enumerate through all of the possible solutions and find the solution most likely to be the original text.

To find the most likely original text, I assigned each enumeration a probability of being correct.  To assign a probability of each word being correct I considered its frequency in the British National Corpus (BNC).  Frequencies are normalised with all of the other possible word frequencies and then combined across all words.

For example, working with 'esp', possible words are:

i: Increment Value
w: Possible Word
f: Frequency of word in BNC

 1ftq  0 6kyv 0 11pda 7 16uif 0 21zmk 0
 2gur 2 7lzw 0 12qeb 0 17vjg 0 22anl 3
 3hvs 7 8max 400 13rfc 21 18wjh 0 23bom 11
 4iwt 0 9nby 0 14sgd 0 19xki 0 24cpn 0
 5jxu 0 10ocz 0 15the 6187267 20ylj 0 25dqo 0

Looking at the frequencies above it is immediately obvious that the most likely increment value is 15 because 'the' is far more frequent than any other words.  However, it is possible that the word may be 'max' or even a different combination with no matches (perhaps an obscure name).  Therefore, so that those rows are not ruled and assigned a probability of 0, I applied Laplace Smoothing assigning my constant as 20 (given this task k (the constant) was fairly unimportant - in other situations more thought may have been needed when assigning the value).

Following this I normalised all of the frequencies and generated probabilities for each word.  Below is the example for max.  Given the massive frequency of 'the', 'max is unlikely:

The overall probability for each possible increment, after considering all words, is the product of all the probabilities normalised - we find the probability of all of the words and multiply them.  After calculating the probability of each increment (possible solution) the one with the max probability is selected as the most likely choice.  

See the output below.  At this point I haven't added the necessary code to restore capital letters and punctuation.  However, such a task would be trivial given the challenge.

Paper Reconstruction Challenge

The text above contains a message however, it is not readable because it is jumbled.  The message has been cut up into 2 character strips and mixed.  Furthermore, the word level approach used in the previous challenge is not sufficient because where the previous challenge had only 25 possible solutions this on has 19! possible solutions (calculate the factorial).  Therefore, it is not possible to enumerate them all.

To solve this challenge instead of using probabilities generated by word frequencies I used probabilities generated by bigram and trigram frequencies.  Using these frequencies I then sorted one strip at a time, each iteration placing the strip highest probability.  Below are the steps taken by the algorithm to solve this puzzle.


Firstly the algorithm generates a table of the bigram and trigram frequencies in general language.  To be effective samples must be taken from large bodies of text.  I used the Brown Corpus for this challenge.  My algorithm read every character (including spaces and punctuation) in the text and recorded the frequency of each bigram and trigram.  See an example of a frequency table below (much smaller scale):

After sampling the above text we are left with the following frequency table.  Notably, due to the size of the text all of the bigrams and trigrams are equally likely.  This is certainly not the case over a large corpus.

Bigram Frequencies
 'He' 1  'el' 1 'll' 1 'lo' 1
 'o ' 1 ' w' 1 'wo' 1 'or' 1
 'rl' 1 'ld' 1    

Trigram Frequencies
 'Hel' 1 'ell' 1 'llo' 1 'lo ' 1
 'o w' 1 ' wo' 1 'wor' 1 'orl' 1
 'rld 1      

Modelling the Problem

I modelled the shredded paper in my application using String arrays.  Each column was an array of strings and all of the columns were kept in an array.  Once sorted the array was copied to a List.


My algorithm sorts from left to right.  Firstly it locates the shred that will be at the left most position.  In this situation that was done by looking for a strip that starts with a capital letter.  Probabilities could also be applied because we know that all of the letters on the first strip will be the start of a word.  Therefore, we could use our bigrams and find which strip contains letters that most frequently follow a space (new words always follow spaces in the English language).

The following algorithm was followed until all of the strips were sorted:

1. Place each unsorted strip next to the last sorted strip (right most strip)

2. Assign probabilities to the trigrams and bigrams created (by placing the strip)
3. Find the product of all probabilities for each given strip

4. The strip with the highest probability is added to the collection of sorted strips.

5. Repeat


The algorithm effectively decoded the text only needing to consider bigrams and trigrams.  Perhaps one of the most exciting aspects of this method is that it seems to work even with names that would not be found in a dictionary.  

HackFest and Brisbane Toilet Finder

posted Dec 3, 2011, 5:23 PM by Jack Marrows

Unleashing Brisbane's Data

Last Saturday, I entered an exciting and rewarding competition called HackFest. About a week before the competition the Brisbane City Council (BCC) released all of their data.  It included datasets such as the locations of bus stops, the number of city cycles at each station and the locations of public toilets.  HackFest was a one day version of a bigger competition being ran by the BCC called Hack::Brisbane.

BCC ran their competition to motivate developers to create applications for people who live in Brisbane using there data. In my opinion it is a great way for the Council to enable developers to give back to the city that they live in.  Furthermore, it is more cost effective to run a competition and have multiple applications developed to exploit the data's potential.

On the day I developed a consumer application for android phone called Brisbane Toilet Finder (follow link to install). I won the open category of the competition with the application.  The apps that other people created were great however, the judges felt that my app was most likely to be used by the general public of a regular basis.  My app had very clear core competencies that it did very well.  Where as some of the other apps used more data sets but they essentially just made them easier to search and displayed them on a map.  I think to win the big competition developers will need a very consumer focused approach.

About the App

Brisbane Toilet Finder is an app that locates the closest public toilet to a person using an Android phone.  It was developed using the public toilet dataset.  The longitude and latitude of each toilet provided by the dataset is used with the geolocation capabilities of Android phones to direct a user to the closest toilet with only one click.

Looking at the UI, it is evident that all of the additional information about a restroom provided by the dataset is shown to the user such as, accessibility and opening hours.  Furthermore, using the Google Street View API it was possible to display a thumbnail picture of the toilet to the user.

The app also allows users to view the average cleanliness of a toilet for a given day and rate it.  The data generated by users rating the public facilities has the potential to be used by the BCC in the future when planning the allocation of resources.  These ratings are stored in the cloud using a webservice built upon Google App Engine.

Finally, users can get direction to the toilet from their current location by a single click.  This was implemented using the Android Google Maps API.



Oh Shit - I Love Cards

posted Oct 23, 2011, 3:22 AM by Jack Marrows   [ updated Nov 4, 2011, 6:54 PM ]

As a child my family used to play a crudely named game, "Oh Shit". We used to have a ball attempting to trip
up each others plays and achieve the scores that we had bid. Of late my social group has began to invest a lot of our spare time in playing card games. Namely we have taken a great interest in 500 and Oh Shit. From this sparked an interest for me to develop Oh Shit as a browser based computer game. 
For myself the part that I enjoyed the most was programming the logic into the computer AI. It gave me a frame to view each of the strategies I often unconsciously played and a platform to analyse the most effective strategies.
During the development I developed a test platform that let me put each of the AIs I developed, each employing different strategies, up against each other over a large number of games (usually around 1000). Doing this allowed me to rank the effectiveness of each strategy and in turn offer users different difficulty levels. 
Halfway through the development I felt as though as a stand alone game people would quickly get bored. Therefore, I made the decision to integrate Facebook into the game. This firstly gave each player an identity with a very quick authentication process allowing me to store their game histories in a database. Furthermore, I was able to incorporate a friend high score function such that people could compete their their friends on my game.

Technically developing Oh Shit was very exciting. I applied much of what I learnt at university to make my code as maintainable as possible. Throughout my code there are many interfaces that allow classes to be easily interchanged. This style of development allowed me to easily develop multiple AIs without changing any external code.

The application has been developed as a web application and runs on all browsers with minimal HTML5 support. As such the game can be played on IOS devices, Android devices and of course computers. I developed it using Java and GWT hence, all of the code is compiled to JavaScript before being sent to the browser. The back-end that provided a datastore and the Facebook integration was built upon Google App Engine. Using these technologies was easy and allowed for seamless deployment to Google's Appengine.
I would love it if people would visit my new application and even attempt to beat the advanced AI:
Play Oh Shit

Client Side Calculator

posted Oct 23, 2011, 3:18 AM by Jack Marrows   [ updated Nov 4, 2011, 6:45 PM ]

One of the elementry tutorials in a subject I am studying was to create a client side calculator using on HTML, CSS and JavaScript.  While I already had the capibilities to create this application it a great refresher and I enjoyed being able to do something quickly without the need to do a great deal of looking up.  I thought I would share it for any people who are beginning to use javascript.  I have commented all of the files and believe this is a great activity to complete when beginning to learn the scripting language.  
Obviously JavaScript is vital in any webdeveloper's tool box.  Especially with the emergence of HTML5 which has many aspects that are heavily reliant on JavaScript.  Find the source code below.

Bloom Reading

posted Oct 23, 2011, 3:13 AM by Jack Marrows

Bloom Reading is a project I worked on after my second semester of university and I aimed to integrate what I had learnt about Web 2.0 and web development into my project.  Bloom Reading  aims to provide a powerful resource to teachers around the world.  It is a platform for teachers to generate and share high quality comprehension questions and worksheets.  The rationale outlined on the website is:

Reading comprehension is an essential skill for all people and students to be able to live a comfortable and literate life. As students move through school the skills educators aim to teach students change. In the lower grades students learn how to read words and understand what they are reading literally. When students move into the upper school the demands on reading broaden and students are required to inference, reorganise, evaluate and deeply consider the texts they read.
Teachers around the world teach these skills through encouraging students to regularly read and strategic questioning that often takes the form of a basic comprehension exercise. They use known frameworks such as Blooms Taxonomy to ensure that their questions provoke higher order thinking in the students and facilitate them developing the skills they will need to be literate in the modern world
Writing these questions can be time consuming for teachers and that is where Bloom Reading is useful. Bloom Reading is a service that allows teachers to share comprehension questions they have written for books, instantly generate worksheets and access questions created by other teachers around the world. It is entirely free to use and as more teachers use it the better it becomes. The ultimate goal of Bloom Reading is to enable teachers to write high quality questions and essentially significantly increasing teacher productivity by allow them to reuse their own and each others questions.

Technically I have included many web 2.0 principles in an attempt to make the site work.  Firstly, all of the content is generated by the users.  This aims to all me to harvest the collective intelligence.  Furthermore, I have allowed users to flag poor questions, bookmark books in ‘My Books’ and implicitly vote questions up by using them in their worksheets.   I am hoping that these measures will ensure the content on the site is scalable and maintains a good standard of quality.

Secondly, I have made it very quick for a user to reach the functionality they are after.  In fact they don’t even need to create an account for certain functions.  On top of this I intend on rewarding my users by acknowledging the top contributors in the side bar and by providing them high quality worksheets efficiently.

Thirdly, I have integrated social media tools into my site in an effort to promote it.  Obviously this site relys on the network effect.  Hence, the more people who use it the better the content is.  The social integration aims to encourage people to share the site with their friends and more importantly their colleagues.

Notably, the service interacts with many other web services such as YouTube, Amazon and of course the social tools I mentioned.  Furthermore, it was build upon the Zend Framework and using JQuery.  As a result the development of this site has been agile without compromising the robustness of the code used.

I intend on providing the service for free and monetising through advertising and referrals to Amazon.  In the future the data collected may also be valuable and with wider functionality there may be a potential for a subscription service (never for independent users however).

Future functionality for the site may include student portals to allow students to answer the comprehension questions electronically.  This could then allow student progress to be mapped easily across books.  I have some other ideas also and if you have any please don’t hesitate to suggest them to me.  In the mean time this blog is intended to notify people of future functionality and Bloom Reading milestones.

Web 2.0 Analysis of Foursquare

posted Oct 23, 2011, 2:46 AM by Jack Marrows   [ updated Oct 23, 2011, 3:08 AM ]

Foursquare is a location based social networking game. Users 'check in' at locations announcing they have been there in competition to become mayor. People contribute through submitting their thoughts, locations, pictures, information about their current location and friends (Wikipedia, 2010).

Utilizing features only available on cell phones has heavily contributed to Foursquare's functionality and success. Photos are taken using cameras, locations recorded using cell location services and the application relies on a phone's connectivity (Foursquare, 2010). In the company of desktop computers, Foursquare can be used on iPhones, Android phones, Blackberrys, Windows mobiles and systems running webOS (Foursquare, 2010), and currently an application is being developed for Nokia handsets.

Foursquare seeks to provide a service that will dramatically decrease the void between the offline and online business and entertainment model. This participation built model incorporates many of the best practices of Web2.0 and successfully meets the patterns that influence with dynamics of the Web. Its future impact of this application is dependent on its successful integration of the eight patterns, and are analysed in greater detail in this report.

In a versatile industry there are also continual issues that surround large scale social networking websites. Privacy and content legal battles have been headlines on a number of platforms such as Facebook and Google. Foursquare has a clean slate for now but the inevitable consequence of their participative, user driven model is controlling content can be difficult. Also the collection of data proposes a threat on users which means there must be a level of trust between from the user. Many of these conflicts are common in today's technology climate and can be preventable if best practices outlined within this reports are followed. These guidelines can aid all websites to structure and provide content that will successfully integrate technology into society and avoid legal implication in the life-cycle.


posted Oct 23, 2011, 1:48 AM by Jack Marrows   [ updated Oct 23, 2011, 2:40 AM ]

During my second semester of my MIT I undertook a project I named Translatinor. Tranlatinor is a tool that may be used by teams that are internationalising software. It takes a resx file, finds strings within it that may need to be translated, finds translations for the strings using existing translation memories (TMX files) and Google Translate and then presents the translations to the user for moderation. These translations can then be entered into an application specific translation memory. This translation memory may then be used to generate localised resx files.

The significance of this project is that it was the first parallel program that I have individually developed. Secondly, it is a tool that could be used by the internationalisation community. Please follow the links below for more information.

Ed Maps

posted Oct 23, 2011, 1:08 AM by Jack Marrows   [ updated Oct 23, 2011, 1:20 AM ]

Ed Maps is a project I undertook in my first semester of IT studies.  At this point I had only had experience in developing using PHP so I chose to develop Ed Maps in PHP.  My team and I did an analysis of the current market and education software aimed at primary schools and found an area that we believe would benefit from some further development.  Ultimatly Ed Maps allows students to collaboratively create digital concept maps using mobile devices.   During this project I analysed the education industry and mobile devices best practices.  I also improved my PHP skill set and UI design.  The project was very enjoyable and something I am still proud of.  If I did this project again I would implement the service using a more loosely coupled API so that other platforms could easily interact with concept maps. 

Web UI

Mobile UI

Idea Sticky

posted Oct 23, 2011, 12:56 AM by Jack Marrows   [ updated Nov 4, 2011, 6:53 PM ]

I have just finished creating my first application using Google Web Toolkit and Google App Engine.  Like the rest of the world I chose to create a sticky application that syncs with a server.  Therefore, people can access sticky notes easily where ever they are.  I used eclipse for the development and have deployed the application for free on Google's App Engine.Read More

This application has not only been the first time I used GWT but in fact my first instance of using Java.  It was a steep learning curve but now that I have a working application and I reflect on my learning I think it is far more maintainable that applications I have created in the past because of the use of packages.  GWT also has a clear interface in implement AJAX which I preferred using compared to simply sending HTTP requests.

Another of the features that impressed me is GAE's ability to use Google's authentication services.  This mean users can use their Gmail accounts with the application and that I didn't have to create my own authentication which would have taken far longer and certainly been less secure.

Some of the bigger challenges I faced throughout development where leaning how to use the Google data store which is different to traditional databases and allows objects to be stored directly to the store.  After learning it I have decided this is more efficient in a lot of instances.

I am still ironing out some kinks and bugs and there are certainly ways I could make my application more efficient.  For instance, currently I am sending far to much data over a network.  It doesn't matter while not many people are using the application but could be expensive on the bandwidth with a more popular application.  Through working through these kinks I want to further develop my GWT ability because in my opinion it is a ver valuable tool.

I would encourage people to use my application for notes but I wouldn't put privacy sensitive information on it because the content sent isn't encrypted.  To use this application it needs to be referred to properly.  This is done by creating a bookmark using the link below.  Simply click and drag the link to your bookmark browser on most common browsers.  You will need to log in the first time that may spoil your window size but after that it is a simple tool to store your notes in the cloud.

My Ideas Sticky

Update 16 July 2010

I have now changed the UI and my biggest concern is that it is now to cluttered.  However, assuming users use the link above (or the bookmark they created with it) the sticky notes now records the page that instigated the sticky.  This means that it is now able to generate a list of references for the sticky.

Update 19 July 2010

IdeasSticky now automatcially gathers the title and URL of a web page.  The other fields can also be prefilled using values other users have used for the same web page.  References for websites can be exported in Harvard and APA styles.

1-9 of 9