OCR in Qwoppy
Reinforcement Learning
Qwoppy is based on reinforcement learning. That means the agent chooses an action based on an input, and then gets some feedback to indicate whether the action was good or bad. In particular Qwoppy is a model-free agent. Qwop is an HTML5 game which means I could look at the Javascript source code and use the physics calculations to derive a model, but where's the fun in that? Qwoppy is model-free, all the agent will have as input is the current score.
But how do we get the score?
YouTube mad scientist Suckerpinch whose Mario-playing AI inspired this project allowed his code direct access to the memory (not to make it sound like he took an easy route, since he actually runs the agent in a modified Mario NES cartridge). For this project I decided to use OCR.
OCR
I originally tried to avoid implementing the OCR myself because the goal of the project was to teach myself about reinforcement learning. I started using Tesseract with Python bindings but found this far too slow - around one step per second - to be feasible for training a deep network. This is probably because the Python binding subprocesses the Tesseract command-line program which forces you to save the image to the disk, only for Tesseract to load it back. Tesseract also handles a wide range of edge-cases and different fonts that I would not need and this would also reduce the performance. With this in mind, I decided to roll my own OCR.
In my postgraduate project I tried to make an x86 machine language disassembler using techniques from neural machine translation techniques. This didn't work for a slew of reasons but I learned a lot about the techniques and was able to apply the encoder-decoder architecture to OCR with decent results.
The result is a pair of 256-unit GRU cells, a recurrent network which gets similar performance to the LSTM but at a lower memory and processing cost due to using only one hidden state vector compared to the LSTM's two. A model can be produced which scans an image from left-to-right or top-to-bottom. Top-to-bottom was found to be superior, getting 97% accuracy on the test data vs. 87% accuracy for left-to-right.
I created the training and test data by taking screenshots of Qwop until I had a sample of each the tokens - the digits 0-9, the minus symbol, space, decimal point and the characters in "metres" - and then wrote a script to generate every legal combination of tokens (- can only go at the start, there may be 1, 2 or 3 digits before an optional decimal point which must be followed by a single additional digit, and finally a space and the word metres) and store each one as a black-and-white PNG. Another script generates a pair of CSV files by dividing the samples randomly at a specified rate (80% for training and 20% for testing).
The model was trained on 1577 training samples in 19 batches of 83 samples for 200 epochs. I used the Adam optimizer with learning rate 0.0001 for both the encoder and decoder GRU and mean square error for the loss function. I deviated from typical methodology by initializing my recurrent units' hidden states with random noise rather than zeros due to reading of [1].
References
Comments
Post a Comment