Building an AI to Recognize my Handwriting – Part I
The theoretical (and practical) starting point
Image by the Author
Roughly ten years ago, inspired by Tim Ferriss and other authors in the self-help genre, I started writing (by hand) regularly in notebooks (the physical kind). I must have filled 10–15 books by now. Wouldn’t it be great to have them digitalized?
This article series follows me on my quest to develop an AI that does that: that turns my handwritten notes into proper plain text files. Judging by the way my handwriting looks, this will be very challenging indeed.
This first article, Part I, will cover the fundamentals. I will start by explaining my motivation a bit further, followed by introducing the theoretic outline of how one could approach that.
I will briefly introduce Convolutional Neural Networks (CNNs), a special type of artifical neural networks designed specifically for image recognition. After that, I will try an automated way of “recognizing” words in my handwriting with the naive goal of preventing to have to label and annotate my input data manually.
The subsequent parts will be more hands-on and actually use CNNs with training data. But I think this Part I serves as a necessary foundation to follow along, even when it ends with a bummer.
Like many of my articles, I’ll publish as I progress, dealing with upcoming challenges yet unknown, taking the reader with me on the journey.
Following the feedback I got for my “Data Scientist turning Quant” series, I will upload the notebooks and files to GitHub for a better follow-along. They are available here (GitHub link) and will be updated regularily.
I hope you enjoy reading this as much as I enjoyed producing. Feel free to contact me for further ideas or questions, or to comment on this post.
Best,
Jonas
My Motivation: A Stack of Personal Notebooks
For ten years now, I do something that is called “journaling”. Rather than the classical “dear diary”-kind of writing, the purpose is not to summarize details of your day every day, things that are mostly irrelevant the next week anyway, but rather to write in order to sort your thoughts about stuff. For example: where you’re at right now, where you could go, and what challenged you recently. Or generally noteworthy experiences. For me it covers a dual purpose: enabling reflection today as well as cataloging thought processes for reference in the future.
Anyway, I like doing it and over the years I finished something around 15 notebooks. They are quite valuable to me as they followed my life, covering worries and challenges as well as highlights and life-changing moments. I’m picturing myself reading through them in one or two decades, filled with nostalgia. Hence, I want to keep them!
Right now they are scattered in two different places in Germany, increasing the likelihood of losing a maximum of half of them, say in an unexpected fire. The more likely threat of course is time, our old enemy. CDs stop working after 25 years and only god knows how long my notes-on-paper will survive. I once found some old writings from my time in school. They were quite pale already.
I was thinking about digitalizing them for a while. There are service providers for sure! However, since the content is extremely personal, I did not want to share it with a third party.
Working as a professional Data Scientist, I know of the possibility of building and training my own image recognition and OCR models. Theoretically I know what to do. But I never did it.
Furthermore, I know there are plenty of models available that are already pre-trained on handwritings, and that I can tune for my own data set. Sure, running them locally wouldn’t be considered third party in this context. However, again, judging by my own handwriting, I am pretty sure that there’s no system currently available that is “intelligent” enough to understand what I wrote.
Long story short, this is my motivation. My goal is as simple: saving my years of work by digitalizing them using AI, while getting hands-on experience in building an OCR system.
I did not want to read too much about how other people are approaching this or about “what you’re supposed to do” as it would hinder my learning experience. I am expecting to fail at some point, facing yet another challenge to solve. Repeat.
Process Outline and Tech Used
Now that my motivation has become clear, it’s time to think about what we actually need to achieve it. Simply speaking, I am expecting to have a pipeline consisting of three parts: pre-processing, main processing (CNN), post-processing.
Image by the Author
Pre-processing
I need labeled image data of my handwriting that are processed in a way that increases the system’s speed and performance metrics. There are different ways for me to test and experiment with, for example different image sizes or number of channels. I will mostly use OpenCV and Numpy for preprocessing the data. Part II will introduce LabelImg, the open source tool I will use to create annotated input data.
Main processing
This is where the magic happens, where input images in form of matrices of floating numbers are turned into a vector of predicted probabilities of belonging to a certain class. In simpler terms, the output would be something like “image_1 is the word ‘Test’ with 90% probability and ‘Toast’ with 10%”.
I’ll use various Convolutional Neural Networks (CNNs), the go-to-architecture for image recognition. If you’re new to CNNs, I will briefly describe their main properties later in this article. I won’t be using any pre-trained networks, following my motivation above and assuming that no person in the world has type of ugly handwriting as me.
I will not start fully from scratch, though. Instead I will define and deal with the neural networks using TensorFlow and Keras, and all of their available classes and helper functions.
Post-processing
At this point we will have a trained CNN that recognized my handwriting more or less well. However, this output probably needs to be cleaned and processed further, using various NLP methods.
Example: The system might predict a word to be “Tost” but this word doesn’t exist in the German language. Another model down the pipeline might correct it to “Test” based on similarity. The whole post-processing part of the system won’t be covered in this nor the next article since I am still very far from knowing how to do that.
So much about the three-part system I am expecting to build. Since CNNs are central to Part I and Part II, I will move on to briefly introduce in a very non-scientific, pragmatic way what Convolutional Neural Networks (CNNs) are. This summary is mostly taken from Aurélien Géron’s great, great book “Hands-on Machine Learning with Scikit-Learn, Keras & TensorFlow, 2nd Edition”, my absolute favorite introduction to practical Machine Learning.
A Very Short Introduction of CNNs
If you roughly know what Artificial Neural Networks (ANNs) are and how they work for classification tasks, you’ll be able to grasp the following quite easily. You’ll find a short list of sources about CNNs at the end of this section.
CNNs build on studies from the late 1950s of how our brain processes visual inputs. These studies resulted into a Nobel Prize in 1981. The researchers’ main finding: many neurons only react to visual stimuli located in a small region of the visual field. These fields overlap, piecing together what we see as a whole. Furthermore, two neurons might share the same receptive field but one only reacts to horizontal, the other to vertical lines.
Based in these findings, computer scientists started to create neural networks designed specifically for the task of image recognition. In 1998, Yann LeCun (currently Chief AI Scientist at Meta) created the LeNet-5 architecture for recognizing handwritten characters.
So what differentiates a CNN to other (deep) neural networks? At its core CNNs are similar to standard ANNs where higher level neurons take input from the output of lower level neurons to detect various patterns in the data, getting from simple to complex as the data progresses through the network. However, CNNs have two special building blocks called Convolution Layer and Pooling Layer.
The motivation for such building blocks are quite easy to grasp. Take a relatively small image of 100px by 100px. This translates to 10,000 inputs (30,000 if its a color image). For a fully connected standard ANN with a first layer of 1,000 neurons, this would translate to 10,000,000 connections to be fitted. Convolution an Pooling Layers allow only partially connecting the layers. Furthermore, CNNs recognize patterns regardless of where they appear in the image while standard ANNs would have trouble with shifted images. Many, many advantages of using these building blocks. Now more about them.
Convolution Layer
Following the ideas of the researchers mentioned above, neurons in a convolution layer are only connected to the pixels in their receptive field (if it’s the first layer) instead of the whole image. In later layers, neurons are only connected to the outputs in a small rectangle. It’s a simple idea but hard to describe in words alone. The image below could help.
By Aphex34 – Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=45679374
This receptive field rectangular now slides (known as stride) over the image and outputs one value each time, thus reducing the size of the image in next layer. The result are filters, also known as convolutions.
Image by the Author
During training, the CNN will learn the most useful filters, like horizontal vs. vertical lines, which can be visualized. These may start out quite simple but are later combined to more and more complex filters, as data progresses through the network. Filters used on actual images are known as Feature Maps (i.e., the dot product of the filter and the overlayed image), highlighting which pixel activated the filter.
Image by the Author
Pooling Layer
To further reduce computational load, so-called Pooling Layers are used to aggregate information from the input to a reduced output, used as input for the next (convolution) layer. This process is also known as sub-sampling. Like convolutional layers, pooling layers have a limited receptive field and slide over the input image. However, the stride is often set in such a way that receptive fields are not overlapping.
For example, nowadays commonly used Max Pooling Layers output the maximum value of all neurons within it’s sight, thus keeping only the most relevant pixel information. The image below visualizes that.
By Aphex34 – Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=45673581
Pooling layers can be quite destructive as they throw away 80–90% of the information. However, they also add some invariance to the data that can reduce the risk of overfitting to specific but not generalizable details.
Output Layer
Pooling layers often follow convolution layers, a pattern that is repeated a couple of times, until the outputs are flattened (from matrices to vector) to have as many final outputs as classes expected. For example, if we want to classify handwritten digits from 0 to 9, we would use a dense layer at the end with 10 outputs, and a softmax activation function, if we want to predict the probability of an image belonging to each of the ten possibilities.
The final output after feeding an image of an “8” could be the vector [0.2,0,0,0,0,0,0.1,0,0.7,0], meaning that the model predicted “8” to be most likely, but it could have been a “0” with 20% and a “6” with 10% probability. Depending on the training images, these handwritten digits could be quite close to another.
So that’s it, all you need to know to follow along. CNNs are a special kind of artificial neural networks used for image recognition. They use convolutional layers to learn simple and complex patterns about the images, and pooling layers to reduce the computational load and risk of overfitting.
Of course there’s more to CNNs and the visual nature of their input data makes them naturally easier to understand, well, visually. Here are some further basic sources helping you understand how CNNs work.
Wikipedia: https://en.wikipedia.org/wiki/Convolutional_neural_network
Josh Starmer’s (humorous but nonetheless highly informational) QuestStats: https://www.youtube.com/watch?v=HGwBXDKFk9I
If you still don’t have enough, 3Blue1Brown’s video goes deeper into what convolutions are and how they are used in all sorts of applications, for example for Image Processing (starting 8:22).
Getting Started with My Own Handwriting
After outlining what tech we would need and how it theoretically would work, it is now time to get more practical aka. start coding. You can see my code and files here.
I started by writing a small letter on paper, taking a picture of it, and reading it on my computer using OpenCV. Don’t mind that it’s in German. The computer doesn’t either (yet).
Image by the Author
Ignoring my actual task for now, I was looking for a way to automatically detect what’s paper and what’s ink, meaning detecting the text. My goal was to find a way to cut out the words in an automated manner while ignoring the actual meaning of each word for now.
I used something called line-text segmentation, by adapting code I found in this notebook on GitHub. The process in simple terms aims at identifying lines of text first, then loop through lines to identify words.
It starts by binarizing the image so that each pixel can be either 0 (black) or 1 (white). Similar to what a Pooling Layer does, the next step is dilation where a kernel (aka. receptive field) slides over the image, replacing the image pixel by the maximum value in that field before sliding further. The result is a “growing” part of the image, thus the name dilation.
Again, rather than understanding by reading, seeing what’s actually happening is easier. The left image shows my letter after binarizing, the right after dilation. The effect of dilation is similar to what we would get when using a text marker.
Image by the Author
It’s just a small step to get from this to line detection. I will use OpenCV’s findContours function. The blue boxes below are supposed to identify the lines in which my text is written. As you can see, that worked better on some parts than others. Looping through the lines to mark individual words leads to the image with the yellow boxes.
This looks okay, right? However, it gives me 346 identified words (instead of the 58 that are actually present). The approach lead to duplicates, identical identified areas in my image. Even after removing these obvious duplicates, I am left with 94 words. Those are overlapping parts of a word, as the examples below show.
Image by the Author
I now had snippets of my letter representing words but most of the time I would need a lot of manually going through the samples to delete nonsense ones. Furthermore, I would need to create some kind of lookup table stating that image_00001.jpg means “ich” and image_00002.jpg “Dokumente”.
So all in all not much is won in terms of efficiency. We might be smarter but not yet further in practical terms.
Summary and Conclusion of Part I
I know, I conclude this first article in my series of building and training an AI to recognize my handwriting with a bummer. I am sorry! But by re-interpreting this failure as challenge we will get the foundation for a much more optimistic second part, I promise!
And it’s not like I wasted time here. I introduced the core technology that we will use for this, Convolutional Neural Networks (CNNs), how it works, and what this type of network architecture differentiates from more standard ones you might knew before. It’s also worth knowing why you get into something, not just how. Therefore, I outlined my situation and motivation.
The main lesson for now is this: if we want to train a CNN, we need properly labeled input data, and we cannot rely on automated ways of creating these, for example the line-word segmentation approach using dilution on binarized input images.
In Part II, I will start training a CNN to recognize the words in the letter above. I will introduce LabelImg, an open-source tool I use for annotation, in order to create properly labeled training dataset of my handwriting.
We will see how well this out-of-the-box CNN works compared to other more famous architectures like LeNet-5 and VGGNet, and if and how we can improve accuracy by tweaking around the data preprocessing knobs available to us.
I hope you had fun reading this and will follow along. I can already say that Part II too will end with a bummer aka. another challenge. The flip side is that there will be a Part III to (hopefully) resolve that, too.
All the best,
Jonas
The post Building an AI to Recognize my Handwriting – Part I appeared first on Towards Data Science.