Character Recognition using Fully connected Neural Network

Recent advances in deep learning got many areas of computing really excited. One of such areas is character recognition. Character Recognition problem takes handwritten characters (be it numbers or letters) as input and gives the computer-recognizable character as output. This blog aims to cover in-depth explanation on how we build such system.

For this project, a simple GUI is provided where a user can draw a English digit (0-9). When submitted, the digit is saved as a jpg image, and is then fed to a trained fully connected neural network. The class with the highest probability is then displayed back to the user via the same GUI.
gui

Well, how do we do that?

Step 1: Image Pre-processing
Our Neural network will be trained using MNIST dataset. So, it is essential that our input image matches the images that our neural network was trained on. Each image in the dataset is:

  • grayscale
  • size of 28x28 pixel
  • each pixel is normalized from 0 to 1
  • background is black and the digit is white.
  • actual digit is centered into 20x20 box within 28x28 pixel image.
    Please refer to this page for further information about the dataset.

Therefore, in the pre-processing stage, it is essential to match all these checkmarks. The following code does exactly that for the submitted image.

def processed_image():
    print("Resizing Image")
    img = cv2.imread('image.jpg', cv2.IMREAD_GRAYSCALE)  # Load the image.
    resized_image = cv2.resize(255-img, (28, 28))  # Resizing the image into 28x28 matrix
    norm_image = cv2.normalize(resized_image, None, alpha=0, beta=1, norm_type=cv2.NORM_MINMAX, dtype=cv2.CV_32F)

    # ---------------Fit the image into 20x20 pixel box-------------
    # remove unnecessary black background
    while np.sum(norm_image[0]) == 0:
        norm_image = norm_image[1:]

    while np.sum(norm_image[:,0]) == 0:
        norm_image = np.delete(norm_image, 0, 1)

    while np.sum(norm_image[-1]) == 0:
        norm_image = norm_image[:-1]

    while np.sum(norm_image[:,-1]) == 0:
        norm_image = np.delete(norm_image, -1, 1)

    rows,cols = norm_image.shape

    # Now fit the image to 20x20 box.
    if rows > cols:
        factor = 20.0/rows
        rows = 20
        cols = int(round(cols*factor))
        norm_image = cv2.resize(norm_image, (cols,rows))
    else:
        factor = 20.0/cols
        cols = 20
        rows = int(round(rows*factor))
        norm_image = cv2.resize(norm_image, (cols, rows))

    # now let's resize the image into 28x28.
    colsPadding = (int(np.math.ceil((28 - cols) / 2.0)), int(np.math.floor((28 - cols) / 2.0)))
    rowsPadding = (int(np.math.ceil((28-rows)/2.0)),int(np.math.floor((28-rows)/2.0)))
    norm_image = np.lib.pad(norm_image, (rowsPadding,colsPadding), 'constant')
    #
    # cv2.imshow('image', norm_image)
    # cv2.waitKey(0)
    # cv2.destroyAllWindows()

    print("Resizing Completed.")
return norm_image.flatten()

Note: This function returns 1D array of size 784.

Orignal image looks like below:
image
Processed image looks like below:
resized

Step 2: Feeding the image to the Neural Network.
Okay, now that we have our image ready. Let's build a neural network.

Our neural network will have one input layer with 784 neurons (since our image is 28x28 pixel), one hidden layer with 300 neurons, another hidden layer with 100 neurons, and an output layer with 10 neurons (since our class range from 0-9). The following code will construct a Neural Network structure for us.

# Defining parameters.
n_inputs = 28*28  # MNIST
n_hidden1 = 300
n_hidden2 = 100
n_outputs = 10  # 10 Classes of prediction

# defining placeholder variables.
x = tf.placeholder(tf.float32, shape=(None, n_inputs), name='x')
y = tf.placeholder(tf.int32, shape=(None), name='y')

# defining deep neural network.
with tf.name_scope("dnn"):
    hidden1 = tf.layers.dense(x, n_hidden1, name="hidden1", activation=tf.nn.relu)
    hidden2 = tf.layers.dense(hidden1, n_hidden2, name="hidden2", activation=tf.nn.relu)
logits = tf.layers.dense(hidden2, n_outputs, name="outputs")

All initial weights and biases are initialized using Xavier initialization which alleviates dying gradient problem. (tf.layers.dense() function automatically does this for us.)

If we feed an image through this NN architecture, we get a prediction. However, during training, the prediction is not always right. Thus, we need to way to measure that error. Cross-Entropy is used to capture that error. In other words, we use cross-entropy as a loss function. The following code does this for us.

# defining loss function.
with tf.name_scope("loss"):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
    loss = tf.reduce_mean(xentropy, name="loss")

Now that we have captured calculated the error, we use backpropagation technique and calculate the partial derivative of that error with respect to each weight of the previous layer. We continue to propagate the error untill we reach the input layer. After this, we use mini-batch gradient descent to optimize the error. The following code does this for us.

# defining the neural network optimizer: the gradient descent.
learning_rate = 0.01
with tf.name_scope("train"):
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
training_op = optimizer.minimize(loss)

After the optimization, we measure the accuracy rate of the whole classifier, i.e Neural Network. The following code does this for us.

# measuring classifiers acuracy.
with tf.name_scope("eval"):
    correct = tf.nn.in_top_k(logits, y, 1)
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))

However, in Tensorflow the above code just constructs all the necessary framework to build a working neural network. To train the network we need to run the following code:

n_epochs = 20
batch_size = 50

with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        for iteration in range(mnist.train.num_examples // batch_size):
            x_batch, y_batch = mnist.train.next_batch(batch_size)
            sess.run(training_op, feed_dict={x: x_batch, y: y_batch})
        acc_train = accuracy.eval(feed_dict={x: x_batch, y: y_batch})
        acc_test = accuracy.eval(feed_dict={x: mnist.test.images, y: mnist.test.labels})
        print(epoch, "Train accuracy:", acc_train, "Test accuracy:", acc_test)

    save_path = saver.save(sess, "./checkpoints/my_model_final.ckpt")

file_writer.close()

Note: we use a batch size of 50, and no. of epochs for gradient descent is set at 20. And the checkpoint is created so that all the parameters are saved, and is readily available when predicting new instances.

However, we don't have a dataset yet. so we need to import it just before the training/execution phase.

from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("/tmp/data/")

This model has about 89% accuracy. Not bad, right?

Step 3: Feed the image to the trained neural network
Now this step is pretty easy. Restore the checkpoint, and feed the image through the neural network.

def return_prediction():
    with tf.Session() as sess:
        saver.restore(sess, './checkpoints/my_model_final.ckpt')
        x_new_scaled = [processed_image()]
        z = logits.eval(feed_dict={x: x_new_scaled})
        y_pred = np.argmax(z, axis=1)[0]

        y_conf = tf.nn.softmax(z, 1)
        y_conf_list = [("%.3f" % i) for i in y_conf.eval(feed_dict={x: x_new_scaled})[0]]

    print("Digit predicted is", y_pred)
    print("Prediction confidence", y_conf_list[y_pred])
return y_pred, y_conf_list[y_pred]

Aha!! there we go. We just build a classifier to classify english handwritten digits with 89% accuracy. Please click here to get all the working code.
Click here to view the research paper.

Shoot an email/message for any suggestions, or questions.

Show Comments