Sahana OCR

Introduction

NOTE: The Sahana OCR is currently under development, and may change before the final stage.

The Sahana OCR is a module designed to assist the data collecting and entering process in disaster zones. Ideally, the module will be used to address one of the most common and largest issues that plague disaster zones, which is the collection of standardized and reliable data. The OCR is currently under development. The final product aims to be accurate, consistent, and user friendly. Once completed, the OCR will be easily deployed and used over large disaster zones and will help to create a more organized recovery process.

Process

The process the OCR uses to read and enter data is demonstrated by the diagram below.

  1. XForm is loaded into memory.
  2. The image of the form is loaded from either a scanner or a disk into memory.
  3. The FormProcessor then uses the ImageProcessor Library to correctly align and segment the image into individual characters.
  4. The OCR reads and interprets the character and returns it to the FormProcessor.
  5. The FormProcessor then creates an XML file with the data.
  6. The data is shown to the user for any necessary adjustments.
  7. The data is uploaded to a server where all the data is contained.

Current Status

The XFormparser , Imageprocessor, and scannerMgr modules are all completed. The OCR and User Interface are under development. There is no date set for the release of the final module yet. However, it is being actively developed and improved to make it a much more friendly and helpful tool for the user. If you would like to help or contribute in any way to the Sahana OCR project, or any part of the Sahana initiative, please visit the Sahana webpage at http://sahanafoundation.org/

The current application can scan an image, load XForm Xml from disk and then process the scanned image. But logging and progress views are not yet completed. When the application starts, the user has to select the correct scanner from the list of available scanners or a folder that contains the scanned images from disk. If needed, the user can change or see the configurations of the scanner. Then, the correct XForm XML has to be selected. When the user enters a correct server path, it will list the available XForms. Then he or she can select proper one or select XForm from Disk. After selecting a scanner and XForm, the user can start processing the forms one by one or continuously, if scanner supports auto paper feeding. The user can set direct uploading or buffer processed forms until he or she manually evaluates and corrects any mistakes made by OCR.

The major issues currently facing the OCR are accuracy in character recognition and usability of the GUI.

Modules

XFormParser:

This is the part that handle XForm XMLs, and the component consists of following classes

  • XForm
  • XFormParser
  • DataField
  • TextArea
  • XFormDomErrorHandler

This module loads an xml file into the type of xform data structure in memory. Later, XForm structure is used by FormProcessor to segment the scanned image. The module was implemented using Xerces open source XML library.

ImageProcessor:

All image processing methods are enclosed in to this component. Main objective is to segment the scanned image into letters. Currently, all necessary image processing need to segment and page aligning has implemented using OpenCV open source Vision library. Most of the methods had coded by the earlier developers.

OCR:

The most critical component in the system. Earlier, this was done using the artificial neural network. But it was able to recognize digits only. FANN framework has used for efficient neural network implementation. In the current system, I also used the earlier OCR with some modifications. But, it's still not working properly. We need some context recognition and some dictionary lookup.

ScannerMngr:

This is wrapper module for Twain API. This library provides functions to enquiry available scanners, connect to required scanner, configure and view scanner properties. Then, the application can continuously scan if the scanner supports auto feed; otherwise, this is done one by one. This component is completed with the required functionality.

NtMngr:

This component deals with the network. Main functionalities are downloading correct XForm XML from the server and uploading the result to the server. For a given path, it loads the XML that contains the details about the server and available XForms, in order for the user to select the proper XForm from the available list. Also this module has to handle Proxy servers, if there are any. This component has to be developed. Currently, XML is read form disk.

FormProcessor:

This is the class that coordinates all the above components. This part has been completed. FormProcessor segments image according to XForm using the Imageprocessor library, then passes those segmented images to the OCR and creates Result XML with result of OCR. Finally, result XML will upload to the server through the NtMnger module.

Todo

These are the features I have identified to improve the system further in the future.

  • To improve the accuracy of the outputs we had to correctly create a training dataset for the handwritten characters using Tesseract.
  • To handle the rotated images we had to change the current algorithm. The current algorithm first recognize the form the 5 black boxes at the edges of the image. Then it extract the form section which is bounded by those edges and then extract the data fields , input areas and the letter boxes according to coordinates from those edges. But if there is a small deviation in a position if any data filed all the other areas inside that does not correctly get segment. So we have to change it to first extract a little bit larger area than the data filed and then recognize the edges of the data filed using image processing and then process the fields within that. It may remove these issues with the rotated images.
  • Completing the NetMngr and complete the system to upload the recognized data to its corresponding module.

Items in Development

Designing the UI

I have worked with improving the UI features. So after discussing with the mentor I have started working with combine main functionalities as

  • loading images by file system
  • loading xforms to the system
  • process the form and get results

Then after that I have design the Log Form features to show the processing criteria of the forms. The Log Form contained following features.

  • Showing the segmented letters boxes at a picturebox
  • Show the recognized letter corresponding to the segmented images
  • Show the full results of the form while processing

Following screen shot shows the design of the Log Form.

{{:foundation:logform_modified.jpg|

Then I have started working with integrating the Scanner Manager option to the UI. That was loading images directly from the scanners. Using that we can automate the process of the form loading to the system.

Now the images were correctly uploaded to the system using the Scanner Manager.

After all I have identified some more functionality to add to the system so it could be more usable for the users.

Improving Accuracy and Efficiency

Integrating Tesserat with SahanaOCR

What is Tesseract?

The Tesseract OCR engine was one of the top 3 engines in the 1995 UNLV Accuracy test. Between 1995 and 2006 it had little work done on it, but it is probably one of the most accurate open source OCR engines available. So I have gone through the Tesseract documentations as well as took helps from the forum to get a great idea on Tesseract architecture. Then I have combined OpenCv library files with the Tesseract code for image handling and used the built library files of the Tesseract to communicate with its functions. Then I have used the Tesseract API (TessbaseApi ) to call for the Tesseract functions. I have successfully replaced the Tesseract with the current Fann neural network functions and did the major changes in the FormProcessor class to send the segmented image data to Tesseract. The accuracy of the results were highly improved with the Tesseract integration but no up to 100%. So I had to train the Tesseract for handwritten data to improve the accuracy.

Training Tesseract for handwritten letters

So at start I have used a hand written data set which was available at the web to make the tessdata folder of the Tesseract trained data. Gihan (mentor) helped me to create the images from the dataset and I have created the necessary tessdata files from those images.

A sample image I have used to train the Tesseract for handwritten “q” letter

But the accuracy was not improved at that stage and most of the time Tesseract returned a segmentation fault error at the images. So then I have tried for a data set which I was written by myself.

Recognize only letters or digits

Then to improve the accuracy I had included a new feature to the system using Tesseract. That was reading the data field type from the xform and handle the recognizing type according to that. So when it reads the data field type as

<dataType> String </dataType>

I have set the possible letters which could include in the data files in to letters as follows.

api.SetVariable(“tessedit_char_whitelist”, ” ABCDEFGHIJKLMNOPQRSTUVWXYZ ”)

So after that it only matches the letters with the images. Then id the data field is a number.

<dataType>number</dataType>

I have set the output to numbers. So this eliminated lots of ambiguous results which made mix the letters and numbers and improved the accuracy of the system.

Modify the rotation compensate functions

While I have working with the system I have recognized that the SahanaOCR was unable to process the rotated images at about 5 degrees. It was only able to change image upside down and process it. But for the images which were at -5 to 5 degrees and 175 to 185 degrees rotated the system does not validate the forms with the xforms. So I had to modify the algorithm which is used to rotate the images. Then It was able to correctly rotate the images. The following two images show the correct rotation of the images by the system.

Original image which is rotated in to 175 deg and its corresponding properly rotated image by the system In this the data filed coordinates got small deviation with the rotated images. So there were some errors with the segmented letter boxes. So we planned to handle it by applying more improved algorithm to it. I’ll list it at the todo section.


QR Code
QR Code agasti:ocr:start (generated for current page)