Table of Contents
Google Summer of Code 2010: Sahana OCR
Exclusive Summary
- Abstract : The data collecting and entering process can be considered as one of the most pain full exercises with manual handling during a huge disaster situation. Therefore Sahana OCR is recognised as a great tool in solving such problems. When it comes to OCR module, reliability and consistency are major areas to be addressed. By focusing and improving these two characteristics, Sahana OCR module can be optimally utilised when ever, where ever a disaster situation occurred.
- Current Status : During a past disaster situation the data are collected from the distributed forms to the victims, which is the most successful method of data collecting within a disaster situation. Then the Sahana OCR module scans these forms using ScannerManager and sends them to create the form images. The form images then processed to extract the data fields and then the letter boxes within the data fields using FormProcessor and the ImageProcessor. Currently the character recognition task was done by a Neural network developed using FANN library. But the accuracy of the recognition was very poor since lack of training the neural network.
Following data about the current status has taken from the document which was written at the last years Gsoc session by Gihan Chamara.
Design :
There were several facts was considered when designing the system
- Separate platform depending components from independent.
- Accommodate future upgrades to major components independently.
- Independent from application form.
Process :
XForm is loaded into memory. Then, image from scanner or disk is loaded into memory. FormProcessor preprocess, aligned the image and segmented using ImageProcessor library. Segmented letters are passed to OCR and with OCR result FormProcessor create result XML. Result XML is passed to NtMnger and uploaded to server. Before uploading, user can evaluate and do any necessary correction. Following figure shows the overall process graphically.
Current implementation : In current implementation, application was developed to run on Microsoft windows platform and the .Net 2.0 framework. XFormparser , Imageprocessor and scannerMgr are completed and OCR and User interface are partially developed.
XFormParser:
This is the part that handle XForm XMLs, and the component is consist of following classes
- XForm
- XFormParser
- DataField
- TextArea
- XFormDomErrorHandler
This module load xml file into the type of xform data structure in memory. Later XForm structure is used by Form processor to segment the scanned image. Module was implemented using Xerces open source XML library.
ImageProcessor: All image processing methods was enclosed in to this component. Main objective is to segment the scanned image into letters. Currently all necessary image processing need to segment and page aligning has implemented using OpenCV open source Vision library. Most of the methods had coded by the earlier developers.
OCR:
The most critical component in the system. Earlier this was done using artificial neural network. But it was able to recognize digits only. FANN framework has used for efficient neural network implementation. In current system I also used the earlier OCR with some modifications. But, Still not working properly. We need some context recognition and some dictionary lookup.
ScannerMngr:
This is wrapper module for Twain API. This library provides functions to enquiry available scanners, connect to required scanner, configure and view scanner properties. Then application can continuously scan if the scanner supports auto feed otherwise one by one. This component is completed with required functionality.
NtMngr:
This component deal with the network. Main functionalities are downloading correct XForm XML from server and Uploading result to the server. For a given path, it loads the XML that contains the details about the server and available XForms, in order to user can select proper XForm from available list. Also this module has to handle Proxy servers, if there is any. This component has to be developed. Currently XML is read form disk.
FormProcessor :
This is the class that coordinates all the above components. This part has been completed. FormProcessor segments image according to XForm using Imageprocessor library, then pass those segmented images to the OCR and create Result XML with result of OCR. Finally result XML will upload to the server through NtMnger module.
MainApplication:
User friendly interface with several views. Interface is partially completed.
SahanaOCR Application
Current application can scan image, load XForm Xml from disk and then process the scan image. But logging and progress views yet not completed.
When application starts, user has to select correct scanner from the list of available scanners or a folder that contains scanned images from disk. zIf needed, user can change or see the configurations of the scanner. Then, correct XForm XML has to be selected. When user enter correct server path it will list available XForms. Then he can select proper one or he can select XForm from Disk.
After selecting scanner and XForm, user can start processing forms one by one or continuously, if scanner supports auto paper feeding. User can set direct uploading or buffer processed forms until he evaluate manually and correct any mistakes made by OCR.
ScreenShots
- Student : Thilanka Kaushalya.
- Mentor(s) : Gihan Chamara , Jo Fonseka, Chamindra de Silva, and Hayesha Somorathne.
Code
Progress
- I have tested the Tesseract using with the existing system and manage to get a good accuracy of recognizing the data.Sample Code
- I have followed the training process of Tesseract to measure the ability to train it for handwritten letters. Testing Results
Project plan and Timeline
- Basically my project plan is organize the SahanaOCR module as a complete module which can handle the whole process of the data entering, with a great accuracy.
The basic project ideas are as follows.
- Integrate the Tesseract code to the project.
- Differentiate the forms and the pages from each other and identify them by the system itself to automate the data sending process to the corresponding modules.
- Make the system platform independent.
These are timeline which are allocated to specific tasks.
Officially coding period has started on 24th May | |||
---|---|---|---|
Goal | Measure | Due date | Status |
Integrating Tesseract with the SahanaOCR | Accurately recognize the letters by the system using Tesseract | 06/20/2010 | Completed |
Training Tesseract for handwritten characters | Accurately recognize the handwritten letters by the system using Tesseract | 06/30/2010 | Completed(But there is no significant improvement in the accuracy of Tesseract, So have to use another data set and try it again) |
Solving the identified issue when processing the rotated images | Correctly process the rotated images and correctly recognize data | 07/10/2010 | Completed |
Midterm Evaluation from 12th July to 16th July | |||
---|---|---|---|
Developing a UI for the system | Fully handle the process by UI | 07/28/2010 | Complete |
Mo Training Tesseract for handwritten characters | Accurately recognize the handwritten letters by the system using Tesseract | 08/05/2010 | Complete(But there is no significant improvement in the accuracy of Tesseract) |
Integrating the scanner manager with the sahanaCOR UI | Correctly scanned the images and feed the images to the SahanaOCR | 08/16/2010 | Complete |
Project implementation
Initially I have gone through sample implementations using WxWidgets to develop platform independent programs.
But after discussed with the mentor and prioritized the suggestions and he mentioned that the Tesseract integration and the training Tesseract for the hand written characters is the most important task. So I have started integrating Tesseract as the Optical Character Recognition Engine.
Integrating Tesserat with SahanaOCR
What is Tesseract?
The Tesseract OCR engine was one of the top 3 engines in the 1995 UNLV Accuracy test. Between 1995 and 2006 it had little work done on it, but it is probably one of the most accurate open source OCR engines available. So I have gone through the Tesseract documentations as well as took helps from the forum to get a great idea on Tesseract architecture. Then I have combined OpenCv library files with the Tesseract code for image handling and used the built library files of the Tesseract to communicate with its functions. Then I have used the Tesseract API (TessbaseApi ) to call for the Tesseract functions. I have successfully replaced the Tesseract with the current Fann neural network functions and did the major changes in the FormProcessor class to send the segmented image data to Tesseract. The accuracy of the results were highly improved with the Tesseract integration but no up to 100%. So I had to train the Tesseract for handwritten data to improve the accuracy.
Training Tesseract for handwritten letters
So at start I have used a hand written data set which was available at the web to make the tessdata folder of the Tesseract trained data. Gihan (mentor) helped me to create the images from the dataset and I have created the necessary tessdata files from those images.
A sample image I have used to train the Tesseract for handwritten “q” letter
But the accuracy was not improved at that stage and most of the time Tesseract returned a segmentation fault error at the images. So then I have tried for a data set which I was written by myself.
A portion of a sample image I have written to train the Tesseract for handwritten letter
Recognize only letters or digits
Then to improve the accuracy I had included a new feature to the system using Tesseract. That was reading the data field type from the xform and handle the recognizing type according to that. So when it reads the data field type as
<dataType> String </dataType>
I have set the possible letters which could include in the data files in to letters as follows. api.SetVariable(“tessedit_char_whitelist”, “ ABCDEFGHIJKLMNOPQRSTUVWXYZ ”) So after that it only matches the letters with the images. Then id the data field is a number.
<dataType>number</dataType>
I have set the output to numbers. So this eliminated lots of ambiguous results which made mix the letters and numbers and improved the accuracy of the system.
Modify the rotation compensate functions
While I have working with the system I have recognized that the SahanaOCR was unable to process the rotated images at about 5 degrees. It was only able to change image upside down and process it. But for the images which were at -5 to 5 degrees and 175 to 185 degrees rotated the system does not validate the forms with the xforms. So I had to modify the algorithm which is used to rotate the images. Then It was able to correctly rotate the images. The following two images show the correct rotation of the images by the system.
Original image which is rotated in to 175 deg and its corresponding properly rotated image by the system
In this the data filed coordinates got small deviation with the rotated images. So there were some errors with the segmented letter boxes. So we planned to handle it by applying more improved algorithm to it. I’ll list it at the todo section.
Designing the UI
Then I have worked with improving the UI features. So after discussing with the mentor I have started working with combine main functionalities as
- loading images by file system
- loading xforms to the system
- process the form and get results
Then after that I have design the Log Form features to show the processing criteria of the forms. The Log Form contained following features.
- Showing the segmented letters boxes at a picturebox
- Show the recognized letter corresponding to the segmented images
- Show the full results of the form while processing
Following screen shot shows the design of the Log Form.
Screen shot of the Log form of the UI while running a process
Then I have started working with integrating the Scanner Manager option to the UI. That was loading images directly from the scanners. Using that we can automate the process of the form loading to the system.
Now the images were correctly uploaded to the system using the Scanner Manager.
After all I have identified some more functionality to add to the system so it could be more usable for the users.
To do
These are the features I have identified to improve the system further in the future.
- To improve the accuracy of the outputs we had to correctly create a training dataset for the handwritten characters using Tesseract.
- To handle the rotated images we had to change the current algorithm. The current algorithm first recognize the form the 5 black boxes at the edges of the image. Then it extract the form section which is bounded by those edges and then extract the data fields , input areas and the letter boxes according to coordinates from those edges. But if there is a small deviation in a position if any data filed all the other areas inside that does not correctly get segment. So we have to change it to first extract a little bit larger area than the data filed and then recognize the edges of the data filed using image processing and then process the fields within that. It may remove these issues with the rotated images.
- Completing the NetMngr and complete the system to upload the recognized data to its corresponding module.
User Guide
This is the link for the user guide for the features that are provided by the existing SahanaOCR system.
http://wiki.sahanafoundation.org/doku.php/wiki:user:lgtkaushalya
This is the link for the video demo of the current SahanaOCR application
http://www.youtube.com/watch?v=Zl3KR8QEHyI
Here is the link for the progress report of the SahanaOCR project during Gsoc 2010
Conclusion
SahanaOCR is a system which came out form innovative ideas and it contains some new concept in a practical scenario. With working with the system during the project period I gained a lot of knowledge and that was a fascination era of my life. For that Gihan , Jo , Chammindra , Michel, Hayesha and Suryagith helped me a lot and others from the community helped too. I’m willing to work with the project further more finish it as a complete project in the near future.
- The weekly meetings are scheduled on Saturdays at 1530 UTC. Calender
- Weekly report will be sent to the mentor and the mailing list on every Saturday, which contains the progress of the project during that week.