foundation:gsoc_chillara

You are here: start » foundation » gsoc_chillara

This is an old revision of the document!

Executive Summary

Abstract: This project aims to create a module that could be integrated into the currentframework where the forms could be generated anduser could upload a scanned image and then get an interface where the scanned information is displayed on the screen along with the corresponding image counterparts for the manual verification. Since training the data set is the most critical feature that affects the accuracy automated character training sub-module shall be incorporated into this module.
Student: Suryajith Chillara
Mentor(s): Michael Howden and Praneeth Bodduluri

Code

* bzr branch * Personal hg repo of the same

Functional Specifications.

Users

There is only an end user who uses the module. Presence of a superuser is not required.

Data Model

The data of the form has to be organized so as to minimalise the handwritten character recognition. Thus the common data can be dealt with just the check boxes and the specific details like name has to be in the form of a Text input. After discussing with Fran, he said some text input boxes like the comment boxes on the web page for additional info have to be present. These are the following input we might deal with

Strings for names.
Integers for various data.
Check boxes.

The check boxes shall first be read using an 'X' and then be later be tested with a complete circular dot.

Flowcharts

Sahana-Eden → xforms → [ Parse xforms + form a layout ] → pdf

Scanned Images → [OCR blackbox] → User interface to correct/verify → xml → Sahana-Eden

Menu

They are screen specific and they shall be explained in the screens.

Screens Correction UI

API

An interface to the end user shall be provided where the user shall upload the scanned image and then get a UI where the user shall compare and correct.

Technologies

Tesseract
Apache-FOP / rst2pdf / ReportLab

Open Issues

Comments

Meeting Schedule

Thursdays at 1600 hrs UTC

Timeline

Project timeline
SMART goal	Measure	Due date	Comments
Xforms to pdf	OCR Forms generated.	15th June	DONE
Correction UI	Web UI with a text box and a corresponding image for every element	28th June	IN PROGRESS
Tesseract integration	Tests with printed data.	5th July	TO BE DONE
Automated training sub-module	A web UI for the same	20th July	TO BE DONE
Web UI	Making some necessary modifications to the UI	2nd August	TO BE DONE
Testing	Accuracy	9th August	TO BE DONE

Updates

Unordered List ItemI worked on the form generation. One of the ways to print a form is by using w3 standard way of using an xml and also a styling sheet xsl as inputs for Apache-FOP to generate PDFs. I have tried this system out but couldn't do all the things I wanted to have in an OCR compatible sheet, specially the form decorations for alignment. I have used the xml parsers and wrote a module to convert xforms into PDFs.
I have hooked up tesseract with python and thus enabled the parsing of the forms via the os.system function. Now the parsed data has to be dumped specifically into the related files instead of a complete dump of the form.
The other thing I have been looking into is the training of the ocr engine. I has to be provided with the training data for proper evaluation. The literature for training of the engine I have read up has been mentioned below at the bottom of the page. The tesseract tools page has various shell scripts which help generate the training data and thus train tesseract where as Training Tesseract and the page Tesseract - Summary explain how to generate and thus train tesseract manually.
A training form has been generated where the user in his handwriting updates the form. Here 10 samples of a particular letter have to be filled in. The box file which has to be generated be standardized, the segments in the boxes provided shall be the letters to consider. The consequent steps of generating a unicharset and clustering and thus putting it all together shall be automated. There is no need of dictionary data here.

Trace: • gsoc_chillara

Executive Summary

Code

Functional Specifications.

Meeting Schedule

Timeline

Updates

Navigation

Quick Links:

Search

Toolbox

QR Code