This is an old revision of the document!


Executive Summary
  • Abstract: This project aims to create a module that could be integrated into the currentframework where the forms could be generated anduser could upload a scanned image and then get an interface where the scanned information is displayed on the screen along with the corresponding image counterparts for the manual verification. Since training the data set is the most critical feature that affects the accuracy automated character training sub-module shall be incorporated into this module.
  • Student: Suryajith Chillara
  • Mentor(s): Michael Howden and Praneeth Bodduluri
Code
Functional Specifications.

Users

There is only an end user who uses the module. Presence of a superuser is not required.

Data Model

The data of the form has to be organized so as to minimalise the handwritten character recognition. Thus the common data can be dealt with just the check boxes and the specific details like name has to be in the form of a Text input. After discussing with Fran, he said some text input boxes like the comment boxes on the web page for additional info have to be present. These are the following input we might deal with

  1. Strings for names.
  2. Integers for various data.
  3. Check boxes.

The check boxes shall first be read using an 'X' and then be later be tested with a complete circular dot.

Flowcharts

Sahana-Eden → xforms → [ Parse xforms + form a layout ] → pdf

Scanned Images → [OCR blackbox] → User interface to correct/verify → xml → Sahana-Eden

Menu

They are screen specific and they shall be explained in the screens.

Screens Correction UI

API

An interface to the end user shall be provided where the user shall upload the scanned image and then get a UI where the user shall compare and correct.

Technologies

  1. Tesseract
  2. Apache-FOP / rst2pdf / ReportLab

Open Issues

<Shall be updated>

Comments

Meeting Schedule
Timeline
Project timeline
SMART goalMeasureDue dateComments
Xforms to pdf OCR Forms generated. 15th June DONE
Correction UI Web UI with a text box and a corresponding image for every element 28th June (Postponed towards the end)
Tesseract integration Tests with printed data. 5th July In Progress (Tesseract is integrated, testing the training data set integration too)
Automated training A training form and automation scripts(A web UI for the same wasnt neccessary) 20th July DONE
Web UI Making some necessary modifications to the UI 2nd August TO BE DONE
Testing Accuracy 9th August IN PROGRESS
Updates
  • Unordered List ItemI worked on the form generation. One of the ways to print a form is by using w3 standard way of using an xml and also a styling sheet xsl as inputs for Apache-FOP to generate PDFs. I have tried this system out but couldn't do all the things I wanted to have in an OCR compatible sheet, specially the form decorations for alignment. I have used the xml parsers and wrote a module to convert xforms into PDFs. The location of the data has been written into an xml document for easy parsing later.
  • I have hooked up tesseract with python and thus enabled the parsing of the forms via the os.system function. Now the parsed data has to be dumped specifically into the related files instead of a complete dump of the form.
  • The other thing I have been looking into is the training of the ocr engine. I has to be provided with the training data for proper evaluation. The literature for training of the engine I have read up has been mentioned below at the bottom of the page. The tesseract tools page has various shell scripts which help generate the training data and thus train tesseract where as Training Tesseract and the page Tesseract - Summary explain how to generate and thus train tesseract manually.
  • A training form has been generated where the user in his handwriting updates the form. Here 10 samples of a particular letter have to be filled in. The box file which has to be generated be standardized, the segments in the boxes provided shall be the letters to consider. The consequent steps of generating a unicharset and clustering and thus putting it all together shall be automated. There is no need of dictionary data here.
  • Reading the following articles at the moment. Overview of Tesseract, Recognizing roman numerals via Tesseract and working with the base APIs of tesseract to enable isolation of the characters.
  • I have integrated the image processing routines.(like conversion into a binary image to improve the contrast, alignment of the images using the connected component analysis to find the boxes and then find the centroid and thus check the angle between them etc. ) The sample collection is going on ( I have distributed the forms in my father's office for data collection ;) ). I shall be scanning them and reading the data.
  • The testing with the printed data is done using the Times new roman font for which the tesseract is trained by default has been giving decent results but I am now selectively dumping the data into xml.
Technical Layout

This is how the scripts function. I am working to tune these to work efficiently.

The Technical layout of the project

README

src:

  • formHandler.py
  • functions.py
  • generateTrainingform.py
  • parseform.py
  • printForm.py
  • regions.py
  • train.py
  • xforms2pdf.py

Usage and their functionality:


formHandler.py

Its a class which handles the parsing of the xforms and which in turn uses the object from the printform class.


functions.py

This file contains all the generic functions that may be used by various modules. The list of available functions is as follows.

→distance(li): Returns the euclidean distance if the input is of the form [(x1, y1), (x2, y2)] →convertImage2binary(image, threshold=default): Takes in any image and converts it to binary based on the threshold.

→getMarkers(image): Takes in an image and returns the list of markers on the image as a list of regions. (for regions check for the class regions.)

→scaleFactor(markers): Takes in a list of markers and the co-ordinates and then converts then returns the scaling factor of the image lengthwise and breadthwise

→checkFolds(list): Takes in the list of markers and then checks if the image has folds. the tolerance has been set to 15 degrees, which is very huge. should reduce it.

→checkForm(image): It checks the validity of the image to be a valid form and aligns it.


generateTrainingform.py

Generation of the training forms. Uses the datainput in the misc folder to generate the form and printout the location details of the characters in a file.

  Usage: python generateTrainingform.py ../misc/<datainput> <pdfoutputname>

parseform.py

Its a file which takes in the imput of the xml file which has the logical data placement and outputs the xml dump of the parsed data. This particular module tesselates the required images and writes the data to an xmlfile.

  Usage: python parseform.py <xmlinput> <imageinput> <xmloutput>

<content to be added here>


printForm.py

A class to handle the forms so to generate the reportlab pdfs


regions.py

It contains the class Region() and the function findRegions

→findRegions(image): returns a list of regions(objects of the class Region) based on the connected component analysis.

  
  Raster Scanning Algorithm for Connected Component Analysis:
  
  On the first pass:
  
  1. Iterate through each element of the data by column, then by row (Raster Scanning)
  2. If the element is not the background
      1. Get the neighboring elements of the current element
      2. If there are no neighbors, uniquely label the current element and continue
      3. Otherwise, find the neighbor with the smallest label and assign it to the current element
      4. Store the equivalence between neighboring labels
  
  On the second pass:
  
  1. Iterate through each element of the data by column, then by row
  2. If the element is not the background
        1. Relabel the element with the lowest equivalent label
  ( source: http://en.wikipedia.org/wiki/Connected_Component_Labeling )

train.py

Training of tesseract automated.

→generateBoxfile(image, boxfilename): Generates the box file based on the location input generated from the generatedTrainingimage

  Usage: python train.py <trainingimage> <locationinput_generated>

xforms2pdf.py

It converts the input xform to a pdf to be printed.

  Usage: python xforms2pdf.py <xforminput> <OPTIONAL -> pdfoutput>

QR Code
QR Code foundation:gsoc_chillara (generated for current page)