foundation:gsoc_chillara

You are here: start » foundation » gsoc_chillara

Executive Summary

Abstract: This project aims to create a module that could be integrated into the currentframework where the forms could be generated anduser could upload a scanned image and then get an interface where the scanned information is displayed on the screen along with the corresponding image counterparts for the manual verification. Since training the data set is the most critical feature that affects the accuracy automated character training sub-module shall be incorporated into this module.
Student: Suryajith Chillara
Mentor(s): Michael Howden and Praneeth Bodduluri

Code

bzr branch

Functional Specifications.

Users

There is only an end user who uses the module. Presence of a superuser is not required.

Data Model

The data of the form has to be organized so as to minimalise the handwritten character recognition. Thus the common data can be dealt with just the check boxes and the specific details like name has to be in the form of a Text input. After discussing with Fran, he said some text input boxes like the comment boxes on the web page for additional info have to be present. These are the following input we might deal with

Strings for names.
Integers for various data.
Check boxes.

The check boxes shall first be read using an 'X' and then be later be tested with a complete circular dot.

Flowcharts

Sahana-Eden → xforms → [ Parse xforms + form a layout ] → pdf

Scanned Images → [OCR blackbox] → User interface to correct/verify → xml → Sahana-Eden

Menu

They are screen specific and they shall be explained in the screens.

Screens Correction UI

API

An interface to the end user shall be provided where the user shall upload the scanned image and then get a UI where the user shall compare and correct.

Comments

Meeting Schedule

Thursdays at 1600 hrs UTC

Timeline

Project timeline
SMART goal	Measure	Due date	Comments
Xforms to pdf	OCR Forms generated.	15th June	DONE
Tesseract integration	Tests with printed data.	5th July	INTEGRATED: waiting for the community to check it.
Automated training	A training form and automation scripts(A web UI for the same wasnt neccessary)	20th July	DONE: Awaiting testing from the community.
Assembly and Testing	Accuracy	2th August	IN PROGRESS
Correction UI	Web UI with a text box and a corresponding image for every element	9th August	TO BE DONE

Updates

Unordered List ItemI worked on the form generation. One of the ways to print a form is by using w3 standard way of using an xml and also a styling sheet xsl as inputs for Apache-FOP to generate PDFs. I have tried this system out but couldn't do all the things I wanted to have in an OCR compatible sheet, specially the form decorations for alignment. I have used the xml parsers and wrote a module to convert xforms into PDFs. The location of the data has been written into an xml document for easy parsing later.
I have hooked up tesseract with python and thus enabled the parsing of the forms via the os.system function. Now the parsed data has to be dumped specifically into the related files instead of a complete dump of the form.
The other thing I have been looking into is the training of the ocr engine. I has to be provided with the training data for proper evaluation. The literature for training of the engine I have read up has been mentioned below at the bottom of the page. The tesseract tools page has various shell scripts which help generate the training data and thus train tesseract where as Training Tesseract and the page Tesseract - Summary explain how to generate and thus train tesseract manually.
A training form has been generated where the user in his handwriting updates the form. Here 10 samples of a particular letter have to be filled in. The box file which has to be generated be standardized, the segments in the boxes provided shall be the letters to consider. The consequent steps of generating a unicharset and clustering and thus putting it all together shall be automated. There is no need of dictionary data here.
Reading the following articles at the moment. Overview of Tesseract, Recognizing roman numerals via Tesseract and working with the base APIs of tesseract to enable isolation of the characters.
I have integrated the image processing routines.(like conversion into a binary image to improve the contrast, alignment of the images using the connected component analysis to find the boxes and then find the centroid and thus check the angle between them etc. ) The sample collection is going on ( I have distributed the forms in my father's office for data collection ;) ). I shall be scanning them and reading the data.
The testing with the printed data is done using the Times new roman font for which the tesseract is trained by default has been giving decent results but I am now selectively dumping the data into xml.

Technical Layout

This is how the scripts function. I am working to tune these to work efficiently.

README

File structure of the ocr folder:

config ← Global config file

images: ← A possible storage area for the images

layoutfiles: ← Default storage area for the layout info of the forms generated

ocrforms: ← Default storage area for the xml forms

parseddata: ← Stores the parsed data

README ← Explains the howto

sahanahcr:

      |-dataHandler.py <- A class to parse the images and dump the data 
      |-formHandler.py <- A class to handle the xforms and print them to pdf
      |-functions.py <- A module with all the necessary functions for this entire ocr module
      |-parseform.py <- A script to parse the forms
      |-printForm.py <- A class to handle the reportlab api to print forms
      |-regions.py <- A Class which describes a region in an image
      |-upload.py <- A script to upload the files
      |-urllib2_file.py <- A module which augments urllib2's functionality to upload files
      |-xforms2pdf.py <- Converts xforms to pdfs and uses the classes from formhandler and printform

tessdata: ← A folder where the necessary training info is stored to parse the scanned forms

      |-configs 
      |-tessconfigs

training:

      |-generatetrainingform.py <- Generates the training form
      |-train.py <- Trains the engine and stores the training data in the tessdata folder
      |-datafiles: <- Contains the input to generate training form and also the training form layout info files
      |-printedpdfs: <- Printed trainging forms reside here

xmlInput:

Dependencies

Reportlab
Core xml libs like xml.dom.minidom and xml.sax
sane on unix and twain on Windows to support scanning
pyscanning (http://code.google.com/p/pyscanning/)
Imaging-sane (http://svn.effbot.python-hosting.com/pil/Sane/ on Unix , not necessary on windows )
urllib
urllib2
PIL >= 1.6

NOTE 1: All scripts have to be run from their respective directories at the moment.

NOTE 2: All the images used are to be provided in the .tif format.

USAGE

Setting up the config file :

[url]

url = http://suryajith.in:5000/

The url to which data could be uploaded to

[tessdata]

tessdata = ../

The folder from the sahanahcr folder where the tessdata folder is located

Step 1: The form generation The forms could be generated using the xforms2pdf.py using the syntax as mentioned below. Incase the pdfname is not mentioned, it uses the uuid.pdf format to save the files. They are stored in the OCRforms folder in the main directory structure.

  Usage: python xforms2pdf.py <xforminput> <OPTIONAL -> pdfoutput>

Step 2: Automated training Generation of the training forms. Uses the datainput.txt located in the “Training” folder which contains the necessary characters that are needed in the Training folder to generate the form and prints out the location details of the characters in a file, this goes into the 'Training' folder too but with the name “Trainingform.pdf” and a layout file with the name “Traininglayoutinfo”. The training form file is used for the automated boxfile generation

  Usage: python generateTrainingform.py

The automation of the tesseract stores the necessary files in the tessdata folder with a <user> mentioned as prefix (which is generally the language) so that the while parsing the forms, when <user> is mentioned the specific trained data is used. They are stored as user_alph.* for the alphabet training and user_num.* for the numeral training

  Usage: python train.py <trainingimage> <layout info of the form which is generally training form> <user>

Step 3: Scan the Image or Add the Image The Images could either be added to the folder images as described in the files structure (or be scanned directly) and the parsing of the images takes place with accordance to the the layout files located in the layout folder. The layout is chosen as per the uuid mentioned on the form as the layout file is stored in the form uuid_page.xml so everytime a page is scanned, the page number has to be specified too. <user> mentioned is the user whose training data has to be used.

  Usage: python parseform.py <imageinput> <uuid of the form> <pagenumber> <user>

The parsed data is stored in the form of an xml in the folder parseddata in the global file structure. It is stored in the <uuid> folder in the parseddata folder. The images of the text are also cut out and stored. The images and text are also uploaded to a web folder via the url mentioned in the config file.

TODO

Implement option parser instead of the regular files
Someway to deal with the files with multiple pages, a way to store the data instead of the present way of storing data in two different xmls
Do windows specific improvements and tests
Barcodes on each page
Make generateTrainingform.py generic for all languages
Improve the xforms parsing to use all the attributes of bind
Check the improvement due to parsing digits and alphabets independently
Try to check if individual character reading improves the accuracy rather than reading the entire string
Should check if multiple fields have been selected for a select1 element

LIMITATIONS

Just works with capital letters and digits now.
Cant really use the restricted attribute of bind in xforms for example, a question like are you pregnant is valid only for females and should be improve to use it and other attributes of bind.
Not accurate parsing.
Selects the field related to the first darkened bubble for a select1 element.

Detailed description of source

formHandler.py

Its a class which handles the parsing of the xforms and which in turn uses the object from the printform class.

functions.py

This file contains all the generic functions that may be used by various modules. The list of available functions is as follows.

→distance(li): Returns the euclidean distance if the input is of the form [(x1, y1), (x2, y2)] →convertImage2binary(image, threshold=default): Takes in any image and converts it to binary based on the threshold.

→getMarkers(image): Takes in an image and returns the list of markers on the image as a list of regions. (for regions check for the class regions.)

→scaleFactor(markers): Takes in a list of markers and the co-ordinates and then converts then returns the scaling factor of the image lengthwise and breadthwise

→checkFolds(list): Takes in the list of markers and then checks if the image has folds. the tolerance has been set to 15 degrees, which is very huge. should reduce it.

→checkForm(image): It checks the validity of the image to be a valid form and aligns it.

generateTrainingform.py

Generation of the training forms. Uses the datainput in the misc folder to generate the form and printout the location details of the characters in a file.

  Usage: python generateTrainingform.py ../misc/<datainput> <pdfoutputname>

parseform.py

Its a file which takes in the imput of the xml file which has the logical data placement and outputs the xml dump of the parsed data. This particular module tesselates the required images and writes the data to an xmlfile.

  Usage: python parseform.py <imageinput> <> <user>

printForm.py

A class to handle the forms so to generate the reportlab pdfs

regions.py

It contains the class Region() and the function findRegions

→findRegions(image): returns a list of regions(objects of the class Region) based on the connected component analysis.

  
  Raster Scanning Algorithm for Connected Component Analysis:
  
  On the first pass:
  
  1. Iterate through each element of the data by column, then by row (Raster Scanning)
  2. If the element is not the background
      1. Get the neighboring elements of the current element
      2. If there are no neighbors, uniquely label the current element and continue
      3. Otherwise, find the neighbor with the smallest label and assign it to the current element
      4. Store the equivalence between neighboring labels
  
  On the second pass:
  
  1. Iterate through each element of the data by column, then by row
  2. If the element is not the background
        1. Relabel the element with the lowest equivalent label
  ( source: http://en.wikipedia.org/wiki/Connected_Component_Labeling )

train.py

Training of tesseract automated.

Tesseract needs a 'box' file to go with each training image. The box file is a text file that lists the characters in the training image, in order, one per line, with the coordinates of the bounding box around the image. Tesseract has a mode in which it will output a text file of the required format, but if the character set is different to its current training, it will naturally have the text incorrect. So the key process here is to manually edit the file to put the correct characters in it, but manually editing the file is not possible in all circumstances and our aim is to automate the entire process. Thus we use the following function:

→generateBoxfile(image, boxfilename): Generates the box file based on the location input generated from the generatedTrainingimage

  Usage: python train.py <trainingimage either the path or just the name if its in src or Images> <layout info of the form> <user>

instead of

  tesseract fontfile.tif fontfile batch.nochop makebox

Tesseract needs to know the set of possible characters it can output. To generate the unicharset data file, use the unicharset_extractor program on the box files generated above:

  unicharset_extractor <list of box files>

Important: Check for errors in the output from apply_box. If there are FATALITIES reported, then there is no point continuing with the training process until you fix the box file. The new box.train.stderr config file makes is easier to choose the location of the output. A FATALITY usually indicates that this step failed to find any training samples of one of the characters listed in your box file. Either the coordinates are wrong, or there is something wrong with the image of the character concerned. If there is no workable sample of a character, it can't be recognized, and the generated inttemp file won't match the unicharset file later and Tesseract will abort.

Another error that can occur that is also fatal and needs attention is an error about “Box file format error on line n”. If preceded by “Bad utf-8 char…” then the utf-8 codes are incorrect and need to be fixed. The error “utf-8 string too long…” indicates that you have exceeded the 24 byte limit on a character description. If you need a description longer than 24 bytes, please file an issue.

→Clustering: When the character features of all the training pages have been extracted, we need to cluster them to create the prototypes. The character shape features can be clustered using the mftraining and cntraining programs:

  mftraining -U unicharset -O lang.unicharset fontfile_1.tr fontfile_2.tr ...

or just in most systems

  mftraining fontfile.tr

and

  cntraining fontfile.tr

Tesseract uses up to 5 dictionary files for each language. Four of the files are coded as a Directed Acyclic Word Graph (DAWG), and the other is a plain UTF-8 text file. To make the DAWG dictionary files, you first need a wordlist for your language. The wordlist is formatted as a UTF-8 text file with one word per line. Split the wordlist into two sets: the frequent words, and the rest of the words, and then use wordlist2dawg to make the DAWG files:

  wordlist2dawg frequent_words_list freq-dawg
  wordlist2dawg words_list word-dawg

The final data file that Tesseract uses is called unicharambigs. It represents the intrinsic ambiguity between characters or sets of characters, and is currently entirely generated minimally. This has to be worked upon.

And rename all those files generated as <lang>.<filename> .

The resulting lang.traineddata goes in the tessdata(usually /usr/share/tessdata) directory. Tesseract can then recognize text in your language (in theory) with the following:

  tesseract image.tif output -l lang

upload.py

The upload function has been implemented here

xforms2pdf.py

It converts the input xform to a pdf to be printed.

  Usage: python xforms2pdf.py <xforminput> <OPTIONAL -> pdfoutput>

Trace: • gsoc_chillara

Executive Summary

Code

Functional Specifications.

Meeting Schedule

Timeline

Updates

Technical Layout

README

Navigation

Quick Links:

Search

Toolbox

QR Code