This is an old revision of the document!


User guidelines for the OCR project for Sahana-Eden

Usage

Dependencies

  1. Reportlab
  2. core xml libs like xml.dom.minidom and xml.sax
  3. sane on unix and twain on Windows to support scanning
  4. Imaging-sane (http://svn.effbot.python-hosting.com/pil/Sane/ on Unix , not necessary on windows )
  5. urllib
  6. urllib2
  7. PIL >= 1.6

Step 1: The form generation

The forms could be generated using the xforms2pdf.py using the syntax as mentioned below. Incase the pdfname is not mentioned, it uses the <uuid>.pdf format to save the files. The location for the storage of the pdfs is asked in case the path is not given but just the filename where as the default place for storage is the ocrforms folder in the main directory structure.

  Usage: python xforms2pdf.py <xforminput> <OPTIONAL -> pdfoutput>

Step 2: Automated training

Generation of the training forms. Uses the datainput.txt located in the datafiles folder of the “Training” folder which contains the necessary characters that are needed in the Training folder to generate the form and prints out the location details of the characters in a file, this goes into the 'Training' folder too but with the name “Trainingform.pdf” and a layout file with the name “Traininglayoutinfo” which is stored in the datafiles folder. The training form layout info file is used for the automated boxfile generation

  Usage: python generateTrainingform.py

The automation of the tesseract stores the necessary files in the tessdata folder with a <user> mentioned as prefix (which is generally the language) so that the while parsing the forms, when <user> is mentioned the specific trained data is used.

  Usage: python train.py <trainingimage> <layout info of the form> <user>

Step 3: Scan the Image or Add the Image

The Images could either be added to the folder images as described in the files structure (or be scanned directly) and the parsing of the images takes place with accordance to the the layout files located in the layout folder. The layout is chosen as per the uuid mentioned on the form as the layout file is stored in the form <uuid>_<page>.xml so everytime a page is scanned, the page number has to be specified too.

  Usage: python parseform.py <imageinput> <uuid of the form> <pagenumber> <user>

QR Code
QR Code ocr_userguidelines (generated for current page)