This is an old revision of the document!

User guidelines for the OCR project for Sahana-Eden



  1. Reportlab
  2. core xml libs like xml.dom.minidom and xml.sax
  3. sane on unix and twain on Windows to support scanning
  4. Imaging-sane ( on Unix , not necessary on windows )
  5. urllib
  6. urllib2
  7. PIL >= 1.6

Step 1: The form generation

The forms could be generated using the using the syntax as mentioned below. Incase the pdfname is not mentioned, it uses the <uuid>.pdf format to save the files. The location for the storage of the pdfs is asked in case the path is not given but just the filename where as the default place for storage is the ocrforms folder in the main directory structure.

  Usage: python <xforminput> <OPTIONAL -> pdfoutput>

Step 2: Automated training

Generation of the training forms. Uses the datainput.txt located in the datafiles folder of the “Training” folder which contains the necessary characters that are needed in the Training folder to generate the form and prints out the location details of the characters in a file, this goes into the 'Training' folder too but with the name “Trainingform.pdf” and a layout file with the name “Traininglayoutinfo” which is stored in the datafiles folder. The training form layout info file is used for the automated boxfile generation

  Usage: python

The automation of the tesseract stores the necessary files in the tessdata folder with a <user> mentioned as prefix (which is generally the language) so that the while parsing the forms, when <user> is mentioned the specific trained data is used.

  Usage: python <trainingimage> <layout info of the form> <user>

Step 3: Scan the Image or Add the Image

The Images could either be added to the folder images as described in the files structure (or be scanned directly) and the parsing of the images takes place with accordance to the the layout files located in the layout folder. The layout is chosen as per the uuid mentioned on the form as the layout file is stored in the form <uuid>_<page>.xml so everytime a page is scanned, the page number has to be specified too.

  Usage: python <imageinput> <uuid of the form> <pagenumber> <user>

The output file is stored in the form of an xml with the name of a new <uuid> and then the images cut out for the corresponding parts are stored in the parseddata folder in a seperate folder with the same uuid.

ocr_userguidelines.1282223217.txt.gz · Last modified: 2010/08/20 07:06 (external edit)
Back to top
CC Attribution-Noncommercial-Share Alike 3.0 Unported = chi`s home Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0