Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
foundation:gsoc_chillara [2010/08/11 22:10]
suryajith Added the usage
foundation:gsoc_chillara [2010/12/18 17:35] (current)
Line 7: Line 7:
 == Code == == Code ==
   * [[https://code.launchpad.net/~suryajith1987/sahana-eden/ocr | bzr branch]]   * [[https://code.launchpad.net/~suryajith1987/sahana-eden/ocr | bzr branch]]
-  * [[http://bitbucket.org/suryajith/gsoc-2010 | Personal hg repo of the same]] 
  
 ==Functional Specifications.== ==Functional Specifications.==
Line 45: Line 44:
 An interface to the end user shall be provided where the user shall upload the scanned image and then get a UI where the user shall compare and correct. An interface to the end user shall be provided where the user shall upload the scanned image and then get a UI where the user shall compare and correct.
  
-__Technologies__ 
- 
-  - Tesseract 
-  - Apache-FOP / rst2pdf / ReportLab 
- 
-__Open Issues__ 
- 
-<Shall be updated> 
  
 __Comments__ __Comments__
Line 88: Line 79:
  
 {{:foundation:layout_ocr.jpg|The Technical layout of the project}} {{:foundation:layout_ocr.jpg|The Technical layout of the project}}
 +
 +--------------
  
 ==README== ==README==
-Structure: 
----------- 
- data/  
-         
- location/ 
  
- images/+**File structure of the ocr folder:**
  
- misc/+config <- Global config file
  
- pdfoutput/+images: <- A possible storage area for the images
  
- README+layoutfiles: <- Default storage area for the layout info of the forms generated 
 +  
 +ocrforms: <- Default storage area for the xml forms
  
- src/ +parseddata: <- Stores the parsed data
-  * formHandler.py +
-  * functions.py +
-  * generateTrainingform.py +
-  * parseform.py +
-  * printForm.py +
-  * regions.py +
-  * train.py +
-  * upload.py +
-  * xforms2pdf.py+
  
- tessdata/+README <- Explains the howto
  
- xmldata/ 
  
 +sahanahcr:
 +        |-dataHandler.py <- A class to parse the images and dump the data 
 +        |-formHandler.py <- A class to handle the xforms and print them to pdf
 +        |-functions.py <- A module with all the necessary functions for this entire ocr module
 +        |-parseform.py <- A script to parse the forms
 +        |-printForm.py <- A class to handle the reportlab api to print forms
 +        |-regions.py <- A Class which describes a region in an image
 +        |-upload.py <- A script to upload the files
 +        |-urllib2_file.py <- A module which augments urllib2's functionality to upload files
 +        |-xforms2pdf.py <- Converts xforms to pdfs and uses the classes from formhandler and printform
  
  
-__**Usage and their functionality:**__+tessdata: <- A folder where the necessary training info is stored to parse the scanned forms 
 +        |-configs  
 +        |-tessconfigs  
 + 
 +training: 
 +        |-generatetrainingform.py <- Generates the training form 
 +        |-train.py <- Trains the engine and stores the training data in the tessdata folder 
 +        |-datafiles: <- Contains the input to generate training form and also the training form layout info files 
 +        |-printedpdfs: <- Printed trainging forms reside here  
 + 
 + 
 +xmlInput: 
 +------------ 
 + 
 + 
 + 
 + 
 +**Dependencies** 
 +------------ 
 +  - Reportlab 
 +  - Core xml libs like xml.dom.minidom and xml.sax 
 +  - sane on unix and twain on Windows to support scanning 
 +  - pyscanning (http://code.google.com/p/pyscanning/
 +  - Imaging-sane (http://svn.effbot.python-hosting.com/pil/Sane/ on Unix , not necessary on windows ) 
 +  - urllib 
 +  - urllib2 
 +  - PIL >= 1.6 
 + 
 +NOTE 1: All scripts have to be run from their respective directories at the moment. 
 + 
 +NOTE 2All the images used are to be provided in the .tif format. 
 + 
 + 
 +**USAGE** 
 + 
 + 
 +__Setting up the config file :__ 
 + 
 +[url] 
 + 
 +url = http://suryajith.in:5000/ 
 + 
 +The url to which data could be uploaded to 
 + 
 +[tessdata] 
 + 
 +tessdata = ../ 
 + 
 +The folder from the sahanahcr folder where the tessdata folder is located 
 + 
 +------------
  
 __Step 1: The form generation__ __Step 1: The form generation__
 +The forms could be generated using the xforms2pdf.py using the syntax as mentioned below. Incase the pdfname is not mentioned, it uses the uuid.pdf format to save the files. They are stored in the OCRforms folder in the main directory structure.
  
-The forms could be generated using the xforms2pdf.py using the syntax as mentioned below. 
-  
     Usage: python xforms2pdf.py <xforminput> <OPTIONAL -> pdfoutput>     Usage: python xforms2pdf.py <xforminput> <OPTIONAL -> pdfoutput>
  
 +__Step 2: Automated training__
 +Generation of the training forms. Uses the datainput.txt located in the "Training" folder which contains the necessary characters that are needed in the Training folder to generate the form and prints out the location details of the characters in a file, this goes into the 'Training' folder too but with the name "Trainingform.pdf" and a layout file with the name "Traininglayoutinfo". The training form file is used for the automated boxfile generation
  
-__Step 2Scan the Image or Add the Image__+    Usagepython generateTrainingform.py
  
-The Images could either be added to the folder images as described in the files structure (or be scanned directly) and the parsing of the images takes place with accordance to the the layout files located in the layout folder. The layout is chosen as per the uuid mentioned on the form as the layout file is stored in the form uuid_page.xml so everytime a page is scanned, the page number has to be specified too.+The automation of the tesseract stores the necessary files in the tessdata folder with a <user> mentioned as prefix (which is generally the language) so that the while parsing the forms, when <user> is mentioned the specific trained data is used. They are stored as user_alph.* for the alphabet training and user_num.* for the numeral training
  
-    Usage: python parseform.py <imageinput> <xmlinput> <user>+    Usage: python train.py <trainingimage> <layout info of the form which is generally training form> <user> 
 + 
 + 
 +__Step 3: Scan the Image or Add the Image__ 
 +The Images could either be added to the folder images as described in the files structure (or be scanned directly) and the parsing of the images takes place with accordance to the the layout files located in the layout folder. The layout is chosen as per the uuid mentioned on the form as the layout file is stored in the form uuid_page.xml so everytime a page is scanned, the page number has to be specified too. <user> mentioned is the user whose training data has to be used.  
 + 
 +    Usage: python parseform.py <imageinput> <uuid of the form> <pagenumber> <user> 
 + 
 +The parsed data is stored in the form of an xml in the folder parseddata in the global file structure. It is stored in the <uuid> folder in the parseddata folder. The images of the text are also cut out and stored. The images and text are also uploaded to a web folder via the url mentioned in the config file. 
 + 
 + 
 + 
 + 
 + 
 +TODO 
 + 
 +  - Implement option parser instead of the regular files 
 +  - Someway to deal with the files with multiple pages, a way to store the data instead of the present way of storing data in two different xmls 
 +  - Do windows specific improvements and tests 
 +  - Barcodes on each page 
 +  - Make generateTrainingform.py generic for all languages 
 +  - Improve the xforms parsing to use all the attributes of bind 
 +  - Check the improvement due to parsing digits and alphabets independently 
 +  - Try to check if individual character reading improves the accuracy rather than reading the entire string 
 +  - Should check if multiple fields have been selected for a select1 element 
 +  
 +LIMITATIONS 
 +============ 
 +  * Just works with capital letters and digits now. 
 +  * Cant really use the restricted attribute of bind in xforms for example, a question like are you pregnant is valid only for females and should be improve  to use it and other attributes of bind. 
 +  * Not accurate parsing. 
 +  * Selects the field related to the first darkened bubble for a select1 element. 
 + 
 +--------------
  
 +__**Detailed description of source**__
  
 -------------- --------------
Line 170: Line 245:
 Its a file which takes in the imput of the xml file which has the logical data placement and outputs the xml dump of the parsed data. This particular module tesselates the required images and writes the data to an xmlfile. Its a file which takes in the imput of the xml file which has the logical data placement and outputs the xml dump of the parsed data. This particular module tesselates the required images and writes the data to an xmlfile.
  
-    Usage: python parseform.py <imageinput> <xmlinput> <user>+    Usage: python parseform.py <imageinput> <> <user>
  
 <content to be added here> <content to be added here>
Line 212: Line 287:
 ->generateBoxfile(image, boxfilename): Generates the box file based on the location input generated from the generatedTrainingimage ->generateBoxfile(image, boxfilename): Generates the box file based on the location input generated from the generatedTrainingimage
  
-    Usage: python train.py <trainingimage> <locationinput_generated>+    Usage: python train.py <trainingimage either the path or just the name if its in src or Images> <layout info of the form<user> 
 instead of  instead of 
     tesseract fontfile.tif fontfile batch.nochop makebox     tesseract fontfile.tif fontfile batch.nochop makebox
Line 242: Line 318:
 The resulting lang.traineddata goes in the tessdata(usually /usr/share/tessdata) directory. Tesseract can then recognize text in your language (in theory) with the following: The resulting lang.traineddata goes in the tessdata(usually /usr/share/tessdata) directory. Tesseract can then recognize text in your language (in theory) with the following:
     tesseract image.tif output -l lang     tesseract image.tif output -l lang
 +
 +
 -------------- --------------
  
 __upload.py__ __upload.py__
 +
 +The upload function has been implemented here
  
  

QR Code
QR Code foundation:gsoc_chillara (generated for current page)