Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
foundation:gsoc_chillara [2010/08/17 10:11] suryajith |
foundation:gsoc_chillara [2010/12/18 17:35] (current) |
||
---|---|---|---|
Line 43: | Line 43: | ||
An interface to the end user shall be provided where the user shall upload the scanned image and then get a UI where the user shall compare and correct. | An interface to the end user shall be provided where the user shall upload the scanned image and then get a UI where the user shall compare and correct. | ||
- | |||
- | __Technologies__ | ||
- | |||
- | - Tesseract | ||
- | - ReportLab | ||
Line 89: | Line 84: | ||
==README== | ==README== | ||
- | Structure: | + | **File structure of the ocr folder:** |
- | -------------- | + | |
- | data/ | + | config <- Global config file |
- | + | ||
- | | + | |
- | images/ | + | images: <- A possible storage area for the images |
- | misc/ | + | layoutfiles: |
+ | |||
+ | ocrforms: <- Default storage area for the xml forms | ||
- | | + | parseddata: <- Stores the parsed data |
- | README | + | README |
- | src/ | ||
- | * config | ||
- | * formHandler.py | ||
- | * functions.py | ||
- | * generateTrainingform.py | ||
- | * parseform.py | ||
- | * printForm.py | ||
- | * regions.py | ||
- | * train.py | ||
- | * upload.py | ||
- | * xforms2pdf.py | ||
- | tessdata/ | + | sahanahcr: |
+ | |-dataHandler.py <- A class to parse the images and dump the data | ||
+ | |-formHandler.py <- A class to handle the xforms and print them to pdf | ||
+ | |-functions.py <- A module with all the necessary functions for this entire ocr module | ||
+ | |-parseform.py <- A script to parse the forms | ||
+ | |-printForm.py <- A class to handle the reportlab api to print forms | ||
+ | |-regions.py <- A Class which describes a region in an image | ||
+ | |-upload.py <- A script to upload the files | ||
+ | |-urllib2_file.py <- A module which augments urllib2' | ||
+ | |-xforms2pdf.py <- Converts xforms to pdfs and uses the classes from formhandler and printform | ||
+ | |||
+ | |||
+ | tessdata: <- A folder where the necessary training info is stored to parse the scanned forms | ||
+ | |-configs | ||
+ | |-tessconfigs | ||
+ | |||
+ | training: | ||
+ | |-generatetrainingform.py <- Generates the training form | ||
+ | |-train.py <- Trains the engine and stores the training data in the tessdata folder | ||
+ | |-datafiles: | ||
+ | |-printedpdfs: | ||
+ | |||
+ | |||
+ | xmlInput: | ||
+ | ------------ | ||
- | | ||
- | -------------- | ||
- | __**Usage**__ | ||
- | __Dependencies__ | + | **Dependencies** |
+ | ------------ | ||
- Reportlab | - Reportlab | ||
- | - core xml libs like xml.dom.minidom and xml.sax | + | - Core xml libs like xml.dom.minidom and xml.sax |
- sane on unix and twain on Windows to support scanning | - sane on unix and twain on Windows to support scanning | ||
- pyscanning (http:// | - pyscanning (http:// | ||
Line 134: | Line 139: | ||
- PIL >= 1.6 | - PIL >= 1.6 | ||
- | __Step | + | NOTE 1: All scripts have to be run from their respective directories at the moment. |
- | The forms could be generated using the xforms2pdf.py using the syntax as mentioned below. | + | NOTE 2: All the images used are to be provided in the .tif format. |
+ | |||
+ | |||
+ | **USAGE** | ||
+ | |||
+ | |||
+ | __Setting up the config file :__ | ||
+ | |||
+ | [url] | ||
+ | |||
+ | url = http:// | ||
+ | |||
+ | The url to which data could be uploaded to | ||
+ | |||
+ | [tessdata] | ||
+ | |||
+ | tessdata = ../ | ||
+ | |||
+ | The folder from the sahanahcr folder where the tessdata folder is located | ||
+ | |||
+ | ------------ | ||
+ | |||
+ | __Step 1: The form generation__ | ||
+ | The forms could be generated using the xforms2pdf.py using the syntax as mentioned below. Incase the pdfname is not mentioned, it uses the uuid.pdf format to save the files. They are stored in the OCRforms folder in the main directory structure. | ||
Usage: python xforms2pdf.py < | Usage: python xforms2pdf.py < | ||
__Step 2: Automated training__ | __Step 2: Automated training__ | ||
- | + | Generation of the training forms. Uses the datainput.txt located in the " | |
- | Generation of the training forms. Uses the datainput in the misc folder to generate the form and prints out the location details of the characters in a file, this goes into the layout | + | |
Usage: python generateTrainingform.py | Usage: python generateTrainingform.py | ||
- | The automation of the tesseract stores the necessary files in the tessdata folder with a < | + | The automation of the tesseract stores the necessary files in the tessdata folder with a < |
- | Usage: python train.py < | + | Usage: python train.py < |
- | __Step 3: Scan the Image or Add the Image__ | ||
- | The Images could either be added to the folder images as described in the files structure (or be scanned directly) and the parsing of the images takes place with accordance to the the layout files located in the layout folder. The layout is chosen as per the uuid mentioned on the form as the layout file is stored in the form uuid_page.xml so everytime a page is scanned, the page number has to be specified too. | + | __Step 3: Scan the Image or Add the Image__ |
+ | The Images could either be added to the folder images as described in the files structure (or be scanned directly) and the parsing of the images takes place with accordance to the the layout files located in the layout folder. The layout is chosen as per the uuid mentioned on the form as the layout file is stored in the form uuid_page.xml so everytime a page is scanned, the page number has to be specified too. < | ||
Usage: python parseform.py < | Usage: python parseform.py < | ||
+ | The parsed data is stored in the form of an xml in the folder parseddata in the global file structure. It is stored in the < | ||
- | **NOTE TO WINDOWS USERS:** Please provide the paths using a '/' | ||
- | __**TODO**__ | ||
+ | |||
+ | TODO | ||
+ | |||
+ | - Implement option parser instead of the regular files | ||
+ | - Someway to deal with the files with multiple pages, a way to store the data instead of the present way of storing data in two different xmls | ||
- Do windows specific improvements and tests | - Do windows specific improvements and tests | ||
- Barcodes on each page | - Barcodes on each page | ||
Line 169: | Line 200: | ||
- Try to check if individual character reading improves the accuracy rather than reading the entire string | - Try to check if individual character reading improves the accuracy rather than reading the entire string | ||
- Should check if multiple fields have been selected for a select1 element | - Should check if multiple fields have been selected for a select1 element | ||
- | + | ||
- | + | LIMITATIONS | |
- | __**LIMITATIONS**__ | + | ============ |
- | | + | |
- | | + | |
- | | + | |
- | | + | |
-------------- | -------------- |