Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
foundation:gsoc_chillara [2010/06/29 21:17] suryajith XML data location dump |
foundation:gsoc_chillara [2010/12/18 17:35] (current) |
||
---|---|---|---|
Line 7: | Line 7: | ||
== Code == | == Code == | ||
* [[https:// | * [[https:// | ||
- | * [[http:// | ||
==Functional Specifications.== | ==Functional Specifications.== | ||
Line 45: | Line 44: | ||
An interface to the end user shall be provided where the user shall upload the scanned image and then get a UI where the user shall compare and correct. | An interface to the end user shall be provided where the user shall upload the scanned image and then get a UI where the user shall compare and correct. | ||
- | __Technologies__ | ||
- | |||
- | - Tesseract | ||
- | - Apache-FOP / rst2pdf / ReportLab | ||
- | |||
- | __Open Issues__ | ||
- | |||
- | <Shall be updated> | ||
__Comments__ | __Comments__ | ||
Line 63: | Line 54: | ||
^SMART goal^Measure^Due date^Comments^ | ^SMART goal^Measure^Due date^Comments^ | ||
| Xforms to pdf | [[http:// | | Xforms to pdf | [[http:// | ||
- | | Correction UI | Web UI with a text box and a corresponding image for every element | 28th June | IN PROGRESS | | + | | Tesseract integration | Tests with printed data. | 5th July | INTEGRATED: waiting for the community to check it. | |
- | | Tesseract integration | Tests with printed data. | 5th July | TO BE DONE | | + | | Automated training |
- | | Automated training | + | | Assembly and Testing |
- | | Web UI | Making some necessary modifications to the UI | 2nd August | TO BE DONE | | + | | Correction UI | Web UI with a text box and a corresponding image for every element |
- | | Testing | + | |
Line 77: | Line 67: | ||
* The other thing I have been looking into is the training of the ocr engine. I has to be provided with the training data for proper evaluation. The literature for training of the engine I have read up has been mentioned below at the bottom of the page. The tesseract tools page has various shell scripts which help generate the training data and thus train tesseract where as Training Tesseract and the page Tesseract - Summary explain how to generate and thus train tesseract manually. | * The other thing I have been looking into is the training of the ocr engine. I has to be provided with the training data for proper evaluation. The literature for training of the engine I have read up has been mentioned below at the bottom of the page. The tesseract tools page has various shell scripts which help generate the training data and thus train tesseract where as Training Tesseract and the page Tesseract - Summary explain how to generate and thus train tesseract manually. | ||
* A {{: | * A {{: | ||
+ | * Reading the following articles at the moment. [[http:// | ||
+ | * I have integrated the image processing routines.(like conversion into a binary image to improve the contrast, alignment of the images using the connected component analysis to find the boxes and then find the centroid and thus check the angle between them etc. ) The sample collection is going on ( I have distributed the forms in my father' | ||
+ | * The testing with the printed data is done using the Times new roman font for which the tesseract is trained by default has been giving decent results but I am now selectively dumping the data into xml. | ||
- [[http:// | - [[http:// | ||
- [[http:// | - [[http:// | ||
- [[http:// | - [[http:// | ||
+ | |||
+ | ==Technical Layout== | ||
+ | This is how the scripts function. I am working to tune these to work efficiently. | ||
+ | |||
+ | {{: | ||
+ | |||
+ | -------------- | ||
+ | |||
+ | ==README== | ||
+ | |||
+ | **File structure of the ocr folder:** | ||
+ | |||
+ | config <- Global config file | ||
+ | |||
+ | images: <- A possible storage area for the images | ||
+ | |||
+ | layoutfiles: | ||
+ | |||
+ | ocrforms: <- Default storage area for the xml forms | ||
+ | |||
+ | parseddata: <- Stores the parsed data | ||
+ | |||
+ | README <- Explains the howto | ||
+ | |||
+ | |||
+ | sahanahcr: | ||
+ | |-dataHandler.py <- A class to parse the images and dump the data | ||
+ | |-formHandler.py <- A class to handle the xforms and print them to pdf | ||
+ | |-functions.py <- A module with all the necessary functions for this entire ocr module | ||
+ | |-parseform.py <- A script to parse the forms | ||
+ | |-printForm.py <- A class to handle the reportlab api to print forms | ||
+ | |-regions.py <- A Class which describes a region in an image | ||
+ | |-upload.py <- A script to upload the files | ||
+ | |-urllib2_file.py <- A module which augments urllib2' | ||
+ | |-xforms2pdf.py <- Converts xforms to pdfs and uses the classes from formhandler and printform | ||
+ | |||
+ | |||
+ | tessdata: <- A folder where the necessary training info is stored to parse the scanned forms | ||
+ | |-configs | ||
+ | |-tessconfigs | ||
+ | |||
+ | training: | ||
+ | |-generatetrainingform.py <- Generates the training form | ||
+ | |-train.py <- Trains the engine and stores the training data in the tessdata folder | ||
+ | |-datafiles: | ||
+ | |-printedpdfs: | ||
+ | |||
+ | |||
+ | xmlInput: | ||
+ | ------------ | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | **Dependencies** | ||
+ | ------------ | ||
+ | - Reportlab | ||
+ | - Core xml libs like xml.dom.minidom and xml.sax | ||
+ | - sane on unix and twain on Windows to support scanning | ||
+ | - pyscanning (http:// | ||
+ | - Imaging-sane (http:// | ||
+ | - urllib | ||
+ | - urllib2 | ||
+ | - PIL >= 1.6 | ||
+ | |||
+ | NOTE 1: All scripts have to be run from their respective directories at the moment. | ||
+ | |||
+ | NOTE 2: All the images used are to be provided in the .tif format. | ||
+ | |||
+ | |||
+ | **USAGE** | ||
+ | |||
+ | |||
+ | __Setting up the config file :__ | ||
+ | |||
+ | [url] | ||
+ | |||
+ | url = http:// | ||
+ | |||
+ | The url to which data could be uploaded to | ||
+ | |||
+ | [tessdata] | ||
+ | |||
+ | tessdata = ../ | ||
+ | |||
+ | The folder from the sahanahcr folder where the tessdata folder is located | ||
+ | |||
+ | ------------ | ||
+ | |||
+ | __Step 1: The form generation__ | ||
+ | The forms could be generated using the xforms2pdf.py using the syntax as mentioned below. Incase the pdfname is not mentioned, it uses the uuid.pdf format to save the files. They are stored in the OCRforms folder in the main directory structure. | ||
+ | |||
+ | Usage: python xforms2pdf.py < | ||
+ | |||
+ | __Step 2: Automated training__ | ||
+ | Generation of the training forms. Uses the datainput.txt located in the " | ||
+ | |||
+ | Usage: python generateTrainingform.py | ||
+ | |||
+ | The automation of the tesseract stores the necessary files in the tessdata folder with a < | ||
+ | |||
+ | Usage: python train.py < | ||
+ | |||
+ | |||
+ | __Step 3: Scan the Image or Add the Image__ | ||
+ | The Images could either be added to the folder images as described in the files structure (or be scanned directly) and the parsing of the images takes place with accordance to the the layout files located in the layout folder. The layout is chosen as per the uuid mentioned on the form as the layout file is stored in the form uuid_page.xml so everytime a page is scanned, the page number has to be specified too. < | ||
+ | |||
+ | Usage: python parseform.py < | ||
+ | |||
+ | The parsed data is stored in the form of an xml in the folder parseddata in the global file structure. It is stored in the < | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | TODO | ||
+ | |||
+ | - Implement option parser instead of the regular files | ||
+ | - Someway to deal with the files with multiple pages, a way to store the data instead of the present way of storing data in two different xmls | ||
+ | - Do windows specific improvements and tests | ||
+ | - Barcodes on each page | ||
+ | - Make generateTrainingform.py generic for all languages | ||
+ | - Improve the xforms parsing to use all the attributes of bind | ||
+ | - Check the improvement due to parsing digits and alphabets independently | ||
+ | - Try to check if individual character reading improves the accuracy rather than reading the entire string | ||
+ | - Should check if multiple fields have been selected for a select1 element | ||
+ | |||
+ | LIMITATIONS | ||
+ | ============ | ||
+ | * Just works with capital letters and digits now. | ||
+ | * Cant really use the restricted attribute of bind in xforms for example, a question like are you pregnant is valid only for females and should be improve | ||
+ | * Not accurate parsing. | ||
+ | * Selects the field related to the first darkened bubble for a select1 element. | ||
+ | |||
+ | -------------- | ||
+ | |||
+ | __**Detailed description of source**__ | ||
+ | |||
+ | -------------- | ||
+ | __formHandler.py__ | ||
+ | |||
+ | Its a class which handles the parsing of the xforms and which in turn uses the object from the printform class. | ||
+ | |||
+ | -------------- | ||
+ | __functions.py__ | ||
+ | |||
+ | This file contains all the generic functions that may be used by various modules. The list of available functions is as follows. | ||
+ | |||
+ | -> | ||
+ | -> | ||
+ | |||
+ | -> | ||
+ | |||
+ | -> | ||
+ | |||
+ | -> | ||
+ | |||
+ | -> | ||
+ | |||
+ | -------------- | ||
+ | __generateTrainingform.py__ | ||
+ | |||
+ | Generation of the training forms. Uses the datainput in the misc folder to generate the form and printout the location details of the characters in a file. | ||
+ | |||
+ | Usage: python generateTrainingform.py ../ | ||
+ | |||
+ | -------------- | ||
+ | __parseform.py__ | ||
+ | |||
+ | Its a file which takes in the imput of the xml file which has the logical data placement and outputs the xml dump of the parsed data. This particular module tesselates the required images and writes the data to an xmlfile. | ||
+ | |||
+ | Usage: python parseform.py < | ||
+ | |||
+ | <content to be added here> | ||
+ | -------------- | ||
+ | __printForm.py__ | ||
+ | |||
+ | A class to handle the forms so to generate the reportlab pdfs | ||
+ | -------------- | ||
+ | __regions.py__ | ||
+ | |||
+ | It contains the class Region() and the function findRegions | ||
+ | |||
+ | -> | ||
+ | |||
+ | | ||
+ | Raster Scanning Algorithm for Connected Component Analysis: | ||
+ | | ||
+ | On the first pass: | ||
+ | | ||
+ | 1. Iterate through each element of the data by column, then by row (Raster Scanning) | ||
+ | 2. If the element is not the background | ||
+ | 1. Get the neighboring elements of the current element | ||
+ | 2. If there are no neighbors, uniquely label the current element and continue | ||
+ | 3. Otherwise, find the neighbor with the smallest label and assign it to the current element | ||
+ | 4. Store the equivalence between neighboring labels | ||
+ | | ||
+ | On the second pass: | ||
+ | | ||
+ | 1. Iterate through each element of the data by column, then by row | ||
+ | 2. If the element is not the background | ||
+ | 1. Relabel the element with the lowest equivalent label | ||
+ | ( source: http:// | ||
+ | |||
+ | -------------- | ||
+ | __train.py__ | ||
+ | |||
+ | Training of tesseract automated. | ||
+ | |||
+ | Tesseract needs a ' | ||
+ | |||
+ | -> | ||
+ | |||
+ | Usage: python train.py < | ||
+ | |||
+ | instead of | ||
+ | tesseract fontfile.tif fontfile batch.nochop makebox | ||
+ | |||
+ | Tesseract needs to know the set of possible characters it can output. To generate the unicharset data file, use the unicharset_extractor program on the box files generated above: | ||
+ | unicharset_extractor <list of box files> | ||
+ | | ||
+ | **Important: | ||
+ | |||
+ | Another error that can occur that is also fatal and needs attention is an error about "Box file format error on line n". If preceded by "Bad utf-8 char..." | ||
+ | |||
+ | -> | ||
+ | When the character features of all the training pages have been extracted, we need to cluster them to create the prototypes. The character shape features can be clustered using the mftraining and cntraining programs: | ||
+ | mftraining -U unicharset -O lang.unicharset fontfile_1.tr fontfile_2.tr ... | ||
+ | or just in most systems | ||
+ | mftraining fontfile.tr | ||
+ | and | ||
+ | cntraining fontfile.tr | ||
+ | | ||
+ | Tesseract uses up to 5 dictionary files for each language. Four of the files are coded as a Directed Acyclic Word Graph (DAWG), and the other is a plain UTF-8 text file. To make the DAWG dictionary files, you first need a wordlist for your language. The wordlist is formatted as a UTF-8 text file with one word per line. Split the wordlist into two sets: the frequent words, and the rest of the words, and then use wordlist2dawg to make the DAWG files: | ||
+ | |||
+ | wordlist2dawg frequent_words_list freq-dawg | ||
+ | wordlist2dawg words_list word-dawg | ||
+ | |||
+ | The final data file that Tesseract uses is called unicharambigs. It represents the intrinsic ambiguity between characters or sets of characters, and is currently entirely generated minimally. | ||
+ | |||
+ | And rename all those files generated as < | ||
+ | |||
+ | The resulting lang.traineddata goes in the tessdata(usually / | ||
+ | tesseract image.tif output -l lang | ||
+ | |||
+ | |||
+ | -------------- | ||
+ | |||
+ | __upload.py__ | ||
+ | |||
+ | The upload function has been implemented here | ||
+ | |||
+ | |||
+ | -------------- | ||
+ | |||
+ | __xforms2pdf.py__ | ||
+ | |||
+ | It converts the input xform to a pdf to be printed. | ||
+ | |||
+ | Usage: python xforms2pdf.py < | ||
+ |