This is an old revision of the document!


Overview

In the aftermath of a disaster, a lot of information needs to be collected about various aspects of disaster relief such as victims, organizations involved, volunteers, available shelters etc. It is often inconvenient to fill out forms for each item of information, therefore spreadsheets are created to cater to these data needs. Unfortunately, accessing information in these spreadsheets can be a tedious job. The flexibility and robustness of a database is required in many such cases, hence such data has to be imported to databases.

The spreadsheet importer extracts data from spreadsheets and parses them to a form suitable for storage in a database.

Scenarios/ User stories

A disaster has occurred in XYZ country 2 days ago. Local agencies and NGOs quickly got together to aid the victims of the disaster and have already started collecting data about the survivors and about the needs for an imminent relief mission. Although the data collected is comprehensive and complete, every agency has uploaded this data in the form of spreadsheets on their servers. Medicine Sans Frontier has stepped in and while it has the resources to alleviate the suffering of the disaster stricken, it has no data on the number of casualties and/or the number of survivors. The data stored by the local agencies is useful, but scattered and will have to be used if any effective relief operation is to be attempted. However, the data is so large and so diverse that manually going through each spreadsheet is a task that could take weeks.

Brief Outline of the System

1. The user fills out a form specifying the path of the XLS file or google doc.

#The following steps are invisible to the user -

<!– By MH (29 May) No, these CANNOT happen automatically. The user will have to identify many of the features here (underlined). Don't presume that the import can be done automatically –>

2.

  o Next, the upload facility in web2py (using the upload field in web2py) is used to fetch the XLS file from a remote computer to the server. 
  It is now stored in the uploads folder and has been assigned a different name (present in the database).
  o A list of spreadsheet imported from gdocs is presented to the user

3. The module then uses xlrd to read the cells present in the spreadsheet. These cells are stored in a list of lists format. Alternative data structures can be used too (the use of lxml has been suggested as well, so an xml tree).

4. Empty rows and are automatically removed ( is this fine?).

<!– 5. Data type checking can also be done at this stage. The cleaning is implemented mainly using the levenshstein function in s3vita.py. The levenshtein function measures the 'distance' between two strings, the minimum number of insertions, deletions, and substitutions required to change one string to another. –>

#User view

6. Spreadsheet view:

        o The data (after being extracted and cleaned) is then put into a spreadsheet which is displayed using a suitable JavaScript library (please check this).
        o The cells can be edited by the user.
        o Entire rows and columns maybe dropped according to the users requirements.
        o The data cleaning process can be used to flag some cells (by giving them a red background? - NICE!).
        o The user can consider these cells for further correction. Eventually, the after all the editing has been done by the user, the user clicks the done button.

7. The data is then mapped. The user maps columns to database fields.

<!– By MH → Generally, the first row of the spreadsheet has the column headers (No really - this will have to be selected by the user - often headers will be split into 2 rows)–>.

However, certain spreadsheets may not have these headers. In such cases, the entire spreadsheet is scanned for a row where such column headers can be found, and these are put into the first row of the spreadsheet. In case the column headers are still not found, default names (custom) are assigned to the columns.

8. Another view in rendered (the data, which has been reviewed by the user, has been stored in the data structure,this data is clean). Now, the user selects a table (or tables) which is already in the database. A list of tables is displayed, table names with small descriptions are shown to the user.

9. The user selects a table. The view changes and a list of columns currently in the table are presented, and the user maps the column names in the spreadsheet to those in the table.

Edit: Validation will be done at this stage + data cleaning ie after mapping

<!– MH → More data cleaning will have to happen at this stage - infact, I don't think that you have to worry about entiring the data until this stage

Could you also do data validation here – flagging invalid cells?

–>

<!– @MH: Data validation will be implemented via s3 methods(when the import button is hit)? Otherwise I'll have to write seperate methods, right? –>

10. S3XRC functions ( import_xml/import_json ) are now used to import the data to the database. Any errors encountered are handled by the module and the user is notified of the same. The user can modify some cells if the need be.

Terms

Normalization: Bring the data to a standard format for cleaning. eg: Convert all occurrences jAmEs to James - This isn't normalization - that refers to another process.

APIs (tentative)

  • xlrd
  • gdocs
  • qooxdoo

Goals

  • To extract data from a spreadsheet
  • To be able to satisfactorily clean data
  • To be able to render a a view which has all the cells
  • To allow the user to make changes to the extracted data prior to import
  • To ensure that data being imported is validated

Data Model

Primarily, this tool is meant to facilitate data transformation and not its storage.

The data model includes the data structure that will be used for storing the cells of the spreadsheet.

XML An XML document can be prepared from the data structure. The XML can be parsed by using lxml (or any other xml parser ). The import_xml() function in the ResourceController class can then be used to import the actual data to the table in the database.

Spreadsheet table In case the upload field functionality of Web2Py is used to upload the spreadsheets, a table will have to be created which has the original names of the spreadsheets as well their new names (as modified by web2py). This will basically be a history of all the the spreadsheets that have been imported to the uploads folder in web2py. <We are going to delete files after import, right?>

JSON JSON will have to be used to transfer the information extracted and parsed to the spreadsheet like interface for the user to edit the data that has been extracted from the spreadsheet. Alternatively, we could also extract spreadsheet data straightaway into JSON, and run all other operations on it.

Spell-check dictionary('nice to have' functionality) This will be a table with primarily two fields (apart from id and uuid of course). The first field will have incorrect words, and the second field will have their corrections. Of course, the usefulness on this dictionary is directly proportional to the usage of that particular instance of Eden. The dictionaries in various instances of Eden could possibly be periodically synced to improve the utility of the dictionary.

History

version 0.4 on 30 May,2010 0030 Changes made according to Michael's directions, UI mockups and flowchart added

version 0.3 on 29 May, 2010 Michael's comments

version 0.2 on 29 May,2010 Flowchart added (in seperate doc)

version 0.1 on 28 May,2010 First draft

An explanation of the the working of the module

This page seeks to explain the flowchart and the kind of processing that goes on for each step.( I shall elaborate this page even further)

Step 1

“Select import source”

The user selects whether an XLS file or a spreadsheet is to be imported.

As soon as the selection is made, jQuery displays one of two forms.

One form is for XLS, and has the fields “File Name” and “Path” (for uploading). File type checking is done, to ensure that the file uploaded is indeed XLS.

One form is for gDocs, and asks for the username, password of the Google account. A list is of spreadsheets is then fetched from gDocs, and the user subsequently selects a spreadsheet.

Alternatively, the user may just specify the URL of the spreadsheet and the data from the spreadsheet would then be extracted. However, I'm not quite sure that gDocs API supports this functionality.

There are going to be separate controllers for handling gDocs and XLS.

Step 2

“Import from source to a data structure ”

The following steps are going to common to both gDocs and XLS.

A lists of lists of the extracted data is created. This is most suitable since it reflects the manner in which data was found and would make it easier to identify blank rows, column headers etc.

Empty rows and columns are removed.

UI begins

This will be a spreadsheet like view of the data to be imported. Hopefully, I should to be able to work out a method to effectively flag invalid data(like giving a red background to a cell).

The user will be able to modify each and every record in this view. The column headers after being identified, no matter where they were defined in the original spreadsheet will be moved to the top of the view. Again, the column headers are editable.

Step 3

“Select column headers”

Column headers are identified, This is done by the user.

Basically, we can have radio buttons down the side of the UI, and the user can click the radio button corresponding the row with the column headers.

Step 4

“Convert sub header rows to columns”

Sub header rows are identified by the user, and are then copied to all the rows(specified by the user) in a particular column.

User presses submit.

Step 5

“Select table from the database”

A list of tables present in the Sahana DB is displayed. The user can then select which table he wants put the data in. The fields of this table are fetched and stored for the mapping step.

Step 6

“Mapping”

After the user selects the table, a drop down menu is introduced above/below the column headers, which contains the fields present in the selected table.

The user selects fields for the different columns.

If the number of columns in the Eden db is less than the number of columns in the spreadsheet, data can be appended in the comments column.

Step 7

“Duplicate identification and data validation”

We now return to the screen which has the spreadsheet like view of the data to be imported.

Duplicates are now identified, and are flagged for the user. I intend to use the levenshtein function in s3vita.py to calculate the “distance” between the strings and then decide which strings are too similar.

Validation will have to be done via JavaScript validation functions. I think those functions are part of another GSoC project. Validation in the UI (ie JS validation) can only be done if JS validation functions are present.

This will be done *pre-import* ie invalid data will be flagged for the user and displayed before the import step.

Step 8

“Import”

Data is now imported using s3xrc. Invalid records are returned to the user.

Step 9

“Return invalid records to the UI”

Records which could not be imported are shown again to the user in the view. After modifying those records, the user can attempt importing again.

“Post import cleaning” (what happens after the user hits the import button)

Michael recently posted on the groups regarding merging and updating of the records.

Updating records:

When a certain number of fields in the record are similar, they can be identified as different instances of the same information, just that one is newer. So, instead of inserting new records in the database, we can update them.

From the implementation POV, I feel that we should check for duplicacy in case insertion fails due to primary key duplication, or we can just update straightaway without checking anything in case primary fields match. IMHO, To-be-inserted records are most likely to be fresh information in case the primary values are the same.

Merging records:

As I see it, there are three aspects to this.

First, when both the records are in the db.

Second, when one record is in db and the to-be-inserted record is in the spreadsheet. This case maps almost exactly to the update implementation described above.

Third, when both records are in the spreadsheet. For this case, we will have to define certain parameters which can identify two records as candidates for merging. As column mapping has already been done, we can check if those columns mapped to primary keys have redundant entries. Then, we will have to identify which row is newer. We can leave the decision to the user to identify which row has to be used for data are flagging “merging candidate” rows.

#After writing the above, I feel that these methods may not be limited just to post import cleaning(especially spreadsheet only merging) but to the first data cleaning step. Can we discuss this in the #meeting?

History

VersionDateComment

version 0.6|3 June, 2010| Suggestions discussed in meeting(2 June) incorporated version 0.5|2 June, 2010|Introduced the cleaning after mapping step

version 0.4|2 June, 2010|Added step 8

version 0.3|1 June, 2010|Tweaked the doc

version 0.2|29-30 May, 2010|UI explanation, changes according to Michael's suggestions

version 0.1|29 May, 2010|First draft


QR Code
QR Code foundation:gsoc_kohli_specs (generated for current page)