This is an old revision of the document!


Overview

In the aftermath of a disaster, a lot of information needs to be collected about various aspects of disaster relief such as victims, organizations involved, volunteers, available shelters etc. It is often inconvenient to fill out forms for each item of information, therefore spreadsheets are created to cater to these data needs. Unfortunately, accessing information in these spreadsheets can be a tedious job. The flexibility and robustness of a database is required in many such cases, hence such data has to be imported to databases.

The spreadsheet importer extracts data from spreadsheets and parses them to a form suitable for storage in a database.

Scenarios/ User stories

A disaster has occurred in XYZ country 2 days ago. Local agencies and NGOs quickly got together to aid the victims of the disaster and have already started collecting data about the survivors and about the needs for an imminent relief mission. Although the data collected is comprehensive and complete, every agency has uploaded this data in the form of spreadsheets on their servers. Medicine Sans Frontier has stepped in and while it has the resources to alleviate the suffering of the disaster stricken, it has no data on the number of casualties and/or the number of survivors. The data stored by the local agencies is useful, but scattered and will have to be used if any effective relief operation is to be attempted. However, the data is so large and so diverse that manually going through each spreadsheet is a task that could take weeks.

APIs (tentative)

  • xlrd
  • gdocs
  • ExtJS

Goals

  • To extract data from a spreadsheet
  • To be able to satisfactorily clean data
  • To be able to render a a view which has all the cells
  • To allow the user to make changes to the extracted data prior to import
  • To insert data into the database

Data Model

Primarily, this tool is meant to facilitate data transformation and not its storage.

The data model includes the data structure that will be used for storing the cells of the spreadsheet.

XML An XML document can be prepared from the data structure. The XML can be parsed by using lxml (or any other xml parser ). The import_xml() function in the ResourceController class can then be used to import the actual data to the table in the database.

Spreadsheet table In case the upload field functionality of Web2Py is used to upload the spreadsheets, a table will have to be created which has the original names of the spreadsheets as well their new names (as modified by web2py). This will basically be a history of all the the spreadsheets that have been imported to the uploads folder in web2py. <We are going to delete files after import, right?>

Spell-check dictionary('nice to have' functionality) This will be a table with primarily two fields (apart from id and uuid of course). The first field will have incorrect words, and the second field will have their corrections. Of course, the usefulness on this dictionary is directly proportional to the usage of that particular instance of Eden. The dictionaries in various instances of Eden could possibly be periodically synced to improve the utility of the dictionary.

An explanation of the the working of the module

This page seeks to explain the flowchart and the kind of processing that goes on for each step.

Step 1

“Select import source”

The user selects whether an XLS file or a spreadsheet is to be imported.

As soon as the selection is made, jQuery displays one of two forms.

One form is for XLS, and has the fields “File Name” and “Path” (for uploading). File type checking is done, to ensure that the file uploaded is indeed XLS.

One form is for gDocs, and asks for the username, password of the Google account. A list is of spreadsheets is then fetched from gDocs, and the user subsequently selects a spreadsheet.

Alternatively, the user may just specify the URL of the spreadsheet and the data from the spreadsheet would then be extracted. However, I'm not quite sure that gDocs API supports this functionality.

There are going to be separate controllers for handling gDocs and XLS.

Step 2

“Import from source to a data structure ”

The following steps are going to common to both gDocs and XLS.

A lists of lists of the extracted data is created. The sublists are the rows in the spreadsheet.

This is most suitable since it reflects the manner in which data was found and would make it easier to identify blank rows, column headers etc Empty rows and columns are removed.

XML is generated. This XML provides data in the spreadsheet to the UI.

UI begins

The data extracted from the spreadsheets is converted from list of lists format to an XML file. This is done for two reasons: 1. A list of lists cannot be easily and properly represented in the the view. It has to converted to XML so that it can be parsed using Ext.data.XmlReader

2. Secondly, the data import using import_xml() can only be achieved if XML is is the Sahana schema. This can be worked out using XSLT.

Now, we can use to the XML file, parse it and make the spreadsheet view (ie an editable grid).

Proposed format is :

   <data>
        <row>
             <column1>->Data in cell(1,1)<-</column1>
             <column2>->Data in cell(1,2)<-</column2>
             .
             .
             . 
             .
             . 
        </row>
        <row>
            .
            .
            .
            .
            .
            .
        </row>    
  </data>

This will be a spreadsheet like view of the data present in the to-be-imported spreadsheet (minus empty rows and columns). Hopefully, I should to be able to work out a method to effectively flag invalid data(like giving a red background to a cell).

The user will be able to modify each and every record in this view. Column headers can also be identified. Again, the column headers are editable.

Step 3

“Convert sub header rows to columns”

Sub header rows are identified by the user, and are then copied to all the rows(specified by the user) in a particular column( can also be at the end).

The user then assigns a column header to the new column.

User presses submit.

Step 4

“Select column headers”

Column headers are identified. This is done by the user.

Basically, we can have radio buttons down the side of the UI, and the user can click the radio button corresponding the row with the column headers.

The records are editable, therefore the user can modify the column headers too.

In the XML, we can have an attribute for the data tag, header, which contains the row number having the header columns (as selected by the user).

Alternatively, we could also add the header attribute to the row tags (such that <row header=“true”></row> ) to identify column headers.

Step 5

“Select table from the database”

A list of tables present in the Sahana DB is displayed. The user can then select which table he wants put the data in.

The fields of this table are fetched and stored for the mapping step.

Again, we will need a pre-generated XML file with table data which can be parsed to the view.

Step 6

“Mapping”

After the user selects the table, a drop down menu is introduced beside each of the column headers, which contains the fields present in the selected table.

The user selects fields for the different columns.

If the number of columns in the Eden db is less than the number of columns in the spreadsheet, data can be appended in the comments column.

Step 7

“Duplicate identification and data validation”

We now return to the screen which has the spreadsheet like view of the data to be imported.

Duplicates are now identified, and are flagged for the user. I intend to use the levenshtein function in s3vita.py to calculate the “distance” between the strings and then decide which strings are too similar.

“Select characters to parse cell on”

Sometimes, users separate different records by using characters such as ,(comma) . (dot) | (pipe).

The cell can be parsed on such data and new records can be created.

Eg:

Consider a spreadsheet of languages

1. English

2. Spanish

3. French|1337

4. Russian, Latin

This can be broken up into

1. English

2. Spanish

3. French

4. 1337

5. Russian

6. Latin

Validation will have to be done via JavaScript validation functions. I think those functions are part of another GSoC project. Validation in the UI (ie JS validation) can only be done if JS validation functions are present.

This will be done pre-import ie invalid data will be flagged for the user and displayed before the import step.

Step 8

“Import”

Data is now imported using S3XRC. Invalid records are returned to the user.

In the UI, the data can be represented in the JSON format. However, it works best to use import_xml() to import to Eden's DB. Ext supports methods to parse XML. Therefore, its best to keep all the data in XML format.

The XML document thus produced could then be parsed according to an XSLT file.

Then, an XML suitable for import to the Eden DB can be obtained.

Therefore, XML (spreadsheet format) –> (via XSLT) –> XML (sahana)

Please see http://eden.sahanafoundation.org/wiki/S3XRC

Note:Importing to multiple tables

One of the suggestions for this project was to enable importing to multiple tables. However, this is not currently supported by S3XRC.

Step 9

“Return invalid records to the UI”

Records which could not be imported are shown again to the user in the view. After modifying those records, the user can attempt importing again.

“Post import cleaning” (what happens after the user hits the import button)

Michael recently posted on the groups regarding merging and updating of the records.

Updating records:

When a certain number of fields in the record are similar, they can be identified as different instances of the same information, just that one is newer. So, instead of inserting new records in the database, we can update them.

From the implementation POV, I feel that we should check for duplicacy in case insertion fails due to primary key duplication, or we can just update straightaway without checking anything in case primary fields match. IMHO, To-be-inserted records are most likely to be fresh information in case the primary values are the same.

Merging records:

As I see it, there are three aspects to this.

First, when both the records are in the db.

Second, when one record is in db and the to-be-inserted record is in the spreadsheet. This case maps almost exactly to the update implementation described above.

Third, when both records are in the spreadsheet. For this case, we will have to define certain parameters which can identify two records as candidates for merging. As column mapping has already been done, we can check if those columns mapped to primary keys have redundant entries. Then, we will have to identify which row is newer. We can leave the decision to the user to identify which row has to be used for data are flagging “merging candidate” rows.

#After writing the above, I feel that these methods may not be limited just to post import cleaning(especially spreadsheet only merging) but to the first data cleaning step. Can we discuss this in the #meeting?

History

VersionDateComment

version 0.7| 5 June,2010| Clean up of the documents

version 0.6|3 June, 2010| Suggestions discussed in meeting(2 June) incorporated

version 0.5|2 June, 2010|Introduced the cleaning after mapping step

version 0.4|2 June, 2010|Added step 8

version 0.3|1 June, 2010|Tweaked the doc

version 0.2|29-30 May, 2010|UI explanation, changes according to Michael's suggestions

version 0.1|29 May, 2010|First draft


QR Code
QR Code foundation:gsoc_kohli_specs (generated for current page)