Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
foundation:gsoc_kohli_specs [2010/06/04 10:46]
shikharkohli
foundation:gsoc_kohli_specs [2010/12/18 17:35] (current)
Line 8: Line 8:
 === Scenarios/ User stories === === Scenarios/ User stories ===
 A disaster has occurred in XYZ country 2 days ago. Local agencies and NGOs quickly got together to aid the victims of the disaster and have already started collecting data about the survivors and about the needs for an imminent relief mission. Although the data collected is comprehensive and complete, every agency has uploaded this data in the form of spreadsheets on their servers. Medicine Sans Frontier has stepped in and while it has the resources to alleviate the suffering of the disaster stricken, it has no data on the number of casualties and/or the number of survivors. The data stored by the local agencies is useful, but scattered and will have to be used if any effective relief operation is to be attempted. However, the data is so large and so diverse that manually going through each spreadsheet is a task that could take weeks. A disaster has occurred in XYZ country 2 days ago. Local agencies and NGOs quickly got together to aid the victims of the disaster and have already started collecting data about the survivors and about the needs for an imminent relief mission. Although the data collected is comprehensive and complete, every agency has uploaded this data in the form of spreadsheets on their servers. Medicine Sans Frontier has stepped in and while it has the resources to alleviate the suffering of the disaster stricken, it has no data on the number of casualties and/or the number of survivors. The data stored by the local agencies is useful, but scattered and will have to be used if any effective relief operation is to be attempted. However, the data is so large and so diverse that manually going through each spreadsheet is a task that could take weeks.
- 
-=== Brief Outline of the System === 
- 
- 
-1. The user fills out a form specifying the path of the XLS file or google doc. 
- 
- 
-#The following steps are invisible to the user - 
- 
-<!-- By MH (29 May) No, these CANNOT happen automatically. The user will have to identify many of the features here (underlined). Don't presume that the import  can be done automatically  --> 
- 
-2.    
-    o Next, the upload facility in web2py (using the upload field in web2py) is used to fetch the XLS file from a remote computer to the server.  
-    It is now stored in the uploads folder and has been assigned a different name (present in the database). 
-    o A list of spreadsheet imported from gdocs is presented to the user 
- 
- 
-3. The module then uses xlrd to read the cells present in the spreadsheet. These cells are stored in a list of lists format. Alternative data structures can be used too (the use of lxml has been suggested as well, so an xml tree). 
- 
- 
-4. Empty rows and  are automatically removed ( is this fine?). 
- 
- 
-<!-- 
-5. Data type checking can also be done at this stage. The cleaning is implemented mainly using the levenshstein function in s3vita.py. The levenshtein function measures the 'distance' between two strings, the minimum number of insertions, deletions, and substitutions required to change one string to another.  
---> 
- 
-#User view 
- 
- 6. Spreadsheet view:  
- 
-          o The data (after being extracted and cleaned) is then put into a spreadsheet which is displayed using a suitable JavaScript library (please check this). 
-          o The cells can be edited by the user. 
-          o Entire rows and columns maybe dropped according to the users requirements. 
-          o The data cleaning process can be used to flag some cells (by giving them a red background? - NICE!). 
-          o The user can consider these cells for further correction. Eventually, the after all the editing has been done by the user, the user clicks the done button. 
- 
-7. The data is then mapped. The user maps columns to database fields. 
- 
-<!-- By MH -> Generally, the first row of the spreadsheet has the column headers (No really - this will have to be selected by the user - often headers will be split into 2 rows)-->. 
- 
- 
-However, certain spreadsheets may not have these headers. In such cases, the entire spreadsheet is scanned for a row where such column headers can be found, and these are put into the first row of the spreadsheet. In case the column headers are still not found, default names (custom) are assigned to the columns. 
- 
-8. Another view in rendered (the data, which has been reviewed by the user, has been stored in the data structure,this data is clean). Now, the user selects a table (or tables) which is already in the database. A list of tables is displayed, table names with small descriptions are shown to the user. 
- 
-9. The user selects a table. The view changes and a list of columns currently in the table are presented, and the user maps the column names in the spreadsheet to those in the table. 
- 
-Edit: Validation will be done at this stage +  data cleaning ie after mapping 
- 
-<!-- MH -> More data cleaning will have to happen at this stage - infact, I don't think that you have to worry about entiring the data until this stage 
- 
- 
-Could you also do data validation here -- flagging invalid cells? 
- 
---> 
- 
-<!-- @MH: Data validation will be implemented via s3 methods(when the import button is hit)? Otherwise I'll have to write seperate methods, right? --> 
- 
- 
-10. S3XRC functions ( import_xml/import_json ) are now used to import the data to the database. Any errors encountered are handled by the module and the user is notified of the same. The user can modify some cells if the need be. 
- 
- 
-=== Terms === 
- 
-Normalization: Bring the data to a standard format for cleaning. eg: Convert all occurrences jAmEs to James - This isn't normalization - that refers to another process.  
                
  
Line 80: Line 14:
     *  xlrd     *  xlrd
     *  gdocs     *  gdocs
-    *  qooxdoo      +    *  ExtJS      
  
  
Line 90: Line 24:
     * To be able to render a a view which has all the cells     * To be able to render a a view which has all the cells
     * To allow the user to make changes to the extracted data prior to import     * To allow the user to make changes to the extracted data prior to import
-    * To ensure that data being imported is validated+    * To insert data into the database
  
 === Data Model === === Data Model ===
Line 104: Line 38:
 __Spreadsheet table__ __Spreadsheet table__
 In case the upload field functionality of Web2Py is used to upload the spreadsheets, a table will have to be created which has the original names of the spreadsheets as well their new names (as modified by web2py). This will basically be a history of all the the spreadsheets that have been imported to the uploads folder in web2py. <We are going to delete files after import, right?> In case the upload field functionality of Web2Py is used to upload the spreadsheets, a table will have to be created which has the original names of the spreadsheets as well their new names (as modified by web2py). This will basically be a history of all the the spreadsheets that have been imported to the uploads folder in web2py. <We are going to delete files after import, right?>
- 
-__JSON__ 
-JSON will have to be used to transfer the information extracted and parsed to the spreadsheet like interface for the user to edit the data that has been extracted from the spreadsheet. Alternatively, we could also extract spreadsheet data straightaway into JSON, and run all other operations on it. 
  
 __Spell-check dictionary('nice to have' functionality)__ __Spell-check dictionary('nice to have' functionality)__
 This will be a table with primarily two fields (apart from id and uuid of course). The first field will have incorrect words, and the second field will have their corrections. Of course, the usefulness on this dictionary is directly proportional to the usage of that particular instance of Eden. The dictionaries in various instances of Eden could possibly be periodically synced to improve the utility of the dictionary. This will be a table with primarily two fields (apart from id and uuid of course). The first field will have incorrect words, and the second field will have their corrections. Of course, the usefulness on this dictionary is directly proportional to the usage of that particular instance of Eden. The dictionaries in various instances of Eden could possibly be periodically synced to improve the utility of the dictionary.
- 
-History 
- 
-version 0.4 on 30 May,2010 0030 
-Changes made according to Michael's directions, UI mockups and flowchart added 
- 
-version 0.3 on 29 May, 2010 
-Michael's comments 
- 
-version 0.2 on 29 May,2010 
-Flowchart added (in seperate doc) 
- 
-version 0.1 on 28 May,2010 
-First draft 
  
  
Line 129: Line 46:
  
  
-This page seeks to explain the flowchart and the kind of processing that goes on for each step.( I shall elaborate this page even further)+This page seeks to explain the flowchart and the kind of processing that goes on for each step.
  
 __**Step 1**__ __**Step 1**__
Line 154: Line 71:
 The following steps are going to common to both gDocs and XLS. The following steps are going to common to both gDocs and XLS.
  
-A lists of lists of the extracted data is created. This is most suitable since it reflects the manner in which data was found and would make it easier to identify blank rows, column headers etc.+A lists of lists of the extracted data is created. The sublists are the rows in the spreadsheet.
  
 +This is most suitable since it reflects the manner in which data was found and would make it easier to identify blank rows, column headers etc
 Empty rows and columns are removed. Empty rows and columns are removed.
 +
 +**XML is generated**. This XML provides data in the spreadsheet to the UI.
 +
  
 __**UI begins**__ __**UI begins**__
  
-{{:foundation:uimockups1.jpg|}} 
  
-This will be a spreadsheet like view of the data to be imported. Hopefully, I should to be able to work out a method to effectively flag invalid data(like giving a red background to a cell).+{{:foundation:input.jpg|}}
  
-The user will be able to modify each and every record in this view. The column headers after being identified, no matter where they were defined in the original spreadsheet will be moved to the top of the view. Again, the column headers are editable.+{{:foundation:spreadsheeturl.jpg|}}
  
 +The data extracted from the spreadsheets is converted from list of lists format to an XML file. This is done for two reasons:
 +1.  A list of lists cannot be easily and properly represented in the the view. It has to converted to XML so that it can be parsed using Ext.data.XmlReader
  
-__**Step 3**__+2. Secondly, the data import using import_xml() can only be achieved if XML is is the Sahana schema. This can be worked out using XSLT.
  
-//"Select column headers"//+Now, we can use to the XML file, parse it and make the spreadsheet view (ie an editable grid).
  
-Column headers are identified, This is done by the user. 
  
-Basically, we can have radio buttons down the side of the UI, and the user can click the radio button corresponding the row with the column headers.+Proposed format is :
  
-__**Step 4**__+     <data> 
 +          <row> 
 +               <column1>->Data in cell(1,1)<-</column1> 
 +               <column2>->Data in cell(1,2)<-</column2> 
 +               . 
 +               . 
 +               .  
 +               . 
 +               .  
 +          </row> 
 +          <row> 
 +              . 
 +              . 
 +              . 
 +              . 
 +              . 
 +              . 
 +          </row>     
 +    </data> 
 + 
 + 
 +This will be a spreadsheet like view of the data present in the to-be-imported spreadsheet (minus empty rows and columns). Hopefully, I should to be able to work out a method to effectively flag invalid data(like giving a red background to a cell). 
 + 
 +The user will be able to modify each and every record in this view. Column headers can also be identified. Again, the column headers are editable. 
 + 
 +{{:foundation:spreadsheet.jpg?1170x500|}} 
 + 
 + 
 +__**Step 3**__
  
 //"Convert sub header rows to columns"// //"Convert sub header rows to columns"//
  
-Sub header rows are identified by the user, and are then copied to all the rows(specified by the user) in a particular column.+Sub header rows are identified by the user, and are then copied to all the rows(specified by the user) in a particular column( can also be at the end). 
 + 
 +The user then assigns a column header to the new column.
  
 User presses **submit**. User presses **submit**.
 +
 +__**Step 4**__
 +
 +//"Select column headers"//
 +
 +Column headers are identified. This is done by the user.
 +
 +Basically, we can have radio buttons down the side of the UI, and the user can click the radio button corresponding the row with the column headers.
 +
 +The records are editable, therefore the user can modify the column headers too.
 +
 +In the XML, we can have an attribute for the data tag, header, which contains the row number having the header columns (as selected by the user).
 +
 +Alternatively, we could also add the header attribute to the row tags (such that <row header="true"></row> ) to identify column headers.
  
 __**Step 5**__ __**Step 5**__
  
 //"Select table from the database"// //"Select table from the database"//
- 
  
 A list of tables present in the Sahana DB is displayed. The user can then select which table he wants put the data in.  A list of tables present in the Sahana DB is displayed. The user can then select which table he wants put the data in. 
  
 The fields of this table are fetched and stored for the mapping step. The fields of this table are fetched and stored for the mapping step.
 +
 +Again, we will need a pre-generated XML file with table data which can be parsed to the view.
 +
 +{{:foundation:selectable.jpeg|}}
  
 __**Step 6**__ __**Step 6**__
Line 202: Line 170:
 If the number of columns in the Eden db is less than the number of columns in the spreadsheet, data can be appended in the comments column. If the number of columns in the Eden db is less than the number of columns in the spreadsheet, data can be appended in the comments column.
  
-{{:foundation:uimockups2_001.jpg|}}+{{:foundation:mapping.jpg|}} 
  
 __**Step 7**__ __**Step 7**__
Line 211: Line 180:
  
 Duplicates are now identified, and are flagged for the user. I intend to use the levenshtein function in s3vita.py to calculate the "distance" between the strings and then decide which strings are too similar. Duplicates are now identified, and are flagged for the user. I intend to use the levenshtein function in s3vita.py to calculate the "distance" between the strings and then decide which strings are too similar.
 +
 +__//"Select characters to parse cell on"//__
 +
 +Sometimes, users separate different records by using characters such as ,(comma) . (dot) | (pipe).
 +
 +The cell can be parsed on such data and new records can be created.
 +
 +Eg:
 +
 +Consider a spreadsheet of languages
 +
 +1. English
 +
 +2. Spanish
 +
 +3. French|1337
 +
 +4. Russian, Latin
 +
 +This can be broken up into
 +
 +1. English
 +
 +2. Spanish
 +
 +3. French
 +
 +4. 1337
 +
 +5. Russian
 +
 +6. Latin
  
 Validation will have to be done via JavaScript validation functions. I think those functions are part of another GSoC project. Validation in the UI (ie JS validation) can only be done if JS validation functions are present. Validation will have to be done via JavaScript validation functions. I think those functions are part of another GSoC project. Validation in the UI (ie JS validation) can only be done if JS validation functions are present.
Line 222: Line 223:
 Data is now imported using S3XRC. Invalid records are returned to the user. Data is now imported using S3XRC. Invalid records are returned to the user.
  
-In the UI, the data is present in the JSON format. However, it works best to use import_xml() to import to Eden's DB. Therefore, the JSON will have to be converted to an elementary XML format.+In the UI, the data can be represented in the JSON format. However, it works best to use import_xml() to import to Eden's DB. Ext supports methods to parse XML. 
 +Therefore, its best to keep all the data in XML format.
  
 The XML document thus produced could then be parsed according to an XSLT file. The XML document thus produced could then be parsed according to an XSLT file.
Line 228: Line 230:
 Then, an XML suitable for import to the Eden DB can be obtained. Then, an XML suitable for import to the Eden DB can be obtained.
  
-Therefore, JSON --> XML (spreadsheet format) --> (via XSLT) --> XML (sahana)+Therefore,  XML (spreadsheet format) --> (via XSLT) --> XML (sahana) 
 + 
 +Please see http://eden.sahanafoundation.org/wiki/S3XRC 
 + 
 +__Note:__//Importing to multiple tables// 
 + 
 +One of the suggestions for this project was to enable importing to multiple tables. However, this is not currently supported by S3XRC.
  
 __**Step 9**__ __**Step 9**__
Line 235: Line 243:
  
 Records which could not be imported are shown again to the user in the view. After modifying those records, the user can attempt importing again. Records which could not be imported are shown again to the user in the view. After modifying those records, the user can attempt importing again.
- 
  
  
Line 266: Line 273:
  
 ^Version^Date^Comment^ ^Version^Date^Comment^
 +
 +
 +version 0.7| 5 June,2010| Clean up of the documents
 +
 version 0.6|3 June, 2010| Suggestions discussed in meeting(2 June) incorporated version 0.6|3 June, 2010| Suggestions discussed in meeting(2 June) incorporated
 +
 version  0.5|2 June, 2010|Introduced the cleaning after mapping step version  0.5|2 June, 2010|Introduced the cleaning after mapping step
  
Line 275: Line 287:
 version 0.2|29-30 May, 2010|UI explanation, changes according to Michael's suggestions version 0.2|29-30 May, 2010|UI explanation, changes according to Michael's suggestions
  
-version 0.1|29 May, 2010|First draft{{:foundation:uimockups1.jpg|}}+version 0.1|29 May, 2010|First draft

QR Code
QR Code foundation:gsoc_kohli_specs (generated for current page)