Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
foundation:gsoc_kohli_specs [2010/06/03 18:24]
shikharkohli
foundation:gsoc_kohli_specs [2010/12/18 17:35] (current)
Line 8: Line 8:
 === Scenarios/ User stories === === Scenarios/ User stories ===
 A disaster has occurred in XYZ country 2 days ago. Local agencies and NGOs quickly got together to aid the victims of the disaster and have already started collecting data about the survivors and about the needs for an imminent relief mission. Although the data collected is comprehensive and complete, every agency has uploaded this data in the form of spreadsheets on their servers. Medicine Sans Frontier has stepped in and while it has the resources to alleviate the suffering of the disaster stricken, it has no data on the number of casualties and/or the number of survivors. The data stored by the local agencies is useful, but scattered and will have to be used if any effective relief operation is to be attempted. However, the data is so large and so diverse that manually going through each spreadsheet is a task that could take weeks. A disaster has occurred in XYZ country 2 days ago. Local agencies and NGOs quickly got together to aid the victims of the disaster and have already started collecting data about the survivors and about the needs for an imminent relief mission. Although the data collected is comprehensive and complete, every agency has uploaded this data in the form of spreadsheets on their servers. Medicine Sans Frontier has stepped in and while it has the resources to alleviate the suffering of the disaster stricken, it has no data on the number of casualties and/or the number of survivors. The data stored by the local agencies is useful, but scattered and will have to be used if any effective relief operation is to be attempted. However, the data is so large and so diverse that manually going through each spreadsheet is a task that could take weeks.
- 
-=== Brief Outline of the System === 
- 
- 
-1. The user fills out a form specifying the path of the XLS file or google doc. 
- 
- 
-#The following steps are invisible to the user - 
- 
-<!-- By MH (29 May) No, these CANNOT happen automatically. The user will have to identify many of the features here (underlined). Don't presume that the import  can be done automatically  --> 
- 
-2.    
-    o Next, the upload facility in web2py (using the upload field in web2py) is used to fetch the XLS file from a remote computer to the server.  
-    It is now stored in the uploads folder and has been assigned a different name (present in the database). 
-    o A list of spreadsheet imported from gdocs is presented to the user 
- 
- 
-3. The module then uses xlrd to read the cells present in the spreadsheet. These cells are stored in a list of lists format. Alternative data structures can be used too (the use of lxml has been suggested as well, so an xml tree). 
- 
- 
-4. Empty rows and  are automatically removed ( is this fine?). 
- 
- 
-<!-- 
-5. Data type checking can also be done at this stage. The cleaning is implemented mainly using the levenshstein function in s3vita.py. The levenshtein function measures the 'distance' between two strings, the minimum number of insertions, deletions, and substitutions required to change one string to another.  
---> 
- 
-#User view 
- 
- 6. Spreadsheet view:  
- 
-          o The data (after being extracted and cleaned) is then put into a spreadsheet which is displayed using a suitable JavaScript library (please check this). 
-          o The cells can be edited by the user. 
-          o Entire rows and columns maybe dropped according to the users requirements. 
-          o The data cleaning process can be used to flag some cells (by giving them a red background? - NICE!). 
-          o The user can consider these cells for further correction. Eventually, the after all the editing has been done by the user, the user clicks the done button. 
- 
-7. The data is then mapped. The user maps columns to database fields. 
- 
-<!-- By MH -> Generally, the first row of the spreadsheet has the column headers (No really - this will have to be selected by the user - often headers will be split into 2 rows)-->. 
- 
- 
-However, certain spreadsheets may not have these headers. In such cases, the entire spreadsheet is scanned for a row where such column headers can be found, and these are put into the first row of the spreadsheet. In case the column headers are still not found, default names (custom) are assigned to the columns. 
- 
-8. Another view in rendered (the data, which has been reviewed by the user, has been stored in the data structure,this data is clean). Now, the user selects a table (or tables) which is already in the database. A list of tables is displayed, table names with small descriptions are shown to the user. 
- 
-9. The user selects a table. The view changes and a list of columns currently in the table are presented, and the user maps the column names in the spreadsheet to those in the table. 
- 
-Edit: Validation will be done at this stage +  data cleaning ie after mapping 
- 
-<!-- MH -> More data cleaning will have to happen at this stage - infact, I don't think that you have to worry about entiring the data until this stage 
- 
- 
-Could you also do data validation here -- flagging invalid cells? 
- 
---> 
- 
-<!-- @MH: Data validation will be implemented via s3 methods(when the import button is hit)? Otherwise I'll have to write seperate methods, right? --> 
- 
- 
-10. S3XRC functions ( import_xml/import_json ) are now used to import the data to the database. Any errors encountered are handled by the module and the user is notified of the same. The user can modify some cells if the need be. 
- 
- 
-=== Terms === 
- 
-Normalization: Bring the data to a standard format for cleaning. eg: Convert all occurrences jAmEs to James - This isn't normalization - that refers to another process.  
                
  
Line 80: Line 14:
     *  xlrd     *  xlrd
     *  gdocs     *  gdocs
-    *  Ext      +    *  ExtJS      
  
  
Line 90: Line 24:
     * To be able to render a a view which has all the cells     * To be able to render a a view which has all the cells
     * To allow the user to make changes to the extracted data prior to import     * To allow the user to make changes to the extracted data prior to import
 +    * To insert data into the database
  
 === Data Model === === Data Model ===
Line 104: Line 38:
 __Spreadsheet table__ __Spreadsheet table__
 In case the upload field functionality of Web2Py is used to upload the spreadsheets, a table will have to be created which has the original names of the spreadsheets as well their new names (as modified by web2py). This will basically be a history of all the the spreadsheets that have been imported to the uploads folder in web2py. <We are going to delete files after import, right?> In case the upload field functionality of Web2Py is used to upload the spreadsheets, a table will have to be created which has the original names of the spreadsheets as well their new names (as modified by web2py). This will basically be a history of all the the spreadsheets that have been imported to the uploads folder in web2py. <We are going to delete files after import, right?>
- 
-__JSON__ 
-JSON will have to be used to transfer the information extracted and parsed to the spreadsheet like interface for the user to edit the data that has been extracted from the spreadsheet. Alternatively, we could also extract spreadsheet data straightaway into JSON, and run all other operations on it. 
  
 __Spell-check dictionary('nice to have' functionality)__ __Spell-check dictionary('nice to have' functionality)__
 This will be a table with primarily two fields (apart from id and uuid of course). The first field will have incorrect words, and the second field will have their corrections. Of course, the usefulness on this dictionary is directly proportional to the usage of that particular instance of Eden. The dictionaries in various instances of Eden could possibly be periodically synced to improve the utility of the dictionary. This will be a table with primarily two fields (apart from id and uuid of course). The first field will have incorrect words, and the second field will have their corrections. Of course, the usefulness on this dictionary is directly proportional to the usage of that particular instance of Eden. The dictionaries in various instances of Eden could possibly be periodically synced to improve the utility of the dictionary.
- 
-History 
- 
-version 0.4 on 30 May,2010 0030 
-Changes made according to Michael's directions, UI mockups and flowchart added 
- 
-version 0.3 on 29 May, 2010 
-Michael's comments 
- 
-version 0.2 on 29 May,2010 
-Flowchart added (in seperate doc) 
- 
-version 0.1 on 28 May,2010 
-First draft 
  
  
Line 129: Line 46:
  
  
-This page seeks to explain the flowchart and the kind of processing that goes on for each step.( I shall elaborate this page even further)+This page seeks to explain the flowchart and the kind of processing that goes on for each step.
  
 __**Step 1**__ __**Step 1**__
Line 154: Line 71:
 The following steps are going to common to both gDocs and XLS. The following steps are going to common to both gDocs and XLS.
  
-A lists of lists of the extracted data is created. This is most suitable since it reflects the manner in which data was found and would make it easier to identify blank rows, column headers etc.+A lists of lists of the extracted data is created. The sublists are the rows in the spreadsheet.
  
 +This is most suitable since it reflects the manner in which data was found and would make it easier to identify blank rows, column headers etc
 Empty rows and columns are removed. Empty rows and columns are removed.
 +
 +**XML is generated**. This XML provides data in the spreadsheet to the UI.
 +
  
 __**UI begins**__ __**UI begins**__
  
-This will be a spreadsheet like view of the data to be imported. Hopefully, I should to be able to work out a method to effectively flag invalid data(like giving a red background to a cell). 
  
-The user will be able to modify each and every record in this view. The column headers after being identified, no matter where they were defined in the original spreadsheet will be moved to the top of the view. Again, the column headers are editable.+{{:foundation:input.jpg|}} 
 + 
 +{{:foundation:spreadsheeturl.jpg|}} 
 + 
 +The data extracted from the spreadsheets is converted from list of lists format to an XML file. This is done for two reasons: 
 +1.  A list of lists cannot be easily and properly represented in the the view. It has to converted to XML so that it can be parsed using Ext.data.XmlReader 
 + 
 +2. Secondly, the data import using import_xml() can only be achieved if XML is is the Sahana schema. This can be worked out using XSLT. 
 + 
 +Now, we can use to the XML file, parse it and make the spreadsheet view (ie an editable grid). 
 + 
 + 
 +Proposed format is : 
 + 
 +     <data> 
 +          <row> 
 +               <column1>->Data in cell(1,1)<-</column1> 
 +               <column2>->Data in cell(1,2)<-</column2> 
 +               . 
 +               . 
 +               .  
 +               . 
 +               .  
 +          </row> 
 +          <row> 
 +              . 
 +              . 
 +              . 
 +              . 
 +              . 
 +              . 
 +          </row>     
 +    </data> 
 + 
 + 
 +This will be a spreadsheet like view of the data present in the to-be-imported spreadsheet (minus empty rows and columns). Hopefully, I should to be able to work out a method to effectively flag invalid data(like giving a red background to a cell). 
 + 
 +The user will be able to modify each and every record in this view. Column headers can also be identified. Again, the column headers are editable. 
 + 
 +{{:foundation:spreadsheet.jpg?1170x500|}}
  
  
 __**Step 3**__ __**Step 3**__
  
-//"Select column headers"//+//"Convert sub header rows to columns"//
  
-Column headers are identified, This is done by the user.+Sub header rows are identified by the userand are then copied to all the rows(specified by the user) in a particular column( can also be at the end).
  
-Basically, we can have radio buttons down the side of the UI, and the user can click the radio button corresponding the row with the column headers.+The user then assigns a column header to the new column
 + 
 +User presses **submit**.
  
 __**Step 4**__ __**Step 4**__
  
-//"Convert sub header rows to columns"//+//"Select column headers"//
  
-Sub header rows are identified by the user, and are then copied to all the rows(specified by the user) in a particular column.+Column headers are identified. This is done by the user.
  
-We can have buttons down the side of the UI, which when clicked display a pop-up. In this pop-up, users can enter the row limits to which the data in the sub header has to copied.+Basically, we can have radio buttons down the side of the UI, and the user can click the radio button corresponding the row with the column headers.
  
-The sub header row in then removed from the spreadsheet (is this fine?)+The records are editable, therefore the user can modify the column headers too.
  
-After this step, the user can press **submit**.+In the XMLwe can have an attribute for the data tag, header, which contains the row number having the header columns (as selected by the user). 
 + 
 +Alternatively, we could also add the header attribute to the row tags (such that <row header="true"></row> ) to identify column headers.
  
 __**Step 5**__ __**Step 5**__
Line 189: Line 152:
 //"Select table from the database"// //"Select table from the database"//
  
 +A list of tables present in the Sahana DB is displayed. The user can then select which table he wants put the data in. 
  
-A list of tables present in the Sahana DB is displayed. The user can then select which table he wants put the data in. The fields of this table are fetched and stored for the mapping step.+The fields of this table are fetched and stored for the mapping step.
  
-We can also put in short description of what kind of data the table is supposed to store.+Again, we will need pre-generated XML file with table data which can be parsed to the view. 
 + 
 +{{:foundation:selectable.jpeg|}}
  
 __**Step 6**__ __**Step 6**__
Line 198: Line 164:
 //"Mapping"// //"Mapping"//
  
-After the user selects the table, a drop down menu is introduced above/below the column headers, which contains the fields present in the selected table.+After the user selects the table, a drop down menu is introduced beside each of the column headers, which contains the fields present in the selected table.
  
 The user selects fields for the different columns. The user selects fields for the different columns.
  
 If the number of columns in the Eden db is less than the number of columns in the spreadsheet, data can be appended in the comments column. If the number of columns in the Eden db is less than the number of columns in the spreadsheet, data can be appended in the comments column.
 +
 +{{:foundation:mapping.jpg|}}
 +
  
 __**Step 7**__ __**Step 7**__
Line 210: Line 179:
 We now return to the screen which has the spreadsheet like view of the data to be imported. We now return to the screen which has the spreadsheet like view of the data to be imported.
  
-Duplicates are now identified, and are flagged for the user. I intend to use the Levenshtein function in s3vita.py to calculate the "distance" between the strings and then decide which strings are too similar. We can then decide which strings have highlow and medium similarity based on the distance returned by the Levenshtein function.+Duplicates are now identified, and are flagged for the user. I intend to use the levenshtein function in s3vita.py to calculate the "distance" between the strings and then decide which strings are too similar. 
 + 
 +__//"Select characters to parse cell on"//__ 
 + 
 +Sometimes, users separate different records by using characters such as ,(comma) . (dot) | (pipe). 
 + 
 +The cell can be parsed on such data and new records can be created. 
 + 
 +Eg: 
 + 
 +Consider spreadsheet of languages 
 + 
 +1. English 
 + 
 +2. Spanish 
 + 
 +3. French|1337 
 + 
 +4. RussianLatin 
 + 
 +This can be broken up into 
 + 
 +1. English 
 + 
 +2. Spanish 
 + 
 +3. French 
 + 
 +4. 1337 
 + 
 +5. Russian 
 + 
 +6Latin
  
 Validation will have to be done via JavaScript validation functions. I think those functions are part of another GSoC project. Validation in the UI (ie JS validation) can only be done if JS validation functions are present. Validation will have to be done via JavaScript validation functions. I think those functions are part of another GSoC project. Validation in the UI (ie JS validation) can only be done if JS validation functions are present.
  
-This will be done *pre-import* i.e. invalid data will be flagged for the user and displayed before the import step.+This will be done **pre-import** ie invalid data will be flagged for the user and displayed before the import step.
  
 __**Step 8**__ __**Step 8**__
Line 220: Line 221:
 //"Import"// //"Import"//
  
-Data is now imported using s3xrc+Data is now imported using S3XRC. Invalid records are returned to the user.
  
-In the UI, all the information is stored in the JSON format.+In the UI, the data can be represented in the JSON format. However, it works best to use import_xml() to import to Eden's DB. Ext supports methods to parse XML. 
 +Therefore, its best to keep all the data in XML format.
  
-Sahana supports JSON import, however, its not very easy to convert the JSON used for the UI to the JSON for Sahana.+The XML document thus produced could then be parsed according to an XSLT file.
  
-Thereforewhat we can do is: +Thenan XML suitable for import to the Eden DB can be obtained.
-     JSON --> XML (in the format its present in the spreadsheet) --> XSLT (mapping to Sahana schema, can be stored too) --> XML (Sahana) --> import_xml() function  +
-      +
-Invalid records, the ones for which import_xml() returns an error, are returned to the user.+
  
 +Therefore,  XML (spreadsheet format) --> (via XSLT) --> XML (sahana)
 +
 +Please see http://eden.sahanafoundation.org/wiki/S3XRC
 +
 +__Note:__//Importing to multiple tables//
 +
 +One of the suggestions for this project was to enable importing to multiple tables. However, this is not currently supported by S3XRC.
  
 __**Step 9**__ __**Step 9**__
Line 236: Line 242:
 //"Return invalid records to the UI"// //"Return invalid records to the UI"//
  
-Records which could not be imported are shown again to the user in the view.  +Records which could not be imported are shown again to the user in the view. After modifying those records, the user can attempt importing again.
- +
-After modifying those records, the user can attempt importing again.+
  
  
Line 269: Line 273:
  
 ^Version^Date^Comment^ ^Version^Date^Comment^
 +
 +
 +version 0.7| 5 June,2010| Clean up of the documents
 +
 version 0.6|3 June, 2010| Suggestions discussed in meeting(2 June) incorporated version 0.6|3 June, 2010| Suggestions discussed in meeting(2 June) incorporated
 +
 version  0.5|2 June, 2010|Introduced the cleaning after mapping step version  0.5|2 June, 2010|Introduced the cleaning after mapping step
  

QR Code
QR Code foundation:gsoc_kohli_specs (generated for current page)