You are here: start » doku.php_id_dev » sahana_gsoc08_datamining

Sahana Data mining and Visualization library

Sahana Data mining and Visualization library is a php library that have ability to perform data mining, statistical analysis and visualization. It consists of 2 parts.

Java data mining engine
PHP data mining library front-end

Since data mining jobs may take longer time ( may be hours ) than PHP execution time-out, data mining algorithms implemented as a separate engine. Currently it is developed as a external thing to Sahana and willing to fully integrate it with Sahana.

Abstract design

Usage

Data mining engine

To perform data mining, data mining engine should start and connect to Sahana database. Data mining engine is located at /sahana-set/sahana-datamining/ of Sourceforge Sahana repository.

Requirements for Sahana data mining engine

JavaSE version 6
jdbc-mysql connector

After downloading data mining engine compile it using javac command and run it using java command. Then engine will ask for database server IP address, Sahana database name, database username and password. After successfully connecting data mining engine to Sahana database, it will periodically look for data mining jobs. When added a new data mining job mining engine will pick it up then execute requested mining algorithm and store mining result in the database.

Sahana Data mining library

Module developers do not have any thing to do with data mining engine. They can perform data mining through data mining library ( /inc/lib_datamine.inc ).

Necessary funcations to perform a data mining

shn_dm_begin()

To preform a data mining job, first call to this module. It will return data mining job id that represent current data mining job , and will initialize environment for data mining.

shn_dm_set_algo($algorithm)

Pass name of the algorithm that need to perform data mining as a string. String should contain 2 parts 1.Algorithm type 2.Algorithm name. Eg:shn_dm_set_algo('associations Apriori'). By calling shn_dm_get_avilable_algos($type) it is possible to get array of supported algorithms, $type is the type of algorithms. It can be associations, classifiers and clusterers. You can find algorithm specifications here

shn_dm_add_attribute($name,$type,$set=null)

Add the attributes of the data set one by one using this function. $name is the name of the attribute and it should not contain spaces. $type is the type of the attribute. Type can be int , string or class. Set type as class if that attribute has limited number of distinct values Eg: blood group . If type is class pass the distinct values as a array. Eg : shn_dm_add_attribute('height','int') , shn_dm_add_attribute('blood_group','class',array('A+','A-','B+','B-','AB+','AB-','O+','O-'))

shn_dm_add_param($name,$value) Parameters for data mining algorithm can be set using this function. $name is the single character representation of the paramater name and $value is the value of the parameter.

Types of the parameters are depend on the type of the algorithm Eg: shn_dm_add_param('N','10'); to select best 10 association rules in associations Apriori algorithm. Or if you need to run algorithm with default parameters, call shn_dm_load_default_params() to load default parameters for algorithm set by shn_dm_set_algo()

shn_dm_add_data($data) To add data (row) to data mining job as a array. Use this function in a loop when you need to add multiple records(in many cases)

Eg: shn_dm_add_data(array(170,'A+')); where 170 is height and a A+ is blood grooup. Most of the time this function can be used to add data to job from a query result

Eg:

  $sql = "SELECT opt_blood_type,height FROM person_physical";
  $res = $db->GetAll($sql);
  foreach($res as $r)
  {
      shn_dm_add_data(array($r[height],$r['opt_blood_type']);
  }

shn_dm_submit($label) To submit data mining job queue. label is a string that can used in later time when retrieving results. Eg: shn_dm_submit('mpr)

Additional Functions

shn_dm_get_arff()

After adding data to mining job this function can used to download a ARFF file with data set. ARFF is a file format that commonly used in data mining software to store data. Downloaded file can analyze using any data mining software such as WEKA , Rapid Miner

shn_dm_remove_dm_job($job_id)

Data set, data mining result, statistical results, graphs will remain in Sahana database after finishing data mining process for future use. If you need to remove a data mining job or result from storage call this function with data mining job id.

shn_dm_get_available_algos($type)

This function will return all supported data mining algorithms of $type . Currently type can be associations,classifiers or clusterers.

shn_dm_get_current_jobs($label)

This function will return a list of data mining jobs currently in data mining results matches with $label.

Trace: • sahana_gsoc08_datamining