| @@ -0,0 +1,210 @@ | |||||
| This directory includes some useful codes: | |||||
| 1. subset selection tools. | |||||
| 2. parameter selection tools. | |||||
| 3. LIBSVM format checking tools | |||||
| Part I: Subset selection tools | |||||
| Introduction | |||||
| ============ | |||||
| Training large data is time consuming. Sometimes one should work on a | |||||
| smaller subset first. The python script subset.py randomly selects a | |||||
| specified number of samples. For classification data, we provide a | |||||
| stratified selection to ensure the same class distribution in the | |||||
| subset. | |||||
| Usage: subset.py [options] dataset number [output1] [output2] | |||||
| This script selects a subset of the given data set. | |||||
| options: | |||||
| -s method : method of selection (default 0) | |||||
| 0 -- stratified selection (classification only) | |||||
| 1 -- random selection | |||||
| output1 : the subset (optional) | |||||
| output2 : the rest of data (optional) | |||||
| If output1 is omitted, the subset will be printed on the screen. | |||||
| Example | |||||
| ======= | |||||
| > python subset.py heart_scale 100 file1 file2 | |||||
| From heart_scale 100 samples are randomly selected and stored in | |||||
| file1. All remaining instances are stored in file2. | |||||
| Part II: Parameter Selection Tools | |||||
| Introduction | |||||
| ============ | |||||
| grid.py is a parameter selection tool for C-SVM classification using | |||||
| the RBF (radial basis function) kernel. It uses cross validation (CV) | |||||
| technique to estimate the accuracy of each parameter combination in | |||||
| the specified range and helps you to decide the best parameters for | |||||
| your problem. | |||||
| grid.py directly executes libsvm binaries (so no python binding is needed) | |||||
| for cross validation and then draw contour of CV accuracy using gnuplot. | |||||
| You must have libsvm and gnuplot installed before using it. The package | |||||
| gnuplot is available at http://www.gnuplot.info/ | |||||
| On Mac OSX, the precompiled gnuplot file needs the library Aquarterm, | |||||
| which thus must be installed as well. In addition, this version of | |||||
| gnuplot does not support png, so you need to change "set term png | |||||
| transparent small" and use other image formats. For example, you may | |||||
| have "set term pbm small color". | |||||
| Usage: grid.py [grid_options] [svm_options] dataset | |||||
| grid_options : | |||||
| -log2c {begin,end,step | "null"} : set the range of c (default -5,15,2) | |||||
| begin,end,step -- c_range = 2^{begin,...,begin+k*step,...,end} | |||||
| "null" -- do not grid with c | |||||
| -log2g {begin,end,step | "null"} : set the range of g (default 3,-15,-2) | |||||
| begin,end,step -- g_range = 2^{begin,...,begin+k*step,...,end} | |||||
| "null" -- do not grid with g | |||||
| -v n : n-fold cross validation (default 5) | |||||
| -svmtrain pathname : set svm executable path and name | |||||
| -gnuplot {pathname | "null"} : | |||||
| pathname -- set gnuplot executable path and name | |||||
| "null" -- do not plot | |||||
| -out {pathname | "null"} : (default dataset.out) | |||||
| pathname -- set output file path and name | |||||
| "null" -- do not output file | |||||
| -png pathname : set graphic output file path and name (default dataset.png) | |||||
| -resume [pathname] : resume the grid task using an existing output file (default pathname is dataset.out) | |||||
| Use this option only if some parameters have been checked for the SAME data. | |||||
| svm_options : additional options for svm-train | |||||
| The program conducts v-fold cross validation using parameter C (and gamma) | |||||
| = 2^begin, 2^(begin+step), ..., 2^end. | |||||
| You can specify where the libsvm executable and gnuplot are using the | |||||
| -svmtrain and -gnuplot parameters. | |||||
| For windows users, please use pgnuplot.exe. If you are using gnuplot | |||||
| 3.7.1, please upgrade to version 3.7.3 or higher. The version 3.7.1 | |||||
| has a bug. If you use cygwin on windows, please use gunplot-x11. | |||||
| If the task is terminated accidentally or you would like to change the | |||||
| range of parameters, you can apply '-resume' to save time by re-using | |||||
| previous results. You may specify the output file of a previous run | |||||
| or use the default (i.e., dataset.out) without giving a name. Please | |||||
| note that the same condition must be used in two runs. For example, | |||||
| you cannot use '-v 10' earlier and resume the task with '-v 5'. | |||||
| The value of some options can be "null." For example, `-log2c -1,0,1 | |||||
| -log2 "null"' means that C=2^-1,2^0,2^1 and g=LIBSVM's default gamma | |||||
| value. That is, you do not conduct parameter selection on gamma. | |||||
| Example | |||||
| ======= | |||||
| > python grid.py -log2c -5,5,1 -log2g -4,0,1 -v 5 -m 300 heart_scale | |||||
| Users (in particular MS Windows users) may need to specify the path of | |||||
| executable files. You can either change paths in the beginning of | |||||
| grid.py or specify them in the command line. For example, | |||||
| > grid.py -log2c -5,5,1 -svmtrain "c:\Program Files\libsvm\windows\svm-train.exe" -gnuplot c:\tmp\gnuplot\binary\pgnuplot.exe -v 10 heart_scale | |||||
| Output: two files | |||||
| dataset.png: the CV accuracy contour plot generated by gnuplot | |||||
| dataset.out: the CV accuracy at each (log2(C),log2(gamma)) | |||||
| The following example saves running time by loading the output file of a previous run. | |||||
| > python grid.py -log2c -7,7,1 -log2g -5,2,1 -v 5 -resume heart_scale.out heart_scale | |||||
| Parallel grid search | |||||
| ==================== | |||||
| You can conduct a parallel grid search by dispatching jobs to a | |||||
| cluster of computers which share the same file system. First, you add | |||||
| machine names in grid.py: | |||||
| ssh_workers = ["linux1", "linux5", "linux5"] | |||||
| and then setup your ssh so that the authentication works without | |||||
| asking a password. | |||||
| The same machine (e.g., linux5 here) can be listed more than once if | |||||
| it has multiple CPUs or has more RAM. If the local machine is the | |||||
| best, you can also enlarge the nr_local_worker. For example: | |||||
| nr_local_worker = 2 | |||||
| Example: | |||||
| > python grid.py heart_scale | |||||
| [local] -1 -1 78.8889 (best c=0.5, g=0.5, rate=78.8889) | |||||
| [linux5] -1 -7 83.3333 (best c=0.5, g=0.0078125, rate=83.3333) | |||||
| [linux5] 5 -1 77.037 (best c=0.5, g=0.0078125, rate=83.3333) | |||||
| [linux1] 5 -7 83.3333 (best c=0.5, g=0.0078125, rate=83.3333) | |||||
| . | |||||
| . | |||||
| . | |||||
| If -log2c, -log2g, or -v is not specified, default values are used. | |||||
| If your system uses telnet instead of ssh, you list the computer names | |||||
| in telnet_workers. | |||||
| Calling grid in Python | |||||
| ====================== | |||||
| In addition to using grid.py as a command-line tool, you can use it as a | |||||
| Python module. | |||||
| >>> rate, param = find_parameters(dataset, options) | |||||
| You need to specify `dataset' and `options' (default ''). See the following example. | |||||
| > python | |||||
| >>> from grid import * | |||||
| >>> rate, param = find_parameters('../heart_scale', '-log2c -1,1,1 -log2g -1,1,1') | |||||
| [local] 0.0 0.0 rate=74.8148 (best c=1.0, g=1.0, rate=74.8148) | |||||
| [local] 0.0 -1.0 rate=77.037 (best c=1.0, g=0.5, rate=77.037) | |||||
| . | |||||
| . | |||||
| [local] -1.0 -1.0 rate=78.8889 (best c=0.5, g=0.5, rate=78.8889) | |||||
| . | |||||
| . | |||||
| >>> rate | |||||
| 78.8889 | |||||
| >>> param | |||||
| {'c': 0.5, 'g': 0.5} | |||||
| Part III: LIBSVM format checking tools | |||||
| Introduction | |||||
| ============ | |||||
| `svm-train' conducts only a simple check of the input data. To do a | |||||
| detailed check, we provide a python script `checkdata.py.' | |||||
| Usage: checkdata.py dataset | |||||
| Exit status (returned value): 1 if there are errors, 0 otherwise. | |||||
| This tool is written by Rong-En Fan at National Taiwan University. | |||||
| Example | |||||
| ======= | |||||
| > cat bad_data | |||||
| 1 3:1 2:4 | |||||
| > python checkdata.py bad_data | |||||
| line 1: feature indices must be in an ascending order, previous/current features 3:1 2:4 | |||||
| Found 1 lines with error. | |||||