Discretize

The Discretize is a TCL program which can be used to divide continous data to arbitrary number of discrete bins. There are two methods for dividing the observations. You can either provide boundaries of the bins yourself or you can just tell how many bins you want to be created, and let the program search such boundaries that every bin contains equal amount of observations. Each variable can be discretized separately.

The data file should be comma separated, with one observation on each row and one variable in each column. This means that you have to use decimal point instead of decimal comma. The first row should include names of the variables. Missing values should be marked with N/A (see example).

The program writes the data out in the same format where it read it, but each observation is replaced with the corresponding discreet bin. See the examples below. Filenames of the data file and the result file will have same extension, but can't be the same. This prevents accidental overwriting of the original data.

File Size Description
discretize-2.2.zip 14.7 kB Discretize TCL file and instructions for use
discretize 1.3 MB Freewrap compiled binary version for Linux. Version 2.2
discretize.exe 2.4 MB Freewrap compiled binary version for Windows. Version 2.2

Running the program:

To run the program you need to have a TCL interpreter on your computer. If you don't have one, you can get either pre-compiled binary version or source files from the address http://www.tcl.tk/ . Many Linux distributions have a TCL interpreter included in their basic compilations. Running the program using an interpreter is the most efficient way for running the program.

Another possibility for running the Discretize is to use a program called Freewrap. The Freewrap makes it possible to use TCL programs even if you can't install programs to your computer yourself. Just download the latest release of the Freewrap, unpack it and follow the instructions for creating an executable.

Instructions for use:

On the first screen, when you start the program, you choose the data file (must exist) and the result file (doesn't have to exist). If you select some existing file as the result file, it is confirmed that you want to overwrite the file. This screen also has a log window where the program prints information about work done.

When you hit "Ok", program tries to read the given data file. Program gives a warning if the data looks somehow weird. If the reading succeeds, the program opens new window where there are all column headings printed out and an entry field generated below each one. This is where you tell the program which kind of bins you would like to have. Only columns which can be discretized are shown. You enter the insctructions to the entry fields. There are three possibilities:

  1. You can enter the boundaries of the bins separated by commas. For example entering 0,1,2,3 would create bins 0-1, 1-2 and 2-3. The program gives a warning any of the observations does not fall into the range of any bin.
  2. You can enter just number of the bins (one number). Then the bins are created so that the smallest value in the column is the lower boundary of the lowest bin and the largest value in the column is the upper boundary of the highest bin. Other bin boundaries are set so that every bin has equal amount of observations. For example entering number 3 to the entry field would divide observations 3, 5, 6, 9 and 10 to bins 3-5.5, 5.5-9.5 and 9.5-10.Program creates at most as many bins as there are observations.
  3. If you don't write anything to the entry field, column is left untouched. This way you can bypass numerical fields you don't want to discretize.

When you hit "Discretize" button, the program processes all the columns and writes discretized results to the result file. Then it writes to the log screen the names of the data and result files and the bins which were created for each column. Then the program is ready for next discretization.

Example:

Let's assume we have following input file data:

Header 1,Header 2,Header 3,Header 4,Header 5
        1,2,3,4,a
        2,3,4,5,b
        -3,N/A,-2,100,c
        0,0,0,0,d

We enter following instructions into column entrys:

Header 1
        -4,-2,0,2,4
        Header 2
        -10,0,10
        Header 3
        2
        Header 4
        2

(Header 5 is not shown because fifth column contains data which cannot be discretized.)

This means that the program creates bins -4 - -2, -2 - 0, 0-2 and 2-4 for first column, bins -10 - 0, 0-10 for second column and divide third and fourth column both to two bins (namely -2 - 1.5 and 1.5-4 for the third column and 0-4.5 and 4.5-100 for fourth column). The fifth column will be left as it is.

The contents of the output file would be following:

Header 1,Header 2,Header 3,Header 4,Header 5
        0 - 2,0 - 10,1.5 - 4,0 - 4.5,a
        2 - 4,0 - 10,1.5 - 4,4.5 - 100,b
        -4 - -2,N/A,-2 - 1.5,4.5 - 100,c
        0 - 2,0 - 10,-2 - 1.5,0 - 4.5,d

And the program would output to the log:

Found 5 columns and 5 rows from the data file.
        4 of the columns are discretizable.
        2004-10-17 17:10:41:
        /home/edu/data.txt ->
        /home/edu/result.txt
        Header 1 (4 bins):
        -4 - -2
        -2 - 0
        0 - 2
        2 - 4
        Header 2 (2 bins):
        -10 - 0
        0 - 10
        Header 3 (2 bins):
        -2 - 1.5
        1.5 - 4
        Header 4 (2 bins):
        0 - 4.5
        4.5 - 100
        Header 5 (not discretized):