Weka - Modified for Data Mining Course at WPI

Introduction

Weka is a collection of machine learning algorithms for data mining tasks. Weka contains tools for data preprocessing, classification, regression, clustering, association rules and visualization. Since the Weka system is open source (convered by the GNU General Public License), people can modify the Weka system for their use, as seen in the large list of Weka related projects on their website. This site provides a modified version of the Weka system, which has some more filters for preprocessing, an integrated multivariate visualization system and tools for similarity analysis of time series datasets.

The project started as a way to learn the Weka system for the Data Mining class at WPI (Worcester Polytechnic Institute). As the different topics of Data Mining were covered in the class, the need to understand the system in terms of the implementation of the algorithms and the complete framework as a whole arised. Initially, the aim was to implement filters to aid in the preprocessing of the dataset. But, as the course progressed and more internals of the system was learnt, it became sort of a passion to add more utilities to the Weka system.

[Back to top]

Features

New Filters:

  1. Remove Missing Instances Filter: This is the simplest filter one can implement in the Weka system. The filter simply removes all the instances that have missing values in the dataset (relation).
  2. Scale By Means Filter: This is also a simple filter which allows one to scale the values of each data point by the mean of all datapoints in that attribute, such that means of all numeric attributes is zero.
  3. Fourier Transform Filter: This is a little bit complex filter which creates fourier transformed attributes from the original dataset. The time domain series is assumed to be specified as attribute of the data. The output is 2 attributes for every attribute which denotes the real and imaginary parts of the coefficients of the transformed frequency domain series. One can specify as to how many coefficients to retain after preprocessing the dataset.

New Panels:
  1. Time Series Panel: A panel within the Explorer GUI of Weka that allows one to add Time Series Analysis tools to the Weka system. One example of a Time Series Analysis tool has been added, which is the Similarity Analysis tool. The tool takes a series of time-domain sequences and outputs as to which of these sequences are similar. Then, it computes the Euclidean distances in the Frequency domain with reduced coefficients. If this is less than some threshold value, the actual Euclidean distance in Time domain is deduced and if this is again less than the threshold, the two sequences are termed to be similar. For more details, please look at the Project web page for Sequence Mining in Data Mining.
    Time Series Panel
  2. XMDV Panel: A panel with a four multivariate display techniques. This panel provides four more visualizations for the data loaded in Weka, viz., Parallel Coordinates, ScatterPlot, Star Glyphs and Dimensional Stacking. You can find more details on the techniques on the XMDV webpage.
    XMDV Panel

Converters:

  1. OKC Loader: Weka can now load OKC files. OKC file format has been defined for the XmdvTool and this converter allows one to load these formats into the Weka system. You can find examples of OKC files on the XMDV website.
  2. Savers: Originally, Weka only gives the option of saving to the ARFF file format, which is Weka's native file format. The Saver feature allows one to write their own converters for saving to other file formats. This option is activated when specifying the 'All Files' file filter option instead of the 'Arff file' in the Save Dialog. This pops up another dialog to choose the converter.
    Saver
    Implemented Savers:
    1. OKC Saver: Save to the Xmdv OKC file format (saves data to .okc file and meta information to .meta file). For more details, look at the Xmdv website.
    2. C45 Saver: Save to the .data and .names file formats.

Miscellaneous:

  1. Tree Visualizer for ID3 Decision Tree: The original Weka version implements the tree visualizer for J4.8 decision tree algorithm. This modified version of Weka also supports the Tree Visualizer for the ID3 algorithm.
  2. Visual Cluster Comparator: A visualization display for visually comparing the cluster assignments in Weka due to the different algorithms. This allows one to see how the different clustering algorithms have been constructed. It basically uses the XMDV visualization capabilities to display this comparison. The following screenshot shows the comparison of Simple K-Means and the EM clustering algorithms for 5 clusters.
    Clustering Comparison

[Back to top]

System Requirements

  1. A Java2 compliant Virtual Machine. The Java JRE / JDK can be downloaded from Sun's site.
  2. A graphics capable machine (which can run OpenGL).
  3. Jausoft's Java binding for OpenGL which can be downloaded here.
  4. Apache Ant software for building the system or any IDE that supports Ant build files (Eclipse, IntelliJ IDEA, etc.)

[Back to top]

Download

  1. Binary Distribution - weka.zip (approx. 1.76 MB)
  2. Source Distribution - weka-src.zip (approx. 2.0 MB)

[Back to top]

Installation Instructions

The Weka environment lacks a standard module registration procedure. Hence, the distribution packages the modified modules with the Weka system.

  1. Install Jausoft's Java binding for OpenGL (GL4Java) (Note that this might be slightly involved).
  2. Unzip the distribution file.
  3. If you have downloaded the binary package, you can use the RunWeka batch file to run the system. If you have downloaded the source package, look at the build instructions to build and run the system.

[Back to top]

Build Instructions

Unzip the source zip file to a separate directory. This will create the src directory which has the weka, xmdv and files created by me in separate directories. The build.xml ant file would also have been unzipped to the directory. If you have ant installed and the system paths set, then you should be able to compile the complete directory tree by running ant from the command prompt. The class files will be created in the bin directory. You can then run the Modifies Weka system by specifying java weka.gui.GUIChooser from the bin directory on the command line.

[Back to top]

Bookmark this page | Print this page | Contact