Document original : Knowledge Discovery Mine

S*i*ftware: Tools for Knowledge Discovery in Data

This document contains information about available tools for Knowledge Discovery in Databases, also known as Data mining, Knowledge extraction, etc. It includes commercial products, public-domain systems, and research prototypes.

Pointers to additional relevant learning tools are in the Other Informations Servers section. Ads for commercial tools can be found in AI Expert, PC AI, AI Magazine, IEEE Expert, Expert Systems, and similar magazines.

Please e-mail to any corrections and comments. E-mail descriptions of relevant systems using the template below.

Official Disclaimer: This is an informal list, representing the opinions of the contributors, but not necessarily of their employers or of GTE Laboratories. The information is provided without any warranty. It is intended to be useful, but is certainly not always complete nor up-to-date.
Maintainer: Gregory Piatetsky-Shapiro.


Tool Description Template

*Name: ...
*Description: ...
*Discovery methods: Clustering, Classification, Summarization, Deviation Detection, Dependency Derivation, Visualization, ...
*Comments: ...
*Source: source of the information, e.g. magazine, posting, ...
*Platform(s): Windows, DOS, Mac, Unix, etc...
*Contact: person, organization, e-mail, phone, fax, address, etc
*Status: public domain, product, prototype, etc
*Updated by: person on 19??-MM-DD

== Tools

=== Classification Tools

==== Classification: Decision-tree approach

*Name: AC2
*Description: AC2 is a decision tree classification tool developed in C++. AC2 allows the user to create and to manipulate decision trees from data set of symbolic, numeric, noisy and unknown descriptions. The scientific grounds of AC2 relies on the discriminatory methods and on the representation language of the data set.
  • AC2 integrates different discriminatory methods such as a regression methods (CART), as well as others methods such as gain ratio (J.R. Quinlan), Gini (CART), information gain (Shannon), information class and distance (Mantaras) wich were extensively used and experienced. AC2 provides confusion matrix and cost matrix as well as pre- and post-pruning methods to avoid overfitting and true error rate estimation methods by the use of powerful statistical procedures such as cross validation and bootstrapping.
  • Data can be flat (usual matrix format) or structured by the use of a representation language based on an object-oriented representation extended with relationships between objects. The representation language allows the user to make use of domain knowledge in order to benefit from the semantic of the domain during the classification process.
  • AC2 has been designed for "real-world" data sets analysis such as banking, marketing, risk analysis, decision help systems, quality control, science, medical diagnosis and epidemiology, population analysis and typology.
    *Discovery methods: Classification, regression and discriminatory methods, decision tree approach.
    *Comments: The system has a well-designed and attractive interface allowing a strong interaction with the user. The decision tree is displayed as a graph allowing the user to inspect nodes, to make changes and easy tests. References :
  • MLT : Machine Learning Toolbox, Esprit Project 2154, Deliverable D2.2, Specification of the CKRL of MLT.
  • StatLog : Comparative Testing of Satistical and Logical Learning, Esprit Project 5170, Deliverable D3.11, Description of AC2.
  • T. Brunet 93 : Le probleme de la resistance aux valeurs inconnues dans l'induction : une foret de branches, JFA-93.
  • T. Brunet 94 : Le probleme de la resistance aux valeurs inconnues dans l'induction, These Universite de Paris VI, France.

  • *Source: ISoft S.A.
    *Platform(s): AC2, coded in C++, is available on PC under Windows 3.1 and on Unix Workstations, SUN, IBM RS6000, BULL DPX20, HP 700, DEC Alpha.
    *Contact: H. Perdrix, ISoft S.A., e-mail :,, tel +33 (1), fax +33 (1), Chemin de Moulon 91190 Gif sur Yvette France,
    *Status: commercial product
    *Updated by: ISoft S.A. on 1995-01-23
    *Name: C4.5
    *Description: one of the most popular and best developed decision-tree tools. Latest version from J.R. Quinlan, the author of ID3. Also includes a module for converting a decision tree to a set of rules.
    *Platform(s): Unix
    *Contact: Morgan Kaufmann publishers. The code, including source C-code, comes with the book, J. R. Quinlan, C4.5 -- Programs for Machine Learning, Morgan Kaufmann, 1993.
    *Status: product.
    *Updated: 1993
    *Name: IND
    *Description: IND is a C program for the creation and manipulation of decision trees from data, integrating the CART, ID3/C4.5, Buntine's smoothing and option trees, Wallace and Patrick's MML method, and Oliver and Wallace's MML decision graphs which extend the tree representation to graphs. Written by Wray Buntine, Longer description here.
    *Comments: Cannot be exported from the US unless you buy it from NASA.
    *Platform(s): Unix
    *Contact: Contact: NASA COSMIC, Tel: 706-542-3265 (ask for customer support) Fax: 706-542-4807
    *Status: product. Relatively inexpensive
    *Updated: 1994
    *Name: Knowledge Seeker
    *Description: A data-mining tool that extracts multiple cause-and-effect relationships from a data set and displays them interactively as a graphic decision tree. It is designed for the analysis of industrial-strength, "real-world" data sets. Price $899 (Windows) and $799 (DOS)
    *Source: AI Expert, April 1994
    *Platform(s): Windows, DOS
    *Contact: Angoss Software, 430 King St., W., Suite 201, Toronto M5V 1J5, Canada, (416) 593-1122, fax (416) 593-5077
    *Status: product.
    *Updated: 1994-03-21

    *Name: MLC++
    *Description: A Machine Learning Library in C++.
    A library of C++ classes and tools for supervised classification learning. While MLC++ provides general learning algorithms that can be used by end users, the main objective is to provide researchers and experts with a wide variety of tools that can accelerate algorithm development, increase software reliability, provide comparison tools, and display information visually. More than just a collection of existing algorithms, MLC++ is an attempt to extract commonalities of algorithms and decompose them for a unified view that is simple, coherent, and extensible. Here is full information on MLC++
    *Induction algorithm: Decision trees, decision graphs, decision tables, nearest-neighbors (Instance-based methods), naive bayes, perceptron, winnow.
    *Other tools: Accuracy estimation (holdout, cross-validation, bootstrap), feature subset selection (wrapper), discretization algorithms (binning, entropy, 1R).
    *Source: Ron Kohavi.
    *Platform(s): Unix, Sun (ObjectCenter C++). Could be ported to other unix machines/compilers, but requires good template support.
    *Contact: Ronny Kohavi
    *Status: public domain (object code), source code available to selected sites.
    *Updated by: Ron Kohavi on 1995-02-23
    *Name: OC1
    *Description: multivariate decision tree induction system
    *Comments: OC1 (Oblique Classifier 1) is a multivariate decision tree induction system designed for applications where the instances have numeric feature values. OC1 builds decision trees that contain linear combinations of one or more attributes at each internal node; these trees then partition the space of examples with both oblique and axis-parallel hyperplanes. OC1 has been used for classification of data from several real world domains, such as astronomy and cancer diagnosis. A technical decription of the algorithm can be found in the AAAI-93 paper by Sreerama K. Murthy, Simon Kasif, Steven Salzberg and Richard Beigel. A postscript version of this paper is provided with the package. OC1 is a written entirely in ANSI C. It incorporates a number of features intended to support flexible experimentation on real and artificial data sets. We have provided support for cross-validation experiments, generation of artificial data, and graphical display of data sets and decision trees. The OC1 software allows the user to create both standard, axis-parallel decision trees and oblique (multivariate) trees.
         The latest version of OC1 is available free of charge, and may be
    obtained via anonymous FTP from the Department of Computer Science at
    Johns Hopkins University.
         To obtain a copy of OC1, click here , or 
    type the following commands:
         UNIX_prompt> ftp 
    [Note: the Internet address of is]
         Name: anonymous
         Password: [enter your email address]
         ftp>  bin
         ftp>  cd pub/oc1
         ftp>  get oc1.tar.Z
    [This announcement is also contained in pub/oc1.]
         ftp>  bye
    [Place the file oc1.tar.Z in a convenient subdirectory.]
         UNIX_prompt> uncompress oc1.tar.Z
         UNIX_prompt> tar -xf oc1.tar
    [Read the file "README", to get cues to other documentation files, and
     to run the programs.]
    If you have any comments, questions or suggestions, please contact
           Sreerama K. Murthy or
           Steven Salzberg or
           Simon Kasif
           Department of Computer Science
           The Johns Hopkins University
           Baltimore, MD 21218
           Email: (primary contact)
    OC1 IS INTENDED FOR NON-COMMERCIAL PURPOSES ONLY. OC1 may be used, copied, and modified freely for this purpose. Any commercial use of OC1 is strictly prohibited without the express written consent of Sreerama K. Murthy, Simon Kasif, and Steven Salzberg, at the Department of Computer Science, Johns Hopkins University.
    *Status: public domain, prototype
    *Updated: Tue, 12 Oct 93 from salzberg@blaze.cs.jhu.EDU
    *Name: SE-Learn
    *Description: An SE-tree-based induction and classification tool.

    Set Enumeration (SE) trees provide the basis for an induction and classification framework which generalizes decision trees. In this framework, called SE-Learn, rather than splitting according to a single attribute, one recursively branches on all (or most) relevant attributes. A single SE-tree economically embeds many decision trees, supporting a more expressive representation. SE-Learn benefits from many techniques developed for decision trees, e.g., attribute-selection and pruning measures. In particular, SE-Learn can be tailored to start off with anyone's favorite decision tree, and then improve upon it via further exploring the SE-tree. This hill-climbing algorithm allows trading time/space for added accuracy. Current studies show that SE-trees are particularly advantageous in domains where (relatively) few examples are available for training, and in noisy domains. Finally, SE-trees provide a unified framework for combining induced knowledge with knowledge available from other sources.

    A LISP implementation of SE-Learn, not to be used for any commercial purpose, is freely available from Ron Rymon. It includes a choice of exploration policies and resolution criteria, as well as hill-climbing from common decision trees: GINIindex (CART), Information Measure and Gain Ratio (ID3, C4.5), and Chi Square statistic (ChAID). An enhanced C version is currently under development by Modeling Labs.

    *Discovery methods: Classification

    *Contact: Ron Rymon (
    *Name: XPERTRule
    *Description: Inductive Rule Learning,
    *Platform(s): Windows ?
    *Contact: CINCOM Systems, 1-800-543-3010
    *Status: product.
    *Updated: 1994-03-15

    Classification: Neural network approach

    *Name: @Brain
    *Description: Neural network tool
    *Platform(s): DOS, Windows ?
    *Status: product.
    *Contact: Talon Development, 414-962-7246
    *Name: 4Thought
    *Description: neural net based tool to make predictions in financial and marketing environments. Intended for data-knowledgeable users, who know little or nothing about neural nets.
    *Comments: Price/performance ratio appears poor (6500 british pounds), but the reviewer describes it as "fun to use". The only neural net used is a basic MLP with one or two hidden layers, and either a small or a large learning step size. No other parameters can be set, and also the export of graphical / model information from 4Thought is very restricted.
    *Source: Review of `4Thought' / by A. Harvey and S. Toulson. - International Journal of Forecasting (Amsterdam) 10 (1994.06) nr.1 p.35-41 (7 refs)
    *Platform(s): Windows, fast 486 PC, preferably math-coprocessor
    *Contact: Right Information Systems Ltd, 9 Westminster Palace Gardens, Artillery Row, London SW1P1RL, UK
    *Status: product
    *Updated by: Sandra Oudshoff on 1994-10-20
    *Name: AIM
    *Description: A modeling tool that uses abductive modeling technology to learn relationships from a database of examples. Uses 1-, 2, and 3-dimensional polynomials.
    *Platform(s): Windows, DOS
    *Contact: Abtech Corp., 508 Dale Ave, Charlottesville, VA, 22903, (804) 977-0686, fax (804) 977-9615
    *Status: product.
    *Updated: 1994-03-15
    *Name: BrainMaker
    *Description: tool for training backprop neural nets
    *Discovery methods: Neural Networks (back-prop)
         BrainMaker package includes:
          The book Introduction to Neural Networks
          BrainMaker Users Guide and reference manual
              300 pages , fully indexed, with tutorials, and sample networks
              Netmaker makes building and training Neural Networks easy, by
              importing and automatically creating BrainMaker's Neural Network
              files.  Netmaker imports Lotus, Excel, dBase, and ASCII files.
              Full menu and dialog box interface, runs Backprop at 750,000 cps
              on a 33Mhz 486.

    *Source: FAQ
    *Platform(s): DOS, Windows, Mac
    	Company: California Scientific Software
     	Address: 10024 Newtown rd, Nevada City, CA, 95959 USA
          	Phone: 800-284-8112 or 916 478 9040
    	Tech Support: 916 478 9035
    	Fax: 916 478 9041
     	Email:  calsci! (flakey connection)

    *Status: product.
    *Updated: 1994-10-27 by GPS
    *Name: MATLAB Neural Network Toolbox
    *Description: a complete engineering environment for neural network research, design, and simulation. Offers over fifteen proven network architectures and learning rules.
    *Discovery methods: Classification
    *Comments: Includes backpropagation, perceptron, linear, recurrent, associative, and self-organizing networks. Fully open and customizable.
    *Source: Product ad in PC AI, Nov/Dec 1994
    *Platform(s): PCs, Macs, and Workstations
    *Contact: The Math Works, 24 Prime Park Way, Natick, MA 01760-1500
    fax: 508-653-6284, e-mail:
    A very nice WWW page is at
    *Status: product
    *Updated: 1994-12-29 by Gregory Piatetsky-Shapiro
    *Comments: proprietary modeling algorithm
    *Platform(s): Windows, DOS, ?
    *Contact: Teranet IA, 800-663-8611
    *Status: product
    *Updated: 1993
    *Name: N-train
    *Description: statistical and Neural network tool
    *Platform(s): Windows, DOS, ?
    *Contact: Scientific Consultant Services, 516-696-3333
    *Status: product
    *Updated: 1993

    ==== Classification: Rule Discovery Approach

    *Name: Datalogic/R
    *Description: Software for data mining and decision support using a rough set-based system for knowledge discovery, predictive modeling, and reasoning.
    *Platform(s): MS-DOS
    *Contact: contact: Reduct Systems, Regina, Canada. (306) 586-9408, fax (306) 586 9442.
    *Status: product.
    *Updated: 1994-03-15
    *Name: Data Surveyor
    *Description: Data Surveyor is a data mining tool for the discovery of strategic relevant information from large databases.
    *Discovery methods: Induction of classification rules
    *Source: information by author
    *Comment: uses a separate front and back-end. The front end directs the mining process. The back-end is a fast, parallel, main memory database server, which performs all massive data handling.
    *Platform(s): Back-end currently runs on (parallel) Unix systems, front-end runs on Unix workstations and MS-Windows.
    Marcel Holsheimer
    CWI, P.O. Box 94079
    1090 GB Amsterdam
    The Netherlands.
    	tel. +31-20-592 4134, fax +31-20-592 4199,

    *Status: product
    *Updated by: Marcel Holsheimer on 1994-12-20
    *Name: IDIS
    *Description: IDIS is the Information Discovery System that analyzes databases by itself and discovers patterns and rules. IDIS automatically decides what to look at, generates hypotheses, discovers hidden and unexpected patterns, rules of knowledge, graphs and anomalies. The results are displayed within a hypermedia environment.
    IDIS examines databases with a set of built-in data analysis algorithms that automatically form hypotheses about what is relevant. It then tests the hypotheses to generate interesting and unexpected rules and graphs that characterize the database. The automatic hypotheses formation and testing cycle continues until important rules and patterns emerge.
    IDIS pre-analyzes large databases to discover the important graphs to be displayed. The analyses of IDIS may be focused by the user towards a specific task or IDIS may be set to roam freely through the database. The system outputs rules and graphs which characterize data. Both numeric and non-numeric data values are shown in two and three dimensional hypermedia graphs.

    *Comments: IDIS works on several databases such as Oracle, Sybase, etc. It works both in client server and stand alone models. IDIS has discovered more rules in more applications than any other program. Discoveries have been made by IDIS in many areas such as point-of-sales data, quality control, finance, banking, the petroleum industry, agriculture, science, business forecasting, forest fire prevention, chemical structure identification, securities trading, crime detection and medical diagnosis, among others.
    *Source: IntelligenceWare
    *Platform(s): Windows, DOS, Unix, various SMP and MPP system.
    	5933 West Century Blvd 
    	Los Angeles, CA 90045 
    	tel. (310) 216-6177, fax (310) 417-8897

    *Status: Commercial product since 1991. Previously the IXL system. IDIS is the new system.
    *Updated: 1994-07-22
    *Name: PQ->R
    *Description: A program for Computer Aided Induction of general rules (if .. then) from cases. Functions: 1) automatic inductive classification of a minimal chain of independent variables that would predict a user selected dependent variable 2) interactive construction of hypotheses based on a lookahead facility
    *Comments: Can handle only up to 1200 cases and 50 variables with up to 15 attributes each
    *Source: Sandra Oudshoff
    *Platform(s): DOS, minimum 286 with 500K free program memory
    *Contact: Finite Epistemics, Passeerderstraat 76, 1016 XZ Amsterdam, Holland, tel +31 20 624 7137
    *Status: product
    *Updated by: Sandra Oudshoff on 1994-07-27

    ==== Classification: Genetic Algorithm approach

    *Name: GAAF
    *Description: GAAF is a genetic algorithm based tool for the approximation of mathematical formulae out of raw data. These formulae then capture the relationship within these data. GAAF overcomes the problems of neural network or statistical approximation methods in several ways. It can generate a symbolic representation for any kind of function, including discontinuous ones. Overfitting is avoided by the ability to separate raw data in a development and a validation sample and by implementing several powerful statistical robustness tests. By representing the generated models in a mathematical, simple to understand form the generated models are easy to explain or analyse.
    *Source: Product folder
    *Platform(s): High performance IBM PC compatible machine under Windows 3.1
    *Contact: Cap Volmac, Division Service Development, Dolderseweg 2, 3712 Huis ter Heide, Holland, tel +31 3404 35411, fax +31 3404 31174
    *Status: product
    *Updated by: Sandra Oudshoff on 1994-07-27
    *Name: FUGA
    *Description: FUGA, Financial modelling Using Genetic Algorithms, is a financial modelling tool based on the GAAF toolbox . FUGA allows for the development of models in divergent financial domains such as credit scoring, risk management and product marketing. Additional functionalities allow financial operators to speed up the search process and it has reporting facilities targeted at financial managers.
    *Source: product folder
    *Platform(s): High performance IBM PC compatible machine under Windows 3.1
    *Contact: Cap Volmac, Division Service Development, Dolderseweg 2, 3712 Huis ter Heide, Holland, tel +31 3404 35411, fax +31 3404 31174
    *Status: product
    *Updated by: Sandra Oudshoff on 1994-07-27

    ==== Classification: Nearest Neighbour

    *Name: PEBLS
    *Description: PEBLS is a nearest-neighbor learning system designed for applications where the instances have symbolic feature values. PEBLS has been applied to the prediction of protein secondary structure and to the identification of DNA promoter sequences. A technical description appears in the article by Cost and Salzberg, Machine Learning journal 10:1 (1993).
     Version 3.0 incorporates a
    number of additions to version 2.1 (released in 1993) and to the
    original PEBLS described in the paper:
         S. Cost and S. Salzberg.  A Weighted Nearest Neighbor 
         Algorithm for Learning with Symbolic Features,
         Machine Learning, 10:1, 57-78 (1993).
         PEBLS 3.0 now makes it possible to draw more comparisons between
    nearest-neighbor and probabilistic approaches to machine learning, by
    incorporating a capability for tracking statistics for Bayesian
    inferences.  The system can thus serve to show specifically where
    nearest-neighbor and Bayesian methods differ.  The system is also able
    to perform tests using simple distance metrics (overlap, Euclidean,
    Manhattan) for baseline comparisons.  Research along these lines was
    described in the following paper:
         J. Rachlin, S. Kasif, S. Salzberg, and D. Aha.  Towards a Better
         Understanding of Memory-Based and Bayesian Classifiers.  {\it
         Proceedings of the Eleventh International Conference on Machine
         Learning} (pp. 242-250).  New Brunswick, NJ, July 1994, Morgan
         Kaufmann Publishers.

    *Source: ML list
    *Platform(s): PEBLS 3.0 is written entirely in ANSI C. It is thus capable of running on a wide range of platforms.
         The latest version of PEBLS is available free of charge, and may
    be obtained via anonymous FTP from the Johns Hopkins University
    Computer Science Department.
         To obtain a copy of PEBLS, type the following commands:
         UNIX_prompt>  ftp
    [Note: the Internet address of is]
         Name: anonymous
         Password: [enter your email address]
         ftp>  bin
         ftp>  cd pub/pebls
         ftp>  get pebls.tar.Z
         ftp>  bye
    [Place the file pebls.tar.Z in a convenient subdirectory.]
         UNIX_prompt> uncompress pebls.tar.Z
         UNIX_prompt> tar -xf pebls.tar
    [Read the files "README" and "pebls_3.doc"]
    For further information, contact:
                   Prof. Steven Salzberg
                   Department of Computer Science
                   Johns Hopkins University
                   Baltimore, Maryland 21218

    *Status: PEBLS 3.0 IS INTENDED FOR RESEARCH AND EDUCATIONAL PURPOSES ONLY. PEBLS 3.0 may be used, copied, and modified freely for this purpose. Any commercial or for-profit use of PEBLS 3.0 is strictly prohibited without the express written consent of Prof. Steven Salzberg, Department of Computer Science, The Johns Hopkins University.
    *Updated by: GPS on 1994-10-20

    ==== Classification: Other approaches

    *Name: Clementine
    *Description: Based on a visual programming interface which links data access, manipulation and visualisation together with machine learning (decision tree induction and neural networks). Trained rules and networks can be exported as C source code. Uses a graphical 'building block' approach to develop applications. Underlying technologies include decision tree induction and neural networks.
    *Source: Product brochure.
    *Platform(s): Sun, DEC, HP, SG
    	Colin Shearer (, Tom Khabaza ( 
        	Integral Solutions Ltd, 3, Campbell Court, Bramley,                   
            Basingstoke RG26 5EG, UK                 
            Phone: +44 1256 882028    Fax: +44 1256 882182 

    *Status: product
    *Updated by: Tom Khabaza, 1994/09/20
    *Name: DISCOVER-IT
    *Description: ?
    *Platform(s): DOS, Windows ?
    *Contact: SourceCode Inc, 800-294-5840
    *Status: product.
    *Updated: 1993
    *Name: HCV
    *Description: representative of the extension matrix approach based family of attribute-based induction algorithms, originating with J.R. Hong's AE1. By dividing the positive examples (PE) of a specific concept in a given example set into intersecting groups and adopting a set of strategies to find a heuristic conjunctive rule in each group which covers all the group's positive examples and none of the negative examples (NE), the HCV algorithm can find a rule in the form of variable-valued logic for the concept based on PE against NE in low-order polynomial time. If there exists at least one conjunctive rule in a given training example set for PE against NE, the rule produced by the HCV algorithm must be a conjunctive one. The rules in variable-valued logic generated by the HCV algorithm have been shown empirically to be more compact than the decision trees or their equivalent decision rules produced by the ID3 algorithm (the best-known induction algorithm to date) and its successors in terms of the numbers of conjunctive rules and conjunctions.
    The term ``HCV'' in this description indicates the current implementation (Version 1.0) of the HCV algorithm in SICStus Prolog which runs on SUN3, SPARC and DEC workstations. In this implementation, HCV can classify more than 2 classes of examples by incorporating the AQ technique developed in the generalization-specialization strategy based family of induction algorithms. It takes a set of pre-classified training examples (vectors of attribute-values) as its input and produces a set of rules as its output classifying the training examples. It also allows you to evaluate the rules' accuracy in terms of a set of pre-classified testing examples.
    *Comments: To use the program, you must prepare your training and testing examples in the form of ASCII files in a fixed format. During the program execution, all you need to do is provide your file names. All outputs and some intermediate results are given on the screen and stored in your own-specified file.
    	X. Wu, HCV User's Manual (Release 1.0 June 1992), DAI Technical Paper
    	No. 9, 30 pp., Department of Artificial Intelligence, University of
    	Edinburgh, 1992.
    	X. Wu, The HCV Induction Algorithm, Proceedings of the 21st ACM
    	Computer Science Conference, S.C. Kwasny and J.F. Buck (Eds.), ACM
    	Press, U.S.A., 1993, 168--175.

    *Platform(s): It runs on SUN3, SPARC and DEC workstations under Unix or Ultrix with SICStus Prolog, and PCs under the DOS environment.
    *Contact: Xindong Wu, Dept of Computer Science, James Cook University, Townsville, Australia Qld 4812 Email: To get the full manual, click here.
    *Status: HCV (Version 1.0) is available electronically for academic use at no cost, and for commercial use by arrangement with the author.
    *Updated: 4/94
    *Name: Information Harvesting
    *Description: A tool for rule discovery in databases ?
    *Comments: shown at AAAI-93, but was very expensive (~ $40,000)
    *Platform(s): ?
    *Contact: Ryan Corp., 53 Wall Street, Fifth Floor, New York, NY 10005 212-858-7730
    *Status: product.
    *Updated: 1993
    *Name: NEXTRA
    *Description: Tool for knowledge acquisition for an expert system. Can synthesize rules from user preferences. Nice graphical abilities.
    *Comments: ?
    *Platform(s): Mac, ?
    *Contact: Neuron Data, 156 University Ave., Palo Alto, CA 94301. 1-800-876-4900
    *Status: product
    *Updated: 1993

    === Deviation Detection

    *Name: EXPLORA
    *Description: An interactive system for discovery of interesting patterns in databases.
    *Comment: See also papers in KDD book, 1991, KDD-91 and KDD-93 proceedings.
    *Platform(s): Mac
    *Contact: Willi Kloesgen, GMD, D-53757 Sankt Augustin, e-mail:
    If you are interested in using the system, you can get it via anonymous ftp from in directory gmd/explora: Open a connection to "" or, and transfer the file "Explora.sit.hqx" from the directory "gmd/explora". The file "README" informs about the installation of Explora. An user manual is included.
    *Status: public domain, prototype
    *Updated: 1993

    === Dependency Derivation

    *Name: TETRAD II
    *Description: A multi-module program that assists in the construction of causal explanations for sample data and their use in prediction. With continuous variables the program will aid in the search for "path models" or "structural equation models;" with discrete data the program will construct and update a Bayes network from sample data and user knowledge of the domain; the program includes Monte Carlo facililities.
    *Comment: Proofs of the asymptotic correctness of all but one of the search modules are available in P. Spirtes, C. Glymour and R. Scheines, Causation, Prediction and Search, Springer Lecture Notes in Statistics, 1993. Should be available as of September 1, 1994.
    *Source: C. Glymour
    *Platform(s): DOS, Unix
    *Contact: Erlbaum Statistical Software. The system is available along with a book
      Richard Scheines, Peter Spirtes, Clark Glymour, and Christopher Meek.
      TETRAD II: Tools for Discovery.
      Lawrence Erlbaum Associates, Hillsdale, NJ, 1994.

    *Status: product
    *Updated: 1994-07-22

    === Clustering

    *Name: AUTOCLASS
    *Description: AutoClass is an unsupervised Bayesian classification system for independent data. It seeks a maximum posterior probability classification.
    ( NASA Ames Research Center )
         The program AUTOCLASS III, Automatic Class Discovery from Data, uses
    Bayesian probability theory to provide a simple and extensible approach to
    problems such as classification and general mixture separation. Its the-
    oretical basis is free from ad hoc quantities, and in particular free of
    any measures which alter the data to suit the needs of the program. As a re-
    sult, the elementary classification model used lends itself easily to ex-
         The standard approach to classification in much of artificial intelli-
    gence and statistical pattern recognition research involves partitioning
    of the data into separate subsets, known as classes. AUTOCLASS III uses the
    Bayesian approach in which classes are described by probability distribu-
    tions over the attributes of the objects, specified by a model function and
    its parameters. The calculation of the probability of each object's mem-
    bership in each class provides a more intuitive classification than abso-
    lute partitioning techniques.
         AUTOCLASS III is applicable to most data sets consisting of indepen-
    dent instances, each described by a fixed length vector of attribute val-
    ues. An attribute value may be a number, one of a set of attribute specific
    symbols, or omitted. The user specifies a class probability distribution
    function by associating attribute sets with supplied likelihood function
    terms. AUTOCLASS then searches in the space of class numbers and parameters
    for the maximally probable combination. It returns the set of class prob-
    ability function parameters, and the class membership probabilities for
    each data instance.
    DISTRIBUTION MEDIA: .25 inch Tape Cartridge in TAR Format
    P. Cheeseman, et al. "Autoclass: A Bayesian Classification System",
      Proceedings of the Fifth International Conference on Machine Learning,
      pp. 54-64, Ann Arbor, MI. June 12-14 1988.
    P. Cheeseman, et al. "Bayesian Classification", Proceedings of the
      Seventh National Conference of Artificial Intelligence (AAAI-88),
      pp. 607-611, St. Paul, MN. August 22-26, 1988.
    J. Goebel, et al. "A Bayesian Classification of the IRAS LRS Atlas",
      Astron. Astrophys. 222, L5-L8 (1989).
    P. Cheeseman, et al. "Automatic Classification of Spectra from the Infrared
      Astronomical Satellite (IRAS)", NASA Reference Publication 1217 (1989)
    P. Cheeseman, "On Finding the Most Probable Model", Computational Models
      of Discovery and Theory Formation, ed. by Jeff Shrager and Pat Langley,
      Morgan Kaufman, Palo Alto, 1990, pp. 73-96.
    R. Hanson, J. Stutz, P. Cheeseman, "Bayesian Classification with
      Correlation and Inheritance", Proceedings of 12th International Joint
      Conference on Artificial Intelligence, Sydney, Australia. August 24-30,

    *Platform(s): Common Lisp on Unix and Mac. AUTOCLASS III, ARC-13180, is written in Common Lisp, and is designed to be platform independent. This program has been successfully run on Symbolics and Explorer Lisp machines. It has been successfully used with the following implementations of Common LISP on the Sun: Franz Allegro CL, Lucid Common Lisp, and Austin Kyoto Common Lisp and similar UNIX platforms; under the Lucid Common Lisp implementations on VAX/VMS v5.4, VAX/Ultrix v4.1, and MIPS/Ultrix v4, rev. 179; and on the Macintosh personal computer. The minimum Macintosh required is the IIci. This program will not run under CMU Common Lisp or VAX/VMS DEC Common Lisp. A minimum of 8Mb of RAM is required for Macintosh platforms and 16Mb for workstations. The standard distribution medium for this program is a .25 inch streaming magnetic tape cartridge in UNIX tar format. It is also available on a 3.5 inch diskette in UNIX tar format and a 3.5 inch diskette in Macintosh format. An electronic copy of the documentation is included on the distribution medium. Domestic pricing is $900 for the program, and $21 for the documentation -- there is a 50% educational discount. International pricing is $1800 for the program, and $42 for the documentation -- there is *no* educational discount. Sun is a trademark of Sun Microsystems, Inc. UNIX is a registered trademark of AT&T Bell Laboratories. DEC, VAX, VMS, and ULTRIX are trade- marks of Digital Equipment Corporation. Macintosh is a trademark of Apple Computer, Inc. Allegro CL is a registered trademark of Franz, Inc. COSMIC, and the COSMIC logo are registered trademarks of the National Aeronautics and Space Administration. All other brands and product names are the trademarks of their respective holders.
       AutoClass III is the official released implementation of AutoClass
       available from COSMIC (NASA's software distribution agency):
    	University of Georgia
    	382 East Broad Street
    	Athens, GA  30602  USA
    	voice: (706) 542-3265  fax: (706) 542-4807
    	telex: 41- 190 UGA IRC ATHENS
    	e-mail:	cosmic@@uga.bitnet  or	

    *Status: product
    *Updated: 1994
    *Name: COBWEB/3
    *Description: A portable implementation of an algorithm for data clustering and incremental concept formation ( long description here).
    *Platform(s): PC with 16 MB RAM
    *Contact: or
    	University of Georgia
    	382 East Broad Street
    	Athens, GA  30602  USA
    	voice: (706) 542-3265  fax: (706) 542-4807
    	telex: 41- 190 UGA IRC ATHENS
    	e-mail:	cosmic@@uga.bitnet  or	

    *Status: product
    *Updated by: GPS, 1994-07
    *Name: DataEngine
    *Description: Uses fuzzy systems, neural nets and their combination. Applications are developed by graphically linking together function blocks (similar to Clementine). Includes: a) fuzzy clustering methods (uses Fuzzy C-Means or FCM) b) rule-based fuzzy methods c) neural nets (back prop and kohonen) d) fuzzy neuro methods e) signal processing module f) basic module (stat and math functions, spreadsheet data editor)
    *Source: AI Watch article, Feb 94
    *Platform(s): PC with 16 MB RAM
    *Contact: Management Intelligenter Technologien GmbH, Aachen, Germany, Tel: +49 2408-94 580, Fax: +49 2408-94 582
    Price: basic module (DM 2498), signal processing module (DM 598), other module (DM 998), complete package (DM 7000)
    *Status: product
    *Updated by: Hing-Yan Lee,, 1994-06-20
    *Name: SDISCOVER
    *Description: This tool discovers regular expression style motifs in each family among a set of families of sequences. This will soon be extended to trees. We use the edit distance for sequences as the measure of similarity and a variety of distance measures including edit distance, alignment distance and top-down edit distance in the tree case. On protein data (SWISS-PROT), we have shown the motifs to be significant by using them successfully as classifiers.
    *Discovery methods: Generate and Test, Clustering, Sampling.
    *Comments: The tool works by encoding the sequences into a suffix tree and traversing the tree to generate candidate motifs. We then evaluate the activity of a candidate motif by comparing it with the sequences in the family.
    *Source: from the authors
    *Platform(s): Windows, DOS and Unix.
     Jason Tsong-Li Wang, Department of Computer and Information
     Science, New Jersey Institute of Technology, University Heights, 
     Newark, NJ 07102,, phone: (201) 596-3396,
     fax: (201) 596-5777.
     Dennis Shasha, Department of Computer Science,
     Courant Institute of Mathematical Sciences, New York University,
     251 Mercer Street, New York, NY 10012,,
     phone: (212) 998-3086, fax: (212) 995-4122.

    *Status: binary is freely available, prototype, server available on the Internet.
    *Updated by: Jason Wang on 1994-11-17

    === Visualization

    *Name: Data Desk
    *Description: stat. package with excellent graphics
    *Comments: In article <>, (Joe Gorberg) writes: On the mac side a good visualization tool I like and recommend is Data Desk (you can get it from Egghead and MacWarehouse). Its a stat. package with excellent graphics for x-y-z rotating plots, histograms and much more. It really has helped me get value out of neural nets and understanding the data.
    *Platform(s): Mac
    *Contact: Egghead and MacWarehouse
    *Status: product
    *Updated: 1993
    *Name: NetMAP
    *Description: data mining and visual relationship mapping
    *Source: Computing (a UK magazine), 20 Jan 94
    *Comments: ?
    *Platform(s): ?
    *Contact: Software AG
    *Status: product
    *Updated: 1994-02-01
    *Name: PV-Wave
    *Description: data visualization tool
    *Platform(s): Unix, ?
    *Contact: Visual Numerics, 5105 East 41st Avenue, Denver CO 80216-9952, 1-800-447-7147
    *Status: product
    *Updated: May 1994
    *Name: WinViz
    *Description: A data analysis tool utilizing visualization . Supports the use of parallel coordinates technique to present multi-dimensional datasets. An interactive visual query facility on the parallel coordinates is also available.
    *Platform(s): Windows
    *Contact: Information Technology Institute, 71 Science Park Drive, Singapore 0511, Republic of Singapore.
    *Status: product
    *Updated by: Hing-Yan Lee,, 1994-06-20

    === Statistics

    *Name: BBN Cornerstone
    *Description: A user-friendly, integrated software package for accessing, visualizing, analysis and presentation of data. Can import data from several popular databases, including Oracle, SYBASE, and Informix.
    *Comments: Nice user interface
    *Platform(s): HP, Sun (as of 7/94). Soon Windows NT and Windows 4.0
    *Contact: James Fitzgerald, BBN, 150 CambridgePark Drive, Cambridge, MA 02140, tel: 617-873-8191, fax 617-873-4751, e-mail:
    *Status: product
    *Updated: 1994-07-14
    *Name: PC-MARS
    *Description: A software package for developing models of non-linear multivariable processes from past input/output data.
    *Comments: Useful for predicting future outputs. Advertised as an alternative to neural networks, helps user to understanfd the process being modelled. Provides graphical tools.
    *Platform(s): IBM PC and compatibles.
    *Contact: Data Patterns, 528 S. 45th street, Philadelphia, PA 19104, (215) 387-1844. 495
    *Status: product
    *Updated: 12/1992

    === Dimensional Analysis

    *Name: CrossTarget
    *Description: This product provides a flexible way of looking at data and "drilling down" into your data. It has a spreadsheet format with some graphical tools.
    *Platform(s): ?
    *Contact: Dimensional Insight, Inc., 99 South Bedford Street Burlington, Mass. 01803, 617-229-9111
    *Updated: 1992-12
    *Name: Cross/Z
    *Discovery methods: Clustering, ?
    *Comments: Use fractal compression techniques to compress huge datasets to manageable sets of non-linear coefficients. Included are data mining tools, such as chaos-theory-based cluster analysis.
    *Source: Internet posting
    *Platform(s): ?
    Cross/Z International, Inc.
    9 Park Place
    Great Neck, NY 11021
    516 482 6300
    516 482 6463 fax

    *Status: product
    *Updated by: GPS on 1995-01-25
    *Name: Essbase Multi-Dimensional OLAP Server
    *Description: Essbase is a high-performance multi-dimensional analytical engine for OLAP (On-Line Analytical Processing.) It allows very rapid analysis of extremely large data sets. Essbase is fully client/server 32-bit, multithreaded, SMP enabled. Essbase supports an unlimited number of dimensions, and an unlimited number of members per dimension. Essbase provides an Open API for client access, and works with a number of popular front-end tools.
    *Discovery methods: N/A: Essbase acts as a server to a range of analytical front-end tools. Essbase provides vastly superior performance as compared with a typical relational database.
    *Comments: See also the comp.databases.olap usenet newsgroup for a discussion of On-Line Analytical Processing (OLAP), a relatively new category of analytical tools defined by Dr. EF Codd.
    *Platform(s): Windows, Mac, OS2, NT, Unix
    Arbor Software Corporation
    1325 Chesapeake Terrace
    Sunnyvale, CA, 94089

    *Status: commercial software product
    *Updated by: Dan Druker, 12/2/94

    === Other methods

    *Name: FOIL 6.0
    *Description: Learns relations from data. FOIL6.0 is a fairly comprehensive (and overdue) rewrite of FOIL5.2. The code is now more compact, better documented, and faster for most tasks. The language has changed to ANSI C.
    *Platform(s): Unix
    *Contact: To get FOIL6.0 by anonymous ftp click here, or ftp to ( Login as anonymous with your email address as password. The file is "~ftp/pub/" (a shar file). Comments and bug reports are most welcome!
    Ross Quinlan , Mike Cameron-Jones
    *Updated: Fri, 29 Oct 1993 11:07:26 +1000

    === Multistrategy Tools:

    *Name: Darwin
    *Description: Comprises 4 tools a) StarMatch - uses memory-based reasoning technology to compare, in parallel, the characteristics of one database record to all others to find similar situations that can be used to predict outcomes b) StarNet - uses neural network technology to define groups c) StarTree - uses a parallel implementation of classification and regression trees technology (CART) to develop segmentation rules that define clusters d) Star Gene - uses simulated evolutionary techniques to optimize forecasting algorithms
    *Platform(s): Thinking Machines
    *Contact: Thinking Machine, 245 First Street, Cambridge, MA 02142-1264, Tel: (617) 234-1000, Fax: (617) 234-4444
    *Status: product
    *Updated by: Hing-Yan Lee,, 1994-06-20
    *Name: DataMariner
    *Description: DataMariner combines classical statistical techniques with inductive machine learning to discover multivariate relationships in numerical and discreet data. The product consists of a set of tools for KDD, including: clustering algorithms, automatic formation of new attributes, simplifying attributes, rule induction, incremental rule induction, rule pruning, cross-validation, rule evaluation and a graphical display of rules. The rule induction algorithm has several unique features that differentiate it from the ID3-based algorithms. These include the per-class nature of the algorithm and a well-founded treatment of noise and unknown values. The tools are integrated into a desktop-style graphical user interface, but are also available as independent command line programs.
    *Discovery methods: clustering, visualization, classification
    *Comments: See C. Bryant's paper in KDD Coloquim, 1-2 Feb 1995.
    *Source: Logica UK Ltd
    *Platform: Sun workstations, Solaris 1 (SunOS 4) or Solaris 2
    *Status: product
         Richard Dallaway
         Logica UK Ltd
         Stephenson House
         75 Hampstead Road
         London NW1 2PL
         Phone: +44 (0)171 637 9111
         Fax: +44 (0)171 344 3621
    *Updated by: Richard Dallaway 13 Feb 1995
    *Name: DBlearn
    *Description: An integrated system for finding characteristic and classification rules from data in relational databases. It applies an attribute-oriented induction method and the system has been tested on several large databases with good performance. Here is an overview of dblearn.
    *Platform(s): Unix
    *Contact: Jiawei Han (, at Simon Fraser U., Canada.
    *Status: research system
    *Updated: 1994-06-20
    *Name: EMERALD (version 2)
    *Description: a system of machine learning and discovery tools for education and research.
    The Artificial Intelligence Center at George Mason University has developed
    EMERALD (version 2), a system of machine learning and discovery tools 
    for education and research.  It introduces users to five different 
    learning programs, provides explanations how they work, and allows users to  
    experiment with them by designing their own problems, made up
    from predefined objects. The system has well-designed and attractive
    interface,  utilizing color graphics. Rules learned by the system are 
    automatically translated to English and spoken by a speech synthesizer.
    The system has already been delivered to many universities, including many 
    in Europe, where the system was demonstrated at several Summer schools. 
    The system includes several learning systems integrated at the user's level:
    1) for learning rules from examples,
    2) for learning structural descriptions of objects,
    3) for conceptually grouping objects or events, 
    4) for discovering rules characterizing sequences, and 
    5) for learning equations based on qualitative and quantitative data.  
    It is envisioned that users could add their own modules in the future 
    that represent other learning paradigms.

    *Platform(s): EMERALD runs on a Sun Workstation with a color monitor. Sun Common Lisp and OpenWindows (version 2 or higher) are required. A Sun Pascal library is necessary to run the Pascal applications. While not necessary, DecTalk voice synthesis device is highly recommended to enhance the presentation. The system is delivered on a high-density 1.5" tape unless other arrangements are made.
     Dr. Janusz Wnek
     Assistant Director for Research Management
     Center for Artificial Intelligence
     George Mason University
     4400 University Dr.
     Fairfax, VA 22030, USA
     tel. (703) 993-1717
     fax. (703) 993-3729 

    *Status: public domain, prototype
    *Updated: 1993

    *Name: INSPECT
    *Description: It is an MSDOS-based tool for the interpretation of data (a lot of graphics, visualisation, PCA, MLR, neural networks, etc).
    From: (Hans LOHNINGER)
    Subject: Re: RBF NN Good Function Estimator?
    Date: 11 Nov 1993 14:00:37 GMT
    I have been working with RBF Networks for some time and had no problem
    in approximating whatever function I wanted. ... If you are interested
    in another RBF implementation (in order to verify your results) feel free to
    download INSPECT. It is an MSDOS-based tool for the interpretation of data 
    (a lot of graphics, visualisation, PCA, MLR, neural networks, etc).
    It runs on IBM PCs (at least 286, 486 recommended), exhibits a graphical
    user interface and provides some of the more important techniques of data
    interpretation. This software is now available in an early test version via
    anonymous ftp.
    Of course, if you are seriously using INSPECT, I would appreciate
    any bug reports or suggestions for further development.
    The server address:
         machine: (
         directory:   Sources/NeuralNet/Inst-of-Chem.
         files:       i-prgXXX.exe, and i-docXXX.exe
    The files in this directory are self-extracting MSDOS files and contain both
    the program files (i-prgXXX.exe) and the documentation (i-docXXX.exe, which
    is a bit fragmentary). The characters 'XXX' in the file-names stand for the
    version number (currently around 067).
         1. Copy these files from the ftp-server to a dedicated directory on the PC
         2. Run i-docXXX.exe (this extracts a PostScript file which contains the
            documentation, approx. 1200 kByte, and some README file with the
            latest information on INSPECT)
         3. Run i-prgXXX.exe (this extracts the program files and some sample
            data, approx. 800 kByte)
         4. Read the documentation (Installation of INSPECT), install it (a very
            simple task) and go.

    *Source: posting 6025 of
    *Platform(s): IBM PC (at least 286, 486 recommended) *Contact:
      **    Dr. Hans Lohninger                     **
      **    Institute of General Chemistry         **
      **    Technical University Vienna            **
      **    Lehargasse 4/152                       **
      **    A-1060 Vienna, Austria                 **
      **    email:    **
      **    fax:    ++43-1-587-4835                **
      **    voice:  ++43-1-58801-5048              **

    *Status: public domain prototype
    *Updated by: on 1994-07-26
    *Name: Mobal
    *Description: Mobal 3.0 is an enhanced version of the GMD knowledge acquisition and machine learning system for first- order KBS development on Sparc workstations. Mobal is a multistrategy learning system that integrates a manual knowledge acquisition and inspection environment, a powerful first-order inference engine, and various machine learning methods for automated knowledge acquisition, structuring, and theory revision.
    *Comments: As the most visible change, the new release 3.0 no longer requires Open Windows, but features an X11 graphical user interface built using Tcl/Tk. This should make installation trouble-free for most users, and through its networked client-server structure, allows easy integration with other programs. As a second change resulting from work in the ILP ESPRIT Basic Research project, Mobal 3.0 now offers an "external tool" facility that allows other (ILP) learning algorithms to be interfaced to the system and used from within the same knowledge acquisition environment. The current release of Mobal includes interfaces to GOLEM by S. Muggleton and C. Feng (Oxford University), GRDT by V. Klingspor (Univ. Dortmund) and FOIL 6.1 by R. Quinlan and M. Cameron- Jones (Sydney Univ.).
    *Platform(s): Unix
    *Contact: GMD grants a cost-free license to use Mobal for academic purposes. The system can be obtained from here , or by ftp to, directory /ml-archive/GMD/software/Mobal (login anonymous, password your E-Mail address).
    For details about the scientific background of Mobal, see the book "Knowledge Acquisition and Machine Learning", by K. Morik, S. Wrobel, J.-U. Kietz and W. Emde (Academic Press, 1993). A user guide is available via FTP.
    *Status: public domain
    *Updated: 1994-07-06
    *Name: Recon
    *Description: provides data analysts and decision-makers with a suite of data mining services. It combines
  • Top-down data mining. Analysts propose relationships that may hold among the data based on their knowledge and models. Recon validates the relationships against the data and helps analysts refine them.
  • Bottom-up data mining. Recon automatically extracts relationships from data, and analysts use them to augment their models.
  • Recon's data mining methods include: deductive database, rule induction, clustering, visualization, neural networks, and nearest neighbor. Full information is here.
    *Comments: Recon's open architecture allows it to interface with a variety of data sources, including relational databases (Oracle, Sybase, DB2, etc), spreadsheets, proprietary databases and domain-specific databases, and ASCII files. Recon is also expensive. Data analysis by Lockheed engineers is included, as is customization of the selected data mining algorithm for the customer's operation environment, plus training of the customer's data analysts on the use of the system. The simplest contract is $30.000, the most extensive contract is priced at over $250.000
    *Platform(s): Unix
    	Dr. Evangelos Simoudis (
    	Lockheed AI Center, 
    	3251 Hanover Street, Palo Alto CA 94304 
    	Voice: (415) 354-5271	Fax: 415-424-3425

    *Status: Lockheed product
    *Updated by: Sandra Oudshoff on 1994-07-27
    Click here to return to Knowledge Discovery Mine