Causality Causality Workbench                                                             Challenges in Machine Learning Causality
[Back to list]
 

TIED

Target Information Equivalent Dataset

Contact: Alexander Statnikov - Submitted: 2008-09-12 20:24 - Views : 2470 - [Edit entry]

Abstract:

TIED dataset

© 2008 Alexander Statnikov and Constantin Aliferis

Introduction

TIED stands for Target Information Equivalent Dataset. It is an artificial simulated dataset constructed to illustrate that there may be many minimal sets of features with optimal predictivity (i.e., Markov boundaries) and likewise many sets of features that are statistically indistinguishable from the set of direct causes and direct effects of the target.

Data-analysis tasks

It is recommended that participants complete all 3 tasks given below; however the submitted results will be evaluated even if a participant completed at least task 1 or 2.

1. Using training data, find all sets of variables that are statistically indistinguishable from the set of direct causes and direct effects (DCE) of the target variable.
2. Using training data, find all Markov boundaries (defined as in Pearl, "Probabilistic Reasoning in Intelligent Systems", 1988).
3. For each of the Markov boundaries identified in task 2, build a classifier model of the target variable using training data and apply it to the testing data.

Submission requirements

The submission should be prepared according to the requirements given below and send by email to alexander.statnikov@vanderbilt.edu in an archive file entitled ?Lastnameofparticipant_TIED.zip? (e.g., Statnikov_TIED.zip). If the submission file is >10MB, please communicate with us using the above email address prior to sending the file.

The complete submission (for all 3 tasks) should consist of four text files:

1. File DCE.txt: Each line contains indices of a set of variables that is statistically indistinguishable from the set DCE. Maximum number of lines in this file is 10000. E.g.:

============= Example of DCE.txt =============
1 2 3 4 % This is a set of variables that is statistically indistinguishable from the set of DCE
3 2 8 3 % This is another set of variables that is statistically indistinguishable from the set of DCE
1 2 9 10 % This is one more set of variables that is statistically indistinguishable from the set of DCE
=========================================

2. File MB.txt: Each line contains indices of variables that participate in a Markov boundary. Maximum number of lines in this file is 10000. E.g.:

============== Example of MB.txt =============
1 2 3 4 8 % This is a Markov boundary (#1)
3 2 8 3 11 % This is another Markov boundary (#2)
1 2 9 10 12 % This is one more Markov boundary (#3)
=========================================

3. File MB_Predictions.txt: Each line contains predictions for all (3000) samples in the testing data for each Markov boundary given in the file MB.txt.

========= Example of MB_Predictions.txt ========
0 1 2 3 0 1 2 ? 0 1 % Predictions for Markov boundary given in line #1 of MB.txt
2 1 2 3 3 1 2 ? 2 1 % Predictions for Markov boundary given in line #2 of MB.txt
1 1 2 3 2 1 2 ? 1 2 % Predictions for Markov boundary given in line #3 of MB.txt
=========================================

4. File README.txt: Brief description of the algorithms/methods used.

Evaluation metrics

The performance metrics to evaluate results in the file MB.txt will include the following:

I. Total number of Markov boundaries output by the algorithm (i.e., number of lines in the file MB.txt).
II. Number of Markov boundaries that were correctly discovered (relative to the gold standard) with no false negative variables but with possible false positive variables.
III. Average number of false positive variables in the output Markov boundaries used for computation of the metric II.
IV. Penalized proportion of discovered Markov boundaries (this metric seeks to maximize the product of sensitivity and specificity for identification of each true Markov boundary).

Evaluation of the results in the file DCE.txt will include the above performance metrics adjusted for identification of the sets of variables that are statistically indistinguishable from the set DCE.

Evaluation of classification predictions in the file MB_Predictiont.txt will be performed using weighted accuracy metric.

Provided data files

The following files are provided:
? Training target labels (file: teid_train.targets)
? Training data (file: tied_train.data)
? Testing data (file: tied_test.data)

Note: both training and testing data are drawn from the same distribution.

Comments / Questions / Answers

None yet.


Your comment / question:

You must be registered in order to post comments/questions.
Email:
Password: Forgot your password ?
Rate the dataset: No rating    0 1 2 3 4 5   (Only counts once, will update if changed)
Comments:
Receive e-mail when new posts are made