Causality Causality Workbench                                                             Challenges in Machine Learning Causality

Causality Challenge #1: Causation and Prediction

Challenge Datasets

We propose datasets from various application domains. We took great care of using real data. Some datasets encode real data directly, while others result from "re-simulation", i.e. they were obtained from data simulators trained on real data. All datasets differing in name only by the last digit have the same training set. Digit zero indicates unmanipulated test sets. See the details of our data generating process.

Dataset (click for info) Description Test data Variables (num) Target Training examples Test examples Download (text format) Download (Matlab format)
REGED0 Genomics re-simulated data Not manipulated Numeric (999) Binary 500 20000 31 MB 25 MB
REGED1 Genomics re-simulated data Manipulated (see list of manipulated variables) Numeric (999) Binary 500 20000 31 MB 25 MB
REGED2 Genomics re-simulated data Manipulated Numeric (999) Binary 500 20000 31 MB 25 MB
SIDO0 Pharmacology real data w. probes Not manipulated Binary (4932) Binary 12678 10000 12 MB 14 MB
SIDO1 Pharmacology real data w. probes Manipulated Binary (4932) Binary 12678 10000 12 MB 14 MB
SIDO2 Pharmacology real data w. probes Manipulated Binary (4932) Binary 12678 10000 12 MB 14 MB
CINA0 Census real data w. probes Not manipulated Mixed (132) Binary 16033 10000 1 MB 1 MB
CINA1 Census real data w. probes Manipulated Mixed (132) Binary 16033 10000 1 MB 1 MB
CINA2 Census real data w. probes Manipulated Mixed (132) Binary 16033 10000 1 MB 1 MB
MARTI0 Genomics re-simulated data w. noise Not manipulated Numeric (1024) Binary 500 20000 47 MB 35 MB
MARTI1 Genomics re-simulated data w. noise Manipulated Numeric (1024) Binary 500 20000 47 MB 35 MB
MARTI2 Genomics re-simulated data w. noise Manipulated Numeric (1024) Binary 500 20000 47 MB 35 MB

A small example

We provide a small toy example dataset for practice purpose. You may submit results on these data on the Submit page like you would do for the challenge datasets. Your results will appear on the Result page, thus providing you with immediate feed-back. The results on these practice datasets WILL NOT COUNT as part of the challenge.
You may download all the example data as a single archive (Text format 1.1 MB, Matlab format 1.2 MB), or as individual datasets from the table below.

Dataset (click for info) Description Test data Variables (num) Target Training examples Test examples Download (text format) Download (Matlab format)
LUCAS0 Toy medicine data Not manipulated Binary (11) Binary 2000 10000 31 KB 22 KB
LUCAS1 Toy medicine data Manipulated Binary (11) Binary 2000 10000 31 KB 22 KB
LUCAS2 Toy medicine data Manipulated Binary (11) Binary 2000 10000 31 KB 22 KB
LUCAP0 Toy medicine data w. probes Not manipulated Binary (143) Binary 2000 10000 341 KB 262 KB
LUCAP1 Toy medicine data w. probes Manipulated Binary (143) Binary 2000 10000 342 KB 654 KB
LUCAP2 Toy medicine data w. probes Manipulated Binary (143) Binary 2000 10000 342 KB 263 KB

Dataset Formats

All datasets are formatted in a similar way. They include a training set from a "natural" distribution, in the form of a data table with variables in columns and samples in the rows. One particular column called "target" is singled out. A larger test set in the same format is provided, without the targets. The test set is not necessarily drawn from the same distribution as the training set. In particular, some variables may have been "manipulated", i.e. an external agent may have set them to given values, therefore de facto disconnecting them from their natural causes.

All data sets are in the same format and include 4 files in text format:

For convenience, we also provide the data tables in Matlab format: dataname_train.mat and dataname_test.mat.

If you are a Matlab user, you can download some sample code to read and check the data.