Causality Challenge #3: Cause-effect pairs

Register to our Google group causalitychallenge to keep informed!

Data for track 2

The data may be downloaded:

From the Kaggle website.
From the Causality Workbench.

We provide hundreds of pairs of real variables with known causal relationships from domains as diverse as chemistry, climatology, ecology, economy, engineering, epidemiology, genomics, medicine, physics. and sociology. Those are intermixed with controls (pairs of independent variables and pairs of variables that are dependent but not causally related) and semi-artificial cause-effect pairs (real variables mixed in various ways to produce a given outcome). We will also test the top ranking algorithms with data provided in track 1, but only the pairs provided by the organizers will be used for scoring.

August 26, 2013 decryption key of test data release:

The test data decryption key is AL_3CAv2++
The participants have until September 2, 2013 to make their final submission.

July 1, 2013 final data release:

We released new datasets including both artificial and real data, distributed similarly. The final test data is encrypted and the decryption key will be released when the final test phase starts.
The new data includes pairs of variables generated in a similar way as those of SUP2data and pairs of real variables from various sources. These data are different from the original training and validation data with respect to normalization and quantization of variables. Algorithms that are invariant with respect to shift and scale and the distribution of number of unique values is approximately even across classes.
The validation and test sets have the same number of examples. In this way, the participants can make sure that their code runs fast enough to deliver the results on time once the final test data decryption key is released. Also, the risk of overfitting the validation data is lessened.

Set	Num pairs
FINAL TRAIN	4050
FINAL VALID	4050
FINAL TEST	4050

REMINDER: You are not limited to using the provided training data.

May-June 2013 supplementary artificial data release (SUP1data, SUP2data, and SUP3 data):

To address a problem of bias and normalization of the original data release, we released 3 datasets. All variables are postprocessed in the same way:

A random sub-sample of the num_val values original values is drawn without replacement uniformly on a log2 scale between min_size and max_size, where min_size=500 and max_size=8000. Pairs with less than 500 examples are not subsampled.
Pairs having at least one variable with only 1 value are eliminated.
Variables with 2 values are considered binary and mapped to 0/1.
Categorical variables with C values are assigned randomly class numbers between 1 and C.
Numerical variables (discrete or continuous) are standardized (the mean is subtracted and then the result is divided by the standard deviation) and then quantized by multiplying the result by 10000 and rounding to the nearest integer.

May 2013: We provided two additional training datasets artificially generated. Those training datasets have balanced number of unique values across all classes. SUP1data includes ~6000 pairs of numerical variables. SUP2data includes ~6000 pairs of mixed variables (numerical, categorical, binary).

June 2013: We provided one additional training datasets generated from real data (SUP3data), except for the A-B pairs that are semi-artificial. The SUP3 data were drawn using a large pool of real A->B cause-effect pairs of variables from various sources. The role of A and B was reversed in half of them to create B->A pairs. A random subset of half of the original pairs was selected to create A|B pairs by randomly permuting independently the values of A and B. The A-B pairs were obtained from a random selection of half of the original pairs to which an algorithm that preserves the marginal distributions while destroying the causal relationships was applied. Pairs of artificially generated dependent variables that are not in a causal relationship were used. Their values were replaced by the values of the real variables in a way that preserves the rank ordering of the values (i.e. the smallest value in the artificial variable is the smallest in the real variable, the second smallest artificial value is the second smallest real value, etc.).

Set	Num pairs
SUP1 [numerical]	5998
SUP2 [mixed]	5989
SUP3 [numerical+binary]	162

March 2013 original data release (CEdata):

These data have become obsolete, we recommend using the SUP2 data described below for training. The original released data have a flaw in data normalization and value quantization that introduces some bias among the causal classes.

Set	Num pairs
TRAIN	7831
VALID	2642