- Causality and prediction
- Pot-luck
- Active learning
- Unsupervised learning
- Cause-effect pairs
- Connectomics

** Register to our Google group causalitychallenge to keep informed!**

- From the Kaggle website.
- From the Causality Workbench.

The test data decryption key is AL_3CAv2++

The participants have until September 2, 2013 to make their final submission.

We released new datasets including both artificial and real data, distributed similarly. The final test data is encrypted and the decryption key will be released when the final test phase starts.

The new data includes pairs of variables generated in a similar way as those of SUP2data and pairs of real variables from various sources. These data are **different** from the original training and validation data with respect to **normalization** and **quantization** of variables. Algorithms that are invariant with respect to shift and scale and the distribution of number of unique values is approximately even across classes.

The validation and test sets have the same number of examples. In this way, the participants can make sure that their code runs fast enough to deliver the results on time once the final test data decryption key is released. Also, the risk of overfitting the validation data is lessened.

Set |
Num pairs |
---|---|

FINAL TRAIN |
4050 |

FINAL VALID |
4050 |

FINAL TEST |
4050 |

REMINDER: You are not limited to using the provided training data.

To address a problem of bias and normalization of the original data release, we released 3 datasets. All variables are postprocessed in the same way:

- A random sub-sample of the num_val values original values is drawn without replacement uniformly on a log2 scale between min_size and max_size, where min_size=500 and max_size=8000. Pairs with less than 500 examples are not subsampled.
- Pairs having at least one variable with only 1 value are eliminated.
- Variables with 2 values are considered binary and mapped to 0/1.
- Categorical variables with C values are assigned randomly class numbers between 1 and C.
- Numerical variables (discrete or continuous) are
**standardized**(the mean is subtracted and then the result is divided by the standard deviation) and then**quantized**by multiplying the result by 10000 and rounding to the nearest integer.

**June 2013:** We provided one additional training datasets generated from real data (SUP3data), except for the A-B pairs that are semi-artificial. The SUP3 data were drawn using a large pool of real A->B cause-effect pairs of variables from various sources. The role of A and B was reversed in half of them to create B->A pairs.
A random subset of half of the original pairs was selected to create A|B pairs by randomly permuting independently the values of A and B.
The A-B pairs were obtained from a random selection of half of the original pairs to which an algorithm that preserves the marginal distributions while destroying the causal relationships was applied. Pairs of artificially generated dependent variables that are not in a causal relationship were used. Their values were replaced by the values of the real variables in a way that preserves the rank ordering of the values (i.e. the smallest value in the artificial variable is the smallest in the real variable, the second smallest artificial value is the second smallest real value, etc.).

Set |
Num pairs |
---|---|

SUP1 [numerical] |
5998 |

SUP2 [mixed] |
5989 |

SUP3 [numerical+binary] |
162 |

These data have become obsolete, we recommend using the SUP2 data described below for training. The original released data have a flaw in data normalization and value quantization that introduces some bias among the causal classes.

Set |
Num pairs |
---|---|

TRAIN |
7831 |

VALID |
2642 |