Register to our Google group causalitychallenge to keep informed!
The test data decryption key is AL_3CAv2++
The participants have until September 2, 2013 to make their final submission.
We released new datasets including both artificial and real data, distributed similarly. The final test data is encrypted and the decryption key will be released when the final test phase starts.
The new data includes pairs of variables generated in a similar way as those of SUP2data and pairs of real variables from various sources. These data are different from the original training and validation data with respect to normalization and quantization of variables. Algorithms that are invariant with respect to shift and scale and the distribution of number of unique values is approximately even across classes.
The validation and test sets have the same number of examples. In this way, the participants can make sure that their code runs fast enough to deliver the results on time once the final test data decryption key is released. Also, the risk of overfitting the validation data is lessened.
Set |
Num pairs |
---|---|
FINAL TRAIN |
4050 |
FINAL VALID |
4050 |
FINAL TEST |
4050 |
To address a problem of bias and normalization of the original data release, we released 3 datasets. All variables are postprocessed in the same way:
June 2013: We provided one additional training datasets generated from real data (SUP3data), except for the A-B pairs that are semi-artificial. The SUP3 data were drawn using a large pool of real A->B cause-effect pairs of variables from various sources. The role of A and B was reversed in half of them to create B->A pairs. A random subset of half of the original pairs was selected to create A|B pairs by randomly permuting independently the values of A and B. The A-B pairs were obtained from a random selection of half of the original pairs to which an algorithm that preserves the marginal distributions while destroying the causal relationships was applied. Pairs of artificially generated dependent variables that are not in a causal relationship were used. Their values were replaced by the values of the real variables in a way that preserves the rank ordering of the values (i.e. the smallest value in the artificial variable is the smallest in the real variable, the second smallest artificial value is the second smallest real value, etc.).
Set |
Num pairs |
---|---|
SUP1 [numerical] |
5998 |
SUP2 [mixed] |
5989 |
SUP3 [numerical+binary] |
162 |
These data have become obsolete, we recommend using the SUP2 data described below for training. The original released data have a flaw in data normalization and value quantization that introduces some bias among the causal classes.
Set |
Num pairs |
---|---|
TRAIN |
7831 |
VALID |
2642 |