Causality Causality Workbench                                                             Challenges in Machine Learning Causality

Causality Challenge #1: Causation and Prediction

Frequently Asked Questions

What is the goal of the challenge?
The goal is to make the best possible predictions of a target variable from a number of predictive variables in situations where some variables may have been "manipulated" by an external agent.

Are there prizes?
There will be prizes for winning on each dataset, a special achievement award, and a best paper award (see details). In addition, student participants can apply for travel grants to attend the WCCI 2008 workshop.

Is causality needed to solve the problems of the challenge?
The datasets ending with the digit "0" (e.g. REGED0, SIDO0) have a test set distributed identically to the training set. These are therefore regular machine learning tasks requiring no knowledge of causality. The other datasets have "manipulated" test sets. The knowledge of causal relationships may help for those datasets. You are welcome to take any approach you want to solve these problems, subject to the restriction of not using test data in training.

Why should I care about causality?
If you are a data mining/machine learning specialist who has encountered problems in which feature/variable selection is important for data understanding, prediction improvement, and efficiency, causality will expand your horizon. By understanding causal relationships between variables, you will be able to:
  • select variables, which can be used to influence the target (causes),
  • select variables, which can be used to monitor the value of your target without measuring it directly (consequences),
  • understand in which way predictive variables may be redundant (e.g. indirect causes may be redundant with more direct causes),
  • understand in which way predictive variables complement each other and eventually eliminate variables, which are falsely predictive, e.g. experimental artifacts,
  • define a minimal subset of optimally predictive variables.
Is there code I can use to perform the challenge tasks?
Yes, see the software repository. Note that many available software packages capable of learning causal graphs cannot handle large amounts of variables. In this challenge, you do not necessarily need to learn the full graph structure since we are focusing on a given target variable. Some algorithms are specifically optimized to discover "local" structure, others can be adapted.

How do you define causality?
There are many possible definitions of causality. See for example our tutorial page. In this challenge, we connect tightly the notion of causality to that of "manipulation". Informally, if the action of an external agent on a given variable influences the target, then that variable is a cause of the target. Conversely, acting on consequences of the target, or on other variables not causally related to the target, should have no effect on the target.

How do you define a "manipulation"?
For the purpose of the challenge, manipulations consist in "clamping" a set of "manipulated" variables to given values and letting the system evolve according to its own dynamics and settle in a "stable state" before sampling other variables. More generally, manipulations are "actions" or "experiments" performed by an "external agent" on a system, whose effect disrupts the "natural" functioning of the system. We distinguish between:
  • "Observational data", which are data collected in the absence of manipulation (like the training set and the test set in the datasets ending with the digit "0".
  • "Manipulated data", which are data obtained as a result of manipulations.
The dependency between a variable and its immediate causes is altered by the manipulations.

Did you manipulate the target variable?

Did you manipulate hidden variables?

What is a probe?
A probe is an artificial variable added to real data. It is constructed as a random function of the real variables, plus some noise. It is never a cause of the target.

Which variables do you manipulate?
In artificial or "re-simulated" data, we manipulate a subset of variables (not the target). In some cases we tell you which ones and in some cases we don’t. Your strategy to make the best predictions may therefore change depending on the situation. For real data with probes, we manipulate the probes, not the real variables. We do not tell you which variables are probes. You only know that the probes are NOT causes of the target. A possible strategy is to use for making predictions only variables, which are causes of the target, and exclude all non-causes.

In real data with probes you do not manipulate causes of the target. Is this realistic?
The intent is not to simulate realistic manipulations in this case. The probes are instruments to evaluate the effectiveness of causal discovery when only observational data are available, they are not emulating real variables. We do not disturb the original data when we add probes; the probes act as "distractors" for the causal discovery algorithms and will allow us to determine the fraction of false positive. To make a connection between causation and prediction, in manipulated data, we manipulate the probes; the other variables retain their original distribution.

It takes years to validate a genomic model, how plausible is the REGED data?
The REGED model is inspired by a real task and trained on real data, but we do not make claims that it is biologically plausible. We prefer using data generated from a task emulating a real problem rather than purely artificial data. The benefits of using a data generative model are:
  • the possibility of generating sufficiently large test sets to obtain statistically significant results,
  • the knowledge of the model structure, which provides truth values of causal relationships.
Is it possible to learn causality without manipulation or experimentation, and if yes, can it be done without Bayesian networks?
Yes, it is possible. There is a large literature on learning causal relationships from "observational data". Bayesian networks are one approach, and several methods to earn the network structure exist. See our tutorial section for introductory papers. Causal approaches other than Bayesian networks exist, including structural equation modeling.

Learning causality from "observational data" is theoretically impossible in many cases, so aren't we trying to solve an unsolvable problem?
You just need to do the best you can. It is true that there can remain ambiguities, which eventually must be resolved by experimentation (when this can be done).
But, in many real applications, all we have is a limited amount of "observational data". Yet we need to infer causality from this alone to the extent possible.

You are not asking us about causes and consequences, so in what is this a causality challenge?
The challenge explores the usefulness of causal models by putting them to work. Allegedly, the ability to predict the consequences of actions is what distinguishes causal modeling from other types of approaches. The challenge investigates this issue.

Why should "causal models" perform better than usual predictive models on these tasks?
The knowledge of the causal relationships of the variables to the target, inferred for instance using causal discovery algorithms, can be instrumental in selecting the right variables, but then any powerful enough predictive model may be used to predict the target values. In the LUCAS toy example, we explain how some manipulations "disconnect" variables from the target. Only direct causes never get disconnected as a result of manipulations. Hence the problem is not really the choice of the predictive model, but rather a problem of variable/feature selection.

Your datasets are not time series. Isn't the notion of time an inherent part of causality?
Time is relevant to causality because "causes always precede their effects". In the challenge data, even though time is not explicitly provided, some variables may be causally related. For instance, in the LUCAS example, if "smoking" is a cause of "lung cancer" it must precede it, even though we do not say exactly when.

Can your systems have hidden variables?
In any practical setting it is not possible to record all possible variables influencing a given target variable. Some of our datasets may include hidden variables.

Can your systems have feed-back loops?
Natural systems often have feed-back loops. We do not exclude the possibility of feed-back loops in the real data we provide. The artificial systems do not have feed-back loops.

Are the artificial variables generated by Bayesian networks?
Not necessarily...

Are there non-linear dependencies between variables?
This is very likely.

Are there non-Gaussian variable distributions?

Are the dependencies between features and the target the same in the training and test datasets?
No. The system under study, which generated the unmanipulated data, remains the same. The dependencies remain exactly the same between the training set and the unmanipulated test set e.g. REGED0, SIDO0,  LUCA0, LUCAP0. But in the manipulated test sets, some links between variables are broken by manipulations. See the LUCAS example. What always remains unchanged are the dependencies between the target and its direct causes, because we never manipulate the target and we do not manipulate hidden variables.

Are all the training sets the same for the datasets whose names differ only in the last digit?

Can domain knowledge be used to facilitate solving the tasks of the challenge?
We purposely did not disclose the identity of the features, which may be available in real applications, so that causal models would not be hand-crafted or biased by human knowledge of the feature semantics. We provide information on the datasets to make things more concrete and motivate participation, but we do not expect the participants to tailor their algorithms to the specifics of the domain. However, the information we provide on the datasets may be used in model design including: the list of features manipulated (when we provide it), partial information on the structure of the data generating process, such as absence of consequences of the target, or the fact that all probes are consequences of the target.

Should we use the same predictive model on the 3 test set variants of a given task?
Probably not! The training set is the same, but different strategies must be applied for unmamnipulated and manipulated test sets. In some cases, we tell you which variables are manipulated (REGED1), this should give you a useful hint on which variables will remain predictive or not. For the probe methods, when the probes are manipulated, they cannot be predictive anymore since all probes are non-causes of the target.

Is this challenge related to the "distribution shift problem"?
In a way it is, because the training set is distributed differently from the test set, when manipulations are performed. However, the setting of this challenge is rather different from that of a "distribution shift" challenge. To solve the distribution shift problem, the participants are usually expected to learn the distribution changes from the "unlabeled test set" (the test set deprived of the values of the target variable). In this challenge, we explicitly forbid this. The participants should build their model from training data only and then use it to make predictions of the target variable for each test example independently. This rule will be enforced.

Why do you forbid "learning from unlabeled data", or using "transduction"?
There are conceptual and practical reasons:
  • The conceptual reason is that we are investigating problems in which only "observational" training data are available for model building and asking the question: "what if we did this and that manipulation in the future"? Therefore, test data are not supposed to be available at model building time; we use them only to test the ability of our model to make predictions about the effect of hypothetical actions performed on the system in the future.
  • The practical reasons is that, in a challenge, we need very large test sets to obtain small error bars on the participant performances, otherwise most differences between algorithms would not be statistically significant. However, such large amount of "manipulated"  test data would not be available all at once in many real world situations.
  • This rule will be enforced.
Are we allowed to compare the distributions of variables in REGED2 with the ones that we learned in the training phase? Or are we only allowed to use the test set one-by-one without estimating its distributions? ?
You are not allowed to compute any statistic from the test data. You must use samples one by one to make predictions on the test set. This rule will be enforced.

Where is the borderline between using test data for prediction and model training on test data, especially with Bayesian networks?
The test data is supposed to be unknown to you when you build your model. It represents future "hypothetical" data you have never seen. So the model (structure and parameters) should be obtained by training on training data, possibly using information about the type of manipulations performed on test data. Predictions are then made for each test example individually (rather than taken jointly).
This rule will be enforced.

Could't people get a conscious or subconscious advantage by just "looking" at the test data?
We have generated data in a way that it is not obvious to determine which variables are manipulated by just "looking" at the test data. However, we urge you not to visualize the test data to and/or to compute any kind of statistics on it to avoid biasing your results and risking to get significant differences in performance in the post-challenge tests.

If we are not allowed to identify the manipulated distributions, can we still find out which variables were manipulated? We need some information to eventually discard manipulated variables.
For the datasets REGED1 and MARTI1, the list of manipulated variables is disclosed. For the others it is not. Hence the only variable that you can be sure will affect the target in manipulated data are the direct causes. See the next question for more details.

How can I make predictions without changing my model if the distribution changes?
Even though you may not use test data to adapt your model, we expect that you will build a model that takes into account the distribution changes.
A possible strategy is to:
  • uncover structural (causal) relationships between variables using training data
  • use disclosed information about the nature of the manipulations performed (which changes from dataset to dataset) to select a subset of variables to be included in your predictive model.
Suppose that the training data suggests a given causal graph, here are a couple of cases, which may arise:
  • If all the variables are manipulated in test data (except the target): only the most direct causes will be predictive of the target. So, you may want to include only those in your model, because the others can only introduce noise in the predictions.
  • If only a subset of variables are manipulated, but we do not tell you which ones: a possible strategy is still to rely only on direct causes, or perhaps emphasize more direct causes.
  • If only a subset of variables are manipulated, and we tell you which ones: you can infer from the causal graph which variables are no longer predictive because of the manipulation. For example a manipulated consequence of the target variable is no longer predictive of that target.
  • In the case of manipulated "probe" variables (artificial variables added to real data, which are non-causes of the target; we manipulate all the probes but do not tell you which variables are probes): A possible strategy is to avoid using variables, which are non-causes of the target to avoid including probes in the model.
These are only examples, other strategies are possible and may be more efficient.

Can I nevertheless use the test data to learn the distribution shift?
No. If you have made entries, which by mistake violate this rule, please let us know. They will not count towards the challenge.

Why do you not always disclose the set of manipulated variables?
For 2 reasons:
  • In some applications, you do not know which variables will be manipulated. For example, if you administer a new drug to a patient, some genes will be turned on/off, but you do not necessarily know which ones.
  • For the real data in which we use artificial probes, we manipulate all the artificial probes, so giving away their identification would give away the solution to the problem.
What is a Markov blanket of the target?
A Markov blanket of the target (called MB)  is a sufficient set of variables such that all other variables are independent of the target, given MB. A minimal Markov blanket is called a Markov boundary. Under some conditions, the Markov boundary is unique.
Many people include the minimality restriction in the definition of Markov blankets, therefore identifying the Markov blanket and the Markov boundary, which is what we do in our examples and instructions.

Can causality and/or Markov blanket be defined without using Bayesian networks?
Absolutely. We just gave above a definition of MB, which does not refer to Bayesian networks. The notion of manipulation provides a way of assessing causal relationships (if not really defining them), which does not refer to Bayesian networks. We use Bayesian networks in our LUCAS example because the language of Bayesian networks is simple and easy to understand. For instance, in the language of causal Bayesian networks, under some conditions known as causal Markov condition and causal faithfulness condition, the causal Markov blanket is unique and coincides with the set of parents (direct causes), children (direct effects), and spouses of the target. The set of parent, children, and spouses is sometimes taken as the definition of Markov blanket.

Why is the Markov blanket different in train and test data?
The Markov Blanket for the training set or "Discovery" data  (MBD) may differ from the Markov Blanket for the "Test" set (MBT), when test data are manipulated, because manipulations "disconnect" variables from  their direct "natural" causes. Hence the causal graph changes. For instance, a direct consequence of the target, which is manipulated in test data, is in the MBD, but not in the MBT. The MBT is always a subset of the MBD.

How can you make use of the MB found by an algorithm on training data when what you need is the MB for test data?
Knowing the MBD and the manipulations being performed are sufficient to deduce the MBT: manipulated direct causes remain in the MB while manipulated direct consequences and spouses are removed. If only partial information or no information is known about the manipulation, a worst case scenario may be adopted, e.g. using only direct causes for making predictions.

I do not understand why you call an importance-sorted list of variables "causal discovery". Do you mean the variables forming the Markov blanket should be ranked as high as possible in the list?
This could be a reasonable choice. The MB variables of the manipulated test distribution (MBT) should be the most predictive variables and logically should be ranked the highest. However, for various reasons, including the fact that there is statistical uncertainty with a finite sample training set, increased predictivity could be gained by adding more variables. We give to participants the flexibility to rank variables in order of preference rather than returning a single subset. We are aware that most existing algorithms are not returning ranked lists. Yet such ranked lists would be practically very useful, e.g. to plan experiments.

Why can’t one completely forget about causality, and just find the most predictive features using any feature-ranking classifier, shouldn't Markov blanket variables be at the top of the importance list in most cases?
For the datasets ending with the digit "0" (unmanipulated test sets), this is certainly a valid approach. For the other datasets (manipulated test set) a better performance would probably be achieved with the knowledge of the causal direction in order to obtain the MBT (or get as close to is as possible). As explained previously, the MBD is learned from training data, but the MBT should be used to make predictions. For manipulated test data, the MBT is a subset of the MBD. The variables to be removed depend on the manipulation performed.

How do you find the number of “predictive” features (Fnum) from the sorted lists? Are both slist and ulist are compulsory?
We do not require that participants tell us which subset is best when they provide a slist and multi-column prediction result files. The Fnum is the number of features corresponding to the best prediction performances. This gives an advantage to people who return ranked lists vs. people who return a single subset with a ulist. In this way we encourage people to return a slist and multi-column predictions, which will give us richer information to analyze. In real applications, people would also have to solve the model selection problem of determining which subset is best. But we do not ask the participants to do that in this challenge. There are already many difficult problems to solve in this challenge; this is one they do not have to worry about.

How do you process multi-column prediction file format, will you select best column (in terms of AUC)? The multicolumn format will obviously score higher than the single column format, so is it compulsory for final entries?
Yes, we will take the best scoring column. No we do not make the multicolumn format mandatory for final entries. The fact that it gives the opportunity of getting better results is enough of an incentive for the participants to use it.

Scoring method : There are several scores, which one will be used for final ranking or what combination of them?
The Tscore will be used for final ranking and it is the AUC for the datasets presently provided on the platform. The Dscore, Fscore and Fnum are just given for information. We will compute other diagnostics of causal discovery during the analysis and see how they correlate with the Tscore. But the participants will be judged only on test data target value predictions (Tscore). We designed the tasks such that the participants should not get good results if they do not use some notion of causality to select the predictive variables. Of course, you may prove us wrong!

How are the “Overall” results calculated? What is the meaning of “Mean” column?
On the overall page, we show only the test-set prediction scores (Tscore) and the mean is the average over all test sets (e.g. REGED0. REGED1, REGED2). Note that all training sets are the same for the datasets, which differ only in the last digit.

Why do you use the AUC to compute the Tscore and not the BAC?
Using the BAC puts another layer of difficulty on the participants: estimating the bias value. It should not make a lot of difference, it is just a bit easier for the participants that way. If we use the AUC, we also give an incentive to participants to return discriminant values, from which we can compute the whole ROC curve. This will make our analysis of the results more interesting.

Since you are using the AUC to compute the Tscore do we still need to adjust the bias on the scores?
No, you do not need to.

What is the Fscore and why do you not use it for ranking participants?
For artificially generated data from a known causal model, the Fscore uses the true Markov blanket of the data generating model as reference "good features". The Fscore is then the area under the ROC curve for the classification between MB features and non-MB features. To perform this AUC calculation, if you return a ulist, the elements of the list are interpreted as being classified as MB and the others as non-MB. If you provide a slist, we interpret the feature rank as a classification score, the first features being most confidently classified and MB and the last ones most confidently classified as non-MB (if some features are not included in the list, they are arbitrarily given the same highest rank).
For real data to which artificial "probe" variables are added, the MB is only partially known, because only the relationships of the probes to that target are known. We use the set of probes not belonging to the MB as a "negative class" or "not-so-good features" and all the other variables as "positive class", then compute the Fscore in a similar way as explained above.
The Fscore is an imperfect evaluation score to assess the feature selection algorithms, particularly in the case of real data for which the MB is not known. We compute the Fscore (and several other scores not shown on the result page) to analyze the mechanisms by which causal or non-causal algorithms select features. We anticipate that the Fscore should be correlated with the Tscore to some extent.

I do not understand what are "good" and "not-so-good" features. Do we have: "not-so-good = probes = negative class of features"?
No, this is only the case for the manipulated distributions in real data with added probes (e.g. SIDO1, SIDO2, LUCAP1, LUCAP2). We use this distinction between "good" and "not-so-good" features in the dataset generation and scoring page, to explain the Fscore. The answer to this question is given in the answer to the previous question, but here are some more details.

For artificial or "re-simulated" data (for which the MB is known):
  • For the unmanipulated test set, good = MBD (MB for training or "Discovery" data), not-so-good = others. This does not mean no variable in the not-so-good set is bad/not predictive. We just drew the line between the subset of variable, which is "theoretically" sufficient to obtain optimal predictions and the other features  (but in fact this gold standard is not perfect for many reasons, it is just the best we can do with the theory).
  • For the manipulated test sets, we use good = MBT (MB for Test data), not-so-good = others. See the LUCAS example to understand what that means.
For the real data with "probes":
  • For the unmanipulated test set, not-so-good = non-MBD probes (i.e. all variables, which we know for sure are NOT in the Markov blanket, namely all probes that are neither direct consequences nor spouses of the target),  good = others (including all real variables and probes that are direct consequences or spouses of the target).
  • For the manipulated test set, not-so-good = probes (i.e. all variables, which we know for sure are not in the Markov blanket of the manipulated distribution),  good = others = real variable.
Note that in the latter case, the problem is not that of separating real variables from probes, it is that of separating causes from non-causes. The probes are there as artificial examples of non-causes to assess the fraction of features falsely selected as causes of the target. Even though we made a significant effort to disguise the probes and make them look like real variables, it may be possible to identify which variables are probes, particularly from the test data distribution (but we forbid that). If you think it is trivial to separate the real variables from the probes even from training data or if you see other flaws in the probe method, please let us know.

What is the difference between the two manipulated datasets?
It depends on the datasets:
  • For the "re-simulated" dataset (REGED), we have two different scenarios:
    1. In one case we imagine that we have a controllable system and can act upon given variables directly. For instance, you can turn off the heating system in a building on week-ends to save energy, with potential other side effects, which you would like to predict before making it a new policy. The on-off button is under your control. For REGED1, we simulated turning on and off some genes (we assume this can be done) and let your know which ones.
    2. In the other case, we imagine that we know some interventions are going to be performed, but we do not know which variables will be affected. For instance, in our genomic problem, if a drug is administered, some pathways may be affected resulting in a change in distribution of the gene expression coefficients, but we do not necessarily know which variables (gene expression coefficients) will be affected. We simulated this situation is REGED2.
  • For the real data with probes (SIDO and CINA), we manipulate all the probes in both cases, but in two different ways.
Isn't it unrealistic to manipulate all the probes?
Manipulations are not meant to simulate a real situation. They are an "instrument" we use to statistically measure how well causal discovery algorithms are performing, by tying causation and prediction. Participants need to assume that any variable, which is not a cause of the target, might be a probe. We think that a reasonable strategy would be to built a predictor only from causes (direct or indirect), since any non-cause might be a probe. In fact, the fraction of probes in the set of variables called "causes" can be used as an indicator of the false positive rate. We intent to use this kind of statistics to perform statistical testing of significance of variables called "causes" and correlate such results with correctness of target value predictions to quantify the relationship between causation and prediction.

With the probe method, isn't possible that an optimum learner having discovered the true causes could be beaten by a sub-optimum method, which selects a certain fraction of predictive consequences?
Yes, it is possible. However, in the case of a large number of non-causes and probes, it is unlikely that a method (which does not cheat by trying to guess which variables are probes) would select predictive non-causes but not select probes. It suffices that an algorithm selects a certain fraction of probes to counterbalance the positive bias of (wrongly) selecting predictive non-causes.

If the generation process and ranking criteria are different for artificial and real+probe datasets, are participants supposed to use different methods for learning on artificial and real+probe datasets? Or are they obliged to use one method for all datasets?
The challenge participants are free to use different methods on the different datasets. The ranking criterion is NOT the Fscore (assessing the feature selection), it is the Tscore (assessing target prediction accuracy), so the same scoring method applies in all cases and we expect the same methods can be successfully used in all cases. The methods will have to take into account though whether or not test data are manipulated and how

In MARTI, what does it mean that "the noise pattern is different in every training example"?
Using our noise model, we drew a noise pattern for every example and added it to that example. When the features are arranged in a 2d 32x32 array (as explained in the documentation), the noise pattern has a smooth structure (neighboring coefficients have similar values). This is kind of background with low frequency. A different noise template is added to each example, but all noise templates are all drawn from the same noise model. If you visualize the training examples after rearranging them as a 32x32 array, you will see this right away. For each feature, the expected value of the noise is zero. But the noise of two neighboring features is correlated. We show below examples of noise patterns (positive values in red and negative values in green).

noise 1 noise 2

In MARTI, what does it mean that 25 "calibrant features" have value zero plus a small amount of Gaussian noise? The averages for every calibrant feature is far from zero.
We have 2 kinds of noise. The calibrants are 0+-[small Gaussian noise]. Then, on top of that, in training data only, we add the correlated noise model. After we add the correlated noise, because of the small sample size and the large variance, the calibrant values are no longer close to zero (even on average) in training data. However, the median is close zero on average for almost all calibrants, relatively to the signal amplitude: abs(mean(median(X(:,calib))/std(abs(X(:)))))~e-005.
In training data, we get: mean(abs(mean(X)))~e+004 but and mean((mean(X(:,calib))))~5e+003. In test data, because we did not add noise, the calibrant values are close to zero, relatively speaking: mean(abs(mean(X)))~5e+003 but mean(abs(mean(X(:,calib))))~1. The calibrants can be used to preprocess the training data by subtracting a bias value after the low frequency noise is removed, so that the calibrant values are zero after preprocessing the training data.

REGED and MARTI do not look like regular microarray data. What kind of normalization did you do?
REGED was obtained by fitting a model to real microrray data. REGED features were shifted and rescaled individually then rounded to integer spanning the range 0:999. MARTI was obtained from data generated by the same model as REGED without rescaling features individually. For MARTI, a particular type of correlated noise was added. The data were then scaled and quantized globally so the features span -999999:999999.
We chose to make the noise model simple but of high amplitude to make it easy to filter out the noise but hard to ignore it. If you think of the spots on a microarray as an image (MARTI patterns are 32x32 "images"), the noise in MARTI corresponds to patches of more or less intense values, added on top of the original image, representing some kind of slowly varying background. Nowadays, microarray technology has progressed to a point that such heavy backgrounds are not common and occasional contaminated arrays would not pass quality control; furthermore microarray reading software calibrate and normalize data so you would not see data that "bad". But for new instruments under development, such levels of noise are not uncommon.
MARTI illustrates the fact that if you do not take out correlated noise, the result of causal discovery may be severely impaired. Even though the amplitude of the noise is large, the noise is easy to filter out, using the fact that neighboring spots are affected similarly, and using the spots having constant values before noise is added (calibrants). After noise filtering, the residual noise may still impair causal discovery, so it its your challenge to see what can be done to avoid drawing wrong conclusions in the presence of correlated noise.

Why are there no multiclass and regression tasks?
There will eventually be some. We are working on including more datasets.

I do not see my results in the "Overall" table, what's wrong?
An entry must be "complete" to be displayed in the "Overall" table., i.e. have results for the 3 test sets 0,1,2 for at least one task (for instance REGED0, REGED1, and REGED2). You must upload these results in a single archive, otherwise they count as separate entries.

I see xxxxx in the result tables, is there a problem with the display?
No, during the development period, we show only results for LUCAS and LUCAP, the toy datasets, which are just used as examples and do not count for the competition. The other results will be revealed at the end of the challenge.

I get the message “Error with the contents of the archive, missing compulsory file(s), or no valid file found” but I verified my files are correct, what's wrong?
The server expects lowercase file names, check this this correct.

Do I have to submit results on all datasets?
No. During the development period, you can submit results on any number of datasets. However, for your final entry counting towards the prizes, you must submit at least one set of results for all the datasets having the same name and differing only in the last digit (e.g. SIDO0, SIDO1, and SIDO2). Entries fulfilling this condition show up in the "Overall" result table.

Is there a limit to the number of submissions?
You can make as many submissions as you want (albeit no more than 5 per day in order not to overload our system.) Only the last submission of each entrant will count towards competing for the prizes.

Can I use a robot to make submissions?
Robot submissions are not explicitly forbidden. However, we require that the total number of submissions per 24 hours from the same origin does not exceed 5. Please be courteous otherwise we run at risk of overloading the server and we would then need to take more drastic measures.

Can I use an alias or a funky email not to reveal my identity?
To enter the final ranking, we require participants to identify themselves by their real name. You cannot win the competition if you use an alias. However, you may use an alias instead of your real name during the development period, to make development entries that do not include results on test data. You must always provide a valid email. Since the system identifies you by email, please always use the same email. Your email will only be used by the administrators to contact you with information that affects the challenge. It will not be visible to others during the challenge.

Do I need to let you know what my method is?

Disclosing information about your method is optional during the development period. However, to participate to the final ranking, you will have to fill out a fact sheet about your method(s). We encourage the participants not only to fill out the fact sheets, but write a paper with more details. A best paper award will distinguish entries with particularly original methods, methods with definite advantages (other that best performance) and good experimental design.

Can I make a submission with mixed methods?
Mixed submissions containing results of different methods on the various datasets are permitted. 

How will you enforce the rules of the challenge?
There are two rules, which we will enforce with post challenge tests:

  1. It is forbidden to use test data for training and predictions must be made independently for each test example.
  2. The prediction results must be obtained with the declared set of features (in particular, result tables must correspond to nested subsets of features provided as a "slist").
The top ranking participants will be asked to cooperate to reproduce their results and the outcome of the tests will be published. We will ask them to provide two executables. One "training module" will take training data (in standard format) as inputs (and optionally a list of variables to be manipulated in test data and/or some hyperparameter values that may be different for the 0,1,2 cases) and return a model and a list of features used. One "test module" will use the trained model, and produce predictions of the target variable for single examples at a time, restricted to the declared set of features. We will run these tests on the datasets of the challenge and on new versions of those datasets, to detect possible significant differences. If the results cannot be reproduced, this will shed doubt on their validity. The organizers reserve the right of disqualifying entries, which do not pass the tests.

Would it be O.K. to submit predictions made using the same form of model (e.g. ridge regression) but with several different feature selection algorithms?
We want a single submission for each task and give the degree of freedom of providing multiple prediction results only to report results on nested subsets of features. However, you can generate the feature order any way you want (e.g. you can put first the direct causes, then the other members of the MB, then the most correlated features, then all other features). The ordering can come out of a hybrid strategy between the results of several feature selection and causal discovery techniques. Then, you can make nested subsets and train a classifier on each subset and turn in the results as a table. To make things comparable, we asked people to vary the subset sizes by powers of 2. This does not necessarily cut at an optimum point, but this will facilitate making comparisons between methods.

Can I participate in the competition if I cannot attend the workshop?

Can I attend the workshop if I do not participate in the challenge?
Yes. You can even submit a paper for presentation on the themes of the workshop.

Why do you not have a validation set?
In past challenges, we used to given feed-back during the development period on a validation set. Then, we disclosed the target values on that set and used a different "final" test set for ranking participants. We adopted a different setting in this challenge because disclosing validation targets would reveal information on the test set distribution, which we do not wish to reveal.
Instead, we give partial feed-back directly on the test set, via "quartile" information.

What motivates the proportion of the data split?
The size of the training set corresponds to realistic amounts of training data available in real applications, yet the data are sufficient to uncover causal relationships at least to a certain extent. The test set was made large enough to get small error bars on the prediction error for the methods we have tried.

Will the organizers enter the competition?
The prize winners may not be challenge organizers. The challenge organizers will enter development submissions from time to time, under the name "Reference". Reference entries are shown for information only and are not part of the competition.

Can a participant give an arbitrary hard time to the organizers?


Who can I ask for more help?
For all other questions, email

Last updated April 3, 2008.