Dataset generation

Types of datasets:

  • Purely artificial data: The data were generated by an artificial stochastic process for which the target variable is an explicit function of some of the variables called "causes" and other hidden variables (noise). We resort to using purely artificial data for the purpose of  illustrating particular technical difficulties inherent to some causal models, e.g. problems of causal sufficiency (not all causes are given in the variable set), Markov equivalence (several causal graphs are consistent with the same data) or causal faithfulness (no causal graph can satisfactorily represent the data). Truth values of causal relationships are known for the data generating model and will be used for scoring your causal discovery results.
  • Re-simulated data: We have trained a causal model (such as a causal Bayesian network or a structural equation model) with real data. The model was then used to generate artificial training and test data for the challenge. Truth values of causal relationships are known for the data generating model and will be used for scoring your causal discovery results. REGED is an example of resimulated dataset.
  • Real data with probe variables: We are using a dataset of real samples for which you are given real values of a number of variables. Some of these variables may be causally related to the target and some may be predictive but non-causal. The nature of the causal relationships of the variables to the target is unknown to us (although domain knowledge may allow us to validate the discoveries to some extent). We have added to the set of real variables a number of distracter variables called "probes", which are generated by an artificial stochastic process, including explicit functions of some of the real variables, other artificial variables, and/or the target. All probes are non-causes of the target, some are completely unrelated to the target. The identity of the probes in conceiled. The fact that truth values of causal relationships are known only for the probes affects our scoring method.

Generation of training and test data, manipulations:

  • Training data: Training data are generated from a so-called "natural distribution" or "unmanipulated distribution", i.e. the variable values are sampled from the system when it is let to evolve according to its own dynamics, after it has settled in a steady state. For the probe method, the system includes the artificial probe generating mechanism.
  • Test data: Test data are generated from a so-called "manipulated distribution". An external agent performs an "intervention" on the system. depending on the problem at hand, interventions can be of several nature:
    • Clamping one or several variables to given values,  then sample other variables from the natural distribution of the system.
    • Randomizing the values of given variables, i.e. sampling them from a distribution chosen by the external agent, which is not governed by the system under study, then sample other variables from the natural distribution of the system.
    • For the probe method, since we do not have the possibility of manipulating real variables, we only manipulate the probes. We actually manipulate all the probes in every sample.
The effect of manipulations is to disconnect the variables from their natural causes. Manipulations allow us to eventually influence the target, if we manipulate causes of the target. Manipulating non-causes should have no effect on the target. Without inferring causal relationships, it should be more difficult to make predictions for manipulated distributions.