Dataset generation
Types of datasets:
- Purely artificial data: The data were generated by
an artificial stochastic process for which the target variable is an explicit
function of some of the variables called "causes" and other hidden variables
(noise). We resort to using purely artificial data for the purpose of illustrating
particular technical difficulties inherent to some causal models, e.g.
problems of causal sufficiency (not all causes are given in the variable
set), Markov equivalence (several causal graphs are consistent with the
same data) or causal faithfulness (no causal graph can satisfactorily represent
the data). Truth values of causal relationships are known for the data generating
model and will be used for scoring your causal discovery results.
- Re-simulated data: We have trained a causal model
(such as a causal Bayesian network or a structural equation model) with
real data. The model was then used to generate artificial training and
test data for the challenge. Truth values of causal relationships
are known for the data generating model and will be used for scoring your
causal discovery results. REGED is an example of resimulated dataset.
- Real data with probe variables: We are using a dataset
of real samples for which you are given real values of a number of variables.
Some of these variables may be causally related to the target and some
may be predictive but non-causal. The nature of the causal relationships
of the variables to the target is unknown to us (although domain knowledge
may allow us to validate the discoveries to some extent). We have added
to the set of real variables a number of distracter variables called "probes",
which are generated by an artificial stochastic process, including explicit
functions of some of the real variables, other artificial variables, and/or
the target. All probes are non-causes of the target, some are completely
unrelated to the target. The identity of the probes in conceiled.
The fact that truth values of causal relationships are known only for the
probes affects our scoring
method.
Generation of training and test data,
manipulations:
- Training data: Training data are generated from a
so-called "natural distribution" or "unmanipulated distribution",
i.e. the variable values are sampled from the system when it is let to
evolve according to its own dynamics, after it has settled in a steady state.
For the probe method, the system includes the artificial probe generating
mechanism.
- Test data: Test data are generated from a so-called
"manipulated distribution". An external agent performs
an "intervention" on the system. depending on the problem at hand, interventions
can be of several nature:
- Clamping one or several variables to given values,
then sample other variables from the natural distribution of the system.
- Randomizing the values of given variables, i.e. sampling
them from a distribution chosen by the external agent, which is not governed
by the system under study, then sample other variables from the natural
distribution of the system.
- For the probe method, since we do not have the possibility
of manipulating real variables, we only manipulate the probes. We actually
manipulate all the probes in every sample.
The effect of manipulations is to disconnect the variables from their
natural causes. Manipulations allow us to eventually influence the target,
if we manipulate causes of the target. Manipulating non-causes should have
no effect on the target. Without inferring causal relationships, it should
be more difficult to make predictions for manipulated distributions.
|