REGEDP studies the probe method on REGED

REGEDP uses the artificially generated dataset REGED, to study the probe method. We assume that REGED data came from a real, but unknown, generative process. We add to the 999 variables of MARTI 3996 "probes".  Those are artificially generated variables including randomly generated variables completely independent of the target, and consequences of subsets of original variables (including the target) and other probes. Importantly, no probe is a cause of the target. Ideally, the probes should be generated from the (unknown) distribution of non-causes of the target. We use instead a method for generating probes that use permutations of values of some of the real variables, while enforcing some causal dependencies.

Assume that we want to uncover causes of the target variable (lung cancer) and we use a causal discovery algorithm for that purpose. The fraction of probes selected as candidate causes is an indication of the fraction of false positive. Because we know in that case the true data generative model, we can analyze how useful the probe method is, despite the ad hoc way in which the probes are generated.

The data include the same 500 training examples as REGED (in the same order). All original variables come first and the probes are appended as extra columns. No test data are provided.

Download the data in text format [7.8 Mb].
Download the data in Matlab format [8 Mb].