REGED

REGED is a genomics dataset

The goal of REGED (REsimulated Gene Expression Dataset) is to find genes, which could be responsible of lung cancer. The data are “re-simulated”, i.e. generated by a model derived from real human lung-cancer microarray gene expression data. From the causal discovery point of view, it is important to separate genes whose activity cause lung cancer from those whose activity is a consequence of the disease.
We propose three tasks, REGED0, REGED1, and REGED2. All three datasets includes 999 features, the same 500 training examples, and different test sets of 20000 examples. The target variable is binary; it separates malignant samples (adenocarcinoma) from control samples (squamous). The three tasks differ in the test data distribution, which results from various types of manipulations:
REGED0: No manipulation (distribution identical to the training data).
REGED1: The following variables are manipulated:
20, 27, 36, 70, 82, 83, 85, 91, 118, 125, 139, 143, 160, 169, 176, 185, 191, 204, 219, 224, 229, 239, 243, 251, 252, 269, 281, 282, 295, 297, 301, 319, 320, 321, 342, 350, 357, 359, 361, 378, 387, 407, 409, 412, 429, 430, 469, 472, 499, 501, 507, 512, 540, 545, 552, 561, 566, 572, 580, 586, 593, 618, 622, 637, 651, 663, 674, 681, 683, 686, 690, 702, 727, 754, 762, 764, 773, 786, 805, 815, 835, 861, 872, 873, 877, 880, 889, 904, 935, 936, 939, 942, 949, 962, 977, 985, 989, 991, 992, 994.
REGED2: Many variables are manipulated, including all the consequences of the target.
When a manipulation is performed, the values of the manipulated variables are clamped to given values by an "external agent". All other variable values are obtained after the system stabilizes when it is let to evolve according to its own dynamics.

Download the data.