We propose datasets from various application domains. We took great care of using
real data. Some datasets encode real data directly, while others result from
"re-simulation", i.e. they were obtained from data simulators trained on real data. All datasets differing in name only by the last digit have the same training set. Digit zero indicates unmanipulated test sets.
See the details of our data generating process.
Dataset (click for info) | Description | Test data | Variables (num) | Target | Training examples | Test examples | Download (text format) | Download (Matlab format) |
---|---|---|---|---|---|---|---|---|
REGED0 | Genomics re-simulated data | Not manipulated | Numeric (999) | Binary | 500 | 20000 | 31 MB | 25 MB |
REGED1 | Genomics re-simulated data | Manipulated (see list of manipulated variables) | Numeric (999) | Binary | 500 | 20000 | 31 MB | 25 MB |
REGED2 | Genomics re-simulated data | Manipulated | Numeric (999) | Binary | 500 | 20000 | 31 MB | 25 MB |
SIDO0 | Pharmacology real data w. probes | Not manipulated | Binary (4932) | Binary | 12678 | 10000 | 12 MB | 14 MB |
SIDO1 | Pharmacology real data w. probes | Manipulated | Binary (4932) | Binary | 12678 | 10000 | 12 MB | 14 MB |
SIDO2 | Pharmacology real data w. probes | Manipulated | Binary (4932) | Binary | 12678 | 10000 | 12 MB | 14 MB |
CINA0 | Census real data w. probes | Not manipulated | Mixed (132) | Binary | 16033 | 10000 | 1 MB | 1 MB |
CINA1 | Census real data w. probes | Manipulated | Mixed (132) | Binary | 16033 | 10000 | 1 MB | 1 MB |
CINA2 | Census real data w. probes | Manipulated | Mixed (132) | Binary | 16033 | 10000 | 1 MB | 1 MB |
MARTI0 | Genomics re-simulated data w. noise | Not manipulated | Numeric (1024) | Binary | 500 | 20000 | 47 MB | 35 MB |
MARTI1 | Genomics re-simulated data w. noise | Manipulated | Numeric (1024) | Binary | 500 | 20000 | 47 MB | 35 MB |
MARTI2 | Genomics re-simulated data w. noise | Manipulated | Numeric (1024) | Binary | 500 | 20000 | 47 MB | 35 MB |
We provide a small toy example dataset for practice purpose.
You may submit results on these data on the Submit page like
you would do for the challenge datasets. Your results will appear on the Result page,
thus providing you with immediate feed-back. The results on these practice datasets
WILL NOT COUNT as part of the challenge.
You may download all the example data as a single archive (Text format 1.1 MB, Matlab format 1.2 MB), or as individual datasets from the table below.
Dataset (click for info) | Description | Test data | Variables (num) | Target | Training examples | Test examples | Download (text format) | Download (Matlab format) |
---|---|---|---|---|---|---|---|---|
LUCAS0 | Toy medicine data | Not manipulated | Binary (11) | Binary | 2000 | 10000 | 31 KB | 22 KB |
LUCAS1 | Toy medicine data | Manipulated | Binary (11) | Binary | 2000 | 10000 | 31 KB | 22 KB |
LUCAS2 | Toy medicine data | Manipulated | Binary (11) | Binary | 2000 | 10000 | 31 KB | 22 KB |
LUCAP0 | Toy medicine data w. probes | Not manipulated | Binary (143) | Binary | 2000 | 10000 | 341 KB | 262 KB |
LUCAP1 | Toy medicine data w. probes | Manipulated | Binary (143) | Binary | 2000 | 10000 | 342 KB | 654 KB |
LUCAP2 | Toy medicine data w. probes | Manipulated | Binary (143) | Binary | 2000 | 10000 | 342 KB | 263 KB |
All data sets are in the same format and include 4 files in text format:
For convenience, we also provide the data tables in Matlab format: dataname_train.mat and dataname_test.mat.
If you are a Matlab user, you can download some sample code to read and check the data.