MARTI

MARTI studies measurement artifacts!

MARTI (Measurement ARTIfact) is obtained from the same data generative process as REGED, a source of simulated genomic data. But a noise model is added to simulate the imperfections of the measurement device.

The goal is still to find genes, which could be responsible of lung cancer. The target variable is binary; it separates malignant samples (adenocarcinoma) from control samples (squamous). The feature values representing measurements of gene expression levels are assumed to have been recorded from a two-dimensional microarray 32x32. The training set was perturbed by a zero-mean correlated noise model (neighboring values in one array are generally similarly affected, but the noise pattern is different in every training example).
The test sets have no added noise. This situation simulates a case where we would be using different instruments at "training time" and "test time", e.g. we would use DNA microarrays to collect training data and PCR for testing. We avoided adding noise to the test set because it would be too difficult to filter it without visualizing the test data or computing statistics on the test data, which we forbid. So the scenario is that the second instrument (used at test time) is more accurate. In practice, the measurements would also probably be more expensive, so part of the goals of training would be to reduce the size of the feature set (we are not making this a requirement in this first challenge).

Technical details:
- The features/variables are randomly arranged in a 2d array 32x32. Variables 1:32 form the first column, 33:64 the second, etc.
- To obtain 1024 features, the 999 features of REGED are complemented by 25 "calibrant features", which have value zero plus a small amount of Gaussian noise. The calibrants are spread regularly accross the array and have variable indices 34 44 54 64 199 209 219 354 364 374 384 519 529 539 674 684 694 704 839 849 859 994 1004 1014 1024.
- Like for REGED, we proposed 3 tasks MARTI0, MARTI1, and MARTI2, all having the same training set of 500 examples (from the "unmanipulated distribution"), and different test sets of 20000 examples.
- Like for REGED, the three tasks differ in the test data distribution, which results from various types of manipulations:
MARTI0: No manipulation (distribution identical to the training data).
MARTI1: The following variables are manipulated:
5, 19, 27, 35, 37, 42, 49, 67, 70, 71, 102, 137, 144, 145, 153, 158, 185, 188, 194, 221, 225, 229, 232, 235, 244, 268, 273, 284, 294, 295, 305, 310, 331, 356, 368, 379, 385, 396, 398, 404, 411, 412, 413, 417, 425, 430, 455, 479, 481, 482, 491, 492, 509, 510, 550, 553, 555, 603, 609, 627, 642, 646, 654, 679, 682, 706, 736, 744, 755, 761, 763, 771, 807, 809, 812, 821, 853, 869, 870, 872, 888, 894, 895, 906, 914, 918, 926, 931, 932, 941, 963, 973, 978, 979, 986, 988, 990, 1001, 1010, 1017.
MARTI2: Many variables are manipulated, including all the consequences of the target.

We anticipate that filtering the noise and/or taking into account the geomety of the array should be necessary to obtain good results.

Download the data.