We propose datasets from various application domains. We took great care of using
real data. We are making available from this page the unlabeled data and one "seed" label for one example. In the data tables, the top examples are training data and the bottom examples are test data. You must "pay" virtual cash to get other training labels, see Instructions.
Dataset | Domain | Feat. Type | Feat. num. | Sparsity % | Missing % | Label | Train num. | Test num. | Positive labels % | Seed | Data (zip) | Data (Matlab) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
A | xxx | mixed | 92 | 79.02 | 0 | binary | 17535 | 17535 | xxx | 1 | 673 KB | 1 MB |
B | xxx | mixed | 250 | 46.89 | 25.76 | binary | 25000 | 25000 | xxx | 1 | 6.5 MB | 6.6 MB |
C | xxx | mixed | 851 | 8.6 | 0 | binary | 25720 | 25720 | xxx | 1 | 62.7 MB | 72.8 MB |
D | xxx | binary | 12000 | 99.67 | 0 | binary | 10000 | 10000 | xxx | 1 | 1.7 MB | 1.6 MB |
E | xxx | continuous | 154 | 0.04 | 0.0004 | binary | 32252 | 32252 | xxx | 1 | 34 MB | 55.8 MB |
F | xxx | mixed | 12 | 1.02 | 0 | binary | 67628 | 67628 | xxx | 1 | 2.3 MB | 1.9 MB |
Dataset | Domain | Feat. Type | Feat. num. | Sparsity % | Missing % | Label | Train num. | Test num. | Positive labels % | Seed | Data (zip) | Data (Matlab) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
HIVA | Chemo-informatics | binary | 1617 | 90.88 | 0 | binary | 21339 |
21339 | 3.52 | 1 | 5.9 MB | 9.3 MB |
IBN_SINA | Handwriting recognition | mixed | 92 | 80.67 | 0 | binary | 10361 | 10361 |
37.84 | 4 | 346 KB | 537 KB |
NOVA | Text processing | binary | 16969 | 99.67 | 0 | binary | 9733 |
9733 |
28.45 | 11 | 2.3 MB | 2.3 MB |
ORANGE | Marketing | mixed | 230 | 9.57 | 65.46 | binary | 25000 | 25000 | 1.78 | 54 | 6.8 MB | 6.4 MB |
SYLVA | Ecology | mixed | 216 | 77.88 | 0 | binary | 72626 | 72626 |
6.15 | 4 | 14.5 MB | 20.2 MB |
ZEBRA | Embryology | continuous | 154 | 0.04 | 0.004 | binary | 30744 | 30744 | 4.58 | 23 | 28.6 MB | 53.2 MB |
The Orange dataset contains categorical variable, see the data description. The column "Data (zip)" points to archives containing the data in ASCII format while the columns "Data (Matlab)" points to the same data in Matlab(R) format. The column "seed" indicates the line number one example of the positive class. Important: The goal is to purchase as few labels as possible with "virtual cash" while getting as good performance as possible BUT to facilitate algorithm development, we give you direct access to all the labels of the development datasets. Read the "Algorithm Development" section of the Instructions.
We provide a toy dataset called ALEX (Active Learning EXample dataset). It consists of 5000 training examples and 5000 test examples generated with a Bayesian network (the LUCAS model) having 12 binary variables, including the target variable. The seed example belonging to the positive class is the first example. We used this dataset to provide example queries (see the Instructions) with our Matlab sample code and example learning curves (see the Evaluation page). You may download ALEX in zip archive (21 KB) or as a Matlab matrix (20 KB).