Active Learning Challenge

Challenge Datasets

We propose datasets from various application domains. We took great care of using real data. We are making available from this page the unlabeled data and one "seed" label for one example. In the data tables, the top examples are training data and the bottom examples are test data. You must "pay" virtual cash to get other training labels, see Instructions.

Final Datasets

Dataset	Domain	Feat. Type	Feat. num.	Sparsity %	Missing %	Label	Train num.	Test num.	Positive labels %	Seed	Data (zip)	Data (Matlab)
A	xxx	mixed	92	79.02	0	binary	17535	17535	xxx	1	673 KB	1 MB
B	xxx	mixed	250	46.89	25.76	binary	25000	25000	xxx	1	6.5 MB	6.6 MB
C	xxx	mixed	851	8.6	0	binary	25720	25720	xxx	1	62.7 MB	72.8 MB
D	xxx	binary	12000	99.67	0	binary	10000	10000	xxx	1	1.7 MB	1.6 MB
E	xxx	continuous	154	0.04	0.0004	binary	32252	32252	xxx	1	34 MB	55.8 MB
F	xxx	mixed	12	1.02	0	binary	67628	67628	xxx	1	2.3 MB	1.9 MB

The identity of the domains and the fraction of positive labels were purposely omitted. They will be revealed at the end of the challenge. Dataset B has one categorical variable (column 14) and dataset F has 2 categorical variables (columns 2 and 4).

Development Datasets

Dataset	Domain	Feat. Type	Feat. num.	Sparsity %	Missing %	Label	Train num.	Test num.	Positive labels %	Seed	Data (zip)	Data (Matlab)
HIVA	Chemo-informatics	binary	1617	90.88	0	binary	21339	21339	3.52	1	5.9 MB	9.3 MB
IBN_SINA	Handwriting recognition	mixed	92	80.67	0	binary	10361	10361	37.84	4	346 KB	537 KB
NOVA	Text processing	binary	16969	99.67	0	binary	9733	9733	28.45	11	2.3 MB	2.3 MB
ORANGE	Marketing	mixed	230	9.57	65.46	binary	25000	25000	1.78	54	6.8 MB	6.4 MB
SYLVA	Ecology	mixed	216	77.88	0	binary	72626	72626	6.15	4	14.5 MB	20.2 MB
ZEBRA	Embryology	continuous	154	0.04	0.004	binary	30744	30744	4.58	23	28.6 MB	53.2 MB

The Orange dataset contains categorical variable, see the data description. The column "Data (zip)" points to archives containing the data in ASCII format while the columns "Data (Matlab)" points to the same data in Matlab(R) format. The column "seed" indicates the line number one example of the positive class. Important: The goal is to purchase as few labels as possible with "virtual cash" while getting as good performance as possible BUT to facilitate algorithm development, we give you direct access to all the labels of the development datasets. Read the "Algorithm Development" section of the Instructions.

Toy Dataset

We provide a toy dataset called ALEX (Active Learning EXample dataset). It consists of 5000 training examples and 5000 test examples generated with a Bayesian network (the LUCAS model) having 12 binary variables, including the target variable. The seed example belonging to the positive class is the first example. We used this dataset to provide example queries (see the Instructions) with our Matlab sample code and example learning curves (see the Evaluation page). You may download ALEX in zip archive (21 KB) or as a Matlab matrix (20 KB).

Dataset Formats

Unlabeled data (provided in the table above):

dataname.data - All the unlabeled data in ASCII format (a space delimited table with samples in rows, features/variables in columns). The table is compressed in a zip archive. For T training examples, the first T lines are training examples. The remaining examples are reserved for testing.
dataname.mat - The same matrix in Matlab format.

Data labels returned when queries are sent:

dataname.sample - The examples (identified by the line number in the data matrix) for which the labels are provided.
dataname.label - The corresponding labels (target values).

To send queries and obtain labels, see Instructions.