Causality Causality Workbench                                                             Challenges in Machine Learning Causality

Active Learning Challenge

Challenge Datasets

We propose datasets from various application domains. We took great care of using real data. We are making available from this page the unlabeled data and one "seed" label for one example. In the data tables, the top examples are training data and the bottom examples are test data. You must "pay" virtual cash to get other training labels, see Instructions.

Final Datasets

Dataset Domain Feat. Type Feat. num. Sparsity % Missing % Label Train num. Test num. Positive labels % Seed Data (zip) Data (Matlab)
A xxx mixed 92 79.02 0 binary 17535 17535 xxx 1 673 KB 1 MB
B xxx mixed 250 46.89 25.76 binary 25000 25000 xxx 1 6.5 MB 6.6 MB
C xxx mixed 851 8.6 0 binary 25720 25720 xxx 1 62.7 MB 72.8 MB
D xxx binary 12000 99.67 0 binary 10000 10000 xxx 1 1.7 MB 1.6 MB
E xxx continuous 154 0.04 0.0004 binary 32252 32252 xxx 1 34 MB 55.8 MB
F xxx mixed 12 1.02 0 binary 67628 67628 xxx 1 2.3 MB 1.9 MB
The identity of the domains and the fraction of positive labels were purposely omitted. They will be revealed at the end of the challenge. Dataset B has one categorical variable (column 14) and dataset F has 2 categorical variables (columns 2 and 4).

Development Datasets

Dataset Domain Feat. Type Feat. num. Sparsity % Missing % Label Train num. Test num. Positive labels % Seed Data (zip) Data (Matlab)
HIVA Chemo-informatics binary 1617 90.88 0 binary 21339
21339 3.52 1 5.9 MB 9.3 MB
IBN_SINA Handwriting recognition mixed 92 80.67 0 binary 10361 10361
37.84 4 346 KB 537 KB
NOVA Text processing binary 16969 99.67 0 binary 9733
9733
28.45 11 2.3 MB 2.3 MB
ORANGE Marketing mixed 230 9.57 65.46 binary 25000 25000 1.78 54 6.8 MB 6.4 MB
SYLVA Ecology mixed 216 77.88 0 binary 72626 72626
6.15 4 14.5 MB 20.2 MB
ZEBRA Embryology continuous 154 0.04 0.004 binary 30744 30744 4.58 23 28.6 MB 53.2 MB

The Orange dataset contains categorical variable, see the data description. The column "Data (zip)" points to archives containing the data in ASCII format while the columns "Data (Matlab)" points to the same data in Matlab(R) format. The column "seed" indicates the line number one example of the positive class. Important: The goal is to purchase as few labels as possible with "virtual cash" while getting as good performance as possible BUT to facilitate algorithm development, we give you direct access to all the labels of the development datasets. Read the "Algorithm Development" section of the Instructions.

Toy Dataset

We provide a toy dataset called ALEX (Active Learning EXample dataset). It consists of 5000 training examples and 5000 test examples generated with a Bayesian network (the LUCAS model) having 12 binary variables, including the target variable. The seed example belonging to the positive class is the first example. We used this dataset to provide example queries (see the Instructions) with our Matlab sample code and example learning curves (see the Evaluation page). You may download ALEX in zip archive (21 KB) or as a Matlab matrix (20 KB).

Dataset Formats

Unlabeled data (provided in the table above): Data labels returned when queries are sent: To send queries and obtain labels, see Instructions.