- Causality and prediction
- Pot-luck
- Active learning
- Unsupervised learning
- Cause effect pairs
- Connectomics

- Synopsis
- Rules
- Data
- Instructions
- Evaluation
- Submit
- Leaderboard
- Tutorial
- Credits
- FAQ
- My Lab
- My Account
- Results
- Fact sheet
- AISTATS2010
- WCCI2010

We propose datasets from various application domains. We took great care of using
real data. We are making available from this page the unlabeled data and one "seed" label for one example. In the data tables, the top examples are training data and the bottom examples are test data. You must "pay" virtual cash to get other training labels, see Instructions.

Dataset | Domain | Feat. Type | Feat. num. | Sparsity % | Missing % | Label | Train num. | Test num. | Positive labels % | Seed | Data (zip) | Data (Matlab) |
---|---|---|---|---|---|---|---|---|---|---|---|---|

A | xxx | mixed | 92 | 79.02 | 0 | binary | 17535 | 17535 | xxx | 1 | 673 KB | 1 MB |

B | xxx | mixed | 250 | 46.89 | 25.76 | binary | 25000 | 25000 | xxx | 1 | 6.5 MB | 6.6 MB |

C | xxx | mixed | 851 | 8.6 | 0 | binary | 25720 | 25720 | xxx | 1 | 62.7 MB | 72.8 MB |

D | xxx | binary | 12000 | 99.67 | 0 | binary | 10000 | 10000 | xxx | 1 | 1.7 MB | 1.6 MB |

E | xxx | continuous | 154 | 0.04 | 0.0004 | binary | 32252 | 32252 | xxx | 1 | 34 MB | 55.8 MB |

F | xxx | mixed | 12 | 1.02 | 0 | binary | 67628 | 67628 | xxx | 1 | 2.3 MB | 1.9 MB |

Dataset | Domain | Feat. Type | Feat. num. | Sparsity % | Missing % | Label | Train num. | Test num. | Positive labels % | Seed | Data (zip) | Data (Matlab) |
---|---|---|---|---|---|---|---|---|---|---|---|---|

HIVA | Chemo-informatics | binary | 1617 | 90.88 | 0 | binary | 21339 |
21339 | 3.52 | 1 | 5.9 MB | 9.3 MB |

IBN_SINA | Handwriting recognition | mixed | 92 | 80.67 | 0 | binary | 10361 | 10361 |
37.84 | 4 | 346 KB | 537 KB |

NOVA | Text processing | binary | 16969 | 99.67 | 0 | binary | 9733 |
9733 |
28.45 | 11 | 2.3 MB | 2.3 MB |

ORANGE | Marketing | mixed | 230 | 9.57 | 65.46 | binary | 25000 | 25000 | 1.78 | 54 | 6.8 MB | 6.4 MB |

SYLVA | Ecology | mixed | 216 | 77.88 | 0 | binary | 72626 | 72626 |
6.15 | 4 | 14.5 MB | 20.2 MB |

ZEBRA | Embryology | continuous | 154 | 0.04 | 0.004 | binary | 30744 | 30744 | 4.58 | 23 | 28.6 MB | 53.2 MB |

The Orange dataset contains categorical variable, see the data description. The column "Data (zip)" points to archives containing the data in ASCII format while the columns "Data (Matlab)" points to the same data in Matlab(R) format. The column "seed" indicates the line number one example of the positive class. **Important:** The goal is to purchase as few labels as possible with "virtual cash" while getting as good performance as possible **BUT** to facilitate algorithm development, we give you direct access to all the labels of the development datasets. Read the "Algorithm Development" section of the Instructions.

We provide a toy dataset called ALEX (Active Learning EXample dataset). It consists of 5000 training examples and 5000 test examples generated with a Bayesian network (the LUCAS model) having 12 binary variables, including the target variable. The seed example belonging to the positive class is the first example. We used this dataset to provide example queries (see the Instructions) with our Matlab **sample code** and example learning curves (see the Evaluation page). You may download ALEX in zip archive (21 KB) or as a Matlab matrix (20 KB).

**dataname.data**- All the unlabeled data in ASCII format (a space delimited table with samples in rows, features/variables in columns). The table is compressed in a zip archive. For T training examples, the first T lines are training examples. The remaining examples are reserved for testing.**dataname.mat**- The same matrix in Matlab format.

**dataname.sample**- The examples (identified by the line number in the data matrix) for which the labels are provided.**dataname.label**- The corresponding labels (target values).