Challenge Datasets
We propose datasets from various application domains (all real data). For all datasets, there are 3 (unlabeled) subsets: development set, validation set, and final evaluation set. During the development period, you may get immediate feed-back on the Leaderboard and in My Lab by making submissions on the validation set of valid data representations. Turn in your representation on the final evaluation set when you are ready for final testing.
In phase 1, no labels were available.
| Dataset |
Domain |
Feat. num. |
Sparsity (%) |
Development num. |
Transfer num. |
Validation num. |
Final Eval. num. |
Data (text) |
Data (Matlab) |
| AVICENNA |
Arabic manuscripts |
120 |
0.00 |
150205 |
50000 |
4096 |
4096 |
16 MB |
14 MB |
| HARRY |
Human action recognition |
5000 |
98.12 |
69652 |
20000 |
4096 |
4096 |
13 MB |
15 MB |
| RITA |
Object recognition |
7200 |
1.19 |
111808 |
24000 |
4096 |
4096 |
1026 MB |
762 MB |
| SYLVESTER |
Ecology |
100 |
0.00 |
572820 |
100000 |
4096 |
4096 |
81 MB |
69 MB |
| TERRY |
Text recognition |
47236 |
99.84 |
217034 |
40000 |
4096 |
4096 |
73 MB |
56 MB |
| ULE (toy data) |
Handwritten digits |
784 |
80.85 |
26808 |
10000 |
4096 |
4096 |
7 MB |
13 MB |
Data Mirrors and Download Tips
- Data mirrors -- Preferably download the data from a location near you:
- ETH, Zurich, Switzerland: this page.
- Orange Labs, Brittany, France.
- Brandeis Univ., Massachusetts, USA.
- NEC Labs, New Jersey, USA.
- NYU, New York, USA.
- Synchromedia, ÉTS Montreal, Canada.
- Acadiau University, Canada.
- UC Irvine, California.
- You??? Help us by setting up a data mirror: make available the datasets from a webpage on your server and email the URL to ul@clopinet.com. Many thanks in advance.
- Matlab format -- Matlab users should download the Matlab format. The data in the text and Matlab columns are identical.
- Download retry -- To download large files (like RITA) over a slow connection with frequent interruptions, use a software that restarts automatically, like Wget.
Toy Dataset
We provide a toy dataset called ULE (Unsupervised Learning Example dataset). This dataset is NOT part of the challenge. It is provided for practice purpose. We used this dataset to provide example submissions (see the Instructions) with our Matlab sample code and example learning curves (see the Evaluation page). For ULE you get all the data labels. For all other datasets, the data come with no label in phase 1 and you will get only the transfer labels in phase 2.
Dataset Formats
Below are the formats of the data found in the archives, where
dataname is one of the dataset names and
subset is one of:
devel (development set),
valid (validation set), or
final (final evaluation set).
- dataname.param - Data statistics.
- dataname_subset.data - Unlabeled data in ASCII format. For all datasets, except TERRY: a space delimited table with samples in rows, features/variables in columns. For TERRY: a sparse matrix M(i, j), each line representing the indices followed by the value (i, j, M(i, j)).
- dataname_subset.mat - The same matrix in Matlab format.
- dataname_transfer.label - Transfer data labels (will be available in phase 2 only).
To prepare a valid submission, see
Instructions.