Unsupervised and Transfer Learning Challenge

Challenge Datasets

We propose datasets from various application domains (all real data). For all datasets, there are 3 (unlabeled) subsets: development set, validation set, and final evaluation set. During the development period, you may get immediate feed-back on the Leaderboard and in My Lab by making submissions on the validation set of valid data representations. Turn in your representation on the final evaluation set when you are ready for final testing.
In phase 1, no labels were available.

For phase 2 download the transfer labels

Dataset	Domain	Feat. num.	Sparsity (%)	Development num.	Transfer num.	Validation num.	Final Eval. num.	Data (text)	Data (Matlab)
AVICENNA	Arabic manuscripts	120	0.00	150205	50000	4096	4096	16 MB	14 MB
HARRY	Human action recognition	5000	98.12	69652	20000	4096	4096	13 MB	15 MB
RITA	Object recognition	7200	1.19	111808	24000	4096	4096	1026 MB	762 MB
SYLVESTER	Ecology	100	0.00	572820	100000	4096	4096	81 MB	69 MB
TERRY	Text recognition	47236	99.84	217034	40000	4096	4096	73 MB	56 MB
ULE (toy data)	Handwritten digits	784	80.85	26808	10000	4096	4096	7 MB	13 MB

Data Mirrors and Download Tips

Data mirrors -- Preferably download the data from a location near you:
1. ETH, Zurich, Switzerland: this page.
2. Orange Labs, Brittany, France.
3. Brandeis Univ., Massachusetts, USA.
4. NEC Labs, New Jersey, USA.
5. NYU, New York, USA.
6. Synchromedia, ÉTS Montreal, Canada.
7. Acadiau University, Canada.
8. UC Irvine, California.
9. You??? Help us by setting up a data mirror: make available the datasets from a webpage on your server and email the URL to ul@clopinet.com. Many thanks in advance.
Matlab format -- Matlab users should download the Matlab format. The data in the text and Matlab columns are identical.
Download retry -- To download large files (like RITA) over a slow connection with frequent interruptions, use a software that restarts automatically, like Wget.

Toy Dataset

We provide a toy dataset called ULE (Unsupervised Learning Example dataset). This dataset is NOT part of the challenge. It is provided for practice purpose. We used this dataset to provide example submissions (see the Instructions) with our Matlab sample code and example learning curves (see the Evaluation page). For ULE you get all the data labels. For all other datasets, the data come with no label in phase 1 and you will get only the transfer labels in phase 2.

Dataset Formats

Below are the formats of the data found in the archives, where dataname is one of the dataset names and subset is one of: devel (development set), valid (validation set), or final (final evaluation set).

dataname.param - Data statistics.
dataname_subset.data - Unlabeled data in ASCII format. For all datasets, except TERRY: a space delimited table with samples in rows, features/variables in columns. For TERRY: a sparse matrix M(i, j), each line representing the indices followed by the value (i, j, M(i, j)).
dataname_subset.mat - The same matrix in Matlab format.
dataname_transfer.label - Transfer data labels (will be available in phase 2 only).

To prepare a valid submission, see Instructions.