Causality Causality Workbench                                                             Challenges in Machine Learning Causality

Unsupervised and Transfer Learning Challenge

Frequently Asked Questions

Contents

Setup
Tasks of the challenge
Data
Evaluation
Submissions
Rules
Help


Setup


What is the goal of the challenge?
The goal is to devise preprocessing algorithms to create good data representations. The algorithms can be trained with unlabeled data only during phase 1 (unsupervised learning). Some labels (from other classes than those used for evaluation) will be made available during phase 2 (transfer learning).

Are there prizes and travel grant?
Yes, there will be free conference registrations, cash prizes and travel grants. See the Prize section.

Will there be a workshop and proceedings?
Yes, we are planning to have 2 workshops, one at IJCNN 2011, July 31 to August 3, 2011, San Jose, California and one at ICML 2011, June 28 to July 2, 2011, Bellevue, Washington, USA (pending acceptance). The IJCNN proceedings will be published by the IEEE. Extended versions of selected papers will be published in the proceedings by JMLR W&CP. NOTE: The IJCNN paper deadline is February 1, 2011 before the final tests.

Since the IJCNN deadline is before the end of the challenge, how will you judge the papers?
The papers will be judged as regular conference papers based on relevance to the workshop topics, novelty/originality, usefulness, and sanity. We encourage challenge participants to incorporate their results on the validation sets.

In the IJCNN submission system, how do I make sure to direct the papers to the right place?
You must either submit to special session S25 (Autonomous and Incremental Learning, AIL) or to competition session Ce (Benchmark for unsupervised learning and transfer learning algorithms, Clopinet UTL challenge). If you submit to S25, your paper will have to be relevant to AIL topics and will be considered for oral presentation. If you submit to Ce, your paper will have to include challenge results; it will be eligible for a special poster session at the main conference. The winners of phase 1 or phase 2 will have also the opportunity of making an oral presentation at a post-meeting workshop on the competitions.

Are we obliged to attend the workshop(s) or publish our method(s) to participate in the challenge?
No.

Can I attend the workshop(s) if I do not participate in the challenge?
Yes. You can even submit papers for presentation on the topics of the workshops, including: unsupervised learning, learning from unlabeled data, or transfer learning.

Do I need to register to participate?
Yes. You can download the data and experiment with the samples code without registering. However, to make submissions you must register. This will give you access to the "Submit" page and to the "Mylab" page, which helps you manage your experiments.

What is an "experiment"?
An experiment is a set of submissions you make on various datasets under the same experiment name. An experiment is complete only when you have submitted results on all datasets for the "final evaluation sets".

Can I participate on a subset of the datasets?
Yes. However, to enter the final ranking and compete towards the prizes, you must make one complete experiment using the final evaluation sets. This means that, under the same experiment name, you must enter results for all datasets.

Which experiment will count towards the final ranking?
At the end of the challenge, to enter the final ranking, you will have to fill out a fact sheet identifying your team members and explaining roughly your methods. You will get to select the complete experiment that you want to use for the final ranking (a different one for each phase).

Can I participate in one phase only?
Yes. There are separate rankings and prizes for each phase.


Tasks of the challenge


Is causality needed to solve the problems of the challenge?
No.

So, why is this part of the Causality Workbench?
The challenge uses the Virtual Lab of the Causality Workbench. We organized this challenge because of the importance of the problem (many applications have large volumes of unlabeled data).

How do you define unsupervised learning?
For this challenge, unsupervised learning means data preprocessing from unlabeled data with the purpose of getting better results on an unknown supervised learning task. The algorithms used may include space transformation algorithms (e.g. Principal Component Analysis), clustering algorithms (e.g. k-means), and various normalizations.

How do you define transfer learning?
There are several kinds of transfer learning. We are NOT using the setting of "supervised" or "inductive" transfer learning in which training examples for many classes are provided, and the task is to use the "knowledge" acquired from learning one class to help learning another. This setting is useful to tackle supervised learning problems with very unbalanced class distributions. For this challenge, NO LABELS for the "primary task(s)" of interest are provided. Transfer learning means preprocessing using data data partially labeled for "secondary tasks" not used for evaluation. The objective is to preprocess data to get better results on an unknown supervised learning "primary task", which will be used by the organizers for evaluation.

Is the data distribution the same in the various subsets?
No. In the development set, there are examples of the classes found in the validation set and in the final evaluation set plus examples of other classes. The validation set and in the final evaluation set contain examples for disjoint sets of classes. For instance, in the toy example ULE: The development set contains examples of all the digits 0-9. The validation set contains examples of digits 1, 3, and 7. The final evaluation set contains examples of digits 0, 2, and 6.

What supervised learning tasks are used for evaluation?
We use several two-class classification problems and average the performances over all the such problems. For instance, in ULE toy example, the validation set contains only examples from the digits 1, 3, and 7. When the organizers make learning curve, they train 3 linear classifiers:
- one to separate class 1 from 3 and 7
- one to separate class 3 from 1and 7
- one to separate class 7 from 3 and 1
The performances of the classifier (AUC) are averaged over the 3 problems.

What application do you have in mind?
There are many applications in which it would be desirable to learn from very few examples, including just one (one shot learning). A typical example is computer-aided "albuming". Imagine that you want to classify all your digital photos according to certain tags. In the future, consumer products may be equipped with pattern recognition algorithms allowing people to classify documents, images, videos, the way they want, given very few examples (say examples of pictures of their family members). The classification accuracy of classifiers trained with very few examples largely rests upon the quality of the data representation. The hypothesis tested in this challenge is that it is possible to develop data representations suitable for a family of similar tasks using either unlabeled data (unsupervised learning) or data labeled with different classes than those of the end task (transfer learning). In our photo classification example, the developers may have available for training large databases of photos, including photos downloaded from the Internet either unlabeled or labeled with the names of celebrities.

Does phase 1 (unsupervised learning) have any practical relevance?
The purpose of phase 1 is to get baseline results to be compared with the results of phase 2. In practical applications, there is generally some amount of labeled data available for preprocessing development, even if the final task is of a different nature (like in the transfer learning setting we are proposing in the second phase). However, transfer learning might be prone to overfitting: a data representation too well tuned to solve a give task might not be suitable for another similar task. In that sense, one might be better off performing unsupervised learning and using the available "transfer" labels for model selection (preprocessing selection). For clarity of evaluation, in phase 1 we withhold the "transfer" labels to force the competitors to learn from unlabeled data only. The supervision in phase 1 is limited to model selection (preprocessing selection) using the performance obtained on the validation set.

Is the evaluation carried out with a single binary classification problem?
No, both in the validation set and in the final evaluation set we use several 2-class classification problems and average performances over all problems. We do not treat the multiclass problem.

What do you mean by a "class"?
A dataset may have several targets or labels (all categorical), e.g. for handwritten digits (the ULE example) the labels are 0, 1, 2, 3, 4, 5, 6, 7, 8, 9. In some cases, the classes form a partition of the data and one talks of a multi-class problem. In other cases, the classes overlap and one talks of a multi-label problem. For example, in the ULE example, you can define the classes {odd, even, (lower than 5), (higher or equal to 5)}. For each dataset, we have defined several binary classification problems, using the original labels of the data.

Why are there no multiclass and regression tasks?
It is difficult to provide a good unified criterion for all problems. Moreover multiclass and regression problems are harder. We prefer testing unsupervised and transfer learning methods with one criterion consistent across all tasks and let the participants focus on solving these problems rather than dealing with multiple difficulties.


Data


Are the datasets using real data?
Yes, all of them, including the toy problem ULE.

How do we get the transfer learning labels?
They will be made available from the Data page at the end of Phase 1.

Can domain knowledge be used to facilitate solving the tasks of the challenge?
We purposely did not disclose the identity of the features, which may be available in real applications. We provide information on the datasets to make things more concrete and motivate participation, but we do not expect the participants to tailor their algorithms to the specifics of the domain.

Why did you hide the identity of the features?
For two reasons: (1) some datasets have patterns that could be labeled by hand by visualizing the data; (2) some datasets are either in the public domain or have been used partially in previous challenges.

Isn't it a pity that you hide the identity of the features?
We agree that it would be interesting to also test how much could be gained by exploiting the data structure (for instance the two dimensional structure of images) and other domain knowledge. But in this challenge we do not test that aspect. In an upcoming challenge on gesture recognition we will.

How easy would it be to decrypt the data and identify the pattern?
It would be very hard, if not impossible. It is against the rules of the challenge to reverse engineer the datasets to try to gain access to the identity of the patterns.

Can we use the unlabeled data of the validation and final evaluation sets for learning?
Yes, you can use ALL the unlabeled data for learning: from the development, validation, and final evaluation sets.

Are the distributions of the three subsets identical?
No. The development, validation, and final evaluation sets are not identically distributed. This is by design. The class repartition is different in the three subsets to illustrate the problem of transfer learning.

Are the classes balanced?
No. There are uneven numbers of examples in the various classes.

Is the validation set different in difficulty from the final evaluation set?
Yes. We tried not to make them too different in difficulty, but unavoidably, they are different: The classes are different.


Evaluation


Why do you compute learning curves?
Learning curves allow us to see how the data representations provided by the participants perform over a range of number of training examples. To emphasize small number of training examples, we stop the learning curves at 64 examples.

How do you compute learning curves?
Consider the set of P=4096 available examples in either the validation set or the final evaluation set. For each point in the learning curve corresponding to p examples, we draw p training examples among the P available examples. A linear classifier is trained on the p training examples and tested on the (P-p) remaining examples. We repeat several time this procedure and average the results. We also compute the standard error over all the repeats. We perform between 10 and 500 repeats, terminating early if the standard error is below 0.01. Hence, for the first few points of the learning curve where there is a large variance, we average over a large number of repeats to reduce the variance.

Wouldn't I be better off choosing different representations for different numbers of training examples?
Possibly. With more training examples you may be better off using more features. But we want to keep things simple and we focus on the regime of small number of training examples. You just need to find a good tradeoff that optimizes the area under the learning curve.

What type of classifier do you use to compute the learning curves?
We use a linear discriminant classifier to evaluate the quality of the data representations. Classification is performed according to the linear discriminant function that is the sum of wi xi, where the xi are the feature values of a given pattern x=[x1,x2,...xn], the wi are the weights of the linear model, and the index i runs over all the features, i=1,...,n. In other words
f(x) = w . x .
If a threshold is set, patterns having a discriminant function value exceeding the threshold are classified in the positive class. Otherwise they are classified in the negative class. In this challenge, we use a linear discriminant function to rank the patterns and compute the AUC to evaluate performance, see the Evaluation page.

What motivates choosing a linear discriminant as classifier?
We want to use a classifier that is commonly used, fast, and simple. Moreover, many methods including kernel methods and neural networks can be thought of as a linear classifier operating on top of an elaborate internal representation. The task of the challenge participants is to develop such a representation on top of which a linear classifier can operate with success.

What algorithm trains the linear discriminant used to compute the learning curves?
We use the learning object @hebbian provided in the sample code. The weights wi are computed as the difference between the average of feature xi for the examples of the positive class and the average of feature xi for the examples of the negative class. In other words, if we call X the training data matrix of dimensions (p, n), p being the number of patterns and n the number of features, and Y the target vector of weighted binary target values (1/p+) and -(1/p-), where p+ and p- are the number of examples of the positive and negative classes respectively, we compute the weight w of the linear discriminant as:
w= X' Y.

What do you do for the first point on the learning curves?
We use the same algorithm as for the other points. For a single training example, this boils down to using the training example itself as weight vector, multiplied by +-1 depending on the class membership of the example.

How do you "kernelize" the learning algorithm?
If you submit XX' instead of X or any positive semi-definite matrix, the evaluation code will notice automatically that your submitted matrix is positive semi-definite (that means the it is a symmetric matrix with eigen values positive or zero). The linear discriminant will then be computed as follows:
f(x) = sumk alphak xk . x
where xk are the training examples.
alphak=1/p+ for the elements of the positive class
alphak=-1/p- for the elements of the negative class
p+ and p- are the number of examples of the positive and negative classes respectively.
Compare with
w = X' Y
where Y is the "balanced" target vector with values 1/p+ and -1/p-
w = sumk alphak xk
hence
f(x) = w . x .
Therefore, whether you submit X or XX', you should get the exact same result.

What motivates your choice of learning algorithm?
We want something fast, simple, and robust against overfitting and numerical instabilities. Moreover, we want to leave as much work as possible to the challenge participants and in this way identify the benefits of simple preprocessing steps including normalization and orthogonalization of features.

What is the relationship between feature orthogonalization and the efficiency of your learning algorithm?
Hebbian learning does not take into account correlations between features. In some cases, the presence of highly correlated features is detrimental to performance. Many linear classifier algorithms (including pseudo-inverse methods, Fisher linear discriminant, ridge regression, LSSVM, PLS, SVM) perform an implicit feature orthogonalization. If you orthogonalize your features, our Hebbian learning algorithm will become equivalent to a pseudo-inverse algorithm. Here is a brief explanation:
Let us call X your training data matrix of dimensions (p, n), p being the number of patterns and n the number of features. Let us call w of dimension n the weight vector of the linear classifier. Assume for simplicity that p>n and that the matrix X'X is invertible. To "balance" the classes, assume that we use as target values (1/p+) and -(1/p-), where p+ and p- are the number of examples of the positive and negative classes respectively. With the pseudo-inverse technique, we solve the matrix equation:
X w = Y
We transform it into the "normal equations":
X' X w = X' Y
Then if X'X is invertible, we compute w as:
w= (X'X)-1 X' Y =X+ Y
where X+ is the pseudo-inverse. If the features are orthogonal (that is such that X'X=I) the solution of the Hebbian algorithm and that of the pseudo-inverse are identical:
w= X' Y
Note that there can be at most p orthogonal features.

What is an example of feature orthogonalization method?
Principal Component Analysis (PCA) is a classical example. Diagonalize your matrix X X':
X X' = U D U'
The eigenvectors U constitute a set of orthogonal features: U'U=I.
So, you may want to submit U or a subset of the columns of U, for instance those columns corresponding to the largest eigenvalues.

Why do you incorporate no feature selection in your algorithm?
Using a particular feature selection strategy may bias the results. Also, results of past challenges indicate that feature selection does not necessarily improve the results.

Why do you incorporate no feature normalization in your algorithm?
Feature normalization is a kind of preprocessing that influences the results and may be data dependent. We leave it up to the participants to discover which normalization works best.

Why do you use the AUC to compute the score?
Many learning problems have one class significantly more depleted than the other and the problem is more that of finding the best candidates of the positive class (a ranking problem) than classifying. Furthermore, using another metric like the "balanced error rate" (average of the error rates of the positive and the negative class) requires choosing a bias value for the linear classifier. This is a difficult problem, distinct from what we want to test in this challenge, and making a particular choice may penalize the participants.

Will you use the ALC or the AUC to score submissions?
The global score (normalized ALC) is used to rank the participants as explained on the Evaluation page.

How exactly is the global score computed?
The Matlab code for the global score is provided, see the function alc.m in the sample code. Explanations are provided on the Evaluation page.

Can I do better than random for the first point on the learning curve?
Yes. This is one-shot learning. Some representations allow you to do better than random. In fact, we expect unsupervised and transfer learning to help mostly at the beginning of the learning curve.

Will the results on the validation sets count for the final ranking?
No. However, you may report these results in your paper submitted to the workshop.


Submissions


How many submissions can I make?
As many as you want. However, we ask you to limit yourself to 5 submissions per day. If the server becomes congested, this limitation will be strictly enforced. Note that only complete experiments (including submissions for all datasets) will be taken into account in the final ranking for prize winning.

Can I make experiments with mixed methods?
We encourage you to use a single unified methodology for all the datasets and group the results using such methodology in one "experiment". However, we acknowledge that the datasets are very different in nature and may require adjustments or changes in strategy, so we do not impose that you use strictly the same method on all datasets.

Can I submit preprocessed data involving no learning at all?
Yes. You may use the provided data to generate a better data representation. However, you are free not to train any model and not to perform unsupervised or transfer learning. You may submit representations that are derived from simple normalizations, products of features, or other transformations involving no learning.

Can I submit the raw data?
Yes. However, to win you need to get better performance than those obtained with the raw data on every dataset.

Can I submit results using a non linear kernel?
Yes. As long as you submit a matrix that is positive semi-definite, we will identify it as a "dot product matrix XX' and treat it a such.

Can I submit results using a non semi=definite positive similarity matrix in place of a kernel matrix?
No. We ask you to submit semi-definite positive similarity matrices. Even though the linear classifier that we use for evaluation does not need it, we will use other classifiers in post-challenge analyzes that will require semi-definite positive similarity matrices.

I do not see my results in the "Leaderboard" table or in "My Lab", what's wrong?
Make sure your submission complies with the Instructions.

Can I overwrite a result on one dataset in a particular experiment?
Yes. If you keep submitting results on the same dataset in a particular experiment, your last result will be taken into account (and appear on the Leaderboard).

Do I have to submit results on all "final" evaluation sets?
Yes. Only experiments including results on all datasets will count towards winning prizes. See the rules on the Synopsis page.

I have problems submitting large files, what should I do?
Make everything possible to limit the size of the files you upload. It is a good idea to quantize your features and submit 8 bit integers. If everything else fails, send email to ul@clopinet.com. We will arrange that you can submit your files in an alternative way.

I accidentally submitted results on the wrong dataset and the system did not generate an error, why?
A zip archive containing files ending with "_valid.prepro" and "_final.prepro" with the correct number of lines and all columns with the same number of features in each line is considered valid. We do not check the name of the files beyond their ending.


Rules


Can I reverse engineer the datasets to uncover the identity of the patterns?
No. This is forbidden. We believe that it is impossible.

How will you enforce this "reverse engineering" rule?
If we suspect that there is a possibility that the datasets have been reverse engineered (for instance if a participant points out to us a scheme to do it), we will run post-challenge verification, which may include reproducing the results under time pressure after the data have been preprocessed in a different way.

Couldn't people cheat by guessing the labels of the validation set?
If a participant makes many submissions on the validation set, he can somewhat guess the labels of the validation set. We do not consider that cheating. This may be actually detrimental to the participant because of the risk of overfitting.

Can I use a robot to make submissions?
Robot submissions are not explicitly forbidden. However, if our server gets overloaded, we will enforce a strict maximum number of submissions per day and per participant. Please be courteous and make no more than 5 submissions per day.

Can I use an alias or a funky email not to reveal my identity?
We require participants to identify themselves by their real name when they register, and you must always provide a valid email so we can communicate with you. But your name will remain confidential, unless you agree to reveal it. Your email will always remain confidential. You may select an alias for your Workbench ID to hide your identity in the result tables and remain anonymous during the challenge.

The rules specify I will have to fill out a fact sheet, do you have information about that fact sheet?
You will have to fill out a multiple choice question form that will be sent to you when the challenge is over. It will include high level questions about the method you used, software and hardware platform. Details or proprietary information may be withheld and the participants retain all intellectual property rights on their methods.

Do I need to let you know what my method is?
Disclosing information about your method is optional. However, to participate to the final ranking, you will have to fill out a fact sheet about your method(s). We encourage the participants not only to fill out the fact sheets, but write a paper with more details. Best paper awards will distinguish entries with principled, original, and effective methods, and with a clear demonstration of the advantages of the method via theoretical derivations and well designed experiments.

The submission button is not activated unless I enter a description. Will the description be accessible by other participants?
The description is only available to you. It is a means for you to keep track of your experiments.

Will the organizers enter the competition?
The prize winners may not be challenge organizers. The challenge organizers will enter development submissions from time to time, under the name "ULref". Reference entries are shown for information only and are not part of the competition.

Can a participant give an arbitrary hard time to the organizers?
DISCLAIMER: ALL INFORMATION, SOFTWARE, DOCUMENTATION, AND DATA ARE PROVIDED "AS-IS". ISABELLE GUYON, the IEEE AND/OR OTHER ORGANIZERS DISCLAIM ANY EXPRESSED OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR ANY PARTICULAR PURPOSE, AND THE WARRANTY OF NON-INFRIGEMENT OF ANY THIRD PARTY'S INTELLECTUAL PROPERTY RIGHTS. IN NO EVENT SHALL ISABELLE GUYON AND/OR OTHER ORGANIZERS BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF SOFTWARE, DOCUMENTS, MATERIALS, PUBLICATIONS, OR INFORMATION MADE AVAILABLE FOR THE CHALLENGE.
In case of dispute about prize attribution or possible exclusion from the competition, the participants agree not to take any legal action against the organizers, IEEE, or data donors. Decisions can be appealed by submitting a letter to the IJCNN 2011 conference chair Ali Minai and will be resolved by the committee of co-chairs of the IJCNN 2011 conference.


Help


Is there code I can use to perform the challenge tasks?
We provide the following tools written in Matlab (R):
  • Sample code creating entries for any dataset and computing the learning curves for the toy example ULE.
  • The CLOP package, which includes many machine learning algorithm that were successful in past challenges.
Who can I ask for more help?
For all other questions, email ul@clopinet.com.

Last updated December 27, 2011.