Causality Causality Workbench                                                             Challenges in Machine Learning Causality

Unsupervised and Transfer Learning Challenge

Fact sheet


Team name: 1055A
Team leader:
First Name: Chuanren
Last Name: Liu
Institution: Rutgers
Team member 1:
First Name: ChuanRen
Last Name: Liu
Team member 2:
First Name: Jianjun
Last Name: Xie
Institution: CoreLogic
Team member 3:
First Name: Hui
Last Name: Xiong
Team member 4:
First Name: Yong
Last Name: Ge
Phase 1 experiment: exp1
Phase 2 experiment: phase2exp1 (phase 2a)
abc1055a (phase 2b)
Title of the contribution:

Stochastic Clustering for Unsupervised Learning


K-means clustering is a popular approach to unsupervised learning. Principal component analysis (PCA) is an effective way for feature preprocessing and dimension reduction. In this competition, we proposed a stochastic clustering algorithm to cluster the data points using K-means based on the first n principal components of the dataset. We first did PCA on validation dataset and used the on-line feedback to determine the first n principal components that gave the best global score. The clustering process was then started by randomly choosing initial cluster seeds, using the n principal components to calculate the similarity distance. This clustering process was repeated 100 times. The final results were the 'average' of 100 k-means clustering for each data point. We used the feedback from the validation dataset to choose the number of clusters in each experiment. We transferred the cluster labels into binary notation and submitted for evaluation. We stayed this approach for all the 5 datasets. We found it produced good results on most of the datasets. The most difficult dataset is AVICENNA which has label overlaps, i.e., one exemplar can belong to multi-classes.
Dataset Validation Final evaluation Rank
avicenna 0.6437 0.138572 0.688514 0.19061 6
harry 0.987791 0.90853 0.934806 0.735657 3
rita 0.792667 0.373677 0.821416 0.478162 5
sylvester 0.938693 0.71457 0.839931 0.58279 1
terry 0.984694 0.817631 0.993058 0.843724 2

PHASE 2a: TRANSFER LEARNING (Official ranking)
Dataset Validation Final evaluation Rank
avicenna 0.668133 0.166751 0.659408 0.151111 4
harry 0.986947 0.906208 0.928677 0.738125 3
rita 0.789524 0.369143 0.830367 0.499217 2
sylvester 0.935239 0.756142 0.845877 0.587319 2
terry 0.984694 0.817631 0.993058 0.843724 1

* The organizers detected that the team 1055A submitted by error their results on the validation set instead of those on the final evaluation set for the dataset Sylvester. The team was allowed to re-submit their results on that dataset and those are shown in the table. Without this correction, the 1055A team ranks 3rd and this is the official ranking (with the correction they rank 2nd ex aequo with tkgw).

PHASE 2b: (Supplemental ranking)
Dataset Validation Final evaluation Rank
avicenna 0.568039 0.0646479 0.564046 0.0574749 8
harry 0.986947 0.906208 0.928677 0.738125 3
rita 0.791816 0.375655 0.831324 0.501497 2
sylvester 0.935239 0.756142 0.845877 0.587319 2
terry 0.984694 0.817631 0.993058 0.843724 1

** Due to an accidental release of the results on the final evaluation set on the scheduled deadline of phase 2, the planned grace period was canceled. However, the participants were permitted to make one last submission.

Algorithm Phase 1 Phase 2
Preprocessing with no learning at all: Did you use...
P1 Normalization of data matrix lines (patterns)?
P2 Normalization of data matrix columns (features)?
P3 Construction of new features (e.g. products of original features)?
P4 Functional data transformations (e.g. take log or sqrt)?
P5 Feature orthogonalization?
P6 Another preprocessing with no learning at all?
Unsupervised learning: Did you use...
U1 Linear manifold learning (e.g. factor analysis, PCA, ICA)?
U2 "Shallow" non-linear manifold learning for dimensionality reduction (e.g. KPCA, MDS, LLE, Laplacian Eigenmaps, Kohonen maps)?
U3 "Shallow" non-linear manifold learning to expand dimension (e.g. sparse coding)?
U4 Clustering (e.g. K-means, hierarchical clustering)?
U5 Deep Learning (e.g. stacks of auto-encoders, stacks of RBMs)?
U6 Another unsupervised learning method?
Transfer learning: Did you...
T1 - Not use of the transfer labels at all?  
T2 - Use of the transfer labels for selection of unsupervised learning methods, not for training?  
T3 - Use only a subset of the available transfer labels (i.e. select the tasks that are most suitable for transfer)?  
T4 - Learn a "shallow" representation with the transfer labels?  
T5 - Learn a "deep" representation with the transfer labels?  
T6 - Use transfer learning in another way?  
Feature selection: Did you...
F1 Not perform any feature selection?
F2 Use a feature selection mechanism embedded in your algorithm?
F3 Use a filter method not taking into account the prediction performances of the classifier (e.g. use reconstruction error)?
F4 Use a wrapper method to select features based on the performance of the classifier (e.g. use the validation set results)?
Kernel (or metric) learning: Did you...
K1 - Learn parameters in a "shallow" architecture (e.g. kernel width, NCA)?
K2 - Learn parameters in a "deep" architecture (e.g. a Siamese neural network)?
Ensemble methods: Did you...
E1 - Concatenate multiple representations?
E2 - Average several kernels?
Model selection: Did you...
M1 - Submit results with the same algorithm on all datasets (eventually with some hyperparameter tuning)?
M2 - Select the model performing best on the validation set?
M3 - Use cross-validation on development data?
Induction/Transduction: To prepare the final results did you...
I1 - Use of the development dataset for training?
I2 - Use of the validation dataset for training?
I3 - Use the final evaluation dataset for training?
Classifier: Did you...
C1 - Make specific changes to your algorithm knowing that it would be evaluated with a linear classifier?
C2 - Take into account the specific type of linear classifier algorithm we are using?
Advantages of the methods employed:
  • Quantitative advantages

    The advantage of our method is the stochastic process which is immune from overfitting. The K-means based on principal components instead of raw data also helps generating robust results.
  • Qualitative advantages

    The stochastic clustering process is another representation of ensemble. Our final results are esentially the ensemble of many clustering process. This method was used in the recent Active Learning Challenge and also produced very good results.
  • Other methods

    Principal Component Analysis and TF-IDF (Term Frequency and Inverse Document Frequency). PCA alone can produce better results than the raw features. But clustering on PCA generate the best results. TF-IDF on Terry dataset generates better results than using raw features.
  • Software implementation
    • Availability
      Proprietary in house software
      Commercially available in house software
      Freeware or shareware in house software
      Off-the-shelf third party commercial software
      Off-the-shelf third party freeware or shareware
    • Language
      Other (precise below)

    Details on software implementation:

    The main software packages we used are Matlab and SAS.
  • Hardware implementation
    • Platform
      Linux or other Unix
      Mac OS
      Other (precise below)
    • Memory
      <= 2GB    <= 8 GB    > 8 GB    >= 32 GB
    • Parallelism
      Multi-processor machine
      Run in parallel different algorithms on different machines
      Other (precise below)
  • Development effort:

    How much time did you spend customizing your existing code (total human effort)?
    A few hours    A few days    1-2 weeks    >2 weeks   

    How much time did you spend experimenting with the validation datasets (total computer time effort)?
    A few hours    A few days    1-2 weeks    >2 weeks   

    Did you get enough development time?
    Yes    No


Stochastic Semi-supervised Learning, JMLR Workshop and Conference Proceedings: Volume 15, 8598, 2011.