The Pot-luck challenge datasets are a selection of the Repository datasets
. Presently, we propose the following tasks:
- CYTO: Causal Protein-Signaling Networks in human T cells. The task is to learn a protein signaling network from multicolor flow cytometry data, recording the molecular activity of 11 proteins. There is on average 800 samples per experimental condition, corresponding to various perturbations of the system (manipulations). The authors used a Bayesian network approach and demonstrated that they recover most of the known signaling network structure, while discovering some new hypothetical regulations (causal relationships). The tasks suggested to the challenge participants include reproducing the results of the paper and finding a method to assess the confidence of the causal relationships uncovered. The evaluation is via submitted papers.
- LOCANET: LOcal CAusal NETwork We regroup under the name LOCANET a number of tasks consisting in finding the local causal structure around a given target variable (depth 3 network). The following datasets lend themselves to performing such a task:
MARTI. The results are evaluated by the organizers upon submition of the local structure in a designated format. In addition, the toy dataset LUCAS can be used for self evaluation.
- PROMO: Simple causal effects in time series. The task is to identify which promotions affect sales. This is an artificial dataset of about 1000 promotion variables and 100 product sales. The goal is to predict a 1000x100 boolean influence matrix, indicating for each (i,j) element whether the ith promotion has a causal influence of the sales of the jth product. Data is provided as time series, with a daily value for each variable for three years (i.e., 1095 days). The ground truth for the influence matrix is provided, so the participants can self-evaluate their results, and submit a paper to compete for the prizes.
- SIGNET: Abscisic Acid Signaling Network. The objective is to determine the set of 43 boolean rules that describe the interactions of the nodes within a plant signaling network. The dataset includes 300 separate boolean pseudodynamic simulations of the true rules, using an asynchronous update scheme. This is an artificial dataset inspired by a real biological system. The results are evaluated by the contact person upon submission of the results in a designated format.
- TIED: Target Information Equivalent Dataset. This is an artificial simulated dataset constructed to illustrate that there may be many minimal sets of features with optimal predictivity (i.e., Markov boundaries) and likewise many sets of features that are statistically indistinguishable from the set of direct causes and direct effects of the target. The tasks suggested include determining all statistically undistinguishable sets of direct causes and effects, or Markov boundaries of the target variable, and predicting the target variable on test data. The results are evaluated by the contact person upon submission of the results in a designated format.
Note that the participants are ultimately judged on their paper(s). For each dataset, the tasks proposed are only suggestions. The participants are invited to use the data is a creative way and propose their own task(s).
New: tasks proposed by participants
October 30: The following tasks proposed by participants are now included in the challenge.
- CauseEffectPairs: Distinguishing between cause and effect. The data set consists of 8 N x 2 matrices, each representing a cause-effect pair and the task is to identify which variable is the cause and which one the effect. The origin of the data is hidden for the participants but known to the organizers. The data sets are chosen such that we expect common agreement on which one is the cause and which one the effect. Even though part of the statistical dependences may also be due to hidden common causes, common sense tells us that there is a significant cause-effect-relation.
- STEMMATOLOGY: Computer-assisted stemmatology. Stemmatology (a.k.a. stemmatics) studies relations among different variants of a document that have been gradually built from an original text by copying and modifying earlier versions. The aim of such study is to reconstruct the family tree (causal graph) of the variants. We provide a dataset to evaluate methods for computer-assisted stemmatology. The ground truth is provided, as are evaluation criteria to allow the ranking of the results of different methods. We hope this will facilitate the development of novel approaches, including but not restricted to hierarchical clustering, graphical modeling, link analysis, phylogenetics, string-matching, etc.
Another way to participate in the challenge is to donate data. You will need to first Register
. Then you will need to fill out a form to Deposit
Our repository does not actually store data, it points to a web page YOUR_DATA.html, which you maintain, and from which your data is accessible. If you do not have a web server allowing you to maintain a web page for your data, you may use the UCI Machine Learning Repository, which physically archives data or contact us at email@example.com.
Your entry can be edited after submission, but we recommend that you prepare your submission in a text file before filling out the form.
Tips to fill out your submission form:
- Contact name/URL: Select a person, which will be available to answer questions and evaluate results.
- Resource type: Choose "data" (eventually you can submit a generative model of data, then choose "model").
- Resource name: A short easy-to-memorize acronym.
- Resource url: This is the web page you maintain YOUR_DATA.html.
- Title: A title describing your dataset.
- Authors: A comma separated list of authors.
- Key facts: Data dimensions (number of variables, number of entries), variable types, missing data, etc.
- Keywords: A comma separated list of keywords.
- Abstract: A brief description of your dataset and the task to be solved, including:
If you provide a web page YOUR_DATA.html, more details can be given there.
- Data description.
- Task description.
- Result format.
- Result submission method.
- Evaluation metrics.
- Suggestion of other tasks.
- Supplements 1, 2, and 3: Use these fields to provide a direct pointer to a zip archive to download data, a published paper or a report on the data, some slides, etc.