Information - Workbench - Causality Workbench

What is the Virtual Laboratory?

The Virtual Laboratory allows researchers on causality to perform experiments on artificial systems in order to unravel causal relationships. Download our Technical Report.

One important goal of causal modeling is to unravel enough of the data generating process to be able to predict the consequences of actions (also called interventions, experiments or manipulations), performed by an external agent. This setup violates the classical i.i.d. assumption commonly made in machine learning. For instance, in policy-making, one may want to predict "the effect on a population health status" of "forbidding to smoke in public places", before passing a law.

While many algorithms have been proposed recently to learn causal relationships from non-experimental data (simple observations without external intervention), experimentation is still considered the only ultimate means of validation of a causal model. But, experiments are often costly and sometimes impossible or unethical to perform.

The virtual lab offers the possibility of researchers to experiment with simulated systems (for a list, see the Index page), by setting the values of certain variables and observing others. It features a realistic setup in which the artificial systems model real applications in medicine, marketing, life sciences, social sciences, etc. and the experiments cost a certain amount of virtual cash; simple observations without intervention generally costing less. The predictive power of the model can then be evaluated by providing test data corresponding to new interventions.

To understand the basic concept of experimentation in causal modeling, you may want to first look at these historical examples in epidemiology brought to us by ThinkQuest:

Lung cancer

Smallpox

Food poisoning

How to Design Experiments?

Here is a brief outline of the steps taken in experimenting and causal modeling:

Problem specification: Define your problem and your goals . In the Virtual Lab, problems are already formalized.
Feature set definition: Identify potentially relevant factors . In the virtual lab the feature set is already given: those are the system variables. In some cases, the task designer may hide a number of variables to test the robustness of algorithms against hidden confounders, which are unknown common causes to several variables in your system.
Manipulation protocol: Figure out how to perform actions on the system and manipulate variables of interest. This step is often very complex in real experiments because we do not always have easy means of influencing variables individually as an external agent. Not all variables are actionable or even observable. Some may be unethical to manipulate. In the Virtual Lab, things are simple: we tell you which variables are actionable. All you have to do to carry out experiments is to initialize or clamp desired variables.
Experimental design: Given a budget (here you have "virtual cash"), spend it in data collection, observations, and manipulations to achieve the goals you have set to yourselves.
Modeling: Carry out the experiments and build models with the data collected. Eventually iterate this process until a satisfactory model is obtained. In the Virtual Lab, all you have to do is to submit queries via the Upload Page using the format described below. Your virtual cash account will be automatically debited and you will be able to download the results of your experiments your private Mylab page.
Deployment: Deploy your model to predict the consequences of actions in new situations. In the Virtual Lab, we provide you with test data, which was drawn from a post-manipulation distribution. The manipulations are performed by the task designers. depending on the task, the designers may or may not inform you of what exact manipulation(s) was performed in test data. When you are done with modeling and before your run out of virtual cash, you must ask for the test data. WARNING: The test data will cost you virtual cash, so make sure you keep enough virtual cash. We do not withhold from your cash account a fixed amount to pay for the test data because, if you cleverly design your experiments and your model, you might get it at a discount price by querying only a subset of the variables. Once you ask for test data, you must return your prediction results on test data, no query for more data are allowed.

We will organize competitions in the future. In a competition setup, it will not be possible to work several times on the same task. However, for the time being, you are free to experiment multiple times on the same problem and even to run concurrent experiments with different strategies.

How to Submit Queries and Get Data?

Easy start

The submission of data requests and prediction results is via the Upload page. A submitted query should be a zip file bundling files described below. Use

zip query.zip *

tar cvf query.tar *; gzip query.tar

to create valid archives. We provide several examples of queries for the LUCAS model:

Observations. Request 25 examples of all the variables. No manipulation is performed. observational data only.
Experiment 1.. Request 10 values of the target variable. Most covariate values are provided, except a few missing values.
Experiment 2.. Not all variables are manipulated. The pre-manipulation values are given by the selection of training samples.
Test data. We ask for the test set 2. Here we ask for postmanip variables, but we will not get them because the test data does not include any postmanipulation observations.
Default training set. Training data can be purchased unlabeled, this is cheaper. Then the labels may be queried separately.
Survey data. Query asking for a subset of the labels of the default training set.
Prediction results. Predictions of the target post-manipulation values on test set 2.

If you want to get baseline results without experimenting, it is always possible with the initial budget to buy the default training set and the entire test set. Just submit two (separate) queries with a single query file, each containing a single word:

To get training data, write the word TRAIN on the first line.
To get test data, write the word TEST on the first line.

File formats for data queries and prediction results

Filename	Non-experimental data	Experimental data	Survey data	Prediction results	Description	File Format
[submission].query	Compulsory (TRAIN, TEST [n] or OBS [num])	Compulsory (EXP)	Optional (SURVEY)	Optional (PREDICT [n])	Type of query.	A single key word on the first line, optionally followed by a number on the same line: TRAIN: get the default training set. TEST [n]; replace [n] by 1, 2, 3 to get the nth test set TEST=TEST 1. OBS [num]: get observational data; replace [num] by the number of samples requested. EXP: get experimental data (the number of samples is determined by the number of lines in [submission].sample and [submission].manipval). SURVEY: get training labels. PREDICT [n]: replace [n] by 1, 2, 3 to indicate that predictions correspond to the nth test set. PREDICT=PREDICT 1.
[submission].sample	NA	Optional	Compulsory	NA	Sample ID in the default training set. The corresponding samples are used to set premanipulation values.	A list of sample numbers, one per line (the numbering is 1-based and corresponds to lines in the training data).
[submission].premanipvar	Optional	Optional	Optional	NA	List of the pre-manipulation variables (observed before or without experimentation). By default (no file given): (1) for non-experimental and experimental data: all the observable variables, except the target; (2) for survey data: the target.	A space-delimited list of variable numbers on the first line of the file. All variables are numbered from 1 to the maximum number of visible variables, except the target variable (if any), which is numbered 0.
[submission].manipvar	NA	Compulsory	NA	NA	List of the variables to be manipulated (clamped).
[submission].postmanipvar	NA	Compulsory	NA	Optional	List of the post-manipulation variables (observed after experimentation). By default: the target variable.
[submission].premanipval	NA	NA	NA	NA	Not applicable: use [submission].sample to initialize values of the pre-manipulation variables.	Each line corresponds to an instance (sample) and should contain space delimited variable values for all the variables of that instance. Use NaN if the value is missing or omitted. The number of lines in [submission].manipval should match the number of samples in [submission].sample (if provided). You may omit [submission].query and provide [submission].predict instead of [submission].postmanipval if there is a single test set and no experiments are involved.
[submission].manipval	NA	Compulsory	NA	NA	Clamped values for the manipulated variables, listed in [submission].manipvar.
[submission].postmanipval or [submission].predict	NA	NA	NA	Compulsory	Predictions values for all the samples of TESTn.

File formats for data received and prediction scores

Data archives with the training or test data you requested are available from your private Mylab a short time after you placed your query. Prediction score are also displayed on the Leaderboard page.

Filename	Non-experimental training data	Survey data	Experimental training data	Test data	Evaluation score	Description	File Format
[submission].query	Optional (TRAIN or OBS)	Optional SURVEY	Optional EXP	Optional TEST	Optional PREDICT	Type of query.	One keyword and optionally a number on the first line (copied from the query submitted).
[answer].premanipvar	Optional	NA	Present if requested	Present	NA	List of the pre-manipulation variables. By default, all the observable variables except the target.	A space-delimited list of variable numbers on the first line of the file. All variables are numbered from 1 to the maximum number of visible variables, except the default target variable (if any), which is numbered 0. If the file is missing or empty, an empty list is assumed.
[answer].manipvar	NA	NA	Present	Optional	NA	List of the variables to be manipulated (clamped).
[answer].postmanipvar	NA	NA	Optional	Optional	NA	List of the post-manipulation variables. By default: the target variable 0.
[answer].premanipval or [answer].data	Present	NA	Present if requested	Present	NA	Pre-manipulation values.	Each line corresponds to an instance (sample) and contains space delimited variable values for all the variables of that instance (or a single target value for [answer].label files). [answer].data files contain unlabeled default training data for problems without experimentation
[answer].label	NA (to get the target variable values, use the index 0)	Present	NA (to get the target variable values, use the index 0)	NA	NA	Target values for default trainign examples. Equivalent to [answer].premanipval when [answer].premanipvar (with the single value 0) is omitted.
[answer].manipval	NA	NA	Present	Optional	NA	Clamped values for the manipulated variables, listed in [answer].manipvar. Those correspond to manipulations performed by the organizers so they are free of charge.
[answer].postmanipval	NA	NA	Present	Hidden to the participants	NA	Post-manipulation values. In answer to [submission].postmanipvar
[answer].is_overbudget	Optional	Optional	Optional	Optional	Optional	File indicating that the budget was overspent and the query was not processed.	The value1.
[answer].score	NA	NA	NA	NA	Present	Prediction score.	A numeric value.
[answer].ebar	NA	NA	NA	NA	Present	Error bar.	A numeric value.
[answer].varnum	Present	Present	Present	Present	NA	The total number of observable variables (excluding the target).	A numeric value.
[answer].samplenum	Present	Present	Present	Present	NA	Number of samples requested.	A numeric value.
[answer].obsernum	Present	Present	Present	Present	NA	Number of variable values observed (including the target).	A numeric value.
[answer].manipnum	Present	Present	Present	Present	NA	Number of values manipulated.	A numeric value.
[answer].targetnum	Present	Present	Present	Present	NA	Number of target values observed.	A numeric value.
[answer].samplecost	Present	Present	Present	Present	NA	Cost for the samples requested (labeled samples may cost more than unlabeled samples).	A numeric value.
[answer].obsercost	Present	Present	Present	Present	NA	Cost for the observations made.	A numeric value.
[answer].manipcost	Present	Present	Present	Present	NA	Cost for the manipulations made.	A numeric value.
[answer].targetcost	Present	Present	Present	Present	NA	Additional cost for target observations.	A numeric value.
[answer].totalcost	Present	Present	Present	Present	NA	Total cost.	A numeric value.