From real data, the anonymized logs of a web server,
determine the causal structure - which pages link/lead to visits of other pages.
The ground truth is beyond doubt, from the referrer information, but this information will be kept for an objective evaluation.
Trends towards privacy and its relation to electronic data storage motivate this problem.
Data format - Input:
A matrix of 512 days by 20 pages containing integer numbers, the frequency of the visits during that day.
The calendar dates are also given for the ones that need them.
Data format - Output:
The matrix of 20 by 20 numbers having on the position (u,v) the probability that a visit of the page 'u' causes a visit of the page 'v'.
Thus, 1 means 100% causal implication (deterministic, each visit of the page 'u' causes a visit of the page 'v'), while 0 means no causal implication of the visits of page 'u' on the visits of page 'v'.
As we do have the ground truth, we will compute the correlation between the given arc strengths and the measured transition probability on an hold-out dataset of the same size.