## Supplementary Information on Experimental Methods

Some of this is already in the paper, but is repeated here for completeness and clarity. I attempt to describe, in order, all data processing done to obtain the models and numerical results.

### How we handled data and trained models

#### Splitting into cross-validation sets

If the full dataset (before doing any train/test split) has fewer than 2,000 examples, AND there is no predefined train/test split provided, we split the dataset into 10 subsets. Each example from the original complete dataset is randomly assigned to one of the subsets, with uniform probability. (Some splits are therefore larger than others.) We train on 9 and test on the 10th, rotating which are training and which is test.

#### Discretization

For each training/test set pair, each continuous variable was discretized. Only the training data was used for this discretization, but the test set followed the same discretization as the training set. Note that different cross-validation splits could have slightly different discretizations (e.g., 0.2 => 1 in one split, but 0.2 => 2 in another).

For each continuous variable, we sorted all values. We used the values at the 20th, 40th, 60th, and 80th percentiles as the bin thresholds. If all values are distinct, then each bin will contain the same number of values (plus or minus one, since the number of values might not be a multiple of 5). If there are duplicates, then the bin sizes could be unequal. For example, if 85% of the values are 0, then the discretization yields two effective bins: one containing only zeroes, and one containing everything greater than zero. This happens because the 20th, 40th, 60th, and 80th percentile values are all 0 (in this example).

As in the discrete case, "Missing" remained a distinct value.

#### Hold-out data

We run the following procedure on every single training set. Therefore, for cross-validated datasets, it gets run 10 times, once for each train/test split. Examples that are held out from one split may not be held out in another split, even if the examples are present in both, because the hold-out sets are selected independently.

• If the training set has fewer than 3,000 examples, examples are placed in the hold-out set with probability 1/3.
• If the training set has more than 10,000 examples, examples are placed in the hold-out set with probability 1/10.
• If the dataset contains at least 3,000, but no more than 10,000 examples, then examples are placed in the hold-out set with probability 1,000/(training set size). That is, the expected number of examples for the hold-out set is 1,000, but the actual number may be different.

The same hold-out set was used for training both the NBE models and the WinMine models.

For the random selection, the Perl random number generator was seeded with 1. (This is probably unimportant.)

### NBE parameters

The following NBE Parameters were set on the command line:

• Random seed: 1
• Initial number of clusters:
• 50 if training set has more than 1,500 examples
• 10 if training set has 1,500 examples or fewer

All other parameters are model defaults (and therefore findable in the source code):

• \delta_{EM} = 0.0001

The variance used for initializing the Gaussian distributions in new clusters is (overall observed variable variance)/2.5. This is irrelevant for our present experiments, however, since we discretized all continuous variables.

#### Pruning

The pruning frequency for removing unlikely clusters is once after every 5 iterations of EM. Pruning keeps the lowest-weight clusters that contribute total weight less than 0.1% of the total probability mass. Algorithmically, we sort the clusters in order of decreasing weight (P(cluster)). We then add up the cluster weights in order, and once we've reached 0.999, we throw away all the smallest clusters that we haven't reached yet. We also prune before adding more clusters (and reset the pruning counter, the one that keeps track of how long it's been since we pruned).

### WinMine parameters

We used the -acyclic flag (so we learned Bayes nets, not dependency nets).

We used the .xplan file: default.xplan. This gives every variable the default role "input-output". The default categorical distribution was "tree-multinomial" and the default numerical distribution was "tree-binGaussian" (which is irrelevant, because we pre-discretized all numerical variables). We never altered these defaults for any variable, so they were followed by WinMine.

No random seed is necessary, because WinMine uses a greedy search.

We used kappa as described below.

#### Kappa

We trained models using a validation set consisting of all training data except for the hold-out set (the very same used for NBE models). Our initial value of kappa was 0.0001 (the WinMine default is 0.01). We increased it by an order of magnitude each iteration, until the log likelihood of the hold out set got worse. We used the best kappa value seen (which would be the one from the next-to-last validation model trained), when learning a new model on all data (validation and held-out).

#### Other WinMine defaults

• -min (minimum count to split): 10
• -bdu (use BDu param prior with given equiv. sample size): -1
• -ksqmarg (use marginal mean KSQ parameter prior with given equiv. sample size): -1
• -maxin (maximum number of input features): -1 (= all)
• -maxout (maximum number of output features): -1 (= all)
• -dynsplits (number of split points for continuous predictors): 7

The last is irrelevant, since we did our own discretization. Most of the others are effectively disabled. -min is the only remaining interesting one.

### Gibbs sampling details

For initializing the sample (before burning in), we randomly sample each variable given its parents. (We can always find an order that allows this, because the network is acyclic.) If there's no evidence, this will produce a random sample from the joint distribution. If there is evidence... then this could be wrong, because an evidence variable might be the child of a non-evidence variable, and we're ignoring that probabilistic dependency. Even so, this seemed like a better strategy than uniform random, because it's sometimes right (though it could sometimes be more wrong than uniform random).

Then we burn-in. Then we sample. Each burn-in or sampling iteration consists of randomly sampling each non-evidence variable in turn. We generate counts every time we sample a query variable. Therefore, the total number of counts exceeds the number of iterations, if we have multiple query variables. (We still normalize properly to arrive at the correct probabilities).

For single-variable queries, we do not do any smoothing -- this is because all of a variable's states receive some probability each time sampled. (Thanks to the Rao-Blackwellisation, as described in the paper.) This is not necessarily the case with joint queries.

For joint queries, due to the large number of potential configurations, we do not keep counts for every single one. In fact, we only keep counts for the chosen configuration (the one observed in the test data). For smoothing, if there are N configurations of the query variables, then we give each configuration 1.0/N initial counts. (We also divide by (#query vars) * (num sampling iterations) + 1 at the end, in order to achieve proper normalization. That is, we do the right thing.)

We do not use Rao-Blackwellisation for the multiple-variable queries.

Comments to Daniel Lowd (lowd at cs dot washington dot edu)