Belief Networks IILecture 24
(Chapter 15.3-4 + new)
Artificial Intelligence I
Autumn 2001
Henry Kautz
Outline
Exact inference by enumeration
Exact inference by variable elimination
Approximate inference by stochastic simulation
Approximate inference by Markov chain Monte Carlo
Inference tasks
Causal: Given burglary, what is probability John calls?
Diagnostic: Given John calls, what is probability of earthquake?
Mixed: Given John calls and there is an earthquake, what is the probability of burglary?
Most Probable Explanation: Given earthquake, what is the most likely simulaneous setting of all of the other variables?
Inference by enumeration
Slightly intelligent way to sum out variables from the joint without actually constructing its explicit representation
Simple query on the burglary network:
1#1
2#2
3#3
4#4
Rewrite full joint entries using product of CPT entries:
5#5
6#6
7#7
Enumeration algorithm
Exhaustive depth-first enumeration: 8#8 space, 9#9 time
120#120
Inference by variable elimination
Enumeration is inefficient: repeated computation e.g., computes 10#10 for each value of 11#11
Variable elimination: carry out summations right-to-left,
storing intermediate results (factors) to avoid recomputation
1#1 12#12 13#13 14#14 15#15 16#16 (sum out 17#17) 18#18 (sum out 19#19) 20#20
A form of dynamic programming. Can also be implemented using message passing of intermediate results.
Variable elimination: Basic operations
Pointwise product of factors 21#21 and 22#22:
23#23
=
24#24
E.g.,
25#25
Summing out a variable from a product of factors: move any constant factors outside the summation:
26#26
assuming 27#27 do not depend on 28#28
Variable elimination algorithm
121#121
Complexity of Bayes net inference
Singly connected networks (or polytrees): - any two nodes are connected by at most one (undirected) path - time and space cost of variable elimination are 29#29
Multiply connected networks: - can reduce 3SAT to exact inference 30#30 NP-hard - equivalent to counting 3SAT models 30#30 #P-complete
=0.75figuresbn-3sat.ps
Inference by stochastic simulation
Basic idea: 1) Draw 31#31 samples from a sampling distribution 32#32 2) Compute an approximate posterior probability 33#33 3) Show this converges to the true probability 34#34
Outline: - Sampling from an empty network - Rejection sampling: reject samples disagreeing with evidence - Likelihood weighting: use evidence to weight samples - MCMC: sample from a stochastic process whose stationary distribution is the true posterior
Sampling from an empty network
122#122
|
=0.5 42#42 |
Sampling from an empty network contd.
Probability that PriorSample generates a particular event
43#43
i.e., the true prior probability
Let 44#44 be the number of samples generated for which 45#45, for any set of variables 46#46.
Then 47#47 and
Rejection sampling
49#49 estimated from samples agreeing with 50#50
123#123
E.g., estimate
51#51 using 100 samples
27 samples have
52#52
Of these, 8 have 53#53 and 19 have 54#54.
55#55
Similar to a basic real-world empirical estimation procedure
Analysis of rejection sampling
56#56 (algorithm defn.) 57#57 (normalized by 58#58) 59#59 (property of PriorSample) 60#60 (defn. of conditional probability)
Hence rejection sampling returns consistent posterior estimates
Problem: hopelessly expensive if 61#61 is small
Likelihood weighting
Idea: fix evidence variables, sample only nonevidence variables,
and weight each sample by the likelihood it accords the evidence
124#124
Likelihood weighting example
Estimate 62#62
=0.5 63#63
LW example contd.
Sample generation process:
1.
64#64
2. Sample
35#35; say 37#37
3. 65#65 has value 37#37, so
66#66
4. Sample
67#67; say 37#37
5. 68#68 has value 37#37, so
69#69
Likelihood weighting analysis
Sampling probability for WeightedSample is
70#70
Note: pays attention to evidence in ancestors only
30#30 somewhere ``in between'' prior and posterior distribution
Weight for a given sample 71#71 is 72#72
Weighted sampling probability is 73#73 74#74 75#75 (by standard global semantics of network)
Hence likelihood weighting returns consistent estimates
but performance still degrades with many evidence variables
Approximate inference using MCMC
``State'' of network = current assignment to all variables
Generate next state by sampling one variable given Markov blanket
Sample each variable in turn, keeping evidence fixed
125#125
Approaches stationary distribution: long-run fraction of time spent in each state is exactly proportional to its posterior probability
MCMC Example
Estimate 62#62
Sample 76#76 then 77#77, repeat.
Count number of times 77#77 is true and false in the samples.
Markov blanket of 76#76 is 65#65 and 77#77
Markov blanket of 77#77 is 76#76, 65#65, and 68#68
=0.5 63#63
MCMC example contd.
Random initial state: 78#78 and 54#54
1. 79#79 sample 36#36 39#39
2. 80#80 sample 36#36 37#37
Visit 100 states 31 have 53#53, 69 have 54#54
81#81 82#82
MCMC analysis: Outline
Transition probability 83#83
Occupancy probability 84#84 at time 85#85
Equilibrium condition on 86#86 defines stationary distribution 87#87 Note: stationary distribution depends on choice of 83#83
Pairwise detailed balance on states guarantees equilibrium
Gibbs sampling transition probability:
sample each variable given current values of all others
30#30 detailed balance with the true posterior
For Bayesian networks, Gibbs sampling reduces to
sampling conditioned on each variable's Markov blanket
Stationary distribution
84#84 = probability in state 88#88 at time 85#85
89#89 = probability in state 90#90 at time 91#91
92#92 in terms of 86#86 and
83#83
In equilibrium, expected ``outflow'' = expected ``inflow''
Detailed balance
``Outflow'' = ``inflow'' for each pair of states:
MCMC algorithms typically constructed by designing a transition
probability 99#99 that is in detailed balance with desired 96#96
Gibbs sampling
Sample each variable in turn, given all other variables
Sampling 100#100, let
101#101 be all other nonevidence variables
Current values are 102#102 and
103#103; 50#50 is fixed
Transition probability is given by
Markov blanket sampling
A variable is independent of all others given its Markov blanket: 107#107
Probability given the Markov blanket is calculated as follows: 108#108
Hence computing the sampling distribution over 100#100 for each flip requires just 109#109 multiplications if 100#100 has 110#110 children and 111#111 values; can cache it if 110#110 not too large.
Main computational problems: 1) Difficult to tell if convergence has been achieved 2) Can be wasteful if Markov blanket is large: 112#112 won't change much (law of large numbers)
Performance of approximation algorithms
Absolute approximation: 113#113
Relative approximation: 114#114
Relative 30#30 absolute since 115#115 (may be 116#116)
Randomized algorithms may fail with probability at most 117#117
Polytime approximation: 118#118
Theorem (Dagum and Luby, 1993): both absolute and relative
approximation for either deterministic or randomized algorithms
are NP-hard for any
119#119
(Absolute approximation polytime with no evidence--Chernoff bounds)