Goes beyond statistical correlations between individual variables and detects direct vs. indirect correlations
Set expectations: at best, we can recover the structure up to the I-equivalence class
Density estimation
Estimate a statistical model of the underlying distribution and use it to reason with and predict new instances
Advantages of Accurate Structure
Structure Learning Approaches
Constraint based methods
View the Bayesian network as representing dependencies
Find a network that best explains dependencies
Limitation: sensitive to errors in single dependencies
Score based approaches
View learning as a model selection problem
Define a scoring function specifying how well the model fits the data
Search for a high-scoring network structure
Limitation: super-exponential search space
Bayesian model averaging methods
Average predictions across all possible structures
Can be done exactly (some cases) or approximately
Constraint Based Approaches
Goal: Find the best minimal I-Map for the domain
G is an I-Map for P if I(G)I(P)
Minimal I-Map if deleting an edge from G renders it not an I-Map
G is a P-Map for P if I(G)=I(P)
Strategy
Query the distribution for independence relationships that hold between sets of variables
Construct a network which is the best minimal I-Map for P
Constructing Minimal I-Maps
Reverse factorization theorem
G is an I-Map of P
Algorithm for constructing a minimal I-Map
Fix an ordering of nodes X1,…,Xn
Select parents of Xi as minimal subset of X1,…,Xi-1, such that Ind(Xi ; X1,…Xi-1 – Pa(Xi) | Pa(Xi))
(Outline of) Proof of minimal I-map
I-map since the factorization above holds by construction
Minimal since by construction, removing one edge destroys the factorization
Constructing P-Maps
Simplifying assumptions
Network has bound in-degree d per node
Oracle can answer Ind. queries of up to 2d+2 variables
Distribution P has a P-Map
Algorithm
Step I: Find skeleton
Step II: Find immoral set of v-structures
Step III: Direct constrained edges
Step I: Identifying the Skeleton
For each pair X,Y query all Z for Ind(X;Y | Z)
X–Y is in skeleton if no Z is found
If graph in-degree bounded by d running time O(n2d+2)
Since if no direct edge exists, Ind(X;Y | Pa(X), Pa(Y))
Reminder
If there is no Z for which Ind(X;Y | Z) holds, then XY or YX in G*
Proof: Assume no Z exists, and G* does not have XY or YX
Then, can find a set Z such that the path from X to Y is blocked
Then, G* implies Ind(X;Y | Z) and since G* is a P-Map
Contradiction
Step II: Identifying Immoralities
For each pair X,Y query candidate triplets X,Y,Z
XZY if no W is found that contains Z and Ind(X;Y | W)
If graph in-degree bounded by d running time O(n2d+3)
If W exists, Ind(X;Y|W), and XZY not immoral, then ZW
Reminder
If there is no W such that Z is in W and Ind(X;Y | W), then XZY is an immorality
Proof: Assume no W exists but X–Z–Y is not an immorality
Then, either XZY or XZY or XZY exists
But then, we can block X–Z–Y by Z
Then, since X and Y are not connected, can find W that includes Z such that Ind(X;Y | W)
Contradiction
Answering Independence Queries
Basic query
Determine whether two variables are independent
Well studied question in statistics
Common to frame query as hypothesis testing
Null hypothesis is H0
H0: Data was sampled from P*(X,Y)=P*(X)P*(Y)
Need a procedure that will Accept or Reject the hypothesis
2 test to assess deviance of data from hypothesis
Alternatively, use mutual information between X and Y
Structure Based Approaches
Strategy
Define a scoring function for each candidate structure
Search for a high scoring structure
Key: choice of scoring function
Likelihood based scores
Bayesian based scores
Likelihood Scores
Goal: find (G,) that maximize the likelihood
ScoreL(G:D)=log P(D | G, ’G) where ’G is MLE for G
Find G that maximizes ScoreL(G:D)
Example
General Decomposition
The Likelihood score decomposes as:
Proof:
General Decomposition
The Likelihood score decomposes as:
Second term does not depend on network structure and thus is irrelevant for selecting between two structures
Score increases as mutual information, or strength of dependence between connected variable increases
After some manipulation can show:
To what extent are the implied Markov assumptions valid
Limitations of Likelihood Score
Avoiding Overfitting
Classical problem in machine learning
Solutions
Restricting the hypotheses space
Limits the overfitting capability of the learner
Example: restrict # of parents or # of parameters
Minimum description length
Description length measures complexity
Prefer models that compactly describes the training data
Bayesian methods
Average over all possible parameter values
Use prior knowledge
Bayesian Score
Marginal Likelihood of Data Given G
Marginal Likelihood: Binomial Case
Assume a sequence of m coin tosses
By the chain rule for probabilities
Recall that for Dirichlet priors
Where MmH is number of heads in first m examples
Marginal Likelihood: Binomial Case
Marginal Likelihood Example
Actual experiment with P(H) = 0.25
Marginal Likelihood: BayesNets
Network structure determines form of marginal likelihood
Marginal Likelihood: BayesNets
Network structure determines form of marginal likelihood
Idealized Experiment
P(X = H) = 0.5
P(Y = H|X = H) = 0.5 + p P(Y = H|X = T) = 0.5 - p
Marginal Likelihood: BayesNets
The marginal likelihood has the form:
where
M(..) are the counts from the data
(..) are hyperparameters for each family
Priors
Structure prior P(G)
Uniform prior: P(G) constant
Prior penalizing number of edges: P(G) c|G| (0
Normalizing constant across networks is similar and can thus be ignored
Priors
Parameter prior P(|G)
BDe prior
M0: equivalent sample size
B0: network representing the prior probability of events
Set (xi,paiG) = M0 P(xi,paiG| B0)
Note: paiG are not the same as parents of Xi in B0
Compute P(xi,paiG| B0) using standard inference in B0
BDe has the desirable property that I-equivalent networks have the same Bayesian score when using the BDe prior for some M’ and P’
Bayesian Score: Asymptotic Behavior
For M, a network G with Dirichlet priors satisfies
Approximation is called BIC score
Score exhibits tradeoff between fit to data and complexity
Mutual information grows linearly with M while complexity grows logarithmically with M
As M grows, more emphasis is given to the fit to the data
Bayesian Score: Asymptotic Behavior
For M, a network G with Dirichlet priors satisfies
Bayesian score is consistent
As M, the true structure G* maximizes the score
Spurious edges will not contribute to likelihood and will be penalized
Required edges will be added due to linear growth of likelihood term relative to M compared to logarithmic growth of model complexity
Summary: Network Scores
Likelihood, MDL, (log) BDe have the form
BDe requires assessing prior network
Can naturally incorporate prior knowledge
BDe is consistent and asymptotically equivalent (up to a constant) to BIC/MDL
All are score-equivalent
G I-equivalent to G’ Score(G) = Score(G’)
Optimization Problem
Input:
Training data
Scoring function (including priors, if needed)
Set of possible structures
Including prior knowledge about structure
Output:
A network (or networks) that maximize the score
Key Property:
Decomposability: the score of a network is a sum of terms.
Learning Trees
Trees
At most one parent per variable
Why trees?
Elegant math
we can solve the optimization problem efficiently (with a greedy algorithm)
Sparse parameterization
avoid overfitting while adapting to the data
Learning Trees
Let p(i) denote parent of Xi, or 0 if Xi has no parent
We can write the score as
Score = sum of edge scores + constant
Learning Trees
Algorithm
Construct graph with vertices: 1,...n
Set w(ij) = Score(Xj | Xi ) - Score(Xj)
Find tree (or forest) with maximal weight
This can be done using standard algorithms in low-order polynomial time by building a tree in a greedy fashion (Kruskal’s maximum spanning tree algorithm)
Theorem: Procedure finds the tree with maximal score
When score is likelihood, then w(ij) is proportional to I(Xi; Xj). This is known as the Chow & Liu method
Learning Trees: Example
Beyond Trees
Problem is not easy for more complex networks
Example: Allowing two parents, greedy algorithm is no longer guaranteed to find the optimal network
In fact, no efficient algorithm exists
Theorem:
Finding maximal scoring network structure with at most k parents for each variable is NP-hard for k>1
Fixed Ordering
For any decomposable scoring function Score(G:D) and ordering the maximal scoring network has:
For fixed ordering we have independent problems
If we bound the in-degree per variable by d, then complexity is exponential in d
Heuristic Search
We address the problem by using heuristic search
Define a search space:
nodes are possible structures
edges denote adjacency of structures
Traverse this space looking for high-scoring structures
Search techniques:
Greedy hill-climbing
Best first search
Simulated Annealing
...
Heuristic Search
Typical operations:
Exploiting Decomposability
Caching: To update the score after a local change, we only need to rescore the families that were changed in the last move
Greedy Hill Climbing
Simplest heuristic local search
Start with a given network
empty network
best tree
a random network
At each iteration
Evaluate all possible changes
Apply change that leads to best improvement in score
Reiterate
Stop when no modification improves score
Each step requires evaluating O(n) new changes
Greedy Hill Climbing Pitfalls
Greedy Hill-Climbing can get stuck in:
Local Maxima
All one-edge changes reduce the score
Plateaus
Some one-edge changes leave the score unchanged
Happens because equivalent networks received the same score and are neighbors in the search space
Both occur during structure search
Standard heuristics can escape both
Random restarts
TABU search
Equivalence Class Search
Idea
Search the space of equivalence classes
Equivalence classes can be represented by PDAGs (partially ordered graph)
Advantages
PDAGs space has fewer local maxima and plateaus
There are fewer PDAGs than DAGs
Equivalence Class Search
Evaluating changes is more expensive
In addition to search, need to score a consistent network
These algorithms are more complex to implement
Learning Example: Alarm Network
Model Selection
So far, we focused on single model
Find best scoring model
Use it to predict next example
Implicit assumption:
Best scoring model dominates the weighted sum
Valid with many data instances
Pros:
We get a single structure
Allows for efficient use in our tasks
Cons:
We are committing to the independencies of a particular structure
Other structures might be as probable given the data
Model Selection
Density estimation
One structure may suffice, if its joint distribution is similar to other high scoring structures
Structure discovery
Define features f(G) (e.g., edge, sub-structure, d-sep query)
Compute
Still requires summing over exponentially many structures
Model Averaging Given an Order
Assumptions
Known total order of variables
Maximum in-degree for variables d
Marginal likelihood
Model Averaging Given an Order
Posterior probability of a general feature
Posterior probability of particular choice of parents
Posterior probability of particular edge choice
Model Averaging
We cannot assume that order is known
Solution: Sample from posterior distribution of P(G|D)
Estimate feature probability by
Sampling can be done by MCMC
Sampling can be done over orderings
Notes on Learning Local Structures
Define score with local structures
Example: in tree CPDs, score decomposes by leaves
Prior may need to be extended
Example: in tree CPDs, penalty for tree structure per CPD
Extend search operators to local structure
Example: in tree CPDs, we need to search for tree structure
Can be done by local encapsulated search or by defining new global operations
Search: Summary
Discrete optimization problem
In general, NP-Hard
Need to resort to heuristic search
In practice, search is relatively fast (~100 vars in ~10 min):
Decomposability
Sufficient statistics
In some cases, we can reduce the search problem to an easy optimization problem