辅导案例-MSDS534

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

Statistical Learning for Data Science
MSDS534
Lecture 06: Graphical Models
Department of Statistics & Biostatistics
Rutgers University
0Do not reproduce or distribute the lecture notes. Unauthorized reproduction or distribution of
the contents of this slides is a copyright violation.
Acknowledgement
Some of the figures in this presentation are taken from Elements of
Statistical Learning (Springer, 2009) with permission from the authors:
T. Hastie, R. Tibshirani and J. Friedman.
Some of the figures in this presentation are taken from Pattern
Recognition and Machine Learning (Springer, 2006) with permission
from the author: Christopher M. Bishop.
Reading Assignments
ESL: §17.1–17.3
PRML: §8.1–8.3
Graphical Models
They provide a simple way to visualize the structure of a probabilistic
model and can be used to design and motivate new models.
Insights into the properties of the model, including conditional
independence properties, can be obtained by inspection of the graph.
Complex computations, required to perform inference and learning in
sophis- ticated models, can be expressed in terms of graphical
manipulations, in which underlying mathematical expressions are carried
along implicitly.
Graphs and Graphical Models
A graph comprises nodes (also called vertices) connected by edges (also
known as links or arcs).
In a probabilistic graphical model, each node represents a random
variable, and the edges express probabilistic relationships between these
variables.
Bayesian networks, also known as directed graphical models: the edges
of the graphs have a particular directionality indicated by arrows.
Markov random fields, also known as undirected graphical models: the
edges do not carry arrows and have no directional significance.
Directed graphs are useful for expressing causal relationships between
random variables. Be careful!
Will focus on continuous, and in particular, Gaussian graphical models.
However, to better understand the conditional probabilities calculation,
it might be easier to have discrete random variables in mind.
Outline
1 Bayesian Networks/Directed Graphical Models
2 Conditional Independence
3 Markov Random Fields/Undirected Graphical Models
Bayesian Networks/Directed Graphical Models
Suppose there are K random variables {x1, . . . , xK}.
The complete factorization of the joint density p(x1, . . . , xK) is given by
p(x1, . . . , xK) = p(xK |x1, . . . , xK−1) · · · p(x2|x1)p(x1).
Can use a directed graph to introduce additional structure into and to
simplify this factorization.
According to the graph, the joint density p(x1, . . . , x7) is factorized as
p(x1)p(x2)p(x3)p(x4|x1, x2, x3)p(x5|x1, x3)p(x6|x4)p(x7|x4, x5).362 8. GRAPHICAL MODELS
Figure 8.2 Example of a directed acyclic graph describing the joint
distribution over variables x1, . . . , x7. The corresponding
decomposition of the joint distribution is given by (8.4).
x1
x2 x3
x4 x5
x6 x7
is therefore given by
p(x1)p(x2)p(x3)p(x4|x1, x2, x3)p(x5|x1, x3)p(x6|x4)p(x7|x4, x5). (8.4)
The reader should take a moment to study carefully the correspondence between
(8.4) and Figure 8.2.
We can now state in general terms the relationship between a given directed
graph and the corresponding distribution over the variables. The joint distribution
defined by a graph is given by the product, over all of the nodes of the graph, of
a conditional distribution for each node conditioned on the variables corresponding
to the parents of that node in the graph. Thus, for a graph with K nodes, the joint
distribution is given by
p(x) =
K∏
k=1
p(xk|pak) (8.5)
where pak denotes the set of parents of xk, and x = {x1, . . . , xK}. This key
equation expresses the factorization properties of the joint distribution for a directed
graphical model. Although we have considered each node to correspond to a single
variable, we can equally well associate sets of variables and vector-valued variables
with the nodes of a graph. It is easy to show that the representation on the right-
hand side of (8.5) is always correctly normalized provided the individual conditional
distributions are normalized.Exercise 8.1
The directed graphs that we are considering are subject to an important restric-
tion namely that there must be no directed cycles, in other words there are no closed
paths within the graph such that we can move from node to node along links follow-
ing the direction of the arrows and end up back at the starting node. Such graphs are
also called directed acyclic graphs, or DAGs. This is equivalent to the statement thatExercise 8.2
there exists an ordering of the nodes such that there are no links that go from any
node to any lower numbered node.
8.1.1 Example: Polynomial regression
As an illustration of the use of directed graphs to describe probability distri-
butions, we consider the Bayesian polynomial regression model introduced in Sec-
I there is a edge going from a node a to a node b, then we say that node
a is the parent of node b, and we say that node b is the child of node a.
– If there is a edge ( rrow) g ing from a node a to a node b, we also
say that a is at the tail of the edge, and b at the head of the edge.
Bayesian Networks
Order the nodes such that if b is a child of b, then b is ordered after a,
i.e. b receives a larger ordering number.
Assume there is NO directed cycles, i.e. there are no closed paths within
the graph such that we can move from node to node along edges
following the direction of the arrows and end up back at the starting
node. Such a graph is also called a directed acyclic graph, or a DAG.
The ordering is always possible for a DAG.
Denote by pak the parents of the node xk. The joint distribution is
factorized as
p(x) =
K∏
k=1
p(xk|pak).
– The complete factorization corresponds to a complete DAG, with
edges from every node to all the nodes after it.
The factorization also leads to the ancestral sampling algorithm to
sample from the joint distribution.
Generally speaking, an undirected graphical model refers to the set of
distributions satisfying the factorization induced by the graph.
Gaussian DAG
Suppose there is a DAG with K nodes. We consider the K-dimensional
Gaussian distribution induced by the graph.
Order the random variables x1, . . . , xK according to the ordering of the
nodes.
The conditional distribution of xk given its parents pak is
xk|pak ∼ N
 ∑
j∈pak
wkjxj + bk, vk
 , 1 ≤ k ≤ K.
It follows that the logarithm of the joint density is
log p(x) = −
K∑
k=1
1
2vk
xk − ∑
j∈pak
wkjxj − bk
2 + const.
Equivalently,
xk =
∑
j∈pak
wkjxj + bk +
√
vkk,
where 1, . . . , K are iid N(0, 1).
Build a Gaussian DAG
If the graph is known, estimate the coefficients wkj and the variance
parameters vk by LSE and RSS.
If the graph is unknown, can use the following greedy algorithm.
1 Suppose the current graph is G(t).
2 Try three possible operations on G(t):
– remove an edge,
– add an edge if the graph is still acyclic,
– reverse the direction of an edge if it is still acyclic.
3 For each allowed operation in Step 2, calculate the BIC of the
resulting graph. Select the one of the smallest BIC, call it G(t+1).
4 Iterate until none of the operations in Step 2 leads to a model with
a smaller BIC.
For a given graph G, its BIC is calculated as
BIC(G) =
K∑
k=1
BICk,
where
BICk = N log
 1
N
N∑
i=1
xik − ∑
j∈pak
wˆkjxj − bˆk
2
+ |pak| · logN.
Outline
1 Bayesian Networks/Directed Graphical Models
2 Conditional Independence
3 Markov Random Fields/Undirected Graphical Models
Conditional Independence
Two random variables a and b are independent if
p(a, b) = p(a)p(b) or p(a|b) = p(a).
Now consider three random variables a, b and c. The conditional density
of (a, b) given c is
p(a, b|c) = p(a, b, c)
p(c)
.
We say a and b are conditional independent given c if
p(a, b|c) = p(a|c)p(b|c) or p(a|b, c) = p(a|c).
Denote the conditional independence by a ⊥⊥ b | c.
Denote the independence of a and b by a ⊥⊥ b | ∅, or simply a ⊥⊥ b.
In general, it is the case that a ⊥⊥ b | c, but NOT a ⊥⊥ b.
Example: DAG with 3 Nodes
Exmaple 1. a ⊥/⊥ b and a ⊥⊥ b | c. The node c is said to be tail-to-tail
with respect to the path.
Exmaple 2. a ⊥/⊥ b and a ⊥⊥ b | c. The node c is said to be head-to-tail
with respect to the path.
Exmaple 3. a ⊥⊥ b. Is a ⊥⊥ b | c? NO! The node c is said to be
head-to-head with respect to the path.
c
a b a c b
c
a b
c
a b a c b
c
a b
Example
Consider an example with three binary random variables relating to the
fuel system on a car.
The variable B represents the state of a battery that is either charged
(B = 1) or flat (B = 0).
The variable F represents the state of the fuel tank that is either full of
fuel (F = 1) or empty (F = 0).
The variable G represents the state of an electric fuel gauge and which
indicates either full (G = 1) or empty (G = 0).
Assume B and F are independent with P (B = 1) = P (F = 1) = 0.9.
Assume that conditional on B and F ,
P (G = 1|B = 1, F = 1) = 0.8
P (G = 1|B = 1, F = 0) = 0.2
P (G = 1|B = 0, F = 1) = 0.2
P (G = 1|B = 0, F = 0) = 0.1
G
B F
G
B F
G
B F
Example
Marginal and conditional probabilities of F = 0:
P (F = 0) = .1 P (F = 0|G = 0) ≈ .257 P (F = 0|G = 0, B = 0) ≈ .111
D-Separation
We say that node y is a descendant of node x if there is a path from x
to y in which each step of the path follows the directions of the arrows.
Consider a general directed graph in which A, B, and C are arbitrary
nonintersecting sets of nodes (whose union may be smaller than the
complete set of nodes in the graph).
Consider all possible paths from any node in A to any node in B. Any
such path is said to be blocked if it includes a node such that either
(a) the arrows on the path meet either head-to-tail or tail-to-tail at the
node, and the node is in the set C, or
(b) the arrows meet head-to-head at the node, and neither the node,
nor any of its descendants, is in the set C.
If all paths are blocked, then A is said to be d-separated from B by C,
and it codes that the joint distribution over all of the variables in the
graph will satisfy A ⊥⊥ B |C.
All the distributions that factorize according to the graph satisfy the
conditional independence statements read from the graph.
Example: D-Separation
Treat c as the conditioning set.
– The path from a to b is not blocked by node f because it is a
tail-to-tail node for this path, and it is not in the conditioning set.
– The path from a to b is not blocked by node e because, although
the latter is a head-to-head node, it has a descendant c in the
conditioning set.
– Thus the conditional independence statement a ⊥⊥ b | c does not
follow from this graph, i.e. a ⊥/⊥ b | c.
f
e b
a
c
Example: D-Separation
Treat f as the conditioning set.
– The path from a to b is blocked by node f because this is a
tail-to-tail node, and is the conditioning set.
– This path is also blocked by node e because e is a head-to-head
node and neither it nor its descendant are in the conditioning set.
– Thus the conditional independence statement a ⊥⊥ b | f follows from
this graph.
f
e b
a
c
Outline
1 Bayesian Networks/Directed Graphical Models
2 Conditional Independence
3 Markov Random Fields/Undirected Graphical Models
Markov Random Fields/Undirected Graphical Models
A Markov random field, also known as a Markov network or an
undirected graphical model, has a set of nodes each of which corresponds
to a random variable, as well as a set of edges each of which connects a
pair of nodes. The links are undirected, that is they do not carry arrows.
It is convenient to begin with the conditional independence properties.
(Compare with Bayesian networks!)
Suppose that in an undirected graph we identify three sets of nodes,
denoted A, B, and C. Consider all possible paths that connect a node in
set A to a node in set B. If all such paths pass through one or more
nodes in set C, we say C separates A and B.
Given an undirected graph G, the Markov random filed contains all
distributions satisfying A ⊥⊥ B |C, as long as C separates A and B.
A
C
B
Factorization
A clique is a subset of the nodes in a graph such that there exists a edge
between all pairs of nodes in the subset.
A maximal clique is a clique such that it is not possible to include any
other nodes from the graph in the set without it ceasing to be a clique.
Denote a clique by C and the set of random variables in it by xC .
The joint distribution is written as a product of potential functions
ψC(xC) over the maximal cliques of the graph:
p(x) =
1
Z
∏
C
ψc(xc).
Here the quantity Z, sometimes called the partition function, is a
normalization constant and is given by
Z =
∑
x
∏
C
ψc(xc).
Hammersley-Clifford Theorem asserts the equivalence between the
conditional independence interpretation and the factorization, under the
assumption that the joint densities/pmfs are positive everywhere.
Pairwise Markov Independence
Given an undirected graph G with nodes V . Consider all the joint
distributions of its nodes, such that a ⊥⊥ b |V \ {a, b} as long as a and b
are not connected, i.e. no edge between a and b.
This type of conditional independence is called the pairwise Markov
independence or pairwise Markov property of G.
If a and b are not connected, then the set V \ {a, b} separates a and b.
The aforementioned A ⊥⊥ B |C is called the global Markov properties of
G. It is obvious that the global Markov property implies the pairwise
Markov property.
The converse is also true, i.e. pairwise and global Markov properties are
equivalent, when restricted to the densities/pmfs that are positive
everywhere.
Undirected Gaussian Graphical Model
Suppose x = (x1, . . . , xp) has a multivariate normal distribution
N(µ,Σ).
The matrix Θ = Σ−1 is called the precision matrix.
xj ⊥⊥ xk |x \ {xj , xk} if and only if θjk = 0. HW problem.
Therefore, for an undirected Gaussian graphical model, the graph is
characterized by the precision matrix.
Let E be the set of edges, i.e. (j, k) ∈ E if θjk 6= 0. Make the convection
that all (j, j) ∈ E .
Given a sample {x1, . . . ,xN}, let S be the sample covariance matrix
S =
1
N
N∑
i=1
(xi − x¯)(xi − x¯)′.
The log-likelihood of the precision matrix Θ is given by
`(Θ) = log det(Θ)− trace(SΘ).
Modified Regression Algorithm
It can be verified that −`(Θ) is convex in Θ.
If E is known, the Lagrangian with respect to the given edge set E is
L(Θ,Γ) = log det(Θ)− trace(SΘ)−
∑
(j,k) 6∈E
γjkθjk.
The gradient condition is
Θ−1 − S − Γ = 0,
where Γ is defined as: Γ[j, k] = γjk if (j, k) 6∈ E , and Γ[j, k] = 0
otherwise.
The modified regression algorithm tries to update one column/row of Θ
at a time.
Let Θ(t) be the current estimate, and W (t) =
[
Θ(t)
]−1
. Will show how
to get Θ(t+1).
W.L.O.G., will show how to update the last column/row of Θ.
Modified Regression Algorithm
The algorithm relies on the identity(
W
(t)
11 w12
w′12 w22
)(
Θ
(t)
11 θ12
θ′12 θ22
)
=
(
I 0
0′ 1
)
.
Will try to update θ12 and θ22 such that the last row/column of the
gradient condition is fulfilled.
The vector w12 is linked to θ12 and θ22 through
W
(t)
11 θ12 + θ22w12 = 0.
θ12 has to satisfy the edge constraint.
Let β := −θ12/θ22, then w12 = W (t)11β, and
W
(t)
11β − s12 − γ12 = 0. (1)
– Note that β has the same zero coordinates as θ12.
The last diagonal element w22 has to equal to s22, and
w′12θ12 + w22θ22 = 1.
More details in class demonstration.
Modified Regression Algorithm
Graphical Lasso
To estimate the graph structure, consider maximizing the penalized
log-likelihood
log det(Θ)− trace(SΘ)− λ‖Θ‖1,
where ‖Θ‖1 denotes the sum of the absolute values of the elements of Θ.
The gradient condition becomes
Θ−1 − S − λ · Sign(Θ) = 0,
– Here Sign(x) is defined as the sub-differential of |x|, which is the
sign when x 6= 0, and can take any value in [−1, 1] when x = 0.
Similar to the Modified Regression Algorithm, update one column/row of
Θ at a time.
The gradient condition (1) becomes
W
(t)
11β − s12 − λ · Sign(β) = 0.
Can adapt Lasso to solve this optimization problem, using the pathwise
coordinate descent method. More details in the video.
Example
Read
https://people.math.aau.dk/~sorenh/misc/2012-Oslo-GMwR/GMwR-notes.pdf
Run setRepositories(), and make sure that BioC software is
added.
Install the following packages using the command
install.packages("gRbase", dependencies=TRUE)
install.packages("gRain", dependencies=TRUE)
install.packages("gRim", dependencies=TRUE)
install.packages("Rgraphviz", dependencies=TRUE)
Example: Generate and plot an undirected graph
library(gRbase)
library(Rgraphviz)
g1 <- ug(~a:b:e + a:c:e + b:d:e + c:d:e + c:g + d:f)
as(g1, "matrix")
## a b e c d g f
## a 0 1 1 1 0 0 0
## b 1 0 1 0 1 0 0
## e 1 1 0 1 1 0 0
## c 1 0 1 0 1 1 0
## d 0 1 1 1 0 0 1
## g 0 0 0 1 0 0 0
## f 0 0 0 0 1 0 0
plot(g1)
a
b
e
c
d g
f
Example: Using adjacency matrix
m <- matrix(c(0,1,1,0,1,1,0,0,1,1,1,0,0,1,1,0,1,1,0,1,1,1,1,1,0), nrow = 5)
rownames(m) <- colnames(m) <- c("a", "b", "c", "d", "e")
m
g1=as(m, "graphNEL")
g1
## A graphNEL graph with undirected edges
## Number of Nodes = 11
## Number of Edges = 8
plot(g1)
a
b
e
c
d g
f
Example: Generate and plot a directed graph
dag0 <- dag(~a, ~b * a, ~c * a * b, ~d * c * e, ~e * a)
dag0 <- dag(~a + b:a + c:a:b + d:c:e + e:a)
dag0 <- dag("a", c("b", "a"), c("c", "a", "b"), c("d", "c", "e"), c("e", "a"))
dag0
## A graphNEL graph with directed edges
## Number of Nodes = 5
## Number of Edges = 6
plot(dag0) a
b
c
d
e
Example: Using directed adjacency matrix
m <- matrix(c(0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,1,0), nrow = 5)
rownames(m) <- colnames(m) <- letters[1:5]
m
## 1 indicates an arrow from row to column
## a b c d e
## a 0 1 0 0 0
## b 0 0 1 1 0
## c 0 0 0 0 1
## d 0 0 0 0 1
## e 0 0 0 0 0
dag1 <- as(m, "graphNEL")
plot(dag1)
a
b
c d
e
Example: Protein flow cytometry data
https://web.stanford.edu/~hastie/ElemStatLearn/datasets/sachs.data
7466 cells, one per row
11 proteins, one per column
Protein names are:
"praf",
"pmek",
"plcg",
"PIP2",
"PIP3",
"p44/42",
"pakts473",
"PKA",
"PKC",
"P38",
"pjnk"
Sachs, K., and Perez, O., and Pe’er, D., and Lauffenburger, D. and Nolan, G.
Causal protein-signaling networks derived from multiparameter single-cell data
Science, 308(5721), 523--529, 2005.
Example: Protein flow cytometry data
pfc=read.table("sachs.txt",header=F)
colnames(pfc)=c(
"raf",
"mek",
"plcg",
"PIP2",
"PIP3",
"p44",
"akt",
"PKA",
"PKC",
"P38",
"jnk")
dim(pfc)
S=cov(pfc)
dim(S)
S=S/1000 ## rescale the sample covariance matrix
Example: λ = 36
fit1=glasso(S,rho=36)
Theta=fit1$wi
colnames(Theta)=c("raf","mek","plcg","PIP2","PIP3","p44","akt","PKA","PKC","P38","jnk")
rownames(Theta)=c("raf","mek","plcg","PIP2","PIP3","p44","akt","PKA","PKC","P38","jnk")
Adj=Theta!=0
Adj=Adj*1
diag(Adj)=0
g1=as(Adj, "graphNEL")
plot(g1)
raf
mek plcg
PIP2
PIP3 p44 akt
PKA PKC
P38
jnk
Example: λ = 27
fit2=glasso(S,rho=27)
Theta=fit2$wi
colnames(Theta)=c("raf","mek","plcg","PIP2","PIP3","p44","akt","PKA","PKC","P38","jnk")
rownames(Theta)=c("raf","mek","plcg","PIP2","PIP3","p44","akt","PKA","PKC","P38","jnk")
Adj=Theta!=0
Adj=Adj*1
diag(Adj)=0
g2=as(Adj, "graphNEL")
plot(g2)
raf
mek
plcg
PIP2
PIP3 p44
aktPKA PKC
P38
jnk
Example: λ = 7
fit3=glasso(S,rho=7)
Theta=fit3$wi
colnames(Theta)=c("raf","mek","plcg","PIP2","PIP3","p44","akt","PKA","PKC","P38","jnk")
rownames(Theta)=c("raf","mek","plcg","PIP2","PIP3","p44","akt","PKA","PKC","P38","jnk")
Adj=Theta!=0
Adj=Adj*1
diag(Adj)=0
g3=as(Adj, "graphNEL")
plot(g3)
raf
mek
plcg
PIP2
PIP3 p44
aktPKA PKC
P38
jnk

欢迎咨询51作业君