Statistical Learning for Data Science MSDS534 Lecture 06: Graphical Models Department of Statistics & Biostatistics Rutgers University 0Do not reproduce or distribute the lecture notes. Unauthorized reproduction or distribution of the contents of this slides is a copyright violation. Acknowledgement Some of the figures in this presentation are taken from Elements of Statistical Learning (Springer, 2009) with permission from the authors: T. Hastie, R. Tibshirani and J. Friedman. Some of the figures in this presentation are taken from Pattern Recognition and Machine Learning (Springer, 2006) with permission from the author: Christopher M. Bishop. Reading Assignments ESL: §17.1–17.3 PRML: §8.1–8.3 Graphical Models They provide a simple way to visualize the structure of a probabilistic model and can be used to design and motivate new models. Insights into the properties of the model, including conditional independence properties, can be obtained by inspection of the graph. Complex computations, required to perform inference and learning in sophis- ticated models, can be expressed in terms of graphical manipulations, in which underlying mathematical expressions are carried along implicitly. Graphs and Graphical Models A graph comprises nodes (also called vertices) connected by edges (also known as links or arcs). In a probabilistic graphical model, each node represents a random variable, and the edges express probabilistic relationships between these variables. Bayesian networks, also known as directed graphical models: the edges of the graphs have a particular directionality indicated by arrows. Markov random fields, also known as undirected graphical models: the edges do not carry arrows and have no directional significance. Directed graphs are useful for expressing causal relationships between random variables. Be careful! Will focus on continuous, and in particular, Gaussian graphical models. However, to better understand the conditional probabilities calculation, it might be easier to have discrete random variables in mind. Outline 1 Bayesian Networks/Directed Graphical Models 2 Conditional Independence 3 Markov Random Fields/Undirected Graphical Models Bayesian Networks/Directed Graphical Models Suppose there are K random variables {x1, . . . , xK}. The complete factorization of the joint density p(x1, . . . , xK) is given by p(x1, . . . , xK) = p(xK |x1, . . . , xK−1) · · · p(x2|x1)p(x1). Can use a directed graph to introduce additional structure into and to simplify this factorization. According to the graph, the joint density p(x1, . . . , x7) is factorized as p(x1)p(x2)p(x3)p(x4|x1, x2, x3)p(x5|x1, x3)p(x6|x4)p(x7|x4, x5).362 8. GRAPHICAL MODELS Figure 8.2 Example of a directed acyclic graph describing the joint distribution over variables x1, . . . , x7. The corresponding decomposition of the joint distribution is given by (8.4). x1 x2 x3 x4 x5 x6 x7 is therefore given by p(x1)p(x2)p(x3)p(x4|x1, x2, x3)p(x5|x1, x3)p(x6|x4)p(x7|x4, x5). (8.4) The reader should take a moment to study carefully the correspondence between (8.4) and Figure 8.2. We can now state in general terms the relationship between a given directed graph and the corresponding distribution over the variables. The joint distribution defined by a graph is given by the product, over all of the nodes of the graph, of a conditional distribution for each node conditioned on the variables corresponding to the parents of that node in the graph. Thus, for a graph with K nodes, the joint distribution is given by p(x) = K∏ k=1 p(xk|pak) (8.5) where pak denotes the set of parents of xk, and x = {x1, . . . , xK}. This key equation expresses the factorization properties of the joint distribution for a directed graphical model. Although we have considered each node to correspond to a single variable, we can equally well associate sets of variables and vector-valued variables with the nodes of a graph. It is easy to show that the representation on the right- hand side of (8.5) is always correctly normalized provided the individual conditional distributions are normalized.Exercise 8.1 The directed graphs that we are considering are subject to an important restric- tion namely that there must be no directed cycles, in other words there are no closed paths within the graph such that we can move from node to node along links follow- ing the direction of the arrows and end up back at the starting node. Such graphs are also called directed acyclic graphs, or DAGs. This is equivalent to the statement thatExercise 8.2 there exists an ordering of the nodes such that there are no links that go from any node to any lower numbered node. 8.1.1 Example: Polynomial regression As an illustration of the use of directed graphs to describe probability distri- butions, we consider the Bayesian polynomial regression model introduced in Sec- I there is a edge going from a node a to a node b, then we say that node a is the parent of node b, and we say that node b is the child of node a. – If there is a edge ( rrow) g ing from a node a to a node b, we also say that a is at the tail of the edge, and b at the head of the edge. Bayesian Networks Order the nodes such that if b is a child of b, then b is ordered after a, i.e. b receives a larger ordering number. Assume there is NO directed cycles, i.e. there are no closed paths within the graph such that we can move from node to node along edges following the direction of the arrows and end up back at the starting node. Such a graph is also called a directed acyclic graph, or a DAG. The ordering is always possible for a DAG. Denote by pak the parents of the node xk. The joint distribution is factorized as p(x) = K∏ k=1 p(xk|pak). – The complete factorization corresponds to a complete DAG, with edges from every node to all the nodes after it. The factorization also leads to the ancestral sampling algorithm to sample from the joint distribution. Generally speaking, an undirected graphical model refers to the set of distributions satisfying the factorization induced by the graph. Gaussian DAG Suppose there is a DAG with K nodes. We consider the K-dimensional Gaussian distribution induced by the graph. Order the random variables x1, . . . , xK according to the ordering of the nodes. The conditional distribution of xk given its parents pak is xk|pak ∼ N ∑ j∈pak wkjxj + bk, vk , 1 ≤ k ≤ K. It follows that the logarithm of the joint density is log p(x) = − K∑ k=1 1 2vk xk − ∑ j∈pak wkjxj − bk 2 + const. Equivalently, xk = ∑ j∈pak wkjxj + bk + √ vkk, where 1, . . . , K are iid N(0, 1). Build a Gaussian DAG If the graph is known, estimate the coefficients wkj and the variance parameters vk by LSE and RSS. If the graph is unknown, can use the following greedy algorithm. 1 Suppose the current graph is G(t). 2 Try three possible operations on G(t): – remove an edge, – add an edge if the graph is still acyclic, – reverse the direction of an edge if it is still acyclic. 3 For each allowed operation in Step 2, calculate the BIC of the resulting graph. Select the one of the smallest BIC, call it G(t+1). 4 Iterate until none of the operations in Step 2 leads to a model with a smaller BIC. For a given graph G, its BIC is calculated as BIC(G) = K∑ k=1 BICk, where BICk = N log 1 N N∑ i=1 xik − ∑ j∈pak wˆkjxj − bˆk 2 + |pak| · logN. Outline 1 Bayesian Networks/Directed Graphical Models 2 Conditional Independence 3 Markov Random Fields/Undirected Graphical Models Conditional Independence Two random variables a and b are independent if p(a, b) = p(a)p(b) or p(a|b) = p(a). Now consider three random variables a, b and c. The conditional density of (a, b) given c is p(a, b|c) = p(a, b, c) p(c) . We say a and b are conditional independent given c if p(a, b|c) = p(a|c)p(b|c) or p(a|b, c) = p(a|c). Denote the conditional independence by a ⊥⊥ b | c. Denote the independence of a and b by a ⊥⊥ b | ∅, or simply a ⊥⊥ b. In general, it is the case that a ⊥⊥ b | c, but NOT a ⊥⊥ b. Example: DAG with 3 Nodes Exmaple 1. a ⊥/⊥ b and a ⊥⊥ b | c. The node c is said to be tail-to-tail with respect to the path. Exmaple 2. a ⊥/⊥ b and a ⊥⊥ b | c. The node c is said to be head-to-tail with respect to the path. Exmaple 3. a ⊥⊥ b. Is a ⊥⊥ b | c? NO! The node c is said to be head-to-head with respect to the path. c a b a c b c a b c a b a c b c a b Example Consider an example with three binary random variables relating to the fuel system on a car. The variable B represents the state of a battery that is either charged (B = 1) or flat (B = 0). The variable F represents the state of the fuel tank that is either full of fuel (F = 1) or empty (F = 0). The variable G represents the state of an electric fuel gauge and which indicates either full (G = 1) or empty (G = 0). Assume B and F are independent with P (B = 1) = P (F = 1) = 0.9. Assume that conditional on B and F , P (G = 1|B = 1, F = 1) = 0.8 P (G = 1|B = 1, F = 0) = 0.2 P (G = 1|B = 0, F = 1) = 0.2 P (G = 1|B = 0, F = 0) = 0.1 G B F G B F G B F Example Marginal and conditional probabilities of F = 0: P (F = 0) = .1 P (F = 0|G = 0) ≈ .257 P (F = 0|G = 0, B = 0) ≈ .111 D-Separation We say that node y is a descendant of node x if there is a path from x to y in which each step of the path follows the directions of the arrows. Consider a general directed graph in which A, B, and C are arbitrary nonintersecting sets of nodes (whose union may be smaller than the complete set of nodes in the graph). Consider all possible paths from any node in A to any node in B. Any such path is said to be blocked if it includes a node such that either (a) the arrows on the path meet either head-to-tail or tail-to-tail at the node, and the node is in the set C, or (b) the arrows meet head-to-head at the node, and neither the node, nor any of its descendants, is in the set C. If all paths are blocked, then A is said to be d-separated from B by C, and it codes that the joint distribution over all of the variables in the graph will satisfy A ⊥⊥ B |C. All the distributions that factorize according to the graph satisfy the conditional independence statements read from the graph. Example: D-Separation Treat c as the conditioning set. – The path from a to b is not blocked by node f because it is a tail-to-tail node for this path, and it is not in the conditioning set. – The path from a to b is not blocked by node e because, although the latter is a head-to-head node, it has a descendant c in the conditioning set. – Thus the conditional independence statement a ⊥⊥ b | c does not follow from this graph, i.e. a ⊥/⊥ b | c. f e b a c Example: D-Separation Treat f as the conditioning set. – The path from a to b is blocked by node f because this is a tail-to-tail node, and is the conditioning set. – This path is also blocked by node e because e is a head-to-head node and neither it nor its descendant are in the conditioning set. – Thus the conditional independence statement a ⊥⊥ b | f follows from this graph. f e b a c Outline 1 Bayesian Networks/Directed Graphical Models 2 Conditional Independence 3 Markov Random Fields/Undirected Graphical Models Markov Random Fields/Undirected Graphical Models A Markov random field, also known as a Markov network or an undirected graphical model, has a set of nodes each of which corresponds to a random variable, as well as a set of edges each of which connects a pair of nodes. The links are undirected, that is they do not carry arrows. It is convenient to begin with the conditional independence properties. (Compare with Bayesian networks!) Suppose that in an undirected graph we identify three sets of nodes, denoted A, B, and C. Consider all possible paths that connect a node in set A to a node in set B. If all such paths pass through one or more nodes in set C, we say C separates A and B. Given an undirected graph G, the Markov random filed contains all distributions satisfying A ⊥⊥ B |C, as long as C separates A and B. A C B Factorization A clique is a subset of the nodes in a graph such that there exists a edge between all pairs of nodes in the subset. A maximal clique is a clique such that it is not possible to include any other nodes from the graph in the set without it ceasing to be a clique. Denote a clique by C and the set of random variables in it by xC . The joint distribution is written as a product of potential functions ψC(xC) over the maximal cliques of the graph: p(x) = 1 Z ∏ C ψc(xc). Here the quantity Z, sometimes called the partition function, is a normalization constant and is given by Z = ∑ x ∏ C ψc(xc). Hammersley-Clifford Theorem asserts the equivalence between the conditional independence interpretation and the factorization, under the assumption that the joint densities/pmfs are positive everywhere. Pairwise Markov Independence Given an undirected graph G with nodes V . Consider all the joint distributions of its nodes, such that a ⊥⊥ b |V \ {a, b} as long as a and b are not connected, i.e. no edge between a and b. This type of conditional independence is called the pairwise Markov independence or pairwise Markov property of G. If a and b are not connected, then the set V \ {a, b} separates a and b. The aforementioned A ⊥⊥ B |C is called the global Markov properties of G. It is obvious that the global Markov property implies the pairwise Markov property. The converse is also true, i.e. pairwise and global Markov properties are equivalent, when restricted to the densities/pmfs that are positive everywhere. Undirected Gaussian Graphical Model Suppose x = (x1, . . . , xp) has a multivariate normal distribution N(µ,Σ). The matrix Θ = Σ−1 is called the precision matrix. xj ⊥⊥ xk |x \ {xj , xk} if and only if θjk = 0. HW problem. Therefore, for an undirected Gaussian graphical model, the graph is characterized by the precision matrix. Let E be the set of edges, i.e. (j, k) ∈ E if θjk 6= 0. Make the convection that all (j, j) ∈ E . Given a sample {x1, . . . ,xN}, let S be the sample covariance matrix S = 1 N N∑ i=1 (xi − x¯)(xi − x¯)′. The log-likelihood of the precision matrix Θ is given by `(Θ) = log det(Θ)− trace(SΘ). Modified Regression Algorithm It can be verified that −`(Θ) is convex in Θ. If E is known, the Lagrangian with respect to the given edge set E is L(Θ,Γ) = log det(Θ)− trace(SΘ)− ∑ (j,k) 6∈E γjkθjk. The gradient condition is Θ−1 − S − Γ = 0, where Γ is defined as: Γ[j, k] = γjk if (j, k) 6∈ E , and Γ[j, k] = 0 otherwise. The modified regression algorithm tries to update one column/row of Θ at a time. Let Θ(t) be the current estimate, and W (t) = [ Θ(t) ]−1 . Will show how to get Θ(t+1). W.L.O.G., will show how to update the last column/row of Θ. Modified Regression Algorithm The algorithm relies on the identity( W (t) 11 w12 w′12 w22 )( Θ (t) 11 θ12 θ′12 θ22 ) = ( I 0 0′ 1 ) . Will try to update θ12 and θ22 such that the last row/column of the gradient condition is fulfilled. The vector w12 is linked to θ12 and θ22 through W (t) 11 θ12 + θ22w12 = 0. θ12 has to satisfy the edge constraint. Let β := −θ12/θ22, then w12 = W (t)11β, and W (t) 11β − s12 − γ12 = 0. (1) – Note that β has the same zero coordinates as θ12. The last diagonal element w22 has to equal to s22, and w′12θ12 + w22θ22 = 1. More details in class demonstration. Modified Regression Algorithm Graphical Lasso To estimate the graph structure, consider maximizing the penalized log-likelihood log det(Θ)− trace(SΘ)− λ‖Θ‖1, where ‖Θ‖1 denotes the sum of the absolute values of the elements of Θ. The gradient condition becomes Θ−1 − S − λ · Sign(Θ) = 0, – Here Sign(x) is defined as the sub-differential of |x|, which is the sign when x 6= 0, and can take any value in [−1, 1] when x = 0. Similar to the Modified Regression Algorithm, update one column/row of Θ at a time. The gradient condition (1) becomes W (t) 11β − s12 − λ · Sign(β) = 0. Can adapt Lasso to solve this optimization problem, using the pathwise coordinate descent method. More details in the video. Example Read https://people.math.aau.dk/~sorenh/misc/2012-Oslo-GMwR/GMwR-notes.pdf Run setRepositories(), and make sure that BioC software is added. Install the following packages using the command install.packages("gRbase", dependencies=TRUE) install.packages("gRain", dependencies=TRUE) install.packages("gRim", dependencies=TRUE) install.packages("Rgraphviz", dependencies=TRUE) Example: Generate and plot an undirected graph library(gRbase) library(Rgraphviz) g1 <- ug(~a:b:e + a:c:e + b:d:e + c:d:e + c:g + d:f) as(g1, "matrix") ## a b e c d g f ## a 0 1 1 1 0 0 0 ## b 1 0 1 0 1 0 0 ## e 1 1 0 1 1 0 0 ## c 1 0 1 0 1 1 0 ## d 0 1 1 1 0 0 1 ## g 0 0 0 1 0 0 0 ## f 0 0 0 0 1 0 0 plot(g1) a b e c d g f Example: Using adjacency matrix m <- matrix(c(0,1,1,0,1,1,0,0,1,1,1,0,0,1,1,0,1,1,0,1,1,1,1,1,0), nrow = 5) rownames(m) <- colnames(m) <- c("a", "b", "c", "d", "e") m g1=as(m, "graphNEL") g1 ## A graphNEL graph with undirected edges ## Number of Nodes = 11 ## Number of Edges = 8 plot(g1) a b e c d g f Example: Generate and plot a directed graph dag0 <- dag(~a, ~b * a, ~c * a * b, ~d * c * e, ~e * a) dag0 <- dag(~a + b:a + c:a:b + d:c:e + e:a) dag0 <- dag("a", c("b", "a"), c("c", "a", "b"), c("d", "c", "e"), c("e", "a")) dag0 ## A graphNEL graph with directed edges ## Number of Nodes = 5 ## Number of Edges = 6 plot(dag0) a b c d e Example: Using directed adjacency matrix m <- matrix(c(0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,1,0), nrow = 5) rownames(m) <- colnames(m) <- letters[1:5] m ## 1 indicates an arrow from row to column ## a b c d e ## a 0 1 0 0 0 ## b 0 0 1 1 0 ## c 0 0 0 0 1 ## d 0 0 0 0 1 ## e 0 0 0 0 0 dag1 <- as(m, "graphNEL") plot(dag1) a b c d e Example: Protein flow cytometry data https://web.stanford.edu/~hastie/ElemStatLearn/datasets/sachs.data 7466 cells, one per row 11 proteins, one per column Protein names are: "praf", "pmek", "plcg", "PIP2", "PIP3", "p44/42", "pakts473", "PKA", "PKC", "P38", "pjnk" Sachs, K., and Perez, O., and Pe’er, D., and Lauffenburger, D. and Nolan, G. Causal protein-signaling networks derived from multiparameter single-cell data Science, 308(5721), 523--529, 2005. Example: Protein flow cytometry data pfc=read.table("sachs.txt",header=F) colnames(pfc)=c( "raf", "mek", "plcg", "PIP2", "PIP3", "p44", "akt", "PKA", "PKC", "P38", "jnk") dim(pfc) S=cov(pfc) dim(S) S=S/1000 ## rescale the sample covariance matrix Example: λ = 36 fit1=glasso(S,rho=36) Theta=fit1$wi colnames(Theta)=c("raf","mek","plcg","PIP2","PIP3","p44","akt","PKA","PKC","P38","jnk") rownames(Theta)=c("raf","mek","plcg","PIP2","PIP3","p44","akt","PKA","PKC","P38","jnk") Adj=Theta!=0 Adj=Adj*1 diag(Adj)=0 g1=as(Adj, "graphNEL") plot(g1) raf mek plcg PIP2 PIP3 p44 akt PKA PKC P38 jnk Example: λ = 27 fit2=glasso(S,rho=27) Theta=fit2$wi colnames(Theta)=c("raf","mek","plcg","PIP2","PIP3","p44","akt","PKA","PKC","P38","jnk") rownames(Theta)=c("raf","mek","plcg","PIP2","PIP3","p44","akt","PKA","PKC","P38","jnk") Adj=Theta!=0 Adj=Adj*1 diag(Adj)=0 g2=as(Adj, "graphNEL") plot(g2) raf mek plcg PIP2 PIP3 p44 aktPKA PKC P38 jnk Example: λ = 7 fit3=glasso(S,rho=7) Theta=fit3$wi colnames(Theta)=c("raf","mek","plcg","PIP2","PIP3","p44","akt","PKA","PKC","P38","jnk") rownames(Theta)=c("raf","mek","plcg","PIP2","PIP3","p44","akt","PKA","PKC","P38","jnk") Adj=Theta!=0 Adj=Adj*1 diag(Adj)=0 g3=as(Adj, "graphNEL") plot(g3) raf mek plcg PIP2 PIP3 p44 aktPKA PKC P38 jnk
欢迎咨询51作业君