AIT 681 Secure Software Engineering Topic #17. Program Analysis: Basics Instructor: Dr. Kun Sun 1 In Last Class • Topic #15. Secure Coding: Return-to-Libc Attacks – [Du]: Chapter 5 • Topic #16. Dirty COW Race Condition Attack – [Du]: Chapter 8 2 Program Analysis • The process of automatically analyzing the behavior of computer programs regarding a property such as correctness, robustness, safety and liveness. • Program analysis focuses on two major areas – program optimization: improving the program’s performance while reducing the resource usage – program correctness: ensuring that the program does what it is supposed to do. • Program analysis can be performed without executing the program (static program analysis), during runtime (dynamic program analysis) or in a combination of both. 3 Program Analysis on Security • Program analysis in the context of identifying security vulnerabilities and defending security attacks • Need for program analysis – A software related vulnerability is essentially a bug in the software • Identification – Defending software oriented attacks • Software transformation • Guidance for bug fixing 4 Learning Goal • Understand various state-of-the-art program analysis approaches • Useful Textbooks on Compiler – Alfred V. Aho, Monical S. Lam, Ravi Sethi, Jeffrey D. Ullman, Compilers Principles, Techniques, & Tools. – Torben Ægidius Mogensen , Basics of Compiler Design, http://hjemmesider.diku.dk/~torbenm/Basics/ 5 Program Representation Why Program Representations • Original representations – Source code (cross languages). – Binaries (cross machines and platforms). – Source code / binaries + test cases. • They are hard for machines to analyze. • Software is translated into certain representations before analyses are applied. 7 Outline • Control flow graphs • Program dependence graphs • Super control flow graphs • Call graph 8 Control Flow Graph • The most commonly used program representation. 9 Program representation: Basic blocks • A basic block in program P is a sequence of consecutive statements with a single entry and a single exit point. Thus, a block has unique entry and exit points. • Control always enters a basic block at its entry point and exits from its exit point. There is no possibility of exit or a halt at any point inside the basic block except at its exit point. • The entry and exit points of a basic block coincide when the block contains only one statement. 10 Basic blocks: Example Example: Computing x raised to y 11 Basic blocks: Example (contd.) Basic blocks 12 Building Basic Blocks • Identify leaders – The first instruction in a procedure, or – The target of any branch, or – An instruction immediately following a branch (implicit target) • Gobble all subsequent instructions until the next leader 13 Control Flow Graph (CFG) • A control flow graph (or flow graph) G is defined as a finite set N of nodes and a finite set E of edges. – An edge (i, j) in E connects two nodes ni and nj in N. – We often write G= (N, E) to denote a flow graph G with nodes given by N and edges by E. 14 Control Flow Graph (CFG) • In a flow graph of a program, each basic block becomes a node and edges are used to indicate the flow of control between blocks. • A directed edge (i, j) connecting basic blocks bi and bj implies that control can go from block bi to block bj. • We also assume that there is a node labeled Start in N that has no incoming edge, and another node labeled End, also in N, that has no outgoing edge. 15 CFG Example N={Start, 1, 2, 3, 4, 5, 6, 7, 8, 9, End} E={(Start,1), (1, 2), (1, 3), (2,4), (3, 4), (4, 5), (5, 6), (6, 5), (5, 7), (7, 8), (7, 9), (9, End)} 16 CFG Example N={Start, 1, 2, 3, 4, 5, 6, 7, 8, 9, End} E={(Start,1), (1, 2), (1, 3), (2,4), (3, 4), (4, 5), (5, 6), (6, 5), (5, 7), (7, 8), (7, 9), (9, End)} Same CFG with statements removed. 17 Building a CFG From Basic Block • Construction – Each CFG node represents a basic block – There is an edge from node i to j if • Last statement of block i branches to the first statement of j, or • Block i does not end with an unconditional branch and is immediately followed in program order by block j (fall through) 18 Paths 19 Consider a flow graph G= (N, E). A sequence of k edges, k>0, (e_1, e_2, … e_k) , denotes a path of length k through the flow graph if the following sequence condition holds. Given that np, nq, nr, and ns are nodes belonging to N, and 0< i
nr. } Complete path: a path from start to exit Subpath: a subsequence of a complete path Sample Paths 20 p1= ( Start, 1, 2, 4, 5, 6, 5, 7, 9, End) p2= (Start, 1, 3, 4, 5, 6, 5, 7, 9, End) Two feasible and complete paths: Bold edges: complete path. Dashed edges: subpath. p1= ( (Start, 1), (1, 2), (2, 4), (4, 5), (5, 6), (6, 5), (5, 7), (7, 9), (9, End)) Specified unambiguously using edges: Paths: (in)feasible paths p1= ( Start, 1, 3, 4, 5, 6, 5, 7, 8, 9, End) p2= (Start, 1, 2, 4, 5, 7, 9, End) A path p through a flow graph for program P is considered feasible if there exists at least one test case which when input to P causes p to be traversed. 21 Number of paths • There can be many distinct paths through a program. A program with no condition statements contains exactly one path that begins at node Start and terminates at node End. • Each additional condition in the program can increases the number of distinct paths by at least one. • Depending on their location, conditions can have a multiplicative effect on the number of paths. 22 Path Explosion 1. while (p1) { 2. if (p2) { 3. continue; 4. } else { 5. if (p3) 6. continue; 7. s1 8. } 9.} 23 A Simplified Version of CFG • Each statement is represented by a node – For readability. – Not for efficient implementation. 24 Dominator • X dominates Y if all possible program paths from START to Y have to pass X. 1: sum=0 2: i=1 3: while ( i4: i=i+1 5: sum=sum+i endwhile 6: print(sum) 3: while ( i1: sum=0 2: i=1 4: i=i+1 5: sum=sum+i 6: print (sum)DOM(6)={1,3, 6} 25 Note that a basic block is identified by the first statement in the block. Dominator • X strictly dominates Y if X dominates Y and X!=Y 1: sum=0 2: i=1 3: while ( i4: i=i+1 5: sum=sum+i endwhile 6: print(sum) 3: while ( i1: sum=0 2: i=1 4: i=i+1 5: sum=sum+i 6: print (sum)SDOM(6)={1,3} 26 Dominator • X is the immediate dominator of Y if X is the last dominator of Y along a path from Start to Y. 1: sum=0 2: i=1 3: while ( i4: i=i+1 5: sum=sum+i endwhile 6: print(sum) 3: while ( i1: sum=0 2: i=1 4: i=i+1 5: sum=sum+i 6: print (sum)IDOM(6)={3} 27 Is it possible X is not the last dominator of Y along another path? Postdominator • X post-dominates Y if every possible program path from Y to End has to pass X. – Strict post-dominator, immediate post-dominance. 1: sum=0 2: i=1 3: while ( i4: i=i+1 5: sum=sum+i endwhile 6: print(sum) 3: while ( i1: sum=0 2: i=1 4: i=i+1 5: sum=sum+i 6: print (sum)PDOM(4)={3,6} SPDOM(4)={3,6} IPDOM(4)=3 28 Back Edges • A back edge is an edge whose head dominates its tail – Back edges often identify loops 3: while ( i1: sum=0 2: i=1 4: i=i+1 5: sum=sum+i 6: print (sum) An directed edge (x, y) is considered to be directed from x to y; y is called the head and x is called the tail of the arrow; 29 Outline • Control flow graphs • Program dependence graphs • Super control flow graphs • Call graph 30 Program Dependence Graph • The second widely used program representation. • Nodes are constituted by statements instead of basic blocks. • Two types of dependences between statements – Data dependence – Control dependence • CFG focuses on the flow of control; PDG focuses on dependences. • Mainly used in debugging, security. 31 Data Dependence • X is data dependent on Y if (1) there is a variable v that is defined at Y and used at X and (2) there exists a path of nonzero length from Y to X along which v is not re-defined. 3: while ( i1: sum=0 2: i=1 4: i=i+1 5: sum=sum+i 6: print (sum) 32 6 is data dependent on 5, and 1. 5 is data dependent on 1, and 4. 5 is not data dependent on 2, since i is re-defined at 4. Computing Data Dependence is Hard in General • Aliasing – A variable can refer to multiple memory locations/objects. 1: int x, y, z …; 2: int * p; 3: x=…; 4: y=…; 5: p = & x; 6: p = p +z; 7: … = *p; 33 Control Dependence • Intuitively, Y is control-dependent on X iff X directly determines whether Y executes (statements inside one branch of a predicate are usually control dependent on the predicate) – X is not strictly post-dominated by Y – there exists a path from X to Y s.t. every node in the path other than X and Y is post-dominated by Y X Y Not post-dominated by Y Every node is post-dominated by Y There is a path from X to End that does not pass Y or X==Y No such paths for nodes in a path between X and Y. 34 Control Dependence • Intuitively, Y is control-dependent on X iff X directly determines whether Y executes (statements inside one branch of a predicate are usually control dependent on the predicate) • The intuition: 1) X can control not to execute Y; 2) X can direct the control along a path that Y is promised to be executed. X Y Not post-dominated by Y Every node is post-dominated by Y 35 Control Dependence - Example 1: sum=0 2: i=1 3: while ( i4: i=i+1 5: sum=sum+i endwhile 6: print(sum) 3: while ( i1: sum=0 2: i=1 4: i=i+1 5: sum=sum+i 6: print (sum)CD(5)=3 CD(3)=3, tricky! Y is control-dependent on X iff X directly determines whether Y executes X is not strictly post-dominated by Y there exists a path from X to Y s.t. every node in the path other than X and Y is post-dominated by Y 36 Control Dependence is not Syntactically Explicit 1: sum=0 2: i=1 3: while ( i4: i=i+1 5: if (i%2==0) 6: continue; 7: sum=sum+i endwhile 8: print(sum) 3: while ( i1: sum=0 2: i=1 4: i=i+1 5: if (i%2==0) 8: print (sum) 7: sum=sum+i Y is control-dependent on X iff X directly determines whether Y executes X is not strictly post-dominated by Y there exists a path from X to Y s.t. every node in the path other than X and Y is post-dominated by Y 37 CD(7) = 5 “Continue” will change the control dependency Control Dependence is Tricky! Can a statement control depend on two predicates? Y is control-dependent on X iff X directly determines whether Y executes X is not strictly post-dominated by Y there exists a path from X to Y s.t. every node in the path other than X and Y is post-dominated by Y 38 Control Dependence is Tricky! 1: if ( p1 || p2 ) 2: s1; 3: s2; 1: ? p1 Can a statement control depend on two predicates? 1: ? p2 2: s1 3: s2 What if ? 1: if ( p1 && p2 ) 2: s1; 3: s2; Y is control-dependent on X iff X directly determines whether Y executes X is not strictly post-dominated by Y there exists a path from X to Y s.t. every node in the path other than X and Y is post-dominated by Y 39 The Use of PDG • A program dependence graph consists of control dependence graph and data dependence graph • Why it is so important to software reliability? – In debugging, what could possibly induce the failure? – In security, sensitive data leakage p=getpassword( ); … if (p==“zhang”) { send (m); } p=getpassword( ); … send (p); 40 Outline • Control flow graphs • Program dependence graphs • Super control flow graphs • Call graph 41 Super Control Flow Graph (SCFG) • Also called interprocedural control flow graph (ICFG). • Besides the normal intraprocedural control flow graph, additional edges are added connecting – Each call site to the beginning of the procedure it calls. – The return statement back to the call site. 1: for (i=0; i2: t1= f(0); 3: t2 = f(243); 4: x[i] = t1 + t2 + t3; 5: } 6: int f (int v) { 7: return (v+1); 8: } 1 2 3 4 7 42The blue edges are called the calling edges. The green edges are called the return edges. Outline • Control flow graphs • Program dependence graphs • Super control flow graphs • Call graph 43 Call Graph (CG) • Each node represents a function; each edge represents a function invocation void A( ) { B( ); C( ); } void C ( ) { D( ); A( ); } void B( ) { L1: D( ); L2: D( ); } void D ( ) { } A CB D 44 Use of CG • CFI (control flow integrity) • Android framework access control inconsistencies • Hidden behavior detection in Android apps 45 Many Other Representations • Points-to Graph. • Static Single Assignment (SSA). • Code Property Graph (CPG) 46 Tools • C/C++: LLVM, CIL • Java: SOOT, Wala • Binary: Valgrind, Pin, DynamoRIO 47 Next Class • Topic #18. Program Analysis: Dynamic Program Analysis – Program Tracing – Dynamic Slicing 48 欢迎咨询51作业君